# 第9章: 事前学習済み言語モデル（BERT型）

## 80. トークン化

In [1]:
from transformers import AutoTokenizer

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.tokenize("The movie was full of incomprehensibilities.")

2025-04-25 03:46:33.055379: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-04-25 03:46:33.674149: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2025-04-25 03:46:33.674229: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-04-25 03:46:35.790914: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2025-

['the',
 'movie',
 'was',
 'full',
 'of',
 'inc',
 '##omp',
 '##re',
 '##hen',
 '##si',
 '##bilities',
 '.']

## 81. マスクの予測

In [2]:
from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model_name, framework="pt")
masked_text = "The movie was full of [MASK]."
outputs = fill_mask(masked_text)
print(outputs[0])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'score': 0.10711917281150818, 'token': 4569, 'token_str': 'fun', 'sequence': 'the movie was full of fun.'}


## 82. マスクのtop-k予測

In [3]:
import pandas as pd

fill_mask = pipeline("fill-mask", model=model_name, top_k=10, framework="pt")
masked_text = "The movie was full of [MASK]."
outputs = fill_mask(masked_text)
display(pd.DataFrame(outputs))

# 上位10個  https://kazuhira-r.hatenablog.com/entry/2024/01/03/221331

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,score,token,token_str,sequence
0,0.107119,4569,fun,the movie was full of fun.
1,0.066345,20096,surprises,the movie was full of surprises.
2,0.044684,3689,drama,the movie was full of drama.
3,0.027217,3340,stars,the movie was full of stars.
4,0.025413,11680,laughs,the movie was full of laughs.
5,0.019517,2895,action,the movie was full of action.
6,0.019038,8277,excitement,the movie was full of excitement.
7,0.01829,2111,people,the movie was full of people.
8,0.015031,6980,tension,the movie was full of tension.
9,0.014646,2189,music,the movie was full of music.


## 83. CLSトークンによる文ベクトル

In [4]:
from transformers import AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

texts = [
  "The movie was full of fun.",
  "The movie was full of excitement.",
  "The movie was full of crap.",
  "The movie was full of rubbish."
]

# 文のエンコード化
model = AutoModel.from_pretrained(model_name)
encoded_texts = tokenizer(texts, padding=True, truncation=True, max_length=512)

# 文ベクトル化
input_ids = torch.tensor(encoded_texts["input_ids"])
outputs = model(input_ids)
sentencevec = outputs[0][:,0,:]

# コサイン類似度
cs_array = cosine_similarity(sentencevec.detach().cpu().numpy(), sentencevec.detach().cpu().numpy())
print(cs_array)

# CLSトークンによる文ベクトル https://qiita.com/ichiroex/items/6e305a5d5bed7d715c2f
# コサイン類似度              https://analysis-navi.com/?p=688

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[[0.99999976 0.9880607  0.9557658  0.9475323 ]
 [0.9880607  1.         0.9541273  0.9486636 ]
 [0.9557658  0.9541273  0.99999976 0.9806931 ]
 [0.9475323  0.9486636  0.9806931  0.99999994]]


## 84. 平均による文ベクトル

In [5]:
# 文ベクトル化
input_ids = torch.tensor(encoded_texts["input_ids"])
outputs = model(input_ids)
att_mask = torch.tensor(encoded_texts["attention_mask"])
att_mask = att_mask.unsqueeze(-1)
sentencevec = (outputs[0] * att_mask).sum(dim=1) / att_mask.sum(dim=1)

# コサイン類似度
cs_array = cosine_similarity(sentencevec.detach().cpu().numpy(), sentencevec.detach().cpu().numpy())
print(cs_array)

# 平均による文ベクトル  https://qiita.com/anyai_corp/items/1d66feea6102c28dd077

[[1.0000002  0.95681167 0.8489995  0.81688434]
 [0.95681167 1.         0.8351837  0.7938447 ]
 [0.8489995  0.8351837  1.0000001  0.92255366]
 [0.81688434 0.7938447  0.92255366 1.        ]]


## 85. データセットの準備

In [6]:
!wget https://dl.fbaipublicfiles.com/glue/data/SST-2.zip -P data/
!unzip -o data/SST-2.zip -d data/
!rm data/SST-2.zip

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2025-04-25 03:47:04--  https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
dl.fbaipublicfiles.com (dl.fbaipublicfiles.com) をDNSに問いあわせています... 3.163.224.94, 3.163.224.86, 3.163.224.57, ...
dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.224.94|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 200 OK
長さ: 7439277 (7.1M) [application/zip]
`data/SST-2.zip' に保存中


2025-04-25 03:47:04 (45.5 MB/s) - `data/SST-2.zip' へ保存完了 [7439277/7439277]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELIS

In [7]:
!pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
[33mDEPRECATION: lightning-lite 1.8.0 has a non-standard dependency specifier torch>=1.9.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of lightning-lite or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: pytorch-lightning 1.8.0 has a non-standard dependency specifier torch>=1.9.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to s

In [8]:
from datasets import Dataset

def make_dataset(file_name):
  df = pd.read_csv(file_name, sep='\t')
  df['label'] = df['label'].astype(int)
  df['tokens'] = df['sentence'].apply(tokenizer.tokenize)
  return Dataset.from_pandas(df)

train_dataset =  make_dataset('./data/SST-2/train.tsv')
dev_dataset = make_dataset('./data/SST-2/dev.tsv')

print('学習データのリスト：')
for i in range(5):
  print(train_dataset[i])
print('')

print('検証データのリスト：')
for i in range(5):
  print(dev_dataset[i])

# pandasでint型に変換 https://tanuhack.com/pandas-type-conversion/#:~:text=astype%20%E3%83%A1%E3%82%BD%E3%83%83%E3%83%89%E3%81%AE%20dtype%20%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E3%81%AB%20%27int%27%20%E3%82%92%E6%8C%87%E5%AE%9A%E3%81%99%E3%82%8B%E3%81%A8%E3%80%81%E3%83%87%E3%83%BC%E3%82%BF%E3%83%95%E3%83%AC%E3%83%BC%E3%83%A0%E3%81%AE%E5%88%97%20%28%E3%82%B7%E3%83%AA%E3%83%BC%E3%82%BA%29%E3%81%8C%E6%95%B4%E6%95%B0%E5%9E%8B%E3%81%AB%E5%A4%89%E6%8F%9B%E3%81%95%E3%82%8C%E3%81%BE%E3%81%99%E3%80%82,dtype%20%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E3%81%AF%E7%9C%81%E7%95%A5%E5%8F%AF%E8%83%BD%E3%80%82%20float%20%E5%9E%8B%E3%81%AB%20astype%28%27int%27%29%20%E3%82%92%E6%8C%87%E5%AE%9A%E3%81%99%E3%82%8B%E3%81%A8%E3%80%81%E5%B0%8F%E6%95%B0%E7%82%B9%E3%81%AF%E5%88%87%E3%82%8A%E6%8D%A8%E3%81%A6%E3%82%89%E3%82%8C%E3%81%BE%E3%81%99%E3%80%82%20%E5%B0%8F%E6%95%B0%E7%82%B9%E3%82%92%E5%88%87%E3%82%8A%E6%8D%A8%E3%81%A6%E3%81%9F%E3%81%84%E5%89%8D%E5%87%A6%E7%90%86%E3%81%AE%E3%81%A8%E3%81%8D%E3%81%AB%E5%BD%B9%E7%AB%8B%E3%81%A1%E3%81%BE%E3%81%99%E3%80%82
# pandasで関数適用    https://note.nkmk.me/python-pandas-map-applymap-apply/

学習データのリスト：
{'sentence': 'hide new secretions from the parental units ', 'label': 0, 'tokens': ['hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units']}
{'sentence': 'contains no wit , only labored gags ', 'label': 0, 'tokens': ['contains', 'no', 'wit', ',', 'only', 'labor', '##ed', 'gag', '##s']}
{'sentence': 'that loves its characters and communicates something rather beautiful about human nature ', 'label': 1, 'tokens': ['that', 'loves', 'its', 'characters', 'and', 'communicate', '##s', 'something', 'rather', 'beautiful', 'about', 'human', 'nature']}
{'sentence': 'remains utterly satisfied to remain the same throughout ', 'label': 0, 'tokens': ['remains', 'utterly', 'satisfied', 'to', 'remain', 'the', 'same', 'throughout']}
{'sentence': 'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up ', 'label': 0, 'tokens': ['on', 'the', 'worst', 'revenge', '-', 'of', '-', 'the', '-', 'ne', '##rds', 'cl', '##iche', '##s', 'the', 'filmmakers', 'could', 'dr', '

## 86. ミニバッチの作成

In [9]:
from transformers import BatchEncoding, DataCollatorWithPadding

def preprocess_text_classification(example: dict[str, str | int]) -> BatchEncoding:
  encoded_example = tokenizer(example["sentence"], padding=True, truncation=True, max_length=512)
  encoded_example["labels"] = example["label"]
  return encoded_example

# エンコーディング化
encoded_train_dataset = train_dataset.map(preprocess_text_classification, remove_columns=train_dataset.column_names)
encoded_dev_dataset = dev_dataset.map(preprocess_text_classification, remove_columns=dev_dataset.column_names)

# ミニバッチ構築
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch_inputs = data_collator(encoded_train_dataset[0:4])
print({name: tensor.size() for name, tensor in batch_inputs.items()})

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([4, 15]), 'token_type_ids': torch.Size([4, 15]), 'attention_mask': torch.Size([4, 15]), 'labels': torch.Size([4])}


## 87. ファインチューニング

In [10]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

training_args = TrainingArguments(
  output_dir="model/model_87",
  per_device_train_batch_size=32,
  per_device_eval_batch_size=32,
  learning_rate=2e-5,
  lr_scheduler_type="linear",
  warmup_ratio=0.1,
  num_train_epochs=3,
  save_strategy="epoch",
  logging_strategy="epoch",
  evaluation_strategy="epoch",
  load_best_model_at_end=True,
  metric_for_best_model="accuracy",
  fp16=True
)

def compute_accuracy(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=model,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_dev_dataset,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_accuracy,
)
trainer.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2495,0.260327,0.923165
2,0.1161,0.275403,0.920872
3,0.0727,0.292709,0.918578


***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/checkpoint-2105
Configuration saved in model/checkpoint-2105/config.json
Model weights saved in model/checkpoint-2105/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/checkpoint-4210
Configuration saved in model/checkpoint-4210/config.json
Model weights saved in model/checkpoint-4210/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/checkpoint-6315
Configuration saved in model/checkpoint-6315/config.json
Model weights saved in model/checkpoint-6315/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from model/checkpoint-2105 (score: 0.9231651376146789).


TrainOutput(global_step=6315, training_loss=0.14612548538171763, metrics={'train_runtime': 423.8496, 'train_samples_per_second': 476.695, 'train_steps_per_second': 14.899, 'total_flos': 4168078408484460.0, 'train_loss': 0.14612548538171763, 'epoch': 3.0})

## 88. 極性分析

In [16]:
texts = [
  "The movie was full of fun.",
  "The movie was full of excitement.",
  "The movie was full of crap.",
  "The movie was full of rubbish."
]

model_path = "./model/model_87/checkpoint-2105"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_name)
encoded_texts = tokenizer(texts, padding=True, truncation=True, max_length=512)

input_ids = torch.tensor(encoded_texts["input_ids"])
outputs = model(input_ids)

for text, output in zip(texts, outputs.logits):
  print(f'"{text}": {np.argmax(output.detach().cpu().numpy())}')

loading configuration file ./model/model_87/checkpoint-2105/config.json
Model config BertConfig {
  "_name_or_path": "./model/model_87/checkpoint-2105",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file ./model/model_87/checkpoint-2105/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.



"The movie was full of fun.": 1
"The movie was full of excitement.": 1
"The movie was full of crap.": 0
"The movie was full of rubbish.": 0


## 89. アーキテクチャの変更

In [20]:
model_name = "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

training_args = TrainingArguments(
  output_dir="model/model_89",
  per_device_train_batch_size=32,
  per_device_eval_batch_size=32,
  learning_rate=2e-5,
  lr_scheduler_type="linear",
  warmup_ratio=0.1,
  num_train_epochs=5,
  save_strategy="epoch",
  logging_strategy="epoch",
  evaluation_strategy="epoch",
  load_best_model_at_end=True,
  metric_for_best_model="accuracy",
  fp16=True
)

def compute_accuracy(eval_pred: tuple[np.ndarray, np.ndarray]) -> dict[str, float]:
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return {"accuracy": (predictions == labels).mean()}

trainer = Trainer(
    model=model,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_dev_dataset,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_accuracy,
)
trainer.train()

loading configuration file config.json from cache at /home/j329nish/.cache/huggingface/hub/models--microsoft--deberta-v3-base/snapshots/8ccc9b6f36199bec6961081d44eb72fb3f7353f3/config.json
Model config DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-base",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_biased_input": false,
  "position_buckets": 256,
  "relative_attention": true,
  "share_att_key": true,
  "transformers_version": "4.24.0",
  "type_vocab_size": 0,
  "vocab_size": 128100

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6846,0.67144,0.587156
2,0.6507,0.610676,0.694954
3,0.572,0.588943,0.711009
4,0.4671,0.620907,0.744266
5,0.404,0.649707,0.74656


***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/model_89/checkpoint-2105
Configuration saved in model/model_89/checkpoint-2105/config.json
Model weights saved in model/model_89/checkpoint-2105/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/model_89/checkpoint-4210
Configuration saved in model/model_89/checkpoint-4210/config.json
Model weights saved in model/model_89/checkpoint-4210/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/model_89/checkpoint-6315
Configuration saved in model/model_89/checkpoint-6315/config.json
Model weights saved in model/model_89/checkpoint-6315/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 872
  Batch size = 32
Saving model checkpoint to model/model_89/checkpoint-8420
Configuration saved in model/model_89/checkpoint-8420/config.json
Model we

TrainOutput(global_step=10525, training_loss=0.5556783050029691, metrics={'train_runtime': 1333.1039, 'train_samples_per_second': 252.602, 'train_steps_per_second': 7.895, 'total_flos': 6945137162497512.0, 'train_loss': 0.5556783050029691, 'epoch': 5.0})