# 第9章: 事前学習済み言語モデル（BERT型）

本章では、BERT型の事前学習済みモデルを利用して、マスク単語の予測や文ベクトルの計算、評判分析器（ポジネガ分類器）の構築に取り組む。

## 80. トークン化

"The movie was full of incomprehensibilities."という文をトークンに分解し、トークン列を表示せよ。

In [8]:
from transformers import BertTokenizer, logging

# 警告を抑制（重要なエラーだけ表示）
logging.set_verbosity_error()

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "The movie was full of incomprehensibilities."
tokens = tokenizer.tokenize(text)

print(tokens)

['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']


## 81. マスクの予測

"The movie was full of [MASK]."の"[MASK]"を埋めるのに最も適切なトークンを求めよ。

In [9]:
from transformers import pipeline
from pprint import pprint

# パイプライン作成と予測
unmasker = pipeline("fill-mask", model="bert-base-uncased")
results = unmasker("The movie was full of [MASK].")
pprint(results[0])


{'score': 0.10711909830570221,
 'sequence': 'the movie was full of fun.',
 'token': 4569,
 'token_str': 'fun'}


## 82. マスクのtop-k予測

"The movie was full of [MASK]."の"[MASK]"に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [10]:
from transformers import pipeline

# pipelineを作成（マスク補完用）
unmasker = pipeline("fill-mask", model="bert-base-uncased")

# 入力文（[MASK]は必ず大文字で）
text = "The movie was full of [MASK]."

# top_k=10で上位10個の予測を取得
results = unmasker(text, top_k=10)

# 結果の表示
for i, result in enumerate(results, 1):
    token = result["token_str"]
    score = result["score"]
    print(f"{i}. {token:<15} (probability: {score:.4f})")


1. fun             (probability: 0.1071)
2. surprises       (probability: 0.0663)
3. drama           (probability: 0.0447)
4. stars           (probability: 0.0272)
5. laughs          (probability: 0.0254)
6. action          (probability: 0.0195)
7. excitement      (probability: 0.0190)
8. people          (probability: 0.0183)
9. tension         (probability: 0.0150)
10. music           (probability: 0.0146)


## 83. CLSトークンによる文ベクトル

以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


In [11]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import itertools

# デバイス設定（GPUがあるなら使う）
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# モデルとトークナイザのロード
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.to(device)
model.eval()

# 対象の文
sentences = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]

# 各文に対する [CLS] トークンの最終層埋め込みを取得
cls_embeddings = []

with torch.no_grad():
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True).to(device)
        outputs = model(**inputs)
        cls_embed = outputs.last_hidden_state[:, 0, :]  # [CLS] トークンは位置 0
        cls_embeddings.append(cls_embed.cpu())

# 埋め込みを1つのテンソルにまとめて numpy 配列に変換
cls_embeddings = torch.cat(cls_embeddings, dim=0).numpy()

# コサイン類似度の計算
similarities = cosine_similarity(cls_embeddings)

# 出力：すべてのペアのコサイン類似度
pairs = list(itertools.combinations(range(len(sentences)), 2))
for i, j in pairs:
    print(f"Similarity between:\n  \"{sentences[i]}\"\n  \"{sentences[j]}\"\n  => {similarities[i][j]:.4f}\n")


Similarity between:
  "The movie was full of fun."
  "The movie was full of excitement."
  => 0.9881

Similarity between:
  "The movie was full of fun."
  "The movie was full of crap."
  => 0.9558

Similarity between:
  "The movie was full of fun."
  "The movie was full of rubbish."
  => 0.9475

Similarity between:
  "The movie was full of excitement."
  "The movie was full of crap."
  => 0.9541

Similarity between:
  "The movie was full of excitement."
  "The movie was full of rubbish."
  => 0.9487

Similarity between:
  "The movie was full of crap."
  "The movie was full of rubbish."
  => 0.9807



## 84. 平均による文ベクトル

以下の文の全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

In [12]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import itertools

# デバイス設定
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# モデルとトークナイザのロード
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.to(device)
model.eval()

# 対象文
sentences = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]

# 各文に対して平均埋め込みを計算
mean_embeddings = []

with torch.no_grad():
    for sentence in sentences:
        inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True).to(device)
        outputs = model(**inputs)

        # last_hidden_state: (1, seq_len, hidden_dim)
        token_embeddings = outputs.last_hidden_state.squeeze(0)  # (seq_len, hidden_dim)
        attention_mask = inputs["attention_mask"].squeeze(0)     # (seq_len)

        # attention_mask を使って、PAD トークンを除外して平均
        mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size())
        sum_embeddings = torch.sum(token_embeddings * mask_expanded, dim=0)
        sum_mask = torch.clamp(mask_expanded.sum(dim=0), min=1e-9)
        mean_embedding = sum_embeddings / sum_mask

        mean_embeddings.append(mean_embedding.cpu())

# numpy に変換してコサイン類似度を計算
mean_embeddings = torch.stack(mean_embeddings).numpy()
similarities = cosine_similarity(mean_embeddings)

# 出力
pairs = list(itertools.combinations(range(len(sentences)), 2))
for i, j in pairs:
    print(f"Similarity between:\n  \"{sentences[i]}\"\n  \"{sentences[j]}\"\n  => {similarities[i][j]:.4f}\n")


Similarity between:
  "The movie was full of fun."
  "The movie was full of excitement."
  => 0.9568

Similarity between:
  "The movie was full of fun."
  "The movie was full of crap."
  => 0.8490

Similarity between:
  "The movie was full of fun."
  "The movie was full of rubbish."
  => 0.8169

Similarity between:
  "The movie was full of excitement."
  "The movie was full of crap."
  => 0.8352

Similarity between:
  "The movie was full of excitement."
  "The movie was full of rubbish."
  => 0.7938

Similarity between:
  "The movie was full of crap."
  "The movie was full of rubbish."
  => 0.9226



## 85. データセットの準備

[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) ベンチマークで配布されている[Stanford Sentiment Treebank (SST)](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip) から訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。

In [13]:
!wget https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
!unzip SST-2.zip

--2025-05-21 02:59:49--  https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.35.7.128, 13.35.7.50, 13.35.7.38, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.35.7.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7439277 (7.1M) [application/zip]
Saving to: ‘SST-2.zip’


2025-05-21 02:59:50 (10.8 MB/s) - ‘SST-2.zip’ saved [7439277/7439277]

Archive:  SST-2.zip
   creating: SST-2/
  inflating: SST-2/dev.tsv           
   creating: SST-2/original/
  inflating: SST-2/original/README.txt  
  inflating: SST-2/original/SOStr.txt  
  inflating: SST-2/original/STree.txt  
  inflating: SST-2/original/datasetSentences.txt  
  inflating: SST-2/original/datasetSplit.txt  
  inflating: SST-2/original/dictionary.txt  
  inflating: SST-2/original/original_rt_snippets.txt  
  inflating: SST-2/original/sentiment_labels.txt  
  inflating: SST-2/test.tsv          
  inflating: SST-2/train.tsv  

In [29]:
import pandas as pd

train_data = pd.read_csv('SST-2/train.tsv', sep='\t')
dev_data = pd.read_csv('SST-2/dev.tsv', sep='\t')

print(train_data.head())

train_data1 = []
for _,j in train_data.iterrows():
  tokens = tokenizer.tokenize(j["sentence"])
  data = {"sentence":tokens,"label":j["label"]}
  train_data1.append(data)

dev_data1 = []
for _,j in dev_data.iterrows():
  tokens = tokenizer.tokenize(j["sentence"])
  data = {"sentence":tokens,"label":j["label"]}
  dev_data1.append(data)

                                            sentence  label
0       hide new secretions from the parental units       0
1               contains no wit , only labored gags       0
2  that loves its characters and communicates som...      1
3  remains utterly satisfied to remain the same t...      0
4  on the worst revenge-of-the-nerds clichés the ...      0


## 86. ミニバッチの作成

85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。

In [33]:
from transformers import BertTokenizer
import torch

# 例としての訓練データの冒頭4事例
train_sentences = [i["sentence"] for i in train_data1[:4]]

# トークナイザのロード（BERT base uncased）
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# トークナイズ＋パディング＋テンソル変換（ミニバッチ化）
batch_encoding = tokenizer(
    train_sentences,
    padding=True,            # 最長文にパディング
    truncation=True,         # 長すぎる文をカット（今回は不要かもしれません）
    return_tensors="pt"      # PyTorchテンソルで返す
)

# 内容確認（オプション）
print("Input IDs:")
print(batch_encoding["input_ids"])
print("\nAttention Mask:")
print(batch_encoding["attention_mask"])


ValueError: too many values to unpack (expected 2)

## 87. ファインチューニング

訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [30]:
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset, load_metric
import numpy as np




# 2. トークナイザとデータセット変換
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize(example):
    return tokenizer(example["sentence"], truncation=True, padding="max_length", max_length=128)

train_dataset = Dataset.from_list(train_data1).map(tokenize, batched=True)
dev_dataset = Dataset.from_list(dev_data1).map(tokenize, batched=True)

# 不要なカラムの削除
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
dev_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# 3. モデルのロード（2クラス分類）
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# 4. トレーニング設定
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    save_total_limit=1,
)

# 5. 評価指標
accuracy_metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=preds, references=labels)

# 6. トレーナーの作成と学習
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

# 7. 評価（検証セット上の正解率）
eval_result = trainer.evaluate()
print(f"\n✅ Validation Accuracy: {eval_result['eval_accuracy']:.4f}")


Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

ValueError: too many values to unpack (expected 2)

## 88. 極性分析

問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

- "The movie was full of incomprehensibilities."
- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


## 89. アーキテクチャの変更

問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。