<a href="https://colab.research.google.com/github/haru1489248/nlp-100-nock/blob/main/ch09/section_88.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 88. 極性分析
問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

“The movie was full of incomprehensibilities.”

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”



In [1]:
!pip install -U transformers evaluate

Collecting transformers
  Downloading transformers-5.0.0-py3-none-any.whl.metadata (37 kB)
Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting huggingface-hub<2.0,>=1.3.0 (from transformers)
  Downloading huggingface_hub-1.3.5-py3-none-any.whl.metadata (13 kB)
Downloading transformers-5.0.0-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-1.3.5-py3-none-any.whl (536 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: huggingface-hub, transformers, evaluate
  Attempting uninstall: huggingface-hub
    Found existing installation: huggin

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import csv
import torch
import numpy as np
from torch.utils.data import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
import evaluate # hugging face公式のライブラリ

In [4]:
class SST2Dataset(Dataset):
  def __init__(self, sentences, labels, tokenizer):
    super().__init__()
    self.encodings = tokenizer(sentences, truncation=True) # paddingはcollator側でやる
    self.labels = labels

  def __len__(self):
    return len(self.labels)

  def __getitem__(self, idx):
    # items()はPythonのdictのメソッド。
    # encodingsはdict互換のBatchEncodingオブジェクトなのでitems()が使える
    item = {k: v[idx] for k, v in self.encodings.items()}
    item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
    return item

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [6]:
def load_SST2(path):
  sentences, labels = [], []
  with open(path, 'r') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for row in reader:
      sentences.append(row['sentence'])
      labels.append(int(row['label']))
  return sentences, labels

In [7]:
train_src = '/content/drive/MyDrive/SST-2/train.tsv'
dev_src = '/content/drive/MyDrive/SST-2/dev.tsv'

In [8]:
train_sentences, train_labels = load_SST2(train_src)
dev_sentences, dev_labels = load_SST2(dev_src)

In [9]:
train_dataset = SST2Dataset(train_sentences, train_labels, tokenizer)
dev_dataset = SST2Dataset(dev_sentences, dev_labels, tokenizer)

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer) # batchごとに最大長に合わせてpaddingしてくれるcollator

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2 # クラス数
)

In [None]:
accuracy = evaluate.load("accuracy")

### compute_matricsとは？
評価時に使う指標（ここでは正解率）を計算する関数
- eval_predとは?
  - tupleで中身は(logits: モデルの出力, labels: 正解ラベル)となっている
  - logits.shape: (N, num_labels)
    - 各サンプルに対するクラスごとのスコア
  - labels.shape: (N)
    - 正解ラベル(0 or 1)
- logitsとは？
  - BERTの分類モデルは`outputs = model(...)`; `outputs.logits`を返す
  - 例（2クラス）:

```
logits = [
  [2.3, -0.8], # sample 1
  [-1.1, 0.4], # sample 2
]
```
  - softmax前のスコア
  - 大きい方がモデルの予測クラス
- `np.argmax(logits, axis=-1)`とは？
  - 各サンプルについて一番スコアが高いクラスのインデックスを取る
  ```
  [2.3, -0.8] -> 0
  [-1.1, 0.4] -> 1
  ```



In [13]:
def compute_metrics(eval_pred):
  logits, labels = eval_pred
  preds = np.argmax(logits, axis=-1)
  return accuracy.compute(predictions=preds, references=labels)

### TrainingArgumentsとは？
trainerに渡す学習の設定
- output_dir: 学習結果の保存先ディレクトリ
- eval_strategy: いつ評価するか
  - `"epoch"`: 1エポック終わるごとに評価
  - `"steps"`: 一定ステップごと
  - `"no"`: 評価しない
- save_strategy: いつモデルを保存するか
- learning_rate: 学習率（2e-5=2*10^{-5}）
- per_device_train_batch_size: 1GPU（or CPU）あたりの学習サイズ
- per_device_eval_batch_size: 評価時のバッチサイズ（勾配計算しないので、学習時より大きくてもいい）
- num_train_epochs: データを何周するか
- weight_decay: L2正則化。重みが大きくなすぎるのを防ぐ（正則化項の係数）
- loging_steps: 何ステップごとにログを出すか（lossやlearning_rateを表示）
1ステップ=一回のoptimizer更新（だいたい1バッチ処理）

In [14]:
training_args = TrainingArguments(
    output_dir="sst2-bert",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=50,
)

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()
result = trainer.evaluate()
print(result) # eval_accuracyが出る

In [17]:
sentences = [
    "The movie was full of incomprehensibilities.",
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]

In [19]:
encodings = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

In [23]:
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
with torch.no_grad():
  logits = model(**encodings.to(device)).logits
  pred_ids = torch.argmax(logits, dim=-1) # dim=-1で最後の次元を指定している（最後の次元はpositive, negative class）
  prods = torch.softmax(logits, dim=-1) # 合計が1になる確率を作成している

In [24]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}

In [26]:
for text, pid, pr in zip(sentences, pred_ids, prods):
  neg, pos = pr.tolist()
  print(f"{text}\n pred={id2label[int(pid)]} (NEG={neg:.3f}, POS={pos:.3f})\n")

The movie was full of incomprehensibilities.
 pred=NEGATIVE (NEG=0.998, POS=0.002)

The movie was full of fun.
 pred=POSITIVE (NEG=0.000, POS=1.000)

The movie was full of excitement.
 pred=POSITIVE (NEG=0.000, POS=1.000)

The movie was full of crap.
 pred=NEGATIVE (NEG=0.999, POS=0.001)

The movie was full of rubbish.
 pred=NEGATIVE (NEG=0.999, POS=0.001)

