<a href="https://colab.research.google.com/github/kooose38/pytools_nlp/blob/master/Bert%E5%AE%9F%E8%A3%85%E5%81%8F_20220225.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
!pip install -q transformers fugashi ipadic

[K     |████████████████████████████████| 568 kB 8.3 MB/s 
[K     |████████████████████████████████| 13.4 MB 23.0 MB/s 
[?25h  Building wheel for ipadic (setup.py) ... [?25l[?25hdone


In [127]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from pprint import pprint 
import os 
import time 
from tqdm.auto import tqdm 

import torch
import torch.nn as nn 
from torch.utils.data import Dataset, DataLoader
from transformers import BertJapaneseTokenizer, BertForMaskedLM

plt.style.use("ggplot")

# 単語を`id`化する

[https://huggingface.co/docs/transformers/main_classes/tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer)

In [9]:
text = "私は犬が好きだ"
tk = BertJapaneseTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")

In [10]:
tk.tokenize(text)

['私', 'は', '犬', 'が', '好き', 'だ']

In [11]:
tk.encode(text)

[2, 1325, 9, 2928, 14, 3596, 75, 3]

In [12]:
tk.encode_plus(text)

{'input_ids': [2, 1325, 9, 2928, 14, 3596, 75, 3], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [15]:
# paddingで系列数を合わせて出力する
result = tk.encode_plus(text, padding="max_length", max_length=20, return_tensors="pt")
pprint(result)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[   2, 1325,    9, 2928,   14, 3596,   75,    3,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


In [19]:
tk.decode([2, 1325, 9, 2928, 14, 3596, 75, 3])

'[CLS] 私 は 犬 が 好き だ [SEP]'

# マスクされた単語を予測する

In [21]:
tk.mask_token_id # mask_id

4

In [59]:
ids = tk.encode(text)
ids[3] = tk.mask_token_id # 「犬」マスク化する
ids

[2, 1325, 9, 4, 14, 3596, 75, 3]

In [60]:
tk.decode(ids) # 確認する

'[CLS] 私 は [MASK] が 好き だ [SEP]'

入力は`[バッチ数, 単語列の長さ]`

In [28]:
x = torch.LongTensor(ids).unsqueeze(0) #モデル入力に変換する
x

tensor([[   2, 1325,    9,    4,   14, 3596,   75,    3]])

In [29]:
model = BertForMaskedLM.from_pretrained("cl-tohoku/bert-base-japanese")

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


出力は`[バッチ数, 単語列の長さ, 登録単語の数]`

In [55]:
y = model(x).logits
y.size()

torch.Size([1, 8, 32000])

In [58]:
softmax = nn.Softmax(dim=-1)
y = softmax(y)
k = 5
result = torch.topk(y.squeeze(0)[3], k=k)
for i in range(k):
    prob = round(result.values[i].item() * 100.0, 2) 
    res = result.indices[i].item()
    ids[3] = res
    pred = tk.decode(ids, skip_special_tokens=True)
    print("予測: ", pred, " 確率: ", prob, "%")

予測:  私 は サッカー が 好き だ  確率:  4.15 %
予測:  私 は あなた が 好き だ  確率:  2.86 %
予測:  私 は 僕 が 好き だ  確率:  2.57 %
予測:  私 は 野球 が 好き だ  確率:  2.34 %
予測:  私 は 音楽 が 好き だ  確率:  1.9 %


複数のマスクを予測したいときに用いる手法。

* 貪欲法
* ビームサーチ

# 文章分類

## BertModelで構築

In [119]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from transformers import BertTokenizer, BertModel, BertForSequenceClassification, AdamW

In [61]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

--2022-02-24 08:58:17--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’


2022-02-24 08:58:18 (5.60 MB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [118]:
class SpamConfig:
    device = "cuda:0" if torch.cuda.is_available() else "cpu"
    MODEL_TYPE = "bert-base-uncased"
    batch_size = 64 
    epoch = 3 
    lr = 1e-6 
    debug = True

cfg = SpamConfig()


[UCIのSMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)を使います。 5572件のSMSのデータセットで、そのうちスパムSMSが747件あります。

In [79]:
df = pd.read_csv("SMSSpamCollection", sep='\t', header=None)
df.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [80]:
# ラベルの作成
label2id = {"ham": 0, "spam": 1}
df.columns = ["label", "text"]
df["label"] = df.label.map(label2id)

# 訓練データとテストデータの分割する
train_df, test_df = train_test_split(df, stratify=df.label, random_state=123, test_size=0.3)

In [84]:
# 読み込み
tk = BertTokenizer.from_pretrained(cfg.MODEL_TYPE)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


pytorchの流儀でデータセットのクラス生成する

In [96]:
class SpamDataset(Dataset):
    def __init__(self, df, tk):
        self.df = df.reset_index(drop=True)
        self.tk = tk 

    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, idx):
        dat = self.df.iloc[idx]
        text, label = dat["text"], dat["label"]
        token = self.tk.encode_plus(
            text,
            max_length=64, # 学習時間短縮のため
            padding="max_length",
            truncation=True
        )

        return {
            "input_ids": torch.tensor(token["input_ids"], dtype=torch.long), 
            "attention_mask": torch.tensor(token["attention_mask"], dtype=torch.long), 
            "token_type_ids": torch.tensor(token["token_type_ids"], dtype=torch.long),
            "target": torch.tensor(label, dtype=torch.long)
        }

In [102]:
# 試しに１つ分取り出してみる
ds = SpamDataset(train_df, tk)
ds[2]

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'input_ids': tensor([  101,  2253,  2000, 25957,  9953,  4377,  4497,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]),
 'target': tensor(0),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 

モデルの構築

In [122]:
class MySpamBertClassification(nn.Module):
    def __init__(self, cfg):
        super(MySpamBertClassification, self).__init__()
        self.base = BertModel.from_pretrained(cfg.MODEL_TYPE)
        # 学習時間短縮のため
        # 精度向上には以下２文をコメントする
        for name, param in self.base.named_parameters():
            param.requires_grad = False 

        self.fc = nn.Linear(768, 2)

    def forward(self, ids, mask, token_type_ids):
        """順伝播を記述する"""
        out = self.base(ids, mask, token_type_ids)[0]
        out = out[:, 0, :]
        out = self.fc(out)
        return out 

In [123]:
def train_f(dataloader, model, criterion, optimizer, cfg, epoch):
    """1エポックあたり学習をする"""
    device = cfg.device
    # model.to(device)
    model.train()
    losses = []
    for x in tqdm(dataloader, total=len(dataloader)):
        # データをgpuへ転送
        ids = x["input_ids"].to(device)
        mask = x["attention_mask"].to(device)
        types = x["token_type_ids"].to(device)
        target = x["target"].to(device)

        output = model(ids, mask, types)
        loss = criterion(output, target)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        losses.append(loss.item())
    print(f"epoch: {epoch+1} | loss: {np.mean(np.array(losses))}")
    return model, np.mean(np.array(losses))

学習を行う


In [124]:
model = MySpamBertClassification(cfg)
model.to(cfg.device)
optimizer = AdamW(model.parameters(), lr=cfg.lr) # 最適化関数
criterion = nn.CrossEntropyLoss() # 損失関数

train_ds = SpamDataset(train_df, tk)
train_dl = DataLoader(
    train_ds,
    batch_size=cfg.batch_size,
    shuffle=True,
    drop_last=True,
    num_workers=2,
    pin_memory=True
)
losses = []
for e in range(1 if cfg.debug else cfg.epoch):
    model, loss = train_f(train_dl, model, criterion, optimizer, cfg, e)
    losses.append(loss)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


  0%|          | 0/60 [00:00<?, ?it/s]

epoch: 0 | loss: 0.9260820696751276


In [None]:
plt.plot(losses)

In [125]:
# モデルの保存をする
torch.save(model.state_dict(), "trained.model")

In [None]:
# モデルの読み込み
model = MySpamBertClassification(cfg)
model.load_state_dict(torch.load("trained.model"))
model.to(cfg.device)
model.eval()

In [129]:
def test_f(dataloader, model, cfg):
    predict = []
    device = cfg.device
    with torch.no_grad():
        for x in tqdm(dataloader, total=len(dataloader)):
            # データをgpuへ転送
            ids = x["input_ids"].to(device)
            mask = x["attention_mask"].to(device)
            types = x["token_type_ids"].to(device)
            target = x["target"].to(device)

            output = model(ids, mask, types)
            output = torch.argmax(output, dim=-1).view(-1).detach().cpu().numpy()
            for out in output:
                predict.append(out)
    return np.array(predict)

予測値を算出する

In [131]:
test_ds = SpamDataset(test_df, tk)
test_dl = DataLoader(
    test_ds, 
    batch_size=cfg.batch_size,
    shuffle=False, 
    drop_last=False
)

predict = test_f(test_dl, model, cfg)

  0%|          | 0/27 [00:00<?, ?it/s]

In [132]:
# 混合行列
test_df["predict"] = predict
confusion_matrix(test_df["label"], test_df["predict"])

array([[  18, 1430],
       [   1,  223]])

## BertForSequenceClassificationで構築

ファインチューニングをするため、学習する層を指定する