<h2 id="80-id番号への変換">80. ID番号への変換</h2>
<p>問題51で構築した学習データ中の単語にユニークなID番号を付与したい．学習データ中で最も頻出する単語に<code class="language-plaintext highlighter-rouge">1</code>，2番目に頻出する単語に<code class="language-plaintext highlighter-rouge">2</code>，……といった方法で，学習データ中で2回以上出現する単語にID番号を付与せよ．そして，与えられた単語列に対して，ID番号の列を返す関数を実装せよ．ただし，出現頻度が2回未満の単語のID番号はすべて<code class="language-plaintext highlighter-rouge">0</code>とせよ．</p>


In [1]:
# データ分割
import pandas as pd
from sklearn.model_selection import train_test_split

# csvファイルを読み込む
path = "/Users/nyuton/Documents/100knock-2023/trainee_nyutonn/chapter08/data/newsCorpora.csv"
df = pd.read_table(path, header=None, sep='\\t', engine='python')
df.columns = ['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP']

# PUBLISHERが特定の行のみを取り出す
publishers = ['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']
daily_mails = df[df['PUBLISHER'].isin(publishers)]

# 訓練データ、検証データ、テストデータに分ける
train_data, non_train, train_target, non_train_target = train_test_split(daily_mails[['TITLE', 'CATEGORY']], daily_mails['CATEGORY'], train_size=0.8, random_state=10, stratify=daily_mails['CATEGORY'])
valid_data, test_data, valid_target, test_target = train_test_split(non_train, non_train_target, train_size=0.5, random_state=10,  stratify=non_train_target)
print(len(train_data), len(valid_data), len(test_data))

# テキストファイルに書き込む
train_data.to_csv('work/train.txt', header=None, index=None, sep='\t')
valid_data.to_csv('work/valid.txt', header=None, index=None, sep='\t')
test_data.to_csv('work/test.txt', header=None, index=None, sep='\t')
print(len(train_data))
train_data.head()

10684 1336 1336
10684


Unnamed: 0,TITLE,CATEGORY
409106,Selena Gomez exposes her derriere in VERY shor...,e
290867,Hillshire Says Tyson Foods Bid Superior to Pin...,b
36532,'Friends saw him hit me': Johnny Weir opens up...,e
358830,'As funny as a liver transplant!' Melissa McCa...,e
67622,Piers Morgan Delivers One Final Blow To Gun Vi...,e


In [2]:
# 単語の回数を数え上げ
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize


def make_vocab(train_data):
    vocab = defaultdict(int)
    for id, (title, category) in train_data.iterrows():
        # words = title.split()
        words = word_tokenize(title)
        for word in words:
            vocab[word] += 1
    vocab = Counter(vocab)
    return vocab
# vocab.most_common()

In [3]:
from nltk.tokenize import word_tokenize

vocab = make_vocab(train_data)

# 単語列から出現頻度インデックスを返す関数
def sentence2index(sentence):
    # 文を単語列に分割
    words = word_tokenize(sentence)
    # 単語のみのリストに分割する
    vocab_order, cnt_list = zip(*vocab.most_common())
    index_output = []
    for word in words:
        # 語彙にないときは 0
        if word not in vocab:
            index = 0
        # 回数が1のときも0
        elif cnt_list[vocab_order.index(word)] == 1:
            index = 0
        # 語彙にあるとき，0 インデックスなので +1 する
        else:  
            index = vocab_order.index(word) + 1

        index_output.append(index)
    return index_output


sentence = "Kim Kardashian Takes The Plunge In A Simple Black Tee"
print(sentence)
print(sentence2index(sentence))

Kim Kardashian Takes The Plunge In A Simple Black Tee
[39, 35, 581, 14, 4855, 20, 24, 8956, 794, 0]


<h2 id="81-rnnによる予測">81. RNNによる予測</h2>
<p>ID番号で表現された単語列\(\boldsymbol{x} = (x_1, x_2, \dots, x_T)\)がある．ただし，\(T\)は単語列の長さ，\(x_t \in \mathbb{R}^{V}\)は単語のID番号のone-hot表記である（\(V\)は単語の総数である）．再帰型ニューラルネットワーク（RNN: Recurrent Neural Network）を用い，単語列\(\boldsymbol{x}\)からカテゴリ\(y\)を予測するモデルとして，次式を実装せよ．</p>

<p>\[\overrightarrow{h}_0 = 0, \\
\overrightarrow{h}_t = {\rm \overrightarrow{RNN}}(\mathrm{emb}(x_t), \overrightarrow{h}_{t-1}), \\
y = {\rm softmax}(W^{(yh)} \overrightarrow{h}_T + b^{(y)}))\]</p>

<p>ただし，\(\mathrm{emb}(x) \in \mathbb{R}^{d_w}\)は単語埋め込み（単語のone-hot表記から単語ベクトルに変換する関数），\(\overrightarrow{h}_t \in \mathbb{R}^{d_h}\)は時刻\(t\)の隠れ状態ベクトル，\({\rm \overrightarrow{RNN}}(x,h)\)は入力\(x\)と前時刻の隠れ状態\(h\)から次状態を計算するRNNユニット，\(W^{(yh)} \in \mathbb{R}^{L \times d_h}\)は隠れ状態ベクトルからカテゴリを予測するための行列，\(b^{(y)} \in \mathbb{R}^{L}\)はバイアス項である（\(d_w, d_h, L\)はそれぞれ，単語埋め込みの次元数，隠れ状態ベクトルの次元数，ラベル数である）．RNNユニット\({\rm \overrightarrow{RNN}}(x,h)\)には様々な構成が考えられるが，典型例として次式が挙げられる．</p>

<p>\[{\rm \overrightarrow{RNN}}(x,h) = g(W^{(hx)} x + W^{(hh)}h + b^{(h)}))\]</p>

<p>ただし，\(W^{(hx)} \in \mathbb{R}^{d_h \times d_w}，W^{(hh)} \in \mathbb{R}^{d_h \times d_h}, b^{(h)} \in \mathbb{R}^{d_h}\)はRNNユニットのパラメータ，\(g\)は活性化関数（例えば\(\tanh\)やReLUなど）である．</p>
<p>なお，この問題ではパラメータの学習を行わず，ランダムに初期化されたパラメータで\(y\)を計算するだけでよい．次元数などのハイパーパラメータは，\(d_w = 300, d_h=50\)など，適当な値に設定せよ（以降の問題でも同様である）．</p>


In [4]:
import torch
import torch.nn as nn
class RNNmodel(nn.Module):
    def __init__(self, vocab_size, padding_idx, emb_size=300, hidden_size=50, n_labels=4) -> None:
        super(RNNmodel, self).__init__()
        self.hidden_size = hidden_size
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        # Embedding で次元が１つ増える
        self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        # batch_first とは？ -> バッチサイズの次元を前に持ってくるかどうか
        # RNN は変な順番になっているので，batch_first をするといい感じの順番になる
        # self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh', batch_first=True)
        self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh')
        self.func = nn.Linear(hidden_size, n_labels)

    def forward(self, x):
        # self.batch_size = x.size()[0]
        # h0 = self.init_hidden(x.device)  # h0 の初期化 ゼロベクトル
        h0 = torch.zeros(1, self.hidden_size)
        # h0 = torch.zeros(1, self.batch_size, self.hidden_size)
        emb = self.emb(x)  # 入力サイズが異なるので統一する
        # RNN は出力が２つ返ってくる
        # 1つめは全部の出力，2つめは最後の出力
        # 2つめの出力だと，そのまま渡せる
        x_rnn, h_last = self.rnn(emb, h0)  # RNN

        # print(h_last.shape)
        # print(h_last[:, -1].shape)
        # print(h_last[:, -1])
        # 一般の３つの軸：バッチサイズ，系列長，語彙の次元数

        # これを実行すると次元数のちがいでエラーになる
        # -> 入力した x がバッチサイズを含んでいないから
        # out = self.func(x_rnn[:, -1, :]) # 最後の層だけ取り出す
        # これもエラーになる
        # -> 軸の順番が 系列長，語彙の次元数，だったから
        # out = self.func(x_rnn[:, -1]) # 最後の層だけ取り出す
        # out = self.func(x_rnn[-1, :]) # これだとうまくいく！！
        out = self.func(h_last) # 現在のhだけ取り出す
        return out

    # 隠れ層の初期化
    # def init_hidden(self, device):
    #     hidden = torch.zeros(1, self.batch_size, self.hidden_size, device=device)
    #     return hidden

In [5]:
from torch.utils.data import Dataset

class CreateDataset(Dataset):
    def __init__(self, X, y, tokenizer):
        self.X = X
        self.y = y
        self.tokenizer = tokenizer

    # len(Dataset) で返す値を指定
    def __len__(self):
        return len(self.y)
    
    # Dataset[index] で返す値を指定
    def __getitem__(self, index):
        text = self.X.iloc[index]
        input_features = self.tokenizer(text)
        label = self.y.iloc[index]
        return {
            'inputs': torch.tensor(input_features, dtype=torch.int64),
            'labels': torch.tensor(label, dtype=torch.int64)
        }

In [6]:
# ラベル
category_dict = {'b': 0, 't': 1, 'e': 2, 'm': 3}
y_train = train_data['CATEGORY'].map(category_dict)
y_valid = valid_data['CATEGORY'].map(category_dict)
y_test = test_data['CATEGORY'].map(category_dict)

# 特徴量データセット
X_train = CreateDataset(train_data['TITLE'], y_train, sentence2index)
X_valid = CreateDataset(valid_data['TITLE'], y_valid, sentence2index)
X_test = CreateDataset(test_data['TITLE'], y_test, sentence2index)

# 使い方の例
from pprint import pprint
print(len(X_train))
pprint(X_train[1])

10684
{'inputs': tensor([1824,   58, 1825, 1016,  556, 6650,    2, 2535,  117]),
 'labels': tensor(0)}


In [7]:
vocab = make_vocab(train_data)
vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数

model = RNNmodel(vocab_size, padding_idx, emb_size, hidden_size, n_labels)
# model.eval()

# 先頭3件の入出力を表示
for i in range(3):
    print(f"{i}番目")
    print(f"入力ベクトル：{X_train[i]}")
    print(f"出力ベクトル：{model(X_train[i]['inputs'])}")
    print(f"予測ラベル　：{model(X_train[i]['inputs']).argmax()}")
    print(f"正解ラベル　：{X_train[i]['labels'].item()}")

0番目
入力ベクトル：{'inputs': tensor([ 145,  153, 6648,   59, 5079,    6, 2950, 1823,    0,  741, 6649,    2,
          23,    3]), 'labels': tensor(2)}
出力ベクトル：tensor([[ 0.1662,  0.1981,  0.8108, -0.2331]], grad_fn=<AddmmBackward0>)
予測ラベル　：2
正解ラベル　：2
1番目
入力ベクトル：{'inputs': tensor([1824,   58, 1825, 1016,  556, 6650,    2, 2535,  117]), 'labels': tensor(0)}
出力ベクトル：tensor([[ 0.1360, -0.3244,  0.2765, -0.3773]], grad_fn=<AddmmBackward0>)
予測ラベル　：2
正解ラベル　：0
2番目
入力ベクトル：{'inputs': tensor([6651, 6652, 1263,  224, 1637,    5,    7,  324, 4066,  533,   43,  169,
         142,    0,    0,    5,   33,    3]), 'labels': tensor(2)}
出力ベクトル：tensor([[ 0.1289, -0.0115,  0.7583, -0.3192]], grad_fn=<AddmmBackward0>)
予測ラベル　：2
正解ラベル　：2


In [8]:
# 確認 入力サイズがちがう．．．
print(X_train[0]['inputs'].shape)
print(X_train[1]['inputs'].shape)

torch.Size([14])
torch.Size([9])


<h2 id="82-確率的勾配降下法による学習">82. 確率的勾配降下法による学習</h2>
<p>確率的勾配降下法（SGD: Stochastic Gradient Descent）を用いて，問題81で構築したモデルを学習せよ．訓練データ上の損失と正解率，評価データ上の損失と正解率を表示しながらモデルを学習し，適当な基準（例えば10エポックなど）で終了させよ．</p>


In [17]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm
import time
from sklearn.metrics import accuracy_score


def train(model, output_path, total_epochs):
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    loss_func = nn.CrossEntropyLoss()

    # 指定した epoch 数だけ学習
    for epoch in range(total_epochs):
        train_total_loss = 0.
        train_acc_cnt = 0

        # パラメータ更新
        model.train()
        for data in X_train:
            # print(data['inputs'])
            # print(model(data['inputs']))
            # print(data['labels'])
            x = data['inputs']
            y = data['labels']
            y_pred = model(x)[0]
            loss = loss_func(y_pred, y)  # 損失計算
            optimizer.zero_grad()  # 勾配の初期化
            loss.backward()  # 勾配計算
            optimizer.step()  # パラメータ修正
            train_total_loss += loss.item()

            # 正解率の計算  # ここで計算するのはまずいかも，学習エポックが終わってからやったほうがよさそう
            # 次の問題からは修正
            if y.item() == y_pred.argmax():
                train_acc_cnt += 1

        # valid のロスと正解率の計算
        model.eval()
        valid_acc_cnt = 0
        valid_total_loss = 0.
        with torch.no_grad():
            for data in X_valid:
                x = data['inputs']
                y = data['labels']
                y_pred = model(x)[0]
                loss = loss_func(y_pred, y)  # 損失計算
                optimizer.zero_grad()  # 勾配の初期化
                # loss.backward()  # 勾配計算
                # optimizer.step()  # パラメータ修正
                valid_total_loss += loss.item()

                # 正解率の計算
                if y.item() == y_pred.argmax():
                    valid_acc_cnt += 1

        # 表示
        train_ave_loss = train_total_loss / len(X_train)
        train_acc = train_acc_cnt / len(X_train)
        valid_ave_loss = valid_total_loss / len(X_valid)
        valid_acc = valid_acc_cnt / len(X_valid)
        print(f"epoch{epoch}: train_loss = {train_ave_loss}, train_acc = {train_acc}, valid_loss = {valid_ave_loss}, valid_acc = {valid_acc}")

    # パラメータを保存
    torch.save(model.state_dict(), output_path)

vocab = make_vocab(train_data)
vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数

model = RNNmodel(vocab_size, padding_idx, emb_size, hidden_size, n_labels)
output_path = "./trained_param.npz"
total_epochs = 10
train(model, output_path, total_epochs)



RuntimeError: For unbatched 2-D input, hx should also be 2-D but got 3-D tensor

<h2 id="83-ミニバッチ化gpu上での学習">83. ミニバッチ化・GPU上での学習</h2>
<p>問題82のコードを改変し，\(B\)事例ごとに損失・勾配を計算して学習を行えるようにせよ（\(B\)の値は適当に選べ）．また，GPU上で学習を実行せよ．</p>


In [10]:
# GPUにしても早くなってないような気がする．．．
# 走らせながら正答率を計測する valid acc と 最後にまとめて計算する valid acc2 の結果が異なるのがかなり気になる．．．
import torch
import torch.nn as nn
class RNNmodel(nn.Module):
    def __init__(self, vocab_size, padding_idx, emb_size=300, hidden_size=50, n_labels=4, batch_size=64, device='cpu') -> None:
        super(RNNmodel, self).__init__()
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.device = device
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        # batch_first とは？ -> batch_size と emb の２次元目のサイズが異なるときに合わせている？？
        self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh', batch_first=True)
        # self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh')
        self.func = nn.Linear(hidden_size, n_labels)

    def forward(self, x):
        # バッチサイズを固定すると，一番最後の余りの分がおかしくなるので，動的に毎回決める！
        self.batch_size = x.size()[0]
        h0 = torch.zeros(1, self.batch_size, self.hidden_size, device=self.device) # ここを変更した
        emb = self.emb(x)  # 入力サイズが異なるので統一する
        x_rnn, h_last = self.rnn(emb, h0)  # RNN
        out = self.func(x_rnn[:, -1, :]) # 最後の層だけ取り出す # ここを変更した
        # out = self.func(x_rnn[:, -1]) # 最後の層だけ取り出す
        # out = self.func(h_last) #現在のhだけ取り出す
        return out

In [11]:
from torch.utils.data import Dataset
import numpy as np

class CreateDataset(Dataset):
    def __init__(self, X, y, tokenizer):
        self.X = X
        self.y = y
        self.tokenizer = tokenizer

    # len(Dataset) で返す値を指定
    def __len__(self):
        return len(self.y)
    
    # Dataset[index] で返す値を指定
    def __getitem__(self, index):
        titles = self.X.iloc[index]
        # スライス記法のとき
        if type(index) == slice:
            labels = self.y.iloc[index]
            input_features = []
            labels_tensor = []
            for title, label in zip(titles, labels):
                input_feature = torch.tensor(self.tokenizer(title))
                input_features.append(input_feature)
                labels_tensor.append(torch.tensor(label, dtype=torch.int64))
        else:
            text = self.X.iloc[index]
            input_features = torch.tensor(self.tokenizer(text), dtype=torch.int64)
            labels_tensor = torch.tensor(self.y.iloc[index], dtype=torch.int64)
        return {
            'inputs': input_features,
            'labels': labels_tensor
        }

In [12]:
import pandas as pd
from nltk.tokenize import word_tokenize

from torch.utils.data import Dataset
import numpy as np

class CreateDataset2(Dataset):
    def __init__(self, X, y):
        self.X = X
        self.y = y

    # len(Dataset) で返す値を指定
    def __len__(self):
        return len(self.y)

    # Dataset[index] で返す値を指定
    def __getitem__(self, index):
        titles = self.X[index]
        # print(titles)
        # スライス記法のとき
        if type(index) == slice:
            labels = self.y[index]
            input_features = []
            labels_tensor = []
            for title, label in zip(titles, labels):
                print(title, label)
                input_feature = torch.tensor(title)
                input_features.append(input_feature)
                labels_tensor.append(torch.tensor(label, dtype=torch.int64))
        else:
            text = self.X[index]
            input_features = torch.tensor(text, dtype=torch.int64)
            labels_tensor = torch.tensor(self.y.iloc[index], dtype=torch.int64)
        return {
            'inputs': input_features,
            'labels': labels_tensor
        }
    
# 単語列から出現頻度インデックスを返す関数
def words2index(words, vocab):
    # 単語のみのリストに分割する
    vocab_order, cnt_list = zip(*vocab.most_common())
    index_output = []
    for word in words:
        # 語彙にないときは 0
        if word not in vocab:
            index = 0
        # 回数が1のときも0
        elif cnt_list[vocab_order.index(word)] == 1:
            index = 0
        # 語彙にあるとき，0 インデックスなので +1 する
        else:  
            index = vocab_order.index(word) + 1

        index_output.append(index)
    return index_output


def tokenize_dataset(dataset):
    tokenized_dataset = []
    for data in dataset:
        tokenized_dataset.append(word_tokenize(data))
    return tokenized_dataset

# データ取得
train_data = pd.read_table('./work/train.txt', names=['TITLE', 'CATEGORY'])
valid_data = pd.read_table('./work/valid.txt', names=['TITLE', 'CATEGORY'])
test_data = pd.read_table('./work/test.txt', names=['TITLE', 'CATEGORY'])

# nltk で tokenized する
train_tokenized = tokenize_dataset(train_data['TITLE'])
valid_tokenized = tokenize_dataset(train_data['TITLE'])
test_tokenized = tokenize_dataset(train_data['TITLE'])

# id に変換する
train_tokenized_id = [words2index(words, vocab) for words in train_tokenized]
valid_tokenized_id = [words2index(words, vocab) for words in valid_tokenized]
test_tokenized_id = [words2index(words, vocab) for words in test_tokenized]

# rnn に実際に渡す形にする
X_train_tokenized = CreateDataset2(train_tokenized_id, y_train)
X_valid_tokenized = CreateDataset2(valid_tokenized_id, y_valid)
X_test_tokenized = CreateDataset2(test_tokenized_id, y_test)

In [13]:
X_train_tokenized[0]

{'inputs': tensor([ 145,  153, 6648,   59, 5079,    6, 2950, 1823,    0,  741, 6649,    2,
           23,    3]),
 'labels': tensor(2)}

In [14]:
import pickle
f = open('./work/train_tokenized.txt', 'wb')
pickle.dump(train_tokenized, f)
f = open('./work/valid_tokenized.txt', 'wb')
pickle.dump(valid_tokenized, f)
f = open('./work/test_tokenized.txt', 'wb')
pickle.dump(test_tokenized, f)

In [15]:
# tokenized_text をロード
f = open('./work/train_tokenized.txt', 'rb')
train_tokenized = pickle.load(f)
f = open('./work/valid_tokenized.txt', 'rb')
valid_tokenized = pickle.load(f)
f = open('./work/test_tokenized.txt', 'rb')
test_tokenized = pickle.load(f)

In [16]:
X_train = CreateDataset(train_data['TITLE'], y_train, sentence2index)
X_valid = CreateDataset(valid_data['TITLE'], y_valid, sentence2index)
X_test = CreateDataset(test_data['TITLE'], y_test, sentence2index)

In [130]:
# サーバ上で動かなかった問題解決 -> model.to('cpu') を model に入れていなかった -> 正） model = model.to('cpu')
# 今度はhiddenとinputのdeviceが違うというエラーが発生 -> xiをcudaに戻したら解決
from sklearn.metrics import accuracy_score

def measure_acc(model, X, y, device):
    model.eval()
    model = model.to(device)
    with torch.no_grad():
        pred_y = []
        for xi in X:
            xi = xi.to(device)
            pred_yi = model(xi[None]).argmax()
            pred_yi = pred_yi.to('cpu')
            pred_y.append(pred_yi)
    return accuracy_score(pred_y, y)

# 単語の回数を数え上げ
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize

# vocabを生成
def make_vocab(train_data):
    vocab = defaultdict(int)
    for id, (title, category) in train_data.iterrows():
        # words = title.split()
        words = word_tokenize(title)
        for word in words:
            vocab[word] += 1
    vocab = Counter(vocab)
    return vocab
# vocab.most_common()

In [24]:
# 修正版
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader # データローダ使ってみる
from tqdm import tqdm
import time
from sklearn.metrics import accuracy_score
import wandb # 追加
# wait and biases
import wandb


def train(model, train_loader, valid_loader, output_path, total_epochs, device, lr=0.01):
    wandb.init(project="chapter09_83")
    wandb.run.name = 'modify-tokenized-run'
    
    optimizer = optim.SGD(model.parameters(), lr=lr)
    loss_func = nn.CrossEntropyLoss()

    model = model.to(device)
    # 指定した epoch 数だけ学習
    for epoch in range(total_epochs):
        train_total_loss = 0.
        train_acc_cnt = 0
        train_cnt = 0

        # パラメータ更新
        model.train()
        for i, data in enumerate(tqdm(train_loader)):
            x = data['inputs']
            x = x.to(device)
            y = data['labels']
            y = y.to(device)
            y_pred = model(x)

            # バッチの中で損失計算
            train_loss = 0.
            for yi, yi_pred in zip(y, y_pred):
                loss_i = loss_func(yi_pred, yi)
                train_loss += loss_i
            
            optimizer.zero_grad()  # 勾配の初期化
            train_loss.backward()  # 勾配計算
            optimizer.step()  # パラメータ修正
            train_total_loss += train_loss.item()

            # バッチの中で正解率の計算 # ここを修正
            for yi, yi_pred in zip(y, y_pred):
                if yi.item() == yi_pred.argmax():
                    train_acc_cnt += 1
                train_cnt += 1

            # wandb に10ステップごとにログを書く
            if i % 10 == 0:
                train_running_loss = train_total_loss / train_cnt
                train_running_acc = train_acc_cnt / train_cnt
                wandb.log({'train_loss': train_running_loss, 'train_acc': train_running_acc})
                
        # train のロスと正解率の計算
        # model.eval()
        # train_acc2 = measure_acc(model, X_train[:]['inputs'], X_train[:]['labels'], device)


        # valid のロスと正解率の計算
        model.eval()
        valid_acc_cnt = 0
        valid_total_loss = 0.
        with torch.no_grad():
            for i, data in enumerate(tqdm(valid_loader)):
                x = data['inputs']
                x = x.to(device)
                y = data['labels']
                y = y.to(device)
                y_pred = model(x)

                # バッチの中で損失計算
                valid_loss = 0.
                for yi, yi_pred in zip(y, y_pred):
                    # print(yi)
                    # print(yi_pred)
                    loss_i = loss_func(yi_pred, yi)
                    valid_loss += loss_i

                optimizer.zero_grad()  # 勾配の初期化
                # valid_loss.backward()  # 勾配計算
                # optimizer.step()  # パラメータ修正
                valid_total_loss += valid_loss

                # バッチの中で正解率の計算  # ここを修正
                for yi, yi_pred in zip(y, y_pred):
                    if yi.item() == yi_pred.argmax():
                        valid_acc_cnt += 1


            # valid のロスと正解率の計算
            # valid_acc2 = measure_acc(model, X_valid[:]['inputs'], X_valid[:]['labels'], device)

        # 表示
        train_ave_loss = train_total_loss / len(X_train_tokenized)
        train_acc = train_acc_cnt / len(X_train_tokenized)
        valid_ave_loss = valid_total_loss / len(X_valid_tokenized)
        valid_acc = valid_acc_cnt / len(X_valid_tokenized)
        print(f"epoch{epoch}: train_loss = {train_ave_loss}, train_acc = {train_acc}, valid_loss = {valid_ave_loss}, valid_acc = {valid_acc}")
        # print(f'train_acc2: {train_acc2}, valid_acc2: {valid_acc2}')

    # パラメータを保存
    torch.save(model.state_dict(), output_path)
    wandb.finish()

In [136]:
# 訓練データの最後で何故かエラーが起きる -> あまりの分がおかしくなっていた！
# ロスが下がっていない．．．なぜだ．．．
# あまりにも時間がかかる -> 最初に nltk_tokenized してからデータを渡したらめちゃめちゃ早くなった！
# 40分 -> 40秒
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

#バッチサイズ
batch_size = 32
vocab = make_vocab(train_data)
PADDING_IDX = len(vocab)
vocab_size = len(vocab) + 1

#ミニバッチを取り出して長さを揃える関数
def collate_fn(batch):
    sorted_batch = sorted(batch, key=lambda x: x['inputs'].shape[0], reverse=True)
    sequences = [x['inputs'] for x in sorted_batch]
    # padding 処理
    sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True, padding_value=PADDING_IDX)
    labels = torch.LongTensor([x['labels'] for x in sorted_batch])
    return {'inputs': sequences_padded, 'labels': labels}

vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

train_loader = DataLoader(X_train_tokenized, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
valid_loader = DataLoader(X_valid_tokenized, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(X_test_tokenized, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

model = RNNmodel(vocab_size, padding_idx, emb_size, hidden_size, n_labels, batch_size, device)
output_path = "./trained_param.npz"
total_epochs = 10
train(model, train_loader, valid_loader, output_path, total_epochs, device)

device: cpu


100%|██████████| 334/334 [00:02<00:00, 158.07it/s]
100%|██████████| 42/42 [00:00<00:00, 701.51it/s]


epoch0: train_loss = 1.1195700514053235, train_acc = 0.541463871209285, valid_loss = 1.2387775182724, valid_acc = 0.41467065868263475


100%|██████████| 334/334 [00:02<00:00, 166.90it/s]
100%|██████████| 42/42 [00:00<00:00, 709.28it/s]


epoch1: train_loss = 1.1781565783310366, train_acc = 0.4976600524148259, valid_loss = 1.2570102214813232, valid_acc = 0.4176646706586826


100%|██████████| 334/334 [00:01<00:00, 168.46it/s]
100%|██████████| 42/42 [00:00<00:00, 636.61it/s]


epoch2: train_loss = 1.1999710910287256, train_acc = 0.47791089479595655, valid_loss = 1.4195775985717773, valid_acc = 0.39895209580838326


100%|██████████| 334/334 [00:02<00:00, 159.02it/s]
100%|██████████| 42/42 [00:00<00:00, 674.61it/s]


epoch3: train_loss = 1.1283182767570263, train_acc = 0.5428678397603893, valid_loss = 1.341603398323059, valid_acc = 0.4161676646706587


100%|██████████| 334/334 [00:02<00:00, 165.10it/s]
100%|██████████| 42/42 [00:00<00:00, 655.92it/s]


epoch4: train_loss = 1.1225296284366038, train_acc = 0.5623362036690378, valid_loss = 1.2817877531051636, valid_acc = 0.4176646706586826


100%|██████████| 334/334 [00:02<00:00, 152.97it/s]
100%|██████████| 42/42 [00:00<00:00, 560.95it/s]


epoch5: train_loss = 1.2834279893355975, train_acc = 0.4635904155746911, valid_loss = 1.363749384880066, valid_acc = 0.405688622754491


100%|██████████| 334/334 [00:02<00:00, 149.76it/s]
100%|██████████| 42/42 [00:00<00:00, 628.72it/s]


epoch6: train_loss = 1.3568002125244827, train_acc = 0.461344065892924, valid_loss = 1.2740203142166138, valid_acc = 0.4086826347305389


100%|██████████| 334/334 [00:02<00:00, 165.85it/s]
100%|██████████| 42/42 [00:00<00:00, 813.50it/s]


epoch7: train_loss = 1.686722041872368, train_acc = 0.4219393485585923, valid_loss = 1.4258145093917847, valid_acc = 0.2776946107784431


100%|██████████| 334/334 [00:01<00:00, 167.82it/s]
100%|██████████| 42/42 [00:00<00:00, 779.87it/s]


epoch8: train_loss = 1.9663927106096788, train_acc = 0.4134219393485586, valid_loss = 2.5287065505981445, valid_acc = 0.4244011976047904


100%|██████████| 334/334 [00:01<00:00, 172.36it/s]
100%|██████████| 42/42 [00:00<00:00, 646.57it/s]


epoch9: train_loss = 1.7828145623430247, train_acc = 0.422688131785848, valid_loss = 1.567039966583252, valid_acc = 0.41467065868263475




0,1
train_acc,▃▆▇▇▆▆▆▅▆▅▄▄▆▇▇▇█▇██▅▆▄▄▃▂▃▃▄▂▂▂▁▁▂▁▁▁▁▂
train_loss,▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▂▂▃▃▃▃▂▃▄▅▇█▇▇█▇▆▆

0,1
train_acc,0.42249
train_loss,1.77866


<h2 id="84-単語ベクトルの導入">84. 単語ベクトルの導入</h2>
<p>事前学習済みの単語ベクトル（例えば，Google Newsデータセット（約1,000億単語）での<a href="https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing">学習済み単語ベクトル</a>）で単語埋め込み\(\mathrm{emb}(x)\)を初期化し，学習せよ．</p>


In [22]:
import numpy as np
from tqdm import tqdm # 進捗表示
from gensim.models import KeyedVectors
from nltk.tokenize import word_tokenize

# 70からもってきた
word2vec_model = KeyedVectors.load_word2vec_format('./../chapter08/data/GoogleNews-vectors-negative300.bin', binary=True) 

vocab_size = len(vocab) + 1
emb_size = 300
padding_idx = len(vocab)

rnn_emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
word2vec_model = KeyedVectors.load_word2vec_format('./../chapter08/data/GoogleNews-vectors-negative300.bin', binary=True) 
embedding_weight_matrix = rnn_emb(torch.tensor([0])).detach().numpy().copy()

# ID持ちの単語が学習済み単語ベクトルを持っていればそれを行方向に足していき，なければnn.Embeddingのベクトルを行方向に足していく
for key, value in vocab.items():
    try:
        embedding_weight_matrix = np.vstack((embedding_weight_matrix, word2vec_model[key]))
    except KeyError:
        embedding_weight_matrix = np.vstack((embedding_weight_matrix, rnn_emb(torch.tensor([value])).detach().numpy().copy()))
    
embedding_weight_matrix = np.vstack((embedding_weight_matrix, np.zeros(emb_size, dtype=np.float32)))  # paddingの分
embedding_weight_matrix = torch.from_numpy(embedding_weight_matrix)

In [138]:
# やっぱりロスが下がらない．．．
# 埋め込み単語ベクトルを変更可能にする
# emb_weight に初期値を入れることで，モード変更をする
import torch
import torch.nn as nn
class RNNmodel(nn.Module):
    def __init__(self, vocab_size, padding_idx, emb_size=300, hidden_size=50, n_labels=4, batch_size=64, device='cpu', emb_weight=None) -> None:
        super(RNNmodel, self).__init__()
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.device = device
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        if emb_weight is None:
            self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        # 追加
        else:
            self.emb = nn.Embedding.from_pretrained(emb_weight, padding_idx=padding_idx)
        
        # batch_first とは？ -> batch_size と emb の２次元目のサイズが異なるときに合わせている？？
        self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh', batch_first=True)
        # self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh')
        self.func = nn.Linear(hidden_size, n_labels)

    def forward(self, x):
        # print(x)
        # バッチサイズを固定すると，一番最後の余りの分がおかしくなるので，動的に毎回決める！
        self.batch_size = x.size()[0]
        h0 = torch.zeros(1, self.batch_size, self.hidden_size, device=self.device) # ここを変更した
        emb = self.emb(x)  # 入力サイズが異なるので統一する
        x_rnn, h_last = self.rnn(emb, h0)  # RNN
        out = self.func(x_rnn[:, -1, :]) # 最後の層だけ取り出す # ここを変更した
        # out = self.func(x_rnn[:, -1]) # 最後の層だけ取り出す
        # out = self.func(h_last) #現在のhだけ取り出す
        return out
    
    
vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

train_loader = DataLoader(X_train_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(X_valid_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(X_test_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

model = RNNmodel(vocab_size, padding_idx, emb_size, hidden_size, n_labels, batch_size, device, embedding_weight_matrix)
output_path = "./trained_param.npz"
total_epochs = 10
train(model, train_loader, valid_loader, output_path, total_epochs, device)

device: cpu


100%|██████████| 334/334 [00:01<00:00, 282.81it/s]
100%|██████████| 42/42 [00:00<00:00, 547.13it/s]


epoch0: train_loss = 1.1600203039225725, train_acc = 0.47079745413702734, valid_loss = 1.1983046531677246, valid_acc = 0.40718562874251496


100%|██████████| 334/334 [00:01<00:00, 306.11it/s]
100%|██████████| 42/42 [00:00<00:00, 499.16it/s]


epoch1: train_loss = 1.1352735627951671, train_acc = 0.4990640209659304, valid_loss = 1.2707889080047607, valid_acc = 0.41392215568862273


100%|██████████| 334/334 [00:01<00:00, 315.23it/s]
100%|██████████| 42/42 [00:00<00:00, 512.81it/s]


epoch2: train_loss = 1.1646572150170067, train_acc = 0.47285660801198054, valid_loss = 1.2272007465362549, valid_acc = 0.4184131736526946


100%|██████████| 334/334 [00:01<00:00, 290.58it/s]
100%|██████████| 42/42 [00:00<00:00, 518.58it/s]


epoch3: train_loss = 1.1323917480677266, train_acc = 0.5125421190565331, valid_loss = 1.2472282648086548, valid_acc = 0.4116766467065868


100%|██████████| 334/334 [00:01<00:00, 319.46it/s]
100%|██████████| 42/42 [00:00<00:00, 572.71it/s]


epoch4: train_loss = 1.127316132505839, train_acc = 0.528453762635717, valid_loss = 1.2504022121429443, valid_acc = 0.4116766467065868


100%|██████████| 334/334 [00:01<00:00, 279.66it/s]
100%|██████████| 42/42 [00:00<00:00, 428.32it/s]


epoch5: train_loss = 1.1492555729685006, train_acc = 0.5042119056533134, valid_loss = 1.2610063552856445, valid_acc = 0.4154191616766467


100%|██████████| 334/334 [00:01<00:00, 305.71it/s]
100%|██████████| 42/42 [00:00<00:00, 370.62it/s]


epoch6: train_loss = 1.138463007470991, train_acc = 0.5115125421190565, valid_loss = 1.3136309385299683, valid_acc = 0.41317365269461076


100%|██████████| 334/334 [00:01<00:00, 281.86it/s]
100%|██████████| 42/42 [00:00<00:00, 462.65it/s]


epoch7: train_loss = 1.1461825918856743, train_acc = 0.5067390490453014, valid_loss = 1.2337405681610107, valid_acc = 0.41467065868263475


100%|██████████| 334/334 [00:01<00:00, 263.02it/s]
100%|██████████| 42/42 [00:00<00:00, 419.37it/s]


epoch8: train_loss = 1.2122050458150426, train_acc = 0.46686634219393486, valid_loss = 1.2385830879211426, valid_acc = 0.4176646706586826


100%|██████████| 334/334 [00:01<00:00, 308.60it/s]
100%|██████████| 42/42 [00:00<00:00, 511.54it/s]


epoch9: train_loss = 1.4167171244048216, train_acc = 0.45011231748408836, valid_loss = 1.282361626625061, valid_acc = 0.4124251497005988




0,1
train_acc,▁▂▃▄▄▆▆▆▆▄▄▄▆▅▆▇█▇▇▇▇▆▇▆▆▆▇▇▅▆▆▆▆▆▅▄▄▄▄▄
train_loss,▄▃▃▂▂▂▂▂▂▃▃▃▃▂▂▂▁▂▂▂▂▂▂▂▁▂▂▂▃▂▂▂▃▃▃▃▄▄▄█

0,1
train_acc,0.45043
train_loss,1.41579


In [27]:
# やっぱりロスが下がらない．．．
# 埋め込み単語ベクトルを変更可能にする
# emb_weight に初期値を入れることで，モード変更をする
import torch
import torch.nn as nn
class RNNmodel(nn.Module):
    def __init__(self, vocab_size, padding_idx, emb_size=300, hidden_size=50, n_labels=4, batch_size=64, device='cpu', emb_weight=None) -> None:
        super(RNNmodel, self).__init__()
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.device = device
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        if emb_weight is None:
            self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        # 追加
        else:
            self.emb = nn.Embedding.from_pretrained(emb_weight, padding_idx=padding_idx)
        
        # batch_first とは？ -> batch_size と emb の２次元目のサイズが異なるときに合わせている？？
        self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh', batch_first=True)
        # self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh')
        self.func = nn.Linear(hidden_size, n_labels)

    def forward(self, x):
        # print(x)
        # バッチサイズを固定すると，一番最後の余りの分がおかしくなるので，動的に毎回決める！
        self.batch_size = x.size()[0]
        h0 = torch.zeros(1, self.batch_size, self.hidden_size, device=self.device) # ここを変更した
        emb = self.emb(x)  # 入力サイズが異なるので統一する
        x_rnn, h_last = self.rnn(emb, h0)  # RNN
        # out = self.func(x_rnn[:, -1, :]) # 最後の層だけ取り出す # ここを変更した
        # out = self.func(x_rnn[:, -1]) # 最後の層だけ取り出す
        out = self.func(h_last[-1]) #現在のhだけ取り出す
        return out

PADDING_IDX = len(vocab)
    
#ミニバッチを取り出して長さを揃える関数
def collate_fn(batch):
    sorted_batch = sorted(batch, key=lambda x: x['inputs'].shape[0], reverse=True)
    sequences = [x['inputs'] for x in sorted_batch]
    # padding 処理
    sequences_padded = torch.nn.utils.rnn.pad_sequence(sequences, batch_first=True, padding_value=PADDING_IDX)
    labels = torch.LongTensor([x['labels'] for x in sorted_batch])
    return {'inputs': sequences_padded, 'labels': labels}

vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

batch_size = 4

train_loader = DataLoader(X_train_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(X_valid_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(X_test_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

model = RNNmodel(vocab_size, padding_idx, emb_size, hidden_size, n_labels, batch_size, device, embedding_weight_matrix)
output_path = "./trained_param.npz"
total_epochs = 10
train(model, train_loader, valid_loader, output_path, total_epochs, device)

device: cpu




100%|██████████| 2671/2671 [00:03<00:00, 832.43it/s]
100%|██████████| 334/334 [00:00<00:00, 2602.11it/s]


epoch0: train_loss = 1.1205019014174462, train_acc = 0.5296705353800075, valid_loss = 1.2792941331863403, valid_acc = 0.40793413173652693


100%|██████████| 2671/2671 [00:02<00:00, 898.58it/s]
100%|██████████| 334/334 [00:00<00:00, 2860.11it/s]


epoch1: train_loss = 1.092181903744226, train_acc = 0.5595282665668289, valid_loss = 1.2651035785675049, valid_acc = 0.39820359281437123


100%|██████████| 2671/2671 [00:02<00:00, 985.74it/s] 
100%|██████████| 334/334 [00:00<00:00, 2948.21it/s]


epoch2: train_loss = 1.0642659534780063, train_acc = 0.5895731935604642, valid_loss = 1.2983145713806152, valid_acc = 0.4199101796407186


100%|██████████| 2671/2671 [00:02<00:00, 987.49it/s] 
100%|██████████| 334/334 [00:00<00:00, 2913.90it/s]


epoch3: train_loss = 1.0875748823317013, train_acc = 0.5656121302882815, valid_loss = 1.4570056200027466, valid_acc = 0.38922155688622756


100%|██████████| 2671/2671 [00:02<00:00, 998.32it/s] 
100%|██████████| 334/334 [00:00<00:00, 2853.38it/s]


epoch4: train_loss = 1.077451624408593, train_acc = 0.5794646199925122, valid_loss = 1.3294237852096558, valid_acc = 0.4101796407185629


100%|██████████| 2671/2671 [00:02<00:00, 945.23it/s] 
100%|██████████| 334/334 [00:00<00:00, 2665.98it/s]


epoch5: train_loss = 1.0854484595997151, train_acc = 0.5723511793335829, valid_loss = 1.3304475545883179, valid_acc = 0.4086826347305389


100%|██████████| 2671/2671 [00:02<00:00, 951.28it/s] 
100%|██████████| 334/334 [00:00<00:00, 2849.20it/s]


epoch6: train_loss = 1.0888683725188886, train_acc = 0.5659865219019093, valid_loss = 1.2992968559265137, valid_acc = 0.405688622754491


100%|██████████| 2671/2671 [00:02<00:00, 973.86it/s] 
100%|██████████| 334/334 [00:00<00:00, 2864.74it/s]


epoch7: train_loss = 1.099557658687418, train_acc = 0.5601834518906776, valid_loss = 1.2932552099227905, valid_acc = 0.3884730538922156


100%|██████████| 2671/2671 [00:02<00:00, 973.71it/s] 
100%|██████████| 334/334 [00:00<00:00, 2821.37it/s]


epoch8: train_loss = 1.0804013058693835, train_acc = 0.5830213403219768, valid_loss = 1.2701021432876587, valid_acc = 0.3997005988023952


100%|██████████| 2671/2671 [00:03<00:00, 871.90it/s]
100%|██████████| 334/334 [00:00<00:00, 2689.36it/s]


epoch9: train_loss = 1.1112056158591817, train_acc = 0.548764507675028, valid_loss = 1.2645072937011719, valid_acc = 0.406437125748503




0,1
train_acc,▁▁▂▂▅▃▅▄██▇▇▃▄▅▅▆▇▇▇▆▅▆▆▃▆▇▆▃▅▆▅▃▅▆▇█▆▅▄
train_loss,█▆▆▅▅▅▄▄▂▁▂▂▅▄▄▃▃▂▂▂▃▃▃▃▆▂▂▃▆▄▃▄▅▄▃▃▁▂▃▄

0,1
train_acc,0.54876
train_loss,1.11121


<h2 id="85-双方向rnn多層化">85. 双方向RNN・多層化</h2>
<p>順方向と逆方向のRNNの両方を用いて入力テキストをエンコードし，モデルを学習せよ．</p>

<p>\[\overleftarrow{h}_{T+1} = 0, \\
\overleftarrow{h}_t = {\rm \overleftarrow{RNN}}(\mathrm{emb}(x_t), \overleftarrow{h}_{t+1}), \\
y = {\rm softmax}(W^{(yh)} [\overrightarrow{h}_T; \overleftarrow{h}_1] + b^{(y)}))\]</p>

<p>ただし，\(\overrightarrow{h}_t \in \mathbb{R}^{d_h}, \overleftarrow{h}_t \in \mathbb{R}^{d_h}\)はそれぞれ，順方向および逆方向のRNNで求めた時刻\(t\)の隠れ状態ベクトル，\({\rm \overleftarrow{RNN}}(x,h)\)は入力\(x\)と次時刻の隠れ状態\(h\)から前状態を計算するRNNユニット，\(W^{(yh)} \in \mathbb{R}^{L \times 2d_h}\)は隠れ状態ベクトルからカテゴリを予測するための行列，\(b^{(y)} \in \mathbb{R}^{L}\)はバイアス項である．また，\([a; b]\)はベクトル\(a\)と\(b\)の連結を表す。</p>
<p>さらに，双方向RNNを多層化して実験せよ．</p>


In [139]:
# 今度はロスがちょっと下がった！
# 双方向と多層化に拡張
import torch
import torch.nn as nn
class RNNmodel(nn.Module):
    def __init__(self, vocab_size, padding_idx, emb_size=300, hidden_size=50, n_labels=4, batch_size=64, device='cpu', emb_weight=None, bidirectional=False, layers=1) -> None:
        super(RNNmodel, self).__init__()
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.device = device
        # 双方向と多層化に拡張
        self.bidirectional = bidirectional
        self.layers = layers
        self.directions = bidirectional + 1 # 単方向：１， 双方向：2
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        if emb_weight is None:
            self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        else:
            self.emb = nn.Embedding.from_pretrained(emb_weight, padding_idx=padding_idx)
        
        # batch_first とは？ -> batch_size と emb の２次元目のサイズが異なるときに合わせている？？
        # 双方向と多層化に対応        
        self.rnn = nn.RNN(emb_size, hidden_size, self.layers, nonlinearity='tanh', batch_first=True, bidirectional=bidirectional)
        # self.rnn = nn.RNN(emb_size, hidden_size, nonlinearity='tanh')
        # 双方向に対応， 隠れ層に self.directions を掛ける
        self.func = nn.Linear(hidden_size * self.directions, n_labels)

    def forward(self, x):
        # バッチサイズを固定すると，一番最後の余りの分がおかしくなるので，動的に毎回決める！
        self.batch_size = x.size()[0]
        # 多層と双方向に対応， １次元目を 1 -> self.layers * self.directions
        h0 = torch.zeros(self.layers * self.directions, self.batch_size, self.hidden_size, device=self.device) # ここを変更した
        emb = self.emb(x)  # 入力サイズが異なるので統一する
        x_rnn, h_last = self.rnn(emb, h0)  # RNN
        # 双方向に対応
        if self.bidirectional:
            out = self.func(torch.cat([h_last[-2], h_last[-1]], dim=1))
        else:
            out = self.func(x_rnn[:, -1, :]) # 最後の層だけ取り出す # ここを変更した
        # out = self.func(x_rnn[:, -1]) # 最後の層だけ取り出す
        # out = self.func(h_last) #現在のhだけ取り出す
        return out
    
    
vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

train_loader = DataLoader(X_train_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(X_valid_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(X_test_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

model = RNNmodel(vocab_size, padding_idx, emb_size, hidden_size, n_labels, batch_size, device, embedding_weight_matrix, bidirectional=True, layers=3)
output_path = "./trained_param.npz"
total_epochs = 10
train(model, train_loader, valid_loader, output_path, total_epochs, device)

device: cpu


100%|██████████| 334/334 [00:03<00:00, 110.60it/s]
100%|██████████| 42/42 [00:00<00:00, 296.44it/s]


epoch0: train_loss = 1.033255908963208, train_acc = 0.5848932983901161, valid_loss = 1.528912901878357, valid_acc = 0.38547904191616766


100%|██████████| 334/334 [00:02<00:00, 116.35it/s]
100%|██████████| 42/42 [00:00<00:00, 284.25it/s]


epoch1: train_loss = 0.8740515920474082, train_acc = 0.6771808311493822, valid_loss = 1.831870675086975, valid_acc = 0.37350299401197606


100%|██████████| 334/334 [00:02<00:00, 119.43it/s]
100%|██████████| 42/42 [00:00<00:00, 292.05it/s]


epoch2: train_loss = 0.8174228320091416, train_acc = 0.6977723698989142, valid_loss = 1.669603943824768, valid_acc = 0.39895209580838326


100%|██████████| 334/334 [00:02<00:00, 121.34it/s]
100%|██████████| 42/42 [00:00<00:00, 291.68it/s]


epoch3: train_loss = 0.7720045291810659, train_acc = 0.7175215275177836, valid_loss = 1.857621431350708, valid_acc = 0.375


100%|██████████| 334/334 [00:02<00:00, 123.16it/s]
100%|██████████| 42/42 [00:00<00:00, 285.69it/s]


epoch4: train_loss = 0.7209171405315221, train_acc = 0.7299700486709098, valid_loss = 2.036397695541382, valid_acc = 0.36227544910179643


100%|██████████| 334/334 [00:03<00:00, 107.68it/s]
100%|██████████| 42/42 [00:00<00:00, 187.17it/s]


epoch5: train_loss = 0.6848365614272288, train_acc = 0.7436353425683264, valid_loss = 2.268230438232422, valid_acc = 0.37649700598802394


100%|██████████| 334/334 [00:03<00:00, 110.00it/s]
100%|██████████| 42/42 [00:00<00:00, 253.33it/s]


epoch6: train_loss = 0.6584594965605912, train_acc = 0.755335080494197, valid_loss = 2.163968324661255, valid_acc = 0.38772455089820357


100%|██████████| 334/334 [00:02<00:00, 116.59it/s]
100%|██████████| 42/42 [00:00<00:00, 287.23it/s]


epoch7: train_loss = 0.6272403965367668, train_acc = 0.7690003743916136, valid_loss = 2.366326093673706, valid_acc = 0.39221556886227543


100%|██████████| 334/334 [00:03<00:00, 97.09it/s] 
100%|██████████| 42/42 [00:00<00:00, 229.38it/s]


epoch8: train_loss = 0.5942053247685292, train_acc = 0.7816360913515538, valid_loss = 2.5697007179260254, valid_acc = 0.3884730538922156


100%|██████████| 334/334 [00:02<00:00, 117.05it/s]
100%|██████████| 42/42 [00:00<00:00, 221.80it/s]


epoch9: train_loss = 0.5916570539735132, train_acc = 0.7819168850617746, valid_loss = 2.5829761028289795, valid_acc = 0.38173652694610777




0,1
train_acc,▁▂▃▃▅▅▅▅▆▆▆▆▇▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇█▇▇▇▇███████
train_loss,█▇▇▇▅▅▅▅▄▄▄▄▂▃▃▃▃▃▃▃▂▂▃▂▂▂▂▂▁▂▂▂▂▁▁▁▁▁▁▁

0,1
train_acc,0.78163
train_loss,0.59182


<h2 id="86-畳み込みニューラルネットワーク-cnn">86. 畳み込みニューラルネットワーク (CNN)</h2>
<p>ID番号で表現された単語列\(\boldsymbol{x} = (x_1, x_2, \dots, x_T)\)がある．ただし，\(T\)は単語列の長さ，\(x_t \in \mathbb{R}^{V}\)は単語のID番号のone-hot表記である（\(V\)は単語の総数である）．畳み込みニューラルネットワーク（CNN: Convolutional Neural Network）を用い，単語列\(\boldsymbol{x}\)からカテゴリ\(y\)を予測するモデルを実装せよ．</p>
<p>ただし，畳み込みニューラルネットワークの構成は以下の通りとする．</p>
<ul>
<li>単語埋め込みの次元数: \(d_w\)</li>
<li>畳み込みのフィルターのサイズ: 3 トークン</li>
<li>畳み込みのストライド: 1 トークン</li>
<li>畳み込みのパディング: あり</li>
<li>畳み込み演算後の各時刻のベクトルの次元数: \(d_h\)</li>
<li>畳み込み演算後に最大値プーリング（max pooling）を適用し，入力文を\(d_h\)次元の隠れベクトルで表現</li>
</ul>
<p>すなわち，時刻\(t\)の特徴ベクトル\(p_t \in \mathbb{R}^{d_h}\)は次式で表される．</p>

<p>\[p_t = g(W^{(px)} [\mathrm{emb}(x_{t-1}); \mathrm{emb}(x_t); \mathrm{emb}(x_{t+1})] + b^{(p)}))\]</p>

<p>ただし，\(W^{(px)} \in \mathbb{R}^{d_h \times 3d_w}, b^{(p)} \in \mathbb{R}^{d_h}\)はCNNのパラメータ，\(g\)は活性化関数（例えば\(\tanh\)やReLUなど），\([a; b; c]\)はベクトル\(a, b, c\)の連結である．なお，行列\(W^{(px)}\)の列数が\(3d_w\)になるのは，3個のトークンの単語埋め込みを連結したものに対して，線形変換を行うためである．</p>
<p>最大値プーリングでは，特徴ベクトルの次元毎に全時刻における最大値を取り，入力文書の特徴ベクトル\(c \in \mathbb{R}^{d_h}\)を求める．\(c[i]\)でベクトル\(c\)の\(i\)番目の次元の値を表すことにすると，最大値プーリングは次式で表される．</p>

<p>\[c[i] = \max_{1 \leq t \leq T} p_t[i]]\]</p>

<p>最後に，入力文書の特徴ベクトル\(c\)に行列\(W^{(yc)} \in \mathbb{R}^{L \times d_h}\)とバイアス項\(b^{(y)} \in \mathbb{R}^{L}\)による線形変換とソフトマックス関数を適用し，カテゴリ\(y\)を予測する．</p>

<p>\[y = {\rm softmax}(W^{(yc)} c + b^{(y)}))\]</p>

<p>なお，この問題ではモデルの学習を行わず，ランダムに初期化された重み行列で\(y\)を計算するだけでよい．</p>


In [142]:
from torch.nn import functional as F

class CNN(nn.Module):
    def __init__(self, vocab_size, padding_idx, out_channels,  emb_size=300, kernel_heights=3, stride=1, n_labels=4, device="cpu", emb_weight=None) -> None:
        """
        stride: 動かす単位（小さいほど細かい）
        kenel_height: 窓の大きさ
        out_channels: 
        conv2d: convolution 層（次元を維持しつつ畳み込み）
        max_pool1d: pooling層（最大値を取り，ダウンサンプリングする）
        """
        super(CNN, self).__init__()
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        if emb_weight is None:
            self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        else:
            self.emb = nn.Embedding.from_pretrained(emb_weight, padding_idx=padding_idx)
        self.conv = nn.Conv2d(1, out_channels, (kernel_heights, emb_size), stride, (padding_idx, 0))
        self.drop = nn.Dropout(0.3)
        self.func = nn.Linear(out_channels, n_labels)
        
    def forward(self, x):
        emb = self.emb(x).unsqueeze(1)
        conv = self.conv(emb)  # 畳み込み層
        act = F.relu(conv.squeeze(3))  # 活性化関数
        max_pool = F.max_pool1d(act, act.size()[2])  # pooling 層
        out = self.func(self.drop(max_pool.squeeze(2)))  # 全結合層？
        return out

In [143]:
vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
out_channels = 50 # ハイパラ？
emb_size = 300  # ハイパラ
kernel_height = 3
stride = 1
n_labels = 4  # ラベル数

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

model = CNN(vocab_size, padding_idx, out_channels, emb_size, kernel_height, stride, n_labels, device)

for i in range(5):
    xi = X_train_tokenized[i]['inputs']
    yi = X_train_tokenized[i]['labels']
    pred_probs = model(xi.unsqueeze(0))
    print(f"予測値：{pred_probs}")
    print(f"予測ラベル：{pred_probs.argmax()}")
    print(f"正解ラベル：{yi}")

device: cpu
予測値：tensor([[ 1.6340,  0.9635, -0.7566,  0.2272]], grad_fn=<AddmmBackward0>)
予測ラベル：0
正解ラベル：2
予測値：tensor([[ 0.8758,  0.1956, -0.2810,  0.8973]], grad_fn=<AddmmBackward0>)
予測ラベル：3
正解ラベル：0
予測値：tensor([[ 1.1807, -0.1821, -1.1514,  0.2550]], grad_fn=<AddmmBackward0>)
予測ラベル：0
正解ラベル：2
予測値：tensor([[ 0.6415,  0.3134, -0.9988, -0.2397]], grad_fn=<AddmmBackward0>)
予測ラベル：0
正解ラベル：2
予測値：tensor([[ 1.3651,  0.0785, -0.9187,  0.1324]], grad_fn=<AddmmBackward0>)
予測ラベル：0
正解ラベル：2


<h2 id="87-確率的勾配降下法によるcnnの学習">87. 確率的勾配降下法によるCNNの学習</h2>
<p>確率的勾配降下法（SGD: Stochastic Gradient Descent）を用いて，問題86で構築したモデルを学習せよ．訓練データ上の損失と正解率，評価データ上の損失と正解率を表示しながらモデルを学習し，適当な基準（例えば10エポックなど）で終了させよ．</p>


In [144]:
# ちゃんとロスが下がった！！
# cpuで回すと10時間くらい
# GPUだと一瞬だった -> 10分弱くらい
vocab_size = len(vocab) + 1  # padding の分 +1 する
padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
emb_size = 300  # ハイパラ
hidden_size = 50  # ハイパラ
n_labels = 4  # ラベル数
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"device: {device}")

train_loader = DataLoader(X_train_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(X_valid_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
test_loader = DataLoader(X_test_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

model = CNN(vocab_size, padding_idx, out_channels, emb_size, kernel_height, stride, n_labels, device)
output_path = "./trained_param.npz"
total_epochs = 10
train(model, train_loader, valid_loader, output_path, total_epochs, device)

device: cpu


 51%|█████     | 170/334 [16:28<15:53,  5.81s/it]


KeyboardInterrupt: 

In [173]:
!cat src/q87.log

INFO:root:epoch0: train_loss = 25.6953570352695, train_acc = 0.5040247098464994, valid_loss = 14.356131553649902, valid_acc = 0.6137724550898204
INFO:root:epoch1: train_loss = 8.639520003883993, train_acc = 0.654530138524897, valid_loss = 3.2860076427459717, valid_acc = 0.6961077844311377
INFO:root:epoch2: train_loss = 2.6602175040264693, train_acc = 0.7282852864095845, valid_loss = 2.087599515914917, valid_acc = 0.6729041916167665
INFO:root:epoch3: train_loss = 1.2603243913422002, train_acc = 0.7751778360164733, valid_loss = 1.8979467153549194, valid_acc = 0.7245508982035929
INFO:root:epoch4: train_loss = 0.8703111402776883, train_acc = 0.8084050917259453, valid_loss = 1.532151222229004, valid_acc = 0.7372754491017964
INFO:root:epoch5: train_loss = 0.6320607877768456, train_acc = 0.827311868214152, valid_loss = 1.5626899003982544, valid_acc = 0.7485029940119761
INFO:root:epoch6: train_loss = 0.48046755281917525, train_acc = 0.8442530887308124, valid_loss = 1.5024590492248535, valid_ac

<h2 id="88-パラメータチューニング">88. パラメータチューニング</h2>
<p>問題85や問題87のコードを改変し，ニューラルネットワークの形状やハイパーパラメータを調整しながら，高性能なカテゴリ分類器を構築せよ．</p>


一番性能が良かったCNNを採用する

optuna でパラメータを自動最適化する

あまりにも時間がかかるので，src/q88.py で実行した

ここに関しては，GPUのほうがCPUよりも3倍くらい早かった気がする

running.log に実行履歴が残っている

In [58]:
# パラメータチューニング用に引数を変更
from torch.nn import functional as F

class CNN(nn.Module):
    def __init__(self, vocab_size, padding_idx, out_channels,  emb_size=300, kernel_heights=3, stride=1, n_labels=4, device="cpu", emb_weight=None, active_func='relu', dropout=0.3) -> None:
        """
        stride: 動かす単位（小さいほど細かい）
        kenel_height: 窓の大きさ
        out_channels: 
        conv2d: convolution 層（次元を維持しつつ畳み込み）
        max_pool1d: pooling層（最大値を取り，ダウンサンプリングする）
        """
        super(CNN, self).__init__()
        # 入力ベクトルの大きさが異なるので，emb層で形をそろえる
        if emb_weight is None:
            self.emb = nn.Embedding(vocab_size, emb_size, padding_idx=padding_idx)
        else:
            self.emb = nn.Embedding.from_pretrained(emb_weight, padding_idx=padding_idx)
        self.conv = nn.Conv2d(1, out_channels, (kernel_heights, emb_size), stride, (padding_idx, 0))
        self.drop = nn.Dropout(dropout)
        self.func = nn.Linear(out_channels, n_labels)
        self.active_func = active_func # 活性化関数をパラメータにする
        
    def forward(self, x):
        emb = self.emb(x).unsqueeze(1)
        conv = self.conv(emb)  # 畳み込み層

        # 活性化関数の最適化を行う
        if self.active_func == 'relu':
            act = F.relu(conv.squeeze(3))
        elif self.active_func == 'tanh':
            act = torch.tanh(conv.squeeze(3))
        elif self.active_func == 'mish':
            act = F.mish(conv.squeeze(3))
        else:
            act = F.relu(conv.squeeze(3))

        max_pool = F.max_pool1d(act, act.size()[2])  # pooling 層
        out = self.func(self.drop(max_pool.squeeze(2)))  # 全結合層？
        return out

In [59]:
# early stoppingを差し込む
class EarlyStopping:
    """earlystoppingクラス"""

    def __init__(self, patience=5, verbose=False, path='checkpoint_model.pth'):
        """引数：最小値の非更新数カウンタ、表示設定、モデル格納path"""

        self.patience = patience    #設定ストップカウンタ
        self.verbose = verbose      #表示の有無
        self.counter = 0            #現在のカウンタ値
        self.best_score = None      #ベストスコア
        self.early_stop = False     #ストップフラグ
        self.val_loss_min = np.Inf   #前回のベストスコア記憶用
        self.path = path             #ベストモデル格納path

    def __call__(self, val_loss, model):
        """
        特殊(call)メソッド
        実際に学習ループ内で最小lossを更新したか否かを計算させる部分
        """
        score = -val_loss

        if self.best_score is None:  #1Epoch目の処理
            self.best_score = score   #1Epoch目はそのままベストスコアとして記録する
            self.checkpoint(val_loss, model)  #記録後にモデルを保存してスコア表示する
        elif score < self.best_score:  # ベストスコアを更新できなかった場合
            self.counter += 1   #ストップカウンタを+1
            if self.verbose:  #表示を有効にした場合は経過を表示
                print(f'EarlyStopping counter: {self.counter} out of {self.patience}')  #現在のカウンタを表示する 
            if self.counter >= self.patience:  #設定カウントを上回ったらストップフラグをTrueに変更
                self.early_stop = True
        else:  #ベストスコアを更新した場合
            self.best_score = score  #ベストスコアを上書き
            self.checkpoint(val_loss, model)  #モデルを保存してスコア表示
            self.counter = 0  #ストップカウンタリセット

    def checkpoint(self, val_loss, model):
        '''ベストスコア更新時に実行されるチェックポイント関数'''
        if self.verbose:  #表示を有効にした場合は、前回のベストスコアからどれだけ更新したか？を表示
            print(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}).  Saving model ...')
        torch.save(model.state_dict(), self.path)  #ベストモデルを指定したpathに保存
        self.val_loss_min = val_loss  #その時のlossを記録する


In [171]:
f = open('work/train-id.txt', 'wb')
pickle.dump(train_tokenized_id, f)
f = open('work/valid-id.txt', 'wb')
pickle.dump(valid_tokenized_id, f)
f = open('work/test-id.txt', 'wb')
pickle.dump(test_tokenized_id, f)

In [67]:
# パラメータチューニング用に引数と返り値を変更
# early stopping の機構を追加
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader # データローダ使ってみる
from tqdm import tqdm
import time
from sklearn.metrics import accuracy_score

# 学習率を引数に追加
def train(model, train_loader, valid_loader, output_path, total_epochs, device, lr=0.01, op='sgd'):
    earlystopping = EarlyStopping(patience=3, verbose=True)
    
    # 最適化手法を変更
    if op == 'sgd':  
        optimizer = optim.SGD(model.parameters(), lr=lr)
    elif op == 'adam':
        optimizer = optim.Adam(model.parameters(), lr=lr)
    elif op == 'rmsprop':
        optimizer = optim.RMSprop(model.parameters(), lr=lr)
    else:
        optimizer = optim.SGD(model.parameters(), lr=lr)
        
    loss_func = nn.CrossEntropyLoss()

    model = model.to(device)
    # 指定した epoch 数だけ学習
    for epoch in range(total_epochs):
        train_total_loss = 0.
        train_acc_cnt = 0

        # パラメータ更新
        model.train()
        for data in tqdm(train_loader):
            x = data['inputs']
            x = x.to(device)
            y = data['labels']
            y = y.to(device)
            y_pred = model(x)

            # バッチの中で損失計算
            train_loss = 0.
            for yi, yi_pred in zip(y, y_pred):
                loss_i = loss_func(yi_pred, yi)
                train_loss += loss_i
            
            optimizer.zero_grad()  # 勾配の初期化
            train_loss.backward()  # 勾配計算
            optimizer.step()  # パラメータ修正
            train_total_loss += train_loss.item()

            # バッチの中で正解率の計算
            for yi, yi_pred in zip(y, y_pred):
                if yi.item() == yi_pred.argmax():
                    train_acc_cnt += 1
        
        #★毎エポックearlystoppingの判定をさせる★
        train_ave_loss = train_total_loss / len(X_train_tokenized)
        
        earlystopping(train_ave_loss, model) #callメソッド呼び出し
        if earlystopping.early_stop: #ストップフラグがTrueの場合、breakでforループを抜ける
            print(f"epoch{epoch}: train_loss = {train_ave_loss}")
            print("Early Stopping!")
            break
                
        # train のロスと正解率の計算
        model.eval()
        train_acc = measure_acc(model, X_train_tokenized[:]['inputs'], X_train_tokenized[:]['labels'], device)


        # valid のロスと正解率の計算
        model.eval()
        valid_acc_cnt = 0
        valid_total_loss = 0.
        with torch.no_grad():
            for data in tqdm(valid_loader):
                x = data['inputs']
                x = x.to(device)
                y = data['labels']
                y = y.to(device)
                y_pred = model(x)

                # バッチの中で損失計算
                valid_loss = 0.
                for yi, yi_pred in zip(y, y_pred):
                    # print(yi)
                    # print(yi_pred)
                    loss_i = loss_func(yi_pred, yi)
                    valid_loss += loss_i

                optimizer.zero_grad()  # 勾配の初期化
                # valid_loss.backward()  # 勾配計算
                # optimizer.step()  # パラメータ修正
                valid_total_loss += valid_loss

                # バッチの中で正解率の計算
                for yi, yi_pred in zip(y, y_pred):
                    if yi.item() == yi_pred.argmax():
                        valid_acc_cnt += 1

            # valid のロスと正解率の計算
            valid_acc = measure_acc(model, X_valid[:]['inputs'], X_valid[:]['labels'], device)

        # 表示
        train_ave_loss = train_total_loss / len(X_train_tokenized)
        # train_acc = train_acc_cnt / len(X_train)
        valid_ave_loss = valid_total_loss / len(X_valid_tokenized)
        # valid_acc = valid_acc_cnt / len(X_valid)
        print(f"epoch{epoch}: train_loss = {train_ave_loss}, train_acc = {train_acc}, valid_loss = {valid_ave_loss}, valid_acc = {valid_acc}")

    # パラメータを保存
    torch.save(model.state_dict(), output_path)
    
    # valid loss を返り値とする
    return valid_ave_loss

In [68]:
# optuna でパラメータの自動最適化
from typing import Any
import optuna

def objective(trial):
    # 固定のもの
    vocab_size = len(vocab) + 1  # padding の分 +1 する
    padding_idx = len(vocab)  # 空き単語を埋めるときは最大値を入れる
    n_labels = 4  # ラベル数
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = DataLoader(X_train_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    valid_loader = DataLoader(X_valid_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    test_loader = DataLoader(X_test_tokenized, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
    output_path = "./trained_param.npz"
    total_epochs = 10

    # ハイパラを変更させる
    out_channels = trial.suggest_categorical('out_channels', [16, 32, 64, 128])  # これだけよくわかっていない
    emb_size = trial.suggest_categorical('emb_size', [50, 100, 200, 300])  # 特徴ベクトルの次元数
    kernel_height = trial.suggest_int('kernel_height', 1, 5, step=1)  # 窓の大きさ
    stride = trial.suggest_int('stride', 1, 2, step=1)  # 窓を動かす単位
    active_func = trial.suggest_categorical('active_func', ['relu', 'tanh', 'mish'])  # 活性化関数
    lr = trial.suggest_float('lr', 1e-3, 1e-2, log=True)  # 学習率
    dropout = trial.suggest_float('dropout', 0.2, 0.5)  # ドロップアウト
    op = trial.suggest_categorical('optimizer', ['rmsprop', 'adam', 'sgd'])  # 最適化手法 

    print(f"device: {device}")
    model = CNN(vocab_size, padding_idx, out_channels, emb_size, kernel_height, stride, n_labels, device, active_func=active_func, dropout=dropout)
    valid_loss = train(model, train_loader, valid_loader, output_path, total_epochs, device, lr, op)

    # 訓練の最後で得られた valid_loss でパラメータチューニングを行う
    return valid_loss

In [69]:
# 流石に時間がかかりすぎるのでサーバで実行 (100時間くらいかかる？)
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)

[I 2023-11-08 21:26:36,072] A new study created in memory with name: no-name-640055c7-92dc-42e9-930c-ccce990283a0


device: cpu


  1%|          | 2/334 [00:09<27:00,  4.88s/it]
[W 2023-11-08 21:26:45,886] Trial 0 failed with parameters: {'out_channels': 16, 'emb_size': 200, 'kernel_height': 5, 'stride': 2, 'active_func': 'relu', 'lr': 0.004388631453821411, 'dropout': 0.4163492420037371, 'optimizer': 'sgd'} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/Users/nyuton/.pyenv/versions/anaconda3-2022.10/lib/python3.9/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
  File "/var/folders/bw/n738lb8d773382cjs4qszbbh0000gn/T/ipykernel_2658/4258090099.py", line 29, in objective
    valid_loss = train(model, train_loader, valid_loader, output_path, total_epochs, device, lr, op)
  File "/var/folders/bw/n738lb8d773382cjs4qszbbh0000gn/T/ipykernel_2658/4177580950.py", line 50, in train
    train_loss.backward()  # 勾配計算
  File "/Users/nyuton/.pyenv/versions/anaconda3-2022.10/lib/python3.9/site-packages/torch/_tensor.py", line 48

KeyboardInterrupt: 

In [65]:
print(f"最高精度のACC：{study.best_value}")
print('最高精度のパラメータ')
pprint(study.best_params)

ValueError: No trials are completed yet.

<h2 id="89-事前学習済み言語モデルからの転移学習">89. 事前学習済み言語モデルからの転移学習</h2>
<p>事前学習済み言語モデル（例えば<a href="https://github.com/google-research/bert">BERT</a>など）を出発点として，ニュース記事見出しをカテゴリに分類するモデルを構築せよ．</p>


こちらもかなり時間がかかるので，src/q89.pyを用いてサーバ上で回した

出力ログを q89.running.log ファイルに残した

In [46]:
import tensorflow as tf
import transformers
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForSequenceClassification

class BERTmodel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.bert_sc = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

    def forward(self, encoding):
        outputs = self.bert_sc(**encoding)
        return outputs
    
class CreateDataset(torch.utils.data.Dataset):
    def __init__(self, X, y, transform=None):
        self.X = X
        self.y = y
    
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, index):
        return {
            'inputs': self.X[index],
            'labels': self.y[index]
        }

In [47]:
# BERT用
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from torch.utils.data import DataLoader # データローダ使ってみる
from tqdm import tqdm
import time
from sklearn.metrics import accuracy_score

def bert_train(model, train_loader, valid_loader, output_path, total_epochs, device, lr=0.01):
    optimizer = optim.SGD(model.parameters(), lr=lr)
    loss_func = nn.CrossEntropyLoss()

    # BERTモデルのエンコード用
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


    model = model.to(device)
    # 指定した epoch 数だけ学習
    for epoch in range(total_epochs):
        train_total_loss = 0.
        train_acc_cnt = 0

        # パラメータ更新
        model.train()
        for batch in tqdm(train_loader):
            x_texts = batch['inputs']
            x_encordings = tokenizer(
                list(x_texts), 
                max_length=128, 
                padding='max_length', 
                truncation=True, 
                return_tensors='pt', 
                return_attention_mask=True, 
                return_token_type_ids=True
            )
            x_encordings = x_encordings.to(device)
            y = batch['labels']
            y = y.to(device)
            y_pred = model(x_encordings).logits

            # バッチの中で損失計算
            train_loss = loss_func(y_pred, y)

            # train_loss = 0.
            # for yi, yi_pred in zip(y, y_pred):
            #     loss_i = loss_func(yi_pred, yi)
            #     train_loss += loss_i
            
            optimizer.zero_grad() # 勾配の初期化
            train_loss.backward()  # 勾配計算
            optimizer.step()  # パラメータ修正
            train_total_loss += train_loss.item()

            # バッチの中で正解率の計算 # ここを修正
            for yi, yi_pred in zip(y, y_pred):
                if yi.item() == yi_pred.argmax():
                    train_acc_cnt += 1
                
        # train のロスと正解率の計算
        model.eval()
        # train_acc = measure_acc(model, X_train[:]['inputs'], X_train[:]['labels'], device)


        # valid のロスと正解率の計算
        model.eval()
        valid_acc_cnt = 0
        valid_total_loss = 0.
        with torch.no_grad():
            for batch in tqdm(valid_loader):
                x_texts = batch['inputs']
                x_encordings = tokenizer(
                    list(x_texts), 
                    max_length=128, 
                    padding='max_length', 
                    truncation=True, 
                    return_tensors='pt', 
                    return_attention_mask=True, 
                    return_token_type_ids=True
                )
                x_encordings = x_encordings.to(device)
                y = batch['labels']
                y = y.to(device)
                y_pred = model(x_encordings).logits

                # バッチの中で損失計算
                valid_loss = loss_func(y_pred, y)
                # valid_loss = 0.
                # for yi, yi_pred in zip(y, y_pred):
                #     # print(yi)
                #     # print(yi_pred)
                #     loss_i = loss_func(yi_pred, yi)
                #     valid_loss += loss_i

                optimizer.zero_grad()  # 勾配の初期化
                # valid_loss.backward()  # 勾配計算
                # optimizer.step()  # パラメータ修正
                valid_total_loss += valid_loss

                # バッチの中で正解率の計算  # ここを修正
                for yi, yi_pred in zip(y, y_pred):
                    if yi.item() == yi_pred.argmax():
                        valid_acc_cnt += 1

            # valid のロスと正解率の計算
            # valid_acc = measure_acc(model, X_valid[:]['inputs'], X_valid[:]['labels'], device)

        # 表示
        train_ave_loss = train_total_loss / len(X_train_tokenized)
        train_acc = train_acc_cnt / len(X_train_tokenized)
        valid_ave_loss = valid_total_loss / len(X_valid_tokenized)
        valid_acc = valid_acc_cnt / len(X_valid_tokenized)
        print(f"epoch{epoch}: train_loss = {train_ave_loss}, train_acc = {train_acc}, valid_loss = {valid_ave_loss}, valid_acc = {valid_acc}")

    # パラメータを保存
    torch.save(model.state_dict(), output_path)

In [54]:
# BERTモデルに入れるためのデータセットの作成
category_dict = {'b': 0, 't': 1, 'e': 2, 'm': 3}
batch_size = 32

y_train = torch.tensor(train_data['CATEGORY'].map(category_dict).values, dtype=torch.int64)
y_valid = torch.tensor(valid_data['CATEGORY'].map(category_dict).values, dtype=torch.int64)
y_test = torch.tensor(test_data['CATEGORY'].map(category_dict).values, dtype=torch.int64)

train_set = CreateDataset(train_data['TITLE'].to_list(), y_train)
valid_set = CreateDataset(valid_data['TITLE'].to_list(), y_valid)
test_set = CreateDataset(test_data['TITLE'].to_list(), y_test)

train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=True)

In [55]:
model = BERTmodel()
total_epochs = 10
lr = 0.01
device = 'cpu'

bert_train(model, train_loader, valid_loader, 'bert_param.npz', total_epochs, device, lr)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

epoch0: train_loss = 0.0036510499323275775, train_acc = 0.08152377386746537, valid_loss = 0.025179985910654068, valid_acc = 0.7155688622754491


100%|██████████| 42/42 [04:49<00:00,  6.89s/it]
100%|██████████| 42/42 [01:30<00:00,  2.16s/it]


epoch1: train_loss = 0.0021149115124386108, train_acc = 0.1012729314863347, valid_loss = 0.011974153108894825, valid_acc = 0.8720059880239521


100%|██████████| 42/42 [05:07<00:00,  7.33s/it]
100%|██████████| 42/42 [01:38<00:00,  2.35s/it]


epoch2: train_loss = 0.0016038271140701892, train_acc = 0.10922875327592661, valid_loss = 0.011902498081326485, valid_acc = 0.8787425149700598


  5%|▍         | 2/42 [00:22<07:37, 11.45s/it]


KeyboardInterrupt: 

In [6]:
# 結果
!cat src/q89.running.log

INFO:root:Start running...
INFO:root:Start running...
INFO:root:epoch0: train_loss = 0.013288951948220946, train_acc = 0.8549232497192063, valid_loss = 0.007369755767285824, valid_acc = 0.9184131736526946
INFO:root:epoch1: train_loss = 0.007372740262478467, train_acc = 0.9213777611381505, valid_loss = 0.006601303815841675, valid_acc = 0.9236526946107785
INFO:root:epoch2: train_loss = 0.005779119322473237, train_acc = 0.9395357543991014, valid_loss = 0.0064792754128575325, valid_acc = 0.9206586826347305
INFO:root:epoch3: train_loss = 0.004371707276756712, train_acc = 0.953575439910146, valid_loss = 0.005881978198885918, valid_acc = 0.937125748502994
INFO:root:epoch4: train_loss = 0.0037143926281943632, train_acc = 0.9605952826656683, valid_loss = 0.007777113933116198, valid_acc = 0.9221556886227545
INFO:root:epoch5: train_loss = 0.003243646150737658, train_acc = 0.9663983526769, valid_loss = 0.005479299463331699, valid_acc = 0.9446107784431138
INFO:root:epoch6: train_loss = 0.0025273937