## 80. ID番号への変換
1. ~~単語をIDに変換する辞書を作成・保存(重いため,2回目以降はセル2を実行)~~
2. 辞書の読み込み
3. 単語列をID列に変換する関数

In [1]:
# pklファイルから読み込み
import pickle

with open('../data/ch09/name_to_id.pkl', 'rb') as tf:
    name_to_id = pickle.load(tf)

In [2]:
# 与えられた単語列に対し, ID番号の列を返す関数
# ch06ではCountVectorizerを利用したため, 今回もCountVectorizerを活用する
from sklearn.feature_extraction.text import CountVectorizer

def convert_words_to_ids(words):
    '''
    input :words(単語列)
    output:ids(ID番号列)
    '''
    # analyzer: 単語列に前処理を加え, listに変換する関数
    analyzer = CountVectorizer().build_analyzer()
    word_list = analyzer(words)
    
    ids = []
    for word in word_list:
        if word in name_to_id:
            ids.append(name_to_id[word])
        else:
            ids.append(0)  # 未知語の場合, IDを0とする
    
    return ids

## GPU prepare
1. 使用可能GPUの確認
2. GPUの指定
3. PyTorchで利用できるGPU数の確認

In [3]:
# 使用可能GPUの確認
!nvidia-smi

Tue Jun 14 19:24:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 471.41       Driver Version: 471.41       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P8    N/A /  N/A |    679MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [4]:
# GPUの指定
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0' #0番を使用するとき

In [5]:
# 確認
import torch
print(torch.cuda.device_count()) #Pytorchで使用できるGPU数を取得

1


## prepare
1. 語彙数の取得
2. 学習データの用意(ラベル)
3. 学習データの用意(特徴量)
4. 乱数の種を固定

In [3]:
# 語彙数の取得(ID:0の単語はまとめて1語とする), 未知語, paddingを考慮
vocab_size = max(name_to_id.values())+2

In [4]:
# 訓練・検証・評価データの用意
import torch
import torch.nn as nn
import numpy as np
import torch.nn.functional as F

# ラベル: ch08の出力を利用
Y_train = np.loadtxt('../data/ch08/Y_train.txt')
Y_valid = np.loadtxt('../data/ch08/Y_valid.txt')
Y_test = np.loadtxt('../data/ch08/Y_test.txt')

# pytorch用に変換
Y_train_long = torch.tensor(Y_train, dtype=torch.int64)
Y_valid_long = torch.tensor(Y_valid, dtype=torch.int64)
Y_test_long = torch.tensor(Y_test, dtype=torch.int64)

In [5]:
# 特徴量: convert_words_to_ids(80)を利用
def convert_text_to_features(fname):
    '''
    input :fname
    output:features(tensor)
    '''
    with open(fname, encoding='utf-8') as f:
        lines = f.readlines()
    
    # id列(list)のリストに変換
    ids_list = [convert_words_to_ids(line) for line in lines]
    
    # id列(tensor)のリストに変換
    ids_tensor = [torch.tensor(ids, dtype=torch.int64) for ids in ids_list]
    
    # 最大のid+1(vocab_size-1)でパディング
    features = torch.nn.utils.rnn.pad_sequence(ids_tensor, batch_first=True, padding_value=vocab_size-1)
    
    return features

# 特徴量抽出
X_train_long = convert_text_to_features('../data/ch06/train.txt')
X_valid_long = convert_text_to_features('../data/ch06/valid.txt')
X_test_long = convert_text_to_features('../data/ch06/test.txt')

In [6]:
# 乱数シードの固定
import random

def fix_seed(seed):
    # random
    random.seed(seed)
    # Numpy
    np.random.seed(seed)
    # Pytorch
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

## 83. ミニバッチ化・GPU上での学習
1. ~~GPUに対応したRNNの定義~~
2. GPUに対応したaccuracy計測関数の定義
3. ~~ミニバッチ化し, GPU上で学習~~

In [8]:
# accuracyの計測
def measure_loss_accuracy2(model, criterion, dataloader):
    '''
    input : model, criterion, dataloader
    output: loss, accuracy
    '''
    running_loss = 0
    correct_num = 0
    device = model.device
    batch_size = model.batch_size
    for X, Y in dataloader:
        # GPU上に
        X = X.to(device)
        Y = Y.to(device)
        model.init_hidden()
        predict_y = model.forward(X)
        
        # lossの計算
        loss = criterion(predict_y, Y)
        running_loss += loss.item()
        
        # accuracyの計算
        predict_label = torch.max(predict_y, 1)[1]
        for i in range(batch_size):
            if predict_label[i] == Y[i]:
                correct_num += 1
    
    loss_avg = running_loss/len(dataloader)
    accuracy = correct_num/(len(dataloader)*batch_size)
    
    return loss_avg, accuracy

## 85. 双方向RNN・多層化
1. biRNNの定義
2. biRNNの学習
3. 多層biRNNの学習

In [13]:
# biRNNの定義
class biRNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, output_size, batch_size, num_layers=1):
        # 層の定義
        super().__init__()
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        self.num_layers = num_layers
        
        self.emb = nn.Embedding(vocab_size, embedding_dim, padding_idx=vocab_size-1)
        # biRNNに対応: bidirectional=True
        self.birnn = nn.RNN(embedding_dim, hidden_size, num_layers=self.num_layers, batch_first=True, bidirectional=True)
        # biRNNに対応: birnnの出力次元はrnnの次元の2倍
        self.fc = nn.Linear(hidden_size*2, output_size)

        # GPUに移す
        self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
        self = self.to(self.device)
    
    def init_hidden(self, batch_size=None):
        if not batch_size:
            batch_size = self.batch_size
        # biRNNに対応: birnnの出力次元はrnnの次元の2倍
        self.hidden_state = torch.zeros(self.num_layers*2, batch_size, self.hidden_size).to(self.device)
    
    def forward(self, x):
        x = self.emb(x)
        # biRNNに対応
        x_birnn, self.hidden_state =self.birnn(x, self.hidden_state)
        x = self.fc(x_birnn[:,-1,:])
        return x

In [14]:
import torch.optim as optim

# ハイパーパラメータの設定
embedding_dim = 300
hidden_size = 50
max_epoch = 25
batch_size = 8
lr = 0.002

# dataloaderの定義
train_dataset2 = torch.utils.data.TensorDataset(X_train_long, Y_train_long)
train_dataloader2 = torch.utils.data.DataLoader(train_dataset2, batch_size=batch_size, drop_last=True)

valid_dataset2 = torch.utils.data.TensorDataset(X_valid_long, Y_valid_long)
valid_dataloader2 = torch.utils.data.DataLoader(valid_dataset2, batch_size=batch_size, drop_last=True)

# モデルの定義
fix_seed(42)
birnn = biRNN(vocab_size=vocab_size, embedding_dim=embedding_dim, hidden_size=hidden_size, output_size=4, batch_size=batch_size)
criterion4 = nn.CrossEntropyLoss()
optimizer4 = optim.SGD(birnn.parameters(), lr)
device = birnn.device

# SGDによる学習
for epoch in range(max_epoch):
    running_loss = 0
    for X, Y in train_dataloader2:
        # GPU上に
        X = X.to(device)
        Y = Y.to(device)
        # forward
        birnn.init_hidden()
        predict_y = birnn.forward(X)
        loss = criterion4(predict_y, Y)
        
        # backward
        optimizer4.zero_grad()
        loss.backward()
        
        # 更新
        optimizer4.step()
        running_loss += loss.item()
    
    # loss, accuracyの表示
    birnn.init_hidden()
    valid_loss, valid_acc = measure_loss_accuracy2(birnn, criterion4, valid_dataloader2)
    train_loss, train_acc = measure_loss_accuracy2(birnn, criterion4, train_dataloader2)
    train_loss = running_loss/len(train_dataloader2)
    
    print(f'epoch: {epoch}')
    print(f'   train loss: {train_loss}\ttrain acc: {train_acc}')
    print(f'   valid loss: {valid_loss}\tvalid acc: {valid_acc}')

epoch: 0
   train loss: 1.2211097791846772	train acc: 0.42265917602996256
   valid loss: 1.1651802605497623	valid acc: 0.4221556886227545
epoch: 1
   train loss: 1.1614401209666934	train acc: 0.42359550561797754
   valid loss: 1.1614140843202967	valid acc: 0.4251497005988024
epoch: 2
   train loss: 1.1602560297826703	train acc: 0.42696629213483145
   valid loss: 1.1605084495630094	valid acc: 0.4311377245508982
epoch: 3
   train loss: 1.1594274669550777	train acc: 0.4339887640449438
   valid loss: 1.1594108942739978	valid acc: 0.4416167664670659
epoch: 4
   train loss: 1.1581736343630245	train acc: 0.4485955056179775
   valid loss: 1.157831371901278	valid acc: 0.4513473053892216
epoch: 5
   train loss: 1.1561409006850996	train acc: 0.4650749063670412
   valid loss: 1.155529128934095	valid acc: 0.46032934131736525
epoch: 6
   train loss: 1.1527960650036844	train acc: 0.47921348314606743
   valid loss: 1.1525606869937417	valid acc: 0.4648203592814371
epoch: 7
   train loss: 1.147797621501

In [15]:
import torch.optim as optim

# ハイパーパラメータの設定
embedding_dim = 300
hidden_size = 50
num_layers = 3
max_epoch = 25
batch_size = 8
lr = 0.002

# dataloader: def in 85-2

# モデルの定義
fix_seed(42)
deepbirnn = biRNN(vocab_size=vocab_size, embedding_dim=embedding_dim, hidden_size=hidden_size, output_size=4, batch_size=batch_size, num_layers=num_layers)
criterion5 = nn.CrossEntropyLoss()
optimizer5 = optim.SGD(deepbirnn.parameters(), lr)
device = deepbirnn.device

# SGDによる学習
for epoch in range(max_epoch):
    running_loss = 0
    for X, Y in train_dataloader2:
        # GPU上に
        X = X.to(device)
        Y = Y.to(device)
        # forward
        deepbirnn.init_hidden()
        predict_y = deepbirnn.forward(X)
        loss = criterion5(predict_y, Y)
        
        # backward
        optimizer5.zero_grad()
        loss.backward()
        
        # 更新
        optimizer5.step()
        running_loss += loss.item()
    
    # loss, accuracyの表示
    deepbirnn.init_hidden()
    valid_loss, valid_acc = measure_loss_accuracy2(deepbirnn, criterion5, valid_dataloader2)
    train_loss, train_acc = measure_loss_accuracy2(deepbirnn, criterion5, train_dataloader2)
    train_loss = running_loss/len(train_dataloader2)
    
    print(f'epoch: {epoch}')
    print(f'   train loss: {train_loss}\ttrain acc: {train_acc}')
    print(f'   valid loss: {valid_loss}\tvalid acc: {valid_acc}')

epoch: 0
   train loss: 1.1780160137776579	train acc: 0.42827715355805246
   valid loss: 1.159561382082408	valid acc: 0.437125748502994
epoch: 1
   train loss: 1.158968941817123	train acc: 0.4394194756554307
   valid loss: 1.157778420134219	valid acc: 0.45434131736526945
epoch: 2
   train loss: 1.1571951397349325	train acc: 0.4557116104868914
   valid loss: 1.155216756338131	valid acc: 0.47380239520958084
epoch: 3
   train loss: 1.154140216759528	train acc: 0.47771535580524344
   valid loss: 1.1510753756511711	valid acc: 0.47604790419161674
epoch: 4
   train loss: 1.1479934675416696	train acc: 0.4954119850187266
   valid loss: 1.14546078336453	valid acc: 0.468562874251497
epoch: 5
   train loss: 1.1363347492860945	train acc: 0.5129213483146068
   valid loss: 1.1446377003264285	valid acc: 0.4812874251497006
epoch: 6
   train loss: 1.092938307184405	train acc: 0.6211610486891386
   valid loss: 1.0322581186979831	valid acc: 0.6107784431137725
epoch: 7
   train loss: 0.9889863099498248	tra