앞서 연습한 LSTM 품사 판별기의 코드에서 forward 함수만 변화하여 마지막 hidden vector 를 sentence representation 으로 이용하는 sentence classification 용 모델을 만들어봅니다. 데이터를 모두 이용하면 느리기 때문에 10k 개의 문장만 이용한 간단한 모델을 만들어봅니다.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

print('pytorch version = {}'.format(torch.__version__))

pytorch version = 1.0.1.post2


In [2]:
import config
from navermovie_comments import load_comments_image_without_padding
from navermovie_comments import load_trained_embedding
import numpy as np

n_data = 1000
sents, labels, idx_to_vocab = load_comments_image_without_padding(
    large=True, tokenize='soynlp_unsup', n_data=n_data)

# transform list of int to torch.tensor
X = [torch.LongTensor(sent) for sent in sents]

# transform label to torch.tensor
def encode_label(y):
    return y <= 6

Y = torch.LongTensor([encode_label(y) for y in labels])

word2vec_model = load_trained_embedding(data_name='large',
    tokenize='soynlp_unsup', embedding='word2vec')

wv = word2vec_model.wv.vectors

In [3]:
from collections import Counter
Counter(Y.numpy())

Counter({1: 235, 0: 765})

Pre-trained 된 word embedding vector 를 이용할 것이기 때문에 CNN 을 이용한 sentence classification 의 예시처럼 nn.Embedding layer 를 만든 뒤, 입력된 embedding vectors 를 복사합니다. 

```python
class Model:
    def __init__(self, ... ):
        # ...
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        if pretrained_wordvec is not None:
            self.word_embeddings.weight.data.copy_(torch.from_numpy(pretrained_wordvec))
```

마지막 output vector 를 sentence vector 로 이용하기 위하여 lstm_out[-1] 을 선택합니다. 이를 linear layer 인 hidden2label 에 넣어 y, prediction vector 를 얻습니다.

```python
class Model:
    def __init__(self, ...):
        # ...
        self.hidden2label = nn.Linear(hidden_dim, label_size)

    def forward(self, sentence):
        # ...
        y = self.hidden2label(lstm_out[-1])
        return y
```

In [4]:
class Model(nn.Module):

    def __init__(self, hidden_dim, label_size, pretrained_wordvec):
        super(Model, self).__init__()

        # prepare word embedding vector
        vocab_size, embedding_dim = pretrained_wordvec.shape
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.word_embeddings.weight.data.copy_(torch.from_numpy(pretrained_wordvec))

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2label = nn.Linear(hidden_dim, label_size)

    def init_hidden(self):
        # (num_layers, minibatch_size, hidden_dim)
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        x = embeds.view(len(sentence), 1, -1)
        hidden, cell = self.init_hidden()
        lstm_out, (hidden, cell) = self.lstm(x, (hidden, cell))
        # Use only last output
        y = self.hidden2label(lstm_out[-1])
        return y

모델을 학습하기 위한 패러매터를 정의합니다. 단어의 개수가 많기 떄문에 hidden dimension 을 128 로 크게 잡아줍니다.

In [5]:
hidden_dim = 64
label_size = np.unique(Y.numpy()).shape[0]
max_epochs = 10

model, loss function, optimizer 를 만듭니다.

In [6]:
model = Model(hidden_dim, label_size, wv)
loss_fun = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.02)

학습 과정은 앞선 모델들과 동일합니다.

In [8]:
def train(model, X, Y, optimizer, loss_func, epoch, checkpoints=200):
    def status(loss_sum, i):
        loss_tmp = loss_sum / (i+1)
        args = (epoch, i+1, loss_tmp)
        print('\repoch = {}, batch = {}, training loss = {:.3}'.format(*args))

    loss_sum = 0
    for i, (x, y) in enumerate(zip(X, Y)):
        model.zero_grad()
        y_pred = model(x)
        loss = loss_func(y_pred, y.view(1))
        loss.backward()
        optimizer.step()
        loss_sum += loss.item()

        if i % checkpoints == checkpoints-1:
            status(loss_sum, i)
    status(loss_sum, i)

    return model

In [9]:
model = Model(hidden_dim, label_size, wv)
loss_fun = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.02, momentum=0.5)

for epoch in range(1, max_epochs + 1):
    model = train(model, X, Y, optimizer, loss_fun, epoch)
    print()

epoch = 1, batch = 200, training loss = 0.629
epoch = 1, batch = 400, training loss = 0.538
epoch = 1, batch = 600, training loss = 0.478
epoch = 1, batch = 800, training loss = 0.423
epoch = 1, batch = 1000, training loss = 0.378
epoch = 1, batch = 1000, training loss = 0.378

epoch = 2, batch = 200, training loss = 0.558
epoch = 2, batch = 400, training loss = 0.443
epoch = 2, batch = 600, training loss = 0.385
epoch = 2, batch = 800, training loss = 0.34
epoch = 2, batch = 1000, training loss = 0.306
epoch = 2, batch = 1000, training loss = 0.306

epoch = 3, batch = 200, training loss = 0.438
epoch = 3, batch = 400, training loss = 0.348
epoch = 3, batch = 600, training loss = 0.295
epoch = 3, batch = 800, training loss = 0.263
epoch = 3, batch = 1000, training loss = 0.237
epoch = 3, batch = 1000, training loss = 0.237

epoch = 4, batch = 200, training loss = 0.337
epoch = 4, batch = 400, training loss = 0.264
epoch = 4, batch = 600, training loss = 0.216
epoch = 4, batch = 800, tr

In [10]:
with torch.no_grad():
    n_correct = 0
    for x, y in zip(X, Y):
        y_pred = model(x)
        score, predicted = torch.max(y_pred.data, dim=1)
        n_correct += (int(predicted) == int(y))
    accuracy = n_correct / len(labels)

print('accuracy = {}'.format(accuracy))

accuracy = 0.998
