# 6 - Transformers for Sentiment Analysis

In this notebook we will be using the transformer model, first introduced in [this](https://arxiv.org/abs/1706.03762) paper. Specifically, we will be using the BERT (Bidirectional Encoder Representations from Transformers) model from [this](https://arxiv.org/abs/1810.04805) paper. 

Transformer models are considerably larger than anything else covered in these tutorials. As such we are going to use the [transformers library](https://github.com/huggingface/transformers) to get pre-trained transformers and use them as our embedding layers. We will freeze (not train) the transformer and only train the remainder of the model which learns from the representations produced by the transformer. In this case we will be using a multi-layer bi-directional GRU, however any model can learn from these representations.

## Preparing Data

First, as always, let's set the random seeds for deterministic results.

In [4]:
import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

Luckily, the transformers library has tokenizers for each of the transformer models provided. In this case we are using the BERT model which ignores casing (i.e. will lower case every word). We get this by loading the pre-trained `bert-base-uncased` tokenizer.

In [5]:
# transformers 라이브러리에서 BertTokenizer를 가져옴
from transformers import BertTokenizer
# bert-base-uncased 라는 pretrained된 BERT 모델의 tokenizer를 불러옴
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The `tokenizer` has a `vocab` attribute which contains the actual vocabulary we will be using. We can check how many tokens are in it by checking its length.

In [6]:
# tokenizer의 어휘 사진(vocab)의 크기를 출력함
# 이는 BERT의 pretrain 시 사용된 고유한 token(단어)의 수를 나타냄
len(tokenizer.vocab)

30522

Using the tokenizer is as simple as calling `tokenizer.tokenize` on a string. This will tokenize and lower case the data in a way that is consistent with the pre-trained transformer model.

In [7]:
#토크나이저를 사용해 토큰화한 후, 토큰화된 결과 출력
#특징: 소문자로 변환, 물음표 구분
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)

['hello', 'world', 'how', 'are', 'you', '?']


We can numericalize tokens using our vocabulary using `tokenizer.convert_tokens_to_ids`.

In [8]:
#토크나이저의 어휘 사전을 사용하여 해당 토큰들의 고유한 ID로 변환하는 작업을 수행함
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

[7592, 2088, 2129, 2024, 2017, 1029]


The transformer was also trained with special tokens to mark the beginning and end of the sentence, detailed [here](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel). As well as a standard padding and unknown token. We can also get these from the tokenizer.

**Note**: the tokenizer does have a beginning of sequence and end of sequence attributes (`bos_token` and `eos_token`) but these are not set and should not be used for this transformer.

In [9]:
#토크나이저에서 특별한 토큰들을 추출함
#[CLS] 토큰을 추출합니다. 대부분의 Transformer 모델에서 문장의 시작 부분에 사용됩니다.
# [SEP] 토큰을 추출합니다. 두 개의 문장을 구분하거나 문장의 끝을 나타내는 데 사용됩니다.
# [PAD] 토큰을 추출합니다. 입력 시퀀스의 길이를 조정하기 위해 사용되는 패딩 토큰입니다.
# [UNK] 토큰을 추출합니다. 어휘 사전에 없는 알 수 없는 단어를 나타내기 위한 토큰입니다.
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


We can get the indexes of the special tokens by converting them using the vocabulary...

In [10]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


...or by explicitly getting them from the tokenizer.

In [11]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


Another thing we need to handle is that the model was trained on sequences with a defined maximum length - it does not know how to handle sequences longer than it has been trained on. We can get the maximum length of these input sizes by checking the `max_model_input_sizes` for the version of the transformer we want to use. In this case, it is 512 tokens.

In [12]:
# BERT 모델에서 허용하는 최대 입력 길이
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)

512


Previously we have used the `spaCy` tokenizer to tokenize our examples. However we now need to define a function that we will pass to our `TEXT` field that will handle all the tokenization for us. It will also cut down the number of tokens to a maximum length. Note that our maximum length is 2 less than the actual maximum length. This is because we need to append two tokens to each sequence, one to the start and one to the end.

In [13]:
# 주어진 문장을 토큰화하고, BERT 모델의 최대 입력 길이에 맞게 토큰을 잘라냄
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    # 토큰화된 결과를 BERT의 최대 입력 길이에 맞게 잘라냅니다.
    # 이 때, -2를 하는 이유는 [CLS]와 [SEP] 토큰을 위한 공간을 확보하기 위함입니다.
    tokens = tokens[:max_input_length-2]
    tokens = tokens[:max_input_length-2]
    return tokens

Now we define our fields. The transformer expects the batch dimension to be first, so we set `batch_first = True`. As we already have the vocabulary for our text, provided by the transformer we set `use_vocab = False` to tell torchtext that we'll be handling the vocabulary side of things. We pass our `tokenize_and_cut` function as the tokenizer. The `preprocessing` argument is a function that takes in the example after it has been tokenized, this is where we will convert the tokens to their indexes. Finally, we define the special tokens - making note that we are defining them to be their index value and not their string value, i.e. `100` instead of `[UNK]` This is because the sequences will already be converted into indexes.

We define the label field as before.

In [14]:
from torchtext.legacy import data

# TEXT는 입력 데이터를 처리하기 위한 Field 객체, 데이터의 로딩, 토큰화, 전처리 및 배치화 과정을 처리
TEXT = data.Field(batch_first = True, # 배치 차원을 첫 번째 차원으로 설정
                  use_vocab = False, # BERT를 사용하기 때문에 별도의 어휘 사전 구축이 필요 없음
                  tokenize = tokenize_and_cut, # 주어진 함수를 사용하여 토큰화 함
                  preprocessing = tokenizer.convert_tokens_to_ids, # 토큰을 ID로 변환
                  init_token = init_token_idx, # 시작 토큰의 ID를 지정
                  eos_token = eos_token_idx, # 종료 토큰의 ID를 지정
                  pad_token = pad_token_idx, # 패딩 토큰의 ID를 지정
                  unk_token = unk_token_idx) # 알 수 없은 토큰의 ID를 지정

# LABEL은 라벨(주로 분류 태스크의 클래스 또는 회귀 값)를 처리하기 위한 Field 객체
LABEL = data.LabelField(dtype = torch.float)

We load the data and create the validation splits as before.

In [15]:
from torchtext.legacy import datasets
# IMDB 데이터셋을 로드
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
# train_data 를 학습,검증 데이터셋으로 구분
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:10<00:00, 8.15MB/s]


In [16]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


We can check an example and ensure that the text has already been numericalized.

In [17]:
# train_data의 7번째 예제에 대한 정보 출력하고
# 해당 예제의 텍스트를 토큰화된 형태로 출력
print(vars(train_data.examples[6]))

{'text': [1000, 6887, 26802, 6491, 1000, 1997, 3245, 2001, 1037, 3811, 12483, 1010, 17109, 1010, 12459, 1998, 2200, 2434, 5469, 17312, 1010, 1998, 1010, 1999, 2028, 2773, 1010, 8754, 1012, 1996, 2034, 8297, 1997, 2997, 2001, 2175, 2854, 1010, 25591, 1010, 2895, 1011, 8966, 1998, 3811, 14036, 1012, 2044, 1996, 2034, 8297, 2174, 1010, 1000, 6887, 26802, 6491, 1000, 8543, 2123, 2522, 15782, 16230, 2100, 4593, 10858, 2047, 4784, 1012, 1000, 6887, 26802, 6491, 3523, 1011, 2935, 1997, 1996, 2757, 1000, 1997, 2807, 2003, 5121, 2025, 1037, 3143, 4945, 1010, 2009, 2130, 2003, 3243, 14036, 1010, 2021, 2045, 2003, 2053, 2062, 2434, 3012, 1010, 1998, 1996, 7143, 4740, 2000, 3288, 1999, 2242, 2047, 1010, 2024, 2012, 2335, 13310, 8462, 1010, 2029, 3084, 2009, 3243, 15640, 1999, 7831, 2000, 2049, 16372, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1011, 27594, 2545, 1011, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 3243, 1999, 1996, 2927, 1010, 2057, 2024, 3107, 1996, 3595, 2369, 1996, 8

We can use the `convert_ids_to_tokens` to transform these indexes back into readable tokens.

In [18]:
# 위에서 출력한 예제의 'text' key의 value을 BERT 토크나이저를 사용하여 토큰화된 형태로 변환
tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[6])['text'])

print(tokens)

['"', 'ph', '##anta', '##sm', '"', 'of', '1979', 'was', 'a', 'highly', 'atmospheric', ',', 'creepy', ',', 'scary', 'and', 'very', 'original', 'horror', 'flick', ',', 'and', ',', 'in', 'one', 'word', ',', 'cult', '.', 'the', 'first', 'sequel', 'of', '1988', 'was', 'go', '##ry', ',', 'witty', ',', 'action', '-', 'packed', 'and', 'highly', 'entertaining', '.', 'after', 'the', 'first', 'sequel', 'however', ',', '"', 'ph', '##anta', '##sm', '"', 'creator', 'don', 'co', '##sca', '##rell', '##y', 'apparently', 'lacked', 'new', 'ideas', '.', '"', 'ph', '##anta', '##sm', 'iii', '-', 'lord', 'of', 'the', 'dead', '"', 'of', '1994', 'is', 'certainly', 'not', 'a', 'complete', 'failure', ',', 'it', 'even', 'is', 'quite', 'entertaining', ',', 'but', 'there', 'is', 'no', 'more', 'original', '##ity', ',', 'and', 'the', 'desperate', 'attempts', 'to', 'bring', 'in', 'something', 'new', ',', 'are', 'at', 'times', 'tires', '##ome', ',', 'which', 'makes', 'it', 'quite', 'disappointing', 'in', 'comparison', 

Although we've handled the vocabulary for the text, we still need to build the vocabulary for the labels.

In [19]:
#'pos'과 'neg'을 고유한 정수 ID로 매핑
LABEL.build_vocab(train_data)

In [20]:
print(LABEL.vocab.stoi)

defaultdict(None, {'neg': 0, 'pos': 1})


As before, we create the iterators. Ideally we want to use the largest batch size that we can as I've found this gives the best results for transformers.

In [21]:

BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#data.BucketIterator를 사용하여 데이터를 배치 단위로 로드하기 위한 iterator를 생성
#BucketIterator는 유사한 길이를 가진 예제들을 함꼐 배치를 만드는 것을 목표로 함
#패딩의 양을 최소화하여 계산 효율성이 향상됨
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

## Build the Model

Next, we'll load the pre-trained model, making sure to load the same model as we did for the tokenizer.

In [22]:
# 'bert-base-uncased'는 BERT의 기본 버전을 나타냅니다. 
# 이 모델은 소문자만 사용하여 학습되었으며, 원본 BERT의 중간 크기에 해당합니다.
# from_pretrained 메소드를 사용하여 해당 모델을 로드합니다.
# 이렇게 하면 사전 학습된 가중치를 포함한 BERT 모델이 생성됩니다.
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')

Downloading pytorch_model.bin: 100%|██████████| 440M/440M [00:13<00:00, 33.8MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Next, we'll define our actual model. 

Instead of using an embedding layer to get embeddings for our text, we'll be using the pre-trained transformer model. These embeddings will then be fed into a GRU to produce a prediction for the sentiment of the input sentence. We get the embedding dimension size (called the `hidden_size`) from the transformer via its config attribute. The rest of the initialization is standard.

Within the forward pass, we wrap the transformer in a `no_grad` to ensure no gradients are calculated over this part of the model. The transformer actually returns the embeddings for the whole sequence as well as a *pooled* output. The [documentation](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel) states that the pooled output is "usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence", hence we will not be using it. The rest of the forward pass is the standard implementation of a recurrent model, where we take the hidden state over the final time-step, and pass it through a linear layer to get our predictions.

In [23]:
import torch.nn as nn

# BERT 모델과 GRU(Gated Recurrent Unit)를 결합하여 감정 분석을 위한 BERTGRUSentiment 모델을 정의합니다.
# 이 모델은 먼저 BERT를 사용하여 문장을 임베딩하고, 결과를 GRU에 전달합니다. 마지막으로, GRU의 최종 hidden state는 분류 작업을 위해 Linear layer에 전달됩니다.
class BERTGRUSentiment(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        # BERT 모델의 embedding 차원을 가져옵니다.
        embedding_dim = bert.config.to_dict()['hidden_size']
        # GRU layer를 정의합니다.
        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        # 최종 분류를 위한 Linear layer를 정의합니다.
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [batch size, sent len]
                
        with torch.no_grad():
            embedded = self.bert(text)[0]
                
        #embedded = [batch size, sent len, emb dim]
        
        _, hidden = self.rnn(embedded)
        
        #hidden = [n layers * n directions, batch size, emb dim]
        
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
                
        #hidden = [batch size, hid dim]
        
        output = self.out(hidden)
        
        #output = [batch size, out dim]
        
        return output

Next, we create an instance of our model using standard hyperparameters.

In [24]:
# 모델의 하이퍼파라미터를 정의합니다.
HIDDEN_DIM = 256 # GRU의 hidden state 차원
OUTPUT_DIM = 1  # 출력 차원 (여기서는 긍정/부정 분류를 위한 1차원 출력)
N_LAYERS = 2 # GRU의 레이어 수
BIDIRECTIONAL = True  # 양방향 GRU를 사용
DROPOUT = 0.25 # 드롭아웃 비율

# `BERTGRUSentiment` 모델을 초기화합니다. 
# 위에서 정의한 하이퍼파라미터와 함께 사전 학습된 BERT 모델을 전달합니다.
model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

We can check how many parameters the model has. Our standard models have under 5M, but this one has 112M! Luckily, 110M of these parameters are from the transformer and we will not be training those.

In [25]:
# 주어진 모델의 학습 가능한 파라미터의 총 수를 반환하는 함수를 정의합니다.
def count_parameters(model):
    # model.parameters()는 모델의 모든 파라미터를 반환합니다.
    # p.requires_grad는 해당 파라미터가 학습 중에 업데이트되어야 하는지 여부를 나타냅니다.
    # p.numel()은 파라미터의 원소 수를 반환합니다.
    # 따라서, 이 함수는 모든 학습 가능한 파라미터의 원소 수를 합산합니다.
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 112,241,409 trainable parameters


In order to freeze paramers (not train them) we need to set their `requires_grad` attribute to `False`. To do this, we simply loop through all of the `named_parameters` in our model and if they're a part of the `bert` transformer model, we set `requires_grad = False`. 

In [26]:
# BERT 모델과 관련된 파라미터는 학습에 업데이트하지 않음
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

We can now see that our model has under 3M trainable parameters, making it almost comparable to the `FastText` model. However, the text still has to propagate through the transformer which causes training to take considerably longer.

In [27]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,759,169 trainable parameters


We can double check the names of the trainable parameters, ensuring they make sense. As we can see, they are all the parameters of the GRU (`rnn`) and the linear layer (`out`).

In [28]:
for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)

rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l0_reverse
rnn.weight_hh_l0_reverse
rnn.bias_ih_l0_reverse
rnn.bias_hh_l0_reverse
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
rnn.weight_ih_l1_reverse
rnn.weight_hh_l1_reverse
rnn.bias_ih_l1_reverse
rnn.bias_hh_l1_reverse
out.weight
out.bias


## Train the Model

As is standard, we define our optimizer and criterion (loss function).

In [29]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

In [30]:
# 이진분류 작업에 적합한 손실 함수 BCEWithLogitsLoss 사용, 시그모이드 함수
criterion = nn.BCEWithLogitsLoss()

Place the model and criterion onto the GPU (if available)

In [31]:
model = model.to(device)
criterion = criterion.to(device)

Next, we'll define functions for: calculating accuracy, performing a training epoch, performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

In [32]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    # 모델의 출력인 preds는 로짓값이므로, 시그모이드를 통과하여 확률값을 얻는다.
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [33]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        # 모든 파라미터의 기울기를 0으로 초기화합니다. 
        # 이는 PyTorch에서는 같은 파라미터에 대해 여러 번의 backward() 호출로 기울기가 누적되기 때문에 필요합니다.
        optimizer.zero_grad()
        # 모델을 사용하여 현재 배치의 예측값을 계산합니다.
        predictions = model(batch.text).squeeze(1)
        # 예측값과 실제 레이블을 사용하여 손실을 계산합니다.
        loss = criterion(predictions, batch.label)
        # 예측값과 실제 레이블을 사용하여 정확도를 계산합니다.
        acc = binary_accuracy(predictions, batch.label)
        # 손실에 대한 기울기를 계산합니다
        loss.backward()
        # optimizer를 사용하여 모델의 파라미터를 업데이트합니다.
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [34]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    # 모델을 평가 모드로 설정합니다.
    # 이렇게 하면 모델 내의 dropout, batch normalization 등의 레이어가 evaluation 모드로 동작합니다.
    model.eval()
    # 기울기 계산이 필요하지 않기 때문에 torch.no_grad() 내에서 연산을 수행합니다.
    # 이렇게 함으로써 메모리 사용량을 줄이고 속도를 높일 수 있습니다.
    with torch.no_grad():
    
        for batch in iterator:
            # 모델을 사용하여 현재 배치의 예측값을 계산합니다.
            predictions = model(batch.text).squeeze(1)
            # 예측값과 실제 레이블을 사용하여 손실을 계산합니다.
            loss = criterion(predictions, batch.label)
            # 예측값과 실제 레이블을 사용하여 정확도를 계산합니다.
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [35]:
import time
# 소요 시간 계산
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we'll train our model. This takes considerably longer than any of the previous models due to the size of the transformer. Even though we are not training any of the transformer's parameters we still need to pass the data through the model which takes a considerable amount of time on a standard GPU.

In [36]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    # 학습과 검증 시행
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    # 모델 저장
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 6m 44s
	Train Loss: 0.535 | Train Acc: 72.30%
	 Val. Loss: 0.294 |  Val. Acc: 88.04%
Epoch: 02 | Epoch Time: 6m 48s
	Train Loss: 0.322 | Train Acc: 86.64%
	 Val. Loss: 0.240 |  Val. Acc: 90.73%
Epoch: 03 | Epoch Time: 6m 49s
	Train Loss: 0.272 | Train Acc: 88.83%
	 Val. Loss: 0.235 |  Val. Acc: 90.68%
Epoch: 04 | Epoch Time: 6m 49s
	Train Loss: 0.238 | Train Acc: 90.40%
	 Val. Loss: 0.227 |  Val. Acc: 91.06%
Epoch: 05 | Epoch Time: 6m 49s
	Train Loss: 0.209 | Train Acc: 91.80%
	 Val. Loss: 0.219 |  Val. Acc: 91.09%


We'll load up the parameters that gave us the best validation loss and try these on the test set - which gives us our best results so far!

In [37]:
# 저장된 모델의 파라미터 로드
model.load_state_dict(torch.load('tut6-model.pt'))
# 성능 평가
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.211 | Test Acc: 91.51%


## Inference

We'll then use the model to test the sentiment of some sequences. We tokenize the input sequence, trim it down to the maximum length, add the special tokens to either side, convert it to a tensor, add a fake batch dimension and then pass it through our model.

In [38]:
# 인퍼런스
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    #주어진 문장을 토크나이즈하여 토큰 리스트를 반환합니다.
    tokens = tokenizer.tokenize(sentence)
    #문장의 길이가 최대 입력 길이를 초과하는 경우, 최대 입력 길이만큼만 토큰을 가져옵니다. -2는 시작 토큰과 종료 토큰을 위한 공간을 확보하기 위함입니다.
    tokens = tokens[:max_input_length-2]
    # 토큰들을 인덱스로 변환하고, 시작 토큰과 종료 토큰을 추가합니다.
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    #인덱스 리스트를 텐서로 변환하고, 적절한 디바이스(CPU 또는 GPU)로 이동시킵니다.
    tensor = torch.LongTensor(indexed).to(device)
    #배치 차원을 추가합니다. 모델은 배치 입력을 기대하기 때문입니다.
    tensor = tensor.unsqueeze(0)
    #모델을 사용하여 텐서에 대한 예측값을 계산하고, 시그모이드 함수를 적용하여 [0, 1] 범위의 값을 얻습니다
    prediction = torch.sigmoid(model(tensor))
    #예측값을 Python의 float 값으로 반환합니다.
    return prediction.item()

In [39]:
predict_sentiment(model, tokenizer, "This film is terrible")

0.05781012400984764

In [40]:
predict_sentiment(model, tokenizer, "This film is great")

0.956889271736145