# 3. Neural Machine Translation by Jointly Learning to Align and Translate
- 대망의 attention의 시작이에요!

## Introduction

아래는 우리가 1에서 학습한 모델이에요
- 2 layers
- LSTM cell
- 학습 시 teacher forcing
- 추론 시 input feeding
- (논문에선 src 대상 backward pass로 학습)

![image.png](https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq1.png)

아래는 우리가 2에서 학습한 모델이에요
- 1 layers
- GRU cell
- 학습 시 teacher forcing
- 추론 시 input feeding
- forward pass로 학습
- (Decoder) GRU input에 단순 token embedding을 넣어주는 대신, 아래와 같이 넣어줌
    - GRU(y_t, z)

![image.png](https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq7.png)

- compression information 문제! seq2seq에서 굉장히 중요하다고 말씀드렸어요
- 2에서 이를 완벽하게 해결했을까요?
- 이는 Decoder 단에서 해결한 것이에요. 정확히 이해하고 넘겨준 context vector가 아니죠!
    - 구몬 학습을 생각해보세요.
    - 답지를 넘겨줬는데 어디 챕터에서 나온건지 정리 안해두면
    - 그냥 open book test... 안희...
- attention은 그러한 관점에서 나왔어요

## 실험 준비
- 앞으론 설명을 달지 않고 코드만 복붙할 예정이에요

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext import data, datasets
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

import torchtext


print(torchtext.__version__)
print(torch.__version__)
print(spacy.__version__)
print(np.__version__)

# seed 고정
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# spacy 모델 초기화
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

# tokenize 함수 정의
def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

# data field 정의
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

# load data
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

# more than 2 freqs만 사용
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

# device 설정
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# batch size 및 train/valid/test loader 호출
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

0.5.0
1.7.1+cu101
2.2.4
1.19.5
cuda


## Building the Seq2Seq Model

### Encoder
- 달라진 점?
    - bidirectional argument를 True로 해줄거에요!


![img](https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq8.png)

In [2]:
x = torch.randn(3, 12, 10) # (bsz, seq_len, hid_dim)

In [3]:
gru1 = nn.GRU(10, 16, bidirectional=False, batch_first=True)
gru1

GRU(10, 16, batch_first=True)

In [4]:
gru2 = nn.GRU(10, 16, bidirectional=True, batch_first=True)
gru2

GRU(10, 16, batch_first=True, bidirectional=True)

In [5]:
sum(p.numel() for p in gru1.parameters()), sum(p.numel() for p in gru2.parameters())

(1344, 2688)

In [6]:
output, hidden = gru2(x)

In [7]:
output.size() # concat된 결과!

torch.Size([3, 12, 32])

In [8]:
hidden.size()

torch.Size([2, 3, 16])

In [9]:
forward_hidden = hidden[0, :, :] # final forward output
forward_hidden

tensor([[ 0.0670,  0.3398,  0.2181, -0.3097,  0.1076, -0.2641, -0.0169, -0.0980,
         -0.1255,  0.6076,  0.4330,  0.2523, -0.0939, -0.1524,  0.2407,  0.2703],
        [-0.5180,  0.1661, -0.0847, -0.3948, -0.0249, -0.2101,  0.1275,  0.1916,
         -0.0967,  0.1239,  0.2186,  0.3209,  0.4888,  0.0072,  0.2770,  0.2544],
        [-0.3849,  0.1163, -0.0131, -0.3293, -0.0205,  0.6072,  0.4943,  0.0874,
          0.4139, -0.2560,  0.1637, -0.3013, -0.1651, -0.1445,  0.4079, -0.1269]],
       grad_fn=<SliceBackward>)

In [10]:
forward_output = output[:, -1, :16]
forward_output

tensor([[ 0.0670,  0.3398,  0.2181, -0.3097,  0.1076, -0.2641, -0.0169, -0.0980,
         -0.1255,  0.6076,  0.4330,  0.2523, -0.0939, -0.1524,  0.2407,  0.2703],
        [-0.5180,  0.1661, -0.0847, -0.3948, -0.0249, -0.2101,  0.1275,  0.1916,
         -0.0967,  0.1239,  0.2186,  0.3209,  0.4888,  0.0072,  0.2770,  0.2544],
        [-0.3849,  0.1163, -0.0131, -0.3293, -0.0205,  0.6072,  0.4943,  0.0874,
          0.4139, -0.2560,  0.1637, -0.3013, -0.1651, -0.1445,  0.4079, -0.1269]],
       grad_fn=<SliceBackward>)

In [11]:
backward_hidden = hidden[1, :, :] # final backward output
backward_hidden

tensor([[-0.0442, -0.1527,  0.2618,  0.0436,  0.4084,  0.0973,  0.0249, -0.1257,
          0.0592, -0.0033, -0.2466, -0.0610, -0.5022,  0.1437, -0.2513,  0.2731],
        [ 0.4148, -0.1290, -0.3368, -0.1570, -0.1727, -0.5331, -0.1311, -0.1707,
         -0.1839,  0.1936, -0.5336,  0.3160, -0.1950,  0.2886, -0.2423, -0.2670],
        [ 0.0942,  0.0764, -0.0954, -0.0169,  0.1336, -0.1355,  0.1616, -0.0181,
          0.0220,  0.0859, -0.1345,  0.0155,  0.0903,  0.1224, -0.0048, -0.0106]],
       grad_fn=<SliceBackward>)

In [12]:
backward_output = output[:, 0, 16:]
backward_output

tensor([[-0.0442, -0.1527,  0.2618,  0.0436,  0.4084,  0.0973,  0.0249, -0.1257,
          0.0592, -0.0033, -0.2466, -0.0610, -0.5022,  0.1437, -0.2513,  0.2731],
        [ 0.4148, -0.1290, -0.3368, -0.1570, -0.1727, -0.5331, -0.1311, -0.1707,
         -0.1839,  0.1936, -0.5336,  0.3160, -0.1950,  0.2886, -0.2423, -0.2670],
        [ 0.0942,  0.0764, -0.0954, -0.0169,  0.1336, -0.1355,  0.1616, -0.0181,
          0.0220,  0.0859, -0.1345,  0.0155,  0.0903,  0.1224, -0.0048, -0.0106]],
       grad_fn=<SliceBackward>)

- 코드 구현이 달라진 부분?
    - fully connected 부분이 생겼어요!
    - forward, backward pass의 결과값을 fully-connected layer로 hidden_dim으로 차원 축소하며 정리해주고 tanh로 activate시켜줘요
    - output으로 input의 각 time-step에 attention해야하기 때문에 output과 hidden을 동시에 return 받아줄게요

In [13]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

## Attention
- 한마디로 요약하면?
    - Decoder: Encoder야 결과 요약해서 좀 넘겨주라... 응?
    - Encoder: 알겠어! 에구... Attention한테 시켜야지

![image.png](https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq9.png)

- https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html 
- concat 기반 additive attention

In [14]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention= [batch size, src len]
        
        return torch.softmax(attention, dim=1)

## Decoder

![image.png](https://github.com/bentrevett/pytorch-seq2seq/raw/49df8404d938a6edbf729876405558cc2c2b3013/assets/seq2seq10.png)

- 달라진 점?
    - GRU가 이젠 그냥 context를 받는게 아니라 attention된 값을 반환합니다!
    - 그러면 attention fc에서 context와 어느 input position에 집중할 지가 계산이 되어있겠죠?
    - 중요한 점은, hidden state를 계산할 때와 lm_head를 계산할 때 모두 attn weight를 넣어준다는 점이에요!
    - 즉, 단순 context를 feeding 해준다 -> input sequence에 attn한 context를 feeding해준다로 바뀐거죠

In [15]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0)

## Seq2Seq
- 코드는 동일해요!
- 중요한 건, Encoder가 Attention 모듈을 통해서 context에 어느 input sequence position에 집중할지 그 정보를 담아서 decoder에 전달했고, decoder는 이를 토대로 hidden state를 계산, 그리고 token generate시 lm_head에서도 attn 정보를 활용했다는 점이 중요해요!

In [16]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

## Training seq2seq

In [17]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

# 1
# enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
# dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)
# 2
# enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, ENC_DROPOUT)
# dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, DEC_DROPOUT)

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

- 2와 동일하게 초기화해줍니다.
- 다른 점, bias는 zero로 초기화합니다.

In [18]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

- ㅎㅎ 확실히 attn으로 parameter가 많이 늘어난 것을 확인할 수 있어요!
- 20M정도 되네요. 굉장히 작은 모델이니 걱정하지 마세요

In [19]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,518,917 trainable parameters


In [20]:
optimizer = optim.Adam(model.parameters())

In [21]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

- 학습/평가 코드 동일

In [22]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [23]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

- 학습에 소요된 시간 체크

In [24]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [25]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 2s
	Train Loss: 4.677 | Train PPL: 107.462
	 Val. Loss: 4.302 |  Val. PPL:  73.820
Epoch: 02 | Time: 1m 1s
	Train Loss: 3.403 | Train PPL:  30.058
	 Val. Loss: 3.524 |  Val. PPL:  33.929
Epoch: 03 | Time: 1m 2s
	Train Loss: 2.722 | Train PPL:  15.214
	 Val. Loss: 3.284 |  Val. PPL:  26.674
Epoch: 04 | Time: 1m 2s
	Train Loss: 2.335 | Train PPL:  10.332
	 Val. Loss: 3.187 |  Val. PPL:  24.226
Epoch: 05 | Time: 1m 2s
	Train Loss: 2.017 | Train PPL:   7.515
	 Val. Loss: 3.167 |  Val. PPL:  23.740
Epoch: 06 | Time: 1m 2s
	Train Loss: 1.784 | Train PPL:   5.954
	 Val. Loss: 3.203 |  Val. PPL:  24.607
Epoch: 07 | Time: 1m 2s
	Train Loss: 1.596 | Train PPL:   4.934
	 Val. Loss: 3.274 |  Val. PPL:  26.417
Epoch: 08 | Time: 1m 2s
	Train Loss: 1.441 | Train PPL:   4.224
	 Val. Loss: 3.317 |  Val. PPL:  27.586
Epoch: 09 | Time: 1m 2s
	Train Loss: 1.331 | Train PPL:   3.784
	 Val. Loss: 3.410 |  Val. PPL:  30.254
Epoch: 10 | Time: 1m 1s
	Train Loss: 1.225 | Train PPL:   3.406


- 와! 성능이 큰 폭으로 개선되었군요.

In [26]:
model.load_state_dict(torch.load('tut3-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.178 | Test PPL:  24.010 |
