# 3 - Neural Machine Translation by Jointly Learning to Align and Translate

In this third notebook on sequence-to-sequence models using PyTorch and TorchText, we'll be implementing the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473). This model achives our best perplexity yet, ~27 compared to ~34 for the previous model.

## Introduction

As a reminder, here is the general encoder-decoder model:

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq1.png?raw=1)

In the previous model, our architecture was set-up in a way to reduce "information compression" by explicitly passing the context vector, $z$, to the decoder at every time-step and by passing both the context vector and embedded input word, $d(y_t)$, along with the hidden state, $s_t$, to the linear layer, $f$, to make a prediction.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq7.png?raw=1)

Even though we have reduced some of this compression, our context vector still needs to contain all of the information about the source sentence. The model implemented in this notebook avoids this compression by allowing the decoder to look at the entire source sentence (via its hidden states) at each decoding step! How does it do this? It uses *attention*. 

Attention works by first, calculating an attention vector, $a$, that is the length of the source sentence. The attention vector has the property that each element is between 0 and 1, and the entire vector sums to 1. We then calculate a weighted sum of our source sentence hidden states, $H$, to get a weighted source vector, $w$. 

$$w = \sum_{i}a_ih_i$$

We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction. We'll explain how to do all of this during the tutorial.

## Preparing Data

Again, the preparation is similar to last time.

First we import all the required modules.

In [1]:
# if you running this code with Colab, uncomment below and run it
# !pip install torchtext==0.9.0
# from google.colab import drive
# drive.mount('/content/drive')

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time

  from .autonotebook import tqdm as notebook_tqdm


Set the random seeds for reproducability.

In [2]:
SEED = 120220210

# 랜덤 시드 설정
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

# CUDA 연산의 결과를 재현 가능하도록 설정
torch.backends.cudnn.deterministic = True

Load the German and English spaCy models.

In [3]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!python -m spacy download en
!python -m spacy download de
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'de' are deprecated. Please use the
full pipeline package name 'de_core_news_sm' instead.[0m
Collecting de-core-news-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.5.0/de_core_news_sm-3.5.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0

In [7]:
spacy_de.__dir__()

['_config',
 '_meta',
 '_path',
 '_optimizer',
 '_pipe_meta',
 '_pipe_configs',
 'vocab',
 '_components',
 '_disabled',
 'max_length',
 'tokenizer',
 'batch_size',
 'default_error_handler',
 '__module__',
 'lang',
 'Defaults',
 '__doc__',
 'default_config',
 '__annotations__',
 'factories',
 '_factory_meta',
 '__init__',
 '__init_subclass__',
 'path',
 'meta',
 'config',
 'disabled',
 'factory_names',
 'components',
 'component_names',
 'pipeline',
 'pipe_names',
 'pipe_factories',
 'pipe_labels',
 'has_factory',
 'get_factory_name',
 'get_factory_meta',
 'set_factory_meta',
 'get_pipe_meta',
 'get_pipe_config',
 'factory',
 'component',
 'analyze_pipes',
 'get_pipe',
 'create_pipe',
 'create_pipe_from_source',
 'add_pipe',
 '_get_pipe_index',
 'has_pipe',
 'replace_pipe',
 'rename_pipe',
 'remove_pipe',
 'disable_pipe',
 'enable_pipe',
 '__call__',
 'disable_pipes',
 'select_pipes',
 'make_doc',
 '_ensure_doc',
 '_ensure_doc_with_context',
 'update',
 'rehearse',
 'begin_training',
 '

In [9]:
spacy_de.vocab.__dir__()

['__iter__',
 '__init__',
 '__len__',
 '__getitem__',
 '__contains__',
 '__new__',
 'add_flag',
 'reset_vectors',
 'deduplicate_vectors',
 'prune_vectors',
 'get_vector',
 'set_vector',
 'has_vector',
 'to_disk',
 'from_disk',
 'to_bytes',
 'from_bytes',
 '_reset_cache',
 'vectors',
 'lang',
 'vectors_length',
 'lookups',
 'strings',
 'morphology',
 '_vectors',
 '_lookups',
 'writing_system',
 'get_noun_chunks',
 'length',
 '_unused_object',
 'lex_attr_getters',
 'cfg',
 '__doc__',
 '__pyx_vtable__',
 '__reduce__',
 '__setstate__',
 '__repr__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__reduce_ex__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [22]:
spacy_de.tokenizer.__dir__()

['__call__',
 '__init__',
 '__new__',
 '__reduce__',
 'pipe',
 '_flush_cache',
 '_reset_cache',
 '_flush_specials',
 'find_infix',
 'find_prefix',
 'find_suffix',
 '_load_special_cases',
 '_validate_special_case',
 'add_special_case',
 '_reload_special_cases',
 'explain',
 'score',
 'to_disk',
 'from_disk',
 'to_bytes',
 'from_bytes',
 'token_match',
 'url_match',
 'prefix_search',
 'suffix_search',
 'infix_finditer',
 'rules',
 'faster_heuristics',
 'vocab',
 '__doc__',
 '__pyx_vtable__',
 '__repr__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__reduce_ex__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

we can find that tokenizer needs data.

We create the tokenizers.

In [17]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    # spacy_de를 이용하여 text를 토큰화하여 리스트로 반환
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    # spacy_en를 이용하여 text를 토큰화하여 리스트로 반환
    return [tok.text for tok in spacy_en.tokenizer(text)]

The fields remain the same as before.

In [25]:
# 독일어 토큰화 함수를 사용하여 SRC 필드를 초기화합니다.
# <sos>와 <eos>를 초기 토큰 및 종료 토큰으로 설정합니다.
# 모든 토큰을 소문자로 변환합니다.
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

Load the data.

In [28]:
# Multi30k 데이터셋을 train, valid, test 데이터로 분리한다.
# exts 매개변수는 각 언어의 확장자를 지정한다.
# fields 매개변수는 각 언어의 필드를 지정한다.
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))

In [33]:
train_data.__dir__()

['examples',
 'fields',
 '__module__',
 '__doc__',
 'urls',
 'name',
 'dirname',
 'splits',
 '__parameters__',
 'sort_key',
 '__init__',
 'split',
 '__getitem__',
 '__len__',
 '__iter__',
 '__getattr__',
 'download',
 'filter_examples',
 '__annotations__',
 'functions',
 '__add__',
 'register_function',
 'register_datapipe_as_function',
 '__orig_bases__',
 '__dict__',
 '__weakref__',
 '__slots__',
 '_is_protocol',
 '__class_getitem__',
 '__init_subclass__',
 '__repr__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [39]:
train_data.examples[0].__dir__()

['src',
 'trg',
 '__module__',
 '__doc__',
 'fromJSON',
 'fromdict',
 'fromCSV',
 'fromlist',
 'fromtree',
 '__dict__',
 '__weakref__',
 '__repr__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__init__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [44]:
print(len(train_data.examples))
print(len(valid_data.examples))
print(len(test_data.examples))

print(vars(train_data.examples[0]))


29000
1014
1000
{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


In [40]:
train_data.examples[0].__dict__

{'src': ['zwei',
  'junge',
  'weiße',
  'männer',
  'sind',
  'im',
  'freien',
  'in',
  'der',
  'nähe',
  'vieler',
  'büsche',
  '.'],
 'trg': ['two',
  'young',
  ',',
  'white',
  'males',
  'are',
  'outside',
  'near',
  'many',
  'bushes',
  '.']}

Build the vocabulary.

In [29]:
SRC.build_vocab(train_data, min_freq = 2) # train_data를 이용하여 SRC 단어장을 생성한다. 단어의 최소 등장 빈도는 2로 설정한다.
TRG.build_vocab(train_data, min_freq = 2) # train_data를 이용하여 TRG 단어장을 생성한다. 단어의 최소 등장 빈도는 2로 설정한다.

In [45]:
SRC.__dir__()

['sequential',
 'use_vocab',
 'init_token',
 'eos_token',
 'unk_token',
 'fix_length',
 'dtype',
 'preprocessing',
 'postprocessing',
 'lower',
 'tokenizer_args',
 'tokenize',
 'include_lengths',
 'batch_first',
 'pad_token',
 'pad_first',
 'truncate_first',
 'stop_words',
 'is_target',
 'vocab',
 '__module__',
 '__doc__',
 'vocab_cls',
 'dtypes',
 'ignore',
 '__init__',
 '__getstate__',
 '__setstate__',
 '__hash__',
 '__eq__',
 'preprocess',
 'process',
 'pad',
 'build_vocab',
 'numericalize',
 '__dict__',
 '__weakref__',
 '__repr__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [47]:
SRC.vocab.__dir__()

['freqs',
 'itos',
 'unk_index',
 'stoi',
 'vectors',
 '__module__',
 '__doc__',
 'UNK',
 '__init__',
 '_default_unk_index',
 '__getitem__',
 '__getstate__',
 '__setstate__',
 '__eq__',
 '__len__',
 'lookup_indices',
 'extend',
 'load_vectors',
 'set_vectors',
 '__dict__',
 '__weakref__',
 '__hash__',
 '__repr__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [48]:
SRC.vocab.freqs.most_common(20)

[('.', 28809),
 ('ein', 18851),
 ('einem', 13711),
 ('in', 11895),
 ('eine', 9909),
 (',', 8938),
 ('und', 8925),
 ('mit', 8843),
 ('auf', 8745),
 ('mann', 7805),
 ('einer', 6765),
 ('der', 4990),
 ('frau', 4186),
 ('die', 3949),
 ('zwei', 3873),
 ('einen', 3479),
 ('im', 3107),
 ('an', 3062),
 ('von', 2363),
 ('sich', 2273)]

In [52]:
TRG.vocab.freqs.most_common(20)

[('a', 49165),
 ('.', 27623),
 ('in', 14886),
 ('the', 10955),
 ('on', 8035),
 ('man', 7781),
 ('is', 7525),
 ('and', 7379),
 ('of', 6871),
 ('with', 6179),
 ('woman', 3973),
 (',', 3963),
 ('two', 3886),
 ('are', 3717),
 ('to', 3128),
 ('people', 3122),
 ('at', 2927),
 ('an', 2861),
 ('wearing', 2623),
 ('shirt', 2324)]

In [51]:
print(SRC.vocab.itos[:10])
print(TRG.vocab.itos[:10])

['<unk>', '<pad>', '<sos>', '<eos>', '.', 'ein', 'einem', 'in', 'eine', ',']
['<unk>', '<pad>', '<sos>', '<eos>', 'a', '.', 'in', 'the', 'on', 'man']


Define the device.

In [53]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Create the iterators.

In [54]:
BATCH_SIZE = 512

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    device = device)

In [56]:
train_iterator.__dir__()

['batch_size',
 'train',
 'dataset',
 'batch_size_fn',
 'iterations',
 'repeat',
 'shuffle',
 'sort',
 'sort_within_batch',
 'sort_key',
 'device',
 'random_shuffler',
 '_iterations_this_epoch',
 '_random_state_this_epoch',
 '_restored_from_state',
 '__module__',
 '__doc__',
 'create_batches',
 '__init__',
 'splits',
 'data',
 'init_epoch',
 'epoch',
 '__len__',
 '__iter__',
 'state_dict',
 'load_state_dict',
 '__dict__',
 '__weakref__',
 '__repr__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [58]:
next(iter(train_iterator))


[torchtext.legacy.data.batch.Batch of size 512 from MULTI30K]
	[.src]:[torch.cuda.LongTensor of size 29x512 (GPU 0)]
	[.trg]:[torch.cuda.LongTensor of size 32x512 (GPU 0)]

## Building the Seq2Seq Model

### Encoder

First, we'll build the encoder. Similar to the previous model, we only use a single layer GRU, however we now use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. 

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq8.png?raw=1)

We now have:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

As before, we only pass an input (`embedded`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `outputs` and `hidden`. 

`outputs` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`hidden` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and we currently have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

**Note**: this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

As we want our model to look back over the whole of the source sentence we return `outputs`, the stacked forward and backward hidden states for every token in the source sentence. We also return `hidden`, which acts as our initial hidden state in the decoder.

In [97]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim) # 임베딩 레이어
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True) # 양방향 GRU 레이어
        
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim) # 선형 레이어
        
        self.dropout = nn.Dropout(dropout) # 드롭아웃 레이어
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src)) # 임베딩
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded) # 양방향 GRU 레이어
        
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))) # 선형 레이어
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

In [65]:
dummy_encoder = Encoder(100, 256, 512, 512, 0.5)
example_src = torch.LongTensor([[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]])
dummy_encoder_outputs, dummy_encoder_hidden = dummy_encoder(example_src)

In [67]:
print(dummy_encoder_outputs.shape)
print(dummy_encoder_hidden.shape)
print(dummy_encoder_outputs)
print(dummy_encoder_hidden)
    

torch.Size([1, 20, 1024])
torch.Size([20, 512])
tensor([[[ 0.2953,  0.1980,  0.3187,  ..., -0.0409, -0.0824, -0.1014],
         [-0.4718, -0.1440, -0.0953,  ...,  0.0996,  0.3089, -0.0241],
         [-0.3840,  0.1870, -0.2285,  ...,  0.0952,  0.0022,  0.1213],
         ...,
         [ 0.1388, -0.0250,  0.2598,  ...,  0.1548, -0.0739, -0.1062],
         [ 0.1485,  0.2943,  0.3038,  ...,  0.2179,  0.1191,  0.2483],
         [ 0.3894, -0.4902, -0.0830,  ...,  0.3395, -0.4165, -0.1838]]],
       grad_fn=<CatBackward0>)
tensor([[ 0.1333, -0.0796,  0.0811,  ...,  0.1036,  0.0277,  0.0182],
        [ 0.1410, -0.3436,  0.0208,  ..., -0.0740,  0.1757, -0.2197],
        [ 0.0721,  0.0695,  0.0608,  ...,  0.1166, -0.0088, -0.0233],
        ...,
        [ 0.0748,  0.0924, -0.0291,  ...,  0.0758, -0.0851, -0.1338],
        [ 0.2274,  0.0988, -0.0309,  ...,  0.0219, -0.0978, -0.1013],
        [ 0.0482, -0.0913, -0.0509,  ..., -0.1669, -0.0727, -0.1779]],
       grad_fn=<TanhBackward0>)


### Attention

Next up is the attention layer. This will take in the previous hidden state of the decoder, $s_{t-1}$, and all of the stacked forward and backward hidden states from the encoder, $H$. The layer will output an attention vector, $a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1.

Intuitively, this layer takes what we have decoded so far, $s_{t-1}$, and all of what we have encoded, $H$, to produce a vector, $a_t$, that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode, $\hat{y}_{t+1}$. 

First, we calculate the *energy* between the previous decoder hidden state and the encoder hidden states. As our encoder hidden states are a sequence of $T$ tensors, and our previous decoder hidden state is a single tensor, the first thing we do is `repeat` the previous decoder hidden state $T$ times. We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (`attn`) and a $\tanh$ activation function. 

$$E_t = \tanh(\text{attn}(s_{t-1}, H))$$ 

This can be thought of as calculating how well each encoder hidden state "matches" the previous decoder hidden state.

We currently have a **[dec hid dim, src len]** tensor for each example in the batch. We want this to be **[src len]** for each example in the batch as the attention should be over the length of the source sentence. This is achieved by multiplying the `energy` by a **[1, dec hid dim]** tensor, $v$.

$$\hat{a}_t = v E_t$$

We can think of $v$ as the weights for a weighted sum of the energy across all encoder hidden states. These weights tell us how much we should attend to each token in the source sequence. The parameters of $v$ are initialized randomly, but learned with the rest of the model via backpropagation. Note how $v$ is not dependent on time, and the same $v$ is used for each time-step of the decoding. We implement $v$ as a linear layer without a bias.

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

This gives us the attention over the source sentence!

Graphically, this looks something like below. This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq9.png?raw=1)

In [70]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1)

In [71]:
attention_test = Attention(512, 512)
dummy_encoder_hidden = torch.randn(1, 512)
dummy_decoder_hidden = torch.randn(1, 512)
dummy_encoder_outputs = torch.randn(10, 1, 1024)
attention_test(dummy_decoder_hidden, dummy_encoder_outputs).shape

torch.Size([1, 10])

### Decoder

Next up is the decoder. 

The decoder contains the attention layer, `attention`, which takes the previous hidden state, $s_{t-1}$, all of the encoder hidden states, $H$, and returns the attention vector, $a_t$.

We then use this attention vector to create a weighted source vector, $w_t$, denoted by `weighted`, which is a weighted sum of the encoder hidden states, $H$, using $a_t$ as the weights.

$$w_t = a_t H$$

The embedded input word, $d(y_t)$, the weighted source vector, $w_t$, and the previous decoder hidden state, $s_{t-1}$, are then all passed into the decoder RNN, with $d(y_t)$ and $w_t$ being concatenated together.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$

We then pass $d(y_t)$, $w_t$ and $s_t$ through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. This is done by concatenating them all together.

$$\hat{y}_{t+1} = f(d(y_t), w_t, s_t)$$

The image below shows decoding the first word in an example translation.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq10.png?raw=1)

The green/teal blocks show the forward/backward encoder RNNs which output $H$, the red block shows the context vector, $z = h_T = \tanh(g(h^\rightarrow_T,h^\leftarrow_T)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$, the blue block shows the decoder RNN which outputs $s_t$, the purple block shows the linear layer, $f$, which outputs $\hat{y}_{t+1}$ and the orange block shows the calculation of the weighted sum over $H$ by $a_t$ and outputs $w_t$. Not shown is the calculation of $a_t$.

In [95]:
# Define a class named Decoder which inherits from nn.Module
class Decoder(nn.Module):
    # Define the constructor method which takes in output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, and attention as arguments
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        # Call the constructor of the parent class
        super().__init__()

        # Set the output_dim and attention attributes of the class
        self.output_dim = output_dim
        self.attention = attention
        
        # Create an embedding layer with output_dim and emb_dim as input and output dimensions respectively
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        # Create a GRU layer with (enc_hid_dim * 2) + emb_dim as input dimension and dec_hid_dim as output dimension
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        # Create a linear layer with (enc_hid_dim * 2) + dec_hid_dim + emb_dim as input dimension and output_dim as output dimension
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        # Create a dropout layer with dropout as the probability of an element to be zeroed
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0)

위의 Decoder 클래스에서 forward 메서드는 세 개의 인자인 input, hidden, encoder_outputs를 받습니다.

input은 현재 시점의 입력 시퀀스로, 크기는 [batch_size]입니다.

hidden은 디코더의 현재 은닉 상태로, 크기는 [num_layers * num_directions, batch_size, hidden_size]입니다. 이전 시점에서의 은닉 상태를 이용하여 현재 시점에서의 출력을 예측합니다.

encoder_outputs은 인코더 모델의 출력으로, 크기는 [src_len, batch_size, hidden_size*num_directions]입니다. 이는 인코더의 입력 시퀀스를 디코더 모델이 생성하는 시퀀스에 맞게 변환된 값입니다.

forward 메서드에서는 우선 input을 임베딩합니다. 그리고 attention 모듈에 hidden과 encoder_outputs를 인자로 전달하여 디코더의 출력에 대한 어텐션 가중치 a를 계산합니다.

a는 크기가 [batch_size, src_len]인 2D 텐서입니다. 그리고 unsqueeze(1)을 이용하여 크기를 [batch_size, 1, src_len]으로 변환합니다.

인코더 모델의 출력 encoder_outputs는 permute 메서드를 이용하여 [batch_size, src_len, hidden_size*num_directions]에서 [src_len, batch_size, hidden_size*num_directions]으로 변환한 후 a와 encoder_outputs를 bmm 메서드를 이용하여 가중합을 계산합니다. 이를 이용하여 가중합된 값을 구합니다.

그리고 구한 값을 이용하여 디코더의 입력 벡터 rnn_input을 만듭니다. 이전 시점의 은닉 상태 hidden과 함께 rnn 모듈에 전달하여 현재 시점의 출력 output과 은닉 상태 hidden을 구합니다.

마지막으로, output, weighted, embedded를 torch.cat을 이용하여 결합한 후, fc_out 모듈에 전달하여 출력값을 계산합니다.

Decoder 클래스에서 출력값 prediction은 [batch_size, output_dim]의 크기를 가집니다. 디코더의 출력은 마지막 선형 계층인 fc_out을 통해 구한 값입니다.

In [82]:
output_dim = 10
emb_dim = 8
enc_hid_dim = 16
dec_hid_dim = 32
dropout = 0.1
attention = Attention(enc_hid_dim, dec_hid_dim)

decoder = Decoder(output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention)

input = torch.LongTensor([2])
hidden = torch.zeros(1, dec_hid_dim)
encoder_outputs = torch.randn(5, 1, enc_hid_dim * 2)

output, hidden = decoder(input, hidden, encoder_outputs)
print(output.size(), hidden.size())  # torch.Size([1, 10]) torch.Size([1, 32])

torch.Size([1, 10]) torch.Size([1, 32])


### Seq2Seq

This is the first model where we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of `enc_dim * 2` to `enc_dim * 2 if encoder_is_bidirectional else enc_dim`. 

This seq2seq encapsulator is similar to the last two. The only difference is that the `encoder` returns both the final hidden state (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as every hidden state (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that `hidden` and `encoder_outputs` are passed to the decoder. 

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y}$
- the source sequence, $X$, is fed into the encoder to receive $z$ and $H$
- the initial decoder hidden state is set to be the `context` vector, $s_0 = z = h_T$
- we use a batch of `<sos>` tokens as the first `input`, $y_1$
- we then decode within a loop:
  - inserting the input token $y_t$, previous hidden state, $s_{t-1}$, and all encoder outputs, $H$, into the decoder
  - receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$
  - we then decide if we are going to teacher force or not, setting the next input as appropriate

In [98]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        if trg is None:
            trg = torch.zeros((25, src.shape[1])).fill_(2).long().to(src.device)
            assert teacher_forcing_ratio == 0
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

## Training the Seq2Seq Model

The rest of this tutorial is very similar to the previous one.

We initialise our parameters, encoder, decoder and seq2seq model (placing it on the GPU if we have one). 

In [99]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

We use a simplified version of the weight initialization scheme used in the paper. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.
신경망 모델의 가중치를 초기화하는 함수 init_weights를 정의하고, 이를 모델 객체의 apply 메소드를 통해 적용하는 코드입니다.
init_weights 함수에서는 모든 가중치 텐서의 값을 정규 분포에서 추출한 난수로 초기화합니다. 표준 편차는 0.01로, 평균은 0입니다. 이 함수는 bias 값은 0으로 초기화합니다.
model.apply(init_weights) 코드는 init_weights 함수를 모델의 모든 파라미터에 적용하는 역할을 합니다. 즉, 이 함수는 모델 내의 모든 가중치와 bias를 지정된 방법으로 초기화합니다.

In [16]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Calculate the number of parameters. We get an increase of almost 50% in the amount of parameters from the last model. 

In [17]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 20,518,405 trainable parameters


We create an optimizer.

In [18]:
optimizer = optim.Adam(model.parameters())

We initialize the loss function.

In [19]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

We then create the training loop...

In [20]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing.

In [21]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Finally, define a timing function.

In [22]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Then, we train our model, saving the parameters that give us the best validation loss.

 모델을 학습하는 루프입니다. N_EPOCHS번 반복하면서 각 에폭마다 다음 작업들을 수행합니다.

train 함수를 호출하여 모델을 학습합니다. 이 함수는 모델을 훈련하는 데 사용되는 train_iterator를 이용하여 학습 데이터셋에 대한 forward 및 backward 연산을 수행하고, optimizer를 사용하여 모델의 파라미터를 업데이트합니다. criterion은 손실 함수입니다.
evaluate 함수를 호출하여 모델의 성능을 평가합니다. 이 함수는 모델을 이용하여 valid_iterator를 이용하여 검증 데이터셋에 대한 forward 연산을 수행하고, criterion에 의해 계산된 검증 손실 값을 반환합니다.
에폭별 통계 정보와 손실 값을 출력합니다. 훈련 손실과 검증 손실을 출력하고, 훈련 손실과 검증 손실의 지수(exp) 값을 출력합니다. 이 값을 통해 얼마나 잘 훈련되었는지를 측정할 수 있습니다.
검증 손실이 가장 낮은 경우, 모델을 저장합니다.
CLIP 변수는 기울기 폭발 방지를 위한 gradient clipping의 임계값을 설정합니다. epoch_time 함수는 현재 에폭이 끝나는데 걸린 시간을 계산하여 출력합니다. best_valid_loss는 가장 낮은 검증 손실 값을 저장하는 변수입니다. 새로운 검증 손실이 이 값보다 낮은 경우, 모델 파라미터를 저장합니다.

In [23]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 3m 51s
	Train Loss: 5.705 | Train PPL: 300.435
	 Val. Loss: 5.042 |  Val. PPL: 154.779
Epoch: 02 | Time: 3m 53s
	Train Loss: 4.863 | Train PPL: 129.463
	 Val. Loss: 5.035 |  Val. PPL: 153.704
Epoch: 03 | Time: 3m 57s
	Train Loss: 4.508 | Train PPL:  90.745
	 Val. Loss: 4.822 |  Val. PPL: 124.221
Epoch: 04 | Time: 3m 54s
	Train Loss: 4.303 | Train PPL:  73.936
	 Val. Loss: 4.774 |  Val. PPL: 118.367
Epoch: 05 | Time: 3m 51s
	Train Loss: 4.107 | Train PPL:  60.735
	 Val. Loss: 4.589 |  Val. PPL:  98.423
Epoch: 06 | Time: 3m 45s
	Train Loss: 3.945 | Train PPL:  51.701
	 Val. Loss: 4.534 |  Val. PPL:  93.170
Epoch: 07 | Time: 1m 42s
	Train Loss: 3.748 | Train PPL:  42.443
	 Val. Loss: 4.289 |  Val. PPL:  72.868
Epoch: 08 | Time: 1m 28s
	Train Loss: 3.479 | Train PPL:  32.420
	 Val. Loss: 4.075 |  Val. PPL:  58.839
Epoch: 09 | Time: 1m 20s
	Train Loss: 3.275 | Train PPL:  26.431
	 Val. Loss: 3.793 |  Val. PPL:  44.378
Epoch: 10 | Time: 1m 14s
	Train Loss: 3.032 | Train PPL

Finally, we test the model on the test set using these "best" parameters.

In [25]:
model.load_state_dict(torch.load('tut3-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.667 | Test PPL:  39.146 |


We've improved on the previous model, but this came at the cost of doubling the training time.

In the next notebook, we'll be using the same architecture but using a few tricks that are applicable to all RNN architectures - packed padded sequences and masking. We'll also implement code which will allow us to look at what words in the input the RNN is paying attention to when decoding the output.

In [26]:
next(iter(train_iterator)).src
next(iter(train_iterator)).trg

tensor([[ 2,  2,  2,  ...,  2,  2,  2],
        [24,  4,  4,  ...,  4, 16,  4],
        [ 9, 64, 24,  ..., 55, 50, 55],
        ...,
        [ 1,  1,  1,  ...,  1,  1,  1],
        [ 1,  1,  1,  ...,  1,  1,  1],
        [ 1,  1,  1,  ...,  1,  1,  1]], device='cuda:0')

In [86]:
def inference(sentence):
    # 입력 문장을 토큰화한다.
    tokenized_sentence = tokenize_de(sentence)
    # 토큰화된 문장을 숫자로 변환한다.
    numericalized_sentence = [SRC.vocab.stoi[t] for t in tokenized_sentence]
    # 숫자로 변환된 문장을 텐서로 변환한다.
    sentence_tensor = torch.LongTensor(numericalized_sentence).reshape(-1, 1).to(device)
    # 모델을 이용해 번역 문장의 확률 분포를 예측한다.
    pred_probs = model(sentence_tensor, None, 0).squeeze(1)
    # 예측된 확률 분포에서 가장 높은 값을 갖는 인덱스를 선택한다.
    translation_tensor = torch.argmax(pred_probs, 1)
    # 번역된 문장을 토큰화한다.
    translation = [TRG.vocab.itos[t] for t in translation_tensor][1:]
    # 번역된 문장을 반환한다.
    return translation


In [135]:
def inference(sentence, model=model, src_field=SRC, trg_field=TRG, max_len=50):
    
    model.eval()

    # 토큰화
    tokenized_sentence = tokenize_de(sentence)

    # 입력 문장 숫자로 변환
    numericalized = [src_field.vocab.stoi[t] for t in tokenized_sentence]
    numericalized = torch.LongTensor(numericalized).unsqueeze(1).to(device)

    # 출력 문장 초기화
    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    # 초기 은닉 상태 생성
    hidden = torch.zeros(1, 1,dec_hid_dim).to(device)

    # 모델로 부터 출력 문장 예측
    for i in range(max_len):
        output, hidden = model.decoder(numericalized, hidden, model.encoder(numericalized))

        # 출력 단어 예측
        pred_token = output.argmax(2)[-1].item()

        # 종료 조건 : <eos> 토큰 예측
        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break

        # 출력 문장에 예측한 토큰 추가
        trg_indexes.append(pred_token)

        # 마지막에 예측된 토큰을 기반으로 다음 입력 토큰 생성
        numericalized = torch.LongTensor([pred_token]).unsqueeze(1).to(device)

    # 숫자로 된 출력 문장을 텍스트로 변환
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]

    return trg_tokens[1:]


In [None]:
def inference(sentence):
    # 입력받은 문장을 출력한다.
    print(sentence)

In [100]:
model=Seq2Seq(enc, dec, device)
model.state_dict = torch.load('tut3-model.pt')
model.to(device)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [87]:
example_data = ' '.join(vars(train_data.examples[12])['src'])
truth = ' '.join(vars(train_data.examples[12])['trg'])

print(example_data)
print(truth)
print(' '.join(example_data.split()[::-1]))

ein schwarzer hund und ein gefleckter hund kämpfen .
a black dog and a spotted dog are fighting
. kämpfen hund gefleckter ein und hund schwarzer ein


In [136]:
inference(example_data)

AttributeError: 'tuple' object has no attribute 'shape'

In [102]:
inference("Ich bin Master-Student an der Graduate School der Sogang University.")

['filmed',
 'clothed',
 'pier',
 'outside',
 'welcome',
 '10',
 'dorm',
 'have',
 'voice',
 'bubbles',
 'headband',
 'american',
 'miniskirt',
 'contained',
 'snowflakes',
 'testing',
 'designed',
 'crosses',
 'cobbled',
 'built',
 'median',
 'golf',
 'curling',
 'cars']