# Welcome to Torch Study 

## 2월 2~3주차 : Sequence to Sequence Learning with Neural Networks
논문 디테일 구현해보기 
- seq2seq 자체 (v)
- Encoder와 Decoder를 연결시켜주는 부분 (v)
- greedy search decoder (v)
- Beam search decoder
- packed_padded_sequence
- Batch로 넣어줄 때, sequence length 별로 sort해서 넣어주는 것 (v)
- most frequent 단어만 사용하고 나머지는 [UNK] 처리함 (v)
- LSTM weight uniform 초기화 (v)
- loss ⇒ $1/|S| * \sum_{(T,S)\in \mathbf{S}}logP(T|S)$
- gradient clipping 과 halving learning rate (△)
- BLEU 계산

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator, Iterator

import spacy
import numpy as np

import random
import math
import time

In [2]:
torch.__version__, torchtext.__version__

('1.7.1', '0.8.1')

We'll set the random seeds for deterministic results.

In [3]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Next, we'll create the tokenizers. A tokenizer is used to turn a string containing a sentence into a list of individual tokens that make up that string, e.g. "good morning!" becomes ["good", "morning", "!"]. We'll start talking about the sentences being a sequence of tokens from now, instead of saying they're a sequence of words. What's the difference? Well, "good" and "morning" are both words and tokens, but "!" is a token, not a word. 

spaCy has model for each language ("de_core_news_sm" for German and "en_core_web_sm" for English) which need to be loaded so we can access the tokenizer of each model. 

**Note**: the models must first be downloaded using the following on the command line: 
```
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
```

We load the models as such:

In [4]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

Next, we create the tokenizer functions. These can be passed to torchtext and will take in the sentence as a string and return the sentence as a list of tokens.

In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier". We copy this by reversing the German sentence after it has been transformed into a list of tokens.

In [5]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

torchtext's `Field`s handle how data should be processed. All of the possible arguments are detailed [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L61). 

We set the `tokenize` argument to the correct tokenization function for each, with German being the `SRC` (source) field and English being the `TRG` (target) field. The field also appends the "start of sequence" and "end of sequence" tokens via the `init_token` and `eos_token` arguments, and converts all words to lowercase.

In [6]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)



Next, we download and load the train, validation and test data. 

The dataset we'll be using is the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. 

`exts` specifies which languages to use as the source and target (source goes first) and `fields` specifies which field to use for the source and target.

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))



We can double check that we've loaded the right number of examples:

In [8]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can also print out an example, making sure the source sentence is reversed:

In [9]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


The period is at the beginning of the German (src) sentence, so it looks like the sentence has been correctly reversed.

Next, we'll build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the source and target languages are distinct.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artifically inflated validation/test scores.

## most frequent인 n개만 사용하는 거 

In [50]:
SRC_MOST_FREQ = 4000
TRG_MOST_FREQ = 2000

In [51]:
SRC.build_vocab(train_data, max_size = SRC_MOST_FREQ)
TRG.build_vocab(train_data, max_size = TRG_MOST_FREQ)

In [52]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 4004
Unique tokens in target (en) vocabulary: 2004


In [66]:
SRC.vocab.stoi[4]

0

In [53]:
len(SRC.vocab.stoi), len(TRG.vocab.stoi) # special token 포함

(4004, 2004)

The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a `src` attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a `trg` attribute (the PyTorch tensors containing a batch of numericalized target sentences). Numericalized is just a fancy way of saying they have been converted from a sequence of readable tokens to a sequence of corresponding indexes, using the vocabulary. 

We also need to define a `torch.device`. This is used to tell torchText to put the tensors on the GPU or not. We use the `torch.cuda.is_available()` function, which will return `True` if a GPU is detected on our computer. We pass this `device` to the iterator.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, torchText iterators handle this for us! 

We use a `BucketIterator` instead of the standard `Iterator` as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences. 

## 비슷한 길이 끼리 묶어주기
- Iterator에서 sort_key를 `src`의 길이로 줌
- sort=True 안하면 sort가 안된다
- BucketIterator와 Iterator는 다르다 (https://stackoverflow.com/questions/49367871/concept-of-bucketing-in-seq2seq-model)

In [55]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 128

### BucketIterator 

In [87]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device,
    sort_within_batch = True)

In [88]:
for _ in train_iterator:
    print(_.src)
    break

tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [   4,    4,    4,  ...,    4,    4,    4],
        [ 312,  292,  439,  ..., 2079, 3988,  793],
        ...,
        [  66,   36,  493,  ...,   68,   30,  752],
        [   5,    5,    5,  ..., 1580,   18,   73],
        [   3,    3,    3,  ...,    3,    3,    3]], device='cuda:0')


In [89]:
for idx, batch in enumerate(train_iterator):
    print(batch.src.shape, batch.trg.shape)
    if idx > 10:
        break

torch.Size([11, 128]) torch.Size([11, 128])
torch.Size([13, 128]) torch.Size([13, 128])
torch.Size([15, 128]) torch.Size([10, 128])
torch.Size([17, 128]) torch.Size([14, 128])
torch.Size([16, 128]) torch.Size([16, 128])
torch.Size([12, 128]) torch.Size([13, 128])
torch.Size([19, 128]) torch.Size([20, 128])
torch.Size([13, 128]) torch.Size([15, 128])
torch.Size([17, 128]) torch.Size([19, 128])
torch.Size([21, 128]) torch.Size([25, 128])
torch.Size([15, 128]) torch.Size([15, 128])
torch.Size([21, 128]) torch.Size([27, 128])


### Iterator

In [90]:
train_iterator = Iterator(train_data, 
                          batch_size = BATCH_SIZE, 
                          sort_key=lambda e: len(e.src),
                          sort=True,
                          sort_within_batch=False)

In [91]:
from torchtext.data import Example
import random

### sort_key를 주면 shuffle이 안되는 거 아닐까?
아니다! sort를 하더라도 매 epoch별로 shuffle를 하고 sort를 하면 데이터는 달라질 수 있다

In [92]:
data = [(1, '하이'), (1, '방가'), (1, '룰루'), (1, 'i will kill you'), (2, 3)]
sorted(data, key = lambda e: e[0])

[(1, '하이'), (1, '방가'), (1, '룰루'), (1, 'i will kill you'), (2, 3)]

In [93]:
random.shuffle(data)
sorted(data, key = lambda e: e[0])

[(1, '방가'), (1, 'i will kill you'), (1, '룰루'), (1, '하이'), (2, 3)]

In [97]:
for idx, batch in enumerate(train_iterator):
    print(batch.src.shape, batch.trg.shape)
#     print(batch.src)
    if idx > 10:
        break

torch.Size([7, 128]) torch.Size([13, 128])
torch.Size([8, 128]) torch.Size([13, 128])
torch.Size([8, 128]) torch.Size([15, 128])
torch.Size([8, 128]) torch.Size([14, 128])
torch.Size([8, 128]) torch.Size([14, 128])
torch.Size([8, 128]) torch.Size([16, 128])
torch.Size([9, 128]) torch.Size([16, 128])
torch.Size([9, 128]) torch.Size([14, 128])
torch.Size([9, 128]) torch.Size([15, 128])
torch.Size([9, 128]) torch.Size([16, 128])
torch.Size([9, 128]) torch.Size([15, 128])
torch.Size([9, 128]) torch.Size([13, 128])


In [23]:
train_iterator, valid_iterator, test_iterator = map(lambda x: Iterator(x,
                                       batch_size = BATCH_SIZE,
                                       sort_key = lambda e: len(e.src),
                                       sort = True, device = device),
                    [train_data, valid_data, test_data])

In [24]:
for idx, batch in enumerate(train_iterator):
    print(batch.src.shape, batch.trg.shape)
    if idx > 10:
        break

torch.Size([7, 128]) torch.Size([13, 128])
torch.Size([8, 128]) torch.Size([13, 128])
torch.Size([8, 128]) torch.Size([15, 128])
torch.Size([8, 128]) torch.Size([14, 128])
torch.Size([8, 128]) torch.Size([14, 128])
torch.Size([8, 128]) torch.Size([16, 128])
torch.Size([9, 128]) torch.Size([16, 128])
torch.Size([9, 128]) torch.Size([14, 128])
torch.Size([9, 128]) torch.Size([15, 128])
torch.Size([9, 128]) torch.Size([16, 128])
torch.Size([9, 128]) torch.Size([15, 128])
torch.Size([9, 128]) torch.Size([13, 128])


## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

### Encoder

First, the encoder, a 2 layer LSTM. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers. 

multi-layer RNN은 input sentence인 $X$은 RNN의 첫번째 (가장 바닥의) 레이어에서 임베딩(H=\{h_1, h_2, ..., h_T\})된 뒤에 , 그 레이어의 output이 그 위의 RNN의 input으로 들어갑니다. 그러므로, 각 레이어를 위첨자로 표현하면, first layer의 hidden state는 아래와 같습니다:

$$h_t^1 = \text{EncoderRNN}^1(e(x_t), h_{t-1}^1)$$

The hidden states in the second layer are given by:

$$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$$

? embedding dim과 rnn의 hidden dim은 같아야 겠네<br>
-> 상관없음 stacked RNN에서 모든 RNN이 크기가 같을 필욘 없음!

Using a multi-layer RNN also means we'll also need an initial hidden state as input per layer, $h_0^l$, and we will also output a context vector per layer, $z^l$.

Without going into too much detail about LSTMs (see [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) blog post to learn more about them), all we need to know is that they're a type of RNN which instead of just taking in a hidden state and returning a new hidden state per time-step, also take in and return a *cell state*, $c_t$, per time-step.

$$\begin{align*}
h_t &= \text{RNN}(e(x_t), h_{t-1})\\
(h_t, c_t) &= \text{LSTM}(e(x_t), h_{t-1}, c_{t-1})
\end{align*}$$

We can just think of $c_t$ as another type of hidden state. Similar to $h_0^l$, $c_0^l$ will be initialized to a tensor of all zeros. Also, our context vector will now be both the final hidden state and the final cell state, i.e. $z^l = (h_T^l, c_T^l)$.

Extending our multi-layer equations to LSTMs, we get:

$$\begin{align*}
(h_t^1, c_t^1) &= \text{EncoderLSTM}^1(e(x_t), (h_{t-1}^1, c_{t-1}^1))\\
(h_t^2, c_t^2) &= \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))
\end{align*}$$

Note how only our hidden state from the first layer is passed as input to the second layer, and not the cell state.

So our encoder looks something like this: 

![](assets/seq2seq2.png)

We create this in code by making an `Encoder` module, which requires we inherit from `torch.nn.Module` and use the `super().__init__()` as some boilerplate code. The encoder takes the following arguments:
- `input_dim` is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden and cell states.
- `n_layers` is the number of layers in the RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out [this](https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR) for more details about dropout.

We aren't going to discuss the embedding layer in detail during these tutorials. All we need to know is that there is a step before the words - technically, the indexes of the words - are passed into the RNN, where the words are transformed into vectors. To read more about word embeddings, check these articles: [1](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/), [2](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html), [3](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), [4](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/). 

The embedding layer is created using `nn.Embedding`, the LSTM with `nn.LSTM` and a dropout layer with `nn.Dropout`. Check the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html) for more about these.


One thing to note is that the `dropout` argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, 즉 multi-layer RNN에서 layer $l$의 hidden state output 과 layer $l+1$의 input hidden state 사이에서 적용됩니다.

In the `forward` method, we pass in the source sentence, $X$, which is converted into dense vectors using the `embedding` layer, and then dropout is applied. These embeddings are then passed into the RNN. 우리가 RNN에게 시퀀스 전체를 넣어줘도, 이것은 자동으로 모든 시퀀스의 hidden state에 대한 recurrent 계산을 해줄 것입니다! 우리가 initial hidden 이나 cell state를 RNN에게 넣어주지 않아도 된다는 점을 알아주세요(참고 : [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM)), 만약 RNN에 넣어준 hidden/cell state가 없다면, 이것은 자동으로 제로 텐서로 넣어줄 것입니다.

The RNN returns: `outputs` (각 time-step에서 최상단 layer의 hidden state), `hidden` (the final hidden state for each layer, $h_T$, stacked on top of each other) and `cell` (the final cell state for each layer, $c_T$, stacked on top of each other).

우리는 context vector를 만들기 위해 오직 마지막 hidden과 cell state를 필요로 하기 때문에, `forward`는 오직 `hidden`과 `cell`만을 return 합니다.

The sizes of each of the tensors is left as comments in the code. In this implementation `n_directions` will always be 1, however note that bidirectional RNNs (covered in tutorial 3) will have `n_directions` as 2.

# `nn.LSTM`
### **Inputs:**  input, (h_0, c_0)

**input** of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.

**h_0** of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. If the LSTM is bidirectional, num_directions should be 2, else it should be 1.

**c_0** of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial cell state for each element in the batch.

If (h_0, c_0) is not provided, both h_0 and c_0 default to zero

### **outputs:** output, (h_n, c_n)
**output** of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.

For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case.

**h_n** of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len.

Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size) and similarly for c_n.

**c_n** of shape (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t = seq_len.

In [103]:
torch.Tensor([[1,2,1,2], [3,4,3,4]]).view(2, 2, -1)

tensor([[[1., 2.],
         [1., 2.]],

        [[3., 4.],
         [3., 4.]]])

In [104]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

### Decoder

Next, we'll build our decoder, which will also be a 2-layer (4 in the paper) LSTM.

![](assets/seq2seq3.png)

`Decoder` 클래스는 decoding의 한 스텝만 진행합니다(즉 time-step당 하나의 토큰만 output으로 내놓습니다). 첫번째 레이어는 그 전 time-step의 hidden, cell state($(s_{t-1}^1, c_{t-1}^1)$)를 받고, 이것을 LSTM에 현재의 임베딩된 토큰$y_t$과 함께 넣습니다. 그 다음의 layer는 그 아래 레이어의 hidden state($s_t^{l-1}$)와 그들 레이어의 이전 hidden state와 cell state $(s_{t-1}^l, c_{t-1}^l)$.를 사용할 것입니다. 이 것은 인코더에 있는 식과 비슷한 식으로 표현할 수 있습니다.

$$\begin{align*}
(s_t^1, c_t^1) = \text{DecoderLSTM}^1(d(y_t), (s_{t-1}^1, c_{t-1}^1))\\
(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))
\end{align*}$$

우리의 initial hidden, cell state는 context vector임을 기억하세요. context vector는 같은 레이어의 마지막 hidden, cell state입니다. 즉, $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.

We then pass the hidden state from the top layer of the RNN, $s_t^L$, through a linear layer, $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(s_t^L)$$

이제 target의 vocab size와 같은 `output_dim`가 추가된 것을 제외하고는 argumetns와 초기화는 `Encoder` 클래스와 비슷합니다. 
그리고 추가적으로 `Linear` 레이어가 있어서, 가장 위의 layer의 hidden state를 통해 예측을 하게 됩니다.

`forward` method 내에서는,  input token 한 배치, 과거의 hidden state와 과거의 cell state를 받습니다. 우리가 한번에 하나의 토큰만 디코딩하기 때문에, input token의 시퀀스 길이는 언제나 1입니다. 우리는 이 때문에 sequence length dimension을 추가하기 위해 `unsqueeze`를 사용합니다. 그리고, 인코더와 유사하게, 우리는 임베딩 레이어를 통과 시키고 dropout을 적용시킵니다. 임베딩된 토큰 한 배치는 이 전의 hidden, cell state와 함께 RNN을 통과합니다. 이것은 `output` (hidden state from the top layer of the RNN), 새로운 `hidden` state (one for each layer, stacked on top of each other), 새로운 `cell` state (also one per layer, stacked on top of each other)를 만듭니다. 그리고 나서 우리는 `output` (after getting rid of the sentence length dimension)을 lineqr layer에 넣고 우리의 `prediction`을 얻습니다. 우리는 `prediction`, 새 `hidden` state, 새 `cell` state를 반환합니다.

**Note**: as we always have a sequence length of 1, we could use `nn.LSTMCell`, instead of `nn.LSTM`, as it is designed to handle a batch of inputs that aren't necessarily in a sequence. `nn.LSTMCell` is just a single cell and `nn.LSTM` is a wrapper around potentially multiple cells. Using the `nn.LSTMCell` in this case would mean we don't have to `unsqueeze` to add a fake sequence length dimension, but we would need one `nn.LSTMCell` per layer in the decoder and to ensure each `nn.LSTMCell` receives the correct initial hidden state from the encoder. All of this makes the code less concise - hence the decision to stick with the regular `nn.LSTM`.

In [105]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

### Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle: 
- receiving the input/source sentence
- using the encoder to produce the context vectors 
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](assets/seq2seq4.png)

The `Seq2Seq` model takes in an `Encoder`, `Decoder`, and a `device` (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the `Encoder` and `Decoder`. This is not always the case, we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we did something like having a different number of layers then we would need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? Etc.

우리의 `forward` method는 source sentence, target sentence, teacher-forcing ratio를 받습니다. teacher forcing ratio는 training할 때 사용됩니다. decoding시에 각 time-step에서 우리는 과거 decoded된 토큰에서 다음 토큰$\hat{y}_{t+1}=f(s_t^L)$을 예측합니다. teahcer forcing ratio의 확률로 우리는 실제 ground truth 다음 토큰을 다음 time-step에서 인풋 토큰으로 사용할 것이고 (1 - teacher_forcing_ratio)의 확률로 모델이 예측한 토큰을 사용할 것입니다.   

`forward` method에서 가장 먼저 해야할 것은 우리의 모든 prediction $\hat{Y}$ 을 저장할 `outputs`텐서를 만드는 것입니다. 
그리고 나서 source sentence를 encoder에 넣고 우리의 마지막 hidden, cell state를 받습니다.
The first input to the decoder is the start of sequence (`<sos>`) token. As our `trg` tensor already has the `<sos>` token appended (all the way back when we defined the `init_token` in our `TRG` field) we get our $y_1$ by slicing into it. 우리는 target 문장이 얼마나 길어야 할지(`max_len`)알기 때문에, 그만큼 루프를 돌면 됩니다. 마지막 토큰은 `<eos>`토큰 바로 전까지의 토큰입니다 - 절대로 `<eos>` 토큰이 디코더에 들어가면 안됩니다. 

iteration을 돌 때마다, 우리는
- input과 과거의 hidden, cell state($y_t, s_{t-1}, c_{t-1}$)를 디코더에 넣습니다
- 예측값, 다음 hidden, cell state($\hat{y}_{t+1}, s_{t}, c_{t}$)를 디코더로 부터 받습니다
- 우리의 예측값인  $\hat{y}_{t+1}$/`output`를 우리의 예측을 위한 텐서 $\hat{Y}$/`outputs`에 넣습니다
- "teacher force"를 할지 말지 정합니다 
  - 만약 한다면, 다음 `input`은 시퀀스 내 ground-truth 다음 토큰 $y_{t+1}$/`trg[t]`이 될 것입니다
  - 아니라면, 우리의 다음 `input`은 시퀀스 내에서 예측된 다음 토큰 $\hat{y}_{t+1}$/`top1`이고 이것은 ouput tensor에 `argrmax`를 함으로서 얻어집니다
  
우리의 모든 예측을 만들면, predictions으로 채워진 tensor$\hat{Y}$/`outputs`를 리턴합니다

**Note**: 우리의 디코더 loop는 0이 아니고 1에서 시작합니다. 이는 우리의 `ouputs`의 0번째 원소는 0으로 남아있음을 의미합니다. 그래서 우리의 `trg`와 `outputs`는 아래와 같이 생겼을 것입니다

$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

In [106]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0, :]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

# Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our Seq2Seq model, which we place on the `device`.

In [107]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

Next up is initializing the weights of our model. In the paper they state they initialize all weights from a uniform distribution between -0.08 and +0.08, i.e. $\mathcal{U}(-0.08, 0.08)$.

We initialize weights in PyTorch by creating a function which we `apply` to our model. When using `apply`, the `init_weights` function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with `nn.init.uniform_`.

In [108]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4004, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(2004, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=2004, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

We also define a function that will calculate the number of trainable parameters in the model.

In [109]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 9,922,516 trainable parameters


We define our optimizer, which we use to update our parameters in the training loop. Check out [this](http://ruder.io/optimizing-gradient-descent/) post for information about different optimizers. Here, we'll use Adam.

## SGD optimizer + halving learning rate every half epoch
train을 7.5 epoch하고, learning rate도 5를 넘는 half epoch마다 lr을 halving 해줘야 하기 때문에 train 중간일 때 에폭을 세자

In [137]:
from torch.optim.lr_scheduler import LambdaLR, MultiStepLR

In [138]:
optimizer = optim.SGD(model.parameters(), lr=0.7)

In [139]:
scheuder = MultiStepLR(optimizer, milestones=list(range(5, 10)), gamma=0.5)

In [174]:
from torch.utils.data import Dataset, DataLoader

In [208]:
optimizer = optim.SGD(model.parameters(), lr=0.7)
scheduler = MultiStepLR(optimizer, milestones=list(np.arange(5, 10, 0.5)), gamma=0.5)
for _ in range(10):
    print(f'Epoch {_} learning rate : {scheduler.get_lr()[0]}')
    scheduler.step()

Epoch 0 learning rate : 0.7
Epoch 1 learning rate : 0.7
Epoch 2 learning rate : 0.7
Epoch 3 learning rate : 0.7
Epoch 4 learning rate : 0.7
Epoch 5 learning rate : 0.175
Epoch 6 learning rate : 0.0875
Epoch 7 learning rate : 0.04375
Epoch 8 learning rate : 0.021875
Epoch 9 learning rate : 0.0109375




Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions. 

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token. 

In [36]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Next, we'll define our training loop. 

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\begin{align*}
\text{trg} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

Here, when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{trg} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}$$

At each iteration:
- batch로 부터 $X$와 $Y$를 받습니다
- 마지막 배치로 부터 계산된 gradient를 0으로 초기화합니다
- source와 target을 모델에 넣고 output $\hat{Y}$를 받습니다 
- loss function이 2D input과 1D target에서만 작동하므로 우리는 .view로 각각을 flatten해줍니다
- 앞서 언급한 대로 ouput의 첫번째 컬럼을 슬라이싱해서 제거해줍니다
- `loss.backward()`로 gradient를 계산해줍니다
- gradient exploding을 방지하기 위해 clipping을 해줍니다(RNN에서 흔한 이슈)
- optimizer step을 통해 모델의 파라미터들을 업데이트해줍니다
- loss를 전체 런닝에 합쳐줍니다

그러면 우리는 모든 배치에 대한 평균적인 loss를 구할 수 있습니다

In [37]:
# scheduler.get_lr()

In [154]:
10 // 2

5

In [38]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    check_half = len(iterator) // 2 
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

We must remember to set the model to evaluation mode with `model.eval()`. This will turn off dropout (and batch normalization, if used).

We use the `with torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up. 

The iteration loop is similar (without the parameter updates), however we must ensure we turn teacher forcing off for evaluation. This will cause the model to only use it's own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [39]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Next, we'll create a function that we'll use to tell us how long an epoch takes.

In [40]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We can finally start training our model!

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called `state_dict` in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss. 

We'll be printing out both the loss and the perplexity at each epoch. It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

In [41]:
N_EPOCHS = 1000
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
        
    scheduler.step()
    print(scheduler.get_lr())
    print(f'Epoch: {epoch + 1:02} | Time: {epoch_mins}m {epoch_secs}s ')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



[0.48999999999999994]
Epoch: 01 | Time: 0m 12s 
	Train Loss: 5.044 | Train PPL: 155.055
	 Val. Loss: 5.266 |  Val. PPL: 193.721
[0.48999999999999994]
Epoch: 02 | Time: 0m 11s 
	Train Loss: 4.612 | Train PPL: 100.703
	 Val. Loss: 5.379 |  Val. PPL: 216.832
[0.48999999999999994]
Epoch: 03 | Time: 0m 12s 
	Train Loss: 4.571 | Train PPL:  96.604
	 Val. Loss: 5.334 |  Val. PPL: 207.280
[0.48999999999999994]
Epoch: 04 | Time: 0m 11s 
	Train Loss: 4.546 | Train PPL:  94.226
	 Val. Loss: 5.291 |  Val. PPL: 198.511
[0.48999999999999994]
Epoch: 05 | Time: 0m 11s 
	Train Loss: 4.520 | Train PPL:  91.828
	 Val. Loss: 5.218 |  Val. PPL: 184.650
[0.24499999999999997]
Epoch: 06 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.188
	 Val. Loss: 5.165 |  Val. PPL: 174.982
[0.12249999999999998]
Epoch: 07 | Time: 0m 11s 
	Train Loss: 4.489 | Train PPL:  88.992
	 Val. Loss: 5.154 |  Val. PPL: 173.103
[0.06124999999999999]
Epoch: 08 | Time: 0m 11s 
	Train Loss: 4.493 | Train PPL:  89.372
	 Val. Loss: 5.0

[8.500145032286354e-19]
Epoch: 64 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.500
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.250072516143177e-19]
Epoch: 65 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.144
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.1250362580715884e-19]
Epoch: 66 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.381
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.0625181290357942e-19]
Epoch: 67 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.309
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.312590645178971e-20]
Epoch: 68 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.555
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.6562953225894855e-20]
Epoch: 69 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.284
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.3281476612947428e-20]
Epoch: 70 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.199
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[6.640738306473714e-21]
Epoch: 71 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.

[9.215875710446733e-38]
Epoch: 127 | Time: 0m 12s 
	Train Loss: 4.504 | Train PPL:  90.340
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.6079378552233664e-38]
Epoch: 128 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.217
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.3039689276116832e-38]
Epoch: 129 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.408
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.1519844638058416e-38]
Epoch: 130 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.286
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.759922319029208e-39]
Epoch: 131 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.104
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.879961159514604e-39]
Epoch: 132 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.422
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.439980579757302e-39]
Epoch: 133 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.373
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.19990289878651e-40]
Epoch: 134 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL

[9.991872466622739e-57]
Epoch: 190 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.200
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.9959362333113697e-57]
Epoch: 191 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.079
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.4979681166556848e-57]
Epoch: 192 | Time: 0m 12s 
	Train Loss: 4.506 | Train PPL:  90.531
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.2489840583278424e-57]
Epoch: 193 | Time: 0m 12s 
	Train Loss: 4.504 | Train PPL:  90.380
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[6.244920291639212e-58]
Epoch: 194 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.307
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.122460145819606e-58]
Epoch: 195 | Time: 0m 11s 
	Train Loss: 4.507 | Train PPL:  90.615
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.561230072909803e-58]
Epoch: 196 | Time: 0m 12s 
	Train Loss: 4.505 | Train PPL:  90.465
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.806150364549015e-59]
Epoch: 197 | Time: 0m 11s 
	Train Loss: 4.502 | Train PP

[1.083320983551047e-75]
Epoch: 253 | Time: 0m 12s 
	Train Loss: 4.502 | Train PPL:  90.217
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.416604917755235e-76]
Epoch: 254 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.277
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.7083024588776175e-76]
Epoch: 255 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.118
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.3541512294388087e-76]
Epoch: 256 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.300
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[6.770756147194044e-77]
Epoch: 257 | Time: 0m 12s 
	Train Loss: 4.504 | Train PPL:  90.363
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.385378073597022e-77]
Epoch: 258 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.410
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.692689036798511e-77]
Epoch: 259 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.496
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.463445183992555e-78]
Epoch: 260 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL

[1.1745389638651786e-94]
Epoch: 316 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.382
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.872694819325893e-95]
Epoch: 317 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.165
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.9363474096629464e-95]
Epoch: 318 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.599
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.4681737048314732e-95]
Epoch: 319 | Time: 0m 10s 
	Train Loss: 4.504 | Train PPL:  90.401
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.340868524157366e-96]
Epoch: 320 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.355
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.670434262078683e-96]
Epoch: 321 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.131
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.8352171310393415e-96]
Epoch: 322 | Time: 0m 11s 
	Train Loss: 4.499 | Train PPL:  89.937
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[9.176085655196708e-97]
Epoch: 323 | Time: 0m 11s 
	Train Loss: 4.502 | Train P

[1.2734376962915e-113]
Epoch: 379 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.234
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[6.3671884814575e-114]
Epoch: 380 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.077
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.18359424072875e-114]
Epoch: 381 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.340
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.591797120364375e-114]
Epoch: 382 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.469
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.958985601821875e-115]
Epoch: 383 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.296
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.9794928009109375e-115]
Epoch: 384 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.127
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.9897464004554687e-115]
Epoch: 385 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.155
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[9.948732002277344e-116]
Epoch: 386 | Time: 0m 12s 
	Train Loss: 4.505 | Train 

[1.3806639168441803e-132]
Epoch: 442 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.392
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[6.9033195842209014e-133]
Epoch: 443 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.586
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.4516597921104507e-133]
Epoch: 444 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.344
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.7258298960552253e-133]
Epoch: 445 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.350
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.629149480276127e-134]
Epoch: 446 | Time: 0m 11s 
	Train Loss: 4.507 | Train PPL:  90.633
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.3145747401380634e-134]
Epoch: 447 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.240
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.1572873700690317e-134]
Epoch: 448 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.304
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.0786436850345158e-134]
Epoch: 449 | Time: 0m 11s 
	Train Loss: 4.50

[2.9938376362296122e-151]
Epoch: 504 | Time: 0m 12s 
	Train Loss: 4.503 | Train PPL:  90.298
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.4969188181148061e-151]
Epoch: 505 | Time: 0m 12s 
	Train Loss: 4.506 | Train PPL:  90.522
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.4845940905740305e-152]
Epoch: 506 | Time: 0m 11s 
	Train Loss: 4.507 | Train PPL:  90.643
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.7422970452870152e-152]
Epoch: 507 | Time: 0m 11s 
	Train Loss: 4.500 | Train PPL:  90.049
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.8711485226435076e-152]
Epoch: 508 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.385
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[9.355742613217538e-153]
Epoch: 509 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.154
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.677871306608769e-153]
Epoch: 510 | Time: 0m 10s 
	Train Loss: 4.507 | Train PPL:  90.640
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.3389356533043845e-153]
Epoch: 511 | Time: 0m 11s 
	Train Loss: 4.503

[6.491850538538026e-170]
Epoch: 566 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.264
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.245925269269013e-170]
Epoch: 567 | Time: 0m 11s 
	Train Loss: 4.500 | Train PPL:  89.979
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.6229626346345064e-170]
Epoch: 568 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.323
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.114813173172532e-171]
Epoch: 569 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.206
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.057406586586266e-171]
Epoch: 570 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.371
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.028703293293133e-171]
Epoch: 571 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.258
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.0143516466465665e-171]
Epoch: 572 | Time: 0m 12s 
	Train Loss: 4.503 | Train PPL:  90.252
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.0717582332328326e-172]
Epoch: 573 | Time: 0m 12s 
	Train Loss: 4.503 | 

[1.4076956914668239e-188]
Epoch: 628 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.531
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.0384784573341195e-189]
Epoch: 629 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.407
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.5192392286670597e-189]
Epoch: 630 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.316
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.7596196143335299e-189]
Epoch: 631 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.362
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.798098071667649e-190]
Epoch: 632 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.216
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.3990490358338247e-190]
Epoch: 633 | Time: 0m 11s 
	Train Loss: 4.508 | Train PPL:  90.736
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.1995245179169123e-190]
Epoch: 634 | Time: 0m 12s 
	Train Loss: 4.504 | Train PPL:  90.423
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.0997622589584562e-190]
Epoch: 635 | Time: 0m 11s 
	Train Loss: 4.50

[3.0524534537736297e-207]
Epoch: 690 | Time: 0m 12s 
	Train Loss: 4.501 | Train PPL:  90.110
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.5262267268868148e-207]
Epoch: 691 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.158
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.631133634434074e-208]
Epoch: 692 | Time: 0m 12s 
	Train Loss: 4.504 | Train PPL:  90.403
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.815566817217037e-208]
Epoch: 693 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.239
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.9077834086085185e-208]
Epoch: 694 | Time: 0m 12s 
	Train Loss: 4.501 | Train PPL:  90.101
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[9.538917043042593e-209]
Epoch: 695 | Time: 0m 11s 
	Train Loss: 4.507 | Train PPL:  90.632
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.7694585215212963e-209]
Epoch: 696 | Time: 0m 12s 
	Train Loss: 4.504 | Train PPL:  90.385
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.3847292607606482e-209]
Epoch: 697 | Time: 0m 11s 
	Train Loss: 4.501 

[6.618953331984501e-226]
Epoch: 752 | Time: 0m 11s 
	Train Loss: 4.507 | Train PPL:  90.659
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.3094766659922506e-226]
Epoch: 753 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.495
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.6547383329961253e-226]
Epoch: 754 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.450
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.273691664980626e-227]
Epoch: 755 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.568
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.136845832490313e-227]
Epoch: 756 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.361
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.0684229162451566e-227]
Epoch: 757 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.254
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.0342114581225783e-227]
Epoch: 758 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.397
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.1710572906128915e-228]
Epoch: 759 | Time: 0m 12s 
	Train Loss: 4.504 

[1.4352567164235529e-244]
Epoch: 814 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.587
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.176283582117764e-245]
Epoch: 815 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.071
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.588141791058882e-245]
Epoch: 816 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.069
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.794070895529441e-245]
Epoch: 817 | Time: 0m 11s 
	Train Loss: 4.499 | Train PPL:  89.892
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.970354477647205e-246]
Epoch: 818 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.163
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.4851772388236027e-246]
Epoch: 819 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.321
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.2425886194118014e-246]
Epoch: 820 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.228
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.1212943097059007e-246]
Epoch: 821 | Time: 0m 11s 
	Train Loss: 4.504 |

[3.1122169000416552e-263]
Epoch: 876 | Time: 0m 10s 
	Train Loss: 4.505 | Train PPL:  90.436
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.5561084500208276e-263]
Epoch: 877 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.321
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[7.780542250104138e-264]
Epoch: 878 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.315
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.890271125052069e-264]
Epoch: 879 | Time: 0m 11s 
	Train Loss: 4.500 | Train PPL:  90.045
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.9451355625260345e-264]
Epoch: 880 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.481
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[9.725677812630172e-265]
Epoch: 881 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.310
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.862838906315086e-265]
Epoch: 882 | Time: 0m 11s 
	Train Loss: 4.506 | Train PPL:  90.525
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.431419453157543e-265]
Epoch: 883 | Time: 0m 11s 
	Train Loss: 4.503 | 

[6.748544648542529e-282]
Epoch: 938 | Time: 0m 12s 
	Train Loss: 4.500 | Train PPL:  90.008
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[3.3742723242712646e-282]
Epoch: 939 | Time: 0m 11s 
	Train Loss: 4.505 | Train PPL:  90.446
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.6871361621356323e-282]
Epoch: 940 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.384
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[8.435680810678161e-283]
Epoch: 941 | Time: 0m 11s 
	Train Loss: 4.503 | Train PPL:  90.304
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[4.2178404053390807e-283]
Epoch: 942 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.346
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[2.1089202026695403e-283]
Epoch: 943 | Time: 0m 11s 
	Train Loss: 4.501 | Train PPL:  90.094
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[1.0544601013347702e-283]
Epoch: 944 | Time: 0m 11s 
	Train Loss: 4.504 | Train PPL:  90.355
	 Val. Loss: 4.572 |  Val. PPL:  96.748
[5.272300506673851e-284]
Epoch: 945 | Time: 0m 12s 
	Train Loss: 4.506 

[1.463357353813047e-300]
Epoch: 1000 | Time: 0m 11s 
	Train Loss: 4.502 | Train PPL:  90.153
	 Val. Loss: 4.572 |  Val. PPL:  96.748


We'll load the parameters (`state_dict`) that gave our model the best validation loss and run it the model on the test set.

In [42]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 4.564 | Test PPL:  95.973 |


In the following notebook we'll implement a model that achieves improved test perplexity, but only uses a single layer in the encoder and the decoder.

# predict

In [43]:
import sys
from glob import glob
sys.path.append('/home/long8v/torch_study/paper/01_CNN/source/')
from dataloader import *

### torchtext field, build_vocab 구현
이유 : `Example.fromlist`가 안먹네요

In [44]:
filepath = '/home/long8v/torch_study/paper/file'

In [45]:
files = glob(f'{filepath}/multi30k/train*')

In [46]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [47]:
de_path = glob(f'{filepath}/multi30k/train*de')[0]
en_path = glob(f'{filepath}/multi30k/train*en')[0]

In [48]:
with open(en_path) as f:
    en = f.readlines()

In [49]:
with open(de_path) as f:
    de = f.readlines()

In [50]:
class Vocab:    
    def build_vocabs(self, sentence_list):
        self.stoi_dict = defaultdict(lambda: 1) 
        self.stoi_dict['<PAD>'] = 0
        self.stoi_dict['<UNK>'] = 1
        self.stoi_dict['<SOS>'] = 2
        self.stoi_dict['<EOS>'] = 3
        _index = 4
        for sentence in sentence_list:
            for word in sentence:
                if word in self.stoi_dict:
                    pass
                else:
                    self.stoi_dict[word] = _index
                    _index += 1
        self.itos_dict = {v:k for k, v in self.stoi_dict.items()}
        
    def stoi(self, token_list):
        return [self.stoi_dict[word] for word in token_list]

    def itos(self, indices):
        return " ".join([self.itos_dict[int(index)] for index in indices if self.itos_dict[index] != '<PAD>'])
    
    def __len__(self):
        return len(self.stoi_dict)

In [51]:
class field:
    def __init__(self, tokenize = lambda e: e.split(), init_token = '<SOS>', 
                 eos_token = '<EOS>', lower = False, reverse = False):
        self.tokenize = tokenize
        self.init_token = init_token
        self.eos_token = eos_token
        self.lower = lower
        self.reverse = reverse
        self.vocab = None
    
    def build_vocab(self, data):
        self.vocab = Vocab()
        self.vocab.build_vocabs(self.get_processed_datalist(data))

    def get_processed_data(self, data):
        if self.lower:
            data = data.lower()
        tokenized_data = self.tokenize(data)
        if self.init_token:
            tokenized_data = [self.init_token] + tokenized_data
        if self.eos_token:
            tokenized_data = tokenized_data + [self.eos_token]
        if self.reverse:
            tokenized_data = tokenized_data[::-1]
        return tokenized_data
    
    def get_processed_datalist(self, datalist):
        return [self.get_processed_data(data) for data in datalist]
        

In [52]:
SRC = field(tokenize_de, '<SOS>', '<EOS>', False, reverse=True)
TRG = field(tokenize_en, '<SOS>', '<EOS>', False)

In [53]:
SRC.build_vocab(de)
TRG.build_vocab(en)

In [54]:
SRC.get_processed_datalist(['hi hi', 'hi'])

[['<EOS>', 'hi', 'hi', '<SOS>'], ['<EOS>', 'hi', '<SOS>']]

### namedtuple for `.src`, `.trg` access 

In [55]:
from collections import namedtuple  
      
# Declaring namedtuple()   
Student = namedtuple('Student',['name','age','DOB'])   
      
# Adding values   
S = Student('Nandini','19','2541997')   
      
# Access using index   
print ("The Student age using index is : ",end ="")   
print (S[1])   
      
# Access using name    
print ("The Student name using keyname is : ",end ="")   
print (S.name) 

The Student age using index is : 19
The Student name using keyname is : Nandini


In [56]:
from collections import namedtuple  

class seq2seqDataset(Dataset):
    def __init__(self, src, trg = None, field = None, device = 'cpu'):
        self.src = src
        self.trg = trg
        self.data_source = {'src':src, 'trg':trg}
        self.field = field
        self.device = device
        self.named_tuple = namedtuple('data', ['src', 'trg'])
    def __len__(self):
        return len(self.src)
    
    def __getitem__(self, idx):
        if self.trg is None:
            return self.getitem('src', idx)
        return self.named_tuple(self.getitem('src', idx), self.getitem('trg', idx))
    
    def getitem(self, field_name, idx):
        data = self.data_source[field_name][idx]
        field = self.field[field_name]
        tokenize_data = field.get_processed_data(data)
        return torch.Tensor(self.field[field_name].vocab.stoi(tokenize_data)).long().to(self.device)

In [57]:
ds = seq2seqDataset('dd', 'ee', {'src':SRC, 'trg':TRG})
for _ in ds:
    print(_.src, _.trg)

tensor([3, 1, 2]) tensor([   2, 7210,    3])
tensor([3, 1, 2]) tensor([   2, 7210,    3])


In [58]:
def pad_collate(batch):
    (xx, yy) = zip(*batch)
    xx_pad = pad_sequence(xx, batch_first=True, padding_value=0)
    yy_pad = pad_sequence(yy, batch_first=True, padding_value=0)
    return xx_pad, yy_pad

#### sorting by length of de

In [59]:
sorted_list = sorted(zip(de, en), key=lambda e: len(e[0]))
de, en = list(zip(*sorted_list))

In [60]:
dataset = seq2seqDataset(de, en, field = {'src':SRC, 'trg':TRG}, device=device)
dl = DataLoader(dataset, batch_size=10, collate_fn=pad_collate)

In [61]:
SRC.vocab.stoi_dict['<EOS>']

3

In [62]:
for _ in dl:
    print(_[0].shape, _[1].shape)
    print(_[0])
    break

torch.Size([10, 8]) torch.Size([10, 12])
tensor([[    3,     4,     2,     0,     0,     0,     0,     0],
        [    3,     4, 12911,     2,     0,     0,     0,     0],
        [    3,     4, 12911,     2,     0,     0,     0,     0],
        [    3,     4,     5,  4154,    41,    30,     2,     0],
        [    3,     4,     5,  9043,   134,    76,     2,     0],
        [    3,     4,     5,   296,  6648,    30,     2,     0],
        [    3,     4,  7562,    77,     2,     0,     0,     0],
        [    3,     4,     5,    61,   578,   331,     2,     0],
        [    3,     4,     5,   307,    12,   402,    17,     2],
        [    3,     4,     5,  1458,   273,  2622,    30,     2]],
       device='cuda:0')


In [63]:
test_de = ["Ich lerne tiefes Lernen und maschinelles Lernen. Wie oft wiederholt sich 'Lernen'?",
          "maschinelles Lernen", 'Lernen']
# i am learning deep learning and machine learning. how many time does "learning" repeat?

In [64]:
test_ds = seq2seqDataset(test_de, field={'src': SRC}, device=device)
test_dl = DataLoader(test_ds, batch_size=1)

In [65]:
for _ in test_dl:
    print(_)

tensor([[   3, 6866, 9649,    1, 9649,  209,    1,    1, 8114,    5,    1,    1,
           33,    1, 7285,    1, 4271,    2]], device='cuda:0')
tensor([[3, 1, 1, 2]], device='cuda:0')
tensor([[3, 1, 2]], device='cuda:0')


In [66]:
test_de

["Ich lerne tiefes Lernen und maschinelles Lernen. Wie oft wiederholt sich 'Lernen'?",
 'maschinelles Lernen',
 'Lernen']

In [67]:
for _ in test_dl:
    print([SRC.vocab.itos_dict[int(idx)] for idx in _.data[0]][::-1])

['<SOS>', 'Ich', '<UNK>', 'tiefes', '<UNK>', 'und', '<UNK>', '<UNK>', '.', 'Wie', '<UNK>', '<UNK>', 'sich', "'", '<UNK>', "'", '?', '<EOS>']
['<SOS>', '<UNK>', '<UNK>', '<EOS>']
['<SOS>', '<UNK>', '<EOS>']


In [68]:
SRC.vocab.itos_dict[4349.]

'braunhaariges'

In [69]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0, :]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

    def greedy_decoder(self, src, max_len):
        outputs = []
        src = src.transpose(1, 0)
        hidden, cell = self.encoder(src)
        input = torch.Tensor([2]).long().to(self.device) # <SOS> token
#         |input| torch.Size([1]) |hidden| torch.Size([2, 1, 512]) |cell| torch.Size([2, 1, 512])
        for _ in range(max_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
#         |output| torch.Size([1, 10858]) |hidden| torch.Size([2, 1, 512]) |cell| torch.Size([2, 1, 512])
            top1 = output.argmax(1)
            input = top1
            outputs.append(int(top1.data))
        return outputs

In [70]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [71]:
for _ in train_iterator:
    print(_.src.shape)
    print(_.trg.shape)
    break

torch.Size([7, 128])
torch.Size([13, 128])




In [72]:
for _ in test_dl:
    x = model.greedy_decoder(_, 10)

In [73]:
TRG.vocab.itos(x)

'<EOS> dice labels congregation leads match dollhouse dollhouse assistance sits'