<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fgeneration/applications/generation/Improved%20Machine%20Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will implement the model from [Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation](https://arxiv.org/pdf/1406.1078.pdf) paper. This model will have better test perplexity than the [Basic MT model](https://github.com/graviraja/100-Days-of-NLP/blob/master/applications/generation/Basic%20Machine%20Translation.ipynb) whilst using only single layer RNN in both encoder and decoder.

![](https://drive.google.com/uc?id=1zPgyT1xZ0g37OAbFM6HWSI1mxfOUxTbJ)

### Resources
- [Unreasonable effectiveness of RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Ben Trevett Seq2Seq](https://github.com/bentrevett/pytorch-seq2seq)
- [Multi30K dataset](https://pytorch.org/text/datasets.html#multi30k)

## Initial Setup

In [0]:
import time
import math
import random
import spacy
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.data import Field, BucketIterator
from torchtext.datasets import Multi30k

In [0]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

## Tokenization

For tokenizing the english and german sentences, we will be using spacy.

Download the models corresponding to the language.

- German - **`de`**
- English - **`en`**

In [3]:
!python -m spacy download de
!python -m spacy download en

Collecting de_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9MB)
[K     |████████████████████████████████| 14.9MB 506kB/s 
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.2.5-cp36-none-any.whl size=14907056 sha256=60e3728a80341be9b598fd61d6a40c4bcd343028dba35e416623cea0735b3cb8
  Stored in directory: /tmp/pip-ephem-wheel-cache-ypt1etfx/wheels/ba/3f/ed/d4aa8e45e7191b7f32db4bfad565e7da1edbf05c916ca7a1ca
Successfully built de-core-news-sm
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/de_core_news_sm -->
/usr/local/

Load the models

In [0]:
tokenize_en = spacy.load('en')
tokenize_de = spacy.load('de')

## Tokenizers

In [0]:
def tokenizer_en(text):
    return [tok.text for tok in tokenize_en.tokenizer(text)]

def tokenizer_de(text):
    return [tok.text for tok in tokenize_de.tokenizer(text)]

## Field

Field defines how the data should be processed.

Since we are tokenizing using spacy, we can pass our tokenizer method to the argument **tokenizer**

In order to indicate the starting and ending of a sentence, we can init_token as **`<sos>`** and eos_token as **`<eos>`**

In [0]:
SRC = Field(
        tokenize=tokenizer_de,
        init_token='<sos>',
        eos_token='<eos>',
        lower=True
)

TRG = Field(
        tokenize=tokenizer_en,
        init_token='<sos>',
        eos_token='<eos>',
        lower=True
)

## Dataset

We will be using Multi30K dataset. Torchtext provides support for multi open datasets. This is a dataset with ~31,000 parallel English, German and French sentences.

exts specifies which languages to use as the source and target (source goes first) and fields specifies which field to use for the source and target.

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:04<00:00, 248kB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 80.0kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 73.4kB/s]


In [8]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [9]:
print(vars(train_data.examples[0]))

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


## Vocabulary

Using the **`min_freq argument`**, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an **`<unk>`** (unknown) token.

It is important to note that our vocabulary should only be built from the training set and not the validation/test set.

In [0]:
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(valid_data, min_freq=2)

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [0]:
BATCH_SIZE = 64

## Iterators

The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a src attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a trg attribute (the PyTorch tensors containing a batch of numericalized target sentences).

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, TorchText iterators handle this for us!

We use a BucketIterator instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [0]:
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device
)

In [14]:
# sample checking
temp = next(iter(train_iterator))
temp.src.shape, temp.trg.shape

(torch.Size([25, 64]), torch.Size([29, 64]))

## Encoder

![](https://drive.google.com/uc?id=1FjGJ0Fdya_Eq2HkSS27sgyG-n_mJFrQI)

The encoder we are using is a 1 layer GRU.

For more on GRU and hidden states explaination refer to my post on RNN [here](https://github.com/graviraja/100-Days-of-NLP/blob/applications/generation/architectures/RNN.ipynb)

Once the source sentence is passed through the GRU, the final step's hidden state in all the layers will be used for decoding.

This final hidden state is called context vector, which represents the encoding of the source sentence. This will be used as the initial hidden state for decoder and also as input to decoder rnn along with decoder embedded input. This context vector will also be provided as input to Classifier along with decoder rnn output. 

> Note: In order for this to work, the number of layers in encoder and decoder must be same. Here we will be using 1 layer.

In [0]:
class Encoder(nn.Module):
    def __init__(self, input_size, emb_size, hidden_size, dropout):
        super().__init__()

        self.embedding = nn.Embedding(input_size, emb_size)
        self.rnn = nn.GRU(emb_size, hidden_size)

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input):
        # input => [seq_len, batch_size]

        embedded = self.embedding(input)
        embedded = self.dropout(embedded)
        # embedded => [seq_len, batch_size, emb_dim]

        output, hidden = self.rnn(embedded)
        # output => [seq_len, batch_size, hidden_dim]
        # hidden => [num_layers * num_dir, batch_size, hidden_dim] => [1, batch_size, hidden_dim]

        return hidden

## Decoder

![](https://drive.google.com/uc?id=1b4aQXhMU11bE3hvLyCHZUFViqNI_6aQO)

The decoder is where the implementation differs significantly from the previous model and we alleviate some of the information compression.

In the [previous implementation](https://github.com/graviraja/100-Days-of-NLP/blob/master/applications/generation/Basic%20Machine%20Translation.ipynb), the context vector is used only as the initial hidden state to decoder. i.e Only during the decoding of the first token in the target, the source information is available. For the rest of the tokens, the propagated states will be used.

However, here we provide the context vector for every decoding step, as an input to decoder and as an input to classifier. This helps the decoder to have more information about the tokens decoded so far and also the source sentence information. 

The addition of $z^1$ to the classifier also means this layer can directly see what the token is, without having to get this information from the hidden state.

In [0]:
class Decoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, dropout):
        super().__init__()

        self.input_dim = input_dim
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim + hidden_dim, hidden_dim)
        self.fc = nn.Linear(emb_dim + (hidden_dim * 2), input_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, hidden, context):
        # input => [batch_size]
        # hidden => [1, batch_size, hidden_dim]
        # context => [1, batch_size, hidden_dim]

        input = input.unsqueeze(0)
        # input => [1, batch_size]

        embedded = self.embedding(input)
        embedded = self.dropout(embedded)
        # embedded => [1, batch_size, emb_dim]

        combined = torch.cat((embedded, context), dim=-1)
        # combined => [1, batch_size, emb_dim + hid_dim]

        output, hidden = self.rnn(combined, hidden)
        # output => [1, batch_size, hid_dim]
        # hidden => [1, batch_size, hid_dim]

        out_combined = torch.cat((embedded.squeeze(0), hidden.squeeze(0), context.squeeze(0)), dim=-1)
        # out_combined => [batch_size, emb_dim + 2 * hid_dim ]

        logits = self.fc(self.dropout(out_combined))
        # logits => [batch_size, input_dim]

        return logits, hidden

## Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle:

- receiving the input/source sentence
- using the encoder to produce the context vectors
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](https://drive.google.com/uc?id=1zPgyT1xZ0g37OAbFM6HWSI1mxfOUxTbJ)


The forward method takes the source sentence, target sentence and a teacher-forcing ratio.

The teacher forcing ratio is used when training our model. 

When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded. With probability equal to the teaching forcing ratio (teacher_forcing_ratio) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step.

However, with probability 1 - teacher_forcing_ratio, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.

The first input to the decoder is the start of sequence `<sos>` token. As our trg tensor already has the `<sos>` token appended we get our $y_1$ by slicing into it. We know how long our target sentences should be (max_len), so we loop that many times. The last token input into the decoder is the one before the `<eos>` token

In [0]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src => [seq_len, batch_size]
        # trg => [seq_len, batch_size]

        batch_size = src.shape[-1]
        max_trg_length = trg.shape[0]
        output_dim = self.decoder.input_dim

        # outputs: to store the predictions of the decoder
        outputs = torch.zeros(max_trg_length, batch_size, output_dim).to(self.device)

        context = self.encoder(src)
        hidden = context

        # send the initial input to decoder as <sos>
        dec_inp = trg[0, :]

        # loop till the maximum target length
        for i in range(1, max_trg_length):
            output, hidden = self.decoder(dec_inp, hidden, context)
            outputs[i] = output
            # to decide whether to use the predicted as input or actual ground truth for the next time step
            teacher_force = random.random() < teacher_forcing_ratio

            # pick the top one: greedy
            # other strategy is to sample from the top_k distribution
            top1 = output.argmax(1)

            dec_inp = trg[i] if teacher_force else top1
        return outputs


In [0]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HIDDEN_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HIDDEN_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HIDDEN_DIM, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)

In [50]:
def init_weights(model):
    for name, param in model.named_parameters():
        nn.init.normal_(param.data, mean=0, std=0.01)
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(256, 512)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(844, 256)
    (rnn): GRU(768, 512)
    (fc): Linear(in_features=1280, out_features=844, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [51]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model)} trainable parameters")

The model has 6459980 trainable parameters


## Optimizer & Loss Criterion

We use the Adam optimizer.

**`CrossEntropyLoss`**: This criterion combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class. It is useful when training a classification problem with C classes. Ignore the **`<pad>`** index as it does not contribute to loss

In [0]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=TRG.vocab.stoi[TRG.pad_token])

## Train

At each iteration:

- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with .view
we slice off the first column of the output and target tensors (`<sos>` not used)
- calculate the gradients with loss.backward()
- clip the gradients to prevent them from exploding
- update the parameters of our model by doing an optimizer step
- update the loss

Finally, we return the loss that is averaged over all batches.

In [0]:
def train(model, iterator, criterion, optimizer, clip):
    epoch_loss = 0
    model.train()

    for batch in iterator:
        src = batch.src
        trg = batch.trg

        optimizer.zero_grad()
        output = model(src, trg)

        output_dim = output.shape[-1]
        loss = criterion(output[1:].view(-1, output_dim), trg[1:].view(-1))
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)

## Evaluate

Similar to training loop, without backward pass.

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    model.eval()

    with torch.no_grad():
        for batch in iterator:
            src = batch.src
            trg = batch.trg

            output = model(src, trg)
            output_dim = output.shape[-1]
            loss = criterion(output[1:].view(-1, output_dim), trg[1:].view(-1))
            
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = elapsed_time - elapsed_mins
    return elapsed_mins, elapsed_secs

## Training

In [56]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, criterion, optimizer, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    
    print(f"Epoch: {epoch + 1} | Time: {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
    print(f"\tValid Loss: {valid_loss:.3f} | Valid PPL: {math.exp(valid_loss):7.3f}")


Epoch: 1 | Time: 0m 27.00088930130005s
	Train Loss: 3.897 | Train PPL:  49.267
	Valid Loss: 3.737 | Valid PPL:  41.985
Epoch: 2 | Time: 0m 26.967023611068726s
	Train Loss: 3.196 | Train PPL:  24.427
	Valid Loss: 3.084 | Valid PPL:  21.835
Epoch: 3 | Time: 0m 26.914294719696045s
	Train Loss: 2.811 | Train PPL:  16.633
	Valid Loss: 2.770 | Valid PPL:  15.952
Epoch: 4 | Time: 0m 26.982502937316895s
	Train Loss: 2.553 | Train PPL:  12.848
	Valid Loss: 2.489 | Valid PPL:  12.052
Epoch: 5 | Time: 0m 26.98589015007019s
	Train Loss: 2.381 | Train PPL:  10.812
	Valid Loss: 2.314 | Valid PPL:  10.111
Epoch: 6 | Time: 0m 26.743107557296753s
	Train Loss: 2.245 | Train PPL:   9.440
	Valid Loss: 2.246 | Valid PPL:   9.454
Epoch: 7 | Time: 0m 26.547712087631226s
	Train Loss: 2.145 | Train PPL:   8.538
	Valid Loss: 2.182 | Valid PPL:   8.863
Epoch: 8 | Time: 0m 26.72164821624756s
	Train Loss: 2.042 | Train PPL:   7.709
	Valid Loss: 2.159 | Valid PPL:   8.667
Epoch: 9 | Time: 0m 26.534064292907715s
	Tr

## Testing

Load the pre-trained model and evaluate on the test data

In [57]:
model.load_state_dict(torch.load('model.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print(f"Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f}")

Test Loss: 1.952 | Test PPL:   7.041
