<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fgeneration/applications/generation/Basic%20Machine%20Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Translation

Machine Translation (MT) is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language. Ideally, a source language sequence is translated into target language sequence. 

The most common sequence-to-sequence (seq2seq) models are encoder-decoder models, which commonly use a recurrent neural network (RNN) to encode the source (input) sentence into a single vector.

![overview](https://drive.google.com/uc?id=1TkFl1a68iOCfRaW6QGRBs5Jf4SHLqk9j)


In this notebook, we'll refer to this single vector as a context vector. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then decoded by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

![](https://drive.google.com/uc?id=1AZ8D6qSu_rMWOHa4STtLaPM4cKvTdZvL)

We will be working on German to English Translation. The dataset can be found [here](https://pytorch.org/text/datasets.html#multi30k)


### Resources

- [Unreasonable effectiveness of RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Ben Trevett Seq2Seq](https://github.com/bentrevett/pytorch-seq2seq)
- [Sequence to Sequence Learning with Neural Networks paper](https://arxiv.org/abs/1409.3215)



## Initial Setup

In [0]:
import math
import time
import spacy
import random
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

In [0]:
SEED = 64

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Tokenization

For tokenizing the english and german sentences, we will be using [`spacy`](https://spacy.io/api/tokenizer/).

Download the models corresponding to the language.
- German - **`de`**
- English - **`en`**

In [3]:
!python -m spacy download en
!python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
Collecting de_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9MB)
[K     |████████████████████████████████| 14.9MB 827kB/s 
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.2.5-cp36-none-any.whl size=14907056 sha256=3219c473175965a26972a63f5d2fc6024eeeda5b0f65dd3708cc3b16ea684c9e
  Stored in directory: /tmp/pip-ephem-wheel-cache-moor_itt/wheels/ba/3f/ed/d4aa8e45e7191b7f32db4bfad565e7da1edbf05c916ca7a1ca
Successfully built de-core-news-sm
Inst

Load the models

In [0]:
spacy_en = spacy.load('en')
spacy_de = spacy.load('de')

## Tokenizers

In [0]:
def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)]

## Field

For the basic usage of Field, refer to my other notebooks [here](https://github.com/graviraja/100-Days-of-NLP/blob/master/applications/classification/Simple%20Sentiment%20Analysis.ipynb) and [here](https://github.com/graviraja/100-Days-of-NLP/blob/master/applications/classification/Improved%20Sentiment%20Analysis.ipynb).

However, in Machine Translation, the source language and target languages are different.

`Field` defines how the data should be processed.

Since we are tokenizing using spacy, we can pass our tokenizer method to the argument **`tokenizer`**

In order to indicate the starting and ending of a sentence, we can `init_token` as **`<sos>`** and `eos_token` as **`<eos>`**


In [0]:
SRC = Field(
        tokenize=tokenize_de,
        init_token='<sos>',
        eos_token='<eos>',
        lower=True)

TRG = Field(
        tokenize=tokenize_en,
        init_token='<sos>',
        eos_token='<eos>',
        lower=True)

## Dataset

We will be using [Multi30K dataset](https://pytorch.org/text/datasets.html#multi30k). Torchtext provides support for multi open datasets. This is a dataset with ~31,000 parallel English, German and French sentences. 

**`exts`** specifies which languages to use as the source and target (source goes first) and **`fields`** specifies which field to use for the source and target.

In [7]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(SRC, TRG))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:02<00:00, 532kB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 171kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 156kB/s]


In [8]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [9]:
print(vars(train_data.examples[0]))

{'src': ['zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


## Vocabulary

Using the **`min_freq`** argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an **`<unk>`** (unknown) token.

It is important to note that our vocabulary should only be built from the training set and not the validation/test set. 

In [0]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [11]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893


In [12]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Iterators


The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a src attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a trg attribute (the PyTorch tensors containing a batch of numericalized target sentences). 


When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, TorchText iterators handle this for us!

We use a BucketIterator instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [0]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    device=device
)

In [14]:
# sample checking
temp = next(iter(train_iterator))
temp.src.shape, temp.trg.shape

(torch.Size([28, 64]), torch.Size([28, 64]))

## Encoder


![encoder](https://drive.google.com/uc?id=1-glq5zaySAlL5HpRlA9v3Zj1JXb1kUfu)

The encoder we are using is a 2 layer LSTM. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers.

For more on LSTM and hidden states explaination refer to my post on RNN [here](https://github.com/graviraja/100-Days-of-NLP/blob/applications/generation/architectures/RNN.ipynb)

Once the source sentence is passed through the LSTM, the final step's hidden state in all the layers will be used for decoding.

This final hidden state is called `context vector`, which represents the encoding of the source sentence. This will be used as the initial hidden state for decoder. 

> *Note: In order for this to work, the number of layers in encoder and decoder must be same, if not then some extra strategy needs to be applied. For example, if the encoder is 1 layer and decoder is 2 layers means, then encoder final hidden state needs to replicated and sent as initial state to decoder*

In [0]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, n_layers, dropout):
        super().__init__()

        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hidden_dim, num_layers=n_layers, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, src):
        # src => [seq_len, batch_size]

        embedded = self.embedding(src)
        embedded = self.dropout(embedded)
        # embedded => [seq_len, batch_size, emb_dim]

        outputs, (hidden, cell) = self.rnn(embedded)
        # outputs => [seq_len, batch_size, hid_dim]
        # hidden => [num_layers * num_dir, batch_size, hid_dim] => [2, batch_size, hid_dim]
        # cell => [num_layers * num_dir, batch_size, hid_dim] => [2, batch_size, hid_dim]

        return hidden, cell

## Decoder

![](https://drive.google.com/uc?id=1YomvoraK5OLyJ_UO40T8K2Am6OOAaGuo)

The decoder we are using is a 2 layer LSTM like encoder. The Decoder class does a single step of decoding, i.e. it ouputs single token per time-step. 

The initial hidden and cell states to the decoder are context vectors $z^1, z^2$, which are the final hidden and cell states of the encoder from the same layer.


As we are only decoding one token at a time, the input tokens will always have a sequence length of 1. We unsqueeze the input tokens to add a sentence length dimension of 1. Then it is passed through the RNN. There is an additional Linear layer, used to make the predictions from the top layer hidden state.


In [0]:
class Decoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, n_layers, dropout):
        super().__init__()

        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.input_dim = input_dim

        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hidden_dim, num_layers=n_layers, dropout=dropout)
        self.fc_out = nn.Linear(hidden_dim, input_dim)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, hidden, cell):
        # input => [batch_size]
        # hidden => [num_layers, batch_size, hid_dim]
        # cell => [num_layers, batch_size, hid_dim]

        input = input.unsqueeze(0)
        # input => [1, batch_size]

        embedded = self.embedding(input)
        embedded = self.dropout(embedded)
        # embedded => [1, batch_size, emb_dim]

        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        # seq_len and num_directions will always be 1 in decoder
        # output => [seq_len, batch_size, hid_dim]
        #        => [1, batch_size, hid_dim]
        # hidden => [num_layers, batch_size, hid_dim]
        # cell   => [num_layers, batch_size, hid_dim]

        prediction = self.fc_out(output.squeeze(0))
        # prediction => [batch_size, input_dim]

        return prediction, hidden, cell

## Seq2Seq

For the final part of the implemenetation, we'll implement the seq2seq model. This will handle:

- receiving the input/source sentence
- using the encoder to produce the context vectors
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](https://drive.google.com/uc?id=1AZ8D6qSu_rMWOHa4STtLaPM4cKvTdZvL)


The forward method takes the source sentence, target sentence and a teacher-forcing ratio.

The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio (teacher_forcing_ratio) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. 

However, with probability 1 - teacher_forcing_ratio, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.

The first input to the decoder is the start of sequence `<sos>` token. As our trg tensor already has the `<sos>` token appended we get our $y_1$ by slicing into it. We know how long our target sentences should be (max_len), so we loop that many times. The last token input into the decoder is the one before the `<eos>` token 

In [0]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device
    
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # src => [src_len, batch_size]
        # trg => [trg_len, batch_size]

        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.input_dim

        # outputs: to store the predictions of the decoder
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        hidden, cell = self.encoder(src)

        # send the initial input to decoder as <sos>
        dec_inp = trg[0, :]

        # loop till the maximum target length
        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(dec_inp, hidden, cell)
            outputs[t] = output

            # to decide whether to use the predicted as input or actual ground truth for the next time step
            teacher_force = random.random() < teacher_forcing_ratio

            # pick the top one: greedy
            # other strategy is to sample from the top_k distribution
            top1 = output.argmax(1)

            dec_inp = trg[t] if teacher_force else top1
        return outputs

In [0]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)

In [19]:
# initial the weights of the model with uniform distribution between -0.08 and 0.08 [stated in paper]
def init_weights(model):
    for name, param in model.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [20]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model)} trainable parameters')

The model has 13899013 trainable parameters


## Optimizer & Criterion

We use the **`Adam`** optimizer.

**`CrossEntropyLoss`**: This criterion combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class. It is useful when training a classification problem with C classes. Ignore the `<pad>` index as it does not contribute to loss

In [0]:
optimizer = optim.Adam(model.parameters())

In [0]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

## Train

At each iteration:

- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with .view
- we slice off the first column of the output and target tensors (`<sos>` not used)
- calculate the gradients with loss.backward()
- clip the gradients to prevent them from exploding
- update the parameters of our model by doing an optimizer step
- update the loss

Finally, we return the loss that is averaged over all batches.

In [0]:
def train(model, iterator, criterion, optimizer, clip):
    epoch_loss = 0

    # keep the model in train mode
    model.train()

    # iterate over train data
    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        # src => [seq_len, batch_size]
        # trg => [seq_len, batch_size]

        # zero the gradients
        optimizer.zero_grad()

        # forward pass
        output = model(src, trg)

        # reshaping the output to make it compatible to cal. loss
        # can also do without reshaping
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)

        # backward pass
        loss.backward()

        # gradient clipping 
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        # update the parameters of the model
        optimizer.step()

        # update the loss
        epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)


## Evaluate

Similar to training loop, without backward pass.

In [0]:
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    
    # keep the model in eval mode
    model.eval()
    
    # do not calculate gradients
    with torch.no_grad():

        # iterate over the data
        for batch in iterator:
            src = batch.src
            trg = batch.trg
            # src => [seq_len, batch_size]
            # trg => [seq_len, batch_size]

            # forward pass
            # make sure the teacher_forcing_ratio is 0 in eval
            output = model(src, trg, 0)

            # reshaping for loss calculation
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            # loss
            loss = criterion(output, trg)

            # update loss
            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = elapsed_time - (elapsed_mins * 60)
    return elapsed_mins, elapsed_secs

## Training

In [26]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    
    train_loss = train(model, train_iterator, criterion, optimizer, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    
    print(f"Epoch {epoch + 1} | Time: {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f} |")
    print(f"\tValid Loss: {valid_loss:.3f} | Valid PPL: {math.exp(valid_loss):7.3f} |")

Epoch 1 | Time: 0m 37.81912350654602s
	Train Loss: 4.821 | Train PPL: 124.027 |
	Valid Loss: 4.775 | Valid PPL: 118.487 |
Epoch 2 | Time: 0m 37.0620002746582s
	Train Loss: 4.154 | Train PPL:  63.693 |
	Valid Loss: 4.468 | Valid PPL:  87.161 |
Epoch 3 | Time: 0m 37.08384919166565s
	Train Loss: 3.860 | Train PPL:  47.464 |
	Valid Loss: 4.252 | Valid PPL:  70.266 |
Epoch 4 | Time: 0m 37.10718059539795s
	Train Loss: 3.643 | Train PPL:  38.214 |
	Valid Loss: 4.111 | Valid PPL:  61.026 |
Epoch 5 | Time: 0m 37.17174744606018s
	Train Loss: 3.420 | Train PPL:  30.576 |
	Valid Loss: 3.987 | Valid PPL:  53.877 |
Epoch 6 | Time: 0m 37.099289417266846s
	Train Loss: 3.232 | Train PPL:  25.334 |
	Valid Loss: 3.862 | Valid PPL:  47.541 |
Epoch 7 | Time: 0m 37.21440601348877s
	Train Loss: 3.057 | Train PPL:  21.260 |
	Valid Loss: 3.701 | Valid PPL:  40.488 |
Epoch 8 | Time: 0m 37.18330144882202s
	Train Loss: 2.901 | Train PPL:  18.190 |
	Valid Loss: 3.773 | Valid PPL:  43.524 |
Epoch 9 | Time: 0m 36.97

## Testing

Load the pre-trained model and evaluate on the test data

In [27]:
model.load_state_dict(torch.load('model.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print(f"\tTest Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |")

	Test Loss: 3.602 | Test PPL:  36.658 |
