# Neural Machine Traslation with Attention
Disclaimer: This notebook is an adopted version of [this repository](https://github.com/keon/seq2seq) and [this tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).

In this tutorial we'll learn how to build recurrent neural network with attention mechanism to automatically translate from German to English! 

<img src="static/model.png" width=300 align="center"/>

Imports:

In [None]:
import os
import random
import numpy as np
import re

import torch
from torch import nn
from torch.nn import functional as F
from torch import optim

# for text processing
import spacy
import torchtext

Parameters:

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

batch_size = 32
hidden_size = 512
embed_size = 256

## Data
1. To train NMT model in supervised manner we need dataset of parallel texts for 2 (or more) languages. Today we'll use **Multi30k** dataset from `torchtext` package. It's a small dataset containing exactly what we need - parallel sentences in German and English.

In [None]:
torchtext.datasets.Multi30k

2. To tokenize text we'll use `spacy`. Before using it you have to download German and English language packages:
```bash
python3 -m spacy download de
python3 -m spacy download en
```

3. To operate with text we'll use handy `torchtext` abstractions: [Field](https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field) and [BucketIterator](https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.BucketIterator). Read the docs on these classes for more details.

Let's combine all data loading and preparation stuff in one function:

In [None]:
def load_dataset(batch_size):
    spacy_de = spacy.load('de')
    spacy_en = spacy.load('en')
    url = re.compile('(<url>.*</url>)')

    def tokenize_de(text):
        return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]

    def tokenize_en(text):
        return [tok.text for tok in spacy_en.tokenizer(url.sub('@URL@', text))]

    DE = torchtext.data.Field(tokenize=tokenize_de, include_lengths=True, init_token='<sos>', eos_token='<eos>')
    EN = torchtext.data.Field(tokenize=tokenize_en, include_lengths=True, init_token='<sos>', eos_token='<eos>')
    
    train, val, test = torchtext.datasets.Multi30k.splits(exts=('.de', '.en'), fields=(DE, EN))
    DE.build_vocab(train.src, min_freq=2)
    EN.build_vocab(train.trg, max_size=10000)
    train_iter, val_iter, test_iter = torchtext.data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, repeat=False)
    return train_iter, val_iter, test_iter, DE, EN

In [None]:
train_iter, val_iter, test_iter, DE, EN = load_dataset(batch_size)
de_size, en_size = len(DE.vocab), len(EN.vocab)

Grab one batch:

In [None]:
batch = next(iter(train_iter))

`batch` has 2 attributes: `src` and `trg`. Each contains tuple with numerical representation of the sentences with similar lengths (padded) and their original lengths:

In [None]:
print("input batch shape", batch.src[0].shape)
print("input batch lengths", batch.src[1])

Using `DE` and `EN` you can convert from string representation to numerical and back (`.stoi` and `.itos` methods):

In [None]:
print("encoded sentence", batch.src[0][:, 0])
print("encoded sentence", [DE.vocab.itos[token] for token in batch.src[0][:, 0]])

Now have data, it's time build models!

# Task 1 (2 points). Seq2Seq
A **Sequence to Sequence** network, or **seq2seq** network, or **Encoder-Decoder** network, is a model consisting of two RNNs called the encoder and decoder. The encoder reads an input sequence and outputs a single vector, and the decoder reads that vector to produce an output sequence. Unlike sequence prediction with a single RNN, where every input corresponds to an output, the seq2seq model frees us from sequence length and order, which makes it ideal for translation between two languages.

With a seq2seq model the encoder creates a single vector which, in the ideal case, encodes the “meaning” of the input sequence into a single vector — a single point in some N dimensional space of sentences.

<img src="static/seq2seq.png" width=1000 align="center"/>

### Encoder
The encoder of a seq2seq network is a bidirectional GRU RNN that outputs some value for every word from the input sentence. For every input word the encoder outputs a vector and a hidden state, and uses the hidden state for the next input word. To keep hidden size of fixed shape we sum outputs over 2 directions.

**Note:** here we use [nn.GRU](https://pytorch.org/docs/stable/nn.html#torch.nn.GRU), not [nn.GRUCell](https://pytorch.org/docs/stable/nn.html#torch.nn.GRUCell). Read the docs to understand the differences.

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_size, embed_size, hidden_size,
                 n_layers=1, dropout=0.5):
        super().__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embed_size = embed_size
        self.embed = nn.Embedding(input_size, embed_size)
        self.gru = nn.GRU(embed_size, hidden_size, n_layers, dropout=dropout, bidirectional=True)

    def forward(self, src, hidden=None):
        """Encodes input sequence

        Args:
            src (torch tensor of shape (t, b)): input sequence
            hidden (torch tensor of shape (n_layers * n_directions, b, h)): prev hidden state (can be None)
        
        Returns:
            outputs (torch tensor of shape (t, b, h)): encoded sequence (dicrections are summed)
            hidden (torch tensor of shape (n_layers * n_directions, b, h)): hidden state
        """
        embedded = ## embed input
        outputs, hidden = ## forward recurrent unit
        
        # sum bidirectional outputs
        outputs = outputs.view(outputs.shape[0], outputs.shape[1], 2, self.hidden_size)
        outputs = outputs.sum(dim=2)
        
        return outputs, hidden

### Decoder
In the simplest seq2seq decoder uses only last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. This context vector is used as the initial hidden state of the decoder. At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string <SOS> token, and the first hidden state is the context vector (the encoder’s last hidden state).
    
But in this tutorial we'll pump our decoder with attention mechanism!

### Decoder with attention

If only the context vector is passed betweeen the encoder and decoder, that single vector carries the burden of encoding the entire sentence.

Attention allows the decoder network to “focus” on a different part of the encoder’s outputs for every step of the decoder’s own outputs. First we calculate a set of attention weights. These will be multiplied by the encoder output vectors to create a weighted combination. The result should contain information about that specific part of the input sequence, and thus help the decoder choose the right output words.

<img src="static/attention.png" width=500 align="center"/>

Calculating the attention weights is done with another feed-forward layer, using the decoder’s input and hidden state as inputs. Below you can find short description of the so-called Bahdanau attention mechanism (details can be found in the paper [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)).

Attention weights ($\alpha_{ij}$):

$$
    e_{ij} = f(s_{i-1}, h_j) = v \tanh(W [s_{i-1}, h_j] + b) \\
    \alpha_{ij} = softmax(e_{ij}) = \frac{\exp{e_{ij}}}{\sum_j{\exp{e_{ij}}}}
$$

Here $s_{i-1}$ - hidden state of decoder (`hidden`), $h_j$ - hidden state of encoder (`encoder_outputs`), $v$, $W$, $b$ - learnable parameters.

Let's implement it:

In [None]:
class Attention(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.attn = nn.Linear(self.hidden_size * 2, hidden_size)
        
        # setup attention parameters
        self.v = nn.Parameter(torch.zeros(hidden_size))
        
        stdv = 1. / np.sqrt(self.v.shape[0])
        self.v.data.uniform_(-stdv, stdv)

    def forward(self, hidden, encoder_outputs):
        """Calculates attention weights

        Args:
            hidden (torch tensor of shape (b, h)): prev hidden state (can be None)
            encoder_outputs (torch tensor of shape (t, b, h)): encoded sequence
        
        Returns:
            attn_weights (torch tensor of shape (b, 1, t)): attention weights
        """ 
        
        timestep = encoder_outputs.shape[0]
        h = hidden.repeat(timestep, 1, 1).transpose(0, 1)  # [B*T*H]
        encoder_outputs = encoder_outputs.transpose(0, 1)  # [B*T*H]
        
        # [B*T*2H]->[B*T*H]
        energy = ## concat h and encoder_outputs, feed to self.attn and then to softmax 
        energy = energy.transpose(1, 2)  # [B*H*T]
        
        v = self.v.repeat(encoder_outputs.shape[0], 1).unsqueeze(1)  # [B*1*H]
        attn_weights = ## multiply by v vector to get shape [B*1*T]
        
        return attn_weights

Now let's insert attention mechanism to decoder:

In [None]:
class Decoder(nn.Module):
    def __init__(self, embed_size, hidden_size, output_size,
                 n_layers=1, dropout=0.2):
        super().__init__()
        
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        self.embed = nn.Embedding(output_size, embed_size)
        self.dropout = nn.Dropout(dropout, inplace=True)
        self.attention = Attention(hidden_size)
        self.gru = nn.GRU(hidden_size + embed_size, hidden_size, n_layers, dropout=dropout)
        self.out = nn.Linear(hidden_size * 2, output_size)

    def forward(self, input, last_hidden, encoder_outputs):
        """Decodes with attention token by token

        Args:
            input (torch tensor of shape (b,)): input token
            last_hidden (torch tensor of shape (1, b, h)): last hidden
            encoder_outputs (torch tensor of shape (t, b, h)): encoded sequence
        
        Returns:
            output (torch tensor of shape (b, vocab_size)): ouput token distribution
            hidden (torch tensor of shape (1, b, h)): hidden state
            attn_weights (torch tensor of shape (b, 1, t)): attention weights
        """
        # get the embedding of the current input word (last output word)
        embedded = self.embed(input).unsqueeze(0)  # (1,B,N)
        embedded = self.dropout(embedded)
        
        # calculate attention weights and apply to encoder outputs
        attn_weights = self.attention(last_hidden[-1], encoder_outputs)  # (B,1,T)
        context = ## apply attention weights to encoder_outputs to get shape # (B,1,N) (don't forget to transpose encoder_outputs)
        context = context.transpose(0, 1)  # (1,B,N)
        
        # combine embedded input word and attended context, run through RNN
        rnn_input = torch.cat([embedded, context], 2)
        output, hidden = ## forward recurrent unit 
        output = output.squeeze(0)  # (1,B,N) -> (B,N)
        
        context = context.squeeze(0)
        output = self.out(torch.cat([output, context], 1))
        output = F.log_softmax(output, dim=1)

        return output, hidden, attn_weights

## Wrap in a single Seq2Seq model

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, sos_token, max_len):
        """Sequence-to-sequence inference

        Args:
            src (torch tensor of shape (t, b)): input sequence
        
        Returns:
            outputs (torch tensor of shape (b, vocab_size)): ouput token distribution
        """
        device = src.device
        
        batch_size = src.shape[1]
        vocab_size = self.decoder.output_size
        outputs = torch.zeros(max_len, batch_size, vocab_size).to(device)

        encoder_output, hidden = self.encoder(src)
        hidden = hidden[:self.decoder.n_layers]
        
        output = torch.full((batch_size,), sos_token, dtype=torch.long).to(device)
        
        for t in range(1, max_len):
            output, hidden, attn_weights = ## apply decoder
            outputs[t] = output
            
            top1 = output.data.max(1)[1]
            output = top1.to(device)
        return outputs

## Task 2 (1 point). Train-loop

Parameters:

In [None]:
epochs = 100
lr = 0.0001
grad_clip = 10.0

Model:

In [None]:
encoder = Encoder(de_size, embed_size, hidden_size, n_layers=2, dropout=0.5)
decoder = Decoder(embed_size, hidden_size, en_size, n_layers=1, dropout=0.0)
seq2seq = Seq2Seq(encoder, decoder).to(device)

optimizer = optim.Adam(seq2seq.parameters(), lr=lr)
print(seq2seq)

trg_sos_token = EN.vocab.stoi['<sos>']

Wrapped `train` and `evaluate` ops:

In [None]:
def evaluate(model, val_iter, vocab_size, device, DE, EN):
    model.eval()
    pad = EN.vocab.stoi['<pad>']
    total_loss = 0
    for b, batch in enumerate(val_iter):
        src, len_src = batch.src
        trg, len_trg = batch.trg

        src, trg = src.to(device), trg.to(device)
        
        with torch.no_grad():
            # apply model
            output = ## your code here
            
            # calculate nll loss
            # 1) don't take into account first token (it's always <sos>)
            # 2) don't take into account pad token (ignore_index argument)
            loss = ## your code here
            
            total_loss += loss.item()
    return total_loss / len(val_iter)


def train(e, model, optimizer, train_iter, vocab_size, device, grad_clip, DE, EN):
    model.train()
    total_loss = 0
    pad = EN.vocab.stoi['<pad>']
    for b, batch in enumerate(train_iter):
        src, len_src = batch.src
        trg, len_trg = batch.trg
        src, trg = src.to(device), trg.to(device)
        optimizer.zero_grad()
        
        # apply model
        output = ## your code here
        
        # calculate nll loss
        # 1) don't take into account first token (it's always <sos>)
        # 2) don't take into account pad token (ignore_index argument)
        loss = ## your code here
        
        loss.backward()
        
        # clip gradients using nn.utils.clip_grad_norm_ by `grad_clip` value
        nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        
        optimizer.step()
        total_loss += loss.item()

        if b % 10 == 0 and b != 0:
            total_loss = total_loss / 10
            print("[%d][loss:%5.2f][pp:%5.2f]" % (b, total_loss, np.exp(total_loss)))
            total_loss = 0

To monitor how our training is going on, let's translate fixed batch of sentences every epoch:

In [None]:
fixed_test_batch = next(iter(test_iter))

In [None]:
def show_translations(seq2seq, batch, device, trg_sos_token, max_len=10, max_examples=5):
    sentence_encoded = batch.src[0].to(device)
    
    for example_i in range(min(batch.src[0].shape[1], max_examples)):
        input_encoded = batch.src[0][:, example_i]
        input = " ".join([DE.vocab.itos[index] for index in input_encoded][1:batch.src[1][example_i]])
        
        result_encoded = seq2seq(sentence_encoded, trg_sos_token, max_len)
        result_encoded = result_encoded.argmax(dim=2)[:, example_i]
        pred = " ".join([EN.vocab.itos[index] for index in result_encoded][1:])

        gt_encoded = batch.trg[0][:, example_i]
        gt = " ".join([EN.vocab.itos[index] for index in gt_encoded][1:batch.trg[1][example_i]])
        
        print("input:\t", input)
        print("pred:\t", pred)
        print("gt:\t", gt)
        print()

Run train-loop:

In [None]:
best_val_loss = None

for e in range(1, epochs + 1):
    train(e, seq2seq, optimizer, train_iter, en_size, device, grad_clip, DE, EN)
    val_loss = evaluate(seq2seq, val_iter, en_size, device, DE, EN)
    print("[Epoch:%d] val_loss:%5.3f | val_pp:%5.2fS" % (e, val_loss, np.exp(val_loss)))

    # save the model if the validation loss is the best we've seen so far.
    if not best_val_loss or val_loss < best_val_loss:
        print("[!] saving model...")
        if not os.path.isdir("weights"):
            os.makedirs("weights")
        torch.save(seq2seq.state_dict(), 'weights/seq2seq_%d.pth' % (e))
        best_val_loss = val_loss
    
    print("Samples from test:")
    show_translations(seq2seq, fixed_test_batch, device, trg_sos_token, max_len=20, max_examples=5)
    print()
        
test_loss = evaluate(seq2seq, test_iter, en_size, device, DE, EN)
print("[TEST] loss:%5.2f" % test_loss)

Yeah, we did it!