# Neural Machine Translation
This article takes you through the core concepts of the following...
- English to German translation using Neural Machine Translation(NMT)
- It uses LSTM networks with Attention
- Beyond translation MT deciphers word sense disambiguation (e.x. `bank` refers to `financial bank` or `riverside bank`
- Implemented using RNN with LSTMs can work for short to medium sentences but can result in vanishing gradient for long sequences
- To address this, an attention mechanism is used to allow the decoder to access all relevant parts of the input sentence regardless of its lenght

1. Preprocess the training and eval data
2. Implement an encoder-decoder system with attention
3. Understand how attention works
4. Build the NMT model from scratch using Trax
5. Generate translations using `Greedy and Minimum Bayes Risk`(MBR) decoding

## Part 1. Data Preparation

### 1.1 Importing the Data

In [1]:
from termcolor import colored
import random
import numpy as np

import trax
from trax import layers as tl
from trax.fastmath import numpy as fastnp
from trax.supervised import training

DATA_DIR = './data/01'

!pip list | grep trax # trax == 1.3.4 is required 

INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 
trax                          1.3.4


In [2]:
# Get generator function for the training set
# This will download the train dataset if no data_dir is specified
train_stream_fn = trax.data.TFDS(
    'opus/medical',
    data_dir=DATA_DIR,
    keys=('en', 'de'),
    eval_holdout_size=0.01, #1% for eval
    train=True
)

# Get generator function for the eval set
eval_stream_fn = trax.data.TFDS(
    'opus/medical',
    data_dir=DATA_DIR,
    keys=('en', 'de'),
    eval_holdout_size=0.01, #1% for eval
    train=False
)

In [3]:
train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()

eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))

[31mtrain data (en, de) tuple:[0m (b'Tel: +421 2 57 103 777\n', b'Tel: +421 2 57 103 777\n')

[31meval data (en, de) tuple:[0m (b'Lutropin alfa Subcutaneous use.\n', b'Pulver zur Injektion Lutropin alfa Subkutane Anwendung\n')


### 1.2 Tokenization and Formatting
- Tokenizing the sentences using subword representations
- Each sentences is represented as an array of integers 
- To avoid out-of-vocab words, subword representations are used
- For example, instead of having separate entries in your vocabulary for --"fear", "fearless", "fearsome", "some", and "less"--, you can simply store --"fear", "some", and "less"-- then allow your tokenizer to combine these subwords when needed.
- This allows it to be more flexible, wont have to save uncommon words explicitly

In [4]:
# Global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = './data/01'

In [5]:
# Tokenize the dataset.
tokenized_train_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)

In [6]:
# Append EOS at the end of each sentence
# Integer assigned as end of sentence (EOS)
# This will help us to infer the model has completed the translation
EOS = 1

# Generator helper function to append EOS to each sentence
def append_eos(stream):
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        
        yield np.array(inputs_with_eos), np.array(targets_with_eos)
        
# Append EOS to the train data
tokenized_train_stream = append_eos(tokenized_train_stream)
tokenized_eval_stream = append_eos(tokenized_eval_stream)

In [7]:
# Filter long sentences
# Filter too long sentences to not run out of memory
# length_keys=[0, 1] means we filter both English and German sentences 
# Both must be not longer than 256 tokens for training / 512 for eval
filtered_train_stream = trax.data.FilterByLength(
    max_length=256,
    length_keys=[0, 1]
)(tokenized_train_stream)
filtered_eval_stream = trax.data.FilterByLength(
    max_length=512,
    length_keys=[0,1]
)(tokenized_eval_stream)

train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)

[31mSingle tokenized example input:[0m [ 2538  2248    30 12114 23184 16889     5     2 20852  6456 20592  5812
  3932    96  5178  3851    30  7891  3550 30650  4729   992     1]
[31mSingle tokenized example target:[0m [ 1872    11  3544    39  7019 17877 30432    23  6845    10 14222    47
  4004    18 21674     5 27467  9513   920   188 10630    18  3550 30650
  4729   992     1]


### 1.3 Tokenize and Detokenize Helper Functions

In [8]:
def tokenize(input_str, vocab_file=None, vocab_dir=None):
    EOS = 1
    inputs = next(trax.data.tokenize(
        iter([input_str]),
        vocab_file=vocab_file,
        vocab_dir=vocab_dir
    ))
    inputs = list(inputs) + [EOS]
    batch_inputs = np.reshape(np.array(inputs), [1, -1])
    
    return batch_inputs

def detokenize(integers, vocab_file=None, vocab_dir=None):
    integers = list(np.squeeze(integers))
    EOS = 1
    if(EOS in integers):
        integers = integers[:integers.index(EOS)]
        
    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

In [9]:
# Detokenize an input-target pair of tokenized sentences
print(colored(f'Single detokenized example input:', 'red'), detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single detokenized example target:', 'red'), detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()

# Tokenize and detokenize a word that is not explicitly saved in the vocabulary file.
# See how it combines the subwords -- 'hell' and 'o'-- to form the word 'hello'.
print(colored(f"tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f"detokenize([17332, 140, 1]): ", 'green'), detokenize([17332, 140, 1], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))

[31mSingle detokenized example input:[0m During treatment with olanzapine, adolescents gained significantly more weight compared with adults.

[31mSingle detokenized example target:[0m Während der Behandlung mit Olanzapin nahmen die Jugendlichen im Vergleich zu Erwachsenen signifikant mehr Gewicht zu.


[32mtokenize('hello'): [0m [[17332   140     1]]
[32mdetokenize([17332, 140, 1]): [0m hello


### 1.4 Bucketing
[Comprehensive Hands-on Guide to Sequence Model batching strategy: Bucketing technique](https://medium.com/@rashmi.margani/how-to-speed-up-the-training-of-the-sequence-model-using-bucketing-techniques-9e302b0fd976)

In [10]:
# Bucketing to create streams of batches

# Buckets are defined in terms of boundaries and batch sizes
# batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 256 sentences of length < 8, 128, if length is
# between 8 and 16. and so on -- and only 2 if length is over 512
boundaries = [8, 16, 32, 64, 128,256, 512]
batch_sizes = [256, 128, 64, 32, 15, 8, 4, 2]

# Create the generators
train_batch_stream = trax.data.BucketByLength(
    boundaries, 
    batch_sizes,
    length_keys=[0,1]
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries,
    batch_sizes,
    length_keys=[0,1]
)(filtered_eval_stream)

# Add masking for the padding (0s)
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

### 1.5 Exploring the Data

In [11]:
input_batch, target_batch, mask_batch = next(train_batch_stream)

# let's see the data type of a batch
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))

# let's see the shape of this particular batch (batch length, sentence length)
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)

input_batch data type:  <class 'numpy.ndarray'>
target_batch data type:  <class 'numpy.ndarray'>
input_batch shape:  (32, 64)
target_batch shape:  (32, 64)


- The tokens acquired are used to produce embedding vectors for each word in the sentence
- Hence, the embedding for a sentence is a matrix
- The number of sentence in each batch us usually a power of 2 for optimal computer memory usage

In [12]:
# pick a random index less than the batch size.
index = random.randrange(len(input_batch))

# use the index to grab an entry from the input and target batch
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')

[31mTHIS IS THE ENGLISH SENTENCE: 
[0m The adjusted mean difference was -4.3 points (CI 95% -6.4; -2.1 points, p-value < 0.0001).
 

[31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: 
 [0m [   29  9701  1516  2640    53  1581   219     3   199  1164    50  7082
     5  4207 11767    15   330     3   219  7108    15   150     3   135
  1164     2   719    15   980   909 33287   913   266     3  8074  3912
 33022 30650  4729   992     1     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 

[31mTHIS IS THE GERMAN TRANSLATION: 
[0m Die angepasste mittlere Differenz betrug -4,3 Punkte (95 %-Konfidenzintervall: -6,4 bis -2,1 Punkte, p-Wert < 0,0001).
 

[31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: 
[0m [   57 30482  8385   191 14998     5 12919 20657  1581   219   227   199
  2927    50  4207 11770    15 11580  7770 13427  9436 19070     5  2801
    15   330   227   219  

## Part 2. NMT with Attention

### 2.1 Attention Overview
- An attention model will be built using an encoder-decoder architecture
- The RNN will take in a tokenized version of a sentence in its encoder.
- Pass the tokenized data into the decoder for translation
- Using a sequence-to-sequence model with LSTMs will work effectively for short to medium sentences but will degrade for longer ones
- All the context of the input sentence is compressed into one vector and passed into a decoder block
- Context of the first parts of the input will have very little effect on the final vector passed to the decoder

$$ENCODER \rightarrow \small{hello}\hspace{2mm} \normalsize{how}\hspace{2mm} \large{are}\hspace{2mm} \Large{you}\hspace{2mm} \huge{today}\hspace{2mm} \Huge{!} \normalsize \rightarrow DECODER$$

- Adding an attention layer to this model avoids this problme by giving the decoder access to all parts of the input sentence
- In a 4 word input sentence,
    - Remember that a hidden state is produced at each timestep of the encoder
    - These hidden states are all passed to the attention layer and each are given a score given the current activation(ie hidden state) of the decoder
    - ie After predicting the first word, the attention layer will receive all the encoder hidden states as well as decoder hiddent state when producing the word wie
    - Given these information, it will score each of the encoder hidden states to know which one the decoder should focus on to produce the next word
- The result of the model training might have learned that it should align to the second encoder hidden state a
- Subsequently assigns a high probability to the word geht.
- If we use greedy decoding, we will output the said word as the next symbol, 
- then restart the process to produce the next word until we reach an end sentence prediction

This is Scaled Dot Product Attention of the form
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$
- Computing scores using Queries(Q) and keys(K) followed by a multiplication values(V) to get a context vector at a particular timestep of the decoder
- The context vector is fed to the decoder RNN to get a set of probabilities for the next predicted word
- The division by squre root of the keys dimensionality $(\sqrt{d_k})$ is for improving model performance
- The encoder activations (hidden states) will be the keys and values, while decoder activations will be queries

#### 2.2.1 Input Encoder
- The input encoder runs on the input tokens, creates its embeddings and feeds it to an LSTM network.
- This outputs the activations that will be the keys and values for attention
- It is a serial network with `tl.Embedding` and `tl.LSTM`

In [13]:
def input_encoder_Fn(input_vocab_size, d_model, n_encoder_layers):
    input_encoder = tl.Serial(
        tl.Embedding(input_vocab_size, d_model),
        [tl.LSTM(d_model) for _ in range(n_encoder_layers)]
    )
    return input_encoder

In [14]:
import w1_unittest

w1_unittest.test_input_encoder_fn(input_encoder_Fn)

[92m All tests passed


#### 2.2.2 Pre-attention Decoder
- The pre-attention decoder runs on the targets and creates activations that are used as queries in attention
- This is a Serial netwokr which is composed of the `tl.ShiftRight`, `tl.Embedding`, `tl.LSTM`

In [15]:
def pre_attention_decoder_fn(mode, target_vocab_size, d_model):
    pre_attention_decoder = tl.Serial(
        # Shift right to insert start-of-sentence token and implement 
        # teacher forcing during training
        tl.ShiftRight(mode),
        tl.Embedding(target_vocab_size, d_model),
        tl.LSTM(d_model)
    )
    return pre_attention_decoder

In [16]:
w1_unittest.test_pre_attention_decoder_fn(pre_attention_decoder_fn)


[92m All tests passed


#### 2.2.3 Preparing the Attention Input
- This function will prepare the input for the attention layer
- Take encoder and pre-attention decoder activations and assign it to the queries, ie hidden states of encoder and decoder
- These activations are assigned to Queries, Keys and Values
- Another output will be the mask to distinguish real tokens from padding tokens
- This mask is used internally by Trax while computing the softmax so padding tokens  will not have an effect on the computed probabilities
- Observe which tokens in the input corresponding to padding
- Multi-headed attention - Computing attention multiple times to improve model's predictions
- It is required to consider this additional axis in the output

In [21]:
def prepare_attention_input(encoder_activations, decoder_activations, inputs):
    """
    Prepare Queries, Keys, Values and Mask for attention
    Args:
        encoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the input encoder
        decoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the pre-attention decoder
        inputs fastnp.array(batch_size, padded_input_length): padded input tokens
    
    Returns:
        queries, keys, values and mask for attention.
    """
    # Set the keys and values to the encoder activations
    keys = encoder_activations
    values = encoder_activations
    
    # Set the queries to the decoder activations
    queries = decoder_activations
    
    # Generate the mask to distinguish real tokens from padding
    # hint: inputs is 1 for real tokens and 0 where they are pdding
    mask = (inputs >= 1)
    
    # add axes to the mask for attention heads and decoder length
    mask = fastnp.reshape(mask, (mask.shape[0], 1, 1, mask.shape[1]))
    
    # Broadcast so mask shape is [batch_size, attention heads, decoder-len, encoder-len]
    # Note: for thsi assignment, attention head is set to 1
    mask = mask + fastnp.zeros((1, 1, decoder_activations.shape[1], 1))
    
    return queries, keys, values, mask

In [22]:
w1_unittest.test_prepare_attention_input(prepare_attention_input)


[92m All tests passed


## 2.3 Implementation Overview
- Step 0: Prepare input encoder and pre-attention decoder branches
- Step 1: Create a Serial Network, It will stack the layers in the next steps one after the other.
- Step 2: Make a copy of the input and target tokens
    - The input and target tokens will be fed into different layers of the model
    - `tl.Select` layer to create copies oft these tokens
    - Arrange them as `[input_tokens, target_tokens, input_tokens, target_tokens]`
- Step 3: Create a parallel branch to feed the input tokens to the `input encoder` and target tokens to the `pre-attention decoder`
    - `tl.Parallel` to create these sublayers in parallel
- Step 4: Call the `prepare_attention_input` function to convert the encoder and pre-attention decoder actiations to a format that the attention layer will accept
    - `tl.Fn` to be used.
- Step 5: Feed Q, K V and Mask to the tl.AttentionQKV layer
    - This computes the scaled dot product attention and outputs the attention weights and mask
    - Though it is a one liner, It composed of a DNN made up of several branches
    - Having deep layers pose the risk of vanishing gradients during training - to be mitigated
    - `tl.Residual` layer is added to improve ability to learn - Added to output of `AttentionQKV` with the queries input
    - Nest the `AttentionQKV` inside `Residual`
- Step 6: Mask is not needed, hence dropped.
    - At this point in network - the signal stack currently has `[attention activations, mask, target tokens]` and you can use `tl.Select` to output just 3
- Step 7: Feed the attention weighted output to the LSTM decoder. 
    - Stack multiple `tl.LSTM` layers to improve the output so remember to append to LSTMs equal to n_decoder_layers of the model
- Step 8: Determine the probabilities of each subword in the vocabulary and set this up easily with a `tl.Dense` layer  by making its size equal to the size of our vocabulary
- Step 9: Normalize the output to log probabilities by passing the activations in Step 8 to a `tl.LogSoftmax` layer