# Neural Machine Translation
This article takes you through the core concepts of the following...
- English to German translation using Neural Machine Translation(NMT)
- It uses LSTM networks with Attention
- Beyond translation MT deciphers word sense disambiguation (e.x. `bank` refers to `financial bank` or `riverside bank`
- Implemented using RNN with LSTMs can work for short to medium sentences but can result in vanishing gradient for long sequences
- To address this, an attention mechanism is used to allow the decoder to access all relevant parts of the input sentence regardless of its lenght

1. Preprocess the training and eval data
2. Implement an encoder-decoder system with attention
3. Understand how attention works
4. Build the NMT model from scratch using Trax
5. Generate translations using `Greedy and Minimum Bayes Risk`(MBR) decoding

## Part 1. Data Preparation

### 1.1 Importing the Data

In [1]:
from termcolor import colored
import random
import numpy as np

import trax
from trax import layers as tl
from trax.fastmath import numpy as fastnp
from trax.supervised import training

DATA_DIR = './data/01'

!pip list | grep trax # trax == 1.3.4 is required 

INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 
trax                          1.3.4


In [2]:
# Get generator function for the training set
# This will download the train dataset if no data_dir is specified
train_stream_fn = trax.data.TFDS(
    'opus/medical',
    data_dir=DATA_DIR,
    keys=('en', 'de'),
    eval_holdout_size=0.01, #1% for eval
    train=True
)

# Get generator function for the eval set
eval_stream_fn = trax.data.TFDS(
    'opus/medical',
    data_dir=DATA_DIR,
    keys=('en', 'de'),
    eval_holdout_size=0.01, #1% for eval
    train=False
)

In [3]:
train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()

eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))

[31mtrain data (en, de) tuple:[0m (b'Decreased Appetite\n', b'Verminderter Appetit\n')

[31meval data (en, de) tuple:[0m (b'Lutropin alfa Subcutaneous use.\n', b'Pulver zur Injektion Lutropin alfa Subkutane Anwendung\n')


### 1.2 Tokenization and Formatting
- Tokenizing the sentences using subword representations
- Each sentences is represented as an array of integers 
- To avoid out-of-vocab words, subword representations are used
- For example, instead of having separate entries in your vocabulary for --"fear", "fearless", "fearsome", "some", and "less"--, you can simply store --"fear", "some", and "less"-- then allow your tokenizer to combine these subwords when needed.
- This allows it to be more flexible, wont have to save uncommon words explicitly

In [4]:
# Global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = './data/01'

In [5]:
# Tokenize the dataset.
tokenized_train_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)

In [6]:
# Append EOS at the end of each sentence
# Integer assigned as end of sentence (EOS)
# This will help us to infer the model has completed the translation
EOS = 1

# Generator helper function to append EOS to each sentence
def append_eos(stream):
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        
        yield np.array(inputs_with_eos), np.array(targets_with_eos)
        
# Append EOS to the train data
tokenized_train_stream = append_eos(tokenized_train_stream)
tokenized_eval_stream = append_eos(tokenized_eval_stream)

In [7]:
# Filter long sentences
# Filter too long sentences to not run out of memory
# length_keys=[0, 1] means we filter both English and German sentences 
# Both must be not longer than 256 tokens for training / 512 for eval
filtered_train_stream = trax.data.FilterByLength(
    max_length=256,
    length_keys=[0, 1]
)(tokenized_train_stream)
filtered_eval_stream = trax.data.FilterByLength(
    max_length=512,
    length_keys=[0,1]
)(tokenized_eval_stream)

train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)

[31mSingle tokenized example input:[0m [  549   617   117   479     9  1737     4   888  3550 30650  4729   992
     1]
[31mSingle tokenized example target:[0m [  328   468  5579    61 12657  3550 30650  4729   992     1]


### 1.3 Tokenize and Detokenize Helper Functions

In [9]:
def tokenize(input_str, vocab_file=None, vocab_dir=None):
    EOS = 1
    inputs = next(trax.data.tokenize(
        iter([input_str]),
        vocab_file=vocab_file,
        vocab_dir=vocab_dir
    ))
    inputs = list(inputs) + [EOS]
    batch_inputs = np.reshape(np.array(inputs), [1, -1])
    
    return batch_inputs

def detokenize(integers, vocab_file=None, vocab_dir=None):
    integers = list(np.squeeze(integers))
    EOS = 1
    if(EOS in integers):
        integers = integers[:integers.index(EOS)]
        
    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

In [10]:
# Detokenize an input-target pair of tokenized sentences
print(colored(f'Single detokenized example input:', 'red'), detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single detokenized example target:', 'red'), detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()

# Tokenize and detokenize a word that is not explicitly saved in the vocabulary file.
# See how it combines the subwords -- 'hell' and 'o'-- to form the word 'hello'.
print(colored(f"tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f"detokenize([17332, 140, 1]): ", 'green'), detokenize([17332, 140, 1], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))

[31mSingle detokenized example input:[0m These measures should help to protect the environment.

[31mSingle detokenized example target:[0m Diese Maßnahmen dienen dem Umweltschutz.


[32mtokenize('hello'): [0m [[17332   140     1]]
[32mdetokenize([17332, 140, 1]): [0m hello


### 1.4 Bucketing
[Comprehensive Hands-on Guide to Sequence Model batching strategy: Bucketing technique](https://medium.com/@rashmi.margani/how-to-speed-up-the-training-of-the-sequence-model-using-bucketing-techniques-9e302b0fd976)

In [12]:
# Bucketing to create streams of batches

# Buckets are defined in terms of boundaries and batch sizes
# batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 256 sentences of length < 8, 128, if length is
# between 8 and 16. and so on -- and only 2 if length is over 512
boundaries = [8, 16, 32, 64, 128,256, 512]
batch_sizes = [256, 128, 64, 32, 15, 8, 4, 2]

# Create the generators
train_batch_stream = trax.data.BucketByLength(
    boundaries, 
    batch_sizes,
    length_keys=[0,1]
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries,
    batch_sizes,
    length_keys=[0,1]
)(filtered_eval_stream)

# Add masking for the padding (0s)
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

### 1.5 Exploring the Data

In [13]:
input_batch, target_batch, mask_batch = next(train_batch_stream)

# let's see the data type of a batch
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))

# let's see the shape of this particular batch (batch length, sentence length)
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)

input_batch data type:  <class 'numpy.ndarray'>
target_batch data type:  <class 'numpy.ndarray'>
input_batch shape:  (32, 64)
target_batch shape:  (32, 64)


- The tokens acquired are used to produce embedding vectors for each word in the sentence
- Hence, the embedding for a sentence is a matrix
- The number of sentence in each batch us usually a power of 2 for optimal computer memory usage

In [14]:
# pick a random index less than the batch size.
index = random.randrange(len(input_batch))

# use the index to grab an entry from the input and target batch
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')

[31mTHIS IS THE ENGLISH SENTENCE: 
[0m HIV infection is a disease spread by contact with blood or sexual contact with an infected individual.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: 
 [0m [ 4188 17251    16    13  3126  4078    45  2412    30  6196    66  7660
  2412    30    27 11123  1283  3550 30650  4729   992     1     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 

[31mTHIS IS THE GERMAN TRANSLATION: 
[0m Die HIV-Infektion ist eine Erkrankung, die durch Kontakt mit infiziertem Blut oder durch sexuellen Kontakt mit HIV-Infizierten übertragen wird.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: 
[0m [   57  4188    15 16015     5    24    41  9183  6818   147     2    10
   121  5378    39 12258 11813   164 18689    97   121 16695  