# Neural Machine Translation (NMT, [wikipedia](https://en.wikipedia.org/wiki/Neural_machine_translation))

An English-to-German NMT model of LSTM networks with attention.  

Implementing this using RNN and LSTMs results vanishing gradients for long sequences.

Thus I have been adding an attention mechanism.


### Data preparation


### Importing data

I used [Trax](https://github.com/google/trax) library of [Google Brain team](https://research.google/teams/brain/).

In [3]:
from termcolor import colored
import random
import numpy as np

import trax
from trax import layers as tl
from trax.fastmath import numpy as fastnp
from trax.supervised import training

import w1_unittest

!pip list | grep trax

trax                         1.3.9
You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.[0m


I used small dataset from [Opus](http://opus.nlpl.eu/), a growing collection of translated texts from the web. 

I use English to German translation subset specified as `opus/medical` which has medical texts.

All datasets are available via [Tensorflow Datasets (TFDS)](https://www.tensorflow.org/datasets).

In [4]:
# Get generator function for the training set
# This will download the train dataset if no data_dir is specified.
train_stream_fn = trax.data.TFDS('opus/medical',
                                 data_dir='./data/',
                                 keys=('en', 'de'),
                                 eval_holdout_size=0.01, # 1% for eval
                                 train=True
                                )

# Get generator function for the eval set
eval_stream_fn = trax.data.TFDS('opus/medical',
                                data_dir='./data/',
                                keys=('en', 'de'),
                                eval_holdout_size=0.01, # 1% for eval                                
                                train=False
                               )



Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Constant'


Notice that TFDS returns a generator *function*, not a generator.

It is very common to use generator functions in Python.

In [5]:
train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()

eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))

[31mtrain data (en, de) tuple:[0m (b'Tel: +421 2 57 103 777\n', b'Tel: +421 2 57 103 777\n')

[31meval data (en, de) tuple:[0m (b'Subcutaneous use and intravenous use.\n', b'Subkutane Anwendung und intraven\xc3\xb6se Anwendung.\n')


### Tokenization and formatting

Preprocessing the sentences into a format that our model can accept.

**Subword representations** 

represent each sentence as an array of integers. 

*subword* representations to tokenize our sentences. 

This is a common technique to avoid **out-of-vocabulary** words.

In [6]:
# global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = 'data/'

# Tokenize the dataset.
tokenized_train_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)

**Append an end-of-sentence token to each sentence** 

Assign a token in this case `1` to mark the end of a sentence. 

This will be useful in inference/prediction.

In [7]:
# Append EOS at the end of each sentence.

# Integer assigned as end-of-sentence (EOS)
EOS = 1

# generator helper function to append EOS to each sentence
def append_eos(stream):
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        yield np.array(inputs_with_eos), np.array(targets_with_eos)

# append EOS to the train data
tokenized_train_stream = append_eos(tokenized_train_stream)

# append EOS to the eval data
tokenized_eval_stream = append_eos(tokenized_eval_stream)

**Filter long sentences** 

I will place a limit on the number of tokens per sentence.

In [8]:
# Filter too long sentences to not run out of memory.
# length_keys=[0, 1] means we filter both English and German sentences, so
# both must be not longer that 256 tokens for training / 512 for eval.
filtered_train_stream = trax.data.FilterByLength(
    max_length=512, length_keys=[0, 1])(tokenized_train_stream)
filtered_eval_stream = trax.data.FilterByLength(
    max_length=512, length_keys=[0, 1])(tokenized_eval_stream)

# print a sample input-target pair of tokenized sentences
train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)

[31mSingle tokenized example input:[0m [ 2538  2248    30 12114 23184 16889     5     2 20852  6456 20592  5812
  3932    96  5178  3851    30  7891  3550 30650  4729   992     1]
[31mSingle tokenized example target:[0m [ 1872    11  3544    39  7019 17877 30432    23  6845    10 14222    47
  4004    18 21674     5 27467  9513   920   188 10630    18  3550 30650
  4729   992     1]


### tokenize & detokenize helper functions

map words to their indices, and indices to their words. 

Theoretically for manual coding I would have to make use of the following

* word2Ind: a dictionary mapping the word to its index.
* ind2Word: a dictionary mapping the index to its word.
* word2Count: a dictionary mapping the word to the number of times it appears. 
* num_words: total number of words that have appeared. 

Practically I have the following functions

* tokenize(): converts a text sentence to its corresponding token list. Also converts words to subwords.
* detokenize(): converts a token list to its corresponding sentence.

In [9]:
# Setup helper functions for tokenizing and detokenizing sentences

def tokenize(input_str, vocab_file=None, vocab_dir=None):
    """Encodes a string to an array of integers

    Args:
        input_str (str): human-readable string to encode
        vocab_file (str): filename of the vocabulary text file
        vocab_dir (str): path to the vocabulary file
  
    Returns:
        numpy.ndarray: tokenized version of the input string
    """
    
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    
    # Use the trax.data.tokenize method. It takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs =  next(trax.data.tokenize(iter([input_str]),
                                      vocab_file=vocab_file, vocab_dir=vocab_dir))
    
    # Mark the end of the sentence with EOS
    inputs = list(inputs) + [EOS]
    
    # Adding the batch dimension to the front of the shape
    batch_inputs = np.reshape(np.array(inputs), [1, -1])
    
    return batch_inputs


def detokenize(integers, vocab_file=None, vocab_dir=None):
    """Decodes an array of integers to a human readable string

    Args:
        integers (numpy.ndarray): array of integers to decode
        vocab_file (str): filename of the vocabulary text file
        vocab_dir (str): path to the vocabulary file
  
    Returns:
        str: the decoded sentence.
    """
    
    # Remove the dimensions of size 1
    integers = list(np.squeeze(integers))
    
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    
    # Remove the EOS to decode only the original tokens
    if EOS in integers:
        integers = integers[:integers.index(EOS)] 
    
    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

#### An usage example

In [10]:
# As declared earlier:
# VOCAB_FILE = 'ende_32k.subword'
# VOCAB_DIR = 'data/'

# Detokenize an input-target pair of tokenized sentences
print(colored(f'Single detokenized example input:', 'red'), detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single detokenized example target:', 'red'), detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()

# Tokenize and detokenize a word that is not explicitly saved in the vocabulary file.
# See how it combines the subwords -- 'hell' and 'o'-- to form the word 'hello'.
print(colored(f"tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f"detokenize([17332, 140, 1]): ", 'green'), detokenize([17332, 140, 1], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))

[31mSingle detokenized example input:[0m During treatment with olanzapine, adolescents gained significantly more weight compared with adults.

[31mSingle detokenized example target:[0m Während der Behandlung mit Olanzapin nahmen die Jugendlichen im Vergleich zu Erwachsenen signifikant mehr Gewicht zu.


[32mtokenize('hello'): [0m [[17332   140     1]]
[32mdetokenize([17332, 140, 1]): [0m hello


### Bucketing

Bucketing the tokenized sentences is an important technique used to speed up training in NLP.

Our inputs have variable lengths and you want to make these the same when batching groups of sentences together. 

We can group our tokenized sentences by length and bucket

![alt text](https://miro.medium.com/max/700/1*hcGuja_d5Z_rFcgwe9dPow.png)

We batch the sentences with similar length together

We add minimal padding to make them have equal length up to the nearest power of two.

This allows to waste less computation when processing padded sequences.

In Trax, it is implemented in the [bucket_by_length](https://github.com/google/trax/blob/5fb8aa8c5cb86dabb2338938c745996d5d87d996/trax/supervised/inputs.py#L378) function.

In [11]:
# Bucketing to create streams of batches.

# Buckets are defined in terms of boundaries and batch sizes.
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 256 sentences of length < 8, 128 if length is
# between 8 and 16, and so on -- and only 2 if length is over 512.
boundaries =  [8,   16,  32, 64, 128, 256, 512]
batch_sizes = [256, 128, 64, 32, 16,    8,   4,  2]

# Create the generators.
train_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_eval_stream)

# Add masking for the padding (0s).
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

### Exploring the data

We will now be displaying some of our data. 

In [12]:
input_batch, target_batch, mask_batch = next(train_batch_stream)

# let's see the data type of a batch
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))

# let's see the shape of this particular batch (batch length, sentence length)
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)

input_batch data type:  <class 'numpy.ndarray'>
target_batch data type:  <class 'numpy.ndarray'>
input_batch shape:  (32, 64)
target_batch shape:  (32, 64)


The `input_batch` and `target_batch` are Numpy arrays consisting of tokenized English sentences and German sentences respectively. 

These tokens will later be used to produce embedding vectors for each word in the sentence embedding for a sentence is a matrix. 

The number of sentences in each batch is a power of 2 gives optimal computer memory usage. 

We can now visually inspect some of the data. 

In [13]:
# pick a random index less than the batch size.
index = random.randrange(len(input_batch))

# use the index to grab an entry from the input and target batch
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')

[31mTHIS IS THE ENGLISH SENTENCE: 
[0m A lowering of blood counts might make you more susceptible to infections, make you bruise more easily or cause fatigue.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: 
 [0m [  176 11086     7  6196 29901   616   237    72    96 22598 13844   867
     9 23626     2   237    72  8970  2565    96  2841    66  1780  8898
  8278    35  3550 30650  4729   992     1     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 

[31mTHIS IS THE GERMAN TRANSLATION: 
[0m Eine erniedrigte Zahl der Blutkörperchen kann Sie anfälliger für Infektionen machen; sie kann bewirken, dass bei Ihnen schneller ein blauer Fleck entsteht oder sie kann Abgeschlagenheit hervorrufen.
 

[31mTHIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: 
[0m [  478  4470 12279  8177   191  1953    11  9208 23449   826   

### NMT with attention

A neural machine translation model from scratch with attention.


### Attention overview

The model we will be building uses an encoder-decoder architecture. 

RNN takes tokenzied sentence to encoder, and passes to decoder for translation.

<img src='images/plain_rnn.png'>

Adding an attention layer giving decoder access to all parts of input sentence. 

<img src='images/attention_overview.png'>

I use the Scaled Dot Product Attention

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

Computing scores using queries (Q) and keys (K), followed by a multiplication of values (V) to get a context vector at a particular timestep of the decoder.

This context vector is fed to the decoder RNN to get a set of probabilities for the next predicted word.

The division by square root of the keys dimensionality ($\sqrt{d_k}$) improves model performance.

At my implementation, the encoder activations will be the keys and values, while the decoder activations will be the queries.

### Helper functions

These will be for the input encoder, pre-attention decoder, and preparation of the queries, keys, values, and mask.

### Input encoder

The input encoder runs on the input tokens, creates its embeddings, and feeds it to an LSTM network. 

This outputs the activations that will be the keys and values for attention. 

It is a [Serial](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Serial) network which uses:

   - [tl.Embedding](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding): Converts each token to its vector representation. In this case, it is the the size of the vocabulary by the dimension of the model: `tl.Embedding(vocab_size, d_model)`. `vocab_size` is the number of entries in the given vocabulary. `d_model` is the number of elements in the word embedding.
  
   - [tl.LSTM](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTM): LSTM layer of size `d_model`. We want to be able to configure how many encoder layers we have so remember to create LSTM layers equal to the number of the `n_encoder_layers` parameter.
   
<img src = "images/input_encoder.png">

### The `input_encoder_fn` function.

In [14]:
def input_encoder_fn(input_vocab_size, d_model, n_encoder_layers):
    """ Input encoder runs on the input sentence and creates
    activations that will be the keys and values for attention.
    
    Args:
        input_vocab_size: int: vocab size of the input
        d_model: int:  depth of embedding (n_units in the LSTM cell)
        n_encoder_layers: int: number of LSTM layers in the encoder
    Returns:
        tl.Serial: The input encoder
    """
    
    # create a serial network
    input_encoder = tl.Serial( 
        
        # create an embedding layer to convert tokens to vectors
        trax.layers.core.Embedding(input_vocab_size, d_model),
        
        # feed the embeddings to the LSTM layers. It is a stack of n_encoder_layers LSTM layers
        [trax.layers.rnn.LSTM(d_model, mode='train') for _ in range(n_encoder_layers)]
    )

    return input_encoder

In [15]:
# UNIT TEST
# test input_encoder_fn
w1_unittest.test_input_encoder_fn(input_encoder_fn)

[92m All tests passed


### Pre-attention decoder

Pre-attention decoder runs on the targets and creates activations that are used as queries in attention.

A Serial network

   - [tl.ShiftRight](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.attention.ShiftRight): This pads a token to the beginning of your target tokens. This will act like a start-of-sentence token that will be the first input to the decoder. During training, this shift also allows the target tokens to be passed as input to do teacher forcing.

   - [tl.Embedding](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding): Like in the previous function, this converts each token to its vector representation. In this case, it is the the size of the vocabulary by the dimension of the model: `tl.Embedding(vocab_size, d_model)`. `vocab_size` is the number of entries in the given vocabulary. `d_model` is the number of elements in the word embedding.
   
   - [tl.LSTM](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.LSTM): LSTM layer of size `d_model`.

### `pre_attention_decoder_fn` function


In [16]:
# UNQ_C2
def pre_attention_decoder_fn(mode, target_vocab_size, d_model):
    """ Pre-attention decoder runs on the targets and creates
    activations that are used as queries in attention.
    
    Args:
        mode: str: 'train' or 'eval'
        target_vocab_size: int: vocab size of the target
        d_model: int:  depth of embedding (n_units in the LSTM cell)
    Returns:
        tl.Serial: The pre-attention decoder
    """
    
    # create a serial network
    pre_attention_decoder = tl.Serial(
        
        # shift right to insert start-of-sentence token and implement
        # teacher forcing during training
        trax.layers.attention.ShiftRight(n_positions=1, mode='train'),

        # run an embedding layer to convert tokens to vectors
        trax.layers.core.Embedding(target_vocab_size, d_model, use_bfloat16=False),

        # feed to an LSTM layer
        trax.layers.rnn.LSTM(d_model, mode='train')
    )
    
    return pre_attention_decoder

In [17]:
# UNIT TEST
# test pre_attention_decoder_fn
w1_unittest.test_pre_attention_decoder_fn(pre_attention_decoder_fn)

[92m All tests passed


### Preparing the attention input

We want to take in the encoder and pre-attention decoder activations and assign it to the queries, keys, and values. 

In addition, another output here will be the mask to distinguish real tokens from padding tokens. 

This mask will be used internally by Trax when computing the softmax so padding tokens will not have an effect on the computated probabilities.

*multiheaded attention* is computing the attention multiple times to improve the model's predictions. 

### `prepare_attention_input` function


In [18]:
# UNQ_C3
def prepare_attention_input(encoder_activations, decoder_activations, inputs):
    """Prepare queries, keys, values and mask for attention.
    
    Args:
        encoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the input encoder
        decoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the pre-attention decoder
        inputs fastnp.array(batch_size, padded_input_length): input tokens
    
    Returns:
        queries, keys, values and mask for attention.
    """
    
    # set the keys and values to the encoder activations
    keys = encoder_activations
    values = encoder_activations

    
    # set the queries to the decoder activations
    queries = decoder_activations
    
    # generate the mask to distinguish real tokens from padding
    # hint: inputs is positive for real tokens and 0 where they are padding
    mask = inputs > 0
    
    # add axes to the mask for attention heads and decoder length.
    mask = fastnp.reshape(mask, (mask.shape[0], 1, 1, mask.shape[1]))
    
    # broadcast so mask shape is [batch size, attention heads, decoder-len, encoder-len].
    # note: for this assignment, attention heads is set to 1.
    mask = mask + fastnp.zeros((1, 1, decoder_activations.shape[1], 1))
        
    
    return queries, keys, values, mask

In [19]:
# UNIT TEST
# test prepare_attention_input
w1_unittest.test_prepare_attention_input(prepare_attention_input)

[92m All tests passed


### Implementation overview

sequence-to-sequence model with attention. 

This will be a Serial network and is illustrated in the diagram below. 

<img src = "images/NMTModel.png">

### `NMTAttn` function defines machine translation model which uses attention.

In [20]:
# UNQ_C4
def NMTAttn(input_vocab_size=33300,
            target_vocab_size=33300,
            d_model=1024,
            n_encoder_layers=2,
            n_decoder_layers=2,
            n_attention_heads=4,
            attention_dropout=0.0,
            mode='train'):
    """Returns an LSTM sequence-to-sequence model with attention.

    The input to the model is a pair (input tokens, target tokens), e.g.,
    an English sentence (tokenized) and its translation into German (tokenized).

    Args:
    input_vocab_size: int: vocab size of the input
    target_vocab_size: int: vocab size of the target
    d_model: int:  depth of embedding (n_units in the LSTM cell)
    n_encoder_layers: int: number of LSTM layers in the encoder
    n_decoder_layers: int: number of LSTM layers in the decoder after attention
    n_attention_heads: int: number of attention heads
    attention_dropout: float, dropout for the attention layer
    mode: str: 'train', 'eval' or 'predict', predict mode is for fast inference

    Returns:
    An LSTM sequence-to-sequence model with attention.
    """
    
    # Step 0: call the helper function to create layers for the input encoder
    input_encoder = input_encoder_fn(input_vocab_size, d_model, n_encoder_layers)

    # Step 0: call the helper function to create layers for the pre-attention decoder
    pre_attention_decoder = pre_attention_decoder_fn(mode, target_vocab_size, d_model)

    # Step 1: create a serial network
    model = tl.Serial( 
        
      # Step 2: copy input tokens and target tokens as they will be needed later.
      trax.layers.combinators.Select(indices=[0, 1, 0, 1]),
        
      # Step 3: run input encoder on the input and pre-attention decoder the target.
      trax.layers.combinators.Parallel(input_encoder, pre_attention_decoder),
        
      # Step 4: prepare queries, keys, values and mask for attention.
      trax.layers.base.Fn('PrepareAttentionInput', f=prepare_attention_input, n_out=4),
        
      # Step 5: run the AttentionQKV layer
      # nest it inside a Residual layer to add to the pre-attention decoder activations(i.e. queries)
      tl.Residual(tl.AttentionQKV(d_feature=d_model, n_heads=n_attention_heads, dropout=attention_dropout, mode=mode)),
      
      # Step 6: drop attention mask (i.e. index = None
      trax.layers.combinators.Select(indices=[0, 2]),
        
      # Step 7: run the rest of the RNN decoder
      [trax.layers.rnn.LSTM(d_model, mode='train') for _ in range(n_decoder_layers)],
        
      # Step 8: prepare output by making it the right size
      trax.layers.core.Dense(target_vocab_size),
        
      # Step 9: Log-softmax for output
      trax.layers.core.LogSoftmax()
    )
    
    return model

In [21]:
# print my model
model = NMTAttn()
print(model)

Serial_in2_out2[
  Select[0,1,0,1]_in2_out4
  Parallel_in2_out2[
    Serial[
      Embedding_33300_1024
      LSTM_1024
      LSTM_1024
    ]
    Serial[
      Serial[
        ShiftRight(1)
      ]
      Embedding_33300_1024
      LSTM_1024
    ]
  ]
  PrepareAttentionInput_in3_out4
  Serial_in4_out2[
    Branch_in4_out3[
      None
      Serial_in4_out2[
        _in4_out4
        Serial_in4_out2[
          Parallel_in3_out3[
            Dense_1024
            Dense_1024
            Dense_1024
          ]
          PureAttention_in4_out2
          Dense_1024
        ]
        _in2_out2
      ]
    ]
    Add_in2
  ]
  Select[0,2]_in3_out2
  LSTM_1024
  LSTM_1024
  Dense_33300
  LogSoftmax
]


**Expected Output:**

```
Serial_in2_out2[
  Select[0,1,0,1]_in2_out4
  Parallel_in2_out2[
    Serial[
      Embedding_33300_1024
      LSTM_1024
      LSTM_1024
    ]
    Serial[
      Serial[
        ShiftRight(1)
      ]
      Embedding_33300_1024
      LSTM_1024
    ]
  ]
  PrepareAttentionInput_in3_out4
  Serial_in4_out2[
    Branch_in4_out3[
      None
      Serial_in4_out2[
        _in4_out4
        Serial_in4_out2[
          Parallel_in3_out3[
            Dense_1024
            Dense_1024
            Dense_1024
          ]
          PureAttention_in4_out2
          Dense_1024
        ]
        _in2_out2
      ]
    ]
    Add_in2
  ]
  Select[0,2]_in3_out2
  LSTM_1024
  LSTM_1024
  Dense_33300
  LogSoftmax
]
```

In [22]:
# UNIT TEST
# test_NMTAttn
w1_unittest.test_NMTAttn(NMTAttn)

[92m All tests passed


### Training

We will now be training our model. 

Doing supervised training in Trax is pretty straightforward (short example [here](https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html#Supervised-training)).

We will be instantiating three classes for this: `TrainTask`, `EvalTask`, and `Loop`. 

### TrainTask

The [TrainTask](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.TrainTask) class allows me to define the labeled data to use for training and the feedback mechanisms to compute the loss and update the weights. 

### a train task

In [23]:
# UNQ_C5
def train_task_function(train_batch_stream):
    """Returns a trax.training.TrainTask object.

    Args:
    train_batch_stream generator: labeled data generator

    Returns:
    A trax.training.TrainTask object.
    """
    return training.TrainTask(

        # use the train batch stream as labeled data
        labeled_data = train_batch_stream,

        # use the cross entropy loss
        loss_layer = trax.layers.metrics.CrossEntropyLoss(),

        # use the Adam optimizer with learning rate of 0.01
        optimizer= trax.optimizers.adam.Adam(learning_rate=0.01), 

        # use the `trax.lr.warmup_and_rsqrt_decay` as the learning rate schedule
        # have 1000 warmup steps with a max value of 0.01
        lr_schedule= trax.lr.warmup_and_rsqrt_decay(1000, .01),

        # have a checkpoint every 10 steps
        n_steps_per_checkpoint= 10
    )

In [24]:
train_task = train_task_function(train_batch_stream)

In [25]:
# UNIT TEST
# test_train_task
w1_unittest.test_train_task(train_task_function)

[92m All tests passed


### EvalTask

The [EvalTask](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.EvalTask) on the other hand allows us to see how the model is doing while training. 

For my implementation, I want it to report the cross entropy loss and accuracy.

In [26]:
eval_task = training.EvalTask(
    
    ## use the eval batch stream as labeled data
    labeled_data=eval_batch_stream,
    
    ## use the cross entropy loss and accuracy as metrics
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
)

### Loop

The [Loop](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.Loop) class defines the model we will train as well as the train and eval tasks to execute.

Its `run()` method allows us to execute the training for a specified number of steps.

In [25]:
# define the output directory
output_dir = 'output_dir/'

# remove old model if it exists. restarts training.
!rm -f ~/output_dir/model.pkl.gz  

# define the training loop
training_loop = training.Loop(NMTAttn(mode='train'),
                              train_task,
                              eval_tasks=[eval_task],
                              output_dir=output_dir)

In [26]:
# NOTE: Execute the training loop. This will take around 11 minutes to complete.
training_loop.run(10)


Step     20: Ran 10 train steps in 407.49 secs
Step     20: train CrossEntropyLoss |  9.37262917
Step     20: eval  CrossEntropyLoss |  8.58640289
Step     20: eval          Accuracy |  0.04535790


### Testing

I will now be using the model I have just trained to translate English sentences to German. 

I will implement this with two functions: 
* The first allows me to identify the next symbol (i.e. output token). 
* The second one takes care of combining the entire translated string.

In [27]:
# instantiate the model we built in eval mode
model = NMTAttn(mode='eval')

# initialize weights from a pre-trained model
model.init_from_file("model.pkl.gz", weights_only=True)
model = tl.Accelerate(model)

### Decoding

There are several ways to get the next token when translating a sentence.

* I can get the most probable token at each step using **greedy decoding**
* I can get a sample from a distribution

I can generalize the implementation of these two approaches by using the `tl.logsoftmax_sample()` method.

Let's briefly look at its implementation

```python
def logsoftmax_sample(log_probs, temperature=1.0):  # pylint: disable=invalid-name
  """Returns a sample from a log-softmax output, with temperature.

  Args:
    log_probs: Logarithms of probabilities (often coming from LogSofmax)
    temperature: For scaling before sampling (1.0 = default, 0.0 = pick argmax)
  """
  # This is equivalent to sampling from a softmax with temperature.
  u = np.random.uniform(low=1e-6, high=1.0 - 1e-6, size=log_probs.shape)
  g = -np.log(-np.log(u))
  return np.argmax(log_probs + g * temperature, axis=-1)
```

The key things to take away here are
* It gets random samples with the same shape as my input `log_probs`.
* The amount of "noise" added to the input by these random samples is scaled by a `temperature` setting. setting it to `0` will make the return statement equal to getting the **argmax** of `log_probs`. 

### `next_symbol()` function takes in the `input_tokens` and the `cur_output_tokens`, then return the index of the next word.


In [28]:
# UNQ_C6
def next_symbol(NMTAttn, input_tokens, cur_output_tokens, temperature):
    """Returns the index of the next token.

    Args:
        NMTAttn (tl.Serial): An LSTM sequence-to-sequence model with attention.
        input_tokens (np.ndarray 1 x n_tokens): tokenized representation of the input sentence
        cur_output_tokens (list): tokenized representation of previously translated words
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)

    Returns:
        int: index of the next token in the translated sentence
        float: log probability of the next symbol
    """

    # set the length of the current output tokens
    token_length = len(cur_output_tokens)
    
    # calculate next power of 2 for padding length 
    padded_length = int(np.ceil(2 ** np.log2(token_length + 1)))

    # pad cur_output_tokens up to the padded_length
    padded = cur_output_tokens + [0] * (padded_length - token_length)
    
    # model expects the output to have an axis for the batch size in front so
    # convert `padded` list to a numpy array with shape (1, <padded_length>)
    padded_with_batch = np.array(padded)[None, :]
    
    # get the model prediction
    output, _ = NMTAttn((input_tokens, padded_with_batch))
    
    # get log probabilities from the last token output
    log_probs = output[0, token_length, :]

    # get the next symbol by getting a logsoftmax sample casted to an int
    symbol = int(trax.layers.core.logsoftmax_sample(log_probs, temperature=temperature))
    
    return symbol, float(log_probs[symbol])

In [29]:
# UNIT TEST
# test_next_symbol
w1_unittest.test_next_symbol(next_symbol, NMTAttn)

[92m All tests passed


Call the `next_symbol()` function above several times until the next output is the end-of-sentence token `EOS`.

It takes in an input string and returns the translated version of that string.

### `sampling_decode()` function.

In [30]:
# UNQ_C7
def sampling_decode(input_sentence, NMTAttn = None, temperature=0.0, vocab_file=None, vocab_dir=None, next_symbol=next_symbol, tokenize=tokenize, detokenize=detokenize):
    """Returns the translated sentence.

    Args:
        input_sentence (str): sentence to translate.
        NMTAttn (tl.Serial): An LSTM sequence-to-sequence model with attention.
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file

    Returns:
        tuple: (list, str, float)
            list of int: tokenized version of the translated sentence
            float: log probability of the translated sentence
            str: the translated sentence
    """
    
    # encode the input sentence
    input_tokens = tokenize(input_str=input_sentence, vocab_file=vocab_file, vocab_dir=vocab_dir)
    
    # initialize an empty the list of output tokens
    cur_output_tokens = []
    
    # initialize an integer that represents the current output index
    cur_output = 0
    
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    
    # check that the current output is not the end of sentence token
    while cur_output != EOS:
        
        # update the current output token by getting the index of the next word (hint: use next_symbol)
        cur_output, log_prob = next_symbol(NMTAttn, input_tokens, cur_output_tokens, temperature)
        
        # append the current output token to the list of output tokens
        cur_output_tokens.append(cur_output)        
    
    # detokenize the output tokens
    sentence = detokenize(integers=cur_output_tokens, vocab_file=vocab_file, vocab_dir=vocab_dir)
    
    return cur_output_tokens, log_prob, sentence

In [31]:
# Test the function above. Try varying the temperature setting with values from 0 to 1.
# Run it several times with each setting and see how often the output changes. 
sampling_decode("I love languages.", NMTAttn=model, temperature=0.0, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

([161, 12202, 5112, 3, 1], -0.0001735687255859375, 'Ich liebe Sprachen.')

In [32]:
# UNIT TEST
# test sampling_decode
w1_unittest.test_sampling_decode(sampling_decode)

[92m All tests passed


I set a default value of `0` to the temperature setting in our implementation of `sampling_decode()` above.

As you may have noticed in the `logsoftmax_sample()` method, this setting will ultimately result in greedy decoding. 

This algorithm generates the translation by getting the most probable word at each step. 

It gets the argmax of the output array of your model and then returns that index. 

In [33]:
def greedy_decode_test(sentence, NMTAttn=None, vocab_file=None, vocab_dir=None, sampling_decode=sampling_decode, next_symbol=next_symbol, tokenize=tokenize, detokenize=detokenize):
    """Prints the input and output of our NMTAttn model using greedy decode

    Args:
        sentence (str): a custom string.
        NMTAttn (tl.Serial): An LSTM sequence-to-sequence model with attention.
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file

    Returns:
        str: the translated sentence
    """
    
    _,_, translated_sentence = sampling_decode(sentence, NMTAttn=NMTAttn, vocab_file=vocab_file, vocab_dir=vocab_dir, next_symbol=next_symbol, tokenize=tokenize, detokenize=detokenize)
    
    print("English: ", sentence)
    print("German: ", translated_sentence)
    
    return translated_sentence

In [34]:
# put a custom string here
your_sentence = 'I am hungry'

greedy_decode_test(your_sentence, NMTAttn=model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR);

English:  I am hungry
German:  Ich bin hungrig.


In [35]:
greedy_decode_test('I am almost done with the implementation!', model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR);

English:  I am almost done with the implementation!
German:  Ich bin fast mit der Umsetzung!


### Google translate test
English: I am almost done with the implementation!

German: Ich bin fast fertig mit der Umsetzung!

### Minimum Bayes-Risk Decoding

Getting the most probable token at each step may not necessarily produce the best results.

Another approach is to do Minimum Bayes Risk Decoding (MBR).

The general steps to implement this are

1. take several random samples
2. score each sample against all other samples
3. select the one with the highest score

### Generating samples

I use the `sampling_decode()` function I developed earlier.

I want to record the token list and log probability for each sample as these will be needed in the next step.

In [36]:
def generate_samples(sentence, n_samples, NMTAttn=None, temperature=0.6, vocab_file=None, vocab_dir=None, sampling_decode=sampling_decode, next_symbol=next_symbol, tokenize=tokenize, detokenize=detokenize):
    """Generates samples using sampling_decode()

    Args:
        sentence (str): sentence to translate.
        n_samples (int): number of samples to generate
        NMTAttn (tl.Serial): An LSTM sequence-to-sequence model with attention.
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file
        
    Returns:
        tuple: (list, list)
            list of lists: token list per sample
            list of floats: log probability per sample
    """
    # define lists to contain samples and probabilities
    samples, log_probs = [], []

    # run a for loop to generate n samples
    for _ in range(n_samples):
        
        # get a sample using the sampling_decode() function
        sample, logp, _ = sampling_decode(sentence, NMTAttn, temperature, vocab_file=vocab_file, vocab_dir=vocab_dir, next_symbol=next_symbol)
        
        # append the token list to the samples list
        samples.append(sample)
        
        # append the log probability to the log_probs list
        log_probs.append(logp)
                
    return samples, log_probs

In [37]:
# generate 4 samples with the default temperature (0.6)
generate_samples('how are you today?', 4, model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

([[93, 24, 34, 352, 102, 1],
  [93, 75, 67, 352, 102, 1],
  [595, 75, 67, 352, 102, 1],
  [93, 2743, 156, 352, 102, 1]],
 [-9.5367431640625e-06,
  -1.9073486328125e-06,
  -3.814697265625e-06,
  -5.7220458984375e-06])

### Comparing overlaps

build my functions to compare a sample against another.

There are several metrics available. 

For this implementation, I will be calculating scores for unigram overlaps.

One of the more simple metrics is the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) which gets the intersection over union (**IoU**) of two sets.

In [38]:
def jaccard_similarity(candidate, reference):
    """Returns the Jaccard similarity between two token lists

    Args:
        candidate (list of int): tokenized version of the candidate translation
        reference (list of int): tokenized version of the reference translation

    Returns:
        float: overlap between the two token lists
    """
    
    # convert the lists to a set to get the unique tokens
    can_unigram_set, ref_unigram_set = set(candidate), set(reference)  
    
    # get the set of tokens common to both candidate and reference
    joint_elems = can_unigram_set.intersection(ref_unigram_set)
    
    # get the set of all tokens found in either candidate or reference
    all_elems = can_unigram_set.union(ref_unigram_set)
    
    # divide the number of joint elements by the number of all elements
    overlap = len(joint_elems) / len(all_elems)
    
    return overlap

In [39]:
# let's try using the function. remember the result here and compare with the next function below.
jaccard_similarity([1, 2, 3], [1, 2, 3, 4])

0.75

One of the more commonly used metrics in machine translation is the Recall-Oriented Understudy for Gisting Evaluation **ROUGE**, [wikipedia](https://en.wikipedia.org/wiki/ROUGE_(metric)). 

For unigrams, this is called ROUGE-1.

To get the final score, I want to compute the **F1-score** ([wikipedia](https://en.wikipedia.org/wiki/F-score)) as given by

$$score = 2* \frac{(precision * recall)}{(precision + recall)}$$

### `rouge1_similarity()` function.

In [40]:
# UNQ_C8
# for making a frequency table easily
from collections import Counter

def rouge1_similarity(system, reference):
    """Returns the ROUGE-1 score between two token lists

    Args:
        system (list of int): tokenized version of the system translation
        reference (list of int): tokenized version of the reference translation

    Returns:
        float: overlap between the two token lists
    """    
    
    # make a frequency table of the system tokens (hint: use the Counter class)
    sys_counter = Counter(system)
    
    # make a frequency table of the reference tokens (hint: use the Counter class)
    ref_counter = Counter(reference)
    
    # initialize overlap to 0
    overlap = 0
    
    # run a for loop over the sys_counter object (can be treated as a dictionary)
    for token in sys_counter:
        
        # lookup the value of the token in the sys_counter dictionary (hint: use the get() method)
        token_count_sys = sys_counter.get(token)
        if token_count_sys is None:
            token_count_sys = 0
        
        # lookup the value of the token in the ref_counter dictionary (hint: use the get() method)
        token_count_ref = ref_counter.get(token)
        if token_count_ref is None:
            token_count_ref = 0
        
        # update the overlap by getting the smaller number between the two token counts above
        overlap += min(token_count_sys, token_count_ref)
    
    # get the precision
    precision = overlap / len(system)
    
    # get the recall
    recall = overlap / len(reference) 
    
    if precision + recall != 0: # @KEEPTHIS
        # compute the f1-score
        rouge1_score = 2 * (precision * recall)  / (precision + recall)
    else:
        rouge1_score = 0 
    
    return rouge1_score

In [41]:
# notice that this produces a different value from the jaccard similarity earlier
rouge1_similarity([1, 2, 3], [1, 2, 3, 4])

0.8571428571428571

In [42]:
# UNIT TEST
# test rouge1_similarity
w1_unittest.test_rouge1_similarity(rouge1_similarity)

[92m All tests passed


### Overall score

I need to compare each sample with all other samples. 

For instance, if I generated 30 sentences, we will need to compare sentence 1 to sentences 2 to 30. 

Then, I compare sentence 2 to sentences 1 and 3 to 30, and so forth. 

At each step, I get the average score of all comparisons to get the overall score for a particular sample. 

To illustrate, these will be the steps to generate the scores of a 4-sample list.

1. Get similarity score between sample 1 and sample 2
2. Get similarity score between sample 1 and sample 3
3. Get similarity score between sample 1 and sample 4
4. Get average score of the first 3 steps. This will be the overall score of sample 1.
5. Iterate and repeat until samples 1 to 4 have overall scores.

I will be storing the results in a dictionary for easy lookups.

### `average_overlap()` function.

In [51]:
# UNQ_C9
def average_overlap(similarity_fn, samples, *ignore_params):
    """Returns the arithmetic mean of each candidate sentence in the samples

    Args:
        similarity_fn (function): similarity function used to compute the overlap
        samples (list of lists): tokenized version of the translated sentences
        *ignore_params: additional parameters will be ignored

    Returns:
        dict: scores of each sample
            key: index of the sample
            value: score of the sample
    """  
    
    # initialize dictionary
    scores = {}
    
    # run a for loop for each sample
    for index_candidate, candidate in enumerate(samples):    
        
        # initialize overlap
        overlap = 0
        
        total_items = 0
        
        # run a for loop for each sample
        for index_sample, sample in enumerate(samples): # @KEEPTHIS

            # skip if the candidate index is the same as the sample index
            if index_candidate == index_sample:
                continue
                
            total_items += 1
                
            # get the overlap between candidate and sample using the similarity function
            sample_overlap = similarity_fn(candidate, sample)
            
            # add the sample overlap to the total overlap
            overlap += sample_overlap
            
        # get the score for the candidate by computing the average
        score = overlap / total_items
        
        # save the score in the dictionary. use index as the key.
        scores[index_candidate] = score
        
    return scores

In [49]:
average_overlap(jaccard_similarity, [[1, 2, 3], [1, 2, 4], [1, 2, 4, 5]], [0.4, 0.2, 0.5])

{0: 0.45, 1: 0.625, 2: 0.575}

In [50]:
# UNIT TEST
# test average_overlap
w1_unittest.test_average_overlap(average_overlap, rouge1_similarity)

[92m All tests passed


In practice, it is also common to see the **weighted** mean being used to calculate the overall score instead of just the arithmetic mean.

I add implementation of it below and I will show in experiements which one will give better results.

In [46]:
def weighted_avg_overlap(similarity_fn, samples, log_probs):
    """Returns the weighted mean of each candidate sentence in the samples

    Args:
        samples (list of lists): tokenized version of the translated sentences
        log_probs (list of float): log probability of the translated sentences

    Returns:
        dict: scores of each sample
            key: index of the sample
            value: score of the sample
    """
    
    # initialize dictionary
    scores = {}
    
    # run a for loop for each sample
    for index_candidate, candidate in enumerate(samples):    
        
        # initialize overlap and weighted sum
        overlap, weight_sum = 0.0, 0.0
        
        # run a for loop for each sample
        for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)):

            # skip if the candidate index is the same as the sample index            
            if index_candidate == index_sample:
                continue
                
            # convert log probability to linear scale
            sample_p = float(np.exp(logp))

            # update the weighted sum
            weight_sum += sample_p

            # get the unigram overlap between candidate and sample
            sample_overlap = similarity_fn(candidate, sample)
            
            # update the overlap
            overlap += sample_p * sample_overlap
            
        # get the score for the candidate
        score = overlap / weight_sum
        
        # save the score in the dictionary. use index as the key.
        scores[index_candidate] = score
    
    return scores

In [47]:
weighted_avg_overlap(jaccard_similarity, [[1, 2, 3], [1, 2, 4], [1, 2, 4, 5]], [0.4, 0.2, 0.5])

{0: 0.44255574831883415, 1: 0.631244796869735, 2: 0.5575581009406329}

### Putting it all together

I will now put everything together and develop the `mbr_decode()` function. 

### `mbr_decode()` function.

In [56]:
# UNQ_C10
def mbr_decode(sentence, n_samples, score_fn, similarity_fn, NMTAttn=None, temperature=0.6, vocab_file=None, vocab_dir=None, generate_samples=generate_samples, sampling_decode=sampling_decode, next_symbol=next_symbol, tokenize=tokenize, detokenize=detokenize):
    """Returns the translated sentence using Minimum Bayes Risk decoding

    Args:
        sentence (str): sentence to translate.
        n_samples (int): number of samples to generate
        score_fn (function): function that generates the score for each sample
        similarity_fn (function): function used to compute the overlap between a pair of samples
        NMTAttn (tl.Serial): An LSTM sequence-to-sequence model with attention.
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file

    Returns:
        str: the translated sentence
    """
    
    # generate samples
    samples, log_probs = generate_samples(sentence, n_samples, NMTAttn=NMTAttn, temperature=temperature, vocab_file=vocab_file, vocab_dir=vocab_dir, sampling_decode=sampling_decode, next_symbol=next_symbol, tokenize=tokenize, detokenize=detokenize)
    
    # use the scoring function to get a dictionary of scores
    # pass in the relevant parameters as shown in the function definition of 
    # the mean methods you developed earlier
    scores = score_fn(similarity_fn, samples, log_probs)
    
    # find the key with the highest score
    max_score_key = list(scores.keys())[np.argmax(np.array(list(scores.values())))]
    
    # detokenize the token list associated with the max_score_key
    translated_sentence = detokenize(samples[max_score_key], vocab_file=vocab_file, vocab_dir=vocab_dir)
    
    return (translated_sentence, max_score_key, scores)

In [54]:
TEMPERATURE = 1.0

# put a custom string here
your_sentence = 'She speaks English and German.'

In [57]:
mbr_decode(your_sentence, 4, weighted_avg_overlap, jaccard_similarity, model, TEMPERATURE, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)[0]

'Sie spricht Englisch und Deutsch.'

In [59]:
mbr_decode('Congratulations!', 4, average_overlap, rouge1_similarity, model, TEMPERATURE, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)[0]

'Herzlichen Glückwunsch!'

In [63]:
mbr_decode('I have completed the implementation!', 4, average_overlap, rouge1_similarity, model, TEMPERATURE, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)[0]

'Ich habe die Ausführung fertig abgeschlossen!'

In [62]:
# UNIT TEST
# test mbr_decode
w1_unittest.test_mbr_decode(target=mbr_decode, score_fn=average_overlap, similarity_fn=rouge1_similarity)

[92m All tests passed


### References
* [Natural Language Processing with Attention Models](https://www.coursera.org/learn/attention-models-in-nlp) coursera course.