# Neural Machine Translation
This article takes you through the core concepts of the following...
- English to German translation using Neural Machine Translation(NMT)
- It uses LSTM networks with Attention
- Beyond translation MT deciphers word sense disambiguation (e.x. `bank` refers to `financial bank` or `riverside bank`
- Implemented using RNN with LSTMs can work for short to medium sentences but can result in vanishing gradient for long sequences
- To address this, an attention mechanism is used to allow the decoder to access all relevant parts of the input sentence regardless of its lenght

1. Preprocess the training and eval data
2. Implement an encoder-decoder system with attention
3. Understand how attention works
4. Build the NMT model from scratch using Trax
5. Generate translations using `Greedy and Minimum Bayes Risk`(MBR) decoding

## Part 1. Data Preparation

### 1.1 Importing the Data

In [1]:
from termcolor import colored
import random
import numpy as np

import trax
from trax import layers as tl
from trax.fastmath import numpy as fastnp
from trax.supervised import training

DATA_DIR = './data/01'

# !pip list | grep trax # trax == 1.3.4 is required 

INFO:tensorflow:tokens_length=568 inputs_length=512 targets_length=114 noise_density=0.15 mean_noise_span_length=3.0 


In [None]:
# Get generator function for the training set
# This will download the train dataset if no data_dir is specified
train_stream_fn = trax.data.TFDS(
    'opus/medical',
    data_dir=DATA_DIR,
    keys=('en', 'de'),
    eval_holdout_size=0.01, #1% for eval
    train=True
)

# Get generator function for the eval set
eval_stream_fn = trax.data.TFDS(
    'opus/medical',
    data_dir=DATA_DIR,
    keys=('en', 'de'),
    eval_holdout_size=0.01, #1% for eval
    train=False
)

In [None]:
train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()

eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))

### 1.2 Tokenization and Formatting
- Tokenizing the sentences using subword representations
- Each sentences is represented as an array of integers 
- To avoid out-of-vocab words, subword representations are used
- For example, instead of having separate entries in your vocabulary for --"fear", "fearless", "fearsome", "some", and "less"--, you can simply store --"fear", "some", and "less"-- then allow your tokenizer to combine these subwords when needed.
- This allows it to be more flexible, wont have to save uncommon words explicitly

In [None]:
# Global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = './data/01'

In [None]:
# Tokenize the dataset.
tokenized_train_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)

In [None]:
# Append EOS at the end of each sentence
# Integer assigned as end of sentence (EOS)
# This will help us to infer the model has completed the translation
EOS = 1

# Generator helper function to append EOS to each sentence
def append_eos(stream):
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        
        yield np.array(inputs_with_eos), np.array(targets_with_eos)
        
# Append EOS to the train data
tokenized_train_stream = append_eos(tokenized_train_stream)
tokenized_eval_stream = append_eos(tokenized_eval_stream)

In [None]:
# Filter long sentences
# Filter too long sentences to not run out of memory
# length_keys=[0, 1] means we filter both English and German sentences 
# Both must be not longer than 256 tokens for training / 512 for eval
filtered_train_stream = trax.data.FilterByLength(
    max_length=256,
    length_keys=[0, 1]
)(tokenized_train_stream)
filtered_eval_stream = trax.data.FilterByLength(
    max_length=512,
    length_keys=[0,1]
)(tokenized_eval_stream)

train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)

### 1.3 Tokenize and Detokenize Helper Functions

In [None]:
def tokenize(input_str, vocab_file=None, vocab_dir=None):
    EOS = 1
    inputs = next(trax.data.tokenize(
        iter([input_str]),
        vocab_file=vocab_file,
        vocab_dir=vocab_dir
    ))
    inputs = list(inputs) + [EOS]
    batch_inputs = np.reshape(np.array(inputs), [1, -1])
    
    return batch_inputs

def detokenize(integers, vocab_file=None, vocab_dir=None):
    integers = list(np.squeeze(integers))
    EOS = 1
    if(EOS in integers):
        integers = integers[:integers.index(EOS)]
        
    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

In [None]:
# Detokenize an input-target pair of tokenized sentences
print(colored(f'Single detokenized example input:', 'red'), detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single detokenized example target:', 'red'), detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()

# Tokenize and detokenize a word that is not explicitly saved in the vocabulary file.
# See how it combines the subwords -- 'hell' and 'o'-- to form the word 'hello'.
print(colored(f"tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f"detokenize([17332, 140, 1]): ", 'green'), detokenize([17332, 140, 1], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))

### 1.4 Bucketing
[Comprehensive Hands-on Guide to Sequence Model batching strategy: Bucketing technique](https://medium.com/@rashmi.margani/how-to-speed-up-the-training-of-the-sequence-model-using-bucketing-techniques-9e302b0fd976)

In [None]:
# Bucketing to create streams of batches

# Buckets are defined in terms of boundaries and batch sizes
# batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 256 sentences of length < 8, 128, if length is
# between 8 and 16. and so on -- and only 2 if length is over 512
boundaries = [8, 16, 32, 64, 128,256, 512]
batch_sizes = [256, 128, 64, 32, 15, 8, 4, 2]

# Create the generators
train_batch_stream = trax.data.BucketByLength(
    boundaries, 
    batch_sizes,
    length_keys=[0,1]
)(filtered_train_stream)

eval_batch_stream = trax.data.BucketByLength(
    boundaries,
    batch_sizes,
    length_keys=[0,1]
)(filtered_eval_stream)

# Add masking for the padding (0s)
train_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = trax.data.AddLossWeights(id_to_mask=0)(eval_batch_stream)

### 1.5 Exploring the Data

In [None]:
input_batch, target_batch, mask_batch = next(train_batch_stream)

# let's see the data type of a batch
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))

# let's see the shape of this particular batch (batch length, sentence length)
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)

- The tokens acquired are used to produce embedding vectors for each word in the sentence
- Hence, the embedding for a sentence is a matrix
- The number of sentence in each batch us usually a power of 2 for optimal computer memory usage

In [None]:
# pick a random index less than the batch size.
index = random.randrange(len(input_batch))

# use the index to grab an entry from the input and target batch
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'), detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE GERMAN TRANSLATION: \n', 'red'), detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE GERMAN TRANSLATION: \n', 'red'), target_batch[index], '\n')

## Part 2. NMT with Attention

### 2.1 Attention Overview
- An attention model will be built using an encoder-decoder architecture
- The RNN will take in a tokenized version of a sentence in its encoder.
- Pass the tokenized data into the decoder for translation
- Using a sequence-to-sequence model with LSTMs will work effectively for short to medium sentences but will degrade for longer ones
- All the context of the input sentence is compressed into one vector and passed into a decoder block
- Context of the first parts of the input will have very little effect on the final vector passed to the decoder

$$ENCODER \rightarrow \small{hello}\hspace{2mm} \normalsize{how}\hspace{2mm} \large{are}\hspace{2mm} \Large{you}\hspace{2mm} \huge{today}\hspace{2mm} \Huge{!} \normalsize \rightarrow DECODER$$

- Adding an attention layer to this model avoids this problme by giving the decoder access to all parts of the input sentence
- In a 4 word input sentence,
    - Remember that a hidden state is produced at each timestep of the encoder
    - These hidden states are all passed to the attention layer and each are given a score given the current activation(ie hidden state) of the decoder
    - ie After predicting the first word, the attention layer will receive all the encoder hidden states as well as decoder hiddent state when producing the word wie
    - Given these information, it will score each of the encoder hidden states to know which one the decoder should focus on to produce the next word
- The result of the model training might have learned that it should align to the second encoder hidden state a
- Subsequently assigns a high probability to the word geht.
- If we use greedy decoding, we will output the said word as the next symbol, 
- then restart the process to produce the next word until we reach an end sentence prediction

This is Scaled Dot Product Attention of the form
$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$
- Computing scores using Queries(Q) and keys(K) followed by a multiplication values(V) to get a context vector at a particular timestep of the decoder
- The context vector is fed to the decoder RNN to get a set of probabilities for the next predicted word
- The division by squre root of the keys dimensionality $(\sqrt{d_k})$ is for improving model performance
- The encoder activations (hidden states) will be the keys and values, while decoder activations will be queries

#### 2.2.1 Input Encoder
- The input encoder runs on the input tokens, creates its embeddings and feeds it to an LSTM network.
- This outputs the activations that will be the keys and values for attention
- It is a serial network with `tl.Embedding` and `tl.LSTM`

In [2]:
def input_encoder_Fn(input_vocab_size, d_model, n_encoder_layers):
    input_encoder = tl.Serial(
        tl.Embedding(input_vocab_size, d_model),
        [tl.LSTM(d_model) for _ in range(n_encoder_layers)]
    )
    return input_encoder

In [3]:
import w1_unittest

w1_unittest.test_input_encoder_fn(input_encoder_Fn)

[92m All tests passed


#### 2.2.2 Pre-attention Decoder
- The pre-attention decoder runs on the targets and creates activations that are used as queries in attention
- This is a Serial netwokr which is composed of the `tl.ShiftRight`, `tl.Embedding`, `tl.LSTM`

In [None]:
def pre_attention_decoder_fn(mode, target_vocab_size, d_model):
    pre_attention_decoder = tl.Serial(
        # Shift right to insert start-of-sentence token and implement 
        # teacher forcing during training
        tl.ShiftRight(1),
        tl.Embedding(target_vocab_size, d_model),
        tl.LSTM(d_model)
    )
    return pre_attention_decoder

In [None]:
w1_unittest.test_pre_attention_decoder_fn(pre_attention_decoder_fn)


#### 2.2.3 Preparing the Attention Input
- This function will prepare the input for the attention layer
- Take encoder and pre-attention decoder activations and assign it to the queries, ie hidden states of encoder and decoder
- These activations are assigned to Queries, Keys and Values
- Another output will be the mask to distinguish real tokens from padding tokens
- This mask is used internally by Trax while computing the softmax so padding tokens  will not have an effect on the computed probabilities
- Observe which tokens in the input corresponding to padding
- Multi-headed attention - Computing attention multiple times to improve model's predictions
- It is required to consider this additional axis in the output

In [None]:
def prepare_attention_input(encoder_activations, decoder_activations, inputs):
    """
    Prepare Queries, Keys, Values and Mask for attention
    Args:
        encoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the input encoder
        decoder_activations fastnp.array(batch_size, padded_input_length, d_model): output from the pre-attention decoder
        inputs fastnp.array(batch_size, padded_input_length): padded input tokens
    
    Returns:
        queries, keys, values and mask for attention.
    """
    # Set the keys and values to the encoder activations
    keys = encoder_activations
    values = encoder_activations
    
    # Set the queries to the decoder activations
    queries = decoder_activations
    
    # Generate the mask to distinguish real tokens from padding
    # hint: inputs is 1 for real tokens and 0 where they are pdding
    mask = (inputs >= 1)
    
    # add axes to the mask for attention heads and decoder length
    mask = fastnp.reshape(mask, (mask.shape[0], 1, 1, mask.shape[1]))
    
    # Broadcast so mask shape is [batch_size, attention heads, decoder-len, encoder-len]
    # Note: for thsi assignment, attention head is set to 1
    mask = mask + fastnp.zeros((1, 1, decoder_activations.shape[1], 1))
    
    return queries, keys, values, mask

In [None]:
w1_unittest.test_prepare_attention_input(prepare_attention_input)


## 2.3 Implementation Overview
- Step 0: Prepare input encoder and pre-attention decoder branches
- Step 1: Create a Serial Network, It will stack the layers in the next steps one after the other.
- Step 2: Make a copy of the input and target tokens
    - The input and target tokens will be fed into different layers of the model
    - `tl.Select` layer to create copies oft these tokens
    - Arrange them as `[input_tokens, target_tokens, input_tokens, target_tokens]`
- Step 3: Create a parallel branch to feed the input tokens to the `input encoder` and target tokens to the `pre-attention decoder`
    - `tl.Parallel` to create these sublayers in parallel
- Step 4: Call the `prepare_attention_input` function to convert the encoder and pre-attention decoder actiations to a format that the attention layer will accept
    - `tl.Fn` to be used.
- Step 5: Feed Q, K V and Mask to the tl.AttentionQKV layer
    - This computes the scaled dot product attention and outputs the attention weights and mask
    - Though it is a one liner, It composed of a DNN made up of several branches
    - Having deep layers pose the risk of vanishing gradients during training - to be mitigated
    - `tl.Residual` layer is added to improve ability to learn - Added to output of `AttentionQKV` with the queries input
    - Nest the `AttentionQKV` inside `Residual`
- Step 6: Mask is not needed, hence dropped.
    - At this point in network - the signal stack currently has `[attention activations, mask, target tokens]` and you can use `tl.Select` to output just 3
- Step 7: Feed the attention weighted output to the LSTM decoder. 
    - Stack multiple `tl.LSTM` layers to improve the output so remember to append to LSTMs equal to n_decoder_layers of the model
- Step 8: Determine the probabilities of each subword in the vocabulary and set this up easily with a `tl.Dense` layer  by making its size equal to the size of our vocabulary
- Step 9: Normalize the output to log probabilities by passing the activations in Step 8 to a `tl.LogSoftmax` layer

In [None]:
def NMTAttn(
    input_vocab_size=33300, 
    target_vocab_size=33300, 
    d_model=1024, 
    n_encoder_layers=2, 
    n_decoder_layers=4,
    n_attention_heads=4,
    attention_dropout=0.0, 
    mode='train'
):
    """
    Returns an LSTM sequence-to-sequence model with attention
    
    The input to the model is a pair (input tokens, target tokens)
    e.g. an English sentence (tokenized) and its translation into German
    
    Args:
    input_vocab_size: int: vocab size of the input
    target_vocab_size: int: vocab size of the target
    d_model: int:  depth of embedding (n_units in the LSTM cell)
    n_encoder_layers: int: number of LSTM layers in the encoder
    n_decoder_layers: int: number of LSTM layers in the decoder after attention
    n_attention_heads: int: number of attention heads
    attention_dropout: float, dropout for the attention layer
    mode: str: 'train', 'eval' or 'predict', predict mode is for fast inference
    """
    
    # Step 0: call the helper function to create layers for the input encoder and pre attention decoder
    input_encoder = input_encoder_Fn(input_vocab_size, d_model, n_encoder_layers)
    pre_attention_decoder = pre_attention_decoder_fn(mode, target_vocab_size, d_model)
    
    # Step 1: Create a serial network
    model = tl.Serial(
        # Copy input tokens and target tokens as they will be needed later
        tl.Select([0, 1, 0, 1]),
        # Run input encoder on the input and pre-attention decoder the target
        tl.Parallel(input_encoder, pre_attention_decoder),
        # Prepare queries, keys, values and mask for attention
        tl.Fn('PrepareAttentionInput', prepare_attention_input, n_out=4),
        # Run the AttentionQKV layer
        # nest it inside a Residual layer to add to the pre-attention decoder activations
        tl.Residual(tl.AttentionQKV(d_model, n_heads=n_attention_heads, dropout=attention_dropout, mode=mode)),
        # Drop attention mask
        tl.Select([0,2], n_in=None),
        # Run the rest of the RNN decoder
        [tl.LSTM(d_model) for _ in range(n_encoder_layers)],
        # Prepare output by making it to the right size
        tl.Dense(target_vocab_size),
        tl.LogSoftmax()
        
    )
    
    
    return model

In [None]:
model = NMTAttn()
model

In [None]:
w1_unittest.test_NMTAttn(NMTAttn)

## Part 3: Training
- Train the model using supervised learning.
- Classes `TrainTask`, `EvalTask` and `Loop`

In [None]:
train_task = training.TrainTask(
    # use the train batch stream as labeled data
    labeled_data = train_batch_stream,
    # Use the cross entropy loss
    loss_layer = tl.CrossEntropyLoss(),
    # use the Adam optimizer with learning rate of 0.01
    optimizer = trax.optimizers.Adam(0.01),
    # learning rate schedule have 1000 warmup steps with a max value of 0.01
    lr_schedule = trax.lr.warmup_and_rsqrt_decay(1000, max_value=0.01),
    # have a checkpoint every 10 steps
    n_steps_per_checkpoint = 10
)

In [None]:
w1_unittest.test_train_task(train_task)


### 3.2 Eval Task

In [None]:
eval_task = training.EvalTask(
    labeled_data=eval_batch_stream,
    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()]
)

### 3.3 Loop
The Loop class defines the model we trains as well as the train and eval tasks to execute. Its `run()` method allows us to execute the training for a specified number of steps

In [None]:
output_dir = './data/01/output_dir'

In [None]:

#!rm -f ~/output_dir/model.pkl.gz

training_loop = training.Loop(
    NMTAttn(mode='train'),
    train_task,
    eval_tasks=[eval_task],
    output_dir=output_dir
)

In [None]:
training_loop.run(10)

## Part 4. Testing
- Method 1: Identified the next symbol ie output token
- Method 2: Combines the entire translation string

In [None]:
import os
model_file = os.path.join(output_dir, 'model.pkl.gz')
# Instantiate the model we built in eval mode
model = NMTAttn(mode='eval')

# intialize weights from a pre-trained model
model.init_from_file(model_file, weights_only=True)
model = tl.Accelerate(model)

### 4.1 Decoding
- There are several ways to get the next token when translating a sentence
- Get the most probable token at each step(ie Greedy Decoding) or get a sample from a distribution
- We can generalize the implementation of these two approaches by using `tl.LogSoftmax_Sample()`

```python
def logsoftmax_sample(log_probs, temperature=1.0):  # pylint: disable=invalid-name
  """Returns a sample from a log-softmax output, with temperature.

  Args:
    log_probs: Logarithms of probabilities (often coming from LogSofmax)
    temperature: For scaling before sampling (1.0 = default, 0.0 = pick argmax)
  """
  # This is equivalent to sampling from a softmax with temperature.
  u = np.random.uniform(low=1e-6, high=1.0 - 1e-6, size=log_probs.shape)
  g = -np.log(-np.log(u))
  return np.argmax(log_probs + g * temperature, axis=-1)
```

Take aways...  
1. It gets random samples with the same shape as your input(i.e log_probs)
2. The amount of "noise" added to input by these random samples is scaled by a `temperature` setting, set to 0
3. It makes the return statement equal to getting the argmax of `log_probs` 

In [None]:
def next_symbol(NMTAttn, input_tokens, cur_output_tokens, temperature):
    """
    Returns the index of the next token
    
    Args:
    NMTAttn: An LSTM Sequence to sequence model with attention
    input_tokens: (np.ndarray 1 x n_tokens): tokenized representation of the input sequence
    cur_output_tokens(list): Tokenized representation of previously translated words
    temperature: Parameter for sampling ranging from 0.0 to 1.0
        0.0: Same as argmax, always pic the most probable token
        1.0: Sampling from the distribution (can sometimes say random things)
    
    Returns:
    int: index of the next token in the translated sentence
    float: log probability of the next symbod
    """
    # Set the length of the current output tokens
    token_length = len(cur_output_tokens)
    
    # Calculate next power of 2 for padding length
    padded_length = 2**int(np.ceil(np.log2(token_length+1)))
    
    # Pad curr_output_tokens up to the padded_length
    padded = cur_output_tokens + (padded_length - token_length) * [0]
     
    # Model expects the output to have an axis for the batch size in front so
    # convert `padded` list to a numpy array  with shape (x, <pad_lenght>) where the
    # x position is the batch axis. np.expand_dims() with axis=0
    padded_with_batch = np.expand_dims(padded, axis=0)
    
    # Get the model prediction. remember to use the NMAttn argument defined above
    # hit: the model accepts  tuple as input
    output, _ = NMTAttn((input_tokens, padded_with_batch))
    
    log_probs = output[0, token_length, :]
    
    # get the next symbol by getting a logsoftmax sample (*hint: cast to an int)
    symbol = int(tl.logsoftmax_sample(log_probs, temperature))
    
    return symbol, float(log_probs[symbol])

In [None]:
w1_unittest.test_next_symbol(next_symbol, model)


In [None]:
def sampling_decode(input_sentence, NMTAttn=None, temperature=0.0, vocab_file=None, vocab_dir=None):
    """
    Returns the translated sentence
    
    Args:
    input_sentence(str): Sentence to translate
    NMTAttn(tl.Serial): An LSTM sequence-to-sequence model with attention 
    temperature(float): parameter for sampling ranging from 0.0 to 1.0
        0.0: Same as argmax, always pick the most proabable token
        1.0: Sampling from the distribution (can sometimes sya random things)
    vocab_file (str): filename of the vocabulary
    vocab_dir (str): path to the vocabulary file
    
    Returns:
    tuple: (list, str, float)
    list of int: tokenized version of the translated sentence
    float: log probability of the translated sentence
    str: the translated sentence
    """
    
    # Encode the input sentence
    input_tokens = tokenize(input_sentence, vocab_file=vocab_file, vocab_dir=vocab_dir)
    # initialize the list of output tokens
    cur_output_tokens = []
    
    cur_output = 0
    # Set the encoding of end of sentence as 1
    EOS = 1
    
    # Check that the current output is not the end of sentence token
    while(cur_output != EOS):
        # Update the current output token by getting the index of the next workd
        cur_output, log_prob = next_symbol(NMTAttn, input_tokens, cur_output_tokens, temperature)
        
        # Append the current output token to the list of output tokens
        cur_output_tokens.append(cur_output)
        
    # detokenize the output tokens
    sentence = trax.data.detokenize(cur_output_tokens, vocab_file=vocab_file, vocab_dir=vocab_dir)
    
    return cur_output_tokens, log_prob, sentence

In [None]:
sampling_decode("I love languages.", model, temperature=0.0, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

### Greedy Decode Test
- We have se a default value of 0 to the temperature setting in our implementation of `sampling_decode()` above
- `logsoftmax_sample()` method ultimately result in greedy decoding
- The algorithm generates the translation by getting the most probable word at each step
- It gets the argmax of the output array of your model adn then returns the index
- In the test below, output remains the same eachtime you run

In [None]:
def greedy_decode_test(sentence, NMTAttn=None, vocab_file=None, vocab_dir=None):
    """
    Prints the input and output of our NMTAttn model using Greedy decode
    Args:
    sentence (str): A custom string
    NMTAttn (tl.Serial): An LSTM sequence-to-sequence model with attention
    vocab_file (str): filename of the vocabulary
    vocab_dir (str): path to the vocabulary file
    
    Returns:
    str: the translated sentence
    """
    
    _, _, translated_sentence = sampling_decode(sentence, NMTAttn, vocab_file=vocab_file, vocab_dir=vocab_dir)
    
    print(f"English: {sentence}")
    print(f"German: {translated_sentence}")
    
    return translated_sentence
    

In [None]:
your_sentence = "I love languages"
greedy_decode_test(your_sentence, model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

In [None]:
greedy_decode_test('You are almost done with the assignment!', model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR);

### 4.2 Minimum Bayes-Risk Decoding
Getting most probable token at each step may not necessarily produce the best results. Another approach is to do minimum Bayes Risk Decoding or MBR. Steps are

1. Take several random samples
2. Score each sample against all other samples
3. Selecte the one with the highest score

#### 4.2.1 Generating Samples
Build a function to generate several samples. Use the `sampling_decode()` function you developed earlier to do this easily. We want to record the token list and log probability for each sample as these will be needed in the next step.

In [None]:
def generate_samples(sentence, n_samples, NMTAttn=None, temperature=0.6, vocab_file=None, vocab_dir=None):
    samples, log_probs = [], []
    
    # run a for loop to generate n samples
    for _ in range(n_samples):
        # Get a sample using the sampling_decode() function
        sample, logp, _ = sampling_decode(
            sentence,
            NMTAttn, 
            temperature,
            vocab_file=vocab_file,
            vocab_dir=vocab_dir
        )
        # Append the token_list to the samples list
        samples.append(sample)
        # Append the log probability to the log_probs list
        log_probs.append(logp)
        
    return samples, log_probs

In [None]:
# generate 4 samples with the default temperature (0.6)
generate_samples('I love languages.', 4, model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

In [None]:
#### 4.2.2 Comparing Overlaps
- Build a function to compare a sample against another
- Jaccard similarity, a simple method - Gets the intersection over union of two sets

In [9]:
def jaccard_similarity(candidate, reference):
    """
    Returns the Jaccard Similarity between two token lists
    Args:
        candidate (list of int): tokenized version of the candidate translation
        reference (list of int): tokenized version of the reference candidate
        
    Returns:
        float: Overlap between the two token lists
    """
    # Convert the lists to a set to get the unique tokens
    can_unigram_set, ref_unigram_set = set(candidate), set(reference)
    
    # Get the set of tokens common to both candidate and reference
    joint_elems = can_unigram_set.intersection(ref_unigram_set)
    
    # Get teh set of all tokens found in either candidate or reference
    all_elems = can_unigram_set.union(ref_unigram_set)
    
    # Divide the number of joint elements by the number of all elements
    overlap = len(joint_elems) / len(all_elems)
    
    return overlap

In [10]:
# let's try using the function. remember the result here and compare with the next function below.
jaccard_similarity([1, 2, 3], [1, 2, 3, 4])

0.75

One of the more commonly used metrics in machine translation is ROUGE score. For unigrams, this called ROUGE-1 and as shown in class, you can ouput the socres for both precision and recall when comparing two samples. To get the final score, an F1-score is computed

$$score = 2 * \frac{(precision * recall)}{(precision + recall)}$$

In [28]:
from collections import Counter

def rouge1_similarity(system, reference):
    
    # Make a frequency table of the system tokens 
    sys_counter = Counter(system)

    # Make a frequency table of the reference tokens
    ref_counter = Counter(reference)
    # initialize overlap to 0
    overlap = 0
    
    # Run a for loop over the sys_counter object
    for token in sys_counter:
        # Lookup the value of the token in the sys_counter dictionary
        token_count_sys = sys_counter.get(token)
        
        # Lookup the value of the token in the ref_counter dictionary
        token_count_ref = ref_counter.get(token) if token in ref_counter else 0
        
        # Update the overlap by getting the smaller number between the two token counts above
        overlap += (token_count_sys if token_count_sys <= token_count_ref else token_count_ref)
        
    # Get the precision
    precision = overlap / len(system)
    recall = overlap / len(reference)
    
    if(precision + recall != 0):
        rouge1_score = 2 * (precision * recall) / (precision + recall)
    else:
        rouge1_score = 0
    
    return rouge1_score

In [29]:
# notice that this produces a different value from the jaccard similarity earlier
rouge1_similarity([1, 2, 3], [1, 2, 3, 4])

0.8571428571428571

In [30]:
# BEGIN UNIT TEST
w1_unittest.test_rouge1_similarity(rouge1_similarity)
# END UNIT TEST

[92m All tests passed


### 4.2.3 Overall score

We will now build a function to generate the overall score for a particular sample. As mentioned earlier, we need to compare each sample with all other samples. For instance, if we generated 30 sentences, we will need to compare sentence 1 to sentences 2 to 30. Then, we compare sentence 2 to sentences 1 and 3 to 30, and so forth. At each step, we get the average score of all comparisons to get the overall score for a particular sample. To illustrate, these will be the steps to generate the scores of a 4-sample list.

1. Get similarity score between sample 1 and sample 2
2. Get similarity score between sample 1 and sample 3
3. Get similarity score between sample 1 and sample 4
4. Get average score of the first 3 steps. This will be the overall score of sample 1.
5. Iterate and repeat until samples 1 to 4 have overall scores.

We will be storing the results in a dictionary for easy lookups.

In [37]:
def average_overlap(similarity_fn, samples, *ignore_params):
    
    # initialize dictionary
    scores = {}
    
    # run a for loop for each sample
    for index_candidate, candidate in enumerate(samples):
        
        # Init to 0.0
        overlap = 0
        
        # Run a for loop for each sample
        for index_sample, sample in enumerate(samples):
            # Skip if the candidate index is the same as the sample index
            if(index_candidate == index_sample):
                continue
                
            # Get the overlap between candidate and sample using the
            # similarity function
            sample_overlap = similarity_fn(sample, candidate)
            
            overlap += sample_overlap
            
        # Get the score for the candidate by computing the average
        score = overlap / (len(samples) - 1)
        scores[index_candidate] = score
        
    return scores

In [38]:
average_overlap(jaccard_similarity, [[1, 2, 3], [1, 2, 4], [1, 2, 4, 5]], [0.4, 0.2, 0.5])

{0: 0.45, 1: 0.625, 2: 0.575}

In [39]:
w1_unittest.test_average_overlap(average_overlap)

[92m All tests passed


In practice, it is also common to see the weighted mean being used to calculate the overall score instead of just the arithmetic mean. We have implemented it below and you can use it in your experiements to see which one will give better results.

In [17]:
def weighted_avg_overlap(similarity_fn, samples, log_probs):
    
    # initialize dictionary
    scores = {}
    
    # run a for loop for each sample
    for index_candidate, candidate in enumerate(samples):
        
        # Init to 0.0
        overlap, weighted_sum = 0.0, 0.0
        
        # Run a for loop for each sample
        for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)):
            # Skip if the candidate index is the same as the sample index
            if(index_candidate == index_sample):
                continue
                
            # Convert log probability to linear scale
            sample_p = float(np.exp(logp))
            
            # Update the weighted sum
            weighted_sum += sample_p
                
            # Get the overlap between candidate and sample using the
            # similarity function
            sample_overlap = similarity_fn(sample, candidate)
            
            overlap += sample_overlap * sample_p
            
        # Get the score for the candidate by computing the average
        score = overlap / weighted_sum
        scores[index_candidate] = score
        
    return scores

In [18]:
weighted_avg_overlap(jaccard_similarity, [[1, 2, 3], [1, 2, 4], [1, 2, 4, 5]], [0.4, 0.2, 0.5])

{0: 0.44255574831883415, 1: 0.631244796869735, 2: 0.5575581009406329}

### 4.2.4 Putting it all together

We will now put everything together and develop the `mbr_decode()` function. Please use the helper functions you just developed to complete this. You will want to generate samples, get the score for each sample, get the highest score among all samples, then detokenize this sample to get the translated sentence.

In [None]:
def mbr_decode(
    sentence, n_samples, score_fn, similarity_fn, 
    NMTAttn=None, temperature=0.6, vocab_file=None, vocab_dir=None
):
    
    samples, log_probs = generate_samples(
        sentence, n_samples, NMTAttn, 
        temperature, vocab_file, vocab_dir
    )
    
    # Use the scoring function to get a dictionary of scores
    # pass in the relevant parameters as shown in the function definition of
    # the mean methods you developed earlier
    scores = weighted_avg_overlap(similarity_fn, samples, log_probs)
    
    # find the key with the highest score
    max_index = np.argmax(scores)
    
    # detokenize the token list associated with the max_index
    translated_sentence = trax.data.detokenize(samples[max_index])

    return (translated_sentence, max_index, scores)
    

In [None]:
TEMPERATURE = 1.0

# put a custom string here
your_sentence = 'She speaks English and German.'

In [None]:
mbr_decode(your_sentence, 4, weighted_avg_overlap, jaccard_similarity, model, TEMPERATURE, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)[0]

In [None]:
mbr_decode('Congratulations!', 4, average_overlap, rouge1_similarity, model, TEMPERATURE, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)[0]

In [None]:
mbr_decode('You have completed the assignment!', 4, average_overlap, rouge1_similarity, model, TEMPERATURE, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)[0]

In [None]:
# BEGIN UNIT TEST
w1_unittest.test_mbr_decode(mbr_decode, model)
# END UNIT TEST