<a href="https://colab.research.google.com/github/OmarAlsaqa/trax/blob/master/NMT_with_Transformers_Reformers_using_Trax.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title
# Copyright 2021 Google LLC.

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# https://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# **NMT with Transformers/Reformers using Trax**

A guide to Neural Machine Translation using ***Transformers/Reformers***. Includes a detailed tutorial using ***Trax*** in Google Colaboratory.

Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation.

In this Notebook you will:
*   Learn how to preprocess your training and evaluation data.
*   implement an encoder-decoder system with attention.
*   understand how attention works.
*   build the NMT model from scratch using Trax.
*   learn how to preprocess your training and evaluation data.
*   generate translations using greedy and Minimum Bayes Risk (MBR) decoding.

This notebook contains a lot of cells taken from [Natural Language Processing Specialization](https://www.coursera.org/specializations/natural-language-processing)

# Part (-1): Run on CPU/GPU

This notebook was designed to run on TPU.

To use TPUs in Colab, click "Runtime" on the main menu bar and select Change runtime type. Set "TPU" as the hardware accelerator.


In [None]:
import os
import sys

# For example, if trax is inside a 'src' directory
project_root = os.environ.get('TRAX_PROJECT_ROOT', '')
sys.path.insert(0, project_root)

# Option to verify the import path
print(f"Python will look for packages in: {sys.path[0]}")

# Import trax
import trax

# Verify the source of the imported package
print(f"Imported trax from: {trax.__file__}")

# Part (0): Important Imports

In [None]:
import trax
from trax import layers as tl
from trax.learning.supervised import training
from trax.data.preprocessing import inputs as preprocessing
from trax.data.encoder import encoder
from trax.data.loader.tf import base as dataset
from trax import models
from trax import optimizers
from trax.learning.supervised import lr_schedules as learning_schedule

import numpy as np

from termcolor import colored
import random
import shutil

# Part (1):  Data Preparation

**You Can jump directly to Trax Data Pipeline (optional) Section and skip 1.1 to 1.5 sections.**

## 1.1  Importing the Data
We will be using [ParaCrawl](https://paracrawl.eu/), a large multi-lingual translation dataset created by the European Union. All of these datasets are available via [TFDS para_crawl](https://www.tensorflow.org/datasets/catalog/para_crawl). We used English to French dataset. You can try the other avaliable languages by changing the `dataset_name` and `keys`. Or even try another datasets available at TFDS.

Notice: It will take a while in the first time to download the dataset. So, it is prefered to specify `data_dir` on Google Drive not in Colab runtime. Try other than para_crawl dataset. since, the para_crawl is a large dataset.

In [None]:
# This will download the train dataset if no data_dir is specified.
train_stream_fn = dataset.TFDS('para_crawl/enfr',
                               keys=('en', 'fr'),
                               eval_holdout_size=0.01,  # 1% for eval
                               train=True)

# Get generator function for the eval set
eval_stream_fn = dataset.TFDS('para_crawl/enfr',
                              keys=('en', 'fr'),
                              eval_holdout_size=0.01,  # 1% for eval
                              train=False)

You can work with your own datasets instead of loading your dataset with TFDS. Opening a file as shown above creates that generator for you. dont forget to make another function for eval.

```python
def train_stream_fn():
  # provide an infinite generator  while True:    # open the first language file (e.g. English sentences)
    with open('lang1.csv','r') as f1:
      # open the second language file (e.g. French sentences)
      with open('lang2.csv','r') as f2:
        # looping over the two files to combine the two translation toghether and yields them.
        for l1, l2 in zip(f1,f2):
          yield (l1, l2)
```

Notice that TFDS returns a generator *function*.

Let's print a a sample pair from our train and eval data. Notice that the raw ouput is represented in bytes (denoted by the `b'` prefix) and these will be converted to strings internally in the next steps.

In [None]:
train_stream = train_stream_fn()
print(colored('train data (en, fr) tuple:', 'red'), next(train_stream))
print()

eval_stream = eval_stream_fn()
print(colored('eval data (en, fr) tuple:', 'red'), next(eval_stream))

Now that we have imported our corpus, we will be preprocessing the sentences into a format that our model can accept. This will be composed of several steps:

## 1.2  Tokenization and Formatting


**Tokenizing the sentences using subword representations:** we want to represent each sentence as an array of integers instead of strings. For our application, we will use *subword* representations to tokenize our sentences. This is a common technique to avoid out-of-vocabulary words by allowing parts of words to be represented separately. For example, instead of having separate entries in your vocabulary for "fear", "fearless", "fearsome", "some", and "less", you can simply store "fear", "some", and "less" then allow your tokenizer to combine these subwords when needed. This allows it to be more flexible so you won't have to save uncommon words explicitly in your vocabulary (e.g. *stylebender*, *nonce*, etc). Tokenizing is done with the `trax.data.Tokenize()` command. The combined subword vocabulary for English, German and French (i.e. `endefr_32k.subword`) is provided by trax. Feel free to open this file to see how the subwords look like.

In [None]:
# global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'endefr_32k.subword'
VOCAB_DIR = 'gs://trax-ml/vocabs/'

# Tokenize the dataset.
tokenized_train_stream = encoder.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(train_stream)
tokenized_eval_stream = encoder.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)(eval_stream)

**Append an end-of-sentence token to each sentence:** We will assign a token (i.e. in this case `1`) to mark the end of a sentence. This will be useful in inference/prediction so we'll know that the model has completed the translation.

In [None]:
# Append EOS at the end of each sentence.

# Integer assigned as end-of-sentence (EOS)
EOS = 1


# generator helper function to append EOS to each sentence
def append_eos(stream):
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        yield np.array(inputs_with_eos), np.array(targets_with_eos)


# append EOS to the train data
tokenized_train_stream = append_eos(tokenized_train_stream)

# append EOS to the eval data
tokenized_eval_stream = append_eos(tokenized_eval_stream)

**Filter long sentences:** We will place a limit on the number of tokens per sentence to ensure we won't run out of memory. This is done with the `trax.data.FilterByLength()` method and you can see its syntax below.

In [None]:
# Filter too long sentences to not run out of memory.
# length_keys=[0, 1] means we filter both English and French sentences, so
# both much be not longer than 512 tokens for training / 1024 for eval.
filtered_train_stream = preprocessing.FilterByLength(
    max_length=512, length_keys=[0, 1])(tokenized_train_stream)
filtered_eval_stream = preprocessing.FilterByLength(
    max_length=1024, length_keys=[0, 1])(tokenized_eval_stream)

# print a sample input-target pair of tokenized sentences
train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red'), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)

## 1.3  tokenize & detokenize helper functions

- tokenize(): converts a text sentence to its corresponding token list (i.e. list of indices). Also converts words to subwords (parts of words).
- detokenize(): converts a token list to its corresponding sentence (i.e. string).

In [None]:
# Setup helper functions for tokenizing and retokenizing sentences
def tokenize(input_str, vocab_file=None, vocab_dir=None):
    """Encodes a string to an array of integers
    Args:
        input_str (str): human-readable string to encode
        vocab_file (str): filename of the vocabulary text file
        vocab_dir (str): path to the vocabulary file
    Returns:
        numpy.ndarray: tokenized version of the input string
    """
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    # Use the trax.data.tokenize method. It takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs = next(encoder.tokenize(iter([input_str]),
                                   vocab_file=vocab_file, vocab_dir=vocab_dir))
    # Mark the end of the sentence with EOS
    inputs = list(inputs) + [EOS]
    # Adding the batch dimension to the front of the shape
    batch_inputs = np.reshape(np.array(inputs), [1, -1])
    return batch_inputs


def detokenize(integers, vocab_file=None, vocab_dir=None):
    """Decodes an array of integers to a human-readable string
    Args:
        integers (numpy.ndarray): array of integers to decode
        vocab_file (str): filename of the vocabulary text file
        vocab_dir (str): path to the vocabulary file
    Returns:
        str: the decoded sentence.
    """
    # Remove the dimensions of size 1
    integers = list(np.squeeze(integers))
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    # Remove the EOS to decode only the original tokens
    if EOS in integers:
        integers = integers[:integers.index(EOS)]
    return encoder.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

Let's see how we might use these functions:

In [None]:
# Detokenize an input-target pair of tokenized sentences
print(colored(f'Single retokenized example input:', 'red'),
      detokenize(train_input, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print(colored(f'Single retokenized example target:', 'red'),
      detokenize(train_target, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))
print()

# Tokenize and detokenize a word that is not explicitly saved in the vocabulary file.
# See how it combines the subwords 'hell' and 'o' to form the word 'hello'.
print(colored(f"tokenize('hello'): ", 'green'), tokenize('hello', vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR))

## 1.4  Bucketing

Bucketing the tokenized sentences is an important technique used to speed up training in NLP.
Here is a
[nice article describing it in detail](https://medium.com/@rashmi.margani/how-to-speed-up-the-training-of-the-sequence-model-using-bucketing-techniques-9e302b0fd976)
but the gist is very simple. Our inputs have variable lengths and you want to make these the same when batching groups of sentences together. One way to do that is to pad each sentence to the length of the longest sentence in the dataset. This might lead to some wasted computation though. For example, if there are multiple short sentences with just two tokens, do we want to pad these when the longest sentence is composed of a 100 tokens? Instead of padding with 0s to the maximum length of a sentence each time, we can group our tokenized sentences by length and bucket, as on this image (from the article above):

![alt text](https://miro.medium.com/max/700/1*hcGuja_d5Z_rFcgwe9dPow.png)

We batch the sentences with similar length together (e.g. the blue sentences in the image above) and only add minimal padding to make them have equal length (usually up to the nearest power of two). This allows to waste less computation when processing padded sequences.
In Trax, it is implemented in the [bucket_by_length](https://github.com/google/trax/blob/5fb8aa8c5cb86dabb2338938c745996d5d87d996/trax/supervised/inputs.py#L378) function.

In [None]:
# Bucketing to create streams of batches.

# Buckets are defined in terms of boundaries and batch sizes.
# Batch_sizes[i] determines the batch size for items with length < boundaries[i]
# So below, we'll take a batch of 128 sentences of length < 8, 128 if length is
# between 8 and 16, and so on. 128 batch is also taken if length is over 256.
boundaries = [8, 16, 32, 64, 128, 256]
batch_sizes = [128, 128, 128, 128, 128, 128, 128]
# Notice all is 128. As we are using TPUs, We need the same batch_size to run in parallel.
# You can make diffrent batch_sizes if you are using GPU or CPU.

# Create the generators.
train_batch_stream = preprocessing.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_train_stream)

eval_batch_stream = preprocessing.BucketByLength(
    boundaries, batch_sizes,
    length_keys=[0, 1]  # As before: count inputs and targets to length.
)(filtered_eval_stream)

# Add masking for the padding (0s).
train_batch_stream = preprocessing.AddLossWeights(id_to_mask=0)(train_batch_stream)
eval_batch_stream = preprocessing.AddLossWeights(id_to_mask=0)(eval_batch_stream)

## 1.5 Exploring the data

We will now be displaying some of our data. You will see that the functions defined above (i.e. `tokenize()` and `detokenize()`)

In [None]:
input_batch, target_batch, mask_batch = next(train_batch_stream)

# let's see the data type of batch
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))
print("target_batch data type: ", type(mask_batch))

# let's see the shape of this particular batch (batch length, sentence length)
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)
print("target_batch shape: ", mask_batch.shape)

The `input_batch` and `target_batch` are Numpy arrays consisting of tokenized English sentences and French sentences respectively. These tokens will later be used to produce embedding vectors for each word in the sentence (so the embedding for a sentence will be a matrix).

We can now visually inspect some of the data. You can run the cell below several times to shuffle through the sentences.

In [None]:
# pick a random index less than the batch size.
index = random.randrange(len(input_batch))

# use the index to grab an entry from the input and target batch
print(colored('THIS IS THE ENGLISH SENTENCE: \n', 'red'),
      detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THIS IS THE FRENCH TRANSLATION: \n', 'red'),
      detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THIS IS THE TOKENIZED VERSION OF THE FRENCH TRANSLATION: \n', 'red'), target_batch[index], '\n')

## Trax Data Pipeline (optional)

Those were the steps needed to prepare the data (steps from 1.1 to 1.5) But you could simply use [Trax data pipeline](https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html#Data) `trax.data.Serial` in the next cell. **if you run this cell you should skip (steps from 1.1 to 1.5).**

You can work with your own datasets instead of loading your dataset with TFDS you can simply replace the TFDS call with an `lambda _: train_stream_fn()`
Everything in tf.Serial is a generator. Opening a file as shown above creates that generator for you.

```python
def train_stream_fn():
  # open the first language file (e.g. English sentences)
  with open('lang1.csv','r') as f1:
    # open the second language file (e.g. French sentences)
    with open('lang2.csv','r') as f2:
      # looping over the two files to combine the two translation toghether and yields them.
      for l1, l2 in zip(f1,f2):
        yield (l1, l2)
```

and then add
```python
lambda _: train_stream_fn()
```
to `trax.data.Serial()` instead of
```python
trax.data.TFDS('para_crawl/enfr',
               data_dir='/content/drive/MyDrive/Colab Notebooks/data/',
               keys=('en', 'fr'),
               eval_holdout_size=0.01, # 1% for eval
               train=True)
```
for both the training and eval streams.

In [None]:
# MOUNT DRIVE
from google.colab import drive

drive.mount('/content/drive')

In [None]:
# if you run this cell you should skip (steps from 1.1 to 1.5).

# global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'endefr_32k.subword'
VOCAB_DIR = 'gs://trax-ml/vocabs/'

EOS = 1


# generator helper function to append EOS to each sentence
def append_eos(stream):
    for (inputs, targets) in stream:
        inputs_with_eos = list(inputs) + [EOS]
        targets_with_eos = list(targets) + [EOS]
        yield np.array(inputs_with_eos), np.array(targets_with_eos)


train_batches_stream = preprocessing.Serial(
    dataset.TFDS('para_crawl/enfr',
                 data_dir='/content/drive/MyDrive/Colab Notebooks/data/',
                 keys=('en', 'fr'),
                 eval_holdout_size=0.01,  # 1% for eval
                 train=True),  # replace TFDS with lambda _: train_stream_fn() if you want to run with your own data
    encoder.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR),
    lambda _: append_eos(_),
    preprocessing.Shuffle(),
    preprocessing.FilterByLength(max_length=512, length_keys=[0, 1]),
    preprocessing.BucketByLength(boundaries=[8, 16, 32, 64, 128, 256],
                                 batch_sizes=[128, 128, 128, 128, 128, 128, 128],
                                 length_keys=[0, 1]),
    preprocessing.AddLossWeights(id_to_mask=0)
)

eval_batches_stream = preprocessing.Serial(
    dataset.TFDS('para_crawl/enfr',
                 data_dir='/content/drive/MyDrive/Colab Notebooks/data/',
                 keys=('en', 'fr'),
                 eval_holdout_size=0.01,  # 1% for eval
                 train=False),
    encoder.Tokenize(vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR),
    lambda _: append_eos(_),
    preprocessing.Shuffle(),
    preprocessing.FilterByLength(max_length=1024, length_keys=[0, 1]),
    preprocessing.BucketByLength(boundaries=[8, 16, 32, 64, 128, 256],
                                 batch_sizes=[128, 128, 128, 128, 128, 128, 128],
                                 length_keys=[0, 1]),
    preprocessing.AddLossWeights(id_to_mask=0)
)

In [None]:
# Exploring the data
train_batch_stream = train_batches_stream()
eval_batch_stream = eval_batches_stream()
input_batch, target_batch, mask_batch = next(train_batch_stream)

# let's see the data type of batch
print("input_batch data type: ", type(input_batch))
print("target_batch data type: ", type(target_batch))
# let's see the shape of this particular batch (batch length, sentence length)
print("input_batch shape: ", input_batch.shape)
print("target_batch shape: ", target_batch.shape)

# pick a random index less than the batch size.
index = random.randrange(len(input_batch))
# use the index to grab an entry from the input and target batch
print(colored('ENGLISH SENTENCE: \n', 'red'),
      encoder.detokenize(input_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THE TOKENIZED VERSION OF THE ENGLISH SENTENCE: \n ', 'red'), input_batch[index], '\n')
print(colored('THE FRENCH TRANSLATION: \n', 'red'),
      encoder.detokenize(target_batch[index], vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR), '\n')
print(colored('THE TOKENIZED VERSION OF THE FRENCH TRANSLATION: \n', 'red'), target_batch[index], '\n')

# Part (2):  Model

Now that we’ve seen preprocessing, it’s time to move into Modeling itself. Trax allows the use of Predefined Models, such as:
 - Seq2Seq with Attention
 - BERT
 - Transformer
 - Reformer

We will be using Transformer in this Notebook As Trax provided a pretrained Transformer NMT Model which is traind on English to German dataset and We now are going to train it on English to French dataset and get a very close results to the one provide by Google Brain Team.

You can simply change `trax.models.Transformer` in the next cell to `trax.models.Reformer` to use the Reformer model.

```python
# you could check the available pretrained models and vocab files provided by trax by running:
!gsutil ls gs://trax-ml/
```

In [None]:
# Create a Transformer model.
model = models.Transformer(
    input_vocab_size=33600,
    d_model=512, d_ff=2048, dropout=0.1,
    n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
    max_len=2048, mode='train')

# Pre-trained Transformer model config in gs://trax-ml/models/translation/ende_wmt32k.gin
# Initialize Transformer using pre-trained weights.
model.init_from_file('gs://trax-ml/models/translation/ende_wmt32k.pkl.gz',
                     weights_only=True)

# You also, could initiate the model from an output checkpoint.
# simply change 'gs://trax-ml/models/translation/ende_wmt32k.pkl.gz' to 'output_dir/ + last_checkpoint'
# for example:
# model.init_from_file('/content/drive/MyDrive/Colab Notebooks/Transformer_FR_pretrained_336/model.pkl.gz',
#                      weights_only=True)

You could have a peek at the model layers.

In [None]:
model

# Part (3):  Training
We will now be training our model in this section. Doing supervised training in Trax is pretty straightforward (short example [here](https://trax-ml.readthedocs.io/en/latest/notebooks/trax_intro.html#Supervised-training)). We will be instantiating three classes for this: `TrainTask`, `EvalTask`, and `Loop`. Let's take a closer look at each of these in the sections below.

## 3.1  TrainTask

The [TrainTask](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.TrainTask) class allows us to define the labeled data to use for training and the feedback mechanisms to compute the loss and update the weights.

In [None]:
train_task = training.TrainTask(
    # use the train batch stream as labeled data
    labeled_data=train_batch_stream,
    # use the cross entropy loss with LogSoftmax
    loss_layer=tl.CrossEntropyLossWithLogSoftmax(),
    # use the Ada factor optimizer with learning rate of 0.001
    optimizer=optimizers.Adafactor(learning_rate=0.001, epsilon1=1e-30),
    # have 500 warmup steps
    lr_schedule=learning_schedule.multifactor(constant=1.0, warmup_steps=500),
    # have a checkpoint every 100 steps
    n_steps_per_checkpoint=100,
    # saving a checkpoint every 1000 steps on the output_dir
    n_steps_per_permanent_checkpoint=1000
)

## 3.2  EvalTask

The [EvalTask](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.EvalTask) on the other hand allows us to see how the model is doing while training. For our application, we want it to report the cross entropy loss with LogSoftmax and accuracy.

In [None]:
eval_task = training.EvalTask(
    # use the eval batch stream as labeled data
    labeled_data=eval_batch_stream,
    # use the cross entropy loss with LogSoftmax and accuracy as metrics
    metrics=[tl.CrossEntropyLossWithLogSoftmax(), tl.WeightedCategoryAccuracy()],
    # you could specify the number of eval batch by n_eval_batches = 64 or any other number,
    # but it is not specified here as we want to evaluate the whole eval data
    # n_eval_batches = 64
)

## 3.3  Loop

The [Loop](https://trax-ml.readthedocs.io/en/latest/trax.supervised.html#trax.supervised.training.Loop) class defines the model we will train as well as the train and eval tasks to execute. Its `run()` method allows us to execute the training for a specified number of steps.

In [None]:
# define the output directory
output_dir = '~/Transformer_FR_pretrained_336'

# # remove old model if it exists. restarts training.
# !rm -rf output_dir
shutil.rmtree(os.path.expanduser(output_dir), ignore_errors=True)

# define the training loop
training_loop = training.Loop(model,
                              train_task,
                              eval_tasks=[eval_task],
                              output_dir=output_dir)

In [None]:
# Start Training!
training_loop.run(5_000)

## More Steps (optional)

As we have specified the `n_steps_per_permanent_checkpoint` in `training.TrainTask` it saves checkpoint in `output_dir` after the specified number of steps. So, if you have face runtime disconnection or you want to train the model for more number of steps to improve the result, you could load last checkpoint saved and load it using `training_loop.load_checkpoint`.

This is an optional way. you could have used `model.init_from_file` as in (Part (2): Model) cells. change 'gs://trax-ml/models/translation/ende_wmt32k.pkl.gz' to 'output_dir/ + last_checkpoint'

In [None]:
output_dir = '/content/drive/MyDrive/Colab Notebooks/Transformer_FR_pretrained_336/'

# This loads a checkpoint:
training_loop.load_checkpoint(directory=output_dir, filename="model.pkl.gz")
# Continue training:
training_loop.run(5000)

## Tensorboard (optional)
The Trax training loop optimizes training, creates TensorBoard logs and model checkpoints for you. you could simply visualize them using the following:


In [None]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [None]:
%tensorboard --logdir output_dir

if it is not loading properly, and for example your `output_dir` is:

```python
output_dir = '/content/drive/MyDrive/Colab Notebooks/Transformer_FR_pretrained_336'
```
add:
```
%cd '/content/drive/MyDrive/Colab Notebooks/'
```
before:
```
%tensorboard --logdir output_dir
```
and change it to:
```
%tensorboard --logdir Transformer_FR_pretrained_336
```

# Part (4):  Testing

We will now be using the model you just trained to translate English sentences to French. We will implement this with two functions: The first allows you to identify the next symbol (i.e. output token). The second one takes care of combining the entire translated string.


## 4.1  Decoding

In [None]:
# Setup helper functions for tokenizing and detokenizing sentences
def tokenize(input_str, vocab_file=None, vocab_dir=None):
    """Encodes a string to an array of integers
    Args:
        input_str (str): human-readable string to encode
        vocab_file (str): filename of the vocabulary text file
        vocab_dir (str): path to the vocabulary file
    Returns:
        numpy.ndarray: tokenized version of the input string
    """
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    # Use the trax.data.tokenize method. It takes streams and returns streams,
    # we get around it by making a 1-element stream with `iter`.
    inputs = next(trax.data.tokenize(iter([input_str]),
                                     vocab_file=vocab_file, vocab_dir=vocab_dir))
    # Mark the end of the sentence with EOS
    inputs = list(inputs) + [EOS]
    # Adding the batch dimension to the front of the shape
    batch_inputs = np.reshape(np.array(inputs), [1, -1])
    return batch_inputs


def detokenize(integers, vocab_file=None, vocab_dir=None):
    """Decodes an array of integers to a human readable string
    Args:
        integers (numpy.ndarray): array of integers to decode
        vocab_file (str): filename of the vocabulary text file
        vocab_dir (str): path to the vocabulary file
    Returns:
        str: the decoded sentence.
    """
    # Remove the dimensions of size 1
    integers = list(np.squeeze(integers))
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    # Remove the EOS to decode only the original tokens
    if EOS in integers:
        integers = integers[:integers.index(EOS)]
    return trax.data.detokenize(integers, vocab_file=vocab_file, vocab_dir=vocab_dir)

There are several ways to get the next token when translating a sentence. For instance, we can just get the most probable token at each step (i.e. greedy decoding) or get a sample from a distribution. We can generalize the implementation of these two approaches by using the `tl.logsoftmax_sample()` method.

In [None]:
def next_symbol(model, input_tokens, cur_output_tokens, temperature):
    """Returns the index of the next token.
    Args:
        model: the NMT model.
        input_tokens (np.ndarray 1 x n_tokens): tokenized representation of the input sentence
        cur_output_tokens (list): tokenized representation of previously translated words
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
    Returns:
        int: index of the next token in the translated sentence
        float: log probability of the next symbol
    """
    # set the length of the current output tokens
    token_length = len(cur_output_tokens)
    # calculate next power of 2 for padding length
    padded_length = np.power(2, int(np.ceil(np.log2(token_length + 1))))
    # pad cur_output_tokens up to the padded_length
    padded = cur_output_tokens + [0] * (padded_length - token_length)
    # model expects the output to have an axis for the batch size in front so
    # convert `padded` list to a numpy array with shape (x, <padded_length>) where the
    # x position is the batch axis.
    padded_with_batch = np.expand_dims(padded, axis=0)
    # the model prediction.
    output, _ = model((input_tokens, padded_with_batch))
    # get log probabilities from the last token output
    log_probs = output[0, token_length, :]
    # get the next symbol by getting a logsoftmax sample
    symbol = int(tl.logsoftmax_sample(log_probs, temperature))
    return symbol, float(log_probs[symbol])

The `sampling_decode()` function will call the `next_symbol()` function above several times until the next output is the end-of-sentence token (i.e. `EOS`). It takes in an input string and returns the translated version of that string.


In [None]:
def sampling_decode(input_sentence, model=None, temperature=0.0, vocab_file=None, vocab_dir=None):
    """Returns the translated sentence.
    Args:
        input_sentence (str): sentence to translate.
        model: the NMT model.
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file
    Returns:
        tuple: (list, str, float)
            list of int: tokenized version of the translated sentence
            float: log probability of the translated sentence
            str: the translated sentence
    """
    # encode the input sentence
    input_tokens = tokenize(input_sentence, vocab_file=vocab_file, vocab_dir=vocab_dir)
    # initialize the list of output tokens
    cur_output_tokens = []
    # initialize an integer that represents the current output index
    cur_output = 0
    # Set the encoding of the "end of sentence" as 1
    EOS = 1
    # check that the current output is not the end of sentence token
    while cur_output != EOS:
        # update the current output token by getting the index of the next word
        cur_output, log_prob = next_symbol(model, input_tokens, cur_output_tokens, temperature)
        # append the current output token to the list of output tokens
        cur_output_tokens.append(cur_output)
    # detokenize the output tokens
    sentence = detokenize(cur_output_tokens, vocab_file=vocab_file, vocab_dir=vocab_dir)
    return cur_output_tokens, log_prob, sentence

In [None]:
# Test the function above. Try varying the temperature setting with values from 0 to 1.
# Run it several times with each setting and see how often the output changes.
sampling_decode("Hello.", model, temperature=0.0, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

We have set a default value of `0` to the temperature setting in our implementation of `sampling_decode()` above. As you may have noticed in the `logsoftmax_sample()` method, this setting will ultimately result in greedy decoding. This algorithm generates the translation by getting the most probable word at each step. It gets the argmax of the output array of your model and then returns that index. See the testing function and sample inputs below. You'll notice that the output will remain the same each time you run it.

In [None]:
def greedy_decode_test(sentence, model=None, vocab_file=None, vocab_dir=None):
    """Prints the input and output of our NMT model using greedy decode
    Args:
        sentence (str): a custom string.
        model: the NMT model.
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file
    Returns:
        str: the translated sentence
    """
    _, _, translated_sentence = sampling_decode(sentence, model, vocab_file=vocab_file, vocab_dir=vocab_dir)
    print("English: ", sentence)
    print("French: ", translated_sentence)
    return translated_sentence

In [None]:
# put a custom string here
your_sentence = 'I love languages.'
greedy_decode_test(your_sentence, model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR);

In [None]:
greedy_decode_test('You are almost done with the assignment!', model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR);

## 4.2  Minimum Bayes-Risk Decoding

Getting the most probable token at each step may not necessarily produce the best results. Another approach is to do Minimum Bayes Risk Decoding or MBR. The general steps to implement this are:

1. take several random samples
2. score each sample against all other samples
3. select the one with the highest score

<a name='4.2.1'></a>
### 4.2.1 Generating samples

First, let's build a function to generate several samples. You can use the `sampling_decode()` function you developed earlier to do this easily. We want to record the token list and log probability for each sample as these will be needed in the next step.

In [None]:
def generate_samples(sentence, n_samples, model=None, temperature=0.6, vocab_file=None, vocab_dir=None):
    """Generates samples using sampling_decode()
    Args:
        sentence (str): sentence to translate.
        n_samples (int): number of samples to generate
        model: the NMT model.
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file
    Returns:
        tuple: (list, list)
            list of lists: token list per sample
            list of floats: log probability per sample
    """
    # define lists to contain samples and probabilities
    samples, log_probs = [], []
    # run a for loop to generate n samples
    for _ in range(n_samples):
        # get a sample using the sampling_decode() function
        sample, logp, _ = sampling_decode(sentence, model, temperature, vocab_file=vocab_file, vocab_dir=vocab_dir)
        # append the token list to the samples list
        samples.append(sample)
        # append the log probability to the log_probs list
        log_probs.append(logp)
    return samples, log_probs

In [None]:
# generate 4 samples with the default temperature (0.6)
generate_samples('I love languages.', 4, model, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

In [None]:
detokenize([769, 31, 31720, 21, 15267, 3, 1], VOCAB_FILE, VOCAB_DIR)

### 4.2.2 Comparing overlaps

Let us now build our functions to compare a sample against another. There are several metrics available and you can try experimenting with any one of these. We will be calculating scores for unigram overlaps. One of the more simple metrics is the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) which gets the intersection over union of two sets.

In [None]:
def jaccard_similarity(candidate, reference):
    """Returns the Jaccard similarity between two token lists
    Args:
        candidate (list of int): tokenized version of the candidate translation
        reference (list of int): tokenized version of the reference translation
    Returns:
        float: overlap between the two token lists
    """
    # convert the lists to a set to get the unique tokens
    can_unigram_set, ref_unigram_set = set(candidate), set(reference)
    # get the set of tokens common to both candidate and reference
    joint_elems = can_unigram_set.intersection(ref_unigram_set)
    # get the set of all tokens found in either candidate or reference
    all_elems = can_unigram_set.union(ref_unigram_set)
    # divide the number of joint elements by the number of all elements
    overlap = len(joint_elems) / len(all_elems)
    return overlap

One of the more commonly used metrics in machine translation is the ROUGE score. For unigrams, this is called ROUGE-1 and you can output the scores for both precision and recall when comparing two samples. To get the final score, you will want to compute the F1-score as given by:

$$score = 2* \frac{(precision * recall)}{(precision + recall)}$$


In [None]:
# for making a frequency table easily
from collections import Counter


def rouge1_similarity(system, reference):
    """Returns the ROUGE-1 score between two token lists
    Args:
        system (list of int): tokenized version of the system translation
        reference (list of int): tokenized version of the reference translation
    Returns:
        float: overlap between the two token lists
    """
    # make a frequency table of the system tokens
    sys_counter = Counter(system)
    # make a frequency table of the reference tokens
    ref_counter = Counter(reference)
    # initialize overlap to 0
    overlap = 0
    # run a for loop over the sys_counter object
    for token in sys_counter:
        # lookup the value of the token in the sys_counter dictionary
        token_count_sys = sys_counter.get(token, 0)
        # lookup the value of the token in the ref_counter dictionary
        token_count_ref = ref_counter.get(token, 0)
        # update the overlap by getting the smaller number between the two token counts above
        overlap += min(token_count_sys, token_count_ref)
    # get the precision (i.e. number of overlapping tokens / number of system tokens)
    precision = overlap / sum(sys_counter.values())
    # get the recall (i.e. number of overlapping tokens / number of reference tokens)
    recall = overlap / sum(ref_counter.values())
    if precision + recall != 0:
        # compute the f1-score
        rouge1_score = 2 * ((precision * recall) / (precision + recall))
    else:
        rouge1_score = 0
    return rouge1_score

### 4.2.3 Overall score

We will now build a function to generate the overall score for a particular sample. As mentioned earlier, we need to compare each sample with all other samples. For instance, if we generated 30 sentences, we will need to compare sentence 1 to sentences 2 to 30. Then, we compare sentence 2 to sentences 1 and 3 to 30, and so forth. At each step, we get the average score of all comparisons to get the overall score for a particular sample. To illustrate, these will be the steps to generate the scores of a 4-sample list.

1. Get similarity score between sample 1 and sample 2
2. Get similarity score between sample 1 and sample 3
3. Get similarity score between sample 1 and sample 4
4. Get average score of the first 3 steps. This will be the overall score of sample 1.
5. Iterate and repeat until samples 1 to 4 have overall scores.

We will be storing the results in a dictionary for easy lookups.

In [None]:
def average_overlap(similarity_fn, samples, *ignore_params):
    """Returns the arithmetic mean of each candidate sentence in the samples
    Args:
        similarity_fn (function): similarity function used to compute the overlap
        samples (list of lists): tokenized version of the translated sentences
        *ignore_params: additional parameters will be ignored
    Returns:
        dict: scores of each sample
            key: index of the sample
            value: score of the sample
    """
    # initialize dictionary
    scores = {}
    # run a for loop for each sample
    for index_candidate, candidate in enumerate(samples):
        # initialize overlap to 0.0
        overlap = 0.0
        # run a for loop for each sample
        for index_sample, sample in enumerate(samples):
            # skip if the candidate index is the same as the sample index
            if index_candidate == index_sample:
                continue
            # get the overlap between candidate and sample using the similarity function
            sample_overlap = similarity_fn(candidate, sample)
            # add the sample overlap to the total overlap
            overlap += sample_overlap
        # get the score for the candidate by computing the average
        score = overlap / index_sample
        # save the score in the dictionary. use index as the key.
        scores[index_candidate] = score
    return scores

It is also common to see the weighted mean being used to calculate the overall score instead of just the arithmetic mean.

In [None]:
def weighted_avg_overlap(similarity_fn, samples, log_probs):
    """Returns the weighted mean of each candidate sentence in the samples
    Args:
        samples (list of lists): tokenized version of the translated sentences
        log_probs (list of float): log probability of the translated sentences
    Returns:
        dict: scores of each sample
            key: index of the sample
            value: score of the sample
    """
    # initialize dictionary
    scores = {}
    # run a for loop for each sample
    for index_candidate, candidate in enumerate(samples):
        # initialize overlap and weighted sum
        overlap, weight_sum = 0.0, 0.0
        # run a for loop for each sample
        for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)):
            # skip if the candidate index is the same as the sample index
            if index_candidate == index_sample:
                continue
            # convert log probability to linear scale
            sample_p = float(np.exp(logp))
            # update the weighted sum
            weight_sum += sample_p
            # get the unigram overlap between candidate and sample
            sample_overlap = similarity_fn(candidate, sample)
            # update the overlap
            overlap += sample_p * sample_overlap
        # get the score for the candidate
        score = overlap / weight_sum
        # save the score in the dictionary. use index as the key.
        scores[index_candidate] = score
    return scores

### 4.2.4 Putting it all together

We will now put everything together and develop the `mbr_decode()` function.

In [None]:
def mbr_decode(sentence, n_samples=4, score_fn=weighted_avg_overlap, similarity_fn=rouge1_similarity, model=model,
               temperature=0.6, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR):
    """Returns the translated sentence using Minimum Bayes Risk decoding
    Args:
        sentence (str): sentence to translate.
        n_samples (int): number of samples to generate
        score_fn (function): function that generates the score for each sample
        similarity_fn (function): function used to compute the overlap between a
        pair of samples
        model: the NMT model.
        temperature (float): parameter for sampling ranging from 0.0 to 1.0.
            0.0: same as argmax, always pick the most probable token
            1.0: sampling from the distribution (can sometimes say random things)
        vocab_file (str): filename of the vocabulary
        vocab_dir (str): path to the vocabulary file
    Returns:
        str: the translated sentence
    """
    # generate samples
    samples, log_probs = generate_samples(sentence, n_samples,
                                          model, temperature,
                                          vocab_file, vocab_dir)
    # use the scoring function to get a dictionary of scores
    scores = score_fn(similarity_fn, samples, log_probs)
    # find the key with the highest score
    max_index = max(scores, key=scores.get)
    # detokenize the token list associated with the max_index
    translated_sentence = detokenize(samples[max_index], vocab_file, vocab_dir)
    return (translated_sentence, max_index, scores)

In [None]:
# put a custom string here
your_sentence = 'She speaks English, French and German.'

In [None]:
mbr_decode(your_sentence)

In [None]:
mbr_decode('You have completed the tutorial.')[0]

# **Resources**

- [Natural Language Processing Specialization](https://www.coursera.org/specializations/natural-language-processing)

- [Trax documentation](https://trax-ml.readthedocs.io/en/latest/index.html)

- [Trax community](https://gitter.im/trax-ml/community)