# Neural Machine Translation

### Credits
This notebook originates from a course from [Kyunghyun Cho](https://kyunghyuncho.me/). It can be found [here](https://github.com/nyu-dl/AMMI-2019-NLP-Part2) and the corresponding lecture [here](https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf).

It was modified by [Alexandre Bérard](https://europe.naverlabs.com/people_user/alexandre-berard/) for the First [Advanded Language Processing School](http://alps.imag.fr/)

### Set up Google Translate API for Comparison

https://github.com/ssut/py-googletrans

In [None]:
from googletrans import Translator
translator = Translator()

### Python imports

In [None]:
path_to_utils = 'pyfiles'
import os
import sys
sys.path.append(path_to_utils)
import nmt_dataset
import nnet_models
import numpy as np
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import ReduceLROnPlateau
from functools import partial
import time
from tqdm.notebook import tqdm
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import copy
from subword_nmt.apply_bpe import BPE
%matplotlib inline

## The Dataset

We will work with a English to French Dataset from https://www.manythings.org/anki/.

The data is downloaded by running the `download-data.sh` script. You can also modify this script to download data for other language pairs.

In [None]:
data_dir = 'data'
source_lang, target_lang = 'en', 'fr'
model_dir = 'models/{}-{}'.format(source_lang, target_lang)

In [None]:
! head -5 data/train.en-fr.en

## Load and preprocess the data

1. Load the BPE model
2. Load the parallel corpora for this language pair (train, valid and test). `load_data` will load a corpus and tokenize it with the BPE model with the `preprocess` function.
3. Create (or load) dictionaries that map BPE tokens to token IDs (`nmt_dataset.load_or_create_dictionary` function)
4. Binarize the data: map source and target text sequences to sequences of IDs, and sort the training set by length (`nmt_dataset.binarize` function)
5. Create batches (`nmt_dataset.BatchIterator` class): group multiple sequence pairs of similar length together, pad them to the maximum length and create numpy arrays that can be used to train our models

In [None]:
def reset_seed(seed=1234):
    np.random.seed(seed)
    torch.manual_seed(seed)

#### 1. Load the BPE model (multilingual BPE model, works with French, German and English)

In [None]:
bpe_path = os.path.join(data_dir, 'bpecodes.de-en-fr')

with open(bpe_path) as bpe_codes:
    bpe_model = BPE(bpe_codes)

def preprocess(line, is_source=True, source_lang=None, target_lang=None):
    return bpe_model.segment(line.lower())

def postprocess(line):
    return line.replace('@@ ', '')

def load_data(source_lang, target_lang, split='train', max_size=None):
    # max_size: max number of sentence pairs in the training corpus (None = all)
    path = os.path.join(data_dir, '{}.{}-{}'.format(split, *sorted([source_lang, target_lang])))
    return nmt_dataset.load_dataset(path, source_lang, target_lang, preprocess=preprocess, max_size=max_size)   # set max_size to 10000 for fast debugging

#### 2. Load and preprocess the parallel corpora (these are pandas DataFrames)

In [None]:
train_data = load_data(source_lang, target_lang, 'train', max_size=None)   # set max_size to 10000 for fast debugging
valid_data = load_data(source_lang, target_lang, 'valid')
test_data = load_data(source_lang, target_lang, 'test')
print(train_data.iloc[:5])

#### 3. Load or create the dictionaries

In [None]:
source_dict_path = os.path.join(model_dir, 'dict.{}.txt'.format(source_lang))
target_dict_path = os.path.join(model_dir, 'dict.{}.txt'.format(target_lang))

source_dict = nmt_dataset.load_or_create_dictionary(
    source_dict_path,
    train_data['source_tokenized'],
    minimum_count=10,
    reset=False    # set reset to True if you're changing the data or the preprocessing
)
print(source_dict.words[:100])

target_dict = nmt_dataset.load_or_create_dictionary(
    target_dict_path,
    train_data['target_tokenized'],
    minimum_count=10,
    reset=False
)
print(target_dict.words[:100])

In [None]:
print('source vocab size:', len(source_dict))
print('target vocab size:', len(target_dict))

#### 4. Use the dictionaries to map tokens to indices. The training set is also sorted by length for more efficient batching.

In [None]:
nmt_dataset.binarize(train_data, source_dict, target_dict, sort=True)
nmt_dataset.binarize(valid_data, source_dict, target_dict, sort=False)
nmt_dataset.binarize(test_data, source_dict, target_dict, sort=False)
print(train_data.iloc[:5])

#### Data statistics:

In [None]:
print('train_size={}, valid_size={}, test_size={}, min_len={}, max_len={}'.format(
    len(train_data),
    len(valid_data),
    len(test_data),
    train_data['source_len'].min(),
    train_data['source_len'].max(),
))

print('Train source length distribution:')
print(train_data['source_len'].quantile([0.5, 0.75, 0.9, 0.95, 0.99, 0.999, 0.9999]))

#### 5. Build batches. The training batches are automatically shuffled before each epoch

In [None]:
max_len = 30       # maximum 30 tokens per sentence (longer sequences will be truncated)
batch_size = 512   # maximum 512 tokens per batch (decrease if you get OOM errors, increase to speed up training)

reset_seed()

train_iterator = nmt_dataset.BatchIterator(train_data, source_lang, target_lang, batch_size=batch_size, max_len=max_len, shuffle=True)
valid_iterator = nmt_dataset.BatchIterator(valid_data, source_lang, target_lang, batch_size=batch_size, max_len=max_len, shuffle=False)
test_iterator = nmt_dataset.BatchIterator(test_data, source_lang, target_lang, batch_size=batch_size, max_len=max_len, shuffle=False)

#### Example of training batch:

In [None]:
print(next(iter(train_iterator)))

The Seq2Seq Model
=================

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.

A [`Sequence to Sequence network`](http://arxiv.org/abs/1409.3215), or
seq2seq network, or [`Encoder-Decoder network`](https://arxiv.org/abs/1406.1078v3), is a model
consisting of usually two RNNs called the encoder and decoder. The encoder reads
an input sequence and outputs a single vector, and the decoder reads
that vector to produce an output sequence. Essentially, all we need is some mechanism to read the source sentence and create an encoding and some mechanism to read the encoding and decode it to the target language. 

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.

Consider the sentence "I am not the
black cat" → "Je ne suis pas le chat noir". Most of the words in the input sentence have a direct
translation in the output sentence, but are in slightly different
orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"
construction there is also one more word in the input sentence. It would
be difficult to produce a correct translation directly from the sequence
of input words.

With a seq2seq model the encoder creates a single vector which, in the
ideal case, encodes the "meaning" of the input sequence into a single
vector — a single point in some N dimensional space of sentences.


The Encoder
-----------

The encoder is anything which takes in a sentence and gives us a representation for the sentence. 

The encoder of a seq2seq network can be a RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

However, we will start with a simpler Bag-of-Words encoder and then move on to more complex encoders.

### Bag-of-Words Encoder

In [None]:
bow_encoder = nnet_models.BagOfWords(
    input_size=len(source_dict),
    hidden_size=512,
    num_layers=1,
    dropout=0.0,
    reduce="sum"
)

In [None]:
print(bow_encoder)

The Decoder
--------------------

The decoder is another network that takes the encoder output vector(s) and outputs a sequence of words to create the translation.

### Decoder without Attention

In the simplest seq2seq decoder we use only the last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence. This context vector can be used as the initial hidden state for an RNN decoder.

At every step of decoding, the decoder is given an input token and hidden state. The initial input token is the start-of-string <SOS> token, and the first hidden state is the context vector (the encoder's last hidden state).

In [None]:
bow_decoder = nnet_models.RNN_Decoder(
    output_size=len(target_dict),
    hidden_size=512,
    num_layers=1,
    dropout=0.0
)

In [None]:
print(bow_decoder)

In [None]:
bow_model = nnet_models.EncoderDecoder(
    bow_encoder,
    bow_decoder,
    lr=0.001,
    use_cuda=True,
    target_dict=target_dict
)

### Training code

In [None]:
def save_model(model, checkpoint_path):
    dirname = os.path.dirname(checkpoint_path)
    if dirname:
        os.makedirs(dirname, exist_ok=True)
    torch.save(model, checkpoint_path)

def train_model(
        train_iterator,
        valid_iterators,
        model,
        checkpoint_path,
        epochs=10,
        validation_frequency=1
    ):
    """
    train_iterator: instance of nmt_dataset.BatchIterator or nmt_dataset.MultiBatchIterator
    valid_iterators: list of nmt_dataset.BatchIterator
    model: instance of nnet_models.EncoderDecoder
    checkpoint_path: path of the model checkpoint
    epochs: iterate this many times over train_iterator
    validation_frequency: validate the model every N epochs
    """

    reset_seed()

    best_bleu = -1
    for epoch in range(1, epochs + 1):

        start = time.time()
        running_loss = 0

        print('Epoch: [{}/{}]'.format(epoch, epochs))

        # Iterate over training batches for one epoch
        for i, batch in tqdm(enumerate(train_iterator), total=len(train_iterator)):
            t = time.time()
            running_loss += model.train_step(batch)

        # Average training loss for this epoch
        epoch_loss = running_loss / len(train_iterator)

        print("loss={:.3f}, time={:.2f}".format(epoch_loss, time.time() - start))
        sys.stdout.flush()

        # Evaluate and save the model
        if epoch % validation_frequency == 0:
            bleu_scores = []
            
            # Compute BLEU over all validation sets
            for valid_iterator in valid_iterators:
                src, tgt = valid_iterator.source_lang, valid_iterator.target_lang
                translation_output = model.translate(valid_iterator, postprocess)
                bleu_score = translation_output.score
                output = translation_output.output

                with open(os.path.join(model_dir, 'valid.{}-{}.{}.out'.format(src, tgt, epoch)), 'w') as f:
                    f.writelines(line + '\n' for line in output)

                print('{}-{}: BLEU={}'.format(src, tgt, bleu_score))
                sys.stdout.flush()
                bleu_scores.append(bleu_score)

            # Average the validation BLEU scores
            bleu_score = round(sum(bleu_scores) / len(bleu_scores), 2)
            if len(bleu_scores) > 1:
                print('BLEU={}'.format(bleu_score))

            # Update the model's learning rate based on current performance.
            # This scheduler divides the learning rate by 10 if BLEU does not improve.
            model.scheduler_step(bleu_score)

            # Save a model checkpoint if it has the best validation BLEU so far
            if bleu_score > best_bleu:
                best_bleu = bleu_score
                save_model(model, checkpoint_path)

        print('=' * 50)

    print("Training completed. Best BLEU is {}".format(best_bleu))

### Train a model with BOW Encoder and RNN Decoder (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epoch" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'bow.pt')
else:
    checkpoint_path = os.path.join(model_dir, 'pretrained-bow.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    bow_model = torch.load(checkpoint_path)
else:
    train_model(train_iterator, [valid_iterator], bow_model,
                epochs=10,
                checkpoint_path=checkpoint_path)

### Compute BLEU on the test set

In [None]:
print('BLEU:', bow_model.translate(test_iterator, postprocess).score)

### Interact with the model

In [None]:
def get_binned_bleu_scores(model, valid_iterator):
    # Compute and plot BLEU scores according to sequence length
    # lengths = np.arange(0, 31, 5)
    lengths = np.arange(4, 20, 3)
    bleu_scores = np.zeros(len(lengths))

    for i in tqdm(range(1, len(lengths)), total=len(lengths) - 1):
        min_len = lengths[i - 1]
        max_len = lengths[i]

        tmp_data = valid_data[(valid_iterator.data['source_len'] > min_len) & (valid_iterator.data['source_len'] <= max_len)]
        tmp_iterator = nmt_dataset.BatchIterator(tmp_data, source_lang, target_lang, batch_size, max_len=max_len)

        bleu_scores[i] = model.translate(tmp_iterator, postprocess).score

    lengths = lengths[1:]
    bleu_scores = bleu_scores[1:]

    plt.plot(lengths, bleu_scores, 'x-')
    plt.ylim(0, np.max(bleu_scores) + 1)
    plt.xlabel('Source length')
    plt.ylabel('BLEU score')
    
    return lengths, bleu_scores


def show_attention(input_sentence, output_words, attentions):
    # Plot an encoder-decoder attention matrix
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions, cmap='bone', aspect='auto')
    fig.colorbar(cax)

    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       [nmt_dataset.EOS_TOKEN], rotation=90)
    ax.set_yticklabels([''] + output_words.split(' ') +
                       [nmt_dataset.EOS_TOKEN])

    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()


def encode_as_batch(sentence, dictionary, source_lang, target_lang):
    # Create a batch from a single sentence
    sentence = sentence + ' ' + nmt_dataset.EOS_TOKEN
    tensor = dictionary.txt2vec(sentence).unsqueeze(0)

    return {
        'source': tensor,
        'source_len': torch.from_numpy(np.array([tensor.shape[-1]])),
        'source_lang': source_lang,
        'target_lang': target_lang
    }


def get_translation(model, sentence, dictionary, source_lang, target_lang, return_output=False):
    # Translate given sentence with given model. Also show translation outputs by Google Translate for comparison.
    print('Source:', sentence)
    sentence_tok = preprocess(sentence, is_source=True, source_lang=source_lang, target_lang=target_lang)
    print('Tokenized source:', sentence_tok)
    batch = encode_as_batch(sentence_tok, dictionary, source_lang, target_lang)
    prediction, attn_matrix, enc_self_attn = model.eval_step(batch)
    prediction = prediction[0]
    prediction_detok = postprocess(prediction)
    print('Prediction:', prediction)
    print('Detokenized prediction:', prediction_detok)

    print('Google Translate ({}->{}): {}'.format(
        source_lang,
        target_lang,
        translator.translate(sentence, src=source_lang, dest=target_lang).text
    ))
    print('Google Translate on prediction ({}->{}): {}'.format(
        target_lang,
        source_lang,
        translator.translate(prediction_detok, src=target_lang, dest=source_lang).text
    ))

    results = {
        'source': sentence,
        'source_tokens': sentence_tok.split(' ') + ['<eos>'],
        'prediction_detok': prediction_detok,
        'prediction_tokens': prediction.split(' '),
    }

    if attn_matrix is not None:
        attn_matrix = attn_matrix[0].detach().cpu().numpy()
        results['attention_matrix'] = attn_matrix
        show_attention(sentence_tok, prediction, attn_matrix)
    
    if enc_self_attn is not None:
        results['encoder_self_attention_list'] = enc_self_attn
    
    if return_output:
        return results

In [None]:
get_translation(bow_model, 'hello how are you ?', source_dict, source_lang, target_lang)

The biggest limitation of a Bag-of-Word encoder is that is insensitive to word order: <br>
when shuffling the words in the previous sentence, you get the same output.

In [None]:
get_translation(bow_model, 'are hello ? how you', source_dict, source_lang, target_lang)

In [None]:
get_translation(bow_model, 'she \'s five years older than me .', source_dict, source_lang, target_lang)

## RNN Encoder + RNN Decoder

In [None]:
rnn_encoder = nnet_models.RNN_Encoder(
    input_size=len(source_dict),
    hidden_size=512,
    num_layers=1,
    dropout=0.2
)

In [None]:
print(rnn_encoder)

In [None]:
rnn_decoder = nnet_models.RNN_Decoder(
    output_size=len(target_dict),
    hidden_size=512,
    num_layers=1,
    dropout=0.2
)

In [None]:
print(rnn_decoder)

In [None]:
rnn_model = nnet_models.EncoderDecoder(
    rnn_encoder,
    rnn_decoder,
    lr=0.001,
    use_cuda=True,
    target_dict=target_dict
)

### Train a model with RNN Encoder and RNN Decoder (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epoch" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'rnn.pt')
else:
    checkpoint_path = os.path.join(model_dir, 'pretrained-rnn.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    rnn_model = torch.load(checkpoint_path)
else:
    train_model(train_iterator, [valid_iterator], rnn_model,
                epochs=10,
                checkpoint_path=checkpoint_path)

### Compute BLEU on the test set

In [None]:
print('BLEU:', rnn_model.translate(test_iterator, postprocess).score)

### Interact with the model

In [None]:
get_translation(rnn_model, 'hello how are you ?', source_dict, source_lang, target_lang)

Contrary to the BoW encoder, an RNN is sensitive to word ordering

In [None]:
get_translation(rnn_model, 'are hello ? how you', source_dict, source_lang, target_lang)

In [None]:
get_translation(rnn_model, 'she \'s five years older than me .', source_dict, source_lang, target_lang)

In [None]:
get_translation(rnn_model, 'i know that the last thing you want to do is help me .', source_dict, source_lang, target_lang)

### Plot validation BLEU according to source sequence length
The performance quickly degrades as the input length increases. This is caused by three main factors:
- The RNN decoder (without attention) only relies on the last hidden state of the encoder. This means that we have to encode the full sentence into a single fixed-size vector
- Encoder-decoder RNNs are difficult to train (because the signal has to be backpropagated through the entire sequence of states)
- The training set we used is mostly composed of very short sentences (95% of source sentences are 15 tokens or less)

In [None]:
rnn_lengths, rnn_bleu_scores = get_binned_bleu_scores(rnn_model, valid_iterator)

## RNN Encoder + RNN Decoder with Encoder-Decoder Attention

In [None]:
rnn_attn_encoder = nnet_models.RNN_Encoder(
    input_size=len(source_dict),
    hidden_size=512,
    num_layers=1,
    dropout=0.0
)

In [None]:
 print(rnn_attn_encoder)

In [None]:
rnn_attn_decoder = nnet_models.AttentionDecoder(
    output_size=len(target_dict),
    hidden_size=512,
    dropout=0.0
)

In [None]:
print(rnn_attn_decoder)

In [None]:
rnn_attn_model = nnet_models.EncoderDecoder(
    rnn_attn_encoder,
    rnn_attn_decoder,
    lr=0.001,
    use_cuda=True,
    target_dict=target_dict
)

### Train a model with RNN Encoder and RNN Decoder with attention (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epoch" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'rnn-attn.pt')
else:
    checkpoint_path = os.path.join(model_dir, 'pretrained-rnn-attn.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    rnn_attn_model = torch.load(checkpoint_path)
else:
    train_model(train_iterator, [valid_iterator], rnn_attn_model,
                epochs=10,
                checkpoint_path=checkpoint_path)

### Compute BLEU on the test set

In [None]:
print('BLEU:', rnn_attn_model.translate(test_iterator, postprocess).score)

### Plot validation BLEU according to source sequence length

In [None]:
rnn_attn_lengths, rnn_attn_bleu_scores = get_binned_bleu_scores(rnn_attn_model, valid_iterator)

In [None]:
plt.plot(rnn_lengths, rnn_bleu_scores, '--x', label='RNN without attention')
plt.plot(rnn_attn_lengths, rnn_attn_bleu_scores, '--x', label='RNN with attention')
plt.xlabel('Source length')
plt.ylabel('BLEU score')
plt.legend()

### Interact with the model and visualize attention matrices

In [None]:
get_translation(rnn_attn_model, 'hello how are you ?', source_dict, source_lang, target_lang)

In [None]:
get_translation(rnn_attn_model, 'she \'s five years older than me .', source_dict, source_lang, target_lang)

In [None]:
get_translation(rnn_attn_model, 'i know that the last thing you want to do is help me .', source_dict, source_lang, target_lang)

## Transformer Model

[Transformer](https://arxiv.org/abs/1706.03762) is currently the state-of-the-art for Machine Translation. The encoder uses self-attention over the previous layers. The decoder combines self-attention and encoder-decoder attention.

In [None]:
transformer_encoder = nnet_models.TransformerEncoder(
    input_size=len(source_dict),
    hidden_size=512,
    num_layers=1,
    dropout=0.0,
    heads=4
)

In [None]:
print(transformer_encoder)

In [None]:
transformer_decoder = nnet_models.TransformerDecoder(
    output_size=len(target_dict),
    hidden_size=512,
    num_layers=1,
    heads=4,
    dropout=0.0
)

In [None]:
print(transformer_decoder)

In [None]:
transformer_model = nnet_models.EncoderDecoder(
    transformer_encoder,
    transformer_decoder,
    lr=0.001,
    use_cuda=True,
    target_dict=target_dict
)

### Train a Transformer model (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epoch" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'transformer.pt')
else:
    checkpoint_path = os.path.join(model_dir, 'pretrained-transformer.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    transformer_model = torch.load(checkpoint_path)
else:
    train_model(train_iterator, [valid_iterator], transformer_model,
                epochs=10,
                checkpoint_path=checkpoint_path)

### Compute BLEU on the test set

In [None]:
print('BLEU:', transformer_model.translate(test_iterator, postprocess).score)

### Plot validation BLEU according to source sequence length

In [None]:
transformer_lengths, transformer_bleu_scores = get_binned_bleu_scores(transformer_model, valid_iterator)

In [None]:
plt.plot(rnn_lengths, rnn_bleu_scores, '--x', label='RNN without attention')
plt.plot(transformer_lengths, transformer_bleu_scores, '--x', label='Transformer')
plt.xlabel('Source length')
plt.ylabel('BLEU score')
plt.legend()

### Interact with the model

In [None]:
from bertviz.bertviz import head_view, model_view

In [None]:
%%javascript
require.config({
  paths: {
      d3: '//cdnjs.cloudflare.com/ajax/libs/d3/5.7.0/d3.min',
    jquery: '//ajax.googleapis.com/ajax/libs/jquery/2.0.0/jquery.min',
  }
});

In [None]:
def show_head_view(results):
    self_attention = results['encoder_self_attention_list']
    tokens = results['source_tokens']
    sentence_b_start = None
    head_view(self_attention, tokens, sentence_b_start)

def show_model_view(results):
    self_attention = results['encoder_self_attention_list']
    tokens = results['source_tokens']
    sentence_b_start = None
    model_view(self_attention, tokens, sentence_b_start)

In [None]:
results = get_translation(transformer_model, 'hello how are you ?', source_dict, source_lang, target_lang, return_output=True)

In [None]:
show_head_view(results)

In [None]:
show_model_view(results)

In [None]:
results = get_translation(transformer_model, 'she \'s five years older than me .', source_dict, source_lang, target_lang, return_output=True)

In [None]:
results = get_translation(transformer_model, 'i know that the last thing you want to do is help me .', source_dict, source_lang, target_lang, return_output=True)

## Multilingual Transformer model

Load a pre-trained **de, fr <-> en** model. The same dictionary and embeddings are shared between all languages, and language codes (`<lang:de>`, `<lang:en>`, `<lang:fr>`) are prepended to each source sequence to identify the target language.

In [None]:
multi_model_dir = os.path.join('models', 'de-en-fr')

multi_dict = nmt_dataset.load_or_create_dictionary(
    os.path.join(multi_model_dir, 'dict.txt'),
    dataset=None,
    minimum_count=10,
    reset=False
)

checkpoint_path = os.path.join(multi_model_dir, 'pretrained-transformer.pt')
multi_transformer_model = torch.load(checkpoint_path)

print(multi_transformer_model)

### Multilingual evaluation

Modify the `preprocess` function to automatically prepend language codes to all source sequences (when calling `get_translation`, or `load_data`).

And load test sets in all language pairs.

In [None]:
def preprocess(line, is_source=True, source_lang=None, target_lang=None):
    line = bpe_model.segment(line.lower())
    if is_source:
        line = '<lang:{}> {}'.format(target_lang, line)
    return line

test_iterators = []

for src, tgt in ('en', 'fr'), ('fr', 'en'), ('en', 'de'), ('de', 'en'), ('de', 'fr'), ('fr', 'de'):
    dataset = load_data(src, tgt, 'test')
    nmt_dataset.binarize(dataset, source_dict=multi_dict, target_dict=multi_dict, sort=False)
    iterator = nmt_dataset.BatchIterator(dataset, src, tgt, batch_size=512, max_len=30, shuffle=False)
    test_iterators.append(iterator)

In [None]:
for test_iterator in test_iterators[:4]:
    print('BLEU {}-{}: {}'.format(
        test_iterator.source_lang,
        test_iterator.target_lang,
        multi_transformer_model.translate(test_iterator, postprocess).score
    ))

### Interact with the model

In [None]:
get_translation(multi_transformer_model, 'she \'s five years older than me .', multi_dict, source_lang='en', target_lang='fr')

In [None]:
get_translation(multi_transformer_model, 'sie ist fünf jahre älter als ich .', multi_dict, source_lang='de', target_lang='en')

### Zero-shot translation

In theory, the model can do **zero-shot** translation, i.e., translate between German and French even though it has never seen German-French sentence pairs during training.

In [None]:
for test_iterator in test_iterators[4:]:
    print('BLEU {}-{}: {}'.format(
        test_iterator.source_lang,
        test_iterator.target_lang,
        multi_transformer_model.translate(test_iterator, postprocess).score
    ))

#### However, in practice zero-shot performance is very bad. Interact with the model to understand why.

In [None]:
get_translation(multi_transformer_model, 'sie ist fünf jahre älter als ich .', multi_dict, 'de', 'fr')

In [None]:
get_translation(multi_transformer_model, 'elle a cinq ans de plus que moi .', multi_dict, 'fr', 'de')

## Your Turn!

### Hyper-parameter tuning

Find the best hyper-parameters for Transformer **en-fr**. Share your best test BLEU scores on a blackboard.

*Don't forget to reload the `preprocess` function at the start of the notebook*

- Hyper-parameters: `lr`, `batch_size`, `num_layers`, `hidden_size`, `dropout`, `heads`, etc.
- Other improvements: modify the learning rate scheduler and optimizer in `nnet_models.EncoderDecoder`; use different embedding size and hidden size, etc.