# Neural Machine Translation Lab @ ALPS Winter School

General Reference: https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf <br>
Original Notebook: https://github.com/nyu-dl/AMMI-2019-NLP-Part2

# Outline

1. [Setup](#1.-Setup): install modules, download datasets and pre-trained models, load and preprocess corpora
2. [Sequence-to-sequence models](#2.-Sequence-to-sequence-models): Bag-of-Words vs RNNs vs Transformers
3. [Controlling generation with input tags](#3.-Controlling-generation-with-input-tags): politeness control and gender control
4. [Multilingual translation](#4.-Multilingual-Translation): zero-shot translation and adaptation to a new language pair
5. [NLLB-200: a massively multilingual model](#5.-NLLB-200:-a-massively-multilingual-MT-model): try NLLB-200 and fine-tune it for domain adaptation and noise robustness

Parts 1 and 2 are pre-requisites to the other parts, but parts 3, 4 and 5 can be run independently.

# 1. Setup

### Install packages

In [None]:
!pip install torch                  # to train neural networks
!pip install sentencepiece          # for tokenization
!pip install googletrans==3.1.0a0   # to use Google Translate
!pip install pandas                 # to store datasets in memory
!pip install sacrebleu              # for MT evaluation
!pip install matplotlib             # for plotting
!pip install requests               # to download stuff
!pip install --upgrade gdown        # to download files from Google Drive

### Python imports

In [None]:
"""
To run this notebook in Google Colab, you need to the following first:
1. Go to "Runtime / Change runtime type", then select "GPU" in the "Hardware accelerator" drop-down list
2. Open this link: https://drive.google.com/drive/folders/1E07YaKths98YpoBCH2PjdtTPqOXgfdZB?usp=sharing
3. Then go to "Shared with me" in your Google Drive, right-click the "ALPS2023-NMT" folder
and select "Add shortcut to Drive"

Optionally, if you don't have a Google Drive account, you can set colab to False,
and the data and models will be downloaded in Colab (might take longer).
"""

import os
cpu = False            # set to True to run on CPU (much slower)
colab = True           # set to False to run locally and not from Google Colab
model_root = 'models'  # where new models will be saved

if not os.path.exists('data.py'):
    !wget https://raw.githubusercontent.com/naverlabseurope/ALPS2023-MT-LAB/main/data.py
    !wget https://raw.githubusercontent.com/naverlabseurope/ALPS2023-MT-LAB/main/models.py
    !wget https://raw.githubusercontent.com/naverlabseurope/ALPS2023-MT-LAB/main/utils.py
    !mkdir -p scripts
    !wget https://raw.githubusercontent.com/naverlabseurope/ALPS2023-MT-LAB/main/scripts/prepare.py -O scripts/prepare.py
    !wget https://raw.githubusercontent.com/naverlabseurope/ALPS2023-MT-LAB/main/scripts/download-data.sh -O scripts/download-data.sh
    !wget https://raw.githubusercontent.com/naverlabseurope/ALPS2023-MT-LAB/main/scripts/download-nllb.sh -O scripts/download-nllb.sh
        
if colab:
    # Download the python files from the ALPS Github
    # Mount your Google Drive, which should contain a link to "ALPS2023-NMT"
    from google.colab import drive
    drive.flush_and_unmount()
    drive.mount('/content/drive')
    root_dir = '/content/drive/MyDrive/ALPS2023-NMT'
    # model_root = '/content/drive/MyDrive/ALPS2023-models' # uncomment to save your models to your Google Drive
    !ls {root_dir}/*
else:
    # Download the datasets and pre-trained models
    # Modify this script to download data in other language pairs than EN-FR
    !bash scripts/download-data.sh
    root_dir = '.'

import os, sys, re, time
import sacrebleu
import torch
import torch.nn as nn
import numpy as np
from tqdm.notebook import tqdm
import data, models, utils
from data import load_dataset, binarize, load_or_create_dictionary, BatchIterator, Tokenizer
%matplotlib inline

# Set up Google Translate API for comparison
from googletrans import Translator
google_translator = Translator()

## The dataset

We will work with a small English to French dataset named Tatoeba. It contains translations of short and simple sentences aimed at foreign language learners (from the [Tatoeba collaborative database](https://tatoeba.org/en/)). Of course, models trained on this data will not perform well on longer, more sophisticated sentences. They also won't be very robust to domain shift and input noise. To train stronger models, some larger datasets can be downloaded from https://www.statmt.org/wmt22/ or https://opus.nlpl.eu/.

In [None]:
# modify those to train models for a different language pair
source_lang, target_lang = 'en', 'fr'

# paths to the datasets and pretrained models
data_dir = os.path.join(root_dir, 'data')
pretrained_model_dir = os.path.join(root_dir, 'pretrained_models', f'{source_lang}-{target_lang}')

# path to the newly trained models
model_dir = os.path.join(model_root, f'{source_lang}-{target_lang}')

!mkdir -p {model_dir}
!head -5 {data_dir}/train.en-fr.en

## Load and preprocess the data

1. Load the BPE tokenizer
2. Load the parallel corpora for this language pair (train, valid and test). `load_dataset` will load a corpus and tokenize it with the BPE model with the given `preprocess` function.
3. Create (or load) dictionaries that map BPE tokens to token IDs (`load_or_create_dictionary` function)
4. Binarize the data: map source and target text sequences to sequences of IDs, and sort the training set by length (`binarize` function)
5. Create batches (`BatchIterator` class): group multiple sequence pairs of similar length together, pad them to the maximum length and create numpy arrays that can be used to train or evaluate our models

In [None]:
# set the random seed: initialize the random number generator for reproducibility
def reset_seed(seed=1234):
    np.random.seed(seed)
    torch.manual_seed(seed)

### 1. Load the BPE tokenizer (multilingual: works with French, German and English)

In [None]:
train_path = os.path.join(data_dir, f'train.{source_lang}-{target_lang}')
valid_path = os.path.join(data_dir, f'valid.{source_lang}-{target_lang}')
test_path = os.path.join(data_dir, f'test.{source_lang}-{target_lang}')
bpe_path = os.path.join(data_dir, 'spm.de-en-fr.model')

tokenizer = Tokenizer(bpe_path)

def preprocess(source_line, target_line=None, source_lang=None, target_lang=None):
    # BPE segmentation: e.g., 'He overslept this morning .' -> '▁He ▁o vers le pt ▁this ▁morning .'
    # modify this function to tweak the pre-processing (e.g., to add control tags / language codes).
    # 'source_lang' and 'target_lang' are not used here, but will be needed for multilingual translation later on.
    # 'preprocess' can also be called to tokenize a single source sentence (instead of a sentence pair)
    source_line = tokenizer.tokenize(source_line)
    target_line = tokenizer.tokenize(target_line)
    return source_line, target_line

def postprocess(line):
    # Merge BPE-tokenized sequences back into sequences of words:
    # "▁Ce ▁matin , ▁il ▁s ' est ▁réve illé ▁trop ▁tard ." -> "Ce matin, il s'est réveillé trop tard."
    # Used to post-process the model predictions into human-readable text.
    return tokenizer.detokenize(line)

### 2. Load and preprocess the parallel corpora

In [None]:
train_data = load_dataset(train_path, source_lang, target_lang, preprocess, max_size=None)  # pandas.DataFrame
# set max_size to 10000 for fast debugging
valid_data = load_dataset(valid_path, source_lang, target_lang, preprocess, max_size=500)
test_data = load_dataset(test_path, source_lang, target_lang, preprocess, max_size=500)
print(train_data[:5])   # to see the first 5 rows of train_data

### 3. Load or create the dictionaries

In [None]:
source_dict_path = os.path.join(pretrained_model_dir, f'dict.{source_lang}.txt')
target_dict_path = os.path.join(pretrained_model_dir, f'dict.{target_lang}.txt')

source_dict = load_or_create_dictionary(
    source_dict_path,
    train_data['source_tokenized'],
    reset=False,    # set reset to True if you're changing the data or the preprocessing
)
print(source_dict.words[:100])   # print the first 100 words in the source vocabulary

target_dict = load_or_create_dictionary(
    target_dict_path,
    train_data['target_tokenized'],
    reset=False,
)
print(target_dict.words[:100])

print('source vocab size:', len(source_dict))
print('target vocab size:', len(target_dict))

### 3. Use the dictionaries to map tokens to indices. The training set is also sorted by length for more efficient batching

In [None]:
binarize(train_data, source_dict, target_dict, sort=True)
binarize(valid_data, source_dict, target_dict, sort=False)
binarize(test_data, source_dict, target_dict, sort=False)
print(train_data[:5])  # print the first 5 rows of train_data
# The 'source_bin' and 'target_bin' columns contain the sequences of indices
# Indices of 2 correspond to the EOS token

### 4. Data statistics:

In [None]:
print('train_size={}, valid_size={}, test_size={}, min_len={}, max_len={}, avg_len={:.1f}'.format(
    len(train_data),
    len(valid_data),
    len(test_data),
    train_data['source_len'].min(),
    train_data['source_len'].max(),
    train_data['source_len'].mean(),
))

print('Train source length distribution:')
# The 90th percentile indicates the point where 90% percent of the data have values lower than this number.
# We see that 90% of training examples have 15 source words or less
# and 99% of all training examples have 30 source words or less.
print(train_data['source_len'].quantile([0.5, 0.9, 0.95, 0.99, 0.999]))

In [None]:
def unk_percentage(column):
    total = sum(len(ids) for ids in column)
    unk = sum((ids == data.UNK_IDX).sum() for ids in column)
    return unk / total

print(f"OOV source words: {unk_percentage(train_data['source_bin']):.2%}")
print(f"OOV target words: {unk_percentage(train_data['target_bin']):.2%}")

### 5. Build batches. The training batches are automatically shuffled before each epoch

In [None]:
MAX_LEN = 30       # maximum 30 tokens per sentence (longer sequences will be truncated)
BATCH_SIZE = 512   # maximum 512 tokens per batch (decrease if you get out-of-memory errors,
# increase to speed up training)

reset_seed()

train_iterator = BatchIterator(train_data, source_lang, target_lang, BATCH_SIZE, max_len=MAX_LEN, shuffle=True)
valid_iterator = BatchIterator(valid_data, source_lang, target_lang, BATCH_SIZE, max_len=MAX_LEN, shuffle=False)
test_iterator = BatchIterator(test_data, source_lang, target_lang, BATCH_SIZE, max_len=MAX_LEN, shuffle=False)

print('Example of training batch:')
print(next(iter(train_iterator)))

# 2. Sequence-to-sequence models

A Recurrent Neural Network, or RNN, is a network that operates on a
sequence and uses its own output as input for subsequent steps.

A `Sequence to Sequence, seq2seq, or Encoder-Decoder model`, is a model consisting of usually two neural networks called the encoder and decoder (http://arxiv.org/abs/1409.3215, https://arxiv.org/abs/1406.1078v3). The encoder reads
an input sequence and outputs a vector representation, and the decoder reads
that vector representation to produce an output sequence. Essentially, all we need is some mechanism to read the source sentence and create an encoding and some mechanism to read the encoding and decode it to the target language. 

Unlike sequence prediction with a single RNN, where every input
corresponds to an output, the seq2seq model frees us from sequence
length and order, which makes it ideal for translation between two
languages.

Consider the sentence "I am not the
black cat" → "Je ne suis pas le chat noir". Most of the words in the input sentence have a direct
translation in the output sentence, but are in slightly different
orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"
construction there is also one more word in the input sentence. It would
be difficult to produce a correct translation directly from the sequence
of input words.

With a basic seq2seq model, the encoder creates a single vector which, in the
ideal case, encodes the meaning of the input sequence into a single
vector — a single point in some N-dimensional space of sentences.


## The Encoder

The encoder is anything which takes in a sentence and gives us a vector representation of this sentence. 

The encoder of a seq2seq network can be an RNN that outputs some value for
every word from the input sentence. For every input word the encoder
outputs a vector and a hidden state, and uses the hidden state for the
next input word.

However, we will start with a simpler Bag-of-Words encoder and then move on to more complex encoders.
This encoder is a simple feed-forward network applied independently at each source position (i.e., on each word embedding). The outputs of the last encoder layer are then summed into a single vector and used as input by the RNN decoder.

### Bag-of-Words encoder

In [None]:
bow_encoder = models.BOW_Encoder(
    source_dict=source_dict,
    embed_dim=512,
    num_layers=1,
    dropout=0.1,
    reduce='sum',
)

print(bow_encoder)

## The decoder

The decoder is another network that takes the encoder's output vector(s) and outputs a sequence of words to create the translation.

### Decoder without attention

In the simplest seq2seq decoder we use only the last output of the encoder. This last output is sometimes called the context vector as it encodes context from the entire sequence.

At every step of decoding, the decoder is given an input token and the encoder's context vector and it updates its internal state, which is then used to predict the next word. The initial input token is the start-of-sequence <SOS> token. The next inputs are the decoder's own predictions (at test time) or the ground-truth tokens (at train time).

In [None]:
bow_decoder = models.RNN_Decoder(
    target_dict=target_dict,
    embed_dim=512,
    num_layers=1,
    dropout=0.1,
)

print(bow_decoder)

In [None]:
bow_model = models.EncoderDecoder(
    bow_encoder,
    bow_decoder,
    lr=0.001,
    use_cuda=not cpu,
    max_len=MAX_LEN,
)

## Training and evaluation

`train_model` trains a model for a given number of epochs. It will evaluate this model on the validation sets after each training epoch, and save a checkpoint if the model has improved.

`evaluate_model` computes validation loss and chrF.

chrF (https://aclanthology.org/W16-2341/) is a string-based metric, less known than BLEU, but which has been shown to outperform BLEU (i.e., to correlate better with human judgment). It also has the advantage that, because it is at the character-level, it does not rely on word tokenization and is more language-independent than BLEU.

However, (hopefully) researchers will gradually move away from string-based metrics, to use the superior learned metrics (e.g., BARTScore: https://arxiv.org/abs/2106.11520).

`utils.plot_loss` plots the model's performance on the training and validation set (train loss and validation loss/chrF). It can be used to diagnose overfitting issues: if the training loss continues decreasing while the validation loss increases, this can mean that we are not doing enough regularization (e.g., `dropout`) or that the model is just too big for this tiny training corpus.

On the other hand, if the training loss seems to stagnate, this can mean that we're doing too much regularization or not using the right learning rate schedule: the initial learning rate is either too large or too small, or a different scheduler should be used. By default, we're using ReduceLROnPlateau, which divides the learning rate by 10 when validation chrF hasn't improved (by at least a 0.5 margin) over the previous best. Depending on the model, this can be either too aggressive or not aggressive enough.

`translate` lets you use the model to translate a single sentence (without having to preprocess it beforehand). Note that for simplicity, we do "greedy" decoding: we generate the highest-probability word at each step without seeking to maximize the sequence-level score. A slightly better and often used decoding algorithm is "beam search", which maintains a fixed number of hypotheses at each time step.

In [None]:
def evaluate_model(model, *test_or_valid_iterators, record=False):
    """
    Evaluate given models with given test or validation sets. This will compute both chrF and validation loss.
    
    model: instance of models.EncoderDecoder
    test_or_valid_iterators: list of BatchIterator
    record: save scores in the model checkpoint
    """
    scores = []
    
    model.half()  # half-precision decoding is faster on some GPUs (i.e., model parameters and activations
    # are stored in float16 format instead of float32)
    
    # Compute chrF and valid loss over all test or validation sets
    for iterator in test_or_valid_iterators:
        loss = 0
        hypotheses = []
        references = []
        
        for batch in iterator:
            loss += model.eval_step(batch) / len(iterator)
            hyps, _ = model.translate(batch)
            hypotheses += [postprocess(hyp) for hyp in hyps]  # detokenize
            references += batch['reference']
        
        chrf = sacrebleu.corpus_chrf(hypotheses, [references]).score

        src, tgt = iterator.source_lang, iterator.target_lang
        print(f'{src}-{tgt}: loss={loss:.2f}, chrF={chrf:.2f}')
        if record:  # store the metrics in the model checkpoint
            model.record(f'{src}_{tgt}_loss', loss)
            model.record(f'{src}_{tgt}_chrf', chrf)
        
        scores.append(chrf)

    # Average the validation chrF scores
    score = sum(scores) / len(scores)
    return score


def train_model(model, train_iterator, valid_iterators, checkpoint_path=None, epochs=10):
    """
    Train given model for the given number of epochs.
    The best performing checkpoint (according to average chrF on 'valid_iterators') will be saved
    under 'checkpoint_path'.
    
    By default, the optimizer, epoch counter and learning rate scheduler are not reset.
    This means that this function can be called several times:
        train_model(epochs=2) is equivalent to train_model(epochs=1); train_model(epochs=1)
    Call model.reset_optimizer() to reset the model to its initial optimization settings.
    
    model: instance of models.EncoderDecoder
    train_iterator: instance of data.BatchIterator used for generating training batches
    valid_iterators: list of BatchIterator used for evaluation
    checkpoint_path: path where the model will be saved (None to not save any checkpoint)
    epochs: iterate this many times over train_iterator
    """
    epochs += (model.epoch - 1)

    reset_seed()
    
    best_score = -1
    while model.epoch <= epochs:
        model.float()  # half-precision training is unstable, we do mixed-precision internally with torch.autocast instead
        
        start = time.time()
        running_loss = 0

        print(f'Epoch [{model.epoch}/{epochs}]')

        # Iterate over training batches for one epoch
        with tqdm(enumerate(train_iterator), total=len(train_iterator)) as t:

            for i, batch in t:
                running_loss += model.train_step(batch)
                model.scheduler_step(end_of_epoch=False)
                t.postfix = f' loss={running_loss / (i + 1):.3f}'

        # Mean training loss for this epoch
        epoch_loss = running_loss / len(train_iterator)

        print(f'train_loss={epoch_loss:.3f}, time={time.time() - start:.2f}')
        model.record('train_loss', epoch_loss)

        score = evaluate_model(model, *valid_iterators, record=True)

        # Update the model's learning rate based on current performance.
        # This scheduler divides the learning rate by 10 if chrF does not improve.
        model.scheduler_step(score, end_of_epoch=True)

        # Save a model checkpoint if it has the best validation chrF so far
        if score > best_score:
            best_score = score
            if checkpoint_path is not None:
                model.save(checkpoint_path)

        print('=' * 50)

    print(f'Training completed. Best chrF is {best_score:.2f}')


def make_batch(sources, dictionary, prefix=None, max_len=MAX_LEN):
    """
    Create a batch from given source sentences
    `prefix` is an optional target-side prefix (e.g., target-side a language code for the NLLB models)
    """
    batch = [
        {
            'source': dictionary.txt2vec(source, add_eos=True),
            'prefix': dictionary.txt2vec(prefix),
        }
        for source in sources
    ]
    return data.collate(batch, max_len)


def get_translations(model, sentences, preprocess=preprocess, source_lang=source_lang, target_lang=target_lang,
                     max_len=MAX_LEN):
    """
    Translate given sentences with given model
    """
    sentences_tok = []
    prefix_tok = None  # should be the same for all sentences
    for sentence in sentences:
        tokenized = preprocess(
            sentence,
            target_line=None,
            source_lang=source_lang,
            target_lang=target_lang,
        )  # returns (tokenized source, tokenized target, optional target prefix)
        sentences_tok.append(tokenized[0])
        if len(tokenized) == 3:
            prefix_tok = tokenized[-1]
        
    batch = make_batch(sentences_tok, model.source_dict, prefix=prefix_tok, max_len=max_len)
    predictions, attention = model.translate(batch)
    predictions_detok = [postprocess(prediction) for prediction in predictions]
    if prefix_tok:
        predictions = [f'{prefix_tok} {prediction}' for prediction in predictions]
    return {
        'source': sentences,
        'source_tok': sentences_tok,
        'predictions': predictions,
        'predictions_detok': predictions_detok,
        'attention': attention,
    }


def pivot_translation(model, sentences, preprocess, source_lang, target_lang, pivot_lang='en'):
    """
    Translate given sentences from `source_lang` to `target_lang` by pivot translation through `pivot_lang`
    """
    output = get_translations(model, sentences, preprocess, source_lang=source_lang, target_lang=pivot_lang)
    output = output['predictions_detok']
    output = get_translations(model, output, preprocess, source_lang=pivot_lang, target_lang=target_lang)
    output = output['predictions_detok']
    return output


def translate(model, sentence, preprocess=preprocess, source_lang=source_lang, target_lang=target_lang,
              google_translate=True, plot_attention=True, max_len=MAX_LEN):
    """
    Translate given sentence with given model and print the outputs.
    Also show translation outputs by Google Translate for comparison.

    sentence (str): sentence to translate
    preprocess: function used to tokenize the input sentence
    source_lang (str): source language code (used for Google Translate and as a parameter to "preprocess")
    target_lang (str): target language code (used for Google Translate and as a parameter to "preprocess")
    google_translate: show translations by Google Translate
    plot_attention: show the encoder-decoder attention matrix as a heatmap
    """
    output = get_translations(
        model,
        [sentence],
        preprocess=preprocess,
        source_lang=source_lang,
        target_lang=target_lang,
        max_len=max_len,
    )
    
    print('Source:                ', sentence)
    print('Tokenized source:      ', output['source_tok'][0])
    print('Prediction:            ', output['predictions'][0])
    print('Detokenized prediction:', output['predictions_detok'][0])
    print()
    
    if google_translate:
        print('Google Translate ({}->{}):               {}'.format(
            source_lang,
            target_lang,
            google_translator.translate(output['source'][0], src=source_lang, dest=target_lang).text,
        ))
        print('Google Translate on prediction ({}->{}): {}'.format(
            target_lang,
            source_lang,
            google_translator.translate(output['predictions_detok'][0], src=target_lang, dest=source_lang).text,
        ))

    if plot_attention and output['attention'] is not None:
        utils.plot_attention(output['source_tok'][0], output['predictions'][0], output['attention'][0])

### Train a model with BOW encoder and RNN decoder (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epochs" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
# Note that you can load the pre-trained model, then re-run this cell with train_again=True to continue training it
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'bow.pt')
else:
    checkpoint_path = os.path.join(pretrained_model_dir, 'bow.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    bow_model.load(checkpoint_path)   # trained for 10 epochs
else:
    train_model(bow_model, train_iterator, [valid_iterator],
                epochs=2,
                checkpoint_path=checkpoint_path)

utils.plot_loss(bow_model)

### Compute chrF on the test set

In [None]:
chrf = evaluate_model(bow_model, test_iterator)

### Interact with the model

In [None]:
# Translate some English sentence with the model
translate(bow_model, 'Do you like dogs?')
# The Google Translate outputs are shown for reference to non-French speakers:
# - The en->fr output is a high-quality translation of the input sentence
# - The fr->en output is a translation back into English of our model's French translation (so that you can assess its quality)

The biggest limitation of a Bag-of-Words encoder is that it is insensitive to word order: when shuffling the words in the previous sentence, you get the same output.

In [None]:
translate(bow_model, 'Do dogs like you?')

In [None]:
translate(bow_model, "The mouse ate the cat.")

## RNN encoder + RNN decoder

Now let's look at a more powerful model, which also uses an RNN to encode the source sequence. Contrary to the Bag-of-Words encoder, it is sensitive to word order.

In [None]:
rnn_encoder = models.RNN_Encoder(
    source_dict=source_dict,
    embed_dim=512,
    num_layers=1,
    dropout=0.1,
)

print(rnn_encoder)

In [None]:
rnn_decoder = models.RNN_Decoder(
    target_dict=target_dict,
    embed_dim=512,
    num_layers=1,
    dropout=0.1,
)

print(rnn_decoder)

In [None]:
rnn_model = models.EncoderDecoder(
    rnn_encoder,
    rnn_decoder,
    lr=0.001,
    use_cuda=not cpu,
    max_len=MAX_LEN,
)

### Train a model with RNN encoder and RNN decoder (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epochs" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
# Note that you can load the pre-trained model, then re-run this cell with train_again=False to continue training it
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'rnn.pt')
else:
    checkpoint_path = os.path.join(pretrained_model_dir, 'rnn.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    rnn_model.load(checkpoint_path)   # trained for 10 epochs
else:
    train_model(rnn_model, train_iterator, [valid_iterator],
                epochs=2,
                checkpoint_path=checkpoint_path)

utils.plot_loss(rnn_model)

### Compute chrF on the test set

In [None]:
chrf = evaluate_model(rnn_model, test_iterator)

### Interact with the model

In [None]:
translate(rnn_model, "She's five years older than me.")

#### Contrary to the BoW encoder, an RNN is sensitive to word ordering

In [None]:
translate(rnn_model, "Do you like dogs?")

In [None]:
translate(rnn_model, "Do dogs like you?")

In [None]:
translate(rnn_model, "The mouse ate the cat.")

## The Transformer

However, researchers observed that this type of RNN encoder/decoder model is hard to train and not very good to deal with long sequences, because the entire input has to be encoder into a single fixed-size vector, no matter its length. Because of this, attention mechanisms were introduced between the encoder and the decoder (https://arxiv.org/abs/1409.0473): the decoder can look at different positions in the encoder depending on its own current state. This usually implemented as a weighted average over encoder states, whose weights are computed with a learnable feed-forward network taking an encoder state and a decoder state as input.

But nowadays, the preferred architecture is the [Transformer](https://arxiv.org/abs/1706.03762) which uses a more complex "multi-head attention" mechanism, and not only between, but also within the encoder and the decoder (AKA "self-attention") as a replacement to the recursion of RNNs. Transformers are basically deep feed-forward networks where each layer has an attention mechanism over the preceding layer. Transformers are considerably faster to train than RNNs because all the states of a given layer can be computed in parallel.

In [None]:
transformer_encoder = models.TransformerEncoder(
    source_dict=source_dict,
    embed_dim=512,
    num_layers=2,
    dropout=0.1,
    heads=4,
)

print(transformer_encoder)

In [None]:
transformer_decoder = models.TransformerDecoder(
    target_dict=target_dict,
    embed_dim=512,
    num_layers=1,
    heads=4,
    dropout=0.1,
)

print(transformer_decoder)

In [None]:
transformer_model = models.EncoderDecoder(
    transformer_encoder,
    transformer_decoder,
    lr=0.0005,
    use_cuda=not cpu,
    max_len=MAX_LEN,
)

Note that in this notebook, we're using the same learning rate scheduler for all models:
`torch.optim.lr_scheduler.ReduceLROnPlateau`, which reduces the learning rate when the validation score (chrF)
does not increase enough.
Feel free to experiment with other schedulers, using the `scheduler_fn` and `scheduler_args` parameters.


For example:
```
transformer_model = models.EncoderDecoder(
    transformer_encoder,
    transformer_decoder,
    lr=0.0005,
    use_cuda=not cpu,
    scheduler_fn=torch.optim.lr_scheduler.ExponentialLR,
    scheduler_args={'gamma': 0.5},
)
```

Transformers are often trained with warmup: starting with a small learning rate, increasing it up to a maximum value for the first N steps, them slowly decreasing it. Such a scheduler is implemented as `models.WarmupLR`.

Deeper models can also be trained (Transformer encoders and decoders are often at least 6 layers). Regularization (`dropout` parameter) might need to be modified accordingly to avoid overfitting.

### Train a Transformer model (or load a pre-trained model)

In [None]:
# Set this value to True to train your own model. By default, a pre-trained model will be loaded.
# Tip: you can set "epochs" to a small value (e.g., 2) and re-run this cell several times to continue training you model (`train_model` does not reset the model)
# Note that you can load the pre-trained model, then re-run this cell with train_again=False to continue training it
train_again = False

if train_again:
    checkpoint_path = os.path.join(model_dir, 'transformer.pt')
else:
    checkpoint_path = os.path.join(pretrained_model_dir, 'transformer.pt')

print('checkpoint path:', checkpoint_path)

if os.path.exists(checkpoint_path) and not train_again:
    transformer_model.load(checkpoint_path)   # trained for 10 epochs
else:
    train_model(transformer_model, train_iterator, [valid_iterator],
                epochs=2,
                checkpoint_path=checkpoint_path)
    
utils.plot_loss(rnn_model)

### Compute chrF on the test set

In [None]:
chrf = evaluate_model(transformer_model, test_iterator)

### Interact with the model
The `translate` function also plots the encoder-decoder attention matrix. Note that this does an average over all attention heads and only at the last decoder layer. The vertical axis shows the encoder positions (and corresponding source words) and the horizontal axis shows the decoder positions (and the words that were generated at these positions). And each cell gives an average attention weight (value between 0 and 1) between these positions.

Interestingly, even though the model is not trained with any prior regarding this alignment, encoder-decoder attention matrices often display linguistically-plausible alignments between the source and the target sentences.

In [None]:
translate(transformer_model, "Look, there's a cat in the kitchen!")

In [None]:
translate(transformer_model, "She's five years older than me.")

In [None]:
translate(transformer_model, 'I know that the last thing you want to do is help me.')

# 3. Controlling generation with input tags

Some aspects of generation can be controlled thanks to special tokens in the input. For instance multi-domain models can be trained and used with source-side domain tags (https://aclanthology.org/R17-1049).

## Politeness control

This work https://aclanthology.org/N16-1005/ used special tokens to control the politeness of the output.

We will implement this approach for English-French translation, to control the use of *"tu"* vs *"vous"* pronouns, which are formal/informal translations of "you".

We only need to partition the training data into formal vs informal splits, by looking for occurrences of *"tu"* and *"vous"*. Then, add source-side control tags depending on the politeness level of the target, and train the model with this.
At test time, we only need to put the right control tag and the model will know how to interpret it to pick the right level of politeness.


### Politeness detector

As we only rely on the "politeness control token," it is necessary to prepare distinctive polite and non-polite training samples from the corpus.

While a lot of different aspects of French grammar can be considered here, to start with, we pick sentences that contain *"tu"* and *"vous"* — both meaning "you"  in English — and label them as "non-polite" and "polite," respectively.

In [None]:
def split_by_punct(line):
    """
    Splits according to punctuation symbols: "Hello, world!" -> ["Hello", ",", " ", "world", "!", ""]
    This can be reverted by: ''.join(split_punct(line))
    """
    return re.split(r'(\W)', line or '')

def is_formal(line):
    """
    Contains formal French translations of "you"
    """
    tokens = split_by_punct(line)
    # Modify this regex to match other formal pronouns (e.g., votre/vos)
    return any(re.fullmatch(r'vous', token, re.IGNORECASE) for token in tokens)

def is_informal(line):
    """
    Contains informal French translations of "you"
    """
    tokens = split_by_punct(line)
    # Modify this regex to match other informal pronouns (e.g., ton/ta/tes)
    return any(re.fullmatch(r'tu', token, re.IGNORECASE) for token in tokens)

### Adding politeness control tags

When we identify sentences that are either polite or non-polite, we can attach corresponding control tags in front of each sentence.

In [None]:
def preprocess_formal(source_line, target_line=None, source_lang=None, target_lang=None):
    """
    Tokenizes the given line pair and prepends the <formal> source-side tag 
    """
    source_line, target_line = preprocess(source_line, target_line)
    source_line = f'<formal> {source_line}'
    return source_line, target_line

def preprocess_informal(source_line, target_line=None, source_lang=None, target_lang=None):
    """
    Tokenizes the given line pair and prepends the <informal> source-side tag 
    """
    source_line, target_line = preprocess(source_line, target_line)
    source_line = f'<informal> {source_line}'
    return source_line, target_line

def preprocess_formal_or_informal(source_line, target_line, source_lang=None, target_lang=None):
    """
    Preprocessing function for politeness control:
    - keep only line pairs whose target side has French formal or informal pronouns
    - prepend politeness control tags to the source side
    """
    if is_formal(target_line):
        return preprocess_formal(source_line, target_line)
    elif is_informal(target_line):
        return preprocess_informal(source_line, target_line)
    else:  # this line pair in neither formal nor informal
        # This example will be filtered out by load_dataset (uncomment below to keep it, without a control tag):
        # return preprocess(source_line, target_line)
        return None

### Filtering and loading the dataset

Finally, we can filter and load the dataset by passing the `preprocess_formal_or_informal` function to `load_dataset`.
This will keep only the line pairs that contain formal or informal pronouns and preprocess the sources to add control tags.

In [None]:
# Use the same dataset as before
train_path = os.path.join(data_dir, 'train.en-fr')
valid_path = os.path.join(data_dir, 'valid.en-fr')

# But preprocess it to keep only line pairs that use tu/vous pronouns and to append control tags
train_data_politeness = load_dataset(
    train_path, 'en', 'fr',
    preprocess=preprocess_formal_or_informal,
)

valid_data_politeness = load_dataset(
    valid_path, 'en', 'fr',
    preprocess=preprocess_formal_or_informal,
    max_size=500,
)

### Training with politeness tags 

As we are introducing new words in the vocabulary (i.e., the control tokens), we need to add them to our pretrained model's existing vocabulary.

Here, we replace the last two most infrequent tokens so that we do not need to resize the vocabulary and embeddings.

Note that the replaced words will now be mapped to `<unk>`.

In [None]:
source_dict = transformer_model.source_dict

# Replace some infrequent tokens with the new control tokens (these words will now be mapped to <unk>)
# This is a bit dirty, but this way we don't have to resize the pretrained model's vocabulary and embeddings
source_dict[len(source_dict) - 2] = '<formal>'
source_dict[len(source_dict) - 1] = '<informal>'

# Binarize the training and validation data with these vocabularies
binarize(train_data_politeness, source_dict, target_dict, sort=True)
binarize(valid_data_politeness, source_dict, target_dict, sort=False)

# You can see that the training source examples now start with special tokens.
print(train_data_politeness[:5])

print('train_size={}, valid_size={}, min_len={}, max_len={}, avg_len={:.1f}'.format(
    len(train_data_politeness),
    len(valid_data_politeness),
    train_data_politeness['source_len'].min(),
    train_data_politeness['source_len'].max(),
    train_data_politeness['source_len'].mean(),
))

reset_seed()

train_iterator_politeness = BatchIterator(
    train_data_politeness, 'en', 'fr',
    batch_size=BATCH_SIZE,
    max_len=MAX_LEN,
    shuffle=True,
)
valid_iterator_politeness = BatchIterator(
    valid_data_politeness, 'en', 'fr',
    batch_size=BATCH_SIZE,
    max_len=MAX_LEN,
    shuffle=False,
)

In [None]:
# Finetune the EN-FR pretrained Transformer model with the new data
new_checkpoint_path = os.path.join(model_root, 'en-fr', 'polite-transformer.pt')
transformer_model.reset_optimizer()
# Uncomment below to reload the pre-trained model
# transformer_model.load(os.path.join(pretrained_model_dir, 'transformer.pt'), reset_optimizer=True)
train_model(
    transformer_model,
    train_iterator_politeness,
    [valid_iterator_politeness],
    new_checkpoint_path,
    epochs=3,
)

### Test your polite Transformer

In [None]:
translate(transformer_model, "Would you lend me your bicycle?", preprocess_formal)

In [None]:
translate(transformer_model, "Would you lend me your bicycle?", preprocess_informal)

### Your turn!

Can you improve the `is_formal` and `is_informal` functions to find more training examples?
For instance, French possessives (*ton/ta/tes*, *votre/vos*) also have this formality distinction.

By default, `preprocess_formal_or_informal` will exclude any training example that is neither formal nor informal. This results in a very small and biased dataset. The resulting model will also catastrophically forget how to translate sentences that do not start with politeness tags. It may be beneficial (to avoid overfitting and catastrophic forgetting) to also include regular training examples, without any politeness tag.

## Controlling gender

One known issue of machine translation models (and other NLP models) is that they tend to exhibit gender biases, caused by the same biases appearing in the training data. For instance, in case of ambiguity, a doctor is more likely to be translated as masculine and a nurse as feminine.

For instance *"Dr. Dupont is very skilled"* -> *"Le Dr. Dupont est très compétent"* (*"compétent"* is masculine, the feminine form is *"compétente"*).

You will now use control tags to control the gender of the translation. Sentences starting with `<feminine>` will be translated with the feminine pronoun *"elle"* and translations of sentences starting with `<masculine>` will use the masculine pronoun *"il"*.

Unfortunately, we don't have a mainstream gender-neutral pronoun in French (like *"they"* in English).
An option called "inclusive writing" consists in writing both pronouns (e.g., *"il/elle"*), but there aren't
many natural occurrences of this in existing NLP datasets yet, so for simplicity we will stick to binary masculine/feminine.

You can mostly mirror the "Politeness control" task and change the regular expressions. Don't forget the `re.IGNORECASE` flag to also match capitalized words (e.g., both *"Elle"* and *"elle"*).

A notable difference with the previous task is that we now want to impose some feature in the output that may be different to what appears in the input. For instance, `<feminine> He eats apples` should translate as `Elle mange des pommes` Because such things rarely occur naturally in MT data (contrary to politeness ambiguities), we will need to do some data augmentation. This can be achieved by randomly swapping the masculine or feminine pronouns in the English source lines. Modify the `feminize` and `masculinize` functions to do this.

Set `implemented = True` once you're done!

In [None]:
implemented = False  # set to True once you've implemented all the functions below

def is_feminine(line):
    """
    Contains the French feminine pronoun "elle"
    """
    raise NotImplementedError

def is_masculine(line):
    """
    Contains the French masculine pronoun "il"
    """
    raise NotImplementedError

def preprocess_feminine(source_line, target_line, source_lang=None, target_lang=None):
    """
    Preprocessing function for feminine line pairs: the source side will have a special <feminine> token
    """
    raise NotImplementedError

def preprocess_masculine(source_line, target_line, source_lang=None, target_lang=None):
    """
    Preprocessing function for masculine line pairs: the source side will have a special <masculine> token
    """
    raise NotImplementedError

def feminize(line):
    """
    Change the English pronouns in `line` to be feminine
    """
    raise NotImplementedError

def masculinize(line):
    """
    Change the English pronouns in `line` to be masculine
    """
    raise NotImplementedError

def preprocess_masculine_or_feminine(source_line, target_line, source_lang=None, target_lang=None):
    """
    Preprocessing function for gender control:
    - add the <feminine> source tag to sentences pairs whose target side is feminine
    - add the <masculine> source tag to sentences pairs whose target side is masculine
    - do data augmentation to swap the source-side gender with probability 0.5
    """
    if is_feminine(target_line):
        if np.random.rand() < 0.5:
            source_line = masculinize(source_line)
        return preprocess_feminine(source_line, target_line)
    elif is_masculine(target_line):
        if np.random.rand() < 0.5:
            source_line = feminize(source_line)
        return preprocess_masculine(source_line, target_line)
    else:
        # return preprocess(source_line, target_line)
        return None

#### Once the previous functions have been filled in, the following can be run without modifications:

In [None]:
if implemented:
    reset_seed()

    train_data_gender = load_dataset(train_path, 'en', 'fr', preprocess=preprocess_masculine_or_feminine)
    valid_data_gender = load_dataset(valid_path, 'en', 'fr', preprocess=preprocess_masculine_or_feminine, max_size=500)

    source_dict = transformer_model.source_dict

    # Replace some infrequent tokens with the new control tokens (these words will now be mapped to <unk>)
    # This is a bit dirty, but this way we don't have to resize the pretrained model's vocabulary and embeddings
    source_dict[len(source_dict) - 2] = '<feminine>'
    source_dict[len(source_dict) - 1] = '<masculine>'

    # Binarize the training and validation data with these vocabularies
    binarize(train_data_gender, source_dict, target_dict, sort=True)
    binarize(valid_data_gender, source_dict, target_dict, sort=False)

    # You can see that the training source examples now start with special tokens.
    print(train_data_gender[:5])

    print('train_size={}, valid_size={}, min_len={}, max_len={}, avg_len={:.1f}'.format(
        len(train_data_gender),
        len(valid_data_gender),
        train_data_gender['source_len'].min(),
        train_data_gender['source_len'].max(),
        train_data_gender['source_len'].mean(),
    ))

    train_iterator_gender = BatchIterator(
        train_data_gender, 'en', 'fr',
        batch_size=BATCH_SIZE,
        max_len=MAX_LEN,
        shuffle=True,
    )
    valid_iterator_gender = BatchIterator(
        valid_data_gender, 'en', 'fr',
        batch_size=BATCH_SIZE,
        max_len=MAX_LEN,
        shuffle=False,
    )

In [None]:
if implemented:
    # Finetune the EN-FR pretrained Transformer model with the new data
    new_checkpoint_path = os.path.join(model_root, 'en-fr', 'gender-controllable-transformer.pt')
    transformer_model.reset_optimizer()
    # Uncomment below to reload the pre-trained model:
    # transformer_model.load(os.path.join(pretrained_model_dir, 'transformer.pt'), reset_optimizer=True)
    train_model(transformer_model, train_iterator_gender, [valid_iterator_gender], new_checkpoint_path, epochs=5)

In [None]:
if implemented:
    translate(transformer_model, "She is attending a Winter school.", preprocess_masculine)

In [None]:
if implemented:
    translate(transformer_model, "She is attending a Winter school.", preprocess_feminine)

# 4. Multilingual Translation

We will now look at multilingual translation, another trendy topic in machine translation. A single model can be trained to translate from multiple languages into multiple languages (https://aclanthology.org/Q17-1024/, https://arxiv.org/abs/2010.11125).
This is done by having a single multilingual BPE tokenizer and dictionary, shared between all languages. The embedding matrix (and other model parameters) are also shared across languages. And this multilingual model is trained on multiple parallel datasets (e.g., EN->FR, FR->EN, DE->EN, EN->DE). Controlling the target language can be achieved by using special tokens, like for politeness control.

Load a pre-trained **DE, FR <-> EN** model. The multilingual dictionary includes tokens for all three languages plus the language codes (`<lang:de>`, `<lang:en>`, `<lang:fr>`), which are prepended to each source sequence to identify the target language.

In [None]:
multi_model_dir = os.path.join(root_dir, 'pretrained_models', 'de-en-fr')

multi_dict = data.Dictionary.load(os.path.join(multi_model_dir, 'dict.txt'))

encoder = models.TransformerEncoder(source_dict=multi_dict, embed_dim=512, num_layers=2, heads=4)
decoder = models.TransformerDecoder(
    target_dict=multi_dict,
    embed_dim=512, num_layers=1, heads=4,
    embed_tokens=encoder.embed_tokens)  # tied embeddings (multilingual models usually have shared source/target embeddings)

multi_model = models.EncoderDecoder(encoder, decoder, lr=0.0005, use_cuda=not cpu, max_len=MAX_LEN)

checkpoint_path = os.path.join(multi_model_dir, 'transformer.pt')
multi_model.load(checkpoint_path)

### Multilingual evaluation

Modify the `preprocess` function to automatically prepend language codes to all source sequences (when calling `translate`, or `load_data`).

And load test sets in all language pairs.

In [None]:
def preprocess_multi(source_line, target_line, source_lang=None, target_lang=None):
    source_line, target_line = preprocess(source_line, target_line)
    source_line = f'<lang:{target_lang}> {source_line}'.strip()
    return source_line, target_line

test_sets = {}

for pair in 'en-fr', 'fr-en', 'en-de', 'de-en', 'de-fr', 'fr-de':
    src, tgt = pair.split('-')
    path = os.path.join(data_dir, f'test.{pair}')
    dataset = load_dataset(path, src, tgt, preprocess_multi, max_size=500)
    binarize(dataset, source_dict=multi_dict, target_dict=multi_dict, sort=False)
    iterator = BatchIterator(dataset, src, tgt, batch_size=BATCH_SIZE, max_len=MAX_LEN, shuffle=False)
    test_sets[pair] = iterator
    
en_centric_test_sets = list(test_sets.values())[:4]
non_en_centric_test_sets = list(test_sets.values())[4:]

In [None]:
chrf = evaluate_model(multi_model, *en_centric_test_sets)

### Interact with the model

In [None]:
# translate accepts preprocess, source_lang and target_lang arguments
translate(multi_model, "She's five years older than me.", preprocess_multi, source_lang='en', target_lang='fr')

In [None]:
translate(multi_model, 'Sie ist fünf Jahre älter als ich.', preprocess_multi, source_lang='de', target_lang='en')

### Zero-shot translation

In theory, the model can do **zero-shot** translation, i.e., translate between German and French even though it has never seen German-French sentence pairs during training.

In [None]:
chrf = evaluate_model(multi_model, *non_en_centric_test_sets)

#### However, in practice zero-shot performance is very bad. Interact with the model to understand why.

In [None]:
translate(multi_model, 'Sie ist fünf Jahre älter als ich.', preprocess_multi, source_lang='de', target_lang='fr')

In [None]:
translate(multi_model, 'Elle a cinq ans de plus que moi.', preprocess_multi, source_lang='fr', target_lang='de')

One solution to use such an English-centric model to translate between two languages that are not English is to do pivot translation. For instance, to translate from German to French, we can use the model to translate from German to English and then from English to French.
However, this approach is twice as slow and it can propagate errors and some useful information may be lost in the first translation step.

In [None]:
srcs = list(test_sets['de-fr'].data['source_data'])
refs = list(test_sets['de-fr'].data['target_data'])
hyps = pivot_translation(multi_model, srcs, preprocess_multi, 'de', 'fr', pivot_lang='en')
import sacrebleu
chrf = sacrebleu.corpus_chrf(hyps, [refs]).score
print(f'de-fr (pivot): chrF={chrf:.2f}')

## Adaptation to a new language pair

Large-scale multilingual MT models are great as they can provide translations for multiple language pairs with just a single model.

However, these models tend to be very large in terms of model parameters and require heavy computational power to train.

Therefore, when adding a new language pair, instead of re-training the model from scratch using the previous and newly added corpora, it would be more efficient to finetune the pretrained model with the new data only.

### Naive finetuning of the model
In the above "Multilingual Translation" section, we observed poor zero-shot MT performance for the **DE, FR <-> EN** model.

We saw that, while it is possible to do **DE <-> FR** translation, as the model has never seen such bilingual data, the performance was rather poor.

Suppose now we want to explicitly train the model to additionally support the **DE -> FR** translation.
One way is to load the corresponding dataset and finetune the pretrained model.

In [None]:
# Load DE-FR training data
src, tgt = 'de', 'fr'

train_path = os.path.join(data_dir, f'train.{src}-{tgt}')
valid_path = os.path.join(data_dir, f'valid.{src}-{tgt}')

train_data = load_dataset(train_path, src, tgt, preprocess_multi, max_size=None)  # set max_size to 10000 for fast debugging
valid_data = load_dataset(valid_path, src, tgt, preprocess_multi, max_size=500)

binarize(train_data, source_dict=multi_dict, target_dict=multi_dict, sort=True)
binarize(valid_data, source_dict=multi_dict, target_dict=multi_dict, sort=False)

reset_seed()

train_iterator = BatchIterator(train_data, src, tgt, batch_size=BATCH_SIZE, max_len=MAX_LEN, shuffle=True)
valid_iterator = BatchIterator(valid_data, src, tgt, batch_size=BATCH_SIZE, max_len=MAX_LEN, shuffle=False)

In [None]:
# Finetune the entire model on DE-FR
new_checkpoint_path = os.path.join(model_root, 'de-en-fr', 'finetuned-transformer.pt')
train_model(multi_model, train_iterator, [valid_iterator], new_checkpoint_path, epochs=1)

### Catastrophic forgetting
After the finetuning, we evaluate the model on FR-EN and DE-FR test sets. Unfortunately, this finetuning resulted in a drop in performance for FR-EN translation. This phenomenon of the model forgetting previously learned information upon learning new information is called "catastrophic forgetting".

In [None]:
# Now evaluate on FR-EN and DE-FR test sets. We see a drop in FR-EN performance (catastrophic forgetting)
chrf = evaluate_model(multi_model, test_sets['fr-en'], test_sets['de-fr'])

### Adapter modules
An alternative to finetuning and an effective way of avoiding the problem of catastrophic forgetting is the usage of adapter modules (https://arxiv.org/abs/1902.00751).

An adapter module is generally a small feedforward network with a skip connection, inserted in each Transformer layer.

The insertion of adapter modules incurs additional model parameters, but they are often kept small compared to the size of the original network.

During adapter tuning, only the adapter modules are trained with the downstream task's data — in our case, the DE-FR data — while the rest of the model parameters are fixed.

In [None]:
from models import AdapterTransformerDecoder, AdapterTransformerEncoderLayer

class AdapterLayer(nn.Module):
    # This class definition is just for show. Adapter layers are actually defined in models.py
    # Same adapter architecture as in this paper: https://arxiv.org/abs/1909.08478
    def __init__(self, input_dim, projection_dim):
        """
        input_dim: Transformer model's embedding dimension
        projection_dim: bottleneck dimension of the adapter (usually smaller than input_dim), can be tuned
        to control the amount of new parameters.
        """
        super().__init__()
        self.down = nn.Linear(input_dim, projection_dim)
        self.up = nn.Linear(projection_dim, input_dim)
        self.layer_norm = nn.LayerNorm(input_dim)
        # initialize the adapter weights to small values, so that it computes the identity function
        # (or close enough) at the beginning of training (i.e., it keeps the Transformer layer outputs mostly
        # unchanged)
        nn.init.uniform_(self.down.weight, -1e-6, 1e-6)
        nn.init.uniform_(self.up.weight, -1e-6, 1e-6)
        nn.init.zeros_(self.down.bias)
        nn.init.zeros_(self.up.bias)

    def forward(self, x):
        y = self.layer_norm(x)
        # down projection to a bottleneck dimension
        y = self.down(y)
        # non-linearity
        y = F.relu(y)
        # up projection to the model's dimension
        y = self.up(y)
        # residual connection
        return x + y

class AdapterTransformerEncoder(models.TransformerEncoder):
    def __init__(self, *args, **kwargs):
        """
        Create a Transformer Encoder with adapter modules (that will be plugged in after each Transformer layer)
        """
        super().__init__(*args, **kwargs)
        for param in self.parameters():
            param.requires_grad = False  # only the adapters are trained

    def add_adapter(self, id, projection_dim, select=False, overwrite=False):
        # Create a new set of adapter modules
        for layer in self.layers:
            layer.add_adapter(id, projection_dim, overwrite=overwrite)
        if select:
            self.select_adapter(id)
            
    def select_adapter(self, id):
        # Use this method to activate a specific set of adapters (e.g., 'de-fr')
        # Set id=None to deactivate adapters (and use the initial Transformer model)
        for layer in self.layers:
            assert id is None or id in layer.adapters
            layer.adapter_id = id

    def build_layer(self, layer_id):
        # This method can be modified to add adapters only at some layers (e.g., first encoder layer)
        return AdapterTransformerEncoderLayer(self.embed_dim, self.heads, self.dropout_rate, self.ffn_dim)

In [None]:
encoder = AdapterTransformerEncoder(
    source_dict=multi_dict,
    embed_dim=512,
    num_layers=2,
    heads=4,
)
decoder = AdapterTransformerDecoder(
    target_dict=multi_dict,
    embed_dim=512,
    num_layers=1,
    heads=4,
    embed_tokens=encoder.embed_tokens,
)

adapter_model = models.EncoderDecoder(encoder, decoder, lr=0.0005, use_cuda=not cpu, max_len=MAX_LEN)

pretrained_checkpoint_path = os.path.join(multi_model_dir, 'transformer.pt')
# Load the pre-trained model's parameters.
# We reset the optimizer because its parameters do not match anymore and the learning rate might be too small.
adapter_model.load(pretrained_checkpoint_path, reset_optimizer=True)

encoder.add_adapter(
    'de-fr',
    projection_dim=64,  # bottleneck dimension of the adapters
)
decoder.add_adapter(
    'de-fr',
    projection_dim=64,
)  # adapters can also be used only in the encoder or decoder

new_checkpoint_path = os.path.join(model_root, 'de-en-fr', 'adapter-transformer.pt')

# Show the number of trained parameters.
# All Transformer parameters are frozen except the adapter parameters.
total_params = 0
trained_params = 0
for name, param in adapter_model.named_parameters():
    total_params += param.numel()
    if param.requires_grad:
        trained_params += param.numel()
print(f'Total parameters: {total_params}, trained parameters: {trained_params}')

In [None]:
# Activate the DE-FR adapters and train them on the DE-FR data (the other parameters are frozen)
# Note that you can do encoder.select_adapter(None) to train only decoder adapters
encoder.select_adapter('de-fr')
decoder.select_adapter('de-fr')
train_model(adapter_model, train_iterator, [valid_iterator], new_checkpoint_path, epochs=1)

### Turning on adapters for evaluation

After adapter training, we can turn on the DE-FR adapters to do inference. The advantage over full finetuning, is that we can easily turn them off to translate in the other language pairs, and avoid the catastrophic forgetting issue.

We can see that with just 200K new parameters (2% of the initial model's size) we can adapt to the DE-FR direction without hurting performance for the other language pairs.

In [None]:
# Activate the DE-FR adapters to translate in the DE-FR direction
encoder.select_adapter('de-fr')
decoder.select_adapter('de-fr')
chrf = evaluate_model(adapter_model, test_sets['fr-en'], test_sets['de-fr'])

In [None]:
# Deactivate the adapters to use the initial model (e.g., to translate in the English-centric directions).
encoder.select_adapter(None)
decoder.select_adapter(None)
chrf = evaluate_model(adapter_model, test_sets['fr-en'], test_sets['de-fr'])

In [None]:
# To automatically deactivate the adapters after using them. This also creates them if they don't exist
with adapter_model.adapter('de-fr'):
    chrf = evaluate_model(adapter_model, test_sets['de-fr'])

## Your turn!

1. Can you train adapters to support more language pairs? (e.g., FR-DE). You can download data and train BPE models for more languages by modifying and re-running `scripts/download-data.sh` (warning: avoid re-running it for `de` and `fr` as it will generate different test splits)
2. Can you achieve politeness control or gender control with adapters instead of control tags?
3. Another technique to add new language pairs is to re-train (or finetune) the entire model on the new language pair's data **plus** the original language pairs. Train your own {DE,FR,EN} -> {DE,FR,EN} multilingual model. Tip: you can use `data.concatenate_datasets(dataset_list)` to concatenate multiple datasets (created by `load_dataset`) into a single one, or `data.MultilingualBatchIterator(iterator_list)` to merge several batch iterators (created by `BatchIterator`) into a single one. The first and second solutions will respectively result in heterogeneous and homogeneous batches (i.e., containing sentences pairs of multiple or a single language pair).

# 5. NLLB-200: a massively multilingual MT model

Meta AI released several models under the name "NLLB-200", that support 202 languages, many of which are not covered by any commercial MT engine to data (https://arxiv.org/abs/2207.04672).
The largest model is a mixture-of-experts model with 54B parameters, which is much too large for this notebook. But they also released smaller dense models of size: [3.3B, 1.3B, and 600M](https://github.com/facebookresearch/fairseq/tree/nllb#multilingual-translation-models). We will experiment here with the smallest model of 600M parameters.

## Play with the model

In [None]:
# NLLB takes quite a lot of GPU memory, move the previous models to CPU if you encounter OOM errors:
# utils.free_gpu_memory()

if not colab:
    !bash scripts/download-nllb.sh

nllb_model_dir = os.path.join(root_dir, 'pretrained_models', 'nllb')

nllb_dict = data.Dictionary.load(os.path.join(nllb_model_dir, 'dict.txt'))

NLLB_MAX_LEN = 100
NLLB_BATCH_SIZE = 512

# We initialize the model as a Transformer with adapters (even though it doesn't contain adapters yet),
# as this will be useful to finetune it
nllb_encoder = models.AdapterTransformerEncoder(
    source_dict=nllb_dict,
    embed_dim=1024,
    ffn_dim=4096,
    num_layers=12,
    heads=16,
    dropout=0.1,
    checkpointing=False,  # set to True if you get OOM errors
)
nllb_decoder = models.AdapterTransformerDecoder(
    target_dict=nllb_dict,
    embed_dim=1024,
    ffn_dim=4096,
    num_layers=12,
    heads=16,
    embed_tokens=nllb_encoder.embed_tokens,  # tied embeddings (multilingual models usually have shared source/target embeddings)
    dropout=0.1,
    checkpointing=False,  # set to True if you get OOM errors
)

nllb_model = models.EncoderDecoder(
    nllb_encoder,
    nllb_decoder,
    lr=0.0005,
    max_len=NLLB_MAX_LEN,
    use_cuda=not cpu,
    scheduler=models.WarmupLR,
    scheduler_args={'warmup': 500},
)

nllb_model.load(os.path.join(nllb_model_dir, '600M_distilled.pt'))

In [None]:
# mapping 2-letter language codes to NLLB's language codes (e.g., fr -> fra_Latn, en -> eng_Latn)
lang_code_mapping = {'af': 'afr_Latn', 'am': 'amh_Ethi', 'ar': 'arb_Arab', 'ast': 'ast_Latn', 'az': 'azj_Latn', 'ba': 'bak_Cyrl', 'be': 'bel_Cyrl', 'bn': 'ben_Beng', 'bs': 'bos_Latn', 'bg': 'bul_Cyrl', 'ca': 'cat_Latn', 'ceb': 'ceb_Latn', 'cs': 'ces_Latn', 'cy': 'cym_Latn', 'da': 'dan_Latn', 'de': 'deu_Latn', 'el': 'ell_Grek', 'en': 'eng_Latn', 'et': 'est_Latn', 'fi': 'fin_Latn', 'fr': 'fra_Latn', 'ff': 'fuv_Latn', 'gd': 'gla_Latn', 'ga': 'gle_Latn', 'gl': 'glg_Latn', 'gu': 'guj_Gujr', 'ht': 'hat_Latn', 'ha': 'hau_Latn', 'he': 'heb_Hebr', 'hi': 'hin_Deva', 'hr': 'hrv_Latn', 'hu': 'hun_Latn', 'hy': 'hye_Armn', 'ig': 'ibo_Latn', 'ilo': 'ilo_Latn', 'id': 'ind_Latn', 'is': 'isl_Latn', 'it': 'ita_Latn', 'jv': 'jav_Latn', 'ja': 'jpn_Jpan', 'kn': 'kan_Knda', 'ka': 'kat_Geor', 'kk': 'kaz_Cyrl', 'km': 'khm_Khmr', 'ko': 'kor_Hang', 'lo': 'lao_Laoo', 'ln': 'lin_Latn', 'lt': 'lit_Latn', 'lb': 'ltz_Latn', 'lg': 'lug_Latn', 'lv': 'lvs_Latn', 'ml': 'mal_Mlym', 'mr': 'mar_Deva', 'mk': 'mkd_Cyrl', 'mg': 'plt_Latn', 'mn': 'khk_Cyrl', 'my': 'mya_Mymr', 'nl': 'nld_Latn', 'no': 'nob_Latn', 'ne': 'npi_Deva', 'ns': 'nso_Latn', 'oc': 'oci_Latn', 'or': 'ory_Orya', 'pa': 'pan_Guru', 'fa': 'pes_Arab', 'pl': 'pol_Latn', 'pt': 'por_Latn', 'ps': 'pbt_Arab', 'ro': 'ron_Latn', 'ru': 'rus_Cyrl', 'si': 'sin_Sinh', 'sk': 'slk_Latn', 'sl': 'slv_Latn', 'sd': 'snd_Arab', 'so': 'som_Latn', 'es': 'spa_Latn', 'sq': 'als_Latn', 'sr': 'srp_Cyrl', 'ss': 'ssw_Latn', 'su': 'sun_Latn', 'sv': 'swe_Latn', 'sw': 'swh_Latn', 'ta': 'tam_Taml', 'tl': 'tgl_Latn', 'th': 'tha_Thai', 'tn': 'tsn_Latn', 'tr': 'tur_Latn', 'uk': 'ukr_Cyrl', 'ur': 'urd_Arab', 'uz': 'uzn_Latn', 'vi': 'vie_Latn', 'wo': 'wol_Latn', 'xh': 'xho_Latn', 'yi': 'ydd_Hebr', 'yo': 'yor_Latn', 'zh': 'zho_Hans', 'ms': 'zsm_Latn', 'zu': 'zul_Latn'}

nllb_bpe_path = os.path.join(nllb_model_dir, 'spm.model')
nllb_tokenizer = Tokenizer(nllb_bpe_path)

def preprocess_nllb(source_line, target_line, source_lang=None, target_lang=None):
    """
    source_lang and target_lang can be either an NLLB language code (e.g., 'deu_Latn'),
    or a 2-letter language code (e.g., 'de'), in which case it is mapped automatically to the correct format
    """
    source_lang = lang_code_mapping.get(source_lang, source_lang)
    target_lang = lang_code_mapping.get(target_lang, target_lang)
    source_line = nllb_tokenizer.tokenize(source_line)
    target_line = nllb_tokenizer.tokenize(target_line)
    source_line = f'<lang:{source_lang}> {source_line}'.strip()
    target_prefix = f'<lang:{target_lang}>'
    target_line = f'{target_prefix} {target_line}'.strip()
    return source_line, target_line, target_prefix

In [None]:
nllb_test_sets = {}

for pair in 'en-fr', 'fr-en', 'de-fr':
    src, tgt = pair.split('-')
    path = os.path.join(data_dir, f'test.{pair}')
    dataset = load_dataset(path, src, tgt, preprocess_nllb, max_size=500)
    binarize(dataset, source_dict=nllb_dict, target_dict=nllb_dict, sort=False)
    iterator = BatchIterator(dataset, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=False)
    nllb_test_sets[pair] = iterator

In [None]:
chrf = evaluate_model(nllb_model, *nllb_test_sets.values())

In [None]:
# Pivot translation DE->EN->FR performs worse than direct translation DE->FR
src = list(nllb_test_sets['de-fr'].data['source_data'])
ref = list(nllb_test_sets['de-fr'].data['target_data'])
hyp = pivot_translation(nllb_model, src, preprocess_nllb, 'deu_Latn', 'fra_Latn', pivot_lang='eng_Latn')
import sacrebleu
chrf = sacrebleu.corpus_chrf(hyp, [ref]).score
print(f'de-fr (pivot): chrF={chrf:.2f}')

### Try translating to/from your native language

Find the language code for your languages of interest on this page: https://github.com/facebookresearch/flores/blob/main/flores200/README.md#languages-in-flores-200

In [None]:
# short-hand for translating with the NLLB-200 model:
def translate_nllb(sentence, source_lang='eng_Latn', target_lang='fra_Latn', plot_attention=False,
                   max_len=NLLB_MAX_LEN):
    translate(
        nllb_model,
        sentence,
        preprocess_nllb,
        source_lang=source_lang,
        target_lang=target_lang,
        google_translate=False,
        plot_attention=plot_attention,
        max_len=max_len,
    )

In [None]:
translate_nllb("Hello, how are you?", target_lang='hat_Latn')

## Adapt NLLB-200 to the Tatoeba domain

NLLB-200 was trained on massive amounts of data crawled from the web, which makes it a very good generic model. However, the 600M version suffers from "negative interference", i.e., it lacks capacity to handle this many languages (the 3.3B version performs considerably better). Adapting it to a specific language direction can significantly improve its performance.
Moreover, we can improve its performance on a specific domain (e.g., our Tatoeba data) by finetuning it on data from that domain (AKA "domain adaptation").

Here we will train small adapters instead of finetuning the entire model, as this requires much less GPU memory and it is faster to do.

In [None]:
# Load EN-FR training data
src, tgt = 'en', 'fr'

train_path = os.path.join(data_dir, f'train.{src}-{tgt}')
valid_path = os.path.join(data_dir, f'valid.{src}-{tgt}')

nllb_train_data = load_dataset(train_path, src, tgt, preprocess_nllb, max_size=10000)
# set max_size to None to load the entire train set
nllb_valid_data = load_dataset(valid_path, src, tgt, preprocess_nllb, max_size=500)

binarize(nllb_train_data, source_dict=nllb_dict, target_dict=nllb_dict, sort=True)
binarize(nllb_valid_data, source_dict=nllb_dict, target_dict=nllb_dict, sort=False)

reset_seed()

nllb_train_iterator = BatchIterator(
    nllb_train_data, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=True
)
nllb_valid_iterator = BatchIterator(
    nllb_valid_data, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=False
)

In [None]:
# Train adapters on EN-FR Tatoeba (domain adaptation)
with nllb_model.adapter('en-fr', projection_dim=64, overwrite=True):
    nllb_model.reset_optimizer()  # this must always be done when adding new adapters
    train_model(nllb_model, nllb_train_iterator, [nllb_valid_iterator], checkpoint_path=None, epochs=1)

In [None]:
# Evaluate the model on the test set

print('# Original model')
chrf = evaluate_model(nllb_model, nllb_test_sets['en-fr'])
print()

print('# Adapted model')
with nllb_model.adapter('en-fr'):
    chrf = evaluate_model(nllb_model, nllb_test_sets['en-fr'])

## Robustness to noise

Try reading the following sentence: ```The poet was sntiitg alone in his own ltlite room on a very sortmy evienng; the wind was rnoirag otiudse, and the rian puerod dwon in tntorers.```

This is a noisy version of:
```The poet was sitting alone in his own little room on a very stormy evening; the wind was roaring outside, and the rain poured down in torrents.``` where all letters but the first and the last in each word have been shuffled.

Interestingly, it does not require us too much effort to parse this sort of text. Let's see how NLLB-200 fares:

In [None]:
# Increase maximum source and output length
translate_nllb(
    "The poet was sntiitg alone in his own ltlite room on a very sortmy evienng; "
    "the wind was rnoirag otiudse, and the rian puerod dwon in tntorers."
)

We see that the model doesn't completely break but the translation is rather bad (read [this paper](https://arxiv.org/abs/1711.02173) for more information about this phenomenon).
Let's try to finetune NLLB-200 to improve its robustness to this specific type of noise.

`permute_letters` below takes a sentence and shuffles letters in each word except the first and last letter.

`preprocess_nllb_permute_letters` takes a pair of lines and shuffles the source side.

The goal is to train the model with noisy sources and clean targets to make it invariant to this sort of noise (i.e., the clean and noisy versions of the same sentence should give the same output).

In [None]:
def permute_letters(sentence):
    words = split_by_punct(sentence)
    noised_words = []
    for word in words:
        if len(word) >= 3:
            word = word[0] + ''.join(np.random.permutation(list(word[1:-1]))) + word[-1]
        noised_words.append(word)
    return ''.join(noised_words)

def preprocess_nllb_permute_letters(source_line, target_line, source_lang=None, target_lang=None):
    return preprocess_nllb(
        permute_letters(source_line),
        target_line,
        source_lang=source_lang,
        target_lang=target_lang,
    )

### Your turn!

In [None]:
# Load EN-FR training data and add noise to its source side
src, tgt = 'en', 'fr'

train_path = os.path.join(data_dir, f'train.{src}-{tgt}')
valid_path = os.path.join(data_dir, f'valid.{src}-{tgt}')
test_path = os.path.join(data_dir, f'test.{src}-{tgt}')

reset_seed()

train_data_noisy = None   # to do
valid_data_noisy = None   # to do
test_data_noisy = None    # to do
implemented = False       # set to True once you're done

if implemented:
    test_iterator_clean = nllb_test_sets['en-fr']

    binarize(train_data_noisy, source_dict=nllb_dict, target_dict=nllb_dict, sort=True)
    train_iterator_noisy = BatchIterator(
        train_data_noisy, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=True
    )
    
    binarize(valid_data_noisy, source_dict=nllb_dict, target_dict=nllb_dict, sort=False)
    valid_iterator_noisy = BatchIterator(
        valid_data_noisy, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=False
    )

    binarize(test_data_noisy, source_dict=nllb_dict, target_dict=nllb_dict, sort=False)
    test_iterator_noisy = BatchIterator(
        test_data_noisy, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=False
    )

In [None]:
if implemented:
    chrf = evaluate_model(nllb_model, valid_iterator_noisy)

### Now train adapters called `en-fr-noisy`

In [None]:
# To do: train noise adapters here

In [None]:
translate_nllb(
    "Someone ate all the cookies from the cookie jar.",
)

translate_nllb(
    "Seoonme ate all the coieoks from the cikooe jar.",
)

translate_nllb(
    "The poet was sntiitg alone in his own ltlite room on a very sortmy evienng; "
    "the wind was rnoirag otiudse, and the rian puerod dwon in tntorers."
)

### Now test your noise adapters

In [None]:
# To do: test your noise adapters on the examples above and compute chrF scores

### Let's try other types of noise

In [None]:
def identity(line):
    return line
def alphanum(line):
    return re.sub(r'\W', '', line)
def capitalized(line):
    return line.upper()
def lowercase(line):
    return line.lower()

letters = list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')

def char_noise(line):
    chars = list(line + ' ')
    for i in range(len(chars)):
        p = np.random.rand()
        if p < 0.05:   # sub (5% prob)
            chars[i] = np.random.choice(letters)
        elif p < 0.05: # del (5% prob)
            chars[i] = ''
        elif p < 0.05: # ins (5% prob)
            chars[i] = np.random.choice(letters) + chars[i]
        else:          # nothing (85% prob)
            pass
    return ''.join(chars)

src, tgt = 'en', 'fr'
valid_path = os.path.join(data_dir, f'valid.{src}-{tgt}')

for noise_fn in identity, permute_letters, alphanum, capitalized, lowercase, char_noise:
    def preprocess_(source_line, *args, **kwargs):
        return preprocess_nllb(noise_fn(source_line), *args, **kwargs)

    reset_seed()
    valid_data_ = load_dataset(valid_path, src, tgt, preprocess_, max_size=500)
    binarize(valid_data_, source_dict=nllb_dict, target_dict=nllb_dict, sort=False)
    valid_iterator_ = BatchIterator(
        valid_data_, src, tgt, batch_size=NLLB_BATCH_SIZE, max_len=NLLB_MAX_LEN, shuffle=False
    )
    
    print('#', noise_fn.__name__)
    example = noise_fn("Someone ate all the cookies from the cookie jar.")
    translation = get_translations(nllb_model, [example], preprocess_nllb, src, tgt)['predictions_detok'][0]
    print(f'Example: "{example}" -> "{translation}"')
    chrf = evaluate_model(nllb_model, valid_iterator_)
    print()

# Check out the other topics in MT!
There are also many other interesting and important research topics in MT which are not covered in this lab session.

Here are some of them:
- **Unsupervised or low-resource MT**
  - How can we leverage monolingual data when bilingual data is not available or extremely scarce?
  - How can we improve the performance of low-resource language pairs?
- **Document-level context-aware MT**
  - How can we effectively translate a text containing multiple sentences while keeping the translation coherent and faithful?
- **Domain-adapted or personalized MT, continual learning for MT**
  - How can we extend an existing model for new domains, language pairs, or simply new addition of data?
- **Efficient MT**
  - How can we train and serve MT models more efficiently (both in terms of memory usage and CPU/GPU computation)
  
If you are interested in finding out more about MT, you can check out the [WMT conference](https://www.statmt.org/wmt22/) that is held annually.