# Transformers for Machine Translation Tasks

The purpose of this notebook is to work with transformer models for machine translation (MT) using three different approaches:
1. Training a transformer model from scratch.
2. Using a pre-trained model.
3. Fine-tuning a pre-trained model.

The translation task we are going to study constitutes an early example of a data driven task defined in the field of MT. The task is called EUTRANS-I or simply, the Traveller task, and its goal is to translate, from Spanish to English, a set of sentences involving human-to-human communication situations in the front-desk of a hotel. EUTRANS-I was generally tackled by means of statistical machine translation (SMT) techniques. SMT was the state-of-the-art technology preceeding the advent of NLP deep learning applications. EUTRANS-I is going to be useful for us due to its simplicity and small size, which will allow us to quickly execute translation experiments.

An MT dataset typically consists of parallel files containing sentences or paragraphs in the source language and their corresponding translation in the target language. Separated pairs of files are often provided for training, validation and testing purposes. The EUTRANS-I original dataset was composed of 10K sentence pairs for training and 3K sentence pairs for testing, with no validation set. For this session, the original test part has been randomly shuffled and used to generate two subsets of 500 sentence pairs that will be used for validation and test sets (we are not using the whole original test set in order to speed up calculations).

In addition to this, a smaller version of the EUTRANS-I dataset with only 2000 training pairs has also been created for this session. This dataset may be useful for debugging purposes or to speed up calculations if GPUs are not available (the notebook is configured to run on GPUs but their availability depend on different factors).

Both versions of the EUTRANS-I dataset are included in the materials for this session and it is recommended that the corresponding folders are put in a Google Drive folder. This is because the easiest way to execute this notebook is to resort to Google Colab.

## Translating with a Model Trained From Scratch

The first approach to MT we are going to explore consists in defining a transformer model and training it from scratch. Working with deep learning architectures has become increasingly affordable thanks to two libraries: [TensorFlow](https://www.tensorflow.org/) and [PyTorch](https://pytorch.org/). This session is going to use PyTorch. PyTorch libraries can be installed locally or in the cloud (if we use Google Colab). Some of the commands executed in this notebook are used to install the required software. In spite of the fact that some requested package versions are no longer the latest ones, they have been chosen here so as to avoid current bugs or incompatibilities.

The content of this section is based on the [official documentation of Pytorch for transformer models](https://pytorch.org/tutorials/beginner/translation_transformer.html). The most relevant modifications that haven been made here are related to data handling and also to enable the comparison of the translation quality results of this approach with the other two approaches to be tested. In addition to this, the code has largely been reordered.

In [None]:
!pip install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 torchtext==0.14.1 torchdata==0.5.1 -f https://download.pytorch.org/whl/torch_stable.html

### Define the Transformer Code

The definition of the transformer is based on the foundational paper [Attention is All you Need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). The defined network has three parts:

1. The embedding layer, which maps a sparse representation of source tokens into a dense one. The embeddings are further augmented with positional encodings to provide position information.
2. The actual transformer model.
3. A linear output layer providing un-normalized probabilities for each target language token.

In [None]:
from torch import Tensor
import torch
import torch.nn as nn
from torch.nn import Transformer
import math
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(DEVICE)

# helper Module that adds positional encoding to the token embedding to introduce a notion of word order.
class PositionalEncoding(nn.Module):
    def __init__(self,
                 emb_size: int,
                 dropout: float,
                 maxlen: int = 5000):
        super(PositionalEncoding, self).__init__()
        den = torch.exp(- torch.arange(0, emb_size, 2)* math.log(10000) / emb_size)
        pos = torch.arange(0, maxlen).reshape(maxlen, 1)
        pos_embedding = torch.zeros((maxlen, emb_size))
        pos_embedding[:, 0::2] = torch.sin(pos * den)
        pos_embedding[:, 1::2] = torch.cos(pos * den)
        pos_embedding = pos_embedding.unsqueeze(-2)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('pos_embedding', pos_embedding)

    def forward(self, token_embedding: Tensor):
        return self.dropout(token_embedding + self.pos_embedding[:token_embedding.size(0), :])

# helper Module to convert tensor of input indices into corresponding tensor of token embeddings
class TokenEmbedding(nn.Module):
    def __init__(self, vocab_size: int, emb_size):
        super(TokenEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.emb_size = emb_size

    def forward(self, tokens: Tensor):
        return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

# Seq2Seq Network
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 emb_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 dim_feedforward: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)
        self.src_tok_emb = TokenEmbedding(src_vocab_size, emb_size)
        self.tgt_tok_emb = TokenEmbedding(tgt_vocab_size, emb_size)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self,
                src: Tensor,
                trg: Tensor,
                src_mask: Tensor,
                tgt_mask: Tensor,
                src_padding_mask: Tensor,
                tgt_padding_mask: Tensor,
                memory_key_padding_mask: Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: Tensor, src_mask: Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: Tensor, memory: Tensor, tgt_mask: Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

### Define Data Related Functions

The following code defines different functions related to data handling, including:

- A class to create a Pytorch dataset from parallel text files.
- Functions related to word masking  that will prevent the model from looking into the future words when making predictions.
- Functions to convert data samples into numerical vectors or *tensors* that will be manipulated inside the transformer model.

In [None]:
from typing import Iterable, List

class PlainTextParallelFilesDataset(torch.utils.data.Dataset):
    def __init__(self, src_fname, trg_fname) -> None:
        self.src_fname = src_fname
        self.trg_fname = trg_fname
        self.source = open(self.src_fname, encoding='utf-8').read().split('\n')
        self.target = open(self.trg_fname, encoding='utf-8').read().split('\n')

    def __getitem__(self, idx) -> torch.Tensor:
        # load one sample by index, e.g like this:
        source_sample = self.source[idx]
        target_sample = self.target[idx]

        return source_sample, target_sample

    def __len__(self):
        return len(self.source)

    def get_source(self):
        return self.source

    def get_target(self):
        return self.target

def generate_square_subsequent_mask(sz):
    mask = (torch.triu(torch.ones((sz, sz), device=DEVICE)) == 1).transpose(0, 1)
    mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

def create_mask(src, tgt):
    src_seq_len = src.shape[0]
    tgt_seq_len = tgt.shape[0]

    tgt_mask = generate_square_subsequent_mask(tgt_seq_len)
    src_mask = torch.zeros((src_seq_len, src_seq_len),device=DEVICE).type(torch.bool)

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)
    return src_mask, tgt_mask, src_padding_mask, tgt_padding_mask

from torch.nn.utils.rnn import pad_sequence

# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX)
    return src_batch, tgt_batch

### Define Training, Loss Evaluation and Decoding Functions

The following code defines functions useful to train and evaluate the model, and also to infer the translation of sentences that were not seen during the training stage (the translation process is also referred to as *decoding*).

Training a deep neural network is a complex and sophisticated process that cannot be explained here in a detailed manner. An intuitive idea behind model training is that the internal weights (or parameters) of the network are adjusted so as to allow the model to generate the target translation contained in the training set for each sentence in the source language. When the model translates each source sentence, the generated translation may not be exactly equal to the one given as reference. The amount of difference between the generated outputs and the references is measured by means of a mathematical function called *loss function*. The resulting loss is used to adjust the weights of the network so that it can make better predictions in the future. Typically, we are interested in calculating the loss of the model, not only for the training set, but also for the validation set, so as to detect common machine learning problems such as model overfitting.

The training process is structured as a series of *epochs*. During training, we say that an epoch was executed when every example in the training dataset has been used once to update the model parameters.

Before starting the training process, the input text is preprocessed. The goal of preprocessing is to make the modelling process easier and more effective. One fundamental step of text preprocessing is *tokenization*. Tokenization, in its most basic definition consists in fragmenting the raw text to be translated into individual words or tokens, although it may involve additional and more complex operations. Typically, the tokenization process will separate punctuation symbols from words. For instance, the string `"Hello World!"` would be tokenized into the word vector `["Hello", "World", "!"]`.

Since the model works with tokenized data, its output is also going to be tokenized unless we apply a *detokenization* step. This step has been incorporated into the code of the `translate` function so as to enable the comparison of translation quality results between approaches. This is because the standard translation quality metric to be applied expects raw text as input (see more on this below).

In [None]:
!pip install mosestokenizer sacremoses

from mosestokenizer import MosesTokenizer, MosesDetokenizer
from torch.utils.data import DataLoader

def train_epoch(model, optimizer, train_iter, loss_fn):
    model.train()
    losses = 0
    train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in train_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))

def evaluate_loss(model, val_iter, loss_fn):
    model.eval()
    losses = 0

    val_dataloader = DataLoader(val_iter, batch_size=BATCH_SIZE, collate_fn=collate_fn)

    for src, tgt in val_dataloader:
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)

        tgt_input = tgt[:-1, :]

        src_mask, tgt_mask, src_padding_mask, tgt_padding_mask = create_mask(src, tgt_input)

        logits = model(src, tgt_input, src_mask, tgt_mask,src_padding_mask, tgt_padding_mask, src_padding_mask)

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

# function to generate output sequence using greedy algorithm
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)

    memory = model.encode(src, src_mask)
    ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
    for i in range(max_len-1):
        memory = memory.to(DEVICE)
        tgt_mask = (generate_square_subsequent_mask(ys.size(0))
                    .type(torch.bool)).to(DEVICE)
        out = model.decode(ys, memory, tgt_mask)
        out = out.transpose(0, 1)
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.item()

        ys = torch.cat([ys,
                        torch.ones(1, 1).type_as(src.data).fill_(next_word)], dim=0)
        if next_word == EOS_IDX:
            break
    return ys


# actual function to translate input sentence into target language
def translate(model: torch.nn.Module, src_sentence: str):
    model.eval()
    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
    tgt_tokens = greedy_decode(
        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
    word_list = vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))
    for sym in ["<bos>", "<eos>"]:
      if sym in word_list:
        word_list.remove(sym)
    return MosesDetokenizer('TGT_LANGUAGE')(word_list)

### Enable Metrics Computation

A fundamental aspect of building a machine learning system is evaluating its performance. The evaluation can be manual or automatic. Typically, manual evaluation is expensive and subjective, since it requires the participation of human experts. On the other hand, automatic evaluation measures are cheap and objective, even if they are not able to capture information that a human expert evaluation would provide.

One of the standard translation quality measures defined in the field of MT is the [BLEU score](https://aclanthology.org/P02-1040.pdf). Given a reference text and the text generated by a translation system, the BLEU score is a number between 0 and 1 which measures the translation quality, being 1 a perfect match. The BLEU score is sometimes expressed as a percentage.

The BLEU score can be easily calculated by installing the `evaluate` package provided by the [Hugging Face library](https://huggingface.co/), which will also be used to work with pre-trained models. The `evaluate` package contains an advanced implementation of the BLEU score called `sacrebleu`. One important aspect of BLEU score calculation with `sacrebleu` is that the text should be provided in raw form, resulting in the necessity of incorporating a detokenization step for the translation system.

In [None]:
!pip install evaluate sacrebleu

import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

### Prepare Data

Before starting the training process, we need to load and preprocess data. For this purpose, a parallel dataset object is created for the training, validation and test sets that compose the EUTRANS-I translation task. In addition to this, each sentence to be processed by the model should be tokenized and the resulting words converted into numerical identifiers. Finally, special symbols (such as the begin-of-sentence or end-of-sentence symbols) are added to the vector of identifiers before transforming it into a tensor that can be manipulated inside the network.

On the other hand, it is important to stress out that the parallel files of the EUTRANS-I translation task are assumed to be stored in a Google Drive directory whose path should be provided in the code below. If the notebook is executed locally instead of using Google Colab, then the files should be handled in a different way and the code would require modifications accordingly.

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from google.colab import drive
drive.mount('/content/drive')

# Set variables
SRC_LANGUAGE = 'es'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}

# Create source and target language tokenizer. Make sure to install the dependencies.
# https://pytorch.org/text/stable/data_utils.html
token_transform[SRC_LANGUAGE] = get_tokenizer("moses")
token_transform[TGT_LANGUAGE] = get_tokenizer("moses")

# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

# Data iterators
# NOTE: EDIT THE CONTENT OF THE "datadir" VARIABLE TO POINT TO THE GOOGLE DRIVE
#       DIRECTORY WHERE YOU HAVE COPIED YOUR FILES. IT CAN BE INTERESTING TO 
#       START USING THE SMALL VERSION OF THE DATASET AND AFTER FINISHING THE
#       EXPERIMENTS REPEAT THEM WITH THE LARGER VERSION
#datadir = "/content/drive/MyDrive/Colab Notebooks/ub_nlp_nnlm/data/"
datadir = "/content/drive/MyDrive/Colab Notebooks/ub_nlp_nnlm/data_small/"
train_iter = PlainTextParallelFilesDataset(datadir + "es.train", datadir + "en.train")
val_iter = PlainTextParallelFilesDataset(datadir + "es.valid", datadir + "en.valid")
test_iter = PlainTextParallelFilesDataset(datadir + "es.test", datadir + "en.test")

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Create torchtext's Vocab object
    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)
    vocab_transform[ln].set_default_index(vocab_transform[ln]['<unk>'])
    
# src and tgt language text transforms to convert raw strings into tensors indices
# NOTE: sequential_transforms is a composition of functions that is stored in 
#       text_transform. text_transform is later used in collate function
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], # Tokenization
                                               vocab_transform[ln], # Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor

### Instantiate Model

The code below instantiates a `Seq2SeqTransformer` object.

In [None]:
torch.manual_seed(0)

SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

### Train Model

At this point, everything is ready to start training the model by completing a series of epochs. After each epoch is completed, the loss of the training and validation sets is reported.

In [None]:
from timeit import default_timer as timer
NUM_EPOCHS = 5
  
for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(transformer, optimizer, train_iter, loss_fn)
    end_time = timer()
    val_loss = evaluate_loss(transformer, val_iter, loss_fn)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

### Generate Sample Translations

After training the model, we are ready to use it for generating some translations. To do this we only need to call the `translate` function.

In [None]:
print(translate(transformer, "me gustaría cambiarme a otra habitación con teléfono, por favor."))
print(translate(transformer, "Me agradaría cambiarme a otra habitación con teléfono, por favor."))
print(translate(transformer, "Esto es un ejemplo de frase que no pertenece al contexto del lenguaje usado en el mostrador de un hotel."))

### Compute Metrics for Test Set (**Exercise**)

Finally, we calculate the BLEU score when translating the EUTRANS-I test set with the trained transformer model. Completing this code is left as an exercise. Consult the documentation about BLEU score computation [here](https://huggingface.co/spaces/evaluate-metric/bleu) if necessary.

In [None]:
import random

# Obtain translations for test set
# TO-BE-DONE

# Show 5 random source/prediction pairs
# TO-BE-DONE

# Compute BLEU score
# TO-BE-DONE


## Translating with a Pre-Trained Model

Instead of training a translation system from scratch, it is becoming more and more frequent to use the so-called *pre-trained* models. Pre-trained models are general machine learning models composed of a huge number of parameters which have been estimated on massive amounts of training data. This pre-training process is very costly in terms of computational resources and thus not affordable to regular machine learning practitioners. However, a pre-trained model can be downloaded by a regular user and utilized as a starting point to implement a machine learning system for a particular task using much less data and computational resources. This goal can be achieved by means of *transfer learning* techniques, and in particular *fine-tuning*. The term fine-tuning refers to the process of adapting the parameters of a pre-trained model to a particular task using a small, task-specific dataset.

Despite the fact that pre-trained systems often are not intended to be applied directly, in this section we are going to use and evaluate a pre-trained model without fine-tuning for educational purposes (the final fine-tuning stage will be executed and evaluated separately at the end of this notebook).

In a similar way to what happened with Pytorch or Tensorflow for deep learning, working with pre-trained models is now far easier by means of the tools provided by [Hugging Face](https://huggingface.co/), a company that develops open-source software libraries for natural language processing and other machine learning tasks, with a strong emphasis on providing pre-trained transformers.

We start the work in this section by installing some Hugging Face packages useful to work with pre-trained models. 

The rest of the content is again an adapted version of official documentation, in this case provided by Hugging Face [here](https://huggingface.co/docs/transformers/tasks/translation). This documentation can be particularly useful to find more information about the different classes and functions that are going to be used below.

In [None]:
# Transformers installation
!pip install transformers datasets sentencepiece

### Generate Datasets

The first thing we will do is to load the EUTRANS-I files into Huggin Face `DataSet` objects. It is worthy of note that in addition to the training, validation and test parallel files, we now have an additional pair of files for fine-tuning (files with extension `finetun`). These files just contain a small subset of a few hundred sentences extracted from the training set (remember that the training set contains 10K sentence pairs). We will use this smaller training dataset to simulate a hypotetical scenario where we have limited training data to fine-tune the pre-trained model.

In [None]:
from datasets import Dataset, DatasetDict, load_dataset
from google.colab import drive
drive.mount('/content/drive')

def gen_dict_from_parallel_files(srclang, trglang, src_fname, trg_fname):
  # Load files into lists
  srclist = open(src_fname, encoding='utf-8').read().split('\n')
  trglist = open(trg_fname, encoding='utf-8').read().split('\n')
  # Create dictionary
  dpairs = {}
  dpairs['id'] = []
  dpairs['translation'] = []
  id = 0
  for src, trg in zip(srclist, trglist):
    if src and trg:
      dpairs['id'].append(id)
      pair = {}
      pair[srclang] = src
      pair[trglang] = trg
      dpairs['translation'].append(pair)
      id += 1
  # Return dictionary
  return dpairs

def gen_dsetdict_from_parallel_files(srclang, trglang, srcpref, trgpref): 
  # Generate dictionaries 
  train_dict = gen_dict_from_parallel_files(srclang, trglang, srcpref + ".train", trgpref + ".train")
  finetun_dict = gen_dict_from_parallel_files(srclang, trglang, srcpref + ".finetun", trgpref + ".finetun")
  valid_dict = gen_dict_from_parallel_files(srclang, trglang, srcpref + ".valid", trgpref + ".valid")
  test_dict = gen_dict_from_parallel_files(srclang, trglang, srcpref + ".test", trgpref + ".test")
  # Generate Dataset Dictionary
  dset = DatasetDict()
  dset["train"] = Dataset.from_dict(train_dict)
  dset["finetun"] = Dataset.from_dict(finetun_dict)
  dset["valid"] = Dataset.from_dict(valid_dict)
  dset["test"] = Dataset.from_dict(test_dict)
  return dset

# Generate datasets
SRC_LANGUAGE = 'es'
TGT_LANGUAGE = 'en'
# NOTE: EDIT THE CONTENT OF THE "datadir" VARIABLE TO POINT TO THE GOOGLE DRIVE
#       DIRECTORY WHERE YOU HAVE COPIED YOUR FILES. IT CAN BE INTERESTING TO 
#       START USING THE SMALL VERSION OF THE DATASET AND AFTER FINISHING THE
#       EXPERIMENTS REPEAT THEM WITH THE LARGER VERSION
#datadir = "/content/drive/MyDrive/Colab Notebooks/ub_nlp_nnlm/data/"
datadir = "/content/drive/MyDrive/Colab Notebooks/ub_nlp_nnlm/data_small/"
dset = gen_dsetdict_from_parallel_files(SRC_LANGUAGE, TGT_LANGUAGE, datadir + "es", datadir + "en")

### Preprocess Datasets

After creating the `DataSet` objects, we will tokenize the data. In Hugging Face, each pre-trained model may also incorporate a specific tokenization process. 

Since the tokenizer is linked to its pre-trained model, the first thing we should do is to decide which model we want to use. A particular model in Hugging Face can be identified by its *checkpoint*. More specifically, a checkpoint refers to a saved state of a trained deep learning model that contains the values of its weights and other important parameters. Checkpoints are created during the training process at certain intervals or after achieving a specific performance metric. These saved checkpoints can be used to resume training from where it was left off or to make predictions with the trained model, which is what we are about to do.

When using the Hugging Face library, a checkpoint is represented by means of a string value. For this session, we have chosen a transformer model with checkpoint [Helsinki-NLP/opus-mt-es-en](https://huggingface.co/Helsinki-NLP/opus-mt-en-es). The tokenizer for this model can be instantiated by means of an `Autotokenizer` object for which we call the `from_pretrained` method.

In [None]:
from transformers import AutoTokenizer

def extract_text(examples, lang):
    return [example[lang] for example in examples["translation"]]

def extract_inputs_targets(examples, srclang, trglang):
    inputs = extract_text(examples, srclang)
    targets = extract_text(examples, trglang)
    return inputs, targets

def preprocess_function(examples, srclang, trglang):
    inputs, targets = extract_inputs_targets(examples, srclang, trglang)
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

checkpoint = "Helsinki-NLP/opus-mt-es-en"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_dset = dset.map(preprocess_function, batched=True, fn_kwargs={"srclang": SRC_LANGUAGE, "trglang": TGT_LANGUAGE})

### Generate Sample Translations

At this point, we can start generating translations. For this purpose, we can use `pipeline` objects provided by the Hugging Face library. In Hugging Face, a `pipeline` object is a high-level interface for performing specific natural language processing tasks using pre-trained models.

When creating a `pipeline` object, it is necessary to specify the task to be performed, such as text classification, named entity recognition, or question answering. The `pipeline` object then automatically loads the appropriate pre-trained model and tokenizer, and applies them to the input text to perform the specified task. Optionally, we can specify a checkpoint to instantiate a `pipeline` object, which is what is done in the code below.

Finally, in order to use the newly created `pipeline` object, we only need to call it with the input text as its argument. The pipeline object will then process such input text and return the output of the NLP task. The input text can be an individual string or a list of them.

In [None]:
from transformers import pipeline

translator = pipeline("translation", model=checkpoint)
print(translator("me gustaría cambiarme a otra habitación con teléfono, por favor."))
print(translator("Esto es un ejemplo de frase que no pertenece al contexto del lenguaje usado en el mostrador de un hotel."))

### Enable Metrics Computation

We enable computation of the BLEU score here again, so as to allow independent execution of the code related to pre-trained models.

In [None]:
!pip install evaluate sacrebleu

import numpy as np
import evaluate

metric = evaluate.load("sacrebleu")

### Compute Metrics for Test Set (**Exercise**)

To end this part of the session, we will compute translation quality metrics for the pre-trained system. For this purpose, we will make use again of a `pipeline` object, but now we are going to process the sentences contained in the test set. Again, completing this code is left as an exercise.

In [None]:
import random

# Obtain translations for test set
# TO-BE-DONE

# Show 5 random source/prediction pairs
# TO-BE-DONE

# Compute BLEU score
# TO-BE-DONE


## Translating with a Fine-Tuned Pre-Trained Model

In the previous sections we have worked with a pre-trained model. In spite of the fact that the model generates correct translations, it is not prepared to generate text with the particular features of the text contained in the EUTRANS-I dataset.

To solve this problem, we can use a small training set to fine-tune the model parameters of the pre-trained system. This fine-tuning strategy is very useful since it reduces the amount of training samples and computational power that are required to implement a translation system specific for a particular task, in contrast to the requirements of the approach adopted at the beginning of this notebook, based on training a transformer from scratch.

### Instantiate Model

Before fine-tuning the model, we need to create a model instance. For this purpose we use a particular class derived from the `AutoModel` class (`AutoModelForSeq2SeqLM`). In Hugging Face, `AutoModel` is a class that enables users to automatically load any pre-trained model, without the need to specify the particular model name or architecture. Instead, what we need to specify is the checkpoint identifying the model of interest.

Once the `AutoModel` object is instantiated, executing its `from_pretrained` method will automatically download the weights for the model, as well as the corresponding tokenizer and configuration.

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

checkpoint = "Helsinki-NLP/opus-mt-es-en"
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
# NOTE: By default, when calling the previous method, the model will use GPUs if available

### Fine-Tune Model

The code below fine-tunes the pre-trained model we have used previously. The following three steps are executed:

1. Define the training hyperparameters.
2. Pass the training arguments (including the hyperparameters) to a `trainer` object. 
3. Call `train` method to fine-tune the model.

The training process will be carried out over a partition specific for fine-tuning (files with `finetun` extension). After each epoch is executed, the loss for the validation set is computed.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

training_args = Seq2SeqTrainingArguments(
    output_dir="my_fine_tuned_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    logging_steps = 1,
    predict_with_generate=True,
    fp16=False,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dset["finetun"],
    eval_dataset=tokenized_dset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

### Generate Sample Translations

Now we can generate some example translations by means of a `pipeline` object defined for the fine-tuned model.

In [None]:
translator = pipeline("translation", model=trainer.model.to('cpu'), tokenizer=trainer.tokenizer, device=-1) # To simplify execution, we ensure that the computations are carried out on CPU
print(translator("me gustaría cambiarme a otra habitación con teléfono, por favor."))
print(translator("Esto es un ejemplo de frase que no pertenece al contexto del lenguaje usado en el mostrador de un hotel."))

### Compute Metrics for Test Set (**Exercise**)

The last step to be executed is computing the BLEU score for the translations generated by the fine-tuned model for the EUTRANS-I test set. The step is left again as an exercise.

In [None]:
# Obtain translations for test set
# TO-BE-DONE

# Show 5 random source/prediction pairs
# TO-BE-DONE

# Compute BLEU score
# TO-BE-DONE


## Optional Exercises

To complement the results that have been obtained previously, it can be interesting to try the following:

- Play with different number of training iterations and model hyperparameters.
- Extract larger portions of the training set and use them as the fine-tune set in order to measure the impact in the test translation quality.
