# TP3, INF8225 2025, Machine translation

This TP will be due on March 27th at 8:30am.
The goal of this TP is to build a machine translation model.
You will be comparing the performance of three different architectures:
* A Vanilla RNN (**Implementation provided!**)
* A GRU-RNN (done individually)
* A Transformer (The implementation, testing and experiments with an Encoder-Decoder - done individually, but you may discuss how to do this, ideas for experiments, etc. with any of your colleagues)

You are provided with the code to load and build the pytorch dataset, the implementation for the Vanilla RNN architecture
and the code for the training loop.
You "only" have to code the architectures (a GRU-RNN and a the missing parts of the Encoder-Decoder Transformer).
Of course, the use of built-in torch layers such as `nn.GRU` or `nn.Transformer`
is forbidden, as the TP would be much easier and you would learn much less.

The source sentences are in english and the target language is french.

We hope that this TP also provides you with a basic but realistic machine learning pipeline. We hope you learn a lot from the provided code.

Do not forget to **select the runtime type as GPU!**

**Sources**

* Dataset: [Tab-delimited Bilingual Sentence Pairs](http://www.manythings.org/anki/)

<!---
M. Cettolo, C. Girardi, and M. Federico. 2012. WIT3: Web Inventory of Transcribed and Translated Talks. In Proc. of EAMT, pp. 261-268, Trento, Italy. pdf, bib. [paper](https://aclanthology.org/2012.eamt-1.60.pdf). [website](https://wit3.fbk.eu/2016-01).
-->

* The code is inspired by this [pytorch tutorial](https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html).

*This notebook is quite big, use the table of contents to easily navigate through it.*

# Imports and data initializations

We first download and parse the dataset. From the parsed sentences
we can build the vocabularies and the torch datasets.
The end goal of this section is to have an iterator
that can yield the pairs of translated datasets, and
where each sentences is made of a sequence of tokens.

## Imports

In [None]:
# Note current default torch and cuda was 2.6.0+cu124
# We need to go back to an earlier version compatible with torchtext
# This will generate some dependency issues (incompatible packages), but for things that we will not need for this TP
!pip install torch==2.1.2+cu121 -f https://download.pytorch.org/whl/torch/ --force-reinstall --no-cache-dir
!pip install torchtext==0.16.2 --force-reinstall --no-cache-dir
!pip install numpy==1.23.5 --force-reinstall --no-cache-dir
!pip install scikit-learn==1.1.3 --force-reinstall --no-cache-dir
!pip install scipy==1.9.3 --force-reinstall --no-cache-dir
!pip install spacy einops wandb torchinfo
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

Looking in links: https://download.pytorch.org/whl/torch/
Collecting torch==2.1.2+cu121
  Downloading https://download.pytorch.org/whl/cu121/torch-2.1.2%2Bcu121-cp311-cp311-linux_x86_64.whl (2200.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 GB[0m [31m301.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting filelock (from torch==2.1.2+cu121)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions (from torch==2.1.2+cu121)
  Downloading typing_extensions-4.13.0-py3-none-any.whl.metadata (3.0 kB)
Collecting sympy (from torch==2.1.2+cu121)
  Downloading sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Collecting networkx (from torch==2.1.2+cu121)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch==2.1.2+cu121)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec (from torch==2.1.2+cu121)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Coll

Collecting numpy==1.23.5
  Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Downloading numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/17.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/17.1 MB[0m [31m20.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/17.1 MB[0m [31m102.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m274.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.4
    Uninstalling numpy-2.2.4:
      Successfully uninstalled numpy-2.2.4
[31mERROR: pip's dependency resolver does not currently take into account all th

Collecting scikit-learn==1.1.3
  Downloading scikit_learn-1.1.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting numpy>=1.17.3 (from scikit-learn==1.1.3)
  Downloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy>=1.3.2 (from scikit-learn==1.1.3)
  Downloading scipy-1.15.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib>=1.0.0 (from scikit-learn==1.1.3)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn==1.1.3)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.1.3-cp311-cp311-manylinux_2_17_

Collecting torchinfo
  Downloading torchinfo-1.8.0-py3-none-any.whl.metadata (21 kB)
Downloading torchinfo-1.8.0-py3-none-any.whl (23 kB)
Installing collected packages: torchinfo
Successfully installed torchinfo-1.8.0
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m118.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_

In [None]:
from itertools import takewhile
from collections import Counter, defaultdict
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
import torch
# cpal
print(torch.__version__)
import torch.nn as nn
import torch.optim as optim
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
import torchtext
# from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator, Vocab
from torchtext.datasets import IWSLT2016
import spacy
import einops
import wandb
from torchinfo import summary
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

2.1.2+cu121


In [None]:
# Our dataset
!wget http://www.manythings.org/anki/fra-eng.zip
!unzip fra-eng.zip
df = pd.read_csv('fra.txt', sep='\t', names=['english', 'french', 'attribution'])
train = [
    (en, fr) for en, fr in zip(df['english'], df['french'])
]
train, valid = train_test_split(train, test_size=0.1, random_state=0)
print(len(train))
en_nlp = spacy.load('en_core_web_sm')
fr_nlp = spacy.load('fr_core_news_sm')
def en_tokenizer(text):
    return [tok.text.lower() for tok in en_nlp.tokenizer(text)]
def fr_tokenizer(text):
    return [tok.text.lower() for tok in fr_nlp.tokenizer(text)]
SPECIALS = ['<unk>', '<pad>', '<bos>', '<eos>']

--2025-03-27 00:43:25--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 173.254.30.110
Connecting to www.manythings.org (www.manythings.org)|173.254.30.110|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7943074 (7.6M) [application/zip]
Saving to: ‘fra-eng.zip’


2025-03-27 00:43:25 (20.6 MB/s) - ‘fra-eng.zip’ saved [7943074/7943074]

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 
209462


The tokenizers are objects that are able to divide a python string into a list of tokens (words, punctuations, special tokens...) as a list of strings.

The special tokens are used for a particular reasons:
* *\<unk\>*: Replace an unknown word in the vocabulary by this default token
* *\<pad\>*: Virtual token used to as padding token so a batch of sentences can have a unique length
* *\<bos\>*: Token indicating the beggining of a sentence in the target sequence
* *\<eos\>*: Token indicating the end of a sentence in the target sequence

## Datasets

Functions and classes to build the vocabularies and the torch datasets.
The vocabulary is an object able to transform a string token into the id (an int) of that token in the vocabulary.

In [None]:
class TranslationDataset(Dataset):
    def __init__(
            self,
            dataset: list,
            en_vocab: Vocab,
            fr_vocab: Vocab,
            en_tokenizer,
            fr_tokenizer,
        ):
        super().__init__()

        self.dataset = dataset
        self.en_vocab = en_vocab
        self.fr_vocab = fr_vocab
        self.en_tokenizer = en_tokenizer
        self.fr_tokenizer = fr_tokenizer

    def __len__(self):
        """Return the number of examples in the dataset.
        """
        return len(self.dataset)

    def __getitem__(self, index: int) -> tuple:
        """Return a sample.

        Args
        ----
            index: Index of the sample.

        Output
        ------
            en_tokens: English tokens of the sample, as a LongTensor.
            fr_tokens: French tokens of the sample, as a LongTensor.
        """
        # Get the strings
        en_sentence, fr_sentence = self.dataset[index]

        # To list of words
        # We also add the beggining-of-sentence and end-of-sentence tokens
        en_tokens = ['<bos>'] + self.en_tokenizer(en_sentence) + ['<eos>']
        fr_tokens = ['<bos>'] + self.fr_tokenizer(fr_sentence) + ['<eos>']

        # To list of tokens
        en_tokens = self.en_vocab(en_tokens)  # list[int]
        fr_tokens = self.fr_vocab(fr_tokens)

        return torch.LongTensor(en_tokens), torch.LongTensor(fr_tokens)


def yield_tokens(dataset, tokenizer, lang):
    """Tokenize the whole dataset and yield the tokens.
    """
    assert lang in ('en', 'fr')
    sentence_idx = 0 if lang == 'en' else 1

    for sentences in dataset:
        sentence = sentences[sentence_idx]
        tokens = tokenizer(sentence)
        yield tokens


def build_vocab(dataset: list, en_tokenizer, fr_tokenizer, min_freq: int):
    """Return two vocabularies, one for each language.
    """
    en_vocab = build_vocab_from_iterator(
        yield_tokens(dataset, en_tokenizer, 'en'),
        min_freq=min_freq,
        specials=SPECIALS,
    )
    en_vocab.set_default_index(en_vocab['<unk>'])  # Default token for unknown words

    fr_vocab = build_vocab_from_iterator(
        yield_tokens(dataset, fr_tokenizer, 'fr'),
        min_freq=min_freq,
        specials=SPECIALS,
    )
    fr_vocab.set_default_index(fr_vocab['<unk>'])

    return en_vocab, fr_vocab


def preprocess(
        dataset: list,
        en_tokenizer,
        fr_tokenizer,
        max_words: int,
    ) -> list:
    """Preprocess the dataset.
    Remove samples where at least one of the sentences are too long.
    Those samples takes too much memory.
    Also remove the pending '\n' at the end of sentences.
    """
    filtered = []

    for en_s, fr_s in dataset:
        if len(en_tokenizer(en_s)) >= max_words or len(fr_tokenizer(fr_s)) >= max_words:
            continue

        en_s = en_s.replace('\n', '')
        fr_s = fr_s.replace('\n', '')

        filtered.append((en_s, fr_s))

    return filtered


def build_datasets(
        max_sequence_length: int,
        min_token_freq: int,
        en_tokenizer,
        fr_tokenizer,
        train: list,
        val: list,
    ) -> tuple:
    """Build the training, validation and testing datasets.
    It takes care of the vocabulary creation.

    Args
    ----
        - max_sequence_length: Maximum number of tokens in each sequences.
            Having big sequences increases dramatically the VRAM taken during training.
        - min_token_freq: Minimum number of occurences each token must have
            to be saved in the vocabulary. Reducing this number increases
            the vocabularies's size.
        - en_tokenizer: Tokenizer for the english sentences.
        - fr_tokenizer: Tokenizer for the french sentences.
        - train and val: List containing the pairs (english, french) sentences.


    Output
    ------
        - (train_dataset, val_dataset): Tuple of the two TranslationDataset objects.
    """
    datasets = [
        preprocess(samples, en_tokenizer, fr_tokenizer, max_sequence_length)
        for samples in [train, val]
    ]

    en_vocab, fr_vocab = build_vocab(datasets[0], en_tokenizer, fr_tokenizer, min_token_freq)

    datasets = [
        TranslationDataset(samples, en_vocab, fr_vocab, en_tokenizer, fr_tokenizer)
        for samples in datasets
    ]

    return datasets


In [None]:
def generate_batch(data_batch: list, src_pad_idx: int, tgt_pad_idx: int) -> tuple:
    """Add padding to the given batch so that all
    the samples are of the same size.

    Args
    ----
        data_batch: List of samples.
            Each sample is a tuple of LongTensors of varying size.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.

    Output
    ------
        en_batch: Batch of tokens for the padded english sentences.
            Shape of [batch_size, max_en_len].
        fr_batch: Batch of tokens for the padded french sentences.
            Shape of [batch_size, max_fr_len].
    """
    en_batch, fr_batch = [], []
    for en_tokens, fr_tokens in data_batch:
        en_batch.append(en_tokens)
        fr_batch.append(fr_tokens)

    en_batch = pad_sequence(en_batch, padding_value=src_pad_idx, batch_first=True)
    fr_batch = pad_sequence(fr_batch, padding_value=tgt_pad_idx, batch_first=True)
    return en_batch, fr_batch

# Models architecture
This is where you have to code the architectures.

In a machine translation task, the model takes as input the whole
source sentence along with the current known tokens of the target,
and predict the next token in the target sequence.
This means that the target tokens are predicted in an autoregressive
manner, starting from the first token (right after the *\<bos\>* token) and producing tokens one by one until the last *\<eos\>* token.

Formally, we define $s = [s_1, ..., s_{N_s}]$ as the source sequence made of $N_s$ tokens.
We also define $t^i = [t_1, ..., t_i]$ as the target sequence at the beginning of the step $i$.

The output of the model parameterized by $\theta$ is:

$$
T_{i+1} = p(t_{i+1} | s, t^i ; \theta )
$$

Where $T_{i+1}$ is the distribution of the next token $t_{i+1}$.

The loss is simply a *cross entropy loss* over the whole steps, where each class is a token of the vocabulary.

![RNN schema for machinea translation](https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-translation-model-with-encoder-decoder-rnn.jpg)

Note that in this image the english sentence is provided in reverse.

---

In pytorch, there is no dinstinction between an intermediate layer or a whole model having multiple layers in itself.
Every layers or models inherit from the `torch.nn.Module`.
This module needs to define the `__init__` method where you instanciate the layers,
and the `forward` method where you decide how the inputs and the layers of the module interact between them.
Thanks to the autograd computations of pytorch, you do not have
to implement any backward method!

A really important advice is to **always look at
the shape of your input and your output.**
From that, you can often guess how the layers should interact
with the inputs to produce the right output.
You can also easily detect if there's something wrong going on.

You are more than advised to use the `einops` library and the `torch.einsum` function. This will require less operations than 'classical' code, but note that it's a bit trickier to use.
This is a way of describing tensors manipulation with strings, bypassing the multiple tensor methods executed in the background.
You can find a nice presentation of `einops` [here](https://einops.rocks/1-einops-basics/).
A paper has just been released about einops [here](https://paperswithcode.com/paper/einops-clear-and-reliable-tensor).

**A great tutorial on pytorch can be found [here](https://stanford.edu/class/cs224n/materials/CS224N_PyTorch_Tutorial.html).**
Spending 3 hours on this tutorial is *no* waste of time.

## RNN models

### RNN
Here, the implementation of the RNN is provided as an example. Study this code and use it as an example for the GRU implementation, if needed.

The `RNNCell` layer produce one hidden state vector for each sentence in the batch
(useful for the output of the encoder), and also produce one embedding for each
token in each sentence (useful for the output of the decoder).

The `RNN` module is composed of a stack of `RNNCell`. Each token embeddings
coming out from a previous `RNNCell` is used as an input for the next `RNNCell` layer.

**Be careful !** Our `RNNCell` implementation is not exactly the same thing as
the PyTorch's `nn.RNNCell`. PyTorch implements only the operations for one token
(so you would need to loop through each tokens inside the `RNN` instead).

The same thing apply for the `GRU` and `GRUCell`.


In [None]:
class RNNCell(nn.Module):
    """A single RNN layer.

    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        dropout: Dropout rate.

    Important note: This layer does not exactly the same thing as nn.RNNCell does.
    PyTorch implementation is only doing one simple pass over one token for each batch.
    This implementation is taking the whole sequence of each batch and provide the
    final hidden state along with the embeddings of each token in each sequence.
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            dropout: float,
        ):
        super().__init__()

        self.hidden_size = hidden_size

        # See pytorch definition: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
        self.Wih = nn.Linear(input_size, hidden_size, device=DEVICE)
        self.Whh = nn.Linear(hidden_size, hidden_size, device=DEVICE)
        self.dropout = nn.Dropout(p=dropout)
        self.act = nn.Tanh()

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor) -> tuple:
        """Go through all the sequence in x, iteratively updating
        the hidden state h.

        Args
        ----
            x: Input sequence.
                Shape of [batch_size, seq_len, input_size].
            h: Initial hidden state.
                Shape of [batch_size, hidden_size].

        Output
        ------
            y: Token embeddings.
                Shape of [batch_size, seq_len, hidden_size].
            h: Last hidden state.
                Shape of [batch_size, hidden_size].
        """
        batch_size, seq_len, input_size = x.shape
        y = torch.zeros([batch_size, seq_len, self.hidden_size], device=DEVICE)

        for t in range(seq_len):
          input = x[:, t, :]
          w_input = self.Wih(input)
          w_hidden = self.Whh(h)
          h = self.act(w_input + w_hidden)
          y[:, t, :] = self.dropout(h)

        return y, h


class RNN(nn.Module):
    """Implementation of an RNN based
    on https://pytorch.org/docs/stable/generated/torch.nn.RNN.html.

    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        num_layers: Number of layers (RNNCell or GRUCell).
        dropout: Dropout rate.
        model_type: Either 'RNN' or 'GRU', to select which model we want.
            This parameter can be removed if you decide to use the module `GRU`.
            Indeed, `GRU` should have exactly the same code as this module,
            but with `GRUCell` instead of `RNNCell`. We let the freedom for you
            to decide at which level you want to specialise the modules (either
            in `TranslationRNN` by creating a `GRU` or a `RNN`, or in `RNN`
            by creating a `GRUCell` or a `RNNCell`).
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            num_layers: int,
            dropout: float,
            model_type: str,
        ):
        super().__init__()

        self.hidden_size = hidden_size
        model_class = RNNCell if model_type == 'RNN' else GRUCell

        self.layers = nn.ModuleList()
        self.layers.append(model_class(input_size, hidden_size, dropout))
        for i in range(1, num_layers):
          self.layers.append(model_class(hidden_size, hidden_size, dropout))

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor=None) -> tuple:
        """Pass the input sequence through all the RNN cells.
        Returns the output and the final hidden state of each RNN layer

        Args
        ----
            x: Input sequence.
                Shape of [batch_size, seq_len, input_size].
            h: Hidden state for each RNN layer.
                Can be None, in which case an initial hidden state is created.
                Shape of [batch_size, n_layers, hidden_size].

        Output
        ------
            y: Output embeddings for each token after the RNN layers.
                Shape of [batch_size, seq_len, hidden_size].
            h: Final hidden state.
                Shape of [batch_size, n_layers, hidden_size].
        """
        input = x
        h = torch.zeros([x.shape[0], len(self.layers), self.hidden_size], device=x.device) if h is None else h
        final_h = torch.zeros_like(h, device=x.device)
        for l in range(len(self.layers)):
          input, h_out = self.layers[l](input, h[:, l, :])
          final_h[:, l, :] = h_out

        return input, final_h

### GRU
Here you have to implement a GRU-RNN. This architecture is close to the Vanilla RNN but perform different operations. Look up the pytorch documentation to figure out the differences.

In [None]:
class GRU(nn.Module):
    """Implementation of a GRU based on https://pytorch.org/docs/stable/generated/torch.nn.GRU.html.

    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        num_layers: Number of layers.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            num_layers: int,
            dropout: float,
        ):
        super().__init__()

        self.hidden_size = hidden_size
        self.layers = nn.ModuleList()
        self.layers.append(GRUCell(input_size, hidden_size, dropout))
        for i in range(1, num_layers):
          self.layers.append(GRUCell(hidden_size, hidden_size, dropout))

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor=None) -> tuple:
        """
        Args
        ----
            x: Input sequence
                Shape of [batch_size, seq_len, input_size].
            h: Initial hidden state for each layer.
                If 'None', then an initial hidden state (a zero filled tensor)
                is created.
                Shape of [batch_size, n_layers, hidden_size].

        Output
        ------
            output:
                Shape of [batch_size, seq_len, hidden_size].
            h_n: Final hidden state.
                Shape of [batch_size, n_layers, hidden size].
        """
        input = x
        h = torch.zeros([x.shape[0], len(self.layers), self.hidden_size], device=x.device) if h is None else h
        final_h = torch.zeros_like(h, device=x.device)
        for l in range(len(self.layers)):
          input, h_out = self.layers[l](input, h[:, l, :])
          final_h[:, l, :] = h_out

        return input, final_h
        pass


class GRUCell(nn.Module):
    """A single GRU layer.

    Parameters
    ----------
        input_size: Size of each input token.
        hidden_size: Size of each RNN hidden state.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            input_size: int,
            hidden_size: int,
            dropout: float,
        ):
        super().__init__()
        self.hidden_size = hidden_size

        self.Wiz = nn.Linear(input_size, hidden_size, device=DEVICE)
        self.Whz = nn.Linear(hidden_size, hidden_size, device=DEVICE)
        self.Wir = nn.Linear(input_size, hidden_size, device=DEVICE)
        self.Whr = nn.Linear(hidden_size, hidden_size, device=DEVICE)
        self.Win = nn.Linear(input_size, hidden_size, device=DEVICE)
        self.Whn = nn.Linear(hidden_size, hidden_size, device=DEVICE)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x: torch.FloatTensor, h: torch.FloatTensor) -> tuple:
        """
        Args
        ----
            x: Input sequence.
                Shape of [batch_size, seq_len, input_size].
            h: Initial hidden state.
                Shape of [batch_size, hidden_size].

        Output
        ------
            y: Token embeddings.
                Shape of [batch_size, seq_len, hidden_size].
            h: Last hidden state.
                Shape of [batch_size, hidden_size].
        """
        batch_size, seq_len, input_size = x.shape
        y = torch.zeros([batch_size, seq_len, self.hidden_size], device=DEVICE)

        for t in range(seq_len):
          input = x[:, t, :]
          w_ir = self.Wir(input)
          w_hr = self.Whr(h)
          w_iz = self.Wiz(input)
          w_hz = self.Whz(h)
          w_in = self.Win(input)
          w_hn = self.Whn(h)
          r = torch.sigmoid(w_ir + w_hr)
          z = torch.sigmoid(w_iz + w_hz)
          n = torch.tanh(w_in + r * w_hn)
          h = z * h + (1 - z) * n
          y[:, t, :] = self.dropout(h)

        return y, h

### Translation RNN

This module instanciates a vanilla RNN or a GRU-RNN and performs the translation task. This code des the following:
* Encodes the source and target sequence
* Passes the final hidden state of the encoder to the decoder (one for each layer)
* Decodes the hidden state into the target sequence

We use teacher forcing for training, meaning that when the next token is predicted, that prediction is based on the previous true target tokens.

In [None]:
class TranslationRNN(nn.Module):
    """Basic RNN encoder and decoder for a translation task.
    It can run as a vanilla RNN or a GRU-RNN.

    Parameters
    ----------
        n_tokens_src: Number of tokens in the source vocabulary.
        n_tokens_tgt: Number of tokens in the target vocabulary.
        dim_embedding: Dimension size of the word embeddings (for both language).
        dim_hidden: Dimension size of the hidden layers in the RNNs
            (for both the encoder and the decoder).
        n_layers: Number of layers in the RNNs.
        dropout: Dropout rate.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.
        model_type: Either 'RNN' or 'GRU', to select which model we want.
    """

    def __init__(
            self,
            n_tokens_src: int,
            n_tokens_tgt: int,
            dim_embedding: int,
            dim_hidden: int,
            n_layers: int,
            dropout: float,
            src_pad_idx: int,
            tgt_pad_idx: int,
            model_type: str,
        ):
        super().__init__()
        self.src_embeddings = nn.Embedding(n_tokens_src, dim_embedding, src_pad_idx)
        self.tgt_embeddings = nn.Embedding(n_tokens_tgt, dim_embedding, tgt_pad_idx)

        self.dropout_1 = nn.Dropout(dropout)
        self.dropout_2 = nn.Dropout(dropout)

        self.encoder = RNN(dim_embedding, dim_hidden, n_layers, dropout, model_type)
        self.norm = nn.LayerNorm(dim_hidden)
        self.decoder = RNN(dim_embedding, dim_hidden, n_layers, dropout, model_type)
        self.out_layer = nn.Linear(dim_hidden, n_tokens_tgt)


    def forward(
        self,
        source: torch.LongTensor,
        target: torch.LongTensor
    ) -> torch.FloatTensor:
        """Predict the target tokens logits based on the source tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, src_seq_len].
            target: Batch of target sentences.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y: Distributions over the next token for all tokens in each sentences.
                Those need to be the logits only, do not apply a softmax because
                it will be done in the loss computation for numerical stability.
                See https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for more informations.
                Shape of [batch_size, tgt_seq_len, n_tokens_tgt].
        """
        source = torch.fliplr(source)

        src_emb = self.src_embeddings(source)
        out, hidden = self.encoder(src_emb)

        hidden = self.norm(hidden)

        tgt_emb = self.tgt_embeddings(target)
        y, hidden = self.decoder(tgt_emb, hidden)

        y = self.out_layer(y)

        return y


## Transformer models
Here you have to code the Full Transformer and Decoder-Only Transformer architectures.
It is divided in three parts:
* Attention layers (done individually)
* Encoder and decoder layers (done individually)
* Full Transformer: gather the encoder and decoder layers (done individually)

The Transformer (or "Full Transformer") is presented in the paper: [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf). The [illustrated transformer](https://jalammar.github.io/illustrated-transformer/) blog can help you
understanding how the architecture works.
Once this is done, you can use [the annontated transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) to have an idea of how to code this architecture.
We encourage you to use `torch.einsum` and the `einops` library as much as you can. It will make your code simpler.

---
**Implementation order**

To help you with the implementation, we advise you following this order:
* Implement `TranslationTransformer` and use `nn.Transformer` instead of `Transformer`
* Implement `Transformer` and use `nn.TransformerDecoder` and `nn.TransformerEnocder`
* Implement the `TransformerDecoder` and `TransformerEncoder` and use `nn.MultiHeadAttention`
* Implement `MultiHeadAttention`

Do not forget to add `batch_first=True` when necessary in the `nn` modules.

### Positional Encoding


In [None]:
class PositionalEncoding(nn.Module):
    """
    This PE module comes from:
    Pytorch. (2021). LANGUAGE MODELING WITH NN.TRANSFORMER AND TORCHTEXT. https://pytorch.org/tutorials/beginner/transformer_tutorial.html
    """
    def __init__(self, d_model: int, dropout: float, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        position = torch.arange(max_len).unsqueeze(1).to(DEVICE)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model)).to(DEVICE)
        pe = torch.zeros(max_len, 1, d_model).to(DEVICE)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = rearrange(x, "b s e -> s b e")
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0)]
        x = rearrange(x, "s b e -> b s e")
        return self.dropout(x)

### Attention layers
We use a `MultiHeadAttention` module, that is able to perform self-attention aswell as cross-attention (depending on what you give as queries, keys and values).

**Attention**


It takes the multiheaded queries, keys and values as input.
It computes the attention between the queries and the keys and return the attended values.

The implementation of this function can greatly be improved with *einsums*.

**MultiheadAttention**

Computes the multihead queries, keys and values and feed them to the `attention` function.
You also need to merge the key padding mask and the attention mask into one mask.

The implementation of this module can greatly be improved with *einops.rearrange*.

In [None]:
from einops.layers.torch import Rearrange
from einops import rearrange
import math

def attention(
        q: torch.FloatTensor,
        k: torch.FloatTensor,
        v: torch.FloatTensor,
        mask: torch.BoolTensor=None,
        dropout: nn.Dropout=None,
    ) -> tuple:
    """Computes multihead scaled dot-product attention from the
    projected queries, keys and values.

    Args
    ----
        q: Batch of queries.
            Shape of [batch_size, seq_len_1, n_heads, dim_model].
        k: Batch of keys.
            Shape of [batch_size, seq_len_2, n_heads, dim_model].
        v: Batch of values.
            Shape of [batch_size, seq_len_2, n_heads, dim_model].
        mask: Prevent tokens to attend to some other tokens (for padding or autoregressive attention).
            Attention is prevented where the mask is `True`.
            Shape of [batch_size, n_heads, seq_len_1, seq_len_2],
            or broadcastable to that shape.
        dropout: Dropout layer to use.

    Output
    ------
        y: Multihead scaled dot-attention between the queries, keys and values.
            Shape of [batch_size, seq_len_1, n_heads, dim_model].
        attn: Computed attention between the keys and the queries.
            Shape of [batch_size, n_heads, seq_len_1, seq_len_2].
    """
    d_k = q.size(-1)
    scores = torch.einsum('b h s d, b h t d -> b h s t', q, k) / math.sqrt(d_k)


    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    if dropout is not None:
        attn = dropout(attn)
    output = torch.einsum('bhst,bhtd->bhsd', attn, v)
    return output, attn

class MultiheadAttention(nn.Module):
    """Multihead attention module.
    Can be used as a self-attention and cross-attention layer.
    The queries, keys and values are projected into multiple heads
    before computing the attention between those tensors.

    Parameters
    ----------
        dim: Dimension of the input tokens.
        n_heads: Number of heads. `dim` must be divisible by `n_heads`.
        dropout: Dropout rate.
    """
    def __init__(
            self,
            dim: int,
            n_heads: int,
            dropout: float,
        ):
        super().__init__()

        assert dim % n_heads == 0

        self.d_k = dim // n_heads
        self.n_heads = n_heads
        self.attn = None
        self.dropout = nn.Dropout(dropout)
        self.linears = nn.ModuleList([nn.Linear(dim, dim) for _ in range(4)])
    def forward(
            self,
            q: torch.FloatTensor,
            k: torch.FloatTensor,
            v: torch.FloatTensor,
            key_padding_mask: torch.BoolTensor = None,
            attn_mask: torch.BoolTensor = None,
        ) -> torch.FloatTensor:
        """Computes the scaled multi-head attention form the input queries,
        keys and values.

        Project those queries, keys and values before feeding them
        to the `attention` function.

        The masks are boolean masks. Tokens are prevented to attends to
        positions where the mask is `True`.

        Args
        ----
            q: Batch of queries.
                Shape of [batch_size, seq_len_1, dim_model].
            k: Batch of keys.
                Shape of [batch_size, seq_len_2, dim_model].
            v: Batch of values.
                Shape of [batch_size, seq_len_2, dim_model].
            key_padding_mask: Prevent attending to padding tokens.
                Shape of [batch_size, seq_len_2].
            attn_mask: Prevent attending to subsequent tokens.
                Shape of [seq_len_1, seq_len_2].

        Output
        ------
            y: Computed multihead attention.
                Shape of [batch_size, seq_len_1, dim_model].
        """
        mask = None
        if key_padding_mask is not None:
            padding_mask = einops.rearrange(key_padding_mask, 'b s -> b 1 1 s')
            mask = padding_mask

        if attn_mask is not None:
            attn_mask = attn_mask.unsqueeze(0).unsqueeze(0)
            mask = attn_mask if mask is None else mask | attn_mask

        query, key, value = [
            einops.rearrange(l(x), 'b s (h d) -> b h s d', h=self.n_heads)
            for l, x in zip(self.linears[:3], (query, key, value))
        ]

        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)
        x = einops.rearrange(x, 'b h s d -> b s (h d)')

        return self.linears[-1](x)


### Encoder and decoder layers

**TranformerEncoder**

Apply self-attention layers onto the source tokens.
It only needs the source key padding mask.


**TranformerDecoder**

Apply masked self-attention layers to the target tokens and cross-attention
layers between the source and the target tokens.
It needs the source and target key padding masks, and the target attention mask.

In [None]:
class TransformerDecoderLayer(nn.Module):
    """Single decoder layer.

    Parameters
    ----------
        d_model: The dimension of decoders inputs/outputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float
        ):
        super().__init__()

        self.self_attn = MultiheadAttention(d_model, nhead, dropout)
        self.cross_attn = MultiheadAttention(d_model, nhead, dropout)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(dropout)
    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor,
        ) -> torch.FloatTensor:
        """Decode the next target tokens based on the previous tokens.

        Args
        ----
            src: Batch of source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of target sentences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y:  Batch of sequence of embeddings representing the predicted target tokens
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        tgt2, _ = self.self_attn(tgt, tgt, tgt, attn_mask=tgt_mask_attn,
                              key_padding_mask=tgt_key_padding_mask)
        tgt = tgt + self.dropout(tgt2)
        tgt = self.norm1(tgt)

        tgt2, _ = self.cross_attn(tgt, src, src,
                               key_padding_mask=src_key_padding_mask)
        tgt = tgt + self.dropout(tgt2)
        tgt = self.norm2(tgt)

        tgt2 = self.feed_forward(tgt)
        tgt = tgt + self.dropout(tgt2)
        tgt = self.norm3(tgt)

        return tgt


class TransformerDecoder(nn.Module):
    """Implementation of the transformer decoder stack.

    Parameters
    ----------
        d_model: The dimension of decoders inputs/outputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        num_decoder_layers: Number of stacked decoders.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            num_decoder_layer:int ,
            nhead: int,
            dropout: float
        ):
        super().__init__()

        self.layers = nn.ModuleList(
            [
              TransformerDecoderLayer(
              d_model=d_model, nhead=nhead, d_ff=d_ff,
              dropout=dropout)
              for _ in range(num_decoder_layer)
            ]
            )


    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor,
        ) -> torch.FloatTensor:
        """Decodes the source sequence by sequentially passing.
        the encoded source sequence and the target sequence through the decoder stack.

        Args
        ----
            src: Batch of encoded source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of taget sentences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y:  Batch of sequence of embeddings representing the predicted target tokens
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        for layer in self.layers:
            tgt = layer(src, tgt, tgt_mask_attn, src_key_padding_mask, tgt_key_padding_mask)
        return tgt


class TransformerEncoderLayer(nn.Module):
    """Single encoder layer.

    Parameters
    ----------
        d_model: The dimension of input tokens.
        dim_feedforward: Hidden dimension of the feedforward networks.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            d_ff: int,
            nhead: int,
            dropout: float,
        ):
        super().__init__()

        self.self_attn = MultiheadAttention(d_model, nhead, dropout)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(dropout)
    def forward(
        self,
        src: torch.FloatTensor,
        key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Encodes the input. Does not attend to masked inputs.

        Args
        ----
            src: Batch of embedded source tokens.
                Shape of [batch_size, src_seq_len, dim_model].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, src_seq_len].

        Output
        ------
            y: Batch of encoded source tokens.
                Shape of [batch_size, src_seq_len, dim_model].
        """
        src2, _ = self.self_attn(src, src, src, key_padding_mask=key_padding_mask)
        src = src + self.dropout(src2)
        src = self.norm1(src)

        src2 = self.feed_forward(src)
        src = src + self.dropout(src2)
        src = self.norm2(src)

        return src

class TransformerEncoder(nn.Module):
    """Implementation of the transformer encoder stack.

    Parameters
    ----------
        d_model: The dimension of encoders inputs.
        dim_feedforward: Hidden dimension of the feedforward networks.
        num_encoder_layers: Number of stacked encoders.
        nheads: Number of heads for each multi-head attention.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            dim_feedforward: int,
            num_encoder_layers: int,
            nheads: int,
            dropout: float
        ):
        super().__init__()
        self.layers = nn.ModuleList(
            [
              TransformerEncoderLayer(
              d_model=d_model, nhead=nheads, d_ff=dim_feedforward,
              dropout=dropout)
              for _ in range(num_encoder_layers)
            ]
            )


    def forward(
            self,
            src: torch.FloatTensor,
            key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Encodes the source sequence by sequentially passing.
        the source sequence through the encoder stack.

        Args
        ----
            src: Batch of embedded source sentences.
                Shape of [batch_size, src_seq_len, dim_model].
            key_padding_mask: Mask preventing attention to padding tokens.
                Shape of [batch_size, src_seq_len].

        Output
        ------
            y: Batch of encoded source sequence.
                Shape of [batch_size, src_seq_len, dim_model].
        """
        for layer in self.layers:
            src = layer(src, key_padding_mask)
        return src

### Transformer
This section gathers the `Transformer` and the `TranslationTransformer` modules.

**Transformer**


The classical transformer architecture.
It takes the source and target tokens embeddings and
do the forward pass through the encoder and decoder.

**Translation Transformer**

Compute the source and target tokens embeddings, and apply a final head to produce next token logits.
The output must not be the softmax but just the logits, because we use the `nn.CrossEntropyLoss`.

It also creates the *src_key_padding_mask*, the *tgt_key_padding_mask* and the *tgt_mask_attn*.

In [None]:
class Transformer(nn.Module):
    """Implementation of a Transformer based on the paper: https://arxiv.org/pdf/1706.03762.pdf.

    Parameters
    ----------
        d_model: The dimension of encoders/decoders inputs/ouputs.
        nhead: Number of heads for each multi-head attention.
        num_encoder_layers: Number of stacked encoders.
        num_decoder_layers: Number of stacked encoders.
        dim_feedforward: Hidden dimension of the feedforward networks.
        dropout: Dropout rate.
    """

    def __init__(
            self,
            d_model: int,
            nhead: int,
            num_encoder_layers: int,
            num_decoder_layers: int,
            dim_feedforward: int,
            dropout: float,
        ):
        super().__init__()
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead, dim_feedforward=dim_feedforward,
            dropout=dropout, batch_first=True
            )
        self.encoder = nn.TransformerEncoder(
            encoder_layer=encoder_layer, num_layers=num_encoder_layers
            )
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model, nhead=nhead, dim_feedforward=dim_feedforward,
            dropout=dropout, batch_first=True
            )
        self.decoder = nn.TransformerDecoder(
            decoder_layer=decoder_layer, num_layers=num_decoder_layers
            )

    def forward(
            self,
            src: torch.FloatTensor,
            tgt: torch.FloatTensor,
            tgt_mask_attn: torch.BoolTensor,
            src_key_padding_mask: torch.BoolTensor,
            tgt_key_padding_mask: torch.BoolTensor
        ) -> torch.FloatTensor:
        """Compute next token embeddings.

        Args
        ----
            src: Batch of source sequences.
                Shape of [batch_size, src_seq_len, dim_model].
            tgt: Batch of target sequences.
                Shape of [batch_size, tgt_seq_len, dim_model].
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [tgt_seq_len, tgt_seq_len].
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, src_seq_len].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, tgt_seq_len].

        Output
        ------
            y: Next token embeddings, given the previous target tokens and the source tokens.
                Shape of [batch_size, tgt_seq_len, dim_model].
        """
        enc = self.encoder(src, src_key_padding_mask=src_key_padding_mask)
        dec = self.decoder(
            tgt, enc, tgt_mask=tgt_mask_attn, tgt_key_padding_mask=tgt_key_padding_mask,
            memory_key_padding_mask=src_key_padding_mask
            )
        return dec


class TranslationTransformer(nn.Module):
    """Basic Transformer encoder and decoder for a translation task.
    Manage the masks creation, and the token embeddings.
    Position embeddings can be learnt with a standard `nn.Embedding` layer.

    Parameters
    ----------
        n_tokens_src: Number of tokens in the source vocabulary.
        n_tokens_tgt: Number of tokens in the target vocabulary.
        n_heads: Number of heads for each multi-head attention.
        dim_embedding: Dimension size of the word embeddings (for both language).
        dim_hidden: Dimension size of the feedforward layers
            (for both the encoder and the decoder).
        n_layers: Number of layers in the encoder and decoder.
        dropout: Dropout rate.
        src_pad_idx: Source padding index value.
        tgt_pad_idx: Target padding index value.
    """
    def __init__(
            self,
            n_tokens_src: int,
            n_tokens_tgt: int,
            n_heads: int,
            dim_embedding: int,
            dim_hidden: int,
            n_layers: int,
            dropout: float,
            src_pad_idx: int,
            tgt_pad_idx: int,
        ):
        super().__init__()

        self.transformer = nn.Transformer(
            d_model=dim_embedding,nhead=n_heads,
            num_encoder_layers=n_layers, num_decoder_layers=n_layers,
            dim_feedforward=dim_hidden, dropout=dropout, batch_first=True
            )

        self.src_pad_idx = src_pad_idx
        self.tgt_pad_idx = tgt_pad_idx

        self.src_embedding = nn.Embedding(n_tokens_src, dim_embedding)
        self.tgt_embedding = nn.Embedding(n_tokens_tgt, dim_embedding)

        self.positional_encoding = PositionalEncoding(dim_embedding, dropout)

        self.linear_out = nn.Linear(dim_embedding, n_tokens_tgt)

    def forward(
            self,
            source: torch.LongTensor,
            target: torch.LongTensor
        ) -> torch.FloatTensor:
        """Predict the target tokens logites based on the source tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            y: Distributions over the next token for all tokens in each sentences.
                Those need to be the logits only, do not apply a softmax because
                it will be done in the loss computation for numerical stability.
                See https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for more informations.
                Shape of [batch_size, seq_len_tgt, n_tokens_tgt].
        """
        src_embedded = self.src_embedding(source)
        src_embedded = self.positional_encoding(src_embedded)

        tgt_embedded = self.tgt_embedding(target)
        tgt_embedded = self.positional_encoding(tgt_embedded)

        tgt_mask_attn = self.generate_causal_mask(target)
        src_key_padding_mask, tgt_key_padding_mask = self.generate_key_padding_mask(source, target)

        transformer_output = self.transformer(
            src=src_embedded, tgt=tgt_embedded, tgt_mask=tgt_mask_attn, src_key_padding_mask=src_key_padding_mask,
            tgt_key_padding_mask=tgt_key_padding_mask)
        output = self.linear_out(transformer_output)
        return output

    def generate_causal_mask(
            self,
            target: torch.LongTensor,
        ) -> tuple:
        """Generate the masks to prevent attending subsequent tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            tgt_mask_attn: Mask to prevent attention to subsequent tokens.
                Shape of [seq_len_tgt, seq_len_tgt].

        """

        seq_len = target.shape[1]

        tgt_mask = torch.ones((seq_len, seq_len), dtype=torch.bool)
        tgt_mask = torch.triu(tgt_mask, diagonal=1).to(target.device)

        return tgt_mask

    def generate_key_padding_mask(
            self,
            source: torch.LongTensor,
            target: torch.LongTensor,
        ) -> tuple:
        """Generate the masks to prevent attending padding tokens.

        Args
        ----
            source: Batch of source sentences.
                Shape of [batch_size, seq_len_src].
            target: Batch of target sentences.
                Shape of [batch_size, seq_len_tgt].

        Output
        ------
            src_key_padding_mask: Mask to prevent attention to padding in src sequence.
                Shape of [batch_size, seq_len_src].
            tgt_key_padding_mask: Mask to prevent attention to padding in tgt sequence.
                Shape of [batch_size, seq_len_tgt].

        """

        src_key_padding_mask = source == self.src_pad_idx
        tgt_key_padding_mask = target == self.tgt_pad_idx

        return src_key_padding_mask, tgt_key_padding_mask

# Greedy search

One idea to explore once you have your model working is to implement a geedy search to generate a target translation from a trained model and an input source string. The next token will simply be the most probable one. Compare this strategy of decoding with the beam search strategy below.

In [None]:
def greedy_search(
        model: nn.Module,
        source: str,
        src_vocab: Vocab,
        tgt_vocab: Vocab,
        src_tokenizer,
        device: str,
        max_sentence_length: int,
    ) -> str:
    """Do a beam search to produce probable translations.

    Args
    ----
        model: The translation model. Assumes it produces logits score (before softmax).
        source: The sentence to translate.
        src_vocab: The source vocabulary.
        tgt_vocab: The target vocabulary.
        device: Device to which we make the inference.
        max_target: Maximum number of target sentences we keep at the end of each stage.
        max_sentence_length: Maximum number of tokens for the translated sentence.

    Output
    ------
        sentence: The translated source sentence.
    """
    src_tokens = src_tokenizer(source)
    src_indices = [src_vocab.stoi.get(token, src_vocab.stoi['<unk>']) for token
                   in src_tokens]
    src_indices = [src_vocab.stoi['<sos>']] + src_indices + [src_vocab.stoi['<eos>']]
    src_tensor = torch.LongTensor(src_indices).unsqueeze(dim=0).to(device)
    tgt_indices = [tgt_vocab.stoi['<sos>']]

    with torch.no_grad():
      encoder_output = model.encoder(src_tensor)
    for i in range(max_sentence_length):
      tgt_tensor = torch.LongTensor(tgt_indices).unsqueeze(dim=0).to(device)

      with torch.no_grad():
        output = model.decoder(tgt_tensor, encoder_output, src_tensor)
        next_token_logits = output[:, -1, :]
        next_token_index = next_token_logits.argmax(dim=-1).item()

      tgt_indices.append(next_token_index)
      if next_token_index == tgt_vocab.stoi['<eos>']:
          break

    result_tokens = [tgt_vocab.itos[idx] for idx in tgt_indices[1:]]

    if result_tokens and result_tokens[-1] == '<eos>':
        result_tokens = result_tokens[:-1]

    translated_sentence = ' '.join(result_tokens)
    return translated_sentence

# Beam search
Beam search is a smarter way of producing a sequence of tokens from
an autoregressive model than just using a greedy search.

The greedy search always chooses the most probable token as the unique
and only next target token, and repeat this processus until the *\<eos\>* token is predicted.

Instead, the beam search selects the k-most probable tokens at each step.
From those k tokens, the current sequence is duplicated k times and the k tokens are appended to the k sequences to produce new k sequences.

*You don't have to understand this code, but understanding this code once the TP is over could improve your torch tensors skills.*

---

**More explanations**

Since it is done at each step, the number of sequences grows exponentially (k sequences after the first step, k² sequences after the second...).
In order to keep the number of sequences low, we remove sequences except the top-s most likely sequences.
To do that, we keep track of the likelihood of each sequence.

Formally, we define $s = [s_1, ..., s_{N_s}]$ as the source sequence made of $N_s$ tokens.
We also define $t^i = [t_1, ..., t_i]$ as the target sequence at the beginning of the step $i$.

The output of the model parameterized by $\theta$ is:

$$
T_{i+1} = p(t_{i+1} | s, t^i ; \theta )
$$

Where $T_{i+1}$ is the distribution of the next token $t_{i+1}$.

Then, we define the likelihood of a target sentence $t = [t_1, ..., t_{N_t}]$ as:

$$
L(t) = \prod_{i=1}^{N_t - 1} p(t_{i+1} | s, t_{i}; \theta )
$$

Pseudocode of the beam search:
```
source: [N_s source tokens]  # Shape of [total_source_tokens]
target: [1, <bos> token]  # Shape of [n_sentences, current_target_tokens]
target_prob: [1]  # Shape of [n_sentences]
# We use `n_sentences` as the batch_size dimension

while current_target_tokens <= max_target_length:
    source = repeat(source, n_sentences)  # Shape of [n_sentences, total_source_tokens]
    predicted = model(source, target)[:, -1]  # Predict the next token distributions of all the n_sentences
    tokens_idx, tokens_prob = topk(predicted, k)

    # Append the `n_sentences * k` tokens to the `n_sentences` sentences
    target = repeat(target, k)  # Shape of [n_sentences * k, current_target_tokens]
    target = append_tokens(target, tokens_idx)  # Shape of [n_sentences * k, current_target_tokens + 1]

    # Update the sentences probabilities
    target_prob = repeat(target_prob, k)  # Shape of [n_sentences * k]
    target_prob *= tokens_prob

    if n_sentences * k >= max_sentences:
        target, target_prob = topk_prob(target, target_prob, k=max_sentences)
    else:
        n_sentences *= k

    current_target_tokens += 1
```

In [None]:
def beautify(sentence: str) -> str:
    """Removes useless spaces.
    """
    punc = {'.', ',', ';'}
    for p in punc:
        sentence = sentence.replace(f' {p}', p)

    links = {'-', "'"}
    for l in links:
        sentence = sentence.replace(f'{l} ', l)
        sentence = sentence.replace(f' {l}', l)

    return sentence

In [None]:
def indices_terminated(
        target: torch.FloatTensor,
        eos_token: int
    ) -> tuple:
    """Split the target sentences between the terminated and the non-terminated
    sentence. Return the indices of those two groups.

    Args
    ----
        target: The sentences.
            Shape of [batch_size, n_tokens].
        eos_token: Value of the End-of-Sentence token.

    Output
    ------
        terminated: Indices of the terminated sentences (who's got the eos_token).
            Shape of [n_terminated, ].
        non-terminated: Indices of the unfinished sentences.
            Shape of [batch_size-n_terminated, ].
    """
    terminated = [i for i, t in enumerate(target) if eos_token in t]
    non_terminated = [i for i, t in enumerate(target) if eos_token not in t]
    return torch.LongTensor(terminated), torch.LongTensor(non_terminated)


def append_beams(
        target: torch.FloatTensor,
        beams: torch.FloatTensor
    ) -> torch.FloatTensor:
    """Add the beam tokens to the current sentences.
    Duplicate the sentences so one token is added per beam per batch.

    Args
    ----
        target: Batch of unfinished sentences.
            Shape of [batch_size, n_tokens].
        beams: Batch of beams for each sentences.
            Shape of [batch_size, n_beams].

    Output
    ------
        target: Batch of sentences with one beam per sentence.
            Shape of [batch_size * n_beams, n_tokens+1].
    """
    batch_size, n_beams = beams.shape
    n_tokens = target.shape[1]

    target = einops.repeat(target, 'b t -> b c t', c=n_beams)  # [batch_size, n_beams, n_tokens]
    beams = beams.unsqueeze(dim=2)  # [batch_size, n_beams, 1]

    target = torch.cat((target, beams), dim=2)  # [batch_size, n_beams, n_tokens+1]
    target = target.view(batch_size*n_beams, n_tokens+1)  # [batch_size * n_beams, n_tokens+1]
    return target


def beam_search(
        model: nn.Module,
        source: str,
        src_vocab: Vocab,
        tgt_vocab: Vocab,
        src_tokenizer,
        device: str,
        beam_width: int,
        max_target: int,
        max_sentence_length: int,
    ) -> list:
    """Do a beam search to produce probable translations.

    Args
    ----
        model: The translation model. Assumes it produces linear score (before softmax).
        source: The sentence to translate.
        src_vocab: The source vocabulary.
        tgt_vocab: The target vocabulary.
        device: Device to which we make the inference.
        beam_width: Number of top-k tokens we keep at each stage.
        max_target: Maximum number of target sentences we keep at the end of each stage.
        max_sentence_length: Maximum number of tokens for the translated sentence.

    Output
    ------
        sentences: List of sentences orderer by their likelihood.
    """
    src_tokens = ['<bos>'] + src_tokenizer(source) + ['<eos>']
    src_tokens = src_vocab(src_tokens)

    tgt_tokens = ['<bos>']
    tgt_tokens = tgt_vocab(tgt_tokens)

    # To tensor and add unitary batch dimension
    src_tokens = torch.LongTensor(src_tokens).to(device)
    tgt_tokens = torch.LongTensor(tgt_tokens).unsqueeze(dim=0).to(device)
    target_probs = torch.FloatTensor([1]).to(device)
    model.to(device)

    EOS_IDX = tgt_vocab['<eos>']
    with torch.no_grad():
        while tgt_tokens.shape[1] < max_sentence_length:
            batch_size, n_tokens = tgt_tokens.shape

            # Get next beams
            src = einops.repeat(src_tokens, 't -> b t', b=tgt_tokens.shape[0])
            predicted = model.forward(src, tgt_tokens)
            predicted = torch.softmax(predicted, dim=-1)
            probs, predicted = predicted[:, -1].topk(k=beam_width, dim=-1)

            # Separe between terminated sentences and the others
            idx_terminated, idx_not_terminated = indices_terminated(tgt_tokens, EOS_IDX)
            idx_terminated, idx_not_terminated = idx_terminated.to(device), idx_not_terminated.to(device)

            tgt_terminated = torch.index_select(tgt_tokens, dim=0, index=idx_terminated)
            tgt_probs_terminated = torch.index_select(target_probs, dim=0, index=idx_terminated)

            filter_t = lambda t: torch.index_select(t, dim=0, index=idx_not_terminated)
            tgt_others = filter_t(tgt_tokens)
            tgt_probs_others = filter_t(target_probs)
            predicted = filter_t(predicted)
            probs = filter_t(probs)

            # Add the top tokens to the previous target sentences
            tgt_others = append_beams(tgt_others, predicted)

            # Add padding to terminated target
            padd = torch.zeros((len(tgt_terminated), 1), dtype=torch.long, device=device)
            tgt_terminated = torch.cat(
                (tgt_terminated, padd),
                dim=1
            )

            # Update each target sentence probabilities
            tgt_probs_others = torch.repeat_interleave(tgt_probs_others, beam_width)
            tgt_probs_others *= probs.flatten()
            tgt_probs_terminated *= 0.999  # Penalize short sequences overtime

            # Group up the terminated and the others
            target_probs = torch.cat(
                (tgt_probs_others, tgt_probs_terminated),
                dim=0
            )
            tgt_tokens = torch.cat(
                (tgt_others, tgt_terminated),
                dim=0
            )

            # Keep only the top `max_target` target sentences
            if target_probs.shape[0] <= max_target:
                continue

            target_probs, indices = target_probs.topk(k=max_target, dim=0)
            tgt_tokens = torch.index_select(tgt_tokens, dim=0, index=indices)

    sentences = []
    for tgt_sentence in tgt_tokens:
        tgt_sentence = list(tgt_sentence)[1:]  # Remove <bos> token
        tgt_sentence = list(takewhile(lambda t: t != EOS_IDX, tgt_sentence))
        tgt_sentence = ' '.join(tgt_vocab.lookup_tokens(tgt_sentence))
        sentences.append(tgt_sentence)

    sentences = [beautify(s) for s in sentences]

    # Join the sentences with their likelihood
    sentences = [(s, p.item()) for s, p in zip(sentences, target_probs)]
    # Sort the sentences by their likelihood
    sentences = [(s, p) for s, p in sorted(sentences, key=lambda k: k[1], reverse=True)]

    return sentences

# Training loop
This is a basic training loop code. It takes a big configuration dictionnary to avoid never ending arguments in the functions.
We use [Weights and Biases](https://wandb.ai/) to log the trainings.
It logs every training informations and model performances in the cloud.
You have to create an account to use it. Every accounts are free for individuals or research teams.

In [None]:
def print_logs(dataset_type: str, logs: dict):
    """Print the logs.

    Args
    ----
        dataset_type: Either "Train", "Eval", "Test" type.
        logs: Containing the metric's name and value.
    """
    desc = [
        f'{name}: {value:.2f}'
        for name, value in logs.items()
    ]
    desc = '\t'.join(desc)
    desc = f'{dataset_type} -\t' + desc
    desc = desc.expandtabs(5)
    print(desc)


def topk_accuracy(
        real_tokens: torch.FloatTensor,
        probs_tokens: torch.FloatTensor,
        k: int,
        tgt_pad_idx: int,
    ) -> torch.FloatTensor:
    """Compute the top-k accuracy.
    We ignore the PAD tokens.

    Args
    ----
        real_tokens: Real tokens of the target sentence.
            Shape of [batch_size * n_tokens].
        probs_tokens: Tokens probability predicted by the model.
            Shape of [batch_size * n_tokens, n_target_vocabulary].
        k: Top-k accuracy threshold.
        src_pad_idx: Source padding index value.

    Output
    ------
        acc: Scalar top-k accuracy value.
    """
    total = (real_tokens != tgt_pad_idx).sum()

    _, pred_tokens = probs_tokens.topk(k=k, dim=-1)  # [batch_size * n_tokens, k]
    real_tokens = einops.repeat(real_tokens, 'b -> b k', k=k)  # [batch_size * n_tokens, k]

    good = (pred_tokens == real_tokens) & (real_tokens != tgt_pad_idx)
    acc = good.sum() / total
    return acc


def loss_batch(
        model: nn.Module,
        source: torch.LongTensor,
        target: torch.LongTensor,
        config: dict,
    )-> dict:
    """Compute the metrics associated with this batch.
    The metrics are:
        - loss
        - top-1 accuracy
        - top-5 accuracy
        - top-10 accuracy

    Args
    ----
        model: The model to train.
        source: Batch of source tokens.
            Shape of [batch_size, n_src_tokens].
        target: Batch of target tokens.
            Shape of [batch_size, n_tgt_tokens].
        config: Additional parameters.

    Output
    ------
        metrics: Dictionnary containing evaluated metrics on this batch.
    """
    device = config['device']
    loss_fn = config['loss'].to(device)
    metrics = dict()

    source, target = source.to(device), target.to(device)
    target_in, target_out = target[:, :-1], target[:, 1:]

    # Loss
    pred = model(source, target_in)  # [batch_size, n_tgt_tokens-1, n_vocab]
    pred = pred.view(-1, pred.shape[2])  # [batch_size * (n_tgt_tokens - 1), n_vocab]
    target_out = target_out.flatten()  # [batch_size * (n_tgt_tokens - 1),]
    metrics['loss'] = loss_fn(pred, target_out)

    # Accuracy - we ignore the padding predictions
    for k in [1, 5, 10]:
        metrics[f'top-{k}'] = topk_accuracy(target_out, pred, k, config['tgt_pad_idx'])

    return metrics


def eval_model(model: nn.Module, dataloader: DataLoader, config: dict) -> dict:
    """Evaluate the model on the given dataloader.
    """
    device = config['device']
    logs = defaultdict(list)

    model.to(device)
    model.eval()

    with torch.no_grad():
        for source, target in dataloader:
            metrics = loss_batch(model, source, target, config)
            for name, value in metrics.items():
                logs[name].append(value.cpu().item())

    for name, values in logs.items():
        logs[name] = np.mean(values)
    return logs


def train_model(model: nn.Module, config: dict):
    """Train the model in a teacher forcing manner.
    """
    train_loader, val_loader = config['train_loader'], config['val_loader']
    train_dataset, val_dataset = train_loader.dataset.dataset, val_loader.dataset.dataset
    optimizer = config['optimizer']
    clip = config['clip']
    device = config['device']

    columns = ['epoch']
    for mode in ['train', 'validation']:
        columns += [
            f'{mode} - {colname}'
            for colname in ['source', 'target', 'predicted', 'likelihood']
        ]
    log_table = wandb.Table(columns=columns)


    print(f'Starting training for {config["epochs"]} epochs, using {device}.')
    for e in range(config['epochs']):
        print(f'\nEpoch {e+1}')

        model.to(device)
        model.train()
        logs = defaultdict(list)

        for batch_id, (source, target) in enumerate(train_loader):
            optimizer.zero_grad()

            metrics = loss_batch(model, source, target, config)
            loss = metrics['loss']

            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()

            for name, value in metrics.items():
                logs[name].append(value.cpu().item())  # Don't forget the '.item' to free the cuda memory

            if batch_id % config['log_every'] == 0:
                for name, value in logs.items():
                    logs[name] = np.mean(value)

                train_logs = {
                    f'Train - {m}': v
                    for m, v in logs.items()
                }
                wandb.log(train_logs)
                logs = defaultdict(list)

        # Logs
        if len(logs) != 0:
            for name, value in logs.items():
                logs[name] = np.mean(value)
            train_logs = {
                f'Train - {m}': v
                for m, v in logs.items()
            }
        else:
            logs = {
                m.split(' - ')[1]: v
                for m, v in train_logs.items()
            }

        print_logs('Train', logs)

        logs = eval_model(model, val_loader, config)
        print_logs('Eval', logs)
        val_logs = {
            f'Validation - {m}': v
            for m, v in logs.items()
        }

        val_source, val_target = val_dataset[ torch.randint(len(val_dataset), (1,)) ]
        val_pred, val_prob = beam_search(
            model,
            val_source,
            config['src_vocab'],
            config['tgt_vocab'],
            config['src_tokenizer'],
            device,  # It can take a lot of VRAM
            beam_width=10,
            max_target=100,
            max_sentence_length=config['max_sequence_length'],
        )[0]
        print(val_source)
        print(val_pred)

        logs = {**train_logs, **val_logs}  # Merge dictionnaries
        wandb.log(logs)  # Upload to the WandB cloud

        # Table logs
        train_source, train_target = train_dataset[ torch.randint(len(train_dataset), (1,)) ]
        train_pred, train_prob = beam_search(
            model,
            train_source,
            config['src_vocab'],
            config['tgt_vocab'],
            config['src_tokenizer'],
            device,  # It can take a lot of VRAM
            beam_width=10,
            max_target=100,
            max_sentence_length=config['max_sequence_length'],
        )[0]

        data = [
            e + 1,
            train_source, train_target, train_pred, train_prob,
            val_source, val_target, val_pred, val_prob,
        ]
        log_table.add_data(*data)

    # Log the table at the end of the training
    wandb.log({'Model predictions': log_table})

# Training the models
We can now finally train the models.
Choose the right hyperparameters, play with them and try to find
ones that lead to good models and good training curves.
Try to reach a loss under 1.0.

So you know, it is possible to get descent results with approximately 20 epochs.
With CUDA enabled, one epoch, even on a big model with a big dataset, shouldn't last more than 10 minutes.
A normal epoch is between 1 to 5 minutes.

*This is considering Colab Pro, we should try using free Colab to get better estimations.*

---

To test your implementations, it is easier to try your models
in a CPU instance. Indeed, Colab reduces your GPU instances priority
with the time you recently past using GPU instances. It would be
sad to consume all your GPU time on implementation testing.
Moreover, you should try your models on small datasets and with a small number of parameters.
For exemple, you could set:
```
MAX_SEQ_LEN = 10
MIN_TOK_FREQ = 20
dim_embedding = 40
dim_hidden = 60
n_layers = 1
```

You usually don't want to log anything onto WandB when testing your implementation.
To deactivate WandB without having to change any line of code, you can type `!wandb offline` in a cell.

Once you have rightly implemented the models, you can train bigger models on bigger datasets.
When you do this, do not forget to change the runtime as GPU (and use `!wandb online`)!

In [None]:
# Checking GPU and logging to wandb

!wandb login

!nvidia-smi

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mahmed-el-shami[0m ([33mahmed-el-shami-polytechnique-montr-al[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
Thu Mar 27 00:55:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M

In [None]:
# Instanciate the datasets

MAX_SEQ_LEN = 60
MIN_TOK_FREQ = 2
train_dataset, val_dataset = build_datasets(
    MAX_SEQ_LEN,
    MIN_TOK_FREQ,
    en_tokenizer,
    fr_tokenizer,
    train,
    valid,
)


print(f'English vocabulary size: {len(train_dataset.en_vocab):,}')
print(f'French vocabulary size: {len(train_dataset.fr_vocab):,}')

print(f'\nTraining examples: {len(train_dataset):,}')
print(f'Validation examples: {len(val_dataset):,}')

English vocabulary size: 11,043
French vocabulary size: 17,264

Training examples: 209,459
Validation examples: 23,274


In [None]:
# Build the model, the dataloaders, optimizer and the loss function
# Log every hyperparameters and arguments into the config dictionnary

config = {
    # General parameters
    'epochs': 5,
    'batch_size': 128,
    'lr': 1e-3,
    'betas': (0.9, 0.99),
    'clip': 5,
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',

    # Model parameters
    'n_tokens_src': len(train_dataset.en_vocab),
    'n_tokens_tgt': len(train_dataset.fr_vocab),
    'n_heads': 4,
    'dim_embedding': 196,
    'dim_hidden': 256,
    'n_layers': 3,
    'dropout': 0.1,
    'model_type': 'RNN',

    # Others
    'max_sequence_length': MAX_SEQ_LEN,
    'min_token_freq': MIN_TOK_FREQ,
    'src_vocab': train_dataset.en_vocab,
    'tgt_vocab': train_dataset.fr_vocab,
    'src_tokenizer': en_tokenizer,
    'tgt_tokenizer': fr_tokenizer,
    'src_pad_idx': train_dataset.en_vocab['<pad>'],
    'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
    'seed': 0,
    'log_every': 50,  # Number of batches between each wandb logs
}

torch.manual_seed(config['seed'])

config['train_loader'] = DataLoader(
    train_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

config['val_loader'] = DataLoader(
    val_dataset,
    batch_size=config['batch_size'],
    shuffle=True,
    collate_fn=lambda batch: generate_batch(batch, config['src_pad_idx'], config['tgt_pad_idx'])
)

# Uncomment code block to select model to train here!
# model = TranslationRNN(
#     config['n_tokens_src'],
#     config['n_tokens_tgt'],
#     config['dim_embedding'],
#     config['dim_hidden'],
#     config['n_layers'],
#     config['dropout'],
#     config['src_pad_idx'],
#     config['tgt_pad_idx'],
#     config['model_type'],
# )

model = TranslationTransformer(
    config['n_tokens_src'],
    config['n_tokens_tgt'],
    config['n_heads'],
    config['dim_embedding'],
    config['dim_hidden'],
    config['n_layers'],
    config['dropout'],
    config['src_pad_idx'],
    config['tgt_pad_idx'],
)


config['optimizer'] = optim.Adam(
    model.parameters(),
    lr=config['lr'],
    betas=config['betas'],
)

weight_classes = torch.ones(config['n_tokens_tgt'], dtype=torch.float)
weight_classes[config['tgt_vocab']['<unk>']] = 0.1  # Lower the importance of that class
config['loss'] = nn.CrossEntropyLoss(
    weight=weight_classes,
    ignore_index=config['tgt_pad_idx'],  # We do not have to learn those
)

### SUMMARY GAVE ME ERRORS SO I MADE A SIMPLE ONE
def get_model_summary(model):
    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(f"Total parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")

    # Print model architecture
    print(model)
get_model_summary(model)
# summary(
#     model,
#     input_size=[
#         (config['batch_size'], config['max_sequence_length']),
#         (config['batch_size'], config['max_sequence_length'])
#     ],
#     dtypes=[torch.long, torch.long],
#     depth=3,
# )

Total parameters: 10,950,700
Trainable parameters: 10,950,700
TranslationTransformer(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=196, out_features=196, bias=True)
          )
          (linear1): Linear(in_features=196, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=256, out_features=196, bias=True)
          (norm1): LayerNorm((196,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((196,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((196,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-2): 3

In [None]:
!wandb online  # online / offline / disabled to activate, deactivate or turn off WandB logging

with wandb.init(
        config=config,
        project='INF8225 - TP3',  # Title of your project
        group='Transformer - small',  # In what group of runs do you want this run to be in?
        save_code=True,
    ):
    train_model(model, config)

W&B online. Running your script from this directory will now sync to the cloud.


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mahmed-el-shami[0m ([33mahmed-el-shami-polytechnique-montr-al[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.01     top-1: 0.61    top-5: 0.80    top-10: 0.85


  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)


Eval -    loss: 1.80     top-1: 0.63    top-5: 0.83    top-10: 0.87
The baby cried.
le bébé continua.

Epoch 2
Train -   loss: 1.60     top-1: 0.67    top-5: 0.85    top-10: 0.89
Eval -    loss: 1.45     top-1: 0.68    top-5: 0.87    top-10: 0.91
There was fear in his eyes.
il y avait peur dans ses yeux.

Epoch 3
Train -   loss: 1.40     top-1: 0.70    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.27     top-1: 0.72    top-5: 0.89    top-10: 0.92
I noticed that she was wearing new glasses.
j'ai remarqué qu'elle portait de nouveaux lunettes.

Epoch 4
Train -   loss: 1.28     top-1: 0.72    top-5: 0.89    top-10: 0.92
Eval -    loss: 1.19     top-1: 0.73    top-5: 0.90    top-10: 0.93
Aren't you angry right now?
n'es-tu pas en colère, maintenant ?

Epoch 5
Train -   loss: 1.23     top-1: 0.73    top-5: 0.90    top-10: 0.93
Eval -    loss: 1.12     top-1: 0.74    top-5: 0.91    top-10: 0.94
He negotiated a free trade agreement with Canada.
il est membre d'un accord libre avec du canada.


0,1
Train - loss,█▆▆▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█████████████████████
Train - top-10,▁▅▆▆▆▇▇▇▇▇▇▇▇▇▇█████████████████████████
Train - top-5,▁▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▇▇█

0,1
Train - loss,1.22543
Train - top-1,0.72981
Train - top-10,0.92898
Train - top-5,0.89981
Validation - loss,1.12245
Validation - top-1,0.74353
Validation - top-10,0.93655
Validation - top-5,0.91128


In [None]:
sentence = "It is possible to try your work here."

preds = beam_search(
    model,
    sentence,
    config['src_vocab'],
    config['tgt_vocab'],
    config['src_tokenizer'],
    config['device'],
    beam_width=10,
    max_target=100,
    max_sentence_length=config['max_sequence_length']
)[:5]

for i, (translation, likelihood) in enumerate(preds):
    print(f'{i}. ({likelihood*100:.5f}%) \t {translation}')

0. (8.10878%) 	 il est possible d'essayer de votre travail ici.
1. (6.63808%) 	 c'est possible d'essayer de votre travail ici.
2. (5.24078%) 	 il est possible de essayer de votre travail ici.
3. (5.23896%) 	 c'est possible d'essayer de ton travail ici.
4. (4.90898%) 	 c'est possible de essayer de votre travail ici.


In [None]:
sweep_config = {
    'method': 'bayes',
    'metric': {
        'name': 'Validation - loss',
        'goal': 'minimize'
    },
    'parameters': {
        'model_type': {
            'values': ['Transformer']
        },
        'n_layers': {
            'values': [2, 3, 4]
        },
        'dim_embedding': {
            'values': [128, 192, 256]
        },
        'dim_hidden': {
            'values': [256, 384, 512]
        },
        'n_heads': {
            'values': [2, 4, 8]
        },
        'dropout': {
            'min': 0.1,
            'max': 0.2,
            'distribution': 'uniform'
        },
        'batch_size': {
            'values': [64, 96, 128]
        },
        'lr': {
            'values': [0.0001, 0.0003, 0.001]
        },
        'clip': {
            'value': 5
        },
        'epochs': {
            'value': 5
        },
        'max_sequence_length': {
            'values': [40, 60]
        },
        'min_token_freq': {
            'value': 2
        },
        'log_every': {
            'value': 50
        }
    }
}

In [None]:
def train_with_config(config=None):
    with wandb.init(config=config):
        config = wandb.config

        train_dataset, val_dataset = build_datasets(
            config.max_sequence_length,
            config.min_token_freq,
            en_tokenizer,
            fr_tokenizer,
            train,
            valid,
        )

        train_loader = DataLoader(
            train_dataset,
            batch_size=config.batch_size,
            shuffle=True,
            collate_fn=lambda batch: generate_batch(batch, train_dataset.en_vocab['<pad>'], train_dataset.fr_vocab['<pad>'])
        )

        val_loader = DataLoader(
            val_dataset,
            batch_size=config.batch_size,
            shuffle=False,
            collate_fn=lambda batch: generate_batch(batch, train_dataset.en_vocab['<pad>'], train_dataset.fr_vocab['<pad>'])
        )

        if config.model_type == "Transformer":
            model = TranslationTransformer(
                n_tokens_src=len(train_dataset.en_vocab),
                n_tokens_tgt=len(train_dataset.fr_vocab),
                n_heads=config.n_heads,
                dim_embedding=config.dim_embedding,
                dim_hidden=config.dim_hidden,
                n_layers=config.n_layers,
                dropout=config.dropout,
                src_pad_idx=train_dataset.en_vocab['<pad>'],
                tgt_pad_idx=train_dataset.fr_vocab['<pad>'],
            )
        else:
            model = TranslationRNN(
                n_tokens_src=len(train_dataset.en_vocab),
                n_tokens_tgt=len(train_dataset.fr_vocab),
                dim_embedding=config.dim_embedding,
                dim_hidden=config.dim_hidden,
                n_layers=config.n_layers,
                dropout=config.dropout,
                src_pad_idx=train_dataset.en_vocab['<pad>'],
                tgt_pad_idx=train_dataset.fr_vocab['<pad>'],
                model_type=config.model_type,
            )

        optimizer = optim.Adam(
            model.parameters(),
            lr=config.lr,
            betas=(0.9, 0.99),
        )

        weight_classes = torch.ones(len(train_dataset.fr_vocab), dtype=torch.float)
        weight_classes[train_dataset.fr_vocab['<unk>']] = 0.1
        loss_fn = nn.CrossEntropyLoss(
            weight=weight_classes,
            ignore_index=train_dataset.fr_vocab['<pad>'],
        )

        training_config = {
            'train_loader': train_loader,
            'val_loader': val_loader,
            'optimizer': optimizer,
            'loss': loss_fn,
            'clip': config.clip,
            'device': 'cuda' if torch.cuda.is_available() else 'cpu',
            'epochs': config.epochs,
            'src_vocab': train_dataset.en_vocab,
            'tgt_vocab': train_dataset.fr_vocab,
            'src_tokenizer': en_tokenizer,
            'tgt_tokenizer': fr_tokenizer,
            'src_pad_idx': train_dataset.en_vocab['<pad>'],
            'tgt_pad_idx': train_dataset.fr_vocab['<pad>'],
            'log_every': config.log_every,
            'max_sequence_length': config.max_sequence_length,
        }

        train_model(model, training_config)

In [None]:
sweep_id = wandb.sweep(sweep_config, project="translation-transformer-sweeps")

wandb.agent(sweep_id, function=train_with_config, count=8)

Create sweep with ID: aokmisqs
Sweep URL: https://wandb.ai/ahmed-el-shami-polytechnique-montr-al/translation-transformer-sweeps/sweeps/aokmisqs


[34m[1mwandb[0m: Agent Starting Run: otb57we5 with config:
[34m[1mwandb[0m: 	batch_size: 96
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 256
[34m[1mwandb[0m: 	dim_hidden: 512
[34m[1mwandb[0m: 	dropout: 0.1691546816370649
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.0001
[34m[1mwandb[0m: 	max_sequence_length: 40
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 4
[34m[1mwandb[0m: 	n_layers: 4


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 3.28     top-1: 0.45    top-5: 0.63    top-10: 0.69
Eval -    loss: 3.16     top-1: 0.45    top-5: 0.64    top-10: 0.70
Can you do bookkeeping?
pouvez-vous le français ?

Epoch 2
Train -   loss: 2.72     top-1: 0.51    top-5: 0.70    top-10: 0.76
Eval -    loss: 2.60     top-1: 0.51    top-5: 0.71    top-10: 0.77
Germany has produced many scientists.
quelqu'un a des enfants.

Epoch 3
Train -   loss: 2.38     top-1: 0.56    top-5: 0.75    top-10: 0.80
Eval -    loss: 2.25     top-1: 0.56    top-5: 0.76    top-10: 0.82
McDonald's is world-famous for its hamburgers.
c'est tout le monde du monde.

Epoch 4
Train -   loss: 2.12     top-1: 0.59    top-5: 0.79    top-10: 0.83
Eval -    loss: 2.02     top-1: 0.60    top-5: 0.80    top-10: 0.84
Where are we now?
où sommes-nous maintenant maintenant maintenant ?

Epoch 5
Train -   loss: 1.93     top-1: 0.62    top-5: 0.81    top-10: 0.86
Eval -    loss: 1.83     top-1: 0.63    t

0,1
Train - loss,█▇▇▆▆▆▅▅▄▄▃▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁
Train - top-1,▁▁▂▂▃▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█▇▇██████
Train - top-10,▁▂▃▄▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Train - top-5,▁▂▂▃▃▄▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇█████████
Validation - loss,█▅▃▂▁
Validation - top-1,▁▃▅▇█
Validation - top-10,▁▄▆▇█
Validation - top-5,▁▄▆▇█

0,1
Train - loss,1.93289
Train - top-1,0.6224
Train - top-10,0.85618
Train - top-5,0.81205
Validation - loss,1.8283
Validation - top-1,0.6265
Validation - top-10,0.86605
Validation - top-5,0.82276


[34m[1mwandb[0m: Agent Starting Run: e3nykbow with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 192
[34m[1mwandb[0m: 	dim_hidden: 256
[34m[1mwandb[0m: 	dropout: 0.1046703591761612
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.0003
[34m[1mwandb[0m: 	max_sequence_length: 60
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 4
[34m[1mwandb[0m: 	n_layers: 2


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.46     top-1: 0.56    top-5: 0.75    top-10: 0.79
Eval -    loss: 2.27     top-1: 0.58    top-5: 0.76    top-10: 0.81
This has shaken my faith in the institutions of government.
c'est mon oncle dans le jardin.

Epoch 2
Train -   loss: 1.92     top-1: 0.63    top-5: 0.82    top-10: 0.86
Eval -    loss: 1.78     top-1: 0.65    top-5: 0.83    top-10: 0.87
I like to play poker.
j'aime jouer jouer au poker.

Epoch 3
Train -   loss: 1.75     top-1: 0.66    top-5: 0.84    top-10: 0.88
Eval -    loss: 1.57     top-1: 0.68    top-5: 0.86    top-10: 0.89
Now things are different.
les choses sont différents.

Epoch 4
Train -   loss: 1.57     top-1: 0.69    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.45     top-1: 0.70    top-5: 0.87    top-10: 0.90
The war had lasted four years.
la guerre avait trois ans.

Epoch 5
Train -   loss: 1.48     top-1: 0.70    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.37     top-1: 0.71    top-

0,1
Train - loss,█▇▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▄▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇██▇███████████████
Train - top-10,▁▂▃▃▄▅▅▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇████████████████
Train - top-5,▁▃▃▄▄▅▅▅▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█

0,1
Train - loss,1.48204
Train - top-1,0.70185
Train - top-10,0.90487
Train - top-5,0.87284
Validation - loss,1.37411
Validation - top-1,0.71429
Validation - top-10,0.91067
Validation - top-5,0.88168


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: lm6lzf74 with config:
[34m[1mwandb[0m: 	batch_size: 96
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 128
[34m[1mwandb[0m: 	dim_hidden: 512
[34m[1mwandb[0m: 	dropout: 0.1519673427893515
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.0003
[34m[1mwandb[0m: 	max_sequence_length: 40
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 4
[34m[1mwandb[0m: 	n_layers: 4


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 3.00     top-1: 0.48    top-5: 0.66    top-10: 0.72
Eval -    loss: 2.96     top-1: 0.46    top-5: 0.66    top-10: 0.72
Do you know how many people in the world starve to death every year?
sais-tu combien de temps à boston   ?

Epoch 2
Train -   loss: 2.44     top-1: 0.55    top-5: 0.74    top-10: 0.79
Eval -    loss: 2.37     top-1: 0.54    top-5: 0.75    top-10: 0.80
Don't bother to pick me up at the hotel.
n'oubliez pas de me rendre au lit.

Epoch 3
Train -   loss: 2.07     top-1: 0.60    top-5: 0.79    top-10: 0.84
Eval -    loss: 2.03     top-1: 0.58    top-5: 0.79    top-10: 0.84
From the bottom of my heart, thank you.
à cause de mon cœur, mon cœur vous plaît.

Epoch 4
Train -   loss: 1.87     top-1: 0.63    top-5: 0.82    top-10: 0.86
Eval -    loss: 1.84     top-1: 0.61    top-5: 0.82    top-10: 0.86
His behavior bothered me.
son comportement m'a cassé mon nom de moi.

Epoch 5
Train -   loss: 1.72     top-1: 0

0,1
Train - loss,█▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▃▃▄▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇███████
Train - top-10,▁▃▃▄▄▄▅▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇███████████
Train - top-5,▁▃▄▄▄▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Validation - loss,█▅▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▄▆▇█
Validation - top-5,▁▄▆▇█

0,1
Train - loss,1.71625
Train - top-1,0.65561
Train - top-10,0.87935
Train - top-5,0.83933
Validation - loss,1.69688
Validation - top-1,0.63482
Validation - top-10,0.88302
Validation - top-5,0.83891


[34m[1mwandb[0m: Agent Starting Run: nciv1n20 with config:
[34m[1mwandb[0m: 	batch_size: 96
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 128
[34m[1mwandb[0m: 	dim_hidden: 256
[34m[1mwandb[0m: 	dropout: 0.11531959869674896
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	max_sequence_length: 60
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 4
[34m[1mwandb[0m: 	n_layers: 2


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.25     top-1: 0.57    top-5: 0.77    top-10: 0.82
Eval -    loss: 2.13     top-1: 0.57    top-5: 0.78    top-10: 0.83
The toilet is upstairs.
le monde est tombé.

Epoch 2
Train -   loss: 1.81     top-1: 0.64    top-5: 0.83    top-10: 0.87
Eval -    loss: 1.64     top-1: 0.66    top-5: 0.85    top-10: 0.89
Is breakfast included?
le petit-déjeuner est blanc ?

Epoch 3
Train -   loss: 1.65     top-1: 0.66    top-5: 0.85    top-10: 0.89
Eval -    loss: 1.46     top-1: 0.69    top-5: 0.87    top-10: 0.90
Hello, girls.
salut, les filles.

Epoch 4
Train -   loss: 1.53     top-1: 0.68    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.39     top-1: 0.70    top-5: 0.88    top-10: 0.91
We now know that the testimony he gave was coerced.
nous savons maintenant que le <unk> qu'il a donné.

Epoch 5
Train -   loss: 1.43     top-1: 0.70    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.31     top-1: 0.71    top-5: 0.89    top-10: 0.9

0,1
Train - loss,█▇▇▅▅▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▂▂▃▄▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇█▇██████████████████
Train - top-10,▁▁▁▂▂▃▃▄▄▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇█▇▇▇███████████
Train - top-5,▁▂▃▄▄▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇██████████████████
Validation - loss,█▄▂▂▁
Validation - top-1,▁▅▇▇█
Validation - top-10,▁▅▇██
Validation - top-5,▁▅▇▇█

0,1
Train - loss,1.43476
Train - top-1,0.6984
Train - top-10,0.90981
Train - top-5,0.87616
Validation - loss,1.30828
Validation - top-1,0.71325
Validation - top-10,0.92006
Validation - top-5,0.88959


[34m[1mwandb[0m: Agent Starting Run: qrjtrp6b with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 128
[34m[1mwandb[0m: 	dim_hidden: 256
[34m[1mwandb[0m: 	dropout: 0.11412962804136462
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	max_sequence_length: 60
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 4
[34m[1mwandb[0m: 	n_layers: 2


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.26     top-1: 0.57    top-5: 0.77    top-10: 0.82
Eval -    loss: 2.09     top-1: 0.60    top-5: 0.78    top-10: 0.83
Bring the key.
la clé.

Epoch 2
Train -   loss: 1.85     top-1: 0.63    top-5: 0.82    top-10: 0.87
Eval -    loss: 1.63     top-1: 0.66    top-5: 0.85    top-10: 0.88
Are you ready to fly?
es-tu prêt à voler ?

Epoch 3
Train -   loss: 1.61     top-1: 0.66    top-5: 0.85    top-10: 0.89
Eval -    loss: 1.45     top-1: 0.69    top-5: 0.87    top-10: 0.90
I was affected by the summer heat.
j'étais parti par la chaleur.

Epoch 4
Train -   loss: 1.48     top-1: 0.68    top-5: 0.87    top-10: 0.90
Eval -    loss: 1.34     top-1: 0.70    top-5: 0.88    top-10: 0.91
It's become commonplace.
c'est advenu.

Epoch 5
Train -   loss: 1.42     top-1: 0.69    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.28     top-1: 0.72    top-5: 0.89    top-10: 0.92
Young children shouldn't watch so much television.
jeunes ne 

0,1
Train - loss,█▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇█▇████████████████████
Train - top-10,▁▂▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█████████████████
Train - top-5,▁▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▂▂▁
Validation - top-1,▁▅▆▇█
Validation - top-10,▁▅▇██
Validation - top-5,▁▅▇██

0,1
Train - loss,1.41632
Train - top-1,0.69244
Train - top-10,0.91036
Train - top-5,0.87586
Validation - loss,1.28044
Validation - top-1,0.71661
Validation - top-10,0.91923
Validation - top-5,0.88934


[34m[1mwandb[0m: Agent Starting Run: s83ehxoo with config:
[34m[1mwandb[0m: 	batch_size: 128
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 128
[34m[1mwandb[0m: 	dim_hidden: 256
[34m[1mwandb[0m: 	dropout: 0.11138223309465074
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.0003
[34m[1mwandb[0m: 	max_sequence_length: 40
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 2
[34m[1mwandb[0m: 	n_layers: 2


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 3.05     top-1: 0.47    top-5: 0.66    top-10: 0.72
Eval -    loss: 2.99     top-1: 0.46    top-5: 0.66    top-10: 0.72
They deserve more.
elles ont l'air plus.

Epoch 2
Train -   loss: 2.53     top-1: 0.54    top-5: 0.73    top-10: 0.78
Eval -    loss: 2.42     top-1: 0.53    top-5: 0.74    top-10: 0.80
This is the stupidest thing I've ever done.
c'est la maison que j'ai jamais fait.

Epoch 3
Train -   loss: 2.17     top-1: 0.58    top-5: 0.78    top-10: 0.83
Eval -    loss: 2.11     top-1: 0.57    top-5: 0.79    top-10: 0.84
I will give you this bicycle as a birthday present.
je te donnerai ce vélo.

Epoch 4
Train -   loss: 1.97     top-1: 0.61    top-5: 0.80    top-10: 0.85
Eval -    loss: 1.90     top-1: 0.60    top-5: 0.81    top-10: 0.86
May I see a menu, please?
je peux voir un moment, s'il vous plaît ?

Epoch 5
Train -   loss: 1.84     top-1: 0.63    top-5: 0.82    top-10: 0.86
Eval -    loss: 1.77     top-1: 

0,1
Train - loss,█▆▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▂▄▅▅▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████
Train - top-10,▁▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇███████████
Train - top-5,▁▂▂▃▃▄▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██████████
Validation - loss,█▅▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▄▆▇█
Validation - top-5,▁▄▆▇█

0,1
Train - loss,1.84468
Train - top-1,0.6267
Train - top-10,0.86321
Train - top-5,0.81988
Validation - loss,1.76593
Validation - top-1,0.62425
Validation - top-10,0.87787
Validation - top-5,0.83451


[34m[1mwandb[0m: Agent Starting Run: 3wws6x8p with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 256
[34m[1mwandb[0m: 	dim_hidden: 256
[34m[1mwandb[0m: 	dropout: 0.11866481682891976
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	max_sequence_length: 60
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 8
[34m[1mwandb[0m: 	n_layers: 2


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 1.94     top-1: 0.62    top-5: 0.82    top-10: 0.86
Eval -    loss: 1.75     top-1: 0.64    top-5: 0.84    top-10: 0.88
Do you understand what's going on?
comprenez-vous ce qui va se passe ?

Epoch 2
Train -   loss: 1.72     top-1: 0.66    top-5: 0.85    top-10: 0.88
Eval -    loss: 1.47     top-1: 0.69    top-5: 0.87    top-10: 0.91
I'll hang onto it for now.
je vais le faire maintenant.

Epoch 3
Train -   loss: 1.46     top-1: 0.69    top-5: 0.88    top-10: 0.91
Eval -    loss: 1.33     top-1: 0.72    top-5: 0.89    top-10: 0.92
I want you to tell me what you really think of me.
je veux que vous me disiez ce que vous pensez vraiment.

Epoch 4
Train -   loss: 1.35     top-1: 0.72    top-5: 0.89    top-10: 0.92
Eval -    loss: 1.26     top-1: 0.73    top-5: 0.90    top-10: 0.93
We didn't need to ask him to resign.
nous n'avons pas besoin de lui demander de démissionner.

Epoch 5
Train -   loss: 1.31     top-1: 0.73   

0,1
Train - loss,█▇▅▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▁▂▂▄▆▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇██████████▇████
Train - top-10,▁▅▅▅▅▆▇▇▇▇▇▇▇███████████████████████████
Train - top-5,▁▃▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇████████████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▅▇▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█

0,1
Train - loss,1.31373
Train - top-1,0.72513
Train - top-10,0.92328
Train - top-5,0.89285
Validation - loss,1.21006
Validation - top-1,0.73876
Validation - top-10,0.93093
Validation - top-5,0.90547


[34m[1mwandb[0m: Agent Starting Run: jtytvh7c with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	clip: 5
[34m[1mwandb[0m: 	dim_embedding: 256
[34m[1mwandb[0m: 	dim_hidden: 256
[34m[1mwandb[0m: 	dropout: 0.1770574133361865
[34m[1mwandb[0m: 	epochs: 5
[34m[1mwandb[0m: 	log_every: 50
[34m[1mwandb[0m: 	lr: 0.001
[34m[1mwandb[0m: 	max_sequence_length: 60
[34m[1mwandb[0m: 	min_token_freq: 2
[34m[1mwandb[0m: 	model_type: Transformer
[34m[1mwandb[0m: 	n_heads: 8
[34m[1mwandb[0m: 	n_layers: 3


Starting training for 5 epochs, using cuda.

Epoch 1
Train -   loss: 2.48     top-1: 0.54    top-5: 0.74    top-10: 0.79
Eval -    loss: 2.20     top-1: 0.57    top-5: 0.77    top-10: 0.82
This is my daughter.
c'est ma fille.

Epoch 2
Train -   loss: 2.06     top-1: 0.60    top-5: 0.80    top-10: 0.84
Eval -    loss: 1.77     top-1: 0.64    top-5: 0.83    top-10: 0.88
Be careful about what you eat.
sois prudente de quoi vous manger.

Epoch 3
Train -   loss: 1.80     top-1: 0.64    top-5: 0.83    top-10: 0.88
Eval -    loss: 1.57     top-1: 0.67    top-5: 0.86    top-10: 0.90
It's been three years since we got married.
il fait trois ans depuis que nous sommes mariés.

Epoch 4
Train -   loss: 1.61     top-1: 0.67    top-5: 0.86    top-10: 0.89
Eval -    loss: 1.46     top-1: 0.69    top-5: 0.88    top-10: 0.91
The CIA is watching you.
la porte te regarde.

Epoch 5
Train -   loss: 1.56     top-1: 0.69    top-5: 0.86    top-10: 0.90
Eval -    loss: 1.37     top-1: 0.71    top-5: 0.89    to

0,1
Train - loss,██▆▅▄▄▄▄▃▃▃▃▃▃▃▂▂▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
Train - top-1,▁▂▂▃▃▄▄▄▄▄▅▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇█▇█▇████████
Train - top-10,▁▃▃▃▃▄▅▅▅▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
Train - top-5,▁▂▃▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇██████████████
Validation - loss,█▄▃▂▁
Validation - top-1,▁▄▆▇█
Validation - top-10,▁▅▇▇█
Validation - top-5,▁▅▆▇█

0,1
Train - loss,1.56275
Train - top-1,0.68616
Train - top-10,0.89813
Train - top-5,0.86416
Validation - loss,1.36573
Validation - top-1,0.71022
Validation - top-10,0.9155
Validation - top-5,0.88597


# Grading:

# Implementations (50 points total)

10 Points for your implementation of the GRU

40 Points for your implementaiton of the Transformer components

# Questions (12 points, 1 point each)
1. Explain the differences between Vanilla RNN, GRU-RNN, Encoder-Decoder Transformer and Decoder-Only Transformer.

  *An RNN processes sequences token by token, with a hidding state holding information from previous states. Similarly, a GRU-RNN processes sequences token by token using gates, update gate z and reset gate r, to control information. Now, an encoder-decoder transformer is not based on reccurence but attention, namely self-attention and cross-attention as they use encoders and decoders. But, a decoder-only transformer consists only of decoder layers so it's like the encoder-decoder but wihtout cross-attention.*
  
2. Why is positionnal encoding necessary in Transformers and not in RNNs?

  *RNNs already process inputs sequentially like so `for t in range(seq_len):` so it knows the information about the position of each token. Transformers process all tokens in the sequence simultaneously with their self-attention, so we use positional encodings to the input embeddings. They are information about the position of tokens in the sequence. This information is then available to the self-attention mechanism, allowing it to consider word order.*

3. Describe the preprocessing process. Detail how the initial dataset is processed before being fed to the translation models.
4. What is teacher forcing, and how is it used in Transformer training? How does the decoder input differ?

  *The deocder processes sequences auto-regressively so in teacher forcing the input is not the model's own prediction but the actual true token from the target sequence. In the code, this relates to our `target tensor`. Teacher forcing prevents errors from previous predictions to propagate.*

5. How are the two types of mask important to the attention mechanism (causal and padding) and how do they work? How do they differ between the encoder and decoder?

  *In order to make our sequences in batch the same length, we use padding tokens `<pad>` but they don't provide information so the goal is to find which positions contains a `<pad` and prevent the model from attending them. The causal mask makes sure our auto-regressive generation only depends on previous tokes. So this mask is not used in the enocder but only in the decoder in the self-attention layers.*

6. What is a causal mask, and why is it only used in the decoder?

  *It's only used in the decoder because it's mostly essential for auro-regressive generations. It's a boolean mask applied during self-attention to keep the attention from going to future tokens and only process previous ones.*

7. Why does the decoder use both self-attention and encoder-decoder attention?

  *The self-attention allows the decoder to use tokens from what was previously generated (target sequence). THis adds context to better predict the next token. The encoder-decoder attention allows mixing information from the source and target sequences.*

8. Why is the Transformer model parallelizable, and how does this improve efficiency compared to RNNs?

  *By using matrix multiplication operations like `torch.einsum or torch.matmul` in the attention function. They are easy to parallelize on GPUs. Also, since there are no sequential dependencies in the attention, all the tokens can be computed simultaneously. This allows transformers to process much longer sequences faster than RNNs because RNNs are bounded by their sequential nature.*

9. How does multi-head self-attention allow the model to capture different aspects of a sentence?

  *In a multi-head attention, each head is a different projection of Q, K, V.This allows different heads to capture different types of relationships within the sentence.*

10. What does the decoder's final output represent before the projection layer? What does the encoder's final output represent?

  *The encoders final outputs is the embedding of the tokens. They embed information about the context of that token within the sentence. Simirlaly, the decoder's final output is the embeddings of the target tokens encoding information from the source tokens and the previous target tokens.*

11. What is the role of the final linear projection layer in the decoder?
How does the decoder output differ between training (parallel processing) and inference (sequential generation)?

  *The final linear layer projects the output of the decoder's (Q10) vectors to vectors the same size of the target vocabulary*

12. Why does the decoder recompute all outputs at each inference step instead of appending new outputs incrementally?

  *In my decoder's forward pass, we are taking the sequence generated so far as input. So we do recompute for all positions. TO avoid this we could explore Key-Value (KV) Caching.*

# Small report - experiments of your own choice (15 points)
Once everything is working fine, you can explore aspects of these models and do some research of your own into how they behave.

For exemple, you can experiment with the hyperparameters.
What are the effect of the differents hyperparameters with the final model performance? What about training time? If you decide to implement Greedy search to compare with beam search, how much worse is it ?

What are some other metrics you could have for machine translation? Can you compute them and add them to your WandB report?

Those are only examples, you can do whatever you think will be interesting.
This part accounts for many points, *feel free to go wild!*

---
*Make a concise report about your experiments here.*

*My experimentation was to test different hyperparameter settings and find the best performance. So I wanted to find parameter combinations that minimize the model's validation loss, which serves as a good metric for the model's ability to generalize to unseen data.*

*To explore that, I used a Bayesian optimization sweep using Weights and Biases to target my minimization of validation loss. So this sweep test different configurations by adjusting different parameters, includind the number of layer (2, 3, 4), embedding dimension(128, 192, 256), hidden dimension of the feedforward networks (256, 384, 512) and the number of attention heads (2, 4, 8) Values were kind of chosen randomly and by keeping in mind that we running on a single T4. Also, I varied the regularization through the drop out rate between 0.1 and 0.2 uniformly and finally, training-related parameters like batch-size and adam optimizer learning rate were also included. This was all for me to become more familiar with all the diffent model and traning parameters.*

*During the experiment, the performance metrics were logged to  Weights and Biases. You can view the full W&B report here: https://wandb.ai/ahmed-el-shami-polytechnique-montr-al/translation-transformer-sweeps/sweeps/aokmisqs?nw=nwuserahmedelshami*

According to my results, I believe the best configuration came from the youthful-sweep-7 whith the following configuration:

```
batch_size: 64
clip: 5
dim_embedding: 256
dim_hidden: 256
dropout: 0.11866481682891976
epochs: 5
log_every: 50
lr: 0.001
max_sequence_length: 60
min_token_freq: 2
model_type: Transformer
n_heads: 8
n_layers: 2
```

*What I learned from this is we have small Transformer model (2 layers and 256 dimensions for embedding/hidden states). And smaller models can better in a limited time, in this experiment i used 5 epochs. The learning rate of 0.001 probably helped my small model learn more quickly, as it gave better results than orther lower learning rates. So this specific combination, even if it's a smaller model, looks like it was the best for getting the lowest error and highest accuracy on unseen validation data within my limited time.*




---
# Not Part of TP3, But A Potential Project Idea: Understanding the Architecture of a Decoder-Only Transformer

Step 1: In a project group of 3-4 create a high level plan for a Decoder-Only model for how you would need to modify this code to implement a Decoder-Only Transformer. Key components of the implementation should be split up and each member of the group should present the pseudo-code (or actual code) for one component of the full model to one another, and in a report. Then create the working model and perform experiments comparing it with your TP3 encoder-decoder model.

For more details on the Decoder-Only Transformer see [this blog post](https://medium.com/international-school-of-ai-data-science/building-custom-gpt-with-pytorch-59e5ba8102d4). The [first "GPT" paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), and the paper cited by this GPT-1 paper for the Decoder Only architecture used for GPT, [i.e. this paper](https://arxiv.org/abs/1801.10198)