**All Rights Reserved**

**Copyright (c) 2025 IRT Saint-Exupery**

*Author & contact:* 
* mouhcine.mendil@irt-saintexupery.com 

# Natural Language Processing (NLP) to Large Language Models (LLM)

<div align="center">
    <h2>Lab Session 1: Part II</h2>
</div>

## Machine Translation

Our task is to automatically translate sentences from English to French. We will not perform a literal word-to-word translation, as the solution is a straighforward word retrieval from a lookup table. Instead, we aim to train the translation model by showing it several examples of English sentences and French sentences.  

Machine Translation is an exemple of sequence-to-sequence learning (Seq2Seq), which consists on training models to convert **sequences of variable length** from one domain (e.g. sentences in English) to **sequences of variable length** in another domain (e.g. the same sentences translated to French). This can be used when you need to generate text, such as in machine translation or text summarization. There are multiple ways to handle this task; **We will focus on RNNs** that you have learned previously.

<div class="alert alert-block alert-warning">

⚠️⚠️⚠️ Even if seq2seq models are suitable to handle variable-length sequences, they need to be trained on data with similar input sequence length $l_{\text{input}}$ and output sequence length $l_{\text{output}}$. We will see how to make this possible later in this notebook. ⚠️⚠️⚠️
</div>

### 1. English-French MT Dataset

We want to train a model to learn English to French translation from a simple dataset hosted in http://www.manythings.org/anki/. Besides, the website provides many translations for other languages such as English, Spanish and Chinese. 

We have previously download the dataset, which you can find in `data/eng_to_fr.txt`.

<div class='alert alert-info'>

<b> Exercise 1 </b>

- Read in a dataframe the first 20000 rows of the file <code>data/eng_to_fr.txt</code>. Make sure you are using the right separator.
- Get a general sense of what the dataset is about and describe it. Is the length of the input and target sequences similar ?
- Keep only the first two columns and name them <code>english</code> and <code>french</code>.
</div>

In [1]:
import pandas as pd
import numpy as np

# limit number of lines read
df_mt = pd.read_csv(
    "data/eng_to_fr.txt",
    sep="\t",
    nrows=20000,
    header=None,
    names=["english", "french", "comment"],
)
df_mt = df_mt[["english", "french"]]
print(f"Number of samples: {len(df_mt)}")

Number of samples: 20000


In [None]:
df_mt.head(100)

Unnamed: 0,english,french
0,Go.,Va !
1,Go.,Marche.
2,Go.,En route !
3,Go.,Bouge !
4,Hi.,Salut !
...,...,...
95,Get up.,Lève-toi !
96,Get up.,Debout.
97,Go now.,"Va, maintenant."
98,Go now.,Allez-y maintenant.


### 2. Data Cleaning and preparation
<div class='alert alert-info'>
<b> Exercise 2.1 </b>

- To simplify the problem (smaller vocabulary), lower all capital letters.
- Can we apply other data cleaning operations ? Explain your answer.
- Split your data into training (80%) and test (20%) subsets using <code>train_test_split</code>, random seed = 42 and <code>shuffle</code> set to True.

</div>

<div class="alert alert-block alert-warning">
⚠️⚠️⚠️ Once done, note that vocabulary size and token index will be exclusively based on the train dataset. Make sure you choose the same random seed and other arguments specified in the question. ⚠️⚠️⚠️
</div>


In [None]:
from sklearn.model_selection import train_test_split

# lower case
df_mt_clean = df_mt.map(lambda x: x.lower())

# Split data into train and test
df_mt_train, df_mt_test = train_test_split(
    df_mt_clean, test_size=0.2, shuffle=True, random_state=42
)

df_mt_train.reset_index(inplace=True, drop=True)
df_mt_test.reset_index(inplace=True, drop=True)

To implement Machine Translation, we will rely on sequence-to-sequence (Seq2Seq) models. 

### Seq2Seq: Encoder/Decoder

Seq2Seq has been first introduced by [Google](https://arxiv.org/abs/1409.3215). It captures the information carried by the input sequences in a low-dimensional encoded form and generates relevant output sequences in an iterative manner. Seq2Seq relies on two main blocks: one to read the input sequence (encoder) and another to generate the output sequence (decoder). 

**Encoder**  
The encoder processes the input sequence. It reads the input data **one token at a time** and transform it into a **context vector** (fixed-sized vector), capturing the data's essential information. The encoder (RNN, LSTM, or GRU) traverses the input sequence, updating its internal state at each step. By the end of this process, the internal state of the encoder is a compact representation of the entire input sequence.

**Decoder**  
The decoder is tasked with generating the output sequence. Starting from the **context vector** produced by the encoder, it generates the output elements one at a time. Like the encoder, the decoder is often implemented as an RNN, LSTM, or GRU. It uses the context vector and what it has generated so far to predict the next element in the output sequence. This process is iterative and continues until a special **end-of-sequence token** is generated (or some other stopping criterion).

At each decoding step, the output of the decoder is passed through a **dense layer followed by a softmax function**. The softmax produces a probability distribution over the entire **target vocabulary**, allowing the model to select the most likely next token. This softmax layer is crucial, as it transforms the decoder’s output into a meaningful prediction at every timestep.

<div align="center">
  <img src="figures/seq2seqmodel.png" width="80%"/>
</div>

----

### Tokenization

Now we know the type of model we will use, let's move to tokenization. We will rely on character-to-character machine translation; therefore, sentences will be tokenized at <b>character-level</b>.

Note however that there are two additional special tokens that we need for Seq2Seq models: the Start-of-Sequence (SOS) token and the End-of-Sequence (EOS). 

- The SOS is essential for initializing the decoding process. During training, it serves as the first input to the decoder, signaling the model to begin generating the target sequence. This helps the model learn a consistent starting context across all examples. During inference (i.e., when generating a translation), the SOS token is explicitly fed as the first decoder input, which triggers the model to begin producing the output sequence.

- The End-of-Sequence (EOS) token, on the other hand, indicates when the generated sequence should stop. It is included in the target output during training so the model can learn to associate certain contexts with the end of a sentence. During inference, the model continues generating tokens until it predicts an EOS token, which signals that the output is complete. Without EOS, the model might continue generating irrelevant tokens or cut off the translation prematurely. Together, SOS and EOS provide clear boundaries for structured and meaningful sequence generation.


<div align="center">
  <img src="figures/tokenization.png" width="40%"/>
  <figcaption>Author: Shann Khosla<figcaption/>
</div>

<div class='alert alert-info'>
<b> Exercise 2.2 </b>

- Add a Start-Of-Sentence (SOS) characters `\t` and End-Of-Sentence (EOS) character `\n` to <b>each french sentence (target)</b>. For example, "bonjour!" becomes "\tbonjour!\n"
- What is the maximum sequence length for input text (English) and target text (French) ? 
</div>

In [None]:
df_mt_train.french = df_mt_train.french.apply(lambda x: f"\t{x}\n")
df_mt_test.french = df_mt_test.french.apply(lambda x: f"\t{x}\n")

# Max seq lenght for english and french
max_fr_seq_len = np.max(df_mt_train["french"].map(len))
max_en_seq_len = np.max(df_mt_train["english"].map(len))

print(f"Max number of characters per text for inputs (English): {max_en_seq_len}")
print(f"Max number of characters per text for outputs (French): {max_fr_seq_len}")

Max number of characters per text for inputs (English): 17
Max number of characters per text for outputs (French): 59


<div class='alert alert-info'>

<b> Exercise 2.3 </b>
- Build a vocabulary for the input text (English) and a vocabulary for the target text (French) using only <b>train dataset</b>. What is the size of each vocabulary ? 
- Build an input token-to-index dictionary to map each character (key) from the previously constructed English vocabulary to a unique index (value). 
- Build a target token-to-index dictionary to map each character (key) from the previously constructed French vocabulary to a unique index (value).
Make sure that the character <code>\t</code> is associated with index 0 and <code>\n</code> is associated with index 1.     
</div>



In [5]:
# Vocabulary
vocab_en = np.unique(df_mt_train["english"].map(lambda x: set(x)).explode())
vocab_en.sort()
vocab_fr = np.unique(df_mt_train["french"].map(lambda x: set(x)).explode())
vocab_fr.sort()
print("Size of English vocabulary:", len(vocab_en))
print("Size of French vocabulary:", len(vocab_fr))

# Token to index lookup dicts for english and french
en_token2index_dict = {char: i for i, char in enumerate(vocab_en)}
fr_token2index_dict = {char: i for i, char in enumerate(vocab_fr)}

Size of English vocabulary: 50
Size of French vocabulary: 69


### Training vs. Inference in Seq2Seq Models

Although the architecture of a Seq2Seq model (encoder + decoder) stays the same, its **behavior during training and inference is very different** — and it’s important to understand why.

##### **Training Phase** – *Using the Right Answer*

During training, we use a method called **teacher forcing**. This means that at each time step, the decoder is given the **correct previous token** from the target sentence (french) — not the one the model predicted.

<div align="center">
  <img src="figures/teacher_forcing.png" width="40%"/>
  <figcaption>Author: Wanshun Wong<figcaption/>
</div>

At each step, the model learns to predict the **next token** given the **true previous one**. This makes training faster and more stable because the decoder always knows the "right context," even if it would have made a mistake on its own.

**The model is trained to minimize the difference between its predicted output sequence and the true output sequence, using a loss function like cross-entropy.**

##### **Inference Phase** – *Using Its Own Predictions*

When we switch to inference (i.e., generating a translation), the model doesn’t have access to the ground-truth translation anymore. It has to **generate the output one token at a time**, feeding **its own previous prediction** back into the decoder.

It starts with the **start-of-sequence token (SOS)** and continues predicting the next token based on everything it has generated so far. It stops when it outputs the **end-of-sequence token (EOS)** or hits a maximum sequence length.

This process is called **autoregressive decoding**, and it's harder because:
- There is no ground truth to guide the model
- One small error early on can affect all the next predictions

##### Summary

| Phase     | What is fed to the decoder?          | Purpose                            |
|-----------|--------------------------------------|------------------------------------|
| Training  | The correct previous token (teacher forcing) | To help the model learn faster and more accurately |
| Inference | The model’s own previous prediction  | To test whether the model can generate coherent sequences by itself |

Understanding this distinction is key: **teacher forcing helps the model learn**, while **inference checks whether it really has**.


---- 

Knowing this, let's proceed to the vectorization using one-hot encoding. 

<div class='alert alert-info'>
<b> Exercise 2.4 </b>

- Write a function `one_hot_input` that prepares:
    - the encoder's inputs `encoder_one_hot_inputs` as a 3D array of shape `(size_corpus, max_english_sentence_length, size_english_vocabulary)` containing the one-hot vectorization of the English sentences. The sentences shorter than `max_english_sentence_length` are to be filled with spaces `" "` (padding). Replace out-of-vocabulary tokens by a space `" "`.
    - the decoder's inputs `decoder_one_hot_inputs` as a 3D array of shape `(size_corpus, max_french_sentence_length, size_french_vocabulary)` containing the one-hot vectorization of the French sentences. The sentences shorter than `max_french_sentence_length` are to be filled with spaces `" "`. Replace out-of-vocabulary tokens by a space `" "`. 
</div>

Consistently with the teacher forcing approach, notice how the function `one_hot_target` that prepares the decoder's target `decoder_one_hot_targets` is the same as `decoder_one_hot_inputs` but offset by one step: `decoder_one_hot_targets[:, j, :]` $\leftarrow$ `decoder_one_hot_inputs[:, j + 1, :]`.

In [6]:
def one_hot_input(corpus, token2index_dict, max_seq_len):
    result = np.zeros(
        (len(corpus), max_seq_len, len(token2index_dict)), dtype="float32"
    )
    for i, document in enumerate(corpus):
        for t, char in enumerate(document):
            try:
                result[i, t, token2index_dict[char]] = 1.0
            except:
                result[i, t, token2index_dict[" "]] = 1.0
                continue
        result[i, t + 1 :, token2index_dict[" "]] = 1.0
    return result


def one_hot_target(corpus, token2index_dict, max_seq_len):
    result = np.zeros(
        (len(corpus), max_seq_len, len(token2index_dict)), dtype="float32"
    )
    for i, document in enumerate(corpus):
        for t, char in enumerate(document):
            if t > 0:
                try:
                    result[i, t - 1, token2index_dict[char]] = 1.0
                except:
                    result[i, t - 1, token2index_dict[" "]] = 1.0
                    continue
        result[i, t:, token2index_dict[" "]] = 1.0
    return result


encoder_one_hot_inputs = one_hot_input(
    df_mt_train.english, en_token2index_dict, max_en_seq_len
)
decoder_one_hot_inputs = one_hot_input(
    df_mt_train.french, fr_token2index_dict, max_fr_seq_len
)
decoder_one_hot_targets = one_hot_target(
    df_mt_train.french, fr_token2index_dict, max_fr_seq_len
)

### 3. Model Training 


We will now implement the architecture shown in the figure below to train a Seq2Seq model for character-level machine translation.

On the left, the encoder processes a sequence of one-hot encoded input characters (from an English sentence) using a stack of GRU cells. As the encoder reads each character, it updates its internal hidden state. Once the final character is processed, the last hidden state becomes the context vector summarizing the entire input sentence.

This context vector is then passed to the decoder on the right. The decoder is also composed of GRU cells, and at each time step, it receives:

* The one-hot encoded representation of the previous character (from the target sentence during training, or from its own prediction during inference)

* The current hidden state (initially, the final hidden state from the encoder)

The decoder's GRU outputs are passed through a Dense layer followed by a softmax activation, producing a probability distribution over the target vocabulary at each time step. This allows the model to predict the next character in the output sequence. The decoder continues this autoregressive process until it generates a special end-of-sequence (EOS) token.

<div align="center">
  <img src="figures/seq2seqmodel.png" width="80%"/>
</div>


<div class='alert alert-info'>
<b> Exercise 3.1 </b>

- Briefly explain how a GRU layer works: What are its inputs, outputs, and why is it useful in the Seq2Seq context? 
- Examine and complement the implementation of the `seq2seq_model` function to build the Seq2Seq model.

</div>

In [7]:
import tensorflow as tf

batch_size = 32  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 512  # dim of the latent space.


def seq2seq_model(latent_dim, input_tokenidx_dict, target_tokenidx_dict):
    # Define an input sequence and process it.
    x_encoder = tf.keras.Input(shape=(None, len(input_tokenidx_dict)))
    encoder = tf.keras.layers.GRU(latent_dim, return_state=True)
    y_encoder, h_encoder = encoder(x_encoder)

    # Set up the decoder input sequence
    x_decoder = tf.keras.Input(shape=(None, len(target_tokenidx_dict)))

    # We set up the decoder to return output sequences and hidden states
    # We don't use the return states in the training model, but we will use them in inference.
    decoder = tf.keras.layers.GRU(latent_dim, return_sequences=True, return_state=True)
    y_decoder, _ = decoder(x_decoder, initial_state=h_encoder)
    # Output layer, Dense + softmax activation on
    dense_softmax = tf.keras.layers.Dense(
        len(target_tokenidx_dict), activation="softmax"
    )
    y_decoder = dense_softmax(y_decoder)

    # Define the model that will turn [`x_encoder`, `x_decoder`] into `y_decoder`
    model = tf.keras.Model([x_encoder, x_decoder], y_decoder)
    return model

2025-03-25 18:17:41.167435: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-25 18:17:41.167476: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-25 18:17:41.167511: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-25 18:17:41.174421: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<div class='alert alert-info'>
<b> Exercise 3.2 </b>

- Instantiate and train the Seq2Seq model using the teacher forcing approach.

</div>

In [8]:
# Instantiate the model and compile it
model = seq2seq_model(latent_dim, en_token2index_dict, fr_token2index_dict)
model.compile(
    optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"]
)

# Set early stopping if validation accuracy does not improve after `patience` epochs
callback = tf.keras.callbacks.EarlyStopping(
    monitor="val_accuracy", patience=50, restore_best_weights=True
)

# Train
history = model.fit(
    [encoder_one_hot_inputs, decoder_one_hot_inputs],
    decoder_one_hot_targets,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2,
    callbacks=[callback],
)

2025-03-25 18:17:49.018559: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-03-25 18:17:49.018851: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2025-03-25 18:17:49.043772: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Epoch 1/100


2025-03-25 18:17:51.806411: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8801
2025-03-25 18:17:51.871894: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x753666876f10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2025-03-25 18:17:51.871924: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2025-03-25 18:17:51.871932: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2025-03-25 18:17:51.876133: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2025-03-25 18:17:51.928726: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 7

In [9]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, None, 50)]           0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, None, 69)]           0         []                            
                                                                                                  
 gru (GRU)                   [(None, 512),                866304    ['input_1[0][0]']             
                              (None, 512)]                                                        
                                                                                                  
 gru_1 (GRU)                 [(None, None, 512),          895488    ['input_2[0][0]',         

### 4. Model Inference 

Unlike training, the target sequence is not known during inference. The model generates the output sequence character by character, starting from an SOS character until it generates an EOS character, signifying the end of the output sequence. Since the true output tokens are not available, the model uses its own predictions as input for the next step. This process is autoregressive and continues until the model produces the EOS or reaches a maximum length. 

During inference, the model generates the target sequence character by character, in an autoregressive way. 

<div class='alert alert-info'>
<b> Exercise 4.1 </b>

- Examine the implementation of the `translate` function and analyse how the Seq2Seq model is used for inference.
- Run the inference on some test sentences one by one.
- What do you notice about errors with this autoregressive generation? Are mistakes equally distributed throughout a sequence (begining, middle and end)?

</div>

In [10]:
def translate(input_seq, seq2seq_model=model):
    # You can check seq2seq_model.summary() to recall the input and outputs
    # of its layers

    # Encoder
    x_encoder = seq2seq_model.input[0]  # input_1
    y_encoder, h_encoder = seq2seq_model.layers[2].output  # lstm_1
    encoder = tf.keras.Model(inputs=x_encoder, outputs=h_encoder)

    # Decoder
    x_decoder = seq2seq_model.input[1]  # input_2
    h_in_decoder = tf.keras.Input(shape=(latent_dim,))
    gru_layer = seq2seq_model.layers[3]
    y_gru, h_out_gru = gru_layer(x_decoder, initial_state=h_in_decoder)
    dense_layer = seq2seq_model.layers[4]
    y_decoder = dense_layer(y_gru)
    decoder_model = tf.keras.Model([x_decoder, h_in_decoder], [y_decoder, h_out_gru])

    # Reverse dictionary to recover target vocabulary from token index
    token_idx_fr_dict = dict((i, char) for char, i in fr_token2index_dict.items())

    def decode_translation(input_seq):
        # Encode input as context vector.
        h_encoder = encoder.predict(input_seq, verbose=0)

        # Generate empty target sequence of length 1.
        translated_seq = np.zeros((1, 1, len(fr_token2index_dict)))
        # Initialize the first character of target sequence with the SOS.
        translated_seq[0, 0, fr_token2index_dict["\t"]] = 1.0

        # Sampling loop for a batch (=1) of sequences
        condition = False
        decoded_sentence = ""
        h_in_decoder = h_encoder  # init
        while not condition:
            y_decoder, h_out_decoder = decoder_model.predict(
                [translated_seq, h_in_decoder], verbose=0
            )

            # Get token
            token_index = np.argmax(y_decoder[0, -1, :])
            char = token_idx_fr_dict[token_index]
            decoded_sentence += char

            # Stop condition: find stop character or reach max length
            if char == "\n" or len(decoded_sentence) > max_fr_seq_len:
                condition = True

            # Update the translated sequence.
            translated_seq = np.zeros((1, 1, len(fr_token2index_dict)))
            translated_seq[0, 0, token_index] = 1.0

            # Update hidden state
            h_in_decoder = h_out_decoder
        return decoded_sentence

    return decode_translation(input_seq)


# test input sequences
input_sequences = one_hot_input(df_mt_test.english, en_token2index_dict, max_en_seq_len)


for i, seq_index in enumerate(range(20)):

    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = input_sequences[seq_index : seq_index + 1]
    # encoder_one_hot_inputs[seq_index : seq_index + 1]
    translated_sentence = translate(input_seq)

    print(f"-------------------- Sentence {i} --------------------")
    print("Input sentence:", df_mt_test.english[seq_index])
    print("Translated sentence:", translated_sentence)

-------------------- Sentence 0 --------------------
Input sentence: do it tomorrow.
Translated sentence: arrête de faire ça !

-------------------- Sentence 1 --------------------
Input sentence: i felt ill.
Translated sentence: je me sentais malade.

-------------------- Sentence 2 --------------------
Input sentence: it'll be easy.
Translated sentence: ça sera facile.

-------------------- Sentence 3 --------------------
Input sentence: i got hot.
Translated sentence: j'ai compris.

-------------------- Sentence 4 --------------------
Input sentence: who broke this?
Translated sentence: qui l'a betion ?

-------------------- Sentence 5 --------------------
Input sentence: i'm no quitter.
Translated sentence: je ne suis pas une sainte.

-------------------- Sentence 6 --------------------
Input sentence: he's so stupid.
Translated sentence: il fur inutile.

-------------------- Sentence 7 --------------------
Input sentence: did tom eat?
Translated sentence: est-ce que tom a mangé ?


BLEU score is one of the most common automatic metrics used to evaluate the quality of machine translation models. It measures how closely the model’s output matches one or more reference translations.

<div class='alert alert-info'>
<b> Exercise 4.2 </b>

* Briefly explain how the BLEU score works. What does it measure, and how is it computed?

* Using `nltk.translate.bleu_score`, compute the average BLEU score over at least 100 sentence pairs (predicted vs. reference). Interpret the result: what does it say about the model's performance?

* Suggest some ways to improve the model. For example, you can consider data quality, preparation, vectorization or model architecture.

</div>