# **Classic Fine-Tuning**
In this hands-on exercise, we will be fine-tuning different models for various tasks using classical fine-tuning. Classical fine-tuning is a common approach to establish a solid baseline for model specialization performance.

The goal of fine-tuning is to take a pre-trained model and adapt it to a specific task or dataset. By leveraging the knowledge and representations learned from a large-scale pre-training task, we can achieve better performance on downstream tasks with less training data.

This notebook is divided in 3 parts:
- Classification: IMDB Dataset with RoBERTa
- Chatbot: Roleplay Dataset (Chatbot) with Phi-2
- Summarization: SciTLDR dataset with T5

In [None]:
from pathlib import Path
import os
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoModelForSeq2SeqLM
from torch.utils.data import Dataset, DataLoader, IterableDataset
import torch
from tqdm.notebook import tqdm
import random
from utils import seed_everything
from jupyterquiz import display_quiz
import json


DSDIR = Path(os.environ['DSDIR'])
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

quiz_path = Path("./quiz/finetune.json")
quiz = json.loads(quiz_path.read_text())

seed_everything(53)

---

## **Classification: IMDB Dataset with RoBERTa**
The [IMDB dataset](https://huggingface.co/datasets/imdb) is a collection of movie reviews labeled with binary sentiment classification (positive or negative). In this task, we will be using the RoBERTa model, which is a pre-trained model on the English language. RoBERTa (Encoder model) utilizes a masked language modeling (MLM) objective, similar to BERT. Our goal is to fine-tune the [RoBERTa base](https://huggingface.co/FacebookAI/roberta-base) model on the IMDB dataset, enabling it to accurately classify movie reviews based on sentiment.


### **Exploratory Data Analysis**
Before diving into the fine-tuning process, it's important to perform an exploratory data analysis (EDA) on the dataset. This will help us understand the structure and characteristics of the data, as well as determine the appropriate input and output variables for training our Auto-Encoding Model.

Here we will make a very quick EDA, but keeps in mind that this part is essential and you should take more time on it than what we will do during this excersise.

In [None]:
# Load the dataset
imdb_dataset = datasets.load_from_disk(DSDIR / "HuggingFace/imdb/plain_text")

In [None]:
# Check the splits of the dataset
imdb_dataset

In [None]:
# Check the number of elements of the dataset and the features
imdb_dataset["train"]

In [None]:
# Check the format of the datset elements
imdb_dataset["train"][0]

Feel free to explore the dataset in more depth:

### **Discover the Model and its Tokenizer**

In this section, we will explore the configurations of the RoBERTa model and its tokenizer. We will utilize the transformers library to import the model and examine the input and output formats of the model.

In [None]:
# Initialize the model and its tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    DSDIR / "HuggingFace_Models/FacebookAI/roberta-base", num_labels=2
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/FacebookAI/roberta-base")

In [None]:
# Check the model configuration
model.config

In [None]:
# Run this cell to get a multiple choice question
display_quiz([quiz[0]])

In [None]:
# Test the tokenizer
text_input = "I am learning."
tokenizer(text_input)

In [None]:
# We can see that one word can be divided into several tokens
tokenizer("IDRIS", add_special_tokens=False)

The padding technique allows us to transform a list of vectors into a matrix, which is essential for efficiently using a Transformer model. Padding involves adding zeros (or any other designated padding token) to sequences of different lengths to make them equal in length.

**Without padding:**<br>
![image](./images/without_padding.jpg)

In [None]:
# Test the tokenizer with a batch of several input without padding
texts_input = ["I am learning.", "IDRIS is hosting the supercomputer Jean Zay"]
tokenizer(texts_input)

**With padding:**<br>
![image](./images/with_padding.jpg)

In [None]:
# Test the tokenizer with a batch of several input and pytorch Tensor transformation
# We need to add padding to make a matrix
texts_input = ["I am learning.", "IDRIS is hosting the supercomputer Jean Zay"]
model_inp = tokenizer(texts_input, return_tensors="pt", padding=True)
model_inp

**Input and output of a Transformer for sequence classification:**<br>
![image](./images/in-out_encoder_classification.jpg)

In [None]:
# Test the truncation to set a maximum length of the Tensor
model_inp = tokenizer(
    texts_input, return_tensors="pt", padding=True, truncation=True, max_length=10
)
model_inp

In [None]:
# compute inference
model_inp = model_inp.to("cuda")
out = model(**model_inp)

In [None]:
print(f"Type of the HuggingFace output: {type(out)}")
print(f"List of element of the output: {out.keys()}")
print(f"Type of the logits: {type(out['logits'])}")
print(f"Shape of the logits: {out['logits'].shape}")
print(f"Value: {out['logits']}")

In [None]:
# Run this cell to get a multiple choice question
display_quiz([quiz[1]])

### **Create the Data Pipeline**
In this section, we will create the PyTorch dataset and dataloader that will be used to feed the model during training. The data pipeline is an essential component of the training process as it handles the loading and preprocessing of the data, ensuring that it is in the appropriate format for the model.

The dataset class will provide the model with the input data and corresponding labels, while the dataloader will handle the batching and shuffling of the data.

In this section, we will implement the necessary code to create the data pipeline, allowing us to seamlessly integrate it into the training loop and train our model effectively.


<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to implement a custom PyTorch Dataset class, which we'll name `IMDBDataset`. This class should be designed to work with the IMDB dataset provided by HuggingFace's datasets library.<br><br>The `IMDBDataset` class should override the `__getitem__` method such that when an instance of the class is indexed at `i` (e.g., `instance[i]`), it returns a tuple containing the text and the label of the `i`-th sample in the underlying HuggingFace IMDB dataset. In other words, if `dataset` is an instance of `IMDBDataset`, then `dataset[i]` should yield `(text, label)` where `text` is the review text and `label` is the sentiment label (positive or negative) of the `i`-th sample in the IMDB dataset.<br><br>Remember to also implement the `__len__` method to return the total number of samples in the dataset. This is a requirement for PyTorch's Dataset interface.

**IMDB pytorch dataset representation:**<br>
![image](./images/imdb_dataset.jpg)

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**

In [None]:
dataset = IMDBDataset(imdb_dataset["train"])
dataset[0]

<hr style="border:1px solid red"> 

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to implement a PyTorch DataLoader along with a custom collate function. This DataLoader should be designed to accept an instance of the IMDBDataset class you previously created.<br><br>The DataLoader's purpose is to load data in batches from the IMDBDataset instance during the training process. It should handle batching, shuffling, and parallel data loading.<br><br>The collate function is a necessary component that you'll need to define and pass to the DataLoader. This function will be used to combine multiple data samples from your IMDBDataset into a single batch. It should take a list of samples (each being a tuple of text and label from IMDBDataset) and return a batch of token ids (with the mask attention) and a batch of labels.<br><br>The final output of the DataLoader (when iterated over) should be batches of inputs and labels ready to be fed into your model for training. Each input batch should correspond to a batch of labels.

**IMDB collate function of pytorch dataloader representation:**<br>
![image](./images/collate_imdb.jpg)

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**

In [None]:
for model_inp, labels in dataloader:
    print(model_inp)
    print(model_inp['input_ids'].shape)
    print(labels)
    break

<hr style="border:1px solid red"> 

### **Create the training loop**

In this section, we will implement the training loop for our model. The training loop is responsible for iterating over the dataset, making predictions, computing the loss, and updating the model's weights.

The steps of the training loop are as follows:

1. Sample a batch: We randomly select a batch of data from the dataset, which consists of both the model inputs and the corresponding labels.

2. Forward pass: We pass the batch through the model to obtain predictions. This step involves feeding the model inputs into the model and obtaining the output.

3. Compute the loss: We compare the model's predictions with the actual labels to compute the loss. The loss function measures the discrepancy between the predicted and actual values.

4. Backward pass: We compute the gradients of the loss with respect to the model's parameters. This step involves calculating the derivative of the loss with respect to each parameter of the model.

5. Update the weights: We update the model's weights using an optimization algorithm, such as stochastic gradient descent (SGD) or Adam. This step involves adjusting the parameters of the model in the direction that minimizes the loss.

By repeating these steps for multiple epochs, the model gradually learns to make better predictions and minimize the loss.

In [None]:
# Initialize Optimizer and Criterion
# We choose the CrossEntropyLoss and Adam because they're the most used
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to define a function named `train_loop`. This function should accept four parameters: the HuggingFace model (`model`), the DataLoader instance (`dataloader`), the loss function (`criterion`), and the optimizer (`optimizer`). These parameters are assumed to have been initialized prior to calling this function.<br><br>The `train_loop` function should implement the training process for the model. This process typically involves iterating over the DataLoader, performing a forward pass of the model, computing the loss using the criterion, performing a backward pass, and then updating the model parameters using the optimizer. The specific steps for this process should have been described earlier in your instructions.<br><br>After the training process is complete (only 1 epoch in this case), the train_loop function should return the trained model.<br><br>Additionally, you should incorporate a "test mode" into the `train_loop` function. When this mode is activated, the training loop should terminate after 50 iterations. This mode is useful for testing or debugging the function without having to wait for the entire training process to complete.

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**

In [None]:
model = train_loop(model, dataloader, criterion, optimizer, test=True)

<hr style="border:1px solid red"> 

### **Use the Model**
In this section, we will demonstrate how to use the trained model for making predictions.


In [None]:
def test_model(list_test, model):
    model.eval()
    
    model_inp = tokenizer(
        list_test, return_tensors="pt", padding=True, truncation=True, max_length=512
    ).to("cuda")
    out = model(**model_inp)
    predictions = out.logits.argmax(dim=1)
    
    for idx in range(len(list_test)):
        print(
            f'The review "{list_test[idx]}" is',
            f'{"negative" if predictions[idx] == 0 else "positive"}.'
        )

In [None]:
test_model(["I hated this movie", "I loved this movie"], model)

---

## **Chatbot: Roleplay Dataset with Phi-2**

The [Roleplay dataset](https://huggingface.co/datasets/hieunguyenminh/roleplay) is a collection of chats between a user and a character. Each chat is preceded by a system prompt that introduces the character. In this section, we will be using the [Phi-2](https://huggingface.co/microsoft/phi-2) Decoder model, which is a 2.7B model trained with a Causal Language Modeling objective.

Our goal is to finetune the Phi-2 model to create a chatbot that can mimic any character we define. 


### **Exploratory Data Analysis**

In [None]:
# Load the dataset
roleplay_dataset = datasets.load_from_disk(DSDIR / "HuggingFace/hieunguyenminh/roleplay")

Feel free to explore the dataset in more depth:

In [None]:
display_quiz(quiz[2:4])

### **Discover the Model and its Tokenizer**

In [None]:
# Initialize the model and its tokenizer
model = AutoModelForCausalLM.from_pretrained(
    DSDIR / "HuggingFace_Models/microsoft/phi-2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Allow using code that was not written by HuggingFace
    attn_implementation="flash_attention_2"  # Optimize the model with Flash Attention
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/microsoft/phi-2")

In [None]:
model.config

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Implement a function that counts the total number of tokens in a given HF dataset split. The function should take as input the dataset split and the tokenizer, and return the number of tokens in the dataset.

**Ease level 1:**

**Ease level 2:**

**Solution**

**Test it here:**

In [None]:
count_tokens(roleplay_dataset['train'], tokenizer)

In [None]:
display_quiz([quiz[4]])

<hr style="border:1px solid red"> 

To test our model, which is a Decoder-style Transformer, we can generate text by using it in an auto-regressive manner.

In [None]:
def generate_text(prompt):
    """Generate text from a prompt and print it."""
    model_inp = tokenizer(prompt, return_tensors="pt").to("cuda")
    # the generate() method is a succession of forward (auto-regressive) 
    out = model.generate(input_ids=model_inp["input_ids"], do_sample=False, max_new_tokens=100)
    print(tokenizer.decode(out[0]))

In [None]:
prompt = "What is a supercomputer ?"
generate_text(prompt)

Now let's explore the capabilities of the model by using it as a roleplay chatbot, similar to the roleplay dataset.

In [None]:
prompt = """<|system|>Orphaned at age three, when he witnessed his mother's brutal murder, Dexter was adopted by Miami police officer Harry Morgan. Recognizing the boy's trauma and the subsequent development of his sociopathic tendencies, Harry trained Dexter to channel his gruesome bloodlust into vigilantism, killing only heinous criminals who slip through the criminal justice system.
<|user|>How do you approach a new case, Dexter?
<|assistant|>"""
generate_text(prompt)

The answer seems good but the LLM do not impersonate the character. So we need to train it for that.
<br>Let's examine the actual output of the model (without any post-processing):

In [None]:
prompt = "What is a supercomputer ?"
list_token_ids = tokenizer(prompt, return_tensors="pt")['input_ids'].to("cuda")
print(f"Shape of the model input: {list_token_ids.shape}")
out = model(list_token_ids)
print(f"Shape of the model output: {out.logits.shape}")

The output of the model is a list of logits for each input token. These logits (afer applying softmax) represent the probability distribution over the vocabulary, indicating the likelihood of each token being the next token in the sequence. The higher the logit value for a token, the more likely it is to be the next token in the sequence:
![image](./images/in-out_decoder.jpg)

In [None]:
display_quiz([quiz[5]])

### **Create the Data Pipeline**

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Implement a custom PyTorch IterableDataset class named `RoleplayDataset` that works with the Roleplay dataset provided by HuggingFace's datasets library. The `RoleplayDataset` class should tokenize the texts of the HF dataset and concatenate the token IDs representing samples until it reaches a specified limit (`seq_length`). Different samples can be in the same sequence, separated by the end of sequence token (`eos_token`).<br><br>The RoleplayDataset should return the `model_input` and the `labels` from the sequence of tokens. The `model_input` is the sequence of tokens without the last token, and the `labels` is the same sequence of tokens without the first token.

**Token concatenation dataset pipeline:**<br>
![image](./images/concat_dataset.jpg)

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here (it should return `True`):**

In [None]:
dataset = RoleplayDataset(tokenizer, roleplay_dataset['train'], seq_length=512)
model_inp, labels = next(iter(dataset))
torch.equal(model_inp[1:], labels[:-1])

In [None]:
dataloader = DataLoader(
    dataset,
    batch_size=4,
    num_workers=1,
    prefetch_factor=2,
)

In [None]:
for model_inp, labels in dataloader:
    print(torch.equal(model_inp[:, 1:], labels[:, :-1]))
    break

<hr style="border:1px solid red"> 

### **Create the training loop**

In [None]:
# Initialize Optimizer and Criterion
# We choose the CrossEntropyLoss and Adam because they're the most used
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-6)

In the training process, the shape of the `labels` tensor is expected to be `[batch_size, seq_length]`, while the shape of the `logits` tensor is expected to be `[batch_size, seq_length, vocab_size]`. However, before passing these tensors to the `CrossEntropyLoss` function, we need to reshape them to make it usable by the function. The `labels` tensor should have a shape of `[batch_size * seq_length]`, and the `logits` tensor should have a shape of `[batch_size * seq_length, vocab_size]`. To achieve this, we can use the following function:

In [None]:
def prepare_for_loss(logits, labels):
    """Unfold the Tensors to compute the CrossEntropyLoss correctly"""
    batch_size, seq_length, vocab_size = logits.shape
    logits = logits.view(batch_size * seq_length, vocab_size)
    labels = labels.view(batch_size * seq_length)
    return logits, labels


In [None]:
for model_inp, labels in dataloader:
    model_inp = model_inp.to("cuda")
    labels = labels.to("cuda")
    print(f"labels tensor shape: {labels.shape}")
    logits = model(model_inp.to("cuda")).logits
    print(f"logits tensor shape: {logits.shape}")
    break

In [None]:
logits, labels = prepare_for_loss(logits, labels)
print(f"labels tensor shape after preparation: {labels.shape}")
print(f"logits tensor shape after preparation: {logits.shape}")

In [None]:
# We can finally compute a loss
loss = criterion(logits, labels)
loss

<hr style="border:1px solid red"> 

<span style="color:red">**Task**:</span> Your task is to define a function named `train_loop` that will be used to train a model. This function should be similar to the train loop you created previously for a classification task. However, you need to incorporate the `prepare_for_loss` function that we defined earlier to ensure correct computation of the loss.


**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**

In [None]:
model = train_loop(model, dataloader, criterion, optimizer, test=True)

In [None]:
prompt = """<|system|>Orphaned at age three, when he witnessed his mother's brutal murder, Dexter was adopted by Miami police officer Harry Morgan. Recognizing the boy's trauma and the subsequent development of his sociopathic tendencies, Harry trained Dexter to channel his gruesome bloodlust into vigilantism, killing only heinous criminals who slip through the criminal justice system.
<|user|>How do you approach a new case, Dexter?
<|assistant|>"""
generate_text(prompt)

<hr style="border:1px solid red"> 

---

## **Summarization: SciTLDR dataset with T5**

The [SciTLDR](https://huggingface.co/datasets/allenai/scitldr) dataset contains TLDR (Too Long, Didn't Read) summaries of research paper abstracts. In this section, we will be using the T5 model, which is an Encoder-Decoder model trained with masked language modeling (MLM) objective. Our goal is to finetune the [T5-large](https://huggingface.co/google-t5/t5-large) model on the SciTLDR dataset. By doing so, we aim to generate concise and informative summaries for research papers automatically.

### **Exploratory Data Analysis**

In [None]:
# Load the dataset
scitldr_dataset = datasets.load_from_disk(DSDIR / "HuggingFace/allenai/scitldr")

Feel free to explore the dataset in more depth:

### **Discover the Model and its Tokenizer**

In [None]:
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/google-t5/t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained(DSDIR / "HuggingFace_Models/google-t5/t5-large").to("cuda")
PAD_TOKEN_ID = tokenizer.pad_token_id  # We are gonna use it later

We can utilize an Encoder-Decoder model in an auto-regressive manner to generate text, similar to how a Decoder model operates. To accomplish this, we can employ the `generate` method:
![image](./images/Encoder-Decoder_translation_example1.gif)

In [None]:
def generate_text(text):
    input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
    outputs = model.generate(input_ids, max_new_tokens=100)
    print(tokenizer.decode(outputs[0]))

In [None]:
input_text = "translate English to German: How old are you?"
generate_text(input_text)

In the auto-regressive generation process, the input to the Encoder model is the text provided in the `generate` method. The Decoder model then takes a 'beginning of sentence' token as input and generates text sequentially. This means that at each step, the model predicts the next token based on the previously generated tokens and output of the Encoder model. This process continues until a predefined maximum length is reached or an end token is generated.

Let's try to compute one forward pass (one step of the generation):<br>
![image](./images/04_Encoder-Decoder_auto-regressive_inference.jpg)

In [None]:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
# The BOS token is the first token of the input, for T5 decoder it's the ID 0
decoder_input_ids = torch.tensor([[0]], dtype=torch.int64).to("cuda")

out = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
out.logits.shape

Before we start the development, let's see what happens when we provide the model with an abstract of a research paper. For this purpose, we will use the abstract of the "Attention is All You Need" paper.

In [None]:
input_text = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."
print(input_text)

In [None]:
generate_text(input_text)

### **Create the Data Pipeline**

<hr style="border:1px solid red"> 

<span style="color:red">**Task**:</span> Your task is to implement a custom PyTorch Dataset class called `SciTLDRDataset`. This class should be designed to work with the SciTLDR dataset provided by HuggingFace's datasets library.

The `SciTLDRDataset` class should override the `__getitem__` method. When an instance of the class is indexed at `i` (e.g., `instance[i]`), it should return a tuple containing the following:
- A string that is the concatenation of all the elements of `source` of the HuggingFace dataset element.
- A Tensor (1 dim) of IDs corresponding to the `target` of the HuggingFace dataset element.
- The same Tensor of IDs without the first token and ending with the end-of-sequence token (`eos_token`).

Additionally, please implement the `__len__` method to return the total number of samples in the dataset. This is a requirement for PyTorch's Dataset interface.

**Ease level 1:**

**Ease level 2:**

**Solution**

**Test it here:**

In [None]:
dataset = SciTLDRDataset(scitldr_dataset['train'], tokenizer)
dataset[0]

In [None]:
# It sould return True
encoder_inp, decoder_inp, labels = dataset[0]
torch.equal(decoder_inp[1:], labels[:-1])

<hr style="border:1px solid red"> 

To facilitate the creation of a batch of padded tensors and attention masks from a list of Tensors, which are returned by the dataset we defined previously, we will create two functions. These functions will be used in the collate function of the dataloader.

In [None]:
def add_padding(list_ids: list[torch.Tensor]) -> torch.Tensor:
    """Add padding to a list of tensors and return a padded tensor (batch)"""
    padded_tensor = torch.nn.utils.rnn.pad_sequence(
        [sample.flip(dims=(0,)) for sample in list_ids],
        batch_first=True,
        padding_value=PAD_TOKEN_ID,
    ).flip(dims=(1,))
    return padded_tensor


def create_mask(padded_tensor: torch.Tensor) -> torch.Tensor:
    """Create a mask for HuggingFace models"""
    decoder_mask = torch.logical_not(
        (padded_tensor == torch.full_like(padded_tensor, 0))
    ).to(dtype=torch.int)
    return decoder_mask

<hr style="border:1px solid red"> 

<span style="color:red">**Task**:</span> Your task is to create the collate function of a Dataloader that will transform the elements of the `SciTLDRDataset` into batches.

To accomplish this, you will need to perform the following steps in the collate function:
1. Transform the list of `encoder_inp` (strings) into a padded tensor with its attention mask using the tokenizer of the model, as you did previously.
2. Transform the list of `decoder_inp` (Tensors) into a padded tensor with its attention mask using the `add_padding` and `create_mask` functions that were defined earlier.
3. Transform the list of `labels` (Tensors) into a padded tensor using the `add_padding` function.

The DataLoader should return the following:
- `encoder_batch_inp`: A dictionary containing the padded encoder input and its attention mask.
- `decoder_batch_inp`: The padded decoder input.
- `labels_batch`: The padded labels used to compute the loss.
- `decoder_mask`: The mask for the padded decoder input.

**Ease level 1:**

**Ease level 2:**

**Solution**

**Test it here:**

In [None]:
dataloader = DataLoader(
    dataset,
    batch_size=4,
    num_workers=4,
    prefetch_factor=2,
    shuffle=True,
    collate_fn=collate_fn
)

for encoder_inp, decoder_inp, labels, decoder_mask in dataloader:
    print(encoder_inp)
    print("#" * 50)
    print(decoder_inp)
    print("#" * 50)
    print(labels)
    print("#" * 50)
    print(decoder_mask)
    break

<hr style="border:1px solid red"> 

Now, let's prepare the training loop. We will need the `prepare_for_loss` function again, which is responsible for reshaping the logits and labels to compute the loss using the `CrossEntropyLoss`.

In [None]:
def prepare_for_loss(logits, labels):
    """Unfold the Tensors to compute the CrossEntropyLoss correctly"""
    batch_size, seq_length, vocab_size = logits.shape
    logits = logits.view(batch_size * seq_length, vocab_size)
    labels = labels.view(batch_size * seq_length)
    return logits, labels

In [None]:
# Initialize Optimizer and Criterion
# We choose the CrossEntropyLoss and Adam because they're the most used
criterion = torch.nn.CrossEntropyLoss(ignore_index=PAD_TOKEN_ID)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

<hr style="border:1px solid red"> 

<span style="color:red">**Task**:</span> Your task is to define a function named train_loop that will be used to train a model. This function should be similar to the train loop you created previously.

**Ease level 1:**

**Ease level 2:**

**Solution:**

**Test it here:**

In [None]:
# We need 2 epoch to see the model improvement
for epoch in range(2):
    model = train_loop(model, dataloader, criterion, optimizer, test=True)

In [None]:
input_text = "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

<hr style="border:1px solid red"> 