# Finetuning
The purpose of this notebook is to learn how to finetune a transformer model for the task of machine translation. This notebook is heavily inspired by the following huggingface tutorials:
- [Tutorial on Translation](https://huggingface.co/docs/transformers/tasks/translation)
- [Course on Translation](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt)

We start by importing the numpy library.

In [None]:
!pip uninstall -y transformers sentence-transformers
!pip install transformers==5.0.0

Found existing installation: transformers 4.36.0
Uninstalling transformers-4.36.0:
  Successfully uninstalled transformers-4.36.0
[0mCollecting transformers==5.0.0
  Downloading transformers-5.0.0-py3-none-any.whl.metadata (37 kB)
Collecting huggingface-hub<2.0,>=1.3.0 (from transformers==5.0.0)
  Downloading huggingface_hub-1.4.0-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers==5.0.0)
  Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Downloading transformers-5.0.0-py3-none-any.whl (10.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m95.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-1.4.0-py3-none-any.whl (553 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.2/553.2 kB[0m [31m46.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3

In [None]:
import numpy as np

We are going to use the *T5 transformer* model that we have already tested in the previous notebook. We are going to fine-tune the *T5 model* on the ``europarl`` dataset, a dataset containing text from the European Parliament Proceedings. You can check out more information about this dataset on its [huggingface dataset card](https://huggingface.co/datasets/Helsinki-NLP/europarl).

## Data preparation
We will first download the subset of the ``europarl`` dataset that contains english text and its counterpart french translations. In order to do so, make sure that you have the Hugging Face library ``datasets`` installed.

In [None]:
# !pip install datasets

In [None]:
from datasets import load_dataset

raw_dataset = load_dataset("Helsinki-NLP/europarl", "en-fr")
raw_dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 2051014
    })
})

**Questions.** Answer the following questions:
1. What type of object is the ``raw_dataset`` object ?
2. How many elements are there in the ``raw_dataset`` object ?
3. What type of object is the ``raw_dataset["train"]`` object ?
4. Describe the ``raw_dataset["train"]`` object.

**Exercise.** Print one of the elements of ``raw_dataset["train"]``.

In [None]:
print(type(raw_dataset))

<class 'datasets.dataset_dict.DatasetDict'>


In [None]:
print(raw_dataset.num_rows
)

{'train': 2051014}


In [None]:
print(raw_dataset.num_columns)

{'train': 1}


In [None]:
print(type(raw_dataset["train"]))

<class 'datasets.arrow_dataset.Dataset'>


In [None]:
# TODO: print one of the elements of raw_dataset["train"]
print(next(iter(raw_dataset["train"])))

{'translation': {'en': 'Resumption of the session', 'fr': 'Reprise de la session'}}


The next step is to split the data into a proper training set and a test/validation set on which we can monitor our finetuning process.

**Exercise.** Create a new ``split_dataset`` object containing a train and a test subset of the original dataset, by randomly splitting the original dataset with the following proportions: 90% training and 10% test. Print the new ``split_dataset`` object.

**Hint.** ``raw_dataset["train"]`` is a ``Dataset`` object, and dataset objects have a method called ``train_test_split`` which should come in handy.

In [None]:
# TODO: create the split_dataset object
split_dataset = raw_dataset["train"].train_test_split(test_size=0.1)

# TODO: print the split_dataset object
print(split_dataset)

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 1845912
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 205102
    })
})


## Model
As mentioned earlier, we are going to use the model *T5 tranformer* model provided by Google. Since we are going to be finetuning the model, and finetuning takes quite a long time, we will choose to use the *small* version of the *T5 model*, rather than the *base* version like last time.

**Question.** Search online and compare the number of parameters of the *T5 base* and the *T5 small* models.

**Exercise.** Check on the Hugging Face Hub for the name of the *T5 small* model checkpoint and store it in the variable ``model_checkpoint.``

In [None]:
# TODO: create the variable model_checkpoint with the appropriate model checkpoint string
model_checkpoint = "google-t5/t5-small"

## Tokenizer
The next step is to tokenize our inputs. In order to do so, we will instantiate a ``tokenizer`` which Hugging Face will guess by using the ``AutoTokenizer`` class. We only need to tell the ``AutoTokenizer`` what model checkpoint we will be using.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

**Exercise.** Let's find out more about the tokenizer we will be using. Check online for the different attributes of the ``Tokenizer`` class, and write code in order to answer the following questions:
1. What is the name of the tokenizer being used?
2. What is the size of the vocabulary?
3. What is the maximum model input length?
4. What special tokens does the tokenizer use? What are their IDs?

In [None]:
# TODO: print the necessary information about the automatically load tokenizer
print(tokenizer.name_or_path)
print(tokenizer.vocab_size)
print(tokenizer.model_max_length)
print(tokenizer.all_special_tokens)
print(tokenizer.all_special_ids)

google-t5/t5-small
32100
512
['</s>', '<unk>', '<pad>', '<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>', '<extra_id_4>', '<extra_id_5>', '<extra_id_6>', '<extra_id_7>', '<extra_id_8>', '<extra_id_9>', '<extra_id_10>', '<extra_id_11>', '<extra_id_12>', '<extra_id_13>', '<extra_id_14>', '<extra_id_15>', '<extra_id_16>', '<extra_id_17>', '<extra_id_18>', '<extra_id_19>', '<extra_id_20>', '<extra_id_21>', '<extra_id_22>', '<extra_id_23>', '<extra_id_24>', '<extra_id_25>', '<extra_id_26>', '<extra_id_27>', '<extra_id_28>', '<extra_id_29>', '<extra_id_30>', '<extra_id_31>', '<extra_id_32>', '<extra_id_33>', '<extra_id_34>', '<extra_id_35>', '<extra_id_36>', '<extra_id_37>', '<extra_id_38>', '<extra_id_39>', '<extra_id_40>', '<extra_id_41>', '<extra_id_42>', '<extra_id_43>', '<extra_id_44>', '<extra_id_45>', '<extra_id_46>', '<extra_id_47>', '<extra_id_48>', '<extra_id_49>', '<extra_id_50>', '<extra_id_51>', '<extra_id_52>', '<extra_id_53>', '<extra_id_54>', '<extra_id_55>', '<

We next try out our tokenizer in a few input sentences.

In [None]:
raw_inputs = [
    "Hello, My name is John",
    "I love ice cream",
    "The grey cat slept on the chair.",
    "When is Rodrigo coming home?",
    "The quick brown dog jumps over the lazy dog."
    ]

In [None]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
print(inputs["input_ids"].shape)

{'input_ids': tensor([[ 8774,     6,   499,   564,    19,  1079,     1,     0,     0,     0,
             0,     0],
        [   27,   333,     3,   867,  3022,     1,     0,     0,     0,     0,
             0,     0],
        [   37,  7592,  1712,     3, 25726,    30,     8,  3533,     5,     1,
             0,     0],
        [  366,    19,  8222,  3380,    32,  1107,   234,    58,     1,     0,
             0,     0],
        [   37,  1704,  4216,  1782,  4418,     7,   147,     8, 19743,  1782,
             5,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
torch.Size([5, 12])


**Questions.** Think about the following questions (do not hesitate to discuss them with your classmates and with the teacher) ignoring the ``attention_mask`` part for now:
1. What type of structure is the output of ``inputs``?
2. What is the size of the tensor ``inputs[input_ids]``? What does this size represent?
4. What is the last non-zero integer of each row in ``inputs[input_ids]``? Why?
5. What does the 0 element represent in the tensor ``inputs[input_ids]``?

### Preprocessing function
Before tokenizing our inputs, we will preprocess them for easier use with the ``T5 model``.  In order to do so, we will define a preprocessing function called ``preprocess_function`` that does the following:
1. Prepend the phrase "Translate from English to French:" to the source English test. Remember that otherwise T5 will try and translate to German!
2. Set the target (French) in the ``text_target`` parameter to ensure the tokenizer processes the target text correctly. Otherwise the tokenizer assumes the language is English.
3. Truncate the sequences to be no longer then the maximum length set by the ``max_length`` parameter. We will set it to 128.
4. Tokenize the inputs by taking into account all of the above.

In [None]:
source_lang = "en"
target_lang = "fr"
prefix = "Translate from English to French: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

We next use use our preprocessing function to tokenize the input:

In [None]:
tokenized_dataset = split_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1845912 [00:00<?, ? examples/s]

Map:   0%|          | 0/205102 [00:00<?, ? examples/s]

Lastly, we use the ``DataCollatorForSeq2Seq`` class in order to *dynamically pad* the sentences to the longest length in a batch, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_checkpoint, return_tensors="pt")

## Metrics
During the finetuning phase, the model parameters will be optimized by gradient descent on the Cross-Entropy Loss, as seen in class. However, we can monitor if our model is indeed learning/overfitting by using more interpretable metrics on the validation dataset. In this case, we will use the SacreBLEU metric. Make sure you have the ``evaluate`` and ``sacrebleu`` libraries installed.

In [None]:
!pip install evaluate
!pip install sacrebleu



In [None]:
import evaluate

metric  = evaluate.load("sacrebleu")

The following two functions will process the predictions of the models as well as the ground truth labels and compute the SacreBLEU score associated to them.

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # replace the -100 token_id by 0 in the labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## Fine-tuning
We now proceed to the actual finetuning phase. In order to fine-tune our model we will need to:
1. Instantiate our model (recall that so-far we have just stored the name of the model checkpoint in ``model_checkpoint`` variable) with the help of the ``AutoModelForSeq2SeqLM`` class.
2. Set all the hyperparameters and other relevant arguments for the training phase with the help of the ``Seq2SeqTrainingArguments`` class.
3. Train the model with the help of the ``Seq2SeqTrainer`` class.

**Exercise.** Instantiate the model by using the ``AutoModelForSeq2SeqLM`` class of the ``transformers`` library and the previously defined ``model_checkpoint``.

In [None]:
from transformers import AutoModelForSeq2SeqLM

# TODO: instantialte the model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
#hyperparameters = Seq2SeqTrainingArguments(do_train=True, do_eval=True, num_train_epochs=5.0)
#trainer= Seq2SeqTrainer(model, args=hyperparameters, train_dataset=split_dataset["train"], eval_dataset=split_dataset["test"], data_collator=data_collator, compute_metrics=compute_metrics)


Loading weights:   0%|          | 0/131 [00:00<?, ?it/s]

In [None]:
import transformers
print(transformers.__version__)

5.0.0


The model is already pretrained, which means that it is already able to translate from English to French.

**Exercise.** Test the model on the following sentences


> The farmers took the cows up to the mountains.

> He finally kicked the bucket after years of saying he wasn't ready to.

> When the engineer spoke to the manager about the delay, he admitted it was his fault.

> You might want to reconsider how directly you addressed the committee.

> It's not so much that the plan failed as that it was never really tested.

> By the time she realized what had happened, the opportunity had already slipped away.



In [None]:
# TODO: test the model in the given sentences

sentences=[
    "The farmers took the cows up to the mountains.",
    "He finally kicked the bucket after years of saying he wasn't ready to",
    "When the engineer spoke to the manager about the delay, he admitted it was his fault",
    "You might want to reconsider how directly you addressed the committee",
    "It's not so much that the plan failed as that it was never really tested",
    "By the time she realized what had happened, the opportunity had already slipped away"
]

for sentence in sentences:
    input_text = tokenizer(prefix + sentence, return_tensors="pt")
    output = model.generate(**input_text, max_new_tokens=40)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Les agriculteurs ont emmené les vaches jusqu'aux montagnes.
Il a finalement chuté le sommet après des années de dissipation qu'il n'était pas prêt à
Lorsque l'ingénieur a parlé au gestionnaire du retard, il a admis que c'était sa faute.
Vous voudriez peut-être réexaminer comment vous avez directement pris la parole au comité.
Ce n'est pas tant que le plan a échoué que qu'il n'a jamais été vraiment testé.
 la fin de la période de transition, elle a eu l’occasion de s’en rendre compte.


**Exercise.** Complete the training argumets with the following hyper-parameters:
- A learning rate of 0.00002
- A batch size of 32 for the training phase
- A batch size of 64 for the evaluation phase
- 1 epoch


In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="europarlament_en_fr_translator",
    eval_strategy="epoch",
    # TODO: specify the learning rate
    learning_rate=2e-5,
    # TODO: specify the trainig batch size
    per_device_train_batch_size=32,
    # TODO: specify the evaluation batch size
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    # TODO: specify the number of epochs
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    report_to="none"
)

The last step is to train the model. Since our training and validation datasets are quite large, the finetuning phase will take very long, even if we train it for 1 epoch only. Therefore, we will truncate the training and validation datasets to speed-up the training.

In [None]:
from transformers import Seq2SeqTrainer

n_train = 100_000
n_val = 10_000

train_subset = tokenized_dataset["train"].select(range(n_train))
val_subset = tokenized_dataset["test"].select(range(n_val))

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=val_subset,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

metrics = trainer.evaluate()
print("Evaluation at Epoch 0:", metrics)

trainer.train()

Evaluation at Epoch 0: {'eval_loss': 0.9109315276145935, 'eval_model_preparation_time': 0.0094, 'eval_bleu': 9.5049, 'eval_gen_len': 19.1541, 'eval_runtime': 100.2313, 'eval_samples_per_second': 99.769, 'eval_steps_per_second': 1.566}


Epoch,Training Loss,Validation Loss,Model Preparation Time,Bleu,Gen Len
1,0.91774,0.800651,0.0094,9.8976,19.1324


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=3125, training_loss=0.9308896020507812, metrics={'train_runtime': 827.8924, 'train_samples_per_second': 120.789, 'train_steps_per_second': 3.775, 'total_flos': 2378226135465984.0, 'train_loss': 0.9308896020507812, 'epoch': 1.0})

## Inference
We can now use our finetuned model for translating English sentences into French.

**Exercise.** Translate the same sentence as above with the finetuned model.

In [None]:
# TODO: translate the same sentences again

for sentence in sentences:
    input_text = tokenizer(prefix + sentence, return_tensors="pt").to('cuda')
    output = model.generate(**input_text, max_new_tokens=40)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Les agriculteurs ont emmené les vaches jusqu'aux montagnes.
Il a finalement chuté le seau après des années de dissipation qu'il n'était pas prêt à
Lorsque l'ingénieur a parlé au gestionnaire du retard, il a admis que c'était sa faute.
Vous voudriez peut-être réexaminer la manière dont vous avez directement pris la parole à la commission.
Ce n'est pas tant que le plan a échoué que qu'il n'a jamais été vraiment mis à l'essai.
 la fin de sa découverte, elle avait déjà eu l'occasion de s'en tirer.


We can change the generation type so as to randomize the model's output, and not always provide the same translation for a given input:

In [None]:
from transformers import AutoModelForSeq2SeqLM

for _ in range(5):
    output = model.generate(**input_text, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Elle avait déjà saisi cette occasion.
Lorsque l’intéressé a pu prendre conscience de ce qui s’était passé, la possibilité a déjà échoué
Après avoir pris conscience des événements qui se sont produits, cette occasion avait déjà perdu.
Lors de la découverte de ce qui s’est passé, l’occasion a déjà perdu l’attention.
Au moment où elle s'est rendu compte du phénomène, la chance avait déjà glissé au-delà de l'horizon.


**Discussion.** Discuss with the classmates and with the teacher.
- What steps of the above notebook are clear ?
- What steps of the above notebook are unclear ?
- What is the BLEU metric measuring ?