# Finetuning
The purpose of this notebook is to learn how to finetune a transformer model for the task of machine translation. This notebook is heavily inspired by the following huggingface tutorials:
- [Tutorial on Translation](https://huggingface.co/docs/transformers/tasks/translation)
- [Course on Translation](https://huggingface.co/learn/nlp-course/chapter7/4?fw=pt)

We start by importing the numpy library.

In [None]:
import numpy as np

We are going to use the *T5 transformer* model that we have already tested in the previous notebook. We are going to fine-tune the *T5 model* on the ``europarl`` dataset, a dataset containing text from the European Parliament Proceedings. You can check out more information about this dataset on its [huggingface dataset card](https://huggingface.co/datasets/Helsinki-NLP/europarl).

## Data preparation
We will first download the subset of the ``europarl`` dataset that contains english text and its counterpart french translations. In order to do so, make sure that you have the Hugging Face library ``datasets`` installed.

In [None]:
from datasets import load_dataset

raw_dataset = load_dataset("Helsinki-NLP/europarl", "en-fr")
raw_dataset

**Questions.** Answer the following questions:
1. What type of object is the ``raw_dataset`` object ?
2. How many elements are there in the ``raw_dataset`` object ?
3. What type of object is the ``raw_dataset["train"]`` object ?
4. Describe the ``raw_dataset["train"]`` object.

**Exercise.** Print one of the elements of ``raw_dataset["train"]``.

In [None]:
# TODO: print one of the elements of raw_dataset["train"]

The next step is to split the data into a proper training set and a test/validation set on which we can monitor our finetuning process.

**Exercise.** Create a new ``split_dataset`` object containing a train and a test subset of the original dataset, by randomly splitting the original dataset with the following proportions: 90% training and 10% test. Print the new ``split_dataset`` object.

**Hint.** ``raw_dataset["train"]`` is a ``Dataset`` object, and dataset objects have a method called ``train_test_split`` which should come in handy.

In [None]:
# TODO: create the split_dataset object

# TODO: print the split_dataset object

## Model
As mentioned earlier, we are going to use the model *T5 tranformer* model provided by Google. Since we are going to be finetuning the model, and finetuning takes quite a long time, we will choose to use the *small* version of the *T5 model*, rather than the *base* version like last time.

**Question.** Search online and compare the number of parameters of the *T5 base* and the *T5 small* models.

**Exercise.** Check on the Hugging Face Hub for the name of the *T5 small* model checkpoint and store it in the variable ``model_checkpoint.``

In [None]:
# TODO: create the variable model_checkpoint with the appropriate model checkpoint string

## Tokenizer
The next step is to tokenize our inputs. In order to do so, we will instantiate a ``tokenizer`` which Hugging Face will guess by using the ``AutoTokenizer`` class. We only need to tell the ``AutoTokenizer`` what model checkpoint we will be using.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

**Exercise.** Let's find out more about the tokenizer we will be using. Check online for the different attributes of the ``Tokenizer`` class, and write code in order to answer the following questions:
1. What is the name of the tokenizer being used?
2. What is the size of the vocabulary?
3. What is the maximum model input length?
4. What special tokens does the tokenizer use? What are their IDs?

In [None]:
# TODO: print the necessary information about the automatically load tokenizer

We next try out our tokenizer in a few input sentences.

In [None]:
raw_inputs = [
    "My name is John",
    "I love ice cream",
    "The grey cat slept on the chair.",
    "When is Rodrigo coming home?"
]

In [None]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

**Questions.** Think about the following questions (do not hesitate to discuss them with your classmates and with the teacher) ignoring the ``attention_mask`` part for now:
1. What type of structure is the output of ``inputs``?
2. What is the size of the tensor ``inputs[input_ids]``? What does this size represent?
4. What is the last non-zero integer of each row in ``inputs[input_ids]``? Why?
5. What does the 0 element represent in the tensor ``inputs[input_ids]``?

### Preprocessing function
Before tokenizing our inputs, we will preprocess them for easier use with the ``T5 model``.  In order to do so, we will define a preprocessing function called ``preprocess_function`` that does the following:
1. Prepend the phrase "Translate from English to French:" to the source English test. Remember that otherwise T5 will try and translate to German!
2. Set the target (French) in the ``text_target`` parameter to ensure the tokenizer processes the target text correctly. Otherwise the tokenizer assumes the language is English.
3. Truncate the sequences to be no longer then the maximum length set by the ``max_length`` parameter. We will set it to 128.
4. Tokenize the inputs by taking into account all of the above.

In [None]:
source_lang = "en"
target_lang = "fr"
prefix = "Translate from English to French: "

def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

We next use use our preprocessing function to tokenize the input:

In [None]:
tokenized_dataset = split_dataset.map(preprocess_function, batched=True)

Lastly, we use the ``DataCollatorForSeq2Seq`` class in order to *dynamically pad* the sentences to the longest length in a batch, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_checkpoint, return_tensors="pt")

## Metrics
During the finetuning phase, the model parameters will be optimized by gradient descent on the Cross-Entropy Loss, as seen in class. However, we can monitor if our model is indeed learning/overfitting by using more interpretable metrics on the validation dataset. In this case, we will use the SacreBLEU metric. Make sure you have the ``evaluate`` and ``sacrebleu`` libraries installed.

In [None]:
import evaluate

metric  = evaluate.load("sacrebleu")

The following two functions will process the predictions of the models as well as the ground truth labels and compute the SacreBLEU score associated to them.

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]
    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # replace the -100 token_id by 0 in the labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## Fine-tuning
We now proceed to the actual finetuning phase. In order to fine-tune our model we will need to:
1. Instantiate our model (recall that so-far we have just stored the name of the model checkpoint in ``model_checkpoint`` variable) with the help of the ``AutoModelForSeq2SeqLM`` class.
2. Set all the hyperparameters and other relevant arguments for the training phase with the help of the ``Seq2SeqTrainingArguments`` class.
3. Train the model with the help of the ``Seq2SeqTrainer`` class.

**Exercise.** Instantiate the model by using the ``AutoModelForSeq2SeqLM`` class of the ``transformers`` library and the previously defined ``model_checkpoint``.

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

# TODO: instantialte the model

The model is already pretrained, which means that it is already able to translate from English to French.

**Exercise.** Test the model on the following sentence


> The farmers take the cows up to the mountains.


In [None]:
# TODO: test the model in the given sentence

**Exercise.** Complete the training argumets with the following hyper-parameters:
- A learning rate of 0.00002
- A batch size of 32 for the training phase
- A batch size of 64 for the evaluation phase
- 1 epoch


In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="europarlament_en_fr_translator",
    eval_strategy="epoch",
    # TODO: specify the learning rate
    # TODO: specify the trainig batch size
    # TODO: specify the evaluation batch size
    weight_decay=0.01,
    save_total_limit=3,
    # TODO: specify the number of epochs
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
    report_to="none"
)

The last step is to train the model. Since our training and validation datasets are quite large, the finetuning phase will take very long, even if we train it for 1 epoch only. Therefore, we will truncate the training and validation datasets to speed-up the training.

In [None]:
n_train = 100_000
n_val = 10_000

train_subset = tokenized_dataset["train"].select(range(n_train))
val_subset = tokenized_dataset["test"].select(range(n_val))

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=val_subset,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

metrics = trainer.evaluate()
print("Evaluation at Epoch 0:", metrics)

trainer.train()

## Inference
We can now use our finetuned model for translating English sentences into French.

**Exercise.** Translate the same sentence as above with the finetuned model.

In [None]:
# TODO: translate the same sentence again

We can change the generation type so as to randomize the model's output, and not always provide the same translation for a given input:

In [None]:
from transformers import AutoModelForSeq2SeqLM

for _ in range(5):
    output = model.generate(**input_text, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

**Discussion.** Discuss with the classmates and with the teacher.
- What steps of the above notebook are clear ?
- What steps of the above notebook are unclear ?
- What is the BLEU metric measuring ?