# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [1]:
!pip install datasets evaluate transformers[sentencepiece] transformers[torch]



In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to fine-tune the [T5 model](https://huggingface.co/docs/transformers/model_doc/t5) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")
print(raw_datasets)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 602605
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 86170
    })
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang'],
        num_rows: 81968
    })
})


As shown, the Europarl-ST already comes with a pre-defined partition on the three conventional sets: training, validation and test. Each set is a dictionary with a list of source sentences (source_text), target sentences (dest_text) and the target language (dest_lang).

Let's take a closer look at the features of the training set:

In [3]:
raw_datasets["train"].features

{'source_text': Value(dtype='string', id=None),
 'dest_text': Value(dtype='string', id=None),
 'dest_lang': ClassLabel(names=['de', 'en', 'es', 'fr', 'it', 'nl', 'pl', 'pt', 'ro'], id=None)}

As you can see, the possible target languages are German, English, Spanish, French, Italian, Dutch, Polish, Portuguese and Romanian.

Let us take a look at the translations of the first two English sentences:

In [4]:
raw_datasets["train"][:14]["source_text"]

['Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.',
 'During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in ord

In [5]:
raw_datasets["train"][:14]["dest_text"]

['Seit 1977 wurden die meisten Finanzdienstleistungen, einschließlich Versicherungen und Verwaltung von Investmentfunds, von der Mehrwertsteuer ausgenommen.',
 'La mayoría de los servicios financieros, incluidos los seguros y la gestión de fondos de inversión, están exentos de IVA desde 1977.',
 'Depuis 1977, la plupart des services financiers, dont les assurances et la gestion des fonds de placement, ne sont pas tenus d ’ appliquer une TVA.',
 'Dal 1997 la maggior parte dei servizi finanziari, compresi i servizi assicurativi e la gestione di fondi di investimento, sono esenti da IVA.',
 'Sinds 1977 zijn de meeste financiële diensten, met inbegrip van verzekeringen en het beheer van beleggingsfondsen, vrijgesteld van btw.',
 'większość usług finansowych, w tym usług w zakresie ubezpieczeń i zarządzania funduszami inwestycyjnymi, była zwolniona z opodatkowania podatkiem VAT.',
 'Desde 1977 que a maioria dos serviços financeiros, incluindo os seguros e a gestão de fundos de investimento,

In [6]:
raw_datasets["train"][:14]["dest_lang"]

[0, 2, 3, 4, 5, 6, 7, 0, 2, 3, 4, 5, 6, 7]

As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

Provided that the T5 model was pretrained on several task, being one of the them the translation from English into German, we are going to be filtering Europarl-ST only for English into German using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter) and taking a small sample with [Dataset.select() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.select). The reason to take a small sample is because of time and computational constraints.

In [7]:
lang="de"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets["train"] = raw_datasets["train"].filter(lambda x: x["dest_lang"] == lang_id).select(range(1024))
raw_datasets["valid"] = raw_datasets["valid"].filter(lambda x: x["dest_lang"] == lang_id).select(range(128))
raw_datasets["test"] = raw_datasets["test"].filter(lambda x: x["dest_lang"] == lang_id).select(range(128))

A good practice is to take a small random sample to get a quick feel for the type of data you’re working with. In Datasets, we can create a random sample by using the [Dataset.shuffle() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.shuffle):

In [8]:
raw_sample = raw_datasets["train"].shuffle(seed=13).select(range(5))
raw_sample[:5]

{'source_text': ['I agree that there needs to be monitoring of all the aspects, both from a legal point of view and from the point of view of aid.',
  'My third and last point is that right now budgetary policy has been shown to be playing the central, leading role far more than monetary policy is.',
  'Therefore, before it is too late, I want to say to the House that we urgently need armed military escorts on these boats.',
  'He talked about cutting compulsory declaration deadlines, enhancing cooperation among tax authorities, establishing shared liability when the purchaser of the',
  'However, once again, as in the recent floods in Romania, we find that the requirements of the regulation are so restrictive that in actual fact they prevent this disaster being considered severe, Commissioner.'],
 'dest_text': ['Ich stimme damit überein, dass eine Überwachung aller Aspekte, was sowohl die rein rechtliche als auch die Seite der Subventionen angeht, notwendig ist.',
  'Bei meinem dritte

Now we load the pre-trained tokenizer for the T5 model and apply it to a sample English-German pair:

In [9]:
from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

You can take a look at what the tokenizer function returns and how it can be converted back:

In [10]:
input_ids = tokenizer(raw_datasets["train"][0]["source_text"])
print(input_ids)
labels = tokenizer(raw_datasets["train"][0]["dest_text"]).input_ids
print(labels)

{'input_ids': [1541, 16433, 6, 167, 981, 364, 6, 379, 958, 11, 1729, 3069, 758, 6, 43, 118, 1215, 9045, 45, 19569, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[8709, 16433, 3163, 67, 7957, 15156, 7691, 24150, 6, 236, 13259, 3, 24591, 35, 64, 18622, 193, 9682, 7610, 7, 6, 193, 74, 29914, 11199, 403, 7026, 5, 1]


In [11]:
print(tokenizer.convert_ids_to_tokens(labels))

['▁Seit', '▁1977', '▁wurden', '▁die', '▁meisten', '▁Finanz', 'dienst', 'leistungen', ',', '▁ein', 'schließlich', '▁', 'Versicherung', 'en', '▁und', '▁Verwaltung', '▁von', '▁Investment', 'fund', 's', ',', '▁von', '▁der', '▁Mehrwert', 'steuer', '▁aus', 'genommen', '.', '</s>']


In [12]:
tokenizer.decode(labels)

2025-03-11 16:25:13.327803: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-11 16:25:13.336383: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741706713.347364   48799 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741706713.350785   48799 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-11 16:25:13.362654: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

'Seit 1977 wurden die meisten Finanzdienstleistungen, einschließlich Versicherungen und Verwaltung von Investmentfunds, von der Mehrwertsteuer ausgenommen.</s>'

We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be finetuned. The T5 model requires that the task prompt, "translate English to German", to be explicitly stated for each source sentence. In addittion, the source and target sentences need to be abruptly truncated to 40 tokens to reduce memory comsuption:

In [13]:
max_input_length = 40
max_dest_length = 40

def tokenize_function(sample):
    task_prefix = "translate English to German: "
    inputs = [task_prefix + s for s in sample["source_text"]]
    model_inputs = tokenizer(inputs,max_length=max_input_length,truncation=True)
    model_inputs['labels'] = tokenizer(text_target = sample["dest_text"],max_length=max_dest_length,truncation=True).input_ids
    return model_inputs

In [14]:
tokenized_sample = tokenize_function(raw_datasets["train"][:5])
print(tokenized_sample)

{'input_ids': [[13959, 1566, 12, 2968, 10, 1541, 16433, 6, 167, 981, 364, 6, 379, 958, 11, 1729, 3069, 758, 6, 43, 118, 1215, 9045, 45, 19569, 5, 1], [13959, 1566, 12, 2968, 10, 3, 2092, 48, 1059, 6, 192, 982, 43, 3, 8317, 7931, 29, 10, 8, 4903, 13, 8, 7401, 13, 8, 20554, 11, 8, 256, 2748, 7, 11102, 13, 21827, 19569, 3, 20890, 16, 455, 1], [13959, 1566, 12, 2968, 10, 3699, 2121, 6, 1611, 981, 5660, 11, 512, 16690, 6, 84, 43, 4161, 8, 5102, 11, 25367, 13, 2673, 12, 370, 175, 364, 6, 43, 856, 974, 12, 8, 3, 30556, 5, 1], [13959, 1566, 12, 2968, 10, 100, 934, 19, 8, 166, 3332, 12, 2270, 3, 9, 21130, 84, 6, 16, 811, 12, 271, 15266, 3, 104, 1374, 12, 8, 7897, 13, 8, 1611, 2243, 13, 6923, 3, 104, 19, 29451, 1], [13959, 1566, 12, 2968, 10, 27, 241, 12, 3, 24615, 15, 8, 5346, 1238, 6, 1363, 6887, 2138, 6, 30, 112, 1287, 161, 16, 5874, 48, 934, 30, 46, 962, 84, 19, 78, 24367, 6280, 11, 18155, 1561, 5, 1]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*:

In [15]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1024
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 128
    })
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 128
    })
})

The function that is responsible for putting together samples inside a batch is called a collate function. It is an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. This is not possible in our case since the inputs we have are not all of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via DataCollatorForSeq2Seq that takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs), so we will also need to instantiate the model first to provide it to the collate function:

In [16]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

In [17]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

To take a look at how this function works, now we have our dataset encoded, let us select only those keys that will be needed to create the batch discarding the textual features. Also, only for demo purposes, we need to prepare each sample to be a dictionary with *input_ids*, *attention_mask* and *labels*:

In [18]:
samples = [tokenized_datasets["train"][i] for i in range(2)]
print(samples)
for i in range(len(samples)):
    samples[i] = {key: value for key, value in samples[i].items() if key not in ["source_text", "dest_text", "dest_lang"]}
print(samples)

[{'source_text': 'Since 1977, most financial services, including insurance and investment fund management, have been exempt from VAT.', 'dest_text': 'Seit 1977 wurden die meisten Finanzdienstleistungen, einschließlich Versicherungen und Verwaltung von Investmentfunds, von der Mehrwertsteuer ausgenommen.', 'dest_lang': 0, 'input_ids': [13959, 1566, 12, 2968, 10, 1541, 16433, 6, 167, 981, 364, 6, 379, 958, 11, 1729, 3069, 758, 6, 43, 118, 1215, 9045, 45, 19569, 5, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [8709, 16433, 3163, 67, 7957, 15156, 7691, 24150, 6, 236, 13259, 3, 24591, 35, 64, 18622, 193, 9682, 7610, 7, 6, 193, 74, 29914, 11199, 403, 7026, 5, 1]}, {'source_text': 'During this period, two problems have essentially arisen: the definition of the scope of the exemption and the impossibility of recovering VAT incurred in order to provide exempt services, giving rise to the phenomenon of hidden VAT.', 'dest_text

In [19]:
batch = data_collator(samples)
print(batch)

{'input_ids': tensor([[13959,  1566,    12,  2968,    10,  1541, 16433,     6,   167,   981,
           364,     6,   379,   958,    11,  1729,  3069,   758,     6,    43,
           118,  1215,  9045,    45, 19569,     5,     1,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [13959,  1566,    12,  2968,    10,     3,  2092,    48,  1059,     6,
           192,   982,    43,     3,  8317,  7931,    29,    10,     8,  4903,
            13,     8,  7401,    13,     8, 20554,    11,     8,   256,  2748,
             7, 11102,    13, 21827, 19569,     3, 20890,    16,   455,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[ 8709, 16433,  3163,    67,  7957

## Evaluation

The last thing to define for our Seq2SeqTrainer is how to compute the metrics to evaluate the predictions of our model with respect to references. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu). You can see a simple example of usage below:

:

In [20]:
!pip install sacrebleu

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [21]:
from evaluate import load

metric = load("sacrebleu")
dest_preds = ["Esto es un ejemplo de cálculo de BLEU.", "Esto es otro."]
dest_labels = [["Este es un ejemplo de cálculo de BLEU."], ["Este es otro."]]
metric.compute(predictions=dest_preds, references=dest_labels)

{'score': 78.75110621102682,
 'counts': [11, 9, 7, 5],
 'totals': [13, 11, 9, 7],
 'precisions': [84.61538461538461,
  81.81818181818181,
  77.77777777777777,
  71.42857142857143],
 'bp': 1.0,
 'sys_len': 13,
 'ref_len': 13}

We need to define a function compute_metrics to compute BLEU scores at each epoch. The example below performs a basic post-processing to decode the predictions into texts:

In [22]:
import numpy as np
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    #labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    for i in range(len(labels)):
        labels[i] = [tokenizer.pad_token_id if j==-100 else j for j in labels[i]]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

We are going to evaluate the pretrained model, preparing the test set to be translated using the [generate function](https://huggingface.co/docs/transformers/main_classes/text_generation):

In [23]:
task_prefix = "translate English to German: "
inputs = tokenizer([task_prefix + sentence for sentence in tokenized_datasets["test"]["source_text"]], max_length=max_input_length, truncation=True, return_tensors="pt", padding=True)
output_sequences = model.generate(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
result = compute_metrics((output_sequences,tokenized_datasets["test"]["labels"]))
print(f'BLEU score: {result["bleu"]}')

BLEU score: 19.2762


## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) is to define a [Seq2SeqTrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [24]:
from transformers import Seq2SeqTrainingArguments

batch_size = 16
model_name = checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-en-to-de",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
)



Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer, the data collator and the compute_metrics function:

In [25]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


  trainer = Seq2SeqTrainer(


To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer:

In [26]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mjorcisai[0m ([33mICLR-MLLP[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,1.120504,18.338,19.6016
2,No log,1.101044,18.514,19.625
3,No log,1.095398,18.6702,19.625
4,No log,1.091563,18.5541,19.625
5,No log,1.090608,18.5806,19.6328


TrainOutput(global_step=320, training_loss=1.2117032051086425, metrics={'train_runtime': 30.4891, 'train_samples_per_second': 167.929, 'train_steps_per_second': 10.496, 'total_flos': 54136720588800.0, 'train_loss': 1.2117032051086425, 'epoch': 5.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers. There’s also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how generation works in general in encoder-decoder models.

In [27]:
task_prefix = "translate English to German: "
inputs = tokenizer([task_prefix + sentence for sentence in tokenized_datasets["test"]["source_text"]], max_length=max_input_length, truncation=True, return_tensors="pt", padding=True)
output_sequences = model.generate(input_ids=inputs["input_ids"].cuda(), attention_mask=inputs["attention_mask"].cuda())
result = compute_metrics((output_sequences.cpu(),tokenized_datasets["test"]["labels"]))
print(f'BLEU score: {result["bleu"]}')

BLEU score: 19.4068


**Exercise**: Finetune the pretrained [multilingual T5 (mT5) model](https://huggingface.co/docs/transformers/model_doc/mt5) for the translation from English into Spanish using the Europarl-ST dataset. [Solution](https://github.com/jorcisai/ARF/blob/master/HuggingFace/06-Finetuning-mT5.ipynb)