# Fine-tuning of multilingual T5 model

First of all, it is important to read on the model you are about to fine-tune to make sure it fits to the task you are going to tackle, in this case, translation from English into Spanish.

You can read about the multilingual T5 model in its paper:

[A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.](https://arxiv.org/abs/2010.11934)

Also, you may have ad hoc information about [the mT5 model on the HuggingFace page](https://huggingface.co/docs/transformers/model_doc/mt5).  

In [1]:
!pip install datasets evaluate transformers[sentencepiece] transformers[torch]

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/536.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m256.0/536.7 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
Collecting 

The load of the Europarl-ST dataset is the same, but filtering having Spanish as a target language

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/602605 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/86170 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/81968 [00:00<?, ? examples/s]

In [3]:
lang="es"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets["train"] = raw_datasets["train"].filter(lambda x: x["dest_lang"] == lang_id).select(range(1024))
raw_datasets["valid"] = raw_datasets["valid"].filter(lambda x: x["dest_lang"] == lang_id).select(range(128))
raw_datasets["test"] = raw_datasets["test"].filter(lambda x: x["dest_lang"] == lang_id).select(range(128))

Filter:   0%|          | 0/602605 [00:00<?, ? examples/s]

Filter:   0%|          | 0/81968 [00:00<?, ? examples/s]

Filter:   0%|          | 0/86170 [00:00<?, ? examples/s]

Now we load the pre-trained tokenizer for the mT5 model and apply it to a sample English-Spanish pair. Please mind that the "mT5 was pre-trained unsupervisedly, there’s no real advantage to using a task prefix during single-task fine-tuning"

In [4]:
from transformers import AutoConfig, AutoTokenizer

checkpoint = "google/mt5-small"
config = AutoConfig.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


As you can see, no prefix is used when preparing the data according to how the model was trained:

In [5]:
def tokenize_function(sample):
    model_inputs = tokenizer(sample["source_text"],max_length=40,truncation=True)
    model_inputs['labels'] = tokenizer(sample["dest_text"],max_length=40,truncation=True).input_ids
    return model_inputs

In [6]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1024
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 128
    })
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 128
    })
})

The model and the data collator apply in the same way:

In [7]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

pytorch_model.bin:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [8]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

The evaluation is performed in the same way:

In [9]:
!pip install sacrebleu
from evaluate import load
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    #labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    for i in range(len(labels)):
        labels[i] = [tokenizer.pad_token_id if j==-100 else j for j in labels[i]]
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    metric = load("sacrebleu")
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Collecting sacrebleu
  Downloading sacrebleu-2.4.0-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.3/106.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-2.8.2 sacrebleu-2.4.0


We define the TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation:

In [10]:
from transformers import Seq2SeqTrainingArguments

batch_size = 16
model_name = checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-en-to-es",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
)

In [11]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


To fine-tune the model on our dataset, we just have to call the train() method of our Trainer:

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,11.540648,0.0229,4.5547




Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

TrainOutput(global_step=64, training_loss=18.007892608642578, metrics={'train_runtime': 35.2971, 'train_samples_per_second': 29.011, 'train_steps_per_second': 1.813, 'total_flos': 42299975270400.0, 'train_loss': 18.007892608642578, 'epoch': 1.0})