# Fine-tuning of multilingual T5 model

First of all, it is important to read on the model you are about to fine-tune to make sure it fits to the task you are going to tackle, in this case, translation from English in to Spanish.

You can read about the multilingual T5 model in its paper:

[A massively multilingual pre-trained text-to-text transformer by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel.](https://arxiv.org/abs/2010.11934)

Also, you may have ad hoc information about [the mT5 model on the HuggingFace page](https://huggingface.co/docs/transformers/model_doc/mt5).  

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

The load of the Europarl-ST dataset is the same, but filtering having Spanish as a target language

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

Downloading readme:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/tj-solergibert___parquet/tj-solergibert--Europarl-ST-processed-mt-en-b0ca63c3d52a064d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/81.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.1M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating valid split:   0%|          | 0/81968 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/602605 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/86170 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/tj-solergibert___parquet/tj-solergibert--Europarl-ST-processed-mt-en-b0ca63c3d52a064d/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [3]:
lang="es"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets["train"] = raw_datasets["train"].filter(lambda x: x["dest_lang"] == lang_id).select(range(1024))
raw_datasets["valid"] = raw_datasets["valid"].filter(lambda x: x["dest_lang"] == lang_id).select(range(128))
raw_datasets["test"] = raw_datasets["test"].filter(lambda x: x["dest_lang"] == lang_id).select(range(128))

Filter:   0%|          | 0/602605 [00:00<?, ? examples/s]

Filter:   0%|          | 0/81968 [00:00<?, ? examples/s]

Filter:   0%|          | 0/86170 [00:00<?, ? examples/s]

Now we load the pre-trained tokenizer for the mT5 model and apply it to a sample English-Spanish pair. Please mind that the "mT5 was pre-trained unsupervisedly, there’s no real advantage to using a task prefix during single-task fine-tuning"

In [4]:
from transformers import AutoConfig, AutoTokenizer

checkpoint = "google/mt5-small"
config = AutoConfig.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading (…)lve/main/config.json:   0%|          | 0.00/553 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/82.0 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



As you can see, no prefix is used when preparing the data according to how the model was trained:

In [5]:
def tokenize_function(sample):
    model_inputs = tokenizer(sample["source_text"],max_length=40,truncation=True)
    model_inputs['labels'] = tokenizer(sample["dest_text"],max_length=40,truncation=True).input_ids
    return model_inputs

In [6]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

Map:   0%|          | 0/1024 [00:00<?, ? examples/s]

Map:   0%|          | 0/128 [00:00<?, ? examples/s]

DatasetDict({
    valid: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 128
    })
    train: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1024
    })
    test: Dataset({
        features: ['source_text', 'dest_text', 'dest_lang', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 128
    })
})

The model and the data collator apply in the same way:

In [7]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.20G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [8]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

The evaluation is performed in the same way:

In [None]:
!pip install sacrebleu
from evaluate import load
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    metric = load("sacrebleu")
    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

We define the TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation:

In [11]:
from transformers import Seq2SeqTrainingArguments

batch_size = 16
model_name = checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-en-to-es",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
)

In [13]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


To fine-tune the model on our dataset, we just have to call the train() method of our Trainer:

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `MT5ForConditionalGeneration.forward` and have been ignored: dest_lang, dest_text, source_text. If dest_lang, dest_text, source_text are not expected by `MT5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1024
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 64
  Number of trainable parameters = 300176768
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
