# T5 for Machine Translation (Pytorch)

Machine Translation tries to convert a text from one language to another. This can be formulated as a Seq2Seq problem.

In this notebook, we will learn how to fine-tune T5 on the English-French subset of the OPUS Books dataset to translate English text to French.

**the framework used will be Pytorch** in this notebook.

Source: https://huggingface.co/docs/transformers/tasks/translation

In [None]:
!pip install transformers datasets 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 5.1 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.6.1-py3-none-any.whl (441 kB)
[K     |████████████████████████████████| 441 kB 76.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 79.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 69.3 MB/s 
Collecting dill<0.3.6
  Downloading dill-0.3.5.1-py2.py3-none-any.whl (95 kB)
[K     |████████████████████████████████| 95 kB 5.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl 

If you want that warnings are not printed, please run this cell:

In [None]:
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
# ignore warning about deprecation
o_deprecation_warning=True


## Data
We use the  OPUS dataset, which is a  collection of translated texts from the web. In particular, we will use the subset

In [1]:
from datasets import load_dataset

dataset = load_dataset("opus_books", "en-es")
dataset

ModuleNotFoundError: ignored

In [None]:
import random as rand
for i in range(3):
    index=rand.randint(0,dataset['train'].num_rows)
    print(dataset['train'][index])
    print()
    # print(dataset['train'][index]['translation']['en'])
    # print(dataset['train'][index]['translation']['es'])
    # print('\n')

{'id': '59285', 'translation': {'en': "'She might bethink herself and only when she is already married find out that she does not and never could love me...'", 'es': '«¿Y si sólo se da cuenta después de casarse conmigo de que no me quiere ni me puede querer?»'}}

{'id': '12125', 'translation': {'en': 'I advanced my head with precaution, desirous to ascertain if any bedroom window-blinds were yet drawn up: battlements, windows, long front--all from this sheltered station were at my command.', 'es': 'Adelanté la cabeza con cautela, para comprobar si las ventanas de algún dormitorio estaban abiertas ya. Todo - fachada, ventanas, almenas-, quedaba desde allí al alcance de mis ojos.'}}

{'id': '22356', 'translation': {'en': '"Go on, Sancho my friend, and be not disheartened," said Don Quixote; "for I double the stakes as to price."', 'es': '-Prosigue, Sancho amigo, y no desmayes -le dijo don Quijote-, que yo doblo la parada del precio.'}}



### Spliting

We have to create the validation and test splits:

In [None]:
dataset = dataset["train"].train_test_split(test_size=0.2)
SIZE_TEST=10
dataset['validation']=dataset["test"].select(range(SIZE_TEST,dataset["test"].num_rows))
dataset['test']=dataset["test"].select(range(SIZE_TEST))

dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 74776
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['id', 'translation'],
        num_rows: 18684
    })
})

### Tokenization

In [None]:
PREFIX='translate English to Spanish: '
source_lang = "en"
target_lang = "es"


In [None]:
from transformers import AutoTokenizer

model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize(examples):
    inputs = [PREFIX + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

# we apply the function to the dataset for encoding it
encoded_datasets = dataset.map(tokenize, batched=True)
encoded_datasets

Downloading:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-small automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


  0%|          | 0/75 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/19 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 74776
    })
    test: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['id', 'translation', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 18684
    })
})

In [None]:
encoded_datasets=encoded_datasets.remove_columns(['translation', 'id'])
encoded_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 74776
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 18684
    })
})

## Model (Pytorch)

Here is when the code is different to the previous notebook where we fine-tune a T5 for text summarization on tensorflow. 
Now we have to use differente classes:


### Defining model, arguments and data collator

In [None]:
from transformers import AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to('cuda')


Downloading:   0%|          | 0.00/242M [00:00<?, ?B/s]

We will use a trainer class for Seq2Seq, so we need to set its arguments:

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=True,
)

We also need to define a data collator, in particular, one for a Seq2Seq model:


In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

### Metrics for the trainer
We also have to define the function that will be used by the trainer to measure the model on the validation dataset:

In [None]:
import keras_nlp
rouge_L = keras_nlp.metrics.RougeL()

def compute_metrics(eval_predictions):
    #the predictions and the corresponding reference labels
    predictions, labels = eval_predictions

    # we have to decode the predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    
    # we also have to decode the reference labels
    # first, we replace those labels <0 with the token id for padding
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    # we now decode the reference labels
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # we calculate rouge_L comparing the decoded labels and the decoded prediction
    result = rouge_L(decoded_labels, decoded_predictions)
    # We will print only the F1 score, you can use other aggregation metrics as well
    result = {"RougeL": result["f1_score"]}

    # return metric.compute(decoded_labels, decoded_predictions)
    return result

### Trainer 

Now, we can define the trainer object by using the *Seq2SeqTrainer* class:

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=encoded_datasets["train"],
    eval_dataset=encoded_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Using cuda_amp half precision backend
***** Running training *****
  Num examples = 74776
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4674
  Number of trainable parameters = 60506624
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.1219,3.018652


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1500/special_toke

TrainOutput(global_step=4674, training_loss=3.2459327132976306, metrics={'train_runtime': 450.8317, 'train_samples_per_second': 165.862, 'train_steps_per_second': 10.368, 'total_flos': 1963460611276800.0, 'train_loss': 3.2459327132976306, 'epoch': 1.0})

### Evaluation on the validation dataset
We evaluate eth

In [None]:
trainer.train()

***** Running training *****
  Num examples = 74776
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 4674
  Number of trainable parameters = 60506624


Epoch,Training Loss,Validation Loss
1,3.1115,3.018531


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-500/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-500/special_tokens_map.json
Deleting older checkpoint [results/checkpoint-3500] due to args.save_total_limit
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in ./results/checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./results/checkpoint-1000/special_tokens_map.json
Deleting older checkpoint [results/checkpoint-4000] due to args.save_total_limit
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/py

TrainOutput(global_step=4674, training_loss=3.11174822812617, metrics={'train_runtime': 443.5357, 'train_samples_per_second': 168.591, 'train_steps_per_second': 10.538, 'total_flos': 1966158988443648.0, 'train_loss': 3.11174822812617, 'epoch': 1.0})

In [None]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 398
  Batch size = 16


Trainer is attempting to log a value of "0.1073099821805954" of type <class 'tensorflow.python.framework.ops.EagerTensor'> for key "eval/RougeL" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 3.6256604194641113,
 'eval_RougeL': <tf.Tensor: shape=(), dtype=float32, numpy=0.10730998>,
 'eval_runtime': 11.5393,
 'eval_samples_per_second': 34.491,
 'eval_steps_per_second': 2.167,
 'epoch': 1.0}

## Evaluation


### Inference
You can directly use the model to generate the summary for some text from the test dataset (or any another text). To do this, we create a pipeline object containing the model and the tokenizer.

In [None]:
from transformers import pipeline
MIN_TARGET_LENGTH = 5
translater = pipeline("translation", model=model, tokenizer=tokenizer, framework="pt", device=0)

translater(dataset['test'][0]['translation']['en'])

[{'summary_text': "Virgil van Dijk's first goal for 18 months gave the hosts the lead . he doubled the lead with a header from Dusan Tadic's corner . the 28-year-old is the fourth englishman to score in six consecutive matches this season ."}]

In [None]:
dataset['test'][0]['translation']['es']

'Premier League top scorer Jamie Vardy scored twice as Leicester came from 2-0 down to draw at Southampton.'

### Results on the test dataset
We also want to provide some final scores about our model on the test dataset

In [None]:
generated_summaries =translater(dataset["test"]['translation']['es'], truncation=True)
generated_summaries=[example['es'] for example in generated_summaries]

result = rouge_L(dataset['test']['translation']['es'], generated_summaries)

Disabling tokenizer parallelism, we're using DataLoader multithreading already
Your max_length is set to 128, but you input_length is only 91. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=45)


In [None]:
import tensorflow as tf
#print("rouge-L:", result['precision'], result['recall'], result['f1_score'])
print("rouge-L -  Precision:", tf.get_static_value(result['precision']), ", Recal: ", tf.get_static_value(result['recall']), ", f1-score:", tf.get_static_value(result['f1_score']))

rouge-L -  Precision: 0.14080042 , Recal:  0.090049334 , f1-score: 0.10729898
