Translation converts a sequence of text from one language to another. It is one of several tasks you can formulate as a sequence-to-sequence problem - a powerful framework for returning some output from an input, like translation or summarization. Translation systems are commonly used for translation between different language texts, but it can also be used for speech or some combination in between like text-to-speech or speech-to-text.

This guide shows how to:
1. Finetune T5 on the English-French subset of the OPUS Books dataset to translate English text to French.
2. Use the finetuned model for inference.

# Libraries

In [None]:
pip install transformers datasets evaluate sacrebleu

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorForSeq2Seq

# Data Load

In [None]:
books = load_dataset("opus_books", "en-fr")
books = books["train"].train_test_split(test_size=0.2)
books["train"][0]

# Preprocessing

In [None]:
# load a T5 tokenizer to process the English-French language pairs
checkpoint = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [None]:

# Prefix the input with a prompt so T5 knows this is a translation task
# Some models capable of multiple NLP tasks require prompting for specific tasks.
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "


def preprocess_function(examples):
    # Tokenize the input (English) and target (French) separately 
    # We can’t tokenize French text with a tokenizer pretrained on an English vocabulary
    inputs = [prefix + example[source_lang] for example in examples["translation"]]
    targets = [example[target_lang] for example in examples["translation"]]
    
    # Truncate sequences to be no longer than the maximum length set by the max_length parameter
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs


In [None]:
tokenized_books = books.map(preprocess_function, batched=True)

In [None]:
# Create a batch of examples using DataCollatorForSeq2Seq
# dynamically pad the sentences to the longest length in a batch during collation
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)