In [None]:
!pip install -q transformers datasets evaluate sacrebleu

# Translation

Translation converts a sequence of text from one language to another and it is a sequence-to-sequence problem.

## Load OPUS Books dataset

In [None]:
from datasets import load_dataset

books = load_dataset('opus_books', 'en-fr')

In [None]:
books = books['train'].train_test_split(test_size=0.2)

In [None]:
books['train'][0]

{'id': '112860',
 'translation': {'en': '"What do I hear? Is it you, my dear master! you I behold in this piteous plight? What dreadful misfortune has befallen you? What has made you leave the most magnificent and delightful of all castles? What has become of Miss Cunegund, the mirror of young ladies, and Nature\'s masterpiece?"',
  'fr': "Qu'entends-je? vous, mon cher maître! vous, dans cet état horrible! quel malheur vous est-il donc arrivé? pourquoi n'êtes-vous plus dans le plus beau des châteaux? qu'est devenue mademoiselle Cunégonde, la perle des filles, le chef-d'oeuvre de la nature?"}}

## Preprocess

We need to load a T5 tokenizer to process the English-French language pairs:

In [None]:
from transformers import AutoTokenizer

checkpoint = 'google-t5/t5-small'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

We want to create a preprocessing function so as to
* Prefix the input with a prompt so T5 knows this is a translation task. Some models capable of multiple NLP tasks require prompting for specific tasks.
* Set the target language in the `text_target` parameter to ensure the tokenizer process the target text correctly. If we do not set `text_target`, the tokenizer processes the target text as English.
* Truncate sequences to be no longer than the maximum length set by the `max_length` parameter.

In [None]:
source_lang = 'en'
target_lang = 'fr'
prefix = 'translate English to French: '


def preprocess_function(examples):
    inputs = [prefix + example[source_lang] for example in examples['translation']]
    targets = [example[target_lang] for example in examples['translation']]

    model_inputs = tokenizer(
        inputs,
        text_target=targets,
        max_length=128,
        truncation=True
    )
    return model_inputs

In [None]:
tokenized_books = books.map(preprocess_function, batched=True)

Map:   0%|          | 0/101668 [00:00<?, ? examples/s]

Map:   0%|          | 0/25417 [00:00<?, ? examples/s]

Create a batch of examples using `DataCollatorForSeq2Seq`. We need to *dynamically pad* the sentences to the longest length in a batch during collation.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

## Evaluate

For this task, load the `SacreBLEU` metric.

In [None]:
import evaluate

metric = evaluate.load('sacrebleu')

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

Create a function that passes our predictions and labels to `.compute()` method to calculate the SacreBLEU score:

In [None]:
import numpy as np


def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {'bleu': result['score']}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result['gen_len'] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

## Train

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir='my_opus_books_model',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_books['train'],
    eval_dataset=tokenized_books['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

## Inference

For T5, we need to prefix our input depending on the task we work on. For translation from English to French, we should prefix our input as below:

In [None]:
text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

In [None]:
from transformers import pipeline

# Change `xx` to the language of the input and `yy` to the language of the desired output.
# Examples: "en" for English, "fr" for French, "de" for German, "es" for Spanish, "zh" for Chinese, etc; translation_en_to_fr translates English to French
#translator = pipeline('translation_xx_to_yy', model='stevhliu/my_awesome_opus_books_model')
translator = pipeline('translation_en_to_fr', model='stevhliu/my_awesome_opus_books_model')

In [None]:
translator(text)

[{'translation_text': "Legumes partagent des ressources avec des bactéries fixatrices d'azote."}]

Manually replicate the results from `pipeline`:

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('stevhliu/my_awesome_opus_books_model')
model = AutoModelForSeq2SeqLM.from_pretrained('stevhliu/my_awesome_opus_books_model')



In [None]:
inputs = tokenizer(text, return_tensors='pt').input_ids
outputs = model.generate(
    inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=30,
    top_p=0.95,
)

tokenizer.decode(outputs[0], skip_special_tokens=True)

"Les levures s'alimentent avec des bactéries contenant de l'azote."