The jupter notebook involved in this article is in the [Chapter 4 code base](https://github.com/datawhalechina/learn-nlp-with-transformers/tree/main/docs/%E7%AF%87%E7%AB%A04-%E4%BD%BF%E7%94%A8Transformers%E8%A7%A3%E5%86%B3NLP%E4%BB%BB%E5%8A%A1).

It is recommended to open this tutorial directly using google colab notebook to quickly download relevant datasets and models.
If you are opening this notebook in google colab, you may need to install the Transformers and ü§óDatasets libraries. Uncomment the following commands to install them.

In [None]:
! pip install datasets transformers "sacrebleu>=1.4.12,<2.0.0" sentencepiece

If you are opening this notebook locally, please make sure you have carefully read and installed all the dependencies in the transformer-quick-start-zh readme file. You can also find the multi-GPU distributed training version of this notebook [here](https://github.com/huggingface/transformers/tree/master/examples/seq2seq).

# Fine-tune the transformer model to solve the translation task

In this notebook, we will show how to use the models from the [ü§ó Transformers](https://github.com/huggingface/transformers) repository to solve the translation task in natural language processing. We will use the [WMT dataset](http://www.statmt.org/wmt16/) dataset. This is one of the most commonly used datasets for translation tasks.

An example is shown below:

![Widget inference on a translation task](https://github.com/huggingface/notebooks/blob/master/examples/images/translation.png?raw=1)

For the translation task, we will show how to use a simple dataset loading and fine-tune the model for the corresponding model without using the Trainer interface in the transformer.

In [2]:
model_checkpoint = "Helsinki-NLP/opus-mt-en-ro" 
# Select a model checkpoint

As long as the pre-trained transformer model contains a seq2seq head layer, this notebook can theoretically use a variety of transformer models [model panel](https://huggingface.co/models) to solve any translation task.

In this article, we use the pre-trained [`Helsinki-NLP/opus-mt-en-ro`](https://huggingface.co/Helsinki-NLP/opus-mt-en-ro) checkpoint for translation tasks.

## Download Data

We will use the ü§ó Datasets library to load the data and the corresponding metrics. Data loading and metric loading only require the use of load_dataset and load_metric. We use the English/Romanian bilingual translation in the WMT dataset.

In [3]:
from datasets import load_dataset, load_metric

raw_datasets = load_dataset("wmt16", "ro-en")
metric = load_metric("sacrebleu")

Downloading: 2.81kB [00:00, 523kB/s]                    
Downloading: 3.19kB [00:00, 758kB/s]                    
Downloading: 41.0kB [00:00, 11.0MB/s]                   


Downloading and preparing dataset wmt16/ro-en (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /Users/niepig/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/0d9fb3e814712c785176ad8cdb9f465fbe6479000ee6546725db30ad8a8b5f8a...


Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 225M/225M [00:18<00:00, 12.2MB/s]
Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 23.5M/23.5M [00:16<00:00, 1.44MB/s]
Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 38.7M/38.7M [00:03<00:00, 9.82MB/s]


Dataset wmt16 downloaded and prepared to /Users/niepig/.cache/huggingface/datasets/wmt16/ro-en/1.0.0/0d9fb3e814712c785176ad8cdb9f465fbe6479000ee6546725db30ad8a8b5f8a. Subsequent calls will reuse this data.


Downloading: 5.40kB [00:00, 2.08MB/s]                   


The datasets object itself is a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict) data structure. For the training set, validation set, and test set, you only need to use the corresponding key (train, validation, test) to get the corresponding data.

In [4]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['translation'],
        num_rows: 610320
    })
    validation: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
    test: Dataset({
        features: ['translation'],
        num_rows: 1999
    })
})

Given a data segmentation key (train, validation, or test) and a subscript, you can view the data.

In [5]:
raw_datasets["train"][0]
# We can see that an English sentence en corresponds to a Romanian sentence ro

{'translation': {'en': 'Membership of Parliament: see Minutes',
  'ro': 'Componen≈£a Parlamentului: a se vedea procesul-verbal'}}

To further understand what the data looks like, the following function will randomly select a few examples from the dataset and display them.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [7]:
show_random_elements(raw_datasets["train"])

Unnamed: 0,translation
0,"{'en': 'I do not believe that this is the right course.', 'ro': 'Nu cred cƒÉ acesta este varianta corectƒÉ.'}"
1,"{'en': 'A total of 104 new jobs were created at the European Chemicals Agency, which mainly supervises our REACH projects.', 'ro': 'Un total de 104 noi locuri de muncƒÉ au fost create la Agen»õia EuropeanƒÉ pentru Produse Chimice, care, √Æn special, supravegheazƒÉ proiectele noastre REACH.'}"
2,"{'en': 'In view of the above, will the Council say what stage discussions for Turkish participation in joint Frontex operations have reached?', 'ro': 'Care este stadiul negocierilor referitoare la participarea Turciei la opera»õiunile comune din cadrul Frontex?'}"
3,"{'en': 'We now fear that if the scope of this directive is expanded, the directive will suffer exactly the same fate as the last attempt at introducing 'Made in' origin marking - in other words, that it will once again be blocked by the Council.', 'ro': 'Acum ne temem cƒÉ, dacƒÉ sfera de aplicare a directivei va fi extinsƒÉ, aceasta va avea exact aceea≈üi soartƒÉ ca ultima √Æncercare de introducere a marcajului de origine ""Made in‚Äù, cu alte cuvinte, cƒÉ va fi din nou blocatƒÉ la Consiliu.'}"
4,"{'en': 'The country dropped nine slots to 85th, with a score of 6.58.', 'ro': '≈¢ara a cobor√¢t nouƒÉ pozi≈£ii, pe locul 85, cu un scor de 6,58.'}"


Metric is an instance of the [`datasets.Metric`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Metric) class. See metric and examples of usage:

In [8]:
metric

Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
Produces BLEU scores along with its sufficient statistics
from a source against one or more references.

Args:
    predictions: The system stream (a sequence of segments)
    references: A list of one or more reference streams (each a sequence of segments)
    smooth: The smoothing method to use
    smooth_value: For 'floor' smoothing, the floor to use
    force: Ignore data that looks already tokenized
    lowercase: Lowercase the data
    tokenize: The tokenizer to use
Returns:
    'score': BLEU score,
    'counts': Counts,
    'totals': Totals,
    'precisions': Precisions,
    'bp': Brevity penalty,
    'sys_len': predictions length,
    'ref_len': reference length,
Examples:

    >>> predictions = ["hello there general kenobi", "foo bar foobar"]
    >>> references = [["hello there gen

We use the `compute` method to compare predictions and labels to calculate the score. Both predictions and labels need to be a list. The specific format is shown in the example below:

In [9]:
fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

{'score': 0.0,
 'counts': [4, 2, 0, 0],
 'totals': [4, 2, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 1.0,
 'sys_len': 4,
 'ref_len': 4}

## Data preprocessing

Before feeding the data into the model, we need to preprocess the data. The preprocessing tool is called Tokenizer. Tokenizer first tokenizes the input, then converts the tokens into the corresponding token ID required in the pre-model, and then converts them into the input format required by the model.

In order to achieve the purpose of data preprocessing, we use the AutoTokenizer.from_pretrained method to instantiate our tokenizer, which ensures:

- We get a tokenizer that corresponds to the pre-trained model one by one.
- When using the tokenizer corresponding to the specified model checkpoint, we also download the vocabulary required by the model, more precisely, the tokens vocabulary.

This downloaded tokens vocabulary will be cached so that it will not be downloaded again when used again.

In [10]:
from transformers import AutoTokenizer
# Need to install `sentencepiece`: pip install sentencepiece
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.13k/1.13k [00:00<00:00, 466kB/s]
Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 789k/789k [00:00<00:00, 882kB/s]
Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 817k/817k [00:00<00:00, 902kB/s]
Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.39M/1.39M [00:01<00:00, 1.24MB/s]
Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 42.0/42.0 [00:00<00:00, 14.6kB/s]


Taking the mBART model we use as an example, we need to set the source language and target language correctly. If you want to translate other bilingual corpora, please check [here](https://huggingface.co/facebook/mbart-large-cc25). We can check the settings of source and target languages:

In [11]:
if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "ro-RO"

The tokenizer can preprocess a single text or a pair of texts. The data obtained after tokenizer preprocessing meets the input format of the pre-trained model.

In [12]:
tokenizer("Hello, this one sentence!")

{'input_ids': [125, 778, 3, 63, 141, 9191, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

The token IDs (i.e. input_ids) you see above generally vary with the names of the pre-trained models. The reason is that different pre-trained models set different rules during pre-training. But as long as the names of the tokenizer and the model are the same, the input format of the tokenizer preprocessing will meet the model requirements. For more information about preprocessing, please refer to [this tutorial](https://huggingface.co/transformers/preprocessing.html)

In addition to tokenizing a sentence, we can also tokenize a list of sentences.

In [13]:
tokenizer(["Hello, this one sentence!", "This is another sentence."])

{'input_ids': [[125, 778, 3, 63, 141, 9191, 23, 0], [187, 32, 716, 9191, 2, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

Note: To prepare translation targets for the model, we use `as_target_tokenizer` to control the special tokens corresponding to the targets:

In [14]:
with tokenizer.as_target_tokenizer():
    print(tokenizer("Hello, this one sentence!"))
    model_input = tokenizer("Hello, this one sentence!")
    tokens = tokenizer.convert_ids_to_tokens(model_input['input_ids'])
# Print and see the special token
    print('tokens: {}'.format(tokens))

{'input_ids': [10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokens: ['‚ñÅHel', 'lo', ',', '‚ñÅ', 'this', '‚ñÅo', 'ne', '‚ñÅse', 'nten', 'ce', '!', '</s>']


If you are using the checkpoints of the T5 pre-trained model, you need to check for special prefixes. T5 uses special prefixes to tell the model the specific tasks to be done. Examples of specific prefixes are as follows:

In [15]:
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Romanian: "
else:
    prefix = ""

Now we can put everything together to form our preprocessing function. When we preprocess the sample, we will also use the parameter `truncation=True` to ensure that our overly long sentences are truncated. By default, we automatically pad for shorter sentences.

In [16]:
max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

# Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

The above preprocessing function can process one sample or multiple sample examples. If it processes multiple samples, it returns a list of the results of the preprocessing of multiple samples.

In [17]:
preprocess_function(raw_datasets['train'][:2])

{'input_ids': [[393, 4462, 14, 1137, 53, 216, 28636, 0], [24385, 14, 28636, 14, 4646, 4622, 53, 216, 28636, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[42140, 494, 1750, 53, 8, 59, 903, 3543, 9, 15202, 0], [36199, 6612, 9, 15202, 122, 568, 35788, 21549, 53, 8, 59, 903, 3543, 9, 15202, 0]]}

Next, all samples in the dataset datasets are preprocessed by using the map function to apply the preprocessing function prepare_train_features to all samples.

In [18]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 611/611 [02:32<00:00,  3.99ba/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00,  3.76ba/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00,  3.89ba/s]


Even better, the returned results are automatically cached to avoid recalculation the next time they are processed (but be aware that if the input changes, it may be affected by the cache!). The datasets library function will detect the input parameters to determine if there are any changes. If there are no changes, the cached data will be used. If there are changes, the data will be reprocessed. However, if the input parameters do not change, it is best to clear the cache when you want to change the input. The way to clear it is to use the `load_from_cache_file=False` parameter. In addition, the `batched=True` parameter used above is a feature of the tokenizer, because it will use multiple threads to process the input in parallel.

## Fine-tune the transformer model

Now that the data is ready, we need to download and load our pre-trained model, and then fine-tune the pre-trained model. Since we are doing seq2seq tasks, we need a model class that can solve this task. We use the class `AutoModelForSeq2SeqLM`. Similar to tokenizer, the `from_pretrained` method can also help us download and load the model, and it will also cache the model so that we don't download the model repeatedly.

In [19]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Downloading: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 301M/301M [00:19<00:00, 15.1MB/s]


Since our fine-tuning task is machine translation, and we load a pre-trained seq2seq model, there will be no prompt that some mismatched neural network parameters are thrown away when loading the model (for example, the neural network head of the pre-trained language model is thrown away, and the neural network head of the machine translation is randomly initialized).

In order to get a `Seq2SeqTrainer` training tool, we need 3 more elements, the most important of which is the training settings/parameters [`Seq2SeqTrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Seq2SeqTrainingArguments). This training setting contains all the properties that can define the training process

In [20]:
batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-translation",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=False,
)

The evaluation_strategy = "epoch" parameter above tells the training code that we will do a validation evaluation once per epoch.

The batch_size is defined above before this notebook.

Since our dataset is large and `Seq2SeqTrainer` will keep saving models, we need to tell it to save at most `save_total_limit=3` models.

Finally, we need a data collator to feed our processed input to the model.

In [21]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

The last thing left to set up the Seq2SeqTrainer is to define the evaluation method. We use metric to do this. We will also do some post-processing before sending the model predictions to the evaluation:

In [22]:
import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

# Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

# Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Finally, pass all parameters/data/models to `Seq2SeqTrainer`

In [23]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Call the `train` method to perform fine-tuning training.

In [None]:
trainer.train()

Finally, don‚Äôt forget to check out how to upload a model and upload it to [ü§ó Model Hub](https://huggingface.co/models). You can then use your model by simply using the model name, just like at the beginning of this notebook.