# Translation


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also log into Hugging Face.

In [53]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Let's now dive into translation. This is another [sequence-to-sequence task](/course/chapter1/7), which means it's a problem that can be formulated as going from one sequence to another. In that sense the problem is pretty close to [summarization](/course/chapter7/6), and you could adapt what we will see here to other sequence-to-sequence problems such as:

- **Style transfer**: Creating a model that *translates* texts written in a certain style to another (e.g., formal to casual or Shakespearean English to modern English)
- **Generative question answering**: Creating a model that generates answers to questions, given a context

If you have a big enough corpus of texts in two (or more) languages, you can train a new translation model from scratch like we will in the section on [causal language modeling](/course/chapter7/6). It will be faster, however, to fine-tune an existing translation model, be it a multilingual one like mT5 or mBART that you want to fine-tune to a specific language pair, or even a model specialized for translation from one language to another that you want to fine-tune to your specific corpus.

In this section, we will fine-tune a Marian model pretrained to translate from English to French (since a lot of Hugging Face employees speak both those languages) on the [KDE4 dataset](https://huggingface.co/datasets/kde4), which is a dataset of localized files for the [KDE apps](https://apps.kde.org/). The model we will use has been pretrained on a large corpus of French and English texts taken from the [Opus dataset](https://opus.nlpl.eu/), which actually contains the KDE4 dataset. But even if the pretrained model we use has seen that data during its pretraining, we will see that we can get a better version of it after fine-tuning.

Once we're finished, we will have a model able to make predictions like this one:

As in the previous sections, you can find the actual model that we'll train and upload to the Hub using the code below and double-check its predictions [here](https://huggingface.co/huggingface-course/marian-finetuned-kde4-en-to-fr?text=This+plugin+allows+you+to+automatically+translate+web+pages+between+several+languages.).

## Preparing the data

To fine-tune or train a translation model from scratch, we will need a dataset suitable for the task. As mentioned previously, we'll use the [KDE4 dataset](https://huggingface.co/datasets/kde4) in this section, but you can adapt the code to use your own data quite easily, as long as you have pairs of sentences in the two languages you want to translate from and into. Refer back to [Chapter 5](/course/chapter5) if you need a reminder of how to load your custom data in a `Dataset`.

### The KDE4 dataset

As usual, we download our dataset using the `load_dataset()` function:

In [3]:
%%capture
# Download the parallel corpus
!wget https://object.pouta.csc.fi/OPUS-KDE4/v2/moses/en-fr.txt.zip
!unzip en-fr.txt.zip

In [4]:
from datasets import Dataset, DatasetDict

with open('KDE4.en-fr.en', 'r', encoding='utf-8') as f:
    en_lines = [line.strip() for line in f]

with open('KDE4.en-fr.fr', 'r', encoding='utf-8') as f:
    fr_lines = [line.strip() for line in f]

dataset = Dataset.from_dict({
    'id': list(range(len(en_lines))),
    'translation': [
        {'en': en, 'fr': fr}
        for en, fr in zip(en_lines, fr_lines)
    ]
})

# Wrap in DatasetDict
raw_datasets = DatasetDict({
    'train': dataset
})

If you want to work with a different pair of languages, you can specify them by their codes. A total of 92 languages are available for this dataset; you can see them all by expanding the language tags on its [dataset card](https://huggingface.co/datasets/kde4).

Let's have a look at the dataset:

In [5]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 210173
    })
})

We have 210,173 pairs of sentences, but in one single split, so we will need to create our own validation set. As we saw in [Chapter 5](/course/chapter5), a `Dataset` has a `train_test_split()` method that can help us. We'll provide a seed for reproducibility:

In [6]:
split_datasets = raw_datasets["train"].train_test_split(train_size=0.9, seed=20)
split_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 189155
    })
    test: Dataset({
        features: ['id', 'translation'],
        num_rows: 21018
    })
})

We can rename the `"test"` key to `"validation"` like this:

In [7]:
split_datasets["validation"] = split_datasets.pop("test")

Now let's take a look at one element of the dataset:

In [8]:
split_datasets["train"][1]["translation"]

{'en': 'Default to expanded threads',
 'fr': 'Par défaut, développer les fils de discussion'}

We get a dictionary with two sentences in the pair of languages we requested. One particularity of this dataset full of technical computer science terms is that they are all fully translated in French. However, French engineers leave most computer science-specific words in English when they talk. Here, for instance, the word "threads" might well appear in a French sentence, especially in a technical conversation; but in this dataset it has been translated into the more correct "fils de discussion." The pretrained model we use, which has been pretrained on a larger corpus of French and English sentences, takes the easier option of leaving the word as is:

In [9]:
%%capture
!pip install sacremoses

In [11]:
from transformers import pipeline

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

Device set to use cuda:0


[{'translation_text': 'Par défaut pour les threads élargis'}]

Another example of this behavior can be seen with the word "plugin," which isn't officially a French word but which most native speakers will understand and not bother to translate.
In the KDE4 dataset this word has been translated in French into the more official "module d'extension":

In [12]:
split_datasets["train"][172]["translation"]

{'en': 'Unable to import %1 using the OFX importer plugin. This file is not the correct format.',
 'fr': "Impossible d'importer %1 en utilisant le module d'extension d'importation OFX. Ce fichier n'a pas un format correct."}

Our pretrained model, however, sticks with the compact and familiar English word:

In [13]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)

[{'translation_text': "Impossible d'importer %1 en utilisant le plugin d'importateur OFX. Ce fichier n'est pas le bon format."}]

It will be interesting to see if our fine-tuned model picks up on those particularities of the dataset (spoiler alert: it will).

> **✏️ Your turn!** Another English word that is often used in French is "email." Find the first sample in the training dataset that uses this word. How is it translated? How does the pretrained model translate the same English sentence?

In [14]:
for idx, example in enumerate(split_datasets["train"]):
    if "email" in example["translation"]["en"].lower():
        print(f"Found at index {idx}")
        print(f"English: {example['translation']['en']}")
        print(f"French translation in dataset: {example['translation']['fr']}")
        first_email_sentence = example["translation"]["en"]
        break

print("\nPretrained model translation:")
print(translator(first_email_sentence))

Found at index 356
English: Sends the chart as an email attachment.
French translation in dataset: Envoie le diagramme comme pièce jointe.

Pretrained model translation:
[{'translation_text': 'Envoie le tableau comme pièce jointe par courriel.'}]


In [15]:
# Test the fine-tuned model on the same sentence
finetuned_translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")
print(finetuned_translator(first_email_sentence))

Device set to use cuda:0


[{'translation_text': 'Envoie le tableau comme pièce jointe par courriel.'}]


### Processing the data

You should know the drill by now: all the texts need to be converted to token IDs so that the model can make sense of them. For this task, we will need to tokenize both the inputs and the targets. Our first task is to create a `tokenizer` object. As mentioned before, we will be using a pretrained Marian English to French model. If you are trying this code with another pair of languages, make sure to adapt the model checkpoint. The [Helsinki-NLP](https://huggingface.co/Helsinki-NLP) organization provides more than a thousand models in multiple languages.

In [16]:
from transformers import AutoTokenizer

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, return_tensors="pt")

You can replace the `model_checkpoint` with any other model you prefer from the [Hub](https://huggingface.co/models), or with a local folder where you've saved a pretrained model and a tokenizer.

One thing to note is that when we instantiate the tokenizer, we can specify the language of the inputs and outputs via the `src_lang` and `tgt_lang` arguments (provided they are different). For the Marian models, this is especially important since it makes the tokenizer add the proper target language token at the beginning of the sentences to be translated. Such a token makes the model output translations in the correct target language. Let's look at an example:

In [17]:
en_sentence = split_datasets["train"][1]["translation"]["en"]
fr_sentence = split_datasets["train"][1]["translation"]["fr"]

inputs = tokenizer(en_sentence, text_target=fr_sentence)
inputs

{'input_ids': [47591, 12, 9842, 19634, 9, 0], 'attention_mask': [1, 1, 1, 1, 1, 1], 'labels': [577, 5891, 2, 3184, 16, 2542, 5, 1710, 0]}

As we can see, the English sentence was converted to the familiar `input_ids` and `attention_mask`, and the French sentence was encoded in the `labels` key. Just to make sure we have the right input, let's decode these inputs and labels back:

In [18]:
print(tokenizer.decode(inputs["input_ids"]))
print(tokenizer.decode(inputs["labels"]))

Default to expanded threads</s>
Par défaut, développer les fils de discussion</s>


We can see our special token, `</s>`, at the end of each sentence, but no special token at the beginning of the French sentence. That's because our Marian model requires only the end of sentence special token, and not a starting one. If we needed to have a special start token for this particular model, we would have needed to pass `forced_bos_token_id` to the model's `generation_config` when initializing it.

Now let's look at the special case of the outputs. During training, there is a subtlety: We don't actually pass the full decoded labels to the model, as padding should be ignored in the loss computation. The Hugging Face Transformers library does this for us by replacing all the padding token IDs in the labels with `-100`, which is the value that should be ignored when computing the loss (see the [PyTorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)). We do need to pass the full sentence we want the model to generate after the inputs during inference, however, so we can evaluate our model.

Don't worry about all of this if you don't understand it yet; it will become clear when we write our training loop!

The last thing to know is that we can set a `max_length` for our inputs and labels. For translation, we might receive very long sentences, especially from our dataset. Here we truncate to 128 tokens, which is long enough to contain the majority of our inputs and outputs. We also ask the tokenizer to pad shorter sentences:

In [19]:
max_length = 128


def preprocess_function(examples):
    inputs = [ex["en"] for ex in examples["translation"]]
    targets = [ex["fr"] for ex in examples["translation"]]
    model_inputs = tokenizer(
        inputs, text_target=targets, max_length=max_length, truncation=True
    )
    return model_inputs

Note that we set the `text_target` argument when calling the tokenizer; this ensures the targets are tokenized properly. Also note that `tokenizer` can process lists of sentences, so we can feed it all the inputs and targets at once. This will make the preprocessing much faster, and we can call it on our whole dataset with the `batched=True` option:

In [20]:
tokenized_datasets = split_datasets.map(
    preprocess_function,
    batched=True,
    remove_columns=split_datasets["train"].column_names,
)

Map:   0%|          | 0/189155 [00:00<?, ? examples/s]

Map:   0%|          | 0/21018 [00:00<?, ? examples/s]

Now that the data has been preprocessed, we are ready to fine-tune our pretrained model!

## Fine-tuning the model with the Trainer API

The actual code using the `Trainer` will be the same as before, with just one little change: we use a `Seq2SeqTrainer` here, which is a subclass of `Trainer` that will allow us to properly use the `generate()` method for evaluation. We'll go into this in more detail when we talk about the metric computation.

### Data collation

We will need a data collator to deal with the padding for dynamic batching. We can't just use a `DataCollatorWithPadding` like in [Chapter 3](/course/chapter3) here, since that only pads the inputs (input IDs, attention mask, and token type IDs). Our labels should also be padded to the maximum length encountered in the labels, with the padding token ID for the tokenizer's vocabulary, which is `-100` so that the corresponding predictions are ignored in the loss. This is all done by a `DataCollatorForSeq2Seq`. Like the `DataCollatorWithPadding`, it takes the `tokenizer` used to preprocess the inputs:

In [21]:
from transformers import DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM

model_checkpoint = "Helsinki-NLP/opus-mt-en-fr"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

To test this on a few samples, we just call it on a list of examples from our tokenized training set:

In [22]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(1, 3)])
batch.keys()

KeysView({'input_ids': tensor([[47591,    12,  9842, 19634,     9,     0, 59513, 59513, 59513, 59513,
         59513, 59513, 59513, 59513, 59513],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
         28149,   139, 33712, 25218,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'labels': tensor([[  577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,  -100,
          -100,  -100,  -100,  -100,  -100,  -100],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
           550,  7032,  5821,  7907, 12649,     0]]), 'decoder_input_ids': tensor([[59513,   577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,
         59513, 59513, 59513, 59513, 59513, 59513],
        [59513,  1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,
           817,   550,  7032,  5821,  7907, 12649]])})

We can check that the labels have been padded to the maximum length of the batch, using `-100`:

In [23]:
batch["labels"]

tensor([[  577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,  -100,
          -100,  -100,  -100,  -100,  -100,  -100],
        [ 1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,   817,
           550,  7032,  5821,  7907, 12649,     0]])

We can also have a look at the `decoder_input_ids` to see they start with the special token for the target language:

In [24]:
batch["decoder_input_ids"]

tensor([[59513,   577,  5891,     2,  3184,    16,  2542,     5,  1710,     0,
         59513, 59513, 59513, 59513, 59513, 59513],
        [59513,  1211,     3,    49,  9409,  1211,     3, 29140,   817,  3124,
           817,   550,  7032,  5821,  7907, 12649]])

Here are the decoded versions of the first labels and decoder inputs:

In [25]:
for i in range(1, 3):
    print(tokenizer.decode(tokenized_datasets["train"][i]["labels"]))

Par défaut, développer les fils de discussion</s>
*. ui *. UI_BAR_Fichiers interface utilisateur</s>


### Metrics

The last thing to do before training is to define how to compute metrics for our model. For translation, the conventional metric is the [BLEU score](https://en.wikipedia.org/wiki/BLEU), which was introduced in a 2002 [article](https://aclanthology.org/P02-1040.pdf) by Kishore Papineni et al. The BLEU score evaluates how close translations are to their labels. It does not measure the intelligibility or grammatical correctness of the model's outputs, but uses statistical rules to ensure that all the words in the generated outputs also appear in the targets. In addition, there are rules that penalize repetitions of words that are not also repeated in the targets (to avoid the model outputting sentences like "the the the the the") and outputs that are shorter than the ones in the targets (to avoid the model outputting sentences like "the").

One weakness of BLEU is that it expects the text to already be tokenized, which makes it difficult to compare scores between models that use different tokenizers. So the most commonly used way to evaluate translation models today is [SacreBLEU](https://github.com/mjpost/sacrebleu), which addresses this weakness (and others) by standardizing the tokenization step. To use this metric, we first need to install the SacreBLEU library:

In [26]:
%%capture
!pip install sacrebleu

We can then load it via `evaluate.load()` like we did in [Chapter 3](/course/chapter3):

In [27]:
import evaluate

metric = evaluate.load("sacrebleu")

Downloading builder script: 0.00B [00:00, ?B/s]

This metric takes texts as inputs and targets. It's designed to accept several acceptable targets, as there is often more than one acceptable translation; here we are only providing one, but nothing stops you from augmenting your dataset with several translations for each input sentence in the future. The predictions should be a list of decoded strings; the references should be a list of lists of decoded strings:

In [28]:
predictions = [
    "This plugin lets you translate web pages between several languages automatically."
]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 46.750469682990186,
 'counts': [11, 6, 4, 3],
 'totals': [12, 11, 10, 9],
 'precisions': [91.66666666666667,
  54.54545454545455,
  40.0,
  33.333333333333336],
 'bp': 0.9200444146293233,
 'sys_len': 12,
 'ref_len': 13}

This gets a BLEU score of 46.75, which is pretty good -- for reference, the original Transformer model in the ["Attention Is All You Need"](https://arxiv.org/pdf/1706.03762.pdf) paper achieved a BLEU score of 41.8 on a similar translation task from English to French! (For more information on the individual metrics, such as `counts` and `bp`, refer to the [SacreBLEU repository](https://github.com/mjpost/sacrebleu/blob/078c440168c6adc89ba75fe6d63f0d922d42bcfe/sacrebleu/metrics/bleu.py#L74).) On the other hand, if we try with the two bad predictions (many repetitions or too short), we will get rather bad BLEU scores:

In [29]:
predictions = ["This This This This"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 1.683602693167689,
 'counts': [1, 0, 0, 0],
 'totals': [4, 3, 2, 1],
 'precisions': [25.0, 16.666666666666668, 12.5, 12.5],
 'bp': 0.10539922456186433,
 'sys_len': 4,
 'ref_len': 13}

In [30]:
predictions = ["This plugin"]
references = [
    [
        "This plugin allows you to automatically translate web pages between several languages."
    ]
]
metric.compute(predictions=predictions, references=references)

{'score': 0.0,
 'counts': [2, 1, 0, 0],
 'totals': [2, 1, 0, 0],
 'precisions': [100.0, 100.0, 0.0, 0.0],
 'bp': 0.004086771438464067,
 'sys_len': 2,
 'ref_len': 13}

The score can go from 0 to 100, and higher is better.

In the `Trainer`, the compute metrics function needs to take a tuple with the predictions and the labels and should return a dictionary with string keys (the names of the metrics) and float values (the values of the metrics). It will return the predictions from our model, which will be a tuple with several arrays (one for the logits, and several for the hidden states and/or attentions if our model returns them). We only need the first one, which contains the logits from the last layer, so we will extract those. Then, as usual, we need to convert `-100` in the labels to the padding token before decoding:

In [31]:
import numpy as np


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

Now that this is done, we are almost ready to define our `Seq2SeqTrainer`. We just need a `model` to fine-tune!

### Fine-tuning the model

First, we need the actual model. We use the usual `AutoModelForXxx` API, and in this case, we want a model with a sequence-to-sequence language modeling head, so `AutoModelForSeq2SeqLM`:

In [32]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Note that this time, we can use the same checkpoint for the tokenizer and the model. After all, we are trying to fine-tune the model on a new dataset, not replace the tokenizer.

Next, we define the `Seq2SeqTrainingArguments`. Like the `TrainingArguments` we saw before, this subclass of `TrainingArguments` contains all the hyperparameters we will pass to our `Seq2SeqTrainer`. The main ones are:

- `evaluation_strategy="epoch"`: We will evaluate at the end of each epoch
- `learning_rate=2e-5`: The learning rate
- `per_device_train_batch_size=16`: The batch size for training
- `per_device_eval_batch_size=16`: The batch size for evaluation
- `weight_decay=0.01`: Weight decay
- `save_total_limit=3`: We'll save only the last 3 checkpoints
- `num_train_epochs=1`: We'll train for 1 epoch
- `predict_with_generate=True`: Use generation for predictions during evaluation
- `fp16=True`: Use mixed-precision training on GPU
- `push_to_hub=True`: Push the model to the Hugging Face Hub

Note the new arguments specific to `Seq2SeqTrainingArguments`:

- `predict_with_generate`: To evaluate properly and compute the BLEU metric, we will need to tell the `Seq2SeqTrainer` to use the model's `generate()` method during evaluation

We can optionally specify a `generation_max_length` to cap the length of the predictions when generating during evaluation (the default is 20, which is quite short for a sentence):

In [33]:
from transformers import Seq2SeqTrainingArguments

args = Seq2SeqTrainingArguments(
    f"marian-finetuned-kde4-en-to-fr",
    #evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    predict_with_generate=True,
    fp16=True,
    push_to_hub=True,
    report_to="none",  # Don't rely on wandb
)

Then we pass them all to the `Seq2SeqTrainer`:

In [34]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

  trainer = Seq2SeqTrainer(


Before training, let's look at the score our model gets, to double-check we aren't making things worse with our fine-tuning:

In [None]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 1.6967031955718994,
 'eval_model_preparation_time': 0.003,
 'eval_bleu': 39.26927837507845,
 'eval_runtime': 1026.5062,
 'eval_samples_per_second': 20.475,
 'eval_steps_per_second': 0.321}

A BLEU score of 39.6 is pretty decent, which means our model is already doing a good job of translating English sentences to French ones.

Next is training, which may take some time:

In [None]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None}.


Step,Training Loss
500,1.3874
1000,1.2296
1500,1.196
2000,1.1373
2500,1.1102
3000,1.079
3500,1.0443
4000,1.0321
4500,1.0381
5000,1.0108




TrainOutput(global_step=17736, training_loss=0.9366313640093771, metrics={'train_runtime': 2856.8315, 'train_samples_per_second': 198.634, 'train_steps_per_second': 6.208, 'total_flos': 1.1353213104095232e+16, 'train_loss': 0.9366313640093771, 'epoch': 3.0})

Note that while the training happens, each time the model is saved (here, every epoch) it is uploaded to the Hub in the background. This way, you will be able to resume your training on another machine if necessary.

Once the training is done, we evaluate one last time:

In [None]:
trainer.evaluate(max_length=max_length)

{'eval_loss': 0.8554979562759399,
 'eval_model_preparation_time': 0.003,
 'eval_bleu': 52.950677422618696,
 'eval_runtime': 1018.7041,
 'eval_samples_per_second': 20.632,
 'eval_steps_per_second': 0.323,
 'epoch': 3.0}

That's a nice improvement on the BLEU score, which validates that we were able to fine-tune our model successfully.

Finally, we use the `push_to_hub()` method to make sure we upload the latest version of the model:

In [None]:
trainer.push_to_hub(tags="translation", commit_message="Training complete")

Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...-kde4-en-to-fr/source.spm: 100%|##########|  778kB /  778kB            

  ...-kde4-en-to-fr/target.spm: 100%|##########|  802kB /  802kB            

  ...n-to-fr/model.safetensors:   0%|          |  166kB /  299MB            

  ...n-to-fr/training_args.bin:   3%|3         |   197B / 5.97kB            

CommitInfo(commit_url='https://huggingface.co/Axion004/marian-finetuned-kde4-en-to-fr/commit/9225bd37b7b331ec7540eae62672adbc844097b6', commit_message='Training complete', commit_description='', oid='9225bd37b7b331ec7540eae62672adbc844097b6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Axion004/marian-finetuned-kde4-en-to-fr', endpoint='https://huggingface.co', repo_type='model', repo_id='Axion004/marian-finetuned-kde4-en-to-fr'), pr_revision=None, pr_num=None)

This command returns the URL of the commit it just did, if you want to inspect it:

```python
'https://huggingface.co/Axion004/marian-finetuned-kde4-en-to-fr/commit/9225bd37b7b331ec7540eae62672adbc844097b6'
```

At this stage, you can use the inference widget on the Model Hub to test your model and share it with your friends. You have successfully fine-tuned a model on a translation task -- congratulations!

If you want to dive a little bit more deeply into the training loop, we will now show you how to do the same thing using 🤗 Accelerate.

## A custom training loop

Let's now take a look at the full training loop, so you can easily customize the parts you need. It will look a lot like what we did in [section 2](/course/chapter7/2) with a few changes for the evaluation.

### Preparing everything for training

First we need to build the `DataLoader`s from our datasets. We will reuse our `data_collator` as a `collate_fn` and shuffle the training set, but not the validation set:

In [35]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=64,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

Next we reinstantiate our model, to make sure we're not continuing the fine-tuning from before but starting from the pretrained model again:

In [36]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

Then we will need an optimizer:

In [37]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

Once we have all those objects, we can send them to the `accelerator.prepare()` method:

In [38]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Now that we have sent our `train_dataloader` to `accelerator.prepare()`, we can use its length to compute the number of training steps. Remember that we should always do this after preparing the dataloader, as that method will change the length of the `DataLoader`. We use a classic linear schedule from the learning rate to 0:

In [39]:
from transformers import get_scheduler

num_train_epochs = 2
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Lastly, to push our model to the Hub, we will need to create a `Repository` object in a working folder. First log in to the Hugging Face Hub, if you're not logged in already. We'll determine the repository name from the model ID we want to give our model (feel free to replace the `repo_name` with your own choice; it just needs to contain your username, which is what the function `get_full_repo_name()` does):

In [40]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'Axion004/marian-finetuned-kde4-en-to-fr-accelerate'

Then we can clone that repository in a local folder. If it already exists, this local folder should be a clone of the repository we are working with:

In [41]:
from huggingface_hub import create_repo

create_repo(repo_name, exist_ok=True)

RepoUrl('https://huggingface.co/Axion004/marian-finetuned-kde4-en-to-fr-accelerate', endpoint='https://huggingface.co', repo_type='model', repo_id='Axion004/marian-finetuned-kde4-en-to-fr-accelerate')

In [42]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "marian-finetuned-kde4-en-to-fr-accelerate"
repo_name = get_full_repo_name(model_name)
output_dir = "marian-finetuned-kde4-en-to-fr-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/Axion004/marian-finetuned-kde4-en-to-fr-accelerate into local empty directory.


Download file model.safetensors:   0%|          | 16.7k/285M [00:00<?, ?B/s]

Download file target.spm:   0%|          | 3.16k/784k [00:00<?, ?B/s]

Download file source.spm:   0%|          | 3.16k/760k [00:00<?, ?B/s]

Clean file source.spm:   0%|          | 1.00k/760k [00:00<?, ?B/s]

Clean file target.spm:   0%|          | 1.00k/784k [00:00<?, ?B/s]

Clean file model.safetensors:   0%|          | 1.00k/285M [00:00<?, ?B/s]

We can now upload anything we save in `output_dir` by calling the `repo.push_to_hub()` method. This will help us upload the intermediate models at the end of each epoch.

### Training loop

We are now ready to write the full training loop. To simplify its evaluation part, we define this `postprocess()` function that takes predictions and labels and converts them to the lists of strings our `metric` object will expect:

In [43]:
def postprocess(predictions, labels):
    predictions = predictions.cpu().numpy()
    labels = labels.cpu().numpy()

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]
    return decoded_preds, decoded_labels

The training loop looks a lot like the ones in [section 2](/course/chapter7/2) and [Chapter 3](/course/chapter3), with a few differences in the evaluation part -- so let's focus on that!

The first thing to note is that we use the `generate()` method to compute predictions, but this is a method on our base model, not the wrapped model 🤗 Accelerate created in the `prepare()` method. That's why we unwrap the model first, then call this method.

The second thing is that, like with [token classification](/course/chapter7/2), two processes may have padded the inputs and labels to different shapes, so we use `accelerator.pad_across_processes()` to make the predictions and labels the same shape before calling the `gather()` method. If we don't do this, the evaluation will either error out or hang forever.

In [44]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
                max_length=128,
            )
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        generated_tokens = accelerator.pad_across_processes(
            generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
        )
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(generated_tokens)
        labels_gathered = accelerator.gather(labels)

        decoded_preds, decoded_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=decoded_preds, references=decoded_labels)

    results = metric.compute()
    print(f"epoch {epoch}, BLEU score: {results['score']:.2f}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/5912 [00:00<?, ?it/s]

  0%|          | 0/2628 [00:00<?, ?it/s]

epoch 0, BLEU score: 50.41




  0%|          | 0/2628 [00:00<?, ?it/s]

epoch 1, BLEU score: 51.22


Several commits (2) will be pushed upstream.


Once this is done, you should have a model that has results pretty similar to the one trained with the `Seq2SeqTrainer`. You can check the one we trained using this code at [*huggingface-course/marian-finetuned-kde4-en-to-fr-accelerate*](https://huggingface.co/huggingface-course/marian-finetuned-kde4-en-to-fr-accelerate). And if you want to test out any tweaks to the training loop, you can directly implement them by editing the code shown above!

## Using the fine-tuned model

We've already shown you how you can use the model we fine-tuned on the Model Hub with the inference widget. To use it locally in a `pipeline`, we just have to specify the proper model identifier:

In [56]:
from transformers import pipeline

model_checkpoint = "Axion004/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model_checkpoint)
translator("Default to expanded threads")

Device set to use cuda:0


[{'translation_text': 'Par défaut, développer les fils de discussion'}]

As expected, our pretrained model adapted its knowledge to the corpus we fine-tuned it on, and instead of leaving the English word "threads" alone, it now translates it to the French official version. It's the same for "plugin":

In [50]:
translator(
    "Unable to import %1 using the OFX importer plugin. This file is not the correct format."
)

[{'translation_text': "Impossible d'importer %1 en utilisant le plugin d'importateur OFX. Ce fichier n'est pas le bon format."}]

Another great example of domain adaptation!

> **✏️ Your turn!** What does the model return on the sample with the word "email" you identified earlier?

In [55]:
model_checkpoint = "Axion004/marian-finetuned-kde4-en-to-fr"
finetuned_translator = pipeline("translation", model=model_checkpoint)

print("Fine-tuned model translation:")
print(finetuned_translator(first_email_sentence))

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/299M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/277 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/873 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

Device set to use cuda:0


Fine-tuned model translation:
[{'translation_text': 'Envoie le diagramme comme pièce jointe.'}]
