# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [1]:
!pip install datasets evaluate transformers==4.30 accelerate peft bitsandbytes
!pip install sacrebleu
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting transformers==4.30
  Downloading transformers-4.30.0-py3-none-any.whl.metadata (113 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.6/113.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.30)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Col

In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to fine-tune the [T5 model](https://huggingface.co/docs/transformers/model_doc/t5) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [105]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

Provided that the T5 model was pretrained on several task, being one of the them the translation from English into Spanish, we are going to be filtering Europarl-ST only for English into Spanish using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter) and taking a small sample with [Dataset.select() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.select). The reason to take a small sample is because of time and computational constraints.

In [106]:
lang="es"
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets["train"] = raw_datasets["train"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<40).select(range(16))
raw_datasets["valid"] = raw_datasets["valid"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<40).select(range(16))
raw_datasets["test"] = raw_datasets["test"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<40).select(range(10))

Now we load the pre-trained tokenizer for the T5 model and apply it to a sample English-Spanish pair:

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!huggingface-cli whoami

jorcisai


In [6]:
from transformers import AutoTokenizer

max_input_length = 50
max_dest_length = 50
checkpoint = "meta-llama/Llama-2-7b-hf"
# from flores200_codes import flores_codes
src_code = "eng_Latn"
tgt_code = "spa_Latn"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint, use_auth_token=True,
    padding=True,
    pad_to_multiple_of=8,
    src_lang=src_code,
    tgt_lang=tgt_code,
    truncation=True,
    max_length=max_input_length,
    padding_side='left',
    )
tokenizer.pad_token = "[PAD]"

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]



tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be finetuned. The T5 model requires that the task prompt, "translate English to Spanish", to be explicitly stated for each source sentence. In addittion, the source and target sentences need to be abruptly truncated to 40 tokens to reduce memory comsuption:

In [24]:
task_prefix = "Translate from en to es:"
def tokenize_function(sample):
    inputs = [f"{task_prefix}\n en: {s} = es: "  for s in sample["source_text"]]
    print(inputs)
    model_inputs = tokenizer(inputs,max_length=max_input_length,truncation=True,padding=True,)
    print(model_inputs)
    model_inputs['labels'] = tokenizer(text_target = sample["dest_text"],max_length=max_dest_length,truncation=True,padding=True,).input_ids
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*:

In [107]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

{'input_ids': [[0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 2193, 338, 451, 1855, 29311, 3381, 29889, 353, 831, 29901, 29871], [0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 1724, 881, 591, 437, 29973, 353, 831, 29901, 29871], [0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 306, 674, 1095, 411, 263, 9177, 29889, 353, 831, 29901, 29871], [0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 450, 29735, 17535, 338, 297, 4845, 457, 29889, 353, 831, 29901, 29871], [0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 3374, 366, 29892, 6285, 1632, 10250, 280, 29991, 353, 831, 29901, 29871], [0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 1128, 508, 5227, 566, 367, 11084, 29973, 353, 831, 29901, 29871], [0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 14846, 5040, 29889, 353, 831, 29901, 29871], [0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 2

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

['Translate from en to es:\n en: Do we want to liberalise the markets? = es: ', 'Translate from en to es:\n en: I will now start. = es: ', 'Translate from en to es:\n en: We now know what he wanted it for. = es: ', 'Translate from en to es:\n en: This would not do. = es: ', 'Translate from en to es:\n en: It can already be seen, Mr Cappato. = es: ', 'Translate from en to es:\n en: I am pleased at that. = es: ', 'Translate from en to es:\n en: This is by no means the case. = es: ', 'Translate from en to es:\n en: Many thanks, that is all. = es: ', 'Translate from en to es:\n en: Perhaps you should do so now. = es: ', 'Translate from en to es:\n en: We really owe it to all the victims. = es: ']
{'input_ids': [[0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 1938, 591, 864, 304, 26054, 895, 278, 2791, 1691, 29973, 353, 831, 29901, 29871], [0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 306, 674, 1286, 1369, 29889, 353, 831, 29901, 29871], [0, 0, 1

Map:   0%|          | 0/16 [00:00<?, ? examples/s]

['Translate from en to es:\n en: This is a well-balanced directive. = es: ', 'Translate from en to es:\n en: Face up to it boldly! = es: ', 'Translate from en to es:\n en: We all need to change our attitude. = es: ', 'Translate from en to es:\n en: We are talking about freedom. = es: ', 'Translate from en to es:\n en: Those who do best cannot be penalised. = es: ', 'Translate from en to es:\n en: What did the Spanish Presidency do? = es: ', 'Translate from en to es:\n en: Did it really support the European bid? = es: ', 'Translate from en to es:\n en: This crisis does not remain static. = es: ', 'Translate from en to es:\n en: It is not only a crisis of the euro. = es: ', 'Translate from en to es:\n en: Firstly, the political context. = es: ', 'Translate from en to es:\n en: The first refers to objectives. = es: ', 'Translate from en to es:\n en: All Members share their pain. = es: ', 'Translate from en to es:\n en: We do not agree with this attitude. = es: ', 'Translate from en to es:

In [11]:
sample = raw_datasets["train"].select(range(1))
model_input = tokenize_function(sample)
print(tokenizer.batch_decode(model_input.input_ids))


['Translate from en to es:\n en: That is not real coordination. = es: ']
{'input_ids': [[1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 427, 29901, 2193, 338, 451, 1855, 29311, 3381, 29889, 353, 831, 29901, 29871]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
['<s> Translate from en to es:\n en: That is not real coordination. = es: ']


bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [12]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [13]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
    )




config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [14]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False, gradient_checkpointing_kwargs={'use_reentrant':False})

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [15]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [16]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.1243


The function that is responsible for putting together samples inside a batch is called a collate function. It is an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. This is not possible in our case since the inputs we have are not all of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via DataCollatorForSeq2Seq that takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs), so we will also need to instantiate the model first to provide it to the collate function:

In [17]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, pad_to_multiple_of=8)

## Evaluation

The last thing to define for our Seq2SeqTrainer is how to compute the metrics to evaluate the predictions of our model with respect to references. To this purpose, we use the [Evaluate library](https://huggingface.co/docs/evaluate) which includes the definition of generic and task-specific metrics. In our case, we use the [BLEU metric](https://huggingface.co/spaces/evaluate-metric/bleu), or to be more precise, [sacreBLEU](https://huggingface.co/spaces/evaluate-metric/sacrebleu). You can see a simple example of usage below:

:

In [18]:
from evaluate import load

metric = load("sacrebleu")

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

We need to define a function compute_metrics to compute BLEU scores at each epoch. The example below performs a basic post-processing to decode the predictions into texts:

In [121]:
import re

def compute_metrics(sample, output_sequences):
    inputs = [f"{task_prefix}\n en: {s} = es: "  for s in sample["source_text"]]
    preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    for i, (input,pred) in enumerate(zip(inputs,preds)):
      pred = re.search(r'^.*\n',pred.removeprefix(input).strip())
      if pred is not None:
        preds[i] = pred.group()[:-1]
      else:
        preds[i] = ""
    print(sample["source_text"])
    print(preds)
    result = metric.compute(predictions=preds, references=sample["dest_text"])
    result = {"bleu": result["score"]}
    return result

We are going to evaluate the pretrained model, preparing the test set to be translated using the [generate function](https://huggingface.co/docs/transformers/main_classes/text_generation):

In [108]:
output_sequences = model.generate(input_ids=torch.tensor(tokenized_datasets["test"]["input_ids"]).cuda(), attention_mask=torch.tensor(tokenized_datasets["test"]["attention_mask"]).cuda(), max_length = max_dest_length)


In [122]:
result = compute_metrics(tokenized_datasets["test"],output_sequences.cpu())
print(f'BLEU score: {result["bleu"]}')

['Do we want to liberalise the markets?', 'I will now start.', 'We now know what he wanted it for.', 'This would not do.', 'It can already be seen, Mr Cappato.', 'I am pleased at that.', 'This is by no means the case.', 'Many thanks, that is all.', 'Perhaps you should do so now.', 'We really owe it to all the victims.']
['Queremos liberalizar los mercados?', '1. Voy a empezar.', '', '▻', '', 'Me alegro de que.', 'No es así en absoluto.', 'Gracias, eso es todo.', '\u200bPuede ser que deberías hacerlo ahora.', 'Debemos a todos los vctimas.']
BLEU score: 6.416816651924026


## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer#trainer) is to define a [Seq2SeqTrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Seq2SeqTrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [123]:
from transformers import TrainingArguments

batch_size = 16
model_name = checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-en-to-es",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=5,
)

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer, the data collator and the compute_metrics function:

In [124]:
from transformers import Trainer

trainer = Trainer(
    lora_model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer:

In [126]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,4.809488
2,No log,4.690376
3,No log,4.594098
4,No log,4.526439
5,No log,4.491308


TrainOutput(global_step=5, training_loss=4.499748229980469, metrics={'train_runtime': 58.4227, 'train_samples_per_second': 1.369, 'train_steps_per_second': 0.086, 'total_flos': 51881925672960.0, 'train_loss': 4.499748229980469, 'epoch': 5.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and feeding the encoded hidden states via cross-attention layers to the decoder and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers. There’s also [this blog post](https://huggingface.co/blog/encoder-decoder#encoder-decoder) which explains how generation works in general in encoder-decoder models.

In [131]:
output_sequences = lora_model.generate(input_ids=torch.tensor(tokenized_datasets["test"]["input_ids"]).cuda(), attention_mask=torch.tensor(tokenized_datasets["test"]["attention_mask"]).cuda(), max_length = max_dest_length)

In [132]:
result = compute_metrics(tokenized_datasets["test"],output_sequences.cpu())
print(f'BLEU score: {result["bleu"]}')

['Do we want to liberalise the markets?', 'I will now start.', 'We now know what he wanted it for.', 'This would not do.', 'It can already be seen, Mr Cappato.', 'I am pleased at that.', 'This is by no means the case.', 'Many thanks, that is all.', 'Perhaps you should do so now.', 'We really owe it to all the victims.']
['en: Do we want to liberalise the markets? = es: ', '', '', '↗ No se puede hacer eso.', 'El señor Cappato ya puede verlo.', '�Esto me alegra.', 'No es así.', 'Muchas gracias, eso es todo.', '### Other', 'Debemos a todos los víctimas.']
BLEU score: 3.0677866578045885
