# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [1]:
!pip install datasets evaluate transformers==4.30 accelerate peft bitsandbytes
!pip install sacrebleu
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting transformers==4.30
  Downloading transformers-4.30.0-py3-none-any.whl.metadata (113 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.6/113.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.30)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Col

In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to fine-tune the [Llama2 model](https://huggingface.co/docs/transformers/model_doc/llama2) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

  from .autonotebook import tqdm as notebook_tqdm


As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

The Llama2 model is a pretrained Large Language Model (LLM) ready to tackle several NLP tasks, being one of the them the translation from English into Spanish. Let us filter the Europarl-ST only for English into Spanish using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter) and taking a small sample with [Dataset.select() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.select). The reason to take a small sample is because of time and computational constraints.

In [2]:
lang="es"
random_seed = 23
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
raw_datasets["train"] = raw_datasets["train"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<40).shuffle(seed=random_seed).select(range(128))
raw_datasets["valid"] = raw_datasets["valid"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<40).shuffle(seed=random_seed).select(range(16))
raw_datasets["test"] = raw_datasets["test"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<40)

Now we load the pre-trained tokenizer for the Llama2 model and apply it to a sample English-Spanish pair:

In [4]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [4]:
from transformers import AutoTokenizer

max_tok_length = 50
max_input_length = max_tok_length
max_dest_length = max_tok_length
checkpoint = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint, use_auth_token=True,
    padding=True,
    pad_to_multiple_of=8,
    truncation=True,
    max_length=max_tok_length,
    padding_side='left',
    )
tokenizer.pad_token = "[PAD]"

We can apply the tokenizer function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset.

In our case, each sample pair is going to be preprocessed according to the training needs of the model that is to be finetuned. In the case of Llama2, it is recommended to explicitly state a task prompt for each source sentence:

In [10]:
import torch

src = "en"
tgt = lang
task_prefix = f"Translate from {src} to {tgt}:\n"

def preprocess_function(sample):
    text_column="source_text"
    label_column="dest_text"
    max_length=max_tok_length
    batch_size = len(sample[text_column])
    inputs = [f"{task_prefix}{src}: {x} = {tgt}: " for x in sample[text_column]]
    targets = [str(x+"\n") for x in sample[label_column]]
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)

    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id]
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])
   
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In [15]:
task_prefix = "Translate from en to es:"
def tokenize_function(sample):
    inputs = [f"{task_prefix}\n en: {s} = es: "  for s in sample["source_text"]]
    model_inputs = tokenizer(inputs,max_length=max_input_length,truncation=True,padding=True,)
    outputs = [f"{s} {tokenizer.eos_token}" for s in sample["dest_text"]]
    model_inputs['labels'] = tokenizer(text_target = outputs,max_length=max_dest_length,truncation=True,padding=True,).input_ids
    return model_inputs

The way the Datasets library applies this processing is by adding new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*:

In [12]:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Map: 100%|██████████| 128/128 [00:00<00:00, 2996.73 examples/s]
Map: 100%|██████████| 379/379 [00:00<00:00, 6633.70 examples/s]
Map: 100%|██████████| 16/16 [00:00<00:00, 1857.68 examples/s]


In [14]:
sample = raw_datasets["train"].select(range(3))
model_input = preprocess_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))
#print(tokenizer.batch_decode(model_input.labels))


{'input_ids': [tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     1,  4103,  9632,
          515,   427,   304,   831, 29901,    13,   264, 29901,  1670,   526,
         5065,  5362,  4225, 29889,   353,   831, 29901, 29871,     1, 11389,
          443,   294, 16632,  7305,  5065, 29887,  5326, 29889,    13,     2]), tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     1,  4103,  9632,   515,   427,   304,   831, 29901,    13,
          264, 29901,  1105,  3113,  8167, 15293, 10465,   381, 29889,   353,
          831, 29901, 29871,     1,  1094, 29983,   712, 29892,  1277,  7853,
        29892, 25264,   264,   263, 15293, 10465,   381, 29889,    13,     2]), tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     1,  4103,  9632,   515,   427,   304,   831, 29901,    13,
          264, 29901, 21353,   466,   575,  4

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [15]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [16]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)


Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.92s/it]


Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [17]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False, gradient_checkpointing_kwargs={'use_reentrant':False})

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [18]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [19]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.1243


The function that is responsible for putting together samples inside a batch is called a collate function. It is an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. This is not possible in our case since the inputs we have are not all of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via DataCollatorForLanguageModeling that takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs), so we will also need to instantiate the model first to provide it to the collate function:

In [20]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, pad_to_multiple_of=8)

## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) is to define a [TrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [27]:
from transformers import TrainingArguments

batch_size = 1
model_name = checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-en-to-es",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    warmup_steps=100,
    optim="adamw_bnb_8bit",
    prediction_loss_only=True,
)

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer and the data collator:

In [28]:
from transformers import Trainer

trainer = Trainer(
    lora_model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)


To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer:

In [29]:
trainer.train()

  0%|          | 0/24 [00:39<?, ?it/s]
 33%|███▎      | 128/384 [01:42<03:25,  1.25it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                                 

[A[A                                         
 33%|███▎      | 128/384 [01:48<03:25,  1.25it/s]
[A
[A

{'eval_loss': 0.7643849849700928, 'eval_runtime': 6.3903, 'eval_samples_per_second': 2.504, 'eval_steps_per_second': 2.504, 'epoch': 1.0}


 67%|██████▋   | 256/384 [03:31<01:43,  1.24it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                                 
[A                                    

 67%|██████▋   | 256/384 [03:38<01:43,  1.24it/s]
[A
[A

{'eval_loss': 0.6640907526016235, 'eval_runtime': 6.4172, 'eval_samples_per_second': 2.493, 'eval_steps_per_second': 2.493, 'epoch': 2.0}


100%|██████████| 384/384 [05:21<00:00,  1.24it/s]
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
                                                 
[A                                    

100%|██████████| 384/384 [05:27<00:00,  1.24it/s]
[A
                                                 
100%|██████████| 384/384 [05:27<00:00,  1.17it/s]

{'eval_loss': 0.6803895235061646, 'eval_runtime': 6.3852, 'eval_samples_per_second': 2.506, 'eval_steps_per_second': 2.506, 'epoch': 3.0}
{'train_runtime': 327.5388, 'train_samples_per_second': 1.172, 'train_steps_per_second': 1.172, 'train_loss': 1.2882375717163086, 'epoch': 3.0}





TrainOutput(global_step=384, training_loss=1.2882375717163086, metrics={'train_runtime': 327.5388, 'train_samples_per_second': 1.172, 'train_steps_per_second': 1.172, 'train_loss': 1.2882375717163086, 'epoch': 3.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.

In [33]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
    )

print(generation_config)



GenerationConfig {
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.6,
  "top_p": 0.9,
  "transformers_version": "4.30.0"
}



In [40]:
test_batch_size = 1
batch_tokenized_test = tokenized_datasets["test"].batch(test_batch_size)

Batching examples: 100%|██████████| 379/379 [00:00<00:00, 1246.11 examples/s]


In [41]:
number_of_batches = len(batch_tokenized_test["input_ids"])
output_sequences = []
for i in range(number_of_batches):
    output_batch = model.generate(generation_config=generation_config, input_ids=torch.tensor(batch_tokenized_test["input_ids"][i]).cuda(), attention_mask=torch.tensor(batch_tokenized_test["attention_mask"][i]).cuda(), max_length = max_dest_length+10)
    output_sequences.extend(output_batch)

OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacity of 7.75 GiB of which 16.94 MiB is free. Including non-PyTorch memory, this process has 7.72 GiB memory in use. Of the allocated memory 6.92 GiB is allocated by PyTorch, and 695.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [44]:
result = compute_metrics(tokenized_datasets["test"],output_sequences.cpu())
print(f'BLEU score: {result["bleu"]}')

['That is our red line.', 'There are, however, a couple of issues.', 'measures of this kind.', 'These are weak parties.', 'I disagree with that.', 'My second proposal concerns energy.', 'It would have strengthened them.', 'We would like to hear that.', 'The initial draft was a bad one.', 'It is a disappointing outcome.', 'Can we accept responsibility for this?', 'The world is finally back to rights.', 'The first phase is complete.', 'How do you assess the programme?', 'Women for Zapatero!', 'The problems we face are European.']
['(es: es: que es nuestra línea roja.)', 'es: There are, however, a couple of issues.', 'es: .', 'es: Estas son partidos débiles.', 'es: I disagree with that.', '2. My second proposal concerns energy.', 'es:', 'We would like to hear that.', 'el primer borrador era malo.', 'es: It is a disappointing outcome.', '?', '.', '1. es: La primera fase está completada.', '', ':', '## WordNet']
BLEU score: 6.5166215179046585
