# Fine-tuning

Fine-tuning refers to the process in transfer learning in which the parameter values of a model trained on a large dataset are modified when the training process continues on a small dataset (see [Kevin Murphy's book](https://probml.github.io/pml-book/book1.html) Section 19.2 for further details). The main motivation is to adapt a pre-trained model trained on a large amount of data to tackle a specific task providing better performance that would be achieved training on the small task-specific dataset.

In [5]:
!pip install datasets evaluate transformers==4.30 accelerate peft bitsandbytes
!pip install sacrebleu
!pip install huggingface_hub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In this notebook, we are going to use for fine-tuning a dataset set that is already available in the [Datasets repository](https://huggingface.co/datasets) from Hugging Face. However, the [Datasets library](https://huggingface.co/docs/datasets) makes easy to access and load datasets. For example, you can easily load your own dataset following [this tutorial](https://huggingface.co/docs/datasets/loading#local-and-remote-files).

More precisely, we are going to explain how to fine-tune the [Llama2 model](https://huggingface.co/docs/transformers/model_doc/llama2) on the [Europarl-ST dataset](https://huggingface.co/datasets/tj-solergibert/Europarl-ST), but only that [dataset of Europarl-ST focused on the text data for MT from English](https://huggingface.co/datasets/tj-solergibert/Europarl-ST-processed-mt-en).

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("tj-solergibert/Europarl-ST-processed-mt-en")

  from .autonotebook import tqdm as notebook_tqdm


As shown, each English sentence is repeated for each of the seven target languages (0: 'de', 2: 'es', 3: 'fr', 4: 'it', 5: 'nl', 6: 'pl', 7: 'pt').

The Llama2 model is a pretrained Large Language Model (LLM) ready to tackle several NLP tasks, being one of the them the translation from English into Spanish. Let us filter the Europarl-ST only for English into Spanish using a simple [lambda function](https://realpython.com/python-lambda/) with the [Dataset.filter() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.filter) and taking a small sample with [Dataset.select() function](https://huggingface.co/docs/datasets/v2.9.0/en/package_reference/main_classes#datasets.Dataset.select). The reason to take a small sample is because of time and computational constraints.

In [37]:
lang="es"
random_seed = 23
max_source_test_len = 40
lang_id = raw_datasets["train"].features["dest_lang"].names.index(lang)
train_dataset = raw_datasets["train"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<max_source_test_len).shuffle(seed=random_seed).select(range(1024))
dev_dataset   = raw_datasets["valid"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<max_source_test_len).shuffle(seed=random_seed).select(range(16))
test_dataset  = raw_datasets["test"].filter(lambda x: x["dest_lang"] == lang_id and len(x["source_text"])<max_source_test_len).shuffle(seed=random_seed).select(range(128))

[The Llama family](https://huggingface.co/meta-llama) of LLMs require to accept the license terms and acceptable use policy. More precisely, we are going to be using the LLM [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf).

Logging in HuggingFace to be granted access to Llama2 with 7B parameters:

In [8]:
#huggingface-cli login

Now we load the pre-trained tokenizer for the Llama2 model with a maximum of 50 tokens and left padding as it needs to be with LLMs (Causal LMs):

In [3]:
from transformers import AutoTokenizer

max_tok_length = 50
checkpoint = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(
    checkpoint, use_auth_token=True,
    #padding=True,
    #pad_to_multiple_of=8,
    #truncation=True,
    #max_length=max_tok_length,
    padding_side='left',
    )
tokenizer.pad_token = "[PAD]"



We can apply a preprocess function to any dataset taking advantage that Hugging Face Datasets are [Apache Arrow](https://arrow.apache.org) files stored on the disk, so you only keep the samples you ask for loaded in memory.

To keep the data as a dataset, we will use the [Dataset.map() function](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset.map). The map() method works by applying a function on each element of the dataset and expects a batch (list) of samples.

In our case, each sample pair is going to be preprocessed according to the training/dev and test needs of the model that is to be finetuned. In the case of Llama2, it is recommended to explicitly state a task prompt for each source sentence.

The processing adds new fields to the datasets, one for each key in the dictionary returned by the tokenize function, that is, *input_ids*, *attention_mask* and *labels*:

In [4]:
import torch

src = "en"
tgt = lang
task_prefix = f"Translate from {src} to {tgt}:\n"

def preprocess4training_function(batch):
    max_length=max_tok_length
    batch_size = len(batch["source_text"])

    # Creating the prompt with the task description and the source sentece for each sample in the batch
    inputs  = [f"{task_prefix}{src}: {s} = {tgt}: " for s in batch["source_text"]]

    # Appending new line after each sample in the batch
    targets = [f"{s}\n" for s in batch["dest_text"]]

    # Applying the Llama2 tokenizer to the inputs and targets 
    # to obtain "input_ids" (token_ids) and "attention mask" 
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)
    
    # Each input is appended with its target 
    # Each target is prepended with as many special token id (-100) as the original input length
    # Both input and target (label) has the same max_length
    # Attention mask is all 1s 
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.eos_token_id]
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])

    # Each input is applied left padding up to max_len
    # Attention mask is 0 for padding
    # Each target (label) is left filled with special token id (-100)
    # Finally inputs, attention_mask and targets (labels) are truncated to max_length
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs


In the case of the test set, we just preprocess the inputs (source sentences)

In [18]:
def preprocess4test_function(batch):
    max_length=max_tok_length
    batch_size = len(batch["source_text"])

    # Creating the prompt with the task description and the source sentece for each sample in the batch
    inputs  = [f"{task_prefix}{src}: {s} = {tgt}: " for s in batch["source_text"]]

    # Applying the Llama2 tokenizer to the inputs 
    # to obtain "input_ids" (token_ids) and "attention mask" 
    model_inputs = tokenizer(inputs)
    
    # Each input is appended with its target 
    # Each target is prepended with as many special token id (-100) as the original input length
    # Both input and target (label) has the same max_length
    # Attention mask is all 1s
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])

    # Each input is applied left padding up to max_len
    # Attention mask is 0 for padding
    # Each target (label) is left filled with special token id (-100)
    # Finally inputs, attention_mask and targets (labels) are truncated to max_length
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
    
    return model_inputs


We can check what the preprocess4training_function is doing:

In [11]:
sample = train_dataset.select(range(4))
model_input = preprocess4training_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))

{'input_ids': [tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     1,  4103,  9632,
          515,   427,   304,   831, 29901,    13,   264, 29901,  1670,   526,
         5065,  5362,  4225, 29889,   353,   831, 29901, 29871,     1, 11389,
          443,   294, 16632,  7305,  5065, 29887,  5326, 29889,    13,     2]), tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     1,  4103,  9632,   515,   427,   304,   831, 29901,    13,
          264, 29901,  1105,  3113,  8167, 15293, 10465,   381, 29889,   353,
          831, 29901, 29871,     1,  1094, 29983,   712, 29892,  1277,  7853,
        29892, 25264,   264,   263, 15293, 10465,   381, 29889,    13,     2]), tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     1,  4103,  9632,   515,   427,   304,   831, 29901,    13,
          264, 29901, 21353,   466,   575,  4

We need to replace -100 by 0 to apply batch_decode:

In [13]:
import numpy as np
tokenizer.batch_decode([np.where(model_input.labels[0] < 0, tokenizer.pad_token_id, model_input.labels[0])])

['<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><s> Hay unas necesidades urgentes.\n</s>']

We can check what the preprocess4test_function is doing:

In [14]:
sample = test_dataset.select(range(1))
model_input = preprocess4test_function(sample)
print(model_input)
print(tokenizer.batch_decode(model_input.input_ids))

{'input_ids': [tensor([    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            1,  4103,  9632,   515,   427,   304,   831, 29901,    13,   264,
        29901, 15366,   310,   445,  2924, 29889,   353,   831, 29901, 29871])], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1])]}
['<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><s> Translate from en to es:\nen: measures of this kind. = es: ']


Preprocessing train and dev sets:

In [5]:
preprocessed_train_dataset = train_dataset.map(preprocess4training_function, batched=True)
preprocessed_dev_dataset = dev_dataset.map(preprocess4training_function, batched=True)

In [18]:
for i in range(len(preprocessed_train_dataset['input_ids'])):
    print(preprocessed_train_dataset['input_ids'][i])
    print(preprocessed_train_dataset['attention_mask'][i])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 1670, 526, 5065, 5362, 4225, 29889, 353, 831, 29901, 29871, 1, 11389, 443, 294, 16632, 7305, 5065, 29887, 5326, 29889, 13, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 1105, 3113, 8167, 15293, 10465, 381, 29889, 353, 831, 29901, 29871, 1, 1094, 29983, 712, 29892, 1277, 7853, 29892, 25264, 264, 263, 15293, 10465, 381, 29889, 13, 2]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 21353, 466, 575, 4034, 12185, 297, 19495, 3900, 29889, 353, 831, 29901, 29871, 1, 997, 9866, 273, 1553, 337, 1113

Preprocessing test set:

In [38]:
preprocessed_test_dataset = test_dataset.map(preprocess4test_function, batched=True)

Map: 100%|██████████| 128/128 [00:00<00:00, 4880.47 examples/s]


In [39]:
for i in range(len(preprocessed_test_dataset['input_ids'])):
    print(preprocessed_test_dataset['input_ids'][i])
    print(preprocessed_test_dataset['attention_mask'][i])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 15366, 310, 445, 2924, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 910, 338, 263, 4472, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 2193, 338, 2020, 445, 2228, 338, 577, 8018, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 

bitsandbytes is a quantization library with a Transformers integration. With this integration, you can quantize a model to 8 or 4-bits and enable many other options by configuring the BitsAndBytesConfig class. For example, you can:

<ul>
<li>set load_in_4bit=True to quantize the model to 4-bits when you load it</li>
<li>set bnb_4bit_quant_type="nf4" to use a special 4-bit data type for weights initialized from a normal distribution</li>
<li>set bnb_4bit_use_double_quant=True to use a nested quantization scheme to quantize the already quantized weights</li>
<li>set bnb_4bit_compute_dtype=torch.bfloat16 to use bfloat16 for faster computation</li>
</ul>


In [6]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Pass the quantization_config to the from_pretrained method.

In [7]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    quantization_config=quantization_config,
    torch_dtype=torch.bfloat16,
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.90s/it]


Next, you should call the prepare_model_for_kbit_training() function to preprocess the quantized model for training.

In [8]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=False, gradient_checkpointing_kwargs={'use_reentrant':False})

[LoRA (Low-Rank Adaptation of Large Language Models)](https://huggingface.co/docs/peft/task_guides/lora_based_methods) is a [parameter-efficient fine-tuning (PEFT)](https://huggingface.co/docs/peft/index) technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share.

Each PEFT method is defined by a PeftConfig class that stores all the important parameters for building a PeftModel. For example, to train with LoRA, load and create a LoraConfig class and specify the following parameters:

<ul>
<li>task_type: the task to train for (sequence-to-sequence language modeling in this case)</li>
<li>r: the dimension of the low-rank matrices</li>
<li>lora_alpha: the scaling factor for the low-rank matrices</li>
<li>target_modules: determine what set of parameters are adapted</li>
<li>lora_dropout: the dropout probability of the LoRA layers</li>
</ul>

In [9]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    inference_mode=False,
)

Once LoRA and the quantization are setup, create a quantized PeftModel with the get_peft_model() function. It takes a quantized model and the LoraConfig containing the parameters for how to configure a model for training with LoRA.

In [10]:
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.1243


The function that is responsible for putting together samples inside a batch is called a collate function. It is an argument you can pass when you build a DataLoader, the default being a function that will just convert your samples to PyTorch tensors and concatenate them. This is not possible in our case since the inputs we have are not all of the same size. We have deliberately postponed the padding, to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via DataCollatorForLanguageModeling that takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs), so we will also need to instantiate the model first to provide it to the collate function:

In [11]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, pad_to_multiple_of=8)

2024-11-10 22:53:34.853226: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-10 22:53:34.982143: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-10 22:53:35.029017: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-10 22:53:35.042857: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-10 22:53:35.147782: I tensorflow/core/platform/cpu_feature_guar

## Training

The first step before we can define our [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) is to define a [TrainingArguments class](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) that will contain all the hyperparameters the Trainer will use for training and evaluation. The only compulsory argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can set them depending on the recommendations from the model developers:

In [12]:
from transformers import TrainingArguments

batch_size = 1
gradient_accumulation_steps = 128
model_name = checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-en-to-es",
    evaluation_strategy = "epoch",
    learning_rate=1e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=3,
    warmup_steps=100,
    optim="adamw_bnb_8bit",
    prediction_loss_only=True,
    gradient_accumulation_steps = gradient_accumulation_steps,
    bf16=True,
    bf16_full_eval=True,
    group_by_length=True,
)

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, the tokenizer and the data collator:

In [13]:
from transformers import Trainer

trainer = Trainer(
    lora_model,
    args,
    train_dataset=preprocessed_train_dataset,
    eval_dataset=preprocessed_dev_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


To fine-tune the model on our dataset, we just have to call the [train() function](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.Trainer.train) of our Trainer:

In [14]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mjorcisai[0m ([33mjorcisai-universitat-polit-cnica-de-val-ncia[0m). Use [1m`wandb login --relogin`[0m to force relogin


  0%|          | 0/24 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
                                              
 33%|███▎      | 8/24 [10:38<21:02, 78.94s/it] 

{'eval_loss': 4.333461284637451, 'eval_runtime': 7.3059, 'eval_samples_per_second': 2.19, 'eval_steps_per_second': 2.19, 'epoch': 1.0}


                                               
 67%|██████▋   | 16/24 [21:16<10:30, 78.87s/it]

{'eval_loss': 4.220178604125977, 'eval_runtime': 7.2655, 'eval_samples_per_second': 2.202, 'eval_steps_per_second': 2.202, 'epoch': 2.0}


                                               
100%|██████████| 24/24 [31:50<00:00, 79.62s/it]

{'eval_loss': 3.971428871154785, 'eval_runtime': 7.3012, 'eval_samples_per_second': 2.191, 'eval_steps_per_second': 2.191, 'epoch': 3.0}
{'train_runtime': 1912.9791, 'train_samples_per_second': 1.606, 'train_steps_per_second': 0.013, 'train_loss': 4.293796857198079, 'epoch': 3.0}





TrainOutput(global_step=24, training_loss=4.293796857198079, metrics={'train_runtime': 1912.9791, 'train_samples_per_second': 1.606, 'train_steps_per_second': 0.013, 'train_loss': 4.293796857198079, 'epoch': 3.0})

## Inference

At inference time, it is recommended to use [generate()](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.generate). This method takes care of encoding the input and auto-regressively generates the decoder output. Check out [this blog post](https://huggingface.co/blog/how-to-generate) to know all the details about generating text with Transformers.

In [15]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_pretrained(
    checkpoint,
    )

print(generation_config)



GenerationConfig {
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_length": 4096,
  "pad_token_id": 0,
  "temperature": 0.6,
  "top_p": 0.9,
  "transformers_version": "4.30.0"
}



In [45]:
test_batch_size = 4
batch_tokenized_test = preprocessed_test_dataset.batch(test_batch_size)

Batching examples: 100%|██████████| 128/128 [00:00<00:00, 4266.43 examples/s]


In [46]:
number_of_batches = len(batch_tokenized_test["input_ids"])
output_sequences = []
for i in range(number_of_batches):
    output_batch = model.generate(generation_config=generation_config, input_ids=torch.tensor(batch_tokenized_test["input_ids"][i]).cuda(), attention_mask=torch.tensor(batch_tokenized_test["attention_mask"][i]).cuda(), max_length = max_tok_length+20)
    output_sequences.extend(output_batch)

In [23]:
from evaluate import load

metric = load("sacrebleu")

In [24]:
import re

def compute_metrics(sample, output_sequences):
    inputs = [f"{task_prefix}{src}: {x} = {tgt}: "  for x in sample["source_text"]]
    preds = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
    print(inputs)
    print(preds)
    for i, (input,pred) in enumerate(zip(inputs,preds)):
      pred = re.search(r'^.*\n',pred.removeprefix(input).strip())
      if pred is not None:
        preds[i] = pred.group()[:-1]
      else:
        preds[i] = ""
    print(sample["source_text"])
    print(sample["dest_text"])
    print(preds)
    result = metric.compute(predictions=preds, references=sample["dest_text"])
    result = {"bleu": result["score"]}
    return result

In [47]:
result = compute_metrics(preprocessed_test_dataset,output_sequences)
print(f'BLEU score: {result["bleu"]}')

['Translate from en to es:\nen: measures of this kind. = es: ', 'Translate from en to es:\nen: This is a template. = es: ', 'Translate from en to es:\nen: That is why this issue is so relevant. = es: ', 'Translate from en to es:\nen: Now it is the turn of the constitution. = es: ', 'Translate from en to es:\nen: It must be flexible, but it must exist. = es: ', 'Translate from en to es:\nen: It has become STX of Korea. = es: ', 'Translate from en to es:\nen: Let us reinvigorate our values. = es: ', 'Translate from en to es:\nen: End of quotation. = es: ', 'Translate from en to es:\nen: This is a fundamental issue. = es: ', 'Translate from en to es:\nen: When is this madness going to stop? = es: ', 'Translate from en to es:\nen: I think it would be a good idea. = es: ', 'Translate from en to es:\nen: This is an insult to democracy. = es: ', 'Translate from en to es:\nen: Do you have an opinion on this? = es: ', 'Translate from en to es:\nen: He is a human rights defender. = es: ', 'Trans

In [26]:
simpletokenizer = AutoTokenizer.from_pretrained(
    checkpoint, use_auth_token=True,
    #padding=True,
    #pad_to_multiple_of=8,
    #truncation=True,
    #max_length=max_tok_length,
    padding_side='left',
    )
simpletokenizer.pad_token = "[PAD]"

In [30]:
task_prefix = "Translate from en to es:\n"
#max_tok_length = 50
#max_input_length = max_tok_length
#max_dest_length = max_tok_length

def simplepreprocess4test_function(batch):
    inputs = [f"{task_prefix}en: {s} = es: "  for s in batch["source_text"]]
    model_inputs = simpletokenizer(inputs,padding=True,)
    #outputs = [f"{s} {tokenizer.eos_token}" for s in sample["dest_text"]]
    #model_inputs['labels'] = tokenizer(text_target = outputs,max_length=max_dest_length,truncation=True,padding=True,).input_ids
    return model_inputs

In [43]:
preprocessed_test_dataset = test_dataset.map(simplepreprocess4test_function, batched=True)

Map: 100%|██████████| 128/128 [00:00<00:00, 15334.35 examples/s]


In [44]:
for i in range(len(preprocessed_test_dataset['input_ids'])):
    print(preprocessed_test_dataset['input_ids'][i])
    print(preprocessed_test_dataset['attention_mask'][i])

[0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 15366, 310, 445, 2924, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 910, 338, 263, 4472, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 2193, 338, 2020, 445, 2228, 338, 577, 8018, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 2567, 372, 338, 278, 2507, 310, 278, 16772, 29889, 353, 831, 29901, 29871]
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[0, 0, 0, 1, 4103, 9632, 515, 427, 304, 831, 29901, 13, 264, 29901, 739, 1818, 367, 25706, 29892

In [53]:
result = compute_metrics(preprocessed_test_dataset,output_sequences)
print(f'BLEU score: {result["bleu"]}')

['Translate from en to es:\nen: measures of this kind. = es: ', 'Translate from en to es:\nen: This is a template. = es: ', 'Translate from en to es:\nen: That is why this issue is so relevant. = es: ', 'Translate from en to es:\nen: Now it is the turn of the constitution. = es: ', 'Translate from en to es:\nen: It must be flexible, but it must exist. = es: ', 'Translate from en to es:\nen: It has become STX of Korea. = es: ', 'Translate from en to es:\nen: Let us reinvigorate our values. = es: ', 'Translate from en to es:\nen: End of quotation. = es: ', 'Translate from en to es:\nen: This is a fundamental issue. = es: ', 'Translate from en to es:\nen: When is this madness going to stop? = es: ', 'Translate from en to es:\nen: I think it would be a good idea. = es: ', 'Translate from en to es:\nen: This is an insult to democracy. = es: ', 'Translate from en to es:\nen: Do you have an opinion on this? = es: ', 'Translate from en to es:\nen: He is a human rights defender. = es: ', 'Trans