## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

### Exercise 1: Sentiment Analysis (warm up)

In this first exercise we will start from a pre-trained BERT transformer and build up a model able to perform text sentiment analysis. Transformers are complex beasts, so we will build up our pipeline in several explorative and incremental steps.


#### Exercise 1.1: Dataset Splits and Pre-trained model
There are a many sentiment analysis datasets, but we will use one of the smallest ones available: the [Cornell Rotten Tomatoes movie review dataset](cornell-movie-review-data/rotten_tomatoes), which consists of 5,331 positive and 5,331 negative processed sentences from the Rotten Tomatoes movie reviews.

**Your first task**: Load the dataset and figure out what splits are available and how to get them. Spend some time exploring the dataset to see how it is organized. Note that we will be using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/en/index) library for downloading, accessing, splitting, and batching data for training and evaluation.

In [1]:
from datasets import load_dataset, get_dataset_split_names

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes")

# Mostro gli split disponibili
print("Available splits:", ds.keys())

# Oppure in alternativa
print("Split names:", get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes"))

# Esplora un esempio
print(ds["train"][0])

# Qualche statistica base
for split in ds.keys():
    print(f"{split} -> {len(ds[split])} samples")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Available splits: dict_keys(['train', 'validation', 'test'])
Split names: ['train', 'validation', 'test']
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
train -> 8530 samples
validation -> 1066 samples
test -> 1066 samples


In [2]:
train_data = ds["train"]
val_data = ds["validation"]
test_data = ds["test"]

In [3]:
sentence = train_data[0]["text"]
label = train_data[0]["label"]

In [4]:
print(sentence)
print(label)

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
1


1 positiva
2 negativa

#### Exercise 1.2: A Pre-trained BERT and Tokenizer

The model we will use is a *very* small BERT transformer called [Distilbert](https://huggingface.co/distilbert/distilbert-base-uncased) this model was trained (using self-supervised learning) on the same corpus as BERT but using the full BERT base model as a *teacher*.

**Your next task**: Load the Distilbert model and corresponding tokenizer. Use the tokenizer on a few samples from the dataset and pass the tokens through the model to see what outputs are provided. I suggest you use the [`AutoModel`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class (and the `from_pretrained()` method) to load the model and `AutoTokenizer` to load the tokenizer).

In [5]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [6]:
samples = ds["train"][:3]["text"]
print("Sample texts:", samples)

Sample texts: ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'effective but too-tepid biopic']


In [7]:
inputs = tokenizer(samples, return_tensors="pt", padding=True, truncation=True)
print(inputs.keys())

KeysView({'input_ids': tensor([[  101,  1996,  2600,  2003, 16036,  2000,  2022,  1996,  7398,  2301,
          1005,  1055,  2047,  1000, 16608,  1000,  1998,  2008,  2002,  1005,
          1055,  2183,  2000,  2191,  1037, 17624,  2130,  3618,  2084,  7779,
         29058,  8625, 13327,  1010,  3744,  1011, 18856, 19513,  3158,  5477,
          4168,  2030,  7112, 16562,  2140,  1012,   102,     0,     0,     0,
             0,     0],
        [  101,  1996,  9882,  2135,  9603, 13633,  1997,  1000,  1996,  2935,
          1997,  1996,  7635,  1000, 11544,  2003,  2061,  4121,  2008,  1037,
          5930,  1997,  2616,  3685, 23613,  6235,  2522,  1011,  3213,  1013,
          2472,  2848,  4027,  1005,  1055,  4423,  4432,  1997,  1046,  1012,
          1054,  1012,  1054,  1012, 23602,  1005,  1055,  2690,  1011,  3011,
          1012,   102],
        [  101,  4621,  2021,  2205,  1011,  8915, 23267, 16012, 24330,   102,
             0,     0,     0,     0,     0,     0,     0,   




*   101: cls_token usato per la classificazione finale, si trova sempre ad inizio frase;
*   102: token speciale per fine frase;
*   input_ids: matrice di dimensione (batch_size, seq_len) -> codifica di ogni frase in una sequenza di id numerici, gli 0 sono token di padding per mantenere la seq_len della stessa dimensione;
*   attention_mask:stessa dimensione di input_ids, ci dice se il token è "reale" o è solo di padding;

Dopo questa fase di tokenizzazione, i token passano attraverso al meccanismo di embedding di DIstillBert, che ha una embedding matrix di (30522, 768) => prendo un token (es: id=1996), vado alla riga corrispondente della embedding matrix e restituisce un vettore float di 768 dimensioni (lookup).

Ricapitolando:
1.   la frase iniziale viene codificata in 52 token (hyperparam di distillbert);
2.   ogni token viene proiettato in un vettore di 768 floats, grazie alla matrice di embedding e al meccanismo di lookup;
3.   a questo punto ho una matrice di dimensione N_tokensXdim_emb (52x768) da passare al trasformer block dove avverrà la magia del meccanismo di attenzione.


In [8]:
outputs = model(input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"])

In [9]:
print("Output object:", outputs)
print("Last hidden state shape:", outputs.last_hidden_state.shape)

Output object: BaseModelOutput(last_hidden_state=tensor([[[-0.0332, -0.0168,  0.0194,  ...,  0.0476,  0.5834,  0.3036],
         [-0.0235, -0.0555, -0.3638,  ...,  0.1877,  0.5781, -0.1577],
         [-0.0516, -0.1014, -0.1511,  ...,  0.1503,  0.2649, -0.1575],
         ...,
         [ 0.3688, -0.1147,  0.8428,  ..., -0.0708, -0.0178, -0.2516],
         [ 0.0654, -0.0206,  0.1889,  ...,  0.1159,  0.2323, -0.2404],
         [ 0.0373, -0.0104,  0.1203,  ...,  0.1049,  0.2852, -0.3035]],

        [[-0.2062, -0.0490, -0.4036,  ..., -0.1186,  0.6141,  0.3919],
         [-0.4361, -0.1647, -0.3533,  ...,  0.1086,  0.9478, -0.0272],
         [-0.1164,  0.1690,  0.2698,  ..., -0.1971,  0.4372,  0.2527],
         ...,
         [-0.2341,  0.4810, -0.2634,  ..., -0.3397,  0.2567,  0.1274],
         [ 0.7139,  0.0574, -0.3260,  ...,  0.2041, -0.3800, -0.3343],
         [ 0.5649,  0.2806, -0.0295,  ...,  0.1297, -0.3160, -0.1874]],

        [[-0.2706, -0.1265, -0.0500,  ..., -0.3721,  0.2477,  0.330

oltre a tutte le rappresentazioni interne, vediamo anche la shape del tensore di rappresentazioni contestualizzate (quello prima del layer lineare dove avviene la classificazione) che è 3(numero di frasi prese) x 52 (dimensione delle frasi tokenizzate) x 768 (dimensione di embedding)

#### Exercise 1.3: A Stable Baseline

In this exercise I want you to:
1. Use Distilbert as a *feature extractor* to extract representations of the text strings from the dataset splits;
2. Train a classifier (your choice, by an SVM from Scikit-learn is an easy choice).
3. Evaluate performance on the validation and test splits.

These results are our *stable baseline* -- the **starting** point on which we will (hopefully) improve in the next exercise.

**Hint**: There are a number of ways to implement the feature extractor, but probably the best is to use a [feature extraction `pipeline`](https://huggingface.co/tasks/feature-extraction). You will need to interpret the output of the pipeline and extract only the `[CLS]` token from the *last* transformer layer. *How can you figure out which output that is?*

In [10]:
from transformers import pipeline
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

In [11]:
feature_extractor = pipeline("feature-extraction", model=model, tokenizer=tokenizer) #uso il feature extractor di HF

Device set to use cuda:0


In [12]:
def get_cls_embeddings(texts):
    # outputs: list of [seq_len x hidden_dim]
    cls_tokens = []  # Inizializza una lista vuota per i token CLS
    for text in texts:
        outputs = feature_extractor(text, truncation=True, padding=True)
        cls_tokens.append(np.array(outputs)[0][0]) #per ogni output prendiamo solo il primo elemento (CLS token) e lo aggiungiamo alla lista
    return np.vstack(cls_tokens)

In [13]:
X_train = get_cls_embeddings(ds["train"]["text"])
y_train = np.array(ds["train"]["label"])

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [14]:
X_val = get_cls_embeddings(ds["validation"]["text"])
y_val = np.array(ds["validation"]["label"])

In [15]:
X_test = get_cls_embeddings(ds["test"]["text"])
y_test = np.array(ds["test"]["label"])

In [16]:
clf = LinearSVC()
clf.fit(X_train, y_train)

In [17]:
print("Validation performance:")
print(classification_report(y_val, clf.predict(X_val)))

Validation performance:
              precision    recall  f1-score   support

           0       0.81      0.84      0.83       533
           1       0.84      0.80      0.82       533

    accuracy                           0.82      1066
   macro avg       0.82      0.82      0.82      1066
weighted avg       0.82      0.82      0.82      1066



In [18]:
print("Test performance:")
print(classification_report(y_test, clf.predict(X_test)))

Test performance:
              precision    recall  f1-score   support

           0       0.79      0.81      0.80       533
           1       0.81      0.78      0.79       533

    accuracy                           0.80      1066
   macro avg       0.80      0.80      0.80      1066
weighted avg       0.80      0.80      0.80      1066



-----
### Exercise 2: Fine-tuning Distilbert

In this exercise we will fine-tune the Distilbert model to (hopefully) improve sentiment analysis performance.

#### Exercise 2.1: Token Preprocessing

The first thing we need to do is *tokenize* our dataset splits. Our current datasets return a dictionary with *strings*, but we want *input token ids* (i.e. the output of the tokenizer). This is easy enough to do my hand, but the HugginFace `Dataset` class provides convenient, efficient, and *lazy* methods. See the documentation for [`Dataset.map`](https://huggingface.co/docs/datasets/v3.5.0/en/package_reference/main_classes#datasets.Dataset.map).

**Tip**: Verify that your new datasets are returning for every element: `text`, `label`, `intput_ids`, and `attention_mask`.

In [19]:
#funzione di tokenizzazione
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True
    )

# 3. Applichiamo la funzione a tutte le split (train, test, validation)
tokenized_datasets = ds.map(tokenize_function, batched=True)

# 4. Controlliamo un esempio
print(tokenized_datasets["train"][0]['text'])
print(tokenized_datasets["train"][0]['label'])
print(tokenized_datasets["train"][0]['input_ids'])
print(tokenized_datasets["train"][0]['attention_mask'])

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
1
[101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2  # perché Rotten Tomatoes ha sentiment positivo/negativo
)

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

In [21]:
# !pip list

In [22]:
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [23]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [24]:
print(data_collator)

DataCollatorWithPadding(tokenizer=DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
), padding=True, max_length=None, pad_to_multiple_of=None, return_tenso

Anche se model_max_seq è 512, prima avevamo visto che ci era venuto un tensore 3x52x768. Questo avviene poichè il padding viene aggiunto affinchè tutte le sequenze di token abbiano la stessa lunghezza, ovvero la max_len_in_batch

In [25]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

In [26]:
for param in model.distilbert.parameters():
    param.requires_grad = False

# Controllo rapido: stampiamo quanti parametri sono addestrabili
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Numero di parametri addestrabili: {trainable_params}")

Numero di parametri addestrabili: 592130


In [27]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Using device: {device}")

Using device: cuda


In [28]:
training_args = TrainingArguments(
    output_dir="./results",          # cartella dove salvare i modelli e i checkpoint
    do_eval=True,                    # abilita valutazione
    eval_strategy="epoch",
    learning_rate=2e-5,              # tasso di apprendimento
    per_device_train_batch_size=16,  # batch size per GPU/CPU in training
    per_device_eval_batch_size=16,   # batch size per GPU/CPU in validazione
    num_train_epochs=3,              # numero di epoche totali
    weight_decay=0.01,               # regolarizzazione L2 sui pesi
    logging_dir="./logs",            # directory dei log (per TensorBoard)
    logging_steps=50,               # ogni quanti step loggare
    load_best_model_at_end=True,
    save_strategy="epoch"

)

In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [30]:
trainer.train()

  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmencucci-marco[0m ([33mmencucci-marco-universit-di-firenze[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.608,0.586392,0.752345,0.737213,0.78424,0.76
2,0.5623,0.535779,0.77955,0.804082,0.739212,0.770283
3,0.5539,0.523806,0.781426,0.803644,0.744841,0.773126


TrainOutput(global_step=1602, training_loss=0.5868256051516563, metrics={'train_runtime': 512.6615, 'train_samples_per_second': 49.916, 'train_steps_per_second': 3.125, 'total_flos': 3389840731607040.0, 'train_loss': 0.5868256051516563, 'epoch': 3.0})

In [31]:
trainer.evaluate()

{'eval_loss': 0.5238064527511597,
 'eval_accuracy': 0.7814258911819888,
 'eval_precision': 0.8036437246963563,
 'eval_recall': 0.7448405253283302,
 'eval_f1': 0.7731256085686465,
 'eval_runtime': 15.9466,
 'eval_samples_per_second': 66.848,
 'eval_steps_per_second': 4.202,
 'epoch': 3.0}

Adesso proviamo a full-fine tunare distillBert

In [32]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2  # perché Rotten Tomatoes ha sentiment positivo/negativo
)
print("Modello Inizializzato!")
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Numero di parametri addestrabilinel full-FT: {trainable_params}")
model.to(device)
print(f"Using device: {device}")
trainer.train()
trainer.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Modello Inizializzato!
Numero di parametri addestrabilinel full-FT: 66955010
Using device: cuda


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.5069,0.481977,0.793621,0.793621,0.793621,0.793621
2,0.5121,0.463989,0.79925,0.818363,0.769231,0.793037
3,0.493,0.459878,0.803002,0.81854,0.778612,0.798077


{'eval_loss': 0.4598783850669861,
 'eval_accuracy': 0.8030018761726079,
 'eval_precision': 0.8185404339250493,
 'eval_recall': 0.7786116322701688,
 'eval_f1': 0.7980769230769231,
 'eval_runtime': 15.949,
 'eval_samples_per_second': 66.838,
 'eval_steps_per_second': 4.201,
 'epoch': 3.0}

-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

In [33]:
from peft import LoraConfig, get_peft_model

# 1. Carica DistilBERT con classification head

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)
print("modello caricato!")

# 2. Configura LoRA (low-rank adaptation)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_lin", "v_lin"],  # attenzione in DistilBERT
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)
print("LOra configurato!")


# Applica LoRA al modello
model = get_peft_model(model, lora_config)
print("LORA applicato con successo al modello")

# 3. Mixed precision: abilita fp16 nel TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    do_eval=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    load_best_model_at_end=True,
    fp16=True   # <-- mixed precision abilitato
)

# 4. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 5. Addestramento
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Using device: {device}")
print(f"Numero di parametri addestrabili (LoRA + full fine-tune): {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

trainer.train()
results = trainer.evaluate()
print(results)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


modello caricato!
LOra configurato!
LORA applicato con successo al modello
Using device: cuda
Numero di parametri addestrabili (LoRA + full fine-tune): 739586


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.4351,0.43191,0.807692,0.822835,0.78424,0.803074
2,0.4406,0.412565,0.818011,0.838323,0.787992,0.812379
3,0.4225,0.408191,0.817073,0.836653,0.787992,0.811594


{'eval_loss': 0.40819114446640015, 'eval_accuracy': 0.8170731707317073, 'eval_precision': 0.8366533864541833, 'eval_recall': 0.7879924953095685, 'eval_f1': 0.8115942028985508, 'eval_runtime': 4.7043, 'eval_samples_per_second': 226.6, 'eval_steps_per_second': 14.242, 'epoch': 3.0}


Proviamo ad alzare il lr

In [34]:
from peft import LoraConfig, get_peft_model

# 1. Carica DistilBERT con classification head

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)
print("modello caricato!")

# 2. Configura LoRA (low-rank adaptation)
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_lin", "v_lin"],  # attenzione in DistilBERT
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)
print("LOra configurato!")


# Applica LoRA al modello
model = get_peft_model(model, lora_config)
print("LORA applicato con successo al modello")

# 3. Mixed precision: abilita fp16 nel TrainingArguments
training_args = TrainingArguments(
    output_dir="./results",
    do_eval=True,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    load_best_model_at_end=True,
    fp16=True,   # <-- mixed precision abilitato
    report_to="wandb"
)

# 4. Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 5. Addestramento
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Using device: {device}")
print(f"Numero di parametri addestrabili (LoRA + full fine-tune): {sum(p.numel() for p in model.parameters() if p.requires_grad)}")

trainer.train()
results = trainer.evaluate()
print(results)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


modello caricato!
LOra configurato!
LORA applicato con successo al modello
Using device: cuda
Numero di parametri addestrabili (LoRA + full fine-tune): 739586


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.3702,0.391986,0.829268,0.792013,0.893058,0.839506
2,0.3709,0.356364,0.841463,0.848659,0.831144,0.83981
3,0.3413,0.352974,0.840525,0.844402,0.834897,0.839623


{'eval_loss': 0.3529743254184723, 'eval_accuracy': 0.8405253283302064, 'eval_precision': 0.8444022770398482, 'eval_recall': 0.8348968105065666, 'eval_f1': 0.839622641509434, 'eval_runtime': 4.7866, 'eval_samples_per_second': 222.706, 'eval_steps_per_second': 13.997, 'epoch': 3.0}
