## Working with Transformers in the HuggingFace Ecosystem

In this laboratory exercise we will learn how to work with the HuggingFace ecosystem to adapt models to new tasks. As you will see, much of what is required is *investigation* into the inner-workings of the HuggingFace abstractions. With a little work, a little trial-and-error, it is fairly easy to get a working adaptation pipeline up and running.

### Exercise 1: Sentiment Analysis (warm up)

In this first exercise we will start from a pre-trained BERT transformer and build up a model able to perform text sentiment analysis. Transformers are complex beasts, so we will build up our pipeline in several explorative and incremental steps.

#### Exercise 1.1: Dataset Splits and Pre-trained model
There are a many sentiment analysis datasets, but we will use one of the smallest ones available: the [Cornell Rotten Tomatoes movie review dataset](cornell-movie-review-data/rotten_tomatoes), which consists of 5,331 positive and 5,331 negative processed sentences from the Rotten Tomatoes movie reviews.

**Your first task**: Load the dataset and figure out what splits are available and how to get them. Spend some time exploring the dataset to see how it is organized. Note that we will be using the [HuggingFace Datasets](https://huggingface.co/docs/datasets/en/index) library for downloading, accessing, splitting, and batching data for training and evaluation.

In [1]:
# Import necessary libraries

import os
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':4096:8'
import time
import copy
import random

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

from datasets import load_dataset, get_dataset_split_names
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
from peft import LoraConfig, TaskType, get_peft_model

import numpy as np
import math

from tqdm import tqdm
from sklearn import svm
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Set the random seed for reproducibility

SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)
torch.use_deterministic_algorithms(True)
torch.backends.cudnn.deterministic = True

In [3]:
# Path configurations

data_dir = './data' 
checkpoints_dir = './checkpoints'
log_dir = './logs'

batch = 64
n_examples_tokenizer = 5
n_examples_test = 5

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


In [4]:
# Clear directories for data, checkpoints, and logs

import shutil
import os

def clear(_dir):
    if os.path.exists(_dir):
        if os.path.isdir(_dir):
            shutil.rmtree(_dir)
        os.makedirs(_dir, exist_ok = True)

# Uncomment the following lines to clear the directories

#clear(data_dir)
#clear(checkpoints_dir)
#clear(log_dir)

In [5]:
# TensorBoard setup

%load_ext tensorboard
%tensorboard --logdir=./logs

Reusing TensorBoard on port 6006 (pid 59463), started 0:53:03 ago. (Use '!kill 59463' to kill it.)

In [6]:
# Rotten Tomatoes dataset loading
# Using the Cornell Movie Review Data for Rotten Tomatoes sentiment analysis
# This dataset is available in the Hugging Face datasets library

dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes")

print(dataset.shape)
print(dataset["train"][0])

data_loaders = []

for split in dataset.keys():
    data_loaders.append(DataLoader(dataset[split], batch_size = batch, shuffle = False))

train, val, test = data_loaders

{'train': (8530, 2), 'validation': (1066, 2), 'test': (1066, 2)}
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}


#### Exercise 1.2: A Pre-trained BERT and Tokenizer

The model we will use is a *very* small BERT transformer called [Distilbert](https://huggingface.co/distilbert/distilbert-base-uncased) this model was trained (using self-supervised learning) on the same corpus as BERT but using the full BERT base model as a *teacher*.

**Your next task**: Load the Distilbert model and corresponding tokenizer. Use the tokenizer on a few samples from the dataset and pass the tokens through the model to see what outputs are provided. I suggest you use the [`AutoModel`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) class (and the `from_pretrained()` method) to load the model and `AutoTokenizer` to load the tokenizer).

## ⚙️ Testing the Tokenizer with Examples

### ✅ Explanation

- **`tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")`**:  
  Loads a pretrained tokenizer based on the DistilBERT model (uncased, meaning all text is lowercased).

- **Loop over `n_examples_tokenizer`:**  
  - Retrieves text samples from the training split of the `dataset`.  
  - Applies the tokenizer to each example with:  
    - `return_tensors="pt"`: returns PyTorch tensors for integration with PyTorch models.  
    - `truncation=True`: truncates inputs longer than the model’s max length.  
    - `padding=True`: pads shorter inputs to the max length for batching.  
  - Converts token IDs back to their string token representation and prints them, showing how text is split into tokens.

- **Custom test string:**  
  `"Pippo was so silly at the beginning of the movie, he should had die since min. 1"`  
  Tokenized similarly, then prints:  
  - The keys of the tokenizer output (`input_ids`, `attention_mask`, etc.).  
  - The list of tokens after conversion from token IDs.

---

### 🧠 Theory & Why

- **Why use a pretrained tokenizer:**  
  Pretrained tokenizers have vocabulary and tokenization rules aligned with the pretrained language model, ensuring compatibility and optimal performance.

- **Truncation and padding:**  
  These ensure input sequences fit fixed sizes required by models and enable batch processing.

- **Token IDs to tokens:**  
  Converting IDs back to tokens helps inspect and debug tokenization, verifying the model input representation.

- **Using PyTorch tensors:**  
  Facilitates direct use in PyTorch model pipelines without extra conversion.

- **Testing on custom text:**  
  Demonstrates tokenizer behavior on arbitrary inputs, including token splits and special tokens.


In [7]:
# Test the tokenizer with a few examples

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

for i in range(n_examples_tokenizer):
    example = dataset["train"][i]["text"]

    ttt = tokenizer(example, return_tensors = "pt", truncation = True, padding = True)
    tokens = tokenizer.convert_ids_to_tokens(ttt["input_ids"][0])
    print(tokens)

print()

test_text = "Pippo was so silly at the beginning of the movie, he should had die since min. 1" #custom text to test

ttt = tokenizer(test_text, return_tensors = "pt", truncation = True, padding = True) #test tokenized text
#pt = pytorch (tensors)


print(ttt.keys())

tokens_ids = tokenizer.convert_ids_to_tokens(ttt["input_ids"][0])
print(tokens_ids)

['[CLS]', 'the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'", 's', 'new', '"', 'conan', '"', 'and', 'that', 'he', "'", 's', 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarz', '##ene', '##gger', ',', 'jean', '-', 'cl', '##aud', 'van', 'dam', '##me', 'or', 'steven', 'sega', '##l', '.', '[SEP]']
['[CLS]', 'the', 'gorgeous', '##ly', 'elaborate', 'continuation', 'of', '"', 'the', 'lord', 'of', 'the', 'rings', '"', 'trilogy', 'is', 'so', 'huge', 'that', 'a', 'column', 'of', 'words', 'cannot', 'adequately', 'describe', 'co', '-', 'writer', '/', 'director', 'peter', 'jackson', "'", 's', 'expanded', 'vision', 'of', 'j', '.', 'r', '.', 'r', '.', 'tolkien', "'", 's', 'middle', '-', 'earth', '.', '[SEP]']
['[CLS]', 'effective', 'but', 'too', '-', 'te', '##pid', 'bio', '##pic', '[SEP]']
['[CLS]', 'if', 'you', 'sometimes', 'like', 'to', 'go', 'to', 'the', 'movies', 'to', 'have', 'fun', ',', 'was', '##abi', 'is', 'a', 'good', 'place', 'to', '

## ⚙️ Loading Pre-trained DistilBERT Model for Sequence Classification and Preparing Tokens

### ✅ Explanation

- **`model = AutoModel.from_pretrained("distilbert-base-uncased", num_labels=2)`**:  
  Loads the DistilBERT model pretrained on a large corpus, adapted for sequence classification with **2 output labels** (e.g., 0 = negative review, 1 = positive review).  
  *Note:* Usually, sequence classification models use `AutoModelForSequenceClassification`, but here `AutoModel` is used—likely a simplified version or for feature extraction.

- **`model.to(device)`**:  
  Moves the model’s parameters to the specified device (CPU or GPU) for efficient computation.

- **`tokens = {k: v.to(device) for k, v in ttt.items()}`**:  
  Moves all token tensors (input IDs, attention masks, etc.) to the same device as the model, ensuring compatibility during forward pass.

---

### 🧠 Theory & Why

- **Why load a pretrained model:**  
  Pretrained models have learned rich language representations that improve downstream task performance and reduce training time.

- **Setting `num_labels=2`:**  
  Configures the model for binary classification tasks, such as sentiment analysis (good vs bad reviews).

- **Device management:**  
  Keeping model and input tensors on the same device avoids errors and accelerates inference/training.

- **Token preparation:**  
  Ensures that all inputs required by the model are ready and correctly formatted for processing.


In [8]:
# Load the pre-trained DistilBERT model for sequence classification and pass the tokens

model = AutoModel.from_pretrained("distilbert-base-uncased", num_labels = 2) #0 for bad reviews, 1 for good reviews
model.to(device)

tokens = {k: v.to(device) for k, v in ttt.items()} #move token tensors to device

#### Exercise 1.3: A Stable Baseline

In this exercise I want you to:
1. Use Distilbert as a *feature extractor* to extract representations of the text strings from the dataset splits;
2. Train a classifier (your choice, by an SVM from Scikit-learn is an easy choice).
3. Evaluate performance on the validation and test splits.

These results are our *stable baseline* -- the **starting** point on which we will (hopefully) improve in the next exercise.

**Hint**: There are a number of ways to implement the feature extractor, but probably the best is to use a [feature extraction `pipeline`](https://huggingface.co/tasks/feature-extraction). You will need to interpret the output of the pipeline and extract only the `[CLS]` token from the *last* transformer layer. *How can you figure out which output that is?*

## ⚙️ Feature Extraction Function Using a Pretrained Transformer Model

### ✅ Explanation

- **`def compute_features(model, tokenizer, data_loader):`**  
  Defines a function to compute feature embeddings for a dataset by passing text through a pretrained model.

- **`model.eval()`**:  
  Sets the model to evaluation mode, disabling dropout and other training-specific layers for consistent outputs.

- **`features = []`**:  
  Initializes a list to collect feature tensors batch-wise.

- **`with torch.no_grad():`**  
  Disables gradient calculation to save memory and computation since we only want inference outputs.

- **Iterate over `data_loader`:**  
  - Tokenizes the batch of texts with padding and truncation for uniform input lengths.  
  - Moves tokens to the same device as the model.  
  - Runs the model forward to get outputs (hidden states).  
  - Extracts the embedding corresponding to the `[CLS]` token (first token) from the last hidden layer (`outputs.last_hidden_state[:, 0, :]`). This token typically summarizes the whole input sequence.  
  - Detaches from computation graph and moves tensor back to CPU for accumulation.  
  - Appends the batch’s embeddings to `features`.  
  - Explicitly deletes intermediate variables and clears CUDA cache to free GPU memory, helpful for large datasets.

- **Concatenate all batch features:**  
  Combines the list of tensors along the batch dimension into a single feature tensor.

- **Return `features`:**  
  Outputs a tensor containing feature embeddings for the entire dataset.

---

### 🧠 Theory & Why

- **Feature extraction from `[CLS]` token:**  
  The `[CLS]` embedding is commonly used as a fixed-size representation of the input sequence, useful for downstream tasks like classification or clustering.

- **Evaluation mode and no grad:**  
  Ensures deterministic outputs and improves efficiency by skipping gradient computations.

- **Batch processing:**  
  Efficiently handles large datasets in manageable chunks.

- **Memory management:**  
  Deleting intermediate variables and clearing cache helps prevent out-of-memory errors during processing on GPUs.

- **Use case:**  
  Extracted features can be used for training classifiers, similarity search, or as input to other ML algorithms.


In [9]:
# Feature extraction function

def compute_features(model, tokenizer, data_loader):
    model.eval()

    features = []

    with torch.no_grad(): 
        for batch in tqdm(data_loader):
            tokens = tokenizer(
                batch["text"],
                padding = True,
                truncation = True,
                return_tensors = "pt"
            )
            tokens = {k: v.to(device) for k, v in tokens.items()}

            outputs = model(**tokens)

            #extract the [CLS] token embedding (first token) from the last hidden state
            #we take all items in batch (:) , the first token (0), and all hidden dimensions (:)
            cls_embeds = outputs.last_hidden_state[:, 0, :].detach().cpu()

            features.append(cls_embeds)

            del tokens, outputs, cls_embeds 
            torch.cuda.empty_cache()

    features = torch.cat(features, dim = 0)

    return features

## ⚙️ Training and Evaluating an SVM Classifier on Extracted Features

### ✅ Explanation

- **`clf = svm.SVC()`**:  
  Initializes a Support Vector Machine (SVM) classifier with default settings.

- **`features = compute_features(model, tokenizer, train)`**:  
  Calls the previously defined function to extract feature embeddings from the training dataset using the pretrained model and tokenizer.

- **`X = features.numpy()`**:  
  Converts the extracted features from a PyTorch tensor to a NumPy array, as scikit-learn expects NumPy inputs.

- **`y = np.array(dataset["train"]["label"])`**:  
  Retrieves the ground-truth labels for the training data as a NumPy array.

- **`clf.fit(X, y)`**:  
  Trains the SVM classifier on the extracted features (`X`) and labels (`y`).

- **`print("acc:", clf.score(X, y))`**:  
  Prints the accuracy of the classifier on the training data by computing the fraction of correctly classified samples.

---

### 🧠 Theory & Why

- **Why use SVM:**  
  SVMs are effective classifiers, especially on smaller datasets or when the data is linearly separable in the feature space.

- **Using pretrained features:**  
  Instead of training a deep model end-to-end, pretrained transformer embeddings provide rich semantic features that simplify downstream classification tasks.

- **Conversion to NumPy:**  
  Scikit-learn requires NumPy arrays, so conversion from PyTorch tensors is necessary.

- **Training and evaluation on same data:**  
  Provides a quick sanity check on the classifier's fit, but ideally, evaluation should be done on a separate validation or test set to assess generalization.

- **Pipeline advantage:**  
  This approach leverages powerful pretrained representations with lightweight classical ML methods, often leading to efficient and effective solutions.


In [10]:
# Train and evaluate small SVM classifier on the extracted features
clf = svm.SVC()

features = compute_features(model, tokenizer, train)

X = features.numpy()
y = np.array(dataset["train"]["label"])

clf.fit(X, y)

print("acc:", clf.score(X, y))

100%|██████████| 134/134 [00:03<00:00, 44.38it/s]


acc: 0.816295427901524


## ⚙️ Testing the SVM Classifier on Validation Data

### ✅ Explanation

- **`features = compute_features(model, tokenizer, val)`**:  
  Extracts feature embeddings for the validation dataset using the pretrained model and tokenizer.

- **`X = features.numpy()`**:  
  Converts the PyTorch tensor of features to a NumPy array for compatibility with scikit-learn.

- **`y = np.array(dataset["validation"]["label"])`**:  
  Loads the true labels of the validation set as a NumPy array.

- **`print("acc:", clf.score(X, y))`**:  
  Prints the accuracy of the SVM classifier on the validation set, showing how well the model generalizes.

- **`y_pred = clf.predict(X)`**:  
  Predicts labels for the validation features using the trained classifier.

- **`mis_idxs = np.where(y_pred != y)[0]`**:  
  Identifies indices of misclassified examples where predictions differ from true labels.

- **Loop over misclassified examples (up to `n_examples_test`):**  
  - Prints the original text from the validation set for qualitative error analysis.  
  - Shows the predicted label for each misclassified example.

---

### 🧠 Theory & Why

- **Testing on unseen data:**  
  Evaluates the classifier's performance beyond the training set, measuring generalization.

- **Misclassification analysis:**  
  Reviewing misclassified samples helps diagnose model weaknesses and can guide improvements in preprocessing, feature extraction, or model selection.

- **Feature consistency:**  
  Using the same feature extraction process on validation data ensures fair comparison.

- **Accuracy metric:**  
  A straightforward measure of overall correct classification rate.

- **Combining deep embeddings and classical ML:**  
  Enables leveraging the strengths of pretrained models with simpler classifiers, often yielding efficient and interpretable pipelines.


In [11]:
#Test the classifier

features = compute_features(model, tokenizer, val)

X = features.numpy()
y = np.array(dataset["validation"]["label"])

print("acc:", clf.score(X, y))

y_pred  = clf.predict(X)
mis_idxs = np.where(y_pred != y)[0]
for i in mis_idxs[:n_examples_test]:
    i = int(i)
    print("Text :", dataset["validation"][i]["text"])
    print(f"Predicted  : { y_pred[i] }")

100%|██████████| 17/17 [00:00<00:00, 45.99it/s]


acc: 0.8142589118198874
Text : made for teens and reviewed as such , this is recommended only for those under 20 years of age . . . and then only as a very mild rental .
Predicted  : 0
Text : imagine o . henry's <b>the gift of the magi</b> relocated to the scuzzy underbelly of nyc's drug scene . merry friggin' christmas !
Predicted  : 0
Text : nothing short of wonderful with its ten-year-old female protagonist and its steadfast refusal to set up a dualistic battle between good and evil .
Predicted  : 0
Text : those moviegoers who would automatically bypass a hip-hop documentary should give " scratch " a second look .
Predicted  : 0
Text : baby-faced renner is eerily convincing as this bland blank of a man with unimaginable demons within .
Predicted  : 0


-----
### Exercise 2: Fine-tuning Distilbert

In this exercise we will fine-tune the Distilbert model to (hopefully) improve sentiment analysis performance.

#### Exercise 2.1: Token Preprocessing

The first thing we need to do is *tokenize* our dataset splits. Our current datasets return a dictionary with *strings*, but we want *input token ids* (i.e. the output of the tokenizer). This is easy enough to do my hand, but the HugginFace `Dataset` class provides convenient, efficient, and *lazy* methods. See the documentation for [`Dataset.map`](https://huggingface.co/docs/datasets/v3.5.0/en/package_reference/main_classes#datasets.Dataset.map).

**Tip**: Verify that your new datasets are returning for every element: `text`, `label`, `intput_ids`, and `attention_mask`.

In [12]:
# Formatting dataset for tokenization

ds = dataset.map(lambda example: tokenizer(example["text"]), batched = True)
print(ds["train"][0])

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1, 'input_ids': [101, 1996, 2600, 2003, 16036, 2000, 2022, 1996, 7398, 2301, 1005, 1055, 2047, 1000, 16608, 1000, 1998, 2008, 2002, 1005, 1055, 2183, 2000, 2191, 1037, 17624, 2130, 3618, 2084, 7779, 29058, 8625, 13327, 1010, 3744, 1011, 18856, 19513, 3158, 5477, 4168, 2030, 7112, 16562, 2140, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


#### Exercise 2.2: Setting up the Model to be Fine-tuned

In this exercise we need to prepare the base Distilbert model for fine-tuning for a *sequence classification task*. This means, at the very least, appending a new, randomly-initialized classification head connected to the `[CLS]` token of the last transformer layer. Luckily, HuggingFace already provides an `AutoModel` for just this type of instantiation: [`AutoModelForSequenceClassification`](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification). You will want you instantiate one of these for fine-tuning.

## ⚙️ Load Pretrained DistilBERT Model for Sequence Classification

### ✅ Explanation

- **`model_ds = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")`**:  
  Loads the pretrained DistilBERT model specifically fine-tuned (or ready to fine-tune) for sequence classification tasks.  
  By default, it assumes 2 labels (e.g., binary classification).

- **`model_ds.to(device)`**:  
  Moves the model to the specified device (`cpu` or `cuda`) to enable efficient computation.

---

### 🧠 Theory & Why

- **Pretrained model usage:**  
  Utilizing a pretrained transformer like DistilBERT leverages learned language representations, accelerating training and improving accuracy.

- **Sequence classification head:**  
  This model includes a classification head on top of the transformer, designed to output logits corresponding to class labels.

- **Device placement:**  
  Moving the model to the correct device is essential to ensure compatibility with input tensors and to utilize GPU acceleration if available.


In [13]:

model_ds = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased") #(n labels is 2 by default)
model_ds.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


#### Exercise 2.3: Fine-tuning Distilbert

Finally. In this exercise you should use a HuggingFace [`Trainer`](https://huggingface.co/docs/transformers/main/en/trainer) to fine-tune your model on the Rotten Tomatoes training split. Setting up the trainer will involve (at least):


1. Instantiating a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/en/main_classes/data_collator) object which is what *actually* does your batch construction (by padding all sequences to the same length).
2. Writing an *evaluation function* that will measure the classification accuracy. This function takes a single argument which is a tuple containing `(logits, labels)` which you should use to compute classification accuracy (and maybe other metrics like F1 score, precision, recall) and return a `dict` with these metrics.  
3. Instantiating a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.51.1/en/main_classes/trainer#transformers.TrainingArguments) object using some reasonable defaults.
4. Instantiating a `Trainer` object using your train and validation splits, you data collator, and function to compute performance metrics.
5. Calling `trainer.train()`, waiting, waiting some more, and then calling `trainer.evaluate()` to see how it did.

**Tip**: When prototyping this laboratory I discovered the HuggingFace [Evaluate library](https://huggingface.co/docs/evaluate/en/index) which provides evaluation metrics. However I found it to have insufferable layers of abstraction and getting actual metrics computed. I suggest just using the Scikit-learn metrics...

## ⚙️ Metrics and Training Setup for Fine-tuning DistilBERT

### ✅ Explanation

- **`dcwp = DataCollatorWithPadding(tokenizer=tokenizer)`**:  
  Creates a data collator that dynamically pads batches to the max sequence length in the batch, ensuring consistent input sizes without excessive padding.

- **`compute_metrics(eval_pred, metrics_dict)`**:  
  Function to calculate evaluation metrics from model predictions:
  - Extracts logits and true labels.
  - Computes predicted classes by taking the argmax over logits.
  - Computes each metric (accuracy, F1, precision, recall) using the passed metrics functions.
  - Returns a dictionary of metric results.

- **`metrics = {...}`**:  
  Dictionary defining the evaluation metrics using standard sklearn metric functions with weighted averaging for multiclass balance.

- **`training_args = TrainingArguments(...)`**:  
  Configuration for the training process:
  - Specifies output directories for checkpoints and logs.
  - Sets learning rate (2e-5) suitable for fine-tuning transformers.
  - Defines batch sizes for training (32) and evaluation (16).
  - Runs for 5 epochs.
  - Applies weight decay (0.01) to regularize training.
  - Evaluates and saves the model at the end of each epoch.
  - Loads the best model (by evaluation metric) after training.
  - Enables logging every 10 steps and reports logs to TensorBoard.

- **`trainer = Trainer(...)`**:  
  Initializes the Hugging Face `Trainer` with the model, arguments, datasets, data collator, and metric computation function.

- **`trainer.train()`**:  
  Starts the fine-tuning process.

---

### 🧠 Theory & Why

- **Data collator with padding:**  
  Efficient batching requires uniform input shapes; dynamic padding minimizes unnecessary computation and memory use.

- **Custom metrics function:**  
  Allows monitoring multiple metrics during evaluation for a richer assessment of model performance.

- **Weighted metrics:**  
  Weighted averaging accounts for class imbalance by weighting per-class scores by support.

- **Training arguments choices:**  
  Learning rate, batch size, and epochs are typical fine-tuning values that balance performance and computational resources.

- **Evaluation strategy:**  
  Evaluating and saving at each epoch helps in early stopping and selecting the best checkpoint.

- **Logging with TensorBoard:**  
  Enables visualization and monitoring of training progress.

- **Using Hugging Face Trainer:**  
  Provides an easy-to-use, standardized API for training transformer models with built-in support for metrics, logging, checkpointing, and device management.


In [14]:
# Metrics and Training setup

dcwp = DataCollatorWithPadding(tokenizer = tokenizer)

def compute_metrics(eval_pred, metrics_dict):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis = -1)

    metrics_results = {}
    for metric in metrics_dict.keys():
        metrics_results[metric] = metrics_dict[metric](preds, labels)

    return metrics_results

metrics = {
    "acc": lambda preds, labels: (preds == labels).mean(),
    "f1": lambda preds, labels: f1_score(labels, preds, average = "weighted"),
    "prec": lambda preds, labels: precision_score(labels, preds, average = "weighted"),
    "rec": lambda preds, labels: recall_score(labels, preds, average = "weighted")
}


training_args = TrainingArguments(
    output_dir = checkpoints_dir + "/fine_tuned_model",
    learning_rate = 2e-5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 16,
    num_train_epochs = 5,
    weight_decay = 0.01,
    eval_strategy = "epoch", #evaluate at the end of each epoch
    save_strategy = "epoch",
    load_best_model_at_end = True,
    logging_dir = log_dir + "/fine_tuned_model",
    logging_strategy = "steps",
    logging_steps = 10,
    report_to = "tensorboard"
)

trainer = Trainer(
    model = model_ds,
    args = training_args,
    train_dataset = ds["train"],
    eval_dataset = ds["validation"],
    data_collator = dcwp,
    compute_metrics = lambda x: compute_metrics(x, metrics)
)

trainer.train()

Epoch,Training Loss,Validation Loss,Acc,F1,Prec,Rec
1,0.4115,0.369271,0.836773,0.836254,0.841095,0.836773
2,0.2319,0.349599,0.851782,0.851749,0.8521,0.851782
3,0.1671,0.442669,0.845216,0.845101,0.846241,0.845216
4,0.052,0.510387,0.845216,0.845216,0.845217,0.845216
5,0.0687,0.557311,0.84334,0.843336,0.84337,0.84334


TrainOutput(global_step=1335, training_loss=0.20539348040627184, metrics={'train_runtime': 56.4881, 'train_samples_per_second': 755.026, 'train_steps_per_second': 23.633, 'total_flos': 580344848019696.0, 'train_loss': 0.20539348040627184, 'epoch': 5.0})

-----
### Exercise 3: Choose at Least One


#### Exercise 3.1: Efficient Fine-tuning for Sentiment Analysis (easy)

In Exercise 2 we fine-tuned the *entire* Distilbert model on Rotten Tomatoes. This is expensive, even for a small model. Find an *efficient* way to fine-tune Distilbert on the Rotten Tomatoes dataset (or some other dataset).

**Hint**: You could check out the [HuggingFace PEFT library](https://huggingface.co/docs/peft/en/index) for some state-of-the-art approaches that should "just work". How else might you go about making fine-tuning more efficient without having to change your training pipeline from above?

## ⚙️ LoRa Fine-tuning Setup

### ✅ Explanation

- **Reload model:**  
  Loads the pre-trained DistilBERT model for sequence classification with 2 output labels, then moves it to the device.

- **LoRa Configuration (`LoraConfig`):**  
  - `r = 8`: Rank of the low-rank decomposition (controls adaptation complexity).  
  - `lora_alpha = 16`: Scaling factor for LoRa updates (commonly recommended value).  
  - `target_modules = ["q_lin", "v_lin"]`: Applies LoRa only to the query and value linear layers in attention modules.  
  - `lora_dropout = 0.1`: Dropout probability applied to LoRa layers for regularization.  
  - `bias = "none"`: No bias parameters added during LoRa tuning.  
  - `task_type = TaskType.SEQ_CLS`: Indicates this is a sequence classification task.

- **Apply LoRa (`get_peft_model`):**  
  Wraps the model to enable parameter-efficient LoRa fine-tuning by freezing most parameters except LoRa adapters.

- **Print trainable parameters:**  
  Confirms which parameters will be updated during training (typically only LoRa adapters).

---

### 🧠 Theory & Why

- **LoRa fine-tuning:**  
  Enables efficient adaptation of large pretrained models by training a small subset of parameters, reducing compute and memory cost.

- **Targeting `q_lin` and `v_lin`:**  
  These are key projection layers in the attention mechanism, where adaptation has high impact on model performance.

- **Using dropout in LoRa layers:**  
  Helps reduce overfitting by regularizing the small set of trainable parameters.

- **Freezing base model:**  
  Keeps pretrained knowledge intact and speeds up training, only tuning lightweight LoRa components.

- **Task-specific config:**  
  Helps the PEFT framework apply correct assumptions and optimizations for sequence classification.


In [15]:
# LoRa fine-tuning

# Redefine the model for LoRa fine-tuning
model_ds = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels = 2)
model_ds.to(device)

lora_config = LoraConfig(    
    r = 8, #LoRa rank decompostion
    lora_alpha = 16, #Suggested value (no clue why)
    target_modules = ["q_lin", "v_lin"],
    lora_dropout = 0.1,
    bias = "none",
    task_type = TaskType.SEQ_CLS
)

model_ds = get_peft_model(model_ds, lora_config)
model_ds.print_trainable_parameters()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 739,586 || all params: 67,694,596 || trainable%: 1.0925


## ⚙️ Training the LoRa Fine-tuned Model

### ✅ Explanation

- **Training Arguments (`TrainingArguments`):**  
  - `output_dir`: Directory to save checkpoints (`checkpoints_dir + "/lora_model"`).  
  - `learning_rate`: Set to **2e-5**, a typical low learning rate for fine-tuning transformer models.  
  - `per_device_train_batch_size`: Batch size of **32** for training.  
  - `per_device_eval_batch_size`: Batch size of **16** for evaluation.  
  - `num_train_epochs`: Train for **3 epochs**, balancing training time and performance.  
  - `weight_decay`: L2 regularization set to **0.01** to reduce overfitting.  
  - `eval_strategy`: Evaluate the model at the end of each epoch.  
  - `save_strategy`: Save model checkpoints at the end of each epoch.  
  - `load_best_model_at_end`: Automatically load the best checkpoint (based on evaluation metric) after training.  
  - `logging_dir`: Directory for TensorBoard logs (`log_dir + "/lora_model"`).  
  - `logging_strategy` & `logging_steps`: Logs training metrics every **10 steps**.  
  - `report_to`: Send logs to TensorBoard for visualization.

- **Trainer Setup (`Trainer`):**  
  - `model`: The LoRa-adapted DistilBERT model.  
  - `args`: Training configuration from above.  
  - `train_dataset` & `eval_dataset`: Training and validation datasets.  
  - `data_collator`: Handles padding and batching (`dcwp` = DataCollatorWithPadding).  
  - `compute_metrics`: Function to calculate evaluation metrics during training.

- **Start training:**  
  `trainer.train()` runs the fine-tuning process using the above settings.

---

### 🧠 Theory & Why

- **Low learning rate:**  
  Essential for stable fine-tuning of large pretrained models, especially when tuning a small subset of parameters (LoRa adapters).

- **Few epochs:**  
  LoRa fine-tuning typically converges faster due to fewer trainable parameters.

- **Evaluation & checkpointing each epoch:**  
  Enables monitoring progress and recovery of the best performing model.

- **Weight decay:**  
  Acts as regularization to prevent overfitting, improving generalization.

- **TensorBoard logging:**  
  Provides rich visual feedback on training metrics and helps diagnose issues.

- **Using Trainer API:**  
  Simplifies training loop management, evaluation, checkpointing, and integration with Hugging Face ecosystem.


In [16]:

training_args = TrainingArguments(
    output_dir = checkpoints_dir + "/lora_model",
    learning_rate = 2e-5,
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 16,
    num_train_epochs = 3,
    weight_decay = 0.01,
    eval_strategy = "epoch",
    save_strategy = "epoch",
    load_best_model_at_end = True,
    logging_dir = log_dir + "/lora_model",
    logging_strategy = "steps",
    logging_steps = 10,
    report_to = "tensorboard"
)

trainer = Trainer(
    model = model_ds,
    args = training_args,
    train_dataset = ds["train"],
    eval_dataset = ds["validation"],
    data_collator = dcwp,
    compute_metrics = lambda x: compute_metrics(x, metrics)
)

trainer.train()

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Epoch,Training Loss,Validation Loss,Acc,F1,Prec,Rec
1,0.5084,0.468388,0.797373,0.797373,0.797378,0.797373
2,0.4505,0.424912,0.813321,0.813071,0.815007,0.813321
3,0.4422,0.420202,0.814259,0.814023,0.815864,0.814259


TrainOutput(global_step=801, training_loss=0.49919321578688985, metrics={'train_runtime': 19.5015, 'train_samples_per_second': 1312.208, 'train_steps_per_second': 41.074, 'total_flos': 354239374467936.0, 'train_loss': 0.49919321578688985, 'epoch': 3.0})

#### Exercise 3.2: Fine-tuning a CLIP Model (harder)

Use a (small) CLIP model like [`openai/clip-vit-base-patch16`](https://huggingface.co/openai/clip-vit-base-patch16) and evaluate its zero-shot performance on a small image classification dataset like ImageNette or TinyImageNet. Fine-tune (using a parameter-efficient method!) the CLIP model to see how much improvement you can squeeze out of it.

**Note**: There are several ways to adapt the CLIP model; you could fine-tune the image encoder, the text encoder, or both. Or, you could experiment with prompt learning.

**Tip**: CLIP probably already works very well on ImageNet and ImageNet-like images. For extra fun, look for an image classification dataset with different image types (e.g. *sketches*).

In [17]:
# Your code here.

#### Exercise 3.3: Choose your Own Adventure

There are a *ton* of interesting and fun models on the HuggingFace hub. Pick one that does something interesting and adapt it in some way to a new task. Or, combine two or more models into something more interesting or fun. The sky's the limit.

**Note**: Reach out to me by email or on the Discord if you are unsure about anything.

In [18]:
# Your code here.