# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: Low-Rank Adaptation (LoRA).
* Model: DistilBERT
* Evaluation approach: evaluate method with a Hugging Face Trainer
* Fine-tuning dataset: https://huggingface.co/datasets/stanfordnlp/imdb

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
#This part is based on the solution to the exercise Create a BERT Sentiment Classifier.
#Adaptation of the DistilBERT model by probing it with a classification head.

In [2]:
#Install datasets. You may need to restart the kernel after installation.
!pip install -U datasets
!pip install -q "datasets==2.15.0"

Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting huggingface-hub>=0.23.0
  Downloading huggingface_hub-0.27.0-py3-none-any.whl (450 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m450.5/450.5 kB[0m [31m46.3 MB/s[0m eta [36m0:00:00[0m
Collecting tqdm>=4.66.3
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tqdm, requests, huggingface-hub, datasets
  Attempting uni

In [3]:
#Import the datasets.
from datasets import load_dataset

#Load the train and test splits of the imdb dataset and store them in a dictionary.
splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [4]:
#Import and set up a tokeniser.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

#Let's use a lambda function to tokenize both datasets (train and test).
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = ds[split].map(
        lambda x: tokenizer(x["text"], truncation=True), batched=True
    )

#Inspect the first three examples in the train dataset after tokenisation.
tokenized_dataset["train"][0:3]



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

In [5]:
#Import the DistilBERT pretrained model from HF and define it as our base model.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
#Freeze DistilBERT's parameters. Note that the classification head's weights are still trainable.
for param in model.base_model.parameters():
    param.requires_grad = False

#Inspect the model.
model.classifier
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [7]:
#Import the libraries necessary for training the classification head.
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

#This function calculates the classifier's performance with respect to a dataset in terms of the accuracy metric.
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

#The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
#Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer.
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir=".",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory.
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

#Evaluate the pretrained classifier. The trainer loop above is also an evaluation loop.
trainer.evaluate()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6918345093727112,
 'eval_accuracy': 0.52436,
 'eval_runtime': 350.0246,
 'eval_samples_per_second': 71.424,
 'eval_steps_per_second': 17.856}

In [8]:
#Initiate the training loop defined above to train the classification head.
trainer.train()

#Save the trained classifier.
trainer.save_model("model_probing")

#Evaluate the trained classifier. 
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3516,0.352573,0.86072


Checkpoint destination directory ./checkpoint-6250 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 0.35257336497306824,
 'eval_accuracy': 0.86072,
 'eval_runtime': 356.2403,
 'eval_samples_per_second': 70.177,
 'eval_steps_per_second': 17.544,
 'epoch': 1.0}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [9]:
#Import the DistilBERT pretrained model from HF and define it as our base model.
from transformers import AutoModelForSequenceClassification
model_base = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)

for name, module in model_base.named_modules():
    print(name)
    
for name, module in model_base.named_modules():
    print(module)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



distilbert
distilbert.embeddings
distilbert.embeddings.word_embeddings
distilbert.embeddings.position_embeddings
distilbert.embeddings.LayerNorm
distilbert.embeddings.dropout
distilbert.transformer
distilbert.transformer.layer
distilbert.transformer.layer.0
distilbert.transformer.layer.0.attention
distilbert.transformer.layer.0.attention.dropout
distilbert.transformer.layer.0.attention.q_lin
distilbert.transformer.layer.0.attention.k_lin
distilbert.transformer.layer.0.attention.v_lin
distilbert.transformer.layer.0.attention.out_lin
distilbert.transformer.layer.0.sa_layer_norm
distilbert.transformer.layer.0.ffn
distilbert.transformer.layer.0.ffn.dropout
distilbert.transformer.layer.0.ffn.lin1
distilbert.transformer.layer.0.ffn.lin2
distilbert.transformer.layer.0.ffn.activation
distilbert.transformer.layer.0.output_layer_norm
distilbert.transformer.layer.1
distilbert.transformer.layer.1.attention
distilbert.transformer.layer.1.attention.dropout
distilbert.transformer.layer.1.attention.q

In [11]:
#Import and create a PEFT adapter configuration for low rank adaptation (LoRA).
from peft import LoraConfig, TaskType
config = LoraConfig(
    task_type=TaskType.SEQ_CLS, #Sequence classification task.
    inference_mode=False,
    r=8, #Rank of the adaptation matrices.
    lora_alpha=32, #Weight assigned adaptation.
    lora_dropout=0.1,
    target_modules=["distilbert.transformer.layer.5.attention.q_lin", "distilbert.transformer.layer.5.attention.k_lin", "distilbert.transformer.layer.5.attention.v_lin"]
)    

#Import and create a PEFT adapter configuration for low rank adaptation (LoRA).
from peft import get_peft_model
model_lora = get_peft_model(model_base, config)

#Print the number of trainable parameters.
model_lora.print_trainable_parameters()

trainable params: 36,864 || all params: 67,584,004 || trainable%: 0.054545451317148955


In [12]:
#Check that the trainable layers are correct.
for name, param in model_lora.named_parameters():
    if not param.requires_grad:
        print(f"Layer {name} is frozen.")
    else:
        print(f"Layer {name} is trainable.")

Layer base_model.model.distilbert.embeddings.word_embeddings.weight is frozen.
Layer base_model.model.distilbert.embeddings.position_embeddings.weight is frozen.
Layer base_model.model.distilbert.embeddings.LayerNorm.weight is frozen.
Layer base_model.model.distilbert.embeddings.LayerNorm.bias is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.q_lin.weight is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.q_lin.bias is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.k_lin.weight is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.k_lin.bias is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.v_lin.weight is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.v_lin.bias is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.out_lin.weight is frozen.
Layer base_model.model.distilbert.transformer.layer.0.attention.out_lin.bias is fr

In [13]:
#Import the libraries necessary for training the classification head and the LoRA adapter.
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

#This function calculates the classifier's performance with respect to a dataset in terms of the accuracy metric.
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

#The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
#Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer.
trainer = Trainer(
    model=model_lora,
    args=TrainingArguments(
        output_dir=".",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory.
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

#Initiate the training loop defined above.
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.3351,0.329078,0.87372


Checkpoint destination directory ./checkpoint-6250 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=6250, training_loss=0.3763320703125, metrics={'train_runtime': 812.7244, 'train_samples_per_second': 30.761, 'train_steps_per_second': 7.69, 'total_flos': 2812718356586688.0, 'train_loss': 0.3763320703125, 'epoch': 1.0})

In [14]:
#Save the trained classifier.
model_lora.save_pretrained("model_lora")
trainer.save_model("model_lora_LW")

#Evaluate the trained classifier. 
trainer.evaluate()

{'eval_loss': 0.3290780782699585,
 'eval_accuracy': 0.87372,
 'eval_runtime': 358.5293,
 'eval_samples_per_second': 69.729,
 'eval_steps_per_second': 17.432,
 'epoch': 1.0}

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [17]:
#Load the PEFT (LoRA) model.
from peft import AutoPeftModelForSequenceClassification

model_lora_infer = AutoPeftModelForSequenceClassification.from_pretrained(
    "model_lora_LW",
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1},
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
#Set the PEFT model's pad ID to that used by the tokeniser.
model_lora_infer.config.pad_token_id = tokenizer.pad_token_id

In [19]:
#Define the evaluation loop.
inference = Trainer(
    model=model_lora_infer, #Remember to use the PEFT model.
    args=TrainingArguments(
        output_dir=".",
        learning_rate=2e-3,
        # Reduce the batch size if you don't have enough memory.
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [20]:
evaluation = inference.evaluate()

In [21]:
print("Performance of the finetuned model:", evaluation)

Performance of the finetuned model: {'eval_loss': 0.3290780782699585, 'eval_accuracy': 0.87372, 'eval_runtime': 358.4737, 'eval_samples_per_second': 69.74, 'eval_steps_per_second': 17.435}


The evaluation results of the pretrained model adapted with linear probing only are shown in Out[7]. They are reproduced here.

{'eval_loss': 0.6918345093727112,
 'eval_accuracy': 0.52436,
 'eval_runtime': 350.0246,
 'eval_samples_per_second': 71.424,
 'eval_steps_per_second': 17.856}
 
The evaluation results of the pretrained model adapted with linear probing, which was finetuned with the base model's pretrained weights frozen, are shown in Out[8]. They are reproduced here.

{'eval_loss': 0.35257336497306824,
 'eval_accuracy': 0.86072,
 'eval_runtime': 356.2403,
 'eval_samples_per_second': 70.177,
 'eval_steps_per_second': 17.544,
 'epoch': 1.0}
 
The model with a finetuned classification head and the one further finetuned with LoRA are more accurate than the pretrained model adapted with linear probing without finetuning.