# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: **LoRA**
* Model: **google-bert/bert-base-cased**
* Evaluation approach: **Using accuracy metric**
* Fine-tuning dataset: **stanfordnlp/imdb**

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install transformers datasets evaluate scikit-learn -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Prepare the Foundation Model

### Load a pretrained HF model

In [2]:
from transformers import AutoTokenizer
model_id="google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

### Load and preprocess a dataset

In [3]:
from datasets import load_dataset
dataset = load_dataset("stanfordnlp/imdb")

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [5]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_train_datasets = dataset["train"].map(tokenize_function, batched=True)
tokenized_test_datasets = dataset["test"].map(tokenize_function, batched=True)

In [6]:
small_train_dataset = tokenized_train_datasets.shuffle(seed=42).select(range(3000))
small_eval_dataset = tokenized_test_datasets.shuffle(seed=42).select(range(1000))

In [7]:
print(small_eval_dataset)

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})


In [8]:
#From: https://achimoraites.medium.com/lightweight-roberta-sequence-classification-fine-tuning-with-lora-using-the-hugging-face-peft-8dd9edf99d19

from transformers import DataCollatorWithPadding

# Extract the number of classess and their names
num_labels = dataset['train'].features['label'].num_classes
class_names = dataset["train"].features["label"].names
print(f"number of labels: {num_labels}")
print(f"the labels: {class_names}")

# Create an id2label mapping
# We will need this for our inference.
id2label = {i: label for i, label in enumerate(class_names)}
print("id2label=", id2label)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

number of labels: 2
the labels: ['neg', 'pos']
id2label= {0: 'neg', 1: 'pos'}


### Evaluate the pretrained model

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    model_id, 
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [11]:
# Print a dataset sample
import random


# Generate a random integer within the range
x = random.randint(0, 1000)

print("text: {},\nlabel:{}".format(small_eval_dataset["text"][x], small_eval_dataset["label"][x]) )

text: Fred Astaire is reteamed with Rita Hayworth one year after their big hit for Columbia, "You'll Never Get Rich". That was the movie which put Hayworth on the Hollywood map, yet her performance in this wan romantic musical hardly gives a suggestion why she was so suddenly popular. Down Buenos Aires way, a tyrannical hotel owner demands that his four daughters marry in order of age; one may think film takes place in the 18th century, but no, it's modern-day 1942. Astaire is an ex-hoofer-turned-gambler who goes back to dancing to earn some money, getting mixed up in impersonating a letter-writing admirer to Hayworth's stone-cold society beauty. Fred gazes at Rita with a brotherly smile, but she's so mannequin-like (lip-synching to her songs like a wide-eyed wind-up doll) that all romantic sparks quickly sputter. They do dance together quite comfortably, however, and the Jerome Kern score is unmemorable but not too bad. ** from ****,
label:0


In [12]:
#Use accuracy metric
#Function inspired from https://huggingface.co/learn/nlp-course/en/chapter3/3#evaluation
import numpy as np

import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


In [13]:
from transformers import TrainingArguments

# training_args = TrainingArguments(output_dir="model_evaluation")
training_args = TrainingArguments(
    "evaluate_foundational_model",
    evaluation_strategy="epoch",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32)



In [14]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [15]:
%%time
import numpy as np

# Let's see the perfomance of the foundation model before any prior training
trainer.evaluate(eval_dataset=small_eval_dataset)

CPU times: user 1.82 s, sys: 722 ms, total: 2.55 s
Wall time: 1min 6s


{'eval_loss': 0.74080491065979,
 'eval_model_preparation_time': 0.0031,
 'eval_accuracy': 0.512,
 'eval_runtime': 66.0,
 'eval_samples_per_second': 15.152,
 'eval_steps_per_second': 0.485}

## **Without any fine tuning the model "google-bert/bert-base-cased" has an _accuracy_ of _0.488_**

### Saving the foundation model to local directory

In [16]:
# Save the foundational model to the local directory "foundational_model/" 
trainer.save_model("foundational_model/")

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

Create two PEFT models to test two different lora_config values and compare the results between the two

### PEFT model (Same foundational model for the two PEFT configuraiotns)

In [17]:
peft_model_id = model_id 
model = AutoModelForSequenceClassification.from_pretrained(
    peft_model_id,
    num_labels=2,
    id2label={0: "negative", 1: "positive"},
    label2id={"negative": 0, "positive": 1}
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Create a PEFT model #1

In [18]:
# Create an dictiopnary with two set of values for two training to see the impact on the performance of the model
peft_values= {
    "values1": {
        "r": 16,
        "lora_alpha": 16,
        "lora_dropout": 0.1,
        "bias": "none"
    },
    "values2": {
        "r": 64,
        "lora_alpha": 128,
        "lora_dropout": 0.05,
        "bias": "none"
    }
}

In [19]:
from peft import LoraConfig, TaskType

lora_config1 = LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=peft_values["values1"]["r"],
    lora_alpha=peft_values["values1"]["lora_alpha"],
    lora_dropout=peft_values["values1"]["lora_dropout"],
    bias=peft_values["values1"]["bias"],
    target_modules=["query", "value"]
)

In [20]:
from peft import get_peft_model

lora_model1 = get_peft_model(model, lora_config1)
lora_model1.print_trainable_parameters()

trainable params: 591,362 || all params: 108,903,172 || trainable%: 0.5430


### Train the PEFT model #1

In [21]:
training_args_peft1 = TrainingArguments(
    "trainer_peft1_output",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16)



In [22]:
%%time
trainer1 = Trainer(
    model=lora_model1,
    args=training_args_peft1,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)
trainer1.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.600658,0.718
2,No log,0.367216,0.842
3,0.524200,0.34237,0.849


CPU times: user 1min 46s, sys: 16min 24s, total: 18min 10s
Wall time: 43min 13s


TrainOutput(global_step=564, training_loss=0.5063010411905059, metrics={'train_runtime': 2592.6722, 'train_samples_per_second': 3.471, 'train_steps_per_second': 0.218, 'total_flos': 2384349474816000.0, 'train_loss': 0.5063010411905059, 'epoch': 3.0})

In [23]:
%%time
trainer1.evaluate()

CPU times: user 3.02 s, sys: 4.13 s, total: 7.15 s
Wall time: 1min 37s


{'eval_loss': 0.3423702120780945,
 'eval_accuracy': 0.849,
 'eval_runtime': 97.9669,
 'eval_samples_per_second': 10.208,
 'eval_steps_per_second': 0.643,
 'epoch': 3.0}

###### **With fine tuning the model1 "google-bert/bert-base-cased" the _accuracy_ is now _0.849_ much better than the performance of the original foundational model.**

### Save the PEFT model #1

In [24]:
lora_model1.save_pretrained("trainer_peft_1")

In [25]:
!ls -ltra trainer_peft_1/

total 4664
drwxr-xr-x  12 mk  staff      384  4 Dec 13:43 [1m[36m..[m[m
-rw-r--r--@  1 mk  staff     5101  4 Dec 13:43 README.md
-rw-r--r--@  1 mk  staff  2372416  4 Dec 13:43 adapter_model.safetensors
drwxr-xr-x@  5 mk  staff      160  4 Dec 13:43 [1m[36m.[m[m
-rw-r--r--@  1 mk  staff      681  4 Dec 13:43 adapter_config.json


### Create PEFT model #2

In [26]:
from peft import LoraConfig, TaskType

lora_config2 = LoraConfig(
    task_type=TaskType.TOKEN_CLS,
    r=peft_values["values2"]["r"],
    lora_alpha=peft_values["values2"]["lora_alpha"],
    lora_dropout=peft_values["values2"]["lora_dropout"],
    bias=peft_values["values2"]["bias"],
    target_modules=["query", "value"]
)

In [27]:
from peft import get_peft_model

lora_model2 = get_peft_model(model, lora_config2)
lora_model2.print_trainable_parameters()

trainable params: 2,360,834 || all params: 110,672,644 || trainable%: 2.1332


### Train PEFT model #2

In [28]:
training_args_peft2 = TrainingArguments(
    "trainer_peft2_output",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16)



In [29]:
%%time
trainer2 = Trainer(
    model=lora_model2,
    args=training_args_peft2,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics
)
trainer2.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.321531,0.869
2,No log,0.29968,0.879
3,0.369100,0.304322,0.877


CPU times: user 1min 36s, sys: 10min 57s, total: 12min 33s
Wall time: 37min 35s


TrainOutput(global_step=564, training_loss=0.3582543244598605, metrics={'train_runtime': 2254.0444, 'train_samples_per_second': 3.993, 'train_steps_per_second': 0.25, 'total_flos': 2433271836672000.0, 'train_loss': 0.3582543244598605, 'epoch': 3.0})

In [30]:
%%time
trainer2.evaluate()

CPU times: user 3.45 s, sys: 4.83 s, total: 8.28 s
Wall time: 1min 14s


{'eval_loss': 0.3043220341205597,
 'eval_accuracy': 0.877,
 'eval_runtime': 74.5718,
 'eval_samples_per_second': 13.41,
 'eval_steps_per_second': 0.845,
 'epoch': 3.0}

**With fine tuning the model2 "google-bert/bert-base-cased" the _accuracy_ is now _0.877_ much better than the performance of the original foundational model and the PEFT1 model.**

### Save the PEFT model #2

In [32]:
lora_model1.save_pretrained("trainer_peft_2")

In [33]:
!ls -ltra trainer_peft_2/

total 9256
drwxr-xr-x 10 student student    4096 Dec  4 16:13 ..
-rw-r--r--  1 student student      88 Dec  4 16:13 README.md
-rw-r--r--  1 student student 9461447 Dec  4 16:13 adapter_model.bin
-rw-r--r--  1 student student     449 Dec  4 16:13 adapter_config.json
drwxr-xr-x  2 student student    4096 Dec  4 16:13 .


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

## Perform Inference Using the Fine-Tuned Model

### Load the saved PEFT model

We load the best PEFT model of the two we created: "trainer_peft_2"

In [34]:
saved_model = AutoModelForSequenceClassification.from_pretrained("trainer_peft_2")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Evaluate the fine-tuned model

In [35]:
%%time

# classify function from URL: https://achimoraites.medium.com/lightweight-roberta-sequence-classification-fine-tuning-with-lora-using-the-hugging-face-peft-8dd9edf99d19

x = random.randint(0, 1000)

text_to_classify=small_eval_dataset["text"][x]

def classify(text):
  inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
  output = saved_model(**inputs)

  prediction = output.logits.argmax(dim=-1).item()

  print(f'\n Class: {prediction}, Label: {id2label[prediction]},\nText: {text}')



classify(text_to_classify)

print("\nFrom the dataset, the text is classified as: {}: {}\n".format(small_eval_dataset["label"][x], id2label[small_eval_dataset["label"][x]]))


 Class: 0, Label: neg,
Text: This is the most frightening film ever made in Hollywood. It is a cautionary tale of how to take a European masterpiece and suck the life of of it until it is a dry husk like an insect carcass on the the windowsill. Frightening because it reveals how the world of Hollywood really works: ignorant money begetting dross. It makes me wonder how many great films could populate the corridors of my memory if the Hollywood process had not leveled them to forgettable mediocrity. Cry for the murdered children! See Spoorloos or read The Golden Egg, if you dare, because they will come back to you forever in the idle moments of your life: when you're walking along the street and you see a 'missing' poster; in ordinary-looking parking lots; when you hear the Tour De France on the radio; and, especially, when you you think "what's the harm?" in wearing a sock with a hole in it on a perfectly ordinary day.<br /><br />If only I could give this a zero.

From the dataset, th

**The inference classified the text as negative which matches what the datatsets has as label.**