# Lightweight Fine-Tuning Project

* Load a pre-trained model and evaluate its performance
* Perform parameter-efficient fine-tuning using the pre-trained model
* Perform inference using the fine-tuned model and compare its performance to the original model

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Hugging Face Training , Evaluation
* Fine-tuning dataset: sms_spam dataset https://huggingface.co/datasets/sms_spam

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install scikit-learn


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m56.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m308.4/308.4 kB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.5.2 scikit-learn-1.7.2 threadpoolctl-3.6.0


In [2]:
# Load the sms_spam dataset
# See: https://huggingface.co/datasets/sms_spam

from datasets import load_dataset

# The sms_spam dataset only has a train split, so we use the train_test_split method to split it into train and test
dataset = load_dataset("sms_spam", split="train").train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

splits = ["train", "test"]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading data: 100%|██████████| 359k/359k [00:00<00:00, 1.07MB/s]


Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

In [3]:
# View the dataset characteristics
dataset["train"]


# Inspect the first example. Do you think this is spam or not?
dataset["train"][0]

{'sms': 'Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n',
 'label': 1}

In [4]:
# Tokenizer - tokenize training and test sets
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Tokenize dataset
def tokenize(batch):
    return tokenizer(batch["sms"], padding=True, truncation=True)

# Tokenize train and test data sets
train_dataset = dataset["train"].map(tokenize, batched=True)
test_dataset = dataset["test"].map(tokenize, batched=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/4459 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [5]:
# Understand train datasets
train_dataset

Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 4459
})

In [6]:
# Understand test datasets
test_dataset

Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1115
})

In [7]:
# Load the pre-trained GPT-2 foundation model
from transformers import AutoModelForSequenceClassification

foundation_model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=2,
    id2label={0: "NOT SPAM", 1: "SPAM"},
    label2id={"NOT SPAM": 0, "SPAM": 1},
)
foundation_model.config.pad_token_id = tokenizer.pad_token_id

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
# Print and understand the foundation model
print(foundation_model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


Evaluate the performance of the pre-trained GPT-2 foundation model

In [9]:
import numpy as np
import torch
from sklearn.metrics import accuracy_score
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(labels, preds):
    accuracy = accuracy_score(labels, preds)
    return {"accuracy": accuracy}

predictions = []
labels = []
for test_sample in test_dataset:    
    # Use GPU if it is available or else use CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Load GPT-2 pretrained model to device
    foundation_model.to(device)

    # Prepare the input text, tokenize and load to device
    inputs = tokenizer(test_sample["sms"], return_tensors="pt").to(device)

    # Get predictions
    with torch.no_grad():
        outputs = foundation_model(**inputs)
        logits = outputs.logits        

    probabilities = torch.nn.functional.softmax(logits, dim=1)    
    predicted_class_id = probabilities.argmax().item()
    
    # Build lists of the predicted output and the ground truth
    predictions.append(predicted_class_id)
    labels.append(test_sample["label"])

In [10]:
# Evaluation foundation model GPT-2 without changing parameters
evaluation_metrics = compute_metrics(labels, predictions)
print(evaluation_metrics)

{'accuracy': 0.13901345291479822}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

Create a PEFT model from your loaded model

In [11]:
# Create PEFT model from our pre-trained GPT2 foundation model
# https://huggingface.co/docs/peft/main/en/conceptual_guides/lora
# Attribution - Udacity class course 2 - section 5.2
from peft import LoraConfig, PeftModelForSequenceClassification, TaskType, AutoPeftModelForSequenceClassification

# Create a PEFT config
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=4,
    lora_alpha=16, #LoRA scaling factor
    lora_dropout=0.1
)

In [12]:
# Load the pre-trained foundation GPT-2 model
from transformers import AutoModelForSequenceClassification

foundation_model = AutoModelForSequenceClassification.from_pretrained("gpt2", num_labels=2,
    id2label={0: "NOT SPAM", 1: "SPAM"},
    label2id={"NOT SPAM": 0, "SPAM": 1},
)
foundation_model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# Build PEFT model
peft_model = PeftModelForSequenceClassification(foundation_model, peft_config)

# Print
peft_model.print_trainable_parameters()



trainable params: 150,528 || all params: 124,590,336 || trainable%: 0.1208183594592762


In [14]:
# Source: Udacity Course 2 section 4.18
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

Train the PEFT LoRA model

In [15]:
# Train the peft model
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import numpy as np

# Source Course 2 Section 4.18
peft_training_args = TrainingArguments(
    output_dir="./results/peft_model",
    # Set the learning rate
    learning_rate=2e-5,
    # Set the per device train batch size and eval batch size
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    # Evaluate and save the model after each epoch
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir='./logs/peft_model',
    load_best_model_at_end=True,
    logging_steps=100,
    warmup_ratio=0.1,
)

# Initialize the Trainer with compute_metrics
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

peft_trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6466,0.288175,0.884305
2,0.5716,0.25727,0.888789


Checkpoint destination directory ./results/peft_model/checkpoint-140 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/peft_model/checkpoint-280 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=280, training_loss=0.5832361630031041, metrics={'train_runtime': 369.6597, 'train_samples_per_second': 24.125, 'train_steps_per_second': 0.757, 'total_flos': 1175786692743168.0, 'train_loss': 0.5832361630031041, 'epoch': 2.0})

In [16]:
# Evaluate PEFT model and print results
peft_evaluation_results = peft_trainer.evaluate()
print("Peft Evaluation Results:", peft_evaluation_results)

Peft Evaluation Results: {'eval_loss': 0.25727012753486633, 'eval_accuracy': 0.8887892376681614, 'eval_runtime': 15.3784, 'eval_samples_per_second': 72.504, 'eval_steps_per_second': 2.276, 'epoch': 2.0}


###  ⚠️ IMPORTANT ⚠️

Due to workspace storage constraints, you should not store the model weights in the same directory but rather use `/tmp` to avoid workspace crashes which are irrecoverable.
Ensure you save it in /tmp always.

In [17]:
# Saving the model
peft_model.save_pretrained("/tmp/peft_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [18]:
# Load the trained/saved PEFT model weights
from peft import TaskType, AutoPeftModelForSequenceClassification

inference_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "/tmp/peft_model",
    num_labels=2,
    id2label={0: "NOT SPAM", 1: "SPAM"},
    label2id={"NOT SPAM": 0, "SPAM": 1},
)
inference_model.config.pad_token_id = inference_model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Inference on the trained PEFT model 
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import numpy as np

# Source Course 2 Section 4.18
inference_peft_training_args = TrainingArguments(
    output_dir="./results/inference_peft_model",
    # Set the learning rate
    learning_rate=2e-5,
    # Set the per device train batch size and eval batch size
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    # Evaluate and save the model after each epoch
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir='./logs/inference_peft_model',
    load_best_model_at_end=True,
    logging_steps=100,
    warmup_ratio=0.1,
)

# Initialize the Trainer with compute_metrics
peft_trainer = Trainer(
    model=inference_model,
    args=inference_peft_training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

In [19]:
# Evaluate the model
peft_evaluation_results = peft_trainer.evaluate()
print("PEFT Evaluation Results:", peft_evaluation_results)

PEFT Evaluation Results: {'eval_loss': 0.25727012753486633, 'eval_accuracy': 0.8887892376681614, 'eval_runtime': 15.6319, 'eval_samples_per_second': 71.329, 'eval_steps_per_second': 2.239, 'epoch': 2.0}


In [20]:
from peft import AutoPeftModelForSequenceClassification

inference_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "/tmp/peft_model"
)

inference_model.eval()


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Linear(
                in_features=768, out_features=2304, bias=True
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=4, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_

In [21]:
import torch

def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    with torch.no_grad():
        logits = inference_model(**inputs).logits

    class_id = logits.argmax(dim=1).item()
    return inference_model.config.id2label[class_id]


In [22]:
# Example
sample = "Update your phone to the latest Camera/Video phones."
predicted_label = predict(sample)
print(f"Prompt: '{sample}'\nPredicted label: {predicted_label}")

Prompt: 'Update your phone to the latest Camera/Video phones.'
Predicted label: LABEL_0


CONCLUSION
1. Evaluation results show that the PEFT model beats the base GPT2 model. PEFT model has higher accuracy and lower loss.
2. Classic fine-tuning requires a lot of compute resources and changes most weights whereas PEFT is more efficient by only training a small number of parameters
3. Additionally, the PEFT Model demonstrated slightly faster evaluation runtimes with higher samples and steps processed per second. These improvements suggest that the fine-tuning process led to enhancements in both loss reduction and predictive accuracy, showcasing the effectiveness of parameter-efficient fine-tuning in optimizing the model using the rotten_tomatoes dataset for sentiment analysis.