# Lightweight Fine-Tuning Project

In this cell, describe your choices for each of the following

* **PEFT technique**: LoRA (LoRA (Low-Rank Adaptation of Large Language Models) is a technique designed to fine-tune large pre-trained models, such as GPT or BERT, efficiently by freezing most of the model's parameters and training only a few added low-rank matrices._
* **Model**: gpt2 (GPT-2 (Generative Pretrained Transformer 2) is an advanced language model developed by OpenAI. It is the second version of the Generative Pretrained Transformer series, designed to generate human-like text based on a given prompt.)
* **Evaluation approach**: Evaluation before and after fine-tuning using the Trainer's evaluate() method. This approach provides a direct comparison of model performance before and after fine-tuning, ensuring the effectiveness of the fine-tuning process. By evaluating on the validation dataset using the same metrics and procedures, we can assess the impact of fine-tuning on model performance objectively.
* **Fine-tuning dataset**: sms_spam (https://huggingface.co/datasets/sms_spam)

## Loading and Evaluating a Foundation Model



In [2]:
# Install the required version of datasets in case you have an older version
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell
#!pip install -q "datasets==2.15.0"
#!pip install transformers
#!pip install peft
!pip install datasets
#!pip install pandas
#!pip install numpy
#!pip install scikit-learn
#!pip install tqdm

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [29]:
# Import the 'load_dataset' function from the 'datasets' library
# 'datasets' is a Hugging Face library that provides access to various public datasets for NLP tasks
from datasets import load_dataset

# Load the SMS spam dataset from Hugging Face datasets
# The 'sms_spam' dataset contains labeled SMS messages as either "ham" (not spam) or "spam"
# The 'split="train"' argument retrieves the training portion of the dataset by default
sms_spam_dataset = load_dataset("sms_spam", split="train")

# Split the dataset into training and test sets using an 80-20 split
# - test_size=0.2: 20% of the data will be used for testing
# - shuffle=True: The data will be shuffled randomly before splitting to ensure both train and test sets are well-mixed
# - seed=42: A fixed random seed ensures reproducibility of the data split, meaning the same shuffle will occur every time the code is run
sms_spam_split = sms_spam_dataset.train_test_split(test_size=0.2, shuffle=True, seed=42)

In [30]:
# Import the AutoTokenizer class from the transformers library
# The AutoTokenizer allows you to easily load pre-trained tokenizers for various transformer models.
from transformers import AutoTokenizer

# Load the tokenizer for GPT-2, which is a pre-trained model for natural language processing tasks
# The tokenizer will convert the input text into tokens that GPT-2 understands
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Set the padding token to be the same as the end-of-sequence (EOS) token for GPT-2
# GPT-2 does not have a dedicated padding token, so we use the EOS token for padding purposes
tokenizer.pad_token = tokenizer.eos_token

# Define a function to tokenize a batch of data
# The function takes a batch (which is a dictionary) as input and tokenizes the "sms" field
# - padding=True: Ensures all sequences are padded to the same length
# - truncation=True: Ensures sequences longer than the model's maximum input length are truncated
def tokenize(batch):
    return tokenizer(batch["sms"], padding=True, truncation=True)

# Tokenize the training set using the 'map' function to apply the 'tokenize' function to each batch
# - batched=True: Processes the data in batches for efficiency, instead of one example at a time
train_dataset = sms_spam_split["train"].map(tokenize, batched=True)

# Tokenize the test set in the same way as the training set
test_dataset = sms_spam_split["test"].map(tokenize, batched=True)

Map:   0%|          | 0/4459 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [31]:
# Import the AutoModelForSequenceClassification class from the transformers library
# This class allows us to load pre-trained transformer models for sequence classification tasks.
from transformers import AutoModelForSequenceClassification

# Load the GPT-2 model pre-trained for sequence classification tasks
# - num_labels=2: The model is being adapted for a binary classification task (spam or not spam)
# - id2label: A mapping from label indices (0 and 1) to label names ("not spam" and "spam")
# - label2id: A reverse mapping from label names to label indices
# We are fine-tuning GPT-2 to classify SMS messages into spam or not spam
foundation_model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=2,  # The dataset has two categories: "spam" and "not spam"
    id2label={0: "not spam", 1: "spam"},  # Mapping from numeric IDs to label names
    label2id={"not spam": 0, "spam": 1},  # Mapping from label names to numeric IDs
)

# Set the padding token ID for the model to match the tokenizer's padding token ID
# Since GPT-2 uses its eos_token as the padding token, we make sure that the model's config reflects this
foundation_model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Evaluating gpt2 foundation model

In [32]:
# Import necessary libraries and modules
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import torch

# Initialize lists to store predictions and true labels for evaluation
predictions = []
labels = []

# Iterate over the test dataset to make predictions for each example
for example in test_dataset:
    # Choose the device (GPU if available, otherwise fallback to CPU)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move the model to the selected device
    foundation_model.to(device)

    # Prepare the input text (SMS message) for the model
    # 'return_tensors="pt"' ensures the output is in PyTorch tensor format
    inputs = tokenizer(example["sms"], return_tensors="pt").to(device)

    # Disable gradient calculation to save memory and computation (inference mode)
    with torch.no_grad():
        # Forward pass through the model to get raw logits (predictions before softmax)
        outputs = foundation_model(**inputs)
        logits = outputs.logits

    # Apply the softmax function to convert logits into probabilities
    probabilities = torch.nn.functional.softmax(logits, dim=1)

    # Get the predicted class with the highest probability
    predicted_class_id = probabilities.argmax().item()

    # Append the predicted class and true label to the respective lists
    predictions.append(predicted_class_id)
    labels.append(example["label"])


In [33]:
# After collecting predictions and labels, you can compute the evaluation metrics
# Compute accuracy, precision, recall, and F1-score to evaluate model performance
accuracy = accuracy_score(labels, predictions)
precision = precision_score(labels, predictions, average='binary')
recall = recall_score(labels, predictions, average='binary')
f1 = f1_score(labels, predictions, average='binary')

# Print the evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Accuracy: 0.8619
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000


## Performing Parameter-Efficient Fine-Tuning

In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [34]:
# Import necessary classes from the PEFT (Parameter Efficient Fine-Tuning) library
# LoraConfig: Configuration for LoRA (Low-Rank Adaptation) fine-tuning
# PeftModelForSequenceClassification: Class to apply PEFT techniques to a sequence classification task
# TaskType: Specifies the task type for the PEFT model (in this case, sequence classification)
# AutoPeftModelForSequenceClassification: Automatically loads a pre-trained PEFT model for sequence classification
from peft import LoraConfig, PeftModelForSequenceClassification, TaskType, AutoPeftModelForSequenceClassification

# Define the PEFT model configuration
# LoRA (Low-Rank Adaptation) is a technique to efficiently fine-tune pre-trained models by introducing low-rank adapters
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Define the task as sequence classification (SEQ_CLS)
    inference_mode=False,        # Set inference_mode to False, indicating we are in training mode
    r=4,                         # 'r' is the rank of the low-rank matrix in LoRA, controlling the complexity of the adaptation
    lora_alpha=16,               # 'lora_alpha' scales the low-rank updates, controlling the learning rate for LoRA adapters
    lora_dropout=0.1             # 'lora_dropout' applies dropout to the LoRA adapters to prevent overfitting
)

# Load the pre-trained GPT-2 model and configure it for binary sequence classification (spam vs. not spam)
# - num_labels=2: There are two possible labels (spam and not spam)
# - id2label and label2id: Map the numeric label indices (0 and 1) to the corresponding string labels ("not spam" and "spam")
model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=2,  # Binary classification task
    id2label={0: "not spam", 1: "spam"},  # Mapping from label indices to label names
    label2id={"not spam": 0, "spam": 1},  # Mapping from label names to label indices
)

# Set the padding token to the eos_token for GPT-2 since it doesn’t have a dedicated padding token
model.config.pad_token_id = model.config.eos_token_id

# Apply PEFT (LoRA) to the model using the configuration defined above
# This applies LoRA adapters to the pre-trained GPT-2 model for efficient fine-tuning
peft_model = PeftModelForSequenceClassification(model, peft_config)

# Print the trainable parameters of the PEFT model
# This will give us a summary of the parameters that will be fine-tuned (LoRA adapters)
peft_model.print_trainable_parameters()

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 148,992 || all params: 124,590,336 || trainable%: 0.1196




## PEFT model Evaluation

In [36]:
import numpy as np

# Function to compute evaluation metrics
# The function receives `eval_pred` (a tuple containing predictions and labels) as input
def compute_evaluation_metrics(eval_pred):
    # Unpack the predictions and labels from the tuple
    predictions, labels = eval_pred

    # Convert the logits (raw model outputs) to predicted class labels by taking the argmax across each prediction
    # argmax(axis=1) selects the class with the highest probability (most likely class)
    predictions = np.argmax(predictions, axis=1)

    # Compute accuracy: The number of correct predictions divided by the total number of predictions
    # The comparison (predictions == labels) results in a boolean array, where True is 1 and False is 0
    accuracy = (predictions == labels).mean()

    # Return the dictionary with the computed accuracy metric
    return {"accuracy": accuracy}

In [37]:
# Import necessary classes from the Hugging Face Transformers library
# DataCollatorWithPadding: Automatically pads the inputs in each batch to the maximum length in the batch
# Trainer: Class responsible for training and evaluating the model
# TrainingArguments: Configuration for training such as batch size, learning rate, etc.
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
import numpy as np

# Define the training arguments with a number of configurations to control the training process
peft_training_args = TrainingArguments(
    output_dir="./results/peft_model",  # Directory to save model checkpoints and results
    evaluation_strategy="epoch",        # Evaluate the model after every epoch
    learning_rate=2e-5,                 # Learning rate for the optimizer
    per_device_train_batch_size=32,     # Batch size used during training for each device (GPU/CPU)
    per_device_eval_batch_size=32,      # Batch size used during evaluation for each device
    num_train_epochs=5,                 # Number of training epochs
    weight_decay=0.01,                  # Weight decay to prevent overfitting (L2 regularization)
    logging_dir='./logs/peft_model',    # Directory for saving logs
    save_strategy="epoch",              # Save the model checkpoint after every epoch
    load_best_model_at_end=True,        # Load the best model (based on evaluation) at the end of training
    logging_steps=100,                  # Log training metrics every 100 steps
    warmup_ratio=0.1,                   # Proportion of training steps to perform learning rate warmup
)

# Initialize the Trainer class, which is responsible for training the model
# The Trainer will take care of running the training loop, logging, evaluation, and saving the model
peft_trainer = Trainer(
    model=peft_model,                  # The model being trained
    args=peft_training_args,           # The training configuration
    train_dataset=train_dataset,       # The dataset to be used for training
    eval_dataset=test_dataset,         # The dataset to be used for evaluation
    compute_metrics=compute_evaluation_metrics,   # Function to compute evaluation metrics after each evaluation
    tokenizer=tokenizer,               # The tokenizer used to encode/decode text inputs
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),  # Automatically pad inputs to the same length in a batch
)

# Train the model
# This will start the training loop with the provided configurations
peft_trainer.train()

  peft_trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,2.0839,1.327786,0.86278
2,1.2861,0.583321,0.855605
3,0.5461,0.403312,0.868161
4,0.3928,0.330501,0.883408
5,0.3644,0.313403,0.886099


TrainOutput(global_step=700, training_loss=0.8245353807721819, metrics={'train_runtime': 844.1601, 'train_samples_per_second': 26.411, 'train_steps_per_second': 0.829, 'total_flos': 2940033184856064.0, 'train_loss': 0.8245353807721819, 'epoch': 5.0})

In [38]:
# Evaluate the model on the evaluation dataset (test dataset in this case)
# The `evaluate()` function computes the evaluation metrics defined in `compute_metrics`
# and returns the results (e.g., accuracy, loss, etc.) as a dictionary
evaluation_results_peft = peft_trainer.evaluate()

# Print the evaluation results, which will include the metrics calculated during evaluation (like accuracy)
evaluation_results_peft

{'eval_loss': 0.3134034276008606,
 'eval_accuracy': 0.8860986547085202,
 'eval_runtime': 11.1442,
 'eval_samples_per_second': 100.052,
 'eval_steps_per_second': 3.141,
 'epoch': 5.0}

## Save PEFT model

In [39]:
# Save the trained PEFT model to a specified directory
# This saves the model's weights, configuration, and tokenizer (if needed) to the directory
peft_model.save_pretrained('models/peft_model')

## Performing Inference with a PEFT Model

In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [40]:
# Import necessary classes for PEFT (LoRA) model loading
from peft import LoraConfig, PeftModelForSequenceClassification, TaskType, AutoPeftModelForSequenceClassification

# Load the PEFT model for inference (the model saved previously)
inference_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "models/peft_model",  # Path to the saved model directory
    num_labels=2,         # Number of output labels (e.g., spam and not spam)
    id2label={0: "not spam", 1: "spam"},  # Mapping from label ids to label names
    label2id={"not spam": 0, "spam": 1},  # Mapping from label names to label ids
)

# Set pad_token_id to eos_token_id (this ensures that padding uses the same token as the end-of-sequence token)
inference_model.config.pad_token_id = inference_model.config.eos_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [42]:
import torch

# Define the prediction function
def predict_label(prompt: str) -> str:
    # Check if a GPU (CUDA) is available, otherwise use the CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Move the model to the selected device (GPU or CPU)
    inference_model.to(device)

    # Prepare the input text (tokenize the prompt)
    inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True).to(device)

    # Get predictions (disable gradient tracking during inference)
    with torch.no_grad():
        outputs = inference_model(**inputs)
        logits = outputs.logits

    # Apply softmax to convert logits to probabilities
    probabilities = torch.nn.functional.softmax(logits, dim=1)

    # Get the predicted class ID (class with the highest probability)
    predicted_class_id = probabilities.argmax(dim=1).item()

    # Mapping from class ID to label name (spam or not spam)
    id2label = {0: "spam", 1: "not spam"}

    # Get the predicted label based on the predicted class ID
    predicted_label = id2label.get(predicted_class_id, "Unknown")

    return predicted_label

In [43]:
# Test the prediction with a sample prompt
sample_prompt = "FREE!FREE!FREE Get yous eye checkd for free bya renowned eye specialist of the city"

# Output the result
print(f"Prompt: '{prompt}'\nPredicted label: {predict_label(sample_prompt)}")


Prompt: 'I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.'
Predicted label: spam


# Conclusion:
## Comparision the performance of the PEFT model with that of the original foundational GPT2 model.

A traditional approach to fine-tuning Large Language Models (LLMs) typically involves adjusting the majority of the model's weights, which demands significant computational resources. In contrast, LoRA-based fine-tuning offers a more efficient alternative by freezing the original weights and training only a small set of additional parameters, making the process much more resource-efficient. When comparing the performance of the original foundational model with the PEFT (Parameter-Efficient Fine-Tuning) model, the PEFT model demonstrates higher accuracy.

