# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following
Based on the code:

- **PEFT Technique**: LoRA (Low-Rank Adaptation)
- **Model**: DistilBERT (`distilbert-base-uncased`)
- **Evaluation Approach**: Hugging Face Trainer with accuracy metrics
- **Dataset**: SMS Spam dataset (from Hugging Face datasets library) for binary classification (spam vs. not spam)


## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [2]:
!pip install evaluate

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: evaluate
[0mSuccessfully installed evaluate-0.4.3


In [3]:
import torch
import numpy as np
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, Trainer, TrainingArguments
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType, PeftConfig, AutoPeftModelForSequenceClassification
import evaluate

In [4]:
# Install required packages
!pip install evaluate scikit-learn -q "datasets==2.15.0"

[0m

In [5]:
# Load dataset
data = load_dataset("sms_spam", split="train").train_test_split(test_size=0.2, shuffle=True, seed=23)
splits = ["train", "test"]

Downloading readme:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 359k/359k [00:00<00:00, 971kB/s]


Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

In [6]:
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
tokenized_data = {}

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [7]:
# Tokenize dataset
for split in splits:
    tokenized_data[split] = data[split].map(lambda x: tokenizer(x['sms'], truncation=True), batched=True)

Map:   0%|          | 0/4459 [00:00<?, ? examples/s]

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

In [8]:
# Load full fine-tuned model and adjust parameters
full_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
for param in full_model.parameters():
    param.requires_grad = True

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
# Training configuration
train_args = TrainingArguments(
    output_dir="./data/spam_not_spam",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=2,
    weight_decay=0.01,
    load_best_model_at_end=True
)

trainer = Trainer(
    model=full_model,
    args=train_args,
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=lambda eval_pred: {"accuracy": np.mean(np.argmax(eval_pred[0], axis=1) == eval_pred[1])}
)

In [10]:

# Train the model
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.045662,0.988341
2,0.052100,0.050454,0.989238


Checkpoint destination directory ./data/spam_not_spam/checkpoint-279 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/checkpoint-558 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=558, training_loss=0.049513387423689645, metrics={'train_runtime': 69.644, 'train_samples_per_second': 128.051, 'train_steps_per_second': 8.012, 'total_flos': 144666559425588.0, 'train_loss': 0.049513387423689645, 'epoch': 2.0})

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [11]:
# PEFT Configuration
peft_cfg = LoraConfig(task_type=TaskType.SEQ_CLS, inference_mode=False, r=8, target_modules=["q_lin", "v_lin"], lora_alpha=32, lora_dropout=0.1)
peft_model = get_peft_model(full_model, peft_cfg)

In [12]:
# Train the PEFT model
trainer = Trainer(
    model=peft_model,
    args=TrainingArguments(
        output_dir="./data/spam_not_spam",
        learning_rate=2e-5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=1,
        weight_decay=0.01,
        load_best_model_at_end=True
    ),
    train_dataset=tokenized_data['train'],
    eval_dataset=tokenized_data['test'],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=lambda eval_pred: {"accuracy": np.mean(np.argmax(eval_pred[0], axis=1) == eval_pred[1])}
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.046053,0.989238


Checkpoint destination directory ./data/spam_not_spam/checkpoint-279 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=279, training_loss=0.02356579654105675, metrics={'train_runtime': 20.3212, 'train_samples_per_second': 219.425, 'train_steps_per_second': 13.729, 'total_flos': 73790878281600.0, 'train_loss': 0.02356579654105675, 'epoch': 1.0})

In [13]:
# Save the PEFT model
peft_model.save_pretrained("./peft_model")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [14]:
# Load saved model and test predictions
loaded_cfg = PeftConfig.from_pretrained("./peft_model")
model = AutoPeftModelForSequenceClassification.from_pretrained("./peft_model")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
tokenizer = AutoTokenizer.from_pretrained(loaded_cfg.base_model_name_or_path)
items = tokenized_data['test'].select([0, 1, 22, 31, 43, 292, 448, 487])

In [16]:
predictions = []
for item in items:
    tokens = tokenizer(item['sms'], return_tensors="pt")
    with torch.no_grad():
        logits = model(**tokens).logits
        predictions.append(logits.argmax().item())

# Create dataframe for results
results_df = pd.DataFrame({
    "sms": [item['sms'] for item in items],
    "predictions": predictions,
    "labels": [item['label'] for item in items]
})

In [17]:
# Set display options
pd.set_option("display.max_colwidth", None)
print(results_df)

                                                                                                                                                                  sms  \
0                                                                                  Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke \n   
1                                                                                                                                           Happy new years melody!\n   
2                           PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0\n   
3     URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n   
4                                                                                                             I had askd u a question some hours before. It

In [18]:
# Evaluate accuracy
labels = [item["label"] for item in items]
accuracy_metric = evaluate.load("accuracy")
evaluation_result = accuracy_metric.compute(references=labels, predictions=predictions)
print(evaluation_result)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

{'accuracy': 0.625}
