# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: Low Rank Adaption (LoRA) is the popular method to fine tune the LLM model with minor changes in some parameters. It results in small compute resource requirement and can help to reduce training time.
* Model: GPT-2ForSequenceClassification (gpt2) is using for text classifier for the spam analysis. The gpt2 can be used for various taks of NLP since it can capture the patterns and contexts in sentences.
* Evaluation approach: Using the Trainer evaluation. The solution is to compare the accuracy of the model with no parameters change (all weights are freezed) and fine tune some parameters. By this comparison, we can select the better model for the text classification.
* Fine-tuning dataset: The dataset using is collected from Hugging Face to check if the message is spam or not spam. Here is the link to the dataset: https://huggingface.co/datasets/ucirvine/sms_spam

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
import numpy as np
import pandas as pd
from datasets import load_dataset
import torch
from peft import AutoPeftModelForSequenceClassification, LoraConfig, get_peft_model, TaskType
from transformers import AutoTokenizer,AutoModelForSequenceClassification
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

In [2]:
# Split the train dataset into train and test dataset for training model
dataset = load_dataset("sms_spam", split = "train").train_test_split(
    test_size = 0.2, shuffle = True, seed = 23
)

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 4459
    })
    test: Dataset({
        features: ['sms', 'label'],
        num_rows: 1115
    })
})

In [4]:
# Load gpt2 model
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

In [5]:
splits = ["train", "test"]

In [6]:
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["sms"], truncation = True), 
        batched = True)

In [7]:
# Inspect some lines of training dataset
tokenized_dataset["train"]["sms"][0] # this seems to be a spam

'Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n'

In [8]:
# Inspect the available columns in the dataset
tokenized_dataset["train"]

Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 4459
})

In [9]:
# Load the Hugging Face transformer model as pre-trained model (here I use gpt2)
model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels = 2,
    id2label = {0: "not spam", 1: "spam"},
    label2id = {"not spam": 0, "spam": 1}
    )

# Set the model pad token id to match the tokenizer pad token id
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Free all the parameters of the based model (gpt2)
# This should help to prevent the weights of layers from updating in training
for param in model.base_model.parameters():
    param.requires_grad = False

In [11]:
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


### Model performance when not changing any parameters 

In [12]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [13]:
training_args = TrainingArguments(
    output_dir="./model_output",
    learning_rate = 2e-5, # Learning rate for the optimizer.
    per_device_train_batch_size = 16, # set the per device train batch size and eval batch size
    per_device_eval_batch_size = 16, 
    # Evaluation and save the model after each epoch
    evaluation_strategy="epoch", 
    save_strategy="epoch", 
    num_train_epochs = 3, 
    weight_decay = 0.01, 
    load_best_model_at_end=True, 
)

In [14]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = tokenized_dataset["train"], 
    eval_dataset = tokenized_dataset["test"], 
    tokenizer = tokenizer, 
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics = compute_metrics, 
)

In [15]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.02375,0.565022
2,1.271300,0.740054,0.804484
3,1.271300,0.667752,0.849327


Checkpoint destination directory ./model_output/checkpoint-279 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./model_output/checkpoint-558 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./model_output/checkpoint-837 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=837, training_loss=1.0634708199449765, metrics={'train_runtime': 78.3513, 'train_samples_per_second': 170.731, 'train_steps_per_second': 10.683, 'total_flos': 416955103543296.0, 'train_loss': 1.0634708199449765, 'epoch': 3.0})

In [16]:
# Evaluate
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)

Evaluation Results: {'eval_loss': 0.6677523851394653, 'eval_accuracy': 0.8493273542600897, 'eval_runtime': 4.5795, 'eval_samples_per_second': 243.475, 'eval_steps_per_second': 15.285, 'epoch': 3.0}


The classification accuracy of the model is 85%. This is quite good result when we do not need to change anything from the original model. Let's see if we can improve the accuracy with fine tuning model

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [17]:
model_ft = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels = 2,
    id2label = {0: "not spam", 1: "spam"},
    label2id = {"not spam": 0, "spam": 1}
    )

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# Set the model's pad token id to match the tokenizer's pad token id
model_ft.config.pad_token_id = tokenizer.pad_token_id

In [19]:
# Create a PEFT Config for LoRA
config = LoraConfig(r = 8, 
                    lora_alpha = 32,
                    target_modules = ['c_attn', 'c_proj'],
                    lora_dropout = 0.1,
                    bias = "none",
                    task_type=TaskType.SEQ_CLS
                )

peft_model = get_peft_model(model_ft, config)
peft_model.print_trainable_parameters()



trainable params: 814,080 || all params: 125,253,888 || trainable%: 0.6499438963523432


In [20]:
trainer_ft = Trainer(
                model = peft_model, 
                args = TrainingArguments(
                    output_dir = "./lora_model_output",
                    learning_rate = 2e-5,
                    per_device_train_batch_size = 32,
                    per_device_eval_batch_size = 32,
                    num_train_epochs = 3,
                    weight_decay = 0.01,
                    evaluation_strategy = "epoch",
                    save_strategy = "epoch",
                    load_best_model_at_end = True,
                    logging_dir='./logs',   
    ),
                train_dataset = tokenized_dataset["train"],
                eval_dataset = tokenized_dataset["test"],
                tokenizer = tokenizer,
                data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=512),
                compute_metrics = compute_metrics,
)

In [21]:
trainer_ft.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.346394,0.870852
2,No log,0.238266,0.893274
3,No log,0.204363,0.910314


Checkpoint destination directory ./lora_model_output/checkpoint-140 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./lora_model_output/checkpoint-280 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./lora_model_output/checkpoint-420 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=420, training_loss=0.3604491824195499, metrics={'train_runtime': 167.8653, 'train_samples_per_second': 79.689, 'train_steps_per_second': 2.502, 'total_flos': 496984766330880.0, 'train_loss': 0.3604491824195499, 'epoch': 3.0})

In [22]:

# Evaluate
evaluation_results_peft = trainer_ft.evaluate()
print("Evaluation Results:", evaluation_results_peft)



Evaluation Results: {'eval_loss': 0.20436277985572815, 'eval_accuracy': 0.9103139013452914, 'eval_runtime': 5.9964, 'eval_samples_per_second': 185.945, 'eval_steps_per_second': 5.837, 'epoch': 3.0}


The accuracy of the fine tuning model is 91%, which is more accurate than the original transformer (accuracy 85%)

In [23]:
# Save fine tuned PEFT model
peft_model.save_pretrained("gpt-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [24]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
lora_model = AutoPeftModelForSequenceClassification.from_pretrained("gpt-lora", num_labels=2, ignore_mismatched_sizes=True).to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
device

device(type='cuda')

In [26]:
# Set the model's pad token id to match the tokenizer's pad token id
lora_model.config.pad_token_id = tokenizer.pad_token_id

In [27]:
training_args = TrainingArguments(
    output_dir="./results/inference_model",
    learning_rate=2e-5, 
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1, 
    weight_decay=0.01, 
    evaluation_strategy="epoch", 
    save_strategy="epoch",
    load_best_model_at_end=True, 
)

finetuned_trainer = Trainer(
    model=lora_model, 
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"], 
    tokenizer=tokenizer, 
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics, 
)

In [28]:
# Evaluate the fine-tuned model on the test set
finetuned_results = finetuned_trainer.evaluate()
print("Evaluation results for the fine-tuned model:", finetuned_results)

Evaluation results for the fine-tuned model: {'eval_loss': 0.20436280965805054, 'eval_accuracy': 0.9103139013452914, 'eval_runtime': 5.0938, 'eval_samples_per_second': 218.893, 'eval_steps_per_second': 13.742}


In [29]:
# Ramdomly select some records in testing dataset
items_for_manual_review = tokenized_dataset["test"].select(
[0, 1, 22, 31, 43, 57, 93])

results = finetuned_trainer.predict(items_for_manual_review)

df = pd.DataFrame(
    {
        "sms": [item["sms"] for item in items_for_manual_review],
        "predictions": results.predictions.argmax(axis=1),
        "labels": results.label_ids
    }
)

# show the result
pd.set_option("display.max_colwidth", None)
df

Unnamed: 0,sms,predictions,labels
0,Yup... Hey then one day on fri we can ask miwa and jiayin take leave go karaoke \n,0,0
1,Happy new years melody!\n,0,0
2,PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08715203652 Identifier Code: 42810 Expires 29/10/0\n,1,1
3,URGENT! We are trying to contact U. Todays draw shows that you have won a £800 prize GUARANTEED. Call 09050003091 from land line. Claim C52. Valid 12hrs only\n,1,1
4,I had askd u a question some hours before. Its answer\n,0,0
5,Where are you ? What do you do ? How can you stand to be away from me ? Doesn't your heart ache without me ? Don't you wonder of me ? Don't you crave me ?\n,0,0
6,IMPORTANT MESSAGE. This is a final contact attempt. You have important messages waiting out our customer claims dept. Expires 13/4/04. Call 08717507382 NOW!\n,0,1


Most of the predictions are matched with the labels. For the final records, the prediction is wrong (it is actually a spam). Overall the model can recognize quite good between spam and not spam messages.

Conclusion:
- Using the fine tuning can help to train the new LLM mode with small number of parameters but still efficient. The weights are freezed so we can focus on the changed parameters and can save the resource for training.
- The fine tuning model have better accuracy comparing to the original model (91% vs 84%)
