# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
LoRA(Low ranked adapters) for parameter efficient fine tuning 
* Model: 
Bert from AutoModelForCausalLM class
* Evaluation approach:  
 Loss calculation

* Fine-tuning dataset: Spam/No Spam messages dataset

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable


### Instantiating a model and choosing AutoTokenizer

In [1]:
from transformers import AutoModelForSequenceClassification
#distilbert-base-uncased
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "not spam", 1: "spam"},
    label2id={"not spam": 0, "spam": 1},
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

### Loading Dataset 

The spam/no spam sms dataset is chosen 

In [3]:
# Load the sms_spam dataset
# See: https://huggingface.co/datasets/sms_spam

from datasets import load_dataset

# The sms_spam dataset only has a train split, so we use the train_test_split method to split it into train and test
datasets = load_dataset("sms_spam", split="train").train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

splits = ["train", "test"]

# View the dataset characteristics
print(datasets["train"].shape)
print(datasets["test"].shape)

(4459, 2)
(1115, 2)


In [4]:
#Show an example of the trainng dataset
datasets["train"][0]

{'sms': 'Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call\n',
 'label': 1}

In [5]:
### Tokenize the datasets

tokenized_ds={}
for split in splits:
    tokenized_ds[split] = datasets[split].map(
        lambda x: tokenizer(x['sms'],  truncation=True), batched = True)
# Inspect the available columns in the dataset
tokenized_ds["train"]

Dataset({
    features: ['sms', 'label', 'input_ids', 'attention_mask'],
    num_rows: 4459
})

In [6]:
# Unfreeze all the model parameters.
for param in model.parameters():
    param.requires_grad = True
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [7]:
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Read more about it here https://huggingface.co/docs/transformers/main_classes/trainer
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/spam_not_spam",
        learning_rate = 1e-5,
        per_device_train_batch_size = 4,
        per_device_eval_batch_size = 4,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,0.0459,0.054434
2,0.0256,0.06476
3,0.0127,0.069225
4,0.0054,0.069861
5,0.0074,0.065434


Checkpoint destination directory ./data/spam_not_spam/checkpoint-1115 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/checkpoint-2230 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/checkpoint-3345 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/checkpoint-4460 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/checkpoint-5575 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=5575, training_loss=0.025967844131114505, metrics={'train_runtime': 330.0509, 'train_samples_per_second': 67.55, 'train_steps_per_second': 16.891, 'total_flos': 252362813258304.0, 'train_loss': 0.025967844131114505, 'epoch': 5.0})

In [8]:
# Evaluate the model
results = trainer.evaluate()
print(results)

{'eval_loss': 0.05443378537893295, 'eval_runtime': 2.6397, 'eval_samples_per_second': 422.394, 'eval_steps_per_second': 105.693, 'epoch': 5.0}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [12]:
from peft import get_peft_model, LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Task type: Sequence Classification
    r=8,                         # Low-rank dimension (rank of the matrix)
    lora_alpha=32,               # Scaling factor
    lora_dropout=0.1,            # Dropout rate
    target_modules = ["q_lin", "v_lin", "k_lin"], #Apply low-rank matrices to attention layer components
) 
# Apply LoRA to the model
LoRA_model1 = get_peft_model(model, lora_config)

It seems that the model isn't performing better than before. 
So, we will try to increase the alpha factor and remove the dropout layers. 
The adapted weight matrix in a LoRA configuration is given by:

$$
W_{\text{adapted}} = W_0 + \frac{\alpha}{r} \Delta W
$$

Where:
- \( W_0 \) is the original weight matrix.
- \( \Delta W = A \times B \) is the low-rank update, where \( A \) and \( B \) are matrices of rank \( r \).
- \( \alpha \) is the scaling factor (LoRA alpha).


In [13]:
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


# Define training arguments
training_args = TrainingArguments(
    output_dir="./data/spam_not_spam/LORA",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_dir="./logs"
)

# Initialize the Trainer
trainer2 = Trainer(
    model=LoRA_model1,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
)

# Train the model
trainer2.train()

Epoch,Training Loss,Validation Loss
1,0.0215,0.052928
2,0.0183,0.051918
3,0.0179,0.051375
4,0.021,0.050917
5,0.0251,0.050874


Checkpoint destination directory ./data/spam_not_spam/LORA/checkpoint-1115 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/LORA/checkpoint-2230 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/LORA/checkpoint-3345 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/LORA/checkpoint-4460 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./data/spam_not_spam/LORA/checkpoint-5575 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=5575, training_loss=0.023010345737495764, metrics={'train_runtime': 228.3412, 'train_samples_per_second': 97.639, 'train_steps_per_second': 24.415, 'total_flos': 257122691150976.0, 'train_loss': 0.023010345737495764, 'epoch': 5.0})

In [15]:
# Evaluate the model
results_2 = trainer2.evaluate()
print(results_2)

{'eval_loss': 0.050874266773462296, 'eval_runtime': 3.8094, 'eval_samples_per_second': 292.7, 'eval_steps_per_second': 73.241, 'epoch': 5.0}


It seems that when fine tuning the whole model, overfitting is noticed where training loss was decreasing and validation loss 
is increasing.


When LoRA fine tuning was used, the overfitting problem wasn't disappeared. Adding dropout helped alot in also reducing the validation losses during training. 

Total Loss when using LoRA: 0.050874
Total Loss when using Full Model Fine Tuning: 0.05443 

In [16]:
#Printing LoRA Parameters
LoRA_model1.print_trainable_parameters()

trainable params: 221,184 || all params: 67,768,324 || trainable%: 0.32638257366376655


In [17]:
#Saving LoRA parameters
LoRA_model1.save_pretrained("gpt-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [18]:
#Loading the model
from peft import AutoPeftModelForSequenceClassification
LoRA_model1_uploaded = AutoPeftModelForSequenceClassification.from_pretrained("gpt-lora")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
trainer.model = LoRA_model1_uploaded
# Run evaluation
results = trainer2.evaluate()

# Print results
print(results)

{'eval_loss': 0.050874266773462296, 'eval_runtime': 4.6466, 'eval_samples_per_second': 239.96, 'eval_steps_per_second': 60.044, 'epoch': 5.0}
