# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA - I chose this technique because it is compatible with all base models.
* Model: DistilBERT - I chose DistilBERT because it is a lightweight model that runs quickly and does not require changing all parameters.
* Evaluation approach: For evaluation, I compared the accuracy of a fine-tuned PEFT trainer to a baseline model.
* Fine-tuning dataset: https://huggingface.co/datasets/scholl99/spam_email_v0

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
!pip install -U datasets
!pip install peft transfomers
!pip install transformers[torch]
!pip install evaluate
!pip install scikit-learn
!pip install accelerate -U
!pip install accelerate>=0.21.0
!pip install torch



Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting requests>=2.32.2
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tqdm>=4.66.3
  Downloading tqdm-4.66.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tqdm, requests, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.66.2
    Uninstalling tqdm-4.66.2:
      Successfully uninstalled tqdm-4.66.2
[0m  Attempting uninstall: requests
    Found existing installation: requests 2.31.0
    Uninstal

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m727.9 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: evaluate
[0mSuccessfully installed evaluate-0.4.2
Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m65.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m38.8 MB/s[0m eta [36m0:00:0

Installing collected packages: accelerate
[0mSuccessfully installed accelerate-0.32.1
Defaulting to user installation because normal site-packages is not writeable


In [2]:
# Import Packages and modules
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load Spam Email Dataset
# URL: https://huggingface.co/datasets/scholl99/spam_email_v0

from datasets import load_dataset

## RUBRIC ITEM: Load a Dataset
dataset = load_dataset("scholl99/spam_email_v0", split="train")

# Split dataset into training and testing datasets
# Use seed as a repeatable constant for Shuffle
# Use 25% of data for tests
# Attribution: https://huggingface.co/docs/datasets/en/process
train_test_split = dataset.train_test_split(test_size=0.25, shuffle=True, seed=17)

# Gain Access to Split Datasets
train_set = train_test_split["train"]
test_set = train_test_split["test"]

# View Sizes of Training Data
print(f"Train set size: {len(train_set)}")
print(f"Test set size: {len(test_set)}")


## RUBRIC ITEM: Load Pre-Trained Model
# Convert Model Predictions to Strings / Convert Labels to Integers
# Attribution: Udacity "Adapting Foundation Models"
model_name = 'distilbert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    # Create Real and Fake Labels
    num_labels=2,
    id2label={0: "MAIL", 1: "SPAM"},
    label2id={"MAIL": 0, "SPAM": 1},
)

## RUBRIC ITEM: Pre-Process a Dataset
# Load Tokenizer from Pre-trained distilbert model
# Attribution: Udacity "Apply Lightweight Fine-Tuning"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenizer_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation = True)

# Tokenize Train and Test Datasets
# Attribution: Udacity "Apply Lightweight Fine-Tuning"
#tokenized_train = train_set.map(tokenizer_function, batched=True).remove_columns("input_ids").remove_columns("attention_mask")
#tokenized_test = test_set.map(tokenizer_function, batched=True).remove_columns("input_ids").remove_columns("attention_mask")
tokenized_train = train_set.map(tokenizer_function, batched=True)
tokenized_test = test_set.map(tokenizer_function, batched=True)

Downloading readme:   0%|          | 0.00/618 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.03M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/496k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2291 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/573 [00:00<?, ? examples/s]

Generating rag split:   0%|          | 0/2864 [00:00<?, ? examples/s]

Train set size: 1718
Test set size: 573


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1718 [00:00<?, ? examples/s]

Map:   0%|          | 0/573 [00:00<?, ? examples/s]

In [4]:
# View model
print("Model:\n")
print(model)

print("\n--------------------------------------------------\n")

# View Dataset
print(dataset)

print("\n--------------------------------------------------\n")

# View Training Datasets
print("Training Dataset:\n")
print(tokenized_train)

print("\n--------------------------------------------------\n")

print("Testing Dataset:\n")
# View Test Datasets
print(tokenized_test)

Model:

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=

In [5]:
# Select Range from Dataset
select_data = train_set.select(range(150))

# Print text from dataset range
print("First 150 emails: \n")
for entry in select_data:
    # Limit to first 150 characters of text
    text = entry["text"][:150]
    label = entry["label"]
    print(f"Text: {text}, Label: {label}")

print("\n--------------------------------------------------\n")

# View Spam Emails
count = 0
print("All spam emails: \n")
for spam in train_set:
    text = spam["text"][:150]
    label = spam["label"]
    if label == 1:
        print(f'SPAM: {text}')
        count += 1
print(f"\nNumber of Spam Emails: {count}")

First 150 emails: 

Text: Subject: request submitted : access request for  praveen . mellacheruvu @ enron . com  you have received this email because you are listed as an alter, Label: 0
Text: Subject: website : data _ research _ pub  please approve or reject this request .  thank you ,  information risk management ( et )  - - - - - - - - - , Label: 0
Text: Subject: alliance info alert  dear generation / power marketing executive :  the following is this week ' s alliance express newsletter , and a specia, Label: 0
Text: Subject: speech by chairman pat wood of puct - ctaee meeting - nov . 29 , 2000  dear colleague :  we are honored to have chairman pat wood of the publ, Label: 0
Text: Subject: re : greetings from garp  frank ,  looks good .  vince  enron north america corp .  from : frank hayden @ enron 12 / 12 / 2000 11 : 31 am  to, Label: 0
Text: Subject: returned mail : see transcript for details  the original message was received at tue , 19 jul 2005 12 : 57 : 50 + 0200  from [ 218

SPAM: Subject: all graphics software available , cheap oem versions .  good morning ,  we we offer latest oem packages of all graphics and publishinq softwa
SPAM: Subject: 75 % reduction in road accidents  august , 2002  dear sir / madam ,  in case you have received this mail earlier , kindly ignore this mail . 
SPAM: Subject: minimize your phone expenses  unlimited web conferencing  subscribe to the web conference center for only $ 40 . 00 per month !  ( connects u
SPAM: Subject: localized software , all languages available .  hello , we would like to offer localized software versions ( german , french , spanish , uk ,
SPAM: Subject: life - time upgrades for freeq 4 ili 6 p 8  below is the result of your feedback form . it was submitted by  ( blowdamovie @ atlas . cz ) on 
SPAM: Subject: sell advertising space on your website  did you know that selling advertising on your website is a great way to earn extra revenues with  abs
SPAM: Subject: kit torre empilhadeira savi  santos , junho

In [6]:
from transformers import TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate
import torch
import numpy as np

## RUBRIC ITEM: Evaluate the model
# Evaluate the model
eval_model= evaluate.load("accuracy")

# Compute accuracy of model
# Attribution: https://huggingface.co/docs/transformers/main_classes/trainer
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# Configure Base Model Trainer
# Attribution: https://huggingface.co/docs/peft/en/quicktour
# Attribution: https://towardsdatascience.com/https-medium-com-dashingaditya-rakhecha-understanding-learning-rate-dd5da26bb6de
trainer_model = TrainingArguments(
    # Create job_results_peft directory
    # Output training data to created directory
    output_dir="./data/results",
    # Overwrite saved data each time code runs
    overwrite_output_dir=True,
    # Decrease learning rate to allow gradual fine-tuning
    # Increase liklihood of accurate predictions
    learning_rate=2e-3,
    # Increase batch size to provide more accurate prediction
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    # Run a training loop w/ each epoch
    evaluation_strategy="epoch",
    # Save the trained model at each epoch
    save_strategy="epoch",
    num_train_epochs=2,
    weight_decay=0.01,
)

# Configure PEFT Model Trainer
peft_trainer_model = TrainingArguments(
    # Create job_results_peft directory
    # Output training data to created directory
    output_dir="./data/results_peft",
    # Overwrite saved data each time code runs
    overwrite_output_dir=True,
    # Decrease learning rate to allow gradual fine-tuning
    # Increase liklihood of accurate predictions
    learning_rate=2e-6,
    # Increase batch size to provide more accurate prediction
    per_device_train_batch_size=12,
    per_device_eval_batch_size=12,
    # Run a training loop w/ each epoch
    evaluation_strategy="epoch",
    # Save the trained model at each epoch
    save_strategy="epoch",
    # Increase epochs to allow more iterations for training
    num_train_epochs=3,
    weight_decay=0.01,
)


# Alert when done
print("Ready to Configure Training")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Ready to Configure Training


In [8]:
# Configure Base Model Trainer
# Attribution: https://huggingface.co/docs/peft/en/quicktour
def trainer(model):
    return Trainer(
    model=model,
    args = trainer_model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# Attribution: https://medium.com/@achillesmoraites/lightweight-roberta-sequence-classification-fine-tuning-with-lora-using-the-hugging-face-peft-8dd9edf99d19
# Configure PEFT Model Trainer
def peft_trainer(model):
    return Trainer(
    model=model,
    args = peft_trainer_model,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

# Alert when done
print("Ready to Start Training")

Ready to Start Training


In [9]:
baseline_training = trainer(
    AutoModelForSequenceClassification.from_pretrained(model_name)
)

# Alert when done
print("\nReady to train baseline model!")

baseline_training.train()

# Evaluate Full Fine Tuned Model
baseline_training_eval = baseline_training.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Ready to train baseline model!


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.573667,0.74171
2,No log,0.578633,0.74171


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.


In [10]:
## RUBRIC ITEM: Create PEFT Model
from peft import LoraConfig, PeftModel, PeftConfig, get_peft_model

model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Attribution: https://towardsdatascience.com/fine-tuning-large-language-models-llms-23473d763b91
# Attribution: https://medium.com/@achillesmoraites/lightweight-roberta-sequence-classification-fine-tuning-with-lora-using-the-hugging-face-peft-8dd9edf99d19
config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1, target_modules=['q_lin'])

# Attribution: Udacity "Apply Lightweight Fine-Tuning"
lora_model = get_peft_model(model, config)

# Display Trainable Parameters
lora_model.print_trainable_parameters()

# Alert when done
print("\nReady to train PEFT model!")

peft_training = peft_trainer(lora_model)

## RUBRIC ITEM: Train the PEFT Model
peft_training.train()

## RUBRIC ITEM: Evaluate the PEFT Model
peft_eval = peft_training.evaluate()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 1,257,988 || all params: 67,620,868 || trainable%: 1.8603547058875376

Ready to train PEFT model!


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.60741,0.74171
2,No log,0.583799,0.74171
3,No log,0.577517,0.74171


In [26]:
## RUBRIC ITEM: Save PEFT Model
lora_model.save_pretrained("./data/results_peft")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.


In [31]:
## RUBRIC ITEM: Load PEFT Model
from peft import AutoPeftModelForSequenceClassification

lora_model = AutoPeftModelForSequenceClassification.from_pretrained("./data/results_peft")


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
# Calculate and print PEFT Accuracy
peft_accuracy = peft_eval['eval_accuracy'] * 100
print("\nPEFT Accuracy: {:.2f}%\n".format(peft_accuracy))

# Print PEFT Evaluation Details
print("Eval Details: \n")
print(peft_eval)

# Calculate and print Base Model Accuracy
base_model_accuracy = baseline_training_eval['eval_accuracy'] * 100
print("\nBase Model Accuracy: {:.2f}%\n".format(base_model_accuracy))

# Print Base Model Evaluation Details
print("Eval Details: \n")
print(baseline_training_eval)



PEFT Accuracy: 74.17%

Eval Details: 

{'eval_loss': 0.5775173902511597, 'eval_accuracy': 0.7417102966841187, 'eval_runtime': 10.6141, 'eval_samples_per_second': 53.985, 'eval_steps_per_second': 4.522, 'epoch': 3.0}

Base Model Accuracy: 74.17%

Eval Details: 

{'eval_loss': 0.5786334276199341, 'eval_accuracy': 0.7417102966841187, 'eval_runtime': 9.2955, 'eval_samples_per_second': 61.643, 'eval_steps_per_second': 7.746, 'epoch': 2.0}


In [15]:
validation_loss = (0.607410 - 0.577517) * 100

print('Parameter Efficient Fine Tuning with LoRA kept the same accuracy, \
but with a {:.2f}% decrease in validation loss, using only 1.86% of the parameters!'.format(validation_loss))

print('\nNow that\'s efficient!')

Parameter Efficient Fine Tuning with LoRA kept the same accuracy, but with a 2.99% decrease in validation loss, using only 1.86% of the parameters!

Now that's efficient!
