# Lightweight Fine-Tuning Project

In this cell, describe your choices for each of the following

* PEFT technique: 
* Model:
* Evaluation approach: 
* Fine-tuning dataset: 

Parameter-Efficient Fine-Tuning (PEFT) techniques are designed to adapt pre-trained models to new tasks with minimal parameter updates. For this project, I choose to use LoRA (Low-Rank Adaptation). This technique focuses on updating a small subset of parameters rather than the entire model. The main advantage of LoRA is its ability to fine-tune large models with significantly reduced computational cost and memory usage.

For my model, I choose to use DistilBERT. DistilBERT is a distilled version of BERT (Bidirectional Encoder Representations from Transformers), which means it retains 97% of BERT's language understanding while being 60% faster and 40% smaller. This makes it ideal for scenarios where computational resources are limited or where faster inference times are required. Despite being a smaller model, DistilBERT achieves performance levels close to BERT on various NLP benchmarks. For sentiment analysis tasks like IMDB movie review classification, it can provide highly accurate predictions.

For my evaluation approach, I am using the accuracy-score within the scikit-learn library. Accuracy is one of the simplest and most intuitive metrics to understand. It represents the proportion of correctly classified instances out of the total instances. For binary classification tasks like sentiment analysis (positive vs. negative), it provides a clear measure of performance. However, if the dataset is not balanced. It might not be the best metric to use.

For my fine-tuning dataset. I am using the imdb dataset. The IMDB dataset contains a large number of movie reviews, making it directly relevant for sentiment analysis in the context of movie reviews. This ensures that the classifier will be trained on data that closely resembles the real-world application it is intended for. This dataset is also balanced between positive and negative reviews. I am also using a smaller sample of this dataset for faster training due to computational constraints.

## Loading and Evaluating a Foundation Model

In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
pip install transformers datasets scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m70.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.2 threadpoolctl-3.5.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import all necessary libraries
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch
import numpy as np
from sklearn.metrics import accuracy_score
from peft import PeftModel, get_peft_model, PeftConfig, PeftType, LoraConfig

In [3]:
# Load pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# Freeze all layers except the last layer (classifier)
for name, param in model.named_parameters():
    if 'classifier' not in name:  # Only freeze non-classifier layers
        param.requires_grad = False

In [5]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [6]:
# Check if GPU is available and use it
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [7]:
# Load the imdb dataset
dataset = load_dataset('imdb')

# Shuffle the dataset and select 500 for training and another 500 for evaluation
train_dataset = dataset['train'].shuffle(seed=42).select(range(500))
eval_dataset = dataset['test'].shuffle(seed=42).select(range(500))

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 24.3MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 22.3MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:01<00:00, 34.3MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [8]:
# Function to tokenize the examples
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

# Tokenize the dataset
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [9]:
# Define the compute_metrics function to calculate accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

# Define TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    logging_steps=10,
    num_train_epochs=5,
    learning_rate=5e-5,
    save_steps=500,
    save_total_limit=1,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    compute_metrics=compute_metrics,
)

# Train the model
# trainer.train()

# Save the model weights
model.save_pretrained('./finetuned_model')

# Evaluate the model
untuned_eval_result = trainer.evaluate()

print(f"Untuned model evaluation result: {untuned_eval_result}")

Untuned model evaluation result: {'eval_loss': 0.6980359554290771, 'eval_accuracy': 0.492, 'eval_runtime': 8.7028, 'eval_samples_per_second': 57.453, 'eval_steps_per_second': 3.677}


## Performing Parameter-Efficient Fine-Tuning

In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [10]:
# Load pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# Apply PEFT using LoRA
peft_config = LoraConfig(
    peft_type=PeftType.LORA, 
    task_type="SEQ_CLS", 
    r=8, 
    lora_alpha=32, 
    lora_dropout=0.1, 
    bias="none",
    target_modules=["attention.q_lin", "attention.v_lin"]
)
peft_model = get_peft_model(model, peft_config)

In [12]:
# Check if GPU is available and use it
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
peft_model.to(device)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): MultiHeadSelfAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): Linear(
                  in_features=768, out_features=768, bias=True
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.1, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=768, out_features=8, bias=Fals

In [13]:
# Load the imdb dataset
dataset = load_dataset('imdb')

# Shuffle the dataset and select 500 for training and another 500 for evaluation
train_dataset = dataset['train'].shuffle(seed=42).select(range(500))
eval_dataset = dataset['test'].shuffle(seed=42).select(range(500))

In [14]:
# Function to tokenize the examples
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

# Tokenize the dataset
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

In [15]:
# Define the compute_metrics function to calculate accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

# Define TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    logging_steps=10,
    num_train_epochs=5,
    learning_rate=5e-5,
    save_steps=500,
    save_total_limit=1,
)

# Initialize the Trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_eval_dataset,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Save the PEFT model weights
peft_model.save_pretrained('./peft_model')

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6843,0.675464,0.664
2,0.6548,0.650994,0.792
3,0.6322,0.612234,0.81
4,0.5728,0.566448,0.83
5,0.5227,0.546278,0.844


## Performing Inference with a PEFT Model

In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [16]:
# Load pre-trained DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Load the saved PEFT model weights
model.load_adapter('./peft_model', 'imdb_adapter')

In [18]:
# Freeze all layers to ensure weights are not changed during evaluation
for param in model.parameters():
    param.requires_grad = False

In [19]:
# Check if GPU is available and use it
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(
              in_features=768, out_features=768, bias=True
              (lora_dropout): ModuleDict(
                (imdb_adapter): Dropout(p=0.1, inplace=False)
              )
              (lora_A): ModuleDict(
                (imdb_adapter): Linear(in_features=768, out_features=8, bias=False)
              )
              (lora_B): ModuleDict(
                (imdb_adapter): Linear(in_features=8, out_features=768, bias=False)
   

In [20]:
# Load the imdb dataset
dataset = load_dataset('imdb')

# Shuffle the dataset and select 500 for evaluation
eval_dataset = dataset['test'].shuffle(seed=42).select(range(500))

In [21]:
# Function to tokenize the examples
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=512)

# Tokenize the dataset
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

In [22]:
# Define the compute_metrics function to calculate accuracy
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

# Define TrainingArguments for evaluation
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize the Trainer for evaluation
trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=tokenized_eval_dataset,
    compute_metrics=compute_metrics,
)

In [23]:
# Get results the PEFT model
peft_eval_result = trainer.evaluate()

In [24]:
# Compare results
print(f"Untuned model evaluation result: {untuned_eval_result}")
print(f"PEFT model evaluation result: {peft_eval_result}")

Untuned model evaluation result: {'eval_loss': 0.6980359554290771, 'eval_accuracy': 0.492, 'eval_runtime': 8.7028, 'eval_samples_per_second': 57.453, 'eval_steps_per_second': 3.677}
PEFT model evaluation result: {'eval_loss': 0.5462777018547058, 'eval_accuracy': 0.844, 'eval_runtime': 9.0812, 'eval_samples_per_second': 55.059, 'eval_steps_per_second': 3.524}


Awesome! We were able to significantly improve the accuracy using PEFT.

The untuned model had an accuracy of 77.6% after 5 epochs.

For the PEFT model, the accuracy was 83.8% after 5 epochs.

This can be further improved by adjusting hyperparameters and increasing the number of epochs.