# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique:  LoRA (Low-Rank Adaptation)
* Model: DistilBERT (distilbert-base-uncased), chosen for its lightweight structure and compatibility with sequence classification tasks.
* Evaluation approach:  sklearn.metrics with accuracy, precision, recall, and F1-score 
* Fine-tuning dataset: fancyzhx/amazon_polarity Amazon Polarity Dataset (from Hugging Face's datasets library),

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
# Importing necessary libraries (Step 1)
from datasets import load_dataset
from transformers import DistilBertForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
import torch
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import matplotlib.pyplot as plt

In [2]:
# Load the dataset (Step 2)
dataset = load_dataset("fancyzhx/amazon_polarity")

In [3]:
# Load the pre-trained model and tokenizer (Step 3)
model_name = "distilbert-base-uncased"
id2label = {0: "negative", 1: "positive"}
label2id = {"negative": 0, "positive": 1}

base_model = DistilBertForSequenceClassification.from_pretrained(
    model_name, num_labels=2, id2label=id2label, label2id=label2id
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
print(base_model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [5]:
# Freeze all parameters in the base model (Step 4)
for name, param in base_model.named_parameters():
    param.requires_grad = False

In [6]:
#An example of data row
print(dataset['train'][2])

{'label': 1, 'title': 'Amazing!', 'content': 'This soundtrack is my favorite music of all time, hands down. The intense sadness of "Prisoners of Fate" (which means all the more if you\'ve played the game) and the hope in "A Distant Promise" and "Girl who Stole the Star" have been an important inspiration to me personally throughout my teen years. The higher energy tracks like "Chrono Cross ~ Time\'s Scar~", "Time of the Dreamwatch", and "Chronomantique" (indefinably remeniscent of Chrono Trigger) are all absolutely superb as well.This soundtrack is amazing music, probably the best of this composer\'s work (I haven\'t heard the Xenogears soundtrack, so I can\'t say for sure), and even if you\'ve never played the game, it would be worth twice the price to buy it.I wish I could give it 6 stars.'}


In [7]:
print("Number of rows in the training dataset:", len(dataset['train']))
print("Number of rows in the test dataset:", len(dataset['test']))

Number of rows in the training dataset: 3600000
Number of rows in the test dataset: 400000


In [8]:
# Preprocess the dataset (Step 5)
def preprocess_function(examples):
    return tokenizer(examples['content'], truncation=True, padding=True, max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

In [9]:
# Split the dataset into train and test sets (Step 6) Not: Because the data too much, I only used 100000 data for training and 50000 data for testing
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(50000))

In [10]:
# Determine the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [11]:
# Move the model to the appropriate device (Step 7)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
base_model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [12]:
# Ensure if there is any trainable parameters
print("Trainable Parameters:")
for name, param in base_model.named_parameters():
    if param.requires_grad:
        print(name)

Trainable Parameters:


In [13]:
# Evaluate the foundation model (Step 8)
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = torch.argmax(torch.tensor(predictions), dim=-1).numpy()
    labels = torch.tensor(labels).numpy()
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="binary")
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}

training_args = TrainingArguments(
    output_dir="./results_base_v1",
    eval_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1
)

trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

print("Evaluating the base model...")
base_results = trainer.evaluate()
print("Base Model Results:", base_results)

Evaluating the base model...


  0%|          | 0/3125 [00:00<?, ?it/s]

Base Model Results: {'eval_loss': 0.693297266960144, 'eval_model_preparation_time': 0.001, 'eval_accuracy': 0.5036, 'eval_precision': 0.5036159515118122, 'eval_recall': 0.8727272727272727, 'eval_f1': 0.6386769929540558, 'eval_runtime': 75.1195, 'eval_samples_per_second': 665.607, 'eval_steps_per_second': 41.6}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [14]:
print(base_model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [15]:
# Configure LoRA for PEFT (Step 9)
config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_lin", "v_lin"],
    lora_dropout=0.05,
    task_type=TaskType.SEQ_CLS
)
peft_model = get_peft_model(base_model, config)
peft_model.print_trainable_parameters()

trainable params: 1,181,954 || all params: 68,136,964 || trainable%: 1.7347


In [16]:
# Train the LoRA model (Step 10)
trainer.model = peft_model
trainer.train()

  0%|          | 0/6250 [00:00<?, ?it/s]

{'loss': 0.6823, 'grad_norm': 1.0510802268981934, 'learning_rate': 4.992e-05, 'epoch': 0.0}
{'loss': 0.673, 'grad_norm': 1.4424479007720947, 'learning_rate': 4.9840000000000004e-05, 'epoch': 0.0}
{'loss': 0.6633, 'grad_norm': 1.745947241783142, 'learning_rate': 4.976e-05, 'epoch': 0.0}
{'loss': 0.6414, 'grad_norm': 1.2828929424285889, 'learning_rate': 4.9680000000000005e-05, 'epoch': 0.01}
{'loss': 0.6389, 'grad_norm': 2.2771382331848145, 'learning_rate': 4.96e-05, 'epoch': 0.01}
{'loss': 0.5808, 'grad_norm': 1.2292017936706543, 'learning_rate': 4.952e-05, 'epoch': 0.01}
{'loss': 0.5011, 'grad_norm': 1.5200871229171753, 'learning_rate': 4.944e-05, 'epoch': 0.01}
{'loss': 0.3836, 'grad_norm': 2.2602458000183105, 'learning_rate': 4.936e-05, 'epoch': 0.01}
{'loss': 0.3464, 'grad_norm': 2.9825305938720703, 'learning_rate': 4.928e-05, 'epoch': 0.01}
{'loss': 0.3419, 'grad_norm': 3.7818498611450195, 'learning_rate': 4.92e-05, 'epoch': 0.02}
{'loss': 0.3288, 'grad_norm': 1.7554080486297607, '

  0%|          | 0/3125 [00:00<?, ?it/s]

{'eval_loss': 0.21493571996688843, 'eval_model_preparation_time': 0.003, 'eval_accuracy': 0.91912, 'eval_precision': 0.9257195914577531, 'eval_recall': 0.9123135070618659, 'eval_f1': 0.9189676591992947, 'eval_runtime': 87.2417, 'eval_samples_per_second': 573.12, 'eval_steps_per_second': 35.82, 'epoch': 1.0}
{'train_runtime': 419.4052, 'train_samples_per_second': 238.433, 'train_steps_per_second': 14.902, 'train_loss': 0.24306428087234497, 'epoch': 1.0}


TrainOutput(global_step=6250, training_loss=0.24306428087234497, metrics={'train_runtime': 419.4052, 'train_samples_per_second': 238.433, 'train_steps_per_second': 14.902, 'total_flos': 6804918067200000.0, 'train_loss': 0.24306428087234497, 'epoch': 1.0})

In [17]:
# Save the trained model (Step 11)
peft_model.save_pretrained("./lora_distilbert_amazon_polarity")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [18]:
# Load the fine-tuned model for inference (Step 12)
from peft import AutoPeftModelForSequenceClassification
fine_tuned_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "./lora_distilbert_amazon_polarity", config=config
)
fine_tuned_model.config.pad_token_id = tokenizer.pad_token_id
trainer.model = fine_tuned_model
trainer.model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): DistilBertForSequenceClassification(
      (distilbert): DistilBertModel(
        (embeddings): Embeddings(
          (word_embeddings): Embedding(30522, 768, padding_idx=0)
          (position_embeddings): Embedding(512, 768)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (transformer): Transformer(
          (layer): ModuleList(
            (0-5): 6 x TransformerBlock(
              (attention): DistilBertSdpaAttention(
                (dropout): Dropout(p=0.1, inplace=False)
                (q_lin): lora.Linear(
                  (base_layer): Linear(in_features=768, out_features=768, bias=True)
                  (lora_dropout): ModuleDict(
                    (default): Dropout(p=0.05, inplace=False)
                  )
                  (lora_A): ModuleDict(
                    (default): Linear(in_features=7

In [19]:
# Evaluate the fine-tuned model (Step 13)
print("Evaluating the fine-tuned model...")
fine_tuned_results = trainer.evaluate()
print("Fine-Tuned Model Results:", fine_tuned_results)

Evaluating the fine-tuned model...


  0%|          | 0/3125 [00:00<?, ?it/s]

Fine-Tuned Model Results: {'eval_loss': 0.21493571996688843, 'eval_model_preparation_time': 0.003, 'eval_accuracy': 0.91912, 'eval_precision': 0.9257195914577531, 'eval_recall': 0.9123135070618659, 'eval_f1': 0.9189676591992947, 'eval_runtime': 86.3589, 'eval_samples_per_second': 578.979, 'eval_steps_per_second': 36.186, 'epoch': 1.0}


In [20]:
# Comparison of Results (Step 14)
print("\nComparison:")
print(f"{'Metric':<15}{'Base Model':<20}{'Fine-Tuned Model':<20}")
print(f"{'-'*55}")
print(f"{'Accuracy':<15}{base_results['eval_accuracy']:<20}{fine_tuned_results['eval_accuracy']:<20}")
print(f"{'Precision':<15}{base_results['eval_precision']:<20}{fine_tuned_results['eval_precision']:<20}")
print(f"{'Recall':<15}{base_results['eval_recall']:<20}{fine_tuned_results['eval_recall']:<20}")
print(f"{'F1 Score':<15}{base_results['eval_f1']:<20}{fine_tuned_results['eval_f1']:<20}")


Comparison:
Metric         Base Model          Fine-Tuned Model    
-------------------------------------------------------
Accuracy       0.5036              0.91912             
Precision      0.5036159515118122  0.9257195914577531  
Recall         0.8727272727272727  0.9123135070618659  
F1 Score       0.6386769929540558  0.9189676591992947  


## Conclusion

The results demonstrate a significant improvement in performance after fine-tuning the DistilBERT model using the LoRA technique on the Amazon Polarity dataset:

Accuracy: The fine-tuned model achieved an accuracy of 91.9%, a substantial improvement over the base model's 50.36%, indicating that the model effectively learned the task during fine-tuning.

Precision: The precision increased from 50.361% to 92.57%, showing that the fine-tuned model is much better at correctly identifying positive predictions without false positives.

Recall: Recall improved from 87.2% to 91.2%, suggesting that the fine-tuned model successfully captures most of the true positives in the dataset.

F1 Score: The F1 score, which balances precision and recall, rose dramatically from 63.8% to 91.9%, highlighting an improvement in the model's classification ability.
By focusing on training only a small number of parameters, this approach achieves high performance with reduced computational resources, making it ideal for deployment in resource-constrained environments.