
 # Notebook1B-LoRA â€” Full Fine-Tuning vs. LoRA

 In the previous notebook, we learned how to train only the top classification head of the GPT model to perform spam SMS classification. Now, we will further explore two more sophisticated fine-tuning strategies:
 - **Full Fine-Tuning**
 - **LoRA (Low-Rank Adaptation)**

 **Core Objectives:**
 - Compare the differences between the two fine-tuning strategies in terms of resource consumption and model performance.
 - Understand the unique advantages of LoRA in fine-tuning large models.


 **Contents:**
 1. [Preparations](#1-Preparations)
    - [1.1 Download and Balance SMS Spam Data](#11-Download-and-Balance-SMS-Spam-Data)
    - [1.2 Build Dataset and DataLoader](#12-Build-Dataset-and-DataLoader)
    - [1.3 Load Pretrained GPT2 (Replacing Custom Model)](#13-Load-Pretrained-GPT2-Replacing-Custom-Model)
 2. [Implementation and Training of Two Strategies](#2-Implementation-and-Training-of-Two-Strategies)
    - [Strategy A: Full Fine-Tuning](#Strategy-A-Full-Fine-Tuning)
    - [Strategy B: LoRA](#Strategy-B-LoRA)
 3. [Parallel Comparison Results](#3-Parallel-Comparison-Results)
 4. [Summary](#4-Summary)

 **References**
 - Build a Large Language Model (from scratch), pp.322-336
 - LoRA paper: [https://arxiv.org/abs/2106.09685](https://arxiv.org/abs/2106.09685)
 - LoRA explanation video by Edward Hu: [https://www.youtube.com/watch?v=DhRoTONcyZE](https://www.youtube.com/watch?v=DhRoTONcyZE)
 - LoRA with diffusion models: [https://huggingface.co/blog/lora](https://huggingface.co/blog/lora)


 ## 1. Preparations

 Before diving into fine-tuning strategies, we need to:
 - Download the SMS Spam dataset
 - Balance it (ham vs. spam)
 - Set up a custom dataset and data loaders
 - Load a pretrained GPT-2 classification model to serve as our base


 ### 1.1 Download and Balance SMS Spam Data

In [2]:
%load_ext autoreload
%autoreload 2
    
import sklearn


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
 
DEBUG = True
import os
if DEBUG: 
    os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
data_path = "content"
import torch

import urllib.request
import zipfile
from pathlib import Path
import pandas as pd
import time, math

In [32]:
from src.dataset_loader import get_enc_dataset
# Set model name
from transformers import GPT2Tokenizer
model_name = "gpt2"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if DEBUG:
    device = "cpu"
print(f"Using device: {device}")


# Load GPT-2 Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
# Add a separate pad_token
tokenizer.pad_token = tokenizer.eos_token
if tokenizer.pad_token is None:
    # Use '<|PAD|>' as the padding token
    tokenizer.add_special_tokens({'pad_token': '<|PAD|>'})
    pad_token_id = tokenizer.pad_token_id
    print("Added new pad_token '<|PAD|>' with ID:", pad_token_id)
batch_size = 16
max_length=128
train_dataset, val_dataset,train_loader,val_loader, pad_token_id = get_enc_dataset(data_path,
                                                                                   tokenizer,
                                                                                   batch_size=batch_size,
                                                                                   max_length = max_length
                                                                                  )

Using device: cpu
Max Length is:  128
Max Length is:  128
Max Length:  128
Number of training batches: 4209, Number of validation batches: 55


##### Old

In [3]:
# %%


url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extract_dir = "sms_spam_collection"
data_file = Path(extract_dir) / "SMSSpamCollection.tsv"

def download_spam_data(url, zip_path, extract_dir, data_file):
    if data_file.exists():
        print(f"{data_file} already exists. Skipping download.")
        return
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, zip_path)
    print("Extracting files...")
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extractall(extract_dir)
    raw_file = Path(extract_dir) / "SMSSpamCollection"
    raw_file.rename(data_file)
    print("Data downloaded and saved as:", data_file)

if not data_file.exists():
    download_spam_data(url, zip_path, extract_dir, data_file)

df = pd.read_csv(data_file, sep="\t", header=None, names=["Label", "Text"])
print(f"Original dataset size: {df.shape}")

def balance_spam_dataset(df):
    """
    Balance the dataset by ensuring an equal number of 'spam' and 'ham' samples.
    """
    df_spam = df[df["Label"] == "spam"]
    df_ham = df[df["Label"] == "ham"].sample(len(df_spam), random_state=42)
    balanced_df = pd.concat([df_spam, df_ham], ignore_index=True)
    return balanced_df

balanced_df = balance_spam_dataset(df)
balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})
print(f"Balanced dataset size: {balanced_df.shape}")

def random_split(df, train_frac=0.7, val_frac=0.1):
    """
    Split the dataset into training, validation, and test sets.
    """
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    total = len(df)
    train_end = int(total * train_frac)
    val_end = train_end + int(total * val_frac)
    return df[:train_end], df[train_end:val_end], df[val_end:]

train_df, valid_df, test_df = random_split(balanced_df, 0.7, 0.1)
train_df.to_csv("train.csv", index=False)
valid_df.to_csv("validation.csv", index=False)
test_df.to_csv("test.csv", index=False)

print(f"Train size: {len(train_df)} | Validation size: {len(valid_df)} | Test size: {len(test_df)}")

Original dataset size: (5572, 2)
Balanced dataset size: (1494, 2)
Train size: 1045 | Validation size: 149 | Test size: 300


### 1.2 Build Dataset and DataLoader

In [7]:
# %%

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=128):
        self.df = pd.read_csv(csv_file)
        self.tokenizer = tokenizer
        self.max_length = max_length

        self.encodings = self.tokenizer(
            self.df["Text"].tolist(),
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        input_ids = self.encodings['input_ids'][idx]
        attention_mask = self.encodings['attention_mask'][idx]
        label = torch.tensor(self.df.iloc[idx]["Label"], dtype=torch.long)
        return input_ids, attention_mask, label

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

#train_ds = SpamDataset(data_path+"/train.tsv", tokenizer, max_length=128)
#val_ds   = SpamDataset(data_path+"/dev.tsv", tokenizer, max_length=128)
train_ds = SpamDataset("train.csv", tokenizer, max_length=128)
val_ds   = SpamDataset("validation.csv", tokenizer, max_length=128)
test_ds  = SpamDataset("test.csv", tokenizer, max_length=128)


train_loader = DataLoader(
    train_ds,
    batch_size=batch_size,
    shuffle=True,
    drop_last=True,
    pin_memory=True
)
val_loader = DataLoader(
    val_ds,
    batch_size=batch_size,
    shuffle=False,
    drop_last=False,
    pin_memory=True
)

test_loader = DataLoader(
    test_ds,
    batch_size=batch_size,
    shuffle=False,
    drop_last=False,
    pin_memory=True
)

print(f"Train loader: {len(train_loader)} batches, Val loader: {len(val_loader)} batches, Test loader: {len(test_loader)} batches")

Train loader: 65 batches, Val loader: 10 batches, Test loader: 19 batches


### 1.3 Load Pretrained GPT2

In [31]:
for x in train_loader:
    print(x.keys())
    break
if tokenizer.pad_token is None:
    print("Errror")
print(tokenizer.encode(tokenizer.pad_token))
print(tokenizer.pad_token_id)
print(tokenizer.encode(tokenizer.eos_token))

dict_keys(['input_ids', 'attention_mask', 'labels', 'text'])
[50256]
50256
[50256]


In [33]:
# %%

import torch.nn as nn
from transformers import GPT2ForSequenceClassification, GPT2Config
#tokenizer.pad_token = tokenizer.eos_token
def forward_for_classification(model, input_ids, attention_mask, device):
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    return logits

@torch.no_grad()
def calc_accuracy(loader, model, device, max_batches=None):
    model.eval()
    correct, total = 0, 0
    #for i, instance in enumerate(loader):
        #print(instance)
    #for i, (input_ids, attention_mask, y_batch,text) in enumerate(loader):
    for i, instance in enumerate(loader):
        input_ids = instance['input_ids']
        attention_mask = instance['attention_mask']
        y_batch = instance['labels']
        print(input_ids.shape)
        print(torch.max(input_ids))
        print(attention_mask.shape)
        text = instance['text']
        if max_batches and (i+1) > max_batches:
            break

        input_ids = input_ids.to(device)
        #input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        y_batch = y_batch.to(device)

        logits = forward_for_classification(model, input_ids, attention_mask, device)
        preds = torch.argmax(logits, dim=-1)
        correct += (preds == y_batch).sum().item()
        total += y_batch.size(0)
    return correct / total if total > 0 else 0



model_name = "gpt2"
num_labels = 2
model_config = GPT2Config.from_pretrained(model_name, num_labels=num_labels)
model = GPT2ForSequenceClassification.from_pretrained(model_name, config=model_config)
model.config.pad_token_id = tokenizer.pad_token_id 
model.to(device)
model.eval()

init_train_acc = calc_accuracy(train_loader, model, device, max_batches=10)
init_val_acc   = calc_accuracy(val_loader, model, device, max_batches=10)
#init_test_acc  = calc_accuracy(test_loader, model, device, max_batches=10)
print(f"Initial Accuracies -> Train: {init_train_acc*100:.2f}%, Val: {init_val_acc*100:.2f}%, Test: {init_test_acc*100:.2f}%")

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


torch.Size([16, 128])
tensor(50256)
torch.Size([16, 128])
torch.Size([16, 128])
tensor(50256)
torch.Size([16, 128])
torch.Size([16, 128])
tensor(50256)
torch.Size([16, 128])


KeyboardInterrupt: 


 ## 2. Implementation and Training of Two Strategies

 We now compare:
 1. **Full Fine-Tuning**: Updating all GPT-2 parameters + classifier
 2. **LoRA**: Introducing Low-Rank Adaptation layers to reduce parameter count


 ### Strategy A: Full Fine-Tuning

In [None]:
# %%

import copy
import time

def train_model_full_finetune(model, train_loader, val_loader, device, epochs=3, lr=5e-5):
    """
    Fully fine-tune all parameters of GPT-2, including the classification head.
    """
    for param in model.parameters():
        param.requires_grad = True

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    loss_fn = nn.CrossEntropyLoss()

    train_losses = []
    val_accs = []
    start_time = time.time()

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for input_ids, attention_mask, y_batch in train_loader:
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            logits = forward_for_classification(model, input_ids, attention_mask, device)
            loss = loss_fn(logits, y_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        val_acc = calc_accuracy(val_loader, model, device)
        train_losses.append(avg_loss)
        val_accs.append(val_acc)

        print(f"Epoch={epoch+1}, Loss={avg_loss:.4f}, ValAcc={val_acc*100:.2f}%")

    end_time = time.time()
    elapsed = end_time - start_time
    return elapsed, train_losses, val_accs

print("=== Strategy A: Full Fine-Tuning ===")
modelA = copy.deepcopy(model)
full_tune_params = sum(p.numel() for p in modelA.parameters() if p.requires_grad)
print(f"[Full Fine-Tuning] Trainable Params: {full_tune_params}")

init_accA = calc_accuracy(train_loader, modelA, device, max_batches=10)
print(f"[Full Fine-Tuning] Initial Train Acc (first 10 batches): {init_accA*100:.2f}%")

elapsedA, ft_train_losses, ft_val_accs = train_model_full_finetune(
    modelA, train_loader, val_loader, device, epochs=5, lr=5e-5
)

train_accA = calc_accuracy(train_loader, modelA, device)
val_accA   = calc_accuracy(val_loader, modelA, device)
test_accA  = calc_accuracy(test_loader, modelA, device)
print(f"[Full Fine-Tuning] Time: {elapsedA:.2f}s, TrainAcc={train_accA*100:.2f}%, ValAcc={val_accA*100:.2f}%, TestAcc={test_accA*100:.2f}%\n")


 ### Strategy B: LoRA (Low-Rank Adaptation)

 **Key Idea**:
 Insert low-rank matrices \(A\) and \(B\) into selected GPT-2 submodules (e.g., `c_fc`, `c_proj`), then train only these matrices (plus the classifier).

 This approach drastically reduces the number of trainable parameters and can still achieve performance close to full fine-tuning.


 #### B.1: LoRA Implementation from Scratch

 Here we define:
 - `LoRALayer` to handle low-rank factors
 - `LinearWithLoRA` and `Conv1DWithLoRA` as wrappers
 - `replace_modules_with_lora` to insert LoRA
 - `freeze_original_parameters` so that only LoRA + classifier are trainable

 Notice the shape logs:
 For GPT-2 `in_dim` = `out_dim` = 768 in certain layers (`c_fc`, `c_proj`), so you might see `[LoRALayer] in_dim=768, out_dim=768, rank=16, alpha=32`.
 
 **Explanation**: GPT-2's hidden dimensionality is 768 in the base model, and `rank=16, alpha=32` are our chosen hyperparameters.
 - **Rank** determines the low-rank subspace.
 - **Alpha** is a scalar factor amplifying the LoRA update.

In [None]:
# %%

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

def advanced_metrics(loader, model, device):
    """
    Calculate precision, recall, F1-score, and accuracy.
    """
    model.eval()
    preds_list, labels_list = [], []
    for input_ids, attention_mask, y_batch in loader:
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        y_batch = y_batch.to(device)

        with torch.no_grad():
            logits = forward_for_classification(model, input_ids, attention_mask, device)
        preds = torch.argmax(logits, dim=-1)
        preds_list.extend(preds.cpu().numpy())
        labels_list.extend(y_batch.cpu().numpy())

    accuracy  = np.mean(np.array(preds_list) == np.array(labels_list))
    precision = precision_score(labels_list, preds_list)
    recall    = recall_score(labels_list, preds_list)
    f1        = f1_score(labels_list, preds_list)
    return accuracy, precision, recall, f1

In [None]:
# %%

import torch.cuda.amp as amp
from transformers.models.gpt2.modeling_gpt2 import Conv1D

class LoRALayer(nn.Module):
    """
    Low-Rank Adaptation layer to inject trainable parameters A and B into original weight update.
    """
    def __init__(self, in_dim, out_dim, rank, alpha=1.0):
        super().__init__()
        self.rank = rank
        self.alpha = alpha

        # Low-rank matrices
        self.A = nn.Parameter(torch.empty(in_dim, rank))
        self.B = nn.Parameter(torch.empty(rank, out_dim))
        nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
        nn.init.zeros_(self.B)

        # Explanation log: GPT-2 base has hidden_dim=768; rank=16, alpha=32 by default.
        print(f"[LoRALayer] in_dim={in_dim}, out_dim={out_dim}, rank={rank}, alpha={alpha}")

    def forward(self, x):
        # Decomposition: alpha * (x @ A @ B)
        return self.alpha * (x @ self.A @ self.B)

class LinearWithLoRA(nn.Module):
    """
    Wrapper for nn.Linear that adds a LoRA output to the original linear output.
    """
    def __init__(self, linear_module, rank, alpha=1.0):
        super().__init__()
        self.linear = linear_module
        self.lora   = LoRALayer(linear_module.in_features, linear_module.out_features, rank, alpha)

    def forward(self, x):
        return self.linear(x) + self.lora(x)

class Conv1DWithLoRA(nn.Module):
    """
    Wrapper for Conv1D that adds a LoRA output to the original Conv1D output.
    """
    def __init__(self, conv1d_module: Conv1D, rank, alpha=1.0):
        super().__init__()
        self.conv = conv1d_module
        in_dim, out_dim = conv1d_module.weight.shape
        self.lora = LoRALayer(in_dim, out_dim, rank, alpha)

    def forward(self, x):
        out_normal = self.conv(x)
        B, S, hidden_dim = x.shape
        x_2d = x.view(B*S, hidden_dim)
        out_lora_2d = self.lora(x_2d)
        out_lora_3d = out_lora_2d.view(B, S, -1)
        return out_normal + out_lora_3d

def replace_modules_with_lora(module, rank=16, alpha=32):
    """
    Recursively replace GPT-2 submodules (c_fc, c_proj) with LoRA wrappers.
    """
    for name, child in list(module.named_children()):
        if isinstance(child, nn.Linear) and name in ["c_fc", "c_proj"]:
            new_module = LinearWithLoRA(child, rank, alpha)
            setattr(module, name, new_module)
        elif isinstance(child, Conv1D) and name in ["c_fc", "c_proj"]:
            new_module = Conv1DWithLoRA(child, rank, alpha)
            setattr(module, name, new_module)
        else:
            replace_modules_with_lora(child, rank, alpha)

def freeze_original_parameters(model):
    """
    Freeze all parameters except LoRA layers and classifier.
    """
    for name, param in model.named_parameters():
        if "lora" not in name.lower() and "classifier" not in name.lower():
            param.requires_grad = False

def print_trainable_parameters(model):
    """
    Print the number of trainable parameters vs. total parameters.
    """
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())
    print(f"Trainable params: {trainable} / Total params: {total}")
    return trainable

def show_gradient_norms(model):
    """
    Print gradient norms for LoRA layers to confirm only LoRA + classifier receive gradients.
    """
    for name, param in model.named_parameters():
        if param.requires_grad and param.grad is not None:
            print(f"Gradient Norm for {name}: {param.grad.norm():.4f}")


 #### B.2: LoRA Training Function

 Here, we only train the LoRA parameters and the classifier head. The rest of GPT-2 is frozen.

 **Optional**: We can log gradient norms (`log_grad_norms=True`) to confirm that only LoRA + classifier layers get updated.

In [None]:
# %%

def train_model_lora(
    model, 
    train_loader, 
    val_loader, 
    device, 
    epochs=3, 
    lr=1e-4, 
    log_grad_norms=False
):
    freeze_original_parameters(model)
    print_trainable_parameters(model)

    optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=lr)
    loss_fn = nn.CrossEntropyLoss()
    scaler = amp.GradScaler()

    train_losses, val_accs = [], []
    start_time = time.time()

    for epoch in range(epochs):
        model.train()
        total_loss = 0
        for step, (input_ids, attention_mask, y_batch) in enumerate(train_loader):
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            y_batch = y_batch.to(device)

            optimizer.zero_grad()
            with amp.autocast():
                logits = forward_for_classification(model, input_ids, attention_mask, device)
                loss = loss_fn(logits, y_batch)

            scaler.scale(loss).backward()

            # If the user wants to log gradient norms for educational demonstration:
            if log_grad_norms:
                scaler.step(optimizer)
                scaler.update()
                #show_gradient_norms(model)
            else:
                scaler.step(optimizer)
                scaler.update()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        val_acc  = calc_accuracy(val_loader, model, device)
        train_losses.append(avg_loss)
        val_accs.append(val_acc)
        print(f"Epoch={epoch+1}, Loss={avg_loss:.4f}, ValAcc={val_acc*100:.2f}%")

    end_time = time.time()
    elapsed = end_time - start_time
    return elapsed, train_losses, val_accs


 Now, let's create a copy of our GPT-2 model and apply LoRA to it with the chosen `rank` and `alpha` hyperparameters.

In [None]:
# %%

print("=== Strategy B: LoRA ===")

modelB = copy.deepcopy(model)  # Duplicate the base model
replace_modules_with_lora(modelB, rank=16, alpha=32)  # Replace layers
modelB.to(device)

elapsedB, lora_train_losses, lora_val_accs = train_model_lora(
    modelB, train_loader, val_loader, device, epochs=5, lr=1e-4, log_grad_norms=True
)

def calc_accuracy_full(loader, model, device, max_batches=None):
    model.eval()
    correct, total = 0, 0
    for i, (input_ids, attention_mask, y_batch) in enumerate(loader):
        if max_batches and (i+1) > max_batches:
            break
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        y_batch = y_batch.to(device)
        with torch.no_grad():
            logits = forward_for_classification(model, input_ids, attention_mask, device)
        preds = torch.argmax(logits, dim=-1)
        correct += (preds == y_batch).sum().item()
        total   += y_batch.size(0)
    return correct / total if total > 0 else 0

train_accB = calc_accuracy_full(train_loader, modelB, device)
val_accB   = calc_accuracy_full(val_loader, modelB, device)
test_accB  = calc_accuracy_full(test_loader, modelB, device)
print(f"[LoRA] Training Time: {elapsedB:.2f}s")
print(f"[LoRA] TrainAcc={train_accB*100:.2f}%, ValAcc={val_accB*100:.2f}%, TestAcc={test_accB*100:.2f}%\n")


 #### B.3: Advanced Metrics (Precision, Recall, F1)

 We'll calculate a more comprehensive set of metrics on the LoRA model to evaluate performance beyond accuracy.

In [None]:
# %%

accB, precB, recB, f1B = advanced_metrics(test_loader, modelB, device)
print(f"[LoRA Advanced Metrics on Test] Accuracy={accB*100:.2f}%, Precision={precB*100:.2f}%, Recall={recB*100:.2f}%, F1={f1B*100:.2f}%")


 #### B.4: Saving and Loading LoRA Parameters

 LoRA's key advantage is the ability to save only the adapter components without storing the entire large model.

In [None]:
# %%

def save_lora_params(model, save_path="lora_params.pt"):
    """
    Save only LoRA-related parameters (and the classifier) for demonstration.
    """
    lora_dict = {
        k: v for k, v in model.state_dict().items()
        if "lora" in k.lower() or "classifier" in k.lower()
    }
    torch.save(lora_dict, save_path)
    print(f"LoRA params saved to {save_path}")

def load_lora_params(model, load_path="lora_params.pt"):
    """
    Load LoRA parameters into a GPT-2 model that already has LoRA layers.
    """
    loaded_dict = torch.load(load_path, map_location=device)
    model.load_state_dict(loaded_dict, strict=False)
    print(f"LoRA params loaded from {load_path}")

# Example usage (commented out):
# save_lora_params(modelB, "my_lora_params.pt")
# new_modelB = copy.deepcopy(model)
# replace_modules_with_lora(new_modelB, rank=16, alpha=32)
# new_modelB.to(device)
# load_lora_params(new_modelB, "my_lora_params.pt")

Compare the convergence behavior of full fine-tuning and LoRA by visualizing the training loss and validation accuracy.

In [None]:
import matplotlib.pyplot as plt

# Set the figure size
plt.figure(figsize=(12, 5))

# Left: Training Loss
plt.subplot(1, 2, 1)
plt.plot(ft_train_losses, label="Full Fine-Tuning - Loss")
plt.plot(lora_train_losses, label="LoRA - Loss")
plt.xlabel("Training Epochs")
plt.ylabel("Training Loss")
plt.title("Comparison of Training Loss")
plt.legend()

# Right: Validation Accuracy
plt.subplot(1, 2, 2)
plt.plot([acc * 100 for acc in ft_val_accs], label="Full Fine-Tuning - Accuracy")
plt.plot([acc * 100 for acc in lora_val_accs], label="LoRA - Accuracy")
plt.xlabel("Training Epochs")
plt.ylabel("Validation Accuracy (%)")
plt.title("Comparison of Validation Accuracy")
plt.legend()

# Adjust layout and show the plot
plt.tight_layout()
plt.show()



 ## 3. Parallel Comparison Results

In [None]:
# %%

print("=== Comparison of Two Fine-Tuning Strategies ===")

print(f"Full Fine-Tuning:")
print(f" Trainable Params={full_tune_params}, Time={elapsedA:.2f}s, TestAcc={test_accA*100:.2f}%\n")

lora_params_count = sum(p.numel() for p in modelB.parameters() if p.requires_grad)
print(f"LoRA:")
print(f" Trainable Params={lora_params_count}, Time={elapsedB:.2f}s, TestAcc={test_accB*100:.2f}%")
print(f" Precision={precB*100:.2f}%, Recall={recB*100:.2f}%, F1={f1B*100:.2f}%\n")


 ## 4. Summary

 1. **Full Fine-Tuning**: Updates all parameters of GPT-2 + the classification head.
    - **Pros**: Maximum adaptation, potentially highest performance.
    - **Cons**: Very expensive computationally; less flexible if you need multiple task-specific versions.

 2. **LoRA**: Freezes original GPT-2 parameters, adds small low-rank matrices for adaptation.
    - **Pros**: Parameter-efficient, faster training, easy to save/load adapters.
    - **Cons**: Might need careful hyperparameter tuning (rank, alpha) to match full fine-tuning performance.

 **Key Takeaways**:
 - LoRA can achieve near-full fine-tuning performance with a fraction of trainable parameters.
 - It's an excellent choice for large language models in resource-constrained environments.

 ### Hyperparameter Exploration

 - **Rank and Alpha**: Try `rank=[4,8,16,32]` and `alpha=[8,16,32,64]` to see changes in parameter count and performance.
 - **Learning Rate**: Adjust `lr` to see if smaller ranks need a different learning rate.
 - **Epochs**: Extend or reduce the number of epochs to observe convergence speed and final performance.