# 🦾 Full Fine-Tuning VS LoRA

First, we can start by importing the necesarry libraries. We can get our Tokenizer and SequenceClassification wrapper from Huggingface's transformers library.


In essence, AutoModelForSequenceClassification has a classification head on top of the model outputs which can be easily trained with the base model

For example, here's Qwen3-0.6B, the model we'll be fine-tuning today, wrapped in SequenceClassification:

```
Qwen3ForSequenceClassification(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (v_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6144, bias=False)
          (down_proj): Linear(in_features=6144, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen3RMSNorm((2048,), eps=1e-06)
    (rotary_emb): Qwen3RotaryEmbedding()
  )
  (score): Linear(in_features=2048, out_features=2, bias=False)
)
```

In [1]:
!pip install -q -U datasets
!pip install -q -U transformers

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EvalPrediction
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

In [3]:
from google.colab import userdata
import os

os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')

## 🛠 Dataset Preprocessing

Our dataset for this tutorial will be a simple Spam Email Classifier dataset from Kaggle (url: https://www.kaggle.com/datasets/sahideseker/spam-mail-classifier-dataset)

The labels will be divided into "ham", which means a non-spam email (defined as 0), and "spam, which means a spam email (defined as 1)

In [4]:
!kaggle datasets download -d sahideseker/spam-mail-classifier-dataset
!unzip spam-mail-classifier-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/sahideseker/spam-mail-classifier-dataset
License(s): CC-BY-SA-4.0
spam-mail-classifier-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  spam-mail-classifier-dataset.zip
replace spam_mail_classifier.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [5]:
import pandas as pd

df = pd.read_csv('spam_mail_classifier.csv')
df.head()

Unnamed: 0,email_text,label
0,Let's catch up sometime next week!,ham
1,Don't forget to submit your project by Friday.,ham
2,Win a free iPhone now!!! Click here.,spam
3,Can you send me the report when it's ready?,ham
4,Meeting has been rescheduled to next Monday.,ham


In [6]:
label_mapping = {"ham": 0, "spam": 1}
df['label'] = df['label'].map(label_mapping)
df.head()

Unnamed: 0,email_text,label
0,Let's catch up sometime next week!,0
1,Don't forget to submit your project by Friday.,0
2,Win a free iPhone now!!! Click here.,1
3,Can you send me the report when it's ready?,0
4,Meeting has been rescheduled to next Monday.,0


## ✂ Dataset Split
*   ### Train: 80%
*   ### Val: 10%
*   ### Test: 10%

In [7]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["email_text"].tolist(),
    df["label"].tolist(),
    test_size=0.1,
    random_state=1
)

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts,
    train_labels,
    test_size=0.1,
    random_state=1
)

In [8]:
train_data = Dataset.from_dict({"text": train_texts, "label": train_labels})
val_data = Dataset.from_dict({"text": val_texts, "label": val_labels})

In [9]:
train_data

Dataset({
    features: ['text', 'label'],
    num_rows: 810
})

In [10]:
model_name = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.config.pad_token_id = model.config.eos_token_id

Some weights of Qwen3ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen3-0.6B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
def tokenize_func(example):
    return tokenizer(
        example['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

train_data = train_data.map(tokenize_func, batched=True)
val_data = val_data.map(tokenize_func, batched=True)

Map:   0%|          | 0/810 [00:00<?, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

In [12]:
train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
val_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [14]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average="binary")
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

## 🏃 Training time

Here we define out training arguments and set up our trainer.

In [15]:
training_args = TrainingArguments(
    output_dir="./qwen_fft",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=3e-5,
    weight_decay=0.01,
    logging_steps=10,
    logging_dir="./fft_logs",
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    save_total_limit=1,
    report_to="none",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [16]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
0,0.0,0.0,1.0,1.0,1.0,1.0
1,0.0,0.0,1.0,1.0,1.0,1.0
2,0.0,0.0,1.0,1.0,1.0,1.0


TrainOutput(global_step=150, training_loss=0.10195042242606481, metrics={'train_runtime': 453.3016, 'train_samples_per_second': 5.361, 'train_steps_per_second': 0.331, 'total_flos': 817285879037952.0, 'train_loss': 0.10195042242606481, 'epoch': 2.9852216748768474})

## 💾 Memory Usage During Full Fine-Tuning

Peak Allocated: ~11.4 GB

- This is the actual memory actively used by model parameters, activations, gradients, and optimizer states during training.

Peak Reserved: ~12.3 GB

- This includes extra memory reserved by PyTorch's memory allocator for internal buffers or future allocations.

## 🤓 Observations:

High peak allocated: ~11.4 GB
- Indicates full fine-tuning is memory-intensive; all model weights, gradients, and optimizer states are stored and updated

Reserved close to max: ~12.3 GB
- PyTorch pre-allocates memory chunks to minimize fragmentation and speed up training

GPU usage efficiency: High
- Most of the reserved memory is being actively used, indicating efficient memory utilization



In [17]:
print(f"[Peak Allocated]: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
print(f"[Peak Reserved]:  {torch.cuda.max_memory_reserved() / 1024**2:.2f} MB")

[Peak Allocated]: 11392.23 MB
[Peak Reserved]:  11732.00 MB


In [18]:
test_dataset = Dataset.from_dict({
    "text": test_texts,
    "label": test_labels
})

def tokenize_fn_test(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

test_dataset = test_dataset.map(tokenize_fn_test, batched=True)
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [19]:
from sklearn.metrics import accuracy_score, classification_report

output = trainer.predict(test_dataset)
preds = output.predictions.argmax(axis=-1)
labels = output.label_ids

accuracy = accuracy_score(labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")

In [20]:
print("Fine-tuned Model Performance:")
print(f"Accuracy : {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"F1 Score : {f1:.4f}")

print("Classification Report:")
print(classification_report(labels, preds, target_names=["ham", "spam"]))

Fine-tuned Model Performance:
Accuracy : 1.0000
Precision: 1.0000
Recall   : 1.0000
F1 Score : 1.0000
Classification Report:
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00        55
        spam       1.00      1.00      1.00        45

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100



# 🤖 LoRA Finetuning Time

First let's load in the model, tokenizer, and compute_metrics; basically the same as our full fine tuning.

In [10]:
model_name = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
model.config.pad_token_id = model.config.eos_token_id

Some weights of Qwen3ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen3-0.6B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
def tokenize_func(example):
    return tokenizer(
        example['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

train_data = train_data.map(tokenize_func, batched=True)
val_data = val_data.map(tokenize_func, batched=True)

Map:   0%|          | 0/810 [00:00<?, ? examples/s]

Map:   0%|          | 0/90 [00:00<?, ? examples/s]

In [12]:
train_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
val_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

In [13]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [14]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    acc = accuracy_score(p.label_ids, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average="binary")
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

## 📎 LoRA's Rank and scaling factor Alpha

In LoRA, instead of updating the full weight matrix $ W ∈ ℝ^{d×k} $, we add a low-rank perturbation:


$ \Delta W = \frac{\alpha}{r} AB, \quad A \in \mathbb{R}^{d \times r}, \quad B \in \mathbb{R}^{r \times k} $

- r (rank): Controls the capacity of the low-rank adaptation. Higher r = more expressive adapter.

- α (alpha): A scaling factor applied to the low-rank matrix. Best practice is $ \alpha = 2r $, ensuring stable training and comparable gradient magnitudes across different ranks.


## Impact on Performance
- Lower ranks (e.g. r = 8-64): May underperform full fine-tuning on more complex tasks.
 - Can produce "intruder dimensions"  reference: https://arxiv.org/pdf/2410.21228, (can be mitigated with a good alpha, best practice α = 2r)

- Higher ranks (e.g. r = 256–1024): More closely approximate full fine-tuning performance.
 - Reduce or eliminate intruder dimensions, better preserve pretraining structure, and improve generalization  


## Memory and Time
Given weight matrix  $ W ∈ ℝ^{d×k} $

Let d = k = 4096. For a full matrix, training requires:

- Full FT: 4096 × 4096 = 16M parameters.

LoRA with r = 64 trains:
- 4096 × 64 + 64 × 4096 = approx. 0.5M parameters (~3% of full).


### TLDR:
- Lower ranks 'forgets' less, has implicit regularization.
- LoRA, with commonly used low-rank settings, underperforms full finetuning.
- Higher ranks perform closer to full fine-tuning, but will use up more resources.


In [15]:
from peft import (
    LoraConfig,
    get_peft_model
)

## 🤓 Observe the Model's architecture.

We have several modules we can target for LoRA, namely:

- q_proj: query projection
- k_proj: key projection
- v_proj: value projection
- o_proj: output projection

In [16]:
print(model)

Qwen3ForSequenceClassification(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_l

## 🧰 LoRA Config

Here we can setup our LoRA configurations.

In the config we can define our rank, lora_alpha, dropout, target_modules, etc.

Currently, the best practice is to have lora_alpha = 2r, and target module be all the LinearLayers

In [17]:
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="SEQ_CLS"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

trainable params: 4,589,568 || all params: 600,641,536 || trainable%: 0.7641


## 🤓 Observe the Attention layers

Our model has been successfully wrapped with LoRA adapters using the peft library!

We can see that we have indeed injected LoRA adapters into the following attention modules:

- q_proj: query projection
- k_proj: key projection
- v_proj: value projection
- o_proj: output projection

Also observe the out_features of A and in_features of B, they follow the concept of LoRA, where


$ \Delta W = \frac{\alpha}{r} AB, \quad A \in \mathbb{R}^{d \times r}, \quad B \in \mathbb{R}^{r \times k} $

- r (rank): Controls the capacity of the low-rank adaptation. Higher r = more expressive adapter.

- α (alpha): A scaling factor applied to the low-rank matrix. Best practice is $ \alpha = 2r $, ensuring stable training and comparable gradient magnitudes across different ranks.

In [18]:
print(model)

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): Qwen3ForSequenceClassification(
      (model): Qwen3Model(
        (embed_tokens): Embedding(151936, 1024)
        (layers): ModuleList(
          (0-27): 28 x Qwen3DecoderLayer(
            (self_attn): Qwen3Attention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=1024, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=1024, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
        

## 🏃 Training time

We'll be using the same training args to maintain consistency with our full fine-tuning.

In [19]:
training_args = TrainingArguments(
    output_dir="./qwen_lora",
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=3e-5,
    weight_decay=0.01,
    logging_steps=10,
    logging_dir="./lora_logs",
    load_best_model_at_end=True,
    metric_for_best_model="eval_f1",
    save_total_limit=1,
    report_to="none",
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [20]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
0,0.0002,0.000131,1.0,1.0,1.0,1.0
1,0.0,4.7e-05,1.0,1.0,1.0,1.0
2,0.0,4e-05,1.0,1.0,1.0,1.0


TrainOutput(global_step=150, training_loss=0.09600905820727348, metrics={'train_runtime': 142.5846, 'train_samples_per_second': 17.043, 'train_steps_per_second': 1.052, 'total_flos': 825801767387136.0, 'train_loss': 0.09600905820727348, 'epoch': 2.9852216748768474})

## 💾 Memory Usage Comparison: Full Fine-Tuning vs LoRA

| Metric         | Full Fine-Tuning | LoRA Fine-Tuning |
|----------------|------------------|------------------|
| Peak Allocated | ~11.4 GB         | ~4.31 GB          |
| Peak Reserved  | ~12.3 GB         | ~4.36 GB          |
| Reduction      | -                | ~60%             |


We saved quite a lot of memory! 😃

In [21]:
print(f"[Peak Allocated]: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
print(f"[Peak Reserved]:  {torch.cuda.max_memory_reserved() / 1024**2:.2f} MB")

[Peak Allocated]: 4319.50 MB
[Peak Reserved]:  4366.00 MB


### 🔍 Accuracy on Test Data

In [22]:
test_dataset = Dataset.from_dict({
    "text": test_texts,
    "label": test_labels
})

def tokenize_fn_test(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

test_dataset = test_dataset.map(tokenize_fn_test, batched=True)
test_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [23]:
from sklearn.metrics import accuracy_score, classification_report

output = trainer.predict(test_dataset)
preds = output.predictions.argmax(axis=-1)
labels = output.label_ids

accuracy = accuracy_score(labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")

## 🎯 Performance Comparison

Both full fine-tuning and LoRA achieved perfect accuracy on the test set. This is likely because the task—spam classification on this dataset—is relatively simple, and the model (Qwen3-0.6B) is strong enough to handle it with ease. Given the straightforward nature of the inputs and the limited size of the dataset, even lightweight fine-tuning methods like LoRA are sufficient to reach optimal performance.

In [24]:
print("Fine-tuned Model Performance:")
print(f"Accuracy : {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall   : {recall:.4f}")
print(f"F1 Score : {f1:.4f}")

print("Classification Report:")
print(classification_report(labels, preds, target_names=["ham", "spam"]))

Fine-tuned Model Performance:
Accuracy : 1.0000
Precision: 1.0000
Recall   : 1.0000
F1 Score : 1.0000
Classification Report:
              precision    recall  f1-score   support

         ham       1.00      1.00      1.00        55
        spam       1.00      1.00      1.00        45

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100

