# 🔬 LoRA vs QLoRA vs DoRA: Math & Comparison

This notebook explains the mathematical foundations of each adapter method and compares them experimentally.

---

## The Big Picture

All three methods solve the same problem: **How to fine-tune a large model efficiently?**

```
Full Fine-tuning:     Update ALL parameters (expensive!)
                      7B model = 14GB+ of gradients & optimizer states

Adapter Methods:      Update SMALL adapter, freeze base model
                      7B model = ~50MB of trainable parameters
```

| Method | Year | Key Innovation | Memory | Quality |
|--------|------|----------------|--------|--------|
| LoRA | 2021 | Low-rank decomposition | Medium | Baseline |
| QLoRA | 2023 | 4-bit quantized base | Very Low | ~99% of LoRA |
| DoRA | 2024 | Magnitude/Direction split | Medium | ~101% of LoRA |

---

# Part 1: The Mathematics

---

## 1.1 LoRA: Low-Rank Adaptation

### The Problem

A weight matrix in a transformer might be:

$$W \in \mathbb{R}^{d \times d}$$

For a typical model with $d = 4096$:

$$\text{Parameters} = 4096 \times 4096 = 16,777,216 \text{ per layer!}$$

### The LoRA Insight

The weight **change** during fine-tuning ($\Delta W$) has low intrinsic rank.

Instead of learning full $\Delta W$, factorize it:

$$\Delta W = B \cdot A$$

Where:
- $A \in \mathbb{R}^{r \times d}$ (down-projection)
- $B \in \mathbb{R}^{d \times r}$ (up-projection)
- $r \ll d$ (typically $r = 8$ or $16$)

### Parameter Savings

$$\text{Full: } d \times d = 4096^2 = 16.7M$$
$$\text{LoRA: } r \times d + d \times r = 2 \times r \times d = 2 \times 8 \times 4096 = 65K$$

**Reduction: 257× fewer parameters!**

### Forward Pass

$$h = W_{frozen} \cdot x + \frac{\alpha}{r} \cdot B \cdot A \cdot x$$

```
Input x ─────┬────────────────────────────────┐
             │                                │
             ▼                                ▼
      ┌─────────────┐                  ┌─────────────┐
      │  W (frozen) │                  │    A        │
      │  d × d      │                  │    r × d    │
      └──────┬──────┘                  └──────┬──────┘
             │                                │
             │                                ▼
             │                         ┌─────────────┐
             │                         │    B        │
             │                         │    d × r    │
             │                         └──────┬──────┘
             │                                │
             │                                ▼
             │                         ┌─────────────┐
             │                         │   × α/r     │
             │                         └──────┬──────┘
             │                                │
             └───────────► + ◄────────────────┘
                           │
                           ▼
                       Output h
```

### Initialization

- $A$ ~ $\mathcal{N}(0, \sigma^2)$ (Gaussian)
- $B = 0$ (zero matrix)

This ensures $\Delta W = B \cdot A = 0$ at start, so the model begins identical to pretrained.

---

## 1.2 QLoRA: Quantized Low-Rank Adaptation

### The Problem

LoRA still requires the base model in memory:

$$\text{7B model in FP16} = 7 \times 10^9 \times 2 \text{ bytes} = 14 \text{ GB}$$

**Doesn't fit on consumer GPUs!**

### The QLoRA Solution

Store base model in 4-bit precision:

$$\text{7B model in NF4} = 7 \times 10^9 \times 0.5 \text{ bytes} = 3.5 \text{ GB}$$

### What is Quantization?

Map continuous values to discrete levels:

$$Q(w) = \text{round}\left(\frac{w - \text{min}}{\text{max} - \text{min}} \times (2^b - 1)\right)$$

For 4-bit: $2^4 = 16$ possible values.

### NF4: Normal Float 4-bit

Neural network weights follow approximately normal distribution:

$$w \sim \mathcal{N}(0, \sigma^2)$$

NF4 places quantization levels at **quantiles** of the normal distribution:

$$q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right)$$

Where $\Phi^{-1}$ is the inverse CDF of standard normal.

```
Standard 4-bit (uniform):        NF4 (normal-optimized):

│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │    │││││  │  │  │  │  │││││
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘    └┴┴┴┴──┴──┴──┴──┴──┴┴┴┴┘
-8      0       +8                 -3σ  -σ   0   +σ  +3σ

Uniform spacing                   More levels near 0 (where most weights are)
```

### Block-wise Quantization

Weights are quantized in blocks of 64:

$$W_{quantized}[i] = Q\left(\frac{W[i]}{s_b}\right)$$

Where $s_b$ is the scale factor for block $b$.

### Double Quantization

QLoRA also quantizes the scale factors:

```
Level 1: Weights → 4-bit (NF4)
Level 2: Scale factors → 8-bit (FP8)
```

### Forward Pass in QLoRA

$$h = \text{Dequant}(W_{4bit}) \cdot x + \frac{\alpha}{r} \cdot B \cdot A \cdot x$$

```
Input x ─────┬────────────────────────────────┐
             │                                │
             ▼                                ▼
      ┌─────────────┐                  ┌─────────────┐
      │ W (4-bit)   │                  │    A (FP16) │
      └──────┬──────┘                  └──────┬──────┘
             │                                │
             ▼                                ▼
      ┌─────────────┐                  ┌─────────────┐
      │ Dequantize  │                  │    B (FP16) │
      │ to FP16     │                  └──────┬──────┘
      └──────┬──────┘                         │
             │                                │
             └───────────► + ◄────────────────┘
                           │
                           ▼
                       Output h
```

### Memory Breakdown

| Component | LoRA (FP16) | QLoRA (NF4) |
|-----------|-------------|-------------|
| Base model (7B) | 14.0 GB | 3.5 GB |
| LoRA adapters | 0.05 GB | 0.05 GB |
| Optimizer states | 0.1 GB | 0.1 GB |
| Activations | ~2 GB | ~2 GB |
| **Total** | **~16 GB** | **~6 GB** |

---

## 1.3 DoRA: Weight-Decomposed Low-Rank Adaptation

### The Problem with LoRA

LoRA updates both **magnitude** and **direction** together:

$$W' = W + B \cdot A$$

But full fine-tuning can change them **independently**. This coupling limits LoRA's expressiveness.

### Vector Decomposition

Any vector can be written as:

$$\vec{v} = \|\vec{v}\| \cdot \frac{\vec{v}}{\|\vec{v}\|}$$

$$\vec{v} = \underbrace{m}_{\text{magnitude}} \cdot \underbrace{\hat{d}}_{\text{direction}}$$

### DoRA's Decomposition

Apply this to each **column** of the weight matrix:

$$W = M \cdot D$$

Where:
- $M = \text{diag}(\|W_{:,1}\|, \|W_{:,2}\|, ..., \|W_{:,d}\|)$ — magnitudes
- $D = W / \|W\|_{col}$ — directions (column-normalized)

### DoRA Update Rule

$$W' = m \cdot \frac{V + B \cdot A}{\|V + B \cdot A\|_c}$$

Where:
- $m \in \mathbb{R}^{d}$ — **learnable** magnitude vector
- $V = W / \|W\|_c$ — original direction (frozen)
- $B \cdot A$ — LoRA update for direction
- $\|\cdot\|_c$ — column-wise normalization

### Visualization

```
LoRA:                              DoRA:

    Original W                         Original W
        │                                  │
        ▼                                  ▼
    ┌───────┐                      ┌──────────────┐
    │ + BA  │                      │ Decompose    │
    └───┬───┘                      │ m = ||W||    │
        │                          │ V = W/||W||  │
        ▼                          └──────┬───────┘
    W' = W + BA                           │
                                   ┌──────┴───────┐
    (magnitude &                   │              │
     direction                     ▼              ▼
     coupled)                  ┌───────┐    ┌──────────┐
                               │ m     │    │ V + BA   │
                               │(learn)│    │ ────────  │
                               └───┬───┘    │ ||V+BA|| │
                                   │        └────┬─────┘
                                   │             │
                                   └──────┬──────┘
                                          │
                                          ▼
                                   W' = m · normalized

                                   (magnitude & direction
                                    independent!)
```

### Why This Helps

**Scenario:** Fine-tune to make feature #1 stronger, feature #2 weaker.

| Method | How it achieves this |
|--------|---------------------|
| LoRA | BA must encode magnitude changes as direction shifts (inefficient) |
| DoRA | $m_1 = 1.5$, $m_2 = 0.5$ — direct! BA focuses on actual direction changes |

### Parameter Count

$$\text{DoRA} = \text{LoRA} + d_{out}$$

For $d = 4096$, $r = 8$:
- LoRA: $2 \times 8 \times 4096 = 65,536$
- DoRA: $65,536 + 4,096 = 69,632$

**Only 6% more parameters!**

---

## 1.4 Summary: The Three Methods

| | LoRA | QLoRA | DoRA |
|--|------|-------|------|
| **Update formula** | $W + BA$ | $\text{Dequant}(W_{4b}) + BA$ | $m \cdot \frac{V + BA}{\|V + BA\|}$ |
| **Base model** | FP16 | NF4 (4-bit) | FP16 |
| **Adapters** | FP16 | FP16 | FP16 + magnitude |
| **Magnitude/Direction** | Coupled | Coupled | Decoupled |
| **Memory (7B)** | ~14 GB | ~4 GB | ~14 GB |
| **Quality** | Baseline | ~99% | ~101% |

### When to Use Each

```
                    ┌─────────────────────────────────┐
                    │      Do you have enough VRAM?   │
                    └───────────────┬─────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │                               │
                    ▼                               ▼
               ┌────────┐                     ┌────────┐
               │   NO   │                     │  YES   │
               └────┬───┘                     └────┬───┘
                    │                              │
                    ▼                              ▼
              ┌──────────┐              ┌─────────────────────┐
              │  QLoRA   │              │ Need best quality?  │
              │  (4-bit) │              └──────────┬──────────┘
              └──────────┘                         │
                                       ┌───────────┴───────────┐
                                       │                       │
                                       ▼                       ▼
                                  ┌────────┐              ┌────────┐
                                  │  YES   │              │   NO   │
                                  └────┬───┘              └────┬───┘
                                       │                       │
                                       ▼                       ▼
                                 ┌──────────┐            ┌──────────┐
                                 │   DoRA   │            │   LoRA   │
                                 └──────────┘            └──────────┘
```

---

# Part 2: Experimental Comparison

Now let's run all three methods and compare!

---

## 2.1 Setup

In [1]:
!pip install -q transformers datasets peft trl accelerate bitsandbytes

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m532.5/532.5 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import torch
import gc
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer

# Check GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"✅ GPU: {gpu_name}")
    print(f"✅ VRAM: {gpu_memory:.1f} GB")
else:
    print("❌ No GPU!")

✅ GPU: Tesla T4
✅ VRAM: 15.8 GB


## 2.2 Configuration

In [3]:
# ============================================================
# CONFIGURATION
# ============================================================

MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
DATASET_ID = "mlabonne/guanaco-llama2-1k"

# Training settings (same for all methods)
MAX_SEQ_LEN = 512
BATCH_SIZE = 2
GRAD_ACCUM = 8
MAX_STEPS = 100  # Short runs for comparison
LEARNING_RATE = 2e-4

# LoRA settings (same for all methods)
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]

print("✅ Configuration loaded")

✅ Configuration loaded


In [4]:
# Load dataset once
dataset = load_dataset(DATASET_ID, split="train")
print(f"✅ Dataset: {len(dataset)} samples")

# Load tokenizer once
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
print(f"✅ Tokenizer loaded")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001-9ad84bb9cf65a4(…):   0%|          | 0.00/967k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

✅ Dataset: 1000 samples


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

✅ Tokenizer loaded


## 2.3 Helper Functions

In [5]:
def get_gpu_memory():
    """Get current GPU memory usage in GB."""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / 1e9
    return 0

def clear_memory():
    """Clear GPU memory."""
    gc.collect()
    torch.cuda.empty_cache()

def count_parameters(model):
    """Count trainable and total parameters."""
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return trainable, total

print("✅ Helper functions defined")

✅ Helper functions defined


---

## 2.4 Method 1: Standard LoRA

$$W' = W_{frozen} + \frac{\alpha}{r} \cdot B \cdot A$$

Base model in **FP16**, adapters in **FP16**.

In [6]:
print("="*60)
print("🔵 METHOD 1: Standard LoRA")
print("="*60)

clear_memory()
mem_before = get_gpu_memory()

# Load model in FP16
model_lora = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

mem_after_load = get_gpu_memory()
print(f"\n📊 Memory after loading base model: {mem_after_load:.2f} GB")

# Apply LoRA
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
    use_dora=False,  # Standard LoRA
)

model_lora = get_peft_model(model_lora, lora_config)

mem_after_lora = get_gpu_memory()
trainable, total = count_parameters(model_lora)

print(f"📊 Memory after LoRA: {mem_after_lora:.2f} GB")
print(f"📊 Trainable params: {trainable:,} ({100*trainable/total:.2f}%)")
print(f"📊 Total params: {total:,}")

🔵 METHOD 1: Standard LoRA


config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]


📊 Memory after loading base model: 1.00 GB
📊 Memory after LoRA: 1.00 GB
📊 Trainable params: 1,081,344 (0.22%)
📊 Total params: 495,114,112


In [7]:
# Train LoRA
print("\n🚀 Training LoRA...")

training_args_lora = TrainingArguments(
    output_dir="./outputs/lora",
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    fp16=True,
    logging_steps=20,
    save_strategy="no",
    report_to="none",
    optim="adamw_torch",
    seed=42,
)

trainer_lora = SFTTrainer(
    model=model_lora,
    args=training_args_lora,
    train_dataset=dataset,
    processing_class=tokenizer,
    # max_seq_length=MAX_SEQ_LEN,
)

start_time = time.time()
trainer_lora.train()
lora_time = time.time() - start_time

lora_loss = trainer_lora.state.log_history[-1].get('train_loss', 'N/A')
print(f"\n✅ LoRA complete!")
print(f"   Time: {lora_time:.1f}s")
print(f"   Final loss: {lora_loss}")


🚀 Training LoRA...




Adding EOS to train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/1000 [00:00<?, ? examples/s]

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
20,1.8387
40,1.7184
60,1.7171
80,1.6606
100,1.6788



✅ LoRA complete!
   Time: 279.2s
   Final loss: 1.7227166748046876


In [8]:
# Store results and clean up
lora_results = {
    'method': 'LoRA',
    'memory_gb': mem_after_lora,
    'trainable_params': trainable,
    'time_seconds': lora_time,
    'final_loss': lora_loss,
}

del model_lora, trainer_lora
clear_memory()
print("🧹 Memory cleared")

🧹 Memory cleared


---

## 2.5 Method 2: QLoRA

$$W' = \text{Dequant}(W_{4bit}) + \frac{\alpha}{r} \cdot B \cdot A$$

Base model in **NF4 (4-bit)**, adapters in **FP16**.

In [9]:
print("="*60)
print("🟢 METHOD 2: QLoRA (4-bit)")
print("="*60)

clear_memory()
mem_before = get_gpu_memory()

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,              # Enable 4-bit
    bnb_4bit_quant_type="nf4",      # Use NF4 (normal float)
    bnb_4bit_compute_dtype=torch.float16,  # Compute in FP16
    bnb_4bit_use_double_quant=True, # Double quantization
)

# Load model in 4-bit
model_qlora = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

mem_after_load = get_gpu_memory()
print(f"\n📊 Memory after loading 4-bit model: {mem_after_load:.2f} GB")

# Prepare for k-bit training
model_qlora = prepare_model_for_kbit_training(model_qlora)

# Apply LoRA
qlora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
    use_dora=False,
)

model_qlora = get_peft_model(model_qlora, qlora_config)

mem_after_qlora = get_gpu_memory()
trainable, total = count_parameters(model_qlora)

print(f"📊 Memory after QLoRA: {mem_after_qlora:.2f} GB")
print(f"📊 Trainable params: {trainable:,} ({100*trainable/total:.2f}%)")
print(f"📊 Total params: {total:,}")

🟢 METHOD 2: QLoRA (4-bit)

📊 Memory after loading 4-bit model: 0.48 GB
📊 Memory after QLoRA: 0.75 GB
📊 Trainable params: 1,081,344 (0.34%)
📊 Total params: 316,200,832


In [12]:
# Train QLoRA
print("\n🚀 Training QLoRA...")

training_args_qlora = TrainingArguments(
    output_dir="./outputs/qlora",
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    bf16=True, # fp16=True -> gies error
    logging_steps=20,
    save_strategy="no",
    report_to="none",
    optim="adamw_torch",
    seed=42,
)

trainer_qlora = SFTTrainer(
    model=model_qlora,
    args=training_args_qlora,
    train_dataset=dataset,
    processing_class=tokenizer,
    # max_seq_length=MAX_SEQ_LEN,
)

start_time = time.time()
trainer_qlora.train()
qlora_time = time.time() - start_time

qlora_loss = trainer_qlora.state.log_history[-1].get('train_loss', 'N/A')
print(f"\n✅ QLoRA complete!")
print(f"   Time: {qlora_time:.1f}s")
print(f"   Final loss: {qlora_loss}")


🚀 Training QLoRA...


  return fn(*args, **kwargs)


Step,Training Loss
20,1.9764
40,1.8735
60,1.8854
80,1.8188
100,1.8352



✅ QLoRA complete!
   Time: 413.6s
   Final loss: 1.877843017578125


In [13]:
# Store results and clean up
qlora_results = {
    'method': 'QLoRA',
    'memory_gb': mem_after_qlora,
    'trainable_params': trainable,
    'time_seconds': qlora_time,
    'final_loss': qlora_loss,
}

del model_qlora, trainer_qlora
clear_memory()
print("🧹 Memory cleared")

🧹 Memory cleared


---

## 2.6 Method 3: DoRA

$$W' = m \cdot \frac{V + B \cdot A}{\|V + B \cdot A\|_c}$$

Base model in **FP16**, adapters in **FP16**, plus **magnitude vector**.

In [14]:
print("="*60)
print("🟣 METHOD 3: DoRA")
print("="*60)

clear_memory()
mem_before = get_gpu_memory()

# Load model in FP16
model_dora = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
)

mem_after_load = get_gpu_memory()
print(f"\n📊 Memory after loading base model: {mem_after_load:.2f} GB")

# Apply DoRA (just flip the flag!)
dora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=TARGET_MODULES,
    use_dora=True,  # ← Enable DoRA!
)

model_dora = get_peft_model(model_dora, dora_config)

mem_after_dora = get_gpu_memory()
trainable, total = count_parameters(model_dora)

print(f"📊 Memory after DoRA: {mem_after_dora:.2f} GB")
print(f"📊 Trainable params: {trainable:,} ({100*trainable/total:.2f}%)")
print(f"📊 Total params: {total:,}")

🟣 METHOD 3: DoRA

📊 Memory after loading base model: 1.75 GB
📊 Memory after DoRA: 1.76 GB
📊 Trainable params: 1,130,496 (0.23%)
📊 Total params: 495,163,264


In [16]:
# Train DoRA
print("\n🚀 Training DoRA...")

training_args_dora = TrainingArguments(
    output_dir="./outputs/dora",
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    fp16=True,
    logging_steps=20,
    save_strategy="no",
    report_to="none",
    optim="adamw_torch",
    seed=42,
)

trainer_dora = SFTTrainer(
    model=model_dora,
    args=training_args_dora,
    train_dataset=dataset,
    processing_class=tokenizer,
    # max_seq_length=MAX_SEQ_LEN,
)

start_time = time.time()
trainer_dora.train()
dora_time = time.time() - start_time

dora_loss = trainer_dora.state.log_history[-1].get('train_loss', 'N/A')
print(f"\n✅ DoRA complete!")
print(f"   Time: {dora_time:.1f}s")
print(f"   Final loss: {dora_loss}")


🚀 Training DoRA...


The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.


Step,Training Loss
20,1.8371
40,1.7178
60,1.7168
80,1.6592
100,1.6772



✅ DoRA complete!
   Time: 293.7s
   Final loss: 1.7216070556640626


In [17]:
# Store results and clean up
dora_results = {
    'method': 'DoRA',
    'memory_gb': mem_after_dora,
    'trainable_params': trainable,
    'time_seconds': dora_time,
    'final_loss': dora_loss,
}

del model_dora, trainer_dora
clear_memory()
print("🧹 Memory cleared")

🧹 Memory cleared


---

# Part 3: Results Comparison

In [18]:
print("\n" + "="*70)
print("📊 COMPARISON RESULTS")
print("="*70)

results = [lora_results, qlora_results, dora_results]

print(f"\n{'Method':<10} {'Memory (GB)':<15} {'Trainable':<15} {'Time (s)':<12} {'Loss':<10}")
print("-"*70)

for r in results:
    print(f"{r['method']:<10} {r['memory_gb']:<15.2f} {r['trainable_params']:<15,} {r['time_seconds']:<12.1f} {r['final_loss']:<10}")


📊 COMPARISON RESULTS

Method     Memory (GB)     Trainable       Time (s)     Loss      
----------------------------------------------------------------------
LoRA       1.00            1,081,344       279.2        1.7227166748046876
QLoRA      0.75            1,081,344       413.6        1.877843017578125
DoRA       1.76            1,130,496       293.7        1.7216070556640626


In [19]:
# Visual comparison
print("\n" + "="*70)
print("📊 VISUAL COMPARISON")
print("="*70)

# Memory comparison
print("\n🧠 Memory Usage:")
max_mem = max(r['memory_gb'] for r in results)
for r in results:
    bar_len = int(30 * r['memory_gb'] / max_mem)
    bar = '█' * bar_len + '░' * (30 - bar_len)
    print(f"  {r['method']:<8} [{bar}] {r['memory_gb']:.2f} GB")

# Time comparison
print("\n⏱️ Training Time:")
max_time = max(r['time_seconds'] for r in results)
for r in results:
    bar_len = int(30 * r['time_seconds'] / max_time)
    bar = '█' * bar_len + '░' * (30 - bar_len)
    print(f"  {r['method']:<8} [{bar}] {r['time_seconds']:.1f}s")

# Parameters comparison
print("\n📦 Trainable Parameters:")
max_params = max(r['trainable_params'] for r in results)
for r in results:
    bar_len = int(30 * r['trainable_params'] / max_params)
    bar = '█' * bar_len + '░' * (30 - bar_len)
    print(f"  {r['method']:<8} [{bar}] {r['trainable_params']:,}")


📊 VISUAL COMPARISON

🧠 Memory Usage:
  LoRA     [█████████████████░░░░░░░░░░░░░] 1.00 GB
  QLoRA    [████████████░░░░░░░░░░░░░░░░░░] 0.75 GB
  DoRA     [██████████████████████████████] 1.76 GB

⏱️ Training Time:
  LoRA     [████████████████████░░░░░░░░░░] 279.2s
  QLoRA    [██████████████████████████████] 413.6s
  DoRA     [█████████████████████░░░░░░░░░] 293.7s

📦 Trainable Parameters:
  LoRA     [████████████████████████████░░] 1,081,344
  QLoRA    [████████████████████████████░░] 1,081,344
  DoRA     [██████████████████████████████] 1,130,496


## Analysis

QLoRA's 4-bit quantization saves ~25% memory.
DoRA being higher is expected — it stores additional magnitude vectors.

QLoRA is slower — this is expected!
The 4-bit weights must be dequantized on every forward pass:


    QLoRA Forward Pass:
    ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
    │ W (4-bit)   │ ──► │ Dequantize  │ ──► │ Compute     │
    │ in memory   │     │ to FP16     │     │ W × x       │
    └─────────────┘     └─────────────┘     └─────────────┘
                              ↑
                        Extra overhead!
      
Trade-off: QLoRA saves memory but costs time.

    LoRA:  1,081,344
    QLoRA: 1,081,344  ← Same (just different base precision)
    DoRA:  1,130,496  ← +49,152 extra (magnitude vectors)

DoRA's extra params:

    Extra = 1,130,496 - 1,081,344 = 49,152

This is the magnitude vector m for each target layer.

    For Qwen 0.5B with QKVO targets:
      24 layers × 4 modules × 512 dim ≈ 49K ✓

  

⚠️ Important Note:

These results are from 100 training steps on a small dataset.

With longer training:
- DoRA's advantage typically grows
- QLoRA's gap may narrow (or widen)

The relative rankings are meaningful, but exact numbers
will vary with more training.

---

# Part 4: Key Takeaways

## Mathematical Summary

| Method | Formula | Key Innovation |
|--------|---------|----------------|
| **LoRA** | $W' = W + BA$ | Low-rank factorization |
| **QLoRA** | $W' = \text{Dequant}(W_{4b}) + BA$ | 4-bit base + FP16 adapters |
| **DoRA** | $W' = m \cdot \frac{V+BA}{\|V+BA\|}$ | Magnitude/direction split |

## When to Use Each

| Situation | Recommendation |
|-----------|----------------|
| Plenty of VRAM, quick experiments | **LoRA** |
| Limited VRAM (consumer GPU) | **QLoRA** |
| Need best quality, have VRAM | **DoRA** |
| Production with memory constraints | **QLoRA** |

## Code Summary

```python
# LoRA
LoraConfig(r=8, use_dora=False)

# QLoRA  
BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
LoraConfig(r=8, use_dora=False)

# DoRA
LoraConfig(r=8, use_dora=True)  # Just flip the flag!
```

---

## References

- **LoRA**: [Hu et al., 2021](https://arxiv.org/abs/2106.09685) - "LoRA: Low-Rank Adaptation of Large Language Models"
- **QLoRA**: [Dettmers et al., 2023](https://arxiv.org/abs/2305.14314) - "QLoRA: Efficient Finetuning of Quantized LLMs"
- **DoRA**: [Liu et al., 2024](https://arxiv.org/abs/2402.09353) - "DoRA: Weight-Decomposed Low-Rank Adaptation"