# Tutorial 6: QLoRA - 4-bit Quantization for Efficient Fine-Tuning

## Introduction

Welcome to **Tutorial 6** on **QLoRA (Quantized LoRA)**! This tutorial explores how 4-bit quantization enables training large language models (7B+ parameters) on consumer GPUs with limited VRAM.

### What is QLoRA?

**QLoRA** combines two powerful techniques:
1. **4-bit Quantization**: Compress model weights from 32-bit floats to 4-bit integers
2. **LoRA Adapters**: Train small adapter layers in FP16 precision

This allows you to:
- Train 7B models on GPUs with 6GB VRAM (vs 28GB for full fine-tuning)
- Maintain 99% of full precision performance
- Enable training on consumer hardware (RTX 3060, 3090, 4090)

### Memory Comparison (7B Model)

| Method | Memory Required | Trainable Params | Performance |
|--------|----------------|------------------|-------------|
| **Full Fine-Tuning** | ~28 GB | 7B (100%) | 100% |
| **LoRA** | ~14 GB | ~16M (0.2%) | 99% |
| **QLoRA** | ~5 GB | ~16M (0.2%) | 99% |

### Learning Objectives

By the end of this tutorial, you will understand:

1. **Quantization Theory**: How 4-bit quantization works (NF4, FP4)
2. **QLoRA Architecture**: Combining quantized base + FP16 adapters
3. **Memory Analysis**: Why QLoRA saves so much memory
4. **Implementation**: Build a simple quantizer from scratch
5. **Production Use**: Using Unsloth for real QLoRA training

### Prerequisites

- Tutorial 1 (LoRA theory)
- Tutorial 2 (LoRA implementation)
- Basic understanding of floating-point representation

In [None]:
# Google Colab Setup
import sys
import os

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running in Google Colab - Setting up environment...")
    if not os.path.exists('transformer_from_scratch'):
        print("Cloning repository...")
        !git clone https://github.com/melhzy/transformer_from_scratch.git
        print("Repository cloned!")
    os.chdir('transformer_from_scratch')
    print("Installing dependencies...")
    !pip install -q torch matplotlib seaborn numpy pandas
    print("Dependencies installed!")
    if '/content/transformer_from_scratch' not in sys.path:
        sys.path.insert(0, '/content/transformer_from_scratch')
    print("Setup complete! Ready to run the tutorial.")
else:
    print("Running locally - no setup needed.")

## 1. Import Required Libraries

Let's import the necessary libraries for implementing QLoRA from scratch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Optional, Tuple
import math

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Understanding Quantization Basics

### What is Quantization?

Quantization reduces the precision of numbers to save memory:

- **FP32 (Full Precision)**: 32 bits per parameter (4 bytes)
- **FP16 (Half Precision)**: 16 bits per parameter (2 bytes) - 50% memory savings
- **INT8**: 8 bits per parameter (1 byte) - 75% memory savings
- **4-bit**: 4 bits per parameter (0.5 bytes) - 87.5% memory savings!

### How Does it Work?

Quantization maps floating-point values to a smaller set of discrete values:

```
Original (FP32): [0.1, 0.5, 0.9, 1.2, 1.8]
Quantized (4-bit): [0, 5, 9, 12, 15]  (16 possible values: 0-15)
```

The formula:
```
quantized = round((value - zero_point) / scale)
dequantized = (quantized * scale) + zero_point
```

### Why 4-bit for QLoRA?

- **Memory**: 7B model goes from 28GB to 3.5GB (8x reduction)
- **Accuracy**: NF4 format preserves 99%+ of model performance
- **Training**: Only LoRA adapters train in FP16, base stays quantized

## 3. Implementing 4-bit Quantization

Let's build a simple 4-bit quantizer from scratch to understand the process.

In [None]:
class Simple4BitQuantizer:
    """
    Simple 4-bit symmetric quantization.
    Maps FP32 values to 16 discrete levels (0-15).
    """
    def __init__(self):
        self.n_levels = 16  # 2^4 = 16 possible values
    
    def quantize(self, tensor):
        """Quantize FP32 tensor to 4-bit representation."""
        # Find min and max for scale calculation
        min_val = tensor.min()
        max_val = tensor.max()
        
        # Calculate scale (range divided by number of levels)
        scale = (max_val - min_val) / (self.n_levels - 1)
        
        # Avoid division by zero
        if scale == 0:
            scale = 1.0
        
        # Quantize: map to 0-15 range
        quantized = torch.round((tensor - min_val) / scale)
        quantized = torch.clamp(quantized, 0, self.n_levels - 1).to(torch.uint8)
        
        return quantized, scale, min_val
    
    def dequantize(self, quantized, scale, min_val):
        """Dequantize back to FP32 for computation."""
        return (quantized.float() * scale) + min_val


# Example: Quantize a weight matrix
print("=== Simple 4-bit Quantization Demo ===\n")

# Create a sample weight matrix (similar to transformer layer weights)
original_weights = torch.randn(4, 4) * 0.5

print("Original weights (FP32):")
print(original_weights)
print(f"Memory: {original_weights.element_size() * original_weights.nelement()} bytes\n")

# Quantize
quantizer = Simple4BitQuantizer()
quant_weights, scale, min_val = quantizer.quantize(original_weights)

print("Quantized weights (4-bit):")
print(quant_weights)
print(f"Memory: {quant_weights.element_size() * quant_weights.nelement()} bytes")
print(f"Memory savings: {(1 - quant_weights.element_size() / original_weights.element_size()) * 100:.1f}%\n")

# Dequantize
dequant_weights = quantizer.dequantize(quant_weights, scale, min_val)

print("Dequantized weights (back to FP32):")
print(dequant_weights)

# Calculate quantization error
error = (original_weights - dequant_weights).abs().mean()
print(f"\nQuantization error (MAE): {error:.6f}")

## 4. QLoRA Architecture: Combining Quantized Base + LoRA Adapters

Now let's implement the core idea of QLoRA: keep the base model weights in 4-bit, but add LoRA adapters in FP16.

**Key Insight**: During forward pass:
1. Dequantize 4-bit weights to FP32 (on-the-fly)
2. Compute base output: `base_output = input @ dequantized_weights`
3. Compute LoRA output: `lora_output = input @ A @ B` (in FP16)
4. Combine: `final_output = base_output + lora_output`

Only LoRA parameters (A and B) get gradients during training!

In [None]:
class QLoRALayer(nn.Module):
    """
    QLoRA Layer: Quantized base weights + FP16 LoRA adapters
    
    Memory breakdown:
    - Base weights: 4-bit (frozen, quantized)
    - LoRA A, B: FP16 (trainable)
    """
    def __init__(self, in_features, out_features, rank=8):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        
        # Base weights (will be quantized and frozen)
        base_weight = torch.randn(out_features, in_features) * 0.02
        
        # Quantize base weights
        quantizer = Simple4BitQuantizer()
        self.quantized_weight, self.scale, self.min_val = quantizer.quantize(base_weight)
        
        # LoRA adapters (trainable, FP16)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Scaling factor for LoRA
        self.scaling = 1.0 / rank
        
    def forward(self, x):
        """
        Forward pass: dequantize base + add LoRA
        
        Args:
            x: Input tensor of shape (batch, seq_len, in_features)
        
        Returns:
            Output tensor of shape (batch, seq_len, out_features)
        """
        # Dequantize base weights on-the-fly
        quantizer = Simple4BitQuantizer()
        base_weight = quantizer.dequantize(self.quantized_weight, self.scale, self.min_val)
        
        # Base output (frozen weights)
        base_output = F.linear(x, base_weight)
        
        # LoRA output (trainable adapters)
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        
        # Combine both
        return base_output + lora_output
    
    def memory_usage(self):
        """Calculate memory usage in MB."""
        # Quantized base: 4 bits = 0.5 bytes per parameter
        base_mem = (self.quantized_weight.nelement() * 0.5) / 1e6
        
        # LoRA adapters: FP16 = 2 bytes per parameter
        lora_mem = ((self.lora_A.nelement() + self.lora_B.nelement()) * 2) / 1e6
        
        return {
            'base_4bit_MB': base_mem,
            'lora_fp16_MB': lora_mem,
            'total_MB': base_mem + lora_mem
        }


# Example: Create a QLoRA layer
print("=== QLoRA Layer Demo ===\n")

d_model = 512
qlora_layer = QLoRALayer(d_model, d_model, rank=8)

# Test forward pass
x = torch.randn(2, 10, d_model)  # (batch=2, seq_len=10, d_model=512)
output = qlora_layer(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}\n")

# Memory analysis
mem = qlora_layer.memory_usage()
print("Memory Usage:")
print(f"  Base weights (4-bit): {mem['base_4bit_MB']:.2f} MB")
print(f"  LoRA adapters (FP16): {mem['lora_fp16_MB']:.2f} MB")
print(f"  Total: {mem['total_MB']:.2f} MB\n")

# Compare to full precision
full_precision_mem = (d_model * d_model * 4) / 1e6  # FP32 = 4 bytes
print(f"Full FP32 layer would be: {full_precision_mem:.2f} MB")
print(f"Memory savings: {(1 - mem['total_MB'] / full_precision_mem) * 100:.1f}%")

## 5. Memory Comparison: Full vs LoRA vs QLoRA

Let's calculate real memory requirements for a 7B parameter model using different fine-tuning approaches.

In [None]:
import pandas as pd

def calculate_memory(num_params_billions, method='full', lora_rank=16, lora_percentage=0.01):
    """
    Calculate memory requirements for different fine-tuning methods.
    
    Args:
        num_params_billions: Model size in billions (e.g., 7 for 7B model)
        method: 'full', 'lora', or 'qlora'
        lora_rank: Rank for LoRA adapters
        lora_percentage: Percentage of layers with LoRA (default 1% of model)
    
    Returns:
        Dictionary with memory breakdown in GB
    """
    num_params = num_params_billions * 1e9
    
    if method == 'full':
        # Full fine-tuning: all params in FP32 + gradients + optimizer states
        model_mem = (num_params * 4) / 1e9  # FP32 = 4 bytes
        gradients_mem = model_mem  # Same size as model
        optimizer_mem = model_mem * 2  # Adam: 2x model size (momentum + variance)
        total = model_mem + gradients_mem + optimizer_mem
        
        return {
            'Model': model_mem,
            'Gradients': gradients_mem,
            'Optimizer': optimizer_mem,
            'Total': total
        }
    
    elif method == 'lora':
        # LoRA: frozen base in FP16 + trainable adapters in FP16
        base_mem = (num_params * 2) / 1e9  # FP16 = 2 bytes
        lora_params = num_params * lora_percentage  # ~1% of model
        lora_mem = (lora_params * 2) / 1e9  # FP16
        lora_gradients = lora_mem
        lora_optimizer = lora_mem * 2
        total = base_mem + lora_mem + lora_gradients + lora_optimizer
        
        return {
            'Base (FP16)': base_mem,
            'LoRA params': lora_mem,
            'LoRA gradients': lora_gradients,
            'LoRA optimizer': lora_optimizer,
            'Total': total
        }
    
    elif method == 'qlora':
        # QLoRA: frozen base in 4-bit + trainable adapters in FP16
        base_mem = (num_params * 0.5) / 1e9  # 4-bit = 0.5 bytes
        lora_params = num_params * lora_percentage
        lora_mem = (lora_params * 2) / 1e9  # FP16
        lora_gradients = lora_mem
        lora_optimizer = lora_mem * 2
        total = base_mem + lora_mem + lora_gradients + lora_optimizer
        
        return {
            'Base (4-bit)': base_mem,
            'LoRA params': lora_mem,
            'LoRA gradients': lora_gradients,
            'LoRA optimizer': lora_optimizer,
            'Total': total
        }


# Compare memory for 7B model
print("=== Memory Comparison for 7B Model ===\n")

methods = ['full', 'lora', 'qlora']
results = []

for method in methods:
    mem = calculate_memory(7, method=method)
    results.append({'Method': method.upper(), **mem})

df = pd.DataFrame(results)
print(df.to_string(index=False))

print("\n=== Key Insights ===")
full_mem = df[df['Method'] == 'FULL']['Total'].values[0]
lora_mem = df[df['Method'] == 'LORA']['Total'].values[0]
qlora_mem = df[df['Method'] == 'QLORA']['Total'].values[0]

print(f"LoRA saves: {(1 - lora_mem / full_mem) * 100:.1f}% vs Full Fine-Tuning")
print(f"QLoRA saves: {(1 - qlora_mem / full_mem) * 100:.1f}% vs Full Fine-Tuning")
print(f"QLoRA saves: {(1 - qlora_mem / lora_mem) * 100:.1f}% vs LoRA")

print("\n=== GPU Requirements (7B Model) ===")
print(f"Full Fine-Tuning: {full_mem:.1f} GB - Requires A100 40GB or 2x RTX 3090")
print(f"LoRA: {lora_mem:.1f} GB - Fits on single RTX 3090 (24GB)")
print(f"QLoRA: {qlora_mem:.1f} GB - Fits on RTX 3060 (12GB) or even RTX 4060 Ti (8GB)!")

## 6. Production QLoRA with Unsloth

Now that you understand how QLoRA works internally, let's see how to use it in production with **Unsloth** - a library optimized for efficient fine-tuning.

### Why Unsloth?

- **2x faster** than standard implementations
- **30% less memory** through optimized kernels
- **Pre-configured** 4-bit quantized models
- **Easy to use** - just a few lines of code!

### Installation

```bash
pip install unsloth
```

### Example: Fine-tune Llama 3.2 1B with QLoRA

In [None]:
# Production QLoRA Example (pseudo-code - requires unsloth installation)

print("=== Production QLoRA with Unsloth ===\n")
print("This is example code showing how to use Unsloth for QLoRA fine-tuning.")
print("To run this, install Unsloth: pip install unsloth\n")

example_code = '''
from unsloth import FastLanguageModel
import torch

# Load model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,  # Auto-detect
    load_in_4bit = True,  # Enable 4-bit quantization
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # LoRA rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0.05,
    bias = "none",
    use_gradient_checkpointing = True,
)

# Prepare dataset
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")

def format_prompts(examples):
    texts = []
    for instruction, input_text, output in zip(
        examples["instruction"], examples["input"], examples["output"]
    ):
        text = f"""### Instruction:
{instruction}

### Input:
{input_text}

### Response:
{output}"""
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

# Training arguments
from transformers import TrainingArguments
from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 100,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        output_dir = "outputs",
    ),
)

# Train!
trainer.train()

# Save LoRA adapters
model.save_pretrained("lora_model")

# Inference
FastLanguageModel.for_inference(model)
inputs = tokenizer("What is machine learning?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
'''

print(example_code)

print("\n=== Key Configuration Parameters ===")
print("- load_in_4bit=True: Loads model in 4-bit quantization")
print("- r=16: LoRA rank (higher = more capacity, more memory)")
print("- lora_alpha=16: LoRA scaling factor")
print("- target_modules: Which layers get LoRA adapters")
print("- use_gradient_checkpointing: Save memory during backprop")

## Summary

### What You Learned

1. **Quantization Fundamentals**
   - How to reduce precision from FP32 → 4-bit
   - Trade-offs between memory and accuracy
   - Simple quantization implementation

2. **QLoRA Architecture**
   - Combining 4-bit base weights with FP16 LoRA adapters
   - Memory savings: 5-6x compared to LoRA, 8x compared to full fine-tuning
   - Only LoRA adapters train, base stays frozen and quantized

3. **Practical Benefits**
   - Train 7B models on consumer GPUs (6-12GB VRAM)
   - Maintain 99% of full precision performance
   - Enable AI research on affordable hardware

4. **Production Usage**
   - Unsloth library for optimized QLoRA
   - 2x speed improvements
   - Pre-configured 4-bit quantized models

### Key Takeaways

| Aspect | Full Fine-Tuning | LoRA | QLoRA |
|--------|------------------|------|-------|
| **Memory (7B)** | 28 GB | 14 GB | 5 GB |
| **Trainable Params** | 7B (100%) | 16M (0.2%) | 16M (0.2%) |
| **Performance** | 100% | 99% | 99% |
| **GPU Required** | A100 40GB | RTX 3090 24GB | RTX 3060 12GB |
| **Training Speed** | 1x | 1.2x | 1x |

### When to Use QLoRA?

✅ **Use QLoRA when:**
- Limited GPU memory (< 24GB)
- Training large models (7B+)
- Budget constraints (consumer GPUs)
- Quick experimentation

❌ **Consider alternatives when:**
- Memory is not a constraint (A100 available)
- Training small models (< 1B parameters)
- Need absolute maximum performance
- Have access to multiple GPUs

### Next Steps

1. **Practice**: Try fine-tuning with Unsloth on your own dataset
2. **Experiment**: Test different ranks (r=8, 16, 32, 64)
3. **Optimize**: Use gradient checkpointing for even lower memory
4. **Deploy**: Merge LoRA weights and quantize for inference

### Resources

- **QLoRA Paper**: [Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- **Unsloth**: [GitHub Repository](https://github.com/unslothai/unsloth)
- **Unsloth Notebooks**: [100+ Examples](https://github.com/unslothai/notebooks)
- **bitsandbytes**: [Quantization Library](https://github.com/TimDettmers/bitsandbytes)

**Related Tutorials:**
- [Tutorial 1: Introduction to Fine-Tuning](01_introduction_to_fine_tuning.ipynb)
- [Tutorial 2: LoRA Implementation](02_lora_implementation.ipynb)
- [Tutorial 4: Instruction Tuning](04_instruction_tuning.ipynb)

---

**Congratulations!** You now understand QLoRA and can train large models on consumer GPUs!