# Tutorial 6: QLoRA - 4-bit Quantization for Efficient Fine-Tuning

## Introduction

Welcome to Tutorial 6 on QLoRA (Quantized LoRA)! This tutorial explores how 4-bit quantization enables training large language models (7B+ parameters) on consumer GPUs.

### What is QLoRA?

**QLoRA** combines:
1. **4-bit Quantization**: Compress model weights from 32-bit to 4-bit
2. **LoRA Adapters**: Train small adapter layers in FP16 precision

### Memory Comparison (7B Model)

| Method | Memory | Trainable Params | Performance |
|--------|--------|------------------|-------------|
| Full Fine-Tuning | 28 GB | 7B (100%) | 100% |
| LoRA | 14 GB | 16M (0.2%) | 99% |
| QLoRA | 5 GB | 16M (0.2%) | 99% |

In [None]:
# Google Colab Setup
import sys
import os

IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print('Running in Google Colab')
    if not os.path.exists('transformer_from_scratch'):
        !git clone https://github.com/melhzy/transformer_from_scratch.git
    os.chdir('transformer_from_scratch')
    !pip install -q torch matplotlib seaborn numpy pandas
    sys.path.insert(0, '/content/transformer_from_scratch')
else:
    print('Running locally')

## 1. Import Required Libraries

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Device: {device}')
print(f'PyTorch version: {torch.__version__}')

## 2. Understanding Quantization

Quantization reduces precision to save memory:

- **FP32**: 32 bits (4 bytes)
- **FP16**: 16 bits (2 bytes) - 50% savings
- **INT8**: 8 bits (1 byte) - 75% savings  
- **4-bit**: 4 bits (0.5 bytes) - 87.5% savings!

### Why 4-bit for QLoRA?

- 7B model: 28GB → 3.5GB (8x reduction)
- Maintains 99%+ performance
- Only LoRA adapters train in FP16

## 3. Simple 4-bit Quantizer

In [None]:
class Simple4BitQuantizer:
    def __init__(self):
        self.n_levels = 16  # 2^4
    
    def quantize(self, tensor):
        min_val = tensor.min()
        max_val = tensor.max()
        scale = (max_val - min_val) / (self.n_levels - 1)
        if scale == 0:
            scale = 1.0
        quantized = torch.round((tensor - min_val) / scale)
        quantized = torch.clamp(quantized, 0, self.n_levels - 1).to(torch.uint8)
        return quantized, scale, min_val
    
    def dequantize(self, quantized, scale, min_val):
        return (quantized.float() * scale) + min_val

# Demo
weights = torch.randn(4, 4) * 0.5
print('Original:', weights.element_size() * weights.nelement(), 'bytes')

quantizer = Simple4BitQuantizer()
quant, scale, min_val = quantizer.quantize(weights)
print('Quantized:', quant.element_size() * quant.nelement(), 'bytes')
print('Savings:', (1 - quant.element_size() / weights.element_size()) * 100, '%')

## 4. QLoRA Layer Implementation

Combines quantized base weights (4-bit) with trainable LoRA adapters (FP16).

In [None]:
class QLoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=8):
        super().__init__()
        self.rank = rank
        
        # Quantize base weights
        base_weight = torch.randn(out_features, in_features) * 0.02
        quantizer = Simple4BitQuantizer()
        self.quantized_weight, self.scale, self.min_val = quantizer.quantize(base_weight)
        
        # LoRA adapters (trainable)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.scaling = 1.0 / rank
    
    def forward(self, x):
        # Dequantize base
        quantizer = Simple4BitQuantizer()
        base_weight = quantizer.dequantize(self.quantized_weight, self.scale, self.min_val)
        base_output = F.linear(x, base_weight)
        
        # Add LoRA
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return base_output + lora_output

# Demo
layer = QLoRALayer(512, 512, rank=8)
x = torch.randn(2, 10, 512)
output = layer(x)
print(f'Input: {x.shape}')
print(f'Output: {output.shape}')

## 5. Memory Comparison

In [None]:
import pandas as pd

def calc_memory(params_billions, method):
    params = params_billions * 1e9
    if method == 'full':
        model = (params * 4) / 1e9
        return {'Model': model, 'Gradients': model, 'Optimizer': model * 2, 'Total': model * 4}
    elif method == 'lora':
        base = (params * 2) / 1e9
        lora = (params * 0.01 * 2) / 1e9
        return {'Base': base, 'LoRA': lora * 3, 'Total': base + lora * 3}
    else:  # qlora
        base = (params * 0.5) / 1e9
        lora = (params * 0.01 * 2) / 1e9
        return {'Base': base, 'LoRA': lora * 3, 'Total': base + lora * 3}

results = [{'Method': m.upper(), **calc_memory(7, m)} for m in ['full', 'lora', 'qlora']]
df = pd.DataFrame(results)
print(df.to_string(index=False))

## 6. Production QLoRA with Unsloth

For production use, Unsloth provides optimized QLoRA:

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name='unsloth/Llama-3.2-1B-Instruct-bnb-4bit',
    load_in_4bit=True
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj']
)
```

**Benefits:**
- 2x faster training
- 30% less memory
- Pre-configured models

## Summary

### Key Takeaways

| Method | Memory (7B) | GPU Required |
|--------|-------------|-------------|
| Full | 28 GB | A100 40GB |
| LoRA | 14 GB | RTX 3090 24GB |
| QLoRA | 5 GB | RTX 3060 12GB |

### When to Use QLoRA

✅ Limited GPU memory (< 24GB)
✅ Training large models (7B+)
✅ Consumer GPUs

### Resources

- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [Unsloth GitHub](https://github.com/unslothai/unsloth)
- [Tutorial 1: Introduction](01_introduction_to_fine_tuning.ipynb)
- [Tutorial 2: LoRA Implementation](02_lora_implementation.ipynb)

**Congratulations!** You can now train large models on consumer GPUs!