# QLoRA Training Demo

This notebook demonstrates QLoRA (Quantized Low-Rank Adaptation) fine-tuning for our e-commerce LLM.

## What is QLoRA?

QLoRA enables fine-tuning of large language models on consumer hardware by combining:
1. **4-bit Quantization**: Reduces model memory footprint by 4x
2. **LoRA Adapters**: Trains small adapter layers instead of full model weights
3. **Paged Optimizers**: Handles memory spikes during training

### Memory Savings:
- Full Mistral-7B: ~28GB (FP16) or ~14GB (FP16 with gradient checkpointing)
- QLoRA Mistral-7B: ~5-6GB (4-bit + LoRA adapters)

### Training Parameters:
- Base model: Mistral-7B-Instruct-v0.3
- Quantization: 4-bit NF4 with double quantization
- LoRA rank: 32, alpha: 64
- Target: all linear layers

In [None]:
# Install required packages if needed
# !pip install torch transformers peft trl bitsandbytes datasets accelerate

In [None]:
import json
import os
from datetime import datetime

import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 1. Understanding QLoRA Components

### 1.1 4-bit Quantization with NF4

NF4 (NormalFloat4) is a 4-bit quantization format optimized for normally distributed weights:

```
Standard FP16:  [sign(1)] [exponent(5)] [mantissa(10)] = 16 bits
NF4:            [quantized_value(4)]                   = 4 bits

Memory reduction: 16/4 = 4x smaller
```

### 1.2 Double Quantization

Quantizes the quantization constants themselves for additional memory savings:
- First quantization: FP16 -> 4-bit (with FP16 scale factors)
- Second quantization: FP16 scale factors -> 8-bit

### 1.3 LoRA (Low-Rank Adaptation)

Instead of updating all model weights, LoRA adds small trainable matrices:

```
Original:  W_new = W_original + delta_W
LoRA:      W_new = W_original + (A @ B)  where A: [d, r], B: [r, d]

With rank r=32 and dimension d=4096:
- Full fine-tuning: 4096 * 4096 = 16.7M parameters
- LoRA: 4096 * 32 + 32 * 4096 = 262K parameters (64x fewer!)
```

In [None]:
# Visual representation of LoRA
print("""
LoRA Architecture:
==================

                    Input (x)
                       |
           +-----------+-----------+
           |                       |
           v                       v
    +-------------+         +-------------+
    |   W_frozen  |         |    A (down) |  <- Trainable (d x r)
    | (d x d)     |         +-------------+
    | 4-bit quant |                |
    +-------------+                v
           |              +-------------+
           |              |    B (up)   |  <- Trainable (r x d)
           |              +-------------+
           |                       |
           +-----------+-----------+
                       | (add)
                       v
                    Output

Where:
  - d = hidden dimension (4096 for 7B models)
  - r = LoRA rank (typically 8-64)
  - W_frozen = original weights (quantized to 4-bit)
  - A, B = LoRA adapter matrices (trained in FP16/BF16)
""")

## 2. Load Model with 4-bit Quantization

We'll load Mistral-7B with 4-bit quantization and compare VRAM usage.

In [None]:
# Configuration
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
OUTPUT_DIR = "./checkpoints/ecommerce-qlora"

# Create output directory
os.makedirs(OUTPUT_DIR, exist_ok=True)

print(f"Model: {MODEL_NAME}")
print(f"Output directory: {OUTPUT_DIR}")

In [None]:
def get_gpu_memory():
    """Get current GPU memory usage in GB."""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated() / 1e9
        reserved = torch.cuda.memory_reserved() / 1e9
        return allocated, reserved
    return 0, 0

def print_gpu_memory(label=""):
    """Print current GPU memory usage."""
    allocated, reserved = get_gpu_memory()
    print(f"{label}")
    print(f"  GPU Memory Allocated: {allocated:.2f} GB")
    print(f"  GPU Memory Reserved:  {reserved:.2f} GB")

# Initial memory
print_gpu_memory("Initial GPU Memory:")

In [None]:
# Define 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",           # Use NormalFloat4 format
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 for stability
    bnb_4bit_use_double_quant=True,       # Enable double quantization
)

print("BitsAndBytes Configuration:")
print(f"  load_in_4bit: {bnb_config.load_in_4bit}")
print(f"  bnb_4bit_quant_type: {bnb_config.bnb_4bit_quant_type}")
print(f"  bnb_4bit_compute_dtype: {bnb_config.bnb_4bit_compute_dtype}")
print(f"  bnb_4bit_use_double_quant: {bnb_config.bnb_4bit_use_double_quant}")

In [None]:
# Load tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)

# Set padding token (required for training)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer.padding_side = "right"  # Required for causal LM training

print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Pad token: {tokenizer.pad_token}")
print(f"EOS token: {tokenizer.eos_token}")

In [None]:
# Load model with 4-bit quantization
print("\nLoading model with 4-bit quantization...")
print("(This may take a few minutes on first download)")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

print("\nModel loaded successfully!")
print_gpu_memory("GPU Memory After Loading 4-bit Model:")

In [None]:
# Show model architecture summary
print("\nModel Architecture Summary:")
print(f"  Model type: {model.config.model_type}")
print(f"  Hidden size: {model.config.hidden_size}")
print(f"  Num layers: {model.config.num_hidden_layers}")
print(f"  Num attention heads: {model.config.num_attention_heads}")
print(f"  Vocab size: {model.config.vocab_size:,}")

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nParameter Count:")
print(f"  Total parameters: {total_params:,}")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Trainable %: {100 * trainable_params / total_params:.2f}%")

## 3. Configure LoRA Adapters

We'll add LoRA adapters to all linear layers for comprehensive fine-tuning.

In [None]:
# Prepare model for k-bit training (enables gradient checkpointing)
model = prepare_model_for_kbit_training(
    model,
    use_gradient_checkpointing=True,
)

print("Model prepared for k-bit training with gradient checkpointing")

In [None]:
# Find all linear layer names for LoRA targeting
def find_all_linear_names(model):
    """Find all linear layer names in the model."""
    cls = torch.nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[-1])
    
    # Remove output layer if present
    if 'lm_head' in lora_module_names:
        lora_module_names.remove('lm_head')
    
    return list(lora_module_names)

target_modules = find_all_linear_names(model)
print(f"Target modules for LoRA: {target_modules}")

In [None]:
# Define LoRA configuration
lora_config = LoraConfig(
    r=32,                          # LoRA rank (higher = more capacity, more memory)
    lora_alpha=64,                 # LoRA scaling factor (typically 2*r)
    target_modules=target_modules,  # Apply to all linear layers
    lora_dropout=0.05,             # Dropout for regularization
    bias="none",                   # Don't train biases
    task_type="CAUSAL_LM",         # Causal language modeling
)

print("\nLoRA Configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Scaling factor: {lora_config.lora_alpha / lora_config.r}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Target modules: {lora_config.target_modules}")
print(f"  Task type: {lora_config.task_type}")

In [None]:
# Apply LoRA adapters to model
print("\nApplying LoRA adapters...")
model = get_peft_model(model, lora_config)

# Print trainable parameters
model.print_trainable_parameters()

print_gpu_memory("\nGPU Memory After Adding LoRA Adapters:")

In [None]:
# Visual comparison of parameter efficiency
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = total_params - trainable_params

print("""
Parameter Efficiency Comparison:
================================

Full Fine-tuning:    [##############################] ~7B params (100%)
                      All parameters updated every step
                      VRAM: ~28GB (FP16) or ~56GB (FP32)

QLoRA Fine-tuning:   [#]                              ~20M params (~0.3%)
                      Only LoRA adapters updated
                      VRAM: ~5-6GB (4-bit base + FP16 adapters)
""")

print(f"\nActual numbers for this model:")
print(f"  Total parameters:     {total_params:>15,}")
print(f"  Trainable parameters: {trainable_params:>15,}")
print(f"  Frozen parameters:    {frozen_params:>15,}")
print(f"  Training efficiency:  {100 * trainable_params / total_params:.4f}%")

## 4. Prepare Training Data

We'll use a small subset of ECInstruct for this demo (100 examples).

In [None]:
# Load ECInstruct dataset
print("Loading ECInstruct dataset...")
ecinstruct = load_dataset("NingLab/ECInstruct", split="train")
print(f"Total examples: {len(ecinstruct):,}")

# Take a small sample for demo
DEMO_SIZE = 100
demo_dataset = ecinstruct.shuffle(seed=42).select(range(DEMO_SIZE))
print(f"Demo dataset size: {len(demo_dataset)}")

In [None]:
# Define prompt formatting function
def format_instruction_prompt(example):
    """Format an example into the training prompt format."""
    instruction = example.get('instruction', '')
    input_text = example.get('input', '')
    output = example.get('output', '')
    
    # Determine task type and add prefix
    instruction_lower = instruction.lower()
    if any(kw in instruction_lower for kw in ['classify', 'category', 'categorize']):
        task_prefix = "[CLASSIFY] "
    elif any(kw in instruction_lower for kw in ['extract', 'attribute', 'specification']):
        task_prefix = "[EXTRACT] "
    elif any(kw in instruction_lower for kw in ['question', 'answer', 'what', 'how']):
        task_prefix = "[QA] "
    else:
        task_prefix = ""
    
    # Format prompt using Mistral's chat template style
    if input_text and input_text.strip():
        prompt = f"""<s>[INST] {task_prefix}{instruction}

Input: {input_text} [/INST] {output}</s>"""
    else:
        prompt = f"""<s>[INST] {task_prefix}{instruction} [/INST] {output}</s>"""
    
    return prompt

# Test formatting
sample = demo_dataset[0]
formatted = format_instruction_prompt(sample)
print("Sample formatted prompt:")
print("=" * 60)
print(formatted[:1000])
print("..." if len(formatted) > 1000 else "")

In [None]:
# Format dataset
def formatting_prompts_func(examples):
    """Format a batch of examples."""
    texts = []
    for i in range(len(examples['instruction'])):
        example = {
            'instruction': examples['instruction'][i],
            'input': examples['input'][i] if 'input' in examples else '',
            'output': examples['output'][i],
        }
        texts.append(format_instruction_prompt(example))
    return {'text': texts}

# Apply formatting
demo_dataset = demo_dataset.map(
    formatting_prompts_func,
    batched=True,
    remove_columns=demo_dataset.column_names,
)

print(f"Formatted dataset columns: {demo_dataset.column_names}")
print(f"Sample text length: {len(demo_dataset[0]['text'])} characters")

## 5. Run Mini Training

We'll train for a few steps to demonstrate the process.

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=10,
    max_steps=25,  # Limit steps for demo
    learning_rate=2e-4,
    fp16=False,
    bf16=True,  # Use BF16 for training stability
    logging_steps=5,
    save_steps=25,
    save_total_limit=2,
    optim="paged_adamw_8bit",  # Memory-efficient optimizer
    lr_scheduler_type="cosine",
    report_to="none",  # Disable wandb for demo
    gradient_checkpointing=True,
    max_grad_norm=0.3,
)

print("Training Arguments:")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Max steps: {training_args.max_steps}")
print(f"  Optimizer: {training_args.optim}")

In [None]:
# Create SFT trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=demo_dataset,
    args=training_args,
    tokenizer=tokenizer,
    max_seq_length=2048,
    dataset_text_field="text",
    packing=False,  # Disable packing for simplicity
)

print("SFT Trainer created successfully!")
print_gpu_memory("GPU Memory Before Training:")

In [None]:
# Test generation before training
def generate_response(prompt, max_new_tokens=256):
    """Generate a response for a given prompt."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test prompt
test_prompt = """<s>[INST] [CLASSIFY] Classify this product into Google Product Taxonomy categories:

Input: Apple MacBook Pro 14-inch with M3 Pro chip, 18GB RAM, 512GB SSD, Space Gray [/INST]"""

print("Output BEFORE training:")
print("=" * 60)
before_response = generate_response(test_prompt)
print(before_response)

In [None]:
# Run training
print("\nStarting training...")
print("=" * 60)

start_time = datetime.now()
train_result = trainer.train()
end_time = datetime.now()

print("\nTraining completed!")
print(f"Duration: {end_time - start_time}")
print(f"Final loss: {train_result.training_loss:.4f}")

print_gpu_memory("\nGPU Memory After Training:")

In [None]:
# Test generation after training
print("\nOutput AFTER training:")
print("=" * 60)
after_response = generate_response(test_prompt)
print(after_response)

In [None]:
# Compare before and after
print("\n" + "=" * 80)
print("COMPARISON: Before vs After Training")
print("=" * 80)

print("\n[BEFORE TRAINING]")
print("-" * 40)
print(before_response[len(test_prompt):] if test_prompt in before_response else before_response)

print("\n[AFTER TRAINING]")
print("-" * 40)
print(after_response[len(test_prompt):] if test_prompt in after_response else after_response)

print("\n" + "=" * 80)
print("Note: With only 25 training steps, improvements may be subtle.")
print("Full training (3 epochs) produces more significant improvements.")
print("=" * 80)

## 6. Save and Load Checkpoints

One advantage of LoRA is that we only need to save the small adapter weights, not the full model.

In [None]:
# Save LoRA adapters
adapter_path = os.path.join(OUTPUT_DIR, "final_adapter")
model.save_pretrained(adapter_path)
tokenizer.save_pretrained(adapter_path)

print(f"\nAdapter saved to: {adapter_path}")

# Check adapter size
import subprocess
result = subprocess.run(['du', '-sh', adapter_path], capture_output=True, text=True)
print(f"Adapter size: {result.stdout.strip().split()[0]}")

# List saved files
print("\nSaved files:")
for f in os.listdir(adapter_path):
    file_path = os.path.join(adapter_path, f)
    size = os.path.getsize(file_path) / 1e6  # Size in MB
    print(f"  {f}: {size:.2f} MB")

In [None]:
# Demonstrate loading the adapter
print("\nDemonstrating adapter loading...")
print("(In practice, you would load into a fresh model)")

# Save adapter config for reference
config_info = {
    "base_model": MODEL_NAME,
    "lora_rank": lora_config.r,
    "lora_alpha": lora_config.lora_alpha,
    "target_modules": lora_config.target_modules,
    "training_steps": training_args.max_steps,
    "final_loss": train_result.training_loss,
    "timestamp": datetime.now().isoformat(),
}

with open(os.path.join(adapter_path, "training_info.json"), "w") as f:
    json.dump(config_info, f, indent=2)

print("\nTraining info saved:")
print(json.dumps(config_info, indent=2))

In [None]:
# Code to load adapter in a new session (for reference)
print("""
To load the trained adapter in a new session:
==============================================

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model with quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    quantization_config=bnb_config,
    device_map="auto",
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "./checkpoints/ecommerce-qlora/final_adapter",
)

tokenizer = AutoTokenizer.from_pretrained(
    "./checkpoints/ecommerce-qlora/final_adapter"
)

# Model is ready for inference!
""")

## 7. Summary and Next Steps

### What We Learned:

1. **QLoRA Architecture**:
   - 4-bit NF4 quantization reduces memory by 4x
   - LoRA adapters train only ~0.3% of parameters
   - Combined savings enable 7B model training on ~6GB VRAM

2. **Configuration**:
   - LoRA rank=32, alpha=64 for good quality-efficiency tradeoff
   - Target all linear layers for comprehensive adaptation
   - Use paged_adamw_8bit optimizer for memory efficiency

3. **Adapter Management**:
   - Adapters are small (~100-200MB) vs full model (~14GB)
   - Can maintain multiple task-specific adapters
   - Easy to share and version control

### Production Training Recommendations:

```python
# Full training configuration
training_args = TrainingArguments(
    num_train_epochs=3,              # 3 epochs to avoid overfitting
    per_device_train_batch_size=4,   # Adjust based on VRAM
    gradient_accumulation_steps=4,   # Effective batch size = 16
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    max_grad_norm=0.3,
    save_strategy="epoch",
    evaluation_strategy="epoch",
)
```

### Next Steps:
1. Run full training on complete ECInstruct + 10% Alpaca mixture
2. Evaluate on held-out test set
3. Deploy with vLLM for production inference

In [None]:
# Clean up GPU memory
print("Cleaning up...")

del model
del trainer
torch.cuda.empty_cache()

print_gpu_memory("Final GPU Memory:")
print("\nNotebook complete!")