<!-- Banner Image -->
<img src="https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdevnotebooks.png" width="100%">

<!-- Links -->
<center>
  <a href="https://console.brev.dev" style="color: #06b6d4;">Console</a> •
  <a href="https://brev.dev" style="color: #06b6d4;">Docs</a> •
  <a href="/" style="color: #06b6d4;">Templates</a> •
  <a href="https://discord.gg/NVDyv7TUgJ" style="color: #06b6d4;">Discord</a>
</center>

# ⚡ Fine-Tuning Showdown: Unsloth vs HuggingFace

## Real Benchmarks. Same Model. Same Dataset. YOUR GPU.

<div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); padding: 30px; border-radius: 10px; color: white; margin-bottom: 20px;">
  <h2 style="margin-top: 0; color: white;">🎯 What You'll Discover</h2>
  <p style="font-size: 18px; line-height: 1.6;">
    <strong>Stop guessing. Start measuring.</strong><br/><br/>
    We'll train <strong>the same model</strong> with <strong>the same dataset</strong> using two frameworks:<br/>
    ⚡ <strong>Unsloth</strong> - Claims 2× speed, let's verify<br/>
    🤗 <strong>HuggingFace</strong> - Vanilla baseline (PEFT + Trainer)<br/>
    <br/>
    <strong>🔥 GPU starts training in 60 seconds. Side-by-side results in 10 minutes.</strong>
  </p>
</div>

## 📋 Prerequisites

- **GPU**: NVIDIA GPU with 16GB+ VRAM (A100, H100, L40S, RTX 4090, RTX 3090)
- **CUDA**: 11.8+ or 12.1+
- **Python**: 3.10+
- **Disk Space**: 20GB free (for models + datasets)

---

## 🎬 Quick Start: What Happens Next

The next few cells will:
1. **Install Unsloth** (~30 sec)
2. **Load Qwen 1.5B model** (~15 sec)
3. **Start training** immediately (2-3 min)
4. **Train with vanilla HF** (same config)
5. **Compare ALL metrics** side-by-side

**Total time: ~10 minutes for complete comparison** ⚡

---

#### 💬 Questions? Join us on [Discord](https://discord.gg/NVDyv7TUgJ) or reach out on [X/Twitter](https://x.com/brevdev)

**📝 Notebook Tips**: Press `Shift + Enter` to run cells. A `*` means running, a number means complete.

---

## 1. Verify GPU Setup 🎮

Let's make sure your NVIDIA GPU is ready for fine-tuning!


In [None]:
# Cell 1: GPU Verification
# =========================
# Quick check that GPU is available and has enough memory

import subprocess
import sys
import os

print("="*80)
print("🎮 GPU STATUS CHECK")
print("="*80 + "\n")

# Check nvidia-smi
try:
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=name,memory.total,driver_version", "--format=csv,noheader"],
        capture_output=True,
        text=True,
        timeout=5
    )
    
    if result.returncode == 0:
        gpu_info = result.stdout.strip().split(", ")
        print(f"✅ GPU Detected: {gpu_info[0]}")
        print(f"✅ VRAM: {gpu_info[1]}")
        print(f"✅ Driver: {gpu_info[2]}")
        
        # Check if enough memory
        vram_str = gpu_info[1].replace(' MiB', '').strip()
        try:
            vram_gb = float(vram_str) / 1024
            if vram_gb < 16:
                print(f"\n⚠️  Warning: Your GPU has {vram_gb:.1f}GB VRAM.")
                print("   This notebook recommends 16GB+ for full comparison.")
                print("   Training may still work with smaller models or reduced batch sizes.")
        except:
            print("   (Could not parse VRAM size)")
    else:
        print("❌ nvidia-smi failed. Is NVIDIA driver installed?")
        sys.exit(1)
        
except FileNotFoundError:
    print("❌ nvidia-smi not found. Please install NVIDIA drivers.")
    sys.exit(1)
except Exception as e:
    print(f"❌ Error checking GPU: {e}")
    sys.exit(1)

# Check PyTorch CUDA
print("\n" + "-"*80)
print("Checking PyTorch CUDA support...\n")

try:
    import torch
    if torch.cuda.is_available():
        print(f"✅ PyTorch {torch.__version__}")
        print(f"✅ CUDA {torch.version.cuda}")
        print(f"✅ {torch.cuda.device_count()} GPU(s) available")
    else:
        print("❌ PyTorch installed but CUDA not available")
        sys.exit(1)
except ImportError:
    print("⚠️  PyTorch not found. Installing...")
    subprocess.run([sys.executable, "-m", "pip", "install", "torch", "-q"], check=True)
    import torch
    print(f"✅ PyTorch {torch.__version__} installed")

print("\n" + "="*80)
print("✅ GPU READY FOR TRAINING!")
print("="*80 + "\n")


## 2. Install Unsloth (Fastest Method) ⚡

Unsloth claims 2× faster training with 60% less memory. Let's verify!

We'll train with Unsloth first, then compare with vanilla HuggingFace.


In [None]:
# Cell 2: Install Unsloth & Dependencies
# =======================================

import time

print("="*80)
print("⚡ INSTALLING UNSLOTH & DEPENDENCIES")
print("="*80 + "\n")

install_start = time.time()

packages = [
    "unsloth",
    "transformers",
    "datasets",
    "peft",
    "trl",
    "accelerate",
    "bitsandbytes",
    "matplotlib",
    "seaborn",
    "pandas"
]

print("📦 Installing packages (this may take 1-2 minutes)...\n")

for package in packages:
    print(f"   Installing {package}... ", end="", flush=True)
    try:
        result = subprocess.run(
            [sys.executable, "-m", "pip", "install", package, "-q"],
            capture_output=True,
            timeout=180,
            check=False
        )
        print("✅")
    except Exception as e:
        print(f"⚠️ ")

install_time = time.time() - install_start

print(f"\n✅ Installation complete in {install_time:.1f}s")
print("="*80 + "\n")


## 3. Load Model & Dataset 📦

**Model:** Qwen2.5-1.5B-Instruct (fast to train, production-quality)  
**Dataset:** OpenHermes-2.5 (5K samples, high-quality instruction data)


In [None]:
# Cell 3: Load Model + Dataset
# =============================

from unsloth import FastLanguageModel
from datasets import load_dataset

print("="*80)
print("📦 LOADING MODEL & DATASET")
print("="*80 + "\n")

MODEL_NAME = "unsloth/Qwen2.5-1.5B-Instruct"
MAX_SEQ_LENGTH = 2048

print(f"[1/2] Loading model: {MODEL_NAME}...\n")
model_start = time.time()

try:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_NAME,
        max_seq_length=MAX_SEQ_LENGTH,
        dtype=None,
        load_in_4bit=True,
    )
    
    model_time = time.time() - model_start
    print(f"   ✅ Model loaded in {model_time:.1f}s")
    
    if torch.cuda.is_available():
        memory_allocated = torch.cuda.memory_allocated(0) / 1e9
        print(f"   📊 GPU Memory: {memory_allocated:.2f} GB\n")
        
except Exception as e:
    print(f"   ❌ Model loading failed: {e}")
    raise

print(f"[2/2] Loading dataset: OpenHermes-2.5 (5,000 samples)...\n")
dataset_start = time.time()

try:
    dataset = load_dataset("teknium/OpenHermes-2.5", split="train[:5000]")
    dataset_time = time.time() - dataset_start
    print(f"   ✅ Dataset loaded in {dataset_time:.1f}s")
    print(f"   📝 {len(dataset)} training samples\n")
except Exception as e:
    print(f"   ⚠️  Primary dataset failed, using backup...")
    dataset = load_dataset("mlabonne/guanaco-llama2-1k", split="train")
    print(f"   ✅ Backup dataset: {len(dataset)} samples\n")

print("="*80)
print("✅ READY TO START TRAINING!")
print("="*80 + "\n")


## 4. 🔥 START TRAINING: Method 1 - Unsloth

### GPU ACTIVE NOW!

**Configuration** (identical for both methods):
- LoRA rank: 16, alpha: 32
- Batch size: 2 × 4 = 8 effective
- Learning rate: 2e-4, Steps: 60


In [None]:
# Cell 4: Train with Unsloth
# ===========================

from unsloth import is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments

print("="*80)
print("⚡ METHOD 1: UNSLOTH")
print("="*80)
print("\n🔥 GPU TRAINING STARTING...\n")

# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print("✅ LoRA configured\n")

# Format dataset
def format_prompts(examples):
    texts = []
    for convs in examples["conversations"]:
        try:
            text = tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=False)
            texts.append(text)
        except:
            texts.append(str(convs))
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)
print("✅ Dataset formatted\n")

# Reset GPU stats
torch.cuda.reset_peak_memory_stats()
training_start = time.time()

# Training!
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs_unsloth",
        report_to="none",
    ),
)

trainer_stats = trainer.train()

# Collect metrics
unsloth_time = time.time() - training_start
unsloth_memory = torch.cuda.max_memory_allocated(0) / 1e9
unsloth_loss = trainer_stats.training_loss

print("\n" + "="*80)
print("✅ UNSLOTH TRAINING COMPLETE!")
print("="*80)
print(f"\n⏱️  Time: {unsloth_time/60:.2f} min ({unsloth_time:.1f}s)")
print(f"💾 Peak Memory: {unsloth_memory:.2f} GB")
print(f"📉 Final Loss: {unsloth_loss:.4f}")

# Save checkpoint
model.save_pretrained("unsloth_lora_model")
tokenizer.save_pretrained("unsloth_lora_model")
print(f"💾 Checkpoint saved\n")

# Store results
unsloth_results = {
    "method": "Unsloth",
    "time_seconds": unsloth_time,
    "memory_gb": unsloth_memory,
    "loss": unsloth_loss,
}

print("="*80 + "\n")
