## Step 1: Install llcuda and Check Environment

Installs llcuda v2.2.0 with fresh binaries and verifies GPU availability by displaying GPU index, name, and total memory for GGUF quantization testing.

In [None]:
%%time
# Install llcuda v2.2.0 (force fresh install to ensure correct binaries)
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llcuda/llcuda.git@v2.2.0

import llcuda
print(f"✅ llcuda {llcuda.__version__} installed")

# Check GPU
print("\n📊 GPU Info:")
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

## Step 2: Understanding Quantization Types

GGUF supports multiple quantization types, organized into families:

Displays comprehensive GGUF quantization type reference organized by family (Legacy, K-Quants, I-Quants, Full Precision) with bits/weight, quality scores, and imatrix requirements.

In [None]:
from llcuda.api.gguf import QUANT_TYPE_INFO, QuantTypeInfo

print("="*80)
print("📋 GGUF QUANTIZATION TYPES")
print("="*80)

# Group by family
families = {
    "Legacy (Basic)": ["Q4_0", "Q4_1", "Q5_0", "Q5_1", "Q8_0"],
    "K-Quants (Recommended)": ["Q2_K", "Q3_K_S", "Q3_K_M", "Q3_K_L", "Q4_K_S", "Q4_K_M", "Q5_K_S", "Q5_K_M", "Q6_K"],
    "I-Quants (Ultra-Low)": ["IQ2_XXS", "IQ2_XS", "IQ2_S", "IQ3_XXS", "IQ3_XS", "IQ3_S", "IQ3_M", "IQ4_XS", "IQ4_NL"],
    "Full Precision": ["F16", "F32", "BF16"],
}

for family_name, types in families.items():
    print(f"\n🔹 {family_name}:")
    print(f"   {'Type':<12} {'Bits/Weight':<12} {'Quality':<10} {'Notes'}")
    print(f"   {'-'*60}")
    
    for qtype in types:
        if qtype in QUANT_TYPE_INFO:
            info = QUANT_TYPE_INFO[qtype]
            quality_stars = "★" * int(info.quality_score * 5)
            notes = "Needs imatrix" if info.requires_imatrix else ""
            print(f"   {qtype:<12} {info.bits_per_weight:<12.2f} {quality_stars:<10} {notes}")

## Step 3: Quantization Size Calculator

Calculates estimated GGUF file sizes for common models (1B to 70B parameters) across different quantization types to help predict VRAM requirements and storage needs.

In [None]:
from llcuda.api.gguf import estimate_gguf_size

print("="*80)
print("📊 GGUF SIZE CALCULATOR")
print("="*80)

# Common model sizes
model_sizes = {
    "Gemma-3 1B": 1,
    "Gemma-3 4B": 4,
    "Llama-3.2 3B": 3,
    "Llama-3.1 8B": 8,
    "Qwen2.5 14B": 14,
    "Llama-3.1 70B": 70,
}

# Quantization types to compare
quant_types = ["Q4_K_M", "Q5_K_M", "Q6_K", "Q8_0", "IQ3_XS", "F16"]

print(f"\n{'Model':<18} | ", end="")
for qt in quant_types:
    print(f"{qt:<10}", end="")
print()
print("-" * 80)

for model_name, params_b in model_sizes.items():
    print(f"{model_name:<18} | ", end="")
    for qt in quant_types:
        size_gb = estimate_gguf_size(params_b, qt)
        print(f"{size_gb:<10.1f}", end="")
    print(" GB")

## Step 4: Kaggle T4 Recommendations

Provides quantization recommendations for various model sizes on single T4 (15GB) and dual T4 (30GB) configurations, indicating which models fit and optimal quantization types.

In [None]:
from llcuda.api.gguf import recommend_quant_for_kaggle, estimate_gguf_size

print("="*80)
print("🎯 KAGGLE T4 QUANTIZATION RECOMMENDATIONS")
print("="*80)

# Test various model sizes (in billions)
test_models = [
    ("Gemma-3 1B", 1_000_000_000),
    ("Llama-3.2 3B", 3_000_000_000),
    ("Gemma-3 4B", 4_000_000_000),
    ("Llama-3.1 8B", 8_000_000_000),
    ("Qwen2.5 14B", 14_000_000_000),
    ("Llama-3.1 70B", 70_000_000_000),
]

print("\n🔹 Single T4 (15GB VRAM):")
print(f"   {'Model':<18} {'Recommended':<12} {'Est. Size':<10} {'Fits?'}")
print(f"   {'-'*55}")

for model_name, params in test_models:
    rec = recommend_quant_for_kaggle(params, dual_t4=False)
    if rec['fits']:
        print(f"   {model_name:<18} {rec['quant_type']:<12} {rec['estimated_size_gb']:.1f} GB     ✅")
    else:
        print(f"   {model_name:<18} {rec['quant_type']:<12} {rec['estimated_size_gb']:.1f} GB     ❌ Too large")

print("\n🔹 Dual T4 (30GB VRAM):")
print(f"   {'Model':<18} {'Recommended':<12} {'Est. Size':<10} {'Fits?'}")
print(f"   {'-'*55}")

for model_name, params in test_models:
    rec = recommend_quant_for_kaggle(params, dual_t4=True)
    if rec['fits']:
        print(f"   {model_name:<18} {rec['quant_type']:<12} {rec['estimated_size_gb']:.1f} GB     ✅")
    else:
        print(f"   {model_name:<18} {rec['quant_type']:<12} {rec['estimated_size_gb']:.1f} GB     ❌ Too large")

## Step 5: K-Quants Deep Dive

K-Quants are the recommended choice for most use cases.

Explains K-Quants mixed-precision approach with detailed guidance on Q4_K_M, Q5_K_M, Q6_K, and Q3_K_M variants including use cases, quality trade-offs, and size comparisons.

In [None]:
print("="*80)
print("📘 K-QUANTS DETAILED GUIDE")
print("="*80)

k_quant_guide = """
K-Quants use a sophisticated mixed-precision approach:
- Attention layers: Higher precision (more important for quality)
- Feed-forward layers: Lower precision (less sensitive)

🔹 Naming Convention:
   Q{bits}_K_{size}
   └─ bits: Base quantization (2,3,4,5,6)
      └─ K: K-quant family marker
         └─ size: S=Small, M=Medium, L=Large

🔹 Recommended K-Quants:

   Q4_K_M (4.85 bits/weight) ⭐ RECOMMENDED
   ├── Best balance of size and quality
   ├── ~30% smaller than FP16
   └── Minimal quality loss

   Q5_K_M (5.69 bits/weight) - HIGH QUALITY
   ├── Better quality than Q4_K_M
   ├── Good for creative writing
   └── ~20% larger than Q4_K_M

   Q6_K (6.59 bits/weight) - NEAR LOSSLESS
   ├── Almost FP16 quality
   ├── Good for technical tasks
   └── ~35% larger than Q4_K_M

   Q3_K_M (3.89 bits/weight) - MEMORY SAVER
   ├── For larger models on limited VRAM
   ├── Some quality degradation
   └── ~20% smaller than Q4_K_M
"""
print(k_quant_guide)

## Step 6: I-Quants for Large Models

I-Quants enable running 70B+ models on limited hardware.

Describes I-Quants (Importance-Matrix Quants) for extreme compression, focusing on IQ3_XS for running 70B models on dual T4 GPUs with importance matrix requirements and quality considerations.

In [None]:
print("="*80)
print("📘 I-QUANTS FOR LARGE MODELS")
print("="*80)

i_quant_guide = """
I-Quants (Importance-Matrix Quants) use importance matrices
to determine which weights are most critical for quality.

🔹 Key Requirements:
   ⚠️  Require importance matrix (imatrix) for good quality
   ⚠️  Without imatrix, quality suffers significantly
   ✅ Pre-made imatrix GGUFs are available on HuggingFace

🔹 I-Quant Types:

   IQ3_XS (3.30 bits/weight) ⭐ BEST FOR 70B
   ├── Fits 70B models in ~25GB VRAM
   ├── Surprisingly good quality with imatrix
   └── Ideal for Kaggle dual T4 (30GB)

   IQ4_XS (4.25 bits/weight) - HIGH QUALITY LOW SIZE
   ├── Better than Q4_K_M in some benchmarks
   ├── Slightly smaller file size
   └── Great for 13B-34B models

   IQ2_XS (2.31 bits/weight) - EXTREME COMPRESSION
   ├── For 70B+ when VRAM is very limited
   ├── Noticeable quality degradation
   └── Use only when necessary

🔹 70B Model on Kaggle Dual T4:
   Model: Llama-3.1-70B-Instruct-GGUF
   Quant: IQ3_XS (~25GB)
   Config: tensor-split 0.5,0.5
   Context: 2048 (limited by VRAM)
"""
print(i_quant_guide)

## Step 7: Interactive Quant Selector

Prints llcuda's interactive quantization guide providing comprehensive reference for selecting optimal quantization types based on model size, VRAM constraints, and quality requirements.

In [None]:
from llcuda.api.gguf import print_quant_guide

print("="*80)
print("🎯 INTERACTIVE QUANTIZATION GUIDE")
print("="*80)

# Print the full guide
print_quant_guide()

## Step 8: Download and Test Different Quantizations

Downloads multiple quantization variants (Q4_K_M, Q5_K_M, Q8_0) of Gemma-3 1B from HuggingFace for side-by-side comparison testing and benchmarking.

In [None]:
from huggingface_hub import hf_hub_download
import os

print("="*80)
print("📥 DOWNLOAD GGUF MODELS FOR COMPARISON")
print("="*80)

# Available Gemma-3 1B quantizations from Unsloth
models_to_test = {
    "Q4_K_M": "gemma-3-1b-it-Q4_K_M.gguf",
    "Q5_K_M": "gemma-3-1b-it-Q5_K_M.gguf",
    "Q8_0": "gemma-3-1b-it-Q8_0.gguf",
}

REPO = "unsloth/gemma-3-1b-it-GGUF"
MODEL_DIR = "/kaggle/working/models"

print(f"\n📥 Downloading from {REPO}...\n")

downloaded = {}
for quant, filename in models_to_test.items():
    print(f"   Downloading {quant}...", end=" ")
    try:
        path = hf_hub_download(
            repo_id=REPO,
            filename=filename,
            local_dir=MODEL_DIR
        )
        size_mb = os.path.getsize(path) / (1024**2)
        downloaded[quant] = path
        print(f"✅ {size_mb:.0f} MB")
    except Exception as e:
        print(f"❌ {e}")

print(f"\n✅ Downloaded {len(downloaded)} models for comparison")

## Step 9: Benchmark Different Quantizations

Benchmarks inference performance across different quantization types by starting separate servers for each variant, measuring tokens/second, and comparing quality versus speed trade-offs.

In [None]:
from llcuda.server import ServerManager
from llcuda.api.client import LlamaCppClient
import time
import socket

def get_free_port():
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        s.bind(('', 0))
        return s.getsockname()[1]

print("="*80)
print("📊 QUANTIZATION BENCHMARK")
print("="*80)

test_prompt = "Explain what CUDA is in exactly 3 sentences."
results = {}

for quant, model_path in downloaded.items():
    print(f"\n🔹 Testing {quant}...")
    port = get_free_port()
    server = ServerManager(server_url=f"http://127.0.0.1:{port}")
    try:
        server.start_server(
            model_path=model_path,
            host="127.0.0.1",
            port=port,
            gpu_layers=32,
            ctx_size=2048,
            timeout=120,
        )
    except Exception as e:
        print(f"   ❌ Failed to start: {e}")
        if hasattr(server, 'server_process') and server.server_process:
            try:
                _, err = server.server_process.communicate(timeout=5)
                if err:
                    print('   [Server stderr]:', err.decode(errors='ignore'))
            except Exception:
                pass
        continue
    if not server.check_server_health(timeout=120):
        print(f"   ❌ Server not healthy")
        server.stop_server()
        continue
    client = LlamaCppClient(base_url=f"http://127.0.0.1:{port}")
    try:
        client.chat.create(
            messages=[{"role": "user", "content": "Hi"}],
            max_tokens=10
        )
        start = time.time()
        response = client.chat.create(
            messages=[{"role": "user", "content": test_prompt}],
            max_tokens=100,
            temperature=0.7
        )
        elapsed = time.time() - start
        tokens = response.usage.completion_tokens
        tok_per_sec = tokens / elapsed
        results[quant] = {
            "tokens": tokens,
            "time": elapsed,
            "tok_per_sec": tok_per_sec,
            "response": response.choices[0].message.content[:100]
        }
        print(f"   ✓ Tokens: {tokens}, Time: {elapsed:.2f}s, Speed: {tok_per_sec:.1f} tok/s")
    except Exception as e:
        print(f"   ❌ Inference failed: {e}")
    server.stop_server()
    time.sleep(2)

print("\n" + "="*80)
print("📊 BENCHMARK SUMMARY")
print("="*80)
if results:
    for quant, data in results.items():
        print(f"   {quant:<12}: {data['tok_per_sec']:.1f} tok/s")
else:
    print("   No successful benchmarks completed.")

## Step 10: Creating Custom Quantizations

Use llama-quantize to create your own quantized models.

Provides reference guide for creating custom GGUF quantizations using llama-quantize, including importance matrix generation for I-quants and llcuda API usage examples.

In [None]:
print("="*80)
print("🔧 CREATING CUSTOM QUANTIZATIONS")
print("="*80)

quantize_guide = """
To quantize a model yourself, use llama-quantize:

🔹 Basic Usage:
   llama-quantize input.gguf output.gguf Q4_K_M

🔹 With Importance Matrix (for I-quants):
   # First, generate importance matrix
   llama-imatrix -m model.gguf -f calibration.txt -o imatrix.dat
   
   # Then quantize with imatrix
   llama-quantize --imatrix imatrix.dat input.gguf output.gguf IQ3_XS

🔹 Available in llcuda:
   from llcuda.quantization import quantize_model
   
   quantize_model(
       input_path="model-f16.gguf",
       output_path="model-q4km.gguf",
       quant_type="Q4_K_M"
   )

🔹 From Unsloth/HuggingFace:
   Most models on HuggingFace are already pre-quantized.
   Look for repos with '-GGUF' suffix:
   - unsloth/gemma-3-4b-it-GGUF
   - unsloth/Llama-3.2-3B-Instruct-GGUF
   - bartowski/Qwen2.5-14B-Instruct-GGUF
"""
print(quantize_guide)

## Step 11: Quick Reference Table

Displays quick reference table mapping model sizes and quantization types to Kaggle T4 feasibility (single/dual GPU), with size estimates and practical deployment notes.

In [None]:
print("="*80)
print("📋 QUICK REFERENCE: MODEL + QUANT → KAGGLE FEASIBILITY")
print("="*80)

reference = """
┌─────────────┬───────────┬───────────┬────────────┬────────────────────────┐
│ Model       │ Quant     │ Size (GB) │ Kaggle T4  │ Notes                  │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 1B params   │ Q4_K_M    │ 0.6       │ ✅ Single  │ Fast, high quality     │
│ 1B params   │ Q8_0      │ 1.1       │ ✅ Single  │ Best quality for 1B    │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 3B params   │ Q4_K_M    │ 1.8       │ ✅ Single  │ Recommended            │
│ 3B params   │ Q5_K_M    │ 2.1       │ ✅ Single  │ Higher quality         │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 4B params   │ Q4_K_M    │ 2.4       │ ✅ Single  │ ⭐ Sweet spot          │
│ 4B params   │ Q6_K      │ 3.3       │ ✅ Single  │ Near lossless          │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 7-8B params │ Q4_K_M    │ 4.5       │ ✅ Single  │ Popular choice         │
│ 7-8B params │ Q5_K_M    │ 5.3       │ ✅ Single  │ Better for coding      │
│ 7-8B params │ Q6_K      │ 6.0       │ ⚠️ Single  │ Tight fit, 4K ctx      │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 13-14B      │ Q4_K_M    │ 8.0       │ ⚠️ Single  │ 2K context max         │
│ 13-14B      │ Q4_K_M    │ 8.0       │ ✅ Dual    │ Split across GPUs      │
│ 13-14B      │ IQ3_XS    │ 5.5       │ ✅ Single  │ Quality trade-off      │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 30-34B      │ Q4_K_M    │ 19        │ ✅ Dual    │ tensor-split 0.5,0.5   │
│ 30-34B      │ IQ3_XS    │ 12        │ ⚠️ Single  │ Low context            │
├─────────────┼───────────┼───────────┼────────────┼────────────────────────┤
│ 70B params  │ IQ3_XS    │ 25        │ ✅ Dual    │ ⭐ 70B on Kaggle!      │
│ 70B params  │ IQ2_XS    │ 19        │ ✅ Dual    │ Lower quality          │
│ 70B params  │ Q4_K_M    │ 40        │ ❌         │ Too large              │
└─────────────┴───────────┴───────────┴────────────┴────────────────────────┘

Legend:
  ✅ Works well    ⚠️ Possible with limits    ❌ Won't fit
"""
print(reference)

## 📚 Summary

### Key Takeaways:

1. **Q4_K_M** is the recommended default - best balance of size and quality
2. **Q5_K_M** for better quality when VRAM allows
3. **IQ3_XS** for fitting large models (70B) on limited hardware
4. Always check HuggingFace for pre-quantized models (faster than doing it yourself)

### Kaggle T4 Guidelines:
- Single T4 (15GB): Up to 8B Q4_K_M comfortably
- Dual T4 (30GB): Up to 70B IQ3_XS or 34B Q4_K_M

---

**Next:** [05-unsloth-integration](05-unsloth-integration-llcuda-v2.2.0.ipynb)