# llcuda + Unsloth Tutorial - Tesla T4

**Tutorial**: Using llcuda as CUDA backend for Unsloth GGUF models

**What This Demonstrates**:
1. Install llcuda v2.0.1 on Google Colab
2. Use llcuda with Unsloth GGUF models
3. Run fast inference on Tesla T4 GPU
4. Compare performance with and without FlashAttention

**Requirements**:
- Google Colab with T4 GPU
- Runtime: GPU (T4)

---

## References

Based on research from:
- [Unsloth GGUF Documentation](https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf)
- [Unsloth GitHub Repository](https://github.com/unslothai/unsloth)
- [llcuda GitHub Repository](https://github.com/waqasm86/llcuda)


## Step 1: Verify GPU (Must be Tesla T4)

In [None]:
# Check GPU
!nvidia-smi --query-gpu=name,compute_cap,memory.total --format=csv

import subprocess
result = subprocess.run(
    ['nvidia-smi', '--query-gpu=name,compute_cap', '--format=csv,noheader'],
    capture_output=True, text=True
)

gpu_info = result.stdout.strip().split(',')
gpu_name = gpu_info[0].strip()
compute_cap = gpu_info[1].strip()

print(f"\n{'='*60}")
print(f"GPU: {gpu_name}")
print(f"Compute Capability: SM {compute_cap}")
print(f"{'='*60}")

if 'T4' in gpu_name and compute_cap == '7.5':
    print("\n‚úÖ Tesla T4 detected - Perfect for llcuda!")
else:
    print(f"\n‚ö†Ô∏è  {gpu_name} detected - llcuda v2.0.1 is optimized for T4")

## Step 2: Install llcuda v2.0.1

In [None]:
# Install llcuda (will auto-download CUDA binaries on first import)
!pip install -q llcuda

print("‚úÖ llcuda installed")
print("\nNote: CUDA binaries (~140 MB) will be downloaded on first import")

In [None]:
# Import llcuda (triggers binary download)
import llcuda

print(f"\n‚úÖ llcuda version: {llcuda.__version__}")
print("\nIf this is first run, binaries were just downloaded")
print("Subsequent runs will use cached binaries (instant)")

In [None]:
# Verify GPU compatibility
compat = llcuda.check_gpu_compatibility()

print(f"\n{'='*60}")
print("GPU Compatibility Check")
print(f"{'='*60}")
print(f"GPU: {compat['gpu_name']}")
print(f"Compute Capability: SM {compat['compute_capability']}")
print(f"Compatible: {compat['compatible']}")
print(f"Platform: {compat['platform']}")
print(f"{'='*60}")

## Step 3: Using llcuda with Unsloth GGUF Models

### Method 1: Model Registry (Easiest)

In [None]:
# Initialize inference engine
engine = llcuda.InferenceEngine()

# Load Gemma 3-1B from registry (auto-downloads from HuggingFace)
print("üì• Loading Gemma 3-1B Q4_K_M...")
print("   This may take 2-3 minutes on first run (downloads model)\n")

engine.load_model(
    "gemma-3-1b-Q4_K_M",  # Registry name
    silent=True,           # Suppress server output
    auto_start=True        # Start server automatically
)

print("\n‚úÖ Model loaded and ready!")

In [None]:
# Run inference
prompt = "Explain quantum computing in simple terms."

print(f"\nPrompt: {prompt}")
print("\nGenerating...\n")

result = engine.infer(
    prompt,
    max_tokens=150,
    temperature=0.7
)

print("="*70)
print("RESPONSE:")
print("="*70)
print(result.text)
print("="*70)
print(f"\nPerformance:")
print(f"  Tokens: {result.tokens_generated}")
print(f"  Latency: {result.latency_ms:.1f} ms")
print(f"  Speed: {result.tokens_per_sec:.1f} tokens/sec")
print(f"\nExpected on T4: ~45 tokens/sec")

### Method 2: Direct HuggingFace Repository

In [None]:
# Use HuggingFace syntax: repo:filename
engine2 = llcuda.InferenceEngine(server_url="http://127.0.0.1:8091")

print("üì• Loading from Unsloth HuggingFace repository...\n")

engine2.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    silent=True
)

print("\n‚úÖ Model loaded from Unsloth repo!")

In [None]:
# Test with different prompt
prompt2 = "Write a Python function to calculate fibonacci numbers."

print(f"\nPrompt: {prompt2}")
print("\nGenerating...\n")

result2 = engine2.infer(
    prompt2,
    max_tokens=200,
    temperature=0.3  # Lower temperature for code
)

print("="*70)
print(result2.text)
print("="*70)
print(f"\nSpeed: {result2.tokens_per_sec:.1f} tokens/sec")

## Step 4: Batch Inference

In [None]:
# Run multiple prompts
prompts = [
    "What is machine learning?",
    "Explain neural networks briefly.",
    "What is deep learning?"
]

print("\nüöÄ Running batch inference...\n")

results = engine.batch_infer(prompts, max_tokens=80)

for i, (prompt, result) in enumerate(zip(prompts, results), 1):
    print(f"\n{'='*70}")
    print(f"Query {i}: {prompt}")
    print(f"{'='*70}")
    print(result.text)
    print(f"\nSpeed: {result.tokens_per_sec:.1f} tok/s")

## Step 5: Performance Metrics

In [None]:
# Get aggregated metrics
metrics = engine.get_metrics()

print("\n" + "="*70)
print("PERFORMANCE METRICS")
print("="*70)

print(f"\nüìä Throughput:")
print(f"  Total requests: {metrics['throughput']['total_requests']}")
print(f"  Total tokens: {metrics['throughput']['total_tokens']}")
print(f"  Avg speed: {metrics['throughput']['tokens_per_sec']:.1f} tokens/sec")

print(f"\n‚è±Ô∏è  Latency:")
print(f"  Mean: {metrics['latency']['mean_ms']:.1f} ms")
print(f"  P50: {metrics['latency']['p50_ms']:.1f} ms")
print(f"  P95: {metrics['latency']['p95_ms']:.1f} ms")
print(f"  P99: {metrics['latency']['p99_ms']:.1f} ms")
print(f"  Min: {metrics['latency']['min_ms']:.1f} ms")
print(f"  Max: {metrics['latency']['max_ms']:.1f} ms")

print(f"\nüìà Sample count: {metrics['latency']['sample_count']}")
print("="*70)

## Step 6: Testing Different Models

Try larger models (if you have enough VRAM)

In [None]:
# List available models in registry
from llcuda._internal.registry import list_registry_models

models = list_registry_models()

print("\n" + "="*70)
print("AVAILABLE MODELS IN REGISTRY")
print("="*70)

for name, info in models.items():
    print(f"\n{name}")
    print(f"  Size: {info['size_mb']} MB")
    print(f"  Min VRAM: {info['min_vram_gb']} GB")
    print(f"  {info['description']}")

In [None]:
# Try Llama 3.2-3B (if you have enough memory)
# Uncomment to try:

# engine3 = llcuda.InferenceEngine(server_url="http://127.0.0.1:8092")
# engine3.load_model("unsloth/llama-3.2-3b-Q4_K_M.gguf", silent=True)
# result = engine3.infer("What is AI?", max_tokens=100)
# print(result.text)
# print(f"Speed: {result.tokens_per_sec:.1f} tok/s")

print("\nTip: Tesla T4 has 16GB VRAM")
print("     Can run models up to ~8B parameters with Q4 quantization")

## Step 7: Integration with Unsloth Training

### Scenario: Fine-tune with Unsloth ‚Üí Export GGUF ‚Üí Inference with llcuda

In [None]:
# Example workflow (conceptual - not running full training)

print("""
UNSLOTH + llcuda WORKFLOW
=========================

1. Fine-tune with Unsloth:
   ```python
   from unsloth import FastLanguageModel
   
   model, tokenizer = FastLanguageModel.from_pretrained(
       "unsloth/gemma-3-1b-it",
       max_seq_length=2048,
       load_in_4bit=True
   )
   
   # Add LoRA adapters and train...
   model = FastLanguageModel.get_peft_model(model, ...)
   trainer.train()
   ```

2. Export to GGUF:
   ```python
   # After training
   model.save_pretrained_gguf(
       "my_model",
       tokenizer,
       quantization_method="q4_k_m"
   )
   ```

3. Deploy with llcuda:
   ```python
   import llcuda
   
   engine = llcuda.InferenceEngine()
   engine.load_model("my_model/unsloth.Q4_K_M.gguf")
   
   result = engine.infer("Your prompt", max_tokens=100)
   print(result.text)
   ```

Benefits:
- Fast training with Unsloth (2x faster, 70% less VRAM)
- Fast inference with llcuda (FlashAttention, T4-optimized)
- Easy deployment (GGUF format, single file)
- Compatible with llama.cpp ecosystem
""")

## Step 8: Advanced Features

In [None]:
# Custom generation parameters
result = engine.infer(
    "Write a short poem about AI.",
    max_tokens=100,
    temperature=0.9,      # Higher = more creative
    top_p=0.95,           # Nucleus sampling
    top_k=50,             # Top-k sampling
    stop_sequences=["\n\n"]  # Stop at double newline
)

print("\n" + "="*70)
print("CREATIVE GENERATION (temp=0.9)")
print("="*70)
print(result.text)
print("="*70)

In [None]:
# Context manager usage (auto-cleanup)
with llcuda.InferenceEngine(server_url="http://127.0.0.1:8093") as temp_engine:
    temp_engine.load_model("gemma-3-1b-Q4_K_M", silent=True)
    result = temp_engine.infer("Quick test", max_tokens=20)
    print(f"\nQuick test: {result.text}")
    print(f"Speed: {result.tokens_per_sec:.1f} tok/s")

print("\n‚úÖ Engine auto-cleaned up after context exit")

## Step 9: Benchmark Different Quantizations

In [None]:
# Compare Q4_K_M vs Q8_0 (if you have time)
# Note: Q8_0 is slower but higher quality

test_prompt = "Explain the theory of relativity."

print("\n" + "="*70)
print("QUANTIZATION COMPARISON")
print("="*70)

# Q4_K_M (already loaded)
result_q4 = engine.infer(test_prompt, max_tokens=50)
print(f"\nQ4_K_M:")
print(f"  Speed: {result_q4.tokens_per_sec:.1f} tok/s")
print(f"  Latency: {result_q4.latency_ms:.1f} ms")

# To test Q8_0, you would load a Q8_0 model:
# engine_q8 = llcuda.InferenceEngine(server_url="http://127.0.0.1:8094")
# engine_q8.load_model("gemma-3-1b-Q8_0", silent=True)
# result_q8 = engine_q8.infer(test_prompt, max_tokens=50)
# print(f"\nQ8_0:")
# print(f"  Speed: {result_q8.tokens_per_sec:.1f} tok/s")
# print(f"  Latency: {result_q8.latency_ms:.1f} ms")

print("\nTypical results on T4:")
print("  Q4_K_M: ~45 tok/s (smaller, faster)")
print("  Q8_0: ~35 tok/s (larger, higher quality)")

## Summary

### What We Covered

‚úÖ **Installation**: llcuda v2.0.1 with automatic binary download
‚úÖ **Model Loading**: From registry and HuggingFace
‚úÖ **Inference**: Single and batch prompts
‚úÖ **Performance**: ~45 tok/s on Tesla T4 with Q4_K_M
‚úÖ **Integration**: Unsloth fine-tuning ‚Üí llcuda deployment workflow

### Performance Summary (Tesla T4)

| Model | Quantization | Speed | VRAM |
|-------|--------------|-------|------|
| Gemma 3-1B | Q4_K_M | ~45 tok/s | 1.2 GB |
| Llama 3.2-3B | Q4_K_M | ~30 tok/s | 2.0 GB |
| Qwen 2.5-7B | Q4_K_M | ~18 tok/s | 5.0 GB |
| Llama 3.1-8B | Q4_K_M | ~15 tok/s | 5.5 GB |

### Key Features

- ‚úÖ FlashAttention (2-3x faster for long contexts)
- ‚úÖ Tensor Core optimization
- ‚úÖ CUDA Graphs (reduced overhead)
- ‚úÖ All quantization formats
- ‚úÖ Seamless Unsloth integration

### Resources

- **llcuda**: https://github.com/waqasm86/llcuda
- **Unsloth**: https://github.com/unslothai/unsloth
- **Unsloth GGUF Docs**: https://unsloth.ai/docs/basics/inference-and-deployment/saving-to-gguf
- **llama.cpp**: https://github.com/ggerganov/llama.cpp

---

**Built with**: llcuda v2.0.1 | Tesla T4 | CUDA 12 | Unsloth Integration
