# llcuda v1.0.0 Quick Start Guide

This notebook demonstrates the **PyTorch-style zero-configuration API** in llcuda v1.0.0.

**Requirements:**
- Python 3.11+
- NVIDIA GPU with CUDA support (tested on GeForce 940M)
- `pip install llcuda` (includes all CUDA binaries)

---

## 1. Import and Auto-Configuration

Importing llcuda automatically configures all paths and libraries.

In [None]:
import llcuda
import sys

print(f"llcuda version: {llcuda.__version__}")
print(f"Python version: {sys.version}")
print("\n" + "="*60)

# Print comprehensive system information
llcuda.print_system_info()

## 2. Basic Inference - Zero Configuration

The simplest way to use llcuda. Just load a model and infer.

In [None]:
# Create inference engine
engine = llcuda.InferenceEngine()

# Load model - auto-downloads from HuggingFace with user confirmation
engine.load_model("gemma-3-1b-Q4_K_M")

print("\n‚úì Model loaded and ready!")

## 3. Run Simple Inference

In [None]:
# Run inference
result = engine.infer(
    prompt="What is artificial intelligence?",
    max_tokens=100,
    temperature=0.7
)

# Display results
print("Generated Text:")
print("="*60)
print(result.text)
print("="*60)
print(f"\nPerformance Metrics:")
print(f"  Tokens Generated: {result.tokens_generated}")
print(f"  Speed: {result.tokens_per_sec:.1f} tok/s")
print(f"  Latency: {result.latency_ms:.0f} ms")

## 4. List Available Models

llcuda v1.0.0 includes 11 curated models in the registry.

In [None]:
from llcuda.models import list_registry_models

models = list_registry_models()

print(f"Available Models in Registry: {len(models)}\n")
for i, (name, info) in enumerate(models.items(), 1):
    print(f"{i}. {name}")
    print(f"   {info['description']}")
    print(f"   Size: {info['size_mb']} MB")
    print(f"   Recommended VRAM: {info['min_vram_gb']} GB\n")

## 5. Try Different Prompts

In [None]:
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a haiku about CUDA programming.",
    "What are the benefits of GPU acceleration?"
]

for i, prompt in enumerate(prompts, 1):
    print(f"\n{'='*60}")
    print(f"Prompt {i}: {prompt}")
    print('='*60)
    
    result = engine.infer(prompt, max_tokens=80, temperature=0.7)
    print(result.text)
    print(f"\n‚ö° {result.tokens_per_sec:.1f} tok/s | {result.latency_ms:.0f}ms")

## 6. Batch Inference

Process multiple prompts efficiently.

In [None]:
batch_prompts = [
    "What is machine learning?",
    "What is deep learning?",
    "What is natural language processing?"
]

print("Running batch inference...\n")
results = engine.batch_infer(batch_prompts, max_tokens=50)

for i, (prompt, result) in enumerate(zip(batch_prompts, results), 1):
    print(f"{i}. {prompt}")
    print(f"   ‚Üí {result.text[:100]}...")
    print(f"   ‚ö° {result.tokens_per_sec:.1f} tok/s\n")

## 7. Performance Metrics

Get detailed P50/P95/P99 latency statistics.

In [None]:
metrics = engine.get_metrics()

print("Performance Metrics")
print("="*60)

print("\nLatency Statistics:")
latency = metrics['latency']
print(f"  Mean: {latency['mean_ms']:.2f} ms")
print(f"  p50:  {latency['p50_ms']:.2f} ms")
print(f"  p95:  {latency['p95_ms']:.2f} ms")
print(f"  p99:  {latency['p99_ms']:.2f} ms")

print("\nThroughput Statistics:")
throughput = metrics['throughput']
print(f"  Total Tokens: {throughput['total_tokens']}")
print(f"  Total Requests: {throughput['total_requests']}")
print(f"  Tokens/sec: {throughput['tokens_per_sec']:.2f}")

## 8. Hardware Auto-Configuration

llcuda automatically detects your GPU and configures optimal settings.

In [None]:
# Check CUDA availability and GPU info
if llcuda.check_cuda_available():
    print("‚úì CUDA is available!\n")
    
    gpu_info = llcuda.get_cuda_device_info()
    if gpu_info:
        print(f"CUDA Version: {gpu_info['cuda_version']}")
        print(f"Number of GPUs: {len(gpu_info['gpus'])}\n")
        
        for i, gpu in enumerate(gpu_info['gpus']):
            print(f"GPU {i}:")
            print(f"  Name: {gpu['name']}")
            print(f"  Memory: {gpu['memory']}")
            print(f"  Driver: {gpu['driver_version']}")
else:
    print("‚ùå CUDA not available")

## 9. Using Local GGUF Files

You can also use local GGUF model files.

In [None]:
# Find local GGUF models
models = llcuda.find_gguf_models()

if models:
    print(f"Found {len(models)} local GGUF models:\n")
    for i, model in enumerate(models[:5], 1):  # Show first 5
        size_mb = model.stat().st_size / (1024 * 1024)
        print(f"{i}. {model.name}")
        print(f"   Size: {size_mb:.1f} MB\n")
else:
    print("No local GGUF models found. Use registry models instead.")

## 10. Temperature Comparison

Compare outputs with different temperature settings.

In [None]:
prompt = "Write a creative opening sentence for a science fiction story."
temperatures = [0.3, 0.7, 1.2]

print("Comparing Different Temperatures\n")
print("="*60)

for temp in temperatures:
    print(f"\nTemperature: {temp}")
    print("-" * 60)
    
    result = engine.infer(
        prompt=prompt,
        max_tokens=60,
        temperature=temp
    )
    print(result.text)

## 11. Visualize Performance (Optional)

Create a simple plot of latencies if matplotlib is installed.

In [None]:
try:
    import matplotlib.pyplot as plt
    
    latencies = engine._metrics['latencies']
    
    if latencies:
        plt.figure(figsize=(12, 4))
        
        # Latency over time
        plt.subplot(1, 2, 1)
        plt.plot(latencies, marker='o', linewidth=2)
        plt.xlabel('Request Number')
        plt.ylabel('Latency (ms)')
        plt.title('Inference Latency Over Time')
        plt.grid(True, alpha=0.3)
        
        # Latency distribution
        plt.subplot(1, 2, 2)
        plt.hist(latencies, bins=20, edgecolor='black', alpha=0.7)
        plt.xlabel('Latency (ms)')
        plt.ylabel('Frequency')
        plt.title('Latency Distribution')
        plt.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    else:
        print("No metrics available yet. Run some inferences first.")
        
except ImportError:
    print("matplotlib not installed. Install with: pip install matplotlib")

## 12. Context Manager Usage

Use llcuda with Python context managers for automatic cleanup.

In [None]:
# Context manager automatically handles cleanup
with llcuda.InferenceEngine() as engine:
    engine.load_model("gemma-3-1b-Q4_K_M")
    
    result = engine.infer(
        "Explain the benefits of context managers in Python.",
        max_tokens=80
    )
    
    print(result.text)
    print(f"\n‚ö° {result.tokens_per_sec:.1f} tok/s")

# Engine automatically cleaned up after context exit
print("\n‚úì Resources automatically cleaned up")

## 13. Cleanup

When you're done, unload the model to free resources.

In [None]:
# Unload model and stop server
engine.unload_model()
print("‚úì Server stopped and resources cleaned up.")

---

## Summary - llcuda v1.0.0 Features

You've learned how to:
- ‚úÖ **Zero-configuration setup** - just `pip install llcuda`
- ‚úÖ **Smart model loading** - auto-download from HuggingFace registry
- ‚úÖ **Hardware auto-configuration** - automatic VRAM detection
- ‚úÖ **Single and batch inference** - efficient processing
- ‚úÖ **Performance metrics** - P50/P95/P99 latency tracking
- ‚úÖ **11 curated models** - ready to use out of the box
- ‚úÖ **Context manager support** - automatic cleanup

### What's New in v1.0.0:
- **Bundled CUDA binaries** - No manual llama-server setup
- **Auto-configuration on import** - No LLAMA_SERVER_PATH needed
- **Model registry** - 11 pre-configured models
- **PyTorch-style API** - Familiar interface for ML engineers

### Next Steps:
1. Try different models from the registry
2. Experiment with temperature and other parameters
3. Build applications using llcuda
4. Check out the documentation: [PyPI](https://pypi.org/project/llcuda/) | [GitHub](https://github.com/waqasm86/llcuda)

---

**Happy Inferencing! üöÄ**