# llcuda v1.1.9 - Google Colab Test

This notebook tests all the fixes in v1.1.9:
1. ✅ llama-server detection from package binaries directory
2. ✅ Silent mode to suppress llama-server warnings
3. ✅ Binary auto-download on first import
4. ✅ Model download only when explicitly called

**Recommended Runtime**: Python 3.11+ with T4 GPU

---

## Step 1: Install llcuda v1.1.9

Install the latest version from PyPI.

In [None]:
!pip install llcuda==1.1.9 -q

## Step 2: Import llcuda

**Expected behavior:**
- First run: Downloads binaries (~161MB) - takes 30-60 seconds
- Subsequent runs: Instant (binaries cached)
- **NO MODEL DOWNLOAD** on import (fixed in v1.1.8)

In [None]:
import llcuda

print(f"✅ llcuda version: {llcuda.__version__}")

## Step 3: System Information

Check GPU and CUDA availability.

In [None]:
llcuda.print_system_info()

## Step 4: Verify llama-server Detection

**This is the critical fix in v1.1.9!**

The server manager now checks the package binaries directory as priority #2.

In [None]:
from llcuda import ServerManager
import os

server = ServerManager()
llama_server_path = server.find_llama_server()

if llama_server_path:
    print(f"✅ llama-server found at: {llama_server_path}")
    print(f"   Exists: {llama_server_path.exists()}")
    print(f"   Executable: {os.access(llama_server_path, os.X_OK)}")
    
    ld_path = os.environ.get("LD_LIBRARY_PATH", "Not set")
    print(f"   LD_LIBRARY_PATH: {ld_path[:100]}...")
else:
    print("❌ llama-server NOT FOUND - this is a bug!")

## Step 5: Load Model with Silent Mode

**New in v1.1.9:** `silent=True` parameter suppresses all llama-server output.

**Expected behavior:**
- First run: Downloads model (~800MB for gemma-3-1b-Q4_K_M) - takes 1-2 minutes
- Subsequent runs: Instant (model cached)
- **NO LLAMA-SERVER WARNINGS** in output

In [None]:
engine = llcuda.InferenceEngine()

# Load model with silent mode enabled
engine.load_model(
    "gemma-3-1b-Q4_K_M",
    gpu_layers=20,  # Conservative for T4 GPU (4GB VRAM)
    ctx_size=2048,
    auto_start=True,
    silent=True,  # ← NEW: Suppress llama-server warnings
)

print("✅ Model loaded successfully!")

## Step 6: Run Inference

Test the model with a simple prompt.

In [None]:
prompt = "What is artificial intelligence? Answer in one sentence."

result = engine.infer(prompt, max_tokens=50)

print(f"Prompt: {prompt}")
print(f"\nResponse: {result.text}")
print(f"\n✅ Tokens generated: {result.tokens_generated}")
print(f"✅ Time: {result.generation_time_ms:.2f}ms")

## Step 7: Batch Inference

Test with multiple prompts.

In [None]:
prompts = [
    "What is machine learning?",
    "Explain neural networks.",
    "What is deep learning?"
]

results = engine.batch_infer(prompts, max_tokens=30)

for i, result in enumerate(results, 1):
    print(f"\n{i}. {prompts[i-1]}")
    print(f"   → {result.text}")

## Step 8: Performance Metrics

Get throughput and latency statistics.

In [None]:
metrics = engine.get_metrics()

if metrics and 'throughput' in metrics:
    print(f"✅ Throughput: {metrics['throughput']['tokens_per_sec']:.2f} tokens/sec")
    
    if 'latency' in metrics:
        print(f"✅ Mean latency: {metrics['latency']['mean_ms']:.2f}ms")
        if 'p95_ms' in metrics['latency']:
            print(f"✅ P95 latency: {metrics['latency']['p95_ms']:.2f}ms")
else:
    print("⚠️  Metrics not available yet (need more inferences)")

## Step 9: Cleanup

Stop the server when done.

In [None]:
if engine.server.is_running():
    engine.server.stop_server()
    print("✅ Server stopped")
else:
    print("Server already stopped")

---

## ✅ Test Summary

If all cells ran successfully, v1.1.9 is working correctly!

**Key Features Verified:**
- ✅ llama-server detected from package binaries directory
- ✅ Silent mode - no llama-server warnings
- ✅ Binary auto-download on first import
- ✅ Model download only when `load_model()` called
- ✅ Inference working correctly
- ✅ Performance metrics available

### Links
- **PyPI**: https://pypi.org/project/llcuda/1.1.9/
- **GitHub**: https://github.com/waqasm86/llcuda
- **Documentation**: https://waqasm86.github.io/
- **Issues**: https://github.com/waqasm86/llcuda/issues

---