## Step 1: Environment Verification

Checks the Kaggle environment by querying available GPUs, their memory capacity, compute capability, and CUDA version to ensure readiness for llama-server configuration.

In [None]:
import subprocess
import os

print("="*70)
print("🔍 ENVIRONMENT CHECK")
print("="*70)

# GPU check
result = subprocess.run(["nvidia-smi", "--query-gpu=index,name,memory.total,compute_cap", 
                         "--format=csv,noheader"], capture_output=True, text=True)
print("\n📊 GPUs Available:")
for line in result.stdout.strip().split('\n'):
    print(f"   {line}")

# CUDA version
print("\n📊 CUDA Version:")
!nvcc --version | grep release

print("\n✅ Environment ready for llama-server configuration")

## Step 2: Install llcuda and Dependencies

Installs llcuda v2.2.0 and required dependencies (huggingface_hub, sseclient-py) from GitHub, then verifies the installation by displaying the version number.

In [None]:
%%time
# Install llcuda v2.2.0 (force fresh install to ensure correct binaries)
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llcuda/llcuda.git@v2.2.0
!pip install -q huggingface_hub sseclient-py

import llcuda
print(f"✅ llcuda {llcuda.__version__} installed")

## Step 3: Understanding Server Configuration Options

llama-server has many configuration options. Here's a comprehensive overview:

Displays a comprehensive overview of llama-server configuration parameters organized by category (Model, GPU, Performance, Server), providing a reference guide for customizing server behavior.

In [None]:
from llcuda.server import ServerManager
from llcuda.api.multigpu import MultiGPUConfig, SplitMode

# Display all configuration options
print("="*70)
print("📋 LLAMA-SERVER CONFIGURATION OPTIONS")
print("="*70)

config_options = {
    "Model Settings": {
        "--model, -m": "Path to GGUF model file",
        "--alias, -a": "Model alias for API responses",
        "--ctx-size, -c": "Context size (default: 4096)",
        "--batch-size, -b": "Batch size for prompt processing",
        "--ubatch-size": "Physical batch size (default: 512)",
    },
    "GPU Settings": {
        "--n-gpu-layers, -ngl": "Layers to offload to GPU (99 = all)",
        "--main-gpu, -mg": "Main GPU for computations",
        "--tensor-split, -ts": "VRAM distribution across GPUs",
        "--split-mode, -sm": "Split mode: layer, row, none",
    },
    "Performance": {
        "--flash-attn, -fa": "Enable FlashAttention (faster)",
        "--threads, -t": "CPU threads for generation",
        "--threads-batch, -tb": "CPU threads for batch processing",
        "--cont-batching": "Enable continuous batching",
        "--parallel, -np": "Number of parallel sequences",
    },
    "Server Settings": {
        "--host": "Host address (default: 127.0.0.1)",
        "--port": "Port number (default: 8080)",
        "--timeout": "Server timeout in seconds",
        "--embeddings": "Enable embeddings endpoint",
    },
}

for category, options in config_options.items():
    print(f"\n📌 {category}:")
    for flag, desc in options.items():
        print(f"   {flag:25} {desc}")

## Step 4: Configuration Presets for Kaggle T4

Demonstrates llcuda's built-in configuration presets optimized for Kaggle T4 GPUs, including single GPU (15GB), dual GPU (30GB), and split-GPU setups for LLM+visualization workflows.

In [None]:
from llcuda.api.multigpu import kaggle_t4_dual_config, colab_t4_single_config

print("="*70)
print("📋 KAGGLE T4 CONFIGURATION PRESETS")
print("="*70)

# Single GPU configuration (use GPU 0 only)
# Note: colab_t4_single_config works for Kaggle single T4 as well (same GPU)
print("\n🔹 Single T4 Configuration (15GB VRAM):")
single_config = colab_t4_single_config()
print(f"   GPU Layers: {single_config.n_gpu_layers}")
print(f"   Context Size: {single_config.ctx_size}")
print(f"   Batch Size: {single_config.batch_size}")
print(f"   Flash Attention: {single_config.flash_attention}")
print(f"   Best for: Models up to ~7B Q4_K_M")

# Dual GPU configuration (split across both)
print("\n🔹 Dual T4 Configuration (30GB VRAM total):")
dual_config = kaggle_t4_dual_config()
print(f"   GPU Layers: {dual_config.n_gpu_layers}")
print(f"   Context Size: {dual_config.ctx_size}")
print(f"   Tensor Split: {dual_config.tensor_split}")
print(f"   Split Mode: {dual_config.split_mode}")
print(f"   Flash Attention: {dual_config.flash_attention}")
print(f"   Best for: Models up to ~13B Q4_K_M")

# Split GPU configuration (LLM on GPU 0, RAPIDS on GPU 1)
print("\n🔹 Split-GPU Configuration (Recommended):")
print("   GPU 0: llama-server (LLM inference)")
print("   GPU 1: RAPIDS/Graphistry (graph processing)")
print("   Best for: Combined LLM + visualization workflows")

## Step 5: Download Test Model

Downloads a small Gemma 1B Q4_K_M GGUF model for testing various server configurations, reporting the download location and file size.

In [None]:
%%time
from huggingface_hub import hf_hub_download
import os

# Download a small model for testing configurations
MODEL_REPO = "unsloth/gemma-3-1b-it-GGUF"
MODEL_FILE = "gemma-3-1b-it-Q4_K_M.gguf"

print(f"📥 Downloading {MODEL_FILE} for configuration testing...")

model_path = hf_hub_download(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    local_dir="/kaggle/working/models"
)

size_gb = os.path.getsize(model_path) / (1024**3)
print(f"\n✅ Model downloaded: {model_path}")
print(f"   Size: {size_gb:.2f} GB")

## Step 6: Basic Server Configuration

Starts llama-server with basic configuration settings including GPU layer offloading, context size, and batch parameters using ServerManager's API.

In [None]:
from llcuda.server import ServerManager

# Create basic configuration settings (used as parameters to start_server)
print("="*70)
print("🔧 BASIC SERVER CONFIGURATION")
print("="*70)

# Configuration parameters
config = {
    "model_path": model_path,
    "host": "127.0.0.1",
    "port": 8080,
    "gpu_layers": 99,       # Offload all layers to GPU
    "ctx_size": 4096,       # 4K context
    "batch_size": 512,      # Batch size for prompt processing
}

print(f"\n📋 Configuration:")
print(f"   Model: {config['model_path']}")
print(f"   Host: {config['host']}:{config['port']}")
print(f"   GPU Layers: {config['gpu_layers']}")
print(f"   Context: {config['ctx_size']}")

# Start server using ServerManager.start_server() API
server = ServerManager(server_url=f"http://{config['host']}:{config['port']}")
print("\n🚀 Starting server with basic configuration...")

try:
    server.start_server(
        model_path=config['model_path'],
        host=config['host'],
        port=config['port'],
        gpu_layers=config['gpu_layers'],
        ctx_size=config['ctx_size'],
        timeout=60,
        verbose=True
    )
    print("\n✅ Server started successfully!")
except Exception as e:
    print(f"\n❌ Server failed to start: {e}")

## Step 7: Server Health Monitoring

Monitors server health by querying the /health, /slots, and /props endpoints to check server status, available slots, and loaded model properties.

In [None]:
import requests
import json

print("="*70)
print("🏥 SERVER HEALTH MONITORING")
print("="*70)

# Health check
try:
    health = requests.get("http://127.0.0.1:8080/health", timeout=5)
    print(f"\n📊 Health Status: {health.json()}")
except Exception as e:
    print(f"❌ Health check failed: {e}")

# Get server slots info
try:
    slots = requests.get("http://127.0.0.1:8080/slots", timeout=5)
    print(f"\n📊 Server Slots:")
    for slot in slots.json():
        print(f"   Slot {slot.get('id', 'N/A')}: {slot.get('state', 'unknown')}")
except Exception as e:
    print(f"   Slots endpoint not available: {e}")

# Get model info
try:
    props = requests.get("http://127.0.0.1:8080/props", timeout=5)
    print(f"\n📊 Model Properties:")
    data = props.json()
    print(f"   Model: {data.get('default_generation_settings', {}).get('model', 'N/A')}")
    print(f"   Context: {data.get('default_generation_settings', {}).get('n_ctx', 'N/A')}")
except Exception as e:
    print(f"   Props endpoint not available: {e}")

## Step 8: Stop Server and Test Advanced Configuration

Gracefully stops the running llama-server and waits 2 seconds for the port to be released before starting a new configuration.

In [None]:
# Stop current server
print("🛑 Stopping current server...")
server.stop_server()

import time
time.sleep(2)  # Wait for port to be released

print("\n✅ Server stopped")

## Step 9: High-Performance Configuration

Optimized for maximum throughput on Kaggle T4.

Configures llama-server for maximum throughput on Kaggle T4 with larger context (8K), batch sizes, and parallel request handling (4 slots) optimized for production workloads.

In [None]:
print("="*70)
print("⚡ HIGH-PERFORMANCE CONFIGURATION")
print("="*70)

# High-performance configuration parameters
hp_config = {
    "model_path": model_path,
    "host": "127.0.0.1",
    "port": 8080,
    
    # GPU settings - maximize GPU utilization
    "gpu_layers": 99,
    
    # Context and batching
    "ctx_size": 8192,      # Larger context
    "batch_size": 1024,    # Larger batch for prompt processing
    "ubatch_size": 512,    # Physical batch size
    
    # Parallelism
    "n_parallel": 4,       # 4 parallel request slots
}

print(f"\n📋 High-Performance Settings:")
print(f"   Context Size: {hp_config['ctx_size']} tokens")
print(f"   Batch Size: {hp_config['batch_size']}")
print(f"   Parallel Slots: {hp_config['n_parallel']}")

# Create new server manager
server = ServerManager(server_url=f"http://{hp_config['host']}:{hp_config['port']}")

# Start with high-performance config
print("\n🚀 Starting server with high-performance configuration...")
try:
    server.start_server(
        model_path=hp_config['model_path'],
        host=hp_config['host'],
        port=hp_config['port'],
        gpu_layers=hp_config['gpu_layers'],
        ctx_size=hp_config['ctx_size'],
        batch_size=hp_config['batch_size'],
        ubatch_size=hp_config['ubatch_size'],
        n_parallel=hp_config['n_parallel'],
        timeout=60,
        verbose=True
    )
    print("\n✅ High-performance server started!")
except Exception as e:
    print(f"\n❌ Server failed to start: {e}")

## Step 10: Benchmark Inference Performance

Benchmarks inference performance by running multiple prompts and measuring tokens per second, total time, and average generation speed to evaluate configuration effectiveness.

In [None]:
import time
from llcuda.api.client import LlamaCppClient

print("="*70)
print("📊 INFERENCE PERFORMANCE BENCHMARK")
print("="*70)

client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Benchmark parameters
prompts = [
    "Explain quantum computing in simple terms.",
    "Write a haiku about machine learning.",
    "What are the benefits of GPU acceleration?",
]

print("\n🏃 Running benchmark with 3 prompts...\n")

total_tokens = 0
total_time = 0

for i, prompt in enumerate(prompts, 1):
    start = time.time()
    
    response = client.chat.create(
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.7
    )
    
    elapsed = time.time() - start
    tokens = response.usage.completion_tokens
    
    total_tokens += tokens
    total_time += elapsed
    
    print(f"   Prompt {i}: {tokens} tokens in {elapsed:.2f}s ({tokens/elapsed:.1f} tok/s)")

print(f"\n📊 Benchmark Results:")
print(f"   Total Tokens: {total_tokens}")
print(f"   Total Time: {total_time:.2f}s")
print(f"   Average Speed: {total_tokens/total_time:.1f} tokens/second")

## Step 11: GPU Memory Monitoring

Monitors current GPU memory usage, available VRAM, and GPU utilization across all GPUs to track resource consumption during inference operations.

In [None]:
print("="*70)
print("📊 GPU MEMORY MONITORING")
print("="*70)

# Current memory usage
print("\n🔹 Current GPU Memory Usage:")
!nvidia-smi --query-gpu=index,name,memory.used,memory.total,memory.free --format=csv

# Memory usage over time (single snapshot)
import subprocess
result = subprocess.run(
    ["nvidia-smi", "--query-gpu=index,memory.used,utilization.gpu", "--format=csv,noheader"],
    capture_output=True, text=True
)

print("\n🔹 GPU Utilization:")
for line in result.stdout.strip().split('\n'):
    parts = line.split(', ')
    if len(parts) >= 3:
        print(f"   GPU {parts[0]}: {parts[1]} used, {parts[2]} utilization")

## Step 12: Command-Line Reference

For running llama-server directly from command line.

Provides command-line reference examples for starting llama-server directly from terminal with various configurations (single GPU, dual GPU, high-performance, embeddings).

In [None]:
print("="*70)
print("📋 COMMAND-LINE REFERENCE")
print("="*70)

cli_examples = f"""
🔹 Basic Start:
   llama-server -m {model_path} --host 0.0.0.0 --port 8080

🔹 Single GPU (GPU 0, 15GB):
   llama-server -m {model_path} \\
       --host 0.0.0.0 --port 8080 \\
       --n-gpu-layers 99 --main-gpu 0 \\
       --ctx-size 4096 --flash-attn

🔹 Dual GPU (30GB total):
   llama-server -m {model_path} \\
       --host 0.0.0.0 --port 8080 \\
       --n-gpu-layers 99 \\
       --tensor-split 0.5,0.5 \\
       --split-mode layer \\
       --ctx-size 8192 --flash-attn

🔹 High-Performance:
   llama-server -m {model_path} \\
       --host 0.0.0.0 --port 8080 \\
       --n-gpu-layers 99 --flash-attn \\
       --ctx-size 8192 --batch-size 1024 \\
       --parallel 4 --cont-batching \\
       --threads 4 --threads-batch 4

🔹 With Embeddings:
   llama-server -m {model_path} \\
       --host 0.0.0.0 --port 8080 \\
       --n-gpu-layers 99 --flash-attn \\
       --embeddings
"""

print(cli_examples)

## Step 13: Cleanup

Performs final cleanup by stopping the server, freeing GPU resources, and displaying the final memory status to confirm all resources have been released.

In [None]:
# Stop server
print("🛑 Stopping server...")
server.stop_server()

print("\n✅ Server stopped. Resources freed.")

# Final GPU status
print("\n📊 Final GPU Memory Status:")
!nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv

## 📚 Summary

You've learned:
1. ✅ Server configuration options
2. ✅ Kaggle T4 presets (single/dual GPU)
3. ✅ High-performance tuning
4. ✅ Health monitoring
5. ✅ Command-line reference

## Configuration Tips for Kaggle T4

| Model Size | Quantization | VRAM | Context | Config |
|------------|--------------|------|---------|--------|
| 1-3B | Q4_K_M | ~2GB | 8192 | Single T4 |
| 4-7B | Q4_K_M | ~5GB | 4096 | Single T4 |
| 8-13B | Q4_K_M | ~8GB | 4096 | Dual T4 |
| 13-30B | IQ3_XS | ~12GB | 2048 | Dual T4 |

---

**Next:** [03-multi-gpu-inference](03-multi-gpu-inference-llcuda-v2.2.0.ipynb)