## Step 1: Verify Kaggle GPU Environment

First, let's confirm we have 2× Tesla T4 GPUs available.

Verifies the Kaggle dual T4 GPU environment by checking available GPUs, CUDA version, and total VRAM to ensure the setup is ready for llcuda v2.2.0 multi-GPU inference.

In [None]:
# Verify we have 2× T4 GPUs
import subprocess
import os

print("="*70)
print("🔍 KAGGLE GPU ENVIRONMENT CHECK")
print("="*70)

# Check nvidia-smi
result = subprocess.run(["nvidia-smi", "-L"], capture_output=True, text=True)
gpu_lines = [l for l in result.stdout.strip().split("\n") if l.startswith("GPU")]
print(f"\n📊 Detected GPUs: {len(gpu_lines)}")
for line in gpu_lines:
    print(f"   {line}")

# Check CUDA version
print("\n📊 CUDA Version:")
!nvcc --version | grep release

# Check total VRAM
print("\n📊 VRAM Summary:")
!nvidia-smi --query-gpu=index,name,memory.total --format=csv

# Verify we have 2 GPUs
if len(gpu_lines) >= 2:
    print("\n✅ Multi-GPU environment confirmed! Ready for llcuda v2.2.0.")
else:
    print("\n⚠️ WARNING: Less than 2 GPUs detected!")
    print("   Enable 'GPU T4 x2' in Kaggle notebook settings.")

## Step 2: Install llcuda v2.2.0

Install from GitHub with pre-built binaries for Kaggle T4×2.

Installs llcuda v2.2.0 from GitHub and verifies the installation by checking CUDA availability, GPU count, and version information using llcuda's API.

In [None]:
%%time
# Install llcuda v2.2.0 from GitHub (force fresh install, no cache)
print("📦 Installing llcuda v2.2.0...")
!pip install -q --no-cache-dir --force-reinstall git+https://github.com/llcuda/llcuda.git@v2.2.0

# Verify installation
import llcuda
print(f"\n✅ llcuda {llcuda.__version__} installed!")

# Check llcuda status using available APIs
from llcuda import check_cuda_available, get_cuda_device_info
from llcuda.api.multigpu import gpu_count

cuda_info = get_cuda_device_info()
print(f"\n📊 llcuda Status:")
print(f"   CUDA Available: {check_cuda_available()}")
print(f"   GPUs: {gpu_count()}")
if cuda_info:
    print(f"   CUDA Version: {cuda_info.get('cuda_version', 'N/A')}")

## Step 3: Download a GGUF Model

We'll use Gemma 3n 4B in Q4_K_M quantization - perfect for Kaggle T4 GPUs.

Downloads the Gemma 3n 4B model in Q4_K_M GGUF quantization format from HuggingFace, optimized for Kaggle's 15GB T4 VRAM, and reports the model size.

In [None]:
%%time
from huggingface_hub import hf_hub_download
import os

# Model selection - optimized for 15GB VRAM
MODEL_REPO = "unsloth/gemma-3-4b-it-GGUF"
MODEL_FILE = "gemma-3-4b-it-Q4_K_M.gguf"

print(f"📥 Downloading {MODEL_FILE}...")
print(f"   Repository: {MODEL_REPO}")
print(f"   Expected size: ~2.5GB")

# Download to Kaggle working directory
model_path = hf_hub_download(
    repo_id=MODEL_REPO,
    filename=MODEL_FILE,
    local_dir="/kaggle/working/models"
)

print(f"\n✅ Model downloaded: {model_path}")

# Show model size
size_gb = os.path.getsize(model_path) / (1024**3)
print(f"   Size: {size_gb:.2f} GB")

## Step 4: Start llama-server

Start the inference server on GPU 0 with optimal settings for T4.

Starts the llama-server with multi-GPU configuration using optimized settings for dual T4 GPUs, including tensor splitting, flash attention, and layer distribution across both GPUs.

In [None]:
from llcuda.server import ServerManager
from llcuda.api.multigpu import kaggle_t4_dual_config

# Get optimized configuration for Kaggle T4×2
config = kaggle_t4_dual_config()

print("🚀 Starting llama-server with Multi-GPU configuration...")
print(f"   Model: {model_path}")
print(f"   GPU Layers: {config.n_gpu_layers} (all layers)")
print(f"   Context Size: {config.ctx_size}")
print(f"   Tensor Split: {config.tensor_split} (equal across 2 GPUs)")
print(f"   Flash Attention: {config.flash_attention}")

# Create server manager
server = ServerManager(server_url="http://127.0.0.1:8080")

# Start server with multi-GPU configuration
# Pass tensor_split as comma-separated string for --tensor-split flag
tensor_split_str = ",".join(str(x) for x in config.tensor_split) if config.tensor_split else None

try:
    server.start_server(
        model_path=model_path,
        host="127.0.0.1",
        port=8080,
        gpu_layers=config.n_gpu_layers,
        ctx_size=config.ctx_size,
        timeout=120,
        verbose=True,
        # Multi-GPU parameters (passed via **kwargs)
        flash_attn=1 if config.flash_attention else 0,
        split_mode="layer",
        tensor_split=tensor_split_str,
    )
    print("\n✅ llama-server is ready with dual T4 GPUs!")
    print(f"   API endpoint: http://127.0.0.1:8080")
except Exception as e:
    print(f"\n❌ Server failed to start: {e}")

## Step 5: Run Your First Inference

Use the OpenAI-compatible API to chat with the model.

Performs the first inference test using llcuda's OpenAI-compatible chat API, sending a simple question about CUDA and displaying the response with token usage statistics.

In [None]:
from llcuda.api.client import LlamaCppClient

# Create client
client = LlamaCppClient(base_url="http://127.0.0.1:8080")

# Test simple completion using OpenAI-compatible API
print("💬 Testing inference...\n")

response = client.chat.create(
    messages=[
        {"role": "user", "content": "What is CUDA? Explain in 2 sentences."}
    ],
    max_tokens=100,
    temperature=0.7
)

print("📝 Response:")
print(response.choices[0].message.content)

print(f"\n📊 Stats:")
print(f"   Tokens generated: {response.usage.completion_tokens}")
print(f"   Total tokens: {response.usage.total_tokens}")

## Step 6: Streaming Response Example

Stream responses for real-time output.

Demonstrates streaming inference where the model's response is generated and displayed token-by-token in real-time, useful for interactive applications and better user experience.

In [None]:
# Streaming example using OpenAI-compatible API
print("💬 Streaming response...\n")

for chunk in client.chat.create(
    messages=[
        {"role": "user", "content": "Write a Python function to calculate factorial."}
    ],
    max_tokens=200,
    temperature=0.3,
    stream=True  # Enable streaming
):
    if hasattr(chunk, 'choices') and chunk.choices:
        delta = chunk.choices[0].delta
        if hasattr(delta, 'content') and delta.content:
            print(delta.content, end="", flush=True)

print("\n\n✅ Streaming complete!")

## Step 7: Check GPU Memory Usage

Monitor VRAM usage to understand resource consumption.

Monitors GPU memory usage across both T4 GPUs to show VRAM consumption by the llama-server on GPU 0, while GPU 1 remains available for RAPIDS or Graphistry workloads.

In [None]:
# Check GPU memory usage
print("📊 GPU Memory Usage:")
print("="*60)
!nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv

print("\n💡 Note:")
print("   GPU 0: llama-server (LLM inference)")
print("   GPU 1: Available for RAPIDS/Graphistry")

## Step 8: Cleanup

Stop the server when done.

Stops the llama-server gracefully and verifies that GPU memory has been released, freeing up resources for other tasks or subsequent notebook runs.

In [None]:
# Stop the server
print("🛑 Stopping llama-server...")
server.stop_server()
print("\n✅ Server stopped. Resources freed.")

# Verify GPU memory is released
print("\n📊 GPU Memory After Cleanup:")
!nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

## 🎉 Quick Start Complete!

You've successfully:
1. ✅ Verified Kaggle GPU environment
2. ✅ Installed llcuda v2.2.0
3. ✅ Downloaded a GGUF model
4. ✅ Started llama-server
5. ✅ Ran inference with chat completion
6. ✅ Used streaming responses

## Next Steps

Explore more tutorials:
- 📘 [02-llama-server-setup](02-llama-server-setup-llcuda-v2.2.0.ipynb) - Advanced server configuration
- 📘 [03-multi-gpu-inference](03-multi-gpu-inference-llcuda-v2.2.0.ipynb) - Dual T4 inference
- 📘 [04-gguf-quantization](04-gguf-quantization-llcuda-v2.2.0.ipynb) - Quantization guide
- 📘 [05-unsloth-integration](05-unsloth-integration-llcuda-v2.2.0.ipynb) - Unsloth training → llcuda
- 📘 [06-split-gpu-graphistry](06-split-gpu-graphistry-llcuda-v2.2.0.ipynb) - LLM + Graph visualization

---

**llcuda v2.2.0** | CUDA 12 Inference Backend for Unsloth