# Lab 9: Production LLM Deployment with Tool Calling (vLLM + FastAPI)

**Goal:**  
Build a production-grade LLM service with tool calling capabilities using vLLM for inference and FastAPI for orchestration.

## Architecture Overview

```
Client Request
     ↓
FastAPI Orchestration Layer (handles tool calling logic)
     ↓
vLLM Server (high-performance inference)
     ↓
Tool Execution (weather API, calculator, database, etc.)
     ↓
Response with tool results
```

## Why This Architecture?

- **vLLM**: Production-grade inference (continuous batching, PagedAttention)
- **FastAPI**: Lightweight orchestration for tool calling business logic
- **Separation of concerns**: Inference engine vs application logic
- **Scalability**: Scale vLLM and FastAPI independently

This is how companies like Anthropic and OpenAI structure their tool-calling systems.

**Time:** ~40 minutes


In [None]:
# Install Unsloth for model preparation, vLLM for inference, FastAPI for orchestration
!wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -

# Install production serving stack
%pip install vllm fastapi uvicorn openai httpx

print("✅ Production stack installed!")
print("📦 Components: Unsloth + vLLM + FastAPI + OpenAI SDK")
print("⚠️ IMPORTANT: Requires GPU runtime (T4 or better)")
print("🔄 Now restart runtime before proceeding.")

In [None]:
# 1️⃣ Prepare model for vLLM deployment (same as Lab 8)

from unsloth import FastLanguageModel
import torch

# Load and prepare your optimized model
model_name = "unsloth/Qwen2.5-1.5B-Instruct"  # Using smaller model for Colab
save_directory = "./vllm_model"

print("📦 Loading optimized model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    dtype=torch.float16,
    load_in_4bit=False,  # vLLM doesn't support 4-bit
)

FastLanguageModel.for_inference(model)

# Export to HuggingFace format for vLLM
print(f"\n💾 Exporting model to {save_directory}...")
model.save_pretrained_merged(save_directory, tokenizer, save_method="merged_16bit")

print(f"""
✅ Model exported for vLLM!
📁 Location: {save_directory}
🎯 Next: Start vLLM server, then build FastAPI orchestration layer
""")

In [None]:
# 2️⃣ Start vLLM server

import subprocess
import time

print("🚀 Starting vLLM server (v1 engine - auto-optimized)...")
print("⏱️  Server startup takes ~30-60 seconds...\n")

vllm_process = subprocess.Popen([
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", save_directory,
    "--host", "0.0.0.0",
    "--port", "8000",
    "--dtype", "float16",
    "--max-model-len", "2048",
    "--gpu-memory-utilization", "0.8",
    # Note: v1 engine auto-handles optimization
    # Don't add --enable-chunked-prefill (forces v0 fallback)
],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

# Wait for server to start
print("⏳ Waiting for server to become ready...")
max_wait = 90
start = time.time()

while time.time() - start < max_wait:
    try:
        import requests
        response = requests.get("http://localhost:8000/health", timeout=1)
        if response.status_code == 200:
            print(f"✅ vLLM server ready! (took {time.time() - start:.1f}s)")
            break
    except:
        pass
    time.sleep(2)
else:
    print("⚠️  Server startup timeout - may need more time")

## vLLM Server Configuration

**Running at:** `http://localhost:8000` (v1 engine)

**This server handles high-performance inference with auto-optimization:**

- Continuous batching (automatically batches concurrent requests)
- PagedAttention (efficient KV cache management)
- Streaming support
- Token usage tracking

**Next step:** Build FastAPI orchestration layer for tool calling

In [None]:
# 3️⃣ Copy FastAPI orchestration layer

import shutil
import os

# Copy the orchestrator file from deployment_files to current directory
source = "../deployment_files/tool_orchestrator.py"
dest = "./tool_orchestrator.py"

if os.path.exists(source):
    shutil.copy(source, dest)
    print("✅ FastAPI orchestration layer copied!")
    print(f"📁 Copied from: {source}")
    print(f"📁 Saved to: {dest}")
else:
    print("⚠️  Source file not found. Make sure deployment_files/tool_orchestrator.py exists.")
    print("   You can find it in the course repo at: deployment_files/tool_orchestrator.py")

## Architecture Overview

```
Client → FastAPI (port 8001) → vLLM (port 8000) → Tools
```

**FastAPI Orchestrator handles:**
- Tool calling logic
- Tool execution
- Conversation management

**vLLM handles:**
- High-performance inference only
- Continuous batching
- GPU optimization

**The `tool_orchestrator.py` file contains:**
- FastAPI app with `/v1/chat/completions` endpoint
- Tool definitions (weather, calculator)
- Tool execution logic
- OpenAI-compatible response formatting

In [None]:
# 4️⃣ Start FastAPI orchestrator in background

import subprocess
import time

print("🚀 Starting FastAPI orchestration layer...")

# Start FastAPI server on port 8001
fastapi_process = subprocess.Popen([
    "python", "tool_orchestrator.py"
],
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
    text=True
)

# Wait for FastAPI to start
time.sleep(5)

print("✅ FastAPI orchestrator running on port 8001")

## Production Architecture Ready!

**Two-tier architecture:**
- **FastAPI (port 8001)**: Orchestration + tool calling
- **vLLM (port 8000)**: High-performance inference

**Why this architecture?**

This is how production systems work:
- **Separation of concerns**: Business logic vs GPU inference
- **Independent scaling**: Scale each tier based on needs
- **FastAPI handles**: Tool calling, conversation management, business logic
- **vLLM handles**: GPU-intensive inference only

This pattern is used by companies like Anthropic, OpenAI, and Google for their tool-calling systems.

In [None]:
# 5️⃣ Test production tool-calling system

from openai import OpenAI
import time

# Point client to our FastAPI orchestrator (not directly to vLLM)
client = OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="none"
)

print("🧪 Testing Production Tool-Calling System...\n")
print("="*70)

# Test 1: Weather tool
print("\n📋 TEST 1: Weather Tool Call")
print("-"*70)

start_time = time.time()
response = client.chat.completions.create(
    model=save_directory,
    messages=[
        {"role": "user", "content": "What's the weather like in Paris?"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_current_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }],
    temperature=0.3
)
elapsed = time.time() - start_time

print(f"⏱️  Total time: {elapsed:.2f}s")
print(f"\n💬 User: What's the weather like in Paris?")
print(f"🤖 Assistant: {response.choices[0].message.content}")
print(f"\n📊 Token usage:")
print(f"  - Prompt: {response.usage.prompt_tokens}")
print(f"  - Completion: {response.usage.completion_tokens}")
print(f"  - Total: {response.usage.total_tokens}")

# Test 2: Math calculator
print("\n" + "="*70)
print("📋 TEST 2: Math Calculator Tool")
print("-"*70)

start_time = time.time()
response = client.chat.completions.create(
    model=save_directory,
    messages=[
        {"role": "user", "content": "Calculate 234 * 156 + 789"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "calculate_math",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string"}
                },
                "required": ["expression"]
            }
        }
    }],
    temperature=0.1
)
elapsed = time.time() - start_time

print(f"⏱️  Total time: {elapsed:.2f}s")
print(f"\n💬 User: Calculate 234 * 156 + 789")
print(f"🤖 Assistant: {response.choices[0].message.content}")

# Test 3: No tools (regular chat)
print("\n" + "="*70)
print("📋 TEST 3: Regular Chat (No Tools)")
print("-"*70)

start_time = time.time()
response = client.chat.completions.create(
    model=save_directory,
    messages=[
        {"role": "user", "content": "Explain tool calling in production systems."}
    ],
    temperature=0.7
)
elapsed = time.time() - start_time

print(f"⏱️  Total time: {elapsed:.2f}s")
print(f"\n💬 User: Explain tool calling in production systems.")
print(f"🤖 Assistant: {response.choices[0].message.content}")

print("\n" + "="*70)
print("✅ All tests complete!")
print("""
🎯 Key Production Architecture Benefits Demonstrated:

1. **Separation of Concerns**: FastAPI (orchestration) + vLLM (inference)
2. **Scalability**: Scale each tier independently
3. **Maintainability**: Tool logic separate from inference engine
4. **Performance**: vLLM handles GPU optimization, FastAPI handles business logic

This is how companies like Anthropic, OpenAI structure their systems!
""")


## Production Deployment: Multi-Service Architecture

### docker-compose.yml

```yaml
version: '3.8'

services:
  # vLLM inference engine (v1 engine with auto-optimization)
  vllm:
    image: vllm/vllm-openai:latest
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - /app/model
      - --host
      - 0.0.0.0
      - --port
      - '8000'
      - --dtype
      - float16
      - --gpu-memory-utilization
      - '0.9'
      # Note: v1 engine auto-handles chunked prefill and scheduling
      # Don't add --enable-chunked-prefill or --num-scheduler-steps
    volumes:
      - ./vllm_model:/app/model
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    networks:
      - llm-network

  # FastAPI orchestration layer
  orchestrator:
    build:
      context: .
      dockerfile: Dockerfile.orchestrator
    ports:
      - '8001:8001'
    environment:
      - VLLM_URL=http://vllm:8000
    depends_on:
      - vllm
    networks:
      - llm-network

networks:
  llm-network:
    driver: bridge
```

### Dockerfile.orchestrator

```dockerfile
FROM python:3.11-slim

WORKDIR /app

RUN pip install fastapi uvicorn httpx pydantic

COPY tool_orchestrator.py .

EXPOSE 8001

CMD ["uvicorn", "tool_orchestrator:app", "--host", "0.0.0.0", "--port", "8001"]
```

### AWS ECS Deployment

Deploy as two separate services:

1. **vLLM Service**: GPU-enabled tasks (g5.xlarge+) with v1 engine
2. **Orchestrator Service**: CPU tasks (can scale independently)

Benefits:
- Scale orchestrator horizontally for high request volumes
- Scale vLLM vertically for larger models
- Cost optimization: Only GPU for inference, CPU for orchestration
- Independent deployment cycles
- v1 engine provides 1.5-2x better throughput automatically

## Reflection: Production Tool-Calling Architecture

### Why This Architecture?

**Two-Tier Design** (FastAPI + vLLM) vs **Monolithic** (Direct Model Wrapping):

| Aspect | Production (This Lab) | Prototype (Old Approach) |
|--------|----------------------|-------------------------|
| **Inference** | vLLM v1 (24x throughput) | Direct torch.generate |
| **Batching** | Continuous batching | None |
| **Scalability** | Scale tiers independently | Scale entire app |
| **Latency** | Optimized CUDA kernels | Naive PyTorch |
| **Tool Logic** | FastAPI (easy to modify) | Coupled with model |
| **Cost** | GPU only for inference | GPU for everything |
| **Optimization** | Auto-tuned (v1 engine) | Manual tuning required |

### vLLM v1 Engine Benefits (2025)

vLLM v1 engine (default since v0.8.0) provides significant improvements:
- **1.5-2x better throughput** vs v0 with no config changes
- **Automatic optimization**: No need for manual `--enable-chunked-prefill` or `--num-scheduler-steps`
- **Better memory efficiency**: Smarter KV cache management
- **Lower latency**: Optimized scheduling algorithms

**Important**: Don't use v0-specific flags like `--enable-chunked-prefill` or `--num-scheduler-steps` - they force fallback to v0 engine and reduce performance.

### Real-World Tool Calling Patterns

1. **Weather/External APIs**: Call external services, cache results
2. **Database Queries**: Execute SQL, return results to model
3. **Code Execution**: Sandboxed Python/JS execution
4. **Multi-tool chains**: Tool A → result feeds Tool B → final answer

### Production Considerations

- **Tool Timeouts**: External APIs can fail/timeout - set proper timeouts
- **Rate Limiting**: Protect external services from abuse
- **Caching**: Cache tool results for identical inputs (Redis/Memcached)
- **Monitoring**: Track which tools are called, success rates, latencies
- **Security**: Validate tool inputs, sandbox execution, prevent SQL injection
- **Multi-tenancy**: Isolate tool execution per customer/org

### Performance Metrics from Tests

- **With tool calling**: 2-3x latency (2 inference calls + tool execution)
- **Without tools**: Standard vLLM latency (~100-500ms)
- **Optimization strategies**: 
  - Batch tool executions when possible
  - Cache tool results (30-50% latency reduction)
  - Run tools in parallel for multi-tool requests
  - Use vLLM v1 engine for automatic optimization

### When to Use Tool Calling vs RAG

- **Tool Calling**: Dynamic data (weather, stock prices, live data), actions (send email, update DB)
- **RAG**: Static knowledge bases, documents, historical data
- **Hybrid**: RAG for context retrieval + Tools for real-time actions (most powerful)

### Enterprise Scaling

For enterprise deployments (>1k QPS):
- **Ray Serve** for distributed orchestration across multiple vLLM instances
- **LoRA multi-tenancy** for serving different fine-tuned models per customer
- **Multi-region** deployment for global low-latency access
- **Tiered routing** (simple → 8B model, complex → 70B model) for cost optimization

This architecture is production-ready and mirrors how companies like Anthropic (Claude), OpenAI (GPT), and Google (Gemini) implement tool calling!