# 📓 The GenAI Revolution Cookbook

**Title:** How to Self-Host a Fast Llama 3 API Locally with vLLM and FastAPI

**Description:** Ship a production-grade Llama 3 API: vLLM throughput with batching/KV cache, FastAPI gateway for auth, rate limits, streaming, and benchmarks.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## What You'll Build

A production-ready API gateway that sits in front of vLLM, exposing an OpenAI-compatible `/v1/chat/completions` endpoint with per-key rate limiting, Prometheus metrics, and streaming support. By the end, you'll be able to swap your OpenAI base URL, authenticate with a custom key, and stream tokens from a locally hosted Llama 3 model—while controlling cost and observability.

**Prerequisites:**
- CUDA-capable GPU (16GB+ VRAM recommended for Llama 3 8B)
- Access to Meta Llama 3 on Hugging Face (gated model; request access at [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct))
- Hugging Face token with read permissions (set as `HUGGING_FACE_HUB_TOKEN`)
- Python 3.10+

**Acceptance Criteria:**
- `/v1/chat/completions` returns streaming or non-streaming responses matching OpenAI's schema
- API keys enforce per-key rate limits (e.g., 10 req/min)
- Prometheus `/metrics` endpoint exposes request counts, latencies, and token throughput
- Clients can replace `openai.api_base` and `Authorization` header to use your gateway with zero code changes

---

## How It Works (High-Level Overview)

**vLLM** serves the model with continuous batching and PagedAttention for high throughput. **FastAPI** wraps vLLM's HTTP API, adding:
- **Authentication** via `Authorization: Bearer <key>` (with optional `x-api-key` fallback)
- **Rate limiting** per key using Redis and SlowAPI
- **Metrics** via Prometheus for request counts, latencies, and tokens/sec
- **OpenAI schema compatibility** so existing clients work without modification

**Flow:**
1. Client sends POST to `/v1/chat/completions` with `Authorization: Bearer sk-abc123`
2. Gateway validates key, checks rate limit in Redis
3. If allowed, forwards request to vLLM's `/v1/chat/completions`
4. Streams or returns JSON response, increments Prometheus counters
5. Logs request ID, hashed key, and latency for traceability

---

## Setup & Installation

### 1. Install Dependencies

In [None]:
!pip install vllm fastapi uvicorn[standard] httpx redis slowapi prometheus-client pydantic

### 2. Start Redis (for rate limiting)

In a notebook or local environment, start Redis in the background:

In [None]:
import subprocess
import time

redis_proc = subprocess.Popen(["redis-server", "--port", "6379"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(2)  # wait for Redis to start
print("Redis started on port 6379")

### 3. Seed API Keys

In [None]:
import redis
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
r.sadd("valid_api_keys", "sk-test-key-1")
r.sadd("valid_api_keys", "sk-test-key-2")
print("Seeded keys: sk-test-key-1, sk-test-key-2")

### 4. Start vLLM Server

Ensure `HUGGING_FACE_HUB_TOKEN` is set in your environment, then start vLLM:

In [None]:
import os
os.environ["HUGGING_FACE_HUB_TOKEN"] = "hf_..."  # replace with your token

vllm_proc = subprocess.Popen([
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", "meta-llama/Meta-Llama-3-8B-Instruct",
    "--port", "8000",
    "--max-model-len", "4096"
], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(30)  # wait for model to load
print("vLLM server started on port 8000")

---

## Step-by-Step Implementation

### Step 1: Define Pydantic Schemas

Create `app/schemas.py` to validate incoming requests and allow unknown fields for full OpenAI compatibility:

In [None]:
from pydantic import BaseModel, Field, ConfigDict
from typing import List, Optional, Literal
from enum import Enum

class Role(str, Enum):
    system = "system"
    user = "user"
    assistant = "assistant"
    tool = "tool"

class Message(BaseModel):
    role: Role
    content: str = Field(..., max_length=32000)

class ChatCompletionRequest(BaseModel):
    model_config = ConfigDict(extra="allow")  # pass unknown fields to vLLM
    model: str
    messages: List[Message]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 512
    stream: Optional[bool] = False

### Step 2: Build the FastAPI Gateway

Create `app/main.py`:

In [None]:
import hashlib
import uuid
import httpx
import redis
from fastapi import FastAPI, Request, Header, HTTPException, Depends
from fastapi.responses import StreamingResponse, JSONResponse
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from app.schemas import ChatCompletionRequest

app = FastAPI()

# Redis client
r = redis.Redis(host="localhost", port=6379, decode_responses=True)

# Rate limiter
limiter = Limiter(key_func=get_remote_address, storage_uri="redis://localhost:6379")
app.state.limiter = limiter

@app.exception_handler(RateLimitExceeded)
async def rate_limit_handler(request: Request, exc: RateLimitExceeded):
    return JSONResponse(
        status_code=429,
        content={"error": {"message": "Rate limit exceeded", "type": "rate_limit_error"}},
        headers={"Retry-After": "60"}
    )

# Prometheus metrics
REQUEST_COUNT = Counter("gateway_requests_total", "Total requests", ["endpoint", "status"])
REQUEST_LATENCY = Histogram("gateway_request_duration_seconds", "Request latency", ["endpoint"])
TOKENS_GENERATED = Counter("gateway_tokens_generated_total", "Total tokens generated")

# Auth dependency
async def verify_api_key(authorization: Optional[str] = Header(None), x_api_key: Optional[str] = Header(None)):
    key = None
    if authorization and authorization.startswith("Bearer "):
        key = authorization.split(" ", 1)[1]
    elif x_api_key:
        key = x_api_key
    if not key or not r.sismember("valid_api_keys", key):
        raise HTTPException(status_code=401, detail="Invalid API key")
    return key

@app.post("/v1/chat/completions")
@limiter.limit("10/minute")
async def chat_completions(
    request: Request,
    body: ChatCompletionRequest,
    api_key: str = Depends(verify_api_key)
):
    request_id = str(uuid.uuid4())
    key_hash = hashlib.sha256(api_key.encode()).hexdigest()[:8]
    
    with REQUEST_LATENCY.labels(endpoint="/v1/chat/completions").time():
        try:
            async with httpx.AsyncClient(timeout=60.0) as client:
                if body.stream:
                    upstream = await client.post(
                        "http://localhost:8000/v1/chat/completions",
                        json=body.model_dump(),
                        headers={"Content-Type": "application/json"}
                    )
                    upstream.raise_for_status()
                    
                    async def stream_response():
                        seen_done = False
                        async for line in upstream.aiter_lines():
                            if line.startswith("data: "):
                                chunk = line[6:]
                                if chunk.strip() == "[DONE]":
                                    seen_done = True
                                yield line + "\n\n"
                        if not seen_done:
                            yield "data: [DONE]\n\n"
                    
                    REQUEST_COUNT.labels(endpoint="/v1/chat/completions", status=200).inc()
                    return StreamingResponse(
                        stream_response(),
                        media_type="text/event-stream",
                        headers={
                            "Cache-Control": "no-cache",
                            "Connection": "keep-alive",
                            "X-Accel-Buffering": "no"
                        }
                    )
                else:
                    resp = await client.post(
                        "http://localhost:8000/v1/chat/completions",
                        json=body.model_dump(),
                        headers={"Content-Type": "application/json"}
                    )
                    resp.raise_for_status()
                    data = resp.json()
                    tokens = data.get("usage", {}).get("completion_tokens", 0)
                    TOKENS_GENERATED.inc(tokens)
                    REQUEST_COUNT.labels(endpoint="/v1/chat/completions", status=200).inc()
                    print(f"[{request_id}] key={key_hash} tokens={tokens}")
                    return data
        except httpx.HTTPStatusError as e:
            REQUEST_COUNT.labels(endpoint="/v1/chat/completions", status=e.response.status_code).inc()
            raise HTTPException(status_code=e.response.status_code, detail=e.response.text)

@app.get("/v1/models")
async def list_models(api_key: str = Depends(verify_api_key)):
    async with httpx.AsyncClient() as client:
        resp = await client.get("http://localhost:8000/v1/models")
        resp.raise_for_status()
        return resp.json()

@app.get("/healthz")
async def health():
    try:
        async with httpx.AsyncClient(timeout=5.0) as client:
            resp = await client.get("http://localhost:8000/v1/models")
            resp.raise_for_status()
        r.ping()
        return {"status": "ok"}
    except Exception as e:
        raise HTTPException(status_code=503, detail=str(e))

@app.get("/metrics")
async def metrics():
    return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}

### Step 3: Start the Gateway

In [None]:
import uvicorn
from threading import Thread

def run_gateway():
    uvicorn.run("app.main:app", host="0.0.0.0", port=8001, workers=1)

gateway_thread = Thread(target=run_gateway, daemon=True)
gateway_thread.start()
time.sleep(3)
print("Gateway running on port 8001")

---

## Validation & Testing

### Test Non-Streaming Request

In [None]:
import httpx

async def test_chat():
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "http://localhost:8001/v1/chat/completions",
            json={
                "model": "meta-llama/Meta-Llama-3-8B-Instruct",
                "messages": [{"role": "user", "content": "Say hello"}],
                "max_tokens": 50
            },
            headers={"Authorization": "Bearer sk-test-key-1"}
        )
        print(resp.status_code, resp.json())

import asyncio
asyncio.run(test_chat())

### Test Streaming Request

In [None]:
async def test_stream():
    async with httpx.AsyncClient() as client:
        async with client.stream(
            "POST",
            "http://localhost:8001/v1/chat/completions",
            json={
                "model": "meta-llama/Meta-Llama-3-8B-Instruct",
                "messages": [{"role": "user", "content": "Count to 5"}],
                "stream": True
            },
            headers={"Authorization": "Bearer sk-test-key-1"}
        ) as resp:
            async for line in resp.aiter_lines():
                print(line)

asyncio.run(test_stream())

### Check Prometheus Metrics

In [None]:
async def check_metrics():
    async with httpx.AsyncClient() as client:
        resp = await client.get("http://localhost:8001/metrics")
        print(resp.text)

asyncio.run(check_metrics())

---

## Benchmarking Throughput

Create `bench.py` to measure time-to-first-token (TTFT) and tokens/sec:

In [None]:
import httpx
import time
import asyncio

async def bench():
    async with httpx.AsyncClient(timeout=60.0) as client:
        start = time.time()
        first_token_time = None
        token_count = 0
        
        async with client.stream(
            "POST",
            "http://localhost:8001/v1/chat/completions",
            json={
                "model": "meta-llama/Meta-Llama-3-8B-Instruct",
                "messages": [{"role": "user", "content": "Write a 200-word essay on AI"}],
                "stream": True,
                "max_tokens": 300
            },
            headers={"Authorization": "Bearer sk-test-key-1"}
        ) as resp:
            async for line in resp.aiter_lines():
                if first_token_time is None:
                    first_token_time = time.time() - start
                if line.startswith("data: ") and line[6:].strip() != "[DONE]":
                    token_count += 1
        
        total_time = time.time() - start
        print(f"TTFT: {first_token_time:.2f}s | Total: {total_time:.2f}s | Tokens/sec: {token_count/total_time:.1f}")

asyncio.run(bench())

**Example output:**

In [None]:
TTFT: 0.32s | Total: 12.45s | Tokens/sec: 24.1

---

## Next Steps

- **Multi-GPU scaling:** Use `--tensor-parallel-size 2` in vLLM for larger models
- **Quantization:** Add `--quantization awq` to reduce memory and increase throughput
- **Key management API:** Build a `/admin/keys` endpoint to create/revoke keys programmatically
- **CORS for web clients:** Add `fastapi.middleware.cors.CORSMiddleware` with an allowlist
- **TLS termination:** Deploy behind NGINX or Traefik with Let's Encrypt certificates
- **Multiprocess metrics:** If using `workers > 1`, configure `prometheus_client` multiprocess mode

You now have a production-ready gateway that lets you replace OpenAI with a self-hosted model, control costs via rate limits, and monitor performance with Prometheus—all while maintaining drop-in compatibility with existing clients.