# 📓 The GenAI Revolution Cookbook

**Title:** How to Build LLM Serving for Llama 3 with vLLM and FastAPI

**Description:** Deploy production-grade LLM serving fast: self-host Llama 3 on vLLM with KV cache batching, FastAPI auth, rate limits, streaming, benchmarks.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## What You'll Build

A production-grade FastAPI gateway that sits in front of vLLM, adding authentication, rate limiting, streaming, and Prometheus metrics. By the end, you'll have a working endpoint on port 8080 proxying to vLLM on 8001, validated with curl and exposing metrics on `/metrics`. You'll measure time-to-first-token (TTFT) and tokens/sec under load, giving you a baseline for optimization.

This setup takes hours, not weeks, and unlocks the control you need to scale LLM inference in production.

## Prerequisites

- **Hardware**: NVIDIA GPU with 24+ GB VRAM (e.g., A100, L4, RTX 4090) for Llama 3 8B; 70B requires multi-GPU with NVLink
- **Software**: CUDA 12.1+, Docker, Python 3.10+, Redis
- **Model**: Meta Llama 3 8B Instruct (or 70B with tensor parallelism)
- **Knowledge**: Basic Python, async/await, REST APIs, and familiarity with Docker

## Why This Approach Works

vLLM handles batching, paged attention, and GPU optimization. Your gateway adds the production layer: auth, rate limits, observability, and streaming. This separation keeps vLLM focused on inference speed while your gateway enforces policy and collects metrics. The result is a scalable, observable LLM service that integrates cleanly with existing infrastructure.

## How It Works (High-Level Overview)

In [None]:
Client → FastAPI Gateway (auth, rate limit, validation, metrics) → vLLM (OpenAI-compatible API) → GPU
                ↓
         Prometheus scrapes /metrics
         Redis tracks rate limits

The gateway validates API keys, enforces per-key rate limits via Redis, proxies requests to vLLM's OpenAI-compatible endpoint, streams responses as Server-Sent Events (SSE), and exposes Prometheus metrics for TTFT, tokens/sec, and GPU utilization.

## Step 1: Install vLLM and Launch the Server

Install vLLM with CUDA support:

In [None]:
pip install vllm==0.4.2

Launch vLLM with Llama 3 8B Instruct on a single GPU, setting max context to 8192 tokens and reserving 90% of GPU memory for inference:

In [None]:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

**Hardware tuning presets**:
- **A100 40GB**: `--gpu-memory-utilization 0.9 --max-model-len 8192` (supports ~16 concurrent requests)
- **L4 24GB**: `--gpu-memory-utilization 0.85 --max-model-len 4096` (tighter memory, shorter context)
- **RTX 4090 24GB**: `--gpu-memory-utilization 0.85 --max-model-len 4096` (similar to L4)
- **70B multi-GPU**: `--tensor-parallel-size 4 --gpu-memory-utilization 0.9 --max-model-len 8192` (requires 4x A100 80GB with NVLink)

Verify the server is running:

In [None]:
curl http://localhost:8001/v1/models

You should see `meta-llama/Meta-Llama-3-8B-Instruct` in the response.

For a deeper dive into how to evaluate and select the right LLM for your application—including considerations like context length, hardware efficiency, and cost—see our guide on [how to pick an LLM](/article/how-to-choose-an-ai-model-for-your-app-speed-cost-reliability).

## Step 2: Set Up Redis for Rate Limiting

Start Redis via Docker:

In [None]:
docker run -d -p 6379:6379 redis:7-alpine

Verify Redis is reachable:

In [None]:
redis-cli ping

You should see `PONG`.

## Step 3: Build the FastAPI Gateway

Install dependencies:

In [None]:
pip install fastapi==0.110.0 uvicorn==0.27.0 httpx==0.27.0 \
  starlette-limiter==0.2.0 redis==5.0.1 prometheus-client==0.20.0 pynvml==11.5.0

Create `app/main.py`. This file defines the FastAPI gateway with auth, rate limiting, streaming, and metrics:

In [None]:
import os
import time
import json
import asyncio
from typing import AsyncGenerator, Dict, Any, Optional

import httpx
from fastapi import FastAPI, HTTPException, Depends, Header, Request
from fastapi.responses import StreamingResponse, JSONResponse, PlainTextResponse
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
from starlette.middleware.cors import CORSMiddleware
from starlette_limiter import Limiter
from redis.asyncio import from_url
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("llm-gateway")

VLLM_URL = os.getenv("VLLM_URL", "http://localhost:8001")
ALLOWED_KEYS = set(os.getenv("API_KEYS", "dev-key-1,dev-key-2").split(","))
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/0")

app = FastAPI(title="LLM Gateway", version="1.0.0")
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"]
)

redis = from_url(REDIS_URL, encoding="utf-8", decode_responses=True)
limiter = Limiter(
    key_func=lambda request: request.headers.get("x-api-key", "unknown"),
    storage=redis
)

REQS = Counter("llm_requests_total", "Total LLM requests", ["route"])
TTFT = Histogram("llm_ttft_seconds", "Time to first token", buckets=(0.05, 0.1, 0.2, 0.5, 1, 2, 5))
TOKENS = Counter("llm_output_tokens_total", "Output tokens total")
TPS = Histogram("llm_tps", "Tokens per second", buckets=(5, 10, 20, 40, 80, 160, 320))
GPU_UTIL = Gauge("gpu_utilization_percent", "GPU utilization percent")
GPU_MEM = Gauge("gpu_memory_used_mb", "GPU memory used MB")

async def require_key(x_api_key: Optional[str] = Header(default=None)):
    if x_api_key is None or x_api_key not in ALLOWED_KEYS:
        logger.warning("Unauthorized access attempt with API key: %s", x_api_key)
        raise HTTPException(status_code=401, detail="Invalid API key")

async def stream_vllm(payload: Dict[str, Any]) -> AsyncGenerator[bytes, None]:
    url = f"{VLLM_URL}/v1/chat/completions"
    payload = {**payload, "stream": True}
    t0 = time.perf_counter()
    first = True
    token_count = 0
    async with httpx.AsyncClient(timeout=None) as client:
        async with client.stream("POST", url, json=payload) as r:
            async for line in r.aiter_lines():
                if not line:
                    continue
                if line.startswith("data:"):
                    data = line[5:].strip()
                    if data == "[DONE]":
                        break
                    if first:
                        ttft = time.perf_counter() - t0
                        TTFT.observe(ttft)
                        logger.info("TTFT observed: %.3fs", ttft)
                        first = False
                    try:
                        obj = json.loads(data)
                        delta = obj["choices"][0].get("delta", {}).get("content", "")
                        token_count += len(delta.split())
                    except Exception as e:
                        logger.debug("Failed to parse SSE data: %s", e)
                    yield (f"{line}\n").encode("utf-8")
    elapsed = max(time.perf_counter() - t0, 1e-6)
    TOKENS.inc(token_count)
    TPS.observe(token_count / elapsed)
    logger.info("Streamed %d tokens in %.2fs (%.2f TPS)", token_count, elapsed, token_count / elapsed)

@app.post("/v1/chat/completions")
@limiter.limit("60/minute")
async def chat_completions(req: Request, key=Depends(require_key)):
    body = await req.json()
    REQS.labels(route="/v1/chat/completions").inc()
    return StreamingResponse(stream_vllm(body), media_type="text/event-stream")

@app.get("/healthz")
async def health():
    return {"status": "ok"}

@app.get("/metrics")
async def metrics():
    try:
        import pynvml
        h = pynvml.nvmlDeviceGetHandleByIndex(0)
        util = pynvml.nvmlDeviceGetUtilizationRates(h)
        mem = pynvml.nvmlDeviceGetMemoryInfo(h)
        GPU_UTIL.set(util.gpu)
        GPU_MEM.set(mem.used / (1024 * 1024))
    except Exception as e:
        logger.debug("GPU metrics update failed: %s", e)
    return PlainTextResponse(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.on_event("startup")
async def startup():
    try:
        import pynvml
        pynvml.nvmlInit()
        logger.info("NVML initialized for GPU metrics.")
    except Exception as e:
        logger.warning("NVML initialization failed: %s", e)

## Step 4: Run the Gateway

Start the gateway with Uvicorn:

In [None]:
uvicorn app.main:app --host 0.0.0.0 --port 8080 --workers 1

**Note**: For production with multiple workers, enable Prometheus multiprocess mode by setting `PROMETHEUS_MULTIPROC_DIR` and using `MultiProcessCollector`. Single-worker mode is sufficient for initial testing.

## Step 5: Test the Gateway

Send a streaming request with a valid API key:

In [None]:
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "x-api-key: dev-key-1" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain vector databases in one paragraph."}],
    "max_tokens": 256,
    "temperature": 0.2,
    "stream": true
  }'

You should see a stream of `data:` lines with incremental tokens.

Test authentication by omitting the API key:

In [None]:
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

You should receive a `401 Unauthorized` response.

Test rate limiting by sending 61 requests in quick succession:

In [None]:
for i in {1..61}; do
  curl -X POST http://localhost:8080/v1/chat/completions \
    -H "x-api-key: dev-key-1" \
    -H "Content-Type: application/json" \
    -d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hi"}], "max_tokens": 10}' &
done
wait

After 60 requests, subsequent requests should return `429 Rate limit exceeded`.

Check Prometheus metrics:

In [None]:
curl http://localhost:8080/metrics

Look for `llm_requests_total`, `llm_ttft_seconds`, `llm_output_tokens_total`, `llm_tps`, `gpu_utilization_percent`, and `gpu_memory_used_mb`. Verify that `llm_requests_total` increments with each request and `llm_ttft_seconds` records TTFT observations.

If you're working with especially long prompts, be aware of position bias and how models may miss critical details—our article on [placing critical info in long prompts](/article/lost-in-the-middle-placing-critical-info-in-long-prompts) offers practical advice to mitigate these issues.

## Step 6: Benchmark Throughput and Latency

Create `tools/quick_bench.py` to measure concurrent throughput and TTFT:

In [None]:
import asyncio
import time
import httpx
import os
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("quick_bench")

GATEWAY = os.getenv("GATEWAY", "http://localhost:8080")
HEADERS = {
    "x-api-key": os.getenv("API_KEY", "dev-key-1"),
    "Content-Type": "application/json"
}
PAYLOAD = {
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [{"role": "user", "content": "In one paragraph, explain vector databases."}],
    "max_tokens": 256,
    "temperature": 0.2,
    "stream": True
}

async def run_once():
    t0 = time.perf_counter()
    first = True
    tokens = 0
    async with httpx.AsyncClient(timeout=None) as c:
        async with c.stream("POST", f"{GATEWAY}/v1/chat/completions", json=PAYLOAD, headers=HEADERS) as r:
            async for line in r.aiter_lines():
                if line and line.startswith("data:"):
                    d = line[5:].strip()
                    if d == "[DONE]":
                        break
                    if first:
                        ttft = time.perf_counter() - t0
                        logger.info("TTFT: %.3fs", ttft)
                        first = False
                    tokens += len(d.split())
    elapsed = time.perf_counter() - t0
    logger.info("Tokens: %d, TPS: %.2f", tokens, tokens / elapsed)

async def main(iters: int = 8):
    tasks = [run_once() for _ in range(iters)]
    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

Run the benchmark:

In [None]:
python tools/quick_bench.py

You should see TTFT and tokens/sec logged for each request. On an A100 40GB with Llama 3 8B, expect TTFT < 0.2s and throughput > 100 tokens/sec per request under light load.

Remember, LLM context is not infinite—if you notice increased hallucinations or inconsistent responses as context grows, you may be running into [context rot](/article/context-rot-why-llms-forget-as-their-memory-grows-3), where models "forget" earlier information as their memory window expands.

## Step 7: Tune vLLM for Your Workload

Adjust `--max-model-len` based on your prompt and output length requirements. Increase cautiously: KV cache grows roughly linearly with sequence length and batch size. Start with `--gpu-memory-utilization 0.9` and adjust based on OOM events and throughput measurements.

For multi-GPU setups with 70B models, add `--tensor-parallel-size N` where N is the number of GPUs. Ensure NVLink is enabled for optimal inter-GPU bandwidth.

Re-run benchmarks after each change to measure the impact on TTFT and tokens/sec.

## Conclusion

You've built a production-grade LLM gateway with authentication, rate limiting, streaming, and observability. You can now measure TTFT, tokens/sec, and GPU utilization, giving you the data to optimize for your workload. Key trade-offs: single-worker mode simplifies metrics but limits concurrency; FP8 quantization can boost throughput but may reduce quality; longer context increases memory pressure.

**Next steps**:
- Enable Prometheus multiprocess mode for multi-worker deployments
- Add input validation with Pydantic models to enforce max_tokens and temperature ranges
- Integrate TLS and rotate API keys via a secret manager
- Deploy behind NGINX or Envoy with SSE-compatible config (disable buffering, set `Cache-Control: no-cache`)
- Explore FP8 quantization with `--quantization fp8` for higher throughput on supported GPUs