Benchmark and observe a production-like single-node LLM serving setup while keeping the comparison narrow:
FastAPI + transformersas the naive baselineFastAPI + vLLMas the optimized serving path- one GPU host
- one local load generator
- shared FastAPI surface for both backends
- SSE streaming endpoint for TTFT and inter-token latency measurement
- SQLite request and token-event persistence
- Prometheus metrics endpoint
- Grafana + Prometheus + host/GPU exporters
- Docker Compose profiles for
naiveandvllm
app/ FastAPI app and backend adapters
gpu_exporter/ Minimal Prometheus exporter backed by nvidia-smi
loadgen/ Local streaming benchmark driver
observability/ Prometheus and Grafana provisioning
infra/terraform/ AWS infrastructure for the single host
compose.yaml Stack entrypoint
api-naive: shared API running thetransformersbaseline in-processapi-vllm: shared API running thevLLMengine in-processprometheusgrafananode-exportergpu-exporter
Only one API backend should be active during a benchmark run.
GET /healthzGET /readyzPOST /v1/runsPOST /v1/generateGET /metrics
POST /v1/generate defaults to SSE streaming and emits:
metatokendoneerror
- Copy
.env.exampleto.envand adjust model/runtime settings. - Start observability plus one backend:
docker compose --profile obs --profile naive up --buildor:
docker compose --profile obs --profile vllm up --build- From your laptop, run the load generator:
python3 -m venv .venv
source .venv/bin/activate
pip install -r loadgen/requirements.txt
python loadgen/stream_bench.py \
--base-url http://<host>:8000 \
--prompt "Explain how continuous batching works." \
--concurrency 1,2,4,8 \
--requests-per-level 4- The repo is structured to keep API-layer instrumentation identical across both backends.
- TTFT and inter-token latency are captured server-side from the same streaming path that the client sees.
- The vLLM image and model settings may need tuning for the exact Qwen 14B variant and GPU memory budget.
The repo can store benchmark snapshots under results/ and render blog-ready figures from them.
For the first full naive sweep:
- data snapshot:
results/naive_full_sweep_2026-03-17.json - prometheus series:
results/naive_full_sweep_2026-03-17_prometheus_series.json - chart script:
analysis/plot_naive_sweep.py
To render the charts locally:
python3 -m venv .venv-chart
source .venv-chart/bin/activate
pip install -r analysis/requirements.txt
python analysis/plot_naive_sweep.pyThat writes PNG and SVG figures into results/charts/, including:
naive_sweep_overviewnaive_latency_vs_concurrencynaive_ttft_vs_concurrencynaive_intertoken_vs_concurrencynaive_throughput_vs_concurrencynaive_sweep_resource_summarynaive_timeline_gpu_vs_inflightnaive_timeline_gpu_memorynaive_timeline_gpu_power_cpunaive_timeline_host_memory
The repo includes a small Terraform slice under infra/terraform for a single AWS GPU host.
What it creates:
- uses the default VPC
- uses the first default subnet unless you override
subnet_id - creates a dedicated security group
- imports your SSH public key as an AWS key pair
- creates an EC2 IAM role and instance profile with
AmazonSSMManagedInstanceCore - launches one GPU EC2 instance with a
gp3root volume - bootstraps Docker, Docker Compose, and the NVIDIA container runtime with
user_data
Usage:
cd infra/terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform plan
terraform applyThe main input you must set is ami_id. Use a GPU-capable AMI that already has NVIDIA drivers or is known to work with the bootstrap path.