Single Node Inference Lab

Benchmark and observe a production-like single-node LLM serving setup while keeping the comparison narrow:

FastAPI + transformers as the naive baseline
FastAPI + vLLM as the optimized serving path
one GPU host
one local load generator

What is included

shared FastAPI surface for both backends
SSE streaming endpoint for TTFT and inter-token latency measurement
SQLite request and token-event persistence
Prometheus metrics endpoint
Grafana + Prometheus + host/GPU exporters
Docker Compose profiles for naive and vllm

Repo Layout

app/                    FastAPI app and backend adapters
gpu_exporter/           Minimal Prometheus exporter backed by nvidia-smi
loadgen/                Local streaming benchmark driver
observability/          Prometheus and Grafana provisioning
infra/terraform/        AWS infrastructure for the single host
compose.yaml            Stack entrypoint

Services

api-naive: shared API running the transformers baseline in-process
api-vllm: shared API running the vLLM engine in-process
prometheus
grafana
node-exporter
gpu-exporter

Only one API backend should be active during a benchmark run.

Endpoints

GET /healthz
GET /readyz
POST /v1/runs
POST /v1/generate
GET /metrics

POST /v1/generate defaults to SSE streaming and emits:

meta
token
done
error

Quick Start

Copy .env.example to .env and adjust model/runtime settings.
Start observability plus one backend:

docker compose --profile obs --profile naive up --build

or:

docker compose --profile obs --profile vllm up --build

From your laptop, run the load generator:

python3 -m venv .venv
source .venv/bin/activate
pip install -r loadgen/requirements.txt
python loadgen/stream_bench.py \
  --base-url http://<host>:8000 \
  --prompt "Explain how continuous batching works." \
  --concurrency 1,2,4,8 \
  --requests-per-level 4

Notes

The repo is structured to keep API-layer instrumentation identical across both backends.
TTFT and inter-token latency are captured server-side from the same streaming path that the client sees.
The vLLM image and model settings may need tuning for the exact Qwen 14B variant and GPU memory budget.

Benchmark Artifacts

The repo can store benchmark snapshots under results/ and render blog-ready figures from them.

For the first full naive sweep:

data snapshot: results/naive_full_sweep_2026-03-17.json
prometheus series: results/naive_full_sweep_2026-03-17_prometheus_series.json
chart script: analysis/plot_naive_sweep.py

To render the charts locally:

python3 -m venv .venv-chart
source .venv-chart/bin/activate
pip install -r analysis/requirements.txt
python analysis/plot_naive_sweep.py

That writes PNG and SVG figures into results/charts/, including:

naive_sweep_overview
naive_latency_vs_concurrency
naive_ttft_vs_concurrency
naive_intertoken_vs_concurrency
naive_throughput_vs_concurrency
naive_sweep_resource_summary
naive_timeline_gpu_vs_inflight
naive_timeline_gpu_memory
naive_timeline_gpu_power_cpu
naive_timeline_host_memory

Terraform

The repo includes a small Terraform slice under infra/terraform for a single AWS GPU host.

What it creates:

uses the default VPC
uses the first default subnet unless you override subnet_id
creates a dedicated security group
imports your SSH public key as an AWS key pair
creates an EC2 IAM role and instance profile with AmazonSSMManagedInstanceCore
launches one GPU EC2 instance with a gp3 root volume
bootstraps Docker, Docker Compose, and the NVIDIA container runtime with user_data

Usage:

cd infra/terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform plan
terraform apply

The main input you must set is ami_id. Use a GPU-capable AMI that already has NVIDIA drivers or is known to work with the bootstrap path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single Node Inference Lab

What is included

Repo Layout

Services

Endpoints

Quick Start

Notes

Benchmark Artifacts

Terraform

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
analysis		analysis
app		app
gpu_exporter		gpu_exporter
infra/terraform		infra/terraform
loadgen		loadgen
observability		observability
results		results
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
compose.yaml		compose.yaml

Folders and files

Latest commit

History

Repository files navigation

Single Node Inference Lab

What is included

Repo Layout

Services

Endpoints

Quick Start

Notes

Benchmark Artifacts

Terraform

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages