Skip to content

pparmar15/single-node-inference-lab

Repository files navigation

Single Node Inference Lab

Benchmark and observe a production-like single-node LLM serving setup while keeping the comparison narrow:

  • FastAPI + transformers as the naive baseline
  • FastAPI + vLLM as the optimized serving path
  • one GPU host
  • one local load generator

What is included

  • shared FastAPI surface for both backends
  • SSE streaming endpoint for TTFT and inter-token latency measurement
  • SQLite request and token-event persistence
  • Prometheus metrics endpoint
  • Grafana + Prometheus + host/GPU exporters
  • Docker Compose profiles for naive and vllm

Repo Layout

app/                    FastAPI app and backend adapters
gpu_exporter/           Minimal Prometheus exporter backed by nvidia-smi
loadgen/                Local streaming benchmark driver
observability/          Prometheus and Grafana provisioning
infra/terraform/        AWS infrastructure for the single host
compose.yaml            Stack entrypoint

Services

  • api-naive: shared API running the transformers baseline in-process
  • api-vllm: shared API running the vLLM engine in-process
  • prometheus
  • grafana
  • node-exporter
  • gpu-exporter

Only one API backend should be active during a benchmark run.

Endpoints

  • GET /healthz
  • GET /readyz
  • POST /v1/runs
  • POST /v1/generate
  • GET /metrics

POST /v1/generate defaults to SSE streaming and emits:

  • meta
  • token
  • done
  • error

Quick Start

  1. Copy .env.example to .env and adjust model/runtime settings.
  2. Start observability plus one backend:
docker compose --profile obs --profile naive up --build

or:

docker compose --profile obs --profile vllm up --build
  1. From your laptop, run the load generator:
python3 -m venv .venv
source .venv/bin/activate
pip install -r loadgen/requirements.txt
python loadgen/stream_bench.py \
  --base-url http://<host>:8000 \
  --prompt "Explain how continuous batching works." \
  --concurrency 1,2,4,8 \
  --requests-per-level 4

Notes

  • The repo is structured to keep API-layer instrumentation identical across both backends.
  • TTFT and inter-token latency are captured server-side from the same streaming path that the client sees.
  • The vLLM image and model settings may need tuning for the exact Qwen 14B variant and GPU memory budget.

Benchmark Artifacts

The repo can store benchmark snapshots under results/ and render blog-ready figures from them.

For the first full naive sweep:

  • data snapshot: results/naive_full_sweep_2026-03-17.json
  • prometheus series: results/naive_full_sweep_2026-03-17_prometheus_series.json
  • chart script: analysis/plot_naive_sweep.py

To render the charts locally:

python3 -m venv .venv-chart
source .venv-chart/bin/activate
pip install -r analysis/requirements.txt
python analysis/plot_naive_sweep.py

That writes PNG and SVG figures into results/charts/, including:

  • naive_sweep_overview
  • naive_latency_vs_concurrency
  • naive_ttft_vs_concurrency
  • naive_intertoken_vs_concurrency
  • naive_throughput_vs_concurrency
  • naive_sweep_resource_summary
  • naive_timeline_gpu_vs_inflight
  • naive_timeline_gpu_memory
  • naive_timeline_gpu_power_cpu
  • naive_timeline_host_memory

Terraform

The repo includes a small Terraform slice under infra/terraform for a single AWS GPU host.

What it creates:

  • uses the default VPC
  • uses the first default subnet unless you override subnet_id
  • creates a dedicated security group
  • imports your SSH public key as an AWS key pair
  • creates an EC2 IAM role and instance profile with AmazonSSMManagedInstanceCore
  • launches one GPU EC2 instance with a gp3 root volume
  • bootstraps Docker, Docker Compose, and the NVIDIA container runtime with user_data

Usage:

cd infra/terraform
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform plan
terraform apply

The main input you must set is ami_id. Use a GPU-capable AMI that already has NVIDIA drivers or is known to work with the bootstrap path.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages