Docker-based LLM inference and image generation with NVIDIA GPU acceleration. Supports three backends via Docker Compose profiles:
- llama.cpp (
--profile llamacpp) — compiled from source, supports GGUF models, interactive CLI - vLLM (
--profile vllm) — official pre-built image, better multi-GPU tensor parallelism, HuggingFace model format - ComfyUI (
--profile imageai) — image generation with visual workflow editor, OpenAI-compatible image API
Host provides only the NVIDIA driver + Docker; all CUDA and inference software live in containers.
Features:
- OpenAI-compatible APIs (text chat and image generation)
- Multi-GPU support (layer splitting for llama.cpp, tensor parallelism for vLLM)
- Visual workflow editor for image generation pipelines (ComfyUI)
- Browser chat interface (Open WebUI)
- GPU monitoring dashboard (Grafana + Prometheus + DCGM) — separate per backend
- Fully configurable via
.env— one file to tune models, context size, GPU split, CUDA version
Ubuntu 24.04 LTS Server with an NVIDIA GPU. See docs/prerequisites.md for the full setup guide (NVIDIA driver, Docker, Container Toolkit).
# 1. Copy and edit config
cp .env.example .env
# Edit .env — set LLAMACPP_MODEL, LLAMACPP_TENSOR_SPLIT, LLAMACPP_CUDA_DOCKER_ARCH for your hardware
# 2. Build llama.cpp image (required for llamacpp profile)
docker compose build
# 3. Start with llama.cpp
docker compose --profile llamacpp up -d
# — or start with vLLM (no build needed, uses official image) —
docker compose --profile vllm up -dThis starts: API server (port 8080), Web UI (port 3000, text backends only), Grafana dashboard (port 4000).
# 1. Download model files (see Models section below)
# 2. Build and start
docker compose --profile imageai build
docker compose --profile imageai up -dThis starts: API server (port 8080), ComfyUI visual editor (port 8188), Grafana dashboard (port 4000).
To switch backends, stop one and start the other:
docker compose --profile llamacpp down
docker compose --profile vllm up -d
# or switch to image generation
docker compose --profile vllm down
docker compose --profile imageai up -dAll config lives in .env. Copy .env.example and adjust for your hardware.
| Variable | Default | Description |
|---|---|---|
LLAMACPP_MODEL |
tinyllama.gguf |
GGUF filename in ../models/ |
LLAMACPP_GPU_LAYERS |
99 |
Layers to offload to GPU (99 = all) |
LLAMACPP_CTX_SIZE |
2048 |
Context window size |
LLAMACPP_FLASH_ATTN |
on |
Flash attention |
LLAMACPP_CACHE_TYPE_K |
q8_0 |
KV cache type (key) |
LLAMACPP_CACHE_TYPE_V |
q8_0 |
KV cache type (value) |
LLAMACPP_REASONING |
off |
Reasoning/thinking mode (on/off) |
LLAMACPP_EXTRA_ARGS |
--jinja |
Additional llama-server flags (space-separated) |
| Variable | Default | Description |
|---|---|---|
LLAMACPP_TENSOR_SPLIT |
1 |
GPU memory ratio. See examples below |
# Single GPU
LLAMACPP_TENSOR_SPLIT=1
# 2 equal GPUs (e.g. 2x RTX 3090)
LLAMACPP_TENSOR_SPLIT=1,1
# 4 equal GPUs
LLAMACPP_TENSOR_SPLIT=1,1,1,1
# Unequal VRAM (24GB + 16GB)
LLAMACPP_TENSOR_SPLIT=3,2
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL |
Qwen/Qwen3.6-27B-FP8 |
HuggingFace model ID or local path |
VLLM_TENSOR_PARALLEL |
2 |
Number of GPUs for tensor parallelism |
VLLM_MAX_MODEL_LEN |
131072 |
Maximum context length |
VLLM_GPU_MEM_UTIL |
0.92 |
Fraction of VRAM for KV cache |
VLLM_MAX_BATCH_TOKENS |
8192 |
Max tokens per prefill batch |
VLLM_EXTRA_ARGS |
(empty) | Model-specific CLI flags (see below) |
HF_TOKEN |
(empty) | HuggingFace token for gated models |
vLLM uses official pre-built Docker images — no build step required. For best results, use HuggingFace model IDs (e.g., Qwen/Qwen3-27B); models are downloaded and cached automatically. The stack applies performance optimizations automatically (CUDA allocator tuning, PCIe multi-GPU fixes, prefix caching, chunked prefill). See docs/vllm.md for per-model VLLM_EXTRA_ARGS examples and tuning details.
| Variable | Default | Description |
|---|---|---|
IMAGEAI_CHECKPOINT |
flux1-dev-fp8.safetensors |
Checkpoint filename in ../models/comfyui/checkpoints/ |
IMAGEAI_DEFAULT_WORKFLOW |
flux1-dev-fp8.json |
Workflow template filename in ./imageai/workflows/ |
IMAGEAI_DEFAULT_WIDTH |
1024 |
Default image width |
IMAGEAI_DEFAULT_HEIGHT |
1024 |
Default image height |
COMFYUI_PORT |
8188 |
Host port for ComfyUI visual editor |
COMFYUI_REF |
latest |
ComfyUI git ref to build (tag, branch, or commit) |
COMFYUI_ARGS |
(empty) | ComfyUI startup flags (use --lowvram only if OOM — causes blur) |
See docs/comfyui.md for model download instructions, workflow customization, and tuning details.
| Variable | Default | Description |
|---|---|---|
LLAMACPP_CUDA_DOCKER_ARCH |
default |
GPU compute capability. See table below |
LLAMACPP_CUDA_VERSION |
12.8.1 |
CUDA toolkit version (must match driver) |
| GPU | LLAMACPP_CUDA_DOCKER_ARCH |
Min LLAMACPP_CUDA_VERSION |
Min driver |
|---|---|---|---|
| Universal (all GPUs, slower build) | default |
12.4.0 |
525+ |
| RTX 50 series (Blackwell) | 120 |
12.8.1 |
550+ |
| RTX 40 series (Ada Lovelace) | 89 |
12.0 |
525+ |
| RTX 30 series (Ampere) | 86 |
11.1 |
455+ |
# Example: rebuild for RTX 4090 with CUDA 12.8
LLAMACPP_CUDA_DOCKER_ARCH=89 LLAMACPP_CUDA_VERSION=12.8.1 docker compose build| Variable | Default | Description |
|---|---|---|
API_KEY |
(empty) | API key for authentication. Leave empty for no auth |
PORT |
8080 |
API server port |
WEBUI_PORT |
3000 |
Open WebUI port (only available with text backends) |
GRAFANA_PORT |
4000 |
Grafana dashboard port |
GRAFANA_USER |
admin |
Grafana username |
GRAFANA_PASSWORD |
admin |
Grafana password |
PROMETHEUS_RETENTION |
7d |
Metrics retention |
docker compose --profile llamacpp run --rm llamacpp-cli -m /models/YOUR_MODEL.gguf \
--n-gpu-layers 99 --ctx-size 8192 --flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--split-mode layer --tensor-split 1,1Both backends expose the same OpenAI-compatible API on port 8080.
# Without auth (API_KEY empty)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'
# With auth (API_KEY set in .env)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'Open http://<host-ip>:3000 in your browser. Works with either text-generation backend. First visit creates a local admin account.
The imageai profile exposes an OpenAI-compatible image generation API on port 8080.
# Generate an image
curl -X POST http://localhost:8080/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{"prompt":"a cat sitting on a windowsill, sunlight streaming in","size":"1024x1024"}'
# Custom resolution
curl -X POST http://localhost:8080/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{"prompt":"hero header for a dental website","size":"1600x900"}'
# Save response and extract image
curl -X POST http://localhost:8080/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{"prompt":"a mountain landscape at sunset"}' \
-o response.json
python3 -c "import json,base64;d=json.load(open('response.json'));open('output.png','wb').write(base64.b64decode(d['data'][0]['b64_json']))"Request parameters: prompt (required), size (WIDTHxHEIGHT), n (1-10 images), negative_prompt, model (workflow template name).
Open http://<host-ip>:4000 for the Grafana dashboard (login from .env). Each backend has its own dashboard with appropriate metrics:
- llama.cpp: Active requests, total tokens, last request performance (prompt/gen tok/s)
- vLLM: Active requests, total tokens, throughput rates (prompt/gen tok/s)
- ComfyUI: Active generations, total images generated, average generation time
Both show GPU panels: utilization, VRAM, temperature, power draw, clock speeds.
Benchmarks work with either backend. Activate the backend profile alongside bench. Model names are auto-detected from the server — no manual configuration needed when switching backends.
# Rebuild after code changes in benchmark directories
docker compose --profile bench build
# LLM benchmarks (GSM8K, MMLU, etc.) — with llama.cpp
docker compose --profile llamacpp --profile bench run --rm benchmark
# Same benchmarks — with vLLM
docker compose --profile vllm --profile bench run --rm benchmark
# Token throughput at different context sizes
docker compose --profile llamacpp --profile bench run --rm benchmark-throughput
docker compose --profile vllm --profile bench run --rm benchmark-throughput
# SWE-bench coding agent benchmark
SWE_SLICE="0:5" docker compose --profile llamacpp --profile bench run --rm benchmark-swe
# SWE-bench against a frontier API (backend profile doesn't matter)
SWE_API_MODE=frontier SWE_API_URL=https://api.openai.com/v1 API_KEY=sk-... \
SWE_MODEL_NAME=openai/gpt-5 \
SWE_SLICE="0:5" docker compose --profile bench run --rm benchmark-sweFull benchmarking documentation: docs/benchmarking.md
LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q4_K_M.gguf
LLAMACPP_CTX_SIZE=32768
LLAMACPP_TENSOR_SPLIT=1
LLAMACPP_CUDA_DOCKER_ARCH=89
LLAMACPP_CUDA_VERSION=12.8.1LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q5_K_M.gguf
LLAMACPP_CTX_SIZE=131072
LLAMACPP_TENSOR_SPLIT=1,1
LLAMACPP_CUDA_DOCKER_ARCH=86
LLAMACPP_CUDA_VERSION=12.8.1# Dense 27B — best quality
LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q5_K_M.gguf
LLAMACPP_CTX_SIZE=131072
LLAMACPP_TENSOR_SPLIT=1,1
LLAMACPP_CUDA_DOCKER_ARCH=120
LLAMACPP_CUDA_VERSION=13.0.0
LLAMACPP_EXTRA_ARGS=--jinja
# — or MoE 35B-A3B (35B total / 3B active) — much faster, similar quality —
# LLAMACPP_MODEL=Qwen_Qwen3.6-35B-A3B-Q5_K_M.ggufVLLM_MODEL=Qwen/Qwen3-14B-AWQ
VLLM_TENSOR_PARALLEL=2
VLLM_MAX_MODEL_LEN=40960
VLLM_EXTRA_ARGS=--kv-cache-dtype fp8- llama.cpp: Place
.gguffiles in../models/(sibling directory). Mounted read-only at/models/. - vLLM: Uses HuggingFace model IDs. Models are downloaded to
~/.cache/huggingface/on first run. For gated models, setHF_TOKENin.env. - ComfyUI: Place model files in
../models/comfyui/with the expected subdirectory structure:models/comfyui/ checkpoints/ — diffusion model (e.g., flux1-dev.safetensors) vae/ — VAE (e.g., ae.safetensors) clip/ — text encoders (e.g., clip_l.safetensors, t5xxl_fp16.safetensors) loras/ — LoRA fine-tunes
Flux.1-dev is a gated model — accept the license at black-forest-labs/FLUX.1-dev, then:
# Install HF CLI (one-time)
pipx install huggingface_hub
huggingface-cli login # paste your token
# Create directories
mkdir -p ../models/comfyui/{checkpoints,vae,clip}
# Download checkpoint + VAE (gated, requires license acceptance)
huggingface-cli download black-forest-labs/FLUX.1-dev flux1-dev.safetensors --local-dir ../models/comfyui/checkpoints
huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors --local-dir ../models/comfyui/vae
# Download text encoders (ungated)
huggingface-cli download comfyanonymous/flux_text_encoders clip_l.safetensors t5xxl_fp16.safetensors --local-dir ../models/comfyui/clipTotal download: ~34 GB. For 16 GB GPUs, use t5xxl_fp8_e4m3fn.safetensors instead of t5xxl_fp16.safetensors to save ~5 GB.
imageai/ — ComfyUI image generation profile
comfyui/ — ComfyUI Dockerfile (pytorch + ComfyUI from git)
wrapper/ — FastAPI wrapper (OpenAI-compatible image API → ComfyUI)
default_workflows/ — pre-seeded workflow templates (API-format JSON)
workflows/ — user workflow templates (persisted, customize via ComfyUI UI)
benchmark/ — lm-evaluation-harness runner (knowledge/reasoning benchmarks)
benchmark-swe/ — mini-swe-agent runner (SWE-bench coding agent benchmarks)
benchmark-throughput/ — token throughput at multiple context sizes
monitoring/
prometheus-llamacpp.yml — scrape config (llama-server metrics + DCGM)
prometheus-vllm.yml — scrape config (vLLM metrics + DCGM)
prometheus-imageai.yml — scrape config (wrapper metrics + DCGM)
grafana/
provisioning/ — shared datasource + dashboard auto-discovery
dashboards/
llamacpp/llama-stack.json — llama.cpp dashboard (tokens, GPU stats)
vllm/vllm-stack.json — vLLM dashboard (tokens, GPU stats)
imageai/imageai-stack.json — ComfyUI dashboard (generations, GPU stats)
- docs/prerequisites.md — Ubuntu 24.04 server setup (NVIDIA driver, Docker, Container Toolkit)
- docs/vllm.md — vLLM tuning: automatic optimizations, per-model examples, KV cache types, switching models
- docs/comfyui.md — ComfyUI image generation: model download, workflow customization, VRAM tuning
- docs/benchmarking.md — LLM benchmarks, throughput benchmark, and SWE-bench coding agent benchmarks
docker container prune -f # remove stopped containers
docker image prune -f # remove dangling images
docker builder prune -f # remove build cache
docker system prune -a -f # remove everything unused (nuclear)