LLM Host — llama.cpp / vLLM / ComfyUI with CUDA

Docker-based LLM inference and image generation with NVIDIA GPU acceleration. Supports three backends via Docker Compose profiles:

llama.cpp (--profile llamacpp) — compiled from source, supports GGUF models, interactive CLI
vLLM (--profile vllm) — official pre-built image, better multi-GPU tensor parallelism, HuggingFace model format
ComfyUI (--profile imageai) — image generation with visual workflow editor, OpenAI-compatible image API

Host provides only the NVIDIA driver + Docker; all CUDA and inference software live in containers.

Features:

OpenAI-compatible APIs (text chat and image generation)
Multi-GPU support (layer splitting for llama.cpp, tensor parallelism for vLLM)
Visual workflow editor for image generation pipelines (ComfyUI)
Browser chat interface (Open WebUI)
GPU monitoring dashboard (Grafana + Prometheus + DCGM) — separate per backend
Fully configurable via .env — one file to tune models, context size, GPU split, CUDA version

Prerequisites

Ubuntu 24.04 LTS Server with an NVIDIA GPU. See docs/prerequisites.md for the full setup guide (NVIDIA driver, Docker, Container Toolkit).

Quick Start

# 1. Copy and edit config
cp .env.example .env
# Edit .env — set LLAMACPP_MODEL, LLAMACPP_TENSOR_SPLIT, LLAMACPP_CUDA_DOCKER_ARCH for your hardware

# 2. Build llama.cpp image (required for llamacpp profile)
docker compose build

# 3. Start with llama.cpp
docker compose --profile llamacpp up -d

# — or start with vLLM (no build needed, uses official image) —
docker compose --profile vllm up -d

This starts: API server (port 8080), Web UI (port 3000, text backends only), Grafana dashboard (port 4000).

Image Generation (ComfyUI)

# 1. Download model files (see Models section below)
# 2. Build and start
docker compose --profile imageai build
docker compose --profile imageai up -d

This starts: API server (port 8080), ComfyUI visual editor (port 8188), Grafana dashboard (port 4000).

To switch backends, stop one and start the other:

docker compose --profile llamacpp down
docker compose --profile vllm up -d

# or switch to image generation
docker compose --profile vllm down
docker compose --profile imageai up -d

Configuration

All config lives in .env. Copy .env.example and adjust for your hardware.

Model (llama.cpp)

Variable	Default	Description
`LLAMACPP_MODEL`	`tinyllama.gguf`	GGUF filename in `../models/`
`LLAMACPP_GPU_LAYERS`	`99`	Layers to offload to GPU (99 = all)
`LLAMACPP_CTX_SIZE`	`2048`	Context window size
`LLAMACPP_FLASH_ATTN`	`on`	Flash attention
`LLAMACPP_CACHE_TYPE_K`	`q8_0`	KV cache type (key)
`LLAMACPP_CACHE_TYPE_V`	`q8_0`	KV cache type (value)
`LLAMACPP_REASONING`	`off`	Reasoning/thinking mode (`on`/`off`)
`LLAMACPP_EXTRA_ARGS`	`--jinja`	Additional llama-server flags (space-separated)

Multi-GPU (llama.cpp)

Variable	Default	Description
`LLAMACPP_TENSOR_SPLIT`	`1`	GPU memory ratio. See examples below

# Single GPU
LLAMACPP_TENSOR_SPLIT=1

# 2 equal GPUs (e.g. 2x RTX 3090)
LLAMACPP_TENSOR_SPLIT=1,1

# 4 equal GPUs
LLAMACPP_TENSOR_SPLIT=1,1,1,1

# Unequal VRAM (24GB + 16GB)
LLAMACPP_TENSOR_SPLIT=3,2

vLLM

Variable	Default	Description
`VLLM_MODEL`	`Qwen/Qwen3.6-27B-FP8`	HuggingFace model ID or local path
`VLLM_TENSOR_PARALLEL`	`2`	Number of GPUs for tensor parallelism
`VLLM_MAX_MODEL_LEN`	`131072`	Maximum context length
`VLLM_GPU_MEM_UTIL`	`0.92`	Fraction of VRAM for KV cache
`VLLM_MAX_BATCH_TOKENS`	`8192`	Max tokens per prefill batch
`VLLM_EXTRA_ARGS`	(empty)	Model-specific CLI flags (see below)
`HF_TOKEN`	(empty)	HuggingFace token for gated models

vLLM uses official pre-built Docker images — no build step required. For best results, use HuggingFace model IDs (e.g., Qwen/Qwen3-27B); models are downloaded and cached automatically. The stack applies performance optimizations automatically (CUDA allocator tuning, PCIe multi-GPU fixes, prefix caching, chunked prefill). See docs/vllm.md for per-model VLLM_EXTRA_ARGS examples and tuning details.

ImageAI (ComfyUI)

Variable	Default	Description
`IMAGEAI_CHECKPOINT`	`flux1-dev-fp8.safetensors`	Checkpoint filename in `../models/comfyui/checkpoints/`
`IMAGEAI_DEFAULT_WORKFLOW`	`flux1-dev-fp8.json`	Workflow template filename in `./imageai/workflows/`
`IMAGEAI_DEFAULT_WIDTH`	`1024`	Default image width
`IMAGEAI_DEFAULT_HEIGHT`	`1024`	Default image height
`COMFYUI_PORT`	`8188`	Host port for ComfyUI visual editor
`COMFYUI_REF`	`latest`	ComfyUI git ref to build (tag, branch, or commit)
`COMFYUI_ARGS`	(empty)	ComfyUI startup flags (use `--lowvram` only if OOM — causes blur)

See docs/comfyui.md for model download instructions, workflow customization, and tuning details.

Build (GPU architecture, llama.cpp only)

Variable	Default	Description
`LLAMACPP_CUDA_DOCKER_ARCH`	`default`	GPU compute capability. See table below
`LLAMACPP_CUDA_VERSION`	`12.8.1`	CUDA toolkit version (must match driver)

GPU	`LLAMACPP_CUDA_DOCKER_ARCH`	Min `LLAMACPP_CUDA_VERSION`	Min driver
Universal (all GPUs, slower build)	`default`	`12.4.0`	525+
RTX 50 series (Blackwell)	`120`	`12.8.1`	550+
RTX 40 series (Ada Lovelace)	`89`	`12.0`	525+
RTX 30 series (Ampere)	`86`	`11.1`	455+

# Example: rebuild for RTX 4090 with CUDA 12.8
LLAMACPP_CUDA_DOCKER_ARCH=89 LLAMACPP_CUDA_VERSION=12.8.1 docker compose build

Ports and services

Variable	Default	Description
`API_KEY`	(empty)	API key for authentication. Leave empty for no auth
`PORT`	`8080`	API server port
`WEBUI_PORT`	`3000`	Open WebUI port (only available with text backends)
`GRAFANA_PORT`	`4000`	Grafana dashboard port
`GRAFANA_USER`	`admin`	Grafana username
`GRAFANA_PASSWORD`	`admin`	Grafana password
`PROMETHEUS_RETENTION`	`7d`	Metrics retention

Usage

CLI — Interactive Testing (llama.cpp only)

docker compose --profile llamacpp run --rm llamacpp-cli -m /models/YOUR_MODEL.gguf \
  --n-gpu-layers 99 --ctx-size 8192 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --split-mode layer --tensor-split 1,1

API — OpenAI-Compatible

Both backends expose the same OpenAI-compatible API on port 8080.

# Without auth (API_KEY empty)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'

# With auth (API_KEY set in .env)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'

Web UI

Open http://<host-ip>:3000 in your browser. Works with either text-generation backend. First visit creates a local admin account.

Image Generation API

The imageai profile exposes an OpenAI-compatible image generation API on port 8080.

# Generate an image
curl -X POST http://localhost:8080/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a cat sitting on a windowsill, sunlight streaming in","size":"1024x1024"}'

# Custom resolution
curl -X POST http://localhost:8080/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"hero header for a dental website","size":"1600x900"}'

# Save response and extract image
curl -X POST http://localhost:8080/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a mountain landscape at sunset"}' \
  -o response.json
python3 -c "import json,base64;d=json.load(open('response.json'));open('output.png','wb').write(base64.b64decode(d['data'][0]['b64_json']))"

Request parameters: prompt (required), size (WIDTHxHEIGHT), n (1-10 images), negative_prompt, model (workflow template name).

Monitoring

Open http://<host-ip>:4000 for the Grafana dashboard (login from .env). Each backend has its own dashboard with appropriate metrics:

llama.cpp: Active requests, total tokens, last request performance (prompt/gen tok/s)
vLLM: Active requests, total tokens, throughput rates (prompt/gen tok/s)
ComfyUI: Active generations, total images generated, average generation time

Both show GPU panels: utilization, VRAM, temperature, power draw, clock speeds.

Benchmarking

Benchmarks work with either backend. Activate the backend profile alongside bench. Model names are auto-detected from the server — no manual configuration needed when switching backends.

# Rebuild after code changes in benchmark directories
docker compose --profile bench build

# LLM benchmarks (GSM8K, MMLU, etc.) — with llama.cpp
docker compose --profile llamacpp --profile bench run --rm benchmark

# Same benchmarks — with vLLM
docker compose --profile vllm --profile bench run --rm benchmark

# Token throughput at different context sizes
docker compose --profile llamacpp --profile bench run --rm benchmark-throughput
docker compose --profile vllm --profile bench run --rm benchmark-throughput

# SWE-bench coding agent benchmark
SWE_SLICE="0:5" docker compose --profile llamacpp --profile bench run --rm benchmark-swe

# SWE-bench against a frontier API (backend profile doesn't matter)
SWE_API_MODE=frontier SWE_API_URL=https://api.openai.com/v1 API_KEY=sk-... \
  SWE_MODEL_NAME=openai/gpt-5 \
  SWE_SLICE="0:5" docker compose --profile bench run --rm benchmark-swe

Full benchmarking documentation: docs/benchmarking.md

Example Setups

Single RTX 4090 (24 GB) — llama.cpp

LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q4_K_M.gguf
LLAMACPP_CTX_SIZE=32768
LLAMACPP_TENSOR_SPLIT=1
LLAMACPP_CUDA_DOCKER_ARCH=89
LLAMACPP_CUDA_VERSION=12.8.1

Dual RTX 3090 (2x 24 GB) — llama.cpp

LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q5_K_M.gguf
LLAMACPP_CTX_SIZE=131072
LLAMACPP_TENSOR_SPLIT=1,1
LLAMACPP_CUDA_DOCKER_ARCH=86
LLAMACPP_CUDA_VERSION=12.8.1

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — llama.cpp

# Dense 27B — best quality
LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q5_K_M.gguf
LLAMACPP_CTX_SIZE=131072
LLAMACPP_TENSOR_SPLIT=1,1
LLAMACPP_CUDA_DOCKER_ARCH=120
LLAMACPP_CUDA_VERSION=13.0.0
LLAMACPP_EXTRA_ARGS=--jinja

# — or MoE 35B-A3B (35B total / 3B active) — much faster, similar quality —
# LLAMACPP_MODEL=Qwen_Qwen3.6-35B-A3B-Q5_K_M.gguf

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — vLLM

VLLM_MODEL=Qwen/Qwen3-14B-AWQ
VLLM_TENSOR_PARALLEL=2
VLLM_MAX_MODEL_LEN=40960
VLLM_EXTRA_ARGS=--kv-cache-dtype fp8

Models

llama.cpp: Place .gguf files in ../models/ (sibling directory). Mounted read-only at /models/.
vLLM: Uses HuggingFace model IDs. Models are downloaded to ~/.cache/huggingface/ on first run. For gated models, set HF_TOKEN in .env.

ComfyUI: Place model files in ../models/comfyui/ with the expected subdirectory structure:

models/comfyui/
  checkpoints/         — diffusion model (e.g., flux1-dev.safetensors)
  vae/                 — VAE (e.g., ae.safetensors)
  clip/                — text encoders (e.g., clip_l.safetensors, t5xxl_fp16.safetensors)
  loras/               — LoRA fine-tunes

Downloading Flux.1-dev

Flux.1-dev is a gated model — accept the license at black-forest-labs/FLUX.1-dev, then:

# Install HF CLI (one-time)
pipx install huggingface_hub
huggingface-cli login    # paste your token

# Create directories
mkdir -p ../models/comfyui/{checkpoints,vae,clip}

# Download checkpoint + VAE (gated, requires license acceptance)
huggingface-cli download black-forest-labs/FLUX.1-dev flux1-dev.safetensors --local-dir ../models/comfyui/checkpoints
huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors --local-dir ../models/comfyui/vae

# Download text encoders (ungated)
huggingface-cli download comfyanonymous/flux_text_encoders clip_l.safetensors t5xxl_fp16.safetensors --local-dir ../models/comfyui/clip

Total download: ~34 GB. For 16 GB GPUs, use t5xxl_fp8_e4m3fn.safetensors instead of t5xxl_fp16.safetensors to save ~5 GB.

Architecture

imageai/                    — ComfyUI image generation profile
  comfyui/                    — ComfyUI Dockerfile (pytorch + ComfyUI from git)
  wrapper/                    — FastAPI wrapper (OpenAI-compatible image API → ComfyUI)
    default_workflows/          — pre-seeded workflow templates (API-format JSON)
  workflows/                  — user workflow templates (persisted, customize via ComfyUI UI)
benchmark/                  — lm-evaluation-harness runner (knowledge/reasoning benchmarks)
benchmark-swe/              — mini-swe-agent runner (SWE-bench coding agent benchmarks)
benchmark-throughput/       — token throughput at multiple context sizes
monitoring/
  prometheus-llamacpp.yml   — scrape config (llama-server metrics + DCGM)
  prometheus-vllm.yml       — scrape config (vLLM metrics + DCGM)
  prometheus-imageai.yml    — scrape config (wrapper metrics + DCGM)
  grafana/
    provisioning/           — shared datasource + dashboard auto-discovery
    dashboards/
      llamacpp/llama-stack.json   — llama.cpp dashboard (tokens, GPU stats)
      vllm/vllm-stack.json        — vLLM dashboard (tokens, GPU stats)
      imageai/imageai-stack.json  — ComfyUI dashboard (generations, GPU stats)

Documentation

docs/prerequisites.md — Ubuntu 24.04 server setup (NVIDIA driver, Docker, Container Toolkit)
docs/vllm.md — vLLM tuning: automatic optimizations, per-model examples, KV cache types, switching models
docs/comfyui.md — ComfyUI image generation: model download, workflow customization, VRAM tuning
docs/benchmarking.md — LLM benchmarks, throughput benchmark, and SWE-bench coding agent benchmarks

Cleanup

docker container prune -f       # remove stopped containers
docker image prune -f           # remove dangling images
docker builder prune -f         # remove build cache
docker system prune -a -f       # remove everything unused (nuclear)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmark-swe		benchmark-swe
benchmark-throughput		benchmark-throughput
benchmark		benchmark
docs		docs
imageai		imageai
monitoring		monitoring
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docker_cleanup.sh		docker_cleanup.sh
entrypoint.sh		entrypoint.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Host — llama.cpp / vLLM / ComfyUI with CUDA

Prerequisites

Quick Start

Image Generation (ComfyUI)

Configuration

Model (llama.cpp)

Multi-GPU (llama.cpp)

vLLM

ImageAI (ComfyUI)

Build (GPU architecture, llama.cpp only)

Ports and services

Usage

CLI — Interactive Testing (llama.cpp only)

API — OpenAI-Compatible

Web UI

Image Generation API

Monitoring

Benchmarking

Example Setups

Single RTX 4090 (24 GB) — llama.cpp

Dual RTX 3090 (2x 24 GB) — llama.cpp

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — llama.cpp

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — vLLM

Models

Downloading Flux.1-dev

Architecture

Documentation

Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Host — llama.cpp / vLLM / ComfyUI with CUDA

Prerequisites

Quick Start

Image Generation (ComfyUI)

Configuration

Model (llama.cpp)

Multi-GPU (llama.cpp)

vLLM

ImageAI (ComfyUI)

Build (GPU architecture, llama.cpp only)

Ports and services

Usage

CLI — Interactive Testing (llama.cpp only)

API — OpenAI-Compatible

Web UI

Image Generation API

Monitoring

Benchmarking

Example Setups

Single RTX 4090 (24 GB) — llama.cpp

Dual RTX 3090 (2x 24 GB) — llama.cpp

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — llama.cpp

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — vLLM

Models

Downloading Flux.1-dev

Architecture

Documentation

Cleanup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages