Skip to content

hirisov/local-llm

Repository files navigation

LLM Host — llama.cpp / vLLM / ComfyUI with CUDA

Docker-based LLM inference and image generation with NVIDIA GPU acceleration. Supports three backends via Docker Compose profiles:

  • llama.cpp (--profile llamacpp) — compiled from source, supports GGUF models, interactive CLI
  • vLLM (--profile vllm) — official pre-built image, better multi-GPU tensor parallelism, HuggingFace model format
  • ComfyUI (--profile imageai) — image generation with visual workflow editor, OpenAI-compatible image API

Host provides only the NVIDIA driver + Docker; all CUDA and inference software live in containers.

Features:

  • OpenAI-compatible APIs (text chat and image generation)
  • Multi-GPU support (layer splitting for llama.cpp, tensor parallelism for vLLM)
  • Visual workflow editor for image generation pipelines (ComfyUI)
  • Browser chat interface (Open WebUI)
  • GPU monitoring dashboard (Grafana + Prometheus + DCGM) — separate per backend
  • Fully configurable via .env — one file to tune models, context size, GPU split, CUDA version

Prerequisites

Ubuntu 24.04 LTS Server with an NVIDIA GPU. See docs/prerequisites.md for the full setup guide (NVIDIA driver, Docker, Container Toolkit).

Quick Start

# 1. Copy and edit config
cp .env.example .env
# Edit .env — set LLAMACPP_MODEL, LLAMACPP_TENSOR_SPLIT, LLAMACPP_CUDA_DOCKER_ARCH for your hardware

# 2. Build llama.cpp image (required for llamacpp profile)
docker compose build

# 3. Start with llama.cpp
docker compose --profile llamacpp up -d

# — or start with vLLM (no build needed, uses official image) —
docker compose --profile vllm up -d

This starts: API server (port 8080), Web UI (port 3000, text backends only), Grafana dashboard (port 4000).

Image Generation (ComfyUI)

# 1. Download model files (see Models section below)
# 2. Build and start
docker compose --profile imageai build
docker compose --profile imageai up -d

This starts: API server (port 8080), ComfyUI visual editor (port 8188), Grafana dashboard (port 4000).

To switch backends, stop one and start the other:

docker compose --profile llamacpp down
docker compose --profile vllm up -d

# or switch to image generation
docker compose --profile vllm down
docker compose --profile imageai up -d

Configuration

All config lives in .env. Copy .env.example and adjust for your hardware.

Model (llama.cpp)

Variable Default Description
LLAMACPP_MODEL tinyllama.gguf GGUF filename in ../models/
LLAMACPP_GPU_LAYERS 99 Layers to offload to GPU (99 = all)
LLAMACPP_CTX_SIZE 2048 Context window size
LLAMACPP_FLASH_ATTN on Flash attention
LLAMACPP_CACHE_TYPE_K q8_0 KV cache type (key)
LLAMACPP_CACHE_TYPE_V q8_0 KV cache type (value)
LLAMACPP_REASONING off Reasoning/thinking mode (on/off)
LLAMACPP_EXTRA_ARGS --jinja Additional llama-server flags (space-separated)

Multi-GPU (llama.cpp)

Variable Default Description
LLAMACPP_TENSOR_SPLIT 1 GPU memory ratio. See examples below
# Single GPU
LLAMACPP_TENSOR_SPLIT=1

# 2 equal GPUs (e.g. 2x RTX 3090)
LLAMACPP_TENSOR_SPLIT=1,1

# 4 equal GPUs
LLAMACPP_TENSOR_SPLIT=1,1,1,1

# Unequal VRAM (24GB + 16GB)
LLAMACPP_TENSOR_SPLIT=3,2

vLLM

Variable Default Description
VLLM_MODEL Qwen/Qwen3.6-27B-FP8 HuggingFace model ID or local path
VLLM_TENSOR_PARALLEL 2 Number of GPUs for tensor parallelism
VLLM_MAX_MODEL_LEN 131072 Maximum context length
VLLM_GPU_MEM_UTIL 0.92 Fraction of VRAM for KV cache
VLLM_MAX_BATCH_TOKENS 8192 Max tokens per prefill batch
VLLM_EXTRA_ARGS (empty) Model-specific CLI flags (see below)
HF_TOKEN (empty) HuggingFace token for gated models

vLLM uses official pre-built Docker images — no build step required. For best results, use HuggingFace model IDs (e.g., Qwen/Qwen3-27B); models are downloaded and cached automatically. The stack applies performance optimizations automatically (CUDA allocator tuning, PCIe multi-GPU fixes, prefix caching, chunked prefill). See docs/vllm.md for per-model VLLM_EXTRA_ARGS examples and tuning details.

ImageAI (ComfyUI)

Variable Default Description
IMAGEAI_CHECKPOINT flux1-dev-fp8.safetensors Checkpoint filename in ../models/comfyui/checkpoints/
IMAGEAI_DEFAULT_WORKFLOW flux1-dev-fp8.json Workflow template filename in ./imageai/workflows/
IMAGEAI_DEFAULT_WIDTH 1024 Default image width
IMAGEAI_DEFAULT_HEIGHT 1024 Default image height
COMFYUI_PORT 8188 Host port for ComfyUI visual editor
COMFYUI_REF latest ComfyUI git ref to build (tag, branch, or commit)
COMFYUI_ARGS (empty) ComfyUI startup flags (use --lowvram only if OOM — causes blur)

See docs/comfyui.md for model download instructions, workflow customization, and tuning details.

Build (GPU architecture, llama.cpp only)

Variable Default Description
LLAMACPP_CUDA_DOCKER_ARCH default GPU compute capability. See table below
LLAMACPP_CUDA_VERSION 12.8.1 CUDA toolkit version (must match driver)
GPU LLAMACPP_CUDA_DOCKER_ARCH Min LLAMACPP_CUDA_VERSION Min driver
Universal (all GPUs, slower build) default 12.4.0 525+
RTX 50 series (Blackwell) 120 12.8.1 550+
RTX 40 series (Ada Lovelace) 89 12.0 525+
RTX 30 series (Ampere) 86 11.1 455+
# Example: rebuild for RTX 4090 with CUDA 12.8
LLAMACPP_CUDA_DOCKER_ARCH=89 LLAMACPP_CUDA_VERSION=12.8.1 docker compose build

Ports and services

Variable Default Description
API_KEY (empty) API key for authentication. Leave empty for no auth
PORT 8080 API server port
WEBUI_PORT 3000 Open WebUI port (only available with text backends)
GRAFANA_PORT 4000 Grafana dashboard port
GRAFANA_USER admin Grafana username
GRAFANA_PASSWORD admin Grafana password
PROMETHEUS_RETENTION 7d Metrics retention

Usage

CLI — Interactive Testing (llama.cpp only)

docker compose --profile llamacpp run --rm llamacpp-cli -m /models/YOUR_MODEL.gguf \
  --n-gpu-layers 99 --ctx-size 8192 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --split-mode layer --tensor-split 1,1

API — OpenAI-Compatible

Both backends expose the same OpenAI-compatible API on port 8080.

# Without auth (API_KEY empty)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'

# With auth (API_KEY set in .env)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"model":"YOUR_MODEL","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'

Web UI

Open http://<host-ip>:3000 in your browser. Works with either text-generation backend. First visit creates a local admin account.

Image Generation API

The imageai profile exposes an OpenAI-compatible image generation API on port 8080.

# Generate an image
curl -X POST http://localhost:8080/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a cat sitting on a windowsill, sunlight streaming in","size":"1024x1024"}'

# Custom resolution
curl -X POST http://localhost:8080/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"hero header for a dental website","size":"1600x900"}'

# Save response and extract image
curl -X POST http://localhost:8080/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a mountain landscape at sunset"}' \
  -o response.json
python3 -c "import json,base64;d=json.load(open('response.json'));open('output.png','wb').write(base64.b64decode(d['data'][0]['b64_json']))"

Request parameters: prompt (required), size (WIDTHxHEIGHT), n (1-10 images), negative_prompt, model (workflow template name).

Monitoring

Open http://<host-ip>:4000 for the Grafana dashboard (login from .env). Each backend has its own dashboard with appropriate metrics:

  • llama.cpp: Active requests, total tokens, last request performance (prompt/gen tok/s)
  • vLLM: Active requests, total tokens, throughput rates (prompt/gen tok/s)
  • ComfyUI: Active generations, total images generated, average generation time

Both show GPU panels: utilization, VRAM, temperature, power draw, clock speeds.

Benchmarking

Benchmarks work with either backend. Activate the backend profile alongside bench. Model names are auto-detected from the server — no manual configuration needed when switching backends.

# Rebuild after code changes in benchmark directories
docker compose --profile bench build

# LLM benchmarks (GSM8K, MMLU, etc.) — with llama.cpp
docker compose --profile llamacpp --profile bench run --rm benchmark

# Same benchmarks — with vLLM
docker compose --profile vllm --profile bench run --rm benchmark

# Token throughput at different context sizes
docker compose --profile llamacpp --profile bench run --rm benchmark-throughput
docker compose --profile vllm --profile bench run --rm benchmark-throughput

# SWE-bench coding agent benchmark
SWE_SLICE="0:5" docker compose --profile llamacpp --profile bench run --rm benchmark-swe

# SWE-bench against a frontier API (backend profile doesn't matter)
SWE_API_MODE=frontier SWE_API_URL=https://api.openai.com/v1 API_KEY=sk-... \
  SWE_MODEL_NAME=openai/gpt-5 \
  SWE_SLICE="0:5" docker compose --profile bench run --rm benchmark-swe

Full benchmarking documentation: docs/benchmarking.md

Example Setups

Single RTX 4090 (24 GB) — llama.cpp

LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q4_K_M.gguf
LLAMACPP_CTX_SIZE=32768
LLAMACPP_TENSOR_SPLIT=1
LLAMACPP_CUDA_DOCKER_ARCH=89
LLAMACPP_CUDA_VERSION=12.8.1

Dual RTX 3090 (2x 24 GB) — llama.cpp

LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q5_K_M.gguf
LLAMACPP_CTX_SIZE=131072
LLAMACPP_TENSOR_SPLIT=1,1
LLAMACPP_CUDA_DOCKER_ARCH=86
LLAMACPP_CUDA_VERSION=12.8.1

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — llama.cpp

# Dense 27B — best quality
LLAMACPP_MODEL=Qwen_Qwen3.6-27B-Q5_K_M.gguf
LLAMACPP_CTX_SIZE=131072
LLAMACPP_TENSOR_SPLIT=1,1
LLAMACPP_CUDA_DOCKER_ARCH=120
LLAMACPP_CUDA_VERSION=13.0.0
LLAMACPP_EXTRA_ARGS=--jinja

# — or MoE 35B-A3B (35B total / 3B active) — much faster, similar quality —
# LLAMACPP_MODEL=Qwen_Qwen3.6-35B-A3B-Q5_K_M.gguf

Dual RTX 5080 + 5060 Ti (2x 16 GB, Blackwell) — vLLM

VLLM_MODEL=Qwen/Qwen3-14B-AWQ
VLLM_TENSOR_PARALLEL=2
VLLM_MAX_MODEL_LEN=40960
VLLM_EXTRA_ARGS=--kv-cache-dtype fp8

Models

  • llama.cpp: Place .gguf files in ../models/ (sibling directory). Mounted read-only at /models/.
  • vLLM: Uses HuggingFace model IDs. Models are downloaded to ~/.cache/huggingface/ on first run. For gated models, set HF_TOKEN in .env.
  • ComfyUI: Place model files in ../models/comfyui/ with the expected subdirectory structure:
    models/comfyui/
      checkpoints/         — diffusion model (e.g., flux1-dev.safetensors)
      vae/                 — VAE (e.g., ae.safetensors)
      clip/                — text encoders (e.g., clip_l.safetensors, t5xxl_fp16.safetensors)
      loras/               — LoRA fine-tunes
    

Downloading Flux.1-dev

Flux.1-dev is a gated model — accept the license at black-forest-labs/FLUX.1-dev, then:

# Install HF CLI (one-time)
pipx install huggingface_hub
huggingface-cli login    # paste your token

# Create directories
mkdir -p ../models/comfyui/{checkpoints,vae,clip}

# Download checkpoint + VAE (gated, requires license acceptance)
huggingface-cli download black-forest-labs/FLUX.1-dev flux1-dev.safetensors --local-dir ../models/comfyui/checkpoints
huggingface-cli download black-forest-labs/FLUX.1-dev ae.safetensors --local-dir ../models/comfyui/vae

# Download text encoders (ungated)
huggingface-cli download comfyanonymous/flux_text_encoders clip_l.safetensors t5xxl_fp16.safetensors --local-dir ../models/comfyui/clip

Total download: ~34 GB. For 16 GB GPUs, use t5xxl_fp8_e4m3fn.safetensors instead of t5xxl_fp16.safetensors to save ~5 GB.

Architecture

imageai/                    — ComfyUI image generation profile
  comfyui/                    — ComfyUI Dockerfile (pytorch + ComfyUI from git)
  wrapper/                    — FastAPI wrapper (OpenAI-compatible image API → ComfyUI)
    default_workflows/          — pre-seeded workflow templates (API-format JSON)
  workflows/                  — user workflow templates (persisted, customize via ComfyUI UI)
benchmark/                  — lm-evaluation-harness runner (knowledge/reasoning benchmarks)
benchmark-swe/              — mini-swe-agent runner (SWE-bench coding agent benchmarks)
benchmark-throughput/       — token throughput at multiple context sizes
monitoring/
  prometheus-llamacpp.yml   — scrape config (llama-server metrics + DCGM)
  prometheus-vllm.yml       — scrape config (vLLM metrics + DCGM)
  prometheus-imageai.yml    — scrape config (wrapper metrics + DCGM)
  grafana/
    provisioning/           — shared datasource + dashboard auto-discovery
    dashboards/
      llamacpp/llama-stack.json   — llama.cpp dashboard (tokens, GPU stats)
      vllm/vllm-stack.json        — vLLM dashboard (tokens, GPU stats)
      imageai/imageai-stack.json  — ComfyUI dashboard (generations, GPU stats)

Documentation

  • docs/prerequisites.md — Ubuntu 24.04 server setup (NVIDIA driver, Docker, Container Toolkit)
  • docs/vllm.md — vLLM tuning: automatic optimizations, per-model examples, KV cache types, switching models
  • docs/comfyui.md — ComfyUI image generation: model download, workflow customization, VRAM tuning
  • docs/benchmarking.md — LLM benchmarks, throughput benchmark, and SWE-bench coding agent benchmarks

Cleanup

docker container prune -f       # remove stopped containers
docker image prune -f           # remove dangling images
docker builder prune -f         # remove build cache
docker system prune -a -f       # remove everything unused (nuclear)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors