-
Notifications
You must be signed in to change notification settings - Fork 1
Configuration
VulkanForge is configured through CLI flags and environment variables. Most behaviour is automatic;
the flags below are the user-facing runtime options. (VulkanForge also has many internal
diagnostic/debug env vars — VF_*_DUMP, VF_TRACE_*, GEMM micro-tuning, etc. — which are for
development only and are intentionally not documented here; never set them in production.)
Defaults below are taken from the v0.8.0 source.
| Flag | Default | Effect |
|---|---|---|
VULKANFORGE_KV_FP8=1 |
off | FP8 KV-cache — −50 % KV-cache VRAM (Gemma-4-26B-A4B: 880 → 440 MB). Required for the Gemma-4-26B-A4B MoE (its non-FP8 KV path is known-broken → the engine aborts at load without it). Optional for everything else. |
VULKANFORGE_ALLOW_BROKEN_KV=1 |
off | Debug-only override of the KV-FP8 guard — forces the known-broken non-FP8 KV path for the Gemma-4 MoE (output will be invalid). |
VF_CPU_LM_HEAD=1 |
off | Offload the vocabulary projection to the CPU (AVX-512 Q6_K GEMV, Zen 4 / Ice Lake+). Frees ~970 MB VRAM; on 14B FP8 it is +32 % decode, on 8B it trades ~32 % decode for the VRAM saving. |
VF_QUANTIZE_ON_LOAD=1 |
off | Quantize FP32/BF16 SafeTensors weights to Q4_K_M at load (~7× compression on quantized tensors). |
VF_VRAM_HEADROOM_GIB=<f> |
1.0 | Warn when free VRAM drops below this (diagnostic). |
VF_LOADER_STAGING_GIB=<f> |
adaptive | Override the upload staging-buffer size (advanced; the default adapts). |
On serve, if you omit --ctx-size the server computes the largest safe KV-cache context from
live free VRAM (VK_EXT_memory_budget), the model's weights and training context, and the active KV
dtype — and prints a one-line, itemized rationale at startup:
auto ctx-size = 16384 (free 15.0G − weights 8.4G − reserve 1.5G = 5.1G for KV / 0.156 MiB/tok; bound: hardware LDS limit 16384)
No more guessing a value that's too small (answers truncated) or too large (out-of-memory at load).
| Flag / var | Default | Effect |
|---|---|---|
--ctx-size <N> |
auto | Verbatim override of the auto value (also the chat context size). |
VF_AUTO_CTX_RESERVE_MIB=<N> |
1536 | VRAM (MiB) held back beyond weights + KV for scratch/prefill buffers. Conservative by design. |
Hardware ceiling: 16384 tokens on RDNA4. The context is capped by the per-workgroup shared-memory (LDS) budget. Auto-ctx stays at or below it; an explicit
--ctx-sizeabove 16384 aborts at pipeline creation (it is not silently clamped).
| Flag | Effect |
|---|---|
VF_FP8=auto |
Auto-detect FP8 model + native-WMMA capability + CPU-lm_head heuristics; auto-load tokenizer/chat-template from the model dir. The recommended way to run FP8. |
VULKANFORGE_ENABLE_FP8=1 |
Explicit FP8 SafeTensors loading (legacy override). |
Native FP8 WMMA itself is capability-driven (no flag): VulkanForge uses the native path iff the
driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+), else the BF16 fallback.
| Flag | Default | Effect |
|---|---|---|
VF_FA_COOPMAT_HD128=0 |
on | Dense head_dim=128 prefill attention uses coopmat flash-attention (parity with llama). =0 falls back to the older tiled kernel. |
VF_MOE_ROUTER_BATCHED=0 |
on | Gemma-MoE router gate-projection batched through the dense GEMM (per-token → ~6 ms @p2048). =0 falls back to the per-token GEMV. |
VF_MOE_FUSED_ROUTER=1 / =0
|
decode-only fused | Default: fused router for decode, separate path for prefill (faster). =1 forces fused everywhere; =0 forces the separate path. |
VF_MOE_GROUPED=0 |
on (since v0.5.3) | Expert-grouped MoE prefill. =0 = legacy GPU-direct GEMV prefill. |
Behaviour note. The batched MoE router is value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail diverge from the per-token router. This is a deliberate, documented numerical change;
VF_MOE_ROUTER_BATCHED=0restores the exact pre-v0.7.0 routing.
| Flag | Default | Effect |
|---|---|---|
VF_USE_GRAPH=1 |
off | Opt-in. Dispatch each layer's steps through a topologically-sorted dependency graph instead of the default imperative dispatch loop. The imperative path is the production default and fallback. |
VF_PROMPT (single-shot turn) · VF_SYSTEM (system prompt) · VF_MAX_TOKENS · VF_TEMPERATURE
(0 = greedy) · VF_TOP_K · VF_TOP_P · VF_SEED · VF_REPETITION_PENALTY · VF_NO_THINK_FILTER=1
(show <think> content for reasoning models) · VF_MODEL_PATH (default model path).
| Flag | Default | Effect |
|---|---|---|
VF_KV_PREFIX_REUSE=1 |
off | Cross-request KV reuse on serve: re-prefills only the suffix after the longest common token prefix (multi-turn/agentic). Byte-identical to full re-prefill at temp 0; single retained session. |
The server-side memory subsystem is gated twice — a compile-time Cargo feature and a runtime flag — both off
by default. Build it in with cargo build --release --features memory (Rust 1.89+; see Installation), then:
| Flag / var | Default | Effect |
|---|---|---|
--memory (serve flag) |
off | Activate the memory subsystem on serve. Without it /memory/* returns 503 and no embedder/DB is loaded. On a lean (non---features memory) binary it aborts with a rebuild with --features memory message before the model loads. |
VULKANFORGE_MEMORY=1 |
off | Env alias for --memory. |
VF_MEMORY_DB=<path> |
~/.vulkanforge/memory.db |
Path to the memory SQLite store (the embedding model is cached in a sibling embed-cache/). |
The embedder runs on the CPU and never takes the GPU permit (no VRAM). What it is and how it works: Memory.
See Usage for examples and Benchmarks for what the prefill flags buy.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases