Configuration

Configuration & Flags

VulkanForge is configured through CLI flags and environment variables. Most behaviour is automatic; the flags below are the user-facing runtime options. (VulkanForge also has many internal diagnostic/debug env vars — VF_*_DUMP, VF_TRACE_*, GEMM micro-tuning, etc. — which are for development only and are intentionally not documented here; never set them in production.)

Defaults below are taken from the v0.8.0 source.

Precision & memory

Flag	Default	Effect
`VULKANFORGE_KV_FP8=1`	off	FP8 KV-cache — −50 % KV-cache VRAM (Gemma-4-26B-A4B: 880 → 440 MB). Required for the Gemma-4-26B-A4B MoE (its non-FP8 KV path is known-broken → the engine aborts at load without it). Optional for everything else.
`VULKANFORGE_ALLOW_BROKEN_KV=1`	off	Debug-only override of the KV-FP8 guard — forces the known-broken non-FP8 KV path for the Gemma-4 MoE (output will be invalid).
`VF_CPU_LM_HEAD=1`	off	Offload the vocabulary projection to the CPU (AVX-512 Q6_K GEMV, Zen 4 / Ice Lake+). Frees ~970 MB VRAM; on 14B FP8 it is +32 % decode, on 8B it trades ~32 % decode for the VRAM saving.
`VF_QUANTIZE_ON_LOAD=1`	off	Quantize FP32/BF16 SafeTensors weights to Q4_K_M at load (~7× compression on quantized tensors).
`VF_VRAM_HEADROOM_GIB=<f>`	1.0	Warn when free VRAM drops below this (diagnostic).
`VF_LOADER_STAGING_GIB=<f>`	adaptive	Override the upload staging-buffer size (advanced; the default adapts).

Context size (automatic since v0.8.0)

On serve, if you omit --ctx-size the server computes the largest safe KV-cache context from live free VRAM (VK_EXT_memory_budget), the model's weights and training context, and the active KV dtype — and prints a one-line, itemized rationale at startup:

auto ctx-size = 16384 (free 15.0G − weights 8.4G − reserve 1.5G = 5.1G for KV / 0.156 MiB/tok; bound: hardware LDS limit 16384)

No more guessing a value that's too small (answers truncated) or too large (out-of-memory at load).

Flag / var	Default	Effect
`--ctx-size <N>`	auto	Verbatim override of the auto value (also the `chat` context size).
`VF_AUTO_CTX_RESERVE_MIB=<N>`	1536	VRAM (MiB) held back beyond weights + KV for scratch/prefill buffers. Conservative by design.

Hardware ceiling: 16384 tokens on RDNA4. The context is capped by the per-workgroup shared-memory (LDS) budget. Auto-ctx stays at or below it; an explicit --ctx-size above 16384 aborts at pipeline creation (it is not silently clamped).

FP8 SafeTensors

Flag	Effect
`VF_FP8=auto`	Auto-detect FP8 model + native-WMMA capability + CPU-lm_head heuristics; auto-load tokenizer/chat-template from the model dir. The recommended way to run FP8.
`VULKANFORGE_ENABLE_FP8=1`	Explicit FP8 SafeTensors loading (legacy override).

Native FP8 WMMA itself is capability-driven (no flag): VulkanForge uses the native path iff the driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+), else the BF16 fallback.

Prefill performance (v0.7.0, all default-ON — opt-outs for A/B)

Flag	Default	Effect
`VF_FA_COOPMAT_HD128=0`	on	Dense `head_dim=128` prefill attention uses coopmat flash-attention (parity with llama). `=0` falls back to the older tiled kernel.
`VF_MOE_ROUTER_BATCHED=0`	on	Gemma-MoE router gate-projection batched through the dense GEMM (per-token → ~6 ms @p2048). `=0` falls back to the per-token GEMV.
`VF_MOE_FUSED_ROUTER=1` / `=0`	decode-only fused	Default: fused router for decode, separate path for prefill (faster). `=1` forces fused everywhere; `=0` forces the separate path.
`VF_MOE_GROUPED=0`	on (since v0.5.3)	Expert-grouped MoE prefill. `=0` = legacy GPU-direct GEMV prefill.

Behaviour note. The batched MoE router is value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail diverge from the per-token router. This is a deliberate, documented numerical change; VF_MOE_ROUTER_BATCHED=0 restores the exact pre-v0.7.0 routing.

Advanced

Flag	Default	Effect
`VF_USE_GRAPH=1`	off	Opt-in. Dispatch each layer's steps through a topologically-sorted dependency graph instead of the default imperative dispatch loop. The imperative path is the production default and fallback.

Sampling & input (env forms; most also have CLI flags)

VF_PROMPT (single-shot turn) · VF_SYSTEM (system prompt) · VF_MAX_TOKENS · VF_TEMPERATURE (0 = greedy) · VF_TOP_K · VF_TOP_P · VF_SEED · VF_REPETITION_PENALTY · VF_NO_THINK_FILTER=1 (show <think> content for reasoning models) · VF_MODEL_PATH (default model path).

Serve-only

Flag	Default	Effect
`VF_KV_PREFIX_REUSE=1`	off	Cross-request KV reuse on `serve`: re-prefills only the suffix after the longest common token prefix (multi-turn/agentic). Byte-identical to full re-prefill at temp 0; single retained session.

Memory (optional, default off)

The server-side memory subsystem is gated twice — a compile-time Cargo feature and a runtime flag — both off by default. Build it in with cargo build --release --features memory (Rust 1.89+; see Installation), then:

Flag / var	Default	Effect
`--memory` (serve flag)	off	Activate the memory subsystem on `serve`. Without it `/memory/*` returns 503 and no embedder/DB is loaded. On a lean (non-`--features memory`) binary it aborts with a `rebuild with --features memory` message before the model loads.
`VULKANFORGE_MEMORY=1`	off	Env alias for `--memory`.
`VF_MEMORY_DB=<path>`	`~/.vulkanforge/memory.db`	Path to the memory SQLite store (the embedding model is cached in a sibling `embed-cache/`).

The embedder runs on the CPU and never takes the GPU permit (no VRAM). What it is and how it works: Memory.

See Usage for examples and Benchmarks for what the prefill flags buy.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

Configuration & Flags

Precision & memory

Context size (automatic since v0.8.0)

FP8 SafeTensors

Prefill performance (v0.7.0, all default-ON — opt-outs for A/B)

Advanced

Sampling & input (env forms; most also have CLI flags)

Serve-only

Memory (optional, default off)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally