-
Notifications
You must be signed in to change notification settings - Fork 1
Configuration
VulkanForge is configured through CLI flags and environment variables. Most behaviour is automatic;
the flags below are the user-facing runtime options. (VulkanForge also has many internal
diagnostic/debug env vars — VF_*_DUMP, VF_TRACE_*, GEMM micro-tuning, etc. — which are for
development only and are intentionally not documented here; never set them in production.)
Defaults below are taken from the v0.7.0 source.
| Flag | Default | Effect |
|---|---|---|
VULKANFORGE_KV_FP8=1 |
off | FP8 KV-cache — −50 % KV-cache VRAM (Gemma-4-26B-A4B: 880 → 440 MB). Recommended/needed to fit the 26B MoE in 16 GB. |
VF_CPU_LM_HEAD=1 |
off | Offload the vocabulary projection to the CPU (AVX-512 Q6_K GEMV, Zen 4 / Ice Lake+). Frees ~970 MB VRAM; on 14B FP8 it is +32 % decode, on 8B it trades ~32 % decode for the VRAM saving. |
VF_QUANTIZE_ON_LOAD=1 |
off | Quantize FP32/BF16 SafeTensors weights to Q4_K_M at load (~7× compression on quantized tensors). |
VF_VRAM_HEADROOM_GIB=<f> |
1.0 | Warn when free VRAM drops below this (diagnostic). |
VF_LOADER_STAGING_GIB=<f> |
adaptive | Override the upload staging-buffer size (advanced; the default adapts). |
| Flag | Effect |
|---|---|
VF_FP8=auto |
Auto-detect FP8 model + native-WMMA capability + CPU-lm_head heuristics; auto-load tokenizer/chat-template from the model dir. The recommended way to run FP8. |
VULKANFORGE_ENABLE_FP8=1 |
Explicit FP8 SafeTensors loading (legacy override). |
Native FP8 WMMA itself is capability-driven (no flag): VulkanForge uses the native path iff the
driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+), else the BF16 fallback.
| Flag | Default | Effect |
|---|---|---|
VF_FA_COOPMAT_HD128=0 |
on | Dense head_dim=128 prefill attention uses coopmat flash-attention (parity with llama). =0 falls back to the older tiled kernel. |
VF_MOE_ROUTER_BATCHED=0 |
on | Gemma-MoE router gate-projection batched through the dense GEMM (per-token → ~6 ms @p2048). =0 falls back to the per-token GEMV. |
VF_MOE_FUSED_ROUTER=1 / =0
|
decode-only fused | Default: fused router for decode, separate path for prefill (faster). =1 forces fused everywhere; =0 forces the separate path. |
VF_MOE_GROUPED=0 |
on (since v0.5.3) | Expert-grouped MoE prefill. =0 = legacy GPU-direct GEMV prefill. |
Behaviour note. The batched MoE router is value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail diverge from the per-token router. This is a deliberate, documented numerical change;
VF_MOE_ROUTER_BATCHED=0restores the exact pre-v0.7.0 routing.
VF_PROMPT (single-shot turn) · VF_SYSTEM (system prompt) · VF_MAX_TOKENS · VF_TEMPERATURE
(0 = greedy) · VF_TOP_K · VF_TOP_P · VF_SEED · VF_REPETITION_PENALTY · VF_NO_THINK_FILTER=1
(show <think> content for reasoning models) · VF_MODEL_PATH (default model path).
| Flag | Default | Effect |
|---|---|---|
VF_KV_PREFIX_REUSE=1 |
off | Cross-request KV reuse on serve: re-prefills only the suffix after the longest common token prefix (multi-turn/agentic). Byte-identical to full re-prefill at temp 0; single retained session. |
See Usage for examples and Benchmarks for what the prefill flags buy.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases