Skip to content
maeddesg edited this page Jun 9, 2026 · 4 revisions

Configuration & Flags

VulkanForge is configured through CLI flags and environment variables. Most behaviour is automatic; the flags below are the user-facing runtime options. (VulkanForge also has many internal diagnostic/debug env vars — VF_*_DUMP, VF_TRACE_*, GEMM micro-tuning, etc. — which are for development only and are intentionally not documented here; never set them in production.)

Defaults below are taken from the v0.7.0 source.

Precision & memory

Flag Default Effect
VULKANFORGE_KV_FP8=1 off FP8 KV-cache — −50 % KV-cache VRAM (Gemma-4-26B-A4B: 880 → 440 MB). Recommended/needed to fit the 26B MoE in 16 GB.
VF_CPU_LM_HEAD=1 off Offload the vocabulary projection to the CPU (AVX-512 Q6_K GEMV, Zen 4 / Ice Lake+). Frees ~970 MB VRAM; on 14B FP8 it is +32 % decode, on 8B it trades ~32 % decode for the VRAM saving.
VF_QUANTIZE_ON_LOAD=1 off Quantize FP32/BF16 SafeTensors weights to Q4_K_M at load (~7× compression on quantized tensors).
VF_VRAM_HEADROOM_GIB=<f> 1.0 Warn when free VRAM drops below this (diagnostic).
VF_LOADER_STAGING_GIB=<f> adaptive Override the upload staging-buffer size (advanced; the default adapts).

FP8 SafeTensors

Flag Effect
VF_FP8=auto Auto-detect FP8 model + native-WMMA capability + CPU-lm_head heuristics; auto-load tokenizer/chat-template from the model dir. The recommended way to run FP8.
VULKANFORGE_ENABLE_FP8=1 Explicit FP8 SafeTensors loading (legacy override).

Native FP8 WMMA itself is capability-driven (no flag): VulkanForge uses the native path iff the driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+), else the BF16 fallback.

Prefill performance (v0.7.0, all default-ON — opt-outs for A/B)

Flag Default Effect
VF_FA_COOPMAT_HD128=0 on Dense head_dim=128 prefill attention uses coopmat flash-attention (parity with llama). =0 falls back to the older tiled kernel.
VF_MOE_ROUTER_BATCHED=0 on Gemma-MoE router gate-projection batched through the dense GEMM (per-token → ~6 ms @p2048). =0 falls back to the per-token GEMV.
VF_MOE_FUSED_ROUTER=1 / =0 decode-only fused Default: fused router for decode, separate path for prefill (faster). =1 forces fused everywhere; =0 forces the separate path.
VF_MOE_GROUPED=0 on (since v0.5.3) Expert-grouped MoE prefill. =0 = legacy GPU-direct GEMV prefill.

Behaviour note. The batched MoE router is value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail diverge from the per-token router. This is a deliberate, documented numerical change; VF_MOE_ROUTER_BATCHED=0 restores the exact pre-v0.7.0 routing.

Sampling & input (env forms; most also have CLI flags)

VF_PROMPT (single-shot turn) · VF_SYSTEM (system prompt) · VF_MAX_TOKENS · VF_TEMPERATURE (0 = greedy) · VF_TOP_K · VF_TOP_P · VF_SEED · VF_REPETITION_PENALTY · VF_NO_THINK_FILTER=1 (show <think> content for reasoning models) · VF_MODEL_PATH (default model path).

Serve-only

Flag Default Effect
VF_KV_PREFIX_REUSE=1 off Cross-request KV reuse on serve: re-prefills only the suffix after the longest common token prefix (multi-turn/agentic). Byte-identical to full re-prefill at temp 0; single retained session.

See Usage for examples and Benchmarks for what the prefill flags buy.

Clone this wiki locally