Configuration

Configuration & Flags

VulkanForge is configured through CLI flags and environment variables. Most behaviour is automatic; the flags below are the user-facing runtime options. (VulkanForge also has many internal diagnostic/debug env vars — VF_*_DUMP, VF_TRACE_*, GEMM micro-tuning, etc. — which are for development only and are intentionally not documented here; never set them in production.)

Defaults below are taken from the v0.7.0 source.

Precision & memory

Flag	Default	Effect
`VULKANFORGE_KV_FP8=1`	off	FP8 KV-cache — −50 % KV-cache VRAM (Gemma-4-26B-A4B: 880 → 440 MB). Recommended/needed to fit the 26B MoE in 16 GB.
`VF_CPU_LM_HEAD=1`	off	Offload the vocabulary projection to the CPU (AVX-512 Q6_K GEMV, Zen 4 / Ice Lake+). Frees ~970 MB VRAM; on 14B FP8 it is +32 % decode, on 8B it trades ~32 % decode for the VRAM saving.
`VF_QUANTIZE_ON_LOAD=1`	off	Quantize FP32/BF16 SafeTensors weights to Q4_K_M at load (~7× compression on quantized tensors).
`VF_VRAM_HEADROOM_GIB=<f>`	1.0	Warn when free VRAM drops below this (diagnostic).
`VF_LOADER_STAGING_GIB=<f>`	adaptive	Override the upload staging-buffer size (advanced; the default adapts).

FP8 SafeTensors

Flag	Effect
`VF_FP8=auto`	Auto-detect FP8 model + native-WMMA capability + CPU-lm_head heuristics; auto-load tokenizer/chat-template from the model dir. The recommended way to run FP8.
`VULKANFORGE_ENABLE_FP8=1`	Explicit FP8 SafeTensors loading (legacy override).

Native FP8 WMMA itself is capability-driven (no flag): VulkanForge uses the native path iff the driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+), else the BF16 fallback.

Prefill performance (v0.7.0, all default-ON — opt-outs for A/B)

Flag	Default	Effect
`VF_FA_COOPMAT_HD128=0`	on	Dense `head_dim=128` prefill attention uses coopmat flash-attention (parity with llama). `=0` falls back to the older tiled kernel.
`VF_MOE_ROUTER_BATCHED=0`	on	Gemma-MoE router gate-projection batched through the dense GEMM (per-token → ~6 ms @p2048). `=0` falls back to the per-token GEMV.
`VF_MOE_FUSED_ROUTER=1` / `=0`	decode-only fused	Default: fused router for decode, separate path for prefill (faster). `=1` forces fused everywhere; `=0` forces the separate path.
`VF_MOE_GROUPED=0`	on (since v0.5.3)	Expert-grouped MoE prefill. `=0` = legacy GPU-direct GEMV prefill.

Behaviour note. The batched MoE router is value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail diverge from the per-token router. This is a deliberate, documented numerical change; VF_MOE_ROUTER_BATCHED=0 restores the exact pre-v0.7.0 routing.

Sampling & input (env forms; most also have CLI flags)

VF_PROMPT (single-shot turn) · VF_SYSTEM (system prompt) · VF_MAX_TOKENS · VF_TEMPERATURE (0 = greedy) · VF_TOP_K · VF_TOP_P · VF_SEED · VF_REPETITION_PENALTY · VF_NO_THINK_FILTER=1 (show <think> content for reasoning models) · VF_MODEL_PATH (default model path).

Serve-only

Flag	Default	Effect
`VF_KV_PREFIX_REUSE=1`	off	Cross-request KV reuse on `serve`: re-prefills only the suffix after the longest common token prefix (multi-turn/agentic). Byte-identical to full re-prefill at temp 0; single retained session.

See Usage for examples and Benchmarks for what the prefill flags buy.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

Configuration & Flags

Precision & memory

FP8 SafeTensors

Prefill performance (v0.7.0, all default-ON — opt-outs for A/B)

Sampling & input (env forms; most also have CLI flags)

Serve-only

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally