-
Notifications
You must be signed in to change notification settings - Fork 1
Troubleshooting
The default amdgpu compute timeout (2 s) is too short for long prefill submits. Set
amdgpu.lockup_timeout=10000,10000 on the kernel command line (bootloader), regenerate config,
reboot. See Installation.
The card has 16 GB; usable budget is roughly ~14.5 GB after overhead. VulkanForge prints a VRAM
budget and warns when free VRAM drops below the headroom threshold (VF_VRAM_HEADROOM_GIB, default
1.0). Options when a model is tight:
-
Gemma-4-26B-A4B: set
VULKANFORGE_KV_FP8=1— required for this MoE (halves KV-cache VRAM; the engine aborts at load without it, see below). -
14B FP8 / multiple sessions:
VF_CPU_LM_HEAD=1frees ~970 MB by moving the vocab projection to the CPU (on 14B FP8 it's also +32 % decode). - Use a smaller quant (Q3_K_M vs Q4_K_M) — see Supported Models.
The Gemma-4-26B-A4B MoE (both Q3_K_M and QAT-Q4_0) requires VULKANFORGE_KV_FP8=1. Its
non-FP8 KV path (F16/F32) is a known-broken code path — Layer-0 attention NaN → degenerate/<pad>
output. Since v0.7.2 the engine fail-loud aborts instead of generating garbage. Fix: restart with
VULKANFORGE_KV_FP8=1 vulkanforge serve … (or chat …). Debug-only override (output will be
invalid): VULKANFORGE_ALLOW_BROKEN_KV=1. FP8 KV also halves KV-cache VRAM, which is what lets the
26B fit in 16 GB.
Since v0.8.0 the vf-clide client makes both cases visible instead of silent:
- Cut off at the token limit →
[truncated at the token limit (N) …]→ raise--max-tokens. - A reasoning model spent the whole budget in its
<think>block, leaving no visible answer →[empty answer …]→ raise--max-tokensor pass--no-think.
The server sizes the context automatically; you only tune the generation budget (--max-tokens).
The KV context is hardware-capped at 16384 tokens on RDNA4 (per-workgroup LDS budget). Auto-ctx
(omit --ctx-size) always stays at or below it; an explicit --ctx-size above 16384 aborts at
pipeline creation rather than clamping silently. Use ≤ 16384, or drop the flag and let auto-ctx
choose. See Configuration.
DeepSeek-R1-Distill emits <think>…</think> reasoning before its answer. With a small --max-tokens,
the visible output can still be inside the <think> block (the answer comes after). Raise
--max-tokens, or use --no-think-filter / VF_NO_THINK_FILTER=1 to see the raw stream. This is a
prompting/harness consideration, not a bug.
v0.7.0's batched MoE router is llama-aligned and value-preserving on factual/structural output, but a
borderline top-k expert flip can make a free-form generation tail phrase differently than the
pre-v0.7.0 per-token router. To reproduce the exact older routing, set VF_MOE_ROUTER_BATCHED=0.
See Configuration.
Native FP8 WMMA is capability-driven. Check:
vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrixIf absent (e.g. Mesa 26.0.x), VulkanForge uses the BF16 conversion fallback — correct, just slower on FP8 prefill. Upgrade to Mesa 26.1+ for the native path.
vulkanforge bench accepts Q4_K_M GGUF. Q8_0 loads in chat but is rejected by bench. Use a
Q4_K_M GGUF for benchmarking.
Memory is opt-in and off by default. Either the binary was built without the feature, or you started serve
without activating it. Build with cargo build --release --features memory, then run
vulkanforge serve --model … --memory (or VULKANFORGE_MEMORY=1). See Memory · Installation.
The binary was built lean (without the memory feature), so the flag has nothing to activate. Rebuild with
cargo build --release --features memory (needs Rust 1.89+) and re-run. See Installation.
The first activated start downloads the Nomic embedding model once into ~/.vulkanforge/embed-cache/ (a sibling
of the DB). It needs network that one time; every start afterwards is offline. If the model can't be fetched,
/memory/* returns 503 and the inference server still runs.
One SQLite file at ~/.vulkanforge/memory.db (override with VF_MEMORY_DB), with the embedding model cached in the
sibling ~/.vulkanforge/embed-cache/. It's local, single-user, and survives restarts. See Memory.
See also Installation · Hardware and Compatibility · Configuration · Memory.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases