Skip to content
maeddesg edited this page Jun 9, 2026 · 3 revisions

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

The default amdgpu compute timeout (2 s) is too short for long prefill submits. Set amdgpu.lockup_timeout=10000,10000 on the kernel command line (bootloader), regenerate config, reboot. See Installation.

Out of VRAM / model doesn't fit

The card has 16 GB; usable budget is roughly ~14.5 GB after overhead. VulkanForge prints a VRAM budget and warns when free VRAM drops below the headroom threshold (VF_VRAM_HEADROOM_GIB, default 1.0). Options when a model is tight:

  • Gemma-4-26B-A4B: set VULKANFORGE_KV_FP8=1 (halves KV-cache VRAM; recommended for the 26B MoE).
  • 14B FP8 / multiple sessions: VF_CPU_LM_HEAD=1 frees ~970 MB by moving the vocab projection to the CPU (on 14B FP8 it's also +32 % decode).
  • Use a smaller quant (Q3_K_M vs Q4_K_M) — see Supported Models.

"KV-FP8 needed at 26B"

The Gemma-4-26B-A4B MoE only fits comfortably in 16 GB with VULKANFORGE_KV_FP8=1. Without it the KV cache may push you over the budget at larger context sizes. It is value-preserving.

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

DeepSeek-R1-Distill emits <think>…</think> reasoning before its answer. With a small --max-tokens, the visible output can still be inside the <think> block (the answer comes after). Raise --max-tokens, or use --no-think-filter / VF_NO_THINK_FILTER=1 to see the raw stream. This is a prompting/harness consideration, not a bug.

A Gemma-MoE free-form answer changed vs an older build

v0.7.0's batched MoE router is llama-aligned and value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail phrase differently than the pre-v0.7.0 per-token router. To reproduce the exact older routing, set VF_MOE_ROUTER_BATCHED=0. See Configuration.

Native FP8 not engaging

Native FP8 WMMA is capability-driven. Check:

vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrix

If absent (e.g. Mesa 26.0.x), VulkanForge uses the BF16 conversion fallback — correct, just slower on FP8 prefill. Upgrade to Mesa 26.1+ for the native path.

bench rejects my model

vulkanforge bench accepts Q4_K_M GGUF. Q8_0 loads in chat but is rejected by bench. Use a Q4_K_M GGUF for benchmarking.

See also Installation · Hardware and Compatibility · Configuration.

Clone this wiki locally