Troubleshooting

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

The default amdgpu compute timeout (2 s) is too short for long prefill submits. Set amdgpu.lockup_timeout=10000,10000 on the kernel command line (bootloader), regenerate config, reboot. See Installation.

Out of VRAM / model doesn't fit

The card has 16 GB; usable budget is roughly ~14.5 GB after overhead. VulkanForge prints a VRAM budget and warns when free VRAM drops below the headroom threshold (VF_VRAM_HEADROOM_GIB, default 1.0). Options when a model is tight:

Gemma-4-26B-A4B: set VULKANFORGE_KV_FP8=1 (halves KV-cache VRAM; recommended for the 26B MoE).
14B FP8 / multiple sessions: VF_CPU_LM_HEAD=1 frees ~970 MB by moving the vocab projection to the CPU (on 14B FP8 it's also +32 % decode).
Use a smaller quant (Q3_K_M vs Q4_K_M) — see Supported Models.

"KV-FP8 needed at 26B"

The Gemma-4-26B-A4B MoE only fits comfortably in 16 GB with VULKANFORGE_KV_FP8=1. Without it the KV cache may push you over the budget at larger context sizes. It is value-preserving.

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

DeepSeek-R1-Distill emits <think>…</think> reasoning before its answer. With a small --max-tokens, the visible output can still be inside the <think> block (the answer comes after). Raise --max-tokens, or use --no-think-filter / VF_NO_THINK_FILTER=1 to see the raw stream. This is a prompting/harness consideration, not a bug.

A Gemma-MoE free-form answer changed vs an older build

v0.7.0's batched MoE router is llama-aligned and value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail phrase differently than the pre-v0.7.0 per-token router. To reproduce the exact older routing, set VF_MOE_ROUTER_BATCHED=0. See Configuration.

Native FP8 not engaging

Native FP8 WMMA is capability-driven. Check:

vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrix

If absent (e.g. Mesa 26.0.x), VulkanForge uses the BF16 conversion fallback — correct, just slower on FP8 prefill. Upgrade to Mesa 26.1+ for the native path.

`bench` rejects my model

vulkanforge bench accepts Q4_K_M GGUF. Q8_0 loads in chat but is rejected by bench. Use a Q4_K_M GGUF for benchmarking.

See also Installation · Hardware and Compatibility · Configuration.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

Out of VRAM / model doesn't fit

"KV-FP8 needed at 26B"

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

A Gemma-MoE free-form answer changed vs an older build

Native FP8 not engaging

`bench` rejects my model

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally

Troubleshooting

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

Out of VRAM / model doesn't fit

"KV-FP8 needed at 26B"

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

A Gemma-MoE free-form answer changed vs an older build

Native FP8 not engaging

bench rejects my model

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally

`bench` rejects my model