Skip to content
maeddesg edited this page Jun 15, 2026 · 3 revisions

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

The default amdgpu compute timeout (2 s) is too short for long prefill submits. Set amdgpu.lockup_timeout=10000,10000 on the kernel command line (bootloader), regenerate config, reboot. See Installation.

Out of VRAM / model doesn't fit

The card has 16 GB; usable budget is roughly ~14.5 GB after overhead. VulkanForge prints a VRAM budget and warns when free VRAM drops below the headroom threshold (VF_VRAM_HEADROOM_GIB, default 1.0). Options when a model is tight:

  • Gemma-4-26B-A4B: set VULKANFORGE_KV_FP8=1required for this MoE (halves KV-cache VRAM; the engine aborts at load without it, see below).
  • 14B FP8 / multiple sessions: VF_CPU_LM_HEAD=1 frees ~970 MB by moving the vocab projection to the CPU (on 14B FP8 it's also +32 % decode).
  • Use a smaller quant (Q3_K_M vs Q4_K_M) — see Supported Models.

Server aborts at load: "…only FP8 (E4M3) KV is correct. Set VULKANFORGE_KV_FP8=1…"

The Gemma-4-26B-A4B MoE (both Q3_K_M and QAT-Q4_0) requires VULKANFORGE_KV_FP8=1. Its non-FP8 KV path (F16/F32) is a known-broken code path — Layer-0 attention NaN → degenerate/<pad> output. Since v0.7.2 the engine fail-loud aborts instead of generating garbage. Fix: restart with VULKANFORGE_KV_FP8=1 vulkanforge serve … (or chat …). Debug-only override (output will be invalid): VULKANFORGE_ALLOW_BROKEN_KV=1. FP8 KV also halves KV-cache VRAM, which is what lets the 26B fit in 16 GB.

Answer is cut off, or empty (thinking models)

Since v0.8.0 the vf-clide client makes both cases visible instead of silent:

  • Cut off at the token limit → [truncated at the token limit (N) …] → raise --max-tokens.
  • A reasoning model spent the whole budget in its <think> block, leaving no visible answer → [empty answer …] → raise --max-tokens or pass --no-think.

The server sizes the context automatically; you only tune the generation budget (--max-tokens).

serve aborts at pipeline creation with a large --ctx-size

The KV context is hardware-capped at 16384 tokens on RDNA4 (per-workgroup LDS budget). Auto-ctx (omit --ctx-size) always stays at or below it; an explicit --ctx-size above 16384 aborts at pipeline creation rather than clamping silently. Use ≤ 16384, or drop the flag and let auto-ctx choose. See Configuration.

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

DeepSeek-R1-Distill emits <think>…</think> reasoning before its answer. With a small --max-tokens, the visible output can still be inside the <think> block (the answer comes after). Raise --max-tokens, or use --no-think-filter / VF_NO_THINK_FILTER=1 to see the raw stream. This is a prompting/harness consideration, not a bug.

A Gemma-MoE free-form answer changed vs an older build

v0.7.0's batched MoE router is llama-aligned and value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail phrase differently than the pre-v0.7.0 per-token router. To reproduce the exact older routing, set VF_MOE_ROUTER_BATCHED=0. See Configuration.

Native FP8 not engaging

Native FP8 WMMA is capability-driven. Check:

vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrix

If absent (e.g. Mesa 26.0.x), VulkanForge uses the BF16 conversion fallback — correct, just slower on FP8 prefill. Upgrade to Mesa 26.1+ for the native path.

bench rejects my model

vulkanforge bench accepts Q4_K_M GGUF. Q8_0 loads in chat but is rejected by bench. Use a Q4_K_M GGUF for benchmarking.

/memory/* returns 503

Memory is opt-in and off by default. Either the binary was built without the feature, or you started serve without activating it. Build with cargo build --release --features memory, then run vulkanforge serve --model … --memory (or VULKANFORGE_MEMORY=1). See Memory · Installation.

serve --memory aborts: "rebuild with --features memory"

The binary was built lean (without the memory feature), so the flag has nothing to activate. Rebuild with cargo build --release --features memory (needs Rust 1.89+) and re-run. See Installation.

First --memory start is slow / wants network

The first activated start downloads the Nomic embedding model once into ~/.vulkanforge/embed-cache/ (a sibling of the DB). It needs network that one time; every start afterwards is offline. If the model can't be fetched, /memory/* returns 503 and the inference server still runs.

Where is my memory stored?

One SQLite file at ~/.vulkanforge/memory.db (override with VF_MEMORY_DB), with the embedding model cached in the sibling ~/.vulkanforge/embed-cache/. It's local, single-user, and survives restarts. See Memory.

See also Installation · Hardware and Compatibility · Configuration · Memory.

Clone this wiki locally