Troubleshooting

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

The default amdgpu compute timeout (2 s) is too short for long prefill submits. Set amdgpu.lockup_timeout=10000,10000 on the kernel command line (bootloader), regenerate config, reboot. See Installation.

Out of VRAM / model doesn't fit

The card has 16 GB; usable budget is roughly ~14.5 GB after overhead. VulkanForge prints a VRAM budget and warns when free VRAM drops below the headroom threshold (VF_VRAM_HEADROOM_GIB, default 1.0). Options when a model is tight:

Gemma-4-26B-A4B: set VULKANFORGE_KV_FP8=1 — required for this MoE (halves KV-cache VRAM; the engine aborts at load without it, see below).
14B FP8 / multiple sessions: VF_CPU_LM_HEAD=1 frees ~970 MB by moving the vocab projection to the CPU (on 14B FP8 it's also +32 % decode).
Use a smaller quant (Q3_K_M vs Q4_K_M) — see Supported Models.

Server aborts at load: "…only FP8 (E4M3) KV is correct. Set `VULKANFORGE_KV_FP8=1`…"

The Gemma-4-26B-A4B MoE (both Q3_K_M and QAT-Q4_0) requires VULKANFORGE_KV_FP8=1. Its non-FP8 KV path (F16/F32) is a known-broken code path — Layer-0 attention NaN → degenerate/<pad> output. Since v0.7.2 the engine fail-loud aborts instead of generating garbage. Fix: restart with VULKANFORGE_KV_FP8=1 vulkanforge serve … (or chat …). Debug-only override (output will be invalid): VULKANFORGE_ALLOW_BROKEN_KV=1. FP8 KV also halves KV-cache VRAM, which is what lets the 26B fit in 16 GB.

Answer is cut off, or empty (thinking models)

Since v0.8.0 the vf-clide client makes both cases visible instead of silent:

Cut off at the token limit → [truncated at the token limit (N) …] → raise --max-tokens.
A reasoning model spent the whole budget in its <think> block, leaving no visible answer → [empty answer …] → raise --max-tokens or pass --no-think.

The server sizes the context automatically; you only tune the generation budget (--max-tokens).

`serve` aborts at pipeline creation with a large `--ctx-size`

The KV context is hardware-capped at 16384 tokens on RDNA4 (per-workgroup LDS budget). Auto-ctx (omit --ctx-size) always stays at or below it; an explicit --ctx-size above 16384 aborts at pipeline creation rather than clamping silently. Use ≤ 16384, or drop the flag and let auto-ctx choose. See Configuration.

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

DeepSeek-R1-Distill emits <think>…</think> reasoning before its answer. With a small --max-tokens, the visible output can still be inside the <think> block (the answer comes after). Raise --max-tokens, or use --no-think-filter / VF_NO_THINK_FILTER=1 to see the raw stream. This is a prompting/harness consideration, not a bug.

A Gemma-MoE free-form answer changed vs an older build

v0.7.0's batched MoE router is llama-aligned and value-preserving on factual/structural output, but a borderline top-k expert flip can make a free-form generation tail phrase differently than the pre-v0.7.0 per-token router. To reproduce the exact older routing, set VF_MOE_ROUTER_BATCHED=0. See Configuration.

Native FP8 not engaging

Native FP8 WMMA is capability-driven. Check:

vulkaninfo 2>/dev/null | grep shaderFloat8CooperativeMatrix

If absent (e.g. Mesa 26.0.x), VulkanForge uses the BF16 conversion fallback — correct, just slower on FP8 prefill. Upgrade to Mesa 26.1+ for the native path.

`bench` rejects my model

vulkanforge bench accepts Q4_K_M GGUF. Q8_0 loads in chat but is rejected by bench. Use a Q4_K_M GGUF for benchmarking.

`/memory/*` returns 503

Memory is opt-in and off by default. Either the binary was built without the feature, or you started serve without activating it. Build with cargo build --release --features memory, then run vulkanforge serve --model … --memory (or VULKANFORGE_MEMORY=1). See Memory · Installation.

`serve --memory` aborts: "rebuild with --features memory"

The binary was built lean (without the memory feature), so the flag has nothing to activate. Rebuild with cargo build --release --features memory (needs Rust 1.89+) and re-run. See Installation.

First `--memory` start is slow / wants network

The first activated start downloads the Nomic embedding model once into ~/.vulkanforge/embed-cache/ (a sibling of the DB). It needs network that one time; every start afterwards is offline. If the model can't be fetched, /memory/* returns 503 and the inference server still runs.

Where is my memory stored?

One SQLite file at ~/.vulkanforge/memory.db (override with VF_MEMORY_DB), with the embedding model cached in the sibling ~/.vulkanforge/embed-cache/. It's local, single-user, and survives restarts. See Memory.

See also Installation · Hardware and Compatibility · Configuration · Memory.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

Out of VRAM / model doesn't fit

Server aborts at load: "…only FP8 (E4M3) KV is correct. Set `VULKANFORGE_KV_FP8=1`…"

Answer is cut off, or empty (thinking models)

`serve` aborts at pipeline creation with a large `--ctx-size`

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

A Gemma-MoE free-form answer changed vs an older build

Native FP8 not engaging

`bench` rejects my model

`/memory/*` returns 503

`serve --memory` aborts: "rebuild with --features memory"

First `--memory` start is slow / wants network

Where is my memory stored?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally

Troubleshooting

Troubleshooting / FAQ

The GPU resets / TDR during long prefill (14B+ models)

Out of VRAM / model doesn't fit

Server aborts at load: "…only FP8 (E4M3) KV is correct. Set VULKANFORGE_KV_FP8=1…"

Answer is cut off, or empty (thinking models)

serve aborts at pipeline creation with a large --ctx-size

A reasoning model (DeepSeek-R1) "doesn't answer" with a short token cap

A Gemma-MoE free-form answer changed vs an older build

Native FP8 not engaging

bench rejects my model

/memory/* returns 503

serve --memory aborts: "rebuild with --features memory"

First --memory start is slow / wants network

Where is my memory stored?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally

Server aborts at load: "…only FP8 (E4M3) KV is correct. Set `VULKANFORGE_KV_FP8=1`…"

`serve` aborts at pipeline creation with a large `--ctx-size`

`bench` rejects my model

`/memory/*` returns 503

`serve --memory` aborts: "rebuild with --features memory"

First `--memory` start is slow / wants network