Skip to content
maeddesg edited this page Jun 12, 2026 · 3 revisions

Supported Models

VulkanForge loads GGUF (llama.cpp-compatible quantization) and HuggingFace FP8 SafeTensors. The GGUF loader covers the llama, qwen2 / qwen3 / qwen35, and gemma4 architectures; the tokenizer is read from the GGUF (no extra setup). FP8 SafeTensors need a tokenizer via --tokenizer-from <gguf> from the same family (auto-loaded from the model dir with VF_FP8=auto).

Model families

For a side-by-side of the coding-capable models (quality / speed / context trade-offs), see Choosing a Model for Coding.

Family Notes
Llama-3.1-8B-Instruct Dense, head_dim=128. GGUF Q4_K_M, and FP8 (per-tensor).
Qwen3-8B Dense, hd128. GGUF Q4_K_M, and FP8 (block-wise [128,128]).
Qwen3-14B Dense, hd128. GGUF Q4_K_M.
Mistral-7B-Instruct-v0.3 Dense, hd128, full-causal. GGUF Q4_K_M.
DeepSeek-R1-Distill-Llama-8B Dense, hd128 (Llama arch). Reasoning model — emits <think>…</think> (see Troubleshooting). GGUF Q4_K_M.
Gemma-4-26B-A4B MoE (128 experts, top-8), heterogeneous head dims (hd256 sliding + hd512 full). GGUF Q3_K_M and the QAT-Q4_0 line. ⚠️ Requires VULKANFORGE_KV_FP8=1 — the non-FP8 KV path is known-broken for this MoE (the engine aborts at load without it; see note below).
Gemma-4-E2B / E4B Smaller Gemma-4. SafeTensors path (on-the-fly Q4_K via VF_QUANTIZE_ON_LOAD=1).
Qwen3.6-27B (qwen35) GDN hybrid (gated-delta-net), dense. GGUF Q3_K_S.

GGUF quant formats

Format Status Notes
Q4_K_M Primary GGUF format; all benchmarks use it.
Q3_K_M Smaller VRAM, slight quality drop (one of the Gemma-4-26B MoE options — see Choosing a Model for Coding).
Q5_K
Q4_0 The Gemma-4 QAT GGUF line (E2B/E4B/12B/26B/31B) since v0.6.1. Qwen2.5-Q4_0 stays gated (missing arch features, not the quant).
Q6_K Chat works.
Q8_0 ⚠️ Loadable via chat, but rejected by bench.

Native FP8 E4M3 (SafeTensors)

All three FP8 scaling strategies are auto-detected from config.json + the SafeTensors header:

Scale type Example model VRAM 15-prompt coherence
Per-tensor Meta-Llama-3.1-8B-Instruct-FP8 7.5 GiB 13/15
Per-channel Qwen2.5-14B-Instruct-FP8 13.8 GiB 15/15
Block-wise [128,128] Qwen3-8B-FP8 8.5 GiB 15/15

FP8 quirks: the tokenizer is not in the FP8 SafeTensors model (use --tokenizer-from); block-wise needs a block size divisible by 64/16 (the [128,128] calibration satisfies both). FP8 SafeTensors loading currently supports the gpt2 tokenizer family (Mistral / Llama-2 SPM not yet wired for FP8).

Gemma-4-26B-A4B requires KV-FP8 (VULKANFORGE_KV_FP8=1)

For the Gemma-4-26B-A4B MoE models (both Q3_K_M and the QAT-Q4_0 line) FP8 KV-cache is mandatory, not optional: the non-FP8 KV path (F16/F32) is a known-broken code path for this MoE — it produces a Layer-0 attention NaN and degenerate/<pad> output. Since v0.7.2 the engine aborts at load with an actionable error if VULKANFORGE_KV_FP8=1 is missing, rather than silently generating garbage (debug-only override: VULKANFORGE_ALLOW_BROKEN_KV=1). FP8 KV also halves KV-cache VRAM, which is what lets the 26B fit in 16 GB. Dense models and Qwen3.5/3.6 are unaffected.

Tool / function calling

The OpenAI-compatible tools API works with Qwen3 / Hermes (ChatML) and, since v0.8.0, Gemma-4 (its own native tool-call format, handled by a permissive parser). See Usage.

VRAM fit (16 GB)

8B Q4_K_M ≈ 4.6 GiB; 8B FP8 ≈ 7.5–8.5 GiB; 14B FP8 ≈ 13.8 GiB; Gemma-4-26B-A4B Q3_K_M ≈ 13 GiB resident (+ KV — VULKANFORGE_KV_FP8=1, required, see above). With auto context sizing (v0.8.0) the server picks a KV context that fits the remaining VRAM automatically; see Configuration and Troubleshooting.

Full detail: docs/MODELS.md.

Clone this wiki locally