Supported Models

VulkanForge loads GGUF (llama.cpp-compatible quantization) and HuggingFace FP8 SafeTensors. The GGUF loader covers the llama, qwen2 / qwen3 / qwen35, and gemma4 architectures; the tokenizer is read from the GGUF (no extra setup). FP8 SafeTensors need a tokenizer via --tokenizer-from <gguf> from the same family (auto-loaded from the model dir with VF_FP8=auto).

Model families

For a side-by-side of the coding-capable models (quality / speed / context trade-offs), see Choosing a Model for Coding.

Family	Notes
Llama-3.1-8B-Instruct	Dense, `head_dim=128`. GGUF Q4_K_M, and FP8 (per-tensor).
Qwen3-8B	Dense, hd128. GGUF Q4_K_M, and FP8 (block-wise `[128,128]`).
Qwen3-14B	Dense, hd128. GGUF Q4_K_M.
Mistral-7B-Instruct-v0.3	Dense, hd128, full-causal. GGUF Q4_K_M.
DeepSeek-R1-Distill-Llama-8B	Dense, hd128 (Llama arch). Reasoning model — emits `<think>…</think>` (see Troubleshooting). GGUF Q4_K_M.
Gemma-4-26B-A4B	MoE (128 experts, top-8), heterogeneous head dims (hd256 sliding + hd512 full). GGUF Q3_K_M and the QAT-Q4_0 line. ⚠️ Requires `VULKANFORGE_KV_FP8=1` — the non-FP8 KV path is known-broken for this MoE (the engine aborts at load without it; see note below).
Gemma-4-E2B / E4B	Smaller Gemma-4. SafeTensors path (on-the-fly Q4_K via `VF_QUANTIZE_ON_LOAD=1`).
Qwen3.6-27B (qwen35)	GDN hybrid (gated-delta-net), dense. GGUF Q3_K_S.

GGUF quant formats

Format	Status	Notes
Q4_K_M	✅	Primary GGUF format; all benchmarks use it.
Q3_K_M	✅	Smaller VRAM, slight quality drop (one of the Gemma-4-26B MoE options — see Choosing a Model for Coding).
Q5_K	✅
Q4_0	✅	The Gemma-4 QAT GGUF line (E2B/E4B/12B/26B/31B) since v0.6.1. Qwen2.5-Q4_0 stays gated (missing arch features, not the quant).
Q6_K	✅	Chat works.
Q8_0	⚠️	Loadable via `chat`, but rejected by `bench`.

Native FP8 E4M3 (SafeTensors)

All three FP8 scaling strategies are auto-detected from config.json + the SafeTensors header:

Scale type	Example model	VRAM	15-prompt coherence
Per-tensor	`Meta-Llama-3.1-8B-Instruct-FP8`	7.5 GiB	13/15
Per-channel	`Qwen2.5-14B-Instruct-FP8`	13.8 GiB	15/15
Block-wise `[128,128]`	`Qwen3-8B-FP8`	8.5 GiB	15/15

FP8 quirks: the tokenizer is not in the FP8 SafeTensors model (use --tokenizer-from); block-wise needs a block size divisible by 64/16 (the [128,128] calibration satisfies both). FP8 SafeTensors loading currently supports the gpt2 tokenizer family (Mistral / Llama-2 SPM not yet wired for FP8).

Gemma-4-26B-A4B requires KV-FP8 (`VULKANFORGE_KV_FP8=1`)

For the Gemma-4-26B-A4B MoE models (both Q3_K_M and the QAT-Q4_0 line) FP8 KV-cache is mandatory, not optional: the non-FP8 KV path (F16/F32) is a known-broken code path for this MoE — it produces a Layer-0 attention NaN and degenerate/<pad> output. Since v0.7.2 the engine aborts at load with an actionable error if VULKANFORGE_KV_FP8=1 is missing, rather than silently generating garbage (debug-only override: VULKANFORGE_ALLOW_BROKEN_KV=1). FP8 KV also halves KV-cache VRAM, which is what lets the 26B fit in 16 GB. Dense models and Qwen3.5/3.6 are unaffected.

Tool / function calling

The OpenAI-compatible tools API works with Qwen3 / Hermes (ChatML) and, since v0.8.0, Gemma-4 (its own native tool-call format, handled by a permissive parser). See Usage.

VRAM fit (16 GB)

8B Q4_K_M ≈ 4.6 GiB; 8B FP8 ≈ 7.5–8.5 GiB; 14B FP8 ≈ 13.8 GiB; Gemma-4-26B-A4B Q3_K_M ≈ 13 GiB resident (+ KV — VULKANFORGE_KV_FP8=1, required, see above). With auto context sizing (v0.8.0) the server picks a KV context that fits the remaining VRAM automatically; see Configuration and Troubleshooting.

Full detail: docs/MODELS.md.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supported Models

Supported Models

Model families

GGUF quant formats

Native FP8 E4M3 (SafeTensors)

Gemma-4-26B-A4B requires KV-FP8 (`VULKANFORGE_KV_FP8=1`)

Tool / function calling

VRAM fit (16 GB)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally

Supported Models

Supported Models

Model families

GGUF quant formats

Native FP8 E4M3 (SafeTensors)

Gemma-4-26B-A4B requires KV-FP8 (VULKANFORGE_KV_FP8=1)

Tool / function calling

VRAM fit (16 GB)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally

Gemma-4-26B-A4B requires KV-FP8 (`VULKANFORGE_KV_FP8=1`)