Skip to content
maeddesg edited this page Jun 10, 2026 · 2 revisions

Benchmarks

VulkanForge vs llama.cpp's Vulkan backend — same hardware, same GGUF file, same quant, ctx, FA and KV precision per row, measured in one run (no cherry-picking). This is the authoritative v0.7.0 matrix.

Conditions

  • HW: AMD Radeon RX 9070 XT, gfx1201 (RDNA 4).
  • Driver: RADV / Mesa 26.1.2-arch2.1 (Vulkan 1.4.303).
  • VulkanForge: v0.7.0 — coopmat flash-attention (dense hd128 + Gemma-4 hd256/512) + batched MoE-router gate-proj, all default-on.
  • llama.cpp: Vulkan build (GGML_VULKAN, KHR_coopmatnot HIP), b9174-g0253fb21f, -fa 1 -ngl 99.
  • ctx 4096, greedy, validation off, warm (first run discarded), median ≥3–5; runs strictly sequential (never both engines at once).
  • Prefill method: dense = run_pp_bench (compute is value-independent, so this is fair); Gemma-MoE = varied-token prompts (a single-token-repeat micro-bench degenerates MoE routing and overstates MoE prefill, so it is not used for MoE).
  • KV precision: 8B dense = f16 both engines; Gemma-4-26B = VulkanForge FP8 vs llama.cpp q8_0 (both 8-bit, nearest equivalent).

Matrix (tok/s; prefill, and generated tok/s for decode)

Model Quant KV VF p≈512 llama p≈512 VF/ll VF p≈2048 llama p≈2048 VF/ll VF decode llama decode VF/ll
Qwen3-8B Q4_K_M f16 4291 4472 0.96 4280 4479 0.96 109.4 114.9 0.95
Llama-3.1-8B-Instruct Q4_K_M f16 4474 4802 0.93 4470 4644 0.96 114.5 119.6 0.96
Mistral-7B-v0.3 Q4_K_M f16 4845 4826 1.00 4825 4654 1.04 124.2 127.5 0.97
DeepSeek-R1-Distill-Llama-8B Q4_K_M f16 4461 4785 0.93 4464 4628 0.96 114.7 118.6 0.97
Gemma-4-26B-A4B (MoE) Q3_K_M FP8/q8_0 2085 3251 0.64 2862 3219 0.89 110.2 127.1 0.87
Gemma-4-26B-A4B (MoE) Q4_0 ⁑ FP8/q8_0 2653 4140 0.64 3436 4119 0.83 121.2 132.7 0.91

Prefill lengths (matched per row): dense exact 512 / 2048; MoE 513 / 2049. ⁑ QAT row = the same GGUF (gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) for both engines; VulkanForge labels it Q4_K_XL by filename, llama.cpp reads the tensors as Q4_0.

Reading the matrix (honest framing)

  • Dense prefill is at parity: 0.93–1.04× llama.cpp Vulkan across both lengths (Mistral ahead). Coopmat flash-attention on dense head_dim=128 (v0.7.0) is what closed this — dense attention is no longer the bottleneck.
  • Gemma-4 MoE prefill is 0.64× @512 → 0.83–0.89× @2048. The router/gate-proj are no longer the bottleneck (gate-proj batched in v0.7.0); the remaining gap is the grouped expert-GEMM.
  • Decode is 0.87–0.97× llama (unchanged in v0.7.0; all v0.7.0 changes are prefill-only).

For choosing between the coding-capable models (the Gemma Q3_K_M / QAT variants and Qwen3.6-27B) by quality / speed / context rather than vs llama.cpp, see Choosing a Model for Coding.

Notes

  • This is the same-backend Vulkan axis. llama.cpp's ROCm/HIP backend, power-efficiency (tok/s/W), and vLLM (FP8 batch) are separate axes — see docs/BENCHMARKS.md for those.
  • All six GGUF/FP8 production paths pass the deterministic 15-prompt coherence suite.

See Architecture for how the prefill paths work, and Configuration for the opt-out flags.

Clone this wiki locally