-
Notifications
You must be signed in to change notification settings - Fork 1
Benchmarks
VulkanForge vs llama.cpp's Vulkan backend — same hardware, same GGUF file, same quant, ctx, FA and KV precision per row, measured in one run (no cherry-picking). This is the authoritative v0.7.0 matrix.
-
HW: AMD Radeon RX 9070 XT,
gfx1201(RDNA 4). - Driver: RADV / Mesa 26.1.2-arch2.1 (Vulkan 1.4.303).
- VulkanForge: v0.7.0 — coopmat flash-attention (dense hd128 + Gemma-4 hd256/512) + batched MoE-router gate-proj, all default-on.
-
llama.cpp: Vulkan build (
GGML_VULKAN,KHR_coopmat— not HIP),b9174-g0253fb21f,-fa 1 -ngl 99. - ctx 4096, greedy, validation off, warm (first run discarded), median ≥3–5; runs strictly sequential (never both engines at once).
-
Prefill method: dense =
run_pp_bench(compute is value-independent, so this is fair); Gemma-MoE = varied-token prompts (a single-token-repeat micro-bench degenerates MoE routing and overstates MoE prefill, so it is not used for MoE). -
KV precision: 8B dense = f16 both engines; Gemma-4-26B = VulkanForge FP8 vs llama.cpp
q8_0(both 8-bit, nearest equivalent).
| Model | Quant | KV | VF p≈512 | llama p≈512 | VF/ll | VF p≈2048 | llama p≈2048 | VF/ll | VF decode | llama decode | VF/ll |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-8B | Q4_K_M | f16 | 4291 | 4472 | 0.96 | 4280 | 4479 | 0.96 | 109.4 | 114.9 | 0.95 |
| Llama-3.1-8B-Instruct | Q4_K_M | f16 | 4474 | 4802 | 0.93 | 4470 | 4644 | 0.96 | 114.5 | 119.6 | 0.96 |
| Mistral-7B-v0.3 | Q4_K_M | f16 | 4845 | 4826 | 1.00 | 4825 | 4654 | 1.04 | 124.2 | 127.5 | 0.97 |
| DeepSeek-R1-Distill-Llama-8B | Q4_K_M | f16 | 4461 | 4785 | 0.93 | 4464 | 4628 | 0.96 | 114.7 | 118.6 | 0.97 |
| Gemma-4-26B-A4B (MoE) | Q3_K_M | FP8/q8_0 | 2085 | 3251 | 0.64 | 2862 | 3219 | 0.89 | 110.2 | 127.1 | 0.87 |
| Gemma-4-26B-A4B (MoE) | Q4_0 ⁑ | FP8/q8_0 | 2653 | 4140 | 0.64 | 3436 | 4119 | 0.83 | 121.2 | 132.7 | 0.91 |
Prefill lengths (matched per row): dense exact 512 / 2048; MoE 513 / 2049.
⁑ QAT row = the same GGUF (gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) for both engines; VulkanForge
labels it Q4_K_XL by filename, llama.cpp reads the tensors as Q4_0.
-
Dense prefill is at parity: 0.93–1.04× llama.cpp Vulkan across both lengths (Mistral ahead).
Coopmat flash-attention on dense
head_dim=128(v0.7.0) is what closed this — dense attention is no longer the bottleneck. - Gemma-4 MoE prefill is 0.64× @512 → 0.83–0.89× @2048. The router/gate-proj are no longer the bottleneck (gate-proj batched in v0.7.0); the remaining gap is the grouped expert-GEMM.
- Decode is 0.87–0.97× llama (unchanged in v0.7.0; all v0.7.0 changes are prefill-only).
For choosing between the coding-capable models (the Gemma Q3_K_M / QAT variants and Qwen3.6-27B) by quality / speed / context rather than vs llama.cpp, see Choosing a Model for Coding.
- This is the same-backend Vulkan axis. llama.cpp's ROCm/HIP backend, power-efficiency (tok/s/W), and vLLM (FP8 batch) are separate axes — see docs/BENCHMARKS.md for those.
- All six GGUF/FP8 production paths pass the deterministic 15-prompt coherence suite.
See Architecture for how the prefill paths work, and Configuration for the opt-out flags.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases