Benchmarks

VulkanForge vs llama.cpp's Vulkan backend — same hardware, same GGUF file, same quant, ctx, FA and KV precision per row, measured in one run (no cherry-picking). This is the authoritative v0.7.0 matrix.

Conditions

HW: AMD Radeon RX 9070 XT, gfx1201 (RDNA 4).
Driver: RADV / Mesa 26.1.2-arch2.1 (Vulkan 1.4.303).
VulkanForge: v0.7.0 — coopmat flash-attention (dense hd128 + Gemma-4 hd256/512) + batched MoE-router gate-proj, all default-on.
llama.cpp: Vulkan build (GGML_VULKAN, KHR_coopmat — not HIP), b9174-g0253fb21f, -fa 1 -ngl 99.
ctx 4096, greedy, validation off, warm (first run discarded), median ≥3–5; runs strictly sequential (never both engines at once).
Prefill method: dense = run_pp_bench (compute is value-independent, so this is fair); Gemma-MoE = varied-token prompts (a single-token-repeat micro-bench degenerates MoE routing and overstates MoE prefill, so it is not used for MoE).
KV precision: 8B dense = f16 both engines; Gemma-4-26B = VulkanForge FP8 vs llama.cpp q8_0 (both 8-bit, nearest equivalent).

Matrix (tok/s; prefill, and generated tok/s for decode)

Model	Quant	KV	VF p≈512	llama p≈512	VF/ll	VF p≈2048	llama p≈2048	VF/ll	VF decode	llama decode	VF/ll
Qwen3-8B	Q4_K_M	f16	4291	4472	0.96	4280	4479	0.96	109.4	114.9	0.95
Llama-3.1-8B-Instruct	Q4_K_M	f16	4474	4802	0.93	4470	4644	0.96	114.5	119.6	0.96
Mistral-7B-v0.3	Q4_K_M	f16	4845	4826	1.00	4825	4654	1.04	124.2	127.5	0.97
DeepSeek-R1-Distill-Llama-8B	Q4_K_M	f16	4461	4785	0.93	4464	4628	0.96	114.7	118.6	0.97
Gemma-4-26B-A4B (MoE)	Q3_K_M	FP8/q8_0	2085	3251	0.64	2862	3219	0.89	110.2	127.1	0.87
Gemma-4-26B-A4B (MoE)	Q4_0 ⁑	FP8/q8_0	2653	4140	0.64	3436	4119	0.83	121.2	132.7	0.91

Prefill lengths (matched per row): dense exact 512 / 2048; MoE 513 / 2049. ⁑ QAT row = the same GGUF (gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf) for both engines; VulkanForge labels it Q4_K_XL by filename, llama.cpp reads the tensors as Q4_0.

Reading the matrix (honest framing)

Dense prefill is at parity: 0.93–1.04× llama.cpp Vulkan across both lengths (Mistral ahead). Coopmat flash-attention on dense head_dim=128 (v0.7.0) is what closed this — dense attention is no longer the bottleneck.
Gemma-4 MoE prefill is 0.64× @512 → 0.83–0.89× @2048. The router/gate-proj are no longer the bottleneck (gate-proj batched in v0.7.0); the remaining gap is the grouped expert-GEMM.
Decode is 0.87–0.97× llama (unchanged in v0.7.0; all v0.7.0 changes are prefill-only).

For choosing between the coding-capable models (the Gemma Q3_K_M / QAT variants and Qwen3.6-27B) by quality / speed / context rather than vs llama.cpp, see Choosing a Model for Coding.

Notes

This is the same-backend Vulkan axis. llama.cpp's ROCm/HIP backend, power-efficiency (tok/s/W), and vLLM (FP8 batch) are separate axes — see docs/BENCHMARKS.md for those.
All six GGUF/FP8 production paths pass the deterministic 15-prompt coherence suite.

See Architecture for how the prefill paths work, and Configuration for the opt-out flags.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Benchmarks

Conditions

Matrix (tok/s; prefill, and generated tok/s for decode)

Reading the matrix (honest framing)

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally