Architecture

Architecture Overview

A high-level map of how VulkanForge runs a model. (Implementation detail lives in the code; this page is the orientation.)

Backend

Vulkan compute-only. No swapchain, no graphics queues. Built directly on ash (raw Vulkan 1.4 bindings), with gpu-allocator for VRAM suballocation.
GLSL compute shaders → SPIR-V. The kernels are GLSL .comp shaders compiled to SPIR-V at build time (via shaderc) and embedded in the ~14 MB static binary.

Key kernels & paths

CoopMat flash-attention. A KHR-cooperative-matrix flash-attention kernel (16×16×16 coopmat QK + PV, online-softmax, row-split). In v0.7.0 it covers both dense head_dim=128 and Gemma-4's heterogeneous head_dim 256 (sliding) / 512 (full). This is what brings dense prefill to llama parity (see Benchmarks).
Native FP8 WMMA. V_WMMA_F32_16X16X16_FP8_FP8 GEMM when the driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+); BF16 conversion fallback otherwise. All three FP8 scaling strategies (per-tensor / per-channel / block-wise) are auto-detected.
Tier-based VRAM bucketing. Model tensors are grouped into tiers (lm_head / attention / dense-FFN / MoE-experts) and packed into VRAM buckets to manage residency and bandwidth on the 16 GB card.
MoE (Gemma-4-26B-A4B). Expert-grouped prefill dispatch (one grouped GEMM per expert instead of a per-token scatter loop), a batched router with the gate-projection run through the dense GEMM (v0.7.0), and a parallel top-K. The router selects top-8 of 128 experts per token.
KV-cache. F32 / F16 / FP8 KV. FP8 halves KV-cache VRAM and is required for the Gemma-4-26B MoE (its non-FP8 KV path is known-broken; the engine fail-loud aborts without VULKANFORGE_KV_FP8=1).
GDN hybrid (Qwen3.6 / qwen35). Gated-delta-net + full-attention hybrid layers are supported (dense), with a batched prefill path.

Decode vs prefill

The v0.7.0 changes are prefill-only. Decode runs a separate scalar-attention path and is unchanged across v0.7.0; the MoE router stays on its per-token fused path for decode (seq_len == 1).

Attribution & license

VulkanForge is licensed under GPL-3.0.
It builds on oldnordic/ROCmForge — the original model loader, GGUF parser, CPU inference path, and overall architecture.
The GLSL compute shaders are faithful ports of llama.cpp's Vulkan kernels (e.g. flash_attn_cm1, the mul_mm GEMM family); llama.cpp is MIT-licensed.
Rust dependencies (permissive): ash (Vulkan), gpu-allocator (VRAM), bytemuck, half, memmap2, clap, rayon, rustyline, and the API-server stack axum / tokio / tower-http.

For shipped numbers see Benchmarks; for what you can tune see Configuration.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture Overview

Backend

Key kernels & paths

Decode vs prefill

Attribution & license

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally