-
Notifications
You must be signed in to change notification settings - Fork 1
Architecture
maeddesg edited this page Jun 12, 2026
·
2 revisions
A high-level map of how VulkanForge runs a model. (Implementation detail lives in the code; this page is the orientation.)
-
Vulkan compute-only. No swapchain, no graphics queues. Built directly on
ash(raw Vulkan 1.4 bindings), withgpu-allocatorfor VRAM suballocation. -
GLSL compute shaders → SPIR-V. The kernels are GLSL
.compshaders compiled to SPIR-V at build time (viashaderc) and embedded in the ~14 MB static binary.
-
CoopMat flash-attention. A KHR-cooperative-matrix flash-attention kernel (16×16×16 coopmat QK +
PV, online-softmax, row-split). In v0.7.0 it covers both dense
head_dim=128and Gemma-4's heterogeneoushead_dim256 (sliding) / 512 (full). This is what brings dense prefill to llama parity (see Benchmarks). -
Native FP8 WMMA.
V_WMMA_F32_16X16X16_FP8_FP8GEMM when the driver advertisesshaderFloat8CooperativeMatrix(Mesa 26.1+); BF16 conversion fallback otherwise. All three FP8 scaling strategies (per-tensor / per-channel / block-wise) are auto-detected. - Tier-based VRAM bucketing. Model tensors are grouped into tiers (lm_head / attention / dense-FFN / MoE-experts) and packed into VRAM buckets to manage residency and bandwidth on the 16 GB card.
- MoE (Gemma-4-26B-A4B). Expert-grouped prefill dispatch (one grouped GEMM per expert instead of a per-token scatter loop), a batched router with the gate-projection run through the dense GEMM (v0.7.0), and a parallel top-K. The router selects top-8 of 128 experts per token.
-
KV-cache. F32 / F16 / FP8 KV. FP8 halves KV-cache VRAM and is required for the Gemma-4-26B MoE (its non-FP8 KV path is known-broken; the engine fail-loud aborts without
VULKANFORGE_KV_FP8=1). -
GDN hybrid (Qwen3.6 /
qwen35). Gated-delta-net + full-attention hybrid layers are supported (dense), with a batched prefill path.
The v0.7.0 changes are prefill-only. Decode runs a separate scalar-attention path and is unchanged
across v0.7.0; the MoE router stays on its per-token fused path for decode (seq_len == 1).
- VulkanForge is licensed under GPL-3.0.
- It builds on oldnordic/ROCmForge — the original model loader, GGUF parser, CPU inference path, and overall architecture.
- The GLSL compute shaders are faithful ports of llama.cpp's Vulkan kernels (e.g.
flash_attn_cm1, themul_mmGEMM family); llama.cpp is MIT-licensed. - Rust dependencies (permissive):
ash(Vulkan),gpu-allocator(VRAM),bytemuck,half,memmap2,clap,rayon,rustyline, and the API-server stackaxum/tokio/tower-http.
For shipped numbers see Benchmarks; for what you can tune see Configuration.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases