Skip to content
maeddesg edited this page Jun 12, 2026 · 2 revisions

Architecture Overview

A high-level map of how VulkanForge runs a model. (Implementation detail lives in the code; this page is the orientation.)

Backend

  • Vulkan compute-only. No swapchain, no graphics queues. Built directly on ash (raw Vulkan 1.4 bindings), with gpu-allocator for VRAM suballocation.
  • GLSL compute shaders → SPIR-V. The kernels are GLSL .comp shaders compiled to SPIR-V at build time (via shaderc) and embedded in the ~14 MB static binary.

Key kernels & paths

  • CoopMat flash-attention. A KHR-cooperative-matrix flash-attention kernel (16×16×16 coopmat QK + PV, online-softmax, row-split). In v0.7.0 it covers both dense head_dim=128 and Gemma-4's heterogeneous head_dim 256 (sliding) / 512 (full). This is what brings dense prefill to llama parity (see Benchmarks).
  • Native FP8 WMMA. V_WMMA_F32_16X16X16_FP8_FP8 GEMM when the driver advertises shaderFloat8CooperativeMatrix (Mesa 26.1+); BF16 conversion fallback otherwise. All three FP8 scaling strategies (per-tensor / per-channel / block-wise) are auto-detected.
  • Tier-based VRAM bucketing. Model tensors are grouped into tiers (lm_head / attention / dense-FFN / MoE-experts) and packed into VRAM buckets to manage residency and bandwidth on the 16 GB card.
  • MoE (Gemma-4-26B-A4B). Expert-grouped prefill dispatch (one grouped GEMM per expert instead of a per-token scatter loop), a batched router with the gate-projection run through the dense GEMM (v0.7.0), and a parallel top-K. The router selects top-8 of 128 experts per token.
  • KV-cache. F32 / F16 / FP8 KV. FP8 halves KV-cache VRAM and is required for the Gemma-4-26B MoE (its non-FP8 KV path is known-broken; the engine fail-loud aborts without VULKANFORGE_KV_FP8=1).
  • GDN hybrid (Qwen3.6 / qwen35). Gated-delta-net + full-attention hybrid layers are supported (dense), with a batched prefill path.

Decode vs prefill

The v0.7.0 changes are prefill-only. Decode runs a separate scalar-attention path and is unchanged across v0.7.0; the MoE router stays on its per-token fused path for decode (seq_len == 1).

Attribution & license

  • VulkanForge is licensed under GPL-3.0.
  • It builds on oldnordic/ROCmForge — the original model loader, GGUF parser, CPU inference path, and overall architecture.
  • The GLSL compute shaders are faithful ports of llama.cpp's Vulkan kernels (e.g. flash_attn_cm1, the mul_mm GEMM family); llama.cpp is MIT-licensed.
  • Rust dependencies (permissive): ash (Vulkan), gpu-allocator (VRAM), bytemuck, half, memmap2, clap, rayon, rustyline, and the API-server stack axum / tokio / tower-http.

For shipped numbers see Benchmarks; for what you can tune see Configuration.

Clone this wiki locally