-
Notifications
You must be signed in to change notification settings - Fork 1
Home
LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static
binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing
native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+
shaderFloat8CooperativeMatrix).
This wiki documents the shipped v0.7.0 reality. It complements — does not replace — the README and CHANGELOG.
VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one
GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.
- A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
- Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.
As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):
- Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
- Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
- Decode: 0.87–0.97× llama (unchanged).
Full table + conditions on Benchmarks.
- Get started: Installation · Hardware and Compatibility
- Use it: Supported Models · Usage · Configuration
- Reference: Benchmarks · Architecture · Troubleshooting
GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases