-
Notifications
You must be signed in to change notification settings - Fork 1
Home
LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static
binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing
native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+
shaderFloat8CooperativeMatrix).
This wiki documents the shipped v1.0 reality. It complements — does not replace — the README and CHANGELOG.
VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one
GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.
- A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
- Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.
-
A persistent, project-scoped, semantic memory embedded in
vulkanforge serve. Write notes on purpose (POST /memory/remember) and read them back by meaning (POST /memory/recall); the record survives restarts and model swaps, and recall in one project cannot return another's notes. Local, single-user, CPU-embedded (Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI) — the memory path never takes the GPU permit. What it is, what it isn't, how it works, and the roadmap: Memory. - Cost, honestly: the two native deps (SQLiteGraph + fastembed/ONNX Runtime) add ~34 MB to the binary, and the
first start downloads the embedding model into
~/.vulkanforge/embed-cache. - Engine
0.9.2 → 1.0.0;vf-clideunchanged at0.3.1.
-
vf-clide REPL honors the permission ceiling. Tool calls at or below the active
--yes/--allow-mutating/--allow-shellceiling are auto-approved in the REPL too (still printed); only calls above it prompty/N. Earlier versions prompted for every call interactively. Headless-pis unchanged (deny above the ceiling). See vf-clide. -
Clearer denials. The agent constitution separates a permission denial (lifted by re-running with
--allow-*) from an absolute workspace-confinement denial — so the model stops asking for OS rights. - vf-clide
0.3.0 → 0.3.1; engine unchanged (0.9.2).
-
vf-clide token meter + pinned status line. The REPL shows live, server-real token usage
(
↑prompt ↓completion (total) · session) and the current action; it's a no-op off-TTY, so headless-poutput is byte-for-byte unchanged. See vf-clide. -
Clean
serveshutdown (engine bugfix).Ctrl+C/SIGTERMonvulkanforge servenow frees all GPU resources in order and exits cleanly (0 leaked objects) instead of leaking and crashing with a segfault. Shutdown-path only — decode is unchanged. See Usage. - (
v0.9.1was a vf-clide-only search symlink-confinement security patch.)
-
vf-clide becomes an agentic coding client. An opt-in
--agenttool loop lets the model call read_file / write_file / search / shell over the OpenAI API, with a three-tier permission model (ReadOnly / Mutating / Exec, opt-in via--yes→--allow-mutating→--allow-shell, cumulative), workspace confinement for the file tools (../and symlink escapes rejected), and a constitution (built-in system prompt + projectAGENTS.md).shellis deliberately not confined —--allow-shellis the explicit opt-in. - Engine test-infra hardening. The end-to-end regression and per-shader correctness suites are reactivated and guarded against drift. No decode/behavior change — inference output is unchanged from v0.7.0.
-
Automatic context sizing.
servewithout--ctx-sizecomputes the largest safe KV context from live free VRAM + the model and prints what it chose and why. No more guessing a value that's too small (truncated answers) or too large (OOM at load). Explicit--ctx-sizestill overrides; hardware-capped at 16384 on RDNA4. See Configuration. -
Gemma-4 native tool / function calling. The OpenAI
toolsAPI now works with Gemma-4's own native tool-call format (Qwen3/Hermes path unchanged). See Usage. - New: vf-clide — a lean standalone CLI chat client (its own crate, no engine dependencies): streaming REPL + headless, with visible markers for truncated/empty answers.
Inference output is unchanged from v0.7.0 (auto-ctx is allocation-time only, decode-neutral).
As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):
- Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
- Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
- Decode: 0.87–0.97× llama (unchanged).
Full table + conditions on Benchmarks.
- Get started: Installation · Hardware and Compatibility
- Use it: Supported Models · Usage · Configuration · vf-clide
- Reference: Benchmarks · Architecture · Troubleshooting
GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases