Skip to content
maeddesg edited this page Jun 15, 2026 · 13 revisions

VulkanForge

LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+ shaderFloat8CooperativeMatrix).

This wiki documents the shipped v1.0.1 reality. It complements — does not replace — the README and CHANGELOG.

Who it is for — and who it is not for

VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.

  • A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
  • Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.

v1.0.1 — server-side memory (opt-in)

  • A persistent, project-scoped, semantic memory embedded in vulkanforge serveopt-in, off by default. Write notes on purpose (POST /memory/remember) and read them back by meaning (POST /memory/recall); the record survives restarts and model swaps, and recall in one project cannot return another's notes. Local, single-user, CPU-embedded (Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI) — the memory path never takes the GPU permit. What it is, what it isn't, how it works, and the roadmap: Memory.
  • Two gates, both off by default: build with cargo build --release --features memory (Rust 1.89+), then activate per run with serve --memory (or VULKANFORGE_MEMORY=1). Without it /memory/* returns 503 and the default build stays lean.
  • Cost, honestly — only with the feature: the two native deps (SQLiteGraph + fastembed/ONNX Runtime) add ~34 MB to the binary (lean default ~25 MB), and an activated store downloads the embedding model into ~/.vulkanforge/embed-cache on first start.
  • Engine 0.9.2 → 1.0.0; vf-clide unchanged at 0.3.1.

v0.9.4 — vf-clide REPL permission ceiling + denial wording

  • vf-clide REPL honors the permission ceiling. Tool calls at or below the active --yes/--allow-mutating/--allow-shell ceiling are auto-approved in the REPL too (still printed); only calls above it prompt y/N. Earlier versions prompted for every call interactively. Headless -p is unchanged (deny above the ceiling). See vf-clide.
  • Clearer denials. The agent constitution separates a permission denial (lifted by re-running with --allow-*) from an absolute workspace-confinement denial — so the model stops asking for OS rights.
  • vf-clide 0.3.0 → 0.3.1; engine unchanged (0.9.2).

v0.9.2 — vf-clide token meter + clean server shutdown

  • vf-clide token meter + pinned status line. The REPL shows live, server-real token usage (↑prompt ↓completion (total) · session) and the current action; it's a no-op off-TTY, so headless -p output is byte-for-byte unchanged. See vf-clide.
  • Clean serve shutdown (engine bugfix). Ctrl+C / SIGTERM on vulkanforge serve now frees all GPU resources in order and exits cleanly (0 leaked objects) instead of leaking and crashing with a segfault. Shutdown-path only — decode is unchanged. See Usage.
  • (v0.9.1 was a vf-clide-only search symlink-confinement security patch.)

v0.9.0 — agentic vf-clide

  • vf-clide becomes an agentic coding client. An opt-in --agent tool loop lets the model call read_file / write_file / search / shell over the OpenAI API, with a three-tier permission model (ReadOnly / Mutating / Exec, opt-in via --yes--allow-mutating--allow-shell, cumulative), workspace confinement for the file tools (../ and symlink escapes rejected), and a constitution (built-in system prompt + project AGENTS.md). shell is deliberately not confined — --allow-shell is the explicit opt-in.
  • Engine test-infra hardening. The end-to-end regression and per-shader correctness suites are reactivated and guarded against drift. No decode/behavior change — inference output is unchanged from v0.7.0.

v0.8.0 — automatic context sizing + Gemma-4 tool-calling + vf-clide

  • Automatic context sizing. serve without --ctx-size computes the largest safe KV context from live free VRAM + the model and prints what it chose and why. No more guessing a value that's too small (truncated answers) or too large (OOM at load). Explicit --ctx-size still overrides; hardware-capped at 16384 on RDNA4. See Configuration.
  • Gemma-4 native tool / function calling. The OpenAI tools API now works with Gemma-4's own native tool-call format (Qwen3/Hermes path unchanged). See Usage.
  • New: vf-clide — a lean standalone CLI chat client (its own crate, no engine dependencies): streaming REPL + headless, with visible markers for truncated/empty answers.

Inference output is unchanged from v0.7.0 (auto-ctx is allocation-time only, decode-neutral).

v0.7.0 — Prefill Parity

As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):

  • Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
  • Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
  • Decode: 0.87–0.97× llama (unchanged).

Full table + conditions on Benchmarks.

Quick links

License

GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.

Clone this wiki locally