Home

VulkanForge

LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+ shaderFloat8CooperativeMatrix).

This wiki documents the shipped v1.0.1 reality. It complements — does not replace — the README and CHANGELOG.

Who it is for — and who it is not for

VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.

A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.

v1.0.1 — server-side memory (opt-in)

A persistent, project-scoped, semantic memory embedded in vulkanforge serve — opt-in, off by default. Write notes on purpose (POST /memory/remember) and read them back by meaning (POST /memory/recall); the record survives restarts and model swaps, and recall in one project cannot return another's notes. Local, single-user, CPU-embedded (Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI) — the memory path never takes the GPU permit. What it is, what it isn't, how it works, and the roadmap: Memory.
Two gates, both off by default: build with cargo build --release --features memory (Rust 1.89+), then activate per run with serve --memory (or VULKANFORGE_MEMORY=1). Without it /memory/* returns 503 and the default build stays lean.
Cost, honestly — only with the feature: the two native deps (SQLiteGraph + fastembed/ONNX Runtime) add ~34 MB to the binary (lean default ~25 MB), and an activated store downloads the embedding model into ~/.vulkanforge/embed-cache on first start.
Engine 0.9.2 → 1.0.0; vf-clide unchanged at 0.3.1.

v0.9.4 — vf-clide REPL permission ceiling + denial wording

vf-clide REPL honors the permission ceiling. Tool calls at or below the active --yes/--allow-mutating/--allow-shell ceiling are auto-approved in the REPL too (still printed); only calls above it prompt y/N. Earlier versions prompted for every call interactively. Headless -p is unchanged (deny above the ceiling). See vf-clide.
Clearer denials. The agent constitution separates a permission denial (lifted by re-running with --allow-*) from an absolute workspace-confinement denial — so the model stops asking for OS rights.
vf-clide 0.3.0 → 0.3.1; engine unchanged (0.9.2).

v0.9.2 — vf-clide token meter + clean server shutdown

vf-clide token meter + pinned status line. The REPL shows live, server-real token usage (↑prompt ↓completion (total) · session) and the current action; it's a no-op off-TTY, so headless -p output is byte-for-byte unchanged. See vf-clide.
Clean serve shutdown (engine bugfix). Ctrl+C / SIGTERM on vulkanforge serve now frees all GPU resources in order and exits cleanly (0 leaked objects) instead of leaking and crashing with a segfault. Shutdown-path only — decode is unchanged. See Usage.
(v0.9.1 was a vf-clide-only search symlink-confinement security patch.)

v0.9.0 — agentic vf-clide

vf-clide becomes an agentic coding client. An opt-in --agent tool loop lets the model call read_file / write_file / search / shell over the OpenAI API, with a three-tier permission model (ReadOnly / Mutating / Exec, opt-in via --yes → --allow-mutating → --allow-shell, cumulative), workspace confinement for the file tools (../ and symlink escapes rejected), and a constitution (built-in system prompt + project AGENTS.md). shell is deliberately not confined — --allow-shell is the explicit opt-in.
Engine test-infra hardening. The end-to-end regression and per-shader correctness suites are reactivated and guarded against drift. No decode/behavior change — inference output is unchanged from v0.7.0.

v0.8.0 — automatic context sizing + Gemma-4 tool-calling + vf-clide

Automatic context sizing. serve without --ctx-size computes the largest safe KV context from live free VRAM + the model and prints what it chose and why. No more guessing a value that's too small (truncated answers) or too large (OOM at load). Explicit --ctx-size still overrides; hardware-capped at 16384 on RDNA4. See Configuration.
Gemma-4 native tool / function calling. The OpenAI tools API now works with Gemma-4's own native tool-call format (Qwen3/Hermes path unchanged). See Usage.
New: vf-clide — a lean standalone CLI chat client (its own crate, no engine dependencies): streaming REPL + headless, with visible markers for truncated/empty answers.

Inference output is unchanged from v0.7.0 (auto-ctx is allocation-time only, decode-neutral).

v0.7.0 — Prefill Parity

As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):

Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
Decode: 0.87–0.97× llama (unchanged).

Full table + conditions on Benchmarks.

Quick links

Get started: Installation · Hardware and Compatibility
Use it: Supported Models · Usage · Configuration · vf-clide
Reference: Benchmarks · Architecture · Troubleshooting

License

GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

VulkanForge

Who it is for — and who it is not for

v1.0.1 — server-side memory (opt-in)

v0.9.4 — vf-clide REPL permission ceiling + denial wording

v0.9.2 — vf-clide token meter + clean server shutdown

v0.9.0 — agentic vf-clide

v0.8.0 — automatic context sizing + Gemma-4 tool-calling + vf-clide

v0.7.0 — Prefill Parity

Quick links

License

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally