Home

VulkanForge

LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+ shaderFloat8CooperativeMatrix).

This wiki documents the shipped v1.0.5 reality. It complements — does not replace — the README and CHANGELOG.

Who it is for — and who it is not for

VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.

A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.

v1.0.5 — conflict edges, opt-in frontier, edge-type priors, and cross-process determinism

CONTRADICTS edge. /contradict <id> <id> (and /uncontradict) flags two notes as in conflict — symmetric, awareness-only (no suppression, no winner), surfaced in --explain; you resolve it with /supersede. See Memory.
Opt-in frontier retrieval (--frontier, off by default). Reserves a few slots for a top hit's DERIVES_FROM evidence (one hop), pulling a supporting premise up next to it; --explain labels seed vs. frontier.
Edge-type priors. A frontier candidate that CONTRADICTS a seed is held back (categorical roles, no scalar weights) — the frontier never amplifies disputed evidence; shown transparently in --explain.
Cross-process recall determinism. A pinned HNSW seed (VF_HNSW_SEED, on SQLiteGraph 3.3.1) makes recall reproducible across process restarts; enforced by a committed test. --features memory now needs Rust 1.89 (the lean default stays 1.85).
Recall stays byte-identical with no edges and no opt-ins active. Engine 1.0.4 → 1.0.5; vf-clide 0.3.3 → 0.3.4.

v1.0.4 — recall diagnostics, note typing, and memory edges

recall --explain. A diagnostic view of why recall returned what it did: returned hits, near-misses, the cut reason per near-miss (superseded / type / threshold / top-k), and the score separation. See Memory.
Note typing + an opt-in relevance threshold. Notes carry a layer type (invariant/working/episodic/ decision/failure, default untyped); --type on remember, /retype, and a --type recall filter — an explicit, non-embedding signal that disambiguates where similarity can't. VF_RECALL_MARGIN (off by default) trims recall to the top band.
SUPERSEDES edges. /supersede <new> <old> (and /unsupersede) marks a note stale; it's suppressed from recall (chains resolve to the current head, recall backfills to k), --include-superseded shows it — suppressed, never deleted.
DERIVES_FROM edges + /why. /derive <A> from <B> … records what a note is anchored in; /why <id> walks the why-graph (cycle-guarded, depth-capped). It never changes recall results — additive awareness only.
KV prefix-reuse is on by default (VF_KV_PREFIX_REUSE=0 to disable) — removes the within-turn double-prefill on memory-augmented turns; logit byte-identical to a fresh prefill. See Configuration.
Recall stays byte-identical with no edges and no opt-ins active. Engine 1.0.3 → 1.0.4; vf-clide 0.3.2 → 0.3.3.

v1.0.3 — agent-side curation, un-archive, and a 404 for missing notes

The agent can archive — safely. In --agent mode the model may archive a note, but only one it recalled this session, and only behind an always-on confirmation that renders the note's real stored text (never the model's claim) plus a required reason. It's on the memory axis, so even --allow-shell doesn't auto-approve it; headless denies it. forget (hard delete) stays user-only. See vf-clide.
Archiving is reversible — /unarchive <id>. Archive drops a note's vector from recall but keeps the record; /unarchive restores it. Because archive removes the vector, un-archive re-embeds the stored text (the embedder is deterministic, so the original vector comes back) and re-inserts it with the node-id link intact — idempotent, and it survives a restart. Like /forget, it's a user-only command. See Memory.
API — a missing id is a 404. The curation endpoints (POST /memory/archive · /unarchive · /delete) now return 404 Not Found for an id that doesn't exist, instead of 500 — distinguishing a client mistake from a server fault. Real faults still return 500.
No inference-path change — decode is byte-identical, nothing new to benchmark.
Engine 1.0.2 → 1.0.3; vf-clide 0.3.1 → 0.3.2.

v1.0.2 — vf-clide reaches the memory + curation

vf-clide can now use the memory. What v1.0.1 built server-side, the client reaches: in --agent mode the model calls recall and remember, and the REPL gains /project, /recall, /remember. The memory tools run on their own axis (direct /memory/* calls, not the file/shell gate) — visible on every call, available whenever the server has memory on. Recall stays explicit; nothing is auto-injected. See vf-clide and Memory.
Curation — archive and forget. Notes are no longer write-only: near-duplicate remembers are de-duplicated, /archive <id> drops a note from recall while keeping the trace, /forget <id> removes it. Curation is a user action — the agent points you to /forget <id> but never deletes on its own.
Accurate agent self-state. The agent knows its real tools, live permissions, and memory boundaries (from the actual gate, not guessed) — recalls instead of file-searching for a remembered fact, cites the real note id, and offers no rights it lacks. shell is described as un-confined; write_file as confirm-gated without --allow-mutating.
Engine — rust-1.96 warning cleanup. Lib warnings 114 → 0 after the toolchain bump; decode output bit-identical (greedy logits OLD == NEW).
Engine 1.0.1 → 1.0.2; vf-clide unchanged at 0.3.1.

v1.0.1 — server-side memory (opt-in)

A persistent, project-scoped, semantic memory embedded in vulkanforge serve — opt-in, off by default. Write notes on purpose (POST /memory/remember) and read them back by meaning (POST /memory/recall); the record survives restarts and model swaps, and recall in one project cannot return another's notes. Local, single-user, CPU-embedded (Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI) — the memory path never takes the GPU permit. What it is, what it isn't, how it works, and the roadmap: Memory.
Two gates, both off by default: build with cargo build --release --features memory (Rust 1.89+), then activate per run with serve --memory (or VULKANFORGE_MEMORY=1). Without it /memory/* returns 503 and the default build stays lean.
Cost, honestly — only with the feature: the two native deps (SQLiteGraph + fastembed/ONNX Runtime) add ~34 MB to the binary (lean default ~25 MB), and an activated store downloads the embedding model into ~/.vulkanforge/embed-cache on first start.
Engine 0.9.2 → 1.0.0; vf-clide unchanged at 0.3.1.

v0.9.4 — vf-clide REPL permission ceiling + denial wording

vf-clide REPL honors the permission ceiling. Tool calls at or below the active --yes/--allow-mutating/--allow-shell ceiling are auto-approved in the REPL too (still printed); only calls above it prompt y/N. Earlier versions prompted for every call interactively. Headless -p is unchanged (deny above the ceiling). See vf-clide.
Clearer denials. The agent constitution separates a permission denial (lifted by re-running with --allow-*) from an absolute workspace-confinement denial — so the model stops asking for OS rights.
vf-clide 0.3.0 → 0.3.1; engine unchanged (0.9.2).

v0.9.2 — vf-clide token meter + clean server shutdown

vf-clide token meter + pinned status line. The REPL shows live, server-real token usage (↑prompt ↓completion (total) · session) and the current action; it's a no-op off-TTY, so headless -p output is byte-for-byte unchanged. See vf-clide.
Clean serve shutdown (engine bugfix). Ctrl+C / SIGTERM on vulkanforge serve now frees all GPU resources in order and exits cleanly (0 leaked objects) instead of leaking and crashing with a segfault. Shutdown-path only — decode is unchanged. See Usage.
(v0.9.1 was a vf-clide-only search symlink-confinement security patch.)

v0.9.0 — agentic vf-clide

vf-clide becomes an agentic coding client. An opt-in --agent tool loop lets the model call read_file / write_file / search / shell over the OpenAI API, with a three-tier permission model (ReadOnly / Mutating / Exec, opt-in via --yes → --allow-mutating → --allow-shell, cumulative), workspace confinement for the file tools (../ and symlink escapes rejected), and a constitution (built-in system prompt + project AGENTS.md). shell is deliberately not confined — --allow-shell is the explicit opt-in.
Engine test-infra hardening. The end-to-end regression and per-shader correctness suites are reactivated and guarded against drift. No decode/behavior change — inference output is unchanged from v0.7.0.

v0.8.0 — automatic context sizing + Gemma-4 tool-calling + vf-clide

Automatic context sizing. serve without --ctx-size computes the largest safe KV context from live free VRAM + the model and prints what it chose and why. No more guessing a value that's too small (truncated answers) or too large (OOM at load). Explicit --ctx-size still overrides; hardware-capped at 16384 on RDNA4. See Configuration.
Gemma-4 native tool / function calling. The OpenAI tools API now works with Gemma-4's own native tool-call format (Qwen3/Hermes path unchanged). See Usage.
New: vf-clide — a lean standalone CLI chat client (its own crate, no engine dependencies): streaming REPL + headless, with visible markers for truncated/empty answers.

Inference output is unchanged from v0.7.0 (auto-ctx is allocation-time only, decode-neutral).

v0.7.0 — Prefill Parity

As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):

Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
Decode: 0.87–0.97× llama (unchanged).

Full table + conditions on Benchmarks.

Quick links

Get started: Installation · Hardware and Compatibility
Use it: Supported Models · Usage · Configuration · vf-clide
Reference: Benchmarks · Architecture · Troubleshooting

License

GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.

VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 · Repository · Releases

VulkanForge Wiki

Get Started

Use VulkanForge

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

VulkanForge

Who it is for — and who it is not for

v1.0.5 — conflict edges, opt-in frontier, edge-type priors, and cross-process determinism

v1.0.4 — recall diagnostics, note typing, and memory edges

v1.0.3 — agent-side curation, un-archive, and a 404 for missing notes

v1.0.2 — vf-clide reaches the memory + curation

v1.0.1 — server-side memory (opt-in)

v0.9.4 — vf-clide REPL permission ceiling + denial wording

v0.9.2 — vf-clide token meter + clean server shutdown

v0.9.0 — agentic vf-clide

v0.8.0 — automatic context sizing + Gemma-4 tool-calling + vf-clide

v0.7.0 — Prefill Parity

Quick links

License

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VulkanForge Wiki

Clone this wiki locally