-
Notifications
You must be signed in to change notification settings - Fork 1
Home
LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static
binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing
native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+
shaderFloat8CooperativeMatrix).
This wiki documents the shipped v1.0.5 reality. It complements — does not replace — the README and CHANGELOG.
VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one
GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.
- A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
- Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.
-
CONTRADICTSedge./contradict <id> <id>(and/uncontradict) flags two notes as in conflict — symmetric, awareness-only (no suppression, no winner), surfaced in--explain; you resolve it with/supersede. See Memory. -
Opt-in frontier retrieval (
--frontier, off by default). Reserves a few slots for a top hit'sDERIVES_FROMevidence (one hop), pulling a supporting premise up next to it;--explainlabels seed vs. frontier. -
Edge-type priors. A frontier candidate that
CONTRADICTSa seed is held back (categorical roles, no scalar weights) — the frontier never amplifies disputed evidence; shown transparently in--explain. -
Cross-process recall determinism. A pinned HNSW seed (
VF_HNSW_SEED, on SQLiteGraph 3.3.1) makes recall reproducible across process restarts; enforced by a committed test.--features memorynow needs Rust 1.89 (the lean default stays 1.85). - Recall stays byte-identical with no edges and no opt-ins active. Engine
1.0.4 → 1.0.5;vf-clide0.3.3 → 0.3.4.
-
recall --explain. A diagnostic view of why recall returned what it did: returned hits, near-misses, the cut reason per near-miss (superseded/type/threshold/top-k), and the score separation. See Memory. -
Note typing + an opt-in relevance threshold. Notes carry a layer
type(invariant/working/episodic/decision/failure, defaultuntyped);--typeon remember,/retype, and a--typerecall filter — an explicit, non-embedding signal that disambiguates where similarity can't.VF_RECALL_MARGIN(off by default) trims recall to the top band. -
SUPERSEDESedges./supersede <new> <old>(and/unsupersede) marks a note stale; it's suppressed from recall (chains resolve to the current head, recall backfills tok),--include-supersededshows it — suppressed, never deleted. -
DERIVES_FROMedges +/why./derive <A> from <B> …records what a note is anchored in;/why <id>walks the why-graph (cycle-guarded, depth-capped). It never changes recall results — additive awareness only. -
KV prefix-reuse is on by default (
VF_KV_PREFIX_REUSE=0to disable) — removes the within-turn double-prefill on memory-augmented turns; logit byte-identical to a fresh prefill. See Configuration. - Recall stays byte-identical with no edges and no opt-ins active. Engine
1.0.3 → 1.0.4;vf-clide0.3.2 → 0.3.3.
-
The agent can
archive— safely. In--agentmode the model mayarchivea note, but only one it recalled this session, and only behind an always-on confirmation that renders the note's real stored text (never the model's claim) plus a required reason. It's on the memory axis, so even--allow-shelldoesn't auto-approve it; headless denies it.forget(hard delete) stays user-only. See vf-clide. -
Archiving is reversible —
/unarchive <id>. Archive drops a note's vector from recall but keeps the record;/unarchiverestores it. Because archive removes the vector, un-archive re-embeds the stored text (the embedder is deterministic, so the original vector comes back) and re-inserts it with the node-id link intact — idempotent, and it survives a restart. Like/forget, it's a user-only command. See Memory. -
API — a missing id is a 404. The curation endpoints (
POST /memory/archive·/unarchive·/delete) now return 404 Not Found for an id that doesn't exist, instead of 500 — distinguishing a client mistake from a server fault. Real faults still return 500. - No inference-path change — decode is byte-identical, nothing new to benchmark.
- Engine
1.0.2 → 1.0.3;vf-clide0.3.1 → 0.3.2.
-
vf-clide can now use the memory. What v1.0.1 built server-side, the client reaches: in
--agentmode the model callsrecallandremember, and the REPL gains/project,/recall,/remember. The memory tools run on their own axis (direct/memory/*calls, not the file/shell gate) — visible on every call, available whenever the server has memory on. Recall stays explicit; nothing is auto-injected. See vf-clide and Memory. -
Curation — archive and forget. Notes are no longer write-only: near-duplicate
remembers are de-duplicated,/archive <id>drops a note from recall while keeping the trace,/forget <id>removes it. Curation is a user action — the agent points you to/forget <id>but never deletes on its own. -
Accurate agent self-state. The agent knows its real tools, live permissions, and memory boundaries (from the
actual gate, not guessed) — recalls instead of file-searching for a remembered fact, cites the real note id, and
offers no rights it lacks.
shellis described as un-confined;write_fileas confirm-gated without--allow-mutating. - Engine — rust-1.96 warning cleanup. Lib warnings 114 → 0 after the toolchain bump; decode output bit-identical (greedy logits OLD == NEW).
- Engine
1.0.1 → 1.0.2;vf-clideunchanged at0.3.1.
-
A persistent, project-scoped, semantic memory embedded in
vulkanforge serve— opt-in, off by default. Write notes on purpose (POST /memory/remember) and read them back by meaning (POST /memory/recall); the record survives restarts and model swaps, and recall in one project cannot return another's notes. Local, single-user, CPU-embedded (Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI) — the memory path never takes the GPU permit. What it is, what it isn't, how it works, and the roadmap: Memory. -
Two gates, both off by default: build with
cargo build --release --features memory(Rust 1.89+), then activate per run withserve --memory(orVULKANFORGE_MEMORY=1). Without it/memory/*returns 503 and the default build stays lean. - Cost, honestly — only with the feature: the two native deps (SQLiteGraph + fastembed/ONNX Runtime) add ~34 MB
to the binary (lean default ~25 MB), and an activated store downloads the embedding model into
~/.vulkanforge/embed-cacheon first start. - Engine
0.9.2 → 1.0.0;vf-clideunchanged at0.3.1.
-
vf-clide REPL honors the permission ceiling. Tool calls at or below the active
--yes/--allow-mutating/--allow-shellceiling are auto-approved in the REPL too (still printed); only calls above it prompty/N. Earlier versions prompted for every call interactively. Headless-pis unchanged (deny above the ceiling). See vf-clide. -
Clearer denials. The agent constitution separates a permission denial (lifted by re-running with
--allow-*) from an absolute workspace-confinement denial — so the model stops asking for OS rights. - vf-clide
0.3.0 → 0.3.1; engine unchanged (0.9.2).
-
vf-clide token meter + pinned status line. The REPL shows live, server-real token usage
(
↑prompt ↓completion (total) · session) and the current action; it's a no-op off-TTY, so headless-poutput is byte-for-byte unchanged. See vf-clide. -
Clean
serveshutdown (engine bugfix).Ctrl+C/SIGTERMonvulkanforge servenow frees all GPU resources in order and exits cleanly (0 leaked objects) instead of leaking and crashing with a segfault. Shutdown-path only — decode is unchanged. See Usage. - (
v0.9.1was a vf-clide-only search symlink-confinement security patch.)
-
vf-clide becomes an agentic coding client. An opt-in
--agenttool loop lets the model call read_file / write_file / search / shell over the OpenAI API, with a three-tier permission model (ReadOnly / Mutating / Exec, opt-in via--yes→--allow-mutating→--allow-shell, cumulative), workspace confinement for the file tools (../and symlink escapes rejected), and a constitution (built-in system prompt + projectAGENTS.md).shellis deliberately not confined —--allow-shellis the explicit opt-in. - Engine test-infra hardening. The end-to-end regression and per-shader correctness suites are reactivated and guarded against drift. No decode/behavior change — inference output is unchanged from v0.7.0.
-
Automatic context sizing.
servewithout--ctx-sizecomputes the largest safe KV context from live free VRAM + the model and prints what it chose and why. No more guessing a value that's too small (truncated answers) or too large (OOM at load). Explicit--ctx-sizestill overrides; hardware-capped at 16384 on RDNA4. See Configuration. -
Gemma-4 native tool / function calling. The OpenAI
toolsAPI now works with Gemma-4's own native tool-call format (Qwen3/Hermes path unchanged). See Usage. - New: vf-clide — a lean standalone CLI chat client (its own crate, no engine dependencies): streaming REPL + headless, with visible markers for truncated/empty answers.
Inference output is unchanged from v0.7.0 (auto-ctx is allocation-time only, decode-neutral).
As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):
- Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
- Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
- Decode: 0.87–0.97× llama (unchanged).
Full table + conditions on Benchmarks.
- Get started: Installation · Hardware and Compatibility
- Use it: Supported Models · Usage · Configuration · vf-clide
- Reference: Benchmarks · Architecture · Troubleshooting
GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.
VulkanForge v1.0.4 · single-user RDNA 4 / gfx1201 Vulkan inference · GPL-3.0 ·
Repository · Releases