Skip to content
maeddesg edited this page Jun 20, 2026 · 13 revisions

VulkanForge

LLM inference engine for AMD RDNA 4 GPUs. Pure Rust + Vulkan compute shaders, ~14 MB static binary, no runtime dependencies beyond the system Vulkan loader. It is the first engine doing native FP8 WMMA over Vulkan on consumer AMD hardware (V_WMMA_F32_16X16X16_FP8_FP8 via Mesa 26.1+ shaderFloat8CooperativeMatrix).

This wiki documents the shipped v1.0.5 reality. It complements — does not replace — the README and CHANGELOG.

Who it is for — and who it is not for

VulkanForge is a single-user, RDNA 4 / gfx1201-specific Vulkan inference engine. It targets one GPU (Radeon RX 9070 XT) running one request at a time, and it is tuned for that case.

  • A good fit if you own an RDNA 4 card, run single-user chat / single-stream inference locally on Linux + Mesa RADV, and want a tiny self-contained binary with native FP8.
  • Not a fit if you need batch serving / concurrent sessions, multi-GPU, NVIDIA/CUDA, or a general cross-hardware llama.cpp replacement. For batch throughput, vLLM is the right tool.

v1.0.5 — conflict edges, opt-in frontier, edge-type priors, and cross-process determinism

  • CONTRADICTS edge. /contradict <id> <id> (and /uncontradict) flags two notes as in conflict — symmetric, awareness-only (no suppression, no winner), surfaced in --explain; you resolve it with /supersede. See Memory.
  • Opt-in frontier retrieval (--frontier, off by default). Reserves a few slots for a top hit's DERIVES_FROM evidence (one hop), pulling a supporting premise up next to it; --explain labels seed vs. frontier.
  • Edge-type priors. A frontier candidate that CONTRADICTS a seed is held back (categorical roles, no scalar weights) — the frontier never amplifies disputed evidence; shown transparently in --explain.
  • Cross-process recall determinism. A pinned HNSW seed (VF_HNSW_SEED, on SQLiteGraph 3.3.1) makes recall reproducible across process restarts; enforced by a committed test. --features memory now needs Rust 1.89 (the lean default stays 1.85).
  • Recall stays byte-identical with no edges and no opt-ins active. Engine 1.0.4 → 1.0.5; vf-clide 0.3.3 → 0.3.4.

v1.0.4 — recall diagnostics, note typing, and memory edges

  • recall --explain. A diagnostic view of why recall returned what it did: returned hits, near-misses, the cut reason per near-miss (superseded / type / threshold / top-k), and the score separation. See Memory.
  • Note typing + an opt-in relevance threshold. Notes carry a layer type (invariant/working/episodic/ decision/failure, default untyped); --type on remember, /retype, and a --type recall filter — an explicit, non-embedding signal that disambiguates where similarity can't. VF_RECALL_MARGIN (off by default) trims recall to the top band.
  • SUPERSEDES edges. /supersede <new> <old> (and /unsupersede) marks a note stale; it's suppressed from recall (chains resolve to the current head, recall backfills to k), --include-superseded shows it — suppressed, never deleted.
  • DERIVES_FROM edges + /why. /derive <A> from <B> … records what a note is anchored in; /why <id> walks the why-graph (cycle-guarded, depth-capped). It never changes recall results — additive awareness only.
  • KV prefix-reuse is on by default (VF_KV_PREFIX_REUSE=0 to disable) — removes the within-turn double-prefill on memory-augmented turns; logit byte-identical to a fresh prefill. See Configuration.
  • Recall stays byte-identical with no edges and no opt-ins active. Engine 1.0.3 → 1.0.4; vf-clide 0.3.2 → 0.3.3.

v1.0.3 — agent-side curation, un-archive, and a 404 for missing notes

  • The agent can archive — safely. In --agent mode the model may archive a note, but only one it recalled this session, and only behind an always-on confirmation that renders the note's real stored text (never the model's claim) plus a required reason. It's on the memory axis, so even --allow-shell doesn't auto-approve it; headless denies it. forget (hard delete) stays user-only. See vf-clide.
  • Archiving is reversible — /unarchive <id>. Archive drops a note's vector from recall but keeps the record; /unarchive restores it. Because archive removes the vector, un-archive re-embeds the stored text (the embedder is deterministic, so the original vector comes back) and re-inserts it with the node-id link intact — idempotent, and it survives a restart. Like /forget, it's a user-only command. See Memory.
  • API — a missing id is a 404. The curation endpoints (POST /memory/archive · /unarchive · /delete) now return 404 Not Found for an id that doesn't exist, instead of 500 — distinguishing a client mistake from a server fault. Real faults still return 500.
  • No inference-path change — decode is byte-identical, nothing new to benchmark.
  • Engine 1.0.2 → 1.0.3; vf-clide 0.3.1 → 0.3.2.

v1.0.2 — vf-clide reaches the memory + curation

  • vf-clide can now use the memory. What v1.0.1 built server-side, the client reaches: in --agent mode the model calls recall and remember, and the REPL gains /project, /recall, /remember. The memory tools run on their own axis (direct /memory/* calls, not the file/shell gate) — visible on every call, available whenever the server has memory on. Recall stays explicit; nothing is auto-injected. See vf-clide and Memory.
  • Curation — archive and forget. Notes are no longer write-only: near-duplicate remembers are de-duplicated, /archive <id> drops a note from recall while keeping the trace, /forget <id> removes it. Curation is a user action — the agent points you to /forget <id> but never deletes on its own.
  • Accurate agent self-state. The agent knows its real tools, live permissions, and memory boundaries (from the actual gate, not guessed) — recalls instead of file-searching for a remembered fact, cites the real note id, and offers no rights it lacks. shell is described as un-confined; write_file as confirm-gated without --allow-mutating.
  • Engine — rust-1.96 warning cleanup. Lib warnings 114 → 0 after the toolchain bump; decode output bit-identical (greedy logits OLD == NEW).
  • Engine 1.0.1 → 1.0.2; vf-clide unchanged at 0.3.1.

v1.0.1 — server-side memory (opt-in)

  • A persistent, project-scoped, semantic memory embedded in vulkanforge serveopt-in, off by default. Write notes on purpose (POST /memory/remember) and read them back by meaning (POST /memory/recall); the record survives restarts and model swaps, and recall in one project cannot return another's notes. Local, single-user, CPU-embedded (Nomic-Embed v1.5-Q, 768-dim, AVX-512/VNNI) — the memory path never takes the GPU permit. What it is, what it isn't, how it works, and the roadmap: Memory.
  • Two gates, both off by default: build with cargo build --release --features memory (Rust 1.89+), then activate per run with serve --memory (or VULKANFORGE_MEMORY=1). Without it /memory/* returns 503 and the default build stays lean.
  • Cost, honestly — only with the feature: the two native deps (SQLiteGraph + fastembed/ONNX Runtime) add ~34 MB to the binary (lean default ~25 MB), and an activated store downloads the embedding model into ~/.vulkanforge/embed-cache on first start.
  • Engine 0.9.2 → 1.0.0; vf-clide unchanged at 0.3.1.

v0.9.4 — vf-clide REPL permission ceiling + denial wording

  • vf-clide REPL honors the permission ceiling. Tool calls at or below the active --yes/--allow-mutating/--allow-shell ceiling are auto-approved in the REPL too (still printed); only calls above it prompt y/N. Earlier versions prompted for every call interactively. Headless -p is unchanged (deny above the ceiling). See vf-clide.
  • Clearer denials. The agent constitution separates a permission denial (lifted by re-running with --allow-*) from an absolute workspace-confinement denial — so the model stops asking for OS rights.
  • vf-clide 0.3.0 → 0.3.1; engine unchanged (0.9.2).

v0.9.2 — vf-clide token meter + clean server shutdown

  • vf-clide token meter + pinned status line. The REPL shows live, server-real token usage (↑prompt ↓completion (total) · session) and the current action; it's a no-op off-TTY, so headless -p output is byte-for-byte unchanged. See vf-clide.
  • Clean serve shutdown (engine bugfix). Ctrl+C / SIGTERM on vulkanforge serve now frees all GPU resources in order and exits cleanly (0 leaked objects) instead of leaking and crashing with a segfault. Shutdown-path only — decode is unchanged. See Usage.
  • (v0.9.1 was a vf-clide-only search symlink-confinement security patch.)

v0.9.0 — agentic vf-clide

  • vf-clide becomes an agentic coding client. An opt-in --agent tool loop lets the model call read_file / write_file / search / shell over the OpenAI API, with a three-tier permission model (ReadOnly / Mutating / Exec, opt-in via --yes--allow-mutating--allow-shell, cumulative), workspace confinement for the file tools (../ and symlink escapes rejected), and a constitution (built-in system prompt + project AGENTS.md). shell is deliberately not confined — --allow-shell is the explicit opt-in.
  • Engine test-infra hardening. The end-to-end regression and per-shader correctness suites are reactivated and guarded against drift. No decode/behavior change — inference output is unchanged from v0.7.0.

v0.8.0 — automatic context sizing + Gemma-4 tool-calling + vf-clide

  • Automatic context sizing. serve without --ctx-size computes the largest safe KV context from live free VRAM + the model and prints what it chose and why. No more guessing a value that's too small (truncated answers) or too large (OOM at load). Explicit --ctx-size still overrides; hardware-capped at 16384 on RDNA4. See Configuration.
  • Gemma-4 native tool / function calling. The OpenAI tools API now works with Gemma-4's own native tool-call format (Qwen3/Hermes path unchanged). See Usage.
  • New: vf-clide — a lean standalone CLI chat client (its own crate, no engine dependencies): streaming REPL + headless, with visible markers for truncated/empty answers.

Inference output is unchanged from v0.7.0 (auto-ctx is allocation-time only, decode-neutral).

v0.7.0 — Prefill Parity

As of v0.7.0, prefill reaches parity with llama.cpp's Vulkan backend on dense models, and the Gemma-4 MoE prefill gap is largely closed — decode is unchanged. Measured same-run vs llama.cpp Vulkan (RX 9070 XT, RADV Mesa 26.1.2):

  • Dense prefill (Qwen3-8B / Llama-3.1-8B / Mistral-7B / DeepSeek-R1-8B) @p2048: 0.96–1.04× llama (parity — Mistral ahead).
  • Gemma-4-26B-A4B MoE prefill @p2048: Q3_K_M 0.89× · QAT-Q4_0 0.83×.
  • Decode: 0.87–0.97× llama (unchanged).

Full table + conditions on Benchmarks.

Quick links

License

GPL-3.0. VulkanForge builds on the foundational work of oldnordic/ROCmForge (model loader, GGUF parser, CPU path, overall architecture). See Architecture for full attribution.

Clone this wiki locally