Skip to content
maeddesg edited this page Jun 16, 2026 · 7 revisions

Memory

Server-side, persistent, project-scoped semantic memory — shipped in v1.0, opt-in (off by default); client access & curation in v1.0.2.

What this memory is — and what it isn't

VulkanForge's memory is a deliberate notebook, not a recording device.

It exists so that a decision you made three weeks ago, a benchmark you ran last sprint, or a bug you finally cornered doesn't evaporate when the session ends or the model is swapped out. You write to it on purpose — remember — and you read from it by meaning — recall. What comes back is what was put in, surfaced because it is relevant to what you asked, not because something guessed you might want it.

The full design these build toward — sleep consolidation, the six layers, watermark-based trust — is on Memory Design.

A few things it deliberately is:

  • Yours, and your project's — not the model's. The store lives on your disk, in your process, on your hardware. Any model that connects reads the same memory; swap one model for another and the record is unchanged. The model is the lens; the memory is what's written down. When a model's context window closes, this doesn't.
  • Local and single-user, all the way down. No cloud, no telemetry, nothing leaving your machine. The embeddings are computed on your own CPU; the vectors and the graph sit in one SQLite file. That's the whole surface.
  • Scoped by construction. Each coding project gets its own index, and recall in one project physically cannot return another's notes — the partitions never touch. The general, non-project things live in a shared global scope. Isolation isn't a filter that has to be applied correctly; it's the shape of the thing.
  • Curated, not hoarded. It holds what's worth keeping — decisions, learnings, benchmarks, bugs — written as deliberate notes. Quality is something you tend, not something assumed.

And, just as deliberately, a few things it is not:

  • Not a transcript. It does not sit in the background recording everything you say. There is no conversation log, no ambient capture, no surveillance. Silence is the default; you choose what to keep.
  • Not auto-injected. Recall is something you (or the agent) invoke, and its results are visible. The memory never quietly stuffs your context behind your back — control over what comes back stays with you.
  • Not an oracle. It stores what was written. A stale or wrong note stays wrong until someone curates it. It will remember your mistakes as faithfully as your insights — that's a memory's honesty, not a flaw to paper over.
  • Not magic. Underneath it's vector search over embeddings and a SQLite graph — understood, inspectable mechanics, not a black box. If a recall surprises you, you can trace why.
  • Not a stand-in for thinking. It surfaces relevant past context so you don't have to hold everything in your head. It doesn't decide for you, and it isn't a substitute for judgment — yours or the agent's.

The short version: it's the project's long-term memory — kept on purpose, owned by you, and honest about being a tool rather than a mind.

Enabling memory

Memory is opt-in and off by default — fitting for a deliberate notebook: you switch it on, it doesn't switch itself on. There are two gates.

1. Build it in (compile-time). The subsystem and its two native deps (SQLiteGraph + the ONNX embedder) sit behind a Cargo feature, so the default build stays lean and free of them:

cargo build --release                    # lean default: no memory, ~25 MB binary
cargo build --release --features memory  # memory compiled in, ~58 MB binary

The memory build needs a newer toolchain — Rust 1.89+ (the sqlitegraph dep is edition-2024 and declares rust-version = 1.89; ort declares 1.88). The lean default still builds on Rust 1.85+. See Installation.

2. Turn it on (runtime). Even a memory-enabled binary starts with memory off. Activate it per run with the flag or the env alias:

vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf --memory
# or:
VULKANFORGE_MEMORY=1 vulkanforge serve --model ~/models/Qwen3-8B-Q4_K_M.gguf

Without --memory, /memory/* returns 503 and the server runs inference only — no embedder is loaded and no database is opened, so an inference-only run carries zero memory overhead. Pass --memory to a lean binary (built without the feature) and it stops immediately with a clear rebuild with --features memory message, before the model loads.

Where it lives. One SQLite file at ~/.vulkanforge/memory.db (override with VF_MEMORY_DB), with the embedding model cached in the sibling ~/.vulkanforge/embed-cache/. The first --memory start downloads the Nomic model there once (needs network that one time); every start after is offline. The embedder runs on the CPU — it never touches VRAM (see Hardware and Compatibility). Flags are summarized in Configuration.

How it works

The memory subsystem is embedded in the vulkanforge serve process (opt-in — see Enabling memory above), with no separate daemon. When activated it exposes VF-native endpoints under /memory/* (a namespace separate from the OpenAI-compatible /v1/*).

Endpoints

POST /memory/remember
  { "project_key": "vf", "kind": "Learning",
    "text": "Dispatch-Reduktion hilft nicht auf gfx1201",
    "name": "optional label", "metadata": { ... } }
  → { "id": 1, "deduped": false }

POST /memory/recall
  { "project_key": "vf", "query": "do fewer barriers help performance?", "k": 3 }
  → { "hits": [ { "id": 1, "kind": "Learning", "name": "...", "text": "...",
                  "status": "active", "score": 0.72 } ] }

POST /memory/projects   { "project_key": "vf", "name": "optional" }   → { "id", "project_key" }
GET  /memory/projects                                                 → { "projects": [ ... ] }

POST /memory/archive    { "project_key": "vf", "id": 1 }   → { "id", "status": "archived" }
POST /memory/delete     { "project_key": "vf", "id": 1 }   → { "id", "deleted": true }

project_key is optional everywhere — omit it and the call uses a shared global scope (__global__). score is cosine similarity in [0,1] (higher = closer).

Curation (v1.0.2). remember de-duplicates a near-identical note instead of storing it twice — it returns the existing id with "deduped": true. /memory/archive drops a note from recall but keeps the node and its trace; /memory/delete removes it outright (node + vector). Both are user-driven — surfaced in vf-clide as /archive and /forget; the agent never calls them.

Under the hood

  • One store. SQLiteGraph (3.2.5, GPL-3.0) holds the nodes (projects + content), the CONTAINS edges, and the per-project HNSW vector indexes — all in one SQLite file (default ~/.vulkanforge/memory.db, override with VF_MEMORY_DB).
  • CPU embedder. fastembed (5.16.2, ONNX Runtime) runs Nomic-Embed v1.5-Q (768-dim, INT8 → AVX-512/VNNI on Zen4). The task prefix the model needs (search_document: for writes, search_query: for reads) is applied for you. Embedding runs on the CPU and never takes the GPU permit, so a recall never waits behind a generation.
  • Per-project index. Each project_key gets its own persistent HNSW index (768-dim, cosine, m=16, ef_construction=200). A recall searches only that index, so project isolation is structural, not a filter.
  • Persistence. The store survives restarts — on reopen the vectors are restored from the SQLite store with no re-embedding. Shutdown flushes the HNSW topology and WAL-checkpoints.
  • First start downloads the Nomic ONNX model from HuggingFace into ~/.vulkanforge/embed-cache, then runs offline. The two native deps (static ONNX Runtime + bundled SQLite) add ~34 MB to the vulkanforge binary.
  • Honest mechanics. The HNSW index assigns its own vector ids, so the graph node id is carried in the vector's metadata and recovered on recall (search → get_vector → node_id → get_entity); project lookups use plain SQL on the graph's graph_entities table. If the embedder can't be loaded, /memory/* returns 503 and the inference server still runs.

The mechanics above are the foundation; the fuller design they're heading toward — six memory layers, sleep-style consolidation, watermark-based trust — is laid out on Memory Design.

Today vs. roadmap

v1.0.2 reads, writes, and curates — and the client now reaches it. Since v1.0.1 the store has written and read; v1.0.2 adds curation and the vf-clide client layer. Shipped on top of the foundation:

  • Curation (v1.0.2). Notes are no longer write-only — a near-duplicate remember is de-duplicated on write, /archive <id> drops a note from recall while keeping the trace, and /forget <id> removes it outright. Curation is a user action: the agent can point you to /forget <id> but never deletes on its own.
  • Client integration (v1.0.2). vf-clide exposes /project / /recall / /remember REPL commands and, in the agent loop, recall / remember tools. They run on their own axis — direct calls to the server's /memory/*, touching neither files nor the shell — so they stay visible (every call prints a marker) but sit outside the file/shell permission ceiling, available whenever the server has memory on.

What's deliberately still not here:

  • Lifecycle — notes are written active and stay there. No draft → confirmed → active → stale → archived transitions, no touch-on-read aging.
  • Flat graph — only Project and CONTAINS; no SUPERSEDES / CONTRADICTS / DERIVES_FROM edges yet.
  • No auto-injection — by design, not a gap. recall is always an explicit call; the memory never loads itself into a prompt behind your back. This one stays.

Clone this wiki locally