Graph-Based Memory Layer for Small Models #2858

Amir0234-afk · 2026-06-05T09:14:59Z

Amir0234-afk
Jun 5, 2026

One thing I've noticed is that Odysseus memory is fundamentally flat. Memories are stored as independent pieces of information, and retrieval is largely based on similarity search.

This works reasonably well with larger models because they can infer relationships between retrieved memories. Smaller local models often struggle. They either retrieve too much context and get distracted, or retrieve too little and miss important connections.

What if memory were stored as a graph instead?

Instead of only storing memory chunks, the system could extract entities and relationships and maintain them as a lightweight knowledge graph.

Example:

User → works on → Project A
Project A → uses → Python
User → interested in → Cybersecurity
Project A → related to → Cybersecurity

At query time, retrieval would work differently:

Find relevant seed nodes using embeddings or keyword search.
Traverse N hops through connected nodes.
Return only the relevant subgraph as context.

The benefit is that retrieval becomes bounded by graph structure rather than conversation history size. A user with three years of memory could potentially return a context payload similar in size to a user with three days of memory.

For ingestion, conversation turns could be converted into graph operations:

New fact → create node/edge
Updated fact → modify relationship
Contradiction → reduce confidence or supersede previous relationship
Forgotten/stale information → decay over time

To avoid unbounded growth, nodes and edges could have recency, frequency, and confidence scores. Rarely accessed information would gradually decay while frequently referenced information remains active.

My interest is primarily in helping smaller local models. The goal is not to make memory "smarter", but to move part of the reasoning workload from inference time into storage time so that 7B-12B models can retrieve more useful context with fewer tokens.

I'm considering building this as a standalone Python library that could integrate with Odysseus but also work independently:

SQLite for local deployments
PostgreSQL + pgvector for larger deployments
Neo4j as an optional backend
Swappable embedding providers
Fully local by default

The main question I have for the community:

Would a lightweight extraction step (entity/relationship extraction per conversation turn) be acceptable overhead for self-hosted users, or would that cost outweigh the retrieval benefits for the smaller-model use case this is trying to solve?

Michiel-VandeVelde · 2026-06-05T09:24:09Z

Michiel-VandeVelde
Jun 5, 2026
Collaborator

As someone who works at imec/IDLab and specifically researches AI, knowledge graphs, and memory/retrieval systems: conceptually, absolutely yes. This is exactly the kind of direction that could make smaller local models much more useful.

Flat vector memory works, but it has obvious limits. Once memories grow large, retrieval becomes noisy, and small models often struggle to infer the missing relationships from loosely related chunks. A graph layer can move part of that reasoning burden into the storage/retrieval system instead of forcing the model to reconstruct everything at inference time.

So from a research perspective, I think the idea is very good.

As a maintainer though, I’d be careful about adding something this large directly into Odysseus core too quickly. A graph-memory system touches a lot of surface area. So while i'm very fond of the idea, i would keep this as future work for now, until this project gets settled more :)

3 replies

Michiel-VandeVelde Jun 5, 2026
Collaborator

do feel free to @ me in anything you create on this issue!

Amir0234-afk Jun 5, 2026
Author

Thanks for the feedback.

That makes sense. This is still a very early idea on my side, and I'm currently working through the architecture and trade-offs before building anything substantial.

My plan is to develop it as a standalone library first, experiment with different approaches, and see whether it actually delivers meaningful gains for smaller local models. If it proves useful and the project reaches a point where a graph-based memory layer makes sense, it could be explored as an integration later on rather than something added directly to Odysseus today.

For now, I'm mostly interested in validating the concept and learning where the practical bottlenecks are.

CoolJohn-lab Jun 6, 2026

Concept tracks logically. Perfect fit for this project. Has the capability to totally transform the usefulness for smaller models and larger models alike.

Implementation will be difficult. I would start by forking Oddy and making your own version where you implement this and work through the implementation challenges.

WGlynn · 2026-06-06T22:35:24Z

WGlynn
Jun 6, 2026

The flat-vector noise problem is real, and you've identified the right shape of solution. One data point from production: I've been running a graph-structured memory layer on top of Claude (small relative to GPT-4-class but not 7B-small) for about a year, and the graph-vs-flat-vector tradeoffs you describe match what we measured.

Three specific moves that helped beyond the obvious "add edges":

Entity-graph cross-reference at the write boundary, not just the read boundary. We run a hook that fires on every memory-write, extracts named entities, greps existing memory for matches, and surfaces the matches as additional context. The graph builds incrementally on writes without requiring offline ingestion passes. For self-hosted users on a 7B model the per-write cost is bounded by the local regex / embedding pass and stays well under the retrieval-time benefit.

Hierarchical sub-index structure beats flat retrieval at small-model scale. We split memory into a small always-loaded core (kept under ~200 lines) and several situation-matched sub-indexes auto-loaded based on the current task. The core indexes the indexes. Retrieval seeds from the core, expands into relevant sub-indexes via the entity matches. Small models read substantially less context per retrieval because the routing layer is tiny and the expansion is targeted.

Semantic and graph compose, they don't replace each other. We keep both the vector similarity (for "things I haven't connected explicitly") and the entity-graph (for "things I know are related"). Retrieval seeds from one and expands through the other. Pure-graph misses unstated similarity; pure-vector loses the explicit structure. The composition is what makes it tractable on a small model.

One honest note on compression: we experimented with a denser logical-operator notation (replacing prose with implication / and / or / not symbols) to fit more memory into context windows. Empirically the token count went up rather than down on most tokenizers, so it didn't pay off as a context-saving mechanism. It did help with cell reusability and acted as a forcing function for rigor, but I'd skip it if context-efficiency is the goal.

Worth seconding Michiel's standalone-library recommendation strongly. We have a clean separation between the substrate and the application that's let us evolve each independently. If you're building this for Odysseus, shipping it as a Python library with a stable interface (write_event, query, traverse) and letting integrators wire it in pays off the moment the second downstream consumer appears.

Happy to share specifics on the cross-reference-hook implementation or the hierarchical sub-index structure if useful. We've been running it long enough to have honest data on the tradeoffs you're asking about. Extraction overhead is real but bounded; the noise reduction at small-model scale is the dominant factor by a wide margin.

2 replies

Amir0234-afk Jun 9, 2026
Author

Thanks for the detailed feedback. It's extremely useful hearing from people who have actually run graph-based memory systems in production rather than only discussing the concept academically.

One thing that stood out to me from your comment is the idea that semantic retrieval and graph retrieval should compose rather than compete. My initial thinking was already leaning toward a hybrid approach where embeddings identify seed nodes and graph traversal expands context, but your experience suggests the interaction between those layers is probably more important than the graph itself.

The write-boundary cross-referencing is also particularly interesting. I had been focusing mostly on retrieval architecture, but shifting part of the work into ingestion seems aligned with my original goal of reducing the reasoning burden on smaller local models.

I'm currently sketching a standalone Python library around these ideas rather than an Odysseus integration. The rough architecture is:

Conversation → entity/relation extraction
Explicit graph CRUD operations
Embedding-based seed retrieval
Graph traversal for bounded context expansion
Backend-agnostic storage (SQLite first, PostgreSQL/Neo4j later)

One thing I'm still trying to understand is where the biggest practical bottlenecks appear in real deployments.

You offered to share specifics on the cross-reference hook and the hierarchical sub-index structure — I'd genuinely like to hear more about both. On the cross-reference side, I'm especially curious how you're handling entity resolution at write time. Are you matching against existing nodes primarily through embedding similarity, string matching, graph heuristics, or some combination of approaches?

And regarding the hierarchical sub-index structure, I'm trying to understand what determines which sub-index gets loaded. Is that selection made by a model at query time, or is it driven by a deterministic classifier or routing mechanism?

Those implementation details sound very close to some of the problems I'm currently thinking through, particularly around bounded retrieval and maintaining graph quality as memory grows over time.

Appreciate the insights.

WGlynn Jun 9, 2026

For the cross-reference hook, write-time matching is string-based against a maintained entity registry. The hook fires PreToolUse on every Write/Edit, runs a regex pass over the new content for known entity tokens (people, projects, repos), and surfaces the existing primitives that mention that entity. False positives on common tokens (the word "Will" matching the verb vs the name) are handled by a negation-window check around the match. Embedding similarity at the write boundary is not in the current implementation; it is a candidate I have not shipped because adding it doubles per-write latency without obvious gain over the read-side pass below.

Semantic retrieval does run, but on the read side rather than the write side. A cosine-similarity pass fires on every user prompt and surfaces primitives above a threshold. That covers the unanticipated-overlap case from the opposite direction: at query time rather than at ingestion.

For sub-index selection, it is a hybrid. A small deterministic router fires at SessionStart and maps explicit situation tags (file-path, project, work-type) to sub-indexes. The read-side semantic pass catches anything the deterministic router missed. Pure model-judged routing was too brittle at small scale.

Biggest practical bottleneck has been entity-registry maintenance. Auto-deriving the registry from observed write patterns is the next move I have not shipped.

alvaroperricone · 2026-06-07T15:43:03Z

alvaroperricone
Jun 7, 2026

I hit the same problem and built a small implementation along these lines, in case it's a useful reference.

It's an owner-scoped, bi-temporal knowledge-graph store that augments (not replaces) the flat memory store, designed so a small local model can't corrupt it:

Closed ontology + entity resolution in code (not model-judged), so the model can't invent nodes/edges.
Non-lossy supersession: a new value for the same (subject, predicate) retires the old one via event-time + transaction-time, so nothing is deleted and "X then, Y now" stays auditable.
Salience decay (45-day half-life); retrieval is a bounded k-hop graph walk — no graph DB, no LLM at query time.
Plain SQLite, no new infra. Store-only (extraction kept separate), ~700 LOC + 21 tests, no API keys.

Reference branch: alvaroperricone@e0fc50e

Sharing it as a standalone starting point rather than a merge request — agree with keeping this as future/standalone work for now. (Different from #1893, which targets in-session context; this is durable user facts + persona.)

1 reply

Amir0234-afk Jun 9, 2026
Author

Thanks for sharing the implementation details and the reference branch.

The bi-temporal approach is particularly interesting. My original thought was closer to CRUD semantics where facts are updated as conversations evolve, but preserving historical state while still presenting a current view may be the better design. The bi-temporal supersession aspect in particular is making me reconsider the DELETE semantics I had sketched for contradicted facts. Keeping the retired value with both event-time and transaction-time is strictly more information, and the auditability argument is compelling. The question for me is whether that adds meaningful complexity for a v1, or whether it's the kind of thing that's much harder to retrofit later once the data model is established.

I also found the closed-ontology versus open-ontology tradeoff interesting. My current thinking leans toward an open-world design because I'd like the system to work across arbitrary domains, but I can definitely see the appeal of a constrained ontology when the priority is reliability and preventing graph corruption from extraction errors.

The fact that you were able to build a bounded k-hop retrieval system on top of SQLite without requiring a dedicated graph database is encouraging as well. One of my goals is keeping the local deployment story simple.

I'll definitely take a look at the branch you linked.

WGlynn · 2026-06-09T15:33:09Z

WGlynn
Jun 9, 2026

Sharp design call on the closed-ontology side. We bet the other way (model-judged entity extraction at the write boundary), and the divergence is fair: your scope is bounded user facts plus persona, ours is an open-ended life/project graph where the schema keeps expanding. Closed makes sense when the schema is knowable; model-judged eats the hallucination risk in exchange for not needing schema updates.

Bi-temporal supersession is the move I want to steal. We currently lean on git history for "what did the agent know when X happened?", which is fine for forensics but unusable at query time. Event-time plus transaction-time in-band is a much cleaner answer, especially for the audit trail when a memory turns out wrong and you want to know which decisions were downstream of it.

Bounded k-hop with no LLM at query is also the right ceiling. Will look at e0fc50e.

1 reply

alvaroperricone Jun 10, 2026

For context, this started as a personal fork I wanted to run on a 5060 Ti with 8GB, so it’s all built around small local models. The closed ontology and code-side resolution are mostly there so a 7B can’t invent nodes or edges at the write boundary. Once you’re not stuck on a small model and the schema stops being knowable, I agree model-judged extraction is the better call.
The bi-temporal part is independent of that, so feel free to lift it: supersession and as-of don’t care whether an edge came from a regex or a model. The one place it touches your side is entity resolution, since it keys on (subject, predicate), so a wrong subject supersedes the wrong fact. Glad the no-LLM-at-query bit landed.

pewdiepie-archdaemon · 2026-06-09T23:18:31Z

pewdiepie-archdaemon
Jun 9, 2026
Maintainer

Very cool idea, I'd love to help testing

2 replies

alvaroperricone Jun 9, 2026

Very cool idea, I'd love to help testing

If it’s useful to try, the version above is store-only, so the fastest way in is the test suite: python -m pytest tests/test_memory_graph.py on e0fc50e (21 tests, plain SQLite, no deps or keys). The parts most worth stressing are entity resolution on aliased names, supersession on functional predicates (one current value, full history retained), and whether the bounded k-hop walk beats the flat store on recall.
Scope as before: it augments the flat store rather than replacing it, it’s owner-scoped, and extraction lives separately. Happy to keep it standalone for now, and glad to put up a short demo or a review-only draft PR if that would make it easier to poke at.

WGlynn Jun 11, 2026

I'd love to help testing

If you also want to stress the other design point in this thread (model-judged extraction, open schema), the system I referenced upthread is public: github.com/WGlynn/JARVIS. Fastest way in: clone, then python verify/verify_primitive_corpus.py validates the graph corpus and python -m pytest substrate/hooks/tests/ runs the write-boundary gates (276 tests, plain python, no keys needed).

The parts worth stressing are the inverse of alvaro's. His suite proves corruption can't happen at the write boundary. Ours bets a frontier model can be trusted there and tests whether drift gets detected after the fact, via the entity cross-reference hook and the hindsight CLI that surfaces candidate contradictions. The cost is real: there's a window where a bad write is live before detection catches it, and detection hands you a triage queue instead of a guarantee. What we get for it is an open schema, and a failure mode that's visible instead of silent. Closed ontologies reject what they can't map, and a rejected write leaves nothing behind to detect. Two failure philosophies, same graph problem.

ferhimedamine · 2026-06-13T10:56:38Z

ferhimedamine
Jun 13, 2026

@Michiel-VandeVelde's point about the compositionality gap is the key insight — flat vector retrieval gives you "memories about X" but not "how X relates to Y relates to Z," which is exactly what small models need because they can't infer those connections from independent chunks.

The approach that works in production: run BOTH layers simultaneously, not one or the other. Vector memory handles semantic recall ("find memories similar to this query"), while the knowledge graph handles structural traversal ("what entities are connected to this entity within 2 hops"). At query time, merge the results — vector similarity surfaces relevant context, graph traversal surfaces structurally-related context that vector search would miss.

For small models specifically, the graph layer acts as a pre-filtering stage: instead of dumping 10 semantically similar chunks into context (half of which are noise), traverse from the query's entities through 1-2 graph hops and THEN rank by vector similarity within that subgraph. This gives smaller models a focused, relationship-aware context window instead of a flat similarity-ranked list.

On the closed vs open ontology question (@alvaroperricone vs @WGlynn): the pragmatic middle ground is entity extraction with type constraints at write time + permissive linking at query time. Extract entities with types (person, project, technology, concept) but allow the graph edges to be free-form. This prevents schema explosion while still capturing arbitrary relationships.

Knowledge graph operations example (entity extraction, linking, traversal): https://github.com/Dakera-AI/dakera-py/blob/main/examples/knowledge_graph.py

1 reply

WGlynn Jun 13, 2026

This is the right cut, and the honest version is that the architecture was already this shape on our side: nodes typed at write, edges as free-form wikilinks resolved at read. I called it "open schema" upthread, which under-sold it. But you named the part we were actually missing, so thank you for that. Our entities were typed only by the file they lived in, not as nodes in their own right. I took "extract entities with types, keep the edges free-form" literally and added a constrained node-type vocabulary this morning (person, project, technology, concept), typed the existing nodes, left every edge untouched. The schema didn't explode and the relationships didn't get poorer. Exactly the trade you described.

One thing running it surfaced: the type set wants a slow-open bottom. Start with a fixed handful and let new types earn their way in over months, while the edges stay fast and cheap. Two different clocks on the same graph.

And the meta point is the one we keep landing on past memory: find the axis that has to be reliable, lock it, leave the other free. Closed everything is brittle, open everything is noise, and the work is deciding which axis is which. You made that call in one sentence. Appreciate it.

Graph-Based Memory Layer for Small Models #2858

Uh oh!

Replies: 6 comments · 10 replies

Uh oh!

Michiel-VandeVelde Jun 5, 2026 Collaborator

Uh oh!

Michiel-VandeVelde Jun 5, 2026 Collaborator

Uh oh!

Amir0234-afk Jun 5, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Amir0234-afk Jun 9, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Amir0234-afk Jun 9, 2026 Author

Uh oh!

Uh oh!

Uh oh!

pewdiepie-archdaemon Jun 9, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 6 comments 10 replies

Michiel-VandeVelde
Jun 5, 2026
Collaborator

Michiel-VandeVelde Jun 5, 2026
Collaborator

Amir0234-afk Jun 5, 2026
Author

Amir0234-afk Jun 9, 2026
Author

Amir0234-afk Jun 9, 2026
Author

pewdiepie-archdaemon
Jun 9, 2026
Maintainer