Graph-Based Memory Layer for Small Models #2858
Replies: 6 comments 10 replies
-
|
As someone who works at imec/IDLab and specifically researches AI, knowledge graphs, and memory/retrieval systems: conceptually, absolutely yes. This is exactly the kind of direction that could make smaller local models much more useful. Flat vector memory works, but it has obvious limits. Once memories grow large, retrieval becomes noisy, and small models often struggle to infer the missing relationships from loosely related chunks. A graph layer can move part of that reasoning burden into the storage/retrieval system instead of forcing the model to reconstruct everything at inference time. So from a research perspective, I think the idea is very good. As a maintainer though, I’d be careful about adding something this large directly into Odysseus core too quickly. A graph-memory system touches a lot of surface area. So while i'm very fond of the idea, i would keep this as future work for now, until this project gets settled more :) |
Beta Was this translation helpful? Give feedback.
-
|
The flat-vector noise problem is real, and you've identified the right shape of solution. One data point from production: I've been running a graph-structured memory layer on top of Claude (small relative to GPT-4-class but not 7B-small) for about a year, and the graph-vs-flat-vector tradeoffs you describe match what we measured. Three specific moves that helped beyond the obvious "add edges": Entity-graph cross-reference at the write boundary, not just the read boundary. We run a hook that fires on every memory-write, extracts named entities, greps existing memory for matches, and surfaces the matches as additional context. The graph builds incrementally on writes without requiring offline ingestion passes. For self-hosted users on a 7B model the per-write cost is bounded by the local regex / embedding pass and stays well under the retrieval-time benefit. Hierarchical sub-index structure beats flat retrieval at small-model scale. We split memory into a small always-loaded core (kept under ~200 lines) and several situation-matched sub-indexes auto-loaded based on the current task. The core indexes the indexes. Retrieval seeds from the core, expands into relevant sub-indexes via the entity matches. Small models read substantially less context per retrieval because the routing layer is tiny and the expansion is targeted. Semantic and graph compose, they don't replace each other. We keep both the vector similarity (for "things I haven't connected explicitly") and the entity-graph (for "things I know are related"). Retrieval seeds from one and expands through the other. Pure-graph misses unstated similarity; pure-vector loses the explicit structure. The composition is what makes it tractable on a small model. One honest note on compression: we experimented with a denser logical-operator notation (replacing prose with implication / and / or / not symbols) to fit more memory into context windows. Empirically the token count went up rather than down on most tokenizers, so it didn't pay off as a context-saving mechanism. It did help with cell reusability and acted as a forcing function for rigor, but I'd skip it if context-efficiency is the goal. Worth seconding Michiel's standalone-library recommendation strongly. We have a clean separation between the substrate and the application that's let us evolve each independently. If you're building this for Odysseus, shipping it as a Python library with a stable interface (write_event, query, traverse) and letting integrators wire it in pays off the moment the second downstream consumer appears. Happy to share specifics on the cross-reference-hook implementation or the hierarchical sub-index structure if useful. We've been running it long enough to have honest data on the tradeoffs you're asking about. Extraction overhead is real but bounded; the noise reduction at small-model scale is the dominant factor by a wide margin. |
Beta Was this translation helpful? Give feedback.
-
|
I hit the same problem and built a small implementation along these lines, in case it's a useful reference. It's an owner-scoped, bi-temporal knowledge-graph store that augments (not replaces) the flat memory store, designed so a small local model can't corrupt it:
Reference branch: alvaroperricone@e0fc50e Sharing it as a standalone starting point rather than a merge request — agree with keeping this as future/standalone work for now. (Different from #1893, which targets in-session context; this is durable user facts + persona.) |
Beta Was this translation helpful? Give feedback.
-
|
Sharp design call on the closed-ontology side. We bet the other way (model-judged entity extraction at the write boundary), and the divergence is fair: your scope is bounded user facts plus persona, ours is an open-ended life/project graph where the schema keeps expanding. Closed makes sense when the schema is knowable; model-judged eats the hallucination risk in exchange for not needing schema updates. Bi-temporal supersession is the move I want to steal. We currently lean on git history for "what did the agent know when X happened?", which is fine for forensics but unusable at query time. Event-time plus transaction-time in-band is a much cleaner answer, especially for the audit trail when a memory turns out wrong and you want to know which decisions were downstream of it. Bounded k-hop with no LLM at query is also the right ceiling. Will look at e0fc50e. |
Beta Was this translation helpful? Give feedback.
-
|
Very cool idea, I'd love to help testing |
Beta Was this translation helpful? Give feedback.
-
|
@Michiel-VandeVelde's point about the compositionality gap is the key insight — flat vector retrieval gives you "memories about X" but not "how X relates to Y relates to Z," which is exactly what small models need because they can't infer those connections from independent chunks. The approach that works in production: run BOTH layers simultaneously, not one or the other. Vector memory handles semantic recall ("find memories similar to this query"), while the knowledge graph handles structural traversal ("what entities are connected to this entity within 2 hops"). At query time, merge the results — vector similarity surfaces relevant context, graph traversal surfaces structurally-related context that vector search would miss. For small models specifically, the graph layer acts as a pre-filtering stage: instead of dumping 10 semantically similar chunks into context (half of which are noise), traverse from the query's entities through 1-2 graph hops and THEN rank by vector similarity within that subgraph. This gives smaller models a focused, relationship-aware context window instead of a flat similarity-ranked list. On the closed vs open ontology question (@alvaroperricone vs @WGlynn): the pragmatic middle ground is entity extraction with type constraints at write time + permissive linking at query time. Extract entities with types (person, project, technology, concept) but allow the graph edges to be free-form. This prevents schema explosion while still capturing arbitrary relationships. Knowledge graph operations example (entity extraction, linking, traversal): https://github.com/Dakera-AI/dakera-py/blob/main/examples/knowledge_graph.py |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
One thing I've noticed is that Odysseus memory is fundamentally flat. Memories are stored as independent pieces of information, and retrieval is largely based on similarity search.
This works reasonably well with larger models because they can infer relationships between retrieved memories. Smaller local models often struggle. They either retrieve too much context and get distracted, or retrieve too little and miss important connections.
What if memory were stored as a graph instead?
Instead of only storing memory chunks, the system could extract entities and relationships and maintain them as a lightweight knowledge graph.
Example:
At query time, retrieval would work differently:
The benefit is that retrieval becomes bounded by graph structure rather than conversation history size. A user with three years of memory could potentially return a context payload similar in size to a user with three days of memory.
For ingestion, conversation turns could be converted into graph operations:
To avoid unbounded growth, nodes and edges could have recency, frequency, and confidence scores. Rarely accessed information would gradually decay while frequently referenced information remains active.
My interest is primarily in helping smaller local models. The goal is not to make memory "smarter", but to move part of the reasoning workload from inference time into storage time so that 7B-12B models can retrieve more useful context with fewer tokens.
I'm considering building this as a standalone Python library that could integrate with Odysseus but also work independently:
The main question I have for the community:
Would a lightweight extraction step (entity/relationship extraction per conversation turn) be acceptable overhead for self-hosted users, or would that cost outweigh the retrieval benefits for the smaller-model use case this is trying to solve?
Beta Was this translation helpful? Give feedback.
All reactions