Structured memory retrieval for AI agents that returns evidence paths, not just top-k chunks. Think navigable memory in a memory-palace-style sense: a graph you can walk, not only a flat similarity list.
Memory Path Engine is a research-first prototype for moving beyond flat retrieval. Instead of treating memory as an unordered vector index, it models memory as typed nodes, edges, weights, and replayable paths so a system can retrieve, traverse, and explain how it reached an answer.
This repository is aimed at people exploring agent memory, graph-aware retrieval, and explainable evidence chains across more than one document shape.
Bundled markdown packs are ingested into an in-memory graph (MemoryNode / MemoryEdge, with weights). Retrievers return a MemoryPath: a composed answer plus ordered steps you can inspect. The CLI demo exercises exactly this path end to end.
examples/*_pack ──▶ ingest ──▶ MemoryStore (typed graph)
│
┌──────────────┼──────────────┐
▼ ▼ ▼
BaselineTopK other modes WeightedGraph
(flat answers) in `retrieve` (path + scores)
│
▼
stitched answer + replayable step list
Most RAG systems still look like this:
- Split documents into chunks.
- Embed chunks.
- Return
top-kmatches. - Ask the LLM to improvise the reasoning.
This repo explores a different question:
Can we retrieve a memory path instead of only retrieving similar chunks?
The prototype is built around three ideas:
structure: memory is not flat; it has typed nodes and edgesweight: not every memory should be treated equallypath: retrieval should expose the chain of evidence, not hide it
- compare multiple retrieval modes in one codebase
- inspect replayable evidence paths instead of only final answers
- test graph-aware retrieval on contract-like and operational documents
- run repository-owned structured benchmarks instead of toy snippets
Maintainers: configure the GitHub link-card image using docs/social-preview.md (docs/assets/open-graph-cover.png).
Install the project in editable mode:
python -m pip install --no-build-isolation -e .Run the test suite:
python -m unittest discover -s tests -vRun the runbook demo:
python -m memory_engine.demo --scenario runbookTerminal-style capture of real stdout (refresh with python scripts/generate_runbook_demo_terminal_svg.py; latency_ms may differ run to run):
Run the contract comparison demo:
python -m memory_engine.demo --scenario contractRun the HotpotQA tiny benchmark smoke check:
python scripts/run_hotpotqa_benchmark.pyDownload the official HotpotQA dev distractor file for local benchmark runs:
python scripts/download_hotpotqa.pypython -m memory_engine.demo prints a small banner, the query, then path-aware output: a BEST ANSWER line built from the winning walk, and a REPLAY PATH with one line per hop (node id, score, via=<edge type>) plus short scoring reasons on the following lines. With --scenario contract, a BASELINE block (flat top-k answers) appears above the path-aware section for the same query.
Representative runbook excerpt (answer line shortened; latency and hop scores can vary slightly between runs):
========================================================================
Memory Path Engine | demo
scenario: runbook
========================================================================
-------------------------------- QUERY ---------------------------------
What should we do if rollback does not recover the API after a
deployment incident?
----------------- PATH-AWARE weighted graph retrieval -----------------
BEST ANSWER
… stitched runbook units … [latency_ms=…]
REPLAY PATH
1. 01_api_incident_runbook:5 | score=0.500 | via=seed
seed hit semantic=0.501
2. 01_api_incident_runbook:4 | score=0.299 | via=next_unit
expanded at hop 1 total=0.299 exception=0.450 contradiction=0.000
========================================================================
The runbook demo loads incident and recovery procedures, then asks a multi-step operational question:
What should we do if rollback does not recover the API after a deployment incident?
The output includes:
- a BEST ANSWER line composed from the graph walk
- a REPLAY PATH with per-step scores,
viaedge types, and short reasons
For a representative stdout excerpt, see What you will see (under Quick start).
The contract demo runs the same query through a baseline retriever and the weighted graph retriever. Stdout shows flat top-k answers first, then the path-aware best answer and replay steps, so you can compare shapes of evidence without relying on a single aggregate metric.
| Retriever | What it emphasizes | Useful for |
|---|---|---|
| lexical baseline | keyword overlap | simple lookups and sanity checks |
| embedding baseline | semantic similarity | paraphrases and fuzzy matches |
| structure-only traversal | graph connectivity | linked evidence exploration |
| weighted graph retrieval | structure plus importance weighting | multi-hop retrieval with replayable evidence |
| activation spreading v1 | explicit propagation with decay | graph diffusion experiments |
The core is meant to stay domain-agnostic. The current examples use both contract-like documents and runbooks because together they stress:
- hierarchical structure
- exception and dependency chains
- critical risk-bearing units
- procedural and operational steps
- strong need for evidence-backed reasoning
If the retrieval and replay ideas cannot survive across these document types, they are unlikely to generalize well to other structured knowledge domains.
src/memory_engine: schema, storage, ingestion, retrieval, scoring, replayexamples/contract_pack: contract-like demo pack with dense dependencies and exceptionsexamples/runbook_pack: operational runbook pack for procedural retrievalbenchmarks/structured_memory: typed benchmark fixtures and evaluation assetsdocs: architecture, evaluation, hypotheses, and project visiontests: unit tests for schema, retrieval behavior, and benchmark support
docs/vision.md: why this project exists and where it is headingdocs/architecture.md: how the current system is structureddocs/evaluation.md: how retrieval modes are compareddocs/benchmark-strategy.md: how public, repo-owned, and private benchmarks should be useddocs/private-contract-dataset-guide.md: how to build and annotate a private contract golden setdocs/hypotheses.md: milestone hypotheses and success criteria
The first milestone tests three claims:
H1: graph-aware retrieval beats vanillatop-kretrieval on multi-hop questionsH2: anomaly and importance weighting improve recall of critical evidenceH3: replayable memory paths improve explainability without unacceptable latency
The retrieval stack separates:
- candidate generation
- semantic similarity backend
- scoring strategy
- path replay
That separation makes it possible to compare lexical baseline, embedding baseline, structure-only traversal, and weighted graph retrieval without rewriting the main search loop.
The evaluation layer can emit detailed per-question reports, which is useful for miss analysis and ablation debugging instead of relying only on a single aggregate score.
The repository also includes a dedicated structured benchmark bounded context with:
- strong pydantic dataset models
- a JSON repository for benchmark fixtures
- application services that load datasets, build stores, and run retrievers end to end
- minimal
MemoryNode,MemoryEdge,MemoryPath, andEvidenceRefschema - an in-memory store for fast iteration
- simple ingestion paths for multiple example document styles
- multiple retrieval modes in one research harness
- a small synthetic contract evaluation set for end-to-end experiments
- production infrastructure
- MCP integration
- multi-modal memory encoding
- online reinforcement and forgetting policies
- large-scale benchmarks
- full UI
- add explicit anomaly detectors and contradiction edges
- expand the evaluation runner with ablation reports and latency summaries
- extend the
domain_packinterface for more domains such as code, research notes, and policy-like documents - add stronger embedding backends behind the same
EmbeddingProviderinterface
For suggested GitHub topic tags (About section), see docs/github-topics.md.
MIT. See LICENSE.