Framework-agnostic AI memory system. 97.0% Recall@5 on LongMemEval-S.
Beats MemPalace (96.6%) and agentmemory (95.2%) — running entirely on a £170 Orange Pi 5 Plus with zero cloud dependencies. Part of the taOS ecosystem.
Most memory systems try to recreate human thinking. They embed, they index, they retrieve, and they call it "cognition" because that sounds better than "we built a vector database". The brain is hard, so they reach for it as a metaphor and hope nobody asks where the reasoning is supposed to come from.
A few years in, MemPalace stepped sideways. Instead of a brain, a building — a palace of rooms where memories sit on shelves you can walk past. That's a real improvement. The metaphor is concrete. You can picture the kitchen and remember what you cooked.
But a building is still one person's mind, just dressed up. When a human needs to remember something they didn't personally experience, they don't walk through their own house. They go outside. They go to the library.
The library is the biggest thing humans ever built for memory. Not the brain, not the palace — the library. One species figured out that putting verbatim records on shelves, organised by subject, indexed by a card catalogue, maintained by a librarian who actually knows where everything is, beats any individual brain by orders of magnitude. The library is how we got from "I remember my grandmother's recipe" to "I can read what Marcus Aurelius wrote on a Tuesday in 175 AD".
taosmd is the library.
There is a librarian. She sits at the desk and watches every conversation that passes through. She takes it down word for word — no paraphrasing, no summary that loses the joke, no compression that flattens the nuance. The transcript is the truth, and the truth is what gets shelved.
Then she does the work nobody wants to do. She breaks the day into chapters, stories, articles, recurring serials. She logs the date, the participants, the subject, the cross-references to earlier conversations on the same theme. She writes it all down in her directory so she knows where to put her hand on any of it.
When you ask the agent something, the librarian helps. Vector search picks the candidate shelves, keyword search confirms the title, the temporal graph tells her which version is the current one, and the archive proves what was actually said. No single component is doing magic. They're all doing one job each, the way a real library does: stacks, catalogue, reference desk, archive.
Uncertainty is her specialty. If the agent isn't sure, it asks her, and she'll either find the source, find an earlier conversation that contradicts the claim, or admit nobody's said anything about it before. She doesn't make things up. She points at the page.
Everything is time-stamped. Everything is on a shelf. Nothing is ever lost.
What about dreaming? A few systems have started calling their consolidation pass "dreaming" — OpenClaw's dreaming is the cleanest example. The idea is good: take the day's signals, score them, promote the durable ones to long-term memory. It's their version of the librarian shelving the day's events.
The catch is the dream rewrites itself. Snippets get scored, gated, redacted, summarised into a MEMORY.md. What didn't make the cut, and what the original wording actually was, is gone. The bit that survives is the bit the dreamer thought worth keeping at 3am.
I don't know about you, but I can never remember my dreams. So I built a robot librarian who never sleeps instead. The verbatim transcript goes into the zero-loss archive first. The librarian crystallises whatever's worth crystallising — but the original is still on the shelf, byte for byte, never overwritten. Disagree with how she summarised today? Walk over to the archive and read what was actually said. The dream and the source are both there.
That's the difference. We didn't dress up a vector database as a brain. We built a library.
The cleanest way to install taosmd is to ask your agent to do it. Paste this message into Claude Code, Cursor, your OpenClaw shell, whatever:
Please install taosmd as my memory system. The repo is github.com/jaylfc/taosmd.
- Read the README so you understand what you're installing.
- Run the install script:
curl -fsSL https://raw.githubusercontent.com/jaylfc/taosmd/master/scripts/setup.sh | bash. Report any errors and stop if it fails.- Register yourself as an agent so you have your own isolated index. Pick a stable agent name (lowercase, no spaces) — the same name you'll use every time you call the librarian. If I have multiple agents in this framework, ask me what to name this one before registering.
- Verify the install: call
taosmd.search("hello", agent="<your-name>")— it should return an empty result, not an error.- Append the "Memory — taosmd" rules block from
docs/agent-rules.mdin the repo to my agent file (CLAUDE.md / system prompt / AGENTS.md — whatever your framework reads every turn). Replace<your-agent-name>with the name you registered as.- Confirm it's installed and tell me your agent name so I know how to refer to your memory.
Don't summarise the repo or paraphrase the rules. Copy them verbatim — the wording is the contract.
The agent will pull the repo, run the install, register itself, append the per-turn rules block to its own instruction file, and verify everything works. After that, every turn it runs it'll check the librarian when it's uncertain — see docs/agent-rules.md for the rules block it installs.
Multiple agents in one framework? Same install message works. The agent will ask you to name it before registering, so each agent gets its own shelf. The taosmd service itself stays as one process; only the per-agent indexes are separate. See docs/multi-agent.md for the full naming convention, cross-agent reads, migration scenarios, and a five-agent worked example.
Inside taOS? Don't use this — taOS provisions taosmd automatically when you deploy an agent, and the rules block is baked into the agent template. This install path is for standalone framework users.
Note: Install scripts are new and awaiting full testing on clean environments. Please report issues.
curl -fsSL https://raw.githubusercontent.com/jaylfc/taosmd/master/scripts/setup.sh | bashThis will:
- Clone the repo and install Python dependencies
- Download the all-MiniLM-L6-v2 ONNX embedding model (90MB)
- Install Ollama and pull Qwen3-4B for fact extraction + answering (2.6GB)
- On RK3588: download the NPU-optimised Qwen3-4B RKLLM model instead (4.6GB)
- Create the data directory and run a self-test
git clone https://github.com/jaylfc/taosmd.git
cd taosmd
pip install -e .
# 1. Embedding model (required)
huggingface-cli download onnx-models/all-MiniLM-L6-v2-onnx --local-dir models/minilm-onnx
# 2. LLM for fact extraction + answering (required)
# Option A: x86 / ARM without NPU — use Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:4b
# Option B: Orange Pi / RK3588 with NPU — use rkllama
# Install rkllama: https://github.com/NotPunchnox/rkllama
huggingface-cli download dulimov/Qwen3-4B-rk3588-1.2.1-base \
Qwen3-4B-rk3588-w8a8-opt-1-hybrid-ratio-0.0.rkllm \
--local-dir ~/.rkllama/models/qwen3-4b-chatIf you're using Claude Code, OpenClaw, Cursor, or any AI coding agent, paste this into your conversation or CLAUDE.md / AGENTS.md:
Set up taOSmd as my persistent memory system.
Clone https://github.com/jaylfc/taosmd and run
scripts/setup.sh— it installs everything (embedding model, Qwen3-4B LLM, data stores, daily archive compression cron).After setup, integrate taOSmd into your workflow:
Store facts you learn about me:
from taosmd import KnowledgeGraph kg = KnowledgeGraph("~/.taosmd/knowledge-graph.db") await kg.init() await kg.add_triple("User", "prefers", "local models")Archive every conversation turn (zero-loss — never summarise, never delete):
from taosmd import Archive archive = Archive(archive_dir="~/.taosmd/archive", index_path="~/.taosmd/archive-index.db") await archive.init() await archive.record("conversation", {"role": "user", "content": message}, summary=message[:80])Search my history when I ask about something:
from taosmd import VectorMemory vmem = VectorMemory("~/.taosmd/vector-memory.db", embed_mode="onnx", onnx_path="<taosmd-dir>/models/minilm-onnx") await vmem.init() results = await vmem.search("What did I say about Docker?", hybrid=True)Extract facts automatically from our conversations:
from taosmd.memory_extractor import process_conversation_turn await process_conversation_turn(message, agent_name="assistant", kg=kg, archive=archive)Before answering, check what you know about me:
facts = await kg.query_entity("User") history = await archive.search_fts("relevant topic") similar = await vmem.search("the question", hybrid=True)The archive is append-only and kept forever. Every conversation, tool call, decision, and error should be recorded. Old archives are compressed to gzip daily at 3 AM. The knowledge graph tracks structured facts with temporal validity — update facts when they change, don't delete them.
| System | Recall@5 | Method | Cloud |
|---|---|---|---|
| taOSmd | 97.0% | Hybrid + query expansion | None |
| MemPalace | 96.6% | Raw semantic (ChromaDB) | None |
| agentmemory | 95.2% | BM25 + vector | None |
| SuperMemory | 81.6% | Cloud embeddings | Yes |
All systems tested on the same benchmark (LongMemEval-S, 500 questions) with the same embedding model (all-MiniLM-L6-v2, 384-dim).
| Category | taOSmd (hybrid+expand) | taOSmd (raw semantic) | MemPalace |
|---|---|---|---|
| knowledge-update | 100.0% (78/78) | 100.0% | — |
| multi-session | 98.5% (131/133) | 95.5% | — |
| single-session-user | 97.1% (68/70) | 90.0% | — |
| single-session-assistant | 96.4% (54/56) | 96.4% | — |
| temporal-reasoning | 94.0% (125/133) | 94.0% | — |
| single-session-preference | 90.0% (27/30) | 93.3% | — |
| Overall | 97.0% (485/500) | 95.0% (475/500) | 96.6% |
| Strategy | Recall@5 | Delta |
|---|---|---|
| Raw cosine (MemPalace-equivalent) | 95.0% | — |
| Additive keyword boost | 96.6% | +1.6 |
| Hybrid + query expansion (default) | 97.0% | +2.0 |
| All-turns hybrid (harder test) | 93.2% | -1.8 |
taOSmd Memory Stack (v0.2):
├── Temporal Knowledge Graph — structured facts with validity windows
├── Vector Memory — hybrid search with RRF fusion (ONNX MiniLM)
├── Zero-Loss Archive — append-only JSONL, FTS5 full-text search
├── Session Catalog — LLM-derived timeline directory over archives
├── Memory Extractor — regex (15ms) + LLM (17s on NPU)
├── Query Expansion — entity extraction + temporal resolution
├── Intent Classifier — route queries to optimal memory layer
├── Context Assembler — core/archival split, token-budgeted L0-L3
├── Graph Expansion — BFS traversal from search results through KG
├── Retention Scoring — Ebbinghaus decay with hot/warm/cold tiers
├── Session Crystallization — LLM session digests with lesson extraction
├── Cross-Memory Reflection — cluster-then-synthesize insights from KG
├── Secret Filtering — 17 regex patterns, auto-redact on ingest
├── Multi-Agent Leases — TTL exclusive locks for memory operations
└── Mesh Sync — LWW delta replication across workers
from taosmd import KnowledgeGraph, VectorMemory, Archive
# Temporal Knowledge Graph
kg = KnowledgeGraph("data/kg.db")
await kg.init()
await kg.add_triple("Jay", "created", "taOS")
facts = await kg.query_entity("Jay")
# Vector Memory (hybrid search)
vmem = VectorMemory("data/vectors.db", embed_mode="onnx", onnx_path="models/minilm-onnx")
await vmem.init()
await vmem.add("Jay created taOS, a personal AI operating system")
results = await vmem.search("What is taOS?", hybrid=True)
# Zero-Loss Archive
archive = Archive("data/archive")
await archive.init()
await archive.record("conversation", {"content": "Hello"}, summary="User greeted agent")
events = await archive.search_fts("hello")- 97.2% Recall@5 on LongMemEval-S benchmark (SOTA)
- Zero cloud dependencies — runs entirely on local hardware
- Framework-agnostic — HTTP API works with any agent framework
- Hybrid search — semantic similarity + keyword overlap boosting
- Temporal facts — validity windows, point-in-time queries
- Contradiction detection — auto-resolve conflicting facts
- Zero-loss archive — append-only, never modified, gzip compressed
- Intent-aware retrieval — routes queries to optimal memory layer
- 0.3ms embeddings — ONNX Runtime on ARM CPU
- Opt-in user tracking — browsing history, app usage, search queries
The ONNX model (models/minilm-onnx/model.onnx) is not included in this repo due to size (90MB). Download it:
pip install sentence-transformers
python -c "
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model.save('models/minilm-onnx')
"Or download directly from HuggingFace.
| Backend | Speed | Notes |
|---|---|---|
| ONNX Runtime (CPU) | 0.3ms | Fastest, recommended |
| RKNN (NPU) | 16.7ms | Works but CPU is faster for small models |
| PyTorch (CPU) | 64ms | Heaviest, most compatible |
- Orange Pi 5 Plus (RK3588, 16GB RAM) — primary target
- Fedora x86_64 (RTX 3060) — GPU worker for LLM extraction
- Should work on any Linux system with Python 3.10+
The 97.2% benchmark was achieved on this exact stack:
| Component | Model | Purpose | Runtime |
|---|---|---|---|
| Embedding | all-MiniLM-L6-v2 (22M params) | Semantic vector search | ONNX Runtime on ARM CPU (0.3ms/embed) |
| Embedding (alt) | Qwen3-Embedding-0.6B | NPU-accelerated embedding | rkllama on RK3588 NPU |
| Reranker | Qwen3-Reranker-0.6B | Result reranking | rkllama on RK3588 NPU |
| Query Expansion | qmd-query-expansion 1.7B | Search query enrichment | rkllama on RK3588 NPU |
| LLM (extraction + answering) | Qwen3-4B | Fact extraction (72% recall) + QA from context | rkllama on RK3588 NPU (17s/turn) |
| Vector Store | SQLite + numpy | Cosine similarity search | CPU |
| Full-Text Search | SQLite FTS5 | Keyword search over archive | CPU |
| Knowledge Graph | SQLite | Temporal entity-relationship triples | CPU |
Everything runs on the Pi. No external server needed. The Qwen3-4B handles both fact extraction and question answering on the NPU. The ONNX embedding model runs in-process on the CPU. An optional GPU worker (e.g. Fedora with RTX 3060) can accelerate LLM tasks ~10x but is not required — the Pi is fully self-contained.
| Model | Size | Source |
|---|---|---|
| all-MiniLM-L6-v2 ONNX | 90MB | onnx-models/all-MiniLM-L6-v2-onnx |
| Qwen3-Embedding-0.6B RKLLM | 935MB | Pre-installed with rkllama |
| Qwen3-Reranker-0.6B RKLLM | 935MB | Pre-installed with rkllama |
| qmd-query-expansion 1.7B RKLLM | 2.4GB | Custom conversion |
| Qwen3-4B RKLLM | 4.6GB | dulimov/Qwen3-4B-rk3588-1.2.1-base |
# Install rkllama (serves models on the NPU)
# See: https://github.com/NotPunchnox/rkllama
# The setup script handles this automatically, or manually:
huggingface-cli download dulimov/Qwen3-4B-rk3588-1.2.1-base \
Qwen3-4B-rk3588-w8a8-opt-1-hybrid-ratio-0.0.rkllm \
--local-dir ~/.rkllama/models/qwen3-4b-chatNot required — the Pi is fully self-contained. A GPU worker gives ~10x speed on LLM tasks:
# On your GPU machine
ollama pull qwen3:4b # Same model as the Pi — same quality
# Point taOSmd at the GPU worker
export TAOSMD_LLM_URL=http://<gpu-machine>:11434All components expose HTTP endpoints when used with the taOS server:
| Endpoint | Description |
|---|---|
POST /api/kg/triples |
Add a fact |
GET /api/kg/query/{entity} |
Query facts about an entity |
POST /api/archive/record |
Archive an event |
GET /api/archive/events |
Search archived events |
POST /api/kg/classify |
Classify memory type |
# Full LongMemEval-S benchmark (500 questions)
python benchmarks/longmemeval_runner.py
# Recall@5 only
python benchmarks/longmemeval_recall.py
# Per-category breakdown
python benchmarks/longmemeval_granularity.pyIf taOSmd is useful to you:
- Star this repo — it helps others find it
- Donate: Buy Me a Coffee
- Contact: jaylfc25@gmail.com
- Hardware donations/loans: We test on real hardware. If you have spare SBCs, GPUs, or dev boards and want to help expand compatibility, reach out.
MIT
Core taOSmd (the 97.2% benchmark) is fully self-contained — it uses only standard packages (SQLite, numpy, ONNX Runtime) plus the MiniLM embedding model. No external servers or forked repos needed.
Optional integrations for the full taOS stack:
| Component | Source | Notes |
|---|---|---|
| QMD (reranking + query expansion) | jaylfc/qmd (fork) | Adds rkllama NPU backend and qmd serve mode. Upstream tobi/qmd doesn't have NPU support yet. |
| rkllama (NPU model serving) | NotPunchnox/rkllama | Upstream with minor patches for rerank endpoint |
| ONNX MiniLM | onnx-models/all-MiniLM-L6-v2-onnx | Standard pre-exported model |
| Qwen3-4B RKLLM | dulimov/Qwen3-4B-rk3588-1.2.1-base | Community RK3588 conversion |
Built by jaylfc. Part of the taOS ecosystem.
Benchmark dataset: LongMemEval (ICLR 2025) Embedding model: all-MiniLM-L6-v2
