Hierarchical, background tree organizing memory for long-running LLM agents.
Reduce context loss. Reduce broken agent tasks. Stop paying for tokens on chat history you don't need.
This repo has two parts — they are completely independent:
- 🧠
trace_memory/— the lightweight memory engine. Install it, import it, and integrate it into your own app. Zero UI, zero bloat.- 🖥️
nexus_terminal/— an optional demo chatbot built on top of the engine. Use it to test TRACE live, explore how it works, and run experiments. You do not need it to use TRACE.
TRACE is a Python library that gives your LLM agent structured, searchable, self-organizing long-term memory.
Instead of naïvely stuffing an ever-growing chat log into every prompt, TRACE organises every conversation exchange into a hierarchical B+Tree of named topic branches. When the agent needs context, TRACE performs a fast cosine similarity search across topic summaries and retrieves only the surgically relevant branches — not the entire history.
At rest (while the agent is not actively chatting), TRACE's background reorganizer evaluates the entire tree against four strict axioms and merges semantically related branches under shared parents — inspired by memory consolidation processes.
The result: an agent designed to preserve cross-session constraints through hierarchical retrieval, that reduces hallucination of stale context, and operates at a fraction of the token cost of sliding-window or full-history approaches.
Standard RAG works extremely well for:
- documentation search
- knowledge bases
- code retrieval
- enterprise search
The problem isn't RAG. The problem is using RAG as persistent memory.
| Failure Mode | What Happens | Real Impact |
|---|---|---|
| Temporal Blindness | RAG retrieves semantically similar chunks regardless of when they occurred. Old, overridden decisions surface alongside current ones. | Agent contradicts itself, repeats resolved problems, or reinstates abandoned plans. |
| Context Rot | Sliding windows drop early messages as the conversation grows. | Constraints set in message 3 are gone by message 50. The agent forgets that Sarah is allergic to peanuts. |
| Lossy Summarization | Compressing history into a single paragraph erases detail and nuance. | Agent loses track of branching plans, multi-hop constraints, and edge-case handling agreed upon earlier. |
A fixed-size sliding window is the most common approach — and it works well for on-demand, single-session queries where you need detailed, verbatim access to recent exchanges. But it is fundamentally unsuited for long-term agent memory:
- No semantic awareness: message #1 and message #200 are weighted equally as long as they fit.
- Guaranteed forgetting: anything outside the window is permanently gone from the agent's context.
- No structure: a flat list tells the LLM nothing about which topics are related or which branch an earlier constraint belongs to.
For a simple chatbot or a document Q&A tool, a sliding window is perfectly fine. For a long-running agent performing multi-step tasks across sessions, it is catastrophic.
MemGPT is incredibly powerful, but it functions like a full operating system with tiered memory (RAM, disk) that the LLM must explicitly learn to manage via function calls. This introduces significant overhead and requires highly capable models.
TRACE is meant to be a lightweight, drop-in component, not a full runtime environment. It focuses specifically on modelling conversations as a hierarchical tree to natively surface multi-hop constraints without forcing the LLM to actively manage its own memory banks.
TRACE builds on the open-source ChatIndex architecture (credit: Mingtian Zhang, Ray, VectifyAI). A modified version of ChatIndex's core logic is bundled directly within TRACE, which models conversation history as a B+Tree:
- Leaf nodes (MessageNodes) — raw user/assistant exchanges.
- Internal nodes (TopicNodes) — LLM-generated topic labels and summaries for each branch.
- Root — a virtual anchor node.
Every time an exchange is added (tree.add()), an LLM call classifies whether the exchange continues the current topic or starts a new branch. If a new branch, the LLM selects the most appropriate parent in the ancestry chain.
This gives TRACE a structured map of the entire conversation history — not a flat log — with topic metadata at every branch.
What TRACE adds to ChatIndex: ChatIndex primarily retrieves context through a single traversal path. While effective for hierarchical exploration, information spread across multiple semantically related branches may require multiple retrieval steps. TRACE augments this with vector-based retrieval across topic summaries, allowing context from multiple branches to be surfaced simultaneously.
The problem: A single ancestry path only captures the current conversational thread. Cross-branch constraints (e.g., "Sarah's allergy" in Branch 1, "party cake" in Branch 3) are invisible unless both branches are active.
The solution: Every time the agent needs to respond, PromptSynthesizer runs a cosine similarity search against the VectorDatabase of embedded topic summaries:
User query → embed() → query vector
↓
VDB: cosine search across ALL topic summaries
↓
Filter: keep all nodes above base cosine threshold
↓
Walk full ancestry of each qualifying node
↓
Deduplicate shared ancestor nodes
↓
Rank by similarity → take Top-3 paths
↓
Format compact multi-path context block
This means the LLM receives context from multiple relevant branches simultaneously, not just the current thread — saving thousands of tokens compared to full-history injection, while synthesising information across branches that were never explicitly connected.
The cross-branch synthesis stress test (validated with gpt-oss-20b via NVIDIA NIM):
Branch 1: "Sarah is allergic to peanuts."
Branch 2: "Weather in Tokyo?" (noise)
Branch 3: "Planning Sarah's surprise party, baking a cake."
Branch 4: "Fixing a bike tire?" (noise)
Branch 5: "Found a Thai Peanut Butter Cake recipe."
Prompt: "I'm making the Peanut Butter Cake for Sarah's party. Good idea?"
Result: The AI aggressively stopped the user.
The VDB surgically bypassed Branches 2 & 4 (noise),
retrieved Branches 1, 3, and 5,
and synthesised them to catch the life-threatening allergy conflict
— without any explicit link between them ever being made.
The problem: Long conversations naturally fragment. Topics discussed in different sessions may be semantically identical but live in separate branches, causing redundant retrieval and diluted context.
The solution: tree.reorganize() runs a conservative, four-rule-guarded merge pass:
Phase 1 — Collect all frozen (inactive) TopicNodes
Phase 2 — Generate missing summaries; embed all candidates
Phase 3 — Compute pairwise cosine similarity
Phase 4 — For each pair above threshold, apply 4 axioms:
Axiom 1 — Chronological Guard
Axiom 2 — Frozen State check
Axiom 3 — Similarity threshold (default 0.55)
Axiom 4 — LLM Veto
If all 4 pass → merge (newer becomes child of older)
Phase 5 — Optional: prune trivial leaf messages
| Axiom | Rule | Why |
|---|---|---|
| 1. Chronological Guard | The older node absorbs the newer one — never the reverse. | Preserves temporal ordering. The past cannot be restructured to appear after the present. |
| 2. Frozen State | Only nodes outside the currently active ancestry path may be merged. | The live conversation thread is never touched. Zero risk of corrupting the active agent state. |
| 3. Similarity Threshold | Cosine similarity between embeddings must exceed the threshold (default: 0.55). |
Pre-filters pairs using pure math before wasting an LLM call. |
| 4. LLM Veto | The LLM independently confirms the merge makes semantic sense. | Catches false positives (e.g., two topics both mentioning "Python" — one about snakes, one about code). If vetoed, the merge is aborted. |
Analogy: Just like a human brain during sleep — when the body is at rest, the brain doesn't switch off. It replays the day's events, consolidates important memories into long-term storage, and prunes connections that are no longer relevant. TRACE does exactly this: when the agent is idle, it reorganizes its memory graph, surfaces hidden connections across branches, and prunes redundancy — all without corrupting the live agent state.
Short, throwaway exchanges ("ok", "thanks", "got it") pollute the tree with noise that wastes tokens and dilutes retrieval quality.
When prune_trivial_leaves=True is passed to reorganize(), TRACE detects MessageNodes where both the user and assistant messages are under 20 words, and soft-archives them — moving them to tree._archived_nodes instead of hard-deleting them. They are persisted to disk in case you ever need them, but they are excluded from all future retrieval and prompt synthesis.
git clone https://github.com/husain34/TRACE.git
cd TRACE
pip install -e .| Package | Purpose |
|---|---|
openai>=1.0.0 |
LLM API client (works with any OpenAI-compatible endpoint) |
python-dotenv>=1.0.0 |
.env file support |
Note: TRACE requires the bundled
trace._llm_utilsmodule (which providesChatGPT_APIandextract_json) for internal LLM calls.Note: TRACE's VectorDatabase uses only Python's built-in
sqlite3andstructmodules — no external vector DB dependency required.
If you already have a chat loop and just want to plug TRACE in, this is all you need:
import openai
from trace_memory import CTree, VectorDatabase, PromptSynthesizer
# 1. Boot
client = openai.OpenAI(api_key="sk-...", base_url="http://127.0.0.1:1234/v1")
def embed(text): return client.embeddings.create(input=[text], model="nomic-embed-text").data[0].embedding
tree = CTree(api_key="sk-...", model="gpt-4o-mini", embed_fn=embed)
tree.vdb = VectorDatabase("session.db")
synth = PromptSynthesizer(ctree=tree, vector_db=tree.vdb)
# 2. Each turn: build prompt → call LLM → store exchange
while True:
user_input = input("You: ")
system_prompt = synth.synthesize_prompt(
user_query = user_input,
query_vector = embed(user_input),
active_node = tree.current_node,
recent_messages = tree.conversation[-6:],
)
response = client.chat.completions.create(
model = "gpt-4o-mini",
messages = [{"role": "system", "content": system_prompt},
*tree.conversation[-10:],
{"role": "user", "content": user_input}],
)
reply = response.choices[0].message.content
tree.add([{"role": "user", "content": user_input},
{"role": "assistant", "content": reply}])
print(f"AI: {reply}")
# 3. When the agent is idle, consolidate memory
stats = tree.reorganize(similarity_threshold=0.55, prune_trivial_leaves=True)TRACE comes with a fully-featured, gorgeous Terminal UI chatbot out of the box. It is designed as a lightweight sandbox just for testing out TRACE—seeing how it works, running tests, and exploring the engine without heavy frontend overhead.
Key Features included in the terminal:
- Dynamic VRAM Swapping: Hot-swaps models on the fly (unloading Text, loading Vision) to prevent local GPU crashes.
- Live Web Search: Pre-generation routing secretly checks DuckDuckGo to prevent hallucinations on current events.
- Visualizing the B+Tree: Use
/treeto instantly print and inspect the live hierarchical memory map. - Multimodal Ingestion: Drop images in the folder and use
/ingestto extract rich descriptions directly into long-term memory. - Gorgeous TUI: Threaded background spinners and ANSI colors so the UI never freezes while testing.
To run it immediately:
- Navigate to the terminal folder:
cd nexus_terminal - Install the UI dependencies:
pip install -r requirements.txt
- Configure your models:
Rename
.env.exampleto.envand adjust the models/URLs to point to your local LM Studio or OpenAI endpoints. - Boot the engine:
python terminal.py
The hierarchical conversation memory tree.
from trace_memory import CTreeCTree(
max_children: int = 5,
api_key: str = None, # falls back to OPENAI_API_KEY env var
model: str = "gpt-4o-mini",
auto_save_path: str = None, # auto-saves tree structure (not VDB) after every add() if set
embed_fn: Callable = None, # optional: inject your embed function at construction time
)tree.vdb = VectorDatabase("session.db") # VDB for semantic retrieval
tree.embed_fn = embed # callable(text: str) -> List[float]Ingest one completed exchange into the tree.
tree.add([
{"role": "user", "content": "What is quantum entanglement?"},
{"role": "assistant", "content": "Quantum entanglement is ..."},
])
# With an optional system/context message
tree.add([
{"role": "system", "content": "[Tool result]: 42.3°C"},
{"role": "user", "content": "Is that dangerous?"},
{"role": "assistant", "content": "Yes, 42.3°C is critically high ..."},
])Run one self-healing reorganization pass.
stats = tree.reorganize(
embed_fn = embed, # optional: overrides tree.embed_fn
similarity_threshold = 0.60, # raise for more conservative merges
prune_trivial_leaves = True, # archive short throwaway messages
)
print(stats)
# {'merged': 3, 'pruned': 7, 'skipped': 12, 'duration_secs': 4.2}When to call: Periodically when the agent is idle. Not after every message.
Persist the tree to JSON.
tree.save("sessions/chat_001.json", save_conversation=True)save_conversation=True embeds the raw message list so the session can be fully restored later.
Restore a tree from a JSON file.
tree = CTree.load("sessions/chat_001.json", api_key="sk-...", embed_fn=embed)
tree.vdb = VectorDatabase("sessions/chat_001.db")Return the ordered ancestry chain from root down to node.
path = tree.get_ancestors(tree.current_node, include_self=True, exclude_root=True)
for node in path:
print(f" {node.topic_name}: {node.summary}")Manually trigger LLM summarisation of all frozen branches.
Called automatically during save() and internally during reorganize(). Can be used to manually pre-warm summaries if desired.
Pretty-print the tree to stdout.
tree.print_tree(show_messages=True)Example output:
ROOT (sub-nodes: 3)
├─ Physics Discussions [0:12] (6 msgs)
Covered quantum entanglement and black hole thermodynamics.
├─ Quantum Entanglement [0:6] (3 msgs)
├─ Black Holes [6:12] (3 msgs)
├─ Party Planning [12:20] (4 msgs)
Planning Sarah's surprise birthday party logistics.
| Attribute | Type | Description |
|---|---|---|
tree.conversation |
List[dict] |
Flat list of all raw messages in chronological order. |
tree.current_node |
TopicNode |
The currently active topic branch. |
tree.root |
TopicNode |
The virtual root of the tree. |
tree._archived_nodes |
List[MessageNode] |
Soft-archived trivial leaf messages. |
tree.auto_save_path |
str | None |
If set, auto-saves the tree structure (not the VDB) after every add(). |
A local SQLite vector store with two active tables: conversation vectors and topic summaries.
Note on Scaling: TRACE uses pure Python (
sqlite3andstruct) to avoid heavy dependencies and run instantly on any machine. This is perfectly sufficient for local agents and moderate conversation histories. However, for massive scale (millions of vectors), you will need to swap this module for a dedicated vector database like FAISS or Chroma.
from trace_memory import VectorDatabasevdb = VectorDatabase("path/to/session.db") # creates the DB if it doesn't existStore an embedded past conversation message for cross-thread recall.
from trace_memory import ConversationVector
import time, uuid
msg = ConversationVector(
message_id = str(uuid.uuid4()),
message_index = 0,
role = "user",
text = "Sarah is allergic to peanuts.",
embedding = embed("Sarah is allergic to peanuts."),
timestamp = time.time(),
thread_path = "ROOT → Health Constraints → Allergies",
)
vdb.add_conversation_message(msg)Retrieve semantically similar past messages from any branch.
recalls = vdb.search_conversation(
query_vector = embed("Is the cake safe for Sarah?"),
top_k = 3,
min_similarity = 0.45,
)
for r in recalls:
print(f"[{r.similarity:.2f}] {r.thread_path} — {r.role}: {r.text}")Insert or update a topic node's embedding. Called automatically by CTree when a node is frozen and summarised.
Used internally by PromptSynthesizer for surgical multi-path retrieval.
hits = vdb.search_topic_summaries(
query_vector = embed("peanut allergy constraint"),
top_k = 3,
min_similarity = 0.35,
)
# Returns: [{"node_id": "...", "topic_name": "...", "summary": "...", "similarity": 0.82}, ...]Remove a topic embedding by node ID. Called automatically during reorganize() when a node is moved.
Assembles the full RAG-enriched system prompt.
from trace_memory import PromptSynthesizer
synth = PromptSynthesizer(ctree=tree, vector_db=vdb)system_prompt = synth.synthesize_prompt(
user_query = "Is the cake safe for Sarah?",
query_vector = embed("Is the cake safe for Sarah?"),
active_node = tree.current_node,
recent_messages = tree.conversation[-6:],
top_k_history = 2, # max past messages to recall
min_history_similarity = 0.50, # min cosine score for conversation recall
)The returned string is a complete system prompt. Pass it directly as the system role message to your LLM.
ConversationVector(
message_id: str,
message_index: int,
role: str, # "user" | "assistant" | "system"
text: str,
embedding: List[float],
timestamp: float, # unix timestamp
thread_path: str, # e.g. "ROOT → Physics → Black Holes"
similarity: float = 0.0,
)import os
os.environ["OPENAI_BASE_URL"] = "http://127.0.0.1:1234/v1"
os.environ["OPENAI_API_KEY"] = "lm-studio"
from trace_memory import CTree
tree = CTree(model="meta-llama-3.1-8b-instruct")Or use a .env file in your project root:
OPENAI_BASE_URL=http://127.0.0.1:1234/v1
OPENAI_API_KEY=lm-studio
from trace_memory import CTree
tree = CTree(api_key="sk-...", model="gpt-4o-mini")import os
os.environ["OPENAI_BASE_URL"] = "https://integrate.api.nvidia.com/v1"
os.environ["OPENAI_API_KEY"] = "nvapi-..."
from trace_memory import CTree
tree = CTree(model="meta/llama-3.1-70b-instruct")TRACE was validated against five adversarial test scenarios:
| Test | Description | Result |
|---|---|---|
| Needle in a Haystack | A critical constraint buried deep in a 200-message session | ✅ Retrieved with >0.7 cosine score |
| Memory Overwrites | User updates a constraint ("actually, Sarah can eat nuts now") | ✅ Newer node supersedes older via Chronological Guard |
| Semantic Drift (Veto Test) | Two topics share keywords but are in different domains (Python the snake vs. Python the language) | ✅ LLM Veto correctly aborted the merge |
| Ship of Theseus | Gradual topic drift — same entity discussed across 10 different branches | ✅ Reorganizer correctly consolidated into a shared parent |
| Multi-Hop Reasoning | Answer requires synthesising info from 3 non-adjacent branches (allergy + party + recipe) | ✅ Surgical retrieval surfaced all 3; LLM synthesised the conflict |
| Variable | Default | Description |
|---|---|---|
OPENAI_BASE_URL |
http://127.0.0.1:1234/v1 |
LLM API endpoint |
OPENAI_API_KEY |
lm-studio |
API key |
Set these in a .env file or directly in your shell.
TRACE loads .env automatically if python-dotenv is installed.
| Package | Version | Required |
|---|---|---|
openai |
≥ 1.0.0 | ✅ Yes |
python-dotenv |
≥ 1.0.0 | ✅ Yes |
sqlite3 |
built-in | ✅ Yes (no install needed) |
struct |
built-in | ✅ Yes (no install needed) |
Pull requests are welcome. For major changes, please open an issue first.
Areas where contributions are especially valuable:
- Additional embedding model adapters (Sentence Transformers, Cohere, etc.)
- Async support for
add()andreorganize() - Web UI for tree visualisation
- Benchmarks against other memory architectures (MemGPT, Zep, etc.)
This project is licensed under the Apache License 2.0. See LICENSE for details.
Note: This project includes code from ChatIndex, licensed under Apache 2.0. See NOTICE for details.
Built by Husain Ghulam.



