When long-running AI agents hit context limits, the standard approach is summarization -- an LLM rewrites the conversation as a condensed summary. This works, but it's lossy, and the loss compounds. By the third compaction you're working with a summary of a summary of a summary.
This repo tests an alternative: replace conversation content with lightweight pointers (chunk IDs with previews) and give the agent a tool to retrieve the originals on demand. Like a memory hierarchy -- hot data in context, cold data in storage, all addressable.
Five experiments, run against Claude Sonnet 4.6 with eight synthetic session fixtures. Full writeup in FINDINGS.md, but the short version:
- Models reliably dereference pointers. 100% dereference rate with a prescriptive prompt and a
list_refstool. - Pointers beat summarization on cascaded compaction. At one compaction the difference is 1 point (85% vs 84% grounding). At two compactions it's 18 points (92% vs 74%). Summaries degrade on re-compaction; pointers don't, because the ref store is lossless.
- Hybrid (summary + pointers) is a trap. The summary anchors the model on a lossy interpretation, undermining the retrieval. This has implications for any RAG system that prepends context summaries before retrieved chunks.
- Good ref descriptions matter more than good prompts.
"File: src/config.ts"beats"Tool result for tool_05". The model decides what to retrieve based on what it sees inlist_refs, not what you tell it in the system prompt.
See PLAN.md for the hypothesis and experiment designs.
bun install
cp .env.example .env
# Add your Anthropic API key to .env
bun run exp:format # Experiment 1: pointer format comparison
bun run exp:interface # Experiment 2: retrieval interface comparison
bun run exp:strategy # Experiment 3: compaction strategy comparison
bun run exp:cascade # Experiment 4: cascaded compaction
bun test # OpenCode grounding testExperiments need session fixtures in fixtures/sessions/. Results go to results/.