A coding agent that builds and queries a structured, persistent memory of a codebase instead of re-reading files at every step. It indexes file purposes, symbol signatures, cross-file dependencies, and design decisions into a SQLite store, then queries that store before touching the filesystem. Stale entries are detected via content hashing and invalidated automatically.
The result: fewer redundant file reads, consistent decisions across long multi-step tasks, and an explicit trace proving both.
The system has five components, each with a single responsibility:
| Component | File | Role |
|---|---|---|
| Executor | src/executor.py |
Orchestrates the per-step loop: decompose task → query memory → check staleness → read only if needed → act → log decisions |
| Memory Store | src/memory_store.py |
SQLite-backed persistence with three tables (files, symbols, decisions), WAL journaling, indexed symbol lookups, and foreign-key cascades |
| Query Interface | src/query.py |
Two-tier retrieval: fast path-and-symbol matching first, LLM-based semantic ranking only when the fast path finds too little. Staleness is a hard gate — stale entries are never returned |
| Staleness Checker | src/staleness.py |
SHA-256 content hashing. Compares stored hash against current file on disk. Enforced as a non-bypassable gate inside the query path |
| Indexer | src/indexer.py |
Extracts structured facts from source files via LLM. Uses file-type-aware prompts (code vs config vs test) because a single generic prompt produces inconsistent results |
┌─────────────────────────────────────────────────────────────────┐
│ For each step in the decomposed task: │
│ │
│ 1. Query memory for files/symbols relevant to this step │
│ 2. Staleness gate: check content hash for every match │
│ ├─ Fresh → serve from memory (no file I/O) │
│ └─ Stale → invalidate, add to "needs fresh read" set │
│ 3. Read only files that memory couldn't cover │
│ 4. Index freshly read files into SQLite for future steps │
│ 5. Execute the step via LLM, with prior decisions as context │
│ 6. If a naming/pattern decision was made, log it to the │
│ decision table so later steps follow it consistently │
└─────────────────────────────────────────────────────────────────┘
files (file_path PK, content_hash, purpose, depends_on JSON, indexed_at)
symbols (id PK, file_path FK→files CASCADE, name, signature, line, kind)
decisions (id PK, decision, reasoning, step_number, related_files JSON, timestamp)
-- Indexed for fast lookups
CREATE INDEX idx_symbols_name ON symbols(name);
CREATE INDEX idx_symbols_file_path ON symbols(file_path);
CREATE INDEX idx_decisions_step ON decisions(step_number);{
"file_path": "src/auth/session.py",
"content_hash": "a3f9...",
"purpose": "Manages user session creation, validation, and expiry",
"key_symbols": [
{"name": "create_session", "signature": "create_session(user_id: str) -> Session", "line": 24, "kind": "function"},
{"name": "validate_session", "signature": "validate_session(token: str) -> bool", "line": 41, "kind": "function"}
],
"depends_on": ["src/auth/tokens.py", "src/db/models.py"],
"indexed_at": "2026-06-22T18:45:11Z"
}Evaluated on encode/httpx (23 source files), a production HTTP client library.
Task: Add a request_id parameter to all public API functions and propagate it through the Client and AsyncClient call chains for distributed tracing, following a consistent naming convention.
| Metric | Memory Agent | Naive Baseline |
|---|---|---|
| File reads | 4 | 7 |
| Memory queries | 8 | 0 |
| Memory hits | 3 | 0 |
| Reads avoided | 3 | — |
| Reduction | 42.9% | — |
The agent logged a naming convention (request_id: str | None = None) at step 1. Across the remaining 7 steps, it produced 19 explicit decision-reuse events — each one a later step retrieving and following the convention from the decision table rather than re-deciding it independently.
DECIDED step 1: parameter naming and default value convention
DECIDED step 2: request_id parameter naming and default value convention
REUSED step 3: followed → "request_id: default None, backward compatible"
REUSED step 5: followed → "request_id: default None, backward compatible"
REUSED step 6: followed → "request_id: default None, backward compatible"
REUSED step 7: followed → "request_id: default None, backward compatible"
REUSED step 8: followed → "request_id: default None, backward compatible"
... (19 total reuse events across steps 2-8)
| Step | Description | Memory | Fresh Reads | Decisions Reused |
|---|---|---|---|---|
| 1 | Review public API in _api.py |
MISS | 1 | — |
| 2 | Establish naming convention | MISS | 0 | 1 |
| 3 | Modify _api.py functions |
HIT | 0 | 2 |
| 4 | Review Client/AsyncClient in _client.py |
MISS | 1 | 2 |
| 5 | Modify Client class |
HIT | 0 | 3 |
| 6 | Modify AsyncClient class |
HIT | 0 | 4 |
| 7 | Validate parameter propagation | MISS | 0 | 5 |
| 8 | Update documentation | MISS | 2 | 2 |
Steps 3, 5, and 6 required zero file reads — the agent served everything from its indexed memory of _api.py and _client.py. Step 1 was the cold start; every subsequent reference to those files was a memory hit.
# Clone and set up
cd code_memory_agent
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env # add your OPENROUTER_API_KEY
# Run against any codebase
python -m code_memory_agent.run \
--codebase /path/to/repo \
--task "rename all helper functions to snake_case consistently" \
-v
# Run tests
pytest tests/ -v| Flag | Description |
|---|---|
--codebase |
Path to the target repository root |
--task |
Natural language description of the multi-step task |
--memory-file |
Path to SQLite DB (default: <codebase>/.code_memory.db) |
--output |
Path to save the JSON efficiency report |
-v |
Verbose logging — shows per-step memory/staleness decisions |
17 unit tests covering the staleness and storage layer:
test_staleness.py::TestComputeHash::test_same_content_same_hash PASSED
test_staleness.py::TestComputeHash::test_different_content_different_hash PASSED
test_staleness.py::TestComputeHash::test_nonexistent_file_returns_empty PASSED
test_staleness.py::TestIsStaleFresh::test_freshly_indexed_file_is_not_stale PASSED
test_staleness.py::TestIsStaleModified::test_modified_file_is_stale PASSED
test_staleness.py::TestIsStaleModified::test_unmodified_sibling_stays_fresh PASSED
test_staleness.py::TestIsStaleEdgeCases::test_never_indexed_file_is_stale PASSED
test_staleness.py::TestIsStaleEdgeCases::test_deleted_file_is_stale PASSED
test_staleness.py::TestInvalidation::test_invalidation_removes_only_stale_file PASSED
test_staleness.py::TestInvalidation::test_invalidation_of_fresh_file_does_nothing PASSED
test_staleness.py::TestInvalidation::test_invalidation_clears_symbols PASSED
test_staleness.py::TestBatchStaleness::test_batch_check PASSED
test_staleness.py::TestMemoryStoreSQL::test_set_and_get_file PASSED
test_staleness.py::TestMemoryStoreSQL::test_overwrite_file_replaces_symbols PASSED
test_staleness.py::TestMemoryStoreSQL::test_decision_log PASSED
test_staleness.py::TestMemoryStoreSQL::test_summary_stats PASSED
test_staleness.py::TestMemoryStoreSQL::test_persistence_across_reopen PASSED
Key properties tested:
- Isolation: modifying file A does not invalidate file B's memory
- Cascade: invalidating a file removes its symbol index entries
- Persistence: data survives store close and reopen
- Correctness: deleted files, never-indexed files, and modified files are all correctly identified as stale
code_memory_agent/
├── src/
│ ├── memory_store.py # SQLite schema, CRUD, symbol index, decision log
│ ├── indexer.py # LLM-powered structured extraction (code/config/test)
│ ├── staleness.py # SHA-256 content-hash invalidation
│ ├── query.py # Two-tier memory lookup with staleness hard gate
│ ├── executor.py # Task decomposition and per-step orchestration
│ └── llm_client.py # OpenRouter client with retry logic
├── benchmark/
│ └── efficiency_report.py # Naive-baseline comparison, rich terminal output
├── tests/
│ └── test_staleness.py # 17 unit tests for staleness + storage correctness
├── results/
│ └── httpx_run.json # Full trace from the httpx evaluation
├── assets/
│ └── architecture.svg # System architecture diagram
├── run.py # CLI entry point
├── requirements.txt
└── .env.example
| Decision | Rationale |
|---|---|
| SQLite over JSON files | ACID transactions, indexed symbol lookups via SQL, WAL for read concurrency, foreign-key cascades for cleanup — without requiring a running server |
| SHA-256 staleness as a hard gate | A memory system that serves stale facts as current is worse than no memory. The check lives inside the query path and cannot be bypassed by the executor |
| File-type-aware indexer prompts | Config files have settings, test files have test functions, code files have signatures. A single generic extraction prompt produces inconsistent results across these |
| Two-tier query (path match → LLM ranking) | Most lookups can be resolved by matching file paths or symbol names mentioned in the task description. LLM-based semantic ranking is expensive and only fires when the fast path finds fewer than 2 results |
| Executor plans but does not modify files | Keeps the benchmark clean (measuring reads, not write correctness) and the demo safe. The action trace describes exactly what changes would be made |
| Decision log with step attribution | Enables the consistency proof: step N logs a convention, step N+3 retrieves it and follows it. The trace captures both the logging and the reuse explicitly |
- Python 3.13
- SQLite with WAL journaling and foreign keys
- OpenAI SDK via OpenRouter (model-agnostic — works with GPT-4o-mini, Claude, etc.)
- Rich for terminal output
- pytest for unit tests
- Python 3.11+
- An OpenRouter API key (or any OpenAI-compatible endpoint)
- No GPU required — this is an orchestration project using API calls
