Summary
TrigramIndex.id_to_path grows unboundedly when files are re-indexed. Each call to indexFile for an already-indexed file appends a new doc_id to id_to_path without reclaiming the old slot, causing monotonic memory growth proportional to the number of re-index cycles.
On a project with active file watching, this causes 425 MB/min private memory growth (observed on a ~22K file repo).
Root Cause
In src/index.zig, the indexFile → removeFile → getOrCreateDocId sequence:
indexFile (line 583) calls self.removeFile(path)
removeFile (line 580) removes path from path_to_id but does NOT touch id_to_path — the old slot remains allocated
indexFile (line 586) calls getOrCreateDocId(path)
getOrCreateDocId (line 554) checks path_to_id — path was just removed — so it falls through to line 555-557: appends a new entry to id_to_path with a new doc_id
The old id_to_path[old_doc_id] slot is never cleared, never reused. After K re-indexes of the same file, id_to_path has K entries for it — only the last is reachable via path_to_id.
Compounding triggers
Two watcher paths amplify this:
- git HEAD change (
watcher.zig line 466): branch switch re-indexes every file in the project. A single checkout on a 5000-file project appends 5000 stale doc_id entries.
drainNotifyFile (watcher.zig line 693): every notified path triggers a full re-index with no dedup against current state.
Both paths call indexFileContent → explorer.indexFile → TrigramIndex.indexFile, hitting this same accumulation.
Secondary effect: incorrect search results
Stale id_to_path entries cause TrigramIndex.candidates() (line 747) to yield doc_ids that are no longer in path_to_id, producing phantom search results for files that no longer match the query.
Reproduction
# Start MCP server on a repo with file watching active
codedb --mcp
# In another terminal, touch files rapidly to trigger re-indexing
for i in $(seq 1 100); do
touch src/*.zig
sleep 2
done
# Monitor RSS — it should stay flat, but grows ~N_files * sizeof(entry) per cycle
Alternatively, switch git branches repeatedly on a large repo:
for i in $(seq 1 20); do
git checkout main
git checkout feature-branch
sleep 3
done
Measurements
| Scenario |
id_to_path growth |
Private bytes growth |
| Startup, no re-indexing |
0 |
Stable |
| 2s poll cycle, 1000 files with mtime changes |
+1000 entries/cycle |
~425 MB/min observed |
| Single git branch switch, 5000 files |
+5000 entries |
Instant spike |
Fix options
-
Reuse doc_id: In removeFile, keep the path_to_id mapping but clear the postings. Then getOrCreateDocId returns the existing id instead of appending. Requires adjusting removeFile to mark the slot as "cleared" rather than removing from path_to_id.
-
Free-list: When removeFile removes a doc_id, push it onto a free-list. getOrCreateDocId pops from the free-list before appending. Minimal changes, no behavioral difference.
-
Periodic compaction: Rebuild id_to_path when stale ratio exceeds a threshold. Higher latency but simplest correctness argument.
Option 2 is likely the best balance of simplicity and efficiency.
Related
Summary
TrigramIndex.id_to_pathgrows unboundedly when files are re-indexed. Each call toindexFilefor an already-indexed file appends a new doc_id toid_to_pathwithout reclaiming the old slot, causing monotonic memory growth proportional to the number of re-index cycles.On a project with active file watching, this causes 425 MB/min private memory growth (observed on a ~22K file repo).
Root Cause
In
src/index.zig, theindexFile→removeFile→getOrCreateDocIdsequence:indexFile(line 583) callsself.removeFile(path)removeFile(line 580) removes path frompath_to_idbut does NOT touchid_to_path— the old slot remains allocatedindexFile(line 586) callsgetOrCreateDocId(path)getOrCreateDocId(line 554) checkspath_to_id— path was just removed — so it falls through to line 555-557: appends a new entry toid_to_pathwith a new doc_idThe old
id_to_path[old_doc_id]slot is never cleared, never reused. After K re-indexes of the same file,id_to_pathhas K entries for it — only the last is reachable viapath_to_id.Compounding triggers
Two watcher paths amplify this:
watcher.zigline 466): branch switch re-indexes every file in the project. A single checkout on a 5000-file project appends 5000 stale doc_id entries.drainNotifyFile(watcher.zigline 693): every notified path triggers a full re-index with no dedup against current state.Both paths call
indexFileContent→explorer.indexFile→TrigramIndex.indexFile, hitting this same accumulation.Secondary effect: incorrect search results
Stale
id_to_pathentries causeTrigramIndex.candidates()(line 747) to yield doc_ids that are no longer inpath_to_id, producing phantom search results for files that no longer match the query.Reproduction
Alternatively, switch git branches repeatedly on a large repo:
Measurements
Fix options
Reuse doc_id: In
removeFile, keep thepath_to_idmapping but clear the postings. ThengetOrCreateDocIdreturns the existing id instead of appending. Requires adjustingremoveFileto mark the slot as "cleared" rather than removing frompath_to_id.Free-list: When
removeFileremoves a doc_id, push it onto a free-list.getOrCreateDocIdpops from the free-list before appending. Minimal changes, no behavioral difference.Periodic compaction: Rebuild
id_to_pathwhen stale ratio exceeds a threshold. Higher latency but simplest correctness argument.Option 2 is likely the best balance of simplicity and efficiency.
Related