Skip to content

bug: TrigramIndex.id_to_path grows unboundedly on re-index — 425 MB/min memory growth #227

@JF10R

Description

@JF10R

Summary

TrigramIndex.id_to_path grows unboundedly when files are re-indexed. Each call to indexFile for an already-indexed file appends a new doc_id to id_to_path without reclaiming the old slot, causing monotonic memory growth proportional to the number of re-index cycles.

On a project with active file watching, this causes 425 MB/min private memory growth (observed on a ~22K file repo).

Root Cause

In src/index.zig, the indexFileremoveFilegetOrCreateDocId sequence:

  1. indexFile (line 583) calls self.removeFile(path)
  2. removeFile (line 580) removes path from path_to_id but does NOT touch id_to_path — the old slot remains allocated
  3. indexFile (line 586) calls getOrCreateDocId(path)
  4. getOrCreateDocId (line 554) checks path_to_id — path was just removed — so it falls through to line 555-557: appends a new entry to id_to_path with a new doc_id

The old id_to_path[old_doc_id] slot is never cleared, never reused. After K re-indexes of the same file, id_to_path has K entries for it — only the last is reachable via path_to_id.

Compounding triggers

Two watcher paths amplify this:

  • git HEAD change (watcher.zig line 466): branch switch re-indexes every file in the project. A single checkout on a 5000-file project appends 5000 stale doc_id entries.
  • drainNotifyFile (watcher.zig line 693): every notified path triggers a full re-index with no dedup against current state.

Both paths call indexFileContentexplorer.indexFileTrigramIndex.indexFile, hitting this same accumulation.

Secondary effect: incorrect search results

Stale id_to_path entries cause TrigramIndex.candidates() (line 747) to yield doc_ids that are no longer in path_to_id, producing phantom search results for files that no longer match the query.

Reproduction

# Start MCP server on a repo with file watching active
codedb --mcp

# In another terminal, touch files rapidly to trigger re-indexing
for i in $(seq 1 100); do
  touch src/*.zig
  sleep 2
done

# Monitor RSS — it should stay flat, but grows ~N_files * sizeof(entry) per cycle

Alternatively, switch git branches repeatedly on a large repo:

for i in $(seq 1 20); do
  git checkout main
  git checkout feature-branch
  sleep 3
done

Measurements

Scenario id_to_path growth Private bytes growth
Startup, no re-indexing 0 Stable
2s poll cycle, 1000 files with mtime changes +1000 entries/cycle ~425 MB/min observed
Single git branch switch, 5000 files +5000 entries Instant spike

Fix options

  1. Reuse doc_id: In removeFile, keep the path_to_id mapping but clear the postings. Then getOrCreateDocId returns the existing id instead of appending. Requires adjusting removeFile to mark the slot as "cleared" rather than removing from path_to_id.

  2. Free-list: When removeFile removes a doc_id, push it onto a free-list. getOrCreateDocId pops from the free-list before appending. Minimal changes, no behavioral difference.

  3. Periodic compaction: Rebuild id_to_path when stale ratio exceeds a threshold. Higher latency but simplest correctness argument.

Option 2 is likely the best balance of simplicity and efficiency.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions