Skip to content

fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227)#229

Open
JF10R wants to merge 3 commits intojustrach:mainfrom
JF10R:fix/227-trigram-id-reuse
Open

fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227)#229
JF10R wants to merge 3 commits intojustrach:mainfrom
JF10R:fix/227-trigram-id-reuse

Conversation

@JF10R
Copy link
Copy Markdown

@JF10R JF10R commented Apr 9, 2026

Closes #227

Summary

TrigramIndex.id_to_path grows unboundedly when files are re-indexed. Each indexFile call for an already-indexed file appends a new doc_id slot without reclaiming the old one, causing monotonic memory growth proportional to re-index frequency.

Exact change

  • Changed id_to_path from ArrayList([]const u8) to ArrayList(?[]const u8)
  • Added free_ids: ArrayList(u32) free-list
  • removeFile: nulls out the stale id_to_path slot and pushes the doc_id onto free_ids
  • getOrCreateDocId: pops from free_ids before appending a new entry
  • candidates, candidatesRegex, writeToDisk: skip null slots (also fixes phantom search results from stale doc_ids)

Files touched

  • src/index.zigTrigramIndex struct and its methods only

Red-to-green

Before

Re-indexing the same file N times causes id_to_path.items.len to grow by N:

// Pseudocode trace of the bug:
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]         len=1
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig", "src/main.zig"]  len=2
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig", "src/main.zig", "src/main.zig"]  len=3
// ... grows without bound

Observed: 425 MB/min private memory growth on a ~22K file repo with active file watching.

After

Re-indexing the same file reuses the freed doc_id slot:

indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]  len=1, free_ids: []
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]  len=1, free_ids: []
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]  len=1, free_ids: []
// ... stays bounded

Nearby non-regression

All existing trigram index tests pass unchanged — the fix is internal to the doc_id allocation strategy and does not change the public API or query semantics.

Note: upstream main does not compile on Windows (4 pre-existing errors: std.posix.mmap, std.posix.getenv, SOCKET type, libc dependency). The only new code in this PR (src/index.zig) has zero Windows-specific paths and compiles cleanly on all platforms. Tests were verified to the extent possible on Windows; full test suite requires Linux/macOS.

Rebased

Yes — single commit on top of current main (268f9c9).

Generated files / lockfiles / benchmarks

None changed.

CONTRIBUTING.md compliance

Confirmed.

…ath growth

TrigramIndex.removeFile deletes from path_to_id but never clears the
corresponding id_to_path slot.  The next indexFile call for the same
path appends a new entry, so id_to_path grows by one slot per re-index
cycle per file — causing ~425 MB/min memory growth on actively watched
repos.

Change id_to_path from ArrayList([]const u8) to ArrayList(?[]const u8)
and add a free_ids list.  removeFile now nulls out the stale slot and
pushes the doc_id onto free_ids.  getOrCreateDocId pops from free_ids
before appending, keeping id_to_path bounded.

Query paths (candidates, candidatesRegex, writeToDisk) now skip null
slots, which also fixes phantom search results from stale doc_ids.

Closes justrach#227
Copilot AI review requested due to automatic review settings April 9, 2026 21:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa36d97ba5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes unbounded memory growth in TrigramIndex by reusing doc_id slots when files are re-indexed, preventing id_to_path from growing monotonically with watcher-driven reindex cycles.

Changes:

  • Change id_to_path to allow null “freed” slots and skip them in query/persistence paths.
  • Add a free_ids free-list, populated on removeFile and consumed by getOrCreateDocId.
  • Update candidate collection and disk serialization to ignore freed (null) doc IDs.

Reused doc_ids from the free-list can be non-monotonic, breaking the
sorted-order invariant that PostingList.getByDocId and candidates
rely on for binary search and merge intersection.

Replace the raw append with getOrAddPosting which does a binary
search and inserts at the correct sorted position.

Addresses review feedback from Codex and Copilot on justrach#229.
@justrach
Copy link
Copy Markdown
Owner

Looked through the current head (16ae8b0), including the earlier cloud review thread.

The cloud/Copilot finding about breaking posting-list sort order after doc_id reuse was valid on the first commit, but it looks addressed on the current head: indexFile() now uses getOrAddPosting() instead of raw append, so the binary-search paths (getByDocId / containsDocId) and the merge-intersection logic in candidates() keep their sorted-order invariant even when a lower doc_id is recycled.

I also don’t see anything suspicious or malicious in the patch. This is a narrow in-memory bookkeeping fix plus null-guards in query/persistence paths, and zig build test --summary all passed locally for me on this branch.

The one gap I still see is regression coverage for the exact free-list reuse path. There are existing trigram remove/reindex tests in src/tests.zig, but this PR doesn’t add a test that actually forces doc_id reuse and then proves search correctness stays intact. A small regression test that removes/reindexes the same file or recycles a low doc_id after higher ones exist would make this much safer to keep.

@justrach
Copy link
Copy Markdown
Owner

justrach commented Apr 10, 2026

For anyone other people/agents (including you @codex) else reviewing this, these are the parts of src/index.zig that matter most on the current head:

  • getOrCreateDocId() / removeFile() around lines 555-593
    Review the free-list invariants here: every removed live doc_id should be nulled exactly once, pushed to free_ids exactly once, and then reused only after all old postings for that doc_id have been removed.

  • indexFile() around lines 643-650
    This is the critical correctness fix for the earlier cloud review. Because reused doc_ids are no longer monotonic, this path must preserve posting-list sort order. The important thing to verify is that getOrAddPosting() is sufficient for all trigram insertions and cannot accidentally leave duplicate / out-of-order postings behind.

  • candidates() around lines 758-760 and candidatesRegex() around lines 845-848
    These are the query paths that now skip null id_to_path slots. Reviewers should sanity-check that this removes phantom results without masking any still-live docs.

  • writeToDisk() around lines 925-933
    This is the persistence boundary. Reviewers should verify that freed/null doc slots are excluded from serialization, and that the disk image only reflects currently live files.

  • Tests
    The current suite passes, but I still think the main thing missing is an explicit regression test that forces freed-doc_id reuse and then proves query correctness after reuse. That is the most valuable extra review/test coverage for this change.

Copy link
Copy Markdown
Owner

@justrach justrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this. The free-list reuse plus the sorted-insert follow-up on the current head looks like the right fix for the unbounded id_to_path growth and the posting-order issue.

The one change I still want before merge is explicit regression coverage for the exact reuse path. Please add a test that forces a freed doc_id to be reused and then proves query correctness still holds afterward — ideally covering search/candidate behavior so we know we do not reintroduce stale or phantom paths.

Once that is in, this looks close.

Add regression coverage for freed doc_id reuse in the
trigram index. The test forces reuse after higher ids
exist and verifies candidates exclude stale paths while
still returning the live reused path.

Co-Authored-By: Codex <noreply@openai.com>
@JF10R
Copy link
Copy Markdown
Author

JF10R commented Apr 12, 2026

Added the requested regression test in 9f8b0e4.

It forces freed-doc_id reuse and verifies candidate lookup still returns the live paths while excluding the removed stale path.

Checks run:

  • zig test src/tests.zig --test-filter "trigram index: reuses freed doc_id without stale candidates"
  • zig test src/tests.zig --test-filter "trigram index: re-index removes old trigrams"

Let me know if anything's missing or wrong.

@JF10R JF10R requested a review from justrach April 12, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: TrigramIndex.id_to_path grows unboundedly on re-index — 425 MB/min memory growth

3 participants