fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227)#229
fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227)#229JF10R wants to merge 3 commits intojustrach:mainfrom
Conversation
…ath growth TrigramIndex.removeFile deletes from path_to_id but never clears the corresponding id_to_path slot. The next indexFile call for the same path appends a new entry, so id_to_path grows by one slot per re-index cycle per file — causing ~425 MB/min memory growth on actively watched repos. Change id_to_path from ArrayList([]const u8) to ArrayList(?[]const u8) and add a free_ids list. removeFile now nulls out the stale slot and pushes the doc_id onto free_ids. getOrCreateDocId pops from free_ids before appending, keeping id_to_path bounded. Query paths (candidates, candidatesRegex, writeToDisk) now skip null slots, which also fixes phantom search results from stale doc_ids. Closes justrach#227
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa36d97ba5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR fixes unbounded memory growth in TrigramIndex by reusing doc_id slots when files are re-indexed, preventing id_to_path from growing monotonically with watcher-driven reindex cycles.
Changes:
- Change
id_to_pathto allow null “freed” slots and skip them in query/persistence paths. - Add a
free_idsfree-list, populated onremoveFileand consumed bygetOrCreateDocId. - Update candidate collection and disk serialization to ignore freed (null) doc IDs.
Reused doc_ids from the free-list can be non-monotonic, breaking the sorted-order invariant that PostingList.getByDocId and candidates rely on for binary search and merge intersection. Replace the raw append with getOrAddPosting which does a binary search and inserts at the correct sorted position. Addresses review feedback from Codex and Copilot on justrach#229.
|
Looked through the current head ( The cloud/Copilot finding about breaking posting-list sort order after I also don’t see anything suspicious or malicious in the patch. This is a narrow in-memory bookkeeping fix plus null-guards in query/persistence paths, and The one gap I still see is regression coverage for the exact free-list reuse path. There are existing trigram remove/reindex tests in |
|
For anyone other people/agents (including you @codex) else reviewing this, these are the parts of
|
justrach
left a comment
There was a problem hiding this comment.
Thanks for working through this. The free-list reuse plus the sorted-insert follow-up on the current head looks like the right fix for the unbounded id_to_path growth and the posting-order issue.
The one change I still want before merge is explicit regression coverage for the exact reuse path. Please add a test that forces a freed doc_id to be reused and then proves query correctness still holds afterward — ideally covering search/candidate behavior so we know we do not reintroduce stale or phantom paths.
Once that is in, this looks close.
Add regression coverage for freed doc_id reuse in the trigram index. The test forces reuse after higher ids exist and verifies candidates exclude stale paths while still returning the live reused path. Co-Authored-By: Codex <noreply@openai.com>
|
Added the requested regression test in 9f8b0e4. It forces freed- Checks run:
Let me know if anything's missing or wrong. |
Closes #227
Summary
TrigramIndex.id_to_pathgrows unboundedly when files are re-indexed. EachindexFilecall for an already-indexed file appends a new doc_id slot without reclaiming the old one, causing monotonic memory growth proportional to re-index frequency.Exact change
id_to_pathfromArrayList([]const u8)toArrayList(?[]const u8)free_ids: ArrayList(u32)free-listremoveFile: nulls out the staleid_to_pathslot and pushes the doc_id ontofree_idsgetOrCreateDocId: pops fromfree_idsbefore appending a new entrycandidates,candidatesRegex,writeToDisk: skip null slots (also fixes phantom search results from stale doc_ids)Files touched
src/index.zig—TrigramIndexstruct and its methods onlyRed-to-green
Before
Re-indexing the same file N times causes
id_to_path.items.lento grow by N:Observed: 425 MB/min private memory growth on a ~22K file repo with active file watching.
After
Re-indexing the same file reuses the freed doc_id slot:
Nearby non-regression
All existing trigram index tests pass unchanged — the fix is internal to the doc_id allocation strategy and does not change the public API or query semantics.
Note: upstream
maindoes not compile on Windows (4 pre-existing errors:std.posix.mmap,std.posix.getenv,SOCKETtype, libc dependency). The only new code in this PR (src/index.zig) has zero Windows-specific paths and compiles cleanly on all platforms. Tests were verified to the extent possible on Windows; full test suite requires Linux/macOS.Rebased
Yes — single commit on top of current
main(268f9c9).Generated files / lockfiles / benchmarks
None changed.
CONTRIBUTING.md compliance
Confirmed.