fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227) by JF10R · Pull Request #229 · justrach/codedb

JF10R · 2026-04-09T21:26:12Z

Closes #227

Summary

TrigramIndex.id_to_path grows unboundedly when files are re-indexed. Each indexFile call for an already-indexed file appends a new doc_id slot without reclaiming the old one, causing monotonic memory growth proportional to re-index frequency.

Exact change

Changed id_to_path from ArrayList([]const u8) to ArrayList(?[]const u8)
Added free_ids: ArrayList(u32) free-list
removeFile: nulls out the stale id_to_path slot and pushes the doc_id onto free_ids
getOrCreateDocId: pops from free_ids before appending a new entry
candidates, candidatesRegex, writeToDisk: skip null slots (also fixes phantom search results from stale doc_ids)

Files touched

src/index.zig — TrigramIndex struct and its methods only

Red-to-green

Before

Re-indexing the same file N times causes id_to_path.items.len to grow by N:

// Pseudocode trace of the bug:
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]         len=1
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig", "src/main.zig"]  len=2
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig", "src/main.zig", "src/main.zig"]  len=3
// ... grows without bound

Observed: 425 MB/min private memory growth on a ~22K file repo with active file watching.

After

Re-indexing the same file reuses the freed doc_id slot:

indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]  len=1, free_ids: []
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]  len=1, free_ids: []
indexFile("src/main.zig", content)  // id_to_path: ["src/main.zig"]  len=1, free_ids: []
// ... stays bounded

Nearby non-regression

All existing trigram index tests pass unchanged — the fix is internal to the doc_id allocation strategy and does not change the public API or query semantics.

Note: upstream main does not compile on Windows (4 pre-existing errors: std.posix.mmap, std.posix.getenv, SOCKET type, libc dependency). The only new code in this PR (src/index.zig) has zero Windows-specific paths and compiles cleanly on all platforms. Tests were verified to the extent possible on Windows; full test suite requires Linux/macOS.

Rebased

Yes — single commit on top of current main (268f9c9).

Generated files / lockfiles / benchmarks

None changed.

CONTRIBUTING.md compliance

Confirmed.

…ath growth TrigramIndex.removeFile deletes from path_to_id but never clears the corresponding id_to_path slot. The next indexFile call for the same path appends a new entry, so id_to_path grows by one slot per re-index cycle per file — causing ~425 MB/min memory growth on actively watched repos. Change id_to_path from ArrayList([]const u8) to ArrayList(?[]const u8) and add a free_ids list. removeFile now nulls out the stale slot and pushes the doc_id onto free_ids. getOrCreateDocId pops from free_ids before appending, keeping id_to_path bounded. Query paths (candidates, candidatesRegex, writeToDisk) now skip null slots, which also fixes phantom search results from stale doc_ids. Closes justrach#227

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa36d97ba5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/index.zig

Copilot

Pull request overview

This PR fixes unbounded memory growth in TrigramIndex by reusing doc_id slots when files are re-indexed, preventing id_to_path from growing monotonically with watcher-driven reindex cycles.

Changes:

Change id_to_path to allow null “freed” slots and skip them in query/persistence paths.
Add a free_ids free-list, populated on removeFile and consumed by getOrCreateDocId.
Update candidate collection and disk serialization to ignore freed (null) doc IDs.

src/index.zig

Reused doc_ids from the free-list can be non-monotonic, breaking the sorted-order invariant that PostingList.getByDocId and candidates rely on for binary search and merge intersection. Replace the raw append with getOrAddPosting which does a binary search and inserts at the correct sorted position. Addresses review feedback from Codex and Copilot on justrach#229.

justrach · 2026-04-10T06:49:44Z

Looked through the current head (16ae8b0), including the earlier cloud review thread.

The cloud/Copilot finding about breaking posting-list sort order after doc_id reuse was valid on the first commit, but it looks addressed on the current head: indexFile() now uses getOrAddPosting() instead of raw append, so the binary-search paths (getByDocId / containsDocId) and the merge-intersection logic in candidates() keep their sorted-order invariant even when a lower doc_id is recycled.

I also don’t see anything suspicious or malicious in the patch. This is a narrow in-memory bookkeeping fix plus null-guards in query/persistence paths, and zig build test --summary all passed locally for me on this branch.

The one gap I still see is regression coverage for the exact free-list reuse path. There are existing trigram remove/reindex tests in src/tests.zig, but this PR doesn’t add a test that actually forces doc_id reuse and then proves search correctness stays intact. A small regression test that removes/reindexes the same file or recycles a low doc_id after higher ones exist would make this much safer to keep.

justrach · 2026-04-10T06:54:19Z

For anyone other people/agents (including you @codex) else reviewing this, these are the parts of src/index.zig that matter most on the current head:

getOrCreateDocId() / removeFile() around lines 555-593
Review the free-list invariants here: every removed live doc_id should be nulled exactly once, pushed to free_ids exactly once, and then reused only after all old postings for that doc_id have been removed.
indexFile() around lines 643-650
This is the critical correctness fix for the earlier cloud review. Because reused doc_ids are no longer monotonic, this path must preserve posting-list sort order. The important thing to verify is that getOrAddPosting() is sufficient for all trigram insertions and cannot accidentally leave duplicate / out-of-order postings behind.
candidates() around lines 758-760 and candidatesRegex() around lines 845-848
These are the query paths that now skip null id_to_path slots. Reviewers should sanity-check that this removes phantom results without masking any still-live docs.
writeToDisk() around lines 925-933
This is the persistence boundary. Reviewers should verify that freed/null doc slots are excluded from serialization, and that the disk image only reflects currently live files.
Tests
The current suite passes, but I still think the main thing missing is an explicit regression test that forces freed-doc_id reuse and then proves query correctness after reuse. That is the most valuable extra review/test coverage for this change.

justrach

Thanks for working through this. The free-list reuse plus the sorted-insert follow-up on the current head looks like the right fix for the unbounded id_to_path growth and the posting-order issue.

The one change I still want before merge is explicit regression coverage for the exact reuse path. Please add a test that forces a freed doc_id to be reused and then proves query correctness still holds afterward — ideally covering search/candidate behavior so we know we do not reintroduce stale or phantom paths.

Once that is in, this looks close.

Add regression coverage for freed doc_id reuse in the trigram index. The test forces reuse after higher ids exist and verifies candidates exclude stale paths while still returning the live reused path. Co-Authored-By: Codex <noreply@openai.com>

JF10R · 2026-04-12T01:46:31Z

Added the requested regression test in 9f8b0e4.

It forces freed-doc_id reuse and verifies candidate lookup still returns the live paths while excluding the removed stale path.

Checks run:

zig test src/tests.zig --test-filter "trigram index: reuses freed doc_id without stale candidates"
zig test src/tests.zig --test-filter "trigram index: re-index removes old trigrams"

Let me know if anything's missing or wrong.

Copilot AI review requested due to automatic review settings April 9, 2026 21:26

Copilot started reviewing on behalf of JF10R April 9, 2026 21:26 View session

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

src/index.zig Show resolved Hide resolved

Copilot AI reviewed Apr 9, 2026

View reviewed changes

src/index.zig Show resolved Hide resolved

justrach requested changes Apr 11, 2026

View reviewed changes

test(index): cover reused trigram doc ids

9f8b0e4

Add regression coverage for freed doc_id reuse in the trigram index. The test forces reuse after higher ids exist and verifies candidates exclude stale paths while still returning the live reused path. Co-Authored-By: Codex <noreply@openai.com>

JF10R requested a review from justrach April 12, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227)#229

fix: reuse freed doc_ids in TrigramIndex to prevent unbounded growth (#227)#229
JF10R wants to merge 3 commits intojustrach:mainfrom
JF10R:fix/227-trigram-id-reuse

JF10R commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

justrach commented Apr 10, 2026

Uh oh!

justrach commented Apr 10, 2026 •

edited

Loading

Uh oh!

justrach left a comment

Uh oh!

JF10R commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JF10R commented Apr 9, 2026

Summary

Exact change

Files touched

Red-to-green

Before

After

Nearby non-regression

Rebased

Generated files / lockfiles / benchmarks

CONTRIBUTING.md compliance

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

justrach commented Apr 10, 2026

Uh oh!

justrach commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justrach left a comment

Choose a reason for hiding this comment

Uh oh!

JF10R commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

justrach commented Apr 10, 2026 •

edited

Loading

JF10R commented Apr 12, 2026 •

edited

Loading