Skip to content

release: v0.2.5813 — search quality + nanoregex + CLOCK content cache#459

Merged
justrach merged 6 commits into
mainfrom
release/v0.2.5813
May 11, 2026
Merged

release: v0.2.5813 — search quality + nanoregex + CLOCK content cache#459
justrach merged 6 commits into
mainfrom
release/v0.2.5813

Conversation

@justrach
Copy link
Copy Markdown
Owner

@justrach justrach commented May 11, 2026

Bumps codedb to v0.2.5813.

Highlights

Explore — Tier 0 search-quality rewrite

  • #447, #451: canonical definition sites in files >64KB are no longer invisible (Tier 0 now builds candidates from word_index.search directly, capturing skip-trigram files).
  • #449: popular identifiers keep code-first / doc-second diversity instead of being skipped by a word_hits.len <= max_results * 2 gate.
  • #450: Tier 0.5 prefix expansion no longer overshoots max_results by one.
  • #448: rerank uses symmetric stem matching + case-insensitive symbol-definition equality.

Regex — nanoregex integration (#454)

  • Replaced the ~300-line homegrown matcher with justrach/nanoregex.
  • 2.7-4.3x faster end-to-end on common codedb_search regex=true shapes (literal, alternation, dot-star, char-class).
  • Correctness wins: \b actually works, {n,m} works, *?/+? work, ReDoS-safe.
  • Bonus upstream patch: fixed false-negative in nanoregex's extractLiteralPrefix where hel+o was silently missing helllo.

Explore — bounded-memory content cache (#208)

  • New src/hot_cache.zig — fixed-capacity CLOCK (second-chance) eviction cache replaces the unbounded Explorer.contents: StringHashMap.
  • MaxRSS 623MB → 225MB on the 1002-file snapshot-writer test. Hot files stay resident; cold files fall through to disk on next access.
  • Design adapted from justrach/turbodb.

Test plan

  • zig build test --summary all → 536/536 passing (530 pre-existing + 5 new ContentCache unit tests + 1 issue-208 integration test).
  • CI bench-regression green on each upstream branch before merging into the release branch.
  • CI green on this release PR.

Commits

🤖 Generated with Claude Code

justrach and others added 3 commits May 10, 2026 01:12
…458)

* fix(explore): integrate nanoregex for correct regex matching (#454)

Replace the homegrown regex matcher with nanoregex (justrach/nanoregex),
a pure-Zig Thompson-NFA/DFA engine with Python-re-compatible semantics.

Key correctness fix: the old matcher silently treated \b as literal 'b'
instead of a word-boundary assertion, causing false matches (issue #454).
nanoregex correctly handles \b, \B, {n,m} quantifiers, and is immune to
catastrophic backtracking on patterns like (a+)+b.

Also patches a false-negative bug in nanoregex's extractLiteralPrefix
prefilter: patterns like hel+o incorrectly computed "helo" as the literal
prefix (skipping matches in haystacks like "helllo"). Fixed by making
collectPrefix return a stop-signal bool so the concat loop halts after any
quantified node even when that node extended the prefix.

Performance: nanoregex is 4-6x faster than the homegrown matcher on common
codedb_search shapes (literal, alternation, dot-star, char-class) due to
DFA table-lookup hot path after warmup.

Changes:
- build.zig.zon: add nanoregex dependency
- build.zig: wire nanoregex module into exe, tests, adversarial_tests
- src/explore.zig: replace regexMatch + 7 helper functions (~300 lines)
  with nanoregex-backed implementations; swap two call sites to compile
  once per file rather than per line
- src/tests.zig: add failing test for issue-454 word-boundary behaviour
- zig-pkg/nanoregex-*/src/prefilter.zig: fix extractLiteralPrefix bug

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(build): wire nanoregex into bench and benchmark targets

src/bench.zig and src/benchmark.zig both import Explorer from
explore.zig, which now requires the nanoregex module. CI's
bench-regression workflow caught this. Also wire nanoregex into the
wasm target for completeness (wasm build is broken on main anyway
due to unrelated std API drift, but the import is consistent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ces)

ab8f7cd's Tier 0 rewrite already fixes #447 implicitly — Tier 0 now
builds candidates directly from word_index.search, which captures hits
in skip_trigram_files alongside fully-indexed files. The new test pins
this behavior so a future Tier 0 refactor cannot silently regress.

(PR #456's structural fix on top of the old code is obviated; closing.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 538840 471382 -12.52% -67458 OK
codedb_changes 56609 54440 -3.83% -2169 OK
codedb_deps 9111 9042 -0.76% -69 OK
codedb_edit 5178 5199 +0.41% +21 OK
codedb_find 61012 61935 +1.51% +923 OK
codedb_hot 101244 100513 -0.72% -731 OK
codedb_outline 288618 297963 +3.24% +9345 OK
codedb_read 91860 92139 +0.30% +279 OK
codedb_search 200480 142800 -28.77% -57680 OK
codedb_snapshot 281574 286786 +1.85% +5212 OK
codedb_status 209006 199180 -4.70% -9826 OK
codedb_symbol 61326 61296 -0.05% -30 OK
codedb_tree 66278 70795 +6.82% +4517 OK
codedb_word 70042 74608 +6.52% +4566 OK

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e84fa8ff3b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread build.zig
Comment on lines +36 to +37
const nanoregex_dep = b.dependency("nanoregex", .{});
exe.root_module.addImport("nanoregex", nanoregex_dep.module("nanoregex"));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wire nanoregex into the exported module

Because src/lib.zig re-exports Explorer from explore.zig, and explore.zig now imports @import("nanoregex"), the module created by b.addModule("codedb", ...) and the lib_tests target also need the same addImport. Right now only the CLI/test executables are wired, so zig build test reaches the library test without a nanoregex module, and downstream users who import codedb.Explorer will fail to compile.

Useful? React with 👍 / 👎.

justrach and others added 3 commits May 12, 2026 03:07
Add hot_cache.zig ContentCache module with CLOCK eviction and
test "issue-208: content cache evicts cold entries under pressure"
that verifies bounded capacity and eviction firing under pressure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace explorer.contents StringHashMap with a fixed-capacity CLOCK
(second-chance) eviction cache. Before this change, indexing large repos
(e.g. openclaw) held up to 1.7 GB of raw file text in a StringHashMap
with no eviction — or dropped it all via releaseContents with no middle
ground.

Design follows justrach/turbodb src/hot_cache.zig: FNV-1a hash, probe
limit of 4, CLOCK hand sweeps to evict cold (unreferenced) entries, atomic
hit/miss/eviction counters.

Changes:
- src/hot_cache.zig: new ContentCache with get/put/remove/clear/iterator,
  len/count/contains helpers, and Stats struct
- src/explore.zig: replace contents field; releaseContents now calls
  cache.clear(); indexFileInner puts unconditionally (CLOCK handles bounds)
- src/snapshot.zig: insertRestoredFile uses cache.put (copies content,
  always frees the snapshot-read buffer)
- src/index.zig: buildFrequencyTableFromMap accepts *ContentCache
- src/watcher.zig: buildTrigramsFromCache accepts *ContentCache
- src/tests.zig: update tests that relied on the old threshold behaviour

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bump build.zig.zon to 0.2.5813. CHANGELOG entry covers:
- Tier 0 rewrite for code-first / skip-trigram-aware search (#447, #449, #451)
- Tier 0.5 prefix-tier max_results contract (#450)
- Symmetric stem + case-insensitive symbol rerank (#448)
- nanoregex integration: 2.7-4.3x speedup + \b/{n,m}/ReDoS correctness (#454)
- CLOCK eviction cache for file contents: 623MB → 225MB MaxRSS on 1002-file workload (#208)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@justrach justrach force-pushed the release/v0.2.5813 branch from e84fa8f to ce15930 Compare May 11, 2026 19:08
@justrach justrach changed the title release: v0.2.5813 — search quality rewrite + nanoregex (#447, #448, #449, #450, #451, #454) release: v0.2.5813 — search quality + nanoregex + CLOCK content cache May 11, 2026
@github-actions
Copy link
Copy Markdown

Benchmark Regression Report

Thresholds: 10.00% and 50,000 ns absolute delta

NOISE means the percentage threshold was exceeded, but the absolute delta was too small to fail CI.

Tool Base (ns) Head (ns) Delta Abs Delta (ns) Status
codedb_bundle 591039 519388 -12.12% -71651 OK
codedb_changes 66255 60682 -8.41% -5573 OK
codedb_deps 10522 11130 +5.78% +608 OK
codedb_edit 6170 6723 +8.96% +553 OK
codedb_find 68199 70424 +3.26% +2225 OK
codedb_hot 105756 114515 +8.28% +8759 OK
codedb_outline 323279 329038 +1.78% +5759 OK
codedb_read 97470 105703 +8.45% +8233 OK
codedb_search 202418 158472 -21.71% -43946 OK
codedb_snapshot 309933 299812 -3.27% -10121 OK
codedb_status 145267 131252 -9.65% -14015 OK
codedb_symbol 63863 65852 +3.11% +1989 OK
codedb_tree 71249 74129 +4.04% +2880 OK
codedb_word 74291 80583 +8.47% +6292 OK

@justrach justrach merged commit f8ed07f into main May 11, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant