release: v0.2.5813 — search quality + nanoregex + CLOCK content cache#459
Conversation
…458) * fix(explore): integrate nanoregex for correct regex matching (#454) Replace the homegrown regex matcher with nanoregex (justrach/nanoregex), a pure-Zig Thompson-NFA/DFA engine with Python-re-compatible semantics. Key correctness fix: the old matcher silently treated \b as literal 'b' instead of a word-boundary assertion, causing false matches (issue #454). nanoregex correctly handles \b, \B, {n,m} quantifiers, and is immune to catastrophic backtracking on patterns like (a+)+b. Also patches a false-negative bug in nanoregex's extractLiteralPrefix prefilter: patterns like hel+o incorrectly computed "helo" as the literal prefix (skipping matches in haystacks like "helllo"). Fixed by making collectPrefix return a stop-signal bool so the concat loop halts after any quantified node even when that node extended the prefix. Performance: nanoregex is 4-6x faster than the homegrown matcher on common codedb_search shapes (literal, alternation, dot-star, char-class) due to DFA table-lookup hot path after warmup. Changes: - build.zig.zon: add nanoregex dependency - build.zig: wire nanoregex module into exe, tests, adversarial_tests - src/explore.zig: replace regexMatch + 7 helper functions (~300 lines) with nanoregex-backed implementations; swap two call sites to compile once per file rather than per line - src/tests.zig: add failing test for issue-454 word-boundary behaviour - zig-pkg/nanoregex-*/src/prefilter.zig: fix extractLiteralPrefix bug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(build): wire nanoregex into bench and benchmark targets src/bench.zig and src/benchmark.zig both import Explorer from explore.zig, which now requires the nanoregex module. CI's bench-regression workflow caught this. Also wire nanoregex into the wasm target for completeness (wasm build is broken on main anyway due to unrelated std API drift, but the import is consistent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ces) ab8f7cd's Tier 0 rewrite already fixes #447 implicitly — Tier 0 now builds candidates directly from word_index.search, which captures hits in skip_trigram_files alongside fully-indexed files. The new test pins this behavior so a future Tier 0 refactor cannot silently regress. (PR #456's structural fix on top of the old code is obviated; closing.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e84fa8ff3b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const nanoregex_dep = b.dependency("nanoregex", .{}); | ||
| exe.root_module.addImport("nanoregex", nanoregex_dep.module("nanoregex")); |
There was a problem hiding this comment.
Wire nanoregex into the exported module
Because src/lib.zig re-exports Explorer from explore.zig, and explore.zig now imports @import("nanoregex"), the module created by b.addModule("codedb", ...) and the lib_tests target also need the same addImport. Right now only the CLI/test executables are wired, so zig build test reaches the library test without a nanoregex module, and downstream users who import codedb.Explorer will fail to compile.
Useful? React with 👍 / 👎.
Add hot_cache.zig ContentCache module with CLOCK eviction and test "issue-208: content cache evicts cold entries under pressure" that verifies bounded capacity and eviction firing under pressure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace explorer.contents StringHashMap with a fixed-capacity CLOCK (second-chance) eviction cache. Before this change, indexing large repos (e.g. openclaw) held up to 1.7 GB of raw file text in a StringHashMap with no eviction — or dropped it all via releaseContents with no middle ground. Design follows justrach/turbodb src/hot_cache.zig: FNV-1a hash, probe limit of 4, CLOCK hand sweeps to evict cold (unreferenced) entries, atomic hit/miss/eviction counters. Changes: - src/hot_cache.zig: new ContentCache with get/put/remove/clear/iterator, len/count/contains helpers, and Stats struct - src/explore.zig: replace contents field; releaseContents now calls cache.clear(); indexFileInner puts unconditionally (CLOCK handles bounds) - src/snapshot.zig: insertRestoredFile uses cache.put (copies content, always frees the snapshot-read buffer) - src/index.zig: buildFrequencyTableFromMap accepts *ContentCache - src/watcher.zig: buildTrigramsFromCache accepts *ContentCache - src/tests.zig: update tests that relied on the old threshold behaviour Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bump build.zig.zon to 0.2.5813. CHANGELOG entry covers: - Tier 0 rewrite for code-first / skip-trigram-aware search (#447, #449, #451) - Tier 0.5 prefix-tier max_results contract (#450) - Symmetric stem + case-insensitive symbol rerank (#448) - nanoregex integration: 2.7-4.3x speedup + \b/{n,m}/ReDoS correctness (#454) - CLOCK eviction cache for file contents: 623MB → 225MB MaxRSS on 1002-file workload (#208) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e84fa8f to
ce15930
Compare
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
Bumps codedb to v0.2.5813.
Highlights
Explore — Tier 0 search-quality rewrite
word_index.searchdirectly, capturing skip-trigram files).word_hits.len <= max_results * 2gate.max_resultsby one.Regex — nanoregex integration (#454)
justrach/nanoregex.codedb_search regex=trueshapes (literal, alternation, dot-star, char-class).\bactually works,{n,m}works,*?/+?work, ReDoS-safe.extractLiteralPrefixwherehel+owas silently missinghelllo.Explore — bounded-memory content cache (#208)
src/hot_cache.zig— fixed-capacity CLOCK (second-chance) eviction cache replaces the unboundedExplorer.contents: StringHashMap.Test plan
zig build test --summary all→ 536/536 passing (530 pre-existing + 5 new ContentCache unit tests + 1 issue-208 integration test).Commits
ab8f7cdfix(explore): resolve search quality regressions3d7381bfix(explore): integrate nanoregex for correct regex matching (regex: integrate nanoregex to add \b, {n,m}, lazy quants, and ReDoS-safe matching #454)c75b574test: regression coverage for explore: searchContent invisibility for canonical definition sites in files >64KB #447 (skip-trigram canonical file surfaces)6b1341ctest: failing test for perf: CLOCK eviction cache for file contents #208 (CLOCK eviction cache)e066242feat(explore): CLOCK eviction cache for file contents (perf: CLOCK eviction cache for file contents #208)ce15930release: v0.2.5813 — version bump + CHANGELOG🤖 Generated with Claude Code