Problem
explore.regexMatch is a homegrown backtracking matcher (src/explore.zig:4298) that:
- Silently treats unknown escapes as literals —
\b, \B, \A, \z are not word boundaries / anchors, they are literal b, B, A, z. Users typing \bfoo\b to find the exact word foo get zero matches.
- Has no
{n,m} bounded quantifier — patterns like [a-z]{3,5} fail.
- Has no lazy quantifiers —
.*? parses as .* followed by literal ?.
- Uses recursive backtracking with no linear-time guarantee, so an adversarial input like
(a+)+b against aaaaaaaaaaaaaaaaaaaaaaaaaaX exhibits classic catastrophic backtracking. (Mitigated for top-level alternation only.)
Why this matters
codedb_search regex=true is the only regex surface, and the matcher's silently-wrong escape handling is the worst kind of bug: the search returns "no results" instead of failing loudly, so callers (and LLM agents) assume the term is genuinely absent.
Proposal
Adopt justrach/nanoregex — pure-Zig, no external deps, minimum_zig_version = 0.16.0 (matches ours), Pike VM with five-tier literal/prefix dispatch. License-compatible. Adds one dependency to build.zig.zon (currently has only mcp_zig).
Swap targets — only two call sites:
src/explore.zig:4030 (searchInContentRegexWithScope)
src/explore.zig:4280 (searchInContentRegex)
The trigram prefilter at index.zig:2259 (decomposeRegex) is independent and stays — it parses regex syntax to extract trigrams, not to match.
After the swap, delete the ~200-line homegrown matcher (regexMatch, regexMatchSingle, matchHere, matchGroupBranch, helpers).
Failing Test
test "issue-454: regex \\b word boundary matches whole-word, not literal 'b'" {
// Today regexMatch treats \b as escaped literal 'b', so the pattern
// collapses to the search string "bbarb" which is not in the haystack.
// After nanoregex integration, \b is a word-boundary anchor and this
// matches the standalone word "bar".
try testing.expect(regexMatch("foo bar baz", "\\bbar\\b"));
// And the inverse direction — \b should NOT match inside a word:
try testing.expect(!regexMatch("foobarbaz", "\\bbar\\b"));
}
Fails on current main at the first expect (pattern \bbar\b parsed as the literal string bbarb).
Risks / open questions
- Schema-payload size: none, this is internal.
- Behavior diff: any client that was accidentally relying on
\b being a literal b will break. Auditing codedb_search callers should show whether this is a real concern; in practice nobody types \b meaning literal b.
- License compatibility: confirm nanoregex license matches codedb's (MIT-like) before vendoring.
- 0.16 compatibility: nanoregex declares
0.16.0 minimum; needs a smoke build.
Acceptance criteria
Problem
explore.regexMatchis a homegrown backtracking matcher (src/explore.zig:4298) that:\b,\B,\A,\zare not word boundaries / anchors, they are literalb,B,A,z. Users typing\bfoo\bto find the exact wordfooget zero matches.{n,m}bounded quantifier — patterns like[a-z]{3,5}fail..*?parses as.*followed by literal?.(a+)+bagainstaaaaaaaaaaaaaaaaaaaaaaaaaaXexhibits classic catastrophic backtracking. (Mitigated for top-level alternation only.)Why this matters
codedb_search regex=trueis the only regex surface, and the matcher's silently-wrong escape handling is the worst kind of bug: the search returns "no results" instead of failing loudly, so callers (and LLM agents) assume the term is genuinely absent.Proposal
Adopt
justrach/nanoregex— pure-Zig, no external deps,minimum_zig_version = 0.16.0(matches ours), Pike VM with five-tier literal/prefix dispatch. License-compatible. Adds one dependency tobuild.zig.zon(currently has onlymcp_zig).Swap targets — only two call sites:
src/explore.zig:4030(searchInContentRegexWithScope)src/explore.zig:4280(searchInContentRegex)The trigram prefilter at
index.zig:2259(decomposeRegex) is independent and stays — it parses regex syntax to extract trigrams, not to match.After the swap, delete the ~200-line homegrown matcher (
regexMatch,regexMatchSingle,matchHere,matchGroupBranch, helpers).Failing Test
Fails on current main at the first
expect(pattern\bbar\bparsed as the literal stringbbarb).Risks / open questions
\bbeing a literalbwill break. Auditingcodedb_searchcallers should show whether this is a real concern; in practice nobody types\bmeaning literalb.0.16.0minimum; needs a smoke build.Acceptance criteria
regexMatch:tests continue to pass (parity for the supported subset).regexMatch(and its three helpers) deleted fromsrc/explore.zig.nanoregexadded as a dependency inbuild.zig.zon.codedb_searchschema description bumping the supported regex feature set.