Skip to content

regex: integrate nanoregex to add \b, {n,m}, lazy quants, and ReDoS-safe matching #454

@justrach

Description

@justrach

Problem

explore.regexMatch is a homegrown backtracking matcher (src/explore.zig:4298) that:

  1. Silently treats unknown escapes as literals — \b, \B, \A, \z are not word boundaries / anchors, they are literal b, B, A, z. Users typing \bfoo\b to find the exact word foo get zero matches.
  2. Has no {n,m} bounded quantifier — patterns like [a-z]{3,5} fail.
  3. Has no lazy quantifiers — .*? parses as .* followed by literal ?.
  4. Uses recursive backtracking with no linear-time guarantee, so an adversarial input like (a+)+b against aaaaaaaaaaaaaaaaaaaaaaaaaaX exhibits classic catastrophic backtracking. (Mitigated for top-level alternation only.)

Why this matters

codedb_search regex=true is the only regex surface, and the matcher's silently-wrong escape handling is the worst kind of bug: the search returns "no results" instead of failing loudly, so callers (and LLM agents) assume the term is genuinely absent.

Proposal

Adopt justrach/nanoregex — pure-Zig, no external deps, minimum_zig_version = 0.16.0 (matches ours), Pike VM with five-tier literal/prefix dispatch. License-compatible. Adds one dependency to build.zig.zon (currently has only mcp_zig).

Swap targets — only two call sites:

  • src/explore.zig:4030 (searchInContentRegexWithScope)
  • src/explore.zig:4280 (searchInContentRegex)

The trigram prefilter at index.zig:2259 (decomposeRegex) is independent and stays — it parses regex syntax to extract trigrams, not to match.

After the swap, delete the ~200-line homegrown matcher (regexMatch, regexMatchSingle, matchHere, matchGroupBranch, helpers).

Failing Test

test "issue-454: regex \\b word boundary matches whole-word, not literal 'b'" {
    // Today regexMatch treats \b as escaped literal 'b', so the pattern
    // collapses to the search string "bbarb" which is not in the haystack.
    // After nanoregex integration, \b is a word-boundary anchor and this
    // matches the standalone word "bar".
    try testing.expect(regexMatch("foo bar baz", "\\bbar\\b"));

    // And the inverse direction — \b should NOT match inside a word:
    try testing.expect(!regexMatch("foobarbaz", "\\bbar\\b"));
}

Fails on current main at the first expect (pattern \bbar\b parsed as the literal string bbarb).

Risks / open questions

  • Schema-payload size: none, this is internal.
  • Behavior diff: any client that was accidentally relying on \b being a literal b will break. Auditing codedb_search callers should show whether this is a real concern; in practice nobody types \b meaning literal b.
  • License compatibility: confirm nanoregex license matches codedb's (MIT-like) before vendoring.
  • 0.16 compatibility: nanoregex declares 0.16.0 minimum; needs a smoke build.

Acceptance criteria

  • Failing test above passes.
  • All existing regexMatch: tests continue to pass (parity for the supported subset).
  • regexMatch (and its three helpers) deleted from src/explore.zig.
  • nanoregex added as a dependency in build.zig.zon.
  • Doc note in codedb_search schema description bumping the supported regex feature set.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority:p2Medium priority

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions