Skip to content

rgxall: multi-match capture-group extraction for HTML scraping#224

Merged
danieljohnmorris merged 4 commits into
mainfrom
fix/rgxall-with-groups
May 13, 2026
Merged

rgxall: multi-match capture-group extraction for HTML scraping#224
danieljohnmorris merged 4 commits into
mainfrom
fix/rgxall-with-groups

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

rgx pattern text has an asymmetric semantics that bites the moment you reach for it on real-world text. Without a capture group it returns every match (good); with a capture group it silently returns only the first match (bad). The HTML-scrape case where you want <h2>([^<]+)</h2> to pull every title gets stuck on title number one, and the workaround is either dropping the group and post-stripping or chunking the input by hand. Both are token-heavy and the kind of thing a model has to rediscover every session.

rgxall pattern text -> L (L t) is the additive fix. Every match is wrapped as a list of capture-group strings; no-group patterns wrap the whole match in a single-element inner list so the outer shape stays uniform regardless of group count. The verifier table sees a predictable nested type and the agent doesn't have to branch on group count to figure out the shape.

I picked the uniform shape over saving a level of nesting for the no-group case because variadic output is a verifier nightmare and the friction of one flat call is far cheaper than every consumer guessing the shape from context. The manifesto cares about deterministic behaviour; that won.

I also looked at changing rgx to always return all matches with groups, but that would silently reshape the output of every program that currently relies on first-match semantics. Additive only.

Repro on main (before)

ilo 'f>L t;rgx "<h2>([^<]+)</h2>" "<h2>One</h2> <h2>Two</h2> <h2>Three</h2>"' --run-tree f
[One]

With rgxall (after):

ilo 'f>L (L t);rgxall "<h2>([^<]+)</h2>" "<h2>One</h2> <h2>Two</h2> <h2>Three</h2>"' --run-tree f
[[One], [Two], [Three]]

What's in the diff

  • builtins: add rgxall pattern text in name table and type sigs (4e43e41) — Builtin::Rgxall variant, from_name/name() round-trip, ast alias ("regex_all", "rgxall"), parser arity ("rgxall", 2, &[]), verifier signature ("rgxall", &["t", "t"], "L (L t)"). No explicit verify handler needed, matches rgx's pattern of signature-table-only.
  • interpreter: dispatch rgxall over captures_iter for all matches (9bee71a) — re.captures_iter walks every non-overlapping match. re.captures_len() > 1 branches to capture-group mode; otherwise wraps each whole match in [whole]. ILO-R009 error family matches rgx/rgxsub.
  • tests + example: pin rgxall multi-match capture extraction (17bed89) — seven regression tests + examples/rgxall.ilo with -- run: / -- out: assertions.

Engine coverage

Tree-only at the engine level, with -- engine-skip: vm and -- engine-skip: cranelift on the example.

Reason: rgx itself is currently a compile error under --run-vm and --run-cranelift on main. The // Builtins that fall through to OP_CALL comment at src/vm/mod.rs:2394-2397 is aspirational, not current. The OP_CALL dispatcher only routes user functions via func_names; there's no builtin bridge yet. rgxall inherits the same limitation. Wiring it through (touches verify, VM compile, JIT compile, plus a uniform builtin-call lowering pass) is a separate, larger fix and out of scope here. Logged in the assessment doc as a parked adjacent finding.

When that bridge lands, both rgx and rgxall should switch on together and the test file's tree-only restriction can lift in the same PR.

Test plan

  • cargo test --release --features cranelift clean (2861 + 199 + ...; all suites green)
  • tests/regression_rgxall.rs 7/7 passing
  • examples/rgxall.ilo runs through tests/examples.rs and tests/examples_engines.rs with tree engine
  • cargo fmt --check clean after format pass
  • cargo clippy --release --features cranelift --all-targets -- -D warnings clean
  • Manual smoke test on the HTML-scrape repro confirms [One, Two, Three] instead of [One]

Follow-ups

  • The VM/cranelift bridge for the rgx family (and fmt variadic, rd path fmt, rdb, ...) is logged in ilo_assessment_feedback.md for a future PR. Once it lands, lift the tree-only restriction in tests/regression_rgxall.rs and the engine-skip lines in examples/rgxall.ilo.

rgxall is the multi-match sibling of rgx. It returns L (L t): every
match wrapped as a list of capture groups, with no-group patterns
wrapping the whole match in a single-element inner list so the outer
shape stays predictable.

This commit only wires the name, alias, arity, and type signature.
The interpreter dispatch lands in the next commit.
The bug: rgx pattern text returns only the first match's groups when
the pattern has a capture group (re.captures singular). Personas
hitting this on HTML scraping (30 titles, 1 returned) end up either
dropping the group and post-stripping or chunking the input.

rgxall uses re.captures_iter to walk every non-overlapping match.
Shape is uniform L (L t): no-group patterns wrap the whole match in
a single-element inner list, group patterns return the capture
strings. ILO-R009 error family matches rgx and rgxsub.

VM and cranelift dispatch still go through the interpreter for the
whole rgx family today (the OP_CALL builtin bridge those engines
need is logged separately), so rgxall is tree-only at the engine
level until that lands.
Seven regression tests covering: no matches, single match no groups,
multiple matches no groups, multiple matches with one capture group
(the HTML-scrape case), multiple matches with two capture groups,
unicode input, invalid regex pattern as runtime error.

Tree-only for now, matching rgx's actual reality on main (both
rgx and rgxall are compile-errors under --run-vm and --run-cranelift
because the OP_CALL builtin bridge isn't wired yet). When that gap
closes, both switch on together and this test should grow to check
all three engines.

examples/rgxall.ilo demonstrates the four canonical shapes with
engine-skip annotations matching the same limitation.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 13 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/interpreter/mod.rs 62.85% 13 Missing ⚠️

📢 Thoughts on this report? Let us know!

Code review surfaced that the rgxall doc comment claimed a uniform
shape that doesn't hold for alternation patterns like (a)|(b): the
filter_map drops absent groups, so inner-list length tracks
*participating* groups, not declared groups. This matches rgx's
existing behaviour at line 3225 and is the right call (the
alternative, emitting empty-string sentinels for absent groups,
collides with groups that legitimately match the empty string).

Tightening the comment to say so explicitly, and adding a regression
test for the alternation case so the behaviour is pinned.
@danieljohnmorris danieljohnmorris merged commit 4a4b247 into main May 13, 2026
4 of 5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/rgxall-with-groups branch May 13, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant