rgxall: multi-match capture-group extraction for HTML scraping#224
Merged
Conversation
rgxall is the multi-match sibling of rgx. It returns L (L t): every match wrapped as a list of capture groups, with no-group patterns wrapping the whole match in a single-element inner list so the outer shape stays predictable. This commit only wires the name, alias, arity, and type signature. The interpreter dispatch lands in the next commit.
The bug: rgx pattern text returns only the first match's groups when the pattern has a capture group (re.captures singular). Personas hitting this on HTML scraping (30 titles, 1 returned) end up either dropping the group and post-stripping or chunking the input. rgxall uses re.captures_iter to walk every non-overlapping match. Shape is uniform L (L t): no-group patterns wrap the whole match in a single-element inner list, group patterns return the capture strings. ILO-R009 error family matches rgx and rgxsub. VM and cranelift dispatch still go through the interpreter for the whole rgx family today (the OP_CALL builtin bridge those engines need is logged separately), so rgxall is tree-only at the engine level until that lands.
Seven regression tests covering: no matches, single match no groups, multiple matches no groups, multiple matches with one capture group (the HTML-scrape case), multiple matches with two capture groups, unicode input, invalid regex pattern as runtime error. Tree-only for now, matching rgx's actual reality on main (both rgx and rgxall are compile-errors under --run-vm and --run-cranelift because the OP_CALL builtin bridge isn't wired yet). When that gap closes, both switch on together and this test should grow to check all three engines. examples/rgxall.ilo demonstrates the four canonical shapes with engine-skip annotations matching the same limitation.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Code review surfaced that the rgxall doc comment claimed a uniform shape that doesn't hold for alternation patterns like (a)|(b): the filter_map drops absent groups, so inner-list length tracks *participating* groups, not declared groups. This matches rgx's existing behaviour at line 3225 and is the right call (the alternative, emitting empty-string sentinels for absent groups, collides with groups that legitimately match the empty string). Tightening the comment to say so explicitly, and adding a regression test for the alternation case so the behaviour is pinned.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
rgx pattern texthas an asymmetric semantics that bites the moment you reach for it on real-world text. Without a capture group it returns every match (good); with a capture group it silently returns only the first match (bad). The HTML-scrape case where you want<h2>([^<]+)</h2>to pull every title gets stuck on title number one, and the workaround is either dropping the group and post-stripping or chunking the input by hand. Both are token-heavy and the kind of thing a model has to rediscover every session.rgxall pattern text -> L (L t)is the additive fix. Every match is wrapped as a list of capture-group strings; no-group patterns wrap the whole match in a single-element inner list so the outer shape stays uniform regardless of group count. The verifier table sees a predictable nested type and the agent doesn't have to branch on group count to figure out the shape.I picked the uniform shape over saving a level of nesting for the no-group case because variadic output is a verifier nightmare and the friction of one
flatcall is far cheaper than every consumer guessing the shape from context. The manifesto cares about deterministic behaviour; that won.I also looked at changing
rgxto always return all matches with groups, but that would silently reshape the output of every program that currently relies on first-match semantics. Additive only.Repro on main (before)
With
rgxall(after):What's in the diff
builtins: add rgxall pattern text in name table and type sigs(4e43e41) —Builtin::Rgxallvariant,from_name/name()round-trip, ast alias("regex_all", "rgxall"), parser arity("rgxall", 2, &[]), verifier signature("rgxall", &["t", "t"], "L (L t)"). No explicit verify handler needed, matchesrgx's pattern of signature-table-only.interpreter: dispatch rgxall over captures_iter for all matches(9bee71a) —re.captures_iterwalks every non-overlapping match.re.captures_len() > 1branches to capture-group mode; otherwise wraps each whole match in[whole].ILO-R009error family matchesrgx/rgxsub.tests + example: pin rgxall multi-match capture extraction(17bed89) — seven regression tests +examples/rgxall.ilowith-- run:/-- out:assertions.Engine coverage
Tree-only at the engine level, with
-- engine-skip: vmand-- engine-skip: cranelifton the example.Reason:
rgxitself is currently a compile error under--run-vmand--run-cranelifton main. The// Builtins that fall through to OP_CALLcomment atsrc/vm/mod.rs:2394-2397is aspirational, not current. The OP_CALL dispatcher only routes user functions viafunc_names; there's no builtin bridge yet.rgxallinherits the same limitation. Wiring it through (touches verify, VM compile, JIT compile, plus a uniform builtin-call lowering pass) is a separate, larger fix and out of scope here. Logged in the assessment doc as a parked adjacent finding.When that bridge lands, both
rgxandrgxallshould switch on together and the test file's tree-only restriction can lift in the same PR.Test plan
cargo test --release --features craneliftclean (2861 + 199 + ...; all suites green)tests/regression_rgxall.rs7/7 passingexamples/rgxall.iloruns throughtests/examples.rsandtests/examples_engines.rswith tree enginecargo fmt --checkclean after format passcargo clippy --release --features cranelift --all-targets -- -D warningsclean[One, Two, Three]instead of[One]Follow-ups
rgxfamily (andfmtvariadic,rd path fmt,rdb, ...) is logged inilo_assessment_feedback.mdfor a future PR. Once it lands, lift the tree-only restriction intests/regression_rgxall.rsand the engine-skip lines inexamples/rgxall.ilo.