Skip to content

builtins: rgxall1 flat-capture + ct count-by-predicate (rerun6 batch)#333

Merged
danieljohnmorris merged 5 commits into
mainfrom
feature/rerun6-additions
May 17, 2026
Merged

builtins: rgxall1 flat-capture + ct count-by-predicate (rerun6 batch)#333
danieljohnmorris merged 5 commits into
mainfrom
feature/rerun6-additions

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Two builtin additions from the rerun6 persona round, both manifesto-aligned token-economy wins on multi-rerun-old shapes:

  • rgxall1 pat s -> L t — flat first-capture-group convenience over rgxall. Closes the ext xs:L t>t;hd xs helper that the html-scraper persona kept paying ~30 tokens to re-declare across five reruns (rerun5, line 4988).
  • ct fn xs -> n — count-by-predicate, parallel to flt's 2/3-arg shape. Avoids the intermediate L b allocation len (flt p xs) pays on every call. Originating ask: bioinformatics rerun6 tm=ct has-tm seqs over 20k proteins (line 5028).

Named ct rather than the persona's wished-for cnt: cnt is reserved as the loop-continue keyword (src/parser/mod.rs:3507). Two chars beats three, no parser surgery, the loop-control reservation stays intact (regression-covered).

Repro before/after

rgxall1. html-scraper persona, "extract every match of one capture group":

-- Before (rerun5, 40 LOC; ext is the only helper that pays no semantic rent):
ext xs:L t>t;hd xs
titles=map ext (rgxall "<a class=\"titleline\">([^<]+)</a>" html)

-- After (PR):
titles=rgxall1 "<a class=\"titleline\">([^<]+)</a>" html

ct. bioinformatics persona, "how many proteins have a TM helix":

-- Before:
tm=len (flt has-tm seqs)   -- allocates L b sized to the keepers

-- After:
tm=ct has-tm seqs           -- straight counter, zero intermediate alloc

What's in the diff (per commit)

  • f3aed43 builtins: rgxall1 pat s for flat first-capture-group extraction — adds Builtin::Rgxall1, tree-walker arm, tree-bridge eligibility, propagate-error allow-list entry. 0 groups → flat L t of whole matches; 1 group → flat L t of capture-1 strings; 2+ groups → runtime ILO-R009 pointing at rgxall. 8 cross-engine regression tests + examples/rgxall1-flat-captures.ilo.
  • e86dba9 builtins: ct fn xs for count-by-predicate without the filtered-list alloc — adds Builtin::Ct with predicate signature identical to flt's, 2-arg and 3-arg closure-bind shapes. Tree-bridge wired for VM/Cranelift parity; runtime non-bool predicate errors join the propagate allow-list. No count long-form alias on purpose (caught a real false-positive that would have trampled the user-fn name in examples/unq-numbers.ilo). 10 cross-engine regression tests including a cnt-as-continue coexistence regression guard + examples/ct-count-by-predicate.ilo.
  • a0a1ee2 docs: sync SPEC.md, ai.txt, SKILL.md for rgxall1 and ct builtins — every keep-in-sync surface gets new rows in the builtin tables and the new names in the cheatsheet/HOF lists. Site builtins.md lands in a follow-up ilo-lang/site PR.
  • 4f91c38 tests: tighten ct non-bool predicate assertion to anchor on "ct:" — rust-review caught that bare stderr.contains("ct") would match the substring inside "expected" and let unrelated wording satisfy the assertion. Anchor on the literal ct: prefix from the ILO-R009 message instead.

Manifesto framing

These two compose: any persona that today writes len (flt … rgxall …) or map (hd …) rgxall … is paying both a helper-decl tax and a redundant allocation. With rgxall1 + ct the canonical "scrape one HTML page, count items matching a predicate" pipeline shaves ~50 tokens and one O(n) alloc per occurrence. Across the agent population shape-of-task that is real compression.

Both ride the tree-bridge — same dispatch contract as rgxall, flt 3, grp 2. No backend codegen changes, no JIT lowering complexity, full cross-engine parity for free.

Test plan

  • cargo test --release --features cranelift — full suite green post-rebase
  • cargo clippy --features cranelift --all-targets -- -W clippy::all — clean
  • cargo test --features cranelift --test regression_rgxall1 — 8/8 pass on tree/VM/Cranelift
  • cargo test --features cranelift --test regression_ct — 10/10 pass on tree/VM/Cranelift, plus cnt-as-continue coexistence regression guard
  • cargo test --features cranelift --test examples_engines — both new examples run on every engine
  • Builtin::ALL tag round-trip preserved (entries appended after Sleep, no reordering)
  • Self rust-review pass complete; one false-positive test assertion caught and tightened

Follow-ups

  • PR B (ILO-W002 unused-binding warning, security-researcher rerun5 ask) branches off main after this merges, per the two-PR split agreed at the gate.
  • Site doc sync (ilo-lang/site builtins.md) lands in a follow-up PR against the site repo once this merges.

rgxall returns L (L t) — every match wrapped in a list of capture groups
(or a 1-element whole-match list when there are no groups). The common
'extract every match of one capture group' shape pays a flatten/hd helper
per use site, which the html-scraper persona has been writing as 'ext
xs:L t>t;hd xs' across five reruns.

rgxall1 collapses that to a single call:
- 0 groups: flat L t of whole matches
- 1 group : flat L t of capture-1 strings (skipping non-participating
            captures under alternation, parallel to rgxall semantics)
- 2+ groups: runtime ILO-R009 pointing back at rgxall

Cross-engine: tree-bridge eligible (parallel to Rgxall). Multi-group
error joins the propagate list so Cranelift surfaces the diagnostic
instead of silently returning nil.

8 cross-engine regression tests (tree/VM/Cranelift) plus an examples/
demo for the engine harness. Closes the originating ext-helper friction
flagged in ilo_assessment_feedback.md html-scraper rerun5 (line 4988).
…lloc

`len (flt pred xs)` is a five-rerun-old shape that costs a full L b
allocation on every call just to ask 'how many'. Bioinformatics rerun6
wanted `tm=cnt has-tm seqs` over 20k proteins; `cnt` is reserved as
the loop-continue keyword, so the builtin is `ct` instead. Two chars
beats the three-char ask in token economy, with zero parser surgery.

ct fn xs    -- count xs where fn returns true                 n
ct fn ctx xs -- closure-bind variant, parallel to flt 3       n

Predicate signature is identical to flt's; predicate must return b or
the runtime raises ILO-R009 on every engine (Builtin::Ct joins the
tree_bridge_propagates_error allow-list so Cranelift surfaces the
diagnostic in lockstep with tree and VM).

Cross-engine: rides the tree-bridge alongside Flt 3, Grp 2, Uniqby 2.
Tree interpreter owns the predicate dispatch; VM and Cranelift both
route through OP_CALL_BUILTIN_TREE.

No long-form alias. `count` is a common user-fn name (see
examples/unq-numbers.ilo, which would otherwise re-dispatch through the
alias resolver into the builtin). Users wanting a long form can keep
their own count helper and call ct directly when they want the builtin.

10 cross-engine regression tests + examples/ct-count-by-predicate.ilo.
Originating ask: ilo_assessment_feedback.md line 5028.
Adds rgxall1 to the regex section and ct to the higher-order section
of each of the three keep-in-sync surfaces:

- SPEC.md: builtin tables get one row each for rgxall1 and ct.
- ai.txt: builtin cheatsheet adds rgxall1 next to rgxall and ct next
  to flt.
- skills/ilo/SKILL.md: same long-form table plus the Text and HOF
  cheatsheet lines and the cross-engine HOF list (rgxall1 in Text, ct
  in HOF, both in the all-HOFs-work-cross-engine sentence).

Site builtins.md ships in a follow-up PR against ilo-lang/site (the
site is a separate repo).
Bare "ct" matches "expected" (which contains the substring "ct")
and would let unrelated error wording satisfy the assertion. Anchor on
the literal "ct:" prefix that ILO-R009 emits from the runtime path,
which is unambiguous.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 71.66667% with 34 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/interpreter/mod.rs 70.66% 22 Missing ⚠️
src/verify.rs 62.50% 12 Missing ⚠️

📢 Thoughts on this report? Let us know!

Minor phrasing harmonisation - the three keep-in-sync surfaces had drifted
slightly during the initial doc-sync commit (one used 'avoids the
intermediate list alloc of len(flt fn xs)', the others wrote it as 'avoids
len(flt fn xs)\'s intermediate list alloc'). Pick the shorter form from
SPEC.md for the other two surfaces. Same applies to the rgxall1 row.
@danieljohnmorris danieljohnmorris merged commit d731e47 into main May 17, 2026
4 of 5 checks passed
@danieljohnmorris danieljohnmorris deleted the feature/rerun6-additions branch May 17, 2026 08:25
danieljohnmorris added a commit that referenced this pull request May 17, 2026
fifteen fixes since 0.11.5, all from rerun5/rerun6 personas plus standing asks: ListView foundation (#334), window-text-perf reshape via ListView (#336), inner-flt predicate inlining (#340), double-minus trap ILO-P021 (#331), bare-ident bang silent-nil regression (#324), Cranelift JIT span plumbing (#335), bool-prefix ternary (#330), wh prefix-cond reparse (#332), --run-engine auto-pick main (#329), subcommand helper hyphens+non-ident (#328), runtime error spans (#335), persona-diagnostic batch 3 (#327), rgxall1+ct (#333), single-line body diagnostic (#322 carry), lambda type-var defensive test (#326), N-deep prefix arity error (#339), prefix-minus span column drift (#338), doc-sync (#337).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant