feat(opt): span-based CSE replacement infrastructure + verifier-gap finding (v0.9.0 PR-K2)#111
Merged
Merged
Conversation
PR-K (v0.8.0) added IPA-aware RECOGNITION of pure+no-trap+single-result
Call exprs in CSE: they're hashed, cost-gated, and tracked in
hash_to_positions, but Phase 4's replacement loop only substituted
single-instruction Const exprs. PR-K2 wires the REPLACEMENT side
with span-based substitution so a duplicate Call's full
`[arg_start ..= call_pos]` range collapses into one `local.get N`.
## What lands
1. New CSEAction::ReplaceSpanWithLoad { local_idx, end_pos } variant.
The first occurrence still gets SaveToLocal (LocalTee after the
Call); each subsequent occurrence gets ReplaceSpanWithLoad — the
apply loop emits one local.get and skips through end_pos.
2. Phase 4 dispatch on Expr variant:
- Const32 / Const64: pre-existing tee+load (unchanged behavior),
now also runs through an `occupied[]` span-overlap check so it
never clashes with a Call dedup planned in the same iteration.
- Call { args, .. }: span dedup gated by FIVE safety checks
(defense-in-depth on top of PR-K's recognition guards):
a. Every arg must be a single-instruction pure pusher (Const
or LocalGet). Binary/Unary/nested-Call args are deferred
to a follow-up — they need a wider span-overlap analysis.
b. Each occurrence's measured span length must equal
args.len()+1. A shorter span would mean PR-K's
leftmost-arg-start tracking missed an instruction, and we
can't prove the span is exactly the call subtree.
c. No span may overlap any already-planned transform (the
`occupied[]` bitmap is union of every prior span).
d. No LocalSet / LocalTee of any LocalGet-arg local may
appear between the first call and the last duplicate's
arg-start. (Conservative: caching across mutation would
fold a stale value.)
e. The existing PR-K cost gate `worth_dedup` still runs in
Phase 2 — same +4 fixed overhead, same break-even at
cost>=5 N=3. No change to that calculation.
3. Apply loop honors an outstanding `skip_until: Option<usize>`
marker so a ReplaceSpanWithLoad action absorbs every instruction
in the duplicate's span.
## Known gap (PR-K3 target)
The two positive tests for Call CSE remain `#[ignore]`'d, with
updated reasons. The mechanism is in place and exercises cleanly,
but the existing Z3 translation validator models every
Instruction::Call as a FRESH symbolic constant (verify.rs:4035
falls through to `BV::new_const("call_unknown_result", 32)` for the
empty signature-context the CSE pass uses today). Two identical
pure+no-trap calls thus translate to two INDEPENDENT BVs, and the
optimization `R1 + R2 -> 2 * R3` is treated as unproven —
`verify_or_revert` reverts every Call dedup attempt. Per the
project's proof-first philosophy ("if we cannot prove it, we must
not do it"), this is the correct behavior.
Unblocking the test is PR-K3: teach the verifier to model
pure+no-trap call results as `f(args)` uninterpreted-function
applications when the IPA proves purity. With that, the dedup
mechanism shipped here becomes immediately useful with no further
CSE-side changes.
The three existing safety tests (impure / may-trap / different-args)
continue to pass — those rely on PR-K's RECOGNITION gates which
this PR didn't touch.
## Narrowing applied
The PR-K2 spec contemplated a NARROWER ship if the full design
couldn't be made confident within an hour. The narrowing here is
two-fold:
- Arg shapes: only Const / LocalGet args participate (point 2a).
Binary / Unary / nested-Call arg subtrees would need a wider
span-overlap analysis that the verifier wouldn't accept anyway
today; deferring keeps soundness and reduces risk surface.
- Tests `#[ignore]`'d on the verifier blocker, with rationale.
## Measurement
gale_in_baseline (1941 B) optimizes to 2826 B total wasm, with a
code-section delta of 0 bytes vs v0.8.0 baseline (still 795 B code
section). Expected — gale lacks the duplicate-pure-call pattern,
and even if it had one the verifier would revert.
Trace: REQ-3, REQ-5, REQ-14
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Merged
avrabe
added a commit
that referenced
this pull request
May 14, 2026
Bump workspace version 0.8.0 → 0.9.0. Measurement and harvest release. Three PRs landed in parallel via worktree-isolated agents: #110 PR-P corpus-wide LOOM vs wasm-opt -O3 measurement harness #111 PR-K2 span-based CSE replacement infrastructure + verifier-gap #112 PR-L2 Souper rule set: 3 → 12 identities, wire into pipeline First objective measured wins: | Workload | LOOM Δ% | wasm-opt Δ% | |----------|--------:|------------:| | gale | -4.9% | -0.8% (LOOM wins by 4.1 pts) | simple_component | -18.8% | n/a (wasm-opt errors) | calc_component | -11.3% | n/a (wasm-opt errors) | calculator_root | -0.4% | n/a (wasm-opt errors) LOOM beats wasm-opt on gale; PR-M's adapter spec delivers double- digit wins on adapter-heavy components. wasm-opt cannot process Component-Model components at all — LOOM's strategic moat. Critical finding (PR-K2): the Z3 verifier models every Call as a fresh symbolic constant, so cross-call CSE dedup is rejected even when the replacement is correct. PR-K3 (verifier-side) will model pure+no-trap calls as uninterpreted-function applications.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the span-based replacement mechanism for
Expr::Callthat PR-K (v0.8.0) couldn't ship because the existing CSE replacement loop only handled single-instruction Const exprs.Critical finding during implementation: the Z3 translation validator models every
Instruction::Callas a fresh symbolic constant (verify.rs:4035, conservative empty-sig-context path). When CSE tries to replace twocall $pure(x)withlocal.tee + local.get, Z3 produces a counterexample (R1 + R2 → 2*R3where R1 and R2 are independent BVs) and the per-functionverify_or_revertreverts every dedup attempt. Per LOOM's proof-first policy, bypassing is not allowed. PR-K3 will fix this inverify.rsby modeling pure+no-trap calls as uninterpreted-function applicationsf(args)so that two identical calls produce equal BVs.What lands
CSEAction::ReplaceSpanWithLoadvariant for span-based replacement.skip_untilwindow to handle overlapping spans.args.len() + 1(proves the span IS the call subtree).occupied[]bitmap rejects spans overlapping another planned transform.LocalSet/LocalTeeof an arg-local may appear between the first call and last duplicate.worth_dedupcost gate still runs unchanged.The
TranslationValidator::verify_or_revertis the downstream safety net — any unsound transform reverts cleanly, which is exactly what happens today (until PR-K3).Tests
test_cse_dedupes_pure_clamp_calls_via_span_replacement— pins the EXPECTED end state. Marked#[ignore]with a comment explaining the PR-K3 verifier gap.test_cse_dedupes_repeated_pure_calls— also#[ignore]'d for the same reason (updated comment).Narrowing applied
loom-core/src/lib.rs(336 inserts / 27 deletes).Soundness reasoning
The mechanism is dominated by
TranslationValidator::verify_or_revert. Any unsound transform reverts. The non-Call Const path is preserved bit-identically except for the new overlap check, which is conservative-pessimistic. PR-K3 will enable the dedup; this PR ships the mechanism.Gale measurement
Zero bytes (gale lacks the duplicate-pure-call pattern even if the verifier permitted it). Infrastructure for PR-K3.
🤖 Generated with Claude Code