Skip to content

feat(opt): span-based CSE replacement infrastructure + verifier-gap finding (v0.9.0 PR-K2)#111

Merged
avrabe merged 1 commit into
mainfrom
release/v0.9.0-pr-k2-span-cse
May 14, 2026
Merged

feat(opt): span-based CSE replacement infrastructure + verifier-gap finding (v0.9.0 PR-K2)#111
avrabe merged 1 commit into
mainfrom
release/v0.9.0-pr-k2-span-cse

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 14, 2026

Summary

Implements the span-based replacement mechanism for Expr::Call that PR-K (v0.8.0) couldn't ship because the existing CSE replacement loop only handled single-instruction Const exprs.

Critical finding during implementation: the Z3 translation validator models every Instruction::Call as a fresh symbolic constant (verify.rs:4035, conservative empty-sig-context path). When CSE tries to replace two call $pure(x) with local.tee + local.get, Z3 produces a counterexample (R1 + R2 → 2*R3 where R1 and R2 are independent BVs) and the per-function verify_or_revert reverts every dedup attempt. Per LOOM's proof-first policy, bypassing is not allowed. PR-K3 will fix this in verify.rs by modeling pure+no-trap calls as uninterpreted-function applications f(args) so that two identical calls produce equal BVs.

What lands

  • CSEAction::ReplaceSpanWithLoad variant for span-based replacement.
  • Phase 4 dispatch in the replacement loop, with skip_until window to handle overlapping spans.
  • Five defense-in-depth gates on top of PR-K's recognition guards:
    1. Each arg must be a single-instruction pure pusher (Const / LocalGet).
    2. Measured span length must equal args.len() + 1 (proves the span IS the call subtree).
    3. occupied[] bitmap rejects spans overlapping another planned transform.
    4. No LocalSet / LocalTee of an arg-local may appear between the first call and last duplicate.
    5. PR-K's worth_dedup cost gate still runs unchanged.

The TranslationValidator::verify_or_revert is the downstream safety net — any unsound transform reverts cleanly, which is exactly what happens today (until PR-K3).

Tests

  • test_cse_dedupes_pure_clamp_calls_via_span_replacement — pins the EXPECTED end state. Marked #[ignore] with a comment explaining the PR-K3 verifier gap.
  • test_cse_dedupes_repeated_pure_calls — also #[ignore]'d for the same reason (updated comment).
  • Existing 3 safety tests (don't dedupe impure / may-trap / different-args) continue to pass.

Narrowing applied

  • Arg shapes: only single-instruction pure pushers (Const / LocalGet). Binary / Unary / nested-Call arg subtrees deferred — they need wider span-overlap analysis AND the verifier wouldn't accept them today anyway.
  • Touched only loom-core/src/lib.rs (336 inserts / 27 deletes).

Soundness reasoning

The mechanism is dominated by TranslationValidator::verify_or_revert. Any unsound transform reverts. The non-Call Const path is preserved bit-identically except for the new overlap check, which is conservative-pessimistic. PR-K3 will enable the dedup; this PR ships the mechanism.

Gale measurement

Zero bytes (gale lacks the duplicate-pure-call pattern even if the verifier permitted it). Infrastructure for PR-K3.

🤖 Generated with Claude Code

PR-K (v0.8.0) added IPA-aware RECOGNITION of pure+no-trap+single-result
Call exprs in CSE: they're hashed, cost-gated, and tracked in
hash_to_positions, but Phase 4's replacement loop only substituted
single-instruction Const exprs. PR-K2 wires the REPLACEMENT side
with span-based substitution so a duplicate Call's full
`[arg_start ..= call_pos]` range collapses into one `local.get N`.

## What lands

1. New CSEAction::ReplaceSpanWithLoad { local_idx, end_pos } variant.
   The first occurrence still gets SaveToLocal (LocalTee after the
   Call); each subsequent occurrence gets ReplaceSpanWithLoad — the
   apply loop emits one local.get and skips through end_pos.

2. Phase 4 dispatch on Expr variant:
   - Const32 / Const64: pre-existing tee+load (unchanged behavior),
     now also runs through an `occupied[]` span-overlap check so it
     never clashes with a Call dedup planned in the same iteration.
   - Call { args, .. }: span dedup gated by FIVE safety checks
     (defense-in-depth on top of PR-K's recognition guards):
       a. Every arg must be a single-instruction pure pusher (Const
          or LocalGet). Binary/Unary/nested-Call args are deferred
          to a follow-up — they need a wider span-overlap analysis.
       b. Each occurrence's measured span length must equal
          args.len()+1. A shorter span would mean PR-K's
          leftmost-arg-start tracking missed an instruction, and we
          can't prove the span is exactly the call subtree.
       c. No span may overlap any already-planned transform (the
          `occupied[]` bitmap is union of every prior span).
       d. No LocalSet / LocalTee of any LocalGet-arg local may
          appear between the first call and the last duplicate's
          arg-start. (Conservative: caching across mutation would
          fold a stale value.)
       e. The existing PR-K cost gate `worth_dedup` still runs in
          Phase 2 — same +4 fixed overhead, same break-even at
          cost>=5 N=3. No change to that calculation.

3. Apply loop honors an outstanding `skip_until: Option<usize>`
   marker so a ReplaceSpanWithLoad action absorbs every instruction
   in the duplicate's span.

## Known gap (PR-K3 target)

The two positive tests for Call CSE remain `#[ignore]`'d, with
updated reasons. The mechanism is in place and exercises cleanly,
but the existing Z3 translation validator models every
Instruction::Call as a FRESH symbolic constant (verify.rs:4035
falls through to `BV::new_const("call_unknown_result", 32)` for the
empty signature-context the CSE pass uses today). Two identical
pure+no-trap calls thus translate to two INDEPENDENT BVs, and the
optimization `R1 + R2 -> 2 * R3` is treated as unproven —
`verify_or_revert` reverts every Call dedup attempt. Per the
project's proof-first philosophy ("if we cannot prove it, we must
not do it"), this is the correct behavior.

Unblocking the test is PR-K3: teach the verifier to model
pure+no-trap call results as `f(args)` uninterpreted-function
applications when the IPA proves purity. With that, the dedup
mechanism shipped here becomes immediately useful with no further
CSE-side changes.

The three existing safety tests (impure / may-trap / different-args)
continue to pass — those rely on PR-K's RECOGNITION gates which
this PR didn't touch.

## Narrowing applied

The PR-K2 spec contemplated a NARROWER ship if the full design
couldn't be made confident within an hour. The narrowing here is
two-fold:
  - Arg shapes: only Const / LocalGet args participate (point 2a).
    Binary / Unary / nested-Call arg subtrees would need a wider
    span-overlap analysis that the verifier wouldn't accept anyway
    today; deferring keeps soundness and reduces risk surface.
  - Tests `#[ignore]`'d on the verifier blocker, with rationale.

## Measurement

gale_in_baseline (1941 B) optimizes to 2826 B total wasm, with a
code-section delta of 0 bytes vs v0.8.0 baseline (still 795 B code
section). Expected — gale lacks the duplicate-pure-call pattern,
and even if it had one the verifier would revert.

Trace: REQ-3, REQ-5, REQ-14

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@avrabe avrabe merged commit 62eeb21 into main May 14, 2026
9 of 18 checks passed
@avrabe avrabe deleted the release/v0.9.0-pr-k2-span-cse branch May 14, 2026 17:01
@avrabe avrabe mentioned this pull request May 14, 2026
avrabe added a commit that referenced this pull request May 14, 2026
Bump workspace version 0.8.0 → 0.9.0.

Measurement and harvest release. Three PRs landed in parallel via
worktree-isolated agents:

  #110  PR-P   corpus-wide LOOM vs wasm-opt -O3 measurement harness
  #111  PR-K2  span-based CSE replacement infrastructure + verifier-gap
  #112  PR-L2  Souper rule set: 3 → 12 identities, wire into pipeline

First objective measured wins:

| Workload | LOOM Δ% | wasm-opt Δ% |
|----------|--------:|------------:|
| gale | -4.9% | -0.8% (LOOM wins by 4.1 pts)
| simple_component | -18.8% | n/a (wasm-opt errors)
| calc_component | -11.3% | n/a (wasm-opt errors)
| calculator_root | -0.4% | n/a (wasm-opt errors)

LOOM beats wasm-opt on gale; PR-M's adapter spec delivers double-
digit wins on adapter-heavy components. wasm-opt cannot process
Component-Model components at all — LOOM's strategic moat.

Critical finding (PR-K2): the Z3 verifier models every Call as a
fresh symbolic constant, so cross-call CSE dedup is rejected even
when the replacement is correct. PR-K3 (verifier-side) will model
pure+no-trap calls as uninterpreted-function applications.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant