release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96
Open
release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm)#96
Conversation
The v0.4.0 audit measured LOOM's CSE growing the gale_ffi
kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced
it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate
expression including 1-2 byte constants. Replacing
`i32.const -22` (2 bytes encoded) with `local.tee N / local.get N`
(2+2 = 4 bytes) plus an additional local declaration (~2 bytes
amortized) is unconditionally a size regression on cheap
constants — and gale is full of them (errno values).
Fix: add a cost gate `Expr::worth_dedup(occurrences)` that
estimates net byte savings before deciding to dedup. Skip when:
net = (N - 1) * (cost - 2) - 4 ≤ 0
Examples:
i32.const 42 (cost=2, N=10): savings = -4 → skip
i32.add cost=5, N=2: savings = -1 → skip
i32.add cost=5, N=3: savings = +2 → keep
Measurement on gale_ffi:
v0.5.0: code section 811 → 862 bytes (+6.3% regression)
v0.6.0: code section 811 → 808 bytes (-0.4% net win)
Tests:
- test_cse_phase4_duplicate_constants_above_cost_threshold:
LARGE constants (5+ byte LEB128) still get deduplicated.
- test_cse_phase4_keeps_small_constants: regression test for
the gale fix. Cheap constants must survive CSE.
Pick #1 from v0.6.0 wasm-opt-gap research agent's plan.
Trace: REQ-3, REQ-14
Two research outputs from v0.6.0 planning, both grounded in real Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale. source-pattern-analysis.md — eight optimization-relevant patterns in gale source with file:line citations: - Closed-set FSM dispatch (br_table targets) — 6 near-identical matches over SchedThreadState in sched.rs:649-779 - Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern) — canonical example in sched.rs:404-444 (next_up_smp), four more found - Verus-bounded loops — 24 `decreases` clauses, all of form MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16) - Tail-call dispatch matches, leaf-inline candidates, bit-mask axiom ingestion (event.rs lemmas), 2D state-machine matches, and Verus annotations as trusted axioms (1607 clauses total) wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by expected payoff on kernel code, cheapest-first: 1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE in this PR 2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks) 3. RedundantSetElimination liveness (600 LOC, 1.5 wks) 4. Compare-operand canonicalization (400 LOC, 1.0 wk) 5. merge-locals (500 LOC, 1.5 wks) 6. directize call_indirect → call (700 LOC, 2.0 wks) 7. simplify-locals sinking mode (1500 LOC, 4.0 wks) The first three together (~1.3 weeks combined) flip the sign of the gale +6.3% regression. Pick #1 is shipped in this PR; picks #2 and #3 are tracked for follow-up PRs in v0.6.0. Trace: REQ-7
…gale) New pass that removes locals declared by a function but never read by any LocalGet anywhere in the function body. Targets the gale "default-then-override" pattern: rustc/LLVM materializes an EINVAL default at function entry, then every reachable path overwrites it before return. The default's local.set becomes pure dead store. Key property: "zero reads anywhere" is path-INSENSITIVE. Unlike full liveness (Pick #3), this rule is sound regardless of BrIf/BrTable/ early-Return control flow. So the pass DOES NOT need the has_dataflow_unsafe_control_flow guard that gates simplify_locals and coalesce_locals on every kernel-style early-exit function. v0.5.0's simplify_locals had zero effect on the gale workload by construction; this pass picks up where it refused to act. Algorithm: 1. Recursive read-count scan over the instruction tree. 2. Dead set = { idx | idx >= param_count && reads(idx) == 0 }. 3. Neutralize writes: LocalSet dead → Drop (preserves [T] -> [] stack effect) LocalTee dead → removed (Tee's [T] -> [T] passes through) 4. Pack-down remap: dense indices, reuse remap_instructions. 5. Z3 translation validation — revert on rejection. Stack-effect rationale for the asymmetric LocalSet/LocalTee handling: LocalSet idx : [T] -> [] so Drop is the substitute LocalTee idx : [T] -> [T] so removing leaves stack passing through Confusing these would corrupt the stack — replacing LocalTee with Drop would consume a value that downstream consumers expected to remain. Measurement on gale_ffi: baseline: code section 811 bytes v0.5.0 (regress'n): code section 862 bytes (+6.3%) v0.6.0 PR-A (CSE): code section 808 bytes (-0.4%) v0.6.0 PR-B (this): code section 804 bytes (-0.86%, vs baseline) PR-A and PR-B are independent and stack: PR-A fixes the regression, PR-B exposes a new optimization wasm-opt does that LOOM previously skipped on early-exit code. Visual confirmation on gale_bitarray_alloc_validate: before: (local i32) ; i32.const -22 ; local.set 3 ; ... after: (no locals) ; i32.const -22 ; drop ; ... The leftover `const; drop` is dead code that vacuum could in principle eliminate, but vacuum runs before this pass. A const+drop peephole in vacuum is a follow-up (~5 LOC). Tests (5 new): basic_write_only, preserves_used_locals, localtee_neutralization, packs_indices, skips_params. Pick #2 from v0.6.0 wasm-opt-gap research agent's plan (narrowed to the path-insensitive subset; full liveness is Pick #3). Trace: REQ-3, REQ-14
…ed wasm (-0.4% on calculator.wasm) New pass: per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to eliminate_dead_locals (PR-B): catches dead writes even when the local IS read elsewhere in the function, by computing live-after sets correctly across all wasm structured control flow. Pick #3 (Option C) from the v0.6.0 wasm-opt-gap research agent's plan. Full liveness, no scope reduction. ALGORITHM Backward walk over the instruction tree, propagating a LiveSet (BTreeSet<u32>) at each position. A LocalSet/LocalTee is dead iff its target local is NOT in the live-after set. Wasm structured-control-flow handling (the heart of the analysis): Block { body } br N inside the block targets the END of the block, so break-target liveness equals live-after-block. Walk body with that as live-after; pop label after. Loop { body } br N inside the loop targets the START of the body. To avoid fixpoint iteration on the back-edge, v1 uses the conservative approximation: every local read anywhere in the body is treated as live throughout the body and live just before the loop. Sound but loses precision INSIDE loops. Loop fixpoint is a follow-up (~100 LOC) — but the gale dead-store patterns sit BEFORE loops, so this approximation costs nothing on the target workload. If { then, else } Both arms see the same live-after-if. live-before-if is the UNION of live-in-then and live-in-else. The if's label targets the END of the if (live-after). Br N live becomes label_stack[depth - 1 - N] BrIf N live ∪= label_stack[...] (taken vs fall-through) BrTable [...] live ∪= ⋃ over all targets ∪ default Return / Unreachable live becomes empty (no continuation) Call / CallIndirect no effect on caller's locals The label_stack mirrors wasm's nesting: outermost first, innermost last. br N counts from innermost-out, so target = stack[len - 1 - N]. ID assignment for write tracking: Forward pre-walk counts total_writes. Backward analysis decrements a counter as it encounters writes; the decremented value is the write's forward-walk-order ID. Apply phase walks forward, increments a parallel counter, looks up dead_writes by ID. This avoids tree paths or unsafe pointers — IDs are stable and align across the two walks because both use deterministic structural recursion in mirrored orders. NEUTRALIZATION (same rules as PR-B's eliminate_dead_locals): LocalSet idx : [T] → [] → Drop LocalTee idx : [T] → [T] → removed (value passes through) TRAP-EFFECTING INSTRUCTIONS load/store/div/etc. may trap. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. PIPELINE ORDER ... → simplify-locals → dead-stores → dead-locals dead-stores BEFORE dead-locals: a write-only local with dead-only writes becomes a fully-unused local, which dead-locals then drops in the same pipeline run. MEASUREMENT gale_in_baseline.wasm (1.9 KB, kernel FFI): code section unchanged at 804 bytes (PR-B already handles this workload's pattern — pure write-without-read; PR-C's path-sensitive cases don't appear here). calculator.wasm (2.3 MB component): --passes dead-stores ALONE: 2,337,724 → 2,327,794 bytes (-0.4%, ~10 KB). Output validates. Confirms PR-C scales with workload complexity. TESTS (6 new) - overwritten_in_straight_line: two writes, no read between; first is dead. - preserves_live_writes: a single live write must survive untouched. - branch_aware_keeps_partial_use: write before an if; if-not-taken path reaches the local.get directly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read. - both_arms_overwrite: write before an if where BOTH arms overwrite the same local. live-before-if = ∅ for that local, so the outer write is dead. Tests the union-of-arm-deads. - return_kills_continuation: write followed by Return. Live becomes empty after Return; the write is dead. - localtee_dead_removed: dead Tee removed (not Drop'd); stack [T] -> [T] passes through. All 258 lib + 28 + 17 = 303 tests pass. Trace: REQ-3, REQ-14
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Third PR of v0.6.0. Stacks on PR #95 (which stacks on PR #94). When PRs #94 and #95 land, this auto-retargets to
main.A new pass
eliminate_dead_storesdoing per-position dead-store elimination via backward liveness walking the structured wasm instruction tree. Path-sensitive complement to PR-B (eliminate_dead_locals): catches dead writes even when the target local IS read elsewhere in the function — by computing live-after sets correctly across all wasm structured control flow.Pick #3 from the v0.6.0 wasm-opt-gap research agent's plan, taken in full (Option C).
Why "full liveness" not the simpler middle option
The user picked the most ambitious scope. The simpler "branch-aware adjacent-LocalSet" middle option would catch the obvious cases but miss the interesting one: write before an
ifwhere both arms overwrite. That requires computinglive-before-if = live-in-then ∪ live-in-else. Once you do that properly, you've already done full structured-wasm liveness. So full liveness it is.Algorithm
Backward walk over the instruction tree, propagating a
LiveSet = BTreeSet<u32>at each position. ALocalSet/LocalTeeis dead iff its target local is not in the live-after set.Block { body }If { then, else }Loop { body }Br Nlabel_stack[depth - 1 - N](no fall-through)BrIf Nlabel_stack[...](taken vs fall-through)BrTable [...]Return/UnreachableCall/CallIndirectThe label stack mirrors wasm's nesting: outermost first, innermost last.
br Ncounts from innermost-out, so target =stack[len - 1 - N].Write-ID tracking (the trickiest engineering bit)
Identifying which
LocalSet/LocalTeeis dead across two walks (analyze backward, apply forward) needs a stable ID. I use a counter that:total_writes(pre-counted in a forward sweep)The apply phase walks forward, increments a parallel counter, and looks up
dead_writesby ID. Because both walks use mirrored deterministic structural recursion, IDs align without tree paths or unsafe pointers.Trap-effecting instructions (load, store, div)
May trap and end the function early. We compute liveness under the no-trap assumption: writes are removed only if dead on the no-trap continuation. If a trap intervenes, no later instruction observes the local — so removal remains sound. The conservative direction is correct here.
Pipeline order
dead-storesBEFOREdead-locals: a write-only local with dead-only writes becomes a fully-unused local, whichdead-localsthen drops in the same pipeline run. Synergistic.Measurement
--passes dead-storesalone. Output validates.The gale-zero-effect and calculator-real-effect together confirm: PR-C's value is on workloads with branchier dead-store patterns. It scales with workload complexity.
Stack-effect rules (same as PR-B)
LocalSet idx[T] -> []Drop[T] -> []LocalTee idx[T] -> [T]The
branch_aware_keeps_partial_usetest pins this — if we got it wrong, the validator would reject (an uninitialized read is type-error in wasm).Tests (6 new)
overwritten_in_straight_line— two writes, no read between; first is dead.preserves_live_writes— single live write must survive untouched.branch_aware_keeps_partial_use— write beforeif; if-not-taken path reaches thelocal.getdirectly. Write IS LIVE on that path and MUST survive. The hardest case — Drop'ing it would expose an uninitialized read.both_arms_overwrite— write beforeifwhere BOTH arms overwrite. live-before-if = ∅, outer write is dead. Tests union-of-arm-deads.return_kills_continuation— write followed by Return. live ← ∅ after Return; write is dead.localtee_dead_removed— dead Tee removed (not Drop'd); stack[T] -> [T]passes through.All 303 tests pass (258 lib + 28 + 17).
Follow-ups not in this PR
i32.const X; droppatterns left by PR-B/PR-C neutralization.🤖 Generated with Claude Code