release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression)#94
Open
release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression)#94
Conversation
The v0.4.0 audit measured LOOM's CSE growing the gale_ffi
kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced
it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate
expression including 1-2 byte constants. Replacing
`i32.const -22` (2 bytes encoded) with `local.tee N / local.get N`
(2+2 = 4 bytes) plus an additional local declaration (~2 bytes
amortized) is unconditionally a size regression on cheap
constants — and gale is full of them (errno values).
Fix: add a cost gate `Expr::worth_dedup(occurrences)` that
estimates net byte savings before deciding to dedup. Skip when:
net = (N - 1) * (cost - 2) - 4 ≤ 0
Examples:
i32.const 42 (cost=2, N=10): savings = -4 → skip
i32.add cost=5, N=2: savings = -1 → skip
i32.add cost=5, N=3: savings = +2 → keep
Measurement on gale_ffi:
v0.5.0: code section 811 → 862 bytes (+6.3% regression)
v0.6.0: code section 811 → 808 bytes (-0.4% net win)
Tests:
- test_cse_phase4_duplicate_constants_above_cost_threshold:
LARGE constants (5+ byte LEB128) still get deduplicated.
- test_cse_phase4_keeps_small_constants: regression test for
the gale fix. Cheap constants must survive CSE.
Pick #1 from v0.6.0 wasm-opt-gap research agent's plan.
Trace: REQ-3, REQ-14
Two research outputs from v0.6.0 planning, both grounded in real Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale. source-pattern-analysis.md — eight optimization-relevant patterns in gale source with file:line citations: - Closed-set FSM dispatch (br_table targets) — 6 near-identical matches over SchedThreadState in sched.rs:649-779 - Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern) — canonical example in sched.rs:404-444 (next_up_smp), four more found - Verus-bounded loops — 24 `decreases` clauses, all of form MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16) - Tail-call dispatch matches, leaf-inline candidates, bit-mask axiom ingestion (event.rs lemmas), 2D state-machine matches, and Verus annotations as trusted axioms (1607 clauses total) wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by expected payoff on kernel code, cheapest-first: 1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE in this PR 2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks) 3. RedundantSetElimination liveness (600 LOC, 1.5 wks) 4. Compare-operand canonicalization (400 LOC, 1.0 wk) 5. merge-locals (500 LOC, 1.5 wks) 6. directize call_indirect → call (700 LOC, 2.0 wks) 7. simplify-locals sinking mode (1500 LOC, 4.0 wks) The first three together (~1.3 weeks combined) flip the sign of the gale +6.3% regression. Pick #1 is shipped in this PR; picks #2 and #3 are tracked for follow-up PRs in v0.6.0. Trace: REQ-7
This was referenced May 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First PR of v0.6.0. Closes the +6.3% code-size regression that LOOM v0.5.0 produced on the gale kernel-scheduler FFI (Verus-verified Rust). Pick #1 from the v0.6.0 wasm-opt-gap research agent's plan: 50 LOC fix, 0.3 weeks effort, no risk.
Also lands two research docs in
docs/research/gale-v0.5.0/: source-pattern analysis and wasm-opt pass-gap analysis. Both feed v0.6.0 planning.The bug
LOOM's enhanced-CSE deduplicated every duplicate expression — including 1-2 byte constants. Replacing
i32.const -22(2 bytes encoded) withlocal.tee N / local.get N(4 bytes) plus a new local declaration (~2 bytes amortized) is unconditionally a size regression. Gale is full of cheap constants (errno values: -EINVAL = -22, -EBUSY = -16, K_FOREVER = -1).The fix
A cost gate
Expr::worth_dedup(occurrences)that estimates net byte savings before deciding to dedup:where
N= number of occurrences,cost= wasm-encoded byte cost of materializing the expression once. Skip whennet ≤ 0.Examples (all measured against actual wasm encoding via
signed_leb128_bytes_*helpers):Measurement on gale_ffi
gale_in_baseline.wasm)A 6.7-point swing on real kernel-scheduler code.
Tests
test_cse_phase4_duplicate_constants_above_cost_threshold— pin that LARGE constants (5+ byte LEB128) still get deduplicated. CSE remains useful where it pays off.test_cse_phase4_keeps_small_constants— regression test for the gale fix. Cheap constants (2-byte LEB128) must survive CSE; neither locals nor instruction count grow.v0.6.0 plan
The wasm-opt-gap research agent ranked 7 picks. The first three (1.3 weeks combined) flip the sign of the gale gap entirely:
🤖 Generated with Claude Code