release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression) by avrabe · Pull Request #94 · pulseengine/loom

avrabe · 2026-05-03T12:45:27Z

Summary

First PR of v0.6.0. Closes the +6.3% code-size regression that LOOM v0.5.0 produced on the gale kernel-scheduler FFI (Verus-verified Rust). Pick #1 from the v0.6.0 wasm-opt-gap research agent's plan: 50 LOC fix, 0.3 weeks effort, no risk.

Also lands two research docs in docs/research/gale-v0.5.0/: source-pattern analysis and wasm-opt pass-gap analysis. Both feed v0.6.0 planning.

The bug

LOOM's enhanced-CSE deduplicated every duplicate expression — including 1-2 byte constants. Replacing i32.const -22 (2 bytes encoded) with local.tee N / local.get N (4 bytes) plus a new local declaration (~2 bytes amortized) is unconditionally a size regression. Gale is full of cheap constants (errno values: -EINVAL = -22, -EBUSY = -16, K_FOREVER = -1).

The fix

A cost gate Expr::worth_dedup(occurrences) that estimates net byte savings before deciding to dedup:

net = (N - 1) * (cost - 2) - 4

where N = number of occurrences, cost = wasm-encoded byte cost of materializing the expression once. Skip when net ≤ 0.

Examples (all measured against actual wasm encoding via signed_leb128_bytes_* helpers):

Expression	Cost	N	Net	Decision
i32.const 42	2	2	-4	skip
i32.const 42	2	10	-4	skip (cheap)
i32.add of LocalGet+Const42	5	2	-1	skip
i32.add of LocalGet+Const42	5	3	+2	keep
i32.const 0x12345678	6	3	+4	keep

Measurement on gale_ffi

Build	Code section	Δ vs baseline
baseline (`gale_in_baseline.wasm`)	811 B	—
LOOM v0.5.0	862 B	+6.3%
LOOM v0.6.0 (this PR)	808 B	-0.4%

A 6.7-point swing on real kernel-scheduler code.

Tests

test_cse_phase4_duplicate_constants_above_cost_threshold — pin that LARGE constants (5+ byte LEB128) still get deduplicated. CSE remains useful where it pays off.
test_cse_phase4_keeps_small_constants — regression test for the gale fix. Cheap constants (2-byte LEB128) must survive CSE; neither locals nor instruction count grow.
All 247 existing lib tests pass.

v0.6.0 plan

The wasm-opt-gap research agent ranked 7 picks. The first three (1.3 weeks combined) flip the sign of the gale gap entirely:

Pick	Status	LOC	Wks
1. Constant-CSE suppression gate	this PR	50	0.3
2. reorder-locals (slot renumbering)	pending	250	0.5
3. RedundantSetElimination liveness	pending	600	1.5

🤖 Generated with Claude Code

The v0.4.0 audit measured LOOM's CSE growing the gale_ffi kernel-scheduler code section by +6.3% while wasm-opt -O3 reduced it by -2.0%. The cause: enhanced-CSE deduplicated every duplicate expression including 1-2 byte constants. Replacing `i32.const -22` (2 bytes encoded) with `local.tee N / local.get N` (2+2 = 4 bytes) plus an additional local declaration (~2 bytes amortized) is unconditionally a size regression on cheap constants — and gale is full of them (errno values). Fix: add a cost gate `Expr::worth_dedup(occurrences)` that estimates net byte savings before deciding to dedup. Skip when: net = (N - 1) * (cost - 2) - 4 ≤ 0 Examples: i32.const 42 (cost=2, N=10): savings = -4 → skip i32.add cost=5, N=2: savings = -1 → skip i32.add cost=5, N=3: savings = +2 → keep Measurement on gale_ffi: v0.5.0: code section 811 → 862 bytes (+6.3% regression) v0.6.0: code section 811 → 808 bytes (-0.4% net win) Tests: - test_cse_phase4_duplicate_constants_above_cost_threshold: LARGE constants (5+ byte LEB128) still get deduplicated. - test_cse_phase4_keeps_small_constants: regression test for the gale fix. Cheap constants must survive CSE. Pick #1 from v0.6.0 wasm-opt-gap research agent's plan. Trace: REQ-3, REQ-14

Two research outputs from v0.6.0 planning, both grounded in real Verus-verified kernel-scheduler code at /Users/r/git/pulseengine/z/gale. source-pattern-analysis.md — eight optimization-relevant patterns in gale source with file:line citations: - Closed-set FSM dispatch (br_table targets) — 6 near-identical matches over SchedThreadState in sched.rs:649-779 - Default-then-override (the LOOM v0.4/v0.5 hoist guard pattern) — canonical example in sched.rs:404-444 (next_up_smp), four more found - Verus-bounded loops — 24 `decreases` clauses, all of form MAX_CONST - i (MAX_WAITERS=64, MAX_CPUS=16) - Tail-call dispatch matches, leaf-inline candidates, bit-mask axiom ingestion (event.rs lemmas), 2D state-machine matches, and Verus annotations as trusted axioms (1607 clauses total) wasm-opt-gap-analysis.md — top 7 wasm-opt passes ranked by expected payoff on kernel code, cheapest-first: 1. Constant-CSE suppression gate (50 LOC, 0.3 wks) — DONE in this PR 2. reorder-locals (slot renumbering) (250 LOC, 0.5 wks) 3. RedundantSetElimination liveness (600 LOC, 1.5 wks) 4. Compare-operand canonicalization (400 LOC, 1.0 wk) 5. merge-locals (500 LOC, 1.5 wks) 6. directize call_indirect → call (700 LOC, 2.0 wks) 7. simplify-locals sinking mode (1500 LOC, 4.0 wks) The first three together (~1.3 weeks combined) flip the sign of the gale +6.3% regression. Pick #1 is shipped in this PR; picks #2 and #3 are tracked for follow-up PRs in v0.6.0. Trace: REQ-7

avrabe added 2 commits May 3, 2026 14:40

This was referenced May 3, 2026

release/v0.6.0 PR-B: eliminate_dead_locals (-0.86% on gale) #95

Open

release/v0.6.0 PR-C: eliminate_dead_stores — full backward liveness (-0.4% on calculator.wasm) #96

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression)#94

release/v0.6.0 PR-A: CSE cost-aware dedup gate (eliminates gale +6.3% regression)#94
avrabe wants to merge 2 commits intomainfrom
release/v0.6.0-pr-a-cse-cost-threshold

avrabe commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 3, 2026

Summary

The bug

The fix

Measurement on gale_ffi

Tests

v0.6.0 plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant