feat(opt): direct hi/lo extraction for u64-packed values (#94) by avrabe · Pull Request #98 · pulseengine/synth

avrabe · 2026-05-10T17:04:19Z

Closes #94.

Summary

When wasm extracts the upper 32 bits of a u64-packed return value via i64.shr_u 32; i32.wrap_i64 — the canonical Rust → wasm pattern for gale-ffi-style FFI shims — synth was emitting a 38-byte runtime 64-bit shift sequence even though the answer was already sitting in a register.

This PR teaches the optimizer-bridge to detect compile-time-constant shift / mask operands on i64 ops and lower the canonical packed-struct extraction patterns to direct register-rename + zero/sign-extend, eliminating the entire generic 64-bit shift dance.

Patterns now caught

pattern	before	after
`i64.shr_u 32; i32.wrap_i64`	38-byte runtime shift	vreg rename + 1× MOV #0
`i64.shr_s 32; i32.wrap_i64`	40-byte runtime shift	vreg rename + 1× ASR #31
`i64.and 0xFFFFFFFF; i32.wrap_i64`	2× AND.W	vreg rename + 1× MOV #0

A pre-pass tracks single-use i64.const values that are folded by these handlers and elides the MOV(s) that would have loaded the constant into a register at all.

Before / after disassembly (canonical pattern)

WAT (cortex-m4f, exported as (param i64) (result i32)):

local.get 0
i64.const 32
i64.shr_u
i32.wrap_i64

Before (compile_function byte length = 48 bytes):

20 f0 3f 00      and.w r0, r0, #63           ; mask shift to <64
b2 f1 20 03      subs.w r3, r2, #32          ; n - 32
0a d5            bpl.n  +20                  ; if >=32 take large branch
c2 f1 20 03      rsb.w  r3, r2, #32
01 fa 03 f3      lsl.w  r3, r1, r3
20 fa 02 f0      lsr.w  r0, r0, r2
40 ea 03 00      orr.w  r0, r0, r3
21 fa 02 f1      lsr.w  r1, r1, r2
02 e0            b.n    +4
21 fa 03 f0      lsr.w  r0, r1, r3
00 21            movs   r1, #0
70 47            bx     lr

After (compile_function byte length = 8 bytes):

00 27            movs   r7, #0     ; rd_hi = 0
39 46            mov    r1, r7     ; epilogue: hi → R1
28 46            mov    r0, r5     ; epilogue: lo → R0 (was the source's hi)
70 47            bx     lr

Net savings: 40 bytes per call site (~83%).

The 8-byte form is exactly LLVM-LTO's gold-standard shape modulo register choice — the result already lives in a register; we just shuffle it into the AAPCS return slot.

Test plan

cargo test -p synth-synthesis --lib optimizer_bridge::tests::test_issue94 — six new tests:
- test_issue94_size_demo (informational, prints before/after sizes)
- test_issue94_shr_u_32_lowers_to_register_rename
- test_issue94_shr_s_32_lowers_to_register_rename_with_sign_extend
- test_issue94_and_mask_low32_lowers_to_register_rename
- test_issue94_shr_u_non_32_still_emits_runtime_shift (regression guard)
- test_issue94_ir_level_shr_u_32 (IR-level direct test)
cargo test -p synth-backend --lib arm_backend::tests::test_issue94_hi32_extract_is_smaller_than_generic_shift — end-to-end byte-size assertion through the full encoder, asserts ≥ 30 byte improvement.
cargo test --workspace — full suite passes (no regressions).
cargo clippy --workspace --all-targets -- -D warnings — clean.
cargo fmt --check — clean.

Files changed

crates/synth-synthesis/src/optimizer_bridge.rs — pre-pass for skip-eligible i64 constants; constant-tracker map; fast-path branches in I64ShrU / I64ShrS / I64And handlers; six unit tests.
crates/synth-backend/src/arm_backend.rs — one end-to-end byte-size test asserting the canonical pattern compiles to ≥ 30 bytes less than the generic-shift baseline.

Notes

The optimization is purely target-specific — it doesn't change WASM-level semantics, IR opcodes, or the Backend trait. It lives in ir_to_arm, where we already know the operand pair structure.
For shr_u/shr_s, we match any constant whose low 6 bits == 32 (per WASM's "shift-amount mod 64" semantics), so i64.const 0x40000000_00000020 would also fire correctly if anyone wrote that. The common case is a literal i64.const 32.
Doesn't touch the synthesis-time / instruction_selector path (covered by another in-flight PR for issue memset/memcpy/memmove i64-codegen produces non-terminating loop on Cortex-M (silicon-blocking) #93).

🤖 Generated with Claude Code

When wasm extracts the upper 32 bits of a u64-packed return value via `i64.shr_u 32; i32.wrap_i64` (the canonical Rust → wasm pattern for gale-ffi-style FFI shims), synth was emitting a 38-byte runtime 64-bit shift sequence even though the answer was already sitting in a register. The optimizer-bridge now: - Tracks the value of every `i64.const` it lowers (`known_i64_consts`). - In the `i64.shr_u` / `i64.shr_s` handler, when the shift amount is a known constant whose low 6 bits == 32, skips the runtime shift entirely and renames `dest_lo` onto the source's high register. For shr_u the hi half is zeroed via a single 16-bit MOV; for shr_s it's sign-extended via an ASR #31. - In the `i64.and` handler, when the right operand is the constant `0xFFFFFFFF`, collapses the lo-half AND to a vreg rename and the hi-half AND to MOV #0. - Pre-scans the IR to find constants whose only use is one of the folded patterns above (with use-count == 1) and elides the MOV(s) that would have loaded the constant into a register. End-to-end byte-size impact (Cortex-M4F, `(param i64) (result i32)`, `local.get 0; i64.const 32; i64.shr_u; i32.wrap_i64`): before: 48 bytes (38 for the shift + 10 prologue/epilogue) after: 8 bytes (`mov r7, #0; mov r1, r7; mov r0, r5; bx lr`) saved: 40 bytes per call site (~83%) Patterns now caught: - `i64.shr_u 32` → register rename (hi half) + MOV #0 - `i64.shr_s 32` → register rename + ASR #31 (sign-extend) - `i64.and 0xFFFFFFFF` → register rename (lo half) + MOV #0 Tests: - `optimizer_bridge::tests::test_issue94_*` — five unit tests covering the shr_u / shr_s / and patterns and a non-32 sanity check that asserts the generic path is preserved. - `arm_backend::tests::test_issue94_hi32_extract_is_smaller_than_generic_shift` — end-to-end byte-size assertion through the full encoder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-10T21:51:57Z

Codecov Report

❌ Patch coverage is 96.01140% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/synth-synthesis/src/optimizer_bridge.rs	95.55%	14 Missing ⚠️

📢 Thoughts on this report? Let us know!

This was referenced May 10, 2026

test: i64 semantic correctness — 25+ tests covering all i64 wasm ops #99

Merged

feat(fuzz): cargo-fuzz harnesses for ARM instruction selection (#82) #100

Open

avrabe merged commit 2a0bb01 into main May 11, 2026
9 checks passed

avrabe deleted the feat/issue-94-u64-packed-extract branch May 11, 2026 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(opt): direct hi/lo extraction for u64-packed values (#94)#98

feat(opt): direct hi/lo extraction for u64-packed values (#94)#98
avrabe merged 1 commit into
mainfrom
feat/issue-94-u64-packed-extract

avrabe commented May 10, 2026

Uh oh!

codecov Bot commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented May 10, 2026

Summary

Patterns now caught

Before / after disassembly (canonical pattern)

Test plan

Files changed

Notes

Uh oh!

codecov Bot commented May 10, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant