feat(opt): direct hi/lo extraction for u64-packed values (#94)#98
Merged
Conversation
When wasm extracts the upper 32 bits of a u64-packed return value via `i64.shr_u 32; i32.wrap_i64` (the canonical Rust → wasm pattern for gale-ffi-style FFI shims), synth was emitting a 38-byte runtime 64-bit shift sequence even though the answer was already sitting in a register. The optimizer-bridge now: - Tracks the value of every `i64.const` it lowers (`known_i64_consts`). - In the `i64.shr_u` / `i64.shr_s` handler, when the shift amount is a known constant whose low 6 bits == 32, skips the runtime shift entirely and renames `dest_lo` onto the source's high register. For shr_u the hi half is zeroed via a single 16-bit MOV; for shr_s it's sign-extended via an ASR #31. - In the `i64.and` handler, when the right operand is the constant `0xFFFFFFFF`, collapses the lo-half AND to a vreg rename and the hi-half AND to MOV #0. - Pre-scans the IR to find constants whose only use is one of the folded patterns above (with use-count == 1) and elides the MOV(s) that would have loaded the constant into a register. End-to-end byte-size impact (Cortex-M4F, `(param i64) (result i32)`, `local.get 0; i64.const 32; i64.shr_u; i32.wrap_i64`): before: 48 bytes (38 for the shift + 10 prologue/epilogue) after: 8 bytes (`mov r7, #0; mov r1, r7; mov r0, r5; bx lr`) saved: 40 bytes per call site (~83%) Patterns now caught: - `i64.shr_u 32` → register rename (hi half) + MOV #0 - `i64.shr_s 32` → register rename + ASR #31 (sign-extend) - `i64.and 0xFFFFFFFF` → register rename (lo half) + MOV #0 Tests: - `optimizer_bridge::tests::test_issue94_*` — five unit tests covering the shr_u / shr_s / and patterns and a non-32 sanity check that asserts the generic path is preserved. - `arm_backend::tests::test_issue94_hi32_extract_is_smaller_than_generic_shift` — end-to-end byte-size assertion through the full encoder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 10, 2026
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #94.
Summary
When wasm extracts the upper 32 bits of a u64-packed return value via
i64.shr_u 32; i32.wrap_i64— the canonical Rust → wasm pattern for gale-ffi-style FFI shims — synth was emitting a 38-byte runtime 64-bit shift sequence even though the answer was already sitting in a register.This PR teaches the optimizer-bridge to detect compile-time-constant shift / mask operands on i64 ops and lower the canonical packed-struct extraction patterns to direct register-rename + zero/sign-extend, eliminating the entire generic 64-bit shift dance.
Patterns now caught
i64.shr_u 32; i32.wrap_i64i64.shr_s 32; i32.wrap_i64i64.and 0xFFFFFFFF; i32.wrap_i64A pre-pass tracks single-use
i64.constvalues that are folded by these handlers and elides the MOV(s) that would have loaded the constant into a register at all.Before / after disassembly (canonical pattern)
WAT (cortex-m4f, exported as
(param i64) (result i32)):Before (
compile_functionbyte length = 48 bytes):After (
compile_functionbyte length = 8 bytes):Net savings: 40 bytes per call site (~83%).
The 8-byte form is exactly LLVM-LTO's gold-standard shape modulo register choice — the result already lives in a register; we just shuffle it into the AAPCS return slot.
Test plan
cargo test -p synth-synthesis --lib optimizer_bridge::tests::test_issue94— six new tests:test_issue94_size_demo(informational, prints before/after sizes)test_issue94_shr_u_32_lowers_to_register_renametest_issue94_shr_s_32_lowers_to_register_rename_with_sign_extendtest_issue94_and_mask_low32_lowers_to_register_renametest_issue94_shr_u_non_32_still_emits_runtime_shift(regression guard)test_issue94_ir_level_shr_u_32(IR-level direct test)cargo test -p synth-backend --lib arm_backend::tests::test_issue94_hi32_extract_is_smaller_than_generic_shift— end-to-end byte-size assertion through the full encoder, asserts ≥ 30 byte improvement.cargo test --workspace— full suite passes (no regressions).cargo clippy --workspace --all-targets -- -D warnings— clean.cargo fmt --check— clean.Files changed
crates/synth-synthesis/src/optimizer_bridge.rs— pre-pass for skip-eligible i64 constants; constant-tracker map; fast-path branches inI64ShrU/I64ShrS/I64Andhandlers; six unit tests.crates/synth-backend/src/arm_backend.rs— one end-to-end byte-size test asserting the canonical pattern compiles to ≥ 30 bytes less than the generic-shift baseline.Notes
Backendtrait. It lives inir_to_arm, where we already know the operand pair structure.i64.const 0x40000000_00000020would also fire correctly if anyone wrote that. The common case is a literali64.const 32.🤖 Generated with Claude Code