Skip to content

feat(opt): direct hi/lo extraction for u64-packed values (#94)#98

Merged
avrabe merged 1 commit into
mainfrom
feat/issue-94-u64-packed-extract
May 11, 2026
Merged

feat(opt): direct hi/lo extraction for u64-packed values (#94)#98
avrabe merged 1 commit into
mainfrom
feat/issue-94-u64-packed-extract

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 10, 2026

Closes #94.

Summary

When wasm extracts the upper 32 bits of a u64-packed return value via i64.shr_u 32; i32.wrap_i64 — the canonical Rust → wasm pattern for gale-ffi-style FFI shims — synth was emitting a 38-byte runtime 64-bit shift sequence even though the answer was already sitting in a register.

This PR teaches the optimizer-bridge to detect compile-time-constant shift / mask operands on i64 ops and lower the canonical packed-struct extraction patterns to direct register-rename + zero/sign-extend, eliminating the entire generic 64-bit shift dance.

Patterns now caught

pattern before after
i64.shr_u 32; i32.wrap_i64 38-byte runtime shift vreg rename + 1× MOV #0
i64.shr_s 32; i32.wrap_i64 40-byte runtime shift vreg rename + 1× ASR #31
i64.and 0xFFFFFFFF; i32.wrap_i64 2× AND.W vreg rename + 1× MOV #0

A pre-pass tracks single-use i64.const values that are folded by these handlers and elides the MOV(s) that would have loaded the constant into a register at all.

Before / after disassembly (canonical pattern)

WAT (cortex-m4f, exported as (param i64) (result i32)):

local.get 0
i64.const 32
i64.shr_u
i32.wrap_i64

Before (compile_function byte length = 48 bytes):

20 f0 3f 00      and.w r0, r0, #63           ; mask shift to <64
b2 f1 20 03      subs.w r3, r2, #32          ; n - 32
0a d5            bpl.n  +20                  ; if >=32 take large branch
c2 f1 20 03      rsb.w  r3, r2, #32
01 fa 03 f3      lsl.w  r3, r1, r3
20 fa 02 f0      lsr.w  r0, r0, r2
40 ea 03 00      orr.w  r0, r0, r3
21 fa 02 f1      lsr.w  r1, r1, r2
02 e0            b.n    +4
21 fa 03 f0      lsr.w  r0, r1, r3
00 21            movs   r1, #0
70 47            bx     lr

After (compile_function byte length = 8 bytes):

00 27            movs   r7, #0     ; rd_hi = 0
39 46            mov    r1, r7     ; epilogue: hi → R1
28 46            mov    r0, r5     ; epilogue: lo → R0 (was the source's hi)
70 47            bx     lr

Net savings: 40 bytes per call site (~83%).

The 8-byte form is exactly LLVM-LTO's gold-standard shape modulo register choice — the result already lives in a register; we just shuffle it into the AAPCS return slot.

Test plan

  • cargo test -p synth-synthesis --lib optimizer_bridge::tests::test_issue94 — six new tests:
    • test_issue94_size_demo (informational, prints before/after sizes)
    • test_issue94_shr_u_32_lowers_to_register_rename
    • test_issue94_shr_s_32_lowers_to_register_rename_with_sign_extend
    • test_issue94_and_mask_low32_lowers_to_register_rename
    • test_issue94_shr_u_non_32_still_emits_runtime_shift (regression guard)
    • test_issue94_ir_level_shr_u_32 (IR-level direct test)
  • cargo test -p synth-backend --lib arm_backend::tests::test_issue94_hi32_extract_is_smaller_than_generic_shift — end-to-end byte-size assertion through the full encoder, asserts ≥ 30 byte improvement.
  • cargo test --workspace — full suite passes (no regressions).
  • cargo clippy --workspace --all-targets -- -D warnings — clean.
  • cargo fmt --check — clean.

Files changed

  • crates/synth-synthesis/src/optimizer_bridge.rs — pre-pass for skip-eligible i64 constants; constant-tracker map; fast-path branches in I64ShrU / I64ShrS / I64And handlers; six unit tests.
  • crates/synth-backend/src/arm_backend.rs — one end-to-end byte-size test asserting the canonical pattern compiles to ≥ 30 bytes less than the generic-shift baseline.

Notes

  • The optimization is purely target-specific — it doesn't change WASM-level semantics, IR opcodes, or the Backend trait. It lives in ir_to_arm, where we already know the operand pair structure.
  • For shr_u/shr_s, we match any constant whose low 6 bits == 32 (per WASM's "shift-amount mod 64" semantics), so i64.const 0x40000000_00000020 would also fire correctly if anyone wrote that. The common case is a literal i64.const 32.
  • Doesn't touch the synthesis-time / instruction_selector path (covered by another in-flight PR for issue memset/memcpy/memmove i64-codegen produces non-terminating loop on Cortex-M (silicon-blocking) #93).

🤖 Generated with Claude Code

When wasm extracts the upper 32 bits of a u64-packed return value via
`i64.shr_u 32; i32.wrap_i64` (the canonical Rust → wasm pattern for
gale-ffi-style FFI shims), synth was emitting a 38-byte runtime 64-bit
shift sequence even though the answer was already sitting in a register.

The optimizer-bridge now:

- Tracks the value of every `i64.const` it lowers (`known_i64_consts`).
- In the `i64.shr_u` / `i64.shr_s` handler, when the shift amount is a
  known constant whose low 6 bits == 32, skips the runtime shift entirely
  and renames `dest_lo` onto the source's high register. For shr_u the
  hi half is zeroed via a single 16-bit MOV; for shr_s it's
  sign-extended via an ASR #31.
- In the `i64.and` handler, when the right operand is the constant
  `0xFFFFFFFF`, collapses the lo-half AND to a vreg rename and the
  hi-half AND to MOV #0.
- Pre-scans the IR to find constants whose only use is one of the
  folded patterns above (with use-count == 1) and elides the MOV(s)
  that would have loaded the constant into a register.

End-to-end byte-size impact (Cortex-M4F, `(param i64) (result i32)`,
`local.get 0; i64.const 32; i64.shr_u; i32.wrap_i64`):

  before: 48 bytes (38 for the shift + 10 prologue/epilogue)
  after:   8 bytes (`mov r7, #0; mov r1, r7; mov r0, r5; bx lr`)
  saved:  40 bytes per call site (~83%)

Patterns now caught:
- `i64.shr_u 32` → register rename (hi half) + MOV #0
- `i64.shr_s 32` → register rename + ASR #31 (sign-extend)
- `i64.and 0xFFFFFFFF` → register rename (lo half) + MOV #0

Tests:
- `optimizer_bridge::tests::test_issue94_*` — five unit tests covering
  the shr_u / shr_s / and patterns and a non-32 sanity check that
  asserts the generic path is preserved.
- `arm_backend::tests::test_issue94_hi32_extract_is_smaller_than_generic_shift`
  — end-to-end byte-size assertion through the full encoder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 10, 2026

Codecov Report

❌ Patch coverage is 96.01140% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/synth-synthesis/src/optimizer_bridge.rs 95.55% 14 Missing ⚠️

📢 Thoughts on this report? Let us know!

@avrabe avrabe merged commit 2a0bb01 into main May 11, 2026
9 checks passed
@avrabe avrabe deleted the feat/issue-94-u64-packed-extract branch May 11, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

u64-packed FFI return: emit register-direct field access instead of generic 64-bit shift extraction

1 participant