memset/memcpy/memmove i64-codegen produces non-terminating loop on Cortex-M (silicon-blocking)

## Strategic context — why this matters for ASIL-D

PulseEngine's wasm-IR toolchain (meld → loom → synth) is positioned in the gale documentation, the recent v0.5.0 release notes, and the "Three Quiet Barriers" blog post as the **formally-verified construction equivalent** of LLVM cross-language LTO. The pitch is concrete: where LLVM-LTO trusts a non-verified C++ optimizer to inline across language boundaries (incompatible with the kind of supply-chain audit ASIL-D / DO-178C want), the meld-loom-synth chain achieves the same inlining property with each pass mechanized in Rocq and the artifact attestable through Sigil.

That pitch needs to clear one bar before any of the higher claims hold: **the resulting binary has to boot on the target silicon**.

This issue is the smallest concrete instance where it doesn't.

## What surfaced this

The gale silicon-anchor protocol (PR pulseengine/gale#40, branch `feat/silicon-smart-data`, commit `a4beaeb`) measures the engine_control bench's per-event handoff cycles on real NUCLEO-G474RE silicon, comparing against published Renode CI numbers at the same gale source SHA. The matrix that prompted this issue:

```
                                          silicon (cyc handoff median, ADC=y)
  baseline (no Gale, native)                 506
  rustc-direct gale (FFI bl preserved)       582  (+76 = the FFI seam cost)
  LLVM cross-language LTO                    558  (recovers 24 of the 76 cyc)
  wasm→synth (gale-ffi only, native shim)    582  (identical to rustc-direct,
                                                   because the C shim never
                                                   enters the wasm bundle —
                                                   the seam stays native)
  wasm-ld merge → synth (BOTH shim+rust)     ?    (this is what we want to
                                                   measure; THIS bug blocks)
```

The fourth row is the point of the protocol: put both the C shim (`zephyr/gale_sem.c::z_impl_k_sem_give`) AND the verified Rust (`gale-ffi/src/lib.rs::gale_k_sem_give_decide`) through wasm-ld, hand the merged module to synth as a single translation unit, and expect synth to inline the call. The disassembly evidence that this **structurally works** is captured in the silicon-anchor notes — synth's emitted `z_impl_k_sem_give` body has zero `bl gale_k_sem_give_decide` instructions, the verified Rust decision logic has been folded into the C function's body. Same property LLVM-LTO delivers.

The synth-emitted body is 138 bytes vs LLVM-LTO's 82 bytes (1.68× larger) for the same inlined logic, but that's a separate and lesser issue — call it #94 below. **First the binary has to boot.**

It doesn't. PC stays in synth's `memset+0x4c` for >10s, bouncing between offsets `0x668` and `0x67e` in the inner loop, observed via `openocd halt; reg pc; resume` sampling on the live chip.

## What synth produces, in detail

Reproducer (commands run from `/tmp/synth-memset-repro`, on macOS arm64 with the `pulseengine/synth fix/synth-i64-locals-and-frame` branch installed):

```sh
# 1. gale-ffi compiled to wasm32; brings in compiler_builtins for memset
( cd /Volumes/Home/git/pulseengine/gale-smart-data/ffi && \
  cargo rustc --release --target wasm32-unknown-unknown --crate-type=staticlib )

# 2. minimal wasm shim that forces memset into the link
cat > shim.c <<'C'
#include <stdint.h>
extern void *memset(void *, int, uint32_t);
void test_memset(uint8_t *p, uint32_t n) { memset(p, 0, n); }
C
clang --target=wasm32-unknown-unknown -O2 -nostdlib -c shim.c -o shim.wasm.o

# 3. wasm-ld merge (this works fine, single 1MB core wasm)
wasm-ld --no-entry --export-all --allow-undefined \
  --whole-archive .../libgale_ffi.a --no-whole-archive shim.wasm.o \
  -o merged.wasm

# 4. synth → ARM ET_REL — produces the broken memset
synth compile merged.wasm --target cortex-m4f --all-exports --relocatable -o merged.o

# 5. inspect — note how synth emits memset
arm-zephyr-eabi-objdump -d merged.o | awk '/<memset>:/,/^$/'
```

The synth-emitted `memset` is **454 bytes** (vs picolibc's ~80) and contains, at offset `+0x38..+0x4e`, a sequence that does not look like a byte-write loop:

```
2c648:  d100      bne.n   2c64c
2c64a:  f240 0900 movw    r9, #0
2c64c:  f1b2 0320 subs.w  r3, r2, #32
2c650:  d50a      bpl.n   2c668
2c652:  f1c2 0320 rsb     r3, r2, #32       ; r3 = 32 - r2
2c656:  fa01 f303 lsl.w   r3, r1, r3        ; r3 = r1 << (32 - r2)
2c65a:  fa20 f002 lsr.w   r0, r0, r2        ; r0 = r0 >> r2
2c65e:  ea40 0003 orr.w   r0, r0, r3        ; r0 |= r3
2c662:  fa21 f102 lsr.w   r1, r1, r2        ; r1 = r1 >> r2
2c666:  e002      b.n     2c66c
```

That's the canonical ARM Thumb-2 implementation of `r1:r0 >>= r2` (a 64-bit logical right shift, where r2 holds the shift amount in bits, not a byte counter). A standard Cortex-M memset has *none* of those instructions — it's `strb` / `stmia` against a byte-counter loop. The fact that this code-shape appears at all suggests synth's wasm→ARM lowering is mistakenly applying its i64-shift template to memset's inner loop.

## Why I think this is i64-codegen-related

`compiler_builtins::mem::memset` (Rust) compiles to wasm as something like:

```wat
(func $memset (param $dst i32) (param $val i32) (param $len i32)
  (local $i i64)              ;; the fast-path uses 8-byte writes
  (local.set $i (i64.const 0))
  (loop $L
    (if (i64.lt_u (local.get $i) (i64.extend_i32_u (local.get $len)))
      (then ... 8-byte i64.store ... (br $L)))))
```

The fast-path uses i64 stores (8-byte aligned writes). The termination predicate compares an `i64` counter against `i64.extend_i32_u (len)`. When synth lowers this, it has to emit:
- An i64 add for the counter increment
- An i64 less-than comparison
- A branch on the result

If synth's i64 codegen template is shared between u64-data-shifts and i64-counter-loops, it would emit the shift sequence above for what should be a 64-bit comparison. The branch would then resolve incorrectly and the loop would never terminate.

This is the same root pattern as the u64-packed-return codegen issue (#94 in this tracker after this one is filed) — synth's i64 lowering is incomplete in a way that's invisible to structural tests but devastating at runtime.

## Workarounds attempted on the gale silicon-anchor branch (none stick)

1. `objcopy --weaken-symbol=memset` on merged.o, hoping libc's strong memset wins → **didn't work**: the weak resolution didn't override; ld picked synth's bytes.
2. `objcopy --redefine-sym memset=__synth_memset` → **didn't work**: the C-symbol got renamed but the Rust mangled `_ZN17compiler_builtins3mem6memset17h...E` symbol remained, and the final ELF still resolves `memset` to synth's broken code at the same address.
3. `objcopy --strip-symbol=memset` → **didn't work**: removes symbol table entry, leaves bytes; ld places the broken code anyway and exposes the symbol from another reference.
4. Linker `--allow-multiple-definition` with libc.a's correct memset added → **didn't work**: ld picks the first defined symbol, which is synth's because merged.o is processed earlier in the link order.

So the only real path forward is fixing synth's i64-codegen-applied-to-byte-counter-loops in the backend. None of the post-link tools can paper over a compiler bug that produces non-terminating code.

## Recommended fix and proof obligation

Two parts. Fix the codegen, and add the test that would have caught this.

**Codegen.** Synth's i64 lowering needs to distinguish between three cases:
1. **i64 as data** (e.g. u64-packed return values) — current template applies
2. **i64 as counter** (e.g. memset's `len` extended to i64) — needs different lowering, probably zero-extension into r0:r1 with a 64-bit subtract or i32-only loop if `len <= UINT32_MAX`
3. **i64 as offset** (e.g. memory64 addressing) — separate concern

The simplest fix is probably to detect the "i64 derived from `i64.extend_i32_u`" pattern at the wasm level and lower the loop using the original i32 counter, never materializing the i64 form. This matches what LLVM does in this position.

**Proof obligation.** This bug exists because there's no end-to-end test that would have caught it (per #74 — zero semantic tests verify compiled output). The right framing per #76 is the validator pattern: the codegen can be heuristic, but the validator checks each emitted memset invocation against the algebraic property `∀ p, n: ∀ i ∈ [0, n): (memset(p, v, n))[i] = v`. That's tractable in Rocq even for i64-counter loops, and the validator would have rejected synth's emitted bytes.

Without semantic tests + a validator, the silicon-anchor protocol is the only thing standing between this bug and a customer running into it on real hardware. That's not where we want this kind of regression to surface.

## Cross-references

- Silicon-anchor capture set: `pulseengine/gale@a4beaeb`, branch `feat/silicon-smart-data`, PR #40
- Full reproduction + side-by-side disassembly evidence: `benches/engine_control/silicon/boards/nucleo_g474re/NOTES-wasm-cross-lto-spike.md`
- Related: #74 (zero semantic tests), #76 (validator-based i64/f32/f64 strategy), the still-open #73 about Rocq compile_wasm_to_arm divergence — same general theme
- The merged.o reproducer is preserved at `/tmp/wasm-shim-poc/merged.o` (~245 KB ARM ET_REL); I can attach it if useful

## Severity rationale

I marked this `bug` rather than `enhancement` because it produces a binary that does not satisfy the implementation's stated semantics (memset's contract). On real silicon it manifests as a chip that boots into a never-ending loop and silently fails — which is the worst class of failure for a tool that's supposed to be the verified-construction path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memset/memcpy/memmove i64-codegen produces non-terminating loop on Cortex-M (silicon-blocking) #93

Strategic context — why this matters for ASIL-D

What surfaced this

What synth produces, in detail

Why I think this is i64-codegen-related

Workarounds attempted on the gale silicon-anchor branch (none stick)

Recommended fix and proof obligation

Cross-references

Severity rationale

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

memset/memcpy/memmove i64-codegen produces non-terminating loop on Cortex-M (silicon-blocking) #93

Description

Strategic context — why this matters for ASIL-D

What surfaced this

What synth produces, in detail

Why I think this is i64-codegen-related

Workarounds attempted on the gale silicon-anchor branch (none stick)

Recommended fix and proof obligation

Cross-references

Severity rationale

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions