Skip to content

memset/memcpy/memmove i64-codegen produces non-terminating loop on Cortex-M (silicon-blocking) #93

@avrabe

Description

@avrabe

Strategic context — why this matters for ASIL-D

PulseEngine's wasm-IR toolchain (meld → loom → synth) is positioned in the gale documentation, the recent v0.5.0 release notes, and the "Three Quiet Barriers" blog post as the formally-verified construction equivalent of LLVM cross-language LTO. The pitch is concrete: where LLVM-LTO trusts a non-verified C++ optimizer to inline across language boundaries (incompatible with the kind of supply-chain audit ASIL-D / DO-178C want), the meld-loom-synth chain achieves the same inlining property with each pass mechanized in Rocq and the artifact attestable through Sigil.

That pitch needs to clear one bar before any of the higher claims hold: the resulting binary has to boot on the target silicon.

This issue is the smallest concrete instance where it doesn't.

What surfaced this

The gale silicon-anchor protocol (PR pulseengine/gale#40, branch feat/silicon-smart-data, commit a4beaeb) measures the engine_control bench's per-event handoff cycles on real NUCLEO-G474RE silicon, comparing against published Renode CI numbers at the same gale source SHA. The matrix that prompted this issue:

                                          silicon (cyc handoff median, ADC=y)
  baseline (no Gale, native)                 506
  rustc-direct gale (FFI bl preserved)       582  (+76 = the FFI seam cost)
  LLVM cross-language LTO                    558  (recovers 24 of the 76 cyc)
  wasm→synth (gale-ffi only, native shim)    582  (identical to rustc-direct,
                                                   because the C shim never
                                                   enters the wasm bundle —
                                                   the seam stays native)
  wasm-ld merge → synth (BOTH shim+rust)     ?    (this is what we want to
                                                   measure; THIS bug blocks)

The fourth row is the point of the protocol: put both the C shim (zephyr/gale_sem.c::z_impl_k_sem_give) AND the verified Rust (gale-ffi/src/lib.rs::gale_k_sem_give_decide) through wasm-ld, hand the merged module to synth as a single translation unit, and expect synth to inline the call. The disassembly evidence that this structurally works is captured in the silicon-anchor notes — synth's emitted z_impl_k_sem_give body has zero bl gale_k_sem_give_decide instructions, the verified Rust decision logic has been folded into the C function's body. Same property LLVM-LTO delivers.

The synth-emitted body is 138 bytes vs LLVM-LTO's 82 bytes (1.68× larger) for the same inlined logic, but that's a separate and lesser issue — call it #94 below. First the binary has to boot.

It doesn't. PC stays in synth's memset+0x4c for >10s, bouncing between offsets 0x668 and 0x67e in the inner loop, observed via openocd halt; reg pc; resume sampling on the live chip.

What synth produces, in detail

Reproducer (commands run from /tmp/synth-memset-repro, on macOS arm64 with the pulseengine/synth fix/synth-i64-locals-and-frame branch installed):

# 1. gale-ffi compiled to wasm32; brings in compiler_builtins for memset
( cd /Volumes/Home/git/pulseengine/gale-smart-data/ffi && \
  cargo rustc --release --target wasm32-unknown-unknown --crate-type=staticlib )

# 2. minimal wasm shim that forces memset into the link
cat > shim.c <<'C'
#include <stdint.h>
extern void *memset(void *, int, uint32_t);
void test_memset(uint8_t *p, uint32_t n) { memset(p, 0, n); }
C
clang --target=wasm32-unknown-unknown -O2 -nostdlib -c shim.c -o shim.wasm.o

# 3. wasm-ld merge (this works fine, single 1MB core wasm)
wasm-ld --no-entry --export-all --allow-undefined \
  --whole-archive .../libgale_ffi.a --no-whole-archive shim.wasm.o \
  -o merged.wasm

# 4. synth → ARM ET_REL — produces the broken memset
synth compile merged.wasm --target cortex-m4f --all-exports --relocatable -o merged.o

# 5. inspect — note how synth emits memset
arm-zephyr-eabi-objdump -d merged.o | awk '/<memset>:/,/^$/'

The synth-emitted memset is 454 bytes (vs picolibc's ~80) and contains, at offset +0x38..+0x4e, a sequence that does not look like a byte-write loop:

2c648:  d100      bne.n   2c64c
2c64a:  f240 0900 movw    r9, #0
2c64c:  f1b2 0320 subs.w  r3, r2, #32
2c650:  d50a      bpl.n   2c668
2c652:  f1c2 0320 rsb     r3, r2, #32       ; r3 = 32 - r2
2c656:  fa01 f303 lsl.w   r3, r1, r3        ; r3 = r1 << (32 - r2)
2c65a:  fa20 f002 lsr.w   r0, r0, r2        ; r0 = r0 >> r2
2c65e:  ea40 0003 orr.w   r0, r0, r3        ; r0 |= r3
2c662:  fa21 f102 lsr.w   r1, r1, r2        ; r1 = r1 >> r2
2c666:  e002      b.n     2c66c

That's the canonical ARM Thumb-2 implementation of r1:r0 >>= r2 (a 64-bit logical right shift, where r2 holds the shift amount in bits, not a byte counter). A standard Cortex-M memset has none of those instructions — it's strb / stmia against a byte-counter loop. The fact that this code-shape appears at all suggests synth's wasm→ARM lowering is mistakenly applying its i64-shift template to memset's inner loop.

Why I think this is i64-codegen-related

compiler_builtins::mem::memset (Rust) compiles to wasm as something like:

(func $memset (param $dst i32) (param $val i32) (param $len i32)
  (local $i i64)              ;; the fast-path uses 8-byte writes
  (local.set $i (i64.const 0))
  (loop $L
    (if (i64.lt_u (local.get $i) (i64.extend_i32_u (local.get $len)))
      (then ... 8-byte i64.store ... (br $L)))))

The fast-path uses i64 stores (8-byte aligned writes). The termination predicate compares an i64 counter against i64.extend_i32_u (len). When synth lowers this, it has to emit:

  • An i64 add for the counter increment
  • An i64 less-than comparison
  • A branch on the result

If synth's i64 codegen template is shared between u64-data-shifts and i64-counter-loops, it would emit the shift sequence above for what should be a 64-bit comparison. The branch would then resolve incorrectly and the loop would never terminate.

This is the same root pattern as the u64-packed-return codegen issue (#94 in this tracker after this one is filed) — synth's i64 lowering is incomplete in a way that's invisible to structural tests but devastating at runtime.

Workarounds attempted on the gale silicon-anchor branch (none stick)

  1. objcopy --weaken-symbol=memset on merged.o, hoping libc's strong memset wins → didn't work: the weak resolution didn't override; ld picked synth's bytes.
  2. objcopy --redefine-sym memset=__synth_memsetdidn't work: the C-symbol got renamed but the Rust mangled _ZN17compiler_builtins3mem6memset17h...E symbol remained, and the final ELF still resolves memset to synth's broken code at the same address.
  3. objcopy --strip-symbol=memsetdidn't work: removes symbol table entry, leaves bytes; ld places the broken code anyway and exposes the symbol from another reference.
  4. Linker --allow-multiple-definition with libc.a's correct memset added → didn't work: ld picks the first defined symbol, which is synth's because merged.o is processed earlier in the link order.

So the only real path forward is fixing synth's i64-codegen-applied-to-byte-counter-loops in the backend. None of the post-link tools can paper over a compiler bug that produces non-terminating code.

Recommended fix and proof obligation

Two parts. Fix the codegen, and add the test that would have caught this.

Codegen. Synth's i64 lowering needs to distinguish between three cases:

  1. i64 as data (e.g. u64-packed return values) — current template applies
  2. i64 as counter (e.g. memset's len extended to i64) — needs different lowering, probably zero-extension into r0:r1 with a 64-bit subtract or i32-only loop if len <= UINT32_MAX
  3. i64 as offset (e.g. memory64 addressing) — separate concern

The simplest fix is probably to detect the "i64 derived from i64.extend_i32_u" pattern at the wasm level and lower the loop using the original i32 counter, never materializing the i64 form. This matches what LLVM does in this position.

Proof obligation. This bug exists because there's no end-to-end test that would have caught it (per #74 — zero semantic tests verify compiled output). The right framing per #76 is the validator pattern: the codegen can be heuristic, but the validator checks each emitted memset invocation against the algebraic property ∀ p, n: ∀ i ∈ [0, n): (memset(p, v, n))[i] = v. That's tractable in Rocq even for i64-counter loops, and the validator would have rejected synth's emitted bytes.

Without semantic tests + a validator, the silicon-anchor protocol is the only thing standing between this bug and a customer running into it on real hardware. That's not where we want this kind of regression to surface.

Cross-references

Severity rationale

I marked this bug rather than enhancement because it produces a binary that does not satisfy the implementation's stated semantics (memset's contract). On real silicon it manifests as a chip that boots into a never-ending loop and silently fails — which is the worst class of failure for a tool that's supposed to be the verified-construction path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions