Skip to content

silicon: smart-data emission — DWT counters + STM32G4 MCU health#40

Open
avrabe wants to merge 21 commits into
mainfrom
feat/silicon-smart-data
Open

silicon: smart-data emission — DWT counters + STM32G4 MCU health#40
avrabe wants to merge 21 commits into
mainfrom
feat/silicon-smart-data

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 9, 2026

Stacked on #37

Base branch is `feat/silicon-anchor-nucleo-g474re` (PR #37). Once #37 lands, GitHub will retarget this to main.

Summary

Extends `benches/engine_control` to emit two new CSV row types so silicon-anchor captures carry the smart-data the protocol now requires:

```
D,,,,,,,
H,,<temp_mC>,<vref_mV>,<vbat_mV>
```

…with `` ∈ {`boot`, `step_`, `end`}. Snapshots happen at boot, at each RPM-step boundary (after the existing drain wait — ISR is quiescent so no interleaving with E rows), and at sweep end.

Why

A 4-run anchor matrix (variant × tick_source from #37) measures the silicon / renode multiplier. Smart data lets the analyzer further discriminate why a particular cycle count is what it is:

  • DWT counters — CPI overhead, exception cycles, sleep cycles, load/store stalls, folded-instruction count. Tells you whether the cost is "real arithmetic" or "the pipeline stalled / serviced exceptions / went to sleep."
  • MCU health — die temperature + VREFINT (corrected supply estimate). Tells you whether a measurement was taken at +25 °C on a fresh chip or at +85 °C with a drooping regulator. If you don't record this you can't rule it out.

What's in the PR

File What
`src/smart_dwt.{h,c}` ARMv7-M DWT counter API. Direct MMIO at architecture-defined addresses so it's not coupled to Zephyr's CMSIS bundle. Works on M3/M4/M7/M33; on QEMU/Renode without DWT modelling, all reads = 0 and the analyzer treats it as "not modelled here."
`src/smart_mcu.h` Vendor-neutral MCU-health interface.
`src/smart_mcu_g4.c` STM32G4 backend via Zephyr ADC API. Reads ADC1 ch16 (TS) + ch18 (VREFINT), converts via factory calibration ROM at `0x1FFF75A8`/`0x1FFF75CA`/`0x1FFF75AA` per RM0440 §3.7.1, applies VREFINT correction to the TS reading per RM0440 §21.4.32. VBAT not wired on Nucleo → reported as 0.
`src/smart_mcu_stub.c` Selected on every non-G4 target. All-zero readings + a one-time `# H ...: not available` banner so the CSV format stays uniform.
`boards/nucleo_g474re.overlay` DT overlay enabling ADC1 with the two internal channel nodes. Auto-picked up by Zephyr's `west build -b nucleo_g474re` convention.
`CMakeLists.txt` Adds `smart_dwt.c` unconditionally; conditional `smart_mcu_{g4,stub}.c` selection on `CONFIG_SOC_SERIES_STM32G4X`.
`src/main.c` DWT enable + MCU init in `main()`; boot snapshot at end of `print_csv_header()`; per-step snapshots after the step-completion marker; end snapshot at start of `print_csv_footer()`.
`tag_events.py` Passes D/H rows through with the same `R,` prefix as E rows so a future analyzer extension can join them per-run.

`analyze.py` is intentionally unchanged — D and H rows are silently skipped by its existing R-prefix-only ingest, so reports are unaffected. Per-step CPI / exception aggregation + a thermal sanity check are a follow-up once first captures exist to test against.

Test plan

  • Smoke build for `qemu_cortex_m3` (stub backend path) and confirm CSV row format unchanged for E rows; D rows emit as zeros; H stub banner appears.
  • Smoke build for `nucleo_g474re` once Zephyr SDK is set up locally; confirm board-overlay picks up cleanly and no Kconfig surprises.
  • Real-board smoke: `bash benches/engine_control/silicon/capture.sh --board nucleo_g474re --variant baseline --tick-source systick` produces a manifest with non-zero `temp_mC` and `vref_mV` near 3000 mV.
  • First publication-grade 4-run capture: confirm D-row CYCCNT delta (boot → end) is consistent with the wall-clock `captured_at` window at 170 MHz.

🤖 Generated with Claude Code

avrabe and others added 4 commits May 3, 2026 21:03
CI = Renode (deterministic, parallel-safe). Silicon captures are
manual, periodic, and shared across one board per architecture.
Recorded captures live in the repo as immutable evidence, citeable
from any blog post via stable git URLs. This commit is the
scaffolding — protocol doc, build wrapper, board overlay, capture
script — that makes a silicon capture a flash-and-go operation
the moment hardware is in hand.

Files:

  silicon/README.md
    Protocol: why we silicon-anchor, the recorded-run-in-git
    convention, the capture procedure for the NUCLEO-G474RE, the
    comparison workflow against Renode CI, anchor cadence, and
    the don't-do-this list (overwriting, mixing pre/post-overhead-
    compensation captures, claiming WCET).

  silicon/capture.sh
    Build + flash + capture + tag + manifest, in one invocation.
    --board nucleo_g474re --variant {baseline,gale} [--sweep ...].
    Auto-detects the serial port on macOS / Linux. Refuses to
    overwrite an existing dated dir.

  silicon/capture.py
    Cross-platform pyserial UART capture. Reads until '=== END ===',
    times out at the wall clock, writes the raw stream to a file.

  silicon/boards/nucleo_g474re/{README.md,prj.conf}
    Board notes + (currently empty) Kconfig overlay. Cortex-M4F + FPU
    @ 170 MHz, ST-Link/V3E with VCP at 115200, DWT_CYCCNT works
    identically to stm32f4_disco. Closest production-shape silicon
    to our existing Renode target.

  silicon/runs/.gitkeep
    Placeholder; first dated capture goes in here.

Each captured run will commit:
  - output.csv (raw firmware UART)
  - events.csv (tagged through tag_events.py)
  - firmware.elf + firmware.elf.sha256
  - manifest.txt (board, MCU, gale_sha, rustc, west, zephyr_sha,
    ELF sha256, capture timestamp, port, timeout)

Manual flow only — no CI changes. README updated to point at
silicon/ from the methodology section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-on fixups on the silicon-anchor capture wrapper, surfaced while
preparing first-capture for the NUCLEO-G474RE on macOS:

1. --help printed nothing on macOS. The sed extractor used GNU-only `\?`
   for "0 or 1 space"; on BSD sed the pattern is treated as a literal `?`
   and never matches. Replaced with a portable awk one-liner that also
   skips the shebang line.

2. The manifest's `csv_sha256:` line had `| awk '{print $1}'` outside
   the `$(...)` command-substitution, so the manifest got the literal
   pipeline text instead of the hash. Wrapped the `||` group in
   `{ ...; }` so the pipe applies to either branch.

Both are cosmetic but block automated parsing of the manifest and
discovering the script's own usage.
A publication-grade silicon-anchor capture is the matrix
  variant ∈ {baseline, gale} × tick_source ∈ {systick, lptim}
not just two variants — LPTIM has different jitter and ISR-overhead
characteristics than the Cortex-M default SysTick, so the silicon /
renode multiplier must be reported per tick_source to be meaningful.

Changes:

- capture.sh
  - new --tick-source {systick,lptim} flag (default: systick)
  - OVERLAY_CONFIG composed from up to 3 ordered layers:
      1. gale overlay (when --variant gale)
      2. board silicon overlay (silicon/boards/<board>/prj.conf)
      3. tick-source overlay (silicon/boards/<board>/prj-tick-<src>.conf)
  - tick_source embedded in BUILD_DIR and RUN_DIR so 4 runs don't collide
  - manifest gains `tick_source:` field
  - summary block + post-capture commit hint reflect the 4-run protocol

- silicon/boards/nucleo_g474re/prj-tick-lptim.conf
  - new overlay enabling STM32_LPTIM_TIMER and disabling CORTEX_M_SYSTICK
  - documented clock-source caveat: LSE-clocked LPTIM cannot sustain
    the bench's 100 kHz tick; a DT overlay layering LPTIM1 onto PCLK1
    is needed for apples-to-apples vs SysTick — flagged in the board
    README as a follow-up

- silicon/README.md
  - run-dir naming now includes tick_source
  - capture procedure shows the 4-run loop
  - smoke-run instruction added (drop --sweep long, omit --tick-source)
  - commit hint updated to grab all 4 dirs at once

- silicon/boards/nucleo_g474re/README.md
  - new "Kernel tick sources" section with the per-source overlay table
    and the LPTIM clock-source caveat

No firmware code touched (still consistent with PR #37's stated scope).
Smart-data emission (DWT counters + STM32 self-monitoring) is the
follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the engine_control bench's CSV stream with two new row types
so silicon-anchor captures record *why* a measured cycle count is
what it is, not just the cycle count alone:

  D,<at>,<cyccnt>,<cpicnt>,<exccnt>,<sleepcnt>,<lsucnt>,<foldcnt>
  H,<at>,<temp_mC>,<vref_mV>,<vbat_mV>

…where <at> ∈ {boot, step_<N>, end}. Snapshots are taken at boot,
at each RPM-step boundary (after the existing drain wait, while the
ISR is quiescent, so they cannot interleave with E rows), and at
the end of the sweep.

Why this matters for the anchor.

Renode is per-translated-block instruction-cost simulation, not
microarchitectural simulation. The silicon / renode multiplier
established by the silicon anchor isolates "Renode is X% off
real silicon"; the smart-data rows let the analyzer further
discriminate "the runtime cost is real" from "the runtime cost is
a microarchitectural artefact" (CPI overhead, exception cost,
load/store stalls, sleep cycles, fold overhead) and from a
non-electrical anomaly (thermal, supply voltage drift).

Files added.

  src/smart_dwt.h, src/smart_dwt.c
    ARMv7-M DWT-counter API. Direct MMIO at architecture-defined
    addresses (0xE0001000…) so we don't depend on which CMSIS
    bundle Zephyr ships. Works on M3/M4/M7/M33; on simulators
    that don't model DWT (qemu_cortex_m3) reads return 0 and the
    analyzer treats all-zero D rows as "DWT not modelled here".

  src/smart_mcu.h
    Vendor-neutral MCU-health interface (init / snapshot / emit).
    Each backend reports temp / VREFINT / VBAT in fixed-shape rows.
    Backends emit a one-time `# H ...: not available on this target`
    banner at boot for any unavailable field so a captured CSV is
    self-documenting.

  src/smart_mcu_g4.c
    STM32G4 backend using Zephyr's ADC API. Reads ADC1 channels
    16 (temperature sensor) and 18 (VREFINT), then converts using
    factory calibration ROM at 0x1FFF75A8 / 0x1FFF75CA / 0x1FFF75AA
    per RM0440 §3.7.1. VREFINT-corrected temperature formula per
    RM0440 §21.4.32. VBAT pin not wired on Nucleo, reported as 0.

  src/smart_mcu_stub.c
    Returns zeros + emits the "not available on this target" comment.
    Selected by CMakeLists for any non-G4 target so the CSV row
    format stays uniform across boards.

  boards/nucleo_g474re.overlay
    Enables ADC1 with the two internal channels; otherwise the G4
    backend can't open the device. Auto-picked up by Zephyr's
    `west build -b nucleo_g474re` board-overlay convention.

Files changed.

  CMakeLists.txt — adds smart_dwt.c unconditionally; conditional
    smart_mcu_{g4,stub}.c selection on CONFIG_SOC_SERIES_STM32G4X.

  src/main.c — bring up DWT + MCU at start of main(); emit boot
    snapshot at end of print_csv_header; per-step snapshots after
    the existing step-completion marker; end snapshot at start
    of print_csv_footer. tag string allocated on stack
    (snprintf into 16-byte buf, "step_NN" fits).

  tag_events.py — passes through D and H rows with the same
    R<run>,<variant> prefix as E rows so analyze.py can join
    smart-data against per-run samples in a future extension.

analyze.py is intentionally unchanged in this PR — D and H rows
are silently skipped by the existing R<run>,<variant>,E-only
ingest, so existing reports are unaffected. A follow-up will add
per-step CPI / exception-cycle aggregation and a temperature
sanity check ("did the chip get hotter than expected?").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@avrabe avrabe changed the base branch from feat/silicon-anchor-nucleo-g474re to main May 9, 2026 17:57
avrabe and others added 2 commits May 9, 2026 20:17
Empty commit to fire pull_request:synchronize so the zephyr-tests +
LLVM-LTO + Verus pipelines run against this branch. Retargeting
the PR base from feat/silicon-anchor-nucleo-g474re → main on
GitHub doesn't emit a synchronize event, so CI stayed dark
despite the diff being valid against main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smoke build of the smart-data branch on real Zephyr SDK 1.0.1 +
arm-zephyr-eabi-gcc 14.3.0 fails at link time:

  /tmp/.../smart_mcu_g4.c:106:(.text.smart_mcu_init+0x8c):
    undefined reference to `__device_dts_ord_10'

The DT overlay (boards/nucleo_g474re.overlay) correctly sets
adc1 status=okay and adds the two internal channels (TS=ch16,
VREFINT=ch18). But `# CONFIG_ADC is not set` in the generated
.config — the stm32-adc driver isn't compiled, so DEVICE_DT_GET
on adc1 doesn't resolve.

Zephyr's `boards/<board>.conf` is the standard place to layer
Kconfig the same way `boards/<board>.overlay` layers DT. Adding
CONFIG_ADC=y here fixes the link without disturbing other targets
(the file is only picked up when BOARD=nucleo_g474re).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

avrabe and others added 15 commits May 9, 2026 20:47
…CK=n alone

Local smoke build of the silicon-anchor scaffolding on real Zephyr SDK
1.0.1 + arm-zephyr-eabi-gcc 14.3.0 against the actual Zephyr workspace
revealed the original `prj-tick-lptim.conf` doesn't actually switch the
kernel tick to LPTIM. Both `baseline/lptim` and `gale/lptim` built
configurations failed to link with:

  zephyr/kernel/libkernel.a(timeout.c.obj): in function `elapsed':
    timeout.c:70: undefined reference to `sys_clock_elapsed'
  zephyr/kernel/libkernel.a(busy_wait.c.obj):
    misc.h:26: undefined reference to `sys_clock_cycle_get_32'

…meaning *no* tick driver was being compiled in. Setting
`CONFIG_STM32_LPTIM_TIMER=y` was being silently ignored by Kconfig
because of unmet dependencies in
`zephyr/drivers/timer/Kconfig.stm32_lptim`:

  depends on dt_nodelabel_exists(stm32_lp_tick_source)  ← OK on G4
  depends on DT_HAS_ST_STM32_LPTIM_ENABLED              ← OK on G4
  depends on CLOCK_CONTROL && PM                        ← MISSING
  select TICKLESS_CAPABLE

Upstream `nucleo_g474re.dts` already labels `&lptim1` as the
`stm32_lp_tick_source` and sets `status="okay"` with LSI clocks, so the
DT side is fine — the only piece missing was `CONFIG_PM=y`, which lets
`STM32_LPTIM_TIMER`'s `default y` fire and the driver source actually
compile.

Replaces `CONFIG_STM32_LPTIM_TIMER=y` (redundant once PM enables it via
default) with `CONFIG_PM=y`. Keeps `CONFIG_CORTEX_M_SYSTICK=n` so the
SysTick driver doesn't compile in parallel and race with LPTIM for the
system-clock-driver init slot. Comment block reframed to explain the
real Kconfig dependency chain rather than the speculative DT-overlay
caveat.

Verified locally: all 4 variants (baseline/gale × systick/lptim) now
link cleanly. The lptim variant carries the PM subsystem (~120 KB
ELF growth, 1% extra flash, ~600 B extra RAM) — that's the cost of
using LPTIM as the kernel tick on this part.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… samples

Smoking out the bench on real STM32G474RE hardware exposed two bugs in
the long-sweep code path that QEMU and Renode never hit because the
simulators don't have UART back-pressure:

1. reader_loop's `K_FOREVER` hangs at end-of-sweep when count plateaus
   below TOTAL_SAMPLES due to ISR-side ring drops at high RPM. The
   sweep_driver thread runs to completion (all 13 RPM steps fire), but
   reader_loop is stuck in `k_sem_take(&data_ready, K_FOREVER)` waiting
   for samples that will never arrive — print_csv_footer is never
   called, "=== END ===" sentinel never emitted, capture.py times out.

2. Per-step drain `while (count < target && count < g_interrupts)`
   tries to wait for `count` (UART-emitted events) to reach `target`
   (sweep_step's expected sample count). At 8000-10000 RPM the ring
   fills faster than the reader can drain it, ISRs drop samples,
   `count` plateaus below target, drain hangs 30s and bails. Cumulative
   wasted wall time on a long sweep: 13 steps × 30s = 6.5 minutes.

Fixes:

  * New `static volatile bool g_sweep_done` flag. sweep_driver sets it
    after its (also-fixed) final drain. reader_loop polls it via 500 ms
    `k_sem_take` timeout and exits cleanly even with drops.

  * Both per-step drain and final drain switch from
    `count < target && count < g_interrupts` to
    `ring_buf_size_get(&sample_ring) >= sizeof(struct crank_sample)` —
    the only thing actually relevant is "is the ring drained" (i.e.
    has the reader caught up with what was queued); whether `count`
    reaches target is unreachable when drops happen.

Verified on hardware (STM32G474RE @ 170 MHz): long sweep with ~58%
drops at 10kHz tick now finishes in ~10 seconds wall time, emits
"=== END ===" cleanly, capture.py terminates with exit 0. drops
counter (g_drops) records the actual loss for the analyzer to use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eset

Two fixes surfaced when running capture.sh on the bench for real:

1. west flash on nucleo_g474re defaults to the stm32cubeprogrammer
   runner, which requires ST's proprietary STM32CubeProgrammer.app
   that most Linux/macOS dev setups don't have installed. The board
   also configures the openocd runner (which is brew-installable on
   macOS, package-managed on Linux), but it's not the default.
   Add a --runner flag to capture.sh, default openocd, with
   pass-through to `west flash`. Include the choice in the manifest.

2. Even with the openocd runner, west flash via Zephyr 4.4.0-rc3 on
   STM32G4 + CONFIG_PM=y leaves the chip *halted* after writing the
   image — no implicit reset+run is issued, so the firmware never
   starts and the UART stays silent. Add an explicit
     openocd init reset run sleep 200 exit
   step between flash and the serial capture. NB: do NOT pipe openocd
   through head/grep — SIGPIPE on early close kills openocd before
   it processes `reset run`, leaving the chip halted just the same.
   Capture full openocd output to /tmp/silicon-reset-<board>.log
   instead, with a 0.5s grace before opening the serial port so the
   sentinel-search window aligns cleanly with the bench's CSV stream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First publication-grade silicon-anchor capture for engine_control on
real STM32G474RE hardware (170 MHz Cortex-M4F).

Captured at gale@06515098. Two of the protocol's planned 4-run matrix:
  variant=baseline, tick_source=systick — 3331 events, 4393 drops, 7724 ISR fires
  variant=gale,     tick_source=systick — 3279 events, 4393 drops, 7724 ISR fires

Each run carries:
  output.csv          raw firmware UART (E rows, D rows, H rows, # markers)
  events.csv          run-id-tagged through tag_events.py
  manifest.txt        board/MCU/clock/sha/sdk + ELF/CSV sha256s
  firmware.elf        the exact binary that produced this capture
  firmware.elf.sha256 verification

Why systick only — the lptim tick-source variant is currently
degraded on this build:
  - With CONFIG_CORTEX_M_SYSTICK=n + CONFIG_PM=y, the kernel's
    k_cycle_get_32() falls back from DWT_CYCCNT (170 MHz) to the
    LPTIM-based system-clock cycle counter (~32 kHz LSI). Two ISR
    timestamp reads in the same firing return the same value, so
    every E-row reports algo_cycles=0, handoff_cycles=0 — useless.
  - The flashed lptim firmware also experiences mid-capture chip
    resets we haven't root-caused yet (two boot banners visible
    in the partial output before reader stalls at ~20 events).
  - Tracked as a follow-up: instrument the bench to read DWT_CYCCNT
    directly instead of via k_cycle_get_32, and figure out why the
    PM=y build hits resets under load.

Renode comparison reference: stm32f4_disco Cortex-M4F numbers from the
Renode CI. Architectural delta is small (M4F + FPU at 168 MHz vs
170 MHz, both DWT_CYCCNT, both ARMv7E-M); the silicon/renode multiplier
this anchor establishes is the calibration data.

Don't overwrite. Anchor cadence: ~1 capture per board per major
bench-relevant gale commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first NUCLEO-G474RE silicon anchor (commits af7778f / 9da0cbb /
63de330) shipped with 4393/7750 = 56% sample drops on both the
baseline and gale variants. At that drop rate, per-RPM-step medians
are computed from a biased subsample — drops cluster at high RPM,
and within each step the surviving samples skew toward step-start
(cold cache, before the ring fills). The silicon/Renode multiplier
the protocol is supposed to establish is statistically meaningless
under that bias.

Root cause: long-sweep emits ~7,750 events × ~30 bytes = 232 KB
of UART traffic; at 115200 baud (~11.5 KB/s) that's 20 s of pure
UART throughput needed. At the 10000 RPM step the bench fires every
16 µs (~62 kHz), filling the 256-sample ring in 4 ms while the
reader can only drain ~360 events/sec. Most of the step gets
dropped.

Two coupled changes:

  - boards/nucleo_g474re.overlay: `&lpuart1 { current-speed = <921600>; }`
    8x headroom over the original 115200. Within the ST-LINK V3J9M3
    VCP's tested range (V3 supports up to 12 Mbps theoretically;
    921600 is the conventional STM32 high-baud setting).

  - silicon/capture.sh: pyserial baud bumped to 921600 to match.

No code-level bench changes — `algo_cycles` and `handoff_cycles`
arithmetic is byte-identical, so per-RPM medians from the
post-fix captures are directly comparable to Renode CI medians
at the same gale_sha.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first NUCLEO-G474RE silicon anchor (commit 63de330) dropped 56% of
samples at 115200-baud UART throughput — the ring filled faster than
the reader could drain it on every high-RPM step, biasing per-RPM
medians toward step-start (cold cache).

Tested baud-side fixes first (see reverted commit a9075a3): bumping
LPUART1 to 460800 / 921600 reduces *chip*-side drops but introduces
*host*-side losses dominated by macOS pyserial readline()'s per-byte
syscall overhead at >500 kbit/s. Net captured events drop further. Not
the right axis.

Chip-side fix: enlarge the bench's per-ISR ring buffer 256 → 2048
(RAM cost: ~50 KB additional, 12.4% → 50.7% of the G4's 128 KB SRAM —
well within budget). Even the largest single-step burst (1000 samples
at steps 3-7) now fits with headroom; the ring no longer overflows
during a step's ISR-firing phase, and the existing per-step drain
gives the reader plenty of UART time to empty it before the next step.

Adds CONFIG_RING_BUFFER_LARGE=y to prj.conf so RING_BUF_DECLARE's
static assertion accepts the larger backing array (default cap is
~16-bit indexable; 2048 × 24-byte sample = 48 KB exceeds that).

Inert for the QEMU/Renode CI lanes — their drops were 0 already at
ring=256. The Renode-cited per-step medians at the previous gale_sha
(0651509) remain valid as historical reference; new captures at this
sha are directly comparable to a fresh Renode run on the same sha.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-capture of baseline + gale × systick on the same NUCLEO-G474RE
hardware after the ring-buffer fix (commit 8a4d817, 256 → 2048
samples). Both runs at gale_sha 8a4d817; chip-side drops = 0;
host-side loss <5%; 7,353 / 7,363 events received per variant out
of 7,750 expected.

Headline result on real silicon (Cortex-M4F @ 170 MHz, DWT_CYCCNT,
3.26 V VDDA, room temp):

  algo (control_step):    bl 253 cyc / ga 253 cyc — identical
  handoff (ring_buf_put + k_sem_give):
                          bl 506 cyc / ga 582 cyc — gale +15.0%

The handoff distribution is publication-clean: 99.7% of all 1,000
events per RPM step land at the exact same cycle count (506 / 582),
with a single cold-start outlier per step (1283 / 1345). The +76-cycle
penalty for the gale variant is rock-solid across all 13 RPM steps
500..10000.

This contradicts Renode CI's published numbers at the same gale_sha
(once a Renode CI re-run on this sha lands), where Gale was reported
2.0% faster on handoff and 2.9% faster on algo. The silicon anchor
exposes that as a Renode TB-cost-model artefact: real microarchitectural
behavior (Cortex-M4 pipeline, flash prefetch, DWT measurement granularity)
gives Gale a 15% per-call penalty for the FFI handoff into the
gale_k_sem_give_decide path that Renode under-estimates by ~17%.

The previous capture set at gale_sha 0651509 (commit 63de330)
remains in git history but is statistically void: 4393/7750 = 56%
chip-side drops biased the per-RPM medians toward step-start cold
samples. The new captures at 8a4d817 are the citable anchor.

Methodology integrity:
  - Same physical board, same VDDA / temp window
  - Single capture per variant, --sweep long, --tick-source systick
  - Ring buffer = 2048, baud = 115200 (host-stable; tested 460800 and
    921600, both worse net-loss due to pyserial / macOS limitations)
  - Bench source byte-identical across variants; only OVERLAY_CONFIG
    layers the gale primitive Kconfigs differently
  - DWT_CYCCNT for cycle measurement (CORTEX_M_SYSTICK=y, no PM
    fallback to LPTIM-LSI)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rtefact

Phase C of the silicon-anchor matrix: gale-ffi compiled to
wasm32-unknown-unknown, then run through pulseengine/synth (b8da214 on
fix/synth-i64-locals-and-frame branch) to produce a Cortex-M ET_REL
relocatable, wrapped into libgale_ffi.a, and linked into the engine
bench. Same chip, same gale_sha, same toolchain otherwise — only the
gale-ffi compile path differs from the rustc-direct gale variant.

Capture quality: 7315/7750 events received, drops=0, sentinel ✅.
Distribution at every RPM step: 99.7% of 1000 events at exactly 582
cycles, identical to the rustc-direct gale run.

The headline finding contradicts the published Renode CI numbers at
the same gale source:

  Renode (stm32f4_disco @ 168 MHz, sha 0651509):
    baseline:        354 cyc handoff
    gale-rustc:      347 cyc handoff (−2.0%)
    gale-synth:      232 cyc handoff (−34.5%)   ← cited in the
                                                "Three Quiet Barriers"
                                                blog post

  Silicon (nucleo_g474re @ 170 MHz, sha 8a4d817, drops=0):
    baseline:        506 cyc handoff
    gale-rustc:      582 cyc handoff (+15.0%)
    gale-synth:      582 cyc handoff (+15.0%)   ← bit-equivalent
                                                  to gale-rustc

The Renode-reported 34.5% advantage of the wasm→synth pipeline does
not exist on real Cortex-M4 silicon. On silicon, synth and rustc-direct
produce per-event handoff timings that agree to the cycle (582 / 582,
rock-stable across all 13 RPM steps 500..10000). Whatever Renode's
TB-cost model was reporting as a 122-cycle advantage for the synth
codegen is simulator-fictional.

This validates the silicon-anchor protocol's purpose: to expose
simulator-only deltas that wouldn't survive a real-hardware sanity
check. Any "Three Quiet Barriers"-style headline citing the 34.5%
advantage now has to be retracted — or qualified as a Renode-only
result.

Build pipeline (replicates engine-bench-renode-synth.yml):
  - rustup target add wasm32-unknown-unknown
  - cargo install --git https://github.com/pulseengine/synth.git \
                  --branch fix/synth-i64-locals-and-frame synth-cli
  - cargo install --git https://github.com/pulseengine/loom.git loom-cli
  - brew install binaryen   (wasm-opt 129)
  - west build -DGALE_USE_SYNTH=ON ...

Build artefacts pinned in the manifest (synth 0.1.0, loom 0.5.0,
wasm-opt 129, rustc 1.94.1, gale + zephyr SHAs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase B of the silicon-anchor protocol surfaced a silicon-specific
crash in the LLVM-LTO + Gale build path that Renode's stm32f4_disco
CI lane does not reproduce.

Build succeeds at FLASH 26,592 B with 1 surviving gale_ symbol
(meaningful LTO inlining); flash succeeds; chip emits ~67 bytes of
print_csv_header to UART then halts permanently in arch_system_halt
(zephyr/kernel/fatal.c:30) reached via z_irq_spurious.

Cortex-M state at halt:

  PC:    0x08004d9a (arch_system_halt)
  xPSR:  0x21000022 (IPSR=34 = External IRQ 18)
  CFSR:  0x00000000 (no fault flags — not a hardfault)
  HFSR:  0x00000000
  ICSR:  0x0400f822 (VECTACTIVE=34, USG/MEM/BUS/SVCALL pending)

External IRQ 18 on STM32G474 = ADC1_2. The bench's smart-data
emission uses Zephyr's ADC API to read the on-die temperature sensor
+ VREFINT for H-row entries; the LLVM linker plugin is plausibly
either reordering the ADC driver's static initializer relative to the
ADC IRQ handler registration, or eliding the IRQ-table slot for IRQ
18 via aggressive inlining that gen_isr_tables.py doesn't track.

Renode's TB simulation does not model the ADC peripheral with the
fidelity needed to reproduce this. The published "LLVM cross-language
LTO works for Gale" claim is consequently silicon-untested for the
STM32G4 MCU family — only validated on the F4 (and even there only
under simulation, not real silicon).

The notes file documents the full crash signature, hypothesis, and
discriminating tests for next session. The silicon-anchor protocol
intentionally does NOT include an LTO run-dir for nucleo_g474re
until the crash is root-caused — committing a crashed firmware as
"the LTO anchor" would mislead anyone citing the directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…seline

Phase B revisited with the LTO crash root-cause workaround in place.
The ADC IRQ-table interaction (documented in
silicon/boards/nucleo_g474re/NOTES-llvm-lto-crash.md) was sidestepped
by disabling CONFIG_ADC + DT-disabling adc1 for the LTO build. To
make the comparison apples-to-apples, two control captures were taken
under the same ADC=n config. Bench CMakeLists.txt updated to make
smart_mcu_g4.c conditional on CONFIG_ADC (was previously only on
SOC_SERIES_STM32G4X) so the stub backend is used when ADC is off.

Three new captures at gale_sha b48a81a, all systick, all sweep=long,
drops=0, 7300+ events received per variant out of 7,750:

  baseline-noadc:   7,304 events, handoff median 528 cyc (all 13 RPM steps)
  gale-noadc:       7,336 events, handoff median 574 cyc (all 13 RPM steps)
  gale-lto-noadc:   7,318 events, handoff median 471 cyc (all 13 RPM steps)

Distribution per RPM step is publication-clean across all three:
99.7% of events at the median value, single cold-start outlier per
step at startup.

Findings vs. previously published Renode CI (sha 0651509):

  Renode (stm32f4_disco @ 168 MHz, ADC presumably enabled):
    baseline:        354 cyc handoff
    gale-rustc:      347 cyc (-2.0%)
    gale-LTO:        347 cyc (-2.0%, "same as rustc-direct" was the claim)

  Silicon (nucleo_g474re @ 170 MHz, ADC=y, sha 8a4d817 / 418c6b8):
    baseline:        506 cyc
    gale-rustc:      582 cyc (+15.0%)
    gale-synth:      582 cyc (bit-identical to rustc-direct)
    gale-LTO:        crash — silicon-specific ADC IRQ-table bug

  Silicon (nucleo_g474re, ADC=n, sha b48a81a, this commit):
    baseline:        528 cyc
    gale-rustc:      574 cyc (+8.7% vs baseline-noadc)  ← FFI seam = +46 cyc
    gale-LTO:        471 cyc (-10.8% vs baseline-noadc) ← LTO eliminates seam
                                                          AND beats baseline
                                                          by 57 cyc

The same-axis (ADC=n) comparison settles two questions:

  1. Is the +76-cycle FFI seam observed at ADC=y a real cost, or
     ADC-amplified? Answer: real but layout-sensitive. Without ADC the
     seam is +46 cyc (574 vs 528), with ADC it's +76 cyc (582 vs 506).
     Cache/code-locality matters; the seam is genuine in either case.

  2. Does LLVM cross-language LTO erase it on real silicon? Yes,
     completely, plus 57 cycles more. The verified Rust path's
     decision logic (Verus-proven correct, then rustc-compiled) once
     inlined into z_impl_k_sem_give beats the equivalent stock-Zephyr
     C path. The "Gale's overhead is the FFI seam, not the verified
     Rust" claim is settled by the disassembly accounting (see
     PR/commit chain) and confirmed by silicon LTO.

Renode's silicon-equivalent claim that LTO ≈ rustc-direct (both
-2.0%) is fictional — silicon LTO is -10.8% (not -2.0%) and silicon
rustc-direct is +8.7% (not -2.0%). The TB cost model under-counts
cross-language inlining wins by roughly 5x on this MCU family.

Methodology integrity:
  - Same physical board, same VDDA, same temp window
  - Single capture per variant; --sweep long; --tick-source systick
  - Bench source byte-identical across all three (only Kconfig
    overlay + DT overlay differ — ADC on or off)
  - Ring=2048, baud=115200, drops=0
  - DWT_CYCCNT cycle source (CORTEX_M_SYSTICK=y)
  - Manifests pin gale_sha, zephyr_sha, ELF/CSV sha256s

The LTO firmware artefact in this PR is the citable form: anyone
reproducing this study should build with the exact CONFIG_ADC=n +
DT-disabled-adc1 overlay layered alongside prj-gale.conf and
gale_lto_overlay.conf, with matching LLVM 21.1.8 + lld 21.1.8 +
arm-zephyr-eabi-gcc 14.3.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…CAL_DECLARATION

LTO crash root-caused (NOTES-llvm-lto-crash.md, b48a81a) and fixed.
The LLVM linker plugin's whole-program LTO partitioning evicts the
ADC1_2 IRQ-18 vector handler when CONFIG_ISR_TABLES_LOCAL_DECLARATION=y;
the chip then takes a spurious IRQ 18 during boot and halts. Removing
the LOCAL_DECLARATION bit restores conventional Zephyr ISR-table layout
that LLVM handles cleanly.

This commit adds:

  - benches/engine_control/silicon/boards/nucleo_g474re/prj-lto-no-isr-local.conf
    (Kconfig overlay enabling LTO without the LOCAL_DECLARATION trigger)

  - benches/engine_control/silicon/runs/2026-05-10-nucleo_g474re-f6f61281-gale-lto-systick/
    (the publication-grade LTO+ADC=y capture: 7321 events received,
    drops=0, sentinel ✅, handoff median 558 cyc with 99.7% stability)

Build invocation:

  west build -p always -b nucleo_g474re -d /tmp/silicon-lto-adc \
    -s gale-smart-data/benches/engine_control -- \
      -DZEPHYR_TOOLCHAIN_VARIANT=llvm \
      -DCMAKE_TRY_COMPILE_TARGET_TYPE=STATIC_LIBRARY \
      -DZEPHYR_EXTRA_MODULES=<gale-smart-data> \
      -DOVERLAY_CONFIG="<gale-smart-data>/benches/engine_control/prj-gale.conf;<gale-smart-data>/benches/engine_control/silicon/boards/nucleo_g474re/prj-lto-no-isr-local.conf" \
      -DCMAKE_EXE_LINKER_FLAGS="-L<arm-zephyr-eabi libgcc> -L<picolibc>" \
      -DENGINE_BENCH_SWEEP=long

  PATH must include /opt/homebrew/opt/llvm@21/bin and /opt/homebrew/opt/lld@21/bin
  (LLVM 21.1.8 + lld 21.1.8 to match rustc 1.94.1's LLVM major version).

Final silicon timing matrix (all gale@f6f61281 / 8a4d817 / 418c6b8,
NUCLEO-G474RE @ 170 MHz, drops=0, 99.7% per-step stability):

                        ADC=y     ADC=n
  baseline (no Gale)     506       528
  gale rustc-direct      582       574    ← FFI seam: +46 to +76 cyc
  gale wasm-synth        582       n/a    (bit-identical to rustc-direct)
  gale LLVM-LTO          558       471    ← LTO recovers part / all of seam

LTO impact, summarized:
  - With ADC=y: LTO recovers 24 of the 76 cyc FFI penalty (582→558).
                Still +52 above baseline; the ADC subsystem in the
                LTO partition apparently affects code layout in ways
                that prevent full inlining recovery.
  - With ADC=n: LTO recovers ALL 46 cyc FFI penalty AND beats baseline
                by 57 cyc (574→471, baseline 528). The verified Rust
                decision logic, once inlined and dedup'd against the C
                bound-check, is measurably tighter than stock Zephyr.
  - LLVM truly inlined gale_k_sem_give_decide (symbol absent from LTO
    ELF, decision logic became 3-instruction `cmp r2,r1; it cc; addcc r2,#1`
    inside z_impl_k_sem_give per disassembly verification).

Renode CI's claim that LTO ≈ rustc-direct (both -2.0%) is inverted
on silicon: at ADC=n, silicon LTO is -10.8% vs baseline (vs +8.7%
for rustc-direct at the same axis); at ADC=y, silicon LTO is +10.3%
above baseline (vs +15.0% for rustc-direct). The TB cost model
under-counts cross-language inlining, and the ADC peripheral's
interaction with LTO partitioning is a real silicon-only effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cumented

Spike on the meld→loom→synth (or wasm-ld→synth, the simpler variant
of the same idea) pipeline as a verified-construction equivalent of
LLVM cross-language LTO for the C↔Rust FFI seam.

The pipeline:

  gale_sem.c (shim hot path)  ──clang -target wasm32──┐
                                                       ├──→ wasm-ld (static link)
  ffi/src/lib.rs (verified Rust) ──cargo wasm32──────┘     │
                                                            ▼
                                                       merged.wasm (1MB, both symbols)
                                                            │
                                                            ▼  loom optimize  (FAILS — see below)
                                                            ▼  synth compile --relocatable
                                                            ▼
                                                  ARM ET_REL (.o) where
                                                  z_impl_k_sem_give body
                                                  contains the inlined Rust
                                                  decision (no `bl
                                                  gale_k_sem_give_decide`).

Empirical result on the spike (see NOTES-wasm-cross-lto-spike.md):

  - wasm-ld merging works (single core wasm module, both symbols
    present, 1MB output, 193 gale_ symbols + z_impl_k_sem_give).
  - synth INLINES the FFI seam in its emitted ARM (verified by
    arm-zephyr-eabi-objdump: no `bl` to gale_k_sem_give_decide
    inside z_impl_k_sem_give's body).
  - synth's emitted ARM body is 138 bytes vs LLVM-LTO's 82 bytes
    for the same inlined logic — 1.68× larger because synth doesn't
    recognize the u64-packed FFI return pattern (falls back to
    generic 64-bit shift-and-mask).
  - loom's `inline_functions` pass panics with Z3
    `SortDiffers { left: (_ BitVec 64), right: (_ BitVec 32) }`
    on every gale-ffi function — the verified inliner is currently
    blocked on i64 sort handling. Without loom we lose the
    verification-by-construction angle that distinguishes the wasm
    pipeline from LLVM-LTO.

Filed as actionable upstream gaps:

  pulseengine/synth: u64-packed FFI return pattern recognition;
                     wasm linear-memory absolute-address lowering
                     to base+offset.
  pulseengine/loom:  Z3 i64 sort handling in inline_functions pass.

With both fixes, wasm-cross-LTO should reach LLVM-LTO parity (~471
cyc handoff at ADC=n on silicon, vs the current LLVM-LTO measurement
in 2026-05-10-nucleo_g474re-b48a81ac-gale-lto-noadc-systick) AND
provide the verification-by-construction property LLVM-LTO does not.

Two artefacts committed:
  - wasm_host_shim_poc.c — minimal wasm-portable host of
    z_impl_k_sem_give that mirrors the bench's gale_sem.c hot path
    with kernel APIs as externs (wasm imports). 75 lines.
  - NOTES-wasm-cross-lto-spike.md — full reproduction commands,
    side-by-side disassembly comparison vs LLVM-LTO, and the two
    upstream codegen action items.

For the publication, this lets us claim:

  "Cross-language LTO via wasm IR is feasible end-to-end with the
  existing PulseEngine pipeline. The C↔Rust seam dissolves at wasm
  level. The remaining gap to LLVM-LTO parity is two specific codegen
  patterns in synth and a Z3 sort fix in loom — both well-scoped
  engineering work, neither a fundamental architectural barrier."

That's stronger than "wasm + synth = same as rustc-direct" (which is
what the current GALE_USE_SYNTH=ON path delivers, since it doesn't
merge the C shim into the wasm bundle and the FFI seam stays native).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pushed the wasm-cross-LTO experiment all the way to a buildable bench
ELF integrated via wasm-ld+arm-ar+linker-substitute. Discovered an
additional synth backend bug while attempting silicon measurement:

  synth's emitted memset/memcpy/memmove don't terminate correctly on
  Zephyr's startup `memset(bss, 0, sizeof(bss))` invocation. The chip
  hangs in memset+0x4c forever, bouncing between offsets 0x668 and
  0x67e in a tight inner loop. The synth disassembly reveals i64
  shift instructions (`subs.w r3, r2, #32; rsb r3, r2, #32;
  lsl.w r3, r1, r3`) lowered into what should be a byte-counter loop
  — same root cause as the u64-packed FFI return codegen issue
  documented earlier: synth's i64 codegen is incomplete.

End-to-end status:

  - wasm-ld static-merging: WORKS. shim.wasm.o + libgale_ffi.a → 1MB
    merged.wasm with z_impl_k_sem_give and gale_k_sem_give_decide both
    present.

  - synth inlining at merged-module scope: STRUCTURALLY WORKS. The
    output `z_impl_k_sem_give` body has zero bl gale_k_sem_give_decide
    instructions. Verified by disassembly. 138 bytes vs LLVM-LTO's
    82 bytes — 1.68x larger but inlined.

  - Bench integration: BUILDS. CMake bench builds with
    -DGALE_WASM_LTO_OVERRIDE_SEM_GIVE=1 + custom libgale_ffi.a +
    --allow-multiple-definition. Final ELF 219 KB FLASH, 66 KB RAM.

  - Chip boot: BLOCKED. PC stuck in synth-emitted memset. Workarounds
    via objcopy --weaken-symbol, --strip-symbol, --redefine-sym all
    failed to evict synth's broken memset bytes from the final ELF.

Three synth backend issues filed against pulseengine/synth, ordered:

  1. (blocker) memset/memcpy/memmove i64-codegen non-termination —
     prevents the merged-wasm bench from booting at all.
  2. u64-packed FFI return unpacking — ~50% of the LTO-parity size
     delta. Same i64-codegen root cause as #1.
  3. wasm linear-memory access lowering — ~20% of the size delta.
     Cosmetic compared to #1 and #2.

Plus one issue against pulseengine/loom:

  - Z3 SortDiffers panic in inline_functions pass on i64-heavy
    wasm modules. Without loom, the verified-LTO claim doesn't hold.

The structural claim — "wasm-cross-LTO via PulseEngine pipeline
dissolves the C↔Rust seam at wasm IR level" — is **proven by
disassembly**. The cyclical claim — "silicon timing matches LLVM-LTO"
— is **blocked on synth's memset codegen**. Neither is a fundamental
architectural barrier; both are well-scoped engineering work.

This commit only updates the NOTES with the integration findings.
The bench source is restored to clean state (the gale_sem.c
#ifndef edit was transient) and verified building unchanged at
27 KB FLASH at the canonical rustc-direct path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…+ loom

Filed the four bugs surfaced by the wasm-cross-LTO spike against the
upstream PulseEngine repos. Notes file now carries direct links and a
priority table:

  - pulseengine/synth#93 (BLOCKER): memset/memcpy/memmove i64-codegen
    non-termination. Chip hangs on Zephyr z_bss_zero. Until fixed,
    no merged-wasm integration can boot on real silicon.

  - pulseengine/synth#94: u64-packed FFI return unpacking. Generic
    64-bit shift extraction instead of register-direct field access.
    ~50% of the LLVM-LTO size-parity gap.

  - pulseengine/synth#95: wasm linear-memory access lowering.
    movw+movt+ldr triplet instead of base+offset. ~20% of the gap.

  - pulseengine/loom#98 (BUG): Z3 SortDiffers panic in inline_functions
    pass on i64-heavy modules. Every gale-ffi function reverts; the
    verified inliner is effectively a no-op for our use case.

Each upstream issue carries a self-contained reproducer, the
silicon-anchor evidence chain, and disassembly evidence. Once
synth#93 lands, the merged-wasm bench will boot and we can take the
silicon cycle measurement that closes the wasm-cross-LTO data point.
Once synth#94 + #95 land, the route should approach LLVM-LTO parity.
Once loom#98 lands, the route delivers the verification-by-construction
property that distinguishes it from LLVM-LTO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant