Skip to content

test(e2e): golden behavioural-equivalence harness (Tier A + B)#213

Merged
avrabe merged 1 commit into
mainfrom
test/golden-e2e
Jun 2, 2026
Merged

test(e2e): golden behavioural-equivalence harness (Tier A + B)#213
avrabe merged 1 commit into
mainfrom
test/golden-e2e

Conversation

@avrabe
Copy link
Copy Markdown
Contributor

@avrabe avrabe commented May 31, 2026

Answers "how do I know it really works" with a differential end-to-end test of meld's central claim — fusion preserves observable behaviour:

  1. Run a real component unfused under deterministic wasmtime → that result is the golden (no hand-authored expected values).
  2. meld fuse it.
  3. Run the fused output the same way.
  4. Assert identical observable behaviour (run-ok + stdout / typed return).

Tier A — active, green (12 checks)

Single-component round-trip equivalence over the wit-bindgen ABI fixtures and real cross-language command components (hello_rust / hello_c_cli / hello_cpp_cli), both memory strategies. The hello_* cases assert byte-identical stdout — a real behavioural diff, not just "didn't trap." SharedMemory+rebasing correctly declines memory.grow fixtures (logged skip — meld refusing is not a divergence).

Tier B — discovery oracle (#[ignore] on #212)

Fuses a real two-component composition built offline with wasm-tools + wac (compose/build.sh): consumer.runner.compute() calls provider.add(20,22) = 42. The body asserts the meld-fused output computes 42 standalone — the acceptance test for #212, un-ignore when it lands.

Building Tier B surfaced three real multi-component fusion gaps (filed as #212):

  1. Separate-input cross-component interface links are not internalised (fused output still imports the dependency).
  2. wac-composed inputs lose their top-level export (empty world root {}).
  3. Bare world-level func exports drop their result type → invalid component.

Tier A proves real wit-bindgen compositions fuse + run with identical behaviour today; Tier B marks the boundary of what multi-component fusion doesn't yet handle.

Honest boundary

Equivalence is proven under wasmtime (the reference runtime), not the synth/kiln MCU target — a module passing here can still break after synth transcodes it. That cross-repo hardware smoke is tracked separately (owner: you).

Fixtures are committed (*.wasm is gitignored; force-added like the existing fixtures) and regenerable via compose/build.sh.

🤖 Generated with Claude Code

meld's central claim is "fusion preserves observable behaviour." This
harness falsifies it differentially: run a real component unfused under
deterministic wasmtime (the result IS the golden), meld-fuse it, run the
fused output the same way, assert identical observable behaviour. No
hand-authored expected values.

Tier A (active, green): single-component round-trip equivalence over the
wit-bindgen ABI fixtures + real cross-language command components
(hello_rust/c/cpp), both memory strategies. 12 fuse-and-run checks; the
hello_* ones assert byte-identical stdout. SharedMemory+rebasing
correctly declines memory.grow fixtures (logged skip, not a divergence).

Tier B (discovery oracle, #[ignore] on meld#212): fuse a real
two-component composition (consumer.runner.compute -> provider.add(20,22)
= 42, built offline via wasm-tools + wac, see compose/build.sh) and
assert the fused output computes 42 standalone. Building it surfaced
three real multi-component fusion gaps (meld#212): separate-input
cross-component links not internalised, wac-composed exports dropped,
bare-world func-export result type dropped. The test body is the fix's
acceptance test — un-ignore when #212 lands.

Honest boundary: equivalence under wasmtime (reference runtime), NOT the
synth/kiln MCU target — that hardware smoke is tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

LS-N verification gate

⚠️ 36/38 verified — 2 missing regression tests

count
Passed (≥1 test, all green) 36
Failed (≥1 test failure) 0
Missing (no ls_*_NN_* test found) 2

Approved loss-scenarios.yaml entries are expected to have a
regression test named ls_<letter>_<num>_* (e.g. LS-A-11
ls_a_11_*). The gate runs each prefix via cargo test --lib --no-fail-fast and aggregates pass/fail/missing.

Failed LS entries

(none)

Missing regression tests
  • LS-R-13
  • LS-M-6

Updated automatically by tools/post_verification_comment.py.
Source of truth: safety/stpa/loss-scenarios.yaml.

@avrabe avrabe merged commit 45c6f42 into main Jun 2, 2026
14 checks passed
@avrabe avrabe deleted the test/golden-e2e branch June 2, 2026 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant