Skip to content

fix: AOT embeds CompiledProgram so HOF / closure dispatch works#424

Merged
danieljohnmorris merged 8 commits into
mainfrom
fix/aot-closures-program-embed
May 19, 2026
Merged

fix: AOT embeds CompiledProgram so HOF / closure dispatch works#424
danieljohnmorris merged 8 commits into
mainfrom
fix/aot-closures-program-embed

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Engine audit PR #413 found AOT silently returns nil (or [nil, nil, ...]) for every program that emits OP_CALL_DYN or an OP_*_BY_KEY finalizer whose helper has to re-enter the VM on a user-fn callback. Tree, VM, and Cranelift JIT are all correct on the same shapes. Targeting 0.12.1 since this is a bug fix (silent miscompute -> correct answer), no new API surface, no breaking change.

Root cause: AOT never publishes ACTIVE_PROGRAM / ACTIVE_AST_PROGRAM, so jit_call_dyn and jit_call_builtin_tree hit their null-program guards and silently return TAG_NIL. Fix shape: serialise the full CompiledProgram (chunks + AST + func_names + is_tool + type_registry) with postcard, embed the blob into a .rodata data section in the AOT binary, and add ilo_aot_publish_program(ptr, len) to deserialise and publish the four TLS pointers with_active_registry already sets for the in-process JIT path.

Repro

$ cat /tmp/m.ilo
main>L n;map (x:n>n;*x 2) [1,2,3]

$ ilo --run-tree /tmp/m.ilo main
[2, 4, 6]
$ ilo --run-vm /tmp/m.ilo main
[2, 4, 6]
$ ilo --jit /tmp/m.ilo main
[2, 4, 6]

# Before this PR:
$ ilo compile /tmp/m.ilo -o /tmp/m main && /tmp/m
[nil, nil, nil]

# After this PR:
$ ilo compile /tmp/m.ilo -o /tmp/m main && /tmp/m
[2, 4, 6]

Every shape from the audit's failing rows now matches tree / VM / JIT byte-for-byte under AOT: map over an inline lambda (no capture and with capture), 3-arg closure-bind map fn ctx xs, fld with a user-fn, grp / uniqby with a key fn, and fn-ref return-and-call.

What's in the diff

Five commits:

  1. aot: add postcard CompiledProgram wire format - new src/vm/aot_blob.rs module with ProgramBlob / WireChunk / WireConst. Schema_version gated. Chunk constants encode only the variants the compiler emits today (Nil / Number / Text / Bool / List); a future variant trips the From<&Value> guard in the test suite before any binary ships. AST is serialised as JSON inside the postcard envelope because Program::serialize_decls uses serialize_seq(None) which postcard rejects.

  2. aot: ilo_aot_publish_program runtime helper - new C ABI helper deserialises the blob, leaks it for the process lifetime (same pattern as ilo_aot_set_registry), and publishes the four TLS pointers. Malformed blob exits 1 with a JSON diagnostic, no silent fallback. ilo_aot_fini now also clears the three new pointers.

  3. aot: emit publish_program call in generate_main and the bench harness - declare the helper, serialise the program in compile_to_binary, emit a call from generate_main after the registry setup and before the user-fn call. Same wiring for compile_to_bench_binary's C harness.

  4. tests: cross-engine regression for AOT HOF / closure dispatch - eight cases in tests/regression_aot_closures.rs covering each audit failure shape, each compiled to a real AOT binary and compared byte-for-byte against --run-tree, --run-vm, and --jit. Plus examples/aot-closures.ilo as the in-context learning artefact, exercised by the examples_engines harness.

  5. docs: AOT now runs Phase 2 closures and HOF / fn-ref dispatch - SPEC.md, ai.txt (regenerated), skills/ilo/SKILL.md, and skills/ilo/ilo-engines.md. Pre-PR these claimed Phase 2 was tree-only with VM/JIT raising ILO-R012, which has been false since feature: VM closure support (Phase 2 PR1) #384 landed; now they describe the actual state including the AOT embed.

Binary size impact

Measured against main on the same machine, same flags:

program pre-fix post-fix delta
main>n;42 9,486,728 9,696,616 +205 KB
main>L n;map (x:n>n;*x 2) [1,2,3] 9,486,744 9,696,648 +205 KB
5-fn program with lambdas + grp + fld 9,486,952 9,696,872 +205 KB

The one-time +205 KB is the libilo.a side: linking in the postcard + serde_json deserialiser code paths reachable from ilo_aot_publish_program. Per-program growth (the actual blob in .rodata) is a few hundred bytes for trivial programs, a few KB for larger ones; in the szhello vs szbig comparison the per-program delta is 256 bytes. Honest trade for fixing the silent-miscompute class.

Test plan

  • cargo build --release --features cranelift clean
  • cargo test --release --features cranelift clean (192 test files, no failures)
  • cargo fmt --check clean
  • cargo clippy --release --features cranelift --all-targets clean
  • tests/regression_aot_closures.rs exercises the seven audit failure shapes + one regression baseline, all green
  • aot_blob module has its own roundtrip + schema-version unit tests
  • examples/aot-closures.ilo runs green through tree / VM / JIT via the examples_engines harness
  • Manual binary-size measurement before / after

Follow-ups

  • Re-run the engine audit harness from PR engine feature-support audit: 49-row corpus + cross-engine harness #413 once that lands to confirm the 9 AOT-fail rows drop to whatever's left after this + the in-flight sum-type-strconst fix.
  • AST is currently embedded as JSON inside the postcard envelope. A future PR could ship a native postcard AST by replacing the serialize_seq(None) impl on Program with a Vec-backed one, shrinking the blob a little. Not urgent.

@danieljohnmorris danieljohnmorris force-pushed the fix/aot-closures-program-embed branch 2 times, most recently from 3a232a0 to dcf9b98 Compare May 19, 2026 00:27
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 92.25092% with 21 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/vm/mod.rs 83.75% 13 Missing ⚠️
src/vm/aot_blob.rs 95.29% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

Engine audit PR #413 found AOT silently returns nil for HOF / closure
dispatch (map over a lambda, fld, grp, uniqby, fn-ref return). Root cause:
AOT never publishes ACTIVE_PROGRAM, so jit_call_dyn and jit_call_builtin_tree
hit their null-program guards and return TAG_NIL for every user-fn callback.

To fix, the AOT binary needs a CompiledProgram at runtime. Add a postcard
wire format (chunks + func_names + is_tool + type_registry + ast) gated by
schema_version. Chunk constants encode only the variants the compiler
emits today (Nil / Number / Text / Bool / List); a future variant trips
the From<&Value> guard in the test suite before any binary ships.

The AST is serialised as JSON inside the postcard envelope because the
Program::serialize_decls custom impl uses serialize_seq(None) which postcard
rejects with "The length of a sequence must be known". serde_json handles
unsized seqs and the AST already serialises cleanly via JSON for --ast.

Round-trip unit tests cover the empty program, a map-lambda program, an
fld user-fn program, type-registry preservation, and schema-version
mismatch detection.
New C ABI helper takes the embedded blob, deserialises via aot_blob, and
publishes the program into ACTIVE_PROGRAM / ACTIVE_AST_PROGRAM /
ACTIVE_FUNC_NAMES / ACTIVE_REGISTRY. Mirrors what with_active_registry does
for the in-process JIT entry path, but stays for the process lifetime
since AOT has no smaller scope.

On a malformed blob (schema mismatch or postcard parse failure) we write
a JSON diagnostic to stderr and exit 1. No silent fallback. ilo_aot_fini
now also clears the three new pointers alongside the existing registry
clear so the test suite sees clean state between AOT runs in the same
process.
Declare ilo_aot_publish_program in HelperFuncs, serialise the program in
compile_to_binary, and emit a call to the helper from generate_main after
the type-registry setup and before the user-fn invocation. Same wiring
goes through the C bench harness (compile_to_bench_binary) so --bench
binaries see the same dispatch contract as plain compiles.

The blob lands in a .rodata data section named ilo_program_blob via the
existing create_data_section helper. Binary size grows by ~210 KB once
(libilo.a now pulls in the postcard + serde_json deserialiser code paths)
and by a few hundred bytes to a few KB per program for the blob itself.

This closes the silent-nil class flagged by the engine audit. map over a
lambda, fld with a user-fn, grp / uniqby with a key fn, closure-bind ctx
args, and fn-ref return-and-call all now match tree / VM / JIT under AOT.
Eight cases, each compiled to a real AOT binary and compared byte-for-byte
against --run-tree, --run-vm, and --jit. The first seven are the audit
failure shapes (map lambda no-cap, map lambda with capture, map closure-bind
ctx, fld user-fn, grp user-fn, uniqby user-fn, fn-ref return-and-call); the
eighth pins the already-passing top-level fn-ref-to-map case so a future
refactor that breaks it while fixing the others gets caught immediately.

examples/aot-closures.ilo demonstrates every fixed shape at the language
level and is exercised by the examples_engines harness so the same
contract is enforced for an agent reading the in-context example.
SPEC.md and the generated ai.txt previously claimed Phase 2 closure capture
was tree-only with VM / JIT raising ILO-R012 and the default runner falling
through. Empirically VM and JIT have handled Phase 2 natively for a while
(PR1 #384 onwards); with this PR AOT joins them. Update the closure-capture
paragraph and the default-engine paragraph in SPEC.md so they describe
post-fix reality, and rewrite the manually-maintained line in SKILL.md to
match. ilo-engines.md gets the same treatment plus the AOT specifics note
about the embedded CompiledProgram blob.

The site cli.md is updated in a follow-up commit in the separate ilo-lang/site
repo since the site has no CI and lives outside this monorepo.
The post-rebase SPEC.md text is slightly different from the version
that produced the dcf9b98 ai.txt - build.rs regenerates ai.txt from
SPEC.md on every cargo build, and CI fails if ai.txt drifts. Re-run
of cargo build produced this diff.
The original source 'parity n:n>t;=mod n 2 0{"even"};"odd"' parses
as a sequence of two statements - '=mod n 2 0{"even"}' (a comparison
followed by a list literal, discarded) and '"odd"' (the return). The
function therefore always returned "odd" regardless of input, so grp
and uniqby produced singleton outputs across every engine - not just
AOT. The fix is to use the standard ternary form '?(=mod n 2 0){"even"}{"odd"}'
so the function actually branches on parity.
The cross-engine AOT regression tests exercise publish/fini end-to-end
via a compiled binary, but the parent test process's coverage
instrumentation doesn't track lines executed in the child. Add three
in-process unit tests that call ilo_aot_publish_program and
ilo_aot_fini directly so cargo llvm-cov picks up the publish path,
the TLS slot wiring, and the schema-version constant.

Codecov was failing patch coverage at 86.7% because the entire
ilo_aot_publish_program body (20 lines in vm/mod.rs) was reported as
uncovered. These tests cover the happy path and the fini-clears-TLS
path; the malformed-blob branch exits the process so remains
uncovered by design (matches how Rust's coverage tooling treats other
process::exit branches in the codebase).
@danieljohnmorris danieljohnmorris force-pushed the fix/aot-closures-program-embed branch from dcf9b98 to 69a1899 Compare May 19, 2026 17:40
@danieljohnmorris danieljohnmorris merged commit 97c6e6f into main May 19, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/aot-closures-program-embed branch May 19, 2026 17:51
danieljohnmorris added a commit that referenced this pull request May 19, 2026
#424 landed regression_aot_closures.rs after this branch's first
sweep, so its in-proc engine loop included the now-removed flag.
VM and JIT cover the same matrix; AOT remains its own arm.
danieljohnmorris added a commit that referenced this pull request May 19, 2026
danieljohnmorris added a commit that referenced this pull request May 19, 2026
…miscompile note

Merge from main re-introduced ~30 tokens that earlier trim pass had
removed. Also dropped the line claiming AOT miscompiles HOFs taking
fn-values, since PR #424 (AOT closures program-embed) fixed that.
Aggregate now 4982 of 5000.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant