Skip to content

fix: Cranelift aarch64 reloc-overflow fallback is now visible (no silent VM)#354

Merged
danieljohnmorris merged 2 commits into
mainfrom
fix/cl-aarch64-reloc-overflow
May 17, 2026
Merged

fix: Cranelift aarch64 reloc-overflow fallback is now visible (no silent VM)#354
danieljohnmorris merged 2 commits into
mainfrom
fix/cl-aarch64-reloc-overflow

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Upstream cranelift-jit 0.116 has an aarch64 BL relocation assertion (compiled_blob.rs:90) that fires non-deterministically when the JIT code-cache and helper symbols end up >±128 MB apart. We already catch the panic and fall back to the bytecode VM (--run-cranelift) or the tree interpreter (default tiered). The catch isn't the problem; the breadcrumb was.

The old breadcrumb was a free-form eprintln!("ilo: Cranelift JIT panicked ({}); falling back to ...") emitted on every panic, with two real problems:

  1. Persona harnesses that buffer or merge stderr could drop it entirely, making the engine swap invisible in user-facing "Cranelift" timing columns (the ml-engineer rerun8 report that triggered this fix).
  2. Hot loops that hit the assertion repeatedly spam stderr.

This PR centralises the visibility contract in vm::jit_cranelift::note_jit_panic_fallback:

  • Stable [ilo:jit-fallback] grep anchor so harnesses can detect the fallback without parsing free-form text.
  • Once per process: first call prints, subsequent calls bump a silent atomic counter callers can read on exit.
  • Explicit stderr flush so the line reaches the harness even if the process exits seconds later under a buffered stream.

Both fallback sites in main.rs route through the helper. The existing fallback-path tests now assert the counter increments; a new test locks down the per-panic increment contract.

The real codegen fix (forcing indirect calls when displacement may overflow, or chunked trampolines) is parked as a follow-up. It is a Cranelift-backend rewrite measured in days, and the panic is already recoverable on every path. Shipping visibility now is the higher-leverage win.

Repro before/after

Titanic logreg from /tmp/ilo-persona-ml-engineer-rerun8/main.ilo, 16 runs of ilo --run-cranelift:

Before (v0.11.7): ~5/16 panics, breadcrumb printed but unprefixed, no flush, no counter, silent in any harness that merged stderr.

After:

[ilo:jit-fallback] Cranelift JIT panicked (assertion failed: (diff >> 26 == -1) || (diff >> 26 == 0)); falling back to bytecode VM. Subsequent fallbacks in this process will be silent (count reported on exit).

One tagged line per process, flushed, greppable.

What's in the diff

  • add tagged once-per-process JIT panic fallback helper (src/vm/jit_cranelift.rs): note_jit_panic_fallback, jit_panic_fallback_count, reset_jit_panic_fallback_count, JIT_FALLBACK_TAG, plus a process-global atomic counter.
  • route Cranelift panic-fallback sites through the shared helper (src/main.rs): both fallback sites now call the helper. Existing tests strengthened to assert the counter increments; new jit_panic_fallback_counter_increments_per_panic test added; a test-scoped mutex serialises the three tests because the counter is process-global.

Test plan

  • cargo test --features cranelift --bin ilo panic — 17 passed, breadcrumb printed for both fallback paths
  • Full suite cargo test --features cranelift — all passing
  • End-to-end: 16 runs of the titanic repro on the release binary, every panic now emits the tagged breadcrumb
  • cargo clippy --features cranelift --all-targets — clean
  • cargo fmt --all — clean

Follow-ups

  • Real codegen fix for the aarch64 BL displacement overflow: either force indirect calls on cross-function dispatch in the JIT module, or split very large generated modules into chunks. Multi-day work in the Cranelift backend. Tracked separately; this PR makes the symptom non-silent in the meantime.

The Cranelift aarch64 BL relocation assertion (cranelift-jit 0.116
compiled_blob.rs:90) fires non-deterministically on programs whose
JIT code-cache and helpers end up >±128 MB apart. We already catch
the panic and fall back to a non-JIT engine, but the existing
breadcrumb was a free-form 'ilo: Cranelift JIT panicked ...' eprintln
emitted on every panic, which:

- Persona harnesses that buffer or merge stderr could drop entirely,
  making the engine swap invisible in 'Cranelift' timing columns.
- Hot loops that hit the assertion repeatedly would spam stderr.

note_jit_panic_fallback centralises the visibility contract:

- Stable [ilo:jit-fallback] grep anchor so harnesses can detect the
  fallback without parsing free-form text.
- Once per process: first call prints, subsequent calls bump a
  silent atomic counter callers can read on exit.
- Explicit stderr flush so the line reaches the harness even if the
  process exits seconds later under a buffered stream.

jit_panic_fallback_count exposes the counter for test assertions and
end-of-run diagnostics; reset_jit_panic_fallback_count is debug-only
for tests.
Both run_cranelift_engine (explicit --run-cranelift) and run_default
(tiered execution) previously eprintln'd their own one-line breadcrumb
when the JIT panicked, with slightly different wording and no flush.
Replace both with note_jit_panic_fallback so the visibility contract
(tagged prefix, once-per-process, flushed stderr) is uniform.

Strengthen the existing fallback-path tests to assert the panic
counter increments, and add jit_panic_fallback_counter_increments_per_panic
to lock down the per-panic increment contract. A test-scoped
mutex serialises the three tests because the counter is process-
global and would otherwise flake under cargo test's parallel
runner.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@danieljohnmorris danieljohnmorris merged commit 35b1cb7 into main May 17, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/cl-aarch64-reloc-overflow branch May 17, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant