Skip to content

fix: catch Cranelift JIT panics and fall back to non-JIT engines#319

Merged
danieljohnmorris merged 3 commits into
mainfrom
fix/cranelift-panic-fallback
May 16, 2026
Merged

fix: catch Cranelift JIT panics and fall back to non-JIT engines#319
danieljohnmorris merged 3 commits into
mainfrom
fix/cranelift-panic-fallback

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Upstream cranelift-jit 0.116 has a non-deterministic AArch64 near-call relocation assertion (compiled_blob.rs:90 - (diff >> 26 == -1) || (diff >> 26 == 0)) that fires when the JIT code-cache and runtime memory end up more than ±64 MB apart. On macOS arm64 this surfaces as a hard process crash on the first run after a fresh cargo build, with no recovery path for the user. Five subsequent invocations of the same command line typically run clean, but the one that crashed killed the pipeline.

Wrapping the Cranelift JIT dispatch in std::panic::catch_unwind converts the crash into a measurable diagnostic plus a slower-but-correct run on the next engine down. The panic payload is surfaced verbatim in a single-line stderr breadcrumb so the upstream issue stays searchable rather than degrading silently into the fallback engine.

Repro before/after

Before this PR, on macOS arm64 (intermittent, first run after fresh build):

$ ilo main.ilo main
thread 'main' panicked at .../cranelift-jit-0.116.1/src/compiled_blob.rs:90:21:
assertion failed: (diff >> 26 == -1) || (diff >> 26 == 0)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[exit 134]

After this PR, same panic forced via the debug-build env-var hook:

$ ILO_FORCE_JIT_PANIC=1 ilo 'f x:n>n;*x 2' f 5
ilo: Cranelift JIT panicked (synthetic cranelift panic (ILO_FORCE_JIT_PANIC)); falling back to interpreter
10
[exit 0]

One clean breadcrumb on stderr, correct program output on stdout, exit 0.

What's in the diff

Commit 1 - jit: wrap Cranelift dispatch in catch_unwind, add Panic variant

  • JitCallError::Panic { msg } joins NotEligible and Runtime in src/vm/jit_cranelift.rs.
  • compile_and_call now wraps the whole dispatch in std::panic::catch_unwind with a thread-scoped panic-hook chain installed once via Once. The hook reads IN_JIT_DISPATCH (a thread-local) and suppresses the default backtrace only when the calling thread is inside the JIT entry. Concurrent JIT dispatches on other threads do not race on set_hook, and panics outside the JIT keep their default rendering.
  • Defensively resets the JIT bump arena and drains the runtime-error TLS cell on the panic arm so leftover state from a mid-call panic does not leak into the next invocation.
  • Save-restore on the IN_JIT_DISPATCH flag so a nested compile_and_call (none today, but a JIT helper could legitimately re-enter) does not clobber the outer's flag on return.
  • Debug-build-only test hooks: FORCE_PANIC_FOR_TEST thread-local and ILO_FORCE_JIT_PANIC env var, both gated on cfg(debug_assertions) so release binaries do not carry them.

Commit 2 - cli: fall back to VM/tree on Cranelift panic, with stderr breadcrumb

  • run_default (the default file/inline path) falls through to the tree interpreter, matching the existing NotEligible arm.
  • run_cranelift_engine (explicit --run-cranelift) falls back to the bytecode VM via vm::run, since the user opted into a JIT engine and VM is the closest non-JIT tier.
  • Both paths emit a single-line breadcrumb including the panic payload.

Commit 3 - test: cover Cranelift panic fallback across engines

  • Three integration tests in tests/regression_cranelift_panic_fallback.rs driving the binary with ILO_FORCE_JIT_PANIC=1:
    • default engine falls through to tree, exit 0, breadcrumb mentions interpreter
    • --run-cranelift falls back to VM, exit 0, breadcrumb mentions VM
    • breadcrumb includes the panic payload
  • Three Rust-level tests using the FORCE_PANIC_FOR_TEST thread-local: one in src/vm/jit_cranelift.rs (asserts compile_and_call returns Panic and the next call recovers normally), two in src/main.rs (assert run_default and run_cranelift_engine exit 0 on fallback).
  • examples/cranelift-panic-fallback.ilo pins the same numeric pipeline shape across tree and VM via the existing examples_engines harness.

Decisions

  • catch_unwind rather than bumping cranelift-jit to 0.131. Bulletproof regardless of upstream state, no API churn, contained blast radius. A version bump can come later as a separate PR if 0.117+ ships the fix.
  • Panic hook scoped by thread-local, installed via Once. Avoids the global set_hook/take_hook race that the obvious "save and restore around the catch" approach would have introduced. cargo test runs multi-threaded by default; embedding ilo as a library is a foreseeable use case.
  • Different fallback per dispatch path. Default engine falls to tree (consistent with NotEligible shape, same user expectation - JIT is an optimisation). --run-cranelift falls to VM (the user asked for an optimised engine; tree would be a much larger surprise).
  • Test hooks gated on debug_assertions. Release binaries never carry the synthetic-panic path; the only behaviour change in release is the catch_unwind wrapper itself.

Test plan

  • cargo test --release --features cranelift - full suite green
  • cargo clippy --features cranelift --tests --release -- -D warnings - clean
  • cargo fmt --all -- --check - clean
  • New tests pass: cranelift_compile_and_call_catches_panic, run_default_cranelift_panic_falls_back_to_interpreter, run_cranelift_engine_panic_falls_back_to_vm, cranelift_panic_default_falls_back_to_interpreter, cranelift_panic_explicit_engine_falls_back_to_vm, cranelift_panic_breadcrumb_includes_payload
  • Manual: ILO_FORCE_JIT_PANIC=1 ilo 'f x:n>n;*x 2' f 5 prints breadcrumb on stderr and 10 on stdout, exits 0

Follow-ups

  • Evaluate bumping cranelift-jit to 0.117+ in a separate PR if the upstream changelog confirms the AArch64 reloc fix. The catch_unwind wrapper remains valuable regardless as a defence-in-depth measure.

Upstream cranelift-jit 0.116 has an AArch64 near-call relocation
assertion (compiled_blob.rs:90) that fires non-deterministically when
the JIT code-cache and runtime memory end up more than ±64 MB apart.
On a first run after a fresh cargo build that surfaces as a hard
process crash on macOS arm64, with no recovery path for the user.

Add JitCallError::Panic { msg } and wrap compile_and_call in
std::panic::catch_unwind so the panic becomes a recoverable error
the dispatcher can fall back from. The panic-hook chain is installed
once via Once and scoped by a thread-local IN_JIT_DISPATCH flag, so
concurrent JIT dispatches on other threads do not race on set_hook
and panics outside the JIT keep their default rendering.

Defensively resets the JIT bump arena and drains the runtime-error
TLS cell in the panic arm: if the panic fired mid-call() after some
helper allocations but before the normal tail reset, those would
otherwise leak into the next invocation.

Debug-build-only test hooks (FORCE_PANIC_FOR_TEST thread-local and
ILO_FORCE_JIT_PANIC env var) raise a synthetic panic at the same
call site so tests can exercise the fallback path without depending
on the AArch64-specific upstream bug. Both gated on debug_assertions
so release binaries do not carry them.
Handle JitCallError::Panic in both engine dispatch paths.

run_default (the default file/inline path) falls through to the tree
interpreter, matching the existing NotEligible arm. run_cranelift_engine
(explicit --run-cranelift) falls back to the bytecode VM, since the
user opted into a JIT engine and VM is the closest non-JIT tier.

Both paths emit a single-line stderr breadcrumb including the panic
payload so the upstream cranelift issue stays measurable in production
logs rather than collapsing into a generic message or a silent engine
swap.
Three integration tests using ILO_FORCE_JIT_PANIC=1 against the
ilo binary:

- default engine falls through to the tree interpreter and produces
  the expected program output
- --run-cranelift falls back to the bytecode VM
- the stderr breadcrumb includes the panic payload so the upstream
  issue stays searchable in production logs

Plus a cross-engine examples/*.ilo pin so the broader regression
harness exercises the same numeric pipeline shape on tree and VM.

Integration tests gated on cfg(debug_assertions) since the env-var
hook only exists in debug builds.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

❌ Patch coverage is 92.56198% with 9 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/main.rs 89.58% 5 Missing ⚠️
src/vm/jit_cranelift.rs 94.52% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@danieljohnmorris danieljohnmorris merged commit 5207f7e into main May 16, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/cranelift-panic-fallback branch May 16, 2026 22:42
danieljohnmorris added a commit that referenced this pull request May 16, 2026
eight fixes since 0.11.4, all from rerun5 personas: bare-bang silent-nil regression (#324), Cranelift AArch64 panic catch_unwind fallback (#319), multi-line body span drift (#318), HOF tree-bridge error parity on Cranelift (#321), bool-ternary brace sugar (#323), single-line body diagnostic with brace-block bodies (#322), unknown-subcommand error in multi-fn files (#320), window perf cliff fused flt/map (#325).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant