fix: Cranelift aarch64 reloc-overflow fallback is now visible (no silent VM)#354
Merged
Merged
Conversation
The Cranelift aarch64 BL relocation assertion (cranelift-jit 0.116 compiled_blob.rs:90) fires non-deterministically on programs whose JIT code-cache and helpers end up >±128 MB apart. We already catch the panic and fall back to a non-JIT engine, but the existing breadcrumb was a free-form 'ilo: Cranelift JIT panicked ...' eprintln emitted on every panic, which: - Persona harnesses that buffer or merge stderr could drop entirely, making the engine swap invisible in 'Cranelift' timing columns. - Hot loops that hit the assertion repeatedly would spam stderr. note_jit_panic_fallback centralises the visibility contract: - Stable [ilo:jit-fallback] grep anchor so harnesses can detect the fallback without parsing free-form text. - Once per process: first call prints, subsequent calls bump a silent atomic counter callers can read on exit. - Explicit stderr flush so the line reaches the harness even if the process exits seconds later under a buffered stream. jit_panic_fallback_count exposes the counter for test assertions and end-of-run diagnostics; reset_jit_panic_fallback_count is debug-only for tests.
Both run_cranelift_engine (explicit --run-cranelift) and run_default (tiered execution) previously eprintln'd their own one-line breadcrumb when the JIT panicked, with slightly different wording and no flush. Replace both with note_jit_panic_fallback so the visibility contract (tagged prefix, once-per-process, flushed stderr) is uniform. Strengthen the existing fallback-path tests to assert the panic counter increments, and add jit_panic_fallback_counter_increments_per_panic to lock down the per-panic increment contract. A test-scoped mutex serialises the three tests because the counter is process- global and would otherwise flake under cargo test's parallel runner.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upstream cranelift-jit 0.116 has an aarch64 BL relocation assertion (
compiled_blob.rs:90) that fires non-deterministically when the JIT code-cache and helper symbols end up >±128 MB apart. We already catch the panic and fall back to the bytecode VM (--run-cranelift) or the tree interpreter (default tiered). The catch isn't the problem; the breadcrumb was.The old breadcrumb was a free-form
eprintln!("ilo: Cranelift JIT panicked ({}); falling back to ...")emitted on every panic, with two real problems:This PR centralises the visibility contract in
vm::jit_cranelift::note_jit_panic_fallback:[ilo:jit-fallback]grep anchor so harnesses can detect the fallback without parsing free-form text.Both fallback sites in
main.rsroute through the helper. The existing fallback-path tests now assert the counter increments; a new test locks down the per-panic increment contract.The real codegen fix (forcing indirect calls when displacement may overflow, or chunked trampolines) is parked as a follow-up. It is a Cranelift-backend rewrite measured in days, and the panic is already recoverable on every path. Shipping visibility now is the higher-leverage win.
Repro before/after
Titanic logreg from
/tmp/ilo-persona-ml-engineer-rerun8/main.ilo, 16 runs ofilo --run-cranelift:Before (v0.11.7): ~5/16 panics, breadcrumb printed but unprefixed, no flush, no counter, silent in any harness that merged stderr.
After:
One tagged line per process, flushed, greppable.
What's in the diff
add tagged once-per-process JIT panic fallback helper(src/vm/jit_cranelift.rs):note_jit_panic_fallback,jit_panic_fallback_count,reset_jit_panic_fallback_count,JIT_FALLBACK_TAG, plus a process-global atomic counter.route Cranelift panic-fallback sites through the shared helper(src/main.rs): both fallback sites now call the helper. Existing tests strengthened to assert the counter increments; newjit_panic_fallback_counter_increments_per_panictest added; a test-scoped mutex serialises the three tests because the counter is process-global.Test plan
cargo test --features cranelift --bin ilo panic— 17 passed, breadcrumb printed for both fallback pathscargo test --features cranelift— all passingcargo clippy --features cranelift --all-targets— cleancargo fmt --all— cleanFollow-ups