Skip to content

fix: cranelift JIT srt-after-map TLS desync silent miscompile#306

Merged
danieljohnmorris merged 2 commits into
mainfrom
fix/srt-cranelift-nil
May 16, 2026
Merged

fix: cranelift JIT srt-after-map TLS desync silent miscompile#306
danieljohnmorris merged 2 commits into
mainfrom
fix/srt-cranelift-nil

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Cranelift JIT silently returned nil from srt fn xs (and grp / uniqby / partition) when another HOF with an inline-lambda callback had run earlier in the same function. Tree and VM produced the correct sorted list. Caught by pdf-analyst rerun4 via the frq + mget + srt pattern, then minimised down to two inline lambdas in the same function body.

Repro

main>_;a=map (x:n>n;+x 1) [1 2 3];sk=srt (p:L _>n;p.0) [[2 "a"] [1 "b"]];sk

Before:

  • --run-tree[[1, b], [2, a]]
  • --run-vm[[1, b], [2, a]]
  • --run-craneliftnil

After: all three engines agree on [[1, b], [2, a]].

Root cause

The four TLS slots ACTIVE_REGISTRY, ACTIVE_FUNC_NAMES, ACTIVE_PROGRAM, ACTIVE_AST_PROGRAM had drop guards that NULL'd the slot unconditionally on scope exit. That is safe at the top-level Cranelift entry, but a per-element HOF callback re-enters the VM via jit_call_dyn -> VM::new(program).call(...); the inner execute()'s guards then NULL the slots when the inner returns, leaving the outer JIT with no TLS state.

The next tree-bridge HOF in the same Cranelift entry saw null ACTIVE_FUNC_NAMES, deserialised its FnRef arg to a synthetic <user_fn:N> placeholder Text, failed to dispatch it through the tree interpreter, and the bridge swallowed the failure as TAG_NIL (legacy nil-sweep gap, filed separately).

Regressed when the HOF dispatch chain (#277 / #278 / #279 / #280 / #283) introduced the sub-VM re-entry path. Pre-#277, map was tree-bridge-only and the inner-VM re-entry didn't exist, so the TLS desync never happened.

What is in the diff

Commit 1 — vm: save-restore TLS guards. The four guards now snapshot the previous pointer at construction and restore it on drop. Linear stack discipline across arbitrary nesting; the outermost restore still ends at null. clear_active_registry() retained for the AOT runtime tear-down path where there is no prior pointer to restore.

Commit 2 — test: cross-engine regression. Nine new tests in regression_lambdas_cross_engine.rs cover every native HOF (map/flt/fld/flatmap) followed by every tree-bridge HOF (srt/grp/uniqby/partition). Plus the original pdf-analyst frq + mget shape, and a native->bridge->native sandwich. New examples/srt-after-map-inline-lambda.ilo exercises the same pattern through tests/examples_engines.rs so the higher-level harness catches future drift too.

Test plan

  • Repro from pdf-analyst rerun4 (min2.ilo) now matches tree/VM on Cranelift
  • Nine new cross-engine regression tests pass on tree / VM / Cranelift
  • New examples/srt-after-map-inline-lambda.ilo runs identically on every engine via tests/examples_engines.rs
  • Full cargo test --release --features cranelift green (3071 + 24 + 30 + ... passes, 0 failures)
  • cargo fmt --check and cargo clippy --release --features cranelift -- -D warnings clean

Follow-ups (filed, out of scope here)

  • Promote tree-bridge errors from silent TAG_NIL to runtime errors for HOFs (srt, grp, uniqby, partition) so a callback failure surfaces as ILO-R instead of nil. Independent of this PR; logged in ilo_assessment_feedback.md parked section.

… JIT state

The four TLS slots (ACTIVE_REGISTRY, ACTIVE_FUNC_NAMES, ACTIVE_PROGRAM,
ACTIVE_AST_PROGRAM) were cleared to null on guard drop. That's fine at
the top-level Cranelift entry, but a per-element HOF callback re-enters
the VM via jit_call_dyn -> VM::new(program).call(...); the inner
execute()'s guards then null the slots when the inner returns, leaving
the outer JIT with no TLS state.

The next tree-bridge HOF in the same Cranelift entry would see null
ACTIVE_FUNC_NAMES, deserialise its FnRef arg to a synthetic
<user_fn:N> placeholder Text, fail to dispatch it through the tree
interpreter, and the bridge would swallow the failure as TAG_NIL.
Manifested as a silent miscompile: srt-after-map returned nil on
Cranelift, correct on tree/VM. pdf-analyst rerun4 caught it via the
frq + mget + srt pattern.

Guards now snapshot the previous pointer at construction and restore
it on drop. Linear stack discipline across arbitrary nesting depth;
top-level restore still ends at null since the outer caller's prev
was null.

clear_active_registry() retained for the AOT runtime tear-down path
where there is no prior pointer to restore.
Nine new tests in regression_lambdas_cross_engine.rs cover every
combination of a native-dispatch HOF (map/flt/fld/flatmap) followed
by a tree-bridge HOF (srt/grp/uniqby/partition) in the same function
body. Pre-fix, the second HOF returned nil on Cranelift while tree
and VM produced the correct result; post-fix all three engines
agree.

Includes the original pdf-analyst rerun4 shape (frq + mget loop
building [count word] pairs, then srt-by-count) and a
native->bridge->native sandwich to confirm the fix doesn't only
help the bridge call.

New example examples/srt-after-map-inline-lambda.ilo exercises the
same pattern through tests/examples_engines.rs so any future
regression breaks at the higher-level harness too.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@danieljohnmorris danieljohnmorris merged commit 684189c into main May 16, 2026
5 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/srt-cranelift-nil branch May 16, 2026 15:38
danieljohnmorris added a commit that referenced this pull request May 16, 2026
three P0 fixes since 0.11.3, all surfaced by rerun4 personas: srt-cranelift TLS desync silent miscompile (#306), CLI auto-run for main and inline programs restored (#307), OP_LISTAPPEND O(n^2) memory regression in in-process Cranelift JIT (#308).
danieljohnmorris added a commit that referenced this pull request May 16, 2026
twelve fixes since 0.11.3, surfaced by rerun4 personas plus standing asks: srt-Cranelift TLS desync (#306), CLI auto-run restoration (#307), OP_LISTAPPEND O(n^2) JIT memory regression (#308), precedence-pair hint false-positive on parens (#309), prefix ?? accepts call expression (#310), += pure-shape docs (#311), bare-mutation silent no-op verifier warning ILO-T033 (#312), asin/acos/atan inverse trig builtins (#313), flat cross-engine (#314), cond{~v} discard hint multi-stmt false-positive (#315), rsrt fn xs key-fn overloads (#316), xs.(expr) paren-after-dot diagnostic hint (#317).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant