fix(perf): close helper-fn mset O(N·K) cliff with OP_CALL_OWN1 + tail-mset rewrite#440
Merged
Merged
Conversation
Adds the two opcode constants for the move-not-clone fast path that closes the helper-fn mset perf cliff. No dispatch wiring yet; the constants are referenced in the next commit when the compiler peephole starts emitting them.
OP_MOVE_OWN copies the NanVal bit pattern from R[B] to R[A] without touching any RC, then nils the source slot so frame teardown doesn't double-drop. Guarded against the a==b degenerate case. OP_CALL_OWN1 is a structural copy of OP_CALL except the first arg is pushed onto the callee's stack without clone_rc, and its caller-side register is cleared to Nil. The other args still pay the normal clone_rc on push. Encoding matches OP_CALL exactly (ABx with the func_idx in the high byte and argc in the low byte) so the compiler can swap one for the other without other changes. No emitter yet, so neither arm is reached at runtime; this is a no-op behaviour change.
The let-stmt peephole now detects the canonical helper-call accumulator pattern - name = fn(name, ...) for a static user fn - and emits the move-not-clone call path. The first arg threads into the callee at the caller's RC (no clone_rc bump), so an accumulator that lives at RC=1 in the caller stays at RC=1 inside the helper. Audited skip cases: builtins (caught by is_builtin), local fn refs and closures (caught by resolve_local, same dispatch shunt OP_CALL_DYN uses), and the !/!! unwrap modes (left on the normal OP_CALL path). OP_MSET's in-place fast path inside the callee still requires the a == b register form, so this commit alone doesn't close the cliff - the next commit relaxes the mset guard to allow a != b under RC=1.
The PR #249 in-place fast path required both a == b AND rc_count == 1. The a == b half is essentially a proxy for 'no other live reference' - useful when the compiler emitted name = mset name k v at the top level, where the result lands in the same register as the source. But under move-semantics (OP_CALL_OWN1 / OP_MOVE_OWN), it's safe to mutate in place whenever RC == 1, regardless of register form: we own the only reference. The a != b path now transfers the heap pointer from R[b] to R[a] (drop_rc the previous R[a], install map_v, nil R[b]). This is the second half of the helper-fn mset cliff fix. Inside the callee, mset m k v as a tail expression compiles to OP_MSET with a != b (b is the param register, a is a fresh result). With OP_CALL_OWN1 threading the map in at RC=1, this commit lets the callee actually take the in-place path.
…guard" This reverts commit 60893b0.
When mset m k v sits at the tail of a function body AND args[0] is a direct Ref to a local register, the source map dies at OP_RET. We can safely reuse the local's register as both result and source of OP_MSET, which fires the existing a == b && rc_count == 1 in-place fast path. This is the second half of the helper-fn perf fix. OP_CALL_OWN1 threads the accumulator into the helper at RC=1; this commit makes the helper's tail mset actually take the in-place path instead of cloning the whole HashMap. Liveness is conservative - only direct refs at tail position, only when resolve_local hits. Non-tail mset (m2 = mset m k v then read m) stays on the slow path because the source isn't statically dead. The in_tail_position flag propagates through compile_body so only the final stmt of the function root sees it set.
Both opcodes lower to the same IR as their OP_MOVE / OP_CALL siblings. In the Cranelift model registers are SSA Variables and args flow as values - there's no per-push clone_rc on a runtime stack like the VM has - so the move-vs-clone-on-push distinction collapses to the same code. The compiler's tail-position rewrite (mset m k v -> a==b form) is what wins on Cranelift, and the existing OP_MSET handler already picks mset_inplace when a_idx == b_idx. Also extends the type-classification pre-scan and the MOVE-propagation fixpoint so OP_MOVE_OWN and OP_CALL_OWN1 don't silently bail. Without this commit any chunk containing the new opcodes would either return an unsupported-opcode error (AOT) or bail to the VM interpreter (JIT). With it, JIT/AOT execute the helper-fn accumulator pattern as fast as VM does.
…nflation First pass at the Cranelift wiring mirrored OP_MOVE's clone-rc-on-heap path for OP_MOVE_OWN too. That's wrong: SSA Variable redefinition in Cranelift doesn't drop_rc the previous value, so bumping RC on every move (twice per loop iteration in the helper-call peephole) inflates RC monotonically. By 100k rows the map's RC hits 200k+, defeats the in-place mset fast path inside the helper, and leaks every owner reference until process exit. Caught by a /tmp/mset_bench 100k helper-form JIT run that hit 70GB resident memory. OP_MOVE_OWN now just assigns the destination Variable from the source without calling jit_move. This is sound because the peephole emits the in/out OP_MOVE_OWN pair around an OP_CALL_OWN1 — the source Variable isn't used again after the move, so the net heap RC is preserved (no clone, no drop) exactly as the VM intends. f64 shadow propagation is preserved for the numeric fast path.
17 tests in tests/regression_mset_helper_perf.rs covering the moves landed by the perf fix: - addto helper with all-distinct keys (the headline cliff shape) - addto helper with overwriting keys (exercises displaced-value drop_rc) - helper with extra args after the map (multi-arg OP_CALL_OWN1) - non-tail mset preserves the source map (semantics gate on tail flag) - non-rebinding helper call preserves caller's map (RC>1 fallback) Each shape runs on tree / VM / JIT. Correctness-only; perf is verified manually with /tmp/mset_bench.ilo and tracked in the assessment doc. examples/mset-helper-perf.ilo demonstrates the now-fast helper-fn accumulator shape so agents see the working idiom directly (the example also acts as a cross-engine regression test via the engine harness).
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the ~1000x perf cliff agents hit when they factor a map accumulator into a helper fn. The canonical DRY refactor
used to take 25.8s for 40k rows; now it's 0.01s. That's the headline. The manifesto-level frame: until this PR, the "obvious thing" (extract the per-row update into a helper for readability) silently made the program 1000x slower with no error or warning. Worst-case footgun for an AI-first language. Now factor or don't, the perf is the same.
Repro before / after (
/tmp/mset_bench.ilo, helper form)Inline form (
m=mset m k vdirectly in the loop) is also unchanged, so the existing accumulator peephole is untouched.Root cause
PR #249's OP_MSET RC=1 in-place fast path was gated on
a == b && rc_count == 1. When the accumulating map crossed an OP_CALL boundary, RC bumped to >=2 (one from the caller MOVE-to-args_base clone, one from the OP_CALL push clone). Inside the callee the in-place path declined, so OP_MSET cloned the whole HashMap on every row - O(N²) on K distinct keys.What's in the diff
Eleven commits, broken up so each step compiles and the bisect story stays clean:
clone_rcand nils the source slot.!/!!unwrap modes.msetimmutability semantics (m2 = mset m k v with m still live). Reverted.in_tail_positionthroughcompile_body. Whenmset m k vsits at the tail of a fn body AND args[0] is a direct Ref to a local, emit OP_MSET with a = b = local_reg. Non-tail use stays on the cloning slow path.tests/regression_mset_helper_perf.rs+examples/mset-helper-perf.iloexercised by the engine harness.Test plan
cargo test --release --features cranelift --lib- 3196 passedcargo test --release --features cranelift(full suite) - all integration tests greencargo fmt- cleancargo clippy --release --features cranelift -- -D warnings- cleanexamples/mset-helper-perf.iloexercised across tree / VM / JIT / AOT via the engine harnessFollow-ups
mdel,setfield, etc) - left out of this PR to keep the change minimal.UnwrapMode::Noneonce we have a clear story for how the result lives across!/!!handling.