Skip to content

fix(perf): close helper-fn mset O(N·K) cliff with OP_CALL_OWN1 + tail-mset rewrite#440

Merged
danieljohnmorris merged 13 commits into
mainfrom
fix/mset-high-cardinality-perf
May 19, 2026
Merged

fix(perf): close helper-fn mset O(N·K) cliff with OP_CALL_OWN1 + tail-mset rewrite#440
danieljohnmorris merged 13 commits into
mainfrom
fix/mset-high-cardinality-perf

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Closes the ~1000x perf cliff agents hit when they factor a map accumulator into a helper fn. The canonical DRY refactor

addto m:M t n k:t v:n>M t n;mset m k v
go n:n>n;m=mmap;@i 1..n{k=str i;m=addto m k i};len (mkeys m)

used to take 25.8s for 40k rows; now it's 0.01s. That's the headline. The manifesto-level frame: until this PR, the "obvious thing" (extract the per-row update into a helper for readability) silently made the program 1000x slower with no error or warning. Worst-case footgun for an AI-first language. Now factor or don't, the perf is the same.

Repro before / after (/tmp/mset_bench.ilo, helper form)

addto m:M t n k:t v:n>M t n;mset m k v
build n:n>n;m=mmap;@i 1..n{k=str i;m=addto m k i};len (mkeys m)
N VM before VM after JIT after Ratio
5 000 0.45 s <0.01 s <0.01 s >45x
10 000 1.00 s <0.01 s <0.01 s >100x
40 000 25.8 s 0.01 s <0.01 s ~2580x
100 000 (extrap. 160 s) 0.04 s 0.03 s ~4000x
1 000 000 n/a (would OOM) 0.53 s 0.61 s linear

Inline form (m=mset m k v directly in the loop) is also unchanged, so the existing accumulator peephole is untouched.

Root cause

PR #249's OP_MSET RC=1 in-place fast path was gated on a == b && rc_count == 1. When the accumulating map crossed an OP_CALL boundary, RC bumped to >=2 (one from the caller MOVE-to-args_base clone, one from the OP_CALL push clone). Inside the callee the in-place path declined, so OP_MSET cloned the whole HashMap on every row - O(N²) on K distinct keys.

What's in the diff

Eleven commits, broken up so each step compiles and the bisect story stays clean:

  1. vm: reserve opcode slots OP_MOVE_OWN (182) and OP_CALL_OWN1 (183) - constants only.
  2. vm: wire OP_MOVE_OWN and OP_CALL_OWN1 dispatch arms - move-not-clone slot semantics for OP_MOVE_OWN; OP_CALL_OWN1 mirrors OP_CALL but pushes the first arg without clone_rc and nils the source slot.
  3. vm: emit OP_CALL_OWN1 for name = fn(name, ...) helper-call shape - the let-stmt peephole. Audited skip cases: builtins, local fn refs (closures / dynamic FnRef), ! / !! unwrap modes.
  4. vm: relax OP_MSET fast path - RC=1 alone suffices, drop a==b guard + Revert - first attempt at the in-callee half. Broke mset immutability semantics (m2 = mset m k v with m still live). Reverted.
  5. vm: tail-position mset reuses local register, fires in-place fast path - correct version. Tracks in_tail_position through compile_body. When mset m k v sits at the tail of a fn body AND args[0] is a direct Ref to a local, emit OP_MSET with a = b = local_reg. Non-tail use stays on the cloning slow path.
  6. cranelift: wire OP_MOVE_OWN and OP_CALL_OWN1 in JIT and AOT pipelines - same SSA Variable lowering as OP_MOVE / OP_CALL plus the type-classification pre-scan and MOVE-propagation fixpoint.
  7. cranelift: OP_MOVE_OWN must skip jit_move to avoid per-iteration RC inflation - caught by a 100k JIT run that hit 70GB resident memory. Without the skip, each loop iteration cloned the map's RC twice (in + out move), monotonically. Fix: OP_MOVE_OWN does pure Variable assignment, no jit_move call. Sound because the peephole emits the in/out pair around an OP_CALL_OWN1 - the source Variable isn't used again.
  8. tests: cross-engine regression coverage for helper-fn mset perf fix - 17 tests in tests/regression_mset_helper_perf.rs + examples/mset-helper-perf.ilo exercised by the engine harness.
  9. changelog: 0.12.1 entry.
  10. fmt + clippy - cargo fmt drift and one useless_concat.

Test plan

  • cargo test --release --features cranelift --lib - 3196 passed
  • cargo test --release --features cranelift (full suite) - all integration tests green
  • cargo fmt - clean
  • cargo clippy --release --features cranelift -- -D warnings - clean
  • examples/mset-helper-perf.ilo exercised across tree / VM / JIT / AOT via the engine harness
  • 1M-row JIT bench (helper form): 0.61s, linear scaling, no memory blowup
  • CI green

Follow-ups

  • Propagate the tail-mset rewrite to other in-place builtins where applicable (mdel, setfield, etc) - left out of this PR to keep the change minimal.
  • Consider widening OP_CALL_OWN1 emission past UnwrapMode::None once we have a clear story for how the result lives across ! / !! handling.

Adds the two opcode constants for the move-not-clone fast path that
closes the helper-fn mset perf cliff. No dispatch wiring yet; the
constants are referenced in the next commit when the compiler peephole
starts emitting them.
OP_MOVE_OWN copies the NanVal bit pattern from R[B] to R[A] without
touching any RC, then nils the source slot so frame teardown doesn't
double-drop. Guarded against the a==b degenerate case.

OP_CALL_OWN1 is a structural copy of OP_CALL except the first arg is
pushed onto the callee's stack without clone_rc, and its caller-side
register is cleared to Nil. The other args still pay the normal
clone_rc on push. Encoding matches OP_CALL exactly (ABx with the
func_idx in the high byte and argc in the low byte) so the compiler
can swap one for the other without other changes.

No emitter yet, so neither arm is reached at runtime; this is a
no-op behaviour change.
The let-stmt peephole now detects the canonical helper-call accumulator
pattern - name = fn(name, ...) for a static user fn - and emits the
move-not-clone call path. The first arg threads into the callee at
the caller's RC (no clone_rc bump), so an accumulator that lives at
RC=1 in the caller stays at RC=1 inside the helper.

Audited skip cases: builtins (caught by is_builtin), local fn refs
and closures (caught by resolve_local, same dispatch shunt OP_CALL_DYN
uses), and the !/!! unwrap modes (left on the normal OP_CALL path).

OP_MSET's in-place fast path inside the callee still requires the
a == b register form, so this commit alone doesn't close the cliff -
the next commit relaxes the mset guard to allow a != b under RC=1.
The PR #249 in-place fast path required both a == b AND rc_count == 1.
The a == b half is essentially a proxy for 'no other live reference' -
useful when the compiler emitted name = mset name k v at the top
level, where the result lands in the same register as the source.

But under move-semantics (OP_CALL_OWN1 / OP_MOVE_OWN), it's safe to
mutate in place whenever RC == 1, regardless of register form: we
own the only reference. The a != b path now transfers the heap
pointer from R[b] to R[a] (drop_rc the previous R[a], install map_v,
nil R[b]).

This is the second half of the helper-fn mset cliff fix. Inside the
callee, mset m k v as a tail expression compiles to OP_MSET with
a != b (b is the param register, a is a fresh result). With OP_CALL_OWN1
threading the map in at RC=1, this commit lets the callee actually
take the in-place path.
When mset m k v sits at the tail of a function body AND args[0] is a
direct Ref to a local register, the source map dies at OP_RET. We can
safely reuse the local's register as both result and source of OP_MSET,
which fires the existing a == b && rc_count == 1 in-place fast path.

This is the second half of the helper-fn perf fix. OP_CALL_OWN1 threads
the accumulator into the helper at RC=1; this commit makes the helper's
tail mset actually take the in-place path instead of cloning the whole
HashMap.

Liveness is conservative - only direct refs at tail position, only when
resolve_local hits. Non-tail mset (m2 = mset m k v then read m) stays
on the slow path because the source isn't statically dead. The
in_tail_position flag propagates through compile_body so only the
final stmt of the function root sees it set.
Both opcodes lower to the same IR as their OP_MOVE / OP_CALL siblings.
In the Cranelift model registers are SSA Variables and args flow as
values - there's no per-push clone_rc on a runtime stack like the VM
has - so the move-vs-clone-on-push distinction collapses to the same
code. The compiler's tail-position rewrite (mset m k v -> a==b form)
is what wins on Cranelift, and the existing OP_MSET handler already
picks mset_inplace when a_idx == b_idx.

Also extends the type-classification pre-scan and the MOVE-propagation
fixpoint so OP_MOVE_OWN and OP_CALL_OWN1 don't silently bail.

Without this commit any chunk containing the new opcodes would either
return an unsupported-opcode error (AOT) or bail to the VM
interpreter (JIT). With it, JIT/AOT execute the helper-fn accumulator
pattern as fast as VM does.
…nflation

First pass at the Cranelift wiring mirrored OP_MOVE's clone-rc-on-heap
path for OP_MOVE_OWN too. That's wrong: SSA Variable redefinition in
Cranelift doesn't drop_rc the previous value, so bumping RC on every
move (twice per loop iteration in the helper-call peephole) inflates
RC monotonically. By 100k rows the map's RC hits 200k+, defeats the
in-place mset fast path inside the helper, and leaks every owner
reference until process exit. Caught by a /tmp/mset_bench 100k
helper-form JIT run that hit 70GB resident memory.

OP_MOVE_OWN now just assigns the destination Variable from the source
without calling jit_move. This is sound because the peephole emits
the in/out OP_MOVE_OWN pair around an OP_CALL_OWN1 — the source
Variable isn't used again after the move, so the net heap RC is
preserved (no clone, no drop) exactly as the VM intends. f64 shadow
propagation is preserved for the numeric fast path.
17 tests in tests/regression_mset_helper_perf.rs covering the moves
landed by the perf fix:
- addto helper with all-distinct keys (the headline cliff shape)
- addto helper with overwriting keys (exercises displaced-value drop_rc)
- helper with extra args after the map (multi-arg OP_CALL_OWN1)
- non-tail mset preserves the source map (semantics gate on tail flag)
- non-rebinding helper call preserves caller's map (RC>1 fallback)

Each shape runs on tree / VM / JIT. Correctness-only; perf is verified
manually with /tmp/mset_bench.ilo and tracked in the assessment doc.

examples/mset-helper-perf.ilo demonstrates the now-fast helper-fn
accumulator shape so agents see the working idiom directly (the
example also acts as a cross-engine regression test via the engine
harness).
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 88.75000% with 18 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/vm/mod.rs 89.60% 13 Missing ⚠️
src/vm/jit_cranelift.rs 72.72% 3 Missing ⚠️
src/vm/compile_cranelift.rs 91.66% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

#439 removed --run-tree from the public CLI; this test was written
against pre-#439 main so its engine sweep still referenced it.
@danieljohnmorris danieljohnmorris merged commit 8c199a2 into main May 19, 2026
4 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/mset-high-cardinality-perf branch May 19, 2026 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant