Skip to content

vm: window-text-perf PR-2 - reshape OP_WINDOW to emit ListView#336

Merged
danieljohnmorris merged 2 commits into
mainfrom
fix/window-listview-perf
May 17, 2026
Merged

vm: window-text-perf PR-2 - reshape OP_WINDOW to emit ListView#336
danieljohnmorris merged 2 commits into
mainfrom
fix/window-listview-perf

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

This is PR-2 of the staged window-text-perf P0 fix from the bioinformatics rerun6 persona. PR-1 (#334) landed the HeapObj::ListView { src, start, len } foundation; this PR is where the perf win actually shows up.

The eager OP_WINDOW arm in the VM dispatcher now produces L (ListView) instead of L (L t). Each window stride was previously a fresh Vec<NanVal> containing n clone_rc'd element NanVals; it's now a single Rc<HeapObj::ListView> of ~24 bytes plus one src.clone_rc. For an 11.4M-residue, 15-wide window walk that's the difference between ~170M RC touches at window construction and ~10K, plus the matching memory traffic for the small Vec allocations.

OP_WINDOW_VIEW (the fused-emitter scratch buffer used by flt fn (window n xs) and map fn (window n xs)) gets a parallel reshape: when R[A] already holds a sole-owned ListView, the stride update is now an in-place rewrite of (src, start, len). Per-stride RC traffic drops from 15 element drop_rc + 15 element clone_rc to 1 src.drop_rc + 1 src.clone_rc (zero RC ops when the source NanVal hasn't changed).

The repro microbenches that surfaced the original cliff drop in time by an order of magnitude:

Bench Before (main) After (this PR)
microbench (1M L n, sum windows) ~0.2s 0.18s (no regression on the #325 win)
microbench3 (3.2M L t, hd windows) 4-6s 0.49s
microbench4 (3.2M L t, flt over windows) 4-6s 0.45s
bio opt4.ilo (11.4M residues, full pipeline) 11.09s 10.8s

The bio bench moves less than the microbenches because window construction is a smaller slice of its total time than the persona repro suggested. After this PR the remaining bio time is dominated by the inner flt is-hydro xs closure-call loop (165M closure invocations over the dataset) rather than window construction; further wins there are a separate optimisation.

Cranelift trade-off

Both the in-process JIT and the AOT pipeline inline FOREACHPREP / LISTGET / LEN as direct [ptr+24] Vec metadata reads (see the audit comment in jit_cranelift.rs:OP_FOREACHPREP). A ListView's struct layout doesn't match Vec<NanVal> at those offsets, so a view reaching those inlines would UB.

This PR takes the design-of-record path: Cranelift bails compilation on OP_WINDOW / OP_WINDOW_VIEW (returns None from compile_function_body). Functions touching window opcodes fall through to the bytecode VM dispatcher, which is the closest non-JIT performance tier and is where the new fast path lives. run_default and the explicit --run-cranelift engine path in main.rs both fall back to vm::run on JitCallError::NotEligible. AOT keeps calling the unchanged jit_window / jit_window_view helpers, which still emit real HeapObj::Lists so AOT-inlined read paths stay sound at the cost of the perf win on window workloads.

A future PR can add a [ptr+0] == 1 discriminant guard to those inlines and let Cranelift emit and consume ListViews directly.

What's in the diff (per-commit)

  1. vm: reshape OP_WINDOW / OP_WINDOW_VIEW to emit HeapObj::ListView - the producer-side reshape plus the read-side audit. Every HeapObj::List(items) => ... _ => err site reachable from a window result now goes through slice_of so both List and ListView variants flow through the same code (hd, tl, rev, srt, rsrt, slc, lst, at, take, drop, zip, enumerate, chunks, frq, has, cat, unq, transpose, matmul, dot, cumsum, getmany, fft, ifft, setunion/inter/diff, sum, avg, min, max, median, stdev, variance, quantile). Wildcard _ => arms become explicit non-list variants so rustc enforces the audit on any future HeapObj addition. nanval_truthy learns the ListView arm. A new vm_list_slice<'a>(&'a NanVal, err) helper consolidates the v -> &[NanVal] | err pattern for stats helpers. The Cranelift bail + main fallback are bundled in here so the tree stays buildable at each commit.

  2. examples: window-listview-perf - mirrors the bio has-tm predicate (cs = chars s; ws = window 15 cs; flt all-h ws) at CI scale so the new ListView producer path runs through tree / VM / Cranelift on every test invocation via tests/examples_engines.rs.

Test plan

  • cargo test --release --features cranelift --lib — 3110 passed, 0 failed
  • cargo test --release --features cranelift (all 152 integration test binaries) — 0 failures
  • cargo fmt --check clean
  • cargo clippy --release --features cranelift -- -D warnings clean
  • microbench{,2,3,4} produce identical output to pre-PR main and run in <1s
  • bio opt4.ilo produces bit-identical output (proteins=22202, residues=11423140, tm-proteins=400, full aa-freq table)
  • examples_engines harness runs examples/window-listview-perf.ilo cleanly on tree + VM
  • In-source PR-2 regression tests cover: numeric output shape, text output shape, n>len empty result, source-rebind survival via view's src RC bump, srt escape path via materialise, window-of-window via source-side materialise

Follow-ups

  • Cranelift inlined FOREACHPREP / LISTGET / LEN can gain a [ptr+0] == 1 discriminant guard so views consume directly through the JIT (would reclaim the per-function bail).
  • The bio bench's remaining ~10s is closure-call overhead in the inner flt is-hydro xs loop. Separate optimisation, separate PR.
  • The remaining handful of HeapObj::List(items) arms inside jit_* extern helpers and the OP_LISTAPPEND copy path stay variant-narrow on purpose - those are reached only after the materialise-on-publish hooks (PR-1) have already swapped any view to an owned List.

…2 of window-text-perf P0)

The eager OP_WINDOW arm in the VM dispatcher now produces `L (ListView)`:
the outer is still an owning `HeapObj::List` of NanVals, each inner is a
ListView aliasing the source list (one src.clone_rc per view, no
per-element clone_rcs). Window construction drops from O(n*k) to O(n)
RC traffic.

OP_WINDOW_VIEW gets a parallel reshape on the reuse arm: when R[A] already
holds a ListView and we are its sole owner (strong_count == 1), rewrite
the (src, start, len) fields in place instead of clearing+repopulating
a Vec. Cost per stride drops from 15 element clone_rcs + 15 element
drop_rcs to a single src.drop_rc + src.clone_rc (or zero RC ops when
the source NanVal hasn't changed). The cold path where R[A] holds an
owned List with no view falls through to a fresh ListView allocation.

The bio microbenches that surfaced the cliff (3.2M `L t` window-walks
on Text-typed inputs) drop from ~4-6s to 0.2-0.5s on the same machine.
Numeric `L n` workloads keep the existing #325 win (the 1M-int window
microbench stays at ~0.2s).

Cranelift (both in-process JIT and AOT) bails on OP_WINDOW / OP_WINDOW_VIEW
because the inlined FOREACHPREP / LISTGET fast paths read Vec metadata
at fixed offsets (ptr+24 etc.), which would misread a ListView's struct
layout and UB. A bail returns None from the JIT compiler; `run_default`
and the explicit `--run-cranelift` dispatch in main both fall back to
the bytecode VM (the closest non-JIT performance tier), where the new
fast path lives. AOT keeps calling the unchanged jit_window /
jit_window_view helpers, which still emit real Lists, so AOT-inlined
read paths stay sound at the cost of the perf win on window workloads.

Read-path audit: every HeapObj::List(items) match site in the VM
dispatcher that was reachable from a let-bound `window` result now
goes through `slice_of` so both List and ListView variants flow
through the same code (hd, tl, rev, srt, rsrt, slc, lst, at, take,
drop, zip, enumerate, chunks, frq, has, cat, unq, transpose, matmul,
dot, cumsum, getmany, fft, ifft, setunion/inter/diff, all reduction
helpers). Wildcard `_ =>` arms become explicit
HeapObj::Str/Map/Record/OkVal/ErrVal so rustc enforces the audit on
any future HeapObj addition. `nanval_truthy` learns the ListView arm.

A new `vm_list_slice` helper bridges the most common pattern
(`v -> &[NanVal] | err`) used by the stats helpers (sum, avg,
collect_numbers, fft pairs).

The materialise-on-publish hooks PR-1 added (OP_LISTAPPEND, OP_MSET,
OP_RECNEW, OP_RET) keep doing their job: views never leak into
long-lived containers, so AOT-compiled callers and `to_value`
boundaries still see real Lists.

Future PR can add a `[ptr+0] == 1` discriminant guard in Cranelift's
inlined FOREACHPREP / LISTGET to let the JIT consume views directly
and reclaim the perf win for the JIT path.
Mirrors the bioinformatics has-tm predicate (`cs = chars s; ws = window
15 cs; flt all-h ws`) at a CI-sized scale so the new ListView producer
path runs across all three engines on every test invocation. Pairs with
the in-source PR-2 regression tests in vm/mod.rs (numeric + text shape,
empty result, source-rebind survival, srt escape, window-of-window
materialisation).
@danieljohnmorris danieljohnmorris merged commit 53f9d09 into main May 17, 2026
4 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/window-listview-perf branch May 17, 2026 09:37
@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 84.34164% with 44 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/vm/mod.rs 86.10% 36 Missing ⚠️
src/main.rs 61.90% 8 Missing ⚠️

📢 Thoughts on this report? Let us know!

danieljohnmorris added a commit that referenced this pull request May 17, 2026
fifteen fixes since 0.11.5, all from rerun5/rerun6 personas plus standing asks: ListView foundation (#334), window-text-perf reshape via ListView (#336), inner-flt predicate inlining (#340), double-minus trap ILO-P021 (#331), bare-ident bang silent-nil regression (#324), Cranelift JIT span plumbing (#335), bool-prefix ternary (#330), wh prefix-cond reparse (#332), --run-engine auto-pick main (#329), subcommand helper hyphens+non-ident (#328), runtime error spans (#335), persona-diagnostic batch 3 (#327), rgxall1+ct (#333), single-line body diagnostic (#322 carry), lambda type-var defensive test (#326), N-deep prefix arity error (#339), prefix-minus span column drift (#338), doc-sync (#337).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant