vm: window-text-perf PR-2 - reshape OP_WINDOW to emit ListView#336
Merged
Conversation
…2 of window-text-perf P0) The eager OP_WINDOW arm in the VM dispatcher now produces `L (ListView)`: the outer is still an owning `HeapObj::List` of NanVals, each inner is a ListView aliasing the source list (one src.clone_rc per view, no per-element clone_rcs). Window construction drops from O(n*k) to O(n) RC traffic. OP_WINDOW_VIEW gets a parallel reshape on the reuse arm: when R[A] already holds a ListView and we are its sole owner (strong_count == 1), rewrite the (src, start, len) fields in place instead of clearing+repopulating a Vec. Cost per stride drops from 15 element clone_rcs + 15 element drop_rcs to a single src.drop_rc + src.clone_rc (or zero RC ops when the source NanVal hasn't changed). The cold path where R[A] holds an owned List with no view falls through to a fresh ListView allocation. The bio microbenches that surfaced the cliff (3.2M `L t` window-walks on Text-typed inputs) drop from ~4-6s to 0.2-0.5s on the same machine. Numeric `L n` workloads keep the existing #325 win (the 1M-int window microbench stays at ~0.2s). Cranelift (both in-process JIT and AOT) bails on OP_WINDOW / OP_WINDOW_VIEW because the inlined FOREACHPREP / LISTGET fast paths read Vec metadata at fixed offsets (ptr+24 etc.), which would misread a ListView's struct layout and UB. A bail returns None from the JIT compiler; `run_default` and the explicit `--run-cranelift` dispatch in main both fall back to the bytecode VM (the closest non-JIT performance tier), where the new fast path lives. AOT keeps calling the unchanged jit_window / jit_window_view helpers, which still emit real Lists, so AOT-inlined read paths stay sound at the cost of the perf win on window workloads. Read-path audit: every HeapObj::List(items) match site in the VM dispatcher that was reachable from a let-bound `window` result now goes through `slice_of` so both List and ListView variants flow through the same code (hd, tl, rev, srt, rsrt, slc, lst, at, take, drop, zip, enumerate, chunks, frq, has, cat, unq, transpose, matmul, dot, cumsum, getmany, fft, ifft, setunion/inter/diff, all reduction helpers). Wildcard `_ =>` arms become explicit HeapObj::Str/Map/Record/OkVal/ErrVal so rustc enforces the audit on any future HeapObj addition. `nanval_truthy` learns the ListView arm. A new `vm_list_slice` helper bridges the most common pattern (`v -> &[NanVal] | err`) used by the stats helpers (sum, avg, collect_numbers, fft pairs). The materialise-on-publish hooks PR-1 added (OP_LISTAPPEND, OP_MSET, OP_RECNEW, OP_RET) keep doing their job: views never leak into long-lived containers, so AOT-compiled callers and `to_value` boundaries still see real Lists. Future PR can add a `[ptr+0] == 1` discriminant guard in Cranelift's inlined FOREACHPREP / LISTGET to let the JIT consume views directly and reclaim the perf win for the JIT path.
Mirrors the bioinformatics has-tm predicate (`cs = chars s; ws = window 15 cs; flt all-h ws`) at a CI-sized scale so the new ListView producer path runs across all three engines on every test invocation. Pairs with the in-source PR-2 regression tests in vm/mod.rs (numeric + text shape, empty result, source-rebind survival, srt escape, window-of-window materialisation).
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
danieljohnmorris
added a commit
that referenced
this pull request
May 17, 2026
fifteen fixes since 0.11.5, all from rerun5/rerun6 personas plus standing asks: ListView foundation (#334), window-text-perf reshape via ListView (#336), inner-flt predicate inlining (#340), double-minus trap ILO-P021 (#331), bare-ident bang silent-nil regression (#324), Cranelift JIT span plumbing (#335), bool-prefix ternary (#330), wh prefix-cond reparse (#332), --run-engine auto-pick main (#329), subcommand helper hyphens+non-ident (#328), runtime error spans (#335), persona-diagnostic batch 3 (#327), rgxall1+ct (#333), single-line body diagnostic (#322 carry), lambda type-var defensive test (#326), N-deep prefix arity error (#339), prefix-minus span column drift (#338), doc-sync (#337).
This was referenced May 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is PR-2 of the staged window-text-perf P0 fix from the bioinformatics rerun6 persona. PR-1 (#334) landed the
HeapObj::ListView { src, start, len }foundation; this PR is where the perf win actually shows up.The eager
OP_WINDOWarm in the VM dispatcher now producesL (ListView)instead ofL (L t). Each window stride was previously a freshVec<NanVal>containing n clone_rc'd element NanVals; it's now a singleRc<HeapObj::ListView>of ~24 bytes plus onesrc.clone_rc. For an 11.4M-residue, 15-wide window walk that's the difference between ~170M RC touches at window construction and ~10K, plus the matching memory traffic for the small Vec allocations.OP_WINDOW_VIEW(the fused-emitter scratch buffer used byflt fn (window n xs)andmap fn (window n xs)) gets a parallel reshape: whenR[A]already holds a sole-owned ListView, the stride update is now an in-place rewrite of(src, start, len). Per-stride RC traffic drops from15 element drop_rc + 15 element clone_rcto1 src.drop_rc + 1 src.clone_rc(zero RC ops when the source NanVal hasn't changed).The repro microbenches that surfaced the original cliff drop in time by an order of magnitude:
L n, sum windows)L t, hd windows)L t, flt over windows)The bio bench moves less than the microbenches because window construction is a smaller slice of its total time than the persona repro suggested. After this PR the remaining bio time is dominated by the inner
flt is-hydro xsclosure-call loop (165M closure invocations over the dataset) rather than window construction; further wins there are a separate optimisation.Cranelift trade-off
Both the in-process JIT and the AOT pipeline inline FOREACHPREP / LISTGET / LEN as direct
[ptr+24]Vec metadata reads (see the audit comment injit_cranelift.rs:OP_FOREACHPREP). A ListView's struct layout doesn't matchVec<NanVal>at those offsets, so a view reaching those inlines would UB.This PR takes the design-of-record path: Cranelift bails compilation on
OP_WINDOW/OP_WINDOW_VIEW(returnsNonefromcompile_function_body). Functions touching window opcodes fall through to the bytecode VM dispatcher, which is the closest non-JIT performance tier and is where the new fast path lives.run_defaultand the explicit--run-craneliftengine path inmain.rsboth fall back tovm::runonJitCallError::NotEligible. AOT keeps calling the unchangedjit_window/jit_window_viewhelpers, which still emit realHeapObj::Lists so AOT-inlined read paths stay sound at the cost of the perf win on window workloads.A future PR can add a
[ptr+0] == 1discriminant guard to those inlines and let Cranelift emit and consume ListViews directly.What's in the diff (per-commit)
vm: reshape
OP_WINDOW/OP_WINDOW_VIEWto emitHeapObj::ListView- the producer-side reshape plus the read-side audit. EveryHeapObj::List(items) => ... _ => errsite reachable from a window result now goes throughslice_ofso both List and ListView variants flow through the same code (hd, tl, rev, srt, rsrt, slc, lst, at, take, drop, zip, enumerate, chunks, frq, has, cat, unq, transpose, matmul, dot, cumsum, getmany, fft, ifft, setunion/inter/diff, sum, avg, min, max, median, stdev, variance, quantile). Wildcard_ =>arms become explicit non-list variants so rustc enforces the audit on any future HeapObj addition.nanval_truthylearns the ListView arm. A newvm_list_slice<'a>(&'a NanVal, err)helper consolidates thev -> &[NanVal] | errpattern for stats helpers. The Cranelift bail + main fallback are bundled in here so the tree stays buildable at each commit.examples: window-listview-perf - mirrors the bio
has-tmpredicate (cs = chars s; ws = window 15 cs; flt all-h ws) at CI scale so the new ListView producer path runs through tree / VM / Cranelift on every test invocation viatests/examples_engines.rs.Test plan
cargo test --release --features cranelift --lib— 3110 passed, 0 failedcargo test --release --features cranelift(all 152 integration test binaries) — 0 failurescargo fmt --checkcleancargo clippy --release --features cranelift -- -D warningscleanopt4.iloproduces bit-identical output (proteins=22202, residues=11423140, tm-proteins=400, full aa-freq table)examples/window-listview-perf.ilocleanly on tree + VMFollow-ups
[ptr+0] == 1discriminant guard so views consume directly through the JIT (would reclaim the per-function bail).flt is-hydro xsloop. Separate optimisation, separate PR.HeapObj::List(items)arms insidejit_*extern helpers and the OP_LISTAPPEND copy path stay variant-narrow on purpose - those are reached only after the materialise-on-publish hooks (PR-1) have already swapped any view to an owned List.