vm: window-text-perf PR-2 - reshape OP_WINDOW to emit ListView by danieljohnmorris · Pull Request #336 · ilo-lang/ilo

danieljohnmorris · 2026-05-17T09:33:57Z

Summary

This is PR-2 of the staged window-text-perf P0 fix from the bioinformatics rerun6 persona. PR-1 (#334) landed the HeapObj::ListView { src, start, len } foundation; this PR is where the perf win actually shows up.

The eager OP_WINDOW arm in the VM dispatcher now produces L (ListView) instead of L (L t). Each window stride was previously a fresh Vec<NanVal> containing n clone_rc'd element NanVals; it's now a single Rc<HeapObj::ListView> of ~24 bytes plus one src.clone_rc. For an 11.4M-residue, 15-wide window walk that's the difference between ~170M RC touches at window construction and ~10K, plus the matching memory traffic for the small Vec allocations.

OP_WINDOW_VIEW (the fused-emitter scratch buffer used by flt fn (window n xs) and map fn (window n xs)) gets a parallel reshape: when R[A] already holds a sole-owned ListView, the stride update is now an in-place rewrite of (src, start, len). Per-stride RC traffic drops from 15 element drop_rc + 15 element clone_rc to 1 src.drop_rc + 1 src.clone_rc (zero RC ops when the source NanVal hasn't changed).

The repro microbenches that surfaced the original cliff drop in time by an order of magnitude:

Bench	Before (main)	After (this PR)
microbench (1M `L n`, sum windows)	~0.2s	0.18s (no regression on the #325 win)
microbench3 (3.2M `L t`, hd windows)	4-6s	0.49s
microbench4 (3.2M `L t`, flt over windows)	4-6s	0.45s
bio opt4.ilo (11.4M residues, full pipeline)	11.09s	10.8s

The bio bench moves less than the microbenches because window construction is a smaller slice of its total time than the persona repro suggested. After this PR the remaining bio time is dominated by the inner flt is-hydro xs closure-call loop (165M closure invocations over the dataset) rather than window construction; further wins there are a separate optimisation.

Cranelift trade-off

Both the in-process JIT and the AOT pipeline inline FOREACHPREP / LISTGET / LEN as direct [ptr+24] Vec metadata reads (see the audit comment in jit_cranelift.rs:OP_FOREACHPREP). A ListView's struct layout doesn't match Vec<NanVal> at those offsets, so a view reaching those inlines would UB.

This PR takes the design-of-record path: Cranelift bails compilation on OP_WINDOW / OP_WINDOW_VIEW (returns None from compile_function_body). Functions touching window opcodes fall through to the bytecode VM dispatcher, which is the closest non-JIT performance tier and is where the new fast path lives. run_default and the explicit --run-cranelift engine path in main.rs both fall back to vm::run on JitCallError::NotEligible. AOT keeps calling the unchanged jit_window / jit_window_view helpers, which still emit real HeapObj::Lists so AOT-inlined read paths stay sound at the cost of the perf win on window workloads.

A future PR can add a [ptr+0] == 1 discriminant guard to those inlines and let Cranelift emit and consume ListViews directly.

What's in the diff (per-commit)

vm: reshape OP_WINDOW / OP_WINDOW_VIEW to emit HeapObj::ListView - the producer-side reshape plus the read-side audit. Every HeapObj::List(items) => ... _ => err site reachable from a window result now goes through slice_of so both List and ListView variants flow through the same code (hd, tl, rev, srt, rsrt, slc, lst, at, take, drop, zip, enumerate, chunks, frq, has, cat, unq, transpose, matmul, dot, cumsum, getmany, fft, ifft, setunion/inter/diff, sum, avg, min, max, median, stdev, variance, quantile). Wildcard _ => arms become explicit non-list variants so rustc enforces the audit on any future HeapObj addition. nanval_truthy learns the ListView arm. A new vm_list_slice<'a>(&'a NanVal, err) helper consolidates the v -> &[NanVal] | err pattern for stats helpers. The Cranelift bail + main fallback are bundled in here so the tree stays buildable at each commit.
examples: window-listview-perf - mirrors the bio has-tm predicate (cs = chars s; ws = window 15 cs; flt all-h ws) at CI scale so the new ListView producer path runs through tree / VM / Cranelift on every test invocation via tests/examples_engines.rs.

Test plan

cargo test --release --features cranelift --lib — 3110 passed, 0 failed
cargo test --release --features cranelift (all 152 integration test binaries) — 0 failures
cargo fmt --check clean
cargo clippy --release --features cranelift -- -D warnings clean
microbench{,2,3,4} produce identical output to pre-PR main and run in <1s
bio opt4.ilo produces bit-identical output (proteins=22202, residues=11423140, tm-proteins=400, full aa-freq table)
examples_engines harness runs examples/window-listview-perf.ilo cleanly on tree + VM
In-source PR-2 regression tests cover: numeric output shape, text output shape, n>len empty result, source-rebind survival via view's src RC bump, srt escape path via materialise, window-of-window via source-side materialise

Follow-ups

Cranelift inlined FOREACHPREP / LISTGET / LEN can gain a [ptr+0] == 1 discriminant guard so views consume directly through the JIT (would reclaim the per-function bail).
The bio bench's remaining ~10s is closure-call overhead in the inner flt is-hydro xs loop. Separate optimisation, separate PR.
The remaining handful of HeapObj::List(items) arms inside jit_* extern helpers and the OP_LISTAPPEND copy path stay variant-narrow on purpose - those are reached only after the materialise-on-publish hooks (PR-1) have already swapped any view to an owned List.

…2 of window-text-perf P0) The eager OP_WINDOW arm in the VM dispatcher now produces `L (ListView)`: the outer is still an owning `HeapObj::List` of NanVals, each inner is a ListView aliasing the source list (one src.clone_rc per view, no per-element clone_rcs). Window construction drops from O(n*k) to O(n) RC traffic. OP_WINDOW_VIEW gets a parallel reshape on the reuse arm: when R[A] already holds a ListView and we are its sole owner (strong_count == 1), rewrite the (src, start, len) fields in place instead of clearing+repopulating a Vec. Cost per stride drops from 15 element clone_rcs + 15 element drop_rcs to a single src.drop_rc + src.clone_rc (or zero RC ops when the source NanVal hasn't changed). The cold path where R[A] holds an owned List with no view falls through to a fresh ListView allocation. The bio microbenches that surfaced the cliff (3.2M `L t` window-walks on Text-typed inputs) drop from ~4-6s to 0.2-0.5s on the same machine. Numeric `L n` workloads keep the existing #325 win (the 1M-int window microbench stays at ~0.2s). Cranelift (both in-process JIT and AOT) bails on OP_WINDOW / OP_WINDOW_VIEW because the inlined FOREACHPREP / LISTGET fast paths read Vec metadata at fixed offsets (ptr+24 etc.), which would misread a ListView's struct layout and UB. A bail returns None from the JIT compiler; `run_default` and the explicit `--run-cranelift` dispatch in main both fall back to the bytecode VM (the closest non-JIT performance tier), where the new fast path lives. AOT keeps calling the unchanged jit_window / jit_window_view helpers, which still emit real Lists, so AOT-inlined read paths stay sound at the cost of the perf win on window workloads. Read-path audit: every HeapObj::List(items) match site in the VM dispatcher that was reachable from a let-bound `window` result now goes through `slice_of` so both List and ListView variants flow through the same code (hd, tl, rev, srt, rsrt, slc, lst, at, take, drop, zip, enumerate, chunks, frq, has, cat, unq, transpose, matmul, dot, cumsum, getmany, fft, ifft, setunion/inter/diff, all reduction helpers). Wildcard `_ =>` arms become explicit HeapObj::Str/Map/Record/OkVal/ErrVal so rustc enforces the audit on any future HeapObj addition. `nanval_truthy` learns the ListView arm. A new `vm_list_slice` helper bridges the most common pattern (`v -> &[NanVal] | err`) used by the stats helpers (sum, avg, collect_numbers, fft pairs). The materialise-on-publish hooks PR-1 added (OP_LISTAPPEND, OP_MSET, OP_RECNEW, OP_RET) keep doing their job: views never leak into long-lived containers, so AOT-compiled callers and `to_value` boundaries still see real Lists. Future PR can add a `[ptr+0] == 1` discriminant guard in Cranelift's inlined FOREACHPREP / LISTGET to let the JIT consume views directly and reclaim the perf win for the JIT path.

Mirrors the bioinformatics has-tm predicate (`cs = chars s; ws = window 15 cs; flt all-h ws`) at a CI-sized scale so the new ListView producer path runs across all three engines on every test invocation. Pairs with the in-source PR-2 regression tests in vm/mod.rs (numeric + text shape, empty result, source-rebind survival, srt escape, window-of-window materialisation).

codecov · 2026-05-17T09:37:27Z

Codecov Report

❌ Patch coverage is 84.34164% with 44 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/vm/mod.rs	86.10%	36 Missing ⚠️
src/main.rs	61.90%	8 Missing ⚠️

📢 Thoughts on this report? Let us know!

fifteen fixes since 0.11.5, all from rerun5/rerun6 personas plus standing asks: ListView foundation (#334), window-text-perf reshape via ListView (#336), inner-flt predicate inlining (#340), double-minus trap ILO-P021 (#331), bare-ident bang silent-nil regression (#324), Cranelift JIT span plumbing (#335), bool-prefix ternary (#330), wh prefix-cond reparse (#332), --run-engine auto-pick main (#329), subcommand helper hyphens+non-ident (#328), runtime error spans (#335), persona-diagnostic batch 3 (#327), rgxall1+ct (#333), single-line body diagnostic (#322 carry), lambda type-var defensive test (#326), N-deep prefix arity error (#339), prefix-minus span column drift (#338), doc-sync (#337).

danieljohnmorris added 2 commits May 17, 2026 10:26

danieljohnmorris merged commit 53f9d09 into main May 17, 2026
4 checks passed

danieljohnmorris deleted the fix/window-listview-perf branch May 17, 2026 09:37

This was referenced May 17, 2026

fix: CLI-boundary arity guard at every engine entry #344

Merged

fuse len (flt p xs) + inline FOREACH-in-body predicates (bio canonical, 17% win) #346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm: window-text-perf PR-2 - reshape OP_WINDOW to emit ListView#336

vm: window-text-perf PR-2 - reshape OP_WINDOW to emit ListView#336
danieljohnmorris merged 2 commits into
mainfrom
fix/window-listview-perf

danieljohnmorris commented May 17, 2026

Uh oh!

Uh oh!

codecov Bot commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danieljohnmorris commented May 17, 2026

Summary

Cranelift trade-off

What's in the diff (per-commit)

Test plan

Follow-ups

Uh oh!

Uh oh!

codecov Bot commented May 17, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant