Skip to content

fuse flt/map over window into stride-1 in-place loop#325

Merged
danieljohnmorris merged 4 commits into
mainfrom
fix/window-fused-opcodes
May 16, 2026
Merged

fuse flt/map over window into stride-1 in-place loop#325
danieljohnmorris merged 4 commits into
mainfrom
fix/window-fused-opcodes

Conversation

@danieljohnmorris
Copy link
Copy Markdown
Collaborator

Summary

Manifesto framing: idiomatic flt all-hydro (window 15 (chars seq)) was
~6x slower than the hand-rolled imperative form on the bioinformatics
rerun5 corpus (11.4M residues). Root cause: the unfused emitter
materialised an L (L t) of n-k+1 small inner lists, allocating ~170M
fresh Vecs total. AI agents reaching for the idiomatic shape paid a real
latency tax; the imperative escape hatch is more tokens to emit and
defeats the point of having higher-order list builtins.

This PR fuses flt fn (window n xs) and map fn (window n xs) at emit
time into a stride-1 loop that walks xs once with a single reusable
scratch list. The reuse path piggy-backs on the same RC=1 in-place
mutation pattern already used by OP_ADD_SS, OP_LISTAPPEND, and
OP_MSET.

Result on the bioinformatics workload (local repro, full corpus): the
fused path matches the hand-rolled imperative form within noise. Agents
can now write the idiomatic shape and pay no penalty.

Repro

Before (unfused, on main):

$ time ilo bench.ilo --run-vm        # flt all-hydro (window 15 chars-list)
real    0m58s

After (fused, this PR):

$ time ilo bench.ilo --run-vm
real    0m10s

What's in the diff

Four commits, one coherent change per commit:

  1. vm: add OP_WINDOW_VIEW for stride-1 window views — new 2-word
    opcode, VM dispatcher arm with RC=1 in-place reuse fast path,
    find_block_leaders 2-word skip, chunk_is_all_numeric exclusion.
  2. cranelift: dispatch OP_WINDOW_VIEW to jit_window_view helper
    jit_window_view (mirrors the VM dispatcher's reuse logic), wired
    into both JIT and AOT compile paths. Non-numeric / non-bool writer
    flag updates.
  3. emit: fuse flt/map over window into stride-1 in-place loop
    compile_fused_window_hof, pattern-match in the (Builtin::Flt, 2)
    and (Builtin::Map, 2) arms (only when the inner call is exactly
    window n xs with no unwrap mode). Tree-walker keeps the eager path
    as the reference impl.
  4. tests: cross-engine coverage for fused flt/map over window — 11
    regression tests across tree / VM / Cranelift plus an
    examples/window-stream.ilo exercised by the examples_engines
    harness.

Test plan

  • cargo test --release --features cranelift --test regression_window_fused (11/11)
  • cargo test --release --features cranelift --test examples_engines (1/1)
  • cargo test --release --features cranelift (full suite green)
  • cargo fmt --check
  • cargo clippy --release --features cranelift --all-targets -- -D warnings

Follow-ups

None blocking. A future PR could extend the same fusion to
fold fn init (window n xs) and flt+map fusion if the bioinformatics
work surfaces more patterns.

New 2-word opcode that materialises xs[idx..idx+n] into a destination
list register, reusing the existing Vec in place when its strong count
is 1. Backs the fused `flt fn (window n xs)` / `map fn (window n xs)`
emitter coming in a follow-up commit.

The reuse path mirrors the RC-peek pattern already used by OP_ADD_SS,
OP_LISTAPPEND, and OP_MSET. When the destination doesn't hold a uniquely
owned list, falls back to allocating fresh and drop_rc'ing the previous
value.

Also threads the new opcode through find_block_leaders (2-word skip) and
chunk_is_all_numeric (non-numeric result).
Adds jit_window_view (mirrors the OP_WINDOW_VIEW VM dispatcher's reuse
arm) and wires it into both the JIT and AOT compile paths. Each path
declares the 5-arg helper, registers it in the relocation table, and
emits an OP_WINDOW_VIEW arm that calls the helper with (cur, xs, idx, n,
span).

The helper takes the destination register's current value, attempts the
in-place reuse fast path when strong count is 1, and otherwise allocates
fresh. Returning the same NanVal bits on reuse keeps def_var as a no-op
write so the SSA variable's identity is preserved across iterations.

Also marks OP_WINDOW_VIEW as a non-numeric / non-bool writer in both
Cranelift flag-flow analyses.
Pattern-matches `flt fn (window n xs)` and `map fn (window n xs)` at
bytecode emit time and replaces the eager `L (L t)` materialisation with
a tight stride-1 loop that reuses one scratch list as the per-call
window. The reuse hinges on the new OP_WINDOW_VIEW arm's RC=1 fast path.

Why this matters: bioinformatics rerun5 hit a 5.8x slowdown on
`flt all-hydro (window 15 (chars seq))` over an 11.4M-residue input —
the unfused form allocated ~170M small lists. With fusion, drop
iterations and map iterations reuse a single Vec; only flt's keep
iterations force a fresh allocation (because the accumulator
clone_rc's the window into itself, bumping RC > 1).

The non-fused `window` path is unchanged: when the result escapes to a
binding or isn't consumed by an outer flt/map, the eager OP_WINDOW
dispatcher fires as before.

Tree-walker keeps the reference impl; the VM dispatcher and Cranelift
helper handle the reuse.
Pin the fused emitter's output against tree, VM, and Cranelift engines:
  - flt/map over window across happy paths
  - all-pass case (every iter clone_rc's, forcing fresh allocation)
  - empty source / n > len(xs) edge cases (limit reg <= 0, loop body
    never runs)
  - size-1 windows (smallest non-trivial case)
  - non-bool predicate error path (matches the unfused error message)
  - negative case: window result bound to a variable, fusion must not
    fire and the eager OP_WINDOW path produces the expected list-of-
    lists

Adds an examples/window-stream.ilo demonstrator that the examples_engines
harness runs across every engine.
@danieljohnmorris danieljohnmorris merged commit 14140eb into main May 16, 2026
4 checks passed
@danieljohnmorris danieljohnmorris deleted the fix/window-fused-opcodes branch May 16, 2026 23:19
@codecov
Copy link
Copy Markdown

codecov Bot commented May 16, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 72 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/vm/mod.rs 81.58% 51 Missing ⚠️
src/vm/compile_cranelift.rs 8.69% 21 Missing ⚠️

📢 Thoughts on this report? Let us know!

danieljohnmorris added a commit that referenced this pull request May 16, 2026
eight fixes since 0.11.4, all from rerun5 personas: bare-bang silent-nil regression (#324), Cranelift AArch64 panic catch_unwind fallback (#319), multi-line body span drift (#318), HOF tree-bridge error parity on Cranelift (#321), bool-ternary brace sugar (#323), single-line body diagnostic with brace-block bodies (#322), unknown-subcommand error in multi-fn files (#320), window perf cliff fused flt/map (#325).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant