perf: page-gather decode microbench and paged-attention ADR (#117) by inureyes · Pull Request #145 · lablup/mlxcel

inureyes · 2026-05-31T03:21:16Z

Summary

Phase 0 feasibility spike for epic #116 (unified paged KV cache). Adds a synthetic op-level microbench that measures the decode-time cost of gathering scattered physical KV blocks on MLX, plus an ADR that uses those measurements to choose the paged-attention strategy and the KV pool tensor layout for the downstream phases.

This PR delivers the code, the docs, and the ADR with empirical-value sentinels. The bench is run on the spike machine and the measured numbers fill the sentinels in a follow-up (the MLX C++ link is a long cold build, so compiling and running the bench is handled separately from authoring).

What changed

examples/page_gather_microbench.rs: new synthetic, op-level decode microbench (no model load). Times contiguous SDPA (lower bound, the effective cost of the current paged_decode_attention_dense_compat path), gather-then-SDPA for two candidate pool layouts, gather-only per layout (to isolate gather cost from attention), and the per-step slice_update block append per layout. Sweeps context lengths 1024/4096/16384/32768, batch 1/4, block 16/32/64 at D=128, Hq=32, Hkv=8, f16. Forces scattered reads via reverse-order block ids over a 2x-slack pool, pads context to a block-aligned ctx_pad so all paths compare apples to apples, and reports block-size internal fragmentation. Emits an aligned human-readable table and a CSV:-prefixed machine-readable block. Timing harness mirrors examples/bridge_overhead_microbench.rs.
docs/adr/0001-paged-attention-gather-vs-fused-kernel.md: new ADR. Decides (A) gather-then-SDPA for Phases 1-5 (Phase 1: Global block-pool tensor storage #118-Phase 5: Block-budget admission, eviction, and preemption #122), defers the (B) fused Metal paged-attention kernel to Phase 6 (Phase 6: Fused Metal paged-attention kernel #123), keeps the existing default block size, and selects the pool layout from the measured take/slice_update numbers. Empirical values are  sentinels.
docs/adr/README.md: new ADR index, establishing docs/adr/.
docs/README.md: links the new ADR directory from the docs layout.
scripts/run_page_gather_microbench.sh: wrapper that runs the bench under caffeinate -i.

Test plan

cargo run --release --features metal,accelerate --example page_gather_microbench (run on the spike machine; numbers fill the ADR sentinels).
cargo build --release --features metal,accelerate --example page_gather_microbench compiles.

Closes #117

Phase 0 feasibility spike for epic #116 (unified paged KV cache). Measures the decode-time cost of gathering scattered physical KV blocks on MLX so the downstream phases can choose between (A) gather-then-SDPA and (B) a fused Metal paged-attention kernel, and pick the pool tensor layout. `examples/page_gather_microbench.rs` is a synthetic, op-level bench (no model load). It allocates fake K/V with `zeros` and times four decode-step paths across a context/batch/block-size sweep: contiguous SDPA (the lower bound and the effective cost of the current `paged_decode_attention_dense_compat` path), gather-then-SDPA for two pool layouts (`[num_blocks, block_size, n_kv_heads, head_dim]` and the head-split `[n_kv_heads, num_blocks, block_size, head_dim]`), gather-only for each layout to isolate gather cost, and the per-step `slice_update` block append for each layout. Block ids are assigned in reverse pool order over a 2x-slack pool to force genuinely scattered reads, and every path attends over a block-aligned `ctx_pad` so the comparison is apples to apples; the reported `frag%` is the block-size internal fragmentation. Output is both an aligned human-readable table and a `CSV:`-prefixed machine-readable block. The timing harness mirrors `examples/bridge_overhead_microbench.rs` (warmup, then eval-per-iter bracketed by `synchronize_default`). `docs/adr/0001-paged-attention-gather-vs-fused-kernel.md` records the decision: adopt (A) gather-then-SDPA for Phases 1-5 (#118-#122) and defer the (B) fused Metal kernel to Phase 6 (#123), keep the existing default block size, and pick the pool layout from the measured `take`/`slice_update` numbers. The empirical values (crossover context length, layout choice, hardware, results table) are left as `` sentinels for the spike machine to fill after running the bench. Also establishes `docs/adr/` with an index `README.md`, links it from `docs/README.md`, and adds `scripts/run_page_gather_microbench.sh` (runs the bench under `caffeinate -i`).

Ran examples/page_gather_microbench.rs on the spike machine (Apple M1 Ultra, 128 GB, macOS 26.5, --release --features metal,accelerate) and filled the ADR 0001 sentinels with the measured results: the 24-row table (per-cell minimum of two sweeps), the hardware line, the gather-overhead crossover, the pool layout decision, and the block-size note. Findings: layout A ([num_blocks, block_size, n_kv_heads, head_dim]) is on average 2.1x faster on gather-then-SDPA than the head-split layout, so it is the chosen pool layout, and slice_update block-append cost is layout-insensitive. Single-sequence gather overhead stays under ~15% below 4096 tokens, rising to ~56% at 16384 and ~67% at 32768, while batched decode (batch 4) is already ~48% at 1024 tokens and 2x to 3x the contiguous SDPA cost past 4096. This confirms (A) gather-then-SDPA for Phases 1-5 and keeps the fused Metal kernel (B, #123) deferred to the long-context or batched regime. Also applies a rustfmt fix to the example.

Add a note clarifying that gatherA_only can exceed gatherA_sdpa at short context because timing the gather alone forces a full K/V materialization, whereas the gather-then-SDPA path fuses take/reshape/transpose into the fused-SDPA read. This makes gatherA_sdpa the decode-relevant number and explains why strategy (A) stays cheap at common context lengths.

cla-assistant · 2026-05-31T03:44:23Z

All committers have signed the CLA.

inureyes · 2026-05-31T04:08:07Z

Implementation Review Summary

Intent

Phase 0 feasibility spike for epic #116: a synthetic op-level decode microbench (gather-then-SDPA vs contiguous SDPA, two pool layouts) plus an ADR that uses the measured numbers to choose the paged-attention strategy and KV pool tensor layout.

Findings Addressed

None. No findings at MEDIUM or above; nothing to auto-fix.

Verification

All stated requirements implemented — microbench (context 1k/4k/16k/32k, batch 1/4, block 16/32/64) + block-size sensitivity + fragmentation + ADR with a layout decision and a kernel-strategy decision. Acceptance criteria met (numbers committed and reproducible).
No placeholder/mock code remaining — all  sentinels filled with measured M1 Ultra data; no TODO/TBD/stub.
Methodology valid — eval-per-iter harness with warmup + synchronize_default() bracketing mirrors examples/bridge_overhead_microbench.rs; every timed path forces materialization (gather_only evals both K and V via eval_pair + the returned array; SDPA/slice_update outputs are evaled), so no dead-code elision; inputs pre-evaled outside the timed region; scattered reads forced via reverse-order ids over a 2x-slack pool.
Shape/transpose math correct — layout A [nb,bs,Hkv,D] take(0)→reshape [B,ctx_pad,Hkv,D]→transpose[0,2,1,3] and layout B [Hkv,nb,bs,D] take(1)→reshape→transpose[1,0,2,3] both land at SDPA's [B,Hkv,ctx_pad,D]; reshapes are valid row-major splits (total_blocks=B*nb); slice_update shapes valid; GQA Hq=32/Hkv=8 standard. FFI signatures (take, slice_update, transpose_axes, fast_scaled_dot_product_attention null-mask, zeros, from_slice_i32) all match the crate-root re-exports; F16=9 dtype id correct.
ADR consistent with committed data — every prose claim verified against the Results table: layout A avg 2.1x faster than B (computed 2.10x, range 1.21x–3.25x); single-seq overhead ≤4k under ~15% (10–16%), ~56% @16k (52/57/58), ~67% @32k (66/69/65); batch-4 ~~48% @1k (52/51/42) and 2x–3x contig past 4k; layout B ~~2x @4k to >4x @16k; slice_update A~~680us / B~~700us; overheadA/B columns internally consistent (recomputed, 0 mismatches); block-size 12/15/16 @4k matches.
Integrated / conventions — root examples/ is auto-discovered (no [[example]] entry needed, matching bridge_overhead_microbench.rs); wrapper script is +x like the other scripts/*.sh; ADR linked from both docs/README.md and docs/adr/README.md; Apache header present; Closes #117 in PR body; no em-dashes, straight quotes only, no AI-slop words/negation-couplets; no internal issue-number leak (all refs [Epic] Unified KV cache: radix-tree-indexed paged attention with shared physical blocks #116–Phase 6: Fused Metal paged-attention kernel #123 in public range).
No unintended structural changes — additive only (examples/docs/script); no existing code touched.
Tests pass — orchestrator confirmed clean build / clippy -D warnings / fmt; bench ran on M1 Ultra (two sweeps), table filled.

Adds a #[cfg(test)] mod to examples/page_gather_microbench.rs covering the three pure, non-GPU helpers: parse_usize_list (happy path, whitespace tolerance, trailing comma, empty input), per_call_us (round and fractional durations), and the ctx_pad/frag_pct math from run_config (exact-multiple → 0% fragmentation, non-multiple → correct pad and frag, block=1 degenerate case, ctx==block edge case). All 11 tests pass under `cargo test --example page_gather_microbench --features metal,accelerate`. No GPU arrays are constructed in the tests.

inureyes · 2026-05-31T04:55:43Z

PR Finalization Complete

Tests

Added #[cfg(test)] mod tests to examples/page_gather_microbench.rs covering the three pure, GPU-free helpers:

parse_usize_list: 5 cases (single value, multi, whitespace tolerance, trailing comma, empty string)
per_call_us: 2 cases (round and fractional durations)
ctx_pad / frag_pct math (inlined from run_config): 4 cases (exact-multiple → 0%, non-multiple → correct pad and frag%, block=1, ctx==block edge)

All 11 tests pass under cargo test --example page_gather_microbench --features metal,accelerate. No GPU arrays are constructed; the timed paths requiring MLX/GPU are not covered and are not intended to be.

Documentation

All cross-links verified:

docs/adr/README.md → 0001-paged-attention-gather-vs-fused-kernel.md: target exists
docs/README.md → adr/README.md: paragraph present
ADR 0001 file references (.rs, .metal paths): all 5 paths resolve

No docs/ko/ directory exists in this repo, so no translations are needed or expected.

CHANGELOG

Skipped. The CHANGELOG has no [Unreleased] section; every entry is under a version tag, so entries are added at release time. A dev bench + ADR spike with no user-facing runtime change does not warrant a versioned CHANGELOG entry.

Lint / Format

cargo fmt --check: clean. cargo clippy --example page_gather_microbench --features metal,accelerate -- -D warnings: 0 warnings, exit 0.

inureyes added 2 commits May 31, 2026 12:20

inureyes added type:performance Performance improvements priority:high High priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers platform:macos macOS (Apple Silicon) specific labels May 31, 2026

inureyes added the status:done Completed label May 31, 2026

inureyes merged commit 889bd84 into main May 31, 2026
5 checks passed

inureyes deleted the feature/issue-117-page-gather-microbench branch May 31, 2026 05:02

inureyes mentioned this pull request Jun 2, 2026

[Epic] Unified KV cache: radix-tree-indexed paged attention with shared physical blocks #116

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: page-gather decode microbench and paged-attention ADR (#117)#145

perf: page-gather decode microbench and paged-attention ADR (#117)#145
inureyes merged 4 commits into
mainfrom
feature/issue-117-page-gather-microbench

inureyes commented May 31, 2026

Uh oh!

cla-assistant Bot commented May 31, 2026 •

edited

Loading

Uh oh!

inureyes commented May 31, 2026

Uh oh!

inureyes commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented May 31, 2026

Summary

What changed

Test plan

Uh oh!

cla-assistant Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

inureyes commented May 31, 2026

Implementation Review Summary

Intent

Findings Addressed

Verification

Uh oh!

inureyes commented May 31, 2026

PR Finalization Complete

Tests

Documentation

CHANGELOG

Lint / Format

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cla-assistant Bot commented May 31, 2026 •

edited

Loading