Skip to content

perf: page-gather decode microbench and paged-attention ADR (#117)#145

Merged
inureyes merged 4 commits into
mainfrom
feature/issue-117-page-gather-microbench
May 31, 2026
Merged

perf: page-gather decode microbench and paged-attention ADR (#117)#145
inureyes merged 4 commits into
mainfrom
feature/issue-117-page-gather-microbench

Conversation

@inureyes
Copy link
Copy Markdown
Member

Summary

Phase 0 feasibility spike for epic #116 (unified paged KV cache). Adds a synthetic op-level microbench that measures the decode-time cost of gathering scattered physical KV blocks on MLX, plus an ADR that uses those measurements to choose the paged-attention strategy and the KV pool tensor layout for the downstream phases.

This PR delivers the code, the docs, and the ADR with empirical-value sentinels. The bench is run on the spike machine and the measured numbers fill the sentinels in a follow-up (the MLX C++ link is a long cold build, so compiling and running the bench is handled separately from authoring).

What changed

  • examples/page_gather_microbench.rs: new synthetic, op-level decode microbench (no model load). Times contiguous SDPA (lower bound, the effective cost of the current paged_decode_attention_dense_compat path), gather-then-SDPA for two candidate pool layouts, gather-only per layout (to isolate gather cost from attention), and the per-step slice_update block append per layout. Sweeps context lengths 1024/4096/16384/32768, batch 1/4, block 16/32/64 at D=128, Hq=32, Hkv=8, f16. Forces scattered reads via reverse-order block ids over a 2x-slack pool, pads context to a block-aligned ctx_pad so all paths compare apples to apples, and reports block-size internal fragmentation. Emits an aligned human-readable table and a CSV:-prefixed machine-readable block. Timing harness mirrors examples/bridge_overhead_microbench.rs.
  • docs/adr/0001-paged-attention-gather-vs-fused-kernel.md: new ADR. Decides (A) gather-then-SDPA for Phases 1-5 (Phase 1: Global block-pool tensor storage #118-Phase 5: Block-budget admission, eviction, and preemption #122), defers the (B) fused Metal paged-attention kernel to Phase 6 (Phase 6: Fused Metal paged-attention kernel #123), keeps the existing default block size, and selects the pool layout from the measured take/slice_update numbers. Empirical values are <!--FILL_...--> sentinels.
  • docs/adr/README.md: new ADR index, establishing docs/adr/.
  • docs/README.md: links the new ADR directory from the docs layout.
  • scripts/run_page_gather_microbench.sh: wrapper that runs the bench under caffeinate -i.

Test plan

  • cargo run --release --features metal,accelerate --example page_gather_microbench (run on the spike machine; numbers fill the ADR sentinels).
  • cargo build --release --features metal,accelerate --example page_gather_microbench compiles.

Closes #117

inureyes added 2 commits May 31, 2026 12:20
Phase 0 feasibility spike for epic #116 (unified paged KV cache). Measures the decode-time cost of gathering scattered physical KV blocks on MLX so the downstream phases can choose between (A) gather-then-SDPA and (B) a fused Metal paged-attention kernel, and pick the pool tensor layout.

`examples/page_gather_microbench.rs` is a synthetic, op-level bench (no model load). It allocates fake K/V with `zeros` and times four decode-step paths across a context/batch/block-size sweep: contiguous SDPA (the lower bound and the effective cost of the current `paged_decode_attention_dense_compat` path), gather-then-SDPA for two pool layouts (`[num_blocks, block_size, n_kv_heads, head_dim]` and the head-split `[n_kv_heads, num_blocks, block_size, head_dim]`), gather-only for each layout to isolate gather cost, and the per-step `slice_update` block append for each layout. Block ids are assigned in reverse pool order over a 2x-slack pool to force genuinely scattered reads, and every path attends over a block-aligned `ctx_pad` so the comparison is apples to apples; the reported `frag%` is the block-size internal fragmentation. Output is both an aligned human-readable table and a `CSV:`-prefixed machine-readable block. The timing harness mirrors `examples/bridge_overhead_microbench.rs` (warmup, then eval-per-iter bracketed by `synchronize_default`).

`docs/adr/0001-paged-attention-gather-vs-fused-kernel.md` records the decision: adopt (A) gather-then-SDPA for Phases 1-5 (#118-#122) and defer the (B) fused Metal kernel to Phase 6 (#123), keep the existing default block size, and pick the pool layout from the measured `take`/`slice_update` numbers. The empirical values (crossover context length, layout choice, hardware, results table) are left as `<!--FILL_...-->` sentinels for the spike machine to fill after running the bench.

Also establishes `docs/adr/` with an index `README.md`, links it from `docs/README.md`, and adds `scripts/run_page_gather_microbench.sh` (runs the bench under `caffeinate -i`).
Ran examples/page_gather_microbench.rs on the spike machine (Apple M1 Ultra, 128 GB, macOS 26.5, --release --features metal,accelerate) and filled the ADR 0001 sentinels with the measured results: the 24-row table (per-cell minimum of two sweeps), the hardware line, the gather-overhead crossover, the pool layout decision, and the block-size note.

Findings: layout A ([num_blocks, block_size, n_kv_heads, head_dim]) is on average 2.1x faster on gather-then-SDPA than the head-split layout, so it is the chosen pool layout, and slice_update block-append cost is layout-insensitive. Single-sequence gather overhead stays under ~15% below 4096 tokens, rising to ~56% at 16384 and ~67% at 32768, while batched decode (batch 4) is already ~48% at 1024 tokens and 2x to 3x the contiguous SDPA cost past 4096. This confirms (A) gather-then-SDPA for Phases 1-5 and keeps the fused Metal kernel (B, #123) deferred to the long-context or batched regime.

Also applies a rustfmt fix to the example.
@inureyes inureyes added type:performance Performance improvements priority:high High priority area:core mlxcel-core: MLX FFI, primitives, KV cache, layers platform:macos macOS (Apple Silicon) specific labels May 31, 2026
Add a note clarifying that gatherA_only can exceed gatherA_sdpa at short context because timing the gather alone forces a full K/V materialization, whereas the gather-then-SDPA path fuses take/reshape/transpose into the fused-SDPA read. This makes gatherA_sdpa the decode-relevant number and explains why strategy (A) stays cheap at common context lengths.
@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented May 31, 2026

CLA assistant check
All committers have signed the CLA.

@inureyes
Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Phase 0 feasibility spike for epic #116: a synthetic op-level decode microbench (gather-then-SDPA vs contiguous SDPA, two pool layouts) plus an ADR that uses the measured numbers to choose the paged-attention strategy and KV pool tensor layout.

Findings Addressed

None. No findings at MEDIUM or above; nothing to auto-fix.

Verification

  • All stated requirements implemented — microbench (context 1k/4k/16k/32k, batch 1/4, block 16/32/64) + block-size sensitivity + fragmentation + ADR with a layout decision and a kernel-strategy decision. Acceptance criteria met (numbers committed and reproducible).
  • No placeholder/mock code remaining — all <!--FILL_...--> sentinels filled with measured M1 Ultra data; no TODO/TBD/stub.
  • Methodology valid — eval-per-iter harness with warmup + synchronize_default() bracketing mirrors examples/bridge_overhead_microbench.rs; every timed path forces materialization (gather_only evals both K and V via eval_pair + the returned array; SDPA/slice_update outputs are evaled), so no dead-code elision; inputs pre-evaled outside the timed region; scattered reads forced via reverse-order ids over a 2x-slack pool.
  • Shape/transpose math correct — layout A [nb,bs,Hkv,D] take(0)→reshape [B,ctx_pad,Hkv,D]→transpose[0,2,1,3] and layout B [Hkv,nb,bs,D] take(1)→reshape→transpose[1,0,2,3] both land at SDPA's [B,Hkv,ctx_pad,D]; reshapes are valid row-major splits (total_blocks=B*nb); slice_update shapes valid; GQA Hq=32/Hkv=8 standard. FFI signatures (take, slice_update, transpose_axes, fast_scaled_dot_product_attention null-mask, zeros, from_slice_i32) all match the crate-root re-exports; F16=9 dtype id correct.
  • ADR consistent with committed data — every prose claim verified against the Results table: layout A avg 2.1x faster than B (computed 2.10x, range 1.21x–3.25x); single-seq overhead ≤4k under ~15% (10–16%), ~56% @16k (52/57/58), ~67% @32k (66/69/65); batch-4 48% @1k (52/51/42) and 2x–3x contig past 4k; layout B 2x @4k to >4x @16k; slice_update A680us / B700us; overheadA/B columns internally consistent (recomputed, 0 mismatches); block-size 12/15/16 @4k matches.
  • Integrated / conventions — root examples/ is auto-discovered (no [[example]] entry needed, matching bridge_overhead_microbench.rs); wrapper script is +x like the other scripts/*.sh; ADR linked from both docs/README.md and docs/adr/README.md; Apache header present; Closes #117 in PR body; no em-dashes, straight quotes only, no AI-slop words/negation-couplets; no internal issue-number leak (all refs [Epic] Unified KV cache: radix-tree-indexed paged attention with shared physical blocks #116Phase 6: Fused Metal paged-attention kernel #123 in public range).
  • No unintended structural changes — additive only (examples/docs/script); no existing code touched.
  • Tests pass — orchestrator confirmed clean build / clippy -D warnings / fmt; bench ran on M1 Ultra (two sweeps), table filled.

Adds a #[cfg(test)] mod to examples/page_gather_microbench.rs covering the three pure, non-GPU helpers: parse_usize_list (happy path, whitespace tolerance, trailing comma, empty input), per_call_us (round and fractional durations), and the ctx_pad/frag_pct math from run_config (exact-multiple → 0% fragmentation, non-multiple → correct pad and frag, block=1 degenerate case, ctx==block edge case). All 11 tests pass under `cargo test --example page_gather_microbench --features metal,accelerate`. No GPU arrays are constructed in the tests.
@inureyes
Copy link
Copy Markdown
Member Author

PR Finalization Complete

Tests

Added #[cfg(test)] mod tests to examples/page_gather_microbench.rs covering the three pure, GPU-free helpers:

  • parse_usize_list: 5 cases (single value, multi, whitespace tolerance, trailing comma, empty string)
  • per_call_us: 2 cases (round and fractional durations)
  • ctx_pad / frag_pct math (inlined from run_config): 4 cases (exact-multiple → 0%, non-multiple → correct pad and frag%, block=1, ctx==block edge)

All 11 tests pass under cargo test --example page_gather_microbench --features metal,accelerate. No GPU arrays are constructed; the timed paths requiring MLX/GPU are not covered and are not intended to be.

Documentation

All cross-links verified:

  • docs/adr/README.md0001-paged-attention-gather-vs-fused-kernel.md: target exists
  • docs/README.mdadr/README.md: paragraph present
  • ADR 0001 file references (.rs, .metal paths): all 5 paths resolve

No docs/ko/ directory exists in this repo, so no translations are needed or expected.

CHANGELOG

Skipped. The CHANGELOG has no [Unreleased] section; every entry is under a version tag, so entries are added at release time. A dev bench + ADR spike with no user-facing runtime change does not warrant a versioned CHANGELOG entry.

Lint / Format

cargo fmt --check: clean. cargo clippy --example page_gather_microbench --features metal,accelerate -- -D warnings: 0 warnings, exit 0.

@inureyes inureyes added the status:done Completed label May 31, 2026
@inureyes inureyes merged commit 889bd84 into main May 31, 2026
5 checks passed
@inureyes inureyes deleted the feature/issue-117-page-gather-microbench branch May 31, 2026 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core mlxcel-core: MLX FFI, primitives, KV cache, layers platform:macos macOS (Apple Silicon) specific priority:high High priority status:done Completed type:performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 0: Feasibility spike — page-gather microbench and paged-attention kernel ADR

1 participant