refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4) by noahgift · Pull Request #1831 · paiml/aprender

noahgift · 2026-05-20T05:27:18Z

Summary

Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on `OwnedQuantizedModel`. Behavior is bit-identical to the previous inline version.

First of 4 PRs implementing #1830 (M32d KV cache for qwen3_moe inference path) per `docs/specifications/m32d-moe-kv-cache-scope.md`.

Why

M32d needs to add `forward_single_qwen3_moe_with_cache` that mirrors `forward_single_with_cache`'s structure but swaps the dense FFN block for MoE expert dispatch. The attention block is reusable as-is — extracting it into a shared helper eliminates the future copy-paste.

PR layout

#	What	Risk
1 (this)	Extract `attention_layer_with_cache` helper from dense path	Medium — must not regress dense KV cache
2	Extract `moe_ffn_layer` helper from `forward_qwen3_moe.rs`	Medium — per-expert routing logic
3	Add `forward_single_qwen3_moe_with_cache` + wire `run_qwen3_moe_generate`	Low — composition of helpers
4	`moe_kv_cache_equivalence` integration test (env-gated, `#[ignore]`)	High — float-equivalence is hard

PR 1 is the lowest-risk prep step. The 34 existing `single_tests` must remain green.

What changed

`crates/aprender-serve/src/gguf/inference/forward/debug.rs` lines 477-589 (the per-layer attention sub-block inside `forward_single_with_cache`'s for-loop) lifted verbatim into a new private method `attention_layer_with_cache`.
Inline block replaced with a single call to the helper.

The helper signature follows the spec doc exactly:

```rust
fn attention_layer_with_cache(
&self,
hidden: &mut Vec,
layer: &OwnedQuantizedLayer,
layer_idx: usize,
cache: &mut OwnedQuantizedKVCache,
position: usize,
attn_out_buffer: &mut [f32],
use_rmsnorm: bool,
) -> Result<()>
```

ALL existing behaviors preserved: PMAT-260 debug traces, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty-cache first-token, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual.

Test plan

`cargo check -p aprender-serve --lib --features cuda` → clean
`cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single` → 34 passed / 0 failed / 1 ignored including:
- `test_forward_single_with_cache_q8k_path`
- `test_forward_single_with_cache_gqa_256`
- `test_forward_single_sequential_decode_10_tokens`
- `test_forward_single_with_scratch_multi_token`
- `test_forward_single_with_scratch_q8k_multi_position`

The multi-position + sequential decode tests are the load-bearing regression gate — they exercise the same attention path repeatedly with growing cache state.

Cross-refs

Issue: M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830 (M32d KV cache for qwen3_moe inference path)
Spec: `docs/specifications/m32d-moe-kv-cache-scope.md`
Contract gate (downstream, blocked by full M32d): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml`
Expected speedup once full M32d ships: ~30× (0.5 tok/s → 5-15 tok/s on Qwen3-Coder-30B-A3B)

🤖 Generated with Claude Code

… prep, #1830 PR-1 of 4) Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on OwnedQuantizedModel. Behavior is bit-identical to the previous inline version. ## Why M32d (KV cache for qwen3_moe inference path, #1830) needs to add a new forward function forward_single_qwen3_moe_with_cache that mirrors forward_single_with_cache's structure but swaps the dense FFN block for the MoE expert dispatch. The attention block is reusable as-is. This PR is the first of 4 in the M32d cascade per docs/specifications/m32d-moe-kv-cache-scope.md: PR 1 (this) — extract attention_layer_with_cache from dense path PR 2 — extract moe_ffn_layer helper from forward_qwen3_moe PR 3 — add forward_single_qwen3_moe_with_cache + wire generate PR 4 — add moe_kv_cache_equivalence integration test PR 1 is the lowest-risk prep step. The 34 existing single_tests must remain green (regression gate — the spec calls this out as 'medium risk: must not regress dense KV cache'). ## What changed - debug.rs lines 477-589 (the attention sub-block inside the per-layer for loop of forward_single_with_cache) lifted verbatim into a new private method attention_layer_with_cache. - Inline block replaced with a single call to the helper. - Helper signature follows the spec doc: hidden: &mut Vec<f32> layer: &OwnedQuantizedLayer layer_idx: usize cache: &mut OwnedQuantizedKVCache position: usize attn_out_buffer: &mut [f32] use_rmsnorm: bool The helper preserves ALL behavior of the inline version: PMAT-260 debug trace calls, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty cache first-token path, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual. ## Test plan - [x] cargo check -p aprender-serve --lib --features cuda: clean - [x] cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single: 34 passed / 0 failed / 1 ignored — covers test_forward_single_with_cache_*, GQA variants, multi-token sequential decode, Q8K path ## Cross-refs - Issue: #1830 (M32d KV cache for qwen3_moe inference path) - Spec: docs/specifications/m32d-moe-kv-cache-scope.md - Contract gate (downstream): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 - Sibling future PR: forward_qwen3_moe.rs lift of moe_ffn_layer helper Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> EOF

noahgift · 2026-05-20T06:04:41Z

Superseded by #1832 — operator flipped from Option (b) engineer-driven to Option (a) in-session and shipped the full M32d KV cache feature monolithically (without the helper extraction this PR proposed). #1832 ships forward_single_qwen3_moe_with_cache + cache-aware run_qwen3_moe_generate + moe_kv_cache_equivalence test + m32d_perf bench with empirical 19× speedup. Closing this prep refactor PR; not needed once #1832 lands.

This is the standard chain-PR squash-leapfrog cleanup pattern (memory: feedback_chain_pr_squash_leapfrog.md).

noahgift enabled auto-merge (squash) May 20, 2026 05:27

noahgift mentioned this pull request May 20, 2026

refactor(forward): lift moe_ffn_layer helper (M32d Day 1 prep, #1830 PR-2 of 4) #1834

Closed

3 tasks

noahgift closed this May 20, 2026

auto-merge was automatically disabled May 20, 2026 06:04
Pull request was closed

noahgift deleted the feat/m32d-moe-kv-cache-prep-attention-helper branch May 20, 2026 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4)#1831

refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4)#1831
noahgift wants to merge 1 commit into
mainfrom
feat/m32d-moe-kv-cache-prep-attention-helper

noahgift commented May 20, 2026

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Why

PR layout

What changed

Test plan

Cross-refs

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant