Skip to content

refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4)#1831

Closed
noahgift wants to merge 1 commit into
mainfrom
feat/m32d-moe-kv-cache-prep-attention-helper
Closed

refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4)#1831
noahgift wants to merge 1 commit into
mainfrom
feat/m32d-moe-kv-cache-prep-attention-helper

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on `OwnedQuantizedModel`. Behavior is bit-identical to the previous inline version.

First of 4 PRs implementing #1830 (M32d KV cache for qwen3_moe inference path) per `docs/specifications/m32d-moe-kv-cache-scope.md`.

Why

M32d needs to add `forward_single_qwen3_moe_with_cache` that mirrors `forward_single_with_cache`'s structure but swaps the dense FFN block for MoE expert dispatch. The attention block is reusable as-is — extracting it into a shared helper eliminates the future copy-paste.

PR layout

# What Risk
1 (this) Extract `attention_layer_with_cache` helper from dense path Medium — must not regress dense KV cache
2 Extract `moe_ffn_layer` helper from `forward_qwen3_moe.rs` Medium — per-expert routing logic
3 Add `forward_single_qwen3_moe_with_cache` + wire `run_qwen3_moe_generate` Low — composition of helpers
4 `moe_kv_cache_equivalence` integration test (env-gated, `#[ignore]`) High — float-equivalence is hard

PR 1 is the lowest-risk prep step. The 34 existing `single_tests` must remain green.

What changed

  • `crates/aprender-serve/src/gguf/inference/forward/debug.rs` lines 477-589 (the per-layer attention sub-block inside `forward_single_with_cache`'s for-loop) lifted verbatim into a new private method `attention_layer_with_cache`.
  • Inline block replaced with a single call to the helper.

The helper signature follows the spec doc exactly:

```rust
fn attention_layer_with_cache(
&self,
hidden: &mut Vec,
layer: &OwnedQuantizedLayer,
layer_idx: usize,
cache: &mut OwnedQuantizedKVCache,
position: usize,
attn_out_buffer: &mut [f32],
use_rmsnorm: bool,
) -> Result<()>
```

ALL existing behaviors preserved: PMAT-260 debug traces, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty-cache first-token, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual.

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` → clean
  • `cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single` → 34 passed / 0 failed / 1 ignored including:
    • `test_forward_single_with_cache_q8k_path`
    • `test_forward_single_with_cache_gqa_256`
    • `test_forward_single_sequential_decode_10_tokens`
    • `test_forward_single_with_scratch_multi_token`
    • `test_forward_single_with_scratch_q8k_multi_position`

The multi-position + sequential decode tests are the load-bearing regression gate — they exercise the same attention path repeatedly with growing cache state.

Cross-refs

  • Issue: M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830 (M32d KV cache for qwen3_moe inference path)
  • Spec: `docs/specifications/m32d-moe-kv-cache-scope.md`
  • Contract gate (downstream, blocked by full M32d): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml`
  • Expected speedup once full M32d ships: ~30× (0.5 tok/s → 5-15 tok/s on Qwen3-Coder-30B-A3B)

🤖 Generated with Claude Code

… prep, #1830 PR-1 of 4)

Pure refactor of the dense KV-cache path's per-layer attention sub-block
(steps 2a–2g) into a private helper method on OwnedQuantizedModel.
Behavior is bit-identical to the previous inline version.

## Why

M32d (KV cache for qwen3_moe inference path, #1830) needs to add a new
forward function forward_single_qwen3_moe_with_cache that mirrors
forward_single_with_cache's structure but swaps the dense FFN block for
the MoE expert dispatch. The attention block is reusable as-is.

This PR is the first of 4 in the M32d cascade per
docs/specifications/m32d-moe-kv-cache-scope.md:

  PR 1 (this) — extract attention_layer_with_cache from dense path
  PR 2       — extract moe_ffn_layer helper from forward_qwen3_moe
  PR 3       — add forward_single_qwen3_moe_with_cache + wire generate
  PR 4       — add moe_kv_cache_equivalence integration test

PR 1 is the lowest-risk prep step. The 34 existing single_tests must
remain green (regression gate — the spec calls this out as 'medium
risk: must not regress dense KV cache').

## What changed

- debug.rs lines 477-589 (the attention sub-block inside the per-layer
  for loop of forward_single_with_cache) lifted verbatim into a new
  private method attention_layer_with_cache.
- Inline block replaced with a single call to the helper.
- Helper signature follows the spec doc:
    hidden: &mut Vec<f32>
    layer: &OwnedQuantizedLayer
    layer_idx: usize
    cache: &mut OwnedQuantizedKVCache
    position: usize
    attn_out_buffer: &mut [f32]
    use_rmsnorm: bool

The helper preserves ALL behavior of the inline version: PMAT-260 debug
trace calls, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK
RMSNorm, RoPE skip for absolute-position models, GQA expansion for
empty cache first-token path, CORRECTNESS-013 attention output trace,
fused_matmul output projection, attn_output_bias add, residual.

## Test plan

- [x] cargo check -p aprender-serve --lib --features cuda: clean
- [x] cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single:
       34 passed / 0 failed / 1 ignored
       — covers test_forward_single_with_cache_*, GQA variants,
         multi-token sequential decode, Q8K path

## Cross-refs

- Issue: #1830 (M32d KV cache for qwen3_moe inference path)
- Spec: docs/specifications/m32d-moe-kv-cache-scope.md
- Contract gate (downstream): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
- Sibling future PR: forward_qwen3_moe.rs lift of moe_ffn_layer helper

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
EOF
@noahgift
Copy link
Copy Markdown
Contributor Author

Superseded by #1832 — operator flipped from Option (b) engineer-driven to Option (a) in-session and shipped the full M32d KV cache feature monolithically (without the helper extraction this PR proposed). #1832 ships forward_single_qwen3_moe_with_cache + cache-aware run_qwen3_moe_generate + moe_kv_cache_equivalence test + m32d_perf bench with empirical 19× speedup. Closing this prep refactor PR; not needed once #1832 lands.

This is the standard chain-PR squash-leapfrog cleanup pattern (memory: feedback_chain_pr_squash_leapfrog.md).

@noahgift noahgift closed this May 20, 2026
auto-merge was automatically disabled May 20, 2026 06:04

Pull request was closed

@noahgift noahgift deleted the feat/m32d-moe-kv-cache-prep-attention-helper branch May 20, 2026 06:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant