refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4)#1831
Closed
noahgift wants to merge 1 commit into
Closed
refactor(forward): lift attention_layer_with_cache helper (M32d Day 1 prep, #1830 PR-1 of 4)#1831noahgift wants to merge 1 commit into
noahgift wants to merge 1 commit into
Conversation
… prep, #1830 PR-1 of 4) Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on OwnedQuantizedModel. Behavior is bit-identical to the previous inline version. ## Why M32d (KV cache for qwen3_moe inference path, #1830) needs to add a new forward function forward_single_qwen3_moe_with_cache that mirrors forward_single_with_cache's structure but swaps the dense FFN block for the MoE expert dispatch. The attention block is reusable as-is. This PR is the first of 4 in the M32d cascade per docs/specifications/m32d-moe-kv-cache-scope.md: PR 1 (this) — extract attention_layer_with_cache from dense path PR 2 — extract moe_ffn_layer helper from forward_qwen3_moe PR 3 — add forward_single_qwen3_moe_with_cache + wire generate PR 4 — add moe_kv_cache_equivalence integration test PR 1 is the lowest-risk prep step. The 34 existing single_tests must remain green (regression gate — the spec calls this out as 'medium risk: must not regress dense KV cache'). ## What changed - debug.rs lines 477-589 (the attention sub-block inside the per-layer for loop of forward_single_with_cache) lifted verbatim into a new private method attention_layer_with_cache. - Inline block replaced with a single call to the helper. - Helper signature follows the spec doc: hidden: &mut Vec<f32> layer: &OwnedQuantizedLayer layer_idx: usize cache: &mut OwnedQuantizedKVCache position: usize attn_out_buffer: &mut [f32] use_rmsnorm: bool The helper preserves ALL behavior of the inline version: PMAT-260 debug trace calls, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty cache first-token path, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual. ## Test plan - [x] cargo check -p aprender-serve --lib --features cuda: clean - [x] cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single: 34 passed / 0 failed / 1 ignored — covers test_forward_single_with_cache_*, GQA variants, multi-token sequential decode, Q8K path ## Cross-refs - Issue: #1830 (M32d KV cache for qwen3_moe inference path) - Spec: docs/specifications/m32d-moe-kv-cache-scope.md - Contract gate (downstream): FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 - Sibling future PR: forward_qwen3_moe.rs lift of moe_ffn_layer helper Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> EOF
3 tasks
Contributor
Author
|
Superseded by #1832 — operator flipped from Option (b) engineer-driven to Option (a) in-session and shipped the full M32d KV cache feature monolithically (without the helper extraction this PR proposed). #1832 ships forward_single_qwen3_moe_with_cache + cache-aware run_qwen3_moe_generate + moe_kv_cache_equivalence test + m32d_perf bench with empirical 19× speedup. Closing this prep refactor PR; not needed once #1832 lands. This is the standard chain-PR squash-leapfrog cleanup pattern (memory: feedback_chain_pr_squash_leapfrog.md). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pure refactor of the dense KV-cache path's per-layer attention sub-block (steps 2a–2g) into a private helper method on `OwnedQuantizedModel`. Behavior is bit-identical to the previous inline version.
First of 4 PRs implementing #1830 (M32d KV cache for qwen3_moe inference path) per `docs/specifications/m32d-moe-kv-cache-scope.md`.
Why
M32d needs to add `forward_single_qwen3_moe_with_cache` that mirrors `forward_single_with_cache`'s structure but swaps the dense FFN block for MoE expert dispatch. The attention block is reusable as-is — extracting it into a shared helper eliminates the future copy-paste.
PR layout
PR 1 is the lowest-risk prep step. The 34 existing `single_tests` must remain green.
What changed
The helper signature follows the spec doc exactly:
```rust
fn attention_layer_with_cache(
&self,
hidden: &mut Vec,
layer: &OwnedQuantizedLayer,
layer_idx: usize,
cache: &mut OwnedQuantizedKVCache,
position: usize,
attn_out_buffer: &mut [f32],
use_rmsnorm: bool,
) -> Result<()>
```
ALL existing behaviors preserved: PMAT-260 debug traces, GH-278 LayerNorm bias branch, GH-479 Qwen3 per-head QK RMSNorm, RoPE skip for absolute-position models, GQA expansion for empty-cache first-token, CORRECTNESS-013 attention output trace, fused_matmul output projection, attn_output_bias add, residual.
Test plan
The multi-position + sequential decode tests are the load-bearing regression gate — they exercise the same attention path repeatedly with growing cache state.
Cross-refs
🤖 Generated with Claude Code