Scope
Implement KV cache support on the qwen3_moe inference path, mirroring the existing dense KV cache infrastructure. Unblocks contract gate FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.2.0.
Operator chose Option (b) engineer-driven follow-up (see #1829 for the spec / contract bump). Anyone with aprender inference-stack familiarity (or willing to ramp up via the dense-path reference) can claim this issue.
Full playbook
See `docs/specifications/m32d-moe-kv-cache-scope.md` — covers:
- Audience + calendar target (1-2 weeks)
- Hand-off criteria (6 closeable items)
- Day-by-day plan (Day 1 ramp → Day 8-10 dispatch)
- PR layout (4 PRs: 2 prep refactors + 1 core + 1 test)
- Risk gates between PRs
- 4 open questions for Day 1
- Cross-team coordination
Quick context
- Why it matters: 30B-MoE without KV cache is ~0.5 tok/s — CCPA Phase 6 bench can't fit a single per-turn budget. Empirical evidence in paiml/claude-code-parity-apr 30b-moe-empirical-2026-05-19.md. With KV cache, expected ~5-15 tok/s.
- Dense reference: `crates/aprender-serve/src/gguf/inference/forward/debug.rs:441` (`forward_single_with_cache`) is the existing pattern to mirror.
- Cache API: `crates/aprender-serve/src/gguf/runtime.rs:123` (`OwnedQuantizedKVCache`) — sufficient as-is, no struct changes needed.
- MoE forward to extend: `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs` (the full-prefill primitive that lacks cache hooks).
- Generate-loop to wire: `crates/aprender-serve/src/infer/qwen3_moe_generate.rs::run_qwen3_moe_generate`.
Acceptance test
# Numerical equivalence
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \\
cargo test --test moe_kv_cache_equivalence \\
-p aprender-serve --features cuda --release -- --ignored --nocapture
# Dense regression
cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single_tests
# V1_001 regression (existing test from #1819)
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \\
cargo test --test qwen3_moe_serve_dispatch_v1 \\
-p aprender-serve --features cuda --release -- --ignored --nocapture
# Perf measurement: ≥ 5 tok/s sustained on 30B-MoE
Companion-side downstream
Once M32d merges, paiml/claude-code-parity-apr operator dispatches Phase 6 bench against the new binary. ~10 hour wall on full 20-fixture corpus. `evidence/under-contract/scores.json` with `student_pass_rate > 0` closes V1_004.
Predecessor PRs (context only)
Scope
Implement KV cache support on the qwen3_moe inference path, mirroring the existing dense KV cache infrastructure. Unblocks contract gate FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.2.0.
Operator chose Option (b) engineer-driven follow-up (see #1829 for the spec / contract bump). Anyone with aprender inference-stack familiarity (or willing to ramp up via the dense-path reference) can claim this issue.
Full playbook
See `docs/specifications/m32d-moe-kv-cache-scope.md` — covers:
Quick context
Acceptance test
Companion-side downstream
Once M32d merges, paiml/claude-code-parity-apr operator dispatches Phase 6 bench against the new binary. ~10 hour wall on full 20-fixture corpus. `evidence/under-contract/scores.json` with `student_pass_rate > 0` closes V1_004.
Predecessor PRs (context only)