M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week)

## Scope

Implement KV cache support on the qwen3_moe inference path, mirroring the existing dense KV cache infrastructure. Unblocks contract gate **FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004** in [\`contracts/qwen3-moe-serve-dispatch-v1.yaml\`](https://github.com/paiml/aprender/blob/main/contracts/qwen3-moe-serve-dispatch-v1.yaml) v1.2.0.

**Operator chose Option (b) engineer-driven follow-up** (see paiml/aprender#1829 for the spec / contract bump). Anyone with aprender inference-stack familiarity (or willing to ramp up via the dense-path reference) can claim this issue.

## Full playbook

See [\`docs/specifications/m32d-moe-kv-cache-scope.md\`](https://github.com/paiml/aprender/blob/main/docs/specifications/m32d-moe-kv-cache-scope.md) — covers:

- Audience + calendar target (1-2 weeks)
- Hand-off criteria (6 closeable items)
- Day-by-day plan (Day 1 ramp → Day 8-10 dispatch)
- PR layout (4 PRs: 2 prep refactors + 1 core + 1 test)
- Risk gates between PRs
- 4 open questions for Day 1
- Cross-team coordination

## Quick context

- **Why it matters**: 30B-MoE without KV cache is ~0.5 tok/s — CCPA Phase 6 bench can't fit a single per-turn budget. Empirical evidence in [paiml/claude-code-parity-apr 30b-moe-empirical-2026-05-19.md](https://github.com/paiml/claude-code-parity-apr/blob/main/evidence/phase-6/30b-moe-empirical-2026-05-19.md). With KV cache, expected ~5-15 tok/s.
- **Dense reference**: \`crates/aprender-serve/src/gguf/inference/forward/debug.rs:441\` (\`forward_single_with_cache\`) is the existing pattern to mirror.
- **Cache API**: \`crates/aprender-serve/src/gguf/runtime.rs:123\` (\`OwnedQuantizedKVCache\`) — sufficient as-is, no struct changes needed.
- **MoE forward to extend**: \`crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs\` (the full-prefill primitive that lacks cache hooks).
- **Generate-loop to wire**: \`crates/aprender-serve/src/infer/qwen3_moe_generate.rs::run_qwen3_moe_generate\`.

## Acceptance test

```bash
# Numerical equivalence
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \\
  cargo test --test moe_kv_cache_equivalence \\
  -p aprender-serve --features cuda --release -- --ignored --nocapture

# Dense regression
cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single_tests

# V1_001 regression (existing test from #1819)
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \\
  cargo test --test qwen3_moe_serve_dispatch_v1 \\
  -p aprender-serve --features cuda --release -- --ignored --nocapture

# Perf measurement: ≥ 5 tok/s sustained on 30B-MoE
```

## Companion-side downstream

Once M32d merges, paiml/claude-code-parity-apr operator dispatches Phase 6 bench against the new binary. ~10 hour wall on full 20-fixture corpus. \`evidence/under-contract/scores.json\` with \`student_pass_rate > 0\` closes V1_004.

## Predecessor PRs (context only)

- #1806 (Option A: arch guard) + #1807 (Option B: full MoE dispatch) — squashed
- #1812: apr-cli serve mapped_gguf_model wire + APR_AGENT_HTTP_TIMEOUT_S env
- #1814: APR_AGENT_MAX_TOKENS_CAP env
- #1819: V1_001 + V1_003 integration test
- #1826: M32d scope (initial)
- #1829: M32d playbook + Option (b) formalization (this issue's spec PR)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Scope

Full playbook

Quick context

Acceptance test

Companion-side downstream

Predecessor PRs (context only)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Description

Scope

Full playbook

Quick context

Acceptance test

Companion-side downstream

Predecessor PRs (context only)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions