Skip to content

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

@noahgift

Description

@noahgift

Scope

Implement KV cache support on the qwen3_moe inference path, mirroring the existing dense KV cache infrastructure. Unblocks contract gate FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.2.0.

Operator chose Option (b) engineer-driven follow-up (see #1829 for the spec / contract bump). Anyone with aprender inference-stack familiarity (or willing to ramp up via the dense-path reference) can claim this issue.

Full playbook

See `docs/specifications/m32d-moe-kv-cache-scope.md` — covers:

  • Audience + calendar target (1-2 weeks)
  • Hand-off criteria (6 closeable items)
  • Day-by-day plan (Day 1 ramp → Day 8-10 dispatch)
  • PR layout (4 PRs: 2 prep refactors + 1 core + 1 test)
  • Risk gates between PRs
  • 4 open questions for Day 1
  • Cross-team coordination

Quick context

  • Why it matters: 30B-MoE without KV cache is ~0.5 tok/s — CCPA Phase 6 bench can't fit a single per-turn budget. Empirical evidence in paiml/claude-code-parity-apr 30b-moe-empirical-2026-05-19.md. With KV cache, expected ~5-15 tok/s.
  • Dense reference: `crates/aprender-serve/src/gguf/inference/forward/debug.rs:441` (`forward_single_with_cache`) is the existing pattern to mirror.
  • Cache API: `crates/aprender-serve/src/gguf/runtime.rs:123` (`OwnedQuantizedKVCache`) — sufficient as-is, no struct changes needed.
  • MoE forward to extend: `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs` (the full-prefill primitive that lacks cache hooks).
  • Generate-loop to wire: `crates/aprender-serve/src/infer/qwen3_moe_generate.rs::run_qwen3_moe_generate`.

Acceptance test

# Numerical equivalence
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \\
  cargo test --test moe_kv_cache_equivalence \\
  -p aprender-serve --features cuda --release -- --ignored --nocapture

# Dense regression
cargo test -p aprender-serve --lib --features cuda gguf::inference::forward::single_tests

# V1_001 regression (existing test from #1819)
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \\
  cargo test --test qwen3_moe_serve_dispatch_v1 \\
  -p aprender-serve --features cuda --release -- --ignored --nocapture

# Perf measurement: ≥ 5 tok/s sustained on 30B-MoE

Companion-side downstream

Once M32d merges, paiml/claude-code-parity-apr operator dispatches Phase 6 bench against the new binary. ~10 hour wall on full 20-fixture corpus. `evidence/under-contract/scores.json` with `student_pass_rate > 0` closes V1_004.

Predecessor PRs (context only)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions