fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler by noahgift · Pull Request #1807 · paiml/aprender

noahgift · 2026-05-19T08:01:31Z

Summary

Replaces the Option A NOT_IMPLEMENTED guard from #1806 with full MoE-aware dispatch through the existing run_qwen3_moe_generate path. qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens.

Base branch is #1806 (Option A), not main, because this PR depends on Option A's predicate + tests. Once #1806 merges, retarget this PR to main + rebase.

What this closes

Discharges FALSIFY-V1_001 (chat-completions returns non-error for qwen3_moe) at the code level — integration-test fixture is a follow-up
Discharges FALSIFY-V1_003 (matmul defensive guard doesn't fire for MoE chat path) — the MoE path is the FFN matmul that runs, not the dense one
Contract v1.0.0 → v1.1.0: status_history now records Phase 1 (fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) #1806) + Phase 2 (this PR)

Implementation

AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>> — retained mmap for MoE inference (per-expert tensors borrow directly from it).
AppState::with_mapped_gguf_model() builder + mapped_gguf_model() accessor.
prepare_gguf_serve_state wraps loaded mapped in Arc + attaches via the builder. Non-MoE archs just hold an extra reference; for MoE it's the lifetime anchor.
try_qwen3_moe_backend() replaces guard_qwen3_moe_dispatch(). For non-MoE archs returns None (no regression). For MoE arch with mmap: tokenize, build config, dispatch through run_qwen3_moe_generate, decode + return chat response. For MoE arch without mmap: NOT_IMPLEMENTED (defensive fallback).
All 16 AppState ctor sites initialize mapped_gguf_model: None, (mechanical insertion).

What this PR does NOT do

Streaming SSE: chat-completions stream=true falls back to the pregenerated SSE response. True per-token streaming requires a callback variant of run_qwen3_moe_generate — follow-up.
KV cache: full-prefill-per-token (catastrophically slow for 30B MoE). M32d's KV cache work would speed this up — out of scope.
Integration test against a real qwen3_moe GGUF: V1_001 evidence. Deferred until a small synthetic MoE GGUF fixture exists.

Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required after merge.

Test plan

cargo check -p aprender-serve --lib --no-default-features — clean
cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass (predicate carried over from fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) #1806)
CI: standard workflow

🤖 Generated with Claude Code

…ons handler Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mechanical status-tracking update. Adds cross-references to upstream paiml/aprender#1807 (Option B: full MoE-aware dispatch via run_qwen3_moe_generate) in the Phase 6 docs. #1807 is stacked on #1806 (Option A). Once #1806 merges, #1807 gets retargeted to main + rebased. The combined effect on un-suspending CCPA work: - #1806 alone: clean moe_dispatch_not_implemented error class; student pass rate stays 0/20 (no inference) - #1807 (post-merge): actual MoE inference via /v1/chat/completions; student pass rate should rise above 0 (V1_004 falsifier discharge) This is a MECHANICAL status-tracking update consistent with the M280 suspension. No substantive new CCPA scope; no new contract gates; no new code. M-counter NOT bumped per the discipline doctrine. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…1806) * fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) apr serve's chat-completions handler dispatches inference through Arc<Model>::generate(), which calls the dense FFN matmul path. For qwen3_moe GGUFs that path fails — per-expert tensors live under ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data. aprender#1790's defensive guard surfaces this as RealizarError:: InvalidShape, but the underlying dispatch is wrong: the MoE-aware path already exists at infer::run_inference (used by `apr run` CLI) and is not wired into the HTTP handler. Until the full HTTP-to-MoE wire-up lands (Option B in the new scope doc), this PR inserts a clean architectural guard: detect qwen3_moe arch via AppState::model_architecture() + return StatusCode:: NOT_IMPLEMENTED with a structured error citing aprender#1789 + the new contract YAML. The cryptic matmul error class becomes an actionable "MoE HTTP dispatch not yet implemented" class at the API surface. Adds: - contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates V1_001..V1_004; V1_002 discharged by this PR's unit tests) - docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause 5-whys, 3-option engineering trade-off, Option A implementation plan, companion-side CCPA Phase 6 integration plan) - crates/aprender-serve/src/api/cuda_chat_backend.rs: - guard_qwen3_moe_dispatch() guard fn (called early in the chat-completions handler before any backend-specific path) - is_qwen3_moe_arch() testable predicate - 5 unit tests under qwen3_moe_dispatch_guard_tests covering canonical name + HuggingFace class names + lowercase variants + dense-arch negatives + unknown-arch negatives Companion-side integration (paiml/claude-code-parity-apr): the M280 CCPA suspension's "harness-validation done; agent-quality measurement blocked on #1789" stance is unchanged. After this PR, Phase 6 re-dispatch against Qwen3-Coder-30B-MoE will produce a clean moe_dispatch_not_implemented driver-error class instead of the previous opaque matmul/InvalidShape class. Meaningful CCPA measurement still requires Option B (actual MoE inference via HTTP). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1806): replace tracing::warn! with eprintln! (tracing is optional dep) `tracing` is declared `optional = true` in aprender-serve/Cargo.toml, so unconditional `tracing::warn!` in the new guard fn fails to compile on CI feature combos that don't enable it (ci/test + workspace-test + ci/lint all observed E0433 "unresolved module `tracing`" against cuda_chat_backend.rs:659). The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level logging (verbose-mode prints, lock-failure messages). Match that style — eliminates the optional-dep dependency entirely + keeps the guard's warning output consistent with surrounding code. Local re-verify: - cargo check -p aprender-serve --lib --no-default-features — clean - cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): rephrase doc to avoid clippy doc_lazy_continuation false-positive Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line ` + any future streaming/batch backends. See` because it parses the `+` at the start of a wrapped doc-comment line as a markdown list-item marker. Reword to use "and" instead of "+" + move the "See" line to its own sentence. Local re-verify: - cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…li serve handlers (#1812) * fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g) Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift merged commit 9b50c9d into fix/1789-qwen3-moe-serve-dispatch May 19, 2026
1 check passed

noahgift deleted the fix/1789-qwen3-moe-serve-dispatch-option-b branch May 19, 2026 08:07

noahgift mentioned this pull request May 19, 2026

docs(M282): aprender#1807 (Option B) upstream-fix progress note paiml/claude-code-parity-apr#250

Merged

2 tasks

noahgift mentioned this pull request May 19, 2026

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers #1812

Merged

4 tasks

noahgift mentioned this pull request May 20, 2026

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler#1807

fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler#1807
noahgift merged 1 commit into
fix/1789-qwen3-moe-serve-dispatchfrom
fix/1789-qwen3-moe-serve-dispatch-option-b

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

What this closes

Implementation

What this PR does NOT do

Companion-side impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant