Skip to content

fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler#1807

Merged
noahgift merged 1 commit into
fix/1789-qwen3-moe-serve-dispatchfrom
fix/1789-qwen3-moe-serve-dispatch-option-b
May 19, 2026
Merged

fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler#1807
noahgift merged 1 commit into
fix/1789-qwen3-moe-serve-dispatchfrom
fix/1789-qwen3-moe-serve-dispatch-option-b

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Replaces the Option A NOT_IMPLEMENTED guard from #1806 with full MoE-aware dispatch through the existing run_qwen3_moe_generate path. qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens.

Base branch is #1806 (Option A), not main, because this PR depends on Option A's predicate + tests. Once #1806 merges, retarget this PR to main + rebase.

What this closes

  • Discharges FALSIFY-V1_001 (chat-completions returns non-error for qwen3_moe) at the code level — integration-test fixture is a follow-up
  • Discharges FALSIFY-V1_003 (matmul defensive guard doesn't fire for MoE chat path) — the MoE path is the FFN matmul that runs, not the dense one
  • Contract v1.0.0 → v1.1.0: status_history now records Phase 1 (fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) #1806) + Phase 2 (this PR)

Implementation

  1. AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>> — retained mmap for MoE inference (per-expert tensors borrow directly from it).
  2. AppState::with_mapped_gguf_model() builder + mapped_gguf_model() accessor.
  3. prepare_gguf_serve_state wraps loaded mapped in Arc + attaches via the builder. Non-MoE archs just hold an extra reference; for MoE it's the lifetime anchor.
  4. try_qwen3_moe_backend() replaces guard_qwen3_moe_dispatch(). For non-MoE archs returns None (no regression). For MoE arch with mmap: tokenize, build config, dispatch through run_qwen3_moe_generate, decode + return chat response. For MoE arch without mmap: NOT_IMPLEMENTED (defensive fallback).
  5. All 16 AppState ctor sites initialize mapped_gguf_model: None, (mechanical insertion).

What this PR does NOT do

  • Streaming SSE: chat-completions stream=true falls back to the pregenerated SSE response. True per-token streaming requires a callback variant of run_qwen3_moe_generate — follow-up.
  • KV cache: full-prefill-per-token (catastrophically slow for 30B MoE). M32d's KV cache work would speed this up — out of scope.
  • Integration test against a real qwen3_moe GGUF: V1_001 evidence. Deferred until a small synthetic MoE GGUF fixture exists.

Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required after merge.

Test plan

🤖 Generated with Claude Code

…ons handler

Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a
full MoE-aware dispatch path through the existing
`run_qwen3_moe_generate` (the same code path used by `apr run` CLI).
qwen3_moe-arch GGUFs served via /v1/chat/completions now actually
generate tokens instead of returning NOT_IMPLEMENTED.

## Root cause closed

apr serve's chat handler at `cuda_chat_backend.rs:564` previously called
`Arc<Model>::generate()` unconditionally → `Model::forward()` → dense
FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data
slice is empty (per-expert weights live under `ffn_up_exps.weight`),
producing either a matmul panic OR (post-#1790) a clean
`RealizarError::InvalidShape`. The MoE-aware path at
`infer/inference_result.rs:225` already existed but was only wired into
the CLI `apr run`. This PR threads it into the HTTP serve path.

## Implementation

- `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field +
  `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor.
  Required because `run_qwen3_moe_generate` borrows per-expert tensors
  directly from the mmap; the mapped model must outlive any inference
  call (Arc gives it shared ownership across handler invocations).

- All 16 AppState ctor sites initialize `mapped_gguf_model: None,`
  (mechanical insertion via python regex; non-MoE paths unaffected).

- `prepare_gguf_serve_state` (CLI server-command load path) wraps the
  loaded `MappedGGUFModel` in an `Arc` + attaches it to the final
  AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this
  is just an extra Arc reference; for MoE it's the critical lifetime
  anchor.

- `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in
  `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler
  falls through to existing dense backends — no regression). For
  qwen3_moe arch with retained mmap: tokenize prompt, build
  QuantizedGenerateConfig, call run_qwen3_moe_generate, decode +
  format chat-completions response. For qwen3_moe arch without retained
  mmap: returns NOT_IMPLEMENTED with actionable error (same class as
  Option A; defensive fallback).

- Contract v1.1.0: status_history records Phase 1 (Option A, #1806) +
  Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now
  discharged at the code level (integration-test fixture availability
  is a follow-up task — see V1_001's evidence note).

## What this PR does NOT do

- Streaming SSE: chat-completions stream=true falls back to the
  pregenerated_sse_response after the full batch generation. True
  per-token streaming would require run_qwen3_moe_generate to expose
  a per-token callback; that's a follow-up refactor.

- KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B
  MoE this is catastrophically slow (~minutes per token). M32d's KV
  cache work would speed this up but is out of scope here.

- Integration test against a real qwen3_moe GGUF fixture: V1_001
  contract gate. Deferred because the fixture infrastructure (small
  synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over
  from #1806 still pass (they cover `is_qwen3_moe_arch` predicate).

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE
should now produce non-zero student pass rate (V1_004 falsification
discharge). Operator-coordinated re-dispatch required.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 9b50c9d into fix/1789-qwen3-moe-serve-dispatch May 19, 2026
1 check passed
@noahgift noahgift deleted the fix/1789-qwen3-moe-serve-dispatch-option-b branch May 19, 2026 08:07
noahgift added a commit to paiml/claude-code-parity-apr that referenced this pull request May 19, 2026
Mechanical status-tracking update. Adds cross-references to upstream
paiml/aprender#1807 (Option B: full MoE-aware dispatch via
run_qwen3_moe_generate) in the Phase 6 docs.

#1807 is stacked on #1806 (Option A). Once #1806 merges, #1807 gets
retargeted to main + rebased. The combined effect on un-suspending
CCPA work:
- #1806 alone: clean moe_dispatch_not_implemented error class; student
  pass rate stays 0/20 (no inference)
- #1807 (post-merge): actual MoE inference via /v1/chat/completions;
  student pass rate should rise above 0 (V1_004 falsifier discharge)

This is a MECHANICAL status-tracking update consistent with the M280
suspension. No substantive new CCPA scope; no new contract gates; no
new code. M-counter NOT bumped per the discipline doctrine.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 19, 2026
…1806)

* fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A)

apr serve's chat-completions handler dispatches inference through
Arc<Model>::generate(), which calls the dense FFN matmul path. For
qwen3_moe GGUFs that path fails — per-expert tensors live under
ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data.
aprender#1790's defensive guard surfaces this as RealizarError::
InvalidShape, but the underlying dispatch is wrong: the MoE-aware path
already exists at infer::run_inference (used by `apr run` CLI) and is
not wired into the HTTP handler.

Until the full HTTP-to-MoE wire-up lands (Option B in the new scope
doc), this PR inserts a clean architectural guard: detect qwen3_moe
arch via AppState::model_architecture() + return StatusCode::
NOT_IMPLEMENTED with a structured error citing aprender#1789 + the
new contract YAML. The cryptic matmul error class becomes an
actionable "MoE HTTP dispatch not yet implemented" class at the API
surface.

Adds:
- contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates
  V1_001..V1_004; V1_002 discharged by this PR's unit tests)
- docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause
  5-whys, 3-option engineering trade-off, Option A implementation
  plan, companion-side CCPA Phase 6 integration plan)
- crates/aprender-serve/src/api/cuda_chat_backend.rs:
  - guard_qwen3_moe_dispatch() guard fn (called early in the
    chat-completions handler before any backend-specific path)
  - is_qwen3_moe_arch() testable predicate
  - 5 unit tests under qwen3_moe_dispatch_guard_tests covering
    canonical name + HuggingFace class names + lowercase variants +
    dense-arch negatives + unknown-arch negatives

Companion-side integration (paiml/claude-code-parity-apr): the M280
CCPA suspension's "harness-validation done; agent-quality measurement
blocked on #1789" stance is unchanged. After this PR, Phase 6
re-dispatch against Qwen3-Coder-30B-MoE will produce a clean
moe_dispatch_not_implemented driver-error class instead of the
previous opaque matmul/InvalidShape class. Meaningful CCPA
measurement still requires Option B (actual MoE inference via HTTP).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1806): replace tracing::warn! with eprintln! (tracing is optional dep)

`tracing` is declared `optional = true` in aprender-serve/Cargo.toml,
so unconditional `tracing::warn!` in the new guard fn fails to
compile on CI feature combos that don't enable it (ci/test +
workspace-test + ci/lint all observed E0433 "unresolved module
`tracing`" against cuda_chat_backend.rs:659).

The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level
logging (verbose-mode prints, lock-failure messages). Match that
style — eliminates the optional-dep dependency entirely + keeps the
guard's warning output consistent with surrounding code.

Local re-verify:
- cargo check -p aprender-serve --lib --no-default-features — clean
- cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler (#1807)

Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a
full MoE-aware dispatch path through the existing
`run_qwen3_moe_generate` (the same code path used by `apr run` CLI).
qwen3_moe-arch GGUFs served via /v1/chat/completions now actually
generate tokens instead of returning NOT_IMPLEMENTED.

## Root cause closed

apr serve's chat handler at `cuda_chat_backend.rs:564` previously called
`Arc<Model>::generate()` unconditionally → `Model::forward()` → dense
FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data
slice is empty (per-expert weights live under `ffn_up_exps.weight`),
producing either a matmul panic OR (post-#1790) a clean
`RealizarError::InvalidShape`. The MoE-aware path at
`infer/inference_result.rs:225` already existed but was only wired into
the CLI `apr run`. This PR threads it into the HTTP serve path.

## Implementation

- `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field +
  `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor.
  Required because `run_qwen3_moe_generate` borrows per-expert tensors
  directly from the mmap; the mapped model must outlive any inference
  call (Arc gives it shared ownership across handler invocations).

- All 16 AppState ctor sites initialize `mapped_gguf_model: None,`
  (mechanical insertion via python regex; non-MoE paths unaffected).

- `prepare_gguf_serve_state` (CLI server-command load path) wraps the
  loaded `MappedGGUFModel` in an `Arc` + attaches it to the final
  AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this
  is just an extra Arc reference; for MoE it's the critical lifetime
  anchor.

- `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in
  `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler
  falls through to existing dense backends — no regression). For
  qwen3_moe arch with retained mmap: tokenize prompt, build
  QuantizedGenerateConfig, call run_qwen3_moe_generate, decode +
  format chat-completions response. For qwen3_moe arch without retained
  mmap: returns NOT_IMPLEMENTED with actionable error (same class as
  Option A; defensive fallback).

- Contract v1.1.0: status_history records Phase 1 (Option A, #1806) +
  Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now
  discharged at the code level (integration-test fixture availability
  is a follow-up task — see V1_001's evidence note).

## What this PR does NOT do

- Streaming SSE: chat-completions stream=true falls back to the
  pregenerated_sse_response after the full batch generation. True
  per-token streaming would require run_qwen3_moe_generate to expose
  a per-token callback; that's a follow-up refactor.

- KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B
  MoE this is catastrophically slow (~minutes per token). M32d's KV
  cache work would speed this up but is out of scope here.

- Integration test against a real qwen3_moe GGUF fixture: V1_001
  contract gate. Deferred because the fixture infrastructure (small
  synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over
  from #1806 still pass (they cover `is_qwen3_moe_arch` predicate).

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE
should now produce non-zero student pass rate (V1_004 falsification
discharge). Operator-coordinated re-dispatch required.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): rephrase doc to avoid clippy doc_lazy_continuation false-positive

Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line
` + any future streaming/batch backends. See` because it parses the
`+` at the start of a wrapped doc-comment line as a markdown list-item
marker. Reword to use "and" instead of "+" + move the "See" line to
its own sentence.

Local re-verify:
- cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 19, 2026
…li serve handlers (#1812)

* fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g)

Phase 3 dispatch v8 on gx10 reached the training loop and the first
backward step began JIT-compiling silu_backward / batched_rms_norm_backward
/ rms_norm_gamma_reduce ON DEMAND, then failed with:

  forward_backward_with_grad returned None (CUDA stream poisoned or
  gradient shape mismatch)

This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug
(trueno#200, CLAUDE.md "Backward kernels: Crash because they compile
on-demand when GPU is already active").

Cause: `pre_warm_lora_backward_kernels` short-circuited the entire
function at `lora_rank == 0`, leaving the activation/norm backward
kernels to JIT on demand mid-training. The function name implies
LoRA-only, but it actually pre-warmed shared non-LoRA kernels
(silu_backward, batched_softmax_backward, batched_rms_norm_backward)
that distillation training also needs.

Fix: restructure — only the LoRA-specific gemm_backward warm-ups are
gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward
kernels always pre-warm, regardless of LoRA mode. Distillation training
(lora_rank == 0) now gets the full backward kernel cache before block
upload, eliminating on-demand JIT and the resulting stream poisoning.

Test plan:
- [x] cargo check --features cuda — clean build
- [x] 18 cuda_backward lib tests pass
- [ ] Live gx10 dispatch reaches stepping (post-merge verification)

Stage 4 in the Phase 3 cuda dispatch defect cascade:
  PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g
Each surfaced the next defect on the gx10 path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers

The squashed Option B PR (#1806 + #1807, commit 9c97452) wired
`with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs`
— but that's the wrong entry point. `apr serve` actually dispatches
through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion,
handlers_include_01, server}.rs`. None of those called
`.with_mapped_gguf_model()`, so production `apr serve` runs against
qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in
`try_qwen3_moe_backend` (state.mapped_gguf_model() returned None).

## Root cause

apr-cli has TWO entry points to serve:
1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state`
   — fixed in #1806/#1807, never called by `apr serve` subcommand
2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server`
   → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` /
   `run_cpu_server` — this is the actual serve path

The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench
dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4
captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch
detected but mapped GGUF not retained in AppState"`. That error fires
from the defensive fallback branch in `try_qwen3_moe_backend` —
proving the dispatch reaches the qwen3_moe path but the mmap isn't
plumbed through.

## Fix

`start_gguf_server` now wraps the `MappedGGUFModel` in
`Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump
shared across all dispatch branches). Threaded into:

- `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` —
  `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState.
- `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>,
  config)` — same.
- `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)`
  — `Option` so the APR-format / non-GGUF callers can pass `None`
  (defensive fallback path remains the clean NOT_IMPLEMENTED).

Callers updated:
- `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes
  through three branches.
- `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU
  branch — passes `Some(mapped_model)`.
- `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format).
- `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`.

## Empirical verification

Smoke-test post-fix:
```
apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
  --port 19999 --host 127.0.0.1 --gpu
curl -X POST http://127.0.0.1:19999/v1/chat/completions \
  -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}'
```
Returns HTTP 200 with valid OpenAI-shape JSON containing generated
tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 +
V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically
discharged.

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench is dispatching against
this binary now. Expected outcome: student_pass_rate > 0 on at least
some fixtures (V1_004 falsifier discharge condition).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE

Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too
short for 30B MoE inference without KV cache. Each generated token
requires a full prefill of the entire sequence; a 256-token request
on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench
fixture died with "Error: driver error: network error: apr serve:
error sending request for url" at exactly the 120s mark.

Same root-cause class as aprender#1782 (apr serve startup 30s timeout
that wasn't configurable + size-aware). Fix is symmetric: env-var
override + size-aware default.

Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s
(30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s
ceiling + leaves headroom for large MoE inference until M32d KV
cache lands. For dense models / KV-cache builds this is effectively
unbounded.

Empirical post-fix evidence pending: Phase 6 bench re-dispatch
against Qwen3-Coder-30B-A3B with this binary expected to produce
non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns,
or oracle_failed). Discharges the implicit
`max_http_timeout_must_accommodate_inference_wall` precondition
embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant