fix(serve): #1789 matmul defensive guard against empty / undersized weights#1790
Merged
Conversation
…eights Empty or undersized `weight.data` would cause a cryptic panic deep in `fused_matmul_f32`: thread '<unnamed>' panicked at matmul_fused.rs:211:54: index out of bounds: the len is 0 but the index is 56311808 Stack traces fire on every rayon worker simultaneously, with no indication that the root cause is an upstream tensor-loading bug. Most-likely root cause (per #1789): Qwen3-MoE-style models where the parent FFN tensor is registered with an empty data buffer because the actual weights live in per-expert slices (`ffn_up_exps`, `ffn_gate_exps`, `ffn_down_exps`) the GGUF loader hasn't wired in. This PR ships the DEFENSIVE GUARD only — it does NOT fix the underlying MoE F32 routing path (which is the deeper issue tracked in #1789). Instead it converts the cryptic panic into an actionable `RealizarError::InvalidShape` so the next investigator sees: matmul weight has EMPTY data buffer (in_dim=N, out_dim=M, qtype=0); likely a MoE per-expert tensor was registered with len-0 data — see aprender#1789 Two guards: 1. `weight.data.is_empty()` → InvalidShape with the empty-data hint 2. `weight.qtype == F32 && weight.data.len() < out_dim*in_dim*4` → InvalidShape with concrete have/need byte counts Guard logic extracted to free `fn validate_matmul_weight_shape(...)` so it's unit-testable without constructing a full `OwnedQuantizedModel`. 6 new unit tests covering empty data, undersized F32, correctly-sized F32, oversized F32 (padding allowed), non-F32 only-checks-emptiness, and usize-overflow protection. matmul_fused module: 0 → 6 tests GREEN. `cargo check -p aprender-serve` clean; clippy clean on lib. Empirical evidence: paiml/claude-code-parity-apr M260 dispatch + the post-#1782 re-dispatch both hit this panic. The timeout fix in #1782 unblocked startup but exposed this downstream MoE-weight bug. Filed as #1789 for the deeper MoE F32 routing fix. Does NOT fix Qwen3-Coder-30B inference yet — needs the MoE per-expert weight slicing fix tracked in #1789. This PR only stops the cryptic panic and gives actionable diagnostics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
… PROPOSED (#1794) Two-axis bump: catch up to companion-led v1.31.0 + ship Phase 6 gate in one PR. Gate registry: 18 → 20 entries. v1.31.0 SKIPPED (companion-led at companion-repo M236 / PR #221 squash 188a328 without aprender-side authoring); v1.30.0 → v1.32.0 directly, same SKIP pattern v1.28.0 → v1.30.0 used for the auto-closed aprender#1705 PR. ## FALSIFY-CCPA-019 calibration_required_before_verdict (PROPOSED) Codifies the M196-M224 4-bug-stack lesson. Any future verdict on CCPA-016/017/018 — promotion PROPOSED → ACTIVE_RUNTIME OR treating an evidence file as discharging the gate — requires a fresh calibration record (identity_pass + regression_fail, ≤30 days old) at evidence/calibration/calibration-runs.json. Bidirectional-sensitivity: a meter that ALWAYS-passes would pass identity but also pass regression (caught); a meter that ALWAYS-fails would fail regression correctly but also fail identity (caught). Freshness window catches infrastructure drift (rustc bumps, apr CLI changes, claude CLI changes) without weekly runs. Test scaffold: companion-repo crates/ccpa-differ/tests/ falsify_ccpa_019_calibration.rs (7 active synthetic + 1 #[ignore]'d live-evidence). The M234 calibration evidence (evidence/calibration/calibration- runs.json) records both the trivial in-house identity fixture + decy#39 regression dispatch; discharges the gate currently. ## FALSIFY-CCPA-020 contract_compliance_per_turn (PROPOSED) Codifies the Phase 6 operator-directive (companion-repo M250+): the right experiment for paiml-org is claude-bound-by-pmat-comply- and-pv vs apr-bound-by-pmat-comply-and-pv, NOT raw-vs-raw. Every paiml commit must pass pmat comply + pv validate to merge. Per-turn pmat comply check --strict + pv validate fire on every Write/Edit in the under-contract regime (ArenaSession::with_compliance (N)). Compound oracle (cargo test + pmat comply + pv validate) gates OraclePassed. Bidirectional sensitivity: - Identity: clean-history-with-pass MUST satisfy - Regression: pass-with-failing-compliance-turn MUST be falsified Test scaffold: companion-repo crates/ccpa-arena/tests/ falsify_ccpa_020_contract_compliance.rs (7 active synthetic + 1 #[ignore]'d live-evidence). ## Companion-side ship trail (M250-M264) M250 plan + n=20 corpus; M252 schema; M254 dispatch hook + trap; M256 compound oracle; M258 CCPA-020 gate; M260 first valid n=15 calibration evidence; M262 Toyota-Way root-cause + upstream fixes (#1782 timeout + #1790 matmul guard, both MERGED); M264 P6.6 bench runner (operator-dispatchable end-to-end). ## Activation path CCPA-019 + CCPA-020 stay PROPOSED until first operator-dispatched Phase 6 bench produces evidence/under-contract/scores.json AND a fresh calibration record. ACTIVE_RUNTIME flip awaits both. `pv validate contracts/claude-code-parity-apr-v1.yaml` clean. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
noahgift
added a commit
that referenced
this pull request
May 19, 2026
…ons handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 19, 2026
…1806) * fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) apr serve's chat-completions handler dispatches inference through Arc<Model>::generate(), which calls the dense FFN matmul path. For qwen3_moe GGUFs that path fails — per-expert tensors live under ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data. aprender#1790's defensive guard surfaces this as RealizarError:: InvalidShape, but the underlying dispatch is wrong: the MoE-aware path already exists at infer::run_inference (used by `apr run` CLI) and is not wired into the HTTP handler. Until the full HTTP-to-MoE wire-up lands (Option B in the new scope doc), this PR inserts a clean architectural guard: detect qwen3_moe arch via AppState::model_architecture() + return StatusCode:: NOT_IMPLEMENTED with a structured error citing aprender#1789 + the new contract YAML. The cryptic matmul error class becomes an actionable "MoE HTTP dispatch not yet implemented" class at the API surface. Adds: - contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates V1_001..V1_004; V1_002 discharged by this PR's unit tests) - docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause 5-whys, 3-option engineering trade-off, Option A implementation plan, companion-side CCPA Phase 6 integration plan) - crates/aprender-serve/src/api/cuda_chat_backend.rs: - guard_qwen3_moe_dispatch() guard fn (called early in the chat-completions handler before any backend-specific path) - is_qwen3_moe_arch() testable predicate - 5 unit tests under qwen3_moe_dispatch_guard_tests covering canonical name + HuggingFace class names + lowercase variants + dense-arch negatives + unknown-arch negatives Companion-side integration (paiml/claude-code-parity-apr): the M280 CCPA suspension's "harness-validation done; agent-quality measurement blocked on #1789" stance is unchanged. After this PR, Phase 6 re-dispatch against Qwen3-Coder-30B-MoE will produce a clean moe_dispatch_not_implemented driver-error class instead of the previous opaque matmul/InvalidShape class. Meaningful CCPA measurement still requires Option B (actual MoE inference via HTTP). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1806): replace tracing::warn! with eprintln! (tracing is optional dep) `tracing` is declared `optional = true` in aprender-serve/Cargo.toml, so unconditional `tracing::warn!` in the new guard fn fails to compile on CI feature combos that don't enable it (ci/test + workspace-test + ci/lint all observed E0433 "unresolved module `tracing`" against cuda_chat_backend.rs:659). The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level logging (verbose-mode prints, lock-failure messages). Match that style — eliminates the optional-dep dependency entirely + keeps the guard's warning output consistent with surrounding code. Local re-verify: - cargo check -p aprender-serve --lib --no-default-features — clean - cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): rephrase doc to avoid clippy doc_lazy_continuation false-positive Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line ` + any future streaming/batch backends. See` because it parses the `+` at the start of a wrapped doc-comment line as a markdown list-item marker. Reword to use "and" instead of "+" + move the "See" line to its own sentence. Local re-verify: - cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 19, 2026
…li serve handlers (#1812) * fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g) Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 19, 2026
…GUF (#1819) * fix(distill): one-char fix — warm! macro used hardcoded "silu_forward" key (PMAT-698j) THE root-cause bug behind the entire Phase 3 cuda dispatch cascade (PMAT-698e..i, 6 prior PRs). Discovered by PMAT-698i's [FWD-CACHE] diagnostic logging. The `warm!` macro in pre_warm_for_model: macro_rules! warm { ($key:expr, $kernel:expr) => {{ let ptx = $kernel.emit_ptx_for_target(&target); self.get_or_compile("silu_forward", &ptx)?; // <-- HARDCODED count += 1; }}; } Every single `warm!()` call stored its compiled module under the hashmap key "silu_forward", colliding on the first call: 1. warm!("batched_rmsnorm_fwd_896", BatchedVectorizedRmsNormKernel...) → cache["silu_forward"] = BatchedVectorizedRmsNorm PTX 2. warm!("gemm_forward_...", ...) → cache["silu_forward"] already Occupied → returns existing entry, new PTX silently discarded 3-23. same — all subsequent kernels never actually pre-warm. At runtime, every kernel looks up its real cache key: let key = format!("batched_rmsnorm_fwd_{hidden_size}_eps{eps_bits:08x}"); match cache.get_cached(&key) { Some(m) => m, None => JIT } — and cache-MISSES because the cache contains exactly one entry under "silu_forward". JIT fires for every "pre-warmed" kernel during the first forward pass — exactly when Blackwell sm_121's CUDA driver crashes on cuModuleLoadData during active GPU work. PMAT-698i's [FWD-CACHE] logging surfaced this: every kernel that was "supposed to be pre-warmed" emitted [FWD-CACHE] Compiling at runtime, proving the cache had nothing in it under those keys. Fix: pass $key through to get_or_compile. One-character change ("silu_forward" → &key). This explains the entire PMAT-698e..i cascade: - PMAT-698e (workspace cap) — legit independent bug - PMAT-698f (APR magic) — legit independent bug - PMAT-698g (non-LoRA backward pre-warm) — would have been fine IF forward pre-warm worked; the backward kernels were correctly stored under their real keys (backward macro doesn't have the typo). Defense-in-depth, still valuable. - PMAT-698h (rms_norm_gamma_reduce) — same defense-in-depth. - PMAT-698j (THIS) — the root cause. The previous PMAT-698g/h fixes are still correct (they covered backward gaps that exist independently). This PR addresses the forward cache, which was the dominant source of post-pre-warm JIT events. Test plan: - [x] cargo check --features cuda — clean build - [x] 366 autograd lib tests pass - [ ] Live gx10 dispatch (post-merge) shows ZERO [FWD-CACHE] Compiling events post-pre-warm (all 23 forward kernels now actually cached) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001 + V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 → v1.1.1). Pins the chat-completions MoE dispatch invariant into CI as an opt-in integration test. ## What the test does `crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`: - Loads a real Qwen3-MoE GGUF (via QWEN3_MOE_GGUF_PATH env var) - Builds AppState with `with_quantized_model_and_vocab` + attaches retained mmap via `with_mapped_gguf_model` (Option B path) - Creates the router via `realizar::api::create_router` - POSTs `/v1/chat/completions` with max_tokens=4, temperature=0 - Asserts: - HTTP 200 (V1_001: dispatch returns non-error) - Non-empty `choices[0].message.content` (V1_001: actual generation) - Body does NOT contain "InvalidShape" or "matmul weight has EMPTY data buffer" (V1_003: #1790 defensive guard did not fire — proves MoE path was taken, not dense) Gated `#[ignore]` by default. Activated by: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test qwen3_moe_serve_dispatch_v1 \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` If `QWEN3_MOE_GGUF_PATH` is unset, test prints a SKIP message and passes — does not block CI on hosts without a real qwen3_moe GGUF. ## Empirical evidence (this PR) Test passed against `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf` in 7.84s wall. Response body: ```json {"id":"chatcmpl-q4k-1779208970299","object":"chat.completion","model":"qwen3-moe-v1-001", "choices":[{"message":{"role":"assistant","content":"Human: What"}}], "usage":{"prompt_tokens":13,"completion_tokens":4,"total_tokens":17}} ``` Non-empty `content` + no `InvalidShape` → V1_001 + V1_003 cargo-test discharged. ## Contract bump `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0 → v1.1.1: - V1_001 evidence updated with new cargo-test command + empirical run record - V1_003 evidence updated to same - status_history appends v1.1.1 entry noting formal discharge V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains BLOCKED on M32d KV cache work — independent contract gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the defensive-guard half of #1789. Ships ONLY the early-error guard; the deeper Qwen3-MoE F32 routing fix remains tracked in the parent issue.
Bug
Inference panics in
fused_matmul_f32atmatmul_fused.rs:211with:Stack traces fire on every rayon worker simultaneously, with no indication the root cause is upstream tensor-loading.
Root cause hypothesis (per #1789)
Qwen3-MoE models register parent FFN tensors with empty data buffers because actual weights live in per-expert slices (
ffn_up_exps/ffn_gate_exps/ffn_down_exps) the GGUF loader hasn't wired in.Fix (this PR)
Defensive guard at the top of
fused_matmul. Converts the cryptic panic into:Two guards via new free fn
validate_matmul_weight_shape:weight.data.is_empty()→InvalidShapewith empty-data hint + apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789 referenceweight.qtype == F32 && weight.data.len() < out_dim*in_dim*4→InvalidShapewith concrete have/need byte countsWhat this does NOT do
Does NOT fix Qwen3-Coder-30B inference. The deeper MoE F32 routing path bug stays in #1789.
Test plan
cargo check -p aprender-servecleancargo clippy -p aprender-serve --lib -- -D warningscleancargo fmt --checkclean (pre-existing helpers.rs over-indented-doc-list errors not mine, also present on main)Empirical evidence
paiml/claude-code-parity-apr M260 dispatch + the post-#1782 re-dispatch both hit this panic. #1782 timeout fix unblocked startup; this PR stops the cryptic panic + gives actionable diagnostics for the next investigator.