fix(serve): #1789 matmul defensive guard against empty / undersized weights by noahgift · Pull Request #1790 · paiml/aprender

noahgift · 2026-05-18T11:34:42Z

Summary

Closes the defensive-guard half of #1789. Ships ONLY the early-error guard; the deeper Qwen3-MoE F32 routing fix remains tracked in the parent issue.

Bug

Inference panics in fused_matmul_f32 at matmul_fused.rs:211 with:

thread '<unnamed>' panicked at matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 56311808

Stack traces fire on every rayon worker simultaneously, with no indication the root cause is upstream tensor-loading.

Root cause hypothesis (per #1789)

Qwen3-MoE models register parent FFN tensors with empty data buffers because actual weights live in per-expert slices (ffn_up_exps/ffn_gate_exps/ffn_down_exps) the GGUF loader hasn't wired in.

Fix (this PR)

Defensive guard at the top of fused_matmul. Converts the cryptic panic into:

matmul weight has EMPTY data buffer (in_dim=N, out_dim=M, qtype=0);
likely a MoE per-expert tensor was registered with len-0 data — see aprender#1789

Two guards via new free fn validate_matmul_weight_shape:

weight.data.is_empty() → InvalidShape with empty-data hint + apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789 reference
weight.qtype == F32 && weight.data.len() < out_dim*in_dim*4 → InvalidShape with concrete have/need byte counts

What this does NOT do

Does NOT fix Qwen3-Coder-30B inference. The deeper MoE F32 routing path bug stays in #1789.

Test plan

6 new unit tests on the free function (empty / undersized F32 / sized correctly / oversized / non-F32 / usize overflow)
matmul_fused module: 0 → 6 tests GREEN
cargo check -p aprender-serve clean
cargo clippy -p aprender-serve --lib -- -D warnings clean
cargo fmt --check clean (pre-existing helpers.rs over-indented-doc-list errors not mine, also present on main)

Empirical evidence

paiml/claude-code-parity-apr M260 dispatch + the post-#1782 re-dispatch both hit this panic. #1782 timeout fix unblocked startup; this PR stops the cryptic panic + gives actionable diagnostics for the next investigator.

…eights Empty or undersized `weight.data` would cause a cryptic panic deep in `fused_matmul_f32`: thread '<unnamed>' panicked at matmul_fused.rs:211:54: index out of bounds: the len is 0 but the index is 56311808 Stack traces fire on every rayon worker simultaneously, with no indication that the root cause is an upstream tensor-loading bug. Most-likely root cause (per #1789): Qwen3-MoE-style models where the parent FFN tensor is registered with an empty data buffer because the actual weights live in per-expert slices (`ffn_up_exps`, `ffn_gate_exps`, `ffn_down_exps`) the GGUF loader hasn't wired in. This PR ships the DEFENSIVE GUARD only — it does NOT fix the underlying MoE F32 routing path (which is the deeper issue tracked in #1789). Instead it converts the cryptic panic into an actionable `RealizarError::InvalidShape` so the next investigator sees: matmul weight has EMPTY data buffer (in_dim=N, out_dim=M, qtype=0); likely a MoE per-expert tensor was registered with len-0 data — see aprender#1789 Two guards: 1. `weight.data.is_empty()` → InvalidShape with the empty-data hint 2. `weight.qtype == F32 && weight.data.len() < out_dim*in_dim*4` → InvalidShape with concrete have/need byte counts Guard logic extracted to free `fn validate_matmul_weight_shape(...)` so it's unit-testable without constructing a full `OwnedQuantizedModel`. 6 new unit tests covering empty data, undersized F32, correctly-sized F32, oversized F32 (padding allowed), non-F32 only-checks-emptiness, and usize-overflow protection. matmul_fused module: 0 → 6 tests GREEN. `cargo check -p aprender-serve` clean; clippy clean on lib. Empirical evidence: paiml/claude-code-parity-apr M260 dispatch + the post-#1782 re-dispatch both hit this panic. The timeout fix in #1782 unblocked startup but exposed this downstream MoE-weight bug. Filed as #1789 for the deeper MoE F32 routing fix. Does NOT fix Qwen3-Coder-30B inference yet — needs the MoE per-expert weight slicing fix tracked in #1789. This PR only stops the cryptic panic and gives actionable diagnostics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… PROPOSED (#1794) Two-axis bump: catch up to companion-led v1.31.0 + ship Phase 6 gate in one PR. Gate registry: 18 → 20 entries. v1.31.0 SKIPPED (companion-led at companion-repo M236 / PR #221 squash 188a328 without aprender-side authoring); v1.30.0 → v1.32.0 directly, same SKIP pattern v1.28.0 → v1.30.0 used for the auto-closed aprender#1705 PR. ## FALSIFY-CCPA-019 calibration_required_before_verdict (PROPOSED) Codifies the M196-M224 4-bug-stack lesson. Any future verdict on CCPA-016/017/018 — promotion PROPOSED → ACTIVE_RUNTIME OR treating an evidence file as discharging the gate — requires a fresh calibration record (identity_pass + regression_fail, ≤30 days old) at evidence/calibration/calibration-runs.json. Bidirectional-sensitivity: a meter that ALWAYS-passes would pass identity but also pass regression (caught); a meter that ALWAYS-fails would fail regression correctly but also fail identity (caught). Freshness window catches infrastructure drift (rustc bumps, apr CLI changes, claude CLI changes) without weekly runs. Test scaffold: companion-repo crates/ccpa-differ/tests/ falsify_ccpa_019_calibration.rs (7 active synthetic + 1 #[ignore]'d live-evidence). The M234 calibration evidence (evidence/calibration/calibration- runs.json) records both the trivial in-house identity fixture + decy#39 regression dispatch; discharges the gate currently. ## FALSIFY-CCPA-020 contract_compliance_per_turn (PROPOSED) Codifies the Phase 6 operator-directive (companion-repo M250+): the right experiment for paiml-org is claude-bound-by-pmat-comply- and-pv vs apr-bound-by-pmat-comply-and-pv, NOT raw-vs-raw. Every paiml commit must pass pmat comply + pv validate to merge. Per-turn pmat comply check --strict + pv validate fire on every Write/Edit in the under-contract regime (ArenaSession::with_compliance (N)). Compound oracle (cargo test + pmat comply + pv validate) gates OraclePassed. Bidirectional sensitivity: - Identity: clean-history-with-pass MUST satisfy - Regression: pass-with-failing-compliance-turn MUST be falsified Test scaffold: companion-repo crates/ccpa-arena/tests/ falsify_ccpa_020_contract_compliance.rs (7 active synthetic + 1 #[ignore]'d live-evidence). ## Companion-side ship trail (M250-M264) M250 plan + n=20 corpus; M252 schema; M254 dispatch hook + trap; M256 compound oracle; M258 CCPA-020 gate; M260 first valid n=15 calibration evidence; M262 Toyota-Way root-cause + upstream fixes (#1782 timeout + #1790 matmul guard, both MERGED); M264 P6.6 bench runner (operator-dispatchable end-to-end). ## Activation path CCPA-019 + CCPA-020 stay PROPOSED until first operator-dispatched Phase 6 bench produces evidence/under-contract/scores.json AND a fresh calibration record. ACTIVE_RUNTIME flip awaits both. `pv validate contracts/claude-code-parity-apr-v1.yaml` clean. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ons handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…1806) * fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) apr serve's chat-completions handler dispatches inference through Arc<Model>::generate(), which calls the dense FFN matmul path. For qwen3_moe GGUFs that path fails — per-expert tensors live under ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data. aprender#1790's defensive guard surfaces this as RealizarError:: InvalidShape, but the underlying dispatch is wrong: the MoE-aware path already exists at infer::run_inference (used by `apr run` CLI) and is not wired into the HTTP handler. Until the full HTTP-to-MoE wire-up lands (Option B in the new scope doc), this PR inserts a clean architectural guard: detect qwen3_moe arch via AppState::model_architecture() + return StatusCode:: NOT_IMPLEMENTED with a structured error citing aprender#1789 + the new contract YAML. The cryptic matmul error class becomes an actionable "MoE HTTP dispatch not yet implemented" class at the API surface. Adds: - contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates V1_001..V1_004; V1_002 discharged by this PR's unit tests) - docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause 5-whys, 3-option engineering trade-off, Option A implementation plan, companion-side CCPA Phase 6 integration plan) - crates/aprender-serve/src/api/cuda_chat_backend.rs: - guard_qwen3_moe_dispatch() guard fn (called early in the chat-completions handler before any backend-specific path) - is_qwen3_moe_arch() testable predicate - 5 unit tests under qwen3_moe_dispatch_guard_tests covering canonical name + HuggingFace class names + lowercase variants + dense-arch negatives + unknown-arch negatives Companion-side integration (paiml/claude-code-parity-apr): the M280 CCPA suspension's "harness-validation done; agent-quality measurement blocked on #1789" stance is unchanged. After this PR, Phase 6 re-dispatch against Qwen3-Coder-30B-MoE will produce a clean moe_dispatch_not_implemented driver-error class instead of the previous opaque matmul/InvalidShape class. Meaningful CCPA measurement still requires Option B (actual MoE inference via HTTP). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1806): replace tracing::warn! with eprintln! (tracing is optional dep) `tracing` is declared `optional = true` in aprender-serve/Cargo.toml, so unconditional `tracing::warn!` in the new guard fn fails to compile on CI feature combos that don't enable it (ci/test + workspace-test + ci/lint all observed E0433 "unresolved module `tracing`" against cuda_chat_backend.rs:659). The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level logging (verbose-mode prints, lock-failure messages). Match that style — eliminates the optional-dep dependency entirely + keeps the guard's warning output consistent with surrounding code. Local re-verify: - cargo check -p aprender-serve --lib --no-default-features — clean - cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): rephrase doc to avoid clippy doc_lazy_continuation false-positive Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line ` + any future streaming/batch backends. See` because it parses the `+` at the start of a wrapped doc-comment line as a markdown list-item marker. Reword to use "and" instead of "+" + move the "See" line to its own sentence. Local re-verify: - cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…li serve handlers (#1812) * fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g) Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…GUF (#1819) * fix(distill): one-char fix — warm! macro used hardcoded "silu_forward" key (PMAT-698j) THE root-cause bug behind the entire Phase 3 cuda dispatch cascade (PMAT-698e..i, 6 prior PRs). Discovered by PMAT-698i's [FWD-CACHE] diagnostic logging. The `warm!` macro in pre_warm_for_model: macro_rules! warm { ($key:expr, $kernel:expr) => {{ let ptx = $kernel.emit_ptx_for_target(&target); self.get_or_compile("silu_forward", &ptx)?; // <-- HARDCODED count += 1; }}; } Every single `warm!()` call stored its compiled module under the hashmap key "silu_forward", colliding on the first call: 1. warm!("batched_rmsnorm_fwd_896", BatchedVectorizedRmsNormKernel...) → cache["silu_forward"] = BatchedVectorizedRmsNorm PTX 2. warm!("gemm_forward_...", ...) → cache["silu_forward"] already Occupied → returns existing entry, new PTX silently discarded 3-23. same — all subsequent kernels never actually pre-warm. At runtime, every kernel looks up its real cache key: let key = format!("batched_rmsnorm_fwd_{hidden_size}_eps{eps_bits:08x}"); match cache.get_cached(&key) { Some(m) => m, None => JIT } — and cache-MISSES because the cache contains exactly one entry under "silu_forward". JIT fires for every "pre-warmed" kernel during the first forward pass — exactly when Blackwell sm_121's CUDA driver crashes on cuModuleLoadData during active GPU work. PMAT-698i's [FWD-CACHE] logging surfaced this: every kernel that was "supposed to be pre-warmed" emitted [FWD-CACHE] Compiling at runtime, proving the cache had nothing in it under those keys. Fix: pass $key through to get_or_compile. One-character change ("silu_forward" → &key). This explains the entire PMAT-698e..i cascade: - PMAT-698e (workspace cap) — legit independent bug - PMAT-698f (APR magic) — legit independent bug - PMAT-698g (non-LoRA backward pre-warm) — would have been fine IF forward pre-warm worked; the backward kernels were correctly stored under their real keys (backward macro doesn't have the typo). Defense-in-depth, still valuable. - PMAT-698h (rms_norm_gamma_reduce) — same defense-in-depth. - PMAT-698j (THIS) — the root cause. The previous PMAT-698g/h fixes are still correct (they covered backward gaps that exist independently). This PR addresses the forward cache, which was the dominant source of post-pre-warm JIT events. Test plan: - [x] cargo check --features cuda — clean build - [x] 366 autograd lib tests pass - [ ] Live gx10 dispatch (post-merge) shows ZERO [FWD-CACHE] Compiling events post-pre-warm (all 23 forward kernels now actually cached) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001 + V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 → v1.1.1). Pins the chat-completions MoE dispatch invariant into CI as an opt-in integration test. ## What the test does `crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`: - Loads a real Qwen3-MoE GGUF (via QWEN3_MOE_GGUF_PATH env var) - Builds AppState with `with_quantized_model_and_vocab` + attaches retained mmap via `with_mapped_gguf_model` (Option B path) - Creates the router via `realizar::api::create_router` - POSTs `/v1/chat/completions` with max_tokens=4, temperature=0 - Asserts: - HTTP 200 (V1_001: dispatch returns non-error) - Non-empty `choices[0].message.content` (V1_001: actual generation) - Body does NOT contain "InvalidShape" or "matmul weight has EMPTY data buffer" (V1_003: #1790 defensive guard did not fire — proves MoE path was taken, not dense) Gated `#[ignore]` by default. Activated by: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test qwen3_moe_serve_dispatch_v1 \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` If `QWEN3_MOE_GGUF_PATH` is unset, test prints a SKIP message and passes — does not block CI on hosts without a real qwen3_moe GGUF. ## Empirical evidence (this PR) Test passed against `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf` in 7.84s wall. Response body: ```json {"id":"chatcmpl-q4k-1779208970299","object":"chat.completion","model":"qwen3-moe-v1-001", "choices":[{"message":{"role":"assistant","content":"Human: What"}}], "usage":{"prompt_tokens":13,"completion_tokens":4,"total_tokens":17}} ``` Non-empty `content` + no `InvalidShape` → V1_001 + V1_003 cargo-test discharged. ## Contract bump `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0 → v1.1.1: - V1_001 evidence updated with new cargo-test command + empirical run record - V1_003 evidence updated to same - status_history appends v1.1.1 entry noting formal discharge V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains BLOCKED on M32d KV cache work — independent contract gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 11:34

Merge branch 'main' into fix/matmul-fused-empty-data-guard-1789

918c08a

noahgift merged commit 2a95205 into main May 18, 2026
10 checks passed

noahgift deleted the fix/matmul-fused-empty-data-guard-1789 branch May 18, 2026 12:30

This was referenced May 18, 2026

contracts(ccpa): v1.32.0 — add FALSIFY-CCPA-019 + FALSIFY-CCPA-020 at PROPOSED #1794

Merged

apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789

Closed

noahgift mentioned this pull request May 19, 2026

fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) #1806

Merged

6 tasks

noahgift mentioned this pull request May 19, 2026

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers #1812

Merged

4 tasks

noahgift mentioned this pull request May 19, 2026

fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF #1819

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(serve): #1789 matmul defensive guard against empty / undersized weights#1790

fix(serve): #1789 matmul defensive guard against empty / undersized weights#1790
noahgift merged 2 commits into
mainfrom
fix/matmul-fused-empty-data-guard-1789

noahgift commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

Bug

Root cause hypothesis (per #1789)

Fix (this PR)

What this does NOT do

Test plan

Empirical evidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant