Skip to content

fix(serve): #1789 matmul defensive guard against empty / undersized weights#1790

Merged
noahgift merged 2 commits into
mainfrom
fix/matmul-fused-empty-data-guard-1789
May 18, 2026
Merged

fix(serve): #1789 matmul defensive guard against empty / undersized weights#1790
noahgift merged 2 commits into
mainfrom
fix/matmul-fused-empty-data-guard-1789

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Closes the defensive-guard half of #1789. Ships ONLY the early-error guard; the deeper Qwen3-MoE F32 routing fix remains tracked in the parent issue.

Bug

Inference panics in fused_matmul_f32 at matmul_fused.rs:211 with:

thread '<unnamed>' panicked at matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 56311808

Stack traces fire on every rayon worker simultaneously, with no indication the root cause is upstream tensor-loading.

Root cause hypothesis (per #1789)

Qwen3-MoE models register parent FFN tensors with empty data buffers because actual weights live in per-expert slices (ffn_up_exps/ffn_gate_exps/ffn_down_exps) the GGUF loader hasn't wired in.

Fix (this PR)

Defensive guard at the top of fused_matmul. Converts the cryptic panic into:

matmul weight has EMPTY data buffer (in_dim=N, out_dim=M, qtype=0);
likely a MoE per-expert tensor was registered with len-0 data — see aprender#1789

Two guards via new free fn validate_matmul_weight_shape:

  1. weight.data.is_empty()InvalidShape with empty-data hint + apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789 reference
  2. weight.qtype == F32 && weight.data.len() < out_dim*in_dim*4InvalidShape with concrete have/need byte counts

What this does NOT do

Does NOT fix Qwen3-Coder-30B inference. The deeper MoE F32 routing path bug stays in #1789.

Test plan

  • 6 new unit tests on the free function (empty / undersized F32 / sized correctly / oversized / non-F32 / usize overflow)
  • matmul_fused module: 0 → 6 tests GREEN
  • cargo check -p aprender-serve clean
  • cargo clippy -p aprender-serve --lib -- -D warnings clean
  • cargo fmt --check clean (pre-existing helpers.rs over-indented-doc-list errors not mine, also present on main)

Empirical evidence

paiml/claude-code-parity-apr M260 dispatch + the post-#1782 re-dispatch both hit this panic. #1782 timeout fix unblocked startup; this PR stops the cryptic panic + gives actionable diagnostics for the next investigator.

…eights

Empty or undersized `weight.data` would cause a cryptic panic deep
in `fused_matmul_f32`:

  thread '<unnamed>' panicked at matmul_fused.rs:211:54:
  index out of bounds: the len is 0 but the index is 56311808

Stack traces fire on every rayon worker simultaneously, with no
indication that the root cause is an upstream tensor-loading bug.

Most-likely root cause (per #1789): Qwen3-MoE-style models where the
parent FFN tensor is registered with an empty data buffer because the
actual weights live in per-expert slices (`ffn_up_exps`,
`ffn_gate_exps`, `ffn_down_exps`) the GGUF loader hasn't wired in.

This PR ships the DEFENSIVE GUARD only — it does NOT fix the
underlying MoE F32 routing path (which is the deeper issue tracked
in #1789). Instead it converts the cryptic panic into an actionable
`RealizarError::InvalidShape` so the next investigator sees:

  matmul weight has EMPTY data buffer (in_dim=N, out_dim=M, qtype=0);
  likely a MoE per-expert tensor was registered with len-0 data —
  see aprender#1789

Two guards:
1. `weight.data.is_empty()` → InvalidShape with the empty-data hint
2. `weight.qtype == F32 && weight.data.len() < out_dim*in_dim*4` →
   InvalidShape with concrete have/need byte counts

Guard logic extracted to free `fn validate_matmul_weight_shape(...)`
so it's unit-testable without constructing a full
`OwnedQuantizedModel`. 6 new unit tests covering empty data,
undersized F32, correctly-sized F32, oversized F32 (padding allowed),
non-F32 only-checks-emptiness, and usize-overflow protection.
matmul_fused module: 0 → 6 tests GREEN. `cargo check -p aprender-serve`
clean; clippy clean on lib.

Empirical evidence: paiml/claude-code-parity-apr M260 dispatch +
the post-#1782 re-dispatch both hit this panic. The timeout fix
in #1782 unblocked startup but exposed this downstream MoE-weight
bug. Filed as #1789 for the deeper MoE F32 routing fix.

Does NOT fix Qwen3-Coder-30B inference yet — needs the MoE per-expert
weight slicing fix tracked in #1789. This PR only stops the cryptic
panic and gives actionable diagnostics.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 11:34
@noahgift noahgift merged commit 2a95205 into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the fix/matmul-fused-empty-data-guard-1789 branch May 18, 2026 12:30
noahgift added a commit that referenced this pull request May 18, 2026
… PROPOSED (#1794)

Two-axis bump: catch up to companion-led v1.31.0 + ship Phase 6
gate in one PR. Gate registry: 18 → 20 entries.

v1.31.0 SKIPPED (companion-led at companion-repo M236 / PR #221
squash 188a328 without aprender-side authoring); v1.30.0 → v1.32.0
directly, same SKIP pattern v1.28.0 → v1.30.0 used for the
auto-closed aprender#1705 PR.

## FALSIFY-CCPA-019 calibration_required_before_verdict (PROPOSED)

Codifies the M196-M224 4-bug-stack lesson. Any future verdict on
CCPA-016/017/018 — promotion PROPOSED → ACTIVE_RUNTIME OR treating
an evidence file as discharging the gate — requires a fresh
calibration record (identity_pass + regression_fail, ≤30 days old)
at evidence/calibration/calibration-runs.json.

Bidirectional-sensitivity: a meter that ALWAYS-passes would pass
identity but also pass regression (caught); a meter that
ALWAYS-fails would fail regression correctly but also fail identity
(caught). Freshness window catches infrastructure drift (rustc
bumps, apr CLI changes, claude CLI changes) without weekly runs.

Test scaffold: companion-repo crates/ccpa-differ/tests/
falsify_ccpa_019_calibration.rs (7 active synthetic + 1 #[ignore]'d
live-evidence).

The M234 calibration evidence (evidence/calibration/calibration-
runs.json) records both the trivial in-house identity fixture +
decy#39 regression dispatch; discharges the gate currently.

## FALSIFY-CCPA-020 contract_compliance_per_turn (PROPOSED)

Codifies the Phase 6 operator-directive (companion-repo M250+):
the right experiment for paiml-org is claude-bound-by-pmat-comply-
and-pv vs apr-bound-by-pmat-comply-and-pv, NOT raw-vs-raw. Every
paiml commit must pass pmat comply + pv validate to merge.

Per-turn pmat comply check --strict + pv validate fire on every
Write/Edit in the under-contract regime (ArenaSession::with_compliance
(N)). Compound oracle (cargo test + pmat comply + pv validate)
gates OraclePassed.

Bidirectional sensitivity:
- Identity: clean-history-with-pass MUST satisfy
- Regression: pass-with-failing-compliance-turn MUST be falsified

Test scaffold: companion-repo crates/ccpa-arena/tests/
falsify_ccpa_020_contract_compliance.rs (7 active synthetic + 1
#[ignore]'d live-evidence).

## Companion-side ship trail (M250-M264)

M250 plan + n=20 corpus; M252 schema; M254 dispatch hook + trap;
M256 compound oracle; M258 CCPA-020 gate; M260 first valid n=15
calibration evidence; M262 Toyota-Way root-cause + upstream fixes
(#1782 timeout + #1790 matmul guard, both MERGED); M264 P6.6 bench
runner (operator-dispatchable end-to-end).

## Activation path

CCPA-019 + CCPA-020 stay PROPOSED until first operator-dispatched
Phase 6 bench produces evidence/under-contract/scores.json AND a
fresh calibration record. ACTIVE_RUNTIME flip awaits both.

`pv validate contracts/claude-code-parity-apr-v1.yaml` clean.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 19, 2026
…ons handler (#1807)

Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a
full MoE-aware dispatch path through the existing
`run_qwen3_moe_generate` (the same code path used by `apr run` CLI).
qwen3_moe-arch GGUFs served via /v1/chat/completions now actually
generate tokens instead of returning NOT_IMPLEMENTED.

## Root cause closed

apr serve's chat handler at `cuda_chat_backend.rs:564` previously called
`Arc<Model>::generate()` unconditionally → `Model::forward()` → dense
FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data
slice is empty (per-expert weights live under `ffn_up_exps.weight`),
producing either a matmul panic OR (post-#1790) a clean
`RealizarError::InvalidShape`. The MoE-aware path at
`infer/inference_result.rs:225` already existed but was only wired into
the CLI `apr run`. This PR threads it into the HTTP serve path.

## Implementation

- `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field +
  `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor.
  Required because `run_qwen3_moe_generate` borrows per-expert tensors
  directly from the mmap; the mapped model must outlive any inference
  call (Arc gives it shared ownership across handler invocations).

- All 16 AppState ctor sites initialize `mapped_gguf_model: None,`
  (mechanical insertion via python regex; non-MoE paths unaffected).

- `prepare_gguf_serve_state` (CLI server-command load path) wraps the
  loaded `MappedGGUFModel` in an `Arc` + attaches it to the final
  AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this
  is just an extra Arc reference; for MoE it's the critical lifetime
  anchor.

- `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in
  `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler
  falls through to existing dense backends — no regression). For
  qwen3_moe arch with retained mmap: tokenize prompt, build
  QuantizedGenerateConfig, call run_qwen3_moe_generate, decode +
  format chat-completions response. For qwen3_moe arch without retained
  mmap: returns NOT_IMPLEMENTED with actionable error (same class as
  Option A; defensive fallback).

- Contract v1.1.0: status_history records Phase 1 (Option A, #1806) +
  Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now
  discharged at the code level (integration-test fixture availability
  is a follow-up task — see V1_001's evidence note).

## What this PR does NOT do

- Streaming SSE: chat-completions stream=true falls back to the
  pregenerated_sse_response after the full batch generation. True
  per-token streaming would require run_qwen3_moe_generate to expose
  a per-token callback; that's a follow-up refactor.

- KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B
  MoE this is catastrophically slow (~minutes per token). M32d's KV
  cache work would speed this up but is out of scope here.

- Integration test against a real qwen3_moe GGUF fixture: V1_001
  contract gate. Deferred because the fixture infrastructure (small
  synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over
  from #1806 still pass (they cover `is_qwen3_moe_arch` predicate).

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE
should now produce non-zero student pass rate (V1_004 falsification
discharge). Operator-coordinated re-dispatch required.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 19, 2026
…1806)

* fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A)

apr serve's chat-completions handler dispatches inference through
Arc<Model>::generate(), which calls the dense FFN matmul path. For
qwen3_moe GGUFs that path fails — per-expert tensors live under
ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data.
aprender#1790's defensive guard surfaces this as RealizarError::
InvalidShape, but the underlying dispatch is wrong: the MoE-aware path
already exists at infer::run_inference (used by `apr run` CLI) and is
not wired into the HTTP handler.

Until the full HTTP-to-MoE wire-up lands (Option B in the new scope
doc), this PR inserts a clean architectural guard: detect qwen3_moe
arch via AppState::model_architecture() + return StatusCode::
NOT_IMPLEMENTED with a structured error citing aprender#1789 + the
new contract YAML. The cryptic matmul error class becomes an
actionable "MoE HTTP dispatch not yet implemented" class at the API
surface.

Adds:
- contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates
  V1_001..V1_004; V1_002 discharged by this PR's unit tests)
- docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause
  5-whys, 3-option engineering trade-off, Option A implementation
  plan, companion-side CCPA Phase 6 integration plan)
- crates/aprender-serve/src/api/cuda_chat_backend.rs:
  - guard_qwen3_moe_dispatch() guard fn (called early in the
    chat-completions handler before any backend-specific path)
  - is_qwen3_moe_arch() testable predicate
  - 5 unit tests under qwen3_moe_dispatch_guard_tests covering
    canonical name + HuggingFace class names + lowercase variants +
    dense-arch negatives + unknown-arch negatives

Companion-side integration (paiml/claude-code-parity-apr): the M280
CCPA suspension's "harness-validation done; agent-quality measurement
blocked on #1789" stance is unchanged. After this PR, Phase 6
re-dispatch against Qwen3-Coder-30B-MoE will produce a clean
moe_dispatch_not_implemented driver-error class instead of the
previous opaque matmul/InvalidShape class. Meaningful CCPA
measurement still requires Option B (actual MoE inference via HTTP).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1806): replace tracing::warn! with eprintln! (tracing is optional dep)

`tracing` is declared `optional = true` in aprender-serve/Cargo.toml,
so unconditional `tracing::warn!` in the new guard fn fails to
compile on CI feature combos that don't enable it (ci/test +
workspace-test + ci/lint all observed E0433 "unresolved module
`tracing`" against cuda_chat_backend.rs:659).

The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level
logging (verbose-mode prints, lock-failure messages). Match that
style — eliminates the optional-dep dependency entirely + keeps the
guard's warning output consistent with surrounding code.

Local re-verify:
- cargo check -p aprender-serve --lib --no-default-features — clean
- cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler (#1807)

Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a
full MoE-aware dispatch path through the existing
`run_qwen3_moe_generate` (the same code path used by `apr run` CLI).
qwen3_moe-arch GGUFs served via /v1/chat/completions now actually
generate tokens instead of returning NOT_IMPLEMENTED.

## Root cause closed

apr serve's chat handler at `cuda_chat_backend.rs:564` previously called
`Arc<Model>::generate()` unconditionally → `Model::forward()` → dense
FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data
slice is empty (per-expert weights live under `ffn_up_exps.weight`),
producing either a matmul panic OR (post-#1790) a clean
`RealizarError::InvalidShape`. The MoE-aware path at
`infer/inference_result.rs:225` already existed but was only wired into
the CLI `apr run`. This PR threads it into the HTTP serve path.

## Implementation

- `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field +
  `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor.
  Required because `run_qwen3_moe_generate` borrows per-expert tensors
  directly from the mmap; the mapped model must outlive any inference
  call (Arc gives it shared ownership across handler invocations).

- All 16 AppState ctor sites initialize `mapped_gguf_model: None,`
  (mechanical insertion via python regex; non-MoE paths unaffected).

- `prepare_gguf_serve_state` (CLI server-command load path) wraps the
  loaded `MappedGGUFModel` in an `Arc` + attaches it to the final
  AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this
  is just an extra Arc reference; for MoE it's the critical lifetime
  anchor.

- `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in
  `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler
  falls through to existing dense backends — no regression). For
  qwen3_moe arch with retained mmap: tokenize prompt, build
  QuantizedGenerateConfig, call run_qwen3_moe_generate, decode +
  format chat-completions response. For qwen3_moe arch without retained
  mmap: returns NOT_IMPLEMENTED with actionable error (same class as
  Option A; defensive fallback).

- Contract v1.1.0: status_history records Phase 1 (Option A, #1806) +
  Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now
  discharged at the code level (integration-test fixture availability
  is a follow-up task — see V1_001's evidence note).

## What this PR does NOT do

- Streaming SSE: chat-completions stream=true falls back to the
  pregenerated_sse_response after the full batch generation. True
  per-token streaming would require run_qwen3_moe_generate to expose
  a per-token callback; that's a follow-up refactor.

- KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B
  MoE this is catastrophically slow (~minutes per token). M32d's KV
  cache work would speed this up but is out of scope here.

- Integration test against a real qwen3_moe GGUF fixture: V1_001
  contract gate. Deferred because the fixture infrastructure (small
  synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over
  from #1806 still pass (they cover `is_qwen3_moe_arch` predicate).

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE
should now produce non-zero student pass rate (V1_004 falsification
discharge). Operator-coordinated re-dispatch required.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): rephrase doc to avoid clippy doc_lazy_continuation false-positive

Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line
` + any future streaming/batch backends. See` because it parses the
`+` at the start of a wrapped doc-comment line as a markdown list-item
marker. Reword to use "and" instead of "+" + move the "See" line to
its own sentence.

Local re-verify:
- cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 19, 2026
…li serve handlers (#1812)

* fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g)

Phase 3 dispatch v8 on gx10 reached the training loop and the first
backward step began JIT-compiling silu_backward / batched_rms_norm_backward
/ rms_norm_gamma_reduce ON DEMAND, then failed with:

  forward_backward_with_grad returned None (CUDA stream poisoned or
  gradient shape mismatch)

This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug
(trueno#200, CLAUDE.md "Backward kernels: Crash because they compile
on-demand when GPU is already active").

Cause: `pre_warm_lora_backward_kernels` short-circuited the entire
function at `lora_rank == 0`, leaving the activation/norm backward
kernels to JIT on demand mid-training. The function name implies
LoRA-only, but it actually pre-warmed shared non-LoRA kernels
(silu_backward, batched_softmax_backward, batched_rms_norm_backward)
that distillation training also needs.

Fix: restructure — only the LoRA-specific gemm_backward warm-ups are
gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward
kernels always pre-warm, regardless of LoRA mode. Distillation training
(lora_rank == 0) now gets the full backward kernel cache before block
upload, eliminating on-demand JIT and the resulting stream poisoning.

Test plan:
- [x] cargo check --features cuda — clean build
- [x] 18 cuda_backward lib tests pass
- [ ] Live gx10 dispatch reaches stepping (post-merge verification)

Stage 4 in the Phase 3 cuda dispatch defect cascade:
  PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g
Each surfaced the next defect on the gx10 path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers

The squashed Option B PR (#1806 + #1807, commit 9c97452) wired
`with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs`
— but that's the wrong entry point. `apr serve` actually dispatches
through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion,
handlers_include_01, server}.rs`. None of those called
`.with_mapped_gguf_model()`, so production `apr serve` runs against
qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in
`try_qwen3_moe_backend` (state.mapped_gguf_model() returned None).

## Root cause

apr-cli has TWO entry points to serve:
1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state`
   — fixed in #1806/#1807, never called by `apr serve` subcommand
2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server`
   → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` /
   `run_cpu_server` — this is the actual serve path

The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench
dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4
captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch
detected but mapped GGUF not retained in AppState"`. That error fires
from the defensive fallback branch in `try_qwen3_moe_backend` —
proving the dispatch reaches the qwen3_moe path but the mmap isn't
plumbed through.

## Fix

`start_gguf_server` now wraps the `MappedGGUFModel` in
`Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump
shared across all dispatch branches). Threaded into:

- `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` —
  `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState.
- `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>,
  config)` — same.
- `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)`
  — `Option` so the APR-format / non-GGUF callers can pass `None`
  (defensive fallback path remains the clean NOT_IMPLEMENTED).

Callers updated:
- `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes
  through three branches.
- `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU
  branch — passes `Some(mapped_model)`.
- `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format).
- `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`.

## Empirical verification

Smoke-test post-fix:
```
apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
  --port 19999 --host 127.0.0.1 --gpu
curl -X POST http://127.0.0.1:19999/v1/chat/completions \
  -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}'
```
Returns HTTP 200 with valid OpenAI-shape JSON containing generated
tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 +
V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically
discharged.

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench is dispatching against
this binary now. Expected outcome: student_pass_rate > 0 on at least
some fixtures (V1_004 falsifier discharge condition).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE

Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too
short for 30B MoE inference without KV cache. Each generated token
requires a full prefill of the entire sequence; a 256-token request
on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench
fixture died with "Error: driver error: network error: apr serve:
error sending request for url" at exactly the 120s mark.

Same root-cause class as aprender#1782 (apr serve startup 30s timeout
that wasn't configurable + size-aware). Fix is symmetric: env-var
override + size-aware default.

Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s
(30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s
ceiling + leaves headroom for large MoE inference until M32d KV
cache lands. For dense models / KV-cache builds this is effectively
unbounded.

Empirical post-fix evidence pending: Phase 6 bench re-dispatch
against Qwen3-Coder-30B-A3B with this binary expected to produce
non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns,
or oracle_failed). Discharges the implicit
`max_http_timeout_must_accommodate_inference_wall` precondition
embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 19, 2026
…GUF (#1819)

* fix(distill): one-char fix — warm! macro used hardcoded "silu_forward" key (PMAT-698j)

THE root-cause bug behind the entire Phase 3 cuda dispatch cascade
(PMAT-698e..i, 6 prior PRs). Discovered by PMAT-698i's [FWD-CACHE]
diagnostic logging.

The `warm!` macro in pre_warm_for_model:

  macro_rules! warm {
      ($key:expr, $kernel:expr) => {{
          let ptx = $kernel.emit_ptx_for_target(&target);
          self.get_or_compile("silu_forward", &ptx)?;  // <-- HARDCODED
          count += 1;
      }};
  }

Every single `warm!()` call stored its compiled module under the
hashmap key "silu_forward", colliding on the first call:

  1. warm!("batched_rmsnorm_fwd_896", BatchedVectorizedRmsNormKernel...)
     → cache["silu_forward"] = BatchedVectorizedRmsNorm PTX
  2. warm!("gemm_forward_...", ...)
     → cache["silu_forward"] already Occupied → returns existing entry,
       new PTX silently discarded
  3-23. same — all subsequent kernels never actually pre-warm.

At runtime, every kernel looks up its real cache key:

  let key = format!("batched_rmsnorm_fwd_{hidden_size}_eps{eps_bits:08x}");
  match cache.get_cached(&key) { Some(m) => m, None => JIT }

— and cache-MISSES because the cache contains exactly one entry
under "silu_forward". JIT fires for every "pre-warmed" kernel during
the first forward pass — exactly when Blackwell sm_121's CUDA driver
crashes on cuModuleLoadData during active GPU work.

PMAT-698i's [FWD-CACHE] logging surfaced this: every kernel that was
"supposed to be pre-warmed" emitted [FWD-CACHE] Compiling at runtime,
proving the cache had nothing in it under those keys.

Fix: pass $key through to get_or_compile. One-character change
("silu_forward" → &key).

This explains the entire PMAT-698e..i cascade:
- PMAT-698e (workspace cap) — legit independent bug
- PMAT-698f (APR magic) — legit independent bug
- PMAT-698g (non-LoRA backward pre-warm) — would have been fine IF
  forward pre-warm worked; the backward kernels were correctly stored
  under their real keys (backward macro doesn't have the typo).
  Defense-in-depth, still valuable.
- PMAT-698h (rms_norm_gamma_reduce) — same defense-in-depth.
- PMAT-698j (THIS) — the root cause.

The previous PMAT-698g/h fixes are still correct (they covered backward
gaps that exist independently). This PR addresses the forward cache,
which was the dominant source of post-pre-warm JIT events.

Test plan:
- [x] cargo check --features cuda — clean build
- [x] 366 autograd lib tests pass
- [ ] Live gx10 dispatch (post-merge) shows ZERO [FWD-CACHE] Compiling
      events post-pre-warm (all 23 forward kernels now actually cached)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(#1789): V1_001 + V1_003 integration test against real Qwen3-MoE GGUF

Formal cargo-test discharge of FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_001
+ V1_003 from `contracts/qwen3-moe-serve-dispatch-v1.yaml` (v1.1.0 →
v1.1.1). Pins the chat-completions MoE dispatch invariant into CI as
an opt-in integration test.

## What the test does

`crates/aprender-serve/tests/qwen3_moe_serve_dispatch_v1.rs`:
- Loads a real Qwen3-MoE GGUF (via QWEN3_MOE_GGUF_PATH env var)
- Builds AppState with `with_quantized_model_and_vocab` + attaches
  retained mmap via `with_mapped_gguf_model` (Option B path)
- Creates the router via `realizar::api::create_router`
- POSTs `/v1/chat/completions` with max_tokens=4, temperature=0
- Asserts:
  - HTTP 200 (V1_001: dispatch returns non-error)
  - Non-empty `choices[0].message.content` (V1_001: actual generation)
  - Body does NOT contain "InvalidShape" or "matmul weight has EMPTY
    data buffer" (V1_003: #1790 defensive guard did not fire — proves
    MoE path was taken, not dense)

Gated `#[ignore]` by default. Activated by:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test qwen3_moe_serve_dispatch_v1 \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

If `QWEN3_MOE_GGUF_PATH` is unset, test prints a SKIP message and
passes — does not block CI on hosts without a real qwen3_moe GGUF.

## Empirical evidence (this PR)

Test passed against `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`
in 7.84s wall. Response body:
```json
{"id":"chatcmpl-q4k-1779208970299","object":"chat.completion","model":"qwen3-moe-v1-001",
 "choices":[{"message":{"role":"assistant","content":"Human: What"}}],
 "usage":{"prompt_tokens":13,"completion_tokens":4,"total_tokens":17}}
```

Non-empty `content` + no `InvalidShape` → V1_001 + V1_003 cargo-test
discharged.

## Contract bump

`qwen3-moe-serve-dispatch-v1.yaml` v1.1.0 → v1.1.1:
- V1_001 evidence updated with new cargo-test command + empirical run record
- V1_003 evidence updated to same
- status_history appends v1.1.1 entry noting formal discharge

V1_004 (companion-side CCPA Phase 6 bench non-zero pass rate) remains
BLOCKED on M32d KV cache work — independent contract gate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant