Bug
Inference panics in fused_matmul_f32 at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211 when serving Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Multiple rayon worker threads panic simultaneously:
thread '<unnamed>' (3854672) panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 56311808
thread '<unnamed>' (3854680) panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 55238656
(Similar panics on 5+ other worker threads, indices 55-56M range.)
Client side observes the symptom as curl: (52) Empty reply from server or, for apr code / apr serve callers, Error: driver error: network error: apr serve: error sending request for url (http://127.0.0.1:N/v1/chat/completions).
Reproducer
apr from latest main (squash ff9d0c996 or b50b7cf21):
apr serve run /path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19900 --host 127.0.0.1 &
# wait for ready (apr serve ready (4.0s))
curl -sS -X POST http://127.0.0.1:19900/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"local","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'
# → curl: (52) Empty reply from server; serve panics on first inference call
Root cause hypothesis
Looking at the matmul kernel:
// crates/aprender-serve/src/gguf/inference/matmul_fused.rs:101-104
if weight.qtype == GGUF_TYPE_F32 {
return Ok(self.fused_matmul_f32(input, &weight.data, in_dim, out_dim, seq_len));
}
fused_matmul_f32 indexes into data: &[u8] with base = row * in_dim * 4. The panic says data.len() == 0 but the computed index reaches ~56M. So weight.data is empty for a tensor marked GGUF_TYPE_F32.
Qwen3-MoE uses per-expert FFN weights: ffn_up_exps, ffn_gate_exps, ffn_down_exps stored as 3D tensors of shape (n_experts, hidden_dim, ff_dim). My hypothesis: the GGUF loader is registering the parent MoE-tensor as F32 with empty data while the actual data lives in per-expert slices that the matmul caller isn't aware of.
This is distinct from the GPU-side issues #1582 and #1583 (M-GPU-MOE-2.x and M-GPU-MOE-3) — those track GPU throughput + parity tests; this is a CPU apr serve correctness bug.
Empirical evidence
paiml/claude-code-parity-apr M260 dispatched the calibration-and-scale bench against this model; ALL 15 student-side dispatches failed with the same panic. Previous failure mode was the 30s startup-readiness timeout (fixed by #1782); after that fix landed, the next-layer bug surfaced — this one.
Immediate mitigations
- Defensive guard in
fused_matmul_f32 — check data.is_empty() and the size invariant data.len() >= row * in_dim * 4 + remainder BEFORE indexing. Return a clear RealizarError::InvalidShape with the tensor name + expected vs actual byte count. Turns the cryptic panic into actionable diagnostics for the next investigator.
- Don't recommend Qwen3-Coder-30B with CPU
apr serve in docs until the MoE F32 path is wired.
Long-term fix
Wire the Qwen3-MoE per-expert weight slicing through OwnedQuantizedTensor / matmul caller so the F32 path receives the correct expert slice for the currently-routed token.
Cross-references
Bug
Inference panics in
fused_matmul_f32atcrates/aprender-serve/src/gguf/inference/matmul_fused.rs:211when serving Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Multiple rayon worker threads panic simultaneously:(Similar panics on 5+ other worker threads, indices 55-56M range.)
Client side observes the symptom as
curl: (52) Empty reply from serveror, forapr code/apr servecallers,Error: driver error: network error: apr serve: error sending request for url (http://127.0.0.1:N/v1/chat/completions).Reproducer
apr from latest main (squash
ff9d0c996orb50b7cf21):Root cause hypothesis
Looking at the matmul kernel:
fused_matmul_f32indexes intodata: &[u8]withbase = row * in_dim * 4. The panic saysdata.len() == 0but the computed index reaches ~56M. Soweight.datais empty for a tensor marked GGUF_TYPE_F32.Qwen3-MoE uses per-expert FFN weights:
ffn_up_exps,ffn_gate_exps,ffn_down_expsstored as 3D tensors of shape (n_experts, hidden_dim, ff_dim). My hypothesis: the GGUF loader is registering the parent MoE-tensor as F32 with empty data while the actual data lives in per-expert slices that the matmul caller isn't aware of.This is distinct from the GPU-side issues #1582 and #1583 (M-GPU-MOE-2.x and M-GPU-MOE-3) — those track GPU throughput + parity tests; this is a CPU
apr servecorrectness bug.Empirical evidence
paiml/claude-code-parity-apr M260 dispatched the calibration-and-scale bench against this model; ALL 15 student-side dispatches failed with the same panic. Previous failure mode was the 30s startup-readiness timeout (fixed by #1782); after that fix landed, the next-layer bug surfaced — this one.
Immediate mitigations
fused_matmul_f32— checkdata.is_empty()and the size invariantdata.len() >= row * in_dim * 4 + remainderBEFORE indexing. Return a clearRealizarError::InvalidShapewith the tensor name + expected vs actual byte count. Turns the cryptic panic into actionable diagnostics for the next investigator.apr servein docs until the MoE F32 path is wired.Long-term fix
Wire the Qwen3-MoE per-expert weight slicing through
OwnedQuantizedTensor/ matmul caller so the F32 path receives the correct expert slice for the currently-routed token.Cross-references
evidence/calibration-and-scale/scores.json