Skip to content

apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789

@noahgift

Description

@noahgift

Bug

Inference panics in fused_matmul_f32 at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211 when serving Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Multiple rayon worker threads panic simultaneously:

thread '<unnamed>' (3854672) panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 56311808

thread '<unnamed>' (3854680) panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 55238656

(Similar panics on 5+ other worker threads, indices 55-56M range.)

Client side observes the symptom as curl: (52) Empty reply from server or, for apr code / apr serve callers, Error: driver error: network error: apr serve: error sending request for url (http://127.0.0.1:N/v1/chat/completions).

Reproducer

apr from latest main (squash ff9d0c996 or b50b7cf21):

apr serve run /path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19900 --host 127.0.0.1 &
# wait for ready (apr serve ready (4.0s))
curl -sS -X POST http://127.0.0.1:19900/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"local","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'
# → curl: (52) Empty reply from server; serve panics on first inference call

Root cause hypothesis

Looking at the matmul kernel:

// crates/aprender-serve/src/gguf/inference/matmul_fused.rs:101-104
if weight.qtype == GGUF_TYPE_F32 {
    return Ok(self.fused_matmul_f32(input, &weight.data, in_dim, out_dim, seq_len));
}

fused_matmul_f32 indexes into data: &[u8] with base = row * in_dim * 4. The panic says data.len() == 0 but the computed index reaches ~56M. So weight.data is empty for a tensor marked GGUF_TYPE_F32.

Qwen3-MoE uses per-expert FFN weights: ffn_up_exps, ffn_gate_exps, ffn_down_exps stored as 3D tensors of shape (n_experts, hidden_dim, ff_dim). My hypothesis: the GGUF loader is registering the parent MoE-tensor as F32 with empty data while the actual data lives in per-expert slices that the matmul caller isn't aware of.

This is distinct from the GPU-side issues #1582 and #1583 (M-GPU-MOE-2.x and M-GPU-MOE-3) — those track GPU throughput + parity tests; this is a CPU apr serve correctness bug.

Empirical evidence

paiml/claude-code-parity-apr M260 dispatched the calibration-and-scale bench against this model; ALL 15 student-side dispatches failed with the same panic. Previous failure mode was the 30s startup-readiness timeout (fixed by #1782); after that fix landed, the next-layer bug surfaced — this one.

Immediate mitigations

  1. Defensive guard in fused_matmul_f32 — check data.is_empty() and the size invariant data.len() >= row * in_dim * 4 + remainder BEFORE indexing. Return a clear RealizarError::InvalidShape with the tensor name + expected vs actual byte count. Turns the cryptic panic into actionable diagnostics for the next investigator.
  2. Don't recommend Qwen3-Coder-30B with CPU apr serve in docs until the MoE F32 path is wired.

Long-term fix

Wire the Qwen3-MoE per-expert weight slicing through OwnedQuantizedTensor / matmul caller so the F32 path receives the correct expert slice for the currently-routed token.

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions