apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight

## Bug

Inference panics in `fused_matmul_f32` at `crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211` when serving Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Multiple rayon worker threads panic simultaneously:

```
thread '<unnamed>' (3854672) panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 56311808

thread '<unnamed>' (3854680) panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 55238656
```

(Similar panics on 5+ other worker threads, indices 55-56M range.)

Client side observes the symptom as `curl: (52) Empty reply from server` or, for `apr code` / `apr serve` callers, `Error: driver error: network error: apr serve: error sending request for url (http://127.0.0.1:N/v1/chat/completions)`.

## Reproducer

apr from latest main (squash `ff9d0c996` or `b50b7cf21`):

```bash
apr serve run /path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19900 --host 127.0.0.1 &
# wait for ready (apr serve ready (4.0s))
curl -sS -X POST http://127.0.0.1:19900/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"local","messages":[{"role":"user","content":"hi"}],"max_tokens":10}'
# → curl: (52) Empty reply from server; serve panics on first inference call
```

## Root cause hypothesis

Looking at the matmul kernel:

```rust
// crates/aprender-serve/src/gguf/inference/matmul_fused.rs:101-104
if weight.qtype == GGUF_TYPE_F32 {
    return Ok(self.fused_matmul_f32(input, &weight.data, in_dim, out_dim, seq_len));
}
```

`fused_matmul_f32` indexes into `data: &[u8]` with `base = row * in_dim * 4`. The panic says `data.len() == 0` but the computed index reaches ~56M. So `weight.data` is **empty** for a tensor marked GGUF_TYPE_F32.

Qwen3-MoE uses per-expert FFN weights: `ffn_up_exps`, `ffn_gate_exps`, `ffn_down_exps` stored as 3D tensors of shape (n_experts, hidden_dim, ff_dim). My hypothesis: the GGUF loader is registering the parent MoE-tensor as F32 with empty data while the actual data lives in per-expert slices that the matmul caller isn't aware of.

This is distinct from the GPU-side issues #1582 and #1583 (M-GPU-MOE-2.x and M-GPU-MOE-3) — those track GPU throughput + parity tests; this is a CPU `apr serve` correctness bug.

## Empirical evidence

paiml/claude-code-parity-apr M260 dispatched the calibration-and-scale bench against this model; ALL 15 student-side dispatches failed with the same panic. Previous failure mode was the 30s startup-readiness timeout (fixed by #1782); after that fix landed, the next-layer bug surfaced — this one.

## Immediate mitigations

1. **Defensive guard in `fused_matmul_f32`** — check `data.is_empty()` and the size invariant `data.len() >= row * in_dim * 4 + remainder` BEFORE indexing. Return a clear `RealizarError::InvalidShape` with the tensor name + expected vs actual byte count. Turns the cryptic panic into actionable diagnostics for the next investigator.
2. **Don't recommend Qwen3-Coder-30B with CPU `apr serve` in docs** until the MoE F32 path is wired.

## Long-term fix

Wire the Qwen3-MoE per-expert weight slicing through `OwnedQuantizedTensor` / matmul caller so the F32 path receives the correct expert slice for the currently-routed token.

## Cross-references

- paiml/claude-code-parity-apr M260 calibration evidence: `evidence/calibration-and-scale/scores.json`
- aprender PR #1782 (timeout fix that surfaced this downstream bug)
- aprender#1583 / #1582 — sibling Qwen3-MoE GPU work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789

Bug

Reproducer

Root cause hypothesis

Empirical evidence

Immediate mitigations

Long-term fix

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789

Description

Bug

Reproducer

Root cause hypothesis

Empirical evidence

Immediate mitigations

Long-term fix

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions