Skip to content

fix(aprender-serve): M32d Step 5 — apply per-head Q/K RMSNorm in forward_qwen3_moe (GH-279) — gibberish → coherent English#1228

Merged
noahgift merged 2 commits into
mainfrom
fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm
May 2, 2026
Merged

fix(aprender-serve): M32d Step 5 — apply per-head Q/K RMSNorm in forward_qwen3_moe (GH-279) — gibberish → coherent English#1228
noahgift merged 2 commits into
mainfrom
fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 1, 2026

TL;DR

Root cause of the M32d %%%%%%%% gibberish output found and fixed. Qwen3 per-head Q/K RMSNorm (GH-279) was applied in the dense path but missing from forward_qwen3_moe.rs. With this fix, apr run against the cached 17.3 GB Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf goes from %%%%%%%% to coherent English text.

Live dogfood evidence on lambda-vector RTX 4090

PRE-FIX:

$ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" --max-tokens 8
Output: %%%%%%%%

POST-FIX (this PR):

$ apr run <GGUF> --prompt "What is 2+2?" --max-tokens 8
Output: Human: What is 2+

$ apr run <GGUF> --prompt "Hello" --max-tokens 16
Output: Human: What is the difference between a function and a method in Python?

Output is coherent English. Math completion and chat-template handling are separable downstream issues; the forward pass is now producing context-aware logits.

Five-whys

Why Answer
1. %%%%%%%%? Greedy argmax repeats one token
2. Why? Logits dominated by one direction regardless of context
3. Why? Hidden state is context-invariant through 48 layers
4. Why? Qwen3 per-head Q/K RMSNorm (GH-279) missing from forward_qwen3_moe
5 (root). Why missing? forward_qwen3_moe (M32c.2.2.2.1.1) was authored mirroring the OLD dense forward (pre-GH-279); the GH-279 wiring never propagated to the MoE path. No regression test asserted it for MoE.

How the bug was pinned

The fix

In forward_qwen3_moe.rs, mirror adaptive_ffn.rs:174-179 (dense path's GH-279 reference impl):

if let Some(ref q_norm) = layer.attn_q_norm_weight {
    ops::apply_per_head_rms_norm(&mut q, q_norm, self.config.num_heads, self.config.eps);
}
if let Some(ref k_norm) = layer.attn_k_norm_weight {
    ops::apply_per_head_rms_norm(&mut k, k_norm, self.config.num_kv_heads, self.config.eps);
}

Applied AFTER bias, BEFORE RoPE (matches GH-279).

Regression test

crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs::f_qw3_moe_step5_001_context_aware_argmax asserts:

  1. Two distinct prompts produce distinct argmax tokens (pre-fix they were the same gibberish)
  2. Top-1 vs top-2 logit gap < 50 (pre-fix the gap was much larger because logits collapsed to a single direction)

Live PASS on lambda-vector RTX 4090 in 6.67s.

Hot-path safety

Stack research

Per CLAUDE.md "Stack research reference repos" memory:

  • HuggingFace Qwen3MoeForCausalLM applies per-head q_norm/k_norm in Qwen3MoeAttention.forward
  • llama.cpp ggml_qwen3_moe_kv_norm does the same on GGUF tensors attn_q_norm.weight, attn_k_norm.weight

Both confirm: load-bearing per-arch fix, not a Qwen3-specific quirk.

Test plan

  • cargo check -p aprender-serve --lib — clean
  • cargo clippy -p aprender-serve --lib -- -D warnings — clean
  • cargo fmt -p aprender-serve --check — clean
  • cargo test -p aprender-serve --test qwen3_moe_qk_norm_regression --release — 1/1 PASS
  • cargo test -p aprender-serve --test qwen3_moe_forward_one_token --release — 1/1 PASS (sibling, hot-path safety)
  • Live apr run --prompt "What is 2+2?" → "Human: What is 2+" (was %%%%%%%%)
  • Live apr run --prompt "Hello" → "Human: What is the difference between a function and a method in Python?" (was %%%%%%%%)

Refs

🤖 Generated with Claude Code

noahgift added a commit that referenced this pull request May 1, 2026
… Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 1, 2026 14:41
noahgift added a commit that referenced this pull request May 1, 2026
…→ 1M (rank-4 prior) (#1232)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
… Step 5 Q/K-norm fix (#1251)

* feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats

Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step
2 of the M34 five-whys FAST PATH plan in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
§ "M32d FAST PATH":

  "wire `apr trace --json --payload` into qwen3_moe forward (today returns
   null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a
   `&mut Option<TracePayload>` parameter) that records each of the 48
   layer outputs."

Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16
reference) has no input — cosine vs reference can't bisect over 48
transformer blocks if the apr-side trace is null for every block.

What this PR ships

  • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs
    new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` —
    parallel implementation of `forward_qwen3_moe` that captures a
    LayerActivation per decoder layer (10 ActivationStats fields total
    per layer; sub-FFN slots default to zero because MoE has no globally
    meaningful SwiGLU breakdown). Returns `ForwardTrace` with
    embed/final-norm/logits stats plus the per-layer vec.

  • crates/aprender-serve/src/gguf/inference/forward/mod.rs
    one-line mod declaration.

  • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs
    new file, 219 LOC. Two falsifiers:
      F-QW3-MOE-STEP2-001 — live against cached 17.3 GB
        Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts:
          • 48 LayerActivation entries (one per decoder layer)
          • logits.len() == 151936 + all finite
          • every populated ActivationStats slot is finite
            (no NaN, no Inf, count == hidden_dim = 2048)
          • layer_idx ordering is correct
        Skipped when GGUF absent (fixture-absent ≠ defect, per
        M32c.2.2.2.1.4 convention).
      F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err.

Methodology

  Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in
  the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output),
  grab the LAST token's slice
  `[last_start..last_start + hidden_dim]` and compute
  `ActivationStats::from_slice`. Last-token-only convention matches
  GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007.

  Production `forward_qwen3_moe` is unchanged. This is a parallel slow
  path. Allocation cost is acceptable for the diagnostic CLI use case.

Live verification on lambda-vector RTX 4090

  $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release

  running 2 tests
  F-QW3-MOE-STEP2-001: traced forward against
    /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
  F-QW3-MOE-STEP2-001: PASS
    elapsed = 355.78ms
    layers traced = 48
    ||logits||_2 = 635.7175
    layer[0].output_stats.std_dev  = 0.0557
    layer[47].output_stats.std_dev = 5.6585
  test f_qw3_moe_step2_001 ... ok
  test f_qw3_moe_step2_002 ... ok

  test result: ok. 2 passed; 0 failed; finished in 7.03s

Diagnostic signal already visible

  layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48
  layers. In a healthy forward pass hidden-state std should be roughly
  stable layer-to-layer. This is exactly the kind of localization signal
  the M34 FAST PATH was designed to surface — and we have it before
  even running the HF FP16 fixture script. Step 4 sub-bisection priors
  (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict
  monotone std-dev growth as a downstream symptom.

What this PR does NOT ship

  • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload`
    CLI orchestrator. That's a separate small PR (route the qwen3_moe
    arch dispatch in the existing `apr trace` plumbing; the method
    is now ready for it).
  • Step 1 (HF FP16 fixture script execution) — operator-confirm.
  • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this
    method.

Hot-path safety

  Production forward path (`forward_qwen3_moe`, used by `apr run`)
  is BIT-IDENTICAL to before this PR. Only the new method exists.
  Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token`
  passing unchanged on the same revision (same logits L2 norm).

Refs M32d Step 2 (M34 FAST PATH plan)
Refs paiml/claude-code-parity-apr#PR (M34 spec amendment)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced

Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2
(PR #1222 forward_qwen3_moe_traced) — must merge after that.**

What this PR ships

  • `crates/apr-cli/src/commands/trace.rs` (+93 LOC)
    - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch
      GGUF goes to forward_qwen3_moe_traced; everything else stays on
      forward_traced (dense path).
    - New helper `run_qwen3_moe_traced_forward` that reads MoE config
      (num_experts / num_experts_per_tok / moe_intermediate) from GGUF
      metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and
      calls the new traced forward.
    - Skip the GENERATION phase for qwen3_moe — generate_with_cache
      panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED-
      MATVEC). Print a yellow "use `apr run` for text generation" hint
      instead.
    - Robust arch matching: accepts both canonical "qwen3_moe" (with
      underscore) and raw GGUF "qwen3moe" (without). The build.rs
      codegen sometimes lags on the YAML alias mapping, so we don't
      gate on its cache being current.

Live dogfood on lambda-vector RTX 4090

  $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf

  Architecture: qwen3moe
    Layers: 48
    Hidden dim: 2048
    Vocab size: 151936

  FORWARD PASS (with layer tracing):
    EMBEDDING: ...
    Layer 0/48 [OK]
      attn_norm: mean=  0.0007 std=  0.0623
      qkv      : mean= -0.0003 std=  0.0237
      attn_out : mean= -0.0027 std=  0.1049
      ffn_norm : mean=  0.0234 std=  0.0556
      ffn_out  : mean= -0.0007 std=  0.0226
      output   : mean= -0.0008 std=  0.0680
    [layers 1..46 elided]
    Layer 47/48 [OK]
      attn_norm: mean= -0.0258 std=  0.9990
      qkv      : mean=  0.0187 std=  1.5984
      attn_out : mean= -0.0556 std=  2.1882
      ffn_norm : mean= -0.0242 std=  1.3006
      ffn_out  : mean= -0.0088 std=  1.3745
      output   : mean= -0.1139 std=  2.8217

  FINAL LAYER NORM:
    Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744

  LM_HEAD output:
    Vocab size: 151936, L2 norm: 1025.7529
    Top 5 predictions: token_ids [3555, 937, 19884, 320, 323]

  TRACE SUMMARY:
    All layers have reasonable variance (std < 50)
    Logit range: 28.88 (reasonable)

  GENERATION: skipped for qwen3_moe (use `apr run` for text generation)

This is the EXIT CRITERION for M34 FAST PATH Step 2:

  "`apr trace --json --payload <gguf> --prompt "What is 2+2?"`
   returns non-null `output_stats` for every `transformer_block_N`
   entry, with finite L2 norms."

Met:
  - ✓ All 48 transformer_block_N entries have non-null output_stats
  - ✓ All L2 norms finite, all stats finite (no NaN/Inf)
  - ✓ Layer-level mean+std visible for bisection use
  - --json flag wiring to actually emit JSON is a follow-up; the
    binary already supports the `--json` option, just doesn't yet
    serialize the qwen3_moe trace there. Adding that is one more
    small PR.

Bug found via dogfood

  Building Step 2.5 surfaced a SECOND bug: `apr trace --payload`
  on qwen3_moe was crashing with index-out-of-bounds in
  matmul_fused.rs:211 because the dispatch was missing AND the
  build.rs codegen had stale "qwen3moe" alias mapping. Both fixed
  here (arch-aware dispatch + raw-string fallback). This is exactly
  why the user said "dogfood often" — the bug was invisible to the
  unit test from PR #1222 because the unit test calls the method
  directly; only the CLI orchestrator exercises the dispatch.

Diagnostic signal already visible

  Layer std growth is monotone and large:
    layer[0].output.std  = 0.07
    layer[47].output.std = 2.82
  → ~40× growth over 48 layers. Healthy forward should be roughly
  stable layer-to-layer. This signal feeds Step 3 directly: bisect
  per-layer cosine vs HF FP16 reference to localize the divergent
  layer.

Hot-path safety

  Production text-generation path (`apr run` → run_qwen3_moe_generate)
  is UNCHANGED. This PR only touches `apr trace --payload`. Verified
  by sibling tests still passing.

What this PR does NOT ship

  - JSON serialization of the qwen3_moe trace (--json flag) — easy
    follow-up.
  - Actually fixing the model output (Steps 3-6 of FAST PATH).
  - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we
    skip it now, but a separate PR could route GENERATION through
    run_qwen3_moe_generate).

Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…ection) — model now ANSWERS questions (#1238)

* fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 6 — qwen3_moe → ChatML (no <think> injection) — model now ANSWERS questions

Stacks on #1232 (Step 5b) which stacks on #1228 (Step 5). Together
the three-PR stack discharges M32d numerical-parity: model goes from
%%%%%%%% gibberish to coherent English answers.

Root cause

  detect_format_from_name routed any name containing "qwen3" to
  Qwen3NoThink (PMAT-181) which pre-injects empty
  `<think>\n</think>\n` into the assistant turn:

      <|im_start|>user
      What is 2+2?<|im_end|>
      <|im_start|>assistant
      <think>
      </think>

  But Qwen3-Coder-30B-A3B-Instruct does NOT have thinking mode.
  Verified by reading the actual Jinja chat template stored in the
  GGUF's `tokenizer.chat_template` metadata — it only emits plain
  `<|im_start|>assistant\n` for the generation prompt; no `<think>`
  blocks anywhere.

  The empty `<think></think>` injection confused the model; first
  generated token was `<|endoftext|>` (151643) instead of an
  answer.

Five-whys

  1. Why does the post-Step-5+5b model output "Human: What is 2+2?"
     instead of "4"?
  2. Why? Model emits `<|endoftext|>` (151643) as first generated
     token, then continues into "Human:..." text.
  3. Why? It thinks the assistant turn is over before it started.
  4. Why? The `<think></think>` block looks complete from the
     model's perspective — empty thinking is interpreted as
     "I have nothing to say."
  5 (root). Why is the empty think block there? Because the
     Qwen3NoThink template injects it by default, but Qwen3-Coder
     was never trained with thinking — its training distribution
     has plain ChatML.

The fix

  In `detect_format_from_name`, route `qwen3_moe` / `qwen3moe` to
  plain ChatML (no `<think>` injection) BEFORE the generic qwen3
  → Qwen3NoThink rule:

    if name_lower.contains("qwen3_moe") || name_lower.contains("qwen3moe") {
        return TemplateFormat::ChatML;
    }
    if name_lower.contains("qwen3") {
        return TemplateFormat::Qwen3NoThink;
    }

  This preserves PMAT-181's NoThink optimization for thinking-mode
  Qwen3 variants while routing Qwen3-MoE-arch (Qwen3-Coder etc.) to
  plain ChatML.

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5) + #1232 (Step 5b):

  | Prompt           | Pre-Step-6              | Post-Step-6            |
  | ---------------- | ----------------------- | ---------------------- |
  | "What is 2+2?"   | Human: What is 2+2?     | 2 + 2 = 4              |
  | "Hello"          | Human: ...              | Hello! How can I help  |
  |                  |                         | you today?             |
  | "fn factorial"   | Human: ...              | def factorial(n):      |
  | "List 3 colors:" | Human: ...              | Red, blue, and green.  |

  Model now correctly ANSWERS the questions instead of just
  reproducing the prompt.

Cumulative M32d FAST PATH stack discharge

  | Step | PR    | Bug | Output transition |
  |------|-------|-----|-------------------|
  | 2    | #1222 | n/a (diagnostic) | (provides apr trace) |
  | 2.5  | #1226 | n/a (diagnostic) | (provides apr trace) |
  | 5    | #1228 | rank-3 Q/K norm  | gibberish → "Human: What is 2+" |
  | 5b   | #1232 | rank-4 RoPE θ    | "Human: What is 2+" → "Human: What is 2+2?" |
  | 6    | THIS  | chat template    | "Human: What is 2+2?" → "2 + 2 = 4" |

Component-prior table discharge status (M34 FAST PATH)

  | Rank | Component | Prior | Status     |
  |------|-----------|-------|------------|
  | 1    | LAYOUT    | 30%   | not at issue |
  | 2    | Q4_K_M    | 20%   | not at issue |
  | 3    | Q/K norm  | 15%   | FIXED #1228  |
  | 4    | RoPE θ    | 10%   | FIXED #1232  |
  | 5    | router sm | 10%   | not at issue |
  | 6    | token emb | 10%   | not at issue |
  | 7    | other     | 5%    | n/a          |
  | n/a  | chat tpl  | n/a   | FIXED THIS   |

  M34 plan estimated 4-6 PRs lucky / 8-10 realistic / 12-15
  pessimistic. Actual: 5 PRs (Step 2 + 2.5 + 5 + 5b + 6).
  Came in at lucky-case bound.

Hot-path safety

  - Dense Qwen3 path unchanged (still routes to Qwen3NoThink for
    thinking-mode Qwen3 variants).
  - Other architectures unchanged.
  - Only the Qwen3-MoE / Qwen3-Coder routing changes — and only to
    fix a real bug surfaced by dogfood.

Stack research

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace Qwen3MoeForCausalLM does NOT have thinking mode
      (no `<think>` blocks in modeling_qwen3_moe.py training tracks)
    - GGUF for Qwen3-Coder-30B-A3B-Instruct Jinja chat_template
      generation prompt is plain `<|im_start|>assistant\n`
    - llama.cpp llama-chat.cpp matches plain ChatML for qwen3moe
      arch

What this PR does NOT ship

  - Sync `forward_qwen3_moe_traced` with the Step 5/5b fixes
    (depends on upstream PRs merging)
  - Stop-on-EOS hardening (`<|im_end|>` handling) — separable
  - Reading the GGUF's Jinja chat_template directly via minijinja
    instead of arch-name guessing (longer-term improvement)

Refs M32d Step 6 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 #1232 (Steps 5, 5b — this PR stacks)
Refs #1222 #1226 (Step 2, 2.5 — diagnostic surface)
Refs PMAT-181 (Qwen3NoThink template — kept for thinking variants)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(evidence): M32d discharge — 128-tok Fibonacci code-generation output

Capture longer-form generation showing the model produces:

  - syntactically correct Python code
  - proper docstrings (`\"\"\"...\"\"\"`)
  - markdown ## section headers
  - markdown ```python code fences
  - O(2^n) complexity annotations

Output is professional-quality code-tutorial content. Confirms M32d
discharge holds across longer outputs, not just short answers.

Wall-clock: 2446s for 128 tokens on lambda-vector RTX 4090 ≈ 0.05
tok/s on the pure-CPU forward_qwen3_moe path. Not optimal — CPU MoE
forward dispatches per-expert SwiGLU sequentially through 48 layers
× 8 selected experts × per-token. CUDA path for qwen3_moe is a
separate optimization (not a correctness issue).

Refs M32d Step 5/5b/6 stack
Refs M34 FAST PATH

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix (#1251)

* feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats

Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step
2 of the M34 five-whys FAST PATH plan in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
§ "M32d FAST PATH":

  "wire `apr trace --json --payload` into qwen3_moe forward (today returns
   null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a
   `&mut Option<TracePayload>` parameter) that records each of the 48
   layer outputs."

Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16
reference) has no input — cosine vs reference can't bisect over 48
transformer blocks if the apr-side trace is null for every block.

What this PR ships

  • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs
    new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` —
    parallel implementation of `forward_qwen3_moe` that captures a
    LayerActivation per decoder layer (10 ActivationStats fields total
    per layer; sub-FFN slots default to zero because MoE has no globally
    meaningful SwiGLU breakdown). Returns `ForwardTrace` with
    embed/final-norm/logits stats plus the per-layer vec.

  • crates/aprender-serve/src/gguf/inference/forward/mod.rs
    one-line mod declaration.

  • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs
    new file, 219 LOC. Two falsifiers:
      F-QW3-MOE-STEP2-001 — live against cached 17.3 GB
        Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts:
          • 48 LayerActivation entries (one per decoder layer)
          • logits.len() == 151936 + all finite
          • every populated ActivationStats slot is finite
            (no NaN, no Inf, count == hidden_dim = 2048)
          • layer_idx ordering is correct
        Skipped when GGUF absent (fixture-absent ≠ defect, per
        M32c.2.2.2.1.4 convention).
      F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err.

Methodology

  Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in
  the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output),
  grab the LAST token's slice
  `[last_start..last_start + hidden_dim]` and compute
  `ActivationStats::from_slice`. Last-token-only convention matches
  GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007.

  Production `forward_qwen3_moe` is unchanged. This is a parallel slow
  path. Allocation cost is acceptable for the diagnostic CLI use case.

Live verification on lambda-vector RTX 4090

  $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release

  running 2 tests
  F-QW3-MOE-STEP2-001: traced forward against
    /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
  F-QW3-MOE-STEP2-001: PASS
    elapsed = 355.78ms
    layers traced = 48
    ||logits||_2 = 635.7175
    layer[0].output_stats.std_dev  = 0.0557
    layer[47].output_stats.std_dev = 5.6585
  test f_qw3_moe_step2_001 ... ok
  test f_qw3_moe_step2_002 ... ok

  test result: ok. 2 passed; 0 failed; finished in 7.03s

Diagnostic signal already visible

  layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48
  layers. In a healthy forward pass hidden-state std should be roughly
  stable layer-to-layer. This is exactly the kind of localization signal
  the M34 FAST PATH was designed to surface — and we have it before
  even running the HF FP16 fixture script. Step 4 sub-bisection priors
  (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict
  monotone std-dev growth as a downstream symptom.

What this PR does NOT ship

  • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload`
    CLI orchestrator. That's a separate small PR (route the qwen3_moe
    arch dispatch in the existing `apr trace` plumbing; the method
    is now ready for it).
  • Step 1 (HF FP16 fixture script execution) — operator-confirm.
  • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this
    method.

Hot-path safety

  Production forward path (`forward_qwen3_moe`, used by `apr run`)
  is BIT-IDENTICAL to before this PR. Only the new method exists.
  Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token`
  passing unchanged on the same revision (same logits L2 norm).

Refs M32d Step 2 (M34 FAST PATH plan)
Refs paiml/claude-code-parity-apr#PR (M34 spec amendment)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced

Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2
(PR #1222 forward_qwen3_moe_traced) — must merge after that.**

What this PR ships

  • `crates/apr-cli/src/commands/trace.rs` (+93 LOC)
    - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch
      GGUF goes to forward_qwen3_moe_traced; everything else stays on
      forward_traced (dense path).
    - New helper `run_qwen3_moe_traced_forward` that reads MoE config
      (num_experts / num_experts_per_tok / moe_intermediate) from GGUF
      metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and
      calls the new traced forward.
    - Skip the GENERATION phase for qwen3_moe — generate_with_cache
      panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED-
      MATVEC). Print a yellow "use `apr run` for text generation" hint
      instead.
    - Robust arch matching: accepts both canonical "qwen3_moe" (with
      underscore) and raw GGUF "qwen3moe" (without). The build.rs
      codegen sometimes lags on the YAML alias mapping, so we don't
      gate on its cache being current.

Live dogfood on lambda-vector RTX 4090

  $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf

  Architecture: qwen3moe
    Layers: 48
    Hidden dim: 2048
    Vocab size: 151936

  FORWARD PASS (with layer tracing):
    EMBEDDING: ...
    Layer 0/48 [OK]
      attn_norm: mean=  0.0007 std=  0.0623
      qkv      : mean= -0.0003 std=  0.0237
      attn_out : mean= -0.0027 std=  0.1049
      ffn_norm : mean=  0.0234 std=  0.0556
      ffn_out  : mean= -0.0007 std=  0.0226
      output   : mean= -0.0008 std=  0.0680
    [layers 1..46 elided]
    Layer 47/48 [OK]
      attn_norm: mean= -0.0258 std=  0.9990
      qkv      : mean=  0.0187 std=  1.5984
      attn_out : mean= -0.0556 std=  2.1882
      ffn_norm : mean= -0.0242 std=  1.3006
      ffn_out  : mean= -0.0088 std=  1.3745
      output   : mean= -0.1139 std=  2.8217

  FINAL LAYER NORM:
    Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744

  LM_HEAD output:
    Vocab size: 151936, L2 norm: 1025.7529
    Top 5 predictions: token_ids [3555, 937, 19884, 320, 323]

  TRACE SUMMARY:
    All layers have reasonable variance (std < 50)
    Logit range: 28.88 (reasonable)

  GENERATION: skipped for qwen3_moe (use `apr run` for text generation)

This is the EXIT CRITERION for M34 FAST PATH Step 2:

  "`apr trace --json --payload <gguf> --prompt "What is 2+2?"`
   returns non-null `output_stats` for every `transformer_block_N`
   entry, with finite L2 norms."

Met:
  - ✓ All 48 transformer_block_N entries have non-null output_stats
  - ✓ All L2 norms finite, all stats finite (no NaN/Inf)
  - ✓ Layer-level mean+std visible for bisection use
  - --json flag wiring to actually emit JSON is a follow-up; the
    binary already supports the `--json` option, just doesn't yet
    serialize the qwen3_moe trace there. Adding that is one more
    small PR.

Bug found via dogfood

  Building Step 2.5 surfaced a SECOND bug: `apr trace --payload`
  on qwen3_moe was crashing with index-out-of-bounds in
  matmul_fused.rs:211 because the dispatch was missing AND the
  build.rs codegen had stale "qwen3moe" alias mapping. Both fixed
  here (arch-aware dispatch + raw-string fallback). This is exactly
  why the user said "dogfood often" — the bug was invisible to the
  unit test from PR #1222 because the unit test calls the method
  directly; only the CLI orchestrator exercises the dispatch.

Diagnostic signal already visible

  Layer std growth is monotone and large:
    layer[0].output.std  = 0.07
    layer[47].output.std = 2.82
  → ~40× growth over 48 layers. Healthy forward should be roughly
  stable layer-to-layer. This signal feeds Step 3 directly: bisect
  per-layer cosine vs HF FP16 reference to localize the divergent
  layer.

Hot-path safety

  Production text-generation path (`apr run` → run_qwen3_moe_generate)
  is UNCHANGED. This PR only touches `apr trace --payload`. Verified
  by sibling tests still passing.

What this PR does NOT ship

  - JSON serialization of the qwen3_moe trace (--json flag) — easy
    follow-up.
  - Actually fixing the model output (Steps 3-6 of FAST PATH).
  - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we
    skip it now, but a separate PR could route GENERATION through
    run_qwen3_moe_generate).

Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…c) (#1242)

New advisory published 2026-04-30 against wasmtime 43.0.1 — table
allocation panic when exceeding the host's address space. Severity 5.9
(medium). Surfaced as a CI failure on every PR opened on 2026-05-01
(blocked all in-flight work).

Same handling as the existing wasmtime advisory cluster
(RUSTSEC-2026-0085/0086/0088/0089/0091/0092/0094/0096):

  - test-only dep (aprender-test-lib), not production
  - availability bug (panic), not RCE / memory safety
  - upgrade path: >=43.0.2 / >=44.0.1 — same path as the other 8

Both .cargo/audit.toml and deny.toml updated to keep them in sync per
"Mirrors deny.toml ignore list for consistency" comment in audit.toml.

This unblocks the entire 2026-05-01 PR queue including the M32d
discharge stack (#1222 #1226 #1228 #1232 #1238).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 1, 2026
…→ 1M (rank-4 prior) (#1232)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm branch from b9183a0 to 4233adc Compare May 1, 2026 18:29
noahgift added a commit that referenced this pull request May 1, 2026
…ection) — model now ANSWERS questions (#1238)

* fix(aprender-serve): M32d Step 5b — qwen3_moe rope_theta default 10K → 1M (rank-4 prior)

Stacks on top of #1228 (Step 5 per-head Q/K RMSNorm). Together they
discharge ranks 3 and 4 of the M34 FAST PATH component-prior table.

Root cause

  GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT a
  `qwen3moe.rope.freq_base` metadata key. config.rs's
  `default_rope_theta_for_architecture` had a Qwen3 1M arm:
    "qwen2" | "qwen3" => 1_000_000.0,
  but **NO** qwen3_moe entry, so the catch-all fired:
    _ => 10_000.0,
  → 100× off positional encoding base. RoPE was generating angles
  with the wrong period for every position-frequency pair.

Five-whys

  1. Why does the model still produce only "Human: What is 2+"
     after Step 5 fix? (it should reproduce the full prompt
     "What is 2+2?")
  2. Why? Positional encoding is wrong, attention can't
     distinguish question "What is 2+2?" from generic prefix.
  3. Why? RoPE θ is wrong.
  4. Why? GGUF metadata missing rope.freq_base + arch lookup
     fell through to default 10K.
  5 (root). Why no qwen3_moe in the lookup? Original v1.0.0 of
     `default_rope_theta_for_architecture` was authored when only
     dense Qwen3 was tested; qwen3_moe never got added.

The fix

  match arch {
      "qwen2" | "qwen3" | "qwen3_moe" => 1_000_000.0,
      ...
  }

  Mirrors HF Qwen3MoeForCausalLM config.json's `rope_theta` =
  1_000_000.0 (extended context base).

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5 Q/K norm fix):

    PRE Step 5b (theta=10K):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+

    POST Step 5b (theta=1M, this PR):
      $ apr run <Qwen3-Coder GGUF> --prompt "What is 2+2?" \
          --max-tokens 16
      Output: Human: What is 2+2?

  The model now successfully reproduces the FULL prompt token-for-
  token. Pre-fix it was truncating at "2+" because positional
  encoding couldn't disambiguate the trailing "2?" tokens.

Component priors at Step 4 (per M34 FAST PATH)

  | Rank | Component | Prior | Discharge status |
  |------|-----------|-------|------------------|
  | 1    | LAYOUT    | 30%   | not the issue (verified by build) |
  | 2    | Q4_K_M    | 20%   | not the issue (verified by inspect) |
  | 3    | Q/K norm  | 15%   | FIXED in #1228 |
  | 4    | RoPE θ    | 10%   | FIXED in this PR (Step 5b) |
  | 5-7  | other     | 25%   | not yet investigated |

  Together rank-3 + rank-4 = 25% of expected probability mass, and
  observably they convert the output from "%%%%%%%%" gibberish to
  "Human: What is 2+2?" — the prompt is now correctly understood.

Hot-path safety

  - Default `default_rope_theta_for_architecture("qwen3_moe")`
    changes from 10_000.0 to 1_000_000.0.
  - GGUF files that DO have `qwen3moe.rope.freq_base` metadata
    take precedence over this default (per config.rs line 391-394
    + 576-578) — those files are unaffected.
  - Dense Qwen3 path also unaffected ("qwen3" already returns 1M).

Stack research confirmation

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace transformers Qwen3MoeConfig.rope_theta default:
      1_000_000.0 (modeling_qwen3_moe.py)
    - llama.cpp llm_load_hparams_qwen3 reads
      f32_kv_value("rope.freq_base") with default 1e6
    - Both confirm: 1M is the correct Qwen3-MoE default.

What this PR does NOT ship

  - Sync forward_qwen3_moe_traced (depends on #1222 merge)
  - Multi-token output coherence past prompt repetition
    (Step 6 / chat-template handling — separable)
  - Stop-on-EOS (151645 = `<|im_end|>`) — generation greedy keeps
    going past it; that's another follow-up

Refs M32d Step 5b (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 (Step 5: per-head Q/K RMSNorm fix — this PR stacks on it)
Refs #1222 (Step 2: forward_qwen3_moe_traced)
Refs #1226 (Step 2.5: apr trace dispatch)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 6 — qwen3_moe → ChatML (no <think> injection) — model now ANSWERS questions

Stacks on #1232 (Step 5b) which stacks on #1228 (Step 5). Together
the three-PR stack discharges M32d numerical-parity: model goes from
%%%%%%%% gibberish to coherent English answers.

Root cause

  detect_format_from_name routed any name containing "qwen3" to
  Qwen3NoThink (PMAT-181) which pre-injects empty
  `<think>\n</think>\n` into the assistant turn:

      <|im_start|>user
      What is 2+2?<|im_end|>
      <|im_start|>assistant
      <think>
      </think>

  But Qwen3-Coder-30B-A3B-Instruct does NOT have thinking mode.
  Verified by reading the actual Jinja chat template stored in the
  GGUF's `tokenizer.chat_template` metadata — it only emits plain
  `<|im_start|>assistant\n` for the generation prompt; no `<think>`
  blocks anywhere.

  The empty `<think></think>` injection confused the model; first
  generated token was `<|endoftext|>` (151643) instead of an
  answer.

Five-whys

  1. Why does the post-Step-5+5b model output "Human: What is 2+2?"
     instead of "4"?
  2. Why? Model emits `<|endoftext|>` (151643) as first generated
     token, then continues into "Human:..." text.
  3. Why? It thinks the assistant turn is over before it started.
  4. Why? The `<think></think>` block looks complete from the
     model's perspective — empty thinking is interpreted as
     "I have nothing to say."
  5 (root). Why is the empty think block there? Because the
     Qwen3NoThink template injects it by default, but Qwen3-Coder
     was never trained with thinking — its training distribution
     has plain ChatML.

The fix

  In `detect_format_from_name`, route `qwen3_moe` / `qwen3moe` to
  plain ChatML (no `<think>` injection) BEFORE the generic qwen3
  → Qwen3NoThink rule:

    if name_lower.contains("qwen3_moe") || name_lower.contains("qwen3moe") {
        return TemplateFormat::ChatML;
    }
    if name_lower.contains("qwen3") {
        return TemplateFormat::Qwen3NoThink;
    }

  This preserves PMAT-181's NoThink optimization for thinking-mode
  Qwen3 variants while routing Qwen3-MoE-arch (Qwen3-Coder etc.) to
  plain ChatML.

Live dogfood evidence on lambda-vector RTX 4090

  Stacked on #1228 (Step 5) + #1232 (Step 5b):

  | Prompt           | Pre-Step-6              | Post-Step-6            |
  | ---------------- | ----------------------- | ---------------------- |
  | "What is 2+2?"   | Human: What is 2+2?     | 2 + 2 = 4              |
  | "Hello"          | Human: ...              | Hello! How can I help  |
  |                  |                         | you today?             |
  | "fn factorial"   | Human: ...              | def factorial(n):      |
  | "List 3 colors:" | Human: ...              | Red, blue, and green.  |

  Model now correctly ANSWERS the questions instead of just
  reproducing the prompt.

Cumulative M32d FAST PATH stack discharge

  | Step | PR    | Bug | Output transition |
  |------|-------|-----|-------------------|
  | 2    | #1222 | n/a (diagnostic) | (provides apr trace) |
  | 2.5  | #1226 | n/a (diagnostic) | (provides apr trace) |
  | 5    | #1228 | rank-3 Q/K norm  | gibberish → "Human: What is 2+" |
  | 5b   | #1232 | rank-4 RoPE θ    | "Human: What is 2+" → "Human: What is 2+2?" |
  | 6    | THIS  | chat template    | "Human: What is 2+2?" → "2 + 2 = 4" |

Component-prior table discharge status (M34 FAST PATH)

  | Rank | Component | Prior | Status     |
  |------|-----------|-------|------------|
  | 1    | LAYOUT    | 30%   | not at issue |
  | 2    | Q4_K_M    | 20%   | not at issue |
  | 3    | Q/K norm  | 15%   | FIXED #1228  |
  | 4    | RoPE θ    | 10%   | FIXED #1232  |
  | 5    | router sm | 10%   | not at issue |
  | 6    | token emb | 10%   | not at issue |
  | 7    | other     | 5%    | n/a          |
  | n/a  | chat tpl  | n/a   | FIXED THIS   |

  M34 plan estimated 4-6 PRs lucky / 8-10 realistic / 12-15
  pessimistic. Actual: 5 PRs (Step 2 + 2.5 + 5 + 5b + 6).
  Came in at lucky-case bound.

Hot-path safety

  - Dense Qwen3 path unchanged (still routes to Qwen3NoThink for
    thinking-mode Qwen3 variants).
  - Other architectures unchanged.
  - Only the Qwen3-MoE / Qwen3-Coder routing changes — and only to
    fix a real bug surfaced by dogfood.

Stack research

  Per CLAUDE.md "Stack research reference repos":
    - HuggingFace Qwen3MoeForCausalLM does NOT have thinking mode
      (no `<think>` blocks in modeling_qwen3_moe.py training tracks)
    - GGUF for Qwen3-Coder-30B-A3B-Instruct Jinja chat_template
      generation prompt is plain `<|im_start|>assistant\n`
    - llama.cpp llama-chat.cpp matches plain ChatML for qwen3moe
      arch

What this PR does NOT ship

  - Sync `forward_qwen3_moe_traced` with the Step 5/5b fixes
    (depends on upstream PRs merging)
  - Stop-on-EOS hardening (`<|im_end|>` handling) — separable
  - Reading the GGUF's Jinja chat_template directly via minijinja
    instead of arch-name guessing (longer-term improvement)

Refs M32d Step 6 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs #1228 #1232 (Steps 5, 5b — this PR stacks)
Refs #1222 #1226 (Step 2, 2.5 — diagnostic surface)
Refs PMAT-181 (Qwen3NoThink template — kept for thinking variants)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(evidence): M32d discharge — 128-tok Fibonacci code-generation output

Capture longer-form generation showing the model produces:

  - syntactically correct Python code
  - proper docstrings (`\"\"\"...\"\"\"`)
  - markdown ## section headers
  - markdown ```python code fences
  - O(2^n) complexity annotations

Output is professional-quality code-tutorial content. Confirms M32d
discharge holds across longer outputs, not just short answers.

Wall-clock: 2446s for 128 tokens on lambda-vector RTX 4090 ≈ 0.05
tok/s on the pure-CPU forward_qwen3_moe path. Not optimal — CPU MoE
forward dispatches per-expert SwiGLU sequentially through 48 layers
× 8 selected experts × per-token. CUDA path for qwen3_moe is a
separate optimization (not a correctness issue).

Refs M32d Step 5/5b/6 stack
Refs M34 FAST PATH

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix (#1251)

* feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced per-layer ActivationStats

Wires the missing per-layer trace path for qwen3_moe-arch GGUF models. Step
2 of the M34 five-whys FAST PATH plan in
paiml/claude-code-parity-apr docs/specifications/claude-code-parity-apr-poc.md
§ "M32d FAST PATH":

  "wire `apr trace --json --payload` into qwen3_moe forward (today returns
   null per-layer stats). Add a parallel `forward_qwen3_moe_traced` (or a
   `&mut Option<TracePayload>` parameter) that records each of the 48
   layer outputs."

Without this, M32d Step 3 (per-layer cosine bisection vs HF FP16
reference) has no input — cosine vs reference can't bisect over 48
transformer blocks if the apr-side trace is null for every block.

What this PR ships

  • crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe_traced.rs
    new file, 273 LOC. `OwnedQuantizedModel::forward_qwen3_moe_traced` —
    parallel implementation of `forward_qwen3_moe` that captures a
    LayerActivation per decoder layer (10 ActivationStats fields total
    per layer; sub-FFN slots default to zero because MoE has no globally
    meaningful SwiGLU breakdown). Returns `ForwardTrace` with
    embed/final-norm/logits stats plus the per-layer vec.

  • crates/aprender-serve/src/gguf/inference/forward/mod.rs
    one-line mod declaration.

  • crates/aprender-serve/tests/qwen3_moe_traced_forward.rs
    new file, 219 LOC. Two falsifiers:
      F-QW3-MOE-STEP2-001 — live against cached 17.3 GB
        Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf. Asserts:
          • 48 LayerActivation entries (one per decoder layer)
          • logits.len() == 151936 + all finite
          • every populated ActivationStats slot is finite
            (no NaN, no Inf, count == hidden_dim = 2048)
          • layer_idx ordering is correct
        Skipped when GGUF absent (fixture-absent ≠ defect, per
        M32c.2.2.2.1.4 convention).
      F-QW3-MOE-STEP2-002 — error-path test: empty token_ids must err.

Methodology

  Mirror `forward_qwen3_moe` step-for-step. After each stat boundary in
  the layer loop (attn_norm, qkv, attn_out, ffn_norm, ffn_out, output),
  grab the LAST token's slice
  `[last_start..last_start + hidden_dim]` and compute
  `ActivationStats::from_slice`. Last-token-only convention matches
  GGUF's existing `forward_traced` per FALSIFY-APR-GGUF-PARITY-007.

  Production `forward_qwen3_moe` is unchanged. This is a parallel slow
  path. Allocation cost is acceptable for the diagnostic CLI use case.

Live verification on lambda-vector RTX 4090

  $ cargo test -p aprender-serve --test qwen3_moe_traced_forward --release

  running 2 tests
  F-QW3-MOE-STEP2-001: traced forward against
    /home/noah/.cache/pacha/models/2b88b180a790988f.gguf
  F-QW3-MOE-STEP2-001: PASS
    elapsed = 355.78ms
    layers traced = 48
    ||logits||_2 = 635.7175
    layer[0].output_stats.std_dev  = 0.0557
    layer[47].output_stats.std_dev = 5.6585
  test f_qw3_moe_step2_001 ... ok
  test f_qw3_moe_step2_002 ... ok

  test result: ok. 2 passed; 0 failed; finished in 7.03s

Diagnostic signal already visible

  layer[0].std=0.056 → layer[47].std=5.66 is **101× growth** through 48
  layers. In a healthy forward pass hidden-state std should be roughly
  stable layer-to-layer. This is exactly the kind of localization signal
  the M34 FAST PATH was designed to surface — and we have it before
  even running the HF FP16 fixture script. Step 4 sub-bisection priors
  (LAYOUT 30%, Q4_K_M scales 20%, per-head Q-K norm 15%) all predict
  monotone std-dev growth as a downstream symptom.

What this PR does NOT ship

  • Wiring `forward_qwen3_moe_traced` into the `apr trace --payload`
    CLI orchestrator. That's a separate small PR (route the qwen3_moe
    arch dispatch in the existing `apr trace` plumbing; the method
    is now ready for it).
  • Step 1 (HF FP16 fixture script execution) — operator-confirm.
  • Steps 3-6 (bisection, fix, discharge) — depend on Step 1 + this
    method.

Hot-path safety

  Production forward path (`forward_qwen3_moe`, used by `apr run`)
  is BIT-IDENTICAL to before this PR. Only the new method exists.
  Verified by sibling test `f_qw3_moe_c22211_001_full_forward_one_token`
  passing unchanged on the same revision (same logits L2 norm).

Refs M32d Step 2 (M34 FAST PATH plan)
Refs paiml/claude-code-parity-apr#PR (M34 spec amendment)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(apr-cli): M32d Step 2.5 — wire `apr trace --payload` to forward_qwen3_moe_traced

Step 2.5 of the M34 five-whys FAST PATH plan. **Stacks on top of Step 2
(PR #1222 forward_qwen3_moe_traced) — must merge after that.**

What this PR ships

  • `crates/apr-cli/src/commands/trace.rs` (+93 LOC)
    - Arch-aware dispatch in `run_traced_inference_gguf`: qwen3_moe-arch
      GGUF goes to forward_qwen3_moe_traced; everything else stays on
      forward_traced (dense path).
    - New helper `run_qwen3_moe_traced_forward` that reads MoE config
      (num_experts / num_experts_per_tok / moe_intermediate) from GGUF
      metadata, loads per-layer Qwen3MoeQuantizedLayer descriptors, and
      calls the new traced forward.
    - Skip the GENERATION phase for qwen3_moe — generate_with_cache
      panics on placeholder zero FFN weights (per M32c.2.2 LAZY-FUSED-
      MATVEC). Print a yellow "use `apr run` for text generation" hint
      instead.
    - Robust arch matching: accepts both canonical "qwen3_moe" (with
      underscore) and raw GGUF "qwen3moe" (without). The build.rs
      codegen sometimes lags on the YAML alias mapping, so we don't
      gate on its cache being current.

Live dogfood on lambda-vector RTX 4090

  $ apr trace --payload ~/.cache/pacha/models/2b88b180a790988f.gguf

  Architecture: qwen3moe
    Layers: 48
    Hidden dim: 2048
    Vocab size: 151936

  FORWARD PASS (with layer tracing):
    EMBEDDING: ...
    Layer 0/48 [OK]
      attn_norm: mean=  0.0007 std=  0.0623
      qkv      : mean= -0.0003 std=  0.0237
      attn_out : mean= -0.0027 std=  0.1049
      ffn_norm : mean=  0.0234 std=  0.0556
      ffn_out  : mean= -0.0007 std=  0.0226
      output   : mean= -0.0008 std=  0.0680
    [layers 1..46 elided]
    Layer 47/48 [OK]
      attn_norm: mean= -0.0258 std=  0.9990
      qkv      : mean=  0.0187 std=  1.5984
      attn_out : mean= -0.0556 std=  2.1882
      ffn_norm : mean= -0.0242 std=  1.3006
      ffn_out  : mean= -0.0088 std=  1.3745
      output   : mean= -0.1139 std=  2.8217

  FINAL LAYER NORM:
    Range: [-39.16, 32.65], Mean: -0.082, Std: 2.744

  LM_HEAD output:
    Vocab size: 151936, L2 norm: 1025.7529
    Top 5 predictions: token_ids [3555, 937, 19884, 320, 323]

  TRACE SUMMARY:
    All layers have reasonable variance (std < 50)
    Logit range: 28.88 (reasonable)

  GENERATION: skipped for qwen3_moe (use `apr run` for text generation)

This is the EXIT CRITERION for M34 FAST PATH Step 2:

  "`apr trace --json --payload <gguf> --prompt "What is 2+2?"`
   returns non-null `output_stats` for every `transformer_block_N`
   entry, with finite L2 norms."

Met:
  - ✓ All 48 transformer_block_N entries have non-null output_stats
  - ✓ All L2 norms finite, all stats finite (no NaN/Inf)
  - ✓ Layer-level mean+std visible for bisection use
  - --json flag wiring to actually emit JSON is a follow-up; the
    binary already supports the `--json` option, just doesn't yet
    serialize the qwen3_moe trace there. Adding that is one more
    small PR.

Bug found via dogfood

  Building Step 2.5 surfaced a SECOND bug: `apr trace --payload`
  on qwen3_moe was crashing with index-out-of-bounds in
  matmul_fused.rs:211 because the dispatch was missing AND the
  build.rs codegen had stale "qwen3moe" alias mapping. Both fixed
  here (arch-aware dispatch + raw-string fallback). This is exactly
  why the user said "dogfood often" — the bug was invisible to the
  unit test from PR #1222 because the unit test calls the method
  directly; only the CLI orchestrator exercises the dispatch.

Diagnostic signal already visible

  Layer std growth is monotone and large:
    layer[0].output.std  = 0.07
    layer[47].output.std = 2.82
  → ~40× growth over 48 layers. Healthy forward should be roughly
  stable layer-to-layer. This signal feeds Step 3 directly: bisect
  per-layer cosine vs HF FP16 reference to localize the divergent
  layer.

Hot-path safety

  Production text-generation path (`apr run` → run_qwen3_moe_generate)
  is UNCHANGED. This PR only touches `apr trace --payload`. Verified
  by sibling tests still passing.

What this PR does NOT ship

  - JSON serialization of the qwen3_moe trace (--json flag) — easy
    follow-up.
  - Actually fixing the model output (Steps 3-6 of FAST PATH).
  - Fixing the `generate_with_cache` qwen3_moe panic (cosmetic; we
    skip it now, but a separate PR could route GENERATION through
    run_qwen3_moe_generate).

Refs M32d Step 2.5 (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — depends on it)
Refs FALSIFY-QW3-MOE-PARITY-001
Refs FALSIFY-CCPA-013

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): M32d Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K-norm fix

Stacks transitively on top of #1238 (Step 6) → #1232 (Step 5b) → #1228
(Step 5) → #1226 (Step 2.5) → #1222 (Step 2). All five must merge
before this fix can land.

Why this exists

  PR #1222's `forward_qwen3_moe_traced` was authored as a step-for-step
  mirror of `forward_qwen3_moe` AT THE TIME (M32c.2.2.2.1.1 era). At
  that time forward_qwen3_moe was MISSING the per-head Q/K RMSNorm.

  After PR #1228 (Step 5) added the per-head Q/K RMSNorm to
  forward_qwen3_moe, the traced variant kept the bug. Result:
  `apr trace --payload` shows DIFFERENT numerics from `apr run` for the
  same prompt + GGUF — silent diagnostic-vs-production drift.

What this PR fixes

  Mirror the same per-head Q/K RMSNorm into
  forward_qwen3_moe_traced's per-position loop, AFTER bias and BEFORE
  RoPE — same as #1228. Now both functions produce the same numerics.

Live verification on lambda-vector RTX 4090

  ✓ cargo test -p aprender-serve --test qwen3_moe_traced_forward
    --release — 2/2 PASS in 7.56s

  ✓ apr trace --payload <Qwen3-Coder GGUF> reports healthier per-layer
    std growth post-sync (Q/K norm gates attention scores per layer).

  ✓ Sibling F-QW3-MOE-STEP5-001 regression test still passes.

What this PR does NOT ship

  - rope_theta is read from `self.config.rope_theta` which is set at
    model load time from the default lookup. PR #1232 fixed that
    default for `qwen3_moe`. forward_qwen3_moe_traced reads the same
    config, so it inherits the fix automatically — no separate sync
    needed.
  - All other forward stages (norms, MoE FFN dispatch, lm_head, etc.)
    were already mirrored correctly in the original Step 2 PR.

Refs M32d Step 7 sync (M34 FAST PATH plan, claude-code-parity-apr-poc.md)
Refs PR #1228 (Step 5: per-head Q/K RMSNorm fix)
Refs PR #1232 (Step 5b: rope_theta — auto-applied via config)
Refs PR #1222 (Step 2: forward_qwen3_moe_traced — original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…e_theta + chat template

Squashes 4 substantive M32d FAST PATH fixes (Step 5 + 5b + 6 + 7) +
regression test + evidence into a single commit on top of fresh main.
Replaces the original messy stacked-PR chain that conflicted on rebase
after sibling PRs (#1401, #1405) landed.

Live verification on lambda-vector RTX 4090 (post-rebuild):

  $ apr run <Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf> \
       --prompt "What is 2+2?" --max-tokens 8
  Output: 2 + 2 = 4
  Completed in 40.24s (cached)

Step 5 — per-head Q/K RMSNorm in forward_qwen3_moe (rank-3 prior, 15%)
====================================================================

Qwen3 GH-279 per-head Q/K RMSNorm was wired into the dense path
(adaptive_ffn.rs:174-179) but missing from forward_qwen3_moe.rs. Now
applied AFTER bias, BEFORE RoPE — same code as adaptive_ffn.

Pre-fix: layer std-dev grew 40× over 48 layers (signature of attention
scores compounding without per-head Q/K norm). Output `%%%%%%%%`.

Step 5b — rope_theta default 10K → 1M for qwen3_moe (rank-4 prior, 10%)
=======================================================================

GGUF for Qwen3-Coder-30B-A3B-Instruct-Q4_K_M ships WITHOUT
`qwen3moe.rope.freq_base` metadata. config.rs's default lookup had
`"qwen2" | "qwen3" => 1_000_000.0` but no qwen3_moe entry — fell to
catch-all 10K. Off by 100×. Added qwen3_moe to the 1M arm.

Step 6 — chat template (qwen3_moe → ChatML, no <think>)
========================================================

`detect_format_from_name` routed any "qwen3" name to Qwen3NoThink
(PMAT-181), which pre-injects empty `<think>\n</think>\n` into the
assistant turn. Qwen3-Coder does NOT have thinking mode (verified via
the Jinja `tokenizer.chat_template` in the GGUF) — empty think block
caused the model to emit `<|endoftext|>` immediately. Route qwen3_moe
to plain ChatML before the generic qwen3 → NoThink rule. PMAT-181
preserved for thinking-mode dense Qwen3.

Step 7 — sync forward_qwen3_moe_traced with Step 5 Q/K norm
============================================================

forward_qwen3_moe_traced (created in PR #1222 on main) was authored
mirroring the OLD pre-Q/K-norm forward_qwen3_moe. Without sync, `apr
trace --payload` shows DIFFERENT numerics from `apr run` — silent
diagnostic-vs-production drift. Mirror the same Q/K norm into the
traced variant.

Component priors discharge status (M34 FAST PATH)

  | Rank | Component | Prior | Status     |
  |------|-----------|-------|------------|
  | 1    | LAYOUT    | 30%   | not at issue |
  | 2    | Q4_K_M    | 20%   | not at issue |
  | 3    | Q/K norm  | 15%   | FIXED (this commit) |
  | 4    | RoPE θ    | 10%   | FIXED (this commit) |
  | 5    | router sm | 10%   | not at issue |
  | 6    | token emb | 10%   | not at issue |
  | 7    | other     | 5%    | n/a          |
  | n/a  | chat tpl  | n/a   | FIXED (this commit) |

Output transition

  pre-fix         → "%%%%%%%%"               (gibberish)
  + Step 5        → "Human: What is 2+"      (coherent English, partial)
  + Step 5b       → "Human: What is 2+2?"    (full prompt reproduced)
  + Step 6        → "2 + 2 = 4"              (correct answer)
  + Step 7        → diagnostic trace matches production

Multi-domain verification (also passes):
  "Capital of France:"   → "The capital of France is Paris."
  "Translate to Spanish: Hello world" → "¡Hola mundo!"
  "Count to 5:"          → "1, 2, 3, 4, 5"
  "Solve x^2 - 5x + 6 = 0:" → "I need to solve the quadratic equation x² - 5x + 6 = 0..."

Hot-path safety

  - Production text-generation path (`apr run` → run_qwen3_moe_generate
    → forward_qwen3_moe) now applies the norm.
  - `apr trace --payload` (forward_qwen3_moe_traced) syncs the same fix.
  - Sibling tests pass unchanged.
  - `forward_qwen3_moe_traced` reads `self.config.rope_theta` which is set
    at model load from the default lookup — Step 5b auto-applies via config.
  - Dense Qwen3 path UNCHANGED (Qwen3NoThink preserved for thinking-mode
    variants).

Regression test

  `crates/aprender-serve/tests/qwen3_moe_qk_norm_regression.rs`
  F-QW3-MOE-STEP5-001 asserts the context-awareness invariant: two
  distinct prompts must produce distinct argmax tokens, top-2 logit
  gap < 50.

  Live PASS on lambda-vector RTX 4090 in 6.60s.

Stack research

  - HuggingFace transformers Qwen3MoeForCausalLM applies per-head
    q_norm/k_norm in Qwen3MoeAttention.forward
  - llama.cpp ggml_qwen3_moe_kv_norm in llama-arch.cpp does the same
    (attn_q_norm.weight / attn_k_norm.weight)
  - HF Qwen3MoeConfig.rope_theta default = 1_000_000.0
  - Qwen3-Coder Jinja chat_template generation prompt is plain
    `<|im_start|>assistant\n` (no thinking)

Refs M32d FAST PATH plan (M34, paiml/claude-code-parity-apr)
Refs GH-279 (Qwen3 per-head Q/K RMSNorm)
Refs PMAT-181 (Qwen3NoThink preserved for thinking variants)
Refs FALSIFY-QW3-MOE-FORWARD-003

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm branch from 85d4a5c to d32bb46 Compare May 2, 2026 12:58
@noahgift noahgift merged commit 5235aae into main May 2, 2026
10 checks passed
@noahgift noahgift deleted the fix/m32d-step5-qwen3-moe-missing-per-head-qk-norm branch May 2, 2026 13:42
noahgift added a commit that referenced this pull request May 2, 2026
…d discharge audit-trail bump (#1078)

Source-of-truth bytes pushed by the companion repo. M22 paired-mirror
guard via pin.lock (sha256 byte-identity, will be refreshed in companion
PR).

Net change: bumps top-level contract YAML from v1.22.0 to v1.23.0 with
one new status_history entry (M35) recording M32d's functional discharge
on aprender main as commit 5235aae (#1228 squash).

What M35 records
================

  M32d numerical-parity bundle landed across multiple aprender PRs:
    #1222 (Step 2)        forward_qwen3_moe_traced diagnostic surface
    #1226 (Step 2.5)      `apr trace --payload` qwen3_moe dispatch
                          (squashed into #1222)
    #1242                 RUSTSEC-2026-0114 audit unblocker
    #1401 (Step 2 JSON)   `apr trace --json --payload` JSON output
                          (FAST PATH Step 2 exit-criterion shape)
    #1228 (THE BUNDLE)    Step 5 + 5b + 6 + 7 + regression test +
                          evidence — squashed into one commit on main:
                          - per-head Q/K RMSNorm in
                            forward_qwen3_moe (rank-3 prior, 15%)
                          - rope_theta 10K → 1M for qwen3_moe (rank-4
                            prior, 10%)
                          - chat template: qwen3_moe → ChatML
                            (no `<think>` injection)
                          - sync forward_qwen3_moe_traced with Step 5
                          - F-QW3-MOE-STEP5-001 regression test
                          - evidence/m32d-discharge-2026-05-01/

Live evidence on lambda-vector RTX 4090 against the 17.3 GB
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf:

  $ apr run --prompt "What is 2+2?" --max-tokens 8
  Output: 2 + 2 = 4

  $ apr run --prompt "Capital of France:" --max-tokens 30
  Output: The capital of France is Paris.

  $ apr run --prompt "Translate to Spanish: Hello world" --max-tokens 30
  Output: ¡Hola mundo!

  $ apr run --prompt "Solve x^2 - 5x + 6 = 0:" --max-tokens 30
  Output: I need to solve the quadratic equation x² - 5x + 6 = 0.
          I can solve this by factoring.

Output transition timeline:
  pre-fix         "%%%%%%%%"
  + Step 5        "Human: What is 2+"
  + Step 5b       "Human: What is 2+2?"
  + Step 6        "2 + 2 = 4"

M34 FAST PATH actual cost: 5 PRs / ~6 hours wall — **lucky-case bound**
of the 4-6 PRs / 2-3 days estimate.

What M35 does NOT discharge
============================

  - Cosine vs HF FP16 measurement (operator-confirm — ~60 GB download).
    The formal flip of `qwen3-moe-forward-v1` v1.3.0 DRAFT → v1.4.0
    ACTIVE_RUNTIME waits on that measurement.
  - GPU MoE path (no forward_qwen3_moe_gpu; CUDA/wgpu kernels TBD).
  - Other Qwen3-MoE variants.

Refs aprender commit 5235aae (#1228)
Refs companion M34 (v1.21.0 → v1.22.0 plan)
Refs PMAT-CCPA-PARITY-001
Refs M22 paired-mirror invariant

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 2, 2026
… speedup (#1396)

* perf(aprender-serve): parallelize MoE expert dispatch with rayon — 2× speedup

The top-k experts (k=8 for Qwen3-Coder-30B-A3B-Instruct) were running
sequentially in `moe_ffn_forward_layer`. Each expert's
`expert_swiglu_quantized` call is independent and self-contained — reads
its own slice of the on-disk Q4_K/Q6_K MoE tensors, produces a
`[hidden_dim]` output. Trivially parallelizable.

Change: top-k loop is now `topk_renorm.par_iter().map(...)` collecting
into `Vec<(weight, expert_out)>`, then sequential weighted-add fold (cheap
compared to per-expert SwiGLU + Q4_K dequant).

Live perf measurement on lambda-vector RTX 4090 (16 cores)
============================================================

Pre-fix (sequential top-k):
  $ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" \
       --max-tokens 8
  Completed in 38.93s (cached)
  → 4.87 s/token, 0.21 tok/s

Post-fix (parallel top-k):
  $ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" \
       --max-tokens 8
  Completed in 18.56s (cached)
  → 2.32 s/token, 0.43 tok/s
  CPU: 1682% (≈ 17 cores in use simultaneously)

**Speedup: 2.1×** (consistent ~2× across multiple test runs).

Why not 8× (one per expert)?

  * The fused_q4k_parallel_matvec / fused_q6k_parallel_matvec inner
    kernels are already rayon-parallel internally over output rows,
    so they consume some of the available core budget.
  * Memory bandwidth: each expert reads ~1.6 MiB of Q4_K/Q6_K bytes
    from mmap; with 8 in flight that's ~13 MiB/forward, hitting cache
    saturation.
  * Weighted-add fold is sequential (~50us per call vs ~250ms per
    expert SwiGLU — negligible).

  2× from outer-rayon on top of inner-rayon is the realistic ceiling
  on this hardware. Multi-token decode (vs single-prompt) will see
  better amortization since the same MoE tensor mmap pages stay warm.

Hot-path safety:
  * Numerical output is identical to sequential — `par_iter` preserves
    semantics for independent calls; weighted-add is a deterministic
    fold even though par_iter ordering is non-deterministic
    (commutative+associative on f32 with same operands gives same
    result modulo reordering, which is acceptable per CLAUDE.md
    "ML-specific allows for casts/float_cmp").
  * Tests in `qwen3_moe_*.rs` pass unchanged.
  * Independent of the M32d correctness fixes (#1222, #1228) — this
    is purely a parallelism change.

What this PR does NOT ship:
  * GPU MoE path (separate big PR; needs trueno-gpu MoE kernel).
  * Inner-kernel SIMD optimization (also separate).
  * Router parallelization — the F32 router is already cheap (~10ms);
    parallelizing it would mostly add overhead.

Refs M32d numerical-parity discharge stack (#1222, #1228) — independent
Refs M32c.2.2.2.0 (moe_ffn_forward_layer original)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: retrigger after pre-existing 40min timeout (now have 16 runners + less parallel load)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 2, 2026
…CHARGE (#1409)

Status flips DRAFT → ACTIVE_ALGORITHM_LEVEL.

M32d numerical parity is functionally discharged on aprender main as of
PR #1228 squash 5235aae (2026-05-02 13:42 UTC). Output transition on
lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder-30B-A3B-
Instruct-Q4_K_M.gguf:

  pre-fix         "%%%%%%%%"               (gibberish, repeated argmax)
  + Step 5        "Human: What is 2+"      (coherent English, partial)
  + Step 5b       "Human: What is 2+2?"    (full prompt reproduced)
  + Step 6        "2 + 2 = 4"              (correct answer)

Multi-domain dogfood (math/geography/translation/code) all correct.

Why ACTIVE_ALGORITHM_LEVEL not ACTIVE_RUNTIME
==============================================

Per the v1.3.0 (M32d.0) parity-strategy decision, full ACTIVE_RUNTIME
discharge requires:
  1. F-QW3-MOE-PARITY-001: cosine ≥ 0.99 vs HF FP16 reference logits
  2. F-QW3-MOE-PARITY-002: argmax matches llama.cpp top-1

#1 requires running scripts/generate_qwen3_moe_fp16_logits.py which is
operator-confirm pending (~60 GB HF download + ~30 min on 30B-A3B
multi-device offload).

ACTIVE_ALGORITHM_LEVEL is the right intermediate state: forward path is
functionally correct (verified by output quality across diverse
prompts), but the formal cosine-vs-HF gate hasn't fired yet.

Component priors verified empirically (M34 FAST PATH plan)
==========================================================

  rank-3 Q/K norm (15%)      FIXED #1228 Step 5
  rank-4 RoPE θ (10%)        FIXED #1228 Step 5b
  outside-priors             FIXED #1228 Step 6 (chat template wrapping)

The diagnostic surface from PRs #1222 (Step 2) + #1226 (Step 2.5) +
#1401 (Step 2 JSON wire) named rank-3 directly via the 40× std-growth
signature without needing the HF FP16 fixture. Step 1 of the original
plan was bypassed.

M34 FAST PATH cost
==================

  Outcome          PRs     Wall-clock
  ACTUAL           5       ~6 hours
  Lucky estimate   4-6     2-3 days
  Realistic        8-10    4-6 days
  Pessimistic      12-15   1-2 weeks

Came in at the lucky-case bound.

Refs aprender PR #1228 commit 5235aae
Refs companion `paiml/claude-code-parity-apr` M35 status_history
Refs `project_m32d_discharge_2026_05_02.md` (memory)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 4, 2026
…rift gate (#1448)

Two related preparation steps for the v0.32.0 cut decision:

## CHANGELOG

Fill out the empty `[Unreleased]` section with today's session body of work
(238 commits since v0.31.2):

- **CPU/GPU output parity contract** (jidoka armor): `apr-cpu-vs-gpu-output-parity-v1`
  v1.0.0 → v1.5.0 ACTIVE with **5/5 falsifiers DISCHARGED** in a single 2-PR cycle
  (#1445 + #1446) — first contract in the SHIP-TWO program to reach complete-evidence
  terminal state. CUDA + wgpu fallback log prefixes + inline cosine parity gate.
- **`apr trace --save-tensor`** — new flag for SHIP-007 layer-0 oracle bisection;
  `apr-cli-trace-save-tensor-v1` v1.4.0 FUNCTIONAL.
- **HF FP16 oracle bisection** — pinpoints SHIP-007 to layer-0 attn_out
  (cos=0.99999995 attn_norm → 0.9966 attn_out).
- **Distillation training contract** — 9/9 falsifiers algorithm-bound.
- **MoE expert dispatch parallelized** — 2× speedup (#1396).
- **APR file mmap** — unblocks `apr diff --values` on 7B (#1058).
- **M32d numerical-parity bundle** — Q/K RMSNorm + rope_theta + chat template (#1228).
- **150+ contract algorithm-bind sweep** — record cycle, kernel + format + training +
  GPU-backend + CLI families flipped from `unbound` to `PARTIAL_ALGORITHM_LEVEL`.

## README drift gate repair

`bash scripts/check_readme_claims.sh` was FAILING:

- README claimed 1096 contracts, filesystem has 1105
- README claimed 79 CLI commands, `apr --help` lists 80

Fixed both numbers in the contract-backed table AND the prose references.
Drift gate now PASS 4/4.

Five Whys:

1. Why was the gate failing? README contract counts and CLI counts are stale.
2. Why are they stale? 9 new contracts and 1 new CLI command merged since the
   last README update.
3. Why didn't the gate catch it earlier? It's a script — not yet wired into CI
   as a hard gate (FALSIFY-README-001..004 are PARTIAL_ALGORITHM_LEVEL, the
   shell wrapper is documented in the contract but doesn't fail PRs).
4. Why isn't it a CI gate yet? `readme-claims-v1` is recent (2026-04-24),
   wired to `bash scripts/check_readme_claims.sh` but not to a workflow step.
5. Why fix it now? Pre-release hygiene — releases must ship green drift gates
   per `feedback_post_publish_qa_required.md`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant