feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3 by noahgift · Pull Request #1524 · paiml/aprender

noahgift · 2026-05-06T00:52:21Z

Summary

M-MOE-SUB-3 heavy diagnostic harness that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. Per contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 step M-MOE-SUB-3.

Why

falsify_qw3_moe_gpu_parity_001_cosine_vs_cpu observes 100% NaN at lm_head from forward_qwen3_moe_cuda while CPU forward_qwen3_moe produces finite output on the same input. The first 9 stages of the GPU forward path are CPU-identical; the only GPU-specific stage is the MoE FFN. This harness pinpoints WHICH layer of the MoE FFN first diverges → that layer is the M-GPU-MOE-1.4 bug-origin candidate.

What ships

New heavy test falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff in crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs.
6 light unit tests for the verdict classifier (Match, Diverge, NanGpu, NanCpu, NanBoth, Missing) — all 6 pass.
Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder GGUF (matches qwen3_moe_gpu_parity convention).
#[ignore] + #[cfg(feature = "cuda")] gating per heavy-test convention.

Invocation

cargo test -p aprender-serve --features cuda \
  --test qwen3_moe_gpu_per_stage_diff \
  -- --include-ignored --nocapture

Sample output (when the heavy test runs)

layer | moe_router (cos / verdict)        | moe_ffn_out (cos / verdict)
------|-----------------------------------|-----------------------------------
L00   | 1.000000 MATCH                    | 1.000000 MATCH
L01   | 0.998765 MATCH                    | 0.997123 MATCH
...
L17   | 0.998123 MATCH                    |             NanGpu      ← first divergence!

M-MOE-SUB-3 bisection summary:
  first DIVERGE on moe_router  : None
  first DIVERGE on moe_ffn_out : None
  first NaN_GPU on moe_router  : None
  first NaN_GPU on moe_ffn_out : Some(17)

If first_NaN_GPU(moe_router) == 0: bug is in the F32 router @ hidden CPU dot product (unlikely).
If first_NaN_GPU(moe_ffn_out) == 0 and moe_router @ L0 is finite: bug is in expert_swiglu_cuda.
If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare).

What the harness does NOT do (yet)

Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001 cosine threshold lives in qwen3_moe_gpu_parity. This is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection.
Does NOT clean up /tmp/moe-sub-{cpu,gpu}-<pid>/ dirs (operator inspects them for raw bytes if cosine is ambiguous).

Falsifier

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the divergence table. Threshold-based discharge becomes meaningful AFTER the M-GPU-MOE-1.4 bug is fixed.

M-MOE-SUB-3 status after this PR

Step	Status
Test harness authored	SHIPPED (this PR)
Run on lambda-vector + interpret table	operator-dispatched
Promote to FALSIFY-MOE-SUB-002 DISCHARGED	gated on M-GPU-MOE-1.4 fix

Test plan

cargo test -p aprender-serve --features cuda --test qwen3_moe_gpu_per_stage_diff → 6/6 light tests pass; heavy test ignored as expected
cargo fmt --check (file-level)
cargo build -p aprender-serve --features cuda --tests --test qwen3_moe_gpu_per_stage_diff → compiles
Auto-merge once required checks pass
Operator dispatches --include-ignored on lambda-vector → divergence table

🤖 Generated with Claude Code

…-SUB-3 Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.2.0 step M-MOE-SUB-3: heavy diagnostic test that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. What ships ========== - New heavy test `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` in `crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs`. - 6 light unit tests for the verdict classifier (`Match`, `Diverge`, `NanGpu`, `NanCpu`, `NanBoth`, `Missing`) — all 6 pass. - Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder GGUF (matches the existing `qwen3_moe_gpu_parity` test convention). - `#[ignore]` + `#[cfg(feature = "cuda")]` gating per the repo's heavy-test convention. Invocation ========== cargo test -p aprender-serve --features cuda \ --test qwen3_moe_gpu_per_stage_diff \ -- --include-ignored --nocapture What the harness does NOT do (yet) ================================== - Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001 cosine threshold lives in the existing qwen3_moe_gpu_parity test; this is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history block. - Does NOT clean up `/tmp/moe-sub-{cpu,gpu}-<pid>/` dirs (operator inspects them for raw bytes if cosine is ambiguous). Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda- vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the layer-by-layer divergence table and pinpoint the first M-GPU-MOE-1.4 bug-origin candidate. Threshold-based discharge (`cosine ≥ 0.99`) becomes meaningful AFTER the bug is fixed. M-MOE-SUB-3 status after this PR ================================ - Test harness: SHIPPED (this PR) - Run on lambda-vector + interpret table: operator-dispatched - Promote to FALSIFY-MOE-SUB-002 DISCHARGED: gated on M-GPU-MOE-1.4 fix Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (a) PR #1516, step (b) PR #1523 Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 NaN/Inf bisection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…scade complete on main (#1525) Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3 (harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING (optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at MoeRouter / MoeFfnOut granularity). Cited PRs (chronological) ========================= - #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c) - #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a) - #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI) - #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu) - #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b) - #1524 — heavy diff harness (M-MOE-SUB-3) What's left =========== - Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) → produces layer-by-layer divergence table. - M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run. - FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…giene amendment (#1526) Realigns FALSIFY-MOE-SUB-001/002 `test:` invocation strings with their live test bindings in code, and promotes statuses to match the implementation_stages they discharge. Drift caught: at v1.3.0 the contract cited test names that didn't match the actual code: - SUB-001 cited `falsify_moe_sub_001_new_stages_parse` (singular) but live binding is a 5-test suite under prefix `falsify_moe_sub_001_*` (round_trip × 2 + canonical_order + parse_list × 2) in aprender-serve --lib. - SUB-002 cited `cargo test -p apr-cli --test falsify_moe_sub_002_byte_identity` but live binding is `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` in `aprender-serve --features cuda --test qwen3_moe_gpu_per_stage_diff` (heavy harness from M80 PR #1524, `#![cfg(feature = "cuda")]` + `#[ignore]`-gated). This is the same drift class M71 closed mechanically via PV-VER-002 (`pv lint --strict-test-binding`) for §50.4 cascade contracts — trace-moe-gpu-sub-stages-v1 was authored before that gate landed and needs manual realignment. Status promotions: - FALSIFY-MOE-SUB-001 PROPOSED → DISCHARGED (5 lib tests pass in <1s, verified live) - FALSIFY-MOE-SUB-002 PROPOSED → ALGORITHM_LEVEL_DISCHARGED (heavy harness exists + listed; full DISCHARGED blocks on RTX 4090 --include-ignored dispatch) - FALSIFY-MOE-SUB-003/004 unchanged (LIVE bisection + fix PR pending) YAML-only — production hot paths byte-unchanged. Contract version v1.3.0 → v1.4.0; status stays ACTIVE_ALGORITHM_LEVEL. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 6, 2026 00:52

noahgift merged commit 8865e3b into main May 6, 2026
11 checks passed

noahgift deleted the feat/m-moe-sub-3-heavy-parity-diff branch May 6, 2026 01:15

This was referenced May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.2.0 → v1.3.0 — M-MOE-SUB cascade complete on main #1525

Merged

docs(M80): record M-MOE-SUB-3 heavy CPU-vs-GPU per-stage diff harness SHIPPED paiml/claude-code-parity-apr#66

Merged

noahgift mentioned this pull request May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.3.0 → v1.4.0 — falsifier hygiene #1526

Merged

4 tasks

noahgift mentioned this pull request May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.4.0 → v1.5.0 — LIVE bisection DISCHARGED on gx10 #1527

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3#1524

feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3#1524
noahgift merged 1 commit intomainfrom
feat/m-moe-sub-3-heavy-parity-diff

noahgift commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 6, 2026

Summary

Why

What ships

Invocation

Sample output (when the heavy test runs)

What the harness does NOT do (yet)

Falsifier

M-MOE-SUB-3 status after this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant