Skip to content

feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3#1524

Merged
noahgift merged 1 commit intomainfrom
feat/m-moe-sub-3-heavy-parity-diff
May 6, 2026
Merged

feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3#1524
noahgift merged 1 commit intomainfrom
feat/m-moe-sub-3-heavy-parity-diff

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 6, 2026

Summary

M-MOE-SUB-3 heavy diagnostic harness that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. Per contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 step M-MOE-SUB-3.

Why

falsify_qw3_moe_gpu_parity_001_cosine_vs_cpu observes 100% NaN at lm_head from forward_qwen3_moe_cuda while CPU forward_qwen3_moe produces finite output on the same input. The first 9 stages of the GPU forward path are CPU-identical; the only GPU-specific stage is the MoE FFN. This harness pinpoints WHICH layer of the MoE FFN first diverges → that layer is the M-GPU-MOE-1.4 bug-origin candidate.

What ships

  • New heavy test falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff in crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs.
  • 6 light unit tests for the verdict classifier (Match, Diverge, NanGpu, NanCpu, NanBoth, Missing) — all 6 pass.
  • Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder GGUF (matches qwen3_moe_gpu_parity convention).
  • #[ignore] + #[cfg(feature = "cuda")] gating per heavy-test convention.

Invocation

cargo test -p aprender-serve --features cuda \
  --test qwen3_moe_gpu_per_stage_diff \
  -- --include-ignored --nocapture

Sample output (when the heavy test runs)

layer | moe_router (cos / verdict)        | moe_ffn_out (cos / verdict)
------|-----------------------------------|-----------------------------------
L00   | 1.000000 MATCH                    | 1.000000 MATCH
L01   | 0.998765 MATCH                    | 0.997123 MATCH
...
L17   | 0.998123 MATCH                    |             NanGpu      ← first divergence!

M-MOE-SUB-3 bisection summary:
  first DIVERGE on moe_router  : None
  first DIVERGE on moe_ffn_out : None
  first NaN_GPU on moe_router  : None
  first NaN_GPU on moe_ffn_out : Some(17)

If first_NaN_GPU(moe_router) == 0: bug is in the F32 router @ hidden CPU dot product (unlikely).
If first_NaN_GPU(moe_ffn_out) == 0 and moe_router @ L0 is finite: bug is in expert_swiglu_cuda.
If first_NaN_GPU(moe_ffn_out) > 0 and earlier layers MATCH: bug is layer-N specific (rare).

What the harness does NOT do (yet)

  • Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001 cosine threshold lives in qwen3_moe_gpu_parity. This is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection.
  • Does NOT clean up /tmp/moe-sub-{cpu,gpu}-<pid>/ dirs (operator inspects them for raw bytes if cosine is ambiguous).

Falsifier

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the divergence table. Threshold-based discharge becomes meaningful AFTER the M-GPU-MOE-1.4 bug is fixed.

M-MOE-SUB-3 status after this PR

Step Status
Test harness authored SHIPPED (this PR)
Run on lambda-vector + interpret table operator-dispatched
Promote to FALSIFY-MOE-SUB-002 DISCHARGED gated on M-GPU-MOE-1.4 fix

Test plan

  • cargo test -p aprender-serve --features cuda --test qwen3_moe_gpu_per_stage_diff → 6/6 light tests pass; heavy test ignored as expected
  • cargo fmt --check (file-level)
  • cargo build -p aprender-serve --features cuda --tests --test qwen3_moe_gpu_per_stage_diff → compiles
  • Auto-merge once required checks pass
  • Operator dispatches --include-ignored on lambda-vector → divergence table

🤖 Generated with Claude Code

…-SUB-3

Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.2.0 step
M-MOE-SUB-3: heavy diagnostic test that runs CPU-traced + GPU-traced
forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523)
with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every
layer, then computes per-layer per-stage cosine similarity to identify
the first layer where the GPU diverges from the CPU.

What ships
==========

- New heavy test `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  in `crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs`.
- 6 light unit tests for the verdict classifier (`Match`, `Diverge`,
  `NanGpu`, `NanCpu`, `NanBoth`, `Missing`) — all 6 pass.
- Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder
  GGUF (matches the existing `qwen3_moe_gpu_parity` test convention).
- `#[ignore]` + `#[cfg(feature = "cuda")]` gating per the repo's
  heavy-test convention.

Invocation
==========

    cargo test -p aprender-serve --features cuda \
      --test qwen3_moe_gpu_per_stage_diff \
      -- --include-ignored --nocapture

What the harness does NOT do (yet)
==================================

- Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001
  cosine threshold lives in the existing qwen3_moe_gpu_parity test;
  this is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection
  per qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history block.
- Does NOT clean up `/tmp/moe-sub-{cpu,gpu}-<pid>/` dirs (operator
  inspects them for raw bytes if cosine is ambiguous).

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages).
This PR ships the harness; an operator-dispatched run on lambda-
vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall)
will produce the layer-by-layer divergence table and pinpoint the
first M-GPU-MOE-1.4 bug-origin candidate. Threshold-based discharge
(`cosine ≥ 0.99`) becomes meaningful AFTER the bug is fixed.

M-MOE-SUB-3 status after this PR
================================

- Test harness: SHIPPED (this PR)
- Run on lambda-vector + interpret table: operator-dispatched
- Promote to FALSIFY-MOE-SUB-002 DISCHARGED: gated on M-GPU-MOE-1.4 fix

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (a) PR #1516, step (b) PR #1523
Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 NaN/Inf bisection

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 6, 2026 00:52
@noahgift noahgift merged commit 8865e3b into main May 6, 2026
11 checks passed
@noahgift noahgift deleted the feat/m-moe-sub-3-heavy-parity-diff branch May 6, 2026 01:15
noahgift added a commit that referenced this pull request May 6, 2026
…scade complete on main (#1525)

Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade
PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3
(harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING
(optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at
MoeRouter / MoeFfnOut granularity).

Cited PRs (chronological)
=========================

- #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c)
- #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a)
- #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI)
- #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu)
- #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b)
- #1524 — heavy diff harness (M-MOE-SUB-3)

What's left
===========

- Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF
  (~30-60 min wall) → produces layer-by-layer divergence table.
- M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run.
- FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml
Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…giene amendment (#1526)

Realigns FALSIFY-MOE-SUB-001/002 `test:` invocation strings with their
live test bindings in code, and promotes statuses to match the
implementation_stages they discharge.

Drift caught: at v1.3.0 the contract cited test names that didn't
match the actual code:
- SUB-001 cited `falsify_moe_sub_001_new_stages_parse` (singular)
  but live binding is a 5-test suite under prefix `falsify_moe_sub_001_*`
  (round_trip × 2 + canonical_order + parse_list × 2) in
  aprender-serve --lib.
- SUB-002 cited `cargo test -p apr-cli --test falsify_moe_sub_002_byte_identity`
  but live binding is `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  in `aprender-serve --features cuda --test qwen3_moe_gpu_per_stage_diff`
  (heavy harness from M80 PR #1524, `#![cfg(feature = "cuda")]` +
  `#[ignore]`-gated).

This is the same drift class M71 closed mechanically via PV-VER-002
(`pv lint --strict-test-binding`) for §50.4 cascade contracts —
trace-moe-gpu-sub-stages-v1 was authored before that gate landed and
needs manual realignment.

Status promotions:
- FALSIFY-MOE-SUB-001 PROPOSED → DISCHARGED (5 lib tests pass in <1s,
  verified live)
- FALSIFY-MOE-SUB-002 PROPOSED → ALGORITHM_LEVEL_DISCHARGED (heavy
  harness exists + listed; full DISCHARGED blocks on RTX 4090
  --include-ignored dispatch)
- FALSIFY-MOE-SUB-003/004 unchanged (LIVE bisection + fix PR pending)

YAML-only — production hot paths byte-unchanged. Contract version
v1.3.0 → v1.4.0; status stays ACTIVE_ALGORITHM_LEVEL.

`pv validate` 0/0.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant