feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3#1524
Merged
feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3#1524
Conversation
…-SUB-3 Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.2.0 step M-MOE-SUB-3: heavy diagnostic test that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. What ships ========== - New heavy test `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` in `crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs`. - 6 light unit tests for the verdict classifier (`Match`, `Diverge`, `NanGpu`, `NanCpu`, `NanBoth`, `Missing`) — all 6 pass. - Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder GGUF (matches the existing `qwen3_moe_gpu_parity` test convention). - `#[ignore]` + `#[cfg(feature = "cuda")]` gating per the repo's heavy-test convention. Invocation ========== cargo test -p aprender-serve --features cuda \ --test qwen3_moe_gpu_per_stage_diff \ -- --include-ignored --nocapture What the harness does NOT do (yet) ================================== - Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001 cosine threshold lives in the existing qwen3_moe_gpu_parity test; this is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history block. - Does NOT clean up `/tmp/moe-sub-{cpu,gpu}-<pid>/` dirs (operator inspects them for raw bytes if cosine is ambiguous). Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda- vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the layer-by-layer divergence table and pinpoint the first M-GPU-MOE-1.4 bug-origin candidate. Threshold-based discharge (`cosine ≥ 0.99`) becomes meaningful AFTER the bug is fixed. M-MOE-SUB-3 status after this PR ================================ - Test harness: SHIPPED (this PR) - Run on lambda-vector + interpret table: operator-dispatched - Promote to FALSIFY-MOE-SUB-002 DISCHARGED: gated on M-GPU-MOE-1.4 fix Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (a) PR #1516, step (b) PR #1523 Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 NaN/Inf bisection Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…scade complete on main (#1525) Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3 (harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING (optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at MoeRouter / MoeFfnOut granularity). Cited PRs (chronological) ========================= - #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c) - #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a) - #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI) - #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu) - #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b) - #1524 — heavy diff harness (M-MOE-SUB-3) What's left =========== - Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) → produces layer-by-layer divergence table. - M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run. - FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 6, 2026
…giene amendment (#1526) Realigns FALSIFY-MOE-SUB-001/002 `test:` invocation strings with their live test bindings in code, and promotes statuses to match the implementation_stages they discharge. Drift caught: at v1.3.0 the contract cited test names that didn't match the actual code: - SUB-001 cited `falsify_moe_sub_001_new_stages_parse` (singular) but live binding is a 5-test suite under prefix `falsify_moe_sub_001_*` (round_trip × 2 + canonical_order + parse_list × 2) in aprender-serve --lib. - SUB-002 cited `cargo test -p apr-cli --test falsify_moe_sub_002_byte_identity` but live binding is `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` in `aprender-serve --features cuda --test qwen3_moe_gpu_per_stage_diff` (heavy harness from M80 PR #1524, `#![cfg(feature = "cuda")]` + `#[ignore]`-gated). This is the same drift class M71 closed mechanically via PV-VER-002 (`pv lint --strict-test-binding`) for §50.4 cascade contracts — trace-moe-gpu-sub-stages-v1 was authored before that gate landed and needs manual realignment. Status promotions: - FALSIFY-MOE-SUB-001 PROPOSED → DISCHARGED (5 lib tests pass in <1s, verified live) - FALSIFY-MOE-SUB-002 PROPOSED → ALGORITHM_LEVEL_DISCHARGED (heavy harness exists + listed; full DISCHARGED blocks on RTX 4090 --include-ignored dispatch) - FALSIFY-MOE-SUB-003/004 unchanged (LIVE bisection + fix PR pending) YAML-only — production hot paths byte-unchanged. Contract version v1.3.0 → v1.4.0; status stays ACTIVE_ALGORITHM_LEVEL. `pv validate` 0/0. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Merged
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
M-MOE-SUB-3 heavy diagnostic harness that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a
SaveTensorPlancapturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. Percontracts/trace-moe-gpu-sub-stages-v1.yamlv1.2.0 step M-MOE-SUB-3.Why
falsify_qw3_moe_gpu_parity_001_cosine_vs_cpuobserves 100% NaN at lm_head fromforward_qwen3_moe_cudawhile CPUforward_qwen3_moeproduces finite output on the same input. The first 9 stages of the GPU forward path are CPU-identical; the only GPU-specific stage is the MoE FFN. This harness pinpoints WHICH layer of the MoE FFN first diverges → that layer is the M-GPU-MOE-1.4 bug-origin candidate.What ships
falsify_moe_sub_002_cpu_gpu_traced_per_stage_diffincrates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs.Match,Diverge,NanGpu,NanCpu,NanBoth,Missing) — all 6 pass.qwen3_moe_gpu_parityconvention).#[ignore]+#[cfg(feature = "cuda")]gating per heavy-test convention.Invocation
cargo test -p aprender-serve --features cuda \ --test qwen3_moe_gpu_per_stage_diff \ -- --include-ignored --nocaptureSample output (when the heavy test runs)
What the harness does NOT do (yet)
qwen3_moe_gpu_parity. This is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection./tmp/moe-sub-{cpu,gpu}-<pid>/dirs (operator inspects them for raw bytes if cosine is ambiguous).Falsifier
FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the divergence table. Threshold-based discharge becomes meaningful AFTER the M-GPU-MOE-1.4 bug is fixed.
M-MOE-SUB-3 status after this PR
Test plan
cargo test -p aprender-serve --features cuda --test qwen3_moe_gpu_per_stage_diff→ 6/6 light tests pass; heavy test ignored as expectedcargo fmt --check(file-level)cargo build -p aprender-serve --features cuda --tests --test qwen3_moe_gpu_per_stage_diff→ compiles--include-ignoredon lambda-vector → divergence table🤖 Generated with Claude Code