feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b) by noahgift · Pull Request #1523 · paiml/aprender

noahgift · 2026-05-06T00:03:42Z

Summary

GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced sibling (PR #1516 forward_qwen3_moe_traced_with_plan) but routes per-layer MoE FFN through the GPU dispatch so apr trace --gpu --json --payload --save-tensor can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter + MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its first divergence point.

Why

§40 + qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history: heavy qwen3_moe_gpu_parity test on RTX 4090 against cached 17.3 GB Qwen3-Coder GGUF produces ALL 151936 logits as NaN at lm_head. CPU forward on the same input produces finite output. Steps 1-9 of the GPU forward are CPU-identical to the CPU forward (precondition checks, embedding, attention norm, QKV, RoPE, attention, output projection, FFN norm). Step 10 is the only GPU-specific stage — the MoE FFN. With this PR + #1522 (helper) + the existing CPU traced sibling, M-MOE-SUB-3 can apr trace --gpu the same plan against both forward bodies and pinpoint where the GPU NaN enters.

What ships

OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data) -> Result<ForwardTrace> — no-plan delegate.
forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>) -> Result<ForwardTrace> — plan-aware body.
New file crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs (~430 LOC including doc-comments).
include!() registered in cuda/uses.rs.
Lib-only signature drift gate test.

Hot path safety

Production forward_qwen3_moe_cuda is unchanged byte-for-byte. The per-token loop dispatches the GPU MoE FFN identically to production for non-capture positions; the LAST sequence position uses moe_ffn_forward_layer_cuda_with_router (PR #1522) when the plan selects MoeRouter or MoeFfnOut so router weights can be emitted without recomputation.

M-MOE-SUB-2 status after this PR

Sub-step	What	Status
(a) CPU body	`forward_qwen3_moe_traced_with_plan`	SHIPPED PR #1516
(a) CLI wireup	`apr trace --save-tensor` for GGUF MoE	SHIPPED PR #1521
(b) GPU body	`forward_qwen3_moe_cuda_traced[_with_plan]`	THIS PR
(c) CPU helper	`moe_ffn_forward_layer_with_router`	SHIPPED PR #1507
(c.gpu) GPU helper	`moe_ffn_forward_layer_cuda_with_router`	SHIPPED PR #1522

M-MOE-SUB-2 will be COMPLETE when this lands. M-MOE-SUB-3 (heavy parity test) is next.

Falsifier

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR does not discharge it (heavy parity test required at M-MOE-SUB-3); it provides the GPU traced sibling that will run the test.

Test plan

cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda_traced_signature → 1 passed (signature drift gate)
cargo fmt --check (file-level)
cargo check -p aprender-serve --features cuda → finishes
Auto-merge once required checks pass
M-MOE-SUB-3 heavy parity test will exercise byte-identity vs production on lambda-vector RTX 4090

🤖 Generated with Claude Code

…p (b) GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per- layer MoE FFN through the GPU dispatch so `apr trace --gpu --json --payload --save-tensor` can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its first divergence point. What ships ========== - `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)` → Result<ForwardTrace>. No-plan delegate. - `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)` → Result<ForwardTrace>. The plan-aware body. - New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs` (~430 LOC including doc-comments). - `include!()` registered in `cuda/uses.rs`. - Lib-only signature drift gate test `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end byte-identity vs production sibling exercised by the heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3). Hot path safety =============== Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This is a parallel slow path used only by `apr trace --gpu`. The per-token loop dispatches the GPU MoE FFN identically to production for non- capture positions; the LAST sequence position uses `moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan selects MoeRouter or MoeFfnOut so the router weights can be emitted without recomputation. Closes (a)+(b)+(c.gpu) of step (b) ================================== The triplet for M-MOE-SUB-2 is now complete: - step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan) - step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE) - step (b) GPU body: THIS PR - step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router) - step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router) M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence. Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR DOES NOT discharge it (heavy parity test required at M-MOE- SUB-3); it provides the GPU traced sibling that will run the test. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (b) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-SUB-3 (#1524) Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.2.0 step M-MOE-SUB-3: heavy diagnostic test that runs CPU-traced + GPU-traced forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523) with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every layer, then computes per-layer per-stage cosine similarity to identify the first layer where the GPU diverges from the CPU. What ships ========== - New heavy test `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` in `crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs`. - 6 light unit tests for the verdict classifier (`Match`, `Diverge`, `NanGpu`, `NanCpu`, `NanBoth`, `Missing`) — all 6 pass. - Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder GGUF (matches the existing `qwen3_moe_gpu_parity` test convention). - `#[ignore]` + `#[cfg(feature = "cuda")]` gating per the repo's heavy-test convention. Invocation ========== cargo test -p aprender-serve --features cuda \ --test qwen3_moe_gpu_per_stage_diff \ -- --include-ignored --nocapture What the harness does NOT do (yet) ================================== - Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001 cosine threshold lives in the existing qwen3_moe_gpu_parity test; this is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection per qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history block. - Does NOT clean up `/tmp/moe-sub-{cpu,gpu}-<pid>/` dirs (operator inspects them for raw bytes if cosine is ambiguous). Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR ships the harness; an operator-dispatched run on lambda- vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) will produce the layer-by-layer divergence table and pinpoint the first M-GPU-MOE-1.4 bug-origin candidate. Threshold-based discharge (`cosine ≥ 0.99`) becomes meaningful AFTER the bug is fixed. M-MOE-SUB-3 status after this PR ================================ - Test harness: SHIPPED (this PR) - Run on lambda-vector + interpret table: operator-dispatched - Promote to FALSIFY-MOE-SUB-002 DISCHARGED: gated on M-GPU-MOE-1.4 fix Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (a) PR #1516, step (b) PR #1523 Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 NaN/Inf bisection Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…scade complete on main (#1525) Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3 (harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING (optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at MoeRouter / MoeFfnOut granularity). Cited PRs (chronological) ========================= - #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c) - #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a) - #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI) - #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu) - #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b) - #1524 — heavy diff harness (M-MOE-SUB-3) What's left =========== - Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) → produces layer-by-layer divergence table. - M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run. - FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 6, 2026 00:03

noahgift merged commit 690a835 into main May 6, 2026
11 checks passed

noahgift deleted the feat/forward-qwen3-moe-cuda-traced branch May 6, 2026 00:27

This was referenced May 6, 2026

docs(M79): record forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 cascade COMPLETE paiml/claude-code-parity-apr#65

Merged

feat(aprender-serve): heavy CPU-vs-GPU per-stage diff harness — M-MOE-SUB-3 #1524

Merged

noahgift mentioned this pull request May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.2.0 → v1.3.0 — M-MOE-SUB cascade complete on main #1525

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b)#1523

feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b)#1523
noahgift merged 1 commit intomainfrom
feat/forward-qwen3-moe-cuda-traced

noahgift commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 6, 2026

Summary

Why

What ships

Hot path safety

M-MOE-SUB-2 status after this PR

Falsifier

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant