Skip to content

feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b)#1523

Merged
noahgift merged 1 commit intomainfrom
feat/forward-qwen3-moe-cuda-traced
May 6, 2026
Merged

feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b)#1523
noahgift merged 1 commit intomainfrom
feat/forward-qwen3-moe-cuda-traced

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 6, 2026

Summary

GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced sibling (PR #1516 forward_qwen3_moe_traced_with_plan) but routes per-layer MoE FFN through the GPU dispatch so apr trace --gpu --json --payload --save-tensor can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter + MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its first divergence point.

Why

§40 + qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history: heavy qwen3_moe_gpu_parity test on RTX 4090 against cached 17.3 GB Qwen3-Coder GGUF produces ALL 151936 logits as NaN at lm_head. CPU forward on the same input produces finite output. Steps 1-9 of the GPU forward are CPU-identical to the CPU forward (precondition checks, embedding, attention norm, QKV, RoPE, attention, output projection, FFN norm). Step 10 is the only GPU-specific stage — the MoE FFN. With this PR + #1522 (helper) + the existing CPU traced sibling, M-MOE-SUB-3 can apr trace --gpu the same plan against both forward bodies and pinpoint where the GPU NaN enters.

What ships

  • OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data) -> Result<ForwardTrace> — no-plan delegate.
  • forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>) -> Result<ForwardTrace> — plan-aware body.
  • New file crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs (~430 LOC including doc-comments).
  • include!() registered in cuda/uses.rs.
  • Lib-only signature drift gate test.

Hot path safety

Production forward_qwen3_moe_cuda is unchanged byte-for-byte. The per-token loop dispatches the GPU MoE FFN identically to production for non-capture positions; the LAST sequence position uses moe_ffn_forward_layer_cuda_with_router (PR #1522) when the plan selects MoeRouter or MoeFfnOut so router weights can be emitted without recomputation.

M-MOE-SUB-2 status after this PR

Sub-step What Status
(a) CPU body forward_qwen3_moe_traced_with_plan SHIPPED PR #1516
(a) CLI wireup apr trace --save-tensor for GGUF MoE SHIPPED PR #1521
(b) GPU body forward_qwen3_moe_cuda_traced[_with_plan] THIS PR
(c) CPU helper moe_ffn_forward_layer_with_router SHIPPED PR #1507
(c.gpu) GPU helper moe_ffn_forward_layer_cuda_with_router SHIPPED PR #1522

M-MOE-SUB-2 will be COMPLETE when this lands. M-MOE-SUB-3 (heavy parity test) is next.

Falsifier

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR does not discharge it (heavy parity test required at M-MOE-SUB-3); it provides the GPU traced sibling that will run the test.

Test plan

  • cargo test -p aprender-serve --features cuda --lib forward_qwen3_moe_cuda_traced_signature → 1 passed (signature drift gate)
  • cargo fmt --check (file-level)
  • cargo check -p aprender-serve --features cuda → finishes
  • Auto-merge once required checks pass
  • M-MOE-SUB-3 heavy parity test will exercise byte-identity vs production on lambda-vector RTX 4090

🤖 Generated with Claude Code

…p (b)

GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced
sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per-
layer MoE FFN through the GPU dispatch so `apr trace --gpu --json
--payload --save-tensor` can run the same SaveTensorPlan against both
CPU and GPU forward paths, capture per-stage activations at MoeRouter
and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its
first divergence point.

What ships
==========

- `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids,
  moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)`
  → Result<ForwardTrace>. No-plan delegate.
- `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)`
  → Result<ForwardTrace>. The plan-aware body.
- New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs`
  (~430 LOC including doc-comments).
- `include!()` registered in `cuda/uses.rs`.
- Lib-only signature drift gate test
  `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end
  byte-identity vs production sibling exercised by the heavy
  `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the
  cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3).

Hot path safety
===============

Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This
is a parallel slow path used only by `apr trace --gpu`. The per-token
loop dispatches the GPU MoE FFN identically to production for non-
capture positions; the LAST sequence position uses
`moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan
selects MoeRouter or MoeFfnOut so the router weights can be emitted
without recomputation.

Closes (a)+(b)+(c.gpu) of step (b)
==================================

The triplet for M-MOE-SUB-2 is now complete:
- step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan)
- step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE)
- step (b) GPU body: THIS PR
- step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router)
- step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router)

M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff
CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence.

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages).
This PR DOES NOT discharge it (heavy parity test required at M-MOE-
SUB-3); it provides the GPU traced sibling that will run the test.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (b)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 6, 2026 00:03
@noahgift noahgift merged commit 690a835 into main May 6, 2026
11 checks passed
@noahgift noahgift deleted the feat/forward-qwen3-moe-cuda-traced branch May 6, 2026 00:27
noahgift added a commit that referenced this pull request May 6, 2026
…-SUB-3 (#1524)

Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.2.0 step
M-MOE-SUB-3: heavy diagnostic test that runs CPU-traced + GPU-traced
forward bodies (M-MOE-SUB-2 step (a) PR #1516 + step (b) PR #1523)
with a SaveTensorPlan capturing MoeRouter and MoeFfnOut for every
layer, then computes per-layer per-stage cosine similarity to identify
the first layer where the GPU diverges from the CPU.

What ships
==========

- New heavy test `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  in `crates/aprender-serve/tests/qwen3_moe_gpu_per_stage_diff.rs`.
- 6 light unit tests for the verdict classifier (`Match`, `Diverge`,
  `NanGpu`, `NanCpu`, `NanBoth`, `Missing`) — all 6 pass.
- Skip-if-not-present pattern for the canonical 17.3 GB Qwen3-Coder
  GGUF (matches the existing `qwen3_moe_gpu_parity` test convention).
- `#[ignore]` + `#[cfg(feature = "cuda")]` gating per the repo's
  heavy-test convention.

Invocation
==========

    cargo test -p aprender-serve --features cuda \
      --test qwen3_moe_gpu_per_stage_diff \
      -- --include-ignored --nocapture

What the harness does NOT do (yet)
==================================

- Does NOT assert pass criteria. The full FALSIFY-QW3-MOE-GPU-PARITY-001
  cosine threshold lives in the existing qwen3_moe_gpu_parity test;
  this is a diagnostic harness for the M-GPU-MOE-1.4 NaN/Inf bisection
  per qwen3-moe-forward-gpu-v1 v1.4.0 amendment_history block.
- Does NOT clean up `/tmp/moe-sub-{cpu,gpu}-<pid>/` dirs (operator
  inspects them for raw bytes if cosine is ambiguous).

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages).
This PR ships the harness; an operator-dispatched run on lambda-
vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall)
will produce the layer-by-layer divergence table and pinpoint the
first M-GPU-MOE-1.4 bug-origin candidate. Threshold-based discharge
(`cosine ≥ 0.99`) becomes meaningful AFTER the bug is fixed.

M-MOE-SUB-3 status after this PR
================================

- Test harness: SHIPPED (this PR)
- Run on lambda-vector + interpret table: operator-dispatched
- Promote to FALSIFY-MOE-SUB-002 DISCHARGED: gated on M-GPU-MOE-1.4 fix

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (a) PR #1516, step (b) PR #1523
Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 NaN/Inf bisection

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…scade complete on main (#1525)

Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade
PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3
(harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING
(optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at
MoeRouter / MoeFfnOut granularity).

Cited PRs (chronological)
=========================

- #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c)
- #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a)
- #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI)
- #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu)
- #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b)
- #1524 — heavy diff harness (M-MOE-SUB-3)

What's left
===========

- Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF
  (~30-60 min wall) → produces layer-by-layer divergence table.
- M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run.
- FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml
Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant