Skip to content

feat(aprender-serve): moe_ffn_forward_layer_cuda_with_router — M-MOE-SUB-2 GPU step (c)#1522

Merged
noahgift merged 2 commits intomainfrom
feat/moe-ffn-cuda-with-router-v2
May 5, 2026
Merged

feat(aprender-serve): moe_ffn_forward_layer_cuda_with_router — M-MOE-SUB-2 GPU step (c)#1522
noahgift merged 2 commits intomainfrom
feat/moe-ffn-cuda-with-router-v2

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 5, 2026

Summary

GPU parallel of M-MOE-SUB-2 step (c) — adds the sibling helper moe_ffn_forward_layer_cuda_with_router that returns both the FFN output AND the post-renormalize top-k router weights.

This unblocks step (b) — the forward_qwen3_moe_cuda_traced sibling (next PR) needs a router-returning GPU helper to capture MoeRouter for the last token without:

  • recomputing the router (drift risk vs production), or
  • falling back to the CPU moe_ffn_forward_layer_with_router (would not measure GPU FFN behavior — the whole point of the GPU traced path is per-stage GPU vs CPU bisection).

What ships

  • moe_ffn_forward_layer_cuda_with_router (~140 LOC) — sibling of moe_ffn_forward_layer_cuda. Same router / softmax / top-k / per-expert SwiGLU logic, returns (Vec<f32>, Vec<f32>) instead of Vec<f32>. Mirrors the CPU moe_ffn_forward_layer_with_router (PR feat(aprender-serve): moe_ffn_forward_layer_with_router — M-MOE-SUB-2 step (c) #1507) and the production moe_ffn_forward_layer_cuda body byte-for-byte except for the additional return value.
  • Lib-only signature drift gate test moe_ffn_forward_layer_cuda_with_router_signature_drift_gate — compilation gate. End-to-end byte-identity vs production sibling exercised by the heavy qwen3_moe_gpu_parity test on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (out-of-scope for unit tests).
  • Contract trace-moe-gpu-sub-stages-v1 v1.1.0 → v1.2.0 records the (c.gpu) sub-step + its blocker relationship to step (b).

Hot path safety

Production moe_ffn_forward_layer_cuda is unchanged byte-for-byte. Additive-purity invariant holds.

Falsifier

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages) remains the contract's nominal acceptance gate. This PR does not discharge it (heavy parity test required); it provides the helper needed to author the GPU traced sibling that will run that test.

Test plan

  • cargo test -p aprender-serve --features cuda --lib moe_ffn_forward_layer_cuda → 2 passed (production + new helper signature drift gates)
  • cargo fmt --check (file-level)
  • pv validate contracts/trace-moe-gpu-sub-stages-v1.yaml → 0 errors, 0 warnings
  • Auto-merge once required checks pass
  • Heavy qwen3_moe_gpu_parity test will exercise byte-identity in M-MOE-SUB-3

🤖 Generated with Claude Code

…SUB-2 GPU step (c)

GPU parallel of M-MOE-SUB-2 step (c) — adds the sibling helper
`moe_ffn_forward_layer_cuda_with_router` that returns both the FFN output
AND the post-renormalize top-k router weights. Unblocks step (b)'s
forward_qwen3_moe_cuda_traced sibling (next PR), which needs a router-
returning GPU helper to capture MoeRouter for the last token without
recomputing the router or falling back to the CPU helper (which would
not measure GPU FFN behavior — the whole point of the GPU traced path
is per-stage GPU vs CPU bisection).

What ships
==========

- `moe_ffn_forward_layer_cuda_with_router` (sibling of
  `moe_ffn_forward_layer_cuda`) — same router/softmax/top-k/per-expert-
  SwiGLU logic, returns `(Vec<f32>, Vec<f32>)` instead of `Vec<f32>`.
  ~140 LOC mirroring the CPU `moe_ffn_forward_layer_with_router`
  helper (PR #1507) and the production `moe_ffn_forward_layer_cuda`
  body byte-for-byte except for the additional return value.
- Lib-only signature drift gate test
  `moe_ffn_forward_layer_cuda_with_router_signature_drift_gate`.
  Compilation gate; end-to-end byte-identity vs production sibling
  is exercised by the heavy `qwen3_moe_gpu_parity` test on lambda-
  vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF
  (out-of-scope for unit tests).
- Contract `trace-moe-gpu-sub-stages-v1` v1.1.0 → v1.2.0 records
  the (c.gpu) sub-step + its blocker relationship to step (b).

Hot path safety
===============

Production `moe_ffn_forward_layer_cuda` is unchanged byte-for-byte.
Additive-purity invariant holds.

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages)
remains the contract's nominal acceptance gate. This PR DOES NOT
discharge it (heavy parity test required); it provides the helper
needed to author the GPU traced sibling that will run that test.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (b)/(c.gpu)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 5, 2026 21:34
@noahgift noahgift merged commit 7e20919 into main May 5, 2026
10 checks passed
@noahgift noahgift deleted the feat/moe-ffn-cuda-with-router-v2 branch May 5, 2026 23:20
noahgift added a commit that referenced this pull request May 6, 2026
…p (b) (#1523)

GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced
sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per-
layer MoE FFN through the GPU dispatch so `apr trace --gpu --json
--payload --save-tensor` can run the same SaveTensorPlan against both
CPU and GPU forward paths, capture per-stage activations at MoeRouter
and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its
first divergence point.

What ships
==========

- `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids,
  moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)`
  → Result<ForwardTrace>. No-plan delegate.
- `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)`
  → Result<ForwardTrace>. The plan-aware body.
- New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs`
  (~430 LOC including doc-comments).
- `include!()` registered in `cuda/uses.rs`.
- Lib-only signature drift gate test
  `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end
  byte-identity vs production sibling exercised by the heavy
  `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the
  cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3).

Hot path safety
===============

Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This
is a parallel slow path used only by `apr trace --gpu`. The per-token
loop dispatches the GPU MoE FFN identically to production for non-
capture positions; the LAST sequence position uses
`moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan
selects MoeRouter or MoeFfnOut so the router weights can be emitted
without recomputation.

Closes (a)+(b)+(c.gpu) of step (b)
==================================

The triplet for M-MOE-SUB-2 is now complete:
- step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan)
- step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE)
- step (b) GPU body: THIS PR
- step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router)
- step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router)

M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff
CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence.

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages).
This PR DOES NOT discharge it (heavy parity test required at M-MOE-
SUB-3); it provides the GPU traced sibling that will run the test.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (b)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…scade complete on main (#1525)

Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade
PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3
(harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING
(optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at
MoeRouter / MoeFfnOut granularity).

Cited PRs (chronological)
=========================

- #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c)
- #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a)
- #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI)
- #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu)
- #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b)
- #1524 — heavy diff harness (M-MOE-SUB-3)

What's left
===========

- Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF
  (~30-60 min wall) → produces layer-by-layer divergence table.
- M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run.
- FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml
Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant