feat(aprender-serve): moe_ffn_forward_layer_cuda_with_router — M-MOE-SUB-2 GPU step (c) by noahgift · Pull Request #1522 · paiml/aprender

noahgift · 2026-05-05T21:34:22Z

Summary

GPU parallel of M-MOE-SUB-2 step (c) — adds the sibling helper moe_ffn_forward_layer_cuda_with_router that returns both the FFN output AND the post-renormalize top-k router weights.

This unblocks step (b) — the forward_qwen3_moe_cuda_traced sibling (next PR) needs a router-returning GPU helper to capture MoeRouter for the last token without:

recomputing the router (drift risk vs production), or
falling back to the CPU moe_ffn_forward_layer_with_router (would not measure GPU FFN behavior — the whole point of the GPU traced path is per-stage GPU vs CPU bisection).

What ships

moe_ffn_forward_layer_cuda_with_router (~140 LOC) — sibling of moe_ffn_forward_layer_cuda. Same router / softmax / top-k / per-expert SwiGLU logic, returns (Vec<f32>, Vec<f32>) instead of Vec<f32>. Mirrors the CPU moe_ffn_forward_layer_with_router (PR feat(aprender-serve): moe_ffn_forward_layer_with_router — M-MOE-SUB-2 step (c) #1507) and the production moe_ffn_forward_layer_cuda body byte-for-byte except for the additional return value.
Lib-only signature drift gate test moe_ffn_forward_layer_cuda_with_router_signature_drift_gate — compilation gate. End-to-end byte-identity vs production sibling exercised by the heavy qwen3_moe_gpu_parity test on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (out-of-scope for unit tests).
Contract trace-moe-gpu-sub-stages-v1 v1.1.0 → v1.2.0 records the (c.gpu) sub-step + its blocker relationship to step (b).

Hot path safety

Production moe_ffn_forward_layer_cuda is unchanged byte-for-byte. Additive-purity invariant holds.

Falsifier

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages) remains the contract's nominal acceptance gate. This PR does not discharge it (heavy parity test required); it provides the helper needed to author the GPU traced sibling that will run that test.

Test plan

cargo test -p aprender-serve --features cuda --lib moe_ffn_forward_layer_cuda → 2 passed (production + new helper signature drift gates)
cargo fmt --check (file-level)
pv validate contracts/trace-moe-gpu-sub-stages-v1.yaml → 0 errors, 0 warnings
Auto-merge once required checks pass
Heavy qwen3_moe_gpu_parity test will exercise byte-identity in M-MOE-SUB-3

🤖 Generated with Claude Code

…SUB-2 GPU step (c) GPU parallel of M-MOE-SUB-2 step (c) — adds the sibling helper `moe_ffn_forward_layer_cuda_with_router` that returns both the FFN output AND the post-renormalize top-k router weights. Unblocks step (b)'s forward_qwen3_moe_cuda_traced sibling (next PR), which needs a router- returning GPU helper to capture MoeRouter for the last token without recomputing the router or falling back to the CPU helper (which would not measure GPU FFN behavior — the whole point of the GPU traced path is per-stage GPU vs CPU bisection). What ships ========== - `moe_ffn_forward_layer_cuda_with_router` (sibling of `moe_ffn_forward_layer_cuda`) — same router/softmax/top-k/per-expert- SwiGLU logic, returns `(Vec<f32>, Vec<f32>)` instead of `Vec<f32>`. ~140 LOC mirroring the CPU `moe_ffn_forward_layer_with_router` helper (PR #1507) and the production `moe_ffn_forward_layer_cuda` body byte-for-byte except for the additional return value. - Lib-only signature drift gate test `moe_ffn_forward_layer_cuda_with_router_signature_drift_gate`. Compilation gate; end-to-end byte-identity vs production sibling is exercised by the heavy `qwen3_moe_gpu_parity` test on lambda- vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (out-of-scope for unit tests). - Contract `trace-moe-gpu-sub-stages-v1` v1.1.0 → v1.2.0 records the (c.gpu) sub-step + its blocker relationship to step (b). Hot path safety =============== Production `moe_ffn_forward_layer_cuda` is unchanged byte-for-byte. Additive-purity invariant holds. Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages) remains the contract's nominal acceptance gate. This PR DOES NOT discharge it (heavy parity test required); it provides the helper needed to author the GPU traced sibling that will run that test. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (b)/(c.gpu) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…p (b) (#1523) GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per- layer MoE FFN through the GPU dispatch so `apr trace --gpu --json --payload --save-tensor` can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its first divergence point. What ships ========== - `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)` → Result<ForwardTrace>. No-plan delegate. - `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)` → Result<ForwardTrace>. The plan-aware body. - New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs` (~430 LOC including doc-comments). - `include!()` registered in `cuda/uses.rs`. - Lib-only signature drift gate test `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end byte-identity vs production sibling exercised by the heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3). Hot path safety =============== Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This is a parallel slow path used only by `apr trace --gpu`. The per-token loop dispatches the GPU MoE FFN identically to production for non- capture positions; the LAST sequence position uses `moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan selects MoeRouter or MoeFfnOut so the router weights can be emitted without recomputation. Closes (a)+(b)+(c.gpu) of step (b) ================================== The triplet for M-MOE-SUB-2 is now complete: - step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan) - step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE) - step (b) GPU body: THIS PR - step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router) - step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router) M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence. Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR DOES NOT discharge it (heavy parity test required at M-MOE- SUB-3); it provides the GPU traced sibling that will run the test. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (b) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…scade complete on main (#1525) Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3 (harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING (optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at MoeRouter / MoeFfnOut granularity). Cited PRs (chronological) ========================= - #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c) - #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a) - #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI) - #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu) - #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b) - #1524 — heavy diff harness (M-MOE-SUB-3) What's left =========== - Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) → produces layer-by-layer divergence table. - M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run. - FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 5, 2026 21:34

Merge branch 'main' into feat/moe-ffn-cuda-with-router-v2

5237d61

noahgift merged commit 7e20919 into main May 5, 2026
10 checks passed

noahgift deleted the feat/moe-ffn-cuda-with-router-v2 branch May 5, 2026 23:20

This was referenced May 5, 2026

docs(M78): record moe_ffn_forward_layer_cuda_with_router GPU helper SHIPPED paiml/claude-code-parity-apr#64

Merged

feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b) #1523

Merged

noahgift mentioned this pull request May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.2.0 → v1.3.0 — M-MOE-SUB cascade complete on main #1525

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(aprender-serve): moe_ffn_forward_layer_cuda_with_router — M-MOE-SUB-2 GPU step (c)#1522

feat(aprender-serve): moe_ffn_forward_layer_cuda_with_router — M-MOE-SUB-2 GPU step (c)#1522
noahgift merged 2 commits intomainfrom
feat/moe-ffn-cuda-with-router-v2

noahgift commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 5, 2026

Summary

What ships

Hot path safety

Falsifier

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant