Skip to content

feat(aprender-serve): moe_ffn_forward_layer_with_router — M-MOE-SUB-2 step (c)#1507

Merged
noahgift merged 1 commit intomainfrom
feat/moe-ffn-forward-layer-with-router-helper
May 5, 2026
Merged

feat(aprender-serve): moe_ffn_forward_layer_with_router — M-MOE-SUB-2 step (c)#1507
noahgift merged 1 commit intomainfrom
feat/moe-ffn-forward-layer-with-router-helper

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 5, 2026

Summary

Authors `moe_ffn_forward_layer_with_router` as additive-pure sibling of `moe_ffn_forward_layer` in `crates/aprender-serve/src/gguf/qwen3_moe_load.rs`. Returns `(output, router_top_k_weights)` enabling traced forward bodies to capture both MoeFfnOut and MoeRouter SaveTensorStage stages in one pass.

Per `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 (M64 companion-spec record) step (c).

What ships

  • New 145-line function `moe_ffn_forward_layer_with_router` (mirrors production logic, captures router weights as `Vec` from internal `topk_renorm`).
  • Two unit tests validating input-shape/qtype error symmetry vs production sibling.

Hot path safety

Production `moe_ffn_forward_layer` unchanged byte-for-byte. Helper is a parallel slow path for traced `apr trace --save-tensor` consumption only. Code duplication is intentional per v1.1.0 additive-purity invariant: "MUST NOT modify production forward_qwen3_moe / forward_qwen3_moe_cuda hot paths".

Verification

```
$ cargo test -p aprender-serve --release --lib gguf::qwen3_moe_load::tests
8 passed (including 2 new helper tests)
$ cargo clippy -p aprender-serve --lib --release -- -D warnings
clean
$ rustfmt --check crates/aprender-serve/src/gguf/qwen3_moe_load.rs
clean
```

What this does NOT ship

  • M-MOE-SUB-2 step (a): wire helper into `forward_qwen3_moe_traced` (separate PR).
  • M-MOE-SUB-2 step (b): NEW `forward_qwen3_moe_cuda_traced.rs` GPU sibling (separate PR).
  • Real-GGUF byte-identity test (exercised via heavy parity tests on lambda-vector RTX 4090).

Test plan

  • 2 new unit tests pass
  • All 8 `qwen3_moe_load::tests` pass
  • Clippy clean
  • rustfmt clean
  • CI gate green
  • workspace-test green
  • Auto-merge fires on green CI

Refs: `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 (PR #1503 squash 8c4c6d5),
M-MOE-SUB-2 step (c) implementation,
MEMORY.md M-GPU-MOE cascade

🤖 Generated with Claude Code

… step (c)

Author additive-pure sibling of `moe_ffn_forward_layer` per
`contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (c).

## What ships

`moe_ffn_forward_layer_with_router` returns `(output, router_top_k_weights)`:

  - `output: Vec<f32>` — `[hidden_dim]` aggregated MoE FFN output
    (the MoeFfnOut SaveTensorStage capture target).
  - `router_top_k_weights: Vec<f32>` — `[num_experts_per_tok]`
    post-softmax + renormalize top-k expert weights (the MoeRouter
    SaveTensorStage capture target).

The helper enables traced forward bodies (M-MOE-SUB-2 steps a/b
upcoming) to capture both `MoeRouter` and `MoeFfnOut` stages without a
second router computation.

## Hot path safety

Production `moe_ffn_forward_layer` is unchanged byte-for-byte. The
helper duplicates the router/softmax/top-k logic to satisfy the
v1.1.0 amendment's additive-purity invariant: "MUST NOT modify
production forward_qwen3_moe / forward_qwen3_moe_cuda hot paths".

Drift between sibling functions is mitigated by:
1. Two new unit tests asserting the helper's input-validation error
   messages match the production sibling's error class for the same
   shape/qtype boundary violations (`hidden.len() != hidden_dim` and
   `router qtype != F32`).
2. End-to-end byte-identity for realistic GGUF inputs is exercised by
   the heavy parity tests at `qwen3_moe_gpu_parity.rs` (out of scope
   for unit tests since they require the cached 17.3 GB
   Qwen3-Coder-30B-A3B-Instruct GGUF on lambda-vector RTX 4090).

## What this discharges

- FALSIFY-MOE-SUB-002 (byte-identity preservation): partial — the
  helper exists and validates inputs symmetrically with production.
  Full discharge needs M-MOE-SUB-2 steps (a)+(b) to wire it into
  traced CPU + GPU forward paths.

## Verification

  $ cargo test -p aprender-serve --release --lib gguf::qwen3_moe_load::tests
    8 passed (including 2 new helper tests)
  $ cargo clippy -p aprender-serve --lib --release -- -D warnings
    clean
  $ rustfmt --check crates/aprender-serve/src/gguf/qwen3_moe_load.rs
    clean

## What this does NOT ship

- M-MOE-SUB-2 step (a): extending `forward_qwen3_moe_traced` to call
  this helper (CPU-side traced wireup) — separate PR.
- M-MOE-SUB-2 step (b): NEW `forward_qwen3_moe_cuda_traced.rs` GPU
  sibling — separate PR.
- Real-GGUF byte-identity test — exercised end-to-end via heavy
  parity tests, not unit tests.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0
      (M64 companion-spec record, aprender PR #1503)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 5, 2026 08:45
@noahgift noahgift merged commit 0f22c78 into main May 5, 2026
11 checks passed
@noahgift noahgift deleted the feat/moe-ffn-forward-layer-with-router-helper branch May 5, 2026 09:22
noahgift added a commit that referenced this pull request May 5, 2026
…2 step (a) (#1516)

Wires SaveTensorStage::MoeRouter + MoeFfnOut emission into the CPU
traced MoE forward path per
`contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (a).

## Design

Adds new method `forward_qwen3_moe_traced_with_plan` accepting
`Option<&SaveTensorPlan>`. The existing `forward_qwen3_moe_traced`
becomes a thin one-line delegate passing `None` — public API
unchanged, zero-cost when no plan is set (single Option discriminant
check).

When the plan selects MoeRouter or MoeFfnOut for a given layer, the
last sequence position's MoE forward is dispatched through
`moe_ffn_forward_layer_with_router` (M68 helper) to obtain top-k
router weights without re-running the MoE forward. All other
sequence positions (and the last position when neither stage is
selected) continue using the production `moe_ffn_forward_layer` so
trace cost stays minimal.

## What this discharges

Per FALSIFY-MOE-SUB-002 contract:

- Helper byte-identity preserved: `moe_ffn_forward_layer_with_router`
  produces the same `output` Vec as production (asserted by step c
  M68 unit tests).
- Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot paths
  unchanged byte-for-byte.
- `forward_qwen3_moe_traced` public API unchanged (delegate
  pattern).
- Plan-aware code path emits MoeRouter as `[num_experts_per_tok]`
  + MoeFfnOut as `[hidden_dim]` to disk via existing
  `maybe_save_stage` machinery (same machinery used by
  `forward_traced_with_plan` for SHIP-007 SaveTensor).

Full discharge of FALSIFY-MOE-SUB-002 needs M-MOE-SUB-2 step (b)
(GPU sibling) + M-MOE-SUB-3 (live bisection on lambda-vector RTX
4090) + M-GPU-MOE-1.4 (fix at bisected stage).

## Verification

  $ cargo build -p aprender-serve --release
    clean
  $ cargo test -p aprender-serve --release --lib gguf::qwen3_moe_load
    8 passed
  $ cargo clippy -p aprender-serve --lib --release -- -D warnings
    clean
  $ rustfmt --check forward_qwen3_moe_traced.rs
    clean

## What this does NOT ship

- M-MOE-SUB-2 step (b): NEW `forward_qwen3_moe_cuda_traced.rs`
  GPU sibling — separate PR.
- Wiring in `apr trace` CLI dispatch site to actually pass a plan
  through to `forward_qwen3_moe_traced_with_plan` — separate PR
  (current `apr trace` for MoE still calls `forward_qwen3_moe_traced`
  with no plan).
- End-to-end SaveTensor verification on lambda-vector RTX 4090 —
  exercised via M-MOE-SUB-3.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a),
      M68 helper PR #1507 (squash 0f22c78),
      M-GPU-MOE-1.4 NaN/Inf bisection plan

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
)

M-MOE-SUB-2 step (a) CLI completion: connects clap surface
(--save-tensor / --save-tensor-layers / --save-tensor-dir, PR-A
#1405) through to forward_qwen3_moe_traced_with_plan (M74, PR
#1516 squash 3138d13) for Qwen3-MoE-arch GGUF models.

## What ships

1. New pub fn `run_save_tensor_gguf_moe(path, stages, dir,
   layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs`.
   Mirrors the structure of the existing `run_save_tensor_apr` for
   APR models, but loads via `MappedGGUFModel` /
   `OwnedQuantizedModel`, validates `qwen3_moe` arch (rejects
   dense GGUF with a clear error), reads MoE config from GGUF
   metadata (`expert_count`, `expert_used_count`,
   `expert_feed_forward_length`), loads per-layer
   `Qwen3MoeQuantizedLayer` descriptors, then dispatches to
   `forward_qwen3_moe_traced_with_plan` with the plan derived from
   the CLI args.

2. Dispatch wireup in `dispatch.rs::dispatch_diagnostic_commands`
   under the `Commands::Trace` arm. The previous code dispatched
   `--save-tensor` for `.apr` only and printed a stub for other
   extensions; now `.gguf` dispatches to the new
   `run_save_tensor_gguf_moe` function. Other extensions
   (.safetensors) still print the stub pending SHIP-007 PR-E.

## What this does NOT ship

- Dense GGUF SaveTensor wireup (still falls through to stub).
- M-MOE-SUB-2 step (b) GPU sibling
  `forward_qwen3_moe_cuda_traced.rs` — separate PR.
- Live verification on lambda-vector RTX 4090 against the cached
  17.3 GB Qwen3-Coder GGUF — exercised in M-MOE-SUB-3.

## Hot path safety

Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot
paths byte-unchanged (additive-purity invariant pinned in
v1.1.0). Production `forward_qwen3_moe_traced` (no plan) also
unchanged — the new wireup uses the M74 sibling
`forward_qwen3_moe_traced_with_plan`.

## Verification

  $ cargo build -p apr-cli --release --features inference
    clean
  $ cargo clippy -p apr-cli --lib --release --features inference \
      -- -D warnings
    clean
  $ rustfmt --check trace_save_tensor.rs dispatch.rs
    clean
  $ cargo test -p apr-cli --release --lib commands::trace_save_tensor
    5 passed (existing tests preserved)

## Falsifier impact

- FALSIFY-MOE-SUB-002 (byte-identity preservation): still
  partial — needs M-MOE-SUB-2 step (b) GPU sibling for full
  discharge.
- M-MOE-SUB-3 live bisection: now unblocked operationally —
  invoking `apr trace --save-tensor moe_router,moe_ffn_out
  --save-tensor-layers 0..48 --save-tensor-dir <dir> <gguf>` on
  lambda-vector RTX 4090 will produce per-layer MoeRouter
  + MoeFfnOut tensor files for the cached Qwen3-Coder GGUF,
  ready for diff vs the GPU sibling output once step (b) ships.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a)
      CLI completion,
      M68 helper PR #1507 (squash 0f22c78),
      M74 forward_qwen3_moe_traced_with_plan PR #1516
      (squash 3138d13)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 5, 2026
…SUB-2 GPU step (c) (#1522)

GPU parallel of M-MOE-SUB-2 step (c) — adds the sibling helper
`moe_ffn_forward_layer_cuda_with_router` that returns both the FFN output
AND the post-renormalize top-k router weights. Unblocks step (b)'s
forward_qwen3_moe_cuda_traced sibling (next PR), which needs a router-
returning GPU helper to capture MoeRouter for the last token without
recomputing the router or falling back to the CPU helper (which would
not measure GPU FFN behavior — the whole point of the GPU traced path
is per-stage GPU vs CPU bisection).

What ships
==========

- `moe_ffn_forward_layer_cuda_with_router` (sibling of
  `moe_ffn_forward_layer_cuda`) — same router/softmax/top-k/per-expert-
  SwiGLU logic, returns `(Vec<f32>, Vec<f32>)` instead of `Vec<f32>`.
  ~140 LOC mirroring the CPU `moe_ffn_forward_layer_with_router`
  helper (PR #1507) and the production `moe_ffn_forward_layer_cuda`
  body byte-for-byte except for the additional return value.
- Lib-only signature drift gate test
  `moe_ffn_forward_layer_cuda_with_router_signature_drift_gate`.
  Compilation gate; end-to-end byte-identity vs production sibling
  is exercised by the heavy `qwen3_moe_gpu_parity` test on lambda-
  vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF
  (out-of-scope for unit tests).
- Contract `trace-moe-gpu-sub-stages-v1` v1.1.0 → v1.2.0 records
  the (c.gpu) sub-step + its blocker relationship to step (b).

Hot path safety
===============

Production `moe_ffn_forward_layer_cuda` is unchanged byte-for-byte.
Additive-purity invariant holds.

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages)
remains the contract's nominal acceptance gate. This PR DOES NOT
discharge it (heavy parity test required); it provides the helper
needed to author the GPU traced sibling that will run that test.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (b)/(c.gpu)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…p (b) (#1523)

GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced
sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per-
layer MoE FFN through the GPU dispatch so `apr trace --gpu --json
--payload --save-tensor` can run the same SaveTensorPlan against both
CPU and GPU forward paths, capture per-stage activations at MoeRouter
and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its
first divergence point.

What ships
==========

- `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids,
  moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)`
  → Result<ForwardTrace>. No-plan delegate.
- `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)`
  → Result<ForwardTrace>. The plan-aware body.
- New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs`
  (~430 LOC including doc-comments).
- `include!()` registered in `cuda/uses.rs`.
- Lib-only signature drift gate test
  `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end
  byte-identity vs production sibling exercised by the heavy
  `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the
  cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3).

Hot path safety
===============

Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This
is a parallel slow path used only by `apr trace --gpu`. The per-token
loop dispatches the GPU MoE FFN identically to production for non-
capture positions; the LAST sequence position uses
`moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan
selects MoeRouter or MoeFfnOut so the router weights can be emitted
without recomputation.

Closes (a)+(b)+(c.gpu) of step (b)
==================================

The triplet for M-MOE-SUB-2 is now complete:
- step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan)
- step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE)
- step (b) GPU body: THIS PR
- step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router)
- step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router)

M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff
CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence.

Falsifier
=========

FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages).
This PR DOES NOT discharge it (heavy parity test required at M-MOE-
SUB-3); it provides the GPU traced sibling that will run the test.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0
Refs: M-MOE-SUB-2 step (b)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 6, 2026
…scade complete on main (#1525)

Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade
PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3
(harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING
(optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at
MoeRouter / MoeFfnOut granularity).

Cited PRs (chronological)
=========================

- #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c)
- #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a)
- #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI)
- #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu)
- #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b)
- #1524 — heavy diff harness (M-MOE-SUB-3)

What's left
===========

- Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff`
  on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF
  (~30-60 min wall) → produces layer-by-layer divergence table.
- M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run.
- FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix.

Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml
Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant