feat(apr-cli): wire --save-tensor through GGUF MoE trace dispatch — M-MOE-SUB-2 step (a) CLI by noahgift · Pull Request #1521 · paiml/aprender

noahgift · 2026-05-05T21:28:54Z

Summary

M-MOE-SUB-2 step (a) CLI completion: connects clap surface (`--save-tensor` / `--save-tensor-layers` / `--save-tensor-dir`, PR-A #1405) through to `forward_qwen3_moe_traced_with_plan` (M74, PR #1516 squash `3138d134d`) for Qwen3-MoE-arch GGUF models.

What ships

New pub fn `run_save_tensor_gguf_moe(path, stages, dir, layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs`. Loads via `MappedGGUFModel` / `OwnedQuantizedModel`, validates `qwen3_moe` arch, reads MoE config from GGUF metadata, dispatches to `forward_qwen3_moe_traced_with_plan`.
Dispatch wireup in `dispatch.rs` under `Commands::Trace` arm: `.gguf` extension now dispatches to MoE wireup; `.apr` continues to use the existing dense path; `.safetensors` still stub.

What this does NOT ship

Dense GGUF SaveTensor wireup (still falls through to stub).
M-MOE-SUB-2 step (b) GPU sibling `forward_qwen3_moe_cuda_traced.rs` — separate PR.
Live verification on RTX 4090 — exercised in M-MOE-SUB-3.

Falsifier impact

M-MOE-SUB-3 live bisection now unblocked operationally — invoking `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <qwen3_moe_gguf>` on lambda-vector RTX 4090 will produce per-layer MoeRouter + MoeFfnOut tensor files ready for diff vs the GPU sibling output (once step (b) ships).

Verification

`cargo build -p apr-cli --release --features inference` clean
`cargo clippy -p apr-cli --lib --release --features inference -- -D warnings` clean
`rustfmt --check` clean
`cargo test -p apr-cli --release --lib commands::trace_save_tensor` 5 passed (existing tests preserved)

Test plan

Refs: `contracts/trace-moe-gpu-sub-stages-v1.yaml` v1.1.0 step (a) CLI completion,
M68 helper PR #1507 (squash 0f22c78),
M74 `forward_qwen3_moe_traced_with_plan` PR #1516 (squash 3138d13)

🤖 Generated with Claude Code

M-MOE-SUB-2 step (a) CLI completion: connects clap surface (--save-tensor / --save-tensor-layers / --save-tensor-dir, PR-A #1405) through to forward_qwen3_moe_traced_with_plan (M74, PR #1516 squash 3138d13) for Qwen3-MoE-arch GGUF models. ## What ships 1. New pub fn `run_save_tensor_gguf_moe(path, stages, dir, layers)` in `crates/apr-cli/src/commands/trace_save_tensor.rs`. Mirrors the structure of the existing `run_save_tensor_apr` for APR models, but loads via `MappedGGUFModel` / `OwnedQuantizedModel`, validates `qwen3_moe` arch (rejects dense GGUF with a clear error), reads MoE config from GGUF metadata (`expert_count`, `expert_used_count`, `expert_feed_forward_length`), loads per-layer `Qwen3MoeQuantizedLayer` descriptors, then dispatches to `forward_qwen3_moe_traced_with_plan` with the plan derived from the CLI args. 2. Dispatch wireup in `dispatch.rs::dispatch_diagnostic_commands` under the `Commands::Trace` arm. The previous code dispatched `--save-tensor` for `.apr` only and printed a stub for other extensions; now `.gguf` dispatches to the new `run_save_tensor_gguf_moe` function. Other extensions (.safetensors) still print the stub pending SHIP-007 PR-E. ## What this does NOT ship - Dense GGUF SaveTensor wireup (still falls through to stub). - M-MOE-SUB-2 step (b) GPU sibling `forward_qwen3_moe_cuda_traced.rs` — separate PR. - Live verification on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF — exercised in M-MOE-SUB-3. ## Hot path safety Production `forward_qwen3_moe` / `forward_qwen3_moe_cuda` hot paths byte-unchanged (additive-purity invariant pinned in v1.1.0). Production `forward_qwen3_moe_traced` (no plan) also unchanged — the new wireup uses the M74 sibling `forward_qwen3_moe_traced_with_plan`. ## Verification $ cargo build -p apr-cli --release --features inference clean $ cargo clippy -p apr-cli --lib --release --features inference \ -- -D warnings clean $ rustfmt --check trace_save_tensor.rs dispatch.rs clean $ cargo test -p apr-cli --release --lib commands::trace_save_tensor 5 passed (existing tests preserved) ## Falsifier impact - FALSIFY-MOE-SUB-002 (byte-identity preservation): still partial — needs M-MOE-SUB-2 step (b) GPU sibling for full discharge. - M-MOE-SUB-3 live bisection: now unblocked operationally — invoking `apr trace --save-tensor moe_router,moe_ffn_out --save-tensor-layers 0..48 --save-tensor-dir <dir> <gguf>` on lambda-vector RTX 4090 will produce per-layer MoeRouter + MoeFfnOut tensor files for the cached Qwen3-Coder GGUF, ready for diff vs the GPU sibling output once step (b) ships. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.1.0 step (a) CLI completion, M68 helper PR #1507 (squash 0f22c78), M74 forward_qwen3_moe_traced_with_plan PR #1516 (squash 3138d13) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…p (b) (#1523) GPU traced sibling of forward_qwen3_moe_cuda. Mirrors the CPU traced sibling (M32d Step 2 + PR #1516 _with_plan extension) but routes per- layer MoE FFN through the GPU dispatch so `apr trace --gpu --json --payload --save-tensor` can run the same SaveTensorPlan against both CPU and GPU forward paths, capture per-stage activations at MoeRouter and MoeFfnOut, and bisect the M-GPU-MOE-1.4 NaN/Inf poisoning to its first divergence point. What ships ========== - `OwnedQuantizedModelCuda::forward_qwen3_moe_cuda_traced(token_ids, moe_layers, num_experts, num_experts_per_tok, moe_intermediate, data)` → Result<ForwardTrace>. No-plan delegate. - `forward_qwen3_moe_cuda_traced_with_plan(..., plan: Option<&SaveTensorPlan>)` → Result<ForwardTrace>. The plan-aware body. - New file `crates/aprender-serve/src/gguf/cuda/forward_qwen3_moe_cuda_traced.rs` (~430 LOC including doc-comments). - `include!()` registered in `cuda/uses.rs`. - Lib-only signature drift gate test `forward_qwen3_moe_cuda_traced_signature_drift_gate`. End-to-end byte-identity vs production sibling exercised by the heavy `qwen3_moe_gpu_parity` test on lambda-vector RTX 4090 against the cached 17.3 GB Qwen3-Coder GGUF (M-MOE-SUB-3). Hot path safety =============== Production `forward_qwen3_moe_cuda` is unchanged byte-for-byte. This is a parallel slow path used only by `apr trace --gpu`. The per-token loop dispatches the GPU MoE FFN identically to production for non- capture positions; the LAST sequence position uses `moe_ffn_forward_layer_cuda_with_router` (PR #1522) when the plan selects MoeRouter or MoeFfnOut so the router weights can be emitted without recomputation. Closes (a)+(b)+(c.gpu) of step (b) ================================== The triplet for M-MOE-SUB-2 is now complete: - step (a) CPU body: PR #1516 (forward_qwen3_moe_traced_with_plan) - step (a) CLI wireup: PR #1521 (apr trace --save-tensor for GGUF MoE) - step (b) GPU body: THIS PR - step (c) CPU helper: PR #1507 (moe_ffn_forward_layer_with_router) - step (c.gpu) GPU helper: PR #1522 (moe_ffn_forward_layer_cuda_with_router) M-MOE-SUB-3 next: heavy parity test on lambda-vector RTX 4090, diff CPU vs GPU at MoeRouter and MoeFfnOut, identify first divergence. Falsifier ========= FALSIFY-MOE-SUB-002 (byte-identity preservation for existing stages). This PR DOES NOT discharge it (heavy parity test required at M-MOE- SUB-3); it provides the GPU traced sibling that will run the test. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml v1.2.0 Refs: M-MOE-SUB-2 step (b) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…scade complete on main (#1525) Promotes status PROPOSED → ACTIVE_ALGORITHM_LEVEL after all 5 cascade PRs land. M-MOE-SUB-1, M-MOE-SUB-2 (a + b + c + c.gpu), M-MOE-SUB-3 (harness) status: PENDING → SHIPPED. M-MOE-SUB-4 stays PENDING (optional, only needed if M-MOE-SUB-3's diff doesn't pinpoint at MoeRouter / MoeFfnOut granularity). Cited PRs (chronological) ========================= - #1507 — moe_ffn_forward_layer_with_router (CPU helper, step c) - #1516 — forward_qwen3_moe_traced_with_plan (CPU body, step a) - #1521 — apr trace --save-tensor GGUF MoE CLI wireup (step a CLI) - #1522 — moe_ffn_forward_layer_cuda_with_router (GPU helper, step c.gpu) - #1523 — forward_qwen3_moe_cuda_traced (GPU body, step b) - #1524 — heavy diff harness (M-MOE-SUB-3) What's left =========== - Operator-dispatched run of `falsify_moe_sub_002_cpu_gpu_traced_per_stage_diff` on lambda-vector RTX 4090 + cached 17.3 GB Qwen3-Coder GGUF (~30-60 min wall) → produces layer-by-layer divergence table. - M-MOE-SUB-3 ALGORITHM_LEVEL → FUNCTIONAL upon operator run. - FALSIFY-MOE-SUB-003 → DISCHARGED gated on M-GPU-MOE-1.4 root-cause fix. Refs: contracts/trace-moe-gpu-sub-stages-v1.yaml Refs: qwen3-moe-forward-gpu-v1 v1.4.0 M-GPU-MOE-1.4 Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 5, 2026 21:28

Merge branch 'main' into feat/m-moe-sub-2-cli-dispatch-wireup

268e5b4

noahgift merged commit c63a8dd into main May 5, 2026
10 checks passed

noahgift deleted the feat/m-moe-sub-2-cli-dispatch-wireup branch May 5, 2026 22:19

This was referenced May 5, 2026

docs(M77): record §58 spec + M-MOE-SUB-2 step (a) CLI completion SHIPPED paiml/claude-code-parity-apr#63

Merged

feat(aprender-serve): forward_qwen3_moe_cuda_traced — M-MOE-SUB-2 step (b) #1523

Merged

noahgift mentioned this pull request May 6, 2026

contract(trace-moe-gpu-sub-stages-v1): v1.2.0 → v1.3.0 — M-MOE-SUB cascade complete on main #1525

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): wire --save-tensor through GGUF MoE trace dispatch — M-MOE-SUB-2 step (a) CLI#1521

feat(apr-cli): wire --save-tensor through GGUF MoE trace dispatch — M-MOE-SUB-2 step (a) CLI#1521
noahgift merged 2 commits intomainfrom
feat/m-moe-sub-2-cli-dispatch-wireup

noahgift commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 5, 2026

Summary

What ships

What this does NOT ship

Falsifier impact

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant