feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2) by noahgift · Pull Request #1839 · paiml/aprender

noahgift · 2026-05-20T07:35:41Z

Phase 4 Stage B-2: wire BatchSource into Pipeline + CLI

Builds on Stage B-1 (#1836 — BatchSource trait + 2 impls, just merged). This PR wires the trait into the pipeline and exposes it via the CLI so Phase 4 real-corpus dispatch can drive distillation from .bin token shards.

Changes

Pipeline::batch_source field (default: SyntheticBatchSource(32))
Pipeline::with_batch_source(Box<dyn BatchSource>) builder
Pipeline::train() pulls each batch via batch_source.next_batch(...) instead of constructing inline (same synthetic semantics on the default path — smoke + fixture tests unaffected)
apr distill --dataset <DIR> CLI flag for real-corpus shard directory
run_cuda_backend() constructs a ShardBatchSource from the dataset dir and passes it to the pipeline; default path remains synthetic
aprender-train-distill?/shard-batch-source feature added to apr-cli's training feature

Test plan

cargo check -p apr-cli --features inference,training,cuda clean
5968 apr-cli lib tests pass (incl. 21 distill tests after a 9-call-site signature update for the new arg)
61 aprender-train-distill lib tests pass (fixture path semantics unchanged)
Live Phase 4 dispatch with --dataset <PATH> (Stage C trial run)

Phase 4 ladder progress

Stage	PR	Status
A	#1833 PMAT-698o	✅ MERGED + verified
B-1	#1836	✅ MERGED
B-2	THIS	Pipeline + CLI integration
C	(next)	7B teacher trial with real corpus
D	(compute-gated)	50K-step Phase 4 dispatch
E	(Phase 5)	HumanEval pass@1
F	(Phase 6)	publish v2

🤖 Generated with Claude Code

…ource (Phase 4 Stage B foundation) SPEC-DISTILL-001 Phase 4 needs real-corpus training batches in pipeline.train(). The synthetic-batch logic currently lives inline (PMAT-698m / PMAT-698o). This PR factors batch production into a trait so the pipeline can swap synthetic vs real-corpus sources without touching the training loop. Adds: - `BatchSource` trait — `next_batch(batch_size, seq_len) -> (inputs, labels)` - `SyntheticBatchSource` — wraps existing PMAT-698m/698o logic verbatim (each row is `seq_len` copies of a distinct token; identity-mapping task remains satisfiable on smoke + fixture tests) - `ShardBatchSource` (feature `shard-batch-source`) — wraps `entrenar::train::shard_reader::ShardBatchIter` for real .bin token shards. Wrap-around enabled by default (Phase 4's 50K-step schedule may exceed a single corpus epoch on smaller subsets). Does NOT yet wire BatchSource into Pipeline::train() — that's a follow-up PR (Stage B-2) to keep this change reviewable. Foundation is independently useful: tests can construct sources directly. Test plan: - [x] 3 batch_source lib tests pass (shape, modulo wrap, reset is no-op) - [x] 61 total distill lib tests pass (all existing tests unaffected) - [x] `cargo check --features shard-batch-source` clean - [ ] Stage B-2 PR wires Pipeline + adds `apr distill --dataset <PATH>` CLI flag Phase 4 ladder progress: Stage A (#1833 PMAT-698o) ✅ MERGED + verified (seq_len=256 smoke PASS) Stage B-1 (THIS) trait + impls foundation Stage B-2 (next) Pipeline + CLI integration Stage C 7B teacher trial run Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 (Phase 5) Stage F publish v2 (Phase 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…e 4 Stage B-2) Builds on Stage B-1 (#1836 — BatchSource trait + 2 impls). Wires the new trait into the pipeline and exposes it via the CLI so Phase 4 real-corpus dispatch can drive distillation from `.bin` token shards. Changes: - `Pipeline::batch_source` field (default: `SyntheticBatchSource(32)`) - `Pipeline::with_batch_source(Box<dyn BatchSource>)` builder - `Pipeline::train()` pulls each batch via `batch_source.next_batch(...)` instead of constructing inline (same synthetic semantics on the default path; smoke + fixture tests unaffected) - `apr distill --dataset <DIR>` CLI flag for real-corpus shard directory - `run_cuda_backend()` constructs a `ShardBatchSource` from the dataset dir and passes it to the pipeline; default path remains synthetic - `aprender-train-distill?/shard-batch-source` feature added to apr-cli's `training` feature so the shard source compiles in by default Test plan: - [x] `cargo check -p apr-cli --features inference,training,cuda` clean - [x] 5968 apr-cli lib tests pass (incl. 21 distill tests after a 9-call-site signature update) - [x] 61 aprender-train-distill lib tests pass - [ ] Live Phase 4 dispatch with `--dataset <PATH>` (Stage C +) Phase 4 ladder progress: Stage A (#1833 PMAT-698o) ✅ MERGED + verified Stage B-1 (#1836) 🟡 in CI Stage B-2 (THIS) Pipeline + CLI integration Stage C 7B teacher trial w/ real corpus Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 (Phase 5) Stage F publish v2 (Phase 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…on-stage-b2

… C-prep) (#1840) * feat(M32d): KV cache for qwen3_moe inference path — 19× speedup Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(distill): add DATASET_DIR env var → --dataset flag (Phase 4 Stage C-prep) Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through the gx10 dispatch script. When `DATASET_DIR` is set, the script passes the directory to apr distill, which drives training from real-corpus .bin shards via ShardBatchSource. When unset (default), the pipeline falls back to SyntheticBatchSource (Phase 3 smoke semantics). Preamble now surfaces which mode the dispatch is in: dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset) dataset: (synthetic — Phase 3 smoke semantics) Validates the directory exists on gx10 before launching apr distill; fails fast with a clear message otherwise. Usage: STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \ bash scripts/dispatch-distill-phase-3-gx10.sh Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset` exists on the rebuilt gx10 binary. With #1839 unmerged the script's invocation falls back to a clap error. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) 🟡 in CI Stage C-prep (THIS) dispatch script + 10 pre-staged shards Stage C run on gx10 with --dataset Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 Stage F publish v2 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ASSES (Phase 4 ladder) (#1845) 2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus (.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B student on Blackwell GB10 (sm_121), 100-step trial. initial_loss = 15.6094 final_loss = 6.0095 ← Δ = -9.60 (-62% reduction) 124 steps, 232.4s, 1.87 sec/step This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3 victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke (#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it with strictly better convergence on real data (codeparrot Python tokenized to Qwen vocab, 10 shards / 383 MB). What this validates: - ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards correctly and produces non-degenerate batches - Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from synthetic → real source via with_batch_source() cleanly - Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10 - Full Phase 4 readiness for the 50K-step Stage D run (compute-gated, requires user check-in per autonomous-mode rule) Cascade math: Stage A: Δloss = -6.80 over 62 steps (synthetic, seq=256) Stage C: Δloss = -9.60 over 124 steps (real corpus, seq=256) Per-step loss decrease: Stage A: -0.110/step Stage C: -0.077/step Stage A's per-step rate is higher because synthetic data has zero variance — every batch is the same identity-mapping task. Real-corpus Stage C has higher variance but covers more concepts, so absolute delta is larger. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) ✅ MERGED Stage C-prep (#1840) ✅ MERGED Stage B-1.5 tests (#1841) 🟡 in CI Stage C trial (THIS evidence) ✅ PASSED 2026-05-20 Stage D 50K dispatch ⏳ awaiting user check-in (28h GB10 compute) Stage E HumanEval pass@1 ⏳ Phase 5 (turnkey post-Stage-D) Stage F publish v2 ⏳ Phase 6 (turnkey post-Stage-E) Evidence: - evidence/distill-stage-c-trial/dispatch.json — dispatch manifest - evidence/distill-stage-c-trial/launch-victory.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/ Trained checkpoint: student-trained.apr/model.safetensors Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 20, 2026 08:39

noahgift enabled auto-merge (squash) May 20, 2026 07:35

noahgift added 2 commits May 20, 2026 09:51

Merge branch 'main' into feat/distill-pipeline-batch-source-integrati…

3cfed37

…on-stage-b2

Merge branch 'main' into feat/distill-pipeline-batch-source-integrati…

4810df9

…on-stage-b2

This was referenced May 20, 2026

chore(distill): DATASET_DIR env var in dispatch script (Phase 4 Stage C-prep) #1840

Merged

test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002) #1841

Merged

Merge branch 'main' into feat/distill-pipeline-batch-source-integrati…

699d717

…on-stage-b2

noahgift merged commit fb16a1c into main May 20, 2026
10 checks passed

noahgift deleted the feat/distill-pipeline-batch-source-integration-stage-b2 branch May 20, 2026 09:54

noahgift mentioned this pull request May 20, 2026

evidence(distill): Stage C — first real-corpus distillation on GB10 PASSES #1845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2)#1839

feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2)#1839
noahgift merged 5 commits into
mainfrom
feat/distill-pipeline-batch-source-integration-stage-b2

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Phase 4 Stage B-2: wire BatchSource into Pipeline + CLI

Changes

Test plan

Phase 4 ladder progress

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant