feat(distill): BatchSource trait + impls (Phase 4 Stage B-1 foundation)#1836
Merged
Merged
Conversation
…ource (Phase 4 Stage B foundation) SPEC-DISTILL-001 Phase 4 needs real-corpus training batches in pipeline.train(). The synthetic-batch logic currently lives inline (PMAT-698m / PMAT-698o). This PR factors batch production into a trait so the pipeline can swap synthetic vs real-corpus sources without touching the training loop. Adds: - `BatchSource` trait — `next_batch(batch_size, seq_len) -> (inputs, labels)` - `SyntheticBatchSource` — wraps existing PMAT-698m/698o logic verbatim (each row is `seq_len` copies of a distinct token; identity-mapping task remains satisfiable on smoke + fixture tests) - `ShardBatchSource` (feature `shard-batch-source`) — wraps `entrenar::train::shard_reader::ShardBatchIter` for real .bin token shards. Wrap-around enabled by default (Phase 4's 50K-step schedule may exceed a single corpus epoch on smaller subsets). Does NOT yet wire BatchSource into Pipeline::train() — that's a follow-up PR (Stage B-2) to keep this change reviewable. Foundation is independently useful: tests can construct sources directly. Test plan: - [x] 3 batch_source lib tests pass (shape, modulo wrap, reset is no-op) - [x] 61 total distill lib tests pass (all existing tests unaffected) - [x] `cargo check --features shard-batch-source` clean - [ ] Stage B-2 PR wires Pipeline + adds `apr distill --dataset <PATH>` CLI flag Phase 4 ladder progress: Stage A (#1833 PMAT-698o) ✅ MERGED + verified (seq_len=256 smoke PASS) Stage B-1 (THIS) trait + impls foundation Stage B-2 (next) Pipeline + CLI integration Stage C 7B teacher trial run Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 (Phase 5) Stage F publish v2 (Phase 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
…e 4 Stage B-2) Builds on Stage B-1 (#1836 — BatchSource trait + 2 impls). Wires the new trait into the pipeline and exposes it via the CLI so Phase 4 real-corpus dispatch can drive distillation from `.bin` token shards. Changes: - `Pipeline::batch_source` field (default: `SyntheticBatchSource(32)`) - `Pipeline::with_batch_source(Box<dyn BatchSource>)` builder - `Pipeline::train()` pulls each batch via `batch_source.next_batch(...)` instead of constructing inline (same synthetic semantics on the default path; smoke + fixture tests unaffected) - `apr distill --dataset <DIR>` CLI flag for real-corpus shard directory - `run_cuda_backend()` constructs a `ShardBatchSource` from the dataset dir and passes it to the pipeline; default path remains synthetic - `aprender-train-distill?/shard-batch-source` feature added to apr-cli's `training` feature so the shard source compiles in by default Test plan: - [x] `cargo check -p apr-cli --features inference,training,cuda` clean - [x] 5968 apr-cli lib tests pass (incl. 21 distill tests after a 9-call-site signature update) - [x] 61 aprender-train-distill lib tests pass - [ ] Live Phase 4 dispatch with `--dataset <PATH>` (Stage C +) Phase 4 ladder progress: Stage A (#1833 PMAT-698o) ✅ MERGED + verified Stage B-1 (#1836) 🟡 in CI Stage B-2 (THIS) Pipeline + CLI integration Stage C 7B teacher trial w/ real corpus Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 (Phase 5) Stage F publish v2 (Phase 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 20, 2026
noahgift
added a commit
that referenced
this pull request
May 20, 2026
…e 4 Stage B-2) (#1839) * feat(distill): BatchSource trait + SyntheticBatchSource + ShardBatchSource (Phase 4 Stage B foundation) SPEC-DISTILL-001 Phase 4 needs real-corpus training batches in pipeline.train(). The synthetic-batch logic currently lives inline (PMAT-698m / PMAT-698o). This PR factors batch production into a trait so the pipeline can swap synthetic vs real-corpus sources without touching the training loop. Adds: - `BatchSource` trait — `next_batch(batch_size, seq_len) -> (inputs, labels)` - `SyntheticBatchSource` — wraps existing PMAT-698m/698o logic verbatim (each row is `seq_len` copies of a distinct token; identity-mapping task remains satisfiable on smoke + fixture tests) - `ShardBatchSource` (feature `shard-batch-source`) — wraps `entrenar::train::shard_reader::ShardBatchIter` for real .bin token shards. Wrap-around enabled by default (Phase 4's 50K-step schedule may exceed a single corpus epoch on smaller subsets). Does NOT yet wire BatchSource into Pipeline::train() — that's a follow-up PR (Stage B-2) to keep this change reviewable. Foundation is independently useful: tests can construct sources directly. Test plan: - [x] 3 batch_source lib tests pass (shape, modulo wrap, reset is no-op) - [x] 61 total distill lib tests pass (all existing tests unaffected) - [x] `cargo check --features shard-batch-source` clean - [ ] Stage B-2 PR wires Pipeline + adds `apr distill --dataset <PATH>` CLI flag Phase 4 ladder progress: Stage A (#1833 PMAT-698o) ✅ MERGED + verified (seq_len=256 smoke PASS) Stage B-1 (THIS) trait + impls foundation Stage B-2 (next) Pipeline + CLI integration Stage C 7B teacher trial run Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 (Phase 5) Stage F publish v2 (Phase 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2) Builds on Stage B-1 (#1836 — BatchSource trait + 2 impls). Wires the new trait into the pipeline and exposes it via the CLI so Phase 4 real-corpus dispatch can drive distillation from `.bin` token shards. Changes: - `Pipeline::batch_source` field (default: `SyntheticBatchSource(32)`) - `Pipeline::with_batch_source(Box<dyn BatchSource>)` builder - `Pipeline::train()` pulls each batch via `batch_source.next_batch(...)` instead of constructing inline (same synthetic semantics on the default path; smoke + fixture tests unaffected) - `apr distill --dataset <DIR>` CLI flag for real-corpus shard directory - `run_cuda_backend()` constructs a `ShardBatchSource` from the dataset dir and passes it to the pipeline; default path remains synthetic - `aprender-train-distill?/shard-batch-source` feature added to apr-cli's `training` feature so the shard source compiles in by default Test plan: - [x] `cargo check -p apr-cli --features inference,training,cuda` clean - [x] 5968 apr-cli lib tests pass (incl. 21 distill tests after a 9-call-site signature update) - [x] 61 aprender-train-distill lib tests pass - [ ] Live Phase 4 dispatch with `--dataset <PATH>` (Stage C +) Phase 4 ladder progress: Stage A (#1833 PMAT-698o) ✅ MERGED + verified Stage B-1 (#1836) 🟡 in CI Stage B-2 (THIS) Pipeline + CLI integration Stage C 7B teacher trial w/ real corpus Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 (Phase 5) Stage F publish v2 (Phase 6) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
… C-prep) (#1840) * feat(M32d): KV cache for qwen3_moe inference path — 19× speedup Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(distill): add DATASET_DIR env var → --dataset flag (Phase 4 Stage C-prep) Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through the gx10 dispatch script. When `DATASET_DIR` is set, the script passes the directory to apr distill, which drives training from real-corpus .bin shards via ShardBatchSource. When unset (default), the pipeline falls back to SyntheticBatchSource (Phase 3 smoke semantics). Preamble now surfaces which mode the dispatch is in: dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset) dataset: (synthetic — Phase 3 smoke semantics) Validates the directory exists on gx10 before launching apr distill; fails fast with a clear message otherwise. Usage: STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \ bash scripts/dispatch-distill-phase-3-gx10.sh Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset` exists on the rebuilt gx10 binary. With #1839 unmerged the script's invocation falls back to a clap error. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) 🟡 in CI Stage C-prep (THIS) dispatch script + 10 pre-staged shards Stage C run on gx10 with --dataset Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 Stage F publish v2 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
…ASSES (Phase 4 ladder) (#1845) 2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus (.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B student on Blackwell GB10 (sm_121), 100-step trial. initial_loss = 15.6094 final_loss = 6.0095 ← Δ = -9.60 (-62% reduction) 124 steps, 232.4s, 1.87 sec/step This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3 victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke (#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it with strictly better convergence on real data (codeparrot Python tokenized to Qwen vocab, 10 shards / 383 MB). What this validates: - ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards correctly and produces non-degenerate batches - Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from synthetic → real source via with_batch_source() cleanly - Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10 - Full Phase 4 readiness for the 50K-step Stage D run (compute-gated, requires user check-in per autonomous-mode rule) Cascade math: Stage A: Δloss = -6.80 over 62 steps (synthetic, seq=256) Stage C: Δloss = -9.60 over 124 steps (real corpus, seq=256) Per-step loss decrease: Stage A: -0.110/step Stage C: -0.077/step Stage A's per-step rate is higher because synthetic data has zero variance — every batch is the same identity-mapping task. Real-corpus Stage C has higher variance but covers more concepts, so absolute delta is larger. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) ✅ MERGED Stage C-prep (#1840) ✅ MERGED Stage B-1.5 tests (#1841) 🟡 in CI Stage C trial (THIS evidence) ✅ PASSED 2026-05-20 Stage D 50K dispatch ⏳ awaiting user check-in (28h GB10 compute) Stage E HumanEval pass@1 ⏳ Phase 5 (turnkey post-Stage-D) Stage F publish v2 ⏳ Phase 6 (turnkey post-Stage-E) Evidence: - evidence/distill-stage-c-trial/dispatch.json — dispatch manifest - evidence/distill-stage-c-trial/launch-victory.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/ Trained checkpoint: student-trained.apr/model.safetensors Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 4 Stage B-1: BatchSource trait foundation
SPEC-DISTILL-001 Phase 4 needs real-corpus training batches in
pipeline.train(). The synthetic-batch logic currently lives inline (PMAT-698m / PMAT-698o). This PR factors batch production into a trait so the pipeline can swap synthetic vs real-corpus sources without touching the training loop.What lands
BatchSourcetrait —next_batch(batch_size, seq_len) -> (inputs, labels)SyntheticBatchSource— wraps existing PMAT-698m/698o logic verbatimShardBatchSource(featureshard-batch-source) — wrapsentrenar::train::shard_reader::ShardBatchIterfor real.bintoken shards (u32 LE), wrap-around enabled by defaultWhat does NOT land
Pipeline integration + CLI flag. Deferred to Stage B-2 PR to keep this change reviewable. Foundation is independently useful — tests can construct sources directly.
Test plan
batch_sourcelib tests (shape, modulo wrap, reset is no-op)cargo check --features shard-batch-sourcecleanPhase 4 ladder progress
apr distill --dataset <PATH>🤖 Generated with Claude Code