chore(distill): DATASET_DIR env var in dispatch script (Phase 4 Stage C-prep) by noahgift · Pull Request #1840 · paiml/aprender

noahgift · 2026-05-20T08:55:48Z

Phase 4 Stage C-prep

Threads the Phase 4 Stage B-2 apr distill --dataset <DIR> flag through the gx10 dispatch script. When DATASET_DIR is set, the script passes the directory to apr distill, which drives training from real-corpus .bin shards via ShardBatchSource. When unset (default), the pipeline falls back to SyntheticBatchSource (Phase 3 smoke semantics).

Preamble surfaces the mode

dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset)
# or
dataset: (synthetic — Phase 3 smoke semantics)

Usage

STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \
  bash scripts/dispatch-distill-phase-3-gx10.sh

Test plan

DRY_RUN=1 with and without DATASET_DIR — preamble surfaces the correct mode
Validates directory existence on gx10 before launching
Stage C live trial run (after feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2) #1839 lands, gx10 rebuilds apr binary)

Pre-staged data

10 shards already copied to gx10:/home/noah/data/codeparrot-shards-trial/ (383 MB). Trial is ready to fire as soon as #1839 lands.

Phase 4 ladder progress

Stage	PR	Status
A	#1833	✅ MERGED + verified
B-1	#1836	✅ MERGED
B-2	#1839	🟡 in CI
C-prep	THIS	dispatch script
C	(next)	live trial w/ --dataset
D	(compute-gated)	50K-step Phase 4
E	(Phase 5)	HumanEval pass@1
F	(Phase 6)	publish v2

🤖 Generated with Claude Code

Implements M32d KV cache support on the qwen3_moe inference path. Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0). ## Empirical results On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M: - **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on every per-turn budget — 5 timeout-class dispatches recorded in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical- 2026-05-19.md) - **Post-M32d**: **9.62 tok/s sustained** on 32-token generation (19× speedup; comfortably above the ≥ 5 tok/s scope target) - **Numerical equivalence**: byte-identical greedy outputs vs full- prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms; ~2× speedup even at small token counts; gap compounds with length) - **V1_001 + V1_003 regression**: existing #1819 cargo test still passes (9.39s wall, content "Human: What", no matmul guard fire) ## Implementation **New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache` in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`. Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs: ~165) step-for-step EXCEPT at the FFN block, where it calls `moe_ffn_forward_layer` (router → top-k expert select → per-expert SwiGLU → weighted sum → down projection) instead of the dense gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm, RoPE at `position`, GQA-aware cached attention, output projection, residual) is byte-identical to the dense reference. **Generate loop rewrite**: `crates/aprender-serve/src/infer/ qwen3_moe_generate.rs::run_qwen3_moe_generate` now: 1. Allocates `OwnedQuantizedKVCache` sized to `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)` 2. Prefill: per prompt token, calls `forward_single_qwen3_moe_with_cache` (cache fills incrementally; final iteration's logits seed decode) 3. Decode: greedy-argmax → append → next cache-aware forward 4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full **Visibility fix**: `single_cache_final_output` in `ffn_block.rs` bumped to `pub(crate)` so the MoE function can reuse the dense final- norm + LM head path unchanged. Same edit applied to the orphan `debug.rs` duplicate for hygiene (it's not in the build graph but mirrors ffn_block.rs). ## New tests (both `#[ignore]`'d, env-gated) - `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` — Generates 4 tokens via M32d cache-on path AND a legacy full-prefill loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms perf numbers in eprintln output. - `crates/aprender-serve/tests/m32d_perf.rs` — Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s. Floor pinned via `M32D_TPS_FLOOR` constant. Catches future KV-cache regressions. Activation: ``` QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \ cargo test --test moe_kv_cache_equivalence --test m32d_perf \ -p aprender-serve --features cuda --release -- --ignored --nocapture ``` ## Risk assessment vs scope doc All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md` were addressed: 1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale logit drift; 4-token sequence byte-identical to full-prefill. 2. **Dense path regression**: NONE. Dense `forward_single_with_cache` not touched (only its sibling `single_cache_final_output` visibility bumped, which doesn't change semantics). 3. **RoPE position offset**: handled via `position` parameter passed to `apply_rope` (same pattern as dense reference). 4. **GQA expansion**: handled via `kv_dim()` config method (same as dense reference); first-token edge case (empty cache) explicitly handled by expanding V across Q heads. 5. **Expert routing under cache**: confirmed unaffected — router reads from current-token hidden state only. 6. **Streaming SSE for free**: structurally enabled but not wired into the chat handler (separate follow-up contract). ## Contract bump v1.1.1 → v1.2.0: - V1_004 entry gains `prerequisite_status` field documenting M32d shipped + empirical throughput numbers - `evidence` field updated with the post-M32d operator dispatch recipe - status_history appends v1.2.0 entry ## Companion-side downstream paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench: ``` APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ PHASE6_COMPLIANCE_ENFORCED=1 \ PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \ APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \ APR_AGENT_MAX_TOKENS_CAP=1024 \ bash scripts/phase-6-bench.sh ``` Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s sustained. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004 + lifts CCPA M280 suspension. ## Supersedes #1829 (Option b engineer-playbook + V1_004 status formalization) — operator flipped from Option (b) to Option (a) in-session; this PR delivers the actual implementation. #1829 can be closed as superseded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ge C-prep) Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through the gx10 dispatch script. When `DATASET_DIR` is set, the script passes the directory to apr distill, which drives training from real-corpus .bin shards via ShardBatchSource. When unset (default), the pipeline falls back to SyntheticBatchSource (Phase 3 smoke semantics). Preamble now surfaces which mode the dispatch is in: dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset) dataset: (synthetic — Phase 3 smoke semantics) Validates the directory exists on gx10 before launching apr distill; fails fast with a clear message otherwise. Usage: STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \ bash scripts/dispatch-distill-phase-3-gx10.sh Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset` exists on the rebuilt gx10 binary. With #1839 unmerged the script's invocation falls back to a clap error. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) 🟡 in CI Stage C-prep (THIS) dispatch script + 10 pre-staged shards Stage C run on gx10 with --dataset Stage D 50K-step Phase 4 dispatch Stage E HumanEval pass@1 Stage F publish v2 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ASSES (Phase 4 ladder) (#1845) 2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus (.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B student on Blackwell GB10 (sm_121), 100-step trial. initial_loss = 15.6094 final_loss = 6.0095 ← Δ = -9.60 (-62% reduction) 124 steps, 232.4s, 1.87 sec/step This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3 victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke (#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it with strictly better convergence on real data (codeparrot Python tokenized to Qwen vocab, 10 shards / 383 MB). What this validates: - ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards correctly and produces non-degenerate batches - Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from synthetic → real source via with_batch_source() cleanly - Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10 - Full Phase 4 readiness for the 50K-step Stage D run (compute-gated, requires user check-in per autonomous-mode rule) Cascade math: Stage A: Δloss = -6.80 over 62 steps (synthetic, seq=256) Stage C: Δloss = -9.60 over 124 steps (real corpus, seq=256) Per-step loss decrease: Stage A: -0.110/step Stage C: -0.077/step Stage A's per-step rate is higher because synthetic data has zero variance — every batch is the same identity-mapping task. Real-corpus Stage C has higher variance but covers more concepts, so absolute delta is larger. Phase 4 ladder progress: Stage A (#1833) ✅ MERGED + verified Stage B-1 (#1836) ✅ MERGED Stage B-2 (#1839) ✅ MERGED Stage C-prep (#1840) ✅ MERGED Stage B-1.5 tests (#1841) 🟡 in CI Stage C trial (THIS evidence) ✅ PASSED 2026-05-20 Stage D 50K dispatch ⏳ awaiting user check-in (28h GB10 compute) Stage E HumanEval pass@1 ⏳ Phase 5 (turnkey post-Stage-D) Stage F publish v2 ⏳ Phase 6 (turnkey post-Stage-E) Evidence: - evidence/distill-stage-c-trial/dispatch.json — dispatch manifest - evidence/distill-stage-c-trial/launch-victory.txt — full training log Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/ Trained checkpoint: student-trained.apr/model.safetensors Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 20, 2026 07:53

noahgift enabled auto-merge (squash) May 20, 2026 08:55

noahgift mentioned this pull request May 20, 2026

test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002) #1841

Merged

2 tasks

noahgift added 2 commits May 20, 2026 11:22

Merge branch 'main' into chore/distill-phase-3-dispatch-dataset-flag

03b479a

Merge branch 'main' into chore/distill-phase-3-dispatch-dataset-flag

5fc9fe5

noahgift merged commit fc814f2 into main May 20, 2026
10 checks passed

noahgift deleted the chore/distill-phase-3-dispatch-dataset-flag branch May 20, 2026 10:32

noahgift mentioned this pull request May 20, 2026

evidence(distill): Stage C — first real-corpus distillation on GB10 PASSES #1845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(distill): DATASET_DIR env var in dispatch script (Phase 4 Stage C-prep)#1840

chore(distill): DATASET_DIR env var in dispatch script (Phase 4 Stage C-prep)#1840
noahgift merged 4 commits into
mainfrom
chore/distill-phase-3-dispatch-dataset-flag

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Phase 4 Stage C-prep

Preamble surfaces the mode

Usage

Test plan

Pre-staged data

Phase 4 ladder progress

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant