Skip to content

feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2)#1839

Merged
noahgift merged 5 commits into
mainfrom
feat/distill-pipeline-batch-source-integration-stage-b2
May 20, 2026
Merged

feat(distill): Pipeline + apr distill CLI integrate BatchSource (Phase 4 Stage B-2)#1839
noahgift merged 5 commits into
mainfrom
feat/distill-pipeline-batch-source-integration-stage-b2

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Phase 4 Stage B-2: wire BatchSource into Pipeline + CLI

Builds on Stage B-1 (#1836BatchSource trait + 2 impls, just merged). This PR wires the trait into the pipeline and exposes it via the CLI so Phase 4 real-corpus dispatch can drive distillation from .bin token shards.

Changes

  • Pipeline::batch_source field (default: SyntheticBatchSource(32))
  • Pipeline::with_batch_source(Box<dyn BatchSource>) builder
  • Pipeline::train() pulls each batch via batch_source.next_batch(...) instead of constructing inline (same synthetic semantics on the default path — smoke + fixture tests unaffected)
  • apr distill --dataset <DIR> CLI flag for real-corpus shard directory
  • run_cuda_backend() constructs a ShardBatchSource from the dataset dir and passes it to the pipeline; default path remains synthetic
  • aprender-train-distill?/shard-batch-source feature added to apr-cli's training feature

Test plan

  • cargo check -p apr-cli --features inference,training,cuda clean
  • 5968 apr-cli lib tests pass (incl. 21 distill tests after a 9-call-site signature update for the new arg)
  • 61 aprender-train-distill lib tests pass (fixture path semantics unchanged)
  • Live Phase 4 dispatch with --dataset <PATH> (Stage C trial run)

Phase 4 ladder progress

Stage PR Status
A #1833 PMAT-698o ✅ MERGED + verified
B-1 #1836 ✅ MERGED
B-2 THIS Pipeline + CLI integration
C (next) 7B teacher trial with real corpus
D (compute-gated) 50K-step Phase 4 dispatch
E (Phase 5) HumanEval pass@1
F (Phase 6) publish v2

🤖 Generated with Claude Code

noahgift and others added 2 commits May 20, 2026 08:39
…ource (Phase 4 Stage B foundation)

SPEC-DISTILL-001 Phase 4 needs real-corpus training batches in
pipeline.train(). The synthetic-batch logic currently lives inline
(PMAT-698m / PMAT-698o). This PR factors batch production into a
trait so the pipeline can swap synthetic vs real-corpus sources
without touching the training loop.

Adds:
- `BatchSource` trait — `next_batch(batch_size, seq_len) -> (inputs, labels)`
- `SyntheticBatchSource` — wraps existing PMAT-698m/698o logic verbatim
  (each row is `seq_len` copies of a distinct token; identity-mapping
  task remains satisfiable on smoke + fixture tests)
- `ShardBatchSource` (feature `shard-batch-source`) — wraps
  `entrenar::train::shard_reader::ShardBatchIter` for real .bin
  token shards. Wrap-around enabled by default (Phase 4's 50K-step
  schedule may exceed a single corpus epoch on smaller subsets).

Does NOT yet wire BatchSource into Pipeline::train() — that's a
follow-up PR (Stage B-2) to keep this change reviewable. Foundation
is independently useful: tests can construct sources directly.

Test plan:
- [x] 3 batch_source lib tests pass (shape, modulo wrap, reset is no-op)
- [x] 61 total distill lib tests pass (all existing tests unaffected)
- [x] `cargo check --features shard-batch-source` clean
- [ ] Stage B-2 PR wires Pipeline + adds `apr distill --dataset <PATH>` CLI flag

Phase 4 ladder progress:
  Stage A (#1833 PMAT-698o)         ✅ MERGED + verified (seq_len=256 smoke PASS)
  Stage B-1 (THIS)                  trait + impls foundation
  Stage B-2 (next)                  Pipeline + CLI integration
  Stage C                            7B teacher trial run
  Stage D                            50K-step Phase 4 dispatch
  Stage E                            HumanEval pass@1 (Phase 5)
  Stage F                            publish v2 (Phase 6)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e 4 Stage B-2)

Builds on Stage B-1 (#1836 — BatchSource trait + 2 impls). Wires the
new trait into the pipeline and exposes it via the CLI so Phase 4
real-corpus dispatch can drive distillation from `.bin` token shards.

Changes:
- `Pipeline::batch_source` field (default: `SyntheticBatchSource(32)`)
- `Pipeline::with_batch_source(Box<dyn BatchSource>)` builder
- `Pipeline::train()` pulls each batch via `batch_source.next_batch(...)`
  instead of constructing inline (same synthetic semantics on the default
  path; smoke + fixture tests unaffected)
- `apr distill --dataset <DIR>` CLI flag for real-corpus shard directory
- `run_cuda_backend()` constructs a `ShardBatchSource` from the dataset
  dir and passes it to the pipeline; default path remains synthetic
- `aprender-train-distill?/shard-batch-source` feature added to apr-cli's
  `training` feature so the shard source compiles in by default

Test plan:
- [x] `cargo check -p apr-cli --features inference,training,cuda` clean
- [x] 5968 apr-cli lib tests pass (incl. 21 distill tests after a
      9-call-site signature update)
- [x] 61 aprender-train-distill lib tests pass
- [ ] Live Phase 4 dispatch with `--dataset <PATH>` (Stage C +)

Phase 4 ladder progress:
  Stage A (#1833 PMAT-698o)       ✅ MERGED + verified
  Stage B-1 (#1836)               🟡 in CI
  Stage B-2 (THIS)                Pipeline + CLI integration
  Stage C                          7B teacher trial w/ real corpus
  Stage D                          50K-step Phase 4 dispatch
  Stage E                          HumanEval pass@1 (Phase 5)
  Stage F                          publish v2 (Phase 6)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 07:35
@noahgift noahgift merged commit fb16a1c into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the feat/distill-pipeline-batch-source-integration-stage-b2 branch May 20, 2026 09:54
noahgift added a commit that referenced this pull request May 20, 2026
… C-prep) (#1840)

* feat(M32d): KV cache for qwen3_moe inference path — 19× speedup

Implements M32d KV cache support on the qwen3_moe inference path.
Discharges the prerequisite for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004
in contracts/qwen3-moe-serve-dispatch-v1.yaml (v1.1.1 → v1.2.0).

## Empirical results

On Qwen3-Coder-30B-A3B-Instruct-Q4_K_M:

- **Pre-M32d**: ~0.5 tok/s (full-prefill-per-token; bench timed out on
  every per-turn budget — 5 timeout-class dispatches recorded in
  paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-
  2026-05-19.md)
- **Post-M32d**: **9.62 tok/s sustained** on 32-token generation
  (19× speedup; comfortably above the ≥ 5 tok/s scope target)
- **Numerical equivalence**: byte-identical greedy outputs vs full-
  prefill on 4-token reference (cache-on=553ms vs cache-off=1002ms;
  ~2× speedup even at small token counts; gap compounds with length)
- **V1_001 + V1_003 regression**: existing #1819 cargo test still
  passes (9.39s wall, content "Human: What", no matmul guard fire)

## Implementation

**New function**: `OwnedQuantizedModel::forward_single_qwen3_moe_with_cache`
in `crates/aprender-serve/src/gguf/inference/forward/forward_qwen3_moe.rs`.
Mirrors the dense `forward_single_with_cache` reference (ffn_block.rs:
~165) step-for-step EXCEPT at the FFN block, where it calls
`moe_ffn_forward_layer` (router → top-k expert select → per-expert
SwiGLU → weighted sum → down projection) instead of the dense
gate/up/down dispatch. Attention block (QKV proj, per-head Q/K RMSNorm,
RoPE at `position`, GQA-aware cached attention, output projection,
residual) is byte-identical to the dense reference.

**Generate loop rewrite**: `crates/aprender-serve/src/infer/
qwen3_moe_generate.rs::run_qwen3_moe_generate` now:
  1. Allocates `OwnedQuantizedKVCache` sized to
     `max(REALIZR_CONTEXT_LENGTH, prompt_len + max_tokens + 8)`
  2. Prefill: per prompt token, calls
     `forward_single_qwen3_moe_with_cache` (cache fills incrementally;
     final iteration's logits seed decode)
  3. Decode: greedy-argmax → append → next cache-aware forward
  4. Stop on `stop_tokens` or `max_tokens` exhausted or cache full

**Visibility fix**: `single_cache_final_output` in `ffn_block.rs`
bumped to `pub(crate)` so the MoE function can reuse the dense final-
norm + LM head path unchanged. Same edit applied to the orphan
`debug.rs` duplicate for hygiene (it's not in the build graph but mirrors
ffn_block.rs).

## New tests (both `#[ignore]`'d, env-gated)

- `crates/aprender-serve/tests/moe_kv_cache_equivalence.rs` —
  Generates 4 tokens via M32d cache-on path AND a legacy full-prefill
  loop. Asserts greedy outputs byte-identical. Pinned 553ms vs 1002ms
  perf numbers in eprintln output.
- `crates/aprender-serve/tests/m32d_perf.rs` —
  Generates 32 tokens; asserts sustained throughput ≥ 5 tok/s.
  Floor pinned via `M32D_TPS_FLOOR` constant. Catches future
  KV-cache regressions.

Activation:
```
QWEN3_MOE_GGUF_PATH=/path/to/qwen3-moe.gguf \
  cargo test --test moe_kv_cache_equivalence --test m32d_perf \
  -p aprender-serve --features cuda --release -- --ignored --nocapture
```

## Risk assessment vs scope doc

All 6 risk surfaces from `docs/specifications/m32d-moe-kv-cache-scope.md`
were addressed:

1. **Numerical equivalence**: PASSED. Greedy argmax robust to ULP-scale
   logit drift; 4-token sequence byte-identical to full-prefill.
2. **Dense path regression**: NONE. Dense `forward_single_with_cache`
   not touched (only its sibling `single_cache_final_output` visibility
   bumped, which doesn't change semantics).
3. **RoPE position offset**: handled via `position` parameter passed
   to `apply_rope` (same pattern as dense reference).
4. **GQA expansion**: handled via `kv_dim()` config method (same as
   dense reference); first-token edge case (empty cache) explicitly
   handled by expanding V across Q heads.
5. **Expert routing under cache**: confirmed unaffected — router reads
   from current-token hidden state only.
6. **Streaming SSE for free**: structurally enabled but not wired into
   the chat handler (separate follow-up contract).

## Contract bump

v1.1.1 → v1.2.0:
- V1_004 entry gains `prerequisite_status` field documenting M32d
  shipped + empirical throughput numbers
- `evidence` field updated with the post-M32d operator dispatch recipe
- status_history appends v1.2.0 entry

## Companion-side downstream

paiml/claude-code-parity-apr operator can now dispatch Phase 6 bench:

```
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
bash scripts/phase-6-bench.sh
```

Expected ~10 hour wall on full 20-fixture corpus at 9.62 tok/s
sustained. Acceptance: `evidence/under-contract/scores.json` with
`student_pass_rate > 0` discharges V1_004 + lifts CCPA M280
suspension.

## Supersedes

#1829 (Option b engineer-playbook + V1_004 status
formalization) — operator flipped from Option (b) to Option (a)
in-session; this PR delivers the actual implementation. #1829 can be
closed as superseded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(distill): add DATASET_DIR env var → --dataset flag (Phase 4 Stage C-prep)

Threads the Phase 4 Stage B-2 `apr distill --dataset <DIR>` flag through
the gx10 dispatch script. When `DATASET_DIR` is set, the script passes
the directory to apr distill, which drives training from real-corpus
.bin shards via ShardBatchSource. When unset (default), the pipeline
falls back to SyntheticBatchSource (Phase 3 smoke semantics).

Preamble now surfaces which mode the dispatch is in:
  dataset: /path/to/shards (Phase 4 Stage B-2: real corpus via --dataset)
  dataset: (synthetic — Phase 3 smoke semantics)

Validates the directory exists on gx10 before launching apr distill;
fails fast with a clear message otherwise.

Usage:
  STEPS=100 DATASET_DIR=/home/noah/data/codeparrot-shards-trial \
    bash scripts/dispatch-distill-phase-3-gx10.sh

Depends on PR #1839 (Stage B-2) landing first so `apr distill --dataset`
exists on the rebuilt gx10 binary. With #1839 unmerged the script's
invocation falls back to a clap error.

Phase 4 ladder progress:
  Stage A (#1833)                 ✅ MERGED + verified
  Stage B-1 (#1836)               ✅ MERGED
  Stage B-2 (#1839)               🟡 in CI
  Stage C-prep (THIS)             dispatch script + 10 pre-staged shards
  Stage C                          run on gx10 with --dataset
  Stage D                          50K-step Phase 4 dispatch
  Stage E                          HumanEval pass@1
  Stage F                          publish v2

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
…ASSES (Phase 4 ladder) (#1845)

2026-05-20 12:34 UTC — first end-to-end Phase 4 dispatch with real corpus
(.bin shards via ShardBatchSource). 0.5B Qwen2.5-Coder teacher → 0.5B
student on Blackwell GB10 (sm_121), 100-step trial.

  initial_loss = 15.6094
  final_loss   =  6.0095   ← Δ = -9.60 (-62% reduction)
  124 steps, 232.4s, 1.87 sec/step

This is the first real-corpus Phase 4 dispatch. The synthetic Phase 3
victory (#1828, -0.47 over 62 steps) and the seq_len=256 Stage A smoke
(#1833, -6.80) both predicted Phase 4 readiness; Stage C confirms it
with strictly better convergence on real data (codeparrot Python
tokenized to Qwen vocab, 10 shards / 383 MB).

What this validates:
- ShardBatchSource (PR #1836, PMAT-PHASE4-STAGE-B-1) reads .bin shards
  correctly and produces non-degenerate batches
- Pipeline integration (PR #1839, PMAT-PHASE4-STAGE-B-2) swaps from
  synthetic → real source via with_batch_source() cleanly
- Dispatch script DATASET_DIR knob (PR #1840) end-to-end through gx10
- Full Phase 4 readiness for the 50K-step Stage D run (compute-gated,
  requires user check-in per autonomous-mode rule)

Cascade math:
  Stage A:  Δloss = -6.80 over 62 steps  (synthetic, seq=256)
  Stage C:  Δloss = -9.60 over 124 steps (real corpus, seq=256)
  Per-step loss decrease:
    Stage A: -0.110/step
    Stage C: -0.077/step
  Stage A's per-step rate is higher because synthetic data has zero
  variance — every batch is the same identity-mapping task. Real-corpus
  Stage C has higher variance but covers more concepts, so absolute
  delta is larger.

Phase 4 ladder progress:
  Stage A (#1833)              ✅ MERGED + verified
  Stage B-1 (#1836)            ✅ MERGED
  Stage B-2 (#1839)            ✅ MERGED
  Stage C-prep (#1840)         ✅ MERGED
  Stage B-1.5 tests (#1841)    🟡 in CI
  Stage C trial (THIS evidence) ✅ PASSED 2026-05-20
  Stage D 50K dispatch          ⏳ awaiting user check-in (28h GB10 compute)
  Stage E HumanEval pass@1      ⏳ Phase 5 (turnkey post-Stage-D)
  Stage F publish v2            ⏳ Phase 6 (turnkey post-Stage-E)

Evidence:
- evidence/distill-stage-c-trial/dispatch.json — dispatch manifest
- evidence/distill-stage-c-trial/launch-victory.txt — full training log

Run dir on gx10: /home/noah/runs/distill-smoke-20260520-123259/
Trained checkpoint: student-trained.apr/model.safetensors

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant