You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
Per-position knowledge distillation (full-sequence KD) for the aprender-train-distill pipeline. The existing per-row path trains on ONE
target per window (the next token after the window); per-position KD trains
on EVERY position (position p predicts token p+1), giving up to seq_len× more distillation signal per forward pass. New kd_step_per_position, and additive trait methods logits_per_position / apply_kd_gradient_per_position (teacher + student) / next_batch_per_position (BatchSource) whose defaults wrap the per-row
methods — so existing providers, including the CUDA backend, compile and
behave unchanged. Opt-in via APR_DISTILL_PER_POSITION (default off → the
production loop is byte-identical). Contract: contracts/distill-per-position-kd-v1.yaml (5 falsifiers + 2 kani
harnesses, all passing on the CPU/fixture path).
Scope: the CPU/fixture path is fully verified. The real throughput
benefit needs the CUDA teacher/student to emit all-position logits (a GPU
forward change) — until then CUDA falls back to one position via the
defaults. That GPU per-position forward is a documented follow-up.
While implementing, corrected a misleading comment in ShardBatchSource that claimed "identity-mapping semantics": the per-row
labels were always genuine next-token (LMBatch causal-shifted target),
not identity. Pinned by a falsifier so it can't mislead again.
Fixed
PMAT-706 re-land: the APR_DISTILL_MAX_STEPS=N smoke-validation mode
announced in #1888 (v0.35.2) was never actually in pipeline.rs — commit 52650c60c squash-dropped the implementation and shipped only the apr-distill-smoke-validation-v1.yaml contract. The early-break, [SMOKE]
summary, 0-steps guard, and no-export side-effect are now implemented in crates/aprender-train-distill/src/pipeline.rs and bound to the contract's
four falsifiers (pipeline::tests::pmat_706_{smoke,no_regression,summary_format,no_output_in_smoke},
all passing). scripts/dispatch-distill-stage-d.sh now forwards APR_DISTILL_MAX_STEPS across the ssh/env boundary so the documented APR_DISTILL_MAX_STEPS=10 ./scripts/dispatch-distill-stage-d.sh actually
triggers smoke mode (previously a silent no-op).