Skip to content

v0.36.0

Choose a tag to compare

@noahgift noahgift released this 11 Jun 04:50
· 66 commits to main since this release
37d98a7

Added

  • Per-position knowledge distillation (full-sequence KD) for the
    aprender-train-distill pipeline. The existing per-row path trains on ONE
    target per window (the next token after the window); per-position KD trains
    on EVERY position (position p predicts token p+1), giving up to
    seq_len× more distillation signal per forward pass. New
    kd_step_per_position, and additive trait methods logits_per_position /
    apply_kd_gradient_per_position (teacher + student) /
    next_batch_per_position (BatchSource) whose defaults wrap the per-row
    methods
    — so existing providers, including the CUDA backend, compile and
    behave unchanged. Opt-in via APR_DISTILL_PER_POSITION (default off → the
    production loop is byte-identical). Contract:
    contracts/distill-per-position-kd-v1.yaml (5 falsifiers + 2 kani
    harnesses, all passing on the CPU/fixture path).
    • Scope: the CPU/fixture path is fully verified. The real throughput
      benefit needs the CUDA teacher/student to emit all-position logits (a GPU
      forward change) — until then CUDA falls back to one position via the
      defaults. That GPU per-position forward is a documented follow-up.
    • While implementing, corrected a misleading comment in
      ShardBatchSource that claimed "identity-mapping semantics": the per-row
      labels were always genuine next-token (LMBatch causal-shifted target),
      not identity. Pinned by a falsifier so it can't mislead again.

Fixed

  • PMAT-706 re-land: the APR_DISTILL_MAX_STEPS=N smoke-validation mode
    announced in #1888 (v0.35.2) was never actually in pipeline.rs — commit
    52650c60c squash-dropped the implementation and shipped only the
    apr-distill-smoke-validation-v1.yaml contract. The early-break, [SMOKE]
    summary, 0-steps guard, and no-export side-effect are now implemented in
    crates/aprender-train-distill/src/pipeline.rs and bound to the contract's
    four falsifiers (pipeline::tests::pmat_706_{smoke,no_regression,summary_format,no_output_in_smoke},
    all passing). scripts/dispatch-distill-stage-d.sh now forwards
    APR_DISTILL_MAX_STEPS across the ssh/env boundary so the documented
    APR_DISTILL_MAX_STEPS=10 ./scripts/dispatch-distill-stage-d.sh actually
    triggers smoke mode (previously a silent no-op).