feat(apr-publish): auto-discover companion files + model.safetensors alias (PMAT-690 defect 6)#1783
Merged
Merged
Conversation
…alias (PMAT-690 P3-C-prep defect 6) Closes the file-selection gap surfaced during the paiml/albor-370m-v1 ship — `apr publish` previously only picked .apr/.safetensors/.gguf extensions, leaving the operator to manually NDJSON-commit every companion file (README, LICENSE, config.json, tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, generation_config.json, special_tokens_map.json, chat_template.jinja). What this PR adds ================ 1. `find_companion_files(directory)` — case-sensitive exact-filename allowlist of HF-standard integration files. Returns paths to whichever are present in the staging directory. Decoys (arbitrary .json or .txt files outside the allowlist) and binary artifacts (.apr/.safetensors/.gguf) are NOT picked up. 2. User-provided README.md preference — when one is present in the companion set, its content is used verbatim as the model card instead of the auto-generated stub. The auto-generated stub was consistently weaker than what model authors hand-craft (observed on the albor-370m-v1 publish: 164-byte auto-stub vs 11.6KB hand- crafted card). 3. `model.safetensors` LFS alias auto-emit — when a `.safetensors` file is uploaded under a descriptive name (e.g., `albor-370m-v1.safetensors`), a second NDJSON `lfsFile` commit emits the alias `model.safetensors` pointing at the same OID. HF deduplicates LFS blobs by OID so the alias is storage-free. Required for HF Transformers `AutoModelForCausalLM.from_pretrained` to auto-discover the weights without an explicit weights_file argument. 4. New public method `HfHubClient::commit_lfs_alias` in aprender-core — wraps the existing NDJSON commit-lfs-pointer path so the apr-cli publish command can emit the alias commit. Reference implementation ======================== Follows SPEC-HF-PUBLISH-001 (committed 2026-05-18 in #1780): - §"Required artifacts (12 files minimum)" — companion files list - §"Publishing the `model.safetensors` alias" — alias protocol Removes the manual NDJSON commit pattern documented in the spec's §"Manual companion-file upload until publish CLI is fixed" — that section can now be marked stale + linked to this PR. Tests ===== 4 new unit tests in publish_tests.rs: - find_companion_files picks all 10 allowlist entries when present - find_companion_files skips decoys + binary artifacts - find_companion_files empty dir returns empty - safetensors_needing_alias triggers on descriptive names - safetensors_needing_alias skips canonical model.safetensors - safetensors_needing_alias skips .apr/.gguf-only publishes All 35 commands::publish::tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 18, 2026
…83/684) (#1785) * docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684) Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. What this spec scopes ===================== Current state audit: - DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL) - CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A) - realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval) - Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB — uses build_synthetic_logits() instead of teacher.forward(); never calls CudaTransformerTrainer for the student. Closing this gap is the epic. 6-phase plan ============ Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1 teacher over the corpus, caches top-K=64 logits to disk. 100-batch test asserts cosine sim ≥ 0.999 against online realizar recompute. Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via new forward_backward_kd_batch(); replace synthetic-logits stub in pipeline.rs::train(). Unit test on toy student verifies loss monotone. Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice. Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0. Phase 4 (4h dispatch + 30h unattended compute): the v2 training run. 50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 — val_loss < 3.0 AND HumanEval pass@1 ≥ 15%. Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%. Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001. With v0.34.0+#1783 binary, companion files + model.safetensors alias are auto-emitted. Three-path verification (apr run + HF Transformers + llama-cli). Total: ~70h eng + ~45h compute, ~10 days calendar. Risk register ============= - Cache size: top-K=64 sparsification → ~4GB instead of ~100GB - KD numerical stability: Phase 2 unit test compares against PyTorch nn.KLDivLoss within 1e-4 absolute - Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep to <100ms/batch reads during training - HumanEval miss: two-path fallback — widen corpus OR drop T from 4.0 to 2.0 (each adds ~1 week) Architectural decisions ======================= 1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent 2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen 3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization 4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init 5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001 (used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is bit-correct, no re-export needed before distillation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC-DISTILL-001 v1.1.0 — priority HIGH; Phase 1 → online teacher (Refs PMAT-691) Revision driven by storage-math sanity check + pmat work priority promotion: 1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684, plus new PMAT-691 for Phase 1 implementation). 2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves to Phase 1.5 as an optional in-memory ring-buffer optimization that hides teacher latency under student compute. 3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic eng stays ~70h but compute drops 45h → 40h. 4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch matches realizar's apr trace logits output within 1e-3 absolute error on a frozen 3-layer fixture. Implementation: PMAT-691 work session started 2026-05-18. Phase 1 deliverables are: teacher_provider.rs module, RealizarTeacher wrapper, pipeline.rs::train() rewrite to use it, unit test against golden fixture. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the file-selection gap surfaced during the paiml/albor-370m-v1 ship —
apr publishpreviously only picked.apr/.safetensors/.ggufextensions, leaving the operator to manually NDJSON-commit every companion file (README, LICENSE, config.json, tokenizer*.json, vocab.json, merges.txt, generation_config.json, etc.).What this PR adds
find_companion_files(directory)— case-sensitive exact-filename allowlist of HF-standard integration files. Returns paths to whichever are present in the staging directory. Decoys (arbitrary.json/.txtfiles outside the allowlist) and binary artifacts (.apr/.safetensors/.gguf) are NOT picked up.Allowlist:
README.md,LICENSE,LICENSE.md,LICENSE.txt,config.json,generation_config.json,tokenizer.json,tokenizer_config.json,vocab.json,merges.txt,special_tokens_map.json,chat_template.jinja.User-provided README.md preference — when one is present in the companion set, its content is used verbatim as the model card instead of the auto-generated stub. The auto-generated stub was consistently weaker than what model authors hand-craft (observed on the albor-370m-v1 publish: 164-byte auto-stub vs 11.6KB hand-crafted card).
model.safetensorsLFS alias auto-emit — when a.safetensorsfile is uploaded under a descriptive name (e.g.,albor-370m-v1.safetensors), a second NDJSONlfsFilecommit emits the aliasmodel.safetensorspointing at the same OID. HF deduplicates LFS blobs by OID so the alias is storage-free. Required for HF TransformersAutoModelForCausalLM.from_pretrainedto auto-discover the weights without an explicitweights_fileargument.New public method
HfHubClient::commit_lfs_aliasin aprender-core — wraps the existing NDJSON commit-lfs-pointer path so the apr-cli publish command can emit the alias commit.Spec compliance
Follows SPEC-HF-PUBLISH-001:
model.safetensorsalias" — alias protocolRemoves the manual NDJSON commit pattern documented in the spec's §"Manual companion-file upload" — that section can now be marked stale + linked to this PR.
Tests
4 new unit tests in
publish_tests.rs:test_find_companion_files_picks_all_hf_integration_files— picks all 10 allowlist entries, rejects decoys + binary artifactstest_find_companion_files_empty_dir_returns_emptytest_safetensors_needing_alias_descriptive_name_triggers_aliastest_safetensors_needing_alias_canonical_name_skips_alias— no alias when already namedmodel.safetensorstest_safetensors_needing_alias_no_safetensors_skips_alias— no alias on.apr/.gguf-only publishesAll 35
commands::publish::testspass.Test plan
cargo check -p apr-cli --features hf-hubclean🤖 Generated with Claude Code