feat(apr-publish): auto-discover companion files + model.safetensors alias (PMAT-690 defect 6) by noahgift · Pull Request #1783 · paiml/aprender

noahgift · 2026-05-18T07:24:59Z

Summary

Closes the file-selection gap surfaced during the paiml/albor-370m-v1 ship — apr publish previously only picked .apr/.safetensors/.gguf extensions, leaving the operator to manually NDJSON-commit every companion file (README, LICENSE, config.json, tokenizer*.json, vocab.json, merges.txt, generation_config.json, etc.).

What this PR adds

find_companion_files(directory) — case-sensitive exact-filename allowlist of HF-standard integration files. Returns paths to whichever are present in the staging directory. Decoys (arbitrary .json / .txt files outside the allowlist) and binary artifacts (.apr/.safetensors/.gguf) are NOT picked up.

Allowlist: README.md, LICENSE, LICENSE.md, LICENSE.txt, config.json, generation_config.json, tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, chat_template.jinja.
User-provided README.md preference — when one is present in the companion set, its content is used verbatim as the model card instead of the auto-generated stub. The auto-generated stub was consistently weaker than what model authors hand-craft (observed on the albor-370m-v1 publish: 164-byte auto-stub vs 11.6KB hand-crafted card).
model.safetensors LFS alias auto-emit — when a .safetensors file is uploaded under a descriptive name (e.g., albor-370m-v1.safetensors), a second NDJSON lfsFile commit emits the alias model.safetensors pointing at the same OID. HF deduplicates LFS blobs by OID so the alias is storage-free. Required for HF Transformers AutoModelForCausalLM.from_pretrained to auto-discover the weights without an explicit weights_file argument.
New public method HfHubClient::commit_lfs_alias in aprender-core — wraps the existing NDJSON commit-lfs-pointer path so the apr-cli publish command can emit the alias commit.

Spec compliance

Follows SPEC-HF-PUBLISH-001:

§"Required artifacts (12 files minimum)" — companion files list
§"Publishing the model.safetensors alias" — alias protocol

Removes the manual NDJSON commit pattern documented in the spec's §"Manual companion-file upload" — that section can now be marked stale + linked to this PR.

Tests

4 new unit tests in publish_tests.rs:

test_find_companion_files_picks_all_hf_integration_files — picks all 10 allowlist entries, rejects decoys + binary artifacts
test_find_companion_files_empty_dir_returns_empty
test_safetensors_needing_alias_descriptive_name_triggers_alias
test_safetensors_needing_alias_canonical_name_skips_alias — no alias when already named model.safetensors
test_safetensors_needing_alias_no_safetensors_skips_alias — no alias on .apr/.gguf-only publishes

All 35 commands::publish::tests pass.

Test plan

4 new unit tests pass
All 35 commands::publish::tests pass
cargo check -p apr-cli --features hf-hub clean
Integration test: dry-run on a staging dir containing all 12 SPEC-HF-PUBLISH-001 files — verify all are listed in the upload plan

🤖 Generated with Claude Code

…alias (PMAT-690 P3-C-prep defect 6) Closes the file-selection gap surfaced during the paiml/albor-370m-v1 ship — `apr publish` previously only picked .apr/.safetensors/.gguf extensions, leaving the operator to manually NDJSON-commit every companion file (README, LICENSE, config.json, tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, generation_config.json, special_tokens_map.json, chat_template.jinja). What this PR adds ================ 1. `find_companion_files(directory)` — case-sensitive exact-filename allowlist of HF-standard integration files. Returns paths to whichever are present in the staging directory. Decoys (arbitrary .json or .txt files outside the allowlist) and binary artifacts (.apr/.safetensors/.gguf) are NOT picked up. 2. User-provided README.md preference — when one is present in the companion set, its content is used verbatim as the model card instead of the auto-generated stub. The auto-generated stub was consistently weaker than what model authors hand-craft (observed on the albor-370m-v1 publish: 164-byte auto-stub vs 11.6KB hand- crafted card). 3. `model.safetensors` LFS alias auto-emit — when a `.safetensors` file is uploaded under a descriptive name (e.g., `albor-370m-v1.safetensors`), a second NDJSON `lfsFile` commit emits the alias `model.safetensors` pointing at the same OID. HF deduplicates LFS blobs by OID so the alias is storage-free. Required for HF Transformers `AutoModelForCausalLM.from_pretrained` to auto-discover the weights without an explicit weights_file argument. 4. New public method `HfHubClient::commit_lfs_alias` in aprender-core — wraps the existing NDJSON commit-lfs-pointer path so the apr-cli publish command can emit the alias commit. Reference implementation ======================== Follows SPEC-HF-PUBLISH-001 (committed 2026-05-18 in #1780): - §"Required artifacts (12 files minimum)" — companion files list - §"Publishing the `model.safetensors` alias" — alias protocol Removes the manual NDJSON commit pattern documented in the spec's §"Manual companion-file upload until publish CLI is fixed" — that section can now be marked stale + linked to this PR. Tests ===== 4 new unit tests in publish_tests.rs: - find_companion_files picks all 10 allowlist entries when present - find_companion_files skips decoys + binary artifacts - find_companion_files empty dir returns empty - safetensors_needing_alias triggers on descriptive names - safetensors_needing_alias skips canonical model.safetensors - safetensors_needing_alias skips .apr/.gguf-only publishes All 35 commands::publish::tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…83/684) (#1785) * docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684) Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2) from where the §88 stack-existence-proof ship left off. What this spec scopes ===================== Current state audit: - DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL) - CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A) - realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval) - Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB — uses build_synthetic_logits() instead of teacher.forward(); never calls CudaTransformerTrainer for the student. Closing this gap is the epic. 6-phase plan ============ Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1 teacher over the corpus, caches top-K=64 logits to disk. 100-batch test asserts cosine sim ≥ 0.999 against online realizar recompute. Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via new forward_backward_kd_batch(); replace synthetic-logits stub in pipeline.rs::train(). Unit test on toy student verifies loss monotone. Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice. Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0. Phase 4 (4h dispatch + 30h unattended compute): the v2 training run. 50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 — val_loss < 3.0 AND HumanEval pass@1 ≥ 15%. Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%. Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001. With v0.34.0+#1783 binary, companion files + model.safetensors alias are auto-emitted. Three-path verification (apr run + HF Transformers + llama-cli). Total: ~70h eng + ~45h compute, ~10 days calendar. Risk register ============= - Cache size: top-K=64 sparsification → ~4GB instead of ~100GB - KD numerical stability: Phase 2 unit test compares against PyTorch nn.KLDivLoss within 1e-4 absolute - Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep to <100ms/batch reads during training - HumanEval miss: two-path fallback — widen corpus OR drop T from 4.0 to 2.0 (each adds ~1 week) Architectural decisions ======================= 1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent 2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen 3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K, intermediate activations aren't recoverable post-quantization 4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for Qwen2.5-Coder-0.5B-Instruct as init 5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001 (used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is bit-correct, no re-export needed before distillation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(spec): SPEC-DISTILL-001 v1.1.0 — priority HIGH; Phase 1 → online teacher (Refs PMAT-691) Revision driven by storage-math sanity check + pmat work priority promotion: 1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684, plus new PMAT-691 for Phase 1 implementation). 2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves to Phase 1.5 as an optional in-memory ring-buffer optimization that hides teacher latency under student compute. 3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic eng stays ~70h but compute drops 45h → 40h. 4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch matches realizar's apr trace logits output within 1e-3 absolute error on a frozen 3-layer fixture. Implementation: PMAT-691 work session started 2026-05-18. Phase 1 deliverables are: teacher_provider.rs module, RealizarTeacher wrapper, pipeline.rs::train() rewrite to use it, unit test against golden fixture. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 18, 2026 07:25

noahgift merged commit 353e741 into main May 18, 2026
11 checks passed

noahgift deleted the feat/apr-publish-companion-files-defect-6 branch May 18, 2026 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-publish): auto-discover companion files + model.safetensors alias (PMAT-690 defect 6)#1783

feat(apr-publish): auto-discover companion files + model.safetensors alias (PMAT-690 defect 6)#1783
noahgift merged 1 commit into
mainfrom
feat/apr-publish-companion-files-defect-6

noahgift commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 18, 2026

Summary

What this PR adds

Spec compliance

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant