Skip to content

feat(apr-publish): auto-discover companion files + model.safetensors alias (PMAT-690 defect 6)#1783

Merged
noahgift merged 1 commit into
mainfrom
feat/apr-publish-companion-files-defect-6
May 18, 2026
Merged

feat(apr-publish): auto-discover companion files + model.safetensors alias (PMAT-690 defect 6)#1783
noahgift merged 1 commit into
mainfrom
feat/apr-publish-companion-files-defect-6

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Closes the file-selection gap surfaced during the paiml/albor-370m-v1 ship — apr publish previously only picked .apr/.safetensors/.gguf extensions, leaving the operator to manually NDJSON-commit every companion file (README, LICENSE, config.json, tokenizer*.json, vocab.json, merges.txt, generation_config.json, etc.).

What this PR adds

  1. find_companion_files(directory) — case-sensitive exact-filename allowlist of HF-standard integration files. Returns paths to whichever are present in the staging directory. Decoys (arbitrary .json / .txt files outside the allowlist) and binary artifacts (.apr/.safetensors/.gguf) are NOT picked up.

    Allowlist: README.md, LICENSE, LICENSE.md, LICENSE.txt, config.json, generation_config.json, tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, chat_template.jinja.

  2. User-provided README.md preference — when one is present in the companion set, its content is used verbatim as the model card instead of the auto-generated stub. The auto-generated stub was consistently weaker than what model authors hand-craft (observed on the albor-370m-v1 publish: 164-byte auto-stub vs 11.6KB hand-crafted card).

  3. model.safetensors LFS alias auto-emit — when a .safetensors file is uploaded under a descriptive name (e.g., albor-370m-v1.safetensors), a second NDJSON lfsFile commit emits the alias model.safetensors pointing at the same OID. HF deduplicates LFS blobs by OID so the alias is storage-free. Required for HF Transformers AutoModelForCausalLM.from_pretrained to auto-discover the weights without an explicit weights_file argument.

  4. New public method HfHubClient::commit_lfs_alias in aprender-core — wraps the existing NDJSON commit-lfs-pointer path so the apr-cli publish command can emit the alias commit.

Spec compliance

Follows SPEC-HF-PUBLISH-001:

  • §"Required artifacts (12 files minimum)" — companion files list
  • §"Publishing the model.safetensors alias" — alias protocol

Removes the manual NDJSON commit pattern documented in the spec's §"Manual companion-file upload" — that section can now be marked stale + linked to this PR.

Tests

4 new unit tests in publish_tests.rs:

  • test_find_companion_files_picks_all_hf_integration_files — picks all 10 allowlist entries, rejects decoys + binary artifacts
  • test_find_companion_files_empty_dir_returns_empty
  • test_safetensors_needing_alias_descriptive_name_triggers_alias
  • test_safetensors_needing_alias_canonical_name_skips_alias — no alias when already named model.safetensors
  • test_safetensors_needing_alias_no_safetensors_skips_alias — no alias on .apr/.gguf-only publishes

All 35 commands::publish::tests pass.

Test plan

  • 4 new unit tests pass
  • All 35 commands::publish::tests pass
  • cargo check -p apr-cli --features hf-hub clean
  • Integration test: dry-run on a staging dir containing all 12 SPEC-HF-PUBLISH-001 files — verify all are listed in the upload plan

🤖 Generated with Claude Code

…alias (PMAT-690 P3-C-prep defect 6)

Closes the file-selection gap surfaced during the paiml/albor-370m-v1
ship — `apr publish` previously only picked .apr/.safetensors/.gguf
extensions, leaving the operator to manually NDJSON-commit every
companion file (README, LICENSE, config.json, tokenizer.json,
tokenizer_config.json, vocab.json, merges.txt, generation_config.json,
special_tokens_map.json, chat_template.jinja).

What this PR adds
================

1. `find_companion_files(directory)` — case-sensitive exact-filename
   allowlist of HF-standard integration files. Returns paths to
   whichever are present in the staging directory. Decoys (arbitrary
   .json or .txt files outside the allowlist) and binary artifacts
   (.apr/.safetensors/.gguf) are NOT picked up.

2. User-provided README.md preference — when one is present in the
   companion set, its content is used verbatim as the model card
   instead of the auto-generated stub. The auto-generated stub was
   consistently weaker than what model authors hand-craft (observed
   on the albor-370m-v1 publish: 164-byte auto-stub vs 11.6KB hand-
   crafted card).

3. `model.safetensors` LFS alias auto-emit — when a `.safetensors`
   file is uploaded under a descriptive name (e.g.,
   `albor-370m-v1.safetensors`), a second NDJSON `lfsFile` commit
   emits the alias `model.safetensors` pointing at the same OID.
   HF deduplicates LFS blobs by OID so the alias is storage-free.
   Required for HF Transformers `AutoModelForCausalLM.from_pretrained`
   to auto-discover the weights without an explicit weights_file
   argument.

4. New public method `HfHubClient::commit_lfs_alias` in aprender-core
   — wraps the existing NDJSON commit-lfs-pointer path so the
   apr-cli publish command can emit the alias commit.

Reference implementation
========================

Follows SPEC-HF-PUBLISH-001 (committed 2026-05-18 in #1780):
- §"Required artifacts (12 files minimum)" — companion files list
- §"Publishing the `model.safetensors` alias" — alias protocol

Removes the manual NDJSON commit pattern documented in the spec's
§"Manual companion-file upload until publish CLI is fixed" — that
section can now be marked stale + linked to this PR.

Tests
=====

4 new unit tests in publish_tests.rs:
- find_companion_files picks all 10 allowlist entries when present
- find_companion_files skips decoys + binary artifacts
- find_companion_files empty dir returns empty
- safetensors_needing_alias triggers on descriptive names
- safetensors_needing_alias skips canonical model.safetensors
- safetensors_needing_alias skips .apr/.gguf-only publishes

All 35 commands::publish::tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 07:25
@noahgift noahgift merged commit 353e741 into main May 18, 2026
11 checks passed
@noahgift noahgift deleted the feat/apr-publish-companion-files-defect-6 branch May 18, 2026 07:56
noahgift added a commit that referenced this pull request May 18, 2026
…83/684) (#1785)

* docs(spec): SPEC-DISTILL-001 — distillation epic 6-phase plan (PMAT-683/684)

Opens the distillation track that picks up MODEL-2 v2 (paiml/albor-370m-v2)
from where the §88 stack-existence-proof ship left off.

What this spec scopes
=====================

Current state audit:
- DistillationLoss is REAL (KD = α·CE + (1-α)·T²·KL)
- CudaTransformerTrainer forward+backward is REAL (proven by §82 P2-A)
- realizar teacher inference is REAL (proven by SHIP-005 86.59% HumanEval)
- Pipeline orchestrator at aprender-train-distill/src/pipeline.rs:115 is STUB
  — uses build_synthetic_logits() instead of teacher.forward(); never calls
  CudaTransformerTrainer for the student. Closing this gap is the epic.

6-phase plan
============

Phase 1 (3 days, 16-24h eng): `apr distill prepare` — realizar runs MODEL-1
  teacher over the corpus, caches top-K=64 logits to disk. 100-batch test
  asserts cosine sim ≥ 0.999 against online realizar recompute.

Phase 2 (2 days, 16-24h eng): wire CudaTransformerTrainer to KD loss via
  new forward_backward_kd_batch(); replace synthetic-logits stub in
  pipeline.rs::train(). Unit test on toy student verifies loss monotone.

Phase 3 (1 day + 4h compute): 500-step E2E smoke on a 10K-batch slice.
  Falsifier F-DISTILL-SMOKE-001 — val_loss at step 500 < step 0.

Phase 4 (4h dispatch + 30h unattended compute): the v2 training run.
  50K steps × 8192 tok/step = 1.6B tokens. Init from Qwen2.5-Coder-0.5B
  (matched tokenizer + arch family). Falsifiers F-DISTILL-V2-001/002 —
  val_loss < 3.0 AND HumanEval pass@1 ≥ 15%.

Phase 5 (5-8h gx10 compute): full 164-problem HumanEval discharge of
  PMAT-684. Acceptance threshold pass@1 ≥ 15%; ship-goal ≥ 25%.

Phase 6 (3h staging + 1h compute): publish v2 per SPEC-HF-PUBLISH-001.
  With v0.34.0+#1783 binary, companion files + model.safetensors alias
  are auto-emitted. Three-path verification (apr run + HF Transformers
  + llama-cli).

Total: ~70h eng + ~45h compute, ~10 days calendar.

Risk register
=============

- Cache size: top-K=64 sparsification → ~4GB instead of ~100GB
- KD numerical stability: Phase 2 unit test compares against PyTorch
  nn.KLDivLoss within 1e-4 absolute
- Teacher inference cost: Phase 1 cache amortizes one-time ~6h prep
  to <100ms/batch reads during training
- HumanEval miss: two-path fallback — widen corpus OR drop T from
  4.0 to 2.0 (each adds ~1 week)

Architectural decisions
=======================

1. Top-K=64 cache (NOT full logits) — DistilBERT/Distil-Qwen precedent
2. Cached teacher (NOT online) — hyperparameter sweep cost < cache regen
3. Vanilla KD (NOT MiniLM intermediate-layer matching) — teacher is Q4_K,
   intermediate activations aren't recoverable post-quantization
4. Matched tokenizer (Qwen2 151,936 vocab) — strongest argument for
   Qwen2.5-Coder-0.5B-Instruct as init

5 AC-DISTILL-* criteria authored; cross-linked to SPEC-HF-PUBLISH-001
(used in Phase 6) and AUDIT-Q4K-SHAPE-001 (confirms teacher Q4_K is
bit-correct, no re-export needed before distillation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(spec): SPEC-DISTILL-001 v1.1.0 — priority HIGH; Phase 1 → online teacher (Refs PMAT-691)

Revision driven by storage-math sanity check + pmat work priority
promotion:

1. Priority promoted to HIGH (pmat work edit on PMAT-683 + PMAT-684,
   plus new PMAT-691 for Phase 1 implementation).

2. Phase 1 redesigned from on-disk top-K=64 cache to online teacher
   logits provider. Storage math: 1.24B tokens × 64 entries × 6 bytes
   ≈ 476 GB, exceeds available NVMe budget. Top-K cache approach moves
   to Phase 1.5 as an optional in-memory ring-buffer optimization that
   hides teacher latency under student compute.

3. Effort totals: Phase 1 compute drops from 4-8h to <1h. Total epic
   eng stays ~70h but compute drops 45h → 40h.

4. New falsifier F-DISTILL-TEACHER-001 — RealizarTeacher.logits_for_batch
   matches realizar's apr trace logits output within 1e-3 absolute error
   on a frozen 3-layer fixture.

Implementation: PMAT-691 work session started 2026-05-18. Phase 1
deliverables are: teacher_provider.rs module, RealizarTeacher wrapper,
pipeline.rs::train() rewrite to use it, unit test against golden fixture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant