feat(apr-cli): #1547 — `apr tokenize encode-corpus --estimate-only` pre-flight pass by noahgift · Pull Request #1553 · paiml/aprender

noahgift · 2026-05-07T06:41:47Z

Summary

GH-1547 piece 3 of 3. Operators dispatched a 47-hour encode run blind in SHIP-TWO-001 5g.1 — a 5-second pre-flight could have shown the projected wall, total tokens, and shard count up front. This PR adds that pre-flight pass.

New flags: --estimate-only (bool), --estimate-sample-docs <N> (default 1000)
When --estimate-only is set: NO shards, NO manifest, NO output dir created (short-circuit lives BEFORE create_dir_all)

Operator-facing stderr block:

[estimate] input_docs=N
[estimate] sample_size=K sample_tokens=T sample_wall=Ws
[estimate] estimated_total_tokens=NNN
[estimate] estimated_shards=NNN (at shard_tokens=NNN)
[estimate] estimated_wall=NNN seconds (at --num-workers=N)

AC4 extrapolation formula respects --num-workers:
estimated_wall = (sample_wall / sample_size) × total_docs / max(num_workers, 1)
Pure-function extrapolate_estimate kernel — testable without invoking BPE tokenizer or filesystem
Total-docs counting: JSONL = wc -l style; parquet = sum of row-group metadata footers (no row-group decode)
Edge cases: 0 sample_size → all-zero; 0 shard_tokens → 0 shards; 0 num_workers → clamp to 1

Tests

5 new unit tests, all passing:

estimate_only_extrapolation_formula_correct — AC4 math + edge cases (FALSIFY-APR-TOK-PAR-011)
estimate_only_no_shards_written — AC1 (FALSIFY-APR-TOK-PAR-012)
estimate_only_no_manifest_written — AC2 (FALSIFY-APR-TOK-PAR-013)
estimate_only_emits_estimate_lines_to_stderr — wiring smoke (FALSIFY-APR-TOK-PAR-014)
estimate_only_rejects_zero_sample_size — fail-fast typo guard

Contract

contracts/apr-tokenize-parallel-bpe-v1.yaml v1.2.0 → v1.3.0. Adds estimate_extrapolation equation, four new falsifiers (011/012/013/014), two new proof obligations. pv validate passes.

Sequencing

Branched from PR #1552 (feat/tokenize-encode-corpus-progress-emission) since both PRs touch the same run_encode_corpus signature. Will rebase onto main cleanly once #1552 lands. Auto-merge armed; will queue behind #1552.

Test plan

cargo build -p apr-cli --features training
cargo clippy -p apr-cli --features training -- -D warnings
cargo fmt --package apr-cli -- --check
cargo test -p apr-cli --features training --lib commands::tokenize::tests (24/24 pass)
cargo test -p apr-cli --features training --test cli_commands (8/8 pass)
pv validate contracts/apr-tokenize-parallel-bpe-v1.yaml

🤖 Generated with Claude Code

noahgift · 2026-05-11T15:03:58Z

Triaged by autonomous rebase pass (2026-05-11): rebase onto main hits a structural conflict that needs manual review.

Specifically, this PR's run_estimate_only_path is typed against &entrenar::tokenizer::BPETokenizer, but main (post #1585 / #1596) has run_encode_corpus wrap the tokenizer in a local EncodeTokenizer enum (Hex / ByteLevel). Three options to unblock:

Lift EncodeTokenizer to module scope and retype run_estimate_only_path against &EncodeTokenizer (preferred — single source of truth for the two-format dispatch).
Make run_estimate_only_path generic over a closure FnMut(&str) -> Result<Vec<u32>, String> and pass |t| tokenizer.encode(t) at the call site.
Take the byte-level branch only when the format is detected; keep the original BPETokenizer arg and skip estimate-only on byte-level.

The conflict also drops one of two test suites (PR's estimate_only_* vs main's repair_manifest_*); whichever resolution path lands needs to keep both. Leaving as-is for the author to pick a direction.

…re-flight pass GH-1547 piece 3 of 3. Operators dispatched a 47-hour encode run blind in SHIP-TWO-001 5g.1 — a 5-second pre-flight could have shown the projected wall, total tokens, and shard count up front. This PR adds that pre-flight pass. What ships: * `--estimate-only` clap flag (default: false). When set, the encode pipeline runs the pre-flight extrapolation path and returns without writing ANY output (no shards, no manifest, no output directory created). * `--estimate-sample-docs <N>` clap flag (default: 1000). Controls how many documents the pre-flight samples to derive per-doc rate averages. * Operator-facing stderr block: [estimate] input_docs=N [estimate] sample_size=K sample_tokens=T sample_wall=Ws [estimate] estimated_total_tokens=NNN [estimate] estimated_shards=NNN (at shard_tokens=NNN) [estimate] estimated_wall=NNN seconds (at --num-workers=N) * Extrapolation formula (AC4): respects --num-workers. estimated_wall = (sample_wall / sample_size) × total_docs / max(num_workers, 1) Plus: estimated_shards = ceil(estimated_total_tokens / shard_tokens). Edge cases: sample_size=0 → all-zero; shard_tokens=0 → 0 shards; num_workers=0 → clamp to 1. * Total-docs counting: JSONL = `wc -l` style (skip empty lines, match `iter_corpus_texts` behavior); parquet = sum of row-group metadata footers (no row-group decode). Implementation: * `EstimateConfig { enabled, sample_docs }` struct threaded through `run_encode_corpus`. * Pure-function `extrapolate_estimate` kernel — testable on synthetic input without invoking the BPE tokenizer or the filesystem. * `count_corpus_docs_fast` helper for total-docs counting. * `run_estimate_only_path` body splits out so the AC1+AC2 invariants (no shards / no manifest) can be unit-tested in isolation. * Short-circuit lives BEFORE `create_dir_all` so a `--estimate-only` invocation never even touches the output directory. Tests (5 new): * `estimate_only_extrapolation_formula_correct` — AC4 math (FALSIFY-APR-TOK-PAR-011), including 0-sample, 0-workers, 0-shard-tokens edge cases. * `estimate_only_no_shards_written` — AC1 invariant (FALSIFY-APR-TOK-PAR-012). 50-doc corpus with sample_docs=10 → zero `.bin` files in output dir. * `estimate_only_no_manifest_written` — AC2 invariant (FALSIFY-APR-TOK-PAR-013). manifest.json absent. * `estimate_only_emits_estimate_lines_to_stderr` — wiring smoke (FALSIFY-APR-TOK-PAR-014). * `estimate_only_rejects_zero_sample_size` — fail-fast on operator typo `--estimate-sample-docs 0`. Contract: `contracts/apr-tokenize-parallel-bpe-v1.yaml` v1.2.0 → v1.3.0. Adds `estimate_extrapolation` equation, four new falsifiers (011/012/013/014), two new proof obligations. `pv validate` passes. Closes piece 3 of 3 of #1547. Branched from PR A (feat/tokenize-encode-corpus-progress-emission) since both PRs touch the same `run_encode_corpus` signature; rebases onto main cleanly once PR A lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 7, 2026 06:41

noahgift force-pushed the feat/tokenize-encode-corpus-estimate-only branch from eb1b5b2 to 0b55a55 Compare May 7, 2026 10:29

noahgift force-pushed the feat/tokenize-encode-corpus-estimate-only branch from 0b55a55 to 5ae4452 Compare May 13, 2026 06:54

noahgift added 19 commits May 13, 2026 09:14

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

dfef367

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

2d12985

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

ca92cf8

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

96f0751

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

1d17757

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

65b6384

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

3ca7766

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

da33543

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

5824d14

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

a4fe34c

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

dd38007

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

76cf234

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

c67b8cf

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

47b1f74

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

7b726ee

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

07d8c07

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

d63e685

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

e4b5f71

Merge branch 'main' into feat/tokenize-encode-corpus-estimate-only

db4df29

noahgift merged commit a2617e2 into main May 14, 2026
10 checks passed

noahgift deleted the feat/tokenize-encode-corpus-estimate-only branch May 14, 2026 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): #1547 — `apr tokenize encode-corpus --estimate-only` pre-flight pass#1553

feat(apr-cli): #1547 — `apr tokenize encode-corpus --estimate-only` pre-flight pass#1553
noahgift merged 20 commits into
mainfrom
feat/tokenize-encode-corpus-estimate-only

noahgift commented May 7, 2026

Uh oh!

noahgift commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 7, 2026

Summary

Tests

Contract

Sequencing

Test plan

Uh oh!

noahgift commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant