feat(apr-cli): #1547 — apr tokenize encode-corpus --estimate-only pre-flight pass#1553
Merged
Merged
Conversation
eb1b5b2 to
0b55a55
Compare
Contributor
Author
|
Triaged by autonomous rebase pass (2026-05-11): rebase onto main hits a structural conflict that needs manual review. Specifically, this PR's
The conflict also drops one of two test suites (PR's |
…re-flight pass GH-1547 piece 3 of 3. Operators dispatched a 47-hour encode run blind in SHIP-TWO-001 5g.1 — a 5-second pre-flight could have shown the projected wall, total tokens, and shard count up front. This PR adds that pre-flight pass. What ships: * `--estimate-only` clap flag (default: false). When set, the encode pipeline runs the pre-flight extrapolation path and returns without writing ANY output (no shards, no manifest, no output directory created). * `--estimate-sample-docs <N>` clap flag (default: 1000). Controls how many documents the pre-flight samples to derive per-doc rate averages. * Operator-facing stderr block: [estimate] input_docs=N [estimate] sample_size=K sample_tokens=T sample_wall=Ws [estimate] estimated_total_tokens=NNN [estimate] estimated_shards=NNN (at shard_tokens=NNN) [estimate] estimated_wall=NNN seconds (at --num-workers=N) * Extrapolation formula (AC4): respects --num-workers. estimated_wall = (sample_wall / sample_size) × total_docs / max(num_workers, 1) Plus: estimated_shards = ceil(estimated_total_tokens / shard_tokens). Edge cases: sample_size=0 → all-zero; shard_tokens=0 → 0 shards; num_workers=0 → clamp to 1. * Total-docs counting: JSONL = `wc -l` style (skip empty lines, match `iter_corpus_texts` behavior); parquet = sum of row-group metadata footers (no row-group decode). Implementation: * `EstimateConfig { enabled, sample_docs }` struct threaded through `run_encode_corpus`. * Pure-function `extrapolate_estimate` kernel — testable on synthetic input without invoking the BPE tokenizer or the filesystem. * `count_corpus_docs_fast` helper for total-docs counting. * `run_estimate_only_path` body splits out so the AC1+AC2 invariants (no shards / no manifest) can be unit-tested in isolation. * Short-circuit lives BEFORE `create_dir_all` so a `--estimate-only` invocation never even touches the output directory. Tests (5 new): * `estimate_only_extrapolation_formula_correct` — AC4 math (FALSIFY-APR-TOK-PAR-011), including 0-sample, 0-workers, 0-shard-tokens edge cases. * `estimate_only_no_shards_written` — AC1 invariant (FALSIFY-APR-TOK-PAR-012). 50-doc corpus with sample_docs=10 → zero `.bin` files in output dir. * `estimate_only_no_manifest_written` — AC2 invariant (FALSIFY-APR-TOK-PAR-013). manifest.json absent. * `estimate_only_emits_estimate_lines_to_stderr` — wiring smoke (FALSIFY-APR-TOK-PAR-014). * `estimate_only_rejects_zero_sample_size` — fail-fast on operator typo `--estimate-sample-docs 0`. Contract: `contracts/apr-tokenize-parallel-bpe-v1.yaml` v1.2.0 → v1.3.0. Adds `estimate_extrapolation` equation, four new falsifiers (011/012/013/014), two new proof obligations. `pv validate` passes. Closes piece 3 of 3 of #1547. Branched from PR A (feat/tokenize-encode-corpus-progress-emission) since both PRs touch the same `run_encode_corpus` signature; rebases onto main cleanly once PR A lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0b55a55 to
5ae4452
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GH-1547 piece 3 of 3. Operators dispatched a 47-hour encode run blind in SHIP-TWO-001 5g.1 — a 5-second pre-flight could have shown the projected wall, total tokens, and shard count up front. This PR adds that pre-flight pass.
--estimate-only(bool),--estimate-sample-docs <N>(default 1000)--estimate-onlyis set: NO shards, NO manifest, NO output dir created (short-circuit lives BEFOREcreate_dir_all)--num-workers:estimated_wall = (sample_wall / sample_size) × total_docs / max(num_workers, 1)extrapolate_estimatekernel — testable without invoking BPE tokenizer or filesystemwc -lstyle; parquet = sum of row-group metadata footers (no row-group decode)Tests
5 new unit tests, all passing:
estimate_only_extrapolation_formula_correct— AC4 math + edge cases (FALSIFY-APR-TOK-PAR-011)estimate_only_no_shards_written— AC1 (FALSIFY-APR-TOK-PAR-012)estimate_only_no_manifest_written— AC2 (FALSIFY-APR-TOK-PAR-013)estimate_only_emits_estimate_lines_to_stderr— wiring smoke (FALSIFY-APR-TOK-PAR-014)estimate_only_rejects_zero_sample_size— fail-fast typo guardContract
contracts/apr-tokenize-parallel-bpe-v1.yamlv1.2.0 → v1.3.0. Addsestimate_extrapolationequation, four new falsifiers (011/012/013/014), two new proof obligations.pv validatepasses.Sequencing
Branched from PR #1552 (feat/tokenize-encode-corpus-progress-emission) since both PRs touch the same
run_encode_corpussignature. Will rebase onto main cleanly once #1552 lands. Auto-merge armed; will queue behind #1552.Test plan
cargo build -p apr-cli --features trainingcargo clippy -p apr-cli --features training -- -D warningscargo fmt --package apr-cli -- --checkcargo test -p apr-cli --features training --lib commands::tokenize::tests(24/24 pass)cargo test -p apr-cli --features training --test cli_commands(8/8 pass)pv validate contracts/apr-tokenize-parallel-bpe-v1.yaml🤖 Generated with Claude Code