Skip to content

feat(apr-cli): #1547 — apr tokenize encode-corpus --estimate-only pre-flight pass#1553

Merged
noahgift merged 20 commits into
mainfrom
feat/tokenize-encode-corpus-estimate-only
May 14, 2026
Merged

feat(apr-cli): #1547 — apr tokenize encode-corpus --estimate-only pre-flight pass#1553
noahgift merged 20 commits into
mainfrom
feat/tokenize-encode-corpus-estimate-only

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 7, 2026

Summary

GH-1547 piece 3 of 3. Operators dispatched a 47-hour encode run blind in SHIP-TWO-001 5g.1 — a 5-second pre-flight could have shown the projected wall, total tokens, and shard count up front. This PR adds that pre-flight pass.

  • New flags: --estimate-only (bool), --estimate-sample-docs <N> (default 1000)
  • When --estimate-only is set: NO shards, NO manifest, NO output dir created (short-circuit lives BEFORE create_dir_all)
  • Operator-facing stderr block:
    [estimate] input_docs=N
    [estimate] sample_size=K sample_tokens=T sample_wall=Ws
    [estimate] estimated_total_tokens=NNN
    [estimate] estimated_shards=NNN (at shard_tokens=NNN)
    [estimate] estimated_wall=NNN seconds (at --num-workers=N)
    
  • AC4 extrapolation formula respects --num-workers:
    estimated_wall = (sample_wall / sample_size) × total_docs / max(num_workers, 1)
  • Pure-function extrapolate_estimate kernel — testable without invoking BPE tokenizer or filesystem
  • Total-docs counting: JSONL = wc -l style; parquet = sum of row-group metadata footers (no row-group decode)
  • Edge cases: 0 sample_size → all-zero; 0 shard_tokens → 0 shards; 0 num_workers → clamp to 1

Tests

5 new unit tests, all passing:

  • estimate_only_extrapolation_formula_correct — AC4 math + edge cases (FALSIFY-APR-TOK-PAR-011)
  • estimate_only_no_shards_written — AC1 (FALSIFY-APR-TOK-PAR-012)
  • estimate_only_no_manifest_written — AC2 (FALSIFY-APR-TOK-PAR-013)
  • estimate_only_emits_estimate_lines_to_stderr — wiring smoke (FALSIFY-APR-TOK-PAR-014)
  • estimate_only_rejects_zero_sample_size — fail-fast typo guard

Contract

contracts/apr-tokenize-parallel-bpe-v1.yaml v1.2.0 → v1.3.0. Adds estimate_extrapolation equation, four new falsifiers (011/012/013/014), two new proof obligations. pv validate passes.

Sequencing

Branched from PR #1552 (feat/tokenize-encode-corpus-progress-emission) since both PRs touch the same run_encode_corpus signature. Will rebase onto main cleanly once #1552 lands. Auto-merge armed; will queue behind #1552.

Test plan

  • cargo build -p apr-cli --features training
  • cargo clippy -p apr-cli --features training -- -D warnings
  • cargo fmt --package apr-cli -- --check
  • cargo test -p apr-cli --features training --lib commands::tokenize::tests (24/24 pass)
  • cargo test -p apr-cli --features training --test cli_commands (8/8 pass)
  • pv validate contracts/apr-tokenize-parallel-bpe-v1.yaml

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 7, 2026 06:41
@noahgift noahgift force-pushed the feat/tokenize-encode-corpus-estimate-only branch from eb1b5b2 to 0b55a55 Compare May 7, 2026 10:29
@noahgift
Copy link
Copy Markdown
Contributor Author

Triaged by autonomous rebase pass (2026-05-11): rebase onto main hits a structural conflict that needs manual review.

Specifically, this PR's run_estimate_only_path is typed against &entrenar::tokenizer::BPETokenizer, but main (post #1585 / #1596) has run_encode_corpus wrap the tokenizer in a local EncodeTokenizer enum (Hex / ByteLevel). Three options to unblock:

  1. Lift EncodeTokenizer to module scope and retype run_estimate_only_path against &EncodeTokenizer (preferred — single source of truth for the two-format dispatch).
  2. Make run_estimate_only_path generic over a closure FnMut(&str) -> Result<Vec<u32>, String> and pass |t| tokenizer.encode(t) at the call site.
  3. Take the byte-level branch only when the format is detected; keep the original BPETokenizer arg and skip estimate-only on byte-level.

The conflict also drops one of two test suites (PR's estimate_only_* vs main's repair_manifest_*); whichever resolution path lands needs to keep both. Leaving as-is for the author to pick a direction.

…re-flight pass

GH-1547 piece 3 of 3. Operators dispatched a 47-hour encode run blind in
SHIP-TWO-001 5g.1 — a 5-second pre-flight could have shown the
projected wall, total tokens, and shard count up front. This PR adds
that pre-flight pass.

What ships:

  * `--estimate-only` clap flag (default: false). When set, the encode
    pipeline runs the pre-flight extrapolation path and returns
    without writing ANY output (no shards, no manifest, no output
    directory created).
  * `--estimate-sample-docs <N>` clap flag (default: 1000). Controls
    how many documents the pre-flight samples to derive per-doc rate
    averages.
  * Operator-facing stderr block:

      [estimate] input_docs=N
      [estimate] sample_size=K sample_tokens=T sample_wall=Ws
      [estimate] estimated_total_tokens=NNN
      [estimate] estimated_shards=NNN (at shard_tokens=NNN)
      [estimate] estimated_wall=NNN seconds (at --num-workers=N)

  * Extrapolation formula (AC4): respects --num-workers.
        estimated_wall = (sample_wall / sample_size)
                       × total_docs
                       / max(num_workers, 1)
    Plus: estimated_shards = ceil(estimated_total_tokens / shard_tokens).
    Edge cases: sample_size=0 → all-zero; shard_tokens=0 → 0 shards;
    num_workers=0 → clamp to 1.
  * Total-docs counting: JSONL = `wc -l` style (skip empty lines, match
    `iter_corpus_texts` behavior); parquet = sum of row-group metadata
    footers (no row-group decode).

Implementation:

  * `EstimateConfig { enabled, sample_docs }` struct threaded through
    `run_encode_corpus`.
  * Pure-function `extrapolate_estimate` kernel — testable on synthetic
    input without invoking the BPE tokenizer or the filesystem.
  * `count_corpus_docs_fast` helper for total-docs counting.
  * `run_estimate_only_path` body splits out so the AC1+AC2 invariants
    (no shards / no manifest) can be unit-tested in isolation.
  * Short-circuit lives BEFORE `create_dir_all` so a `--estimate-only`
    invocation never even touches the output directory.

Tests (5 new):

  * `estimate_only_extrapolation_formula_correct` — AC4 math
    (FALSIFY-APR-TOK-PAR-011), including 0-sample, 0-workers,
    0-shard-tokens edge cases.
  * `estimate_only_no_shards_written` — AC1 invariant
    (FALSIFY-APR-TOK-PAR-012). 50-doc corpus with sample_docs=10 →
    zero `.bin` files in output dir.
  * `estimate_only_no_manifest_written` — AC2 invariant
    (FALSIFY-APR-TOK-PAR-013). manifest.json absent.
  * `estimate_only_emits_estimate_lines_to_stderr` — wiring smoke
    (FALSIFY-APR-TOK-PAR-014).
  * `estimate_only_rejects_zero_sample_size` — fail-fast on operator
    typo `--estimate-sample-docs 0`.

Contract: `contracts/apr-tokenize-parallel-bpe-v1.yaml` v1.2.0 → v1.3.0.
Adds `estimate_extrapolation` equation, four new falsifiers
(011/012/013/014), two new proof obligations. `pv validate` passes.

Closes piece 3 of 3 of #1547. Branched from PR A
(feat/tokenize-encode-corpus-progress-emission) since both PRs touch
the same `run_encode_corpus` signature; rebases onto main cleanly once
PR A lands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/tokenize-encode-corpus-estimate-only branch from 0b55a55 to 5ae4452 Compare May 13, 2026 06:54
@noahgift noahgift merged commit a2617e2 into main May 14, 2026
10 checks passed
@noahgift noahgift deleted the feat/tokenize-encode-corpus-estimate-only branch May 14, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant