Skip to content

feat(apr-cli): #1547 — apr tokenize encode-corpus per-doc progress emission#1552

Merged
noahgift merged 1 commit into
mainfrom
feat/tokenize-encode-corpus-progress-emission
May 7, 2026
Merged

feat(apr-cli): #1547 — apr tokenize encode-corpus per-doc progress emission#1552
noahgift merged 1 commit into
mainfrom
feat/tokenize-encode-corpus-progress-emission

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

@noahgift noahgift commented May 7, 2026

Summary

GH-1547 piece 2 of 3. Operators ran the prior single-threaded encode for 47 hours blind — there was no in-flight signal to tell whether the process was healthy or near completion. This PR adds operator-facing progress emission to stderr at a configurable cadence.

  • New flags: --quiet, --progress-interval-docs <N> (default 1000), --progress-interval-seconds <S> (default 60)
  • OR-cadence: emit when EITHER N docs OR S seconds have elapsed since last tick; both clocks reset on emit
  • Final completion line [progress] done docs=N tokens=K elapsed=Es rate=X.X docs/s always emitted (unless --quiet)
  • Per-tick format: [progress] doc=N/T tokens=K rate=X.X docs/s eta=YYYY-MM-DDTHH:MM:SSZ; /T and eta= fragments omitted when total unknown
  • Pure-functional ProgressEmitter (should_emit, format_line) so unit tests pin invariants without scraping stderr
  • Both single-threaded and chunked-rayon paths emit identically
  • No chrono dep — ETA via Howard Hinnant civil_from_days (public domain)

Tests

6 new unit tests, all passing:

  • progress_emit_every_n_docs_when_under_seconds_window — FALSIFY-APR-TOK-PAR-007
  • progress_emit_every_n_seconds_when_under_docs_window — FALSIFY-APR-TOK-PAR-008
  • progress_quiet_flag_suppresses_emission — FALSIFY-APR-TOK-PAR-009
  • progress_format_line_no_total_omits_eta_fragment — FALSIFY-APR-TOK-PAR-010 part 1
  • progress_format_line_with_total_includes_eta_fragment — FALSIFY-APR-TOK-PAR-010 part 2
  • progress_mark_emitted_resets_both_clocks — OR-cadence reset correctness

Contract

contracts/apr-tokenize-parallel-bpe-v1.yaml v1.1.0 → v1.2.0. Adds progress_or_cadence equation, four new falsifiers (007/008/009/010), two new proof obligations. pv validate passes.

Test plan

  • cargo build -p apr-cli --features training
  • cargo clippy -p apr-cli --features training -- -D warnings
  • cargo fmt --package apr-cli -- --check
  • cargo test -p apr-cli --features training --lib commands::tokenize::tests (19/19 pass)
  • cargo test -p apr-cli --features training --test cli_commands (8/8 pass)
  • cargo test -p apr-cli --features training --test falsification_apr_tok_par_004 (2/2 pass)
  • pv validate contracts/apr-tokenize-parallel-bpe-v1.yaml

Closes piece 2 of 3 of #1547. Piece 3 (--estimate-only) is the follow-up PR and gates on this landing first.

🤖 Generated with Claude Code

…emission

GH-1547 piece 2 of 3. Operators ran the prior single-threaded encode for
47 hours blind — there was no in-flight signal to tell whether the
process was healthy or near completion. This PR adds operator-facing
progress emission to stderr at a configurable cadence.

What ships:

  * `--quiet` clap flag (default: false) suppresses all stderr emission.
    The JSON manifest and stdout summary still emit; only the operator-
    facing progress lines go silent. Useful for CI / log-scraping
    callers that prefer pure JSON.
  * `--progress-interval-docs <N>` clap flag (default: 1000). Emit a
    progress line at most every N docs.
  * `--progress-interval-seconds <S>` clap flag (default: 60). Emit a
    progress line at most every S seconds.
  * OR-cadence: emit when EITHER N docs OR S seconds have elapsed
    since the last tick. After an emit, BOTH clocks reset.
  * Final completion line at end of run:
    `[progress] done docs=N tokens=K elapsed=Es rate=X.X docs/s`.
  * Per-tick line format: `[progress] doc=N/T tokens=K rate=X.X docs/s
    eta=YYYY-MM-DDTHH:MM:SSZ`. When total `T` is unknown (the common
    case — counting up-front would double-walk the corpus), the `/T`
    and `eta=` fragments are omitted.

Implementation:

  * `ProgressConfig { quiet, interval_docs, interval_seconds }` struct
    threaded through `run_encode_corpus`.
  * `ProgressEmitter` with pure-functional `should_emit` predicate and
    `format_line` so unit tests can pin OR-cadence + format invariants
    without scraping stderr.
  * Both single-threaded and chunked-rayon paths emit identically (the
    `should_emit` check sits inside the per-doc loop in both branches).
  * ETA computation uses Howard Hinnant's civil_from_days algorithm
    (public domain) — no chrono dependency added.

Tests (6 new):

  * `progress_emit_every_n_docs_when_under_seconds_window` — doc-tick
    branch fires (FALSIFY-APR-TOK-PAR-007).
  * `progress_emit_every_n_seconds_when_under_docs_window` — time-tick
    branch fires (FALSIFY-APR-TOK-PAR-008).
  * `progress_quiet_flag_suppresses_emission` — flag wiring
    (FALSIFY-APR-TOK-PAR-009).
  * `progress_format_line_no_total_omits_eta_fragment` — wire format
    invariant when total unknown (FALSIFY-APR-TOK-PAR-010 part 1).
  * `progress_format_line_with_total_includes_eta_fragment` — wire
    format invariant when total known (FALSIFY-APR-TOK-PAR-010 part 2).
  * `progress_mark_emitted_resets_both_clocks` — OR-cadence reset
    correctness.

Contract: `contracts/apr-tokenize-parallel-bpe-v1.yaml`
v1.1.0 → v1.2.0. Adds `progress_or_cadence` equation, four new
falsifiers (007/008/009/010), and two new proof obligations. `pv
validate` passes.

Closes piece 2 of 3 of #1547. Piece 3 (`--estimate-only`) is the
follow-up PR and gates on this landing first.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 7, 2026 06:31
@noahgift noahgift merged commit e21f242 into main May 7, 2026
11 checks passed
@noahgift noahgift deleted the feat/tokenize-encode-corpus-progress-emission branch May 7, 2026 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant