feat(apr-cli): #1547 — apr tokenize encode-corpus per-doc progress emission#1552
Merged
Merged
Conversation
…emission GH-1547 piece 2 of 3. Operators ran the prior single-threaded encode for 47 hours blind — there was no in-flight signal to tell whether the process was healthy or near completion. This PR adds operator-facing progress emission to stderr at a configurable cadence. What ships: * `--quiet` clap flag (default: false) suppresses all stderr emission. The JSON manifest and stdout summary still emit; only the operator- facing progress lines go silent. Useful for CI / log-scraping callers that prefer pure JSON. * `--progress-interval-docs <N>` clap flag (default: 1000). Emit a progress line at most every N docs. * `--progress-interval-seconds <S>` clap flag (default: 60). Emit a progress line at most every S seconds. * OR-cadence: emit when EITHER N docs OR S seconds have elapsed since the last tick. After an emit, BOTH clocks reset. * Final completion line at end of run: `[progress] done docs=N tokens=K elapsed=Es rate=X.X docs/s`. * Per-tick line format: `[progress] doc=N/T tokens=K rate=X.X docs/s eta=YYYY-MM-DDTHH:MM:SSZ`. When total `T` is unknown (the common case — counting up-front would double-walk the corpus), the `/T` and `eta=` fragments are omitted. Implementation: * `ProgressConfig { quiet, interval_docs, interval_seconds }` struct threaded through `run_encode_corpus`. * `ProgressEmitter` with pure-functional `should_emit` predicate and `format_line` so unit tests can pin OR-cadence + format invariants without scraping stderr. * Both single-threaded and chunked-rayon paths emit identically (the `should_emit` check sits inside the per-doc loop in both branches). * ETA computation uses Howard Hinnant's civil_from_days algorithm (public domain) — no chrono dependency added. Tests (6 new): * `progress_emit_every_n_docs_when_under_seconds_window` — doc-tick branch fires (FALSIFY-APR-TOK-PAR-007). * `progress_emit_every_n_seconds_when_under_docs_window` — time-tick branch fires (FALSIFY-APR-TOK-PAR-008). * `progress_quiet_flag_suppresses_emission` — flag wiring (FALSIFY-APR-TOK-PAR-009). * `progress_format_line_no_total_omits_eta_fragment` — wire format invariant when total unknown (FALSIFY-APR-TOK-PAR-010 part 1). * `progress_format_line_with_total_includes_eta_fragment` — wire format invariant when total known (FALSIFY-APR-TOK-PAR-010 part 2). * `progress_mark_emitted_resets_both_clocks` — OR-cadence reset correctness. Contract: `contracts/apr-tokenize-parallel-bpe-v1.yaml` v1.1.0 → v1.2.0. Adds `progress_or_cadence` equation, four new falsifiers (007/008/009/010), and two new proof obligations. `pv validate` passes. Closes piece 2 of 3 of #1547. Piece 3 (`--estimate-only`) is the follow-up PR and gates on this landing first. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GH-1547 piece 2 of 3. Operators ran the prior single-threaded encode for 47 hours blind — there was no in-flight signal to tell whether the process was healthy or near completion. This PR adds operator-facing progress emission to stderr at a configurable cadence.
--quiet,--progress-interval-docs <N>(default 1000),--progress-interval-seconds <S>(default 60)[progress] done docs=N tokens=K elapsed=Es rate=X.X docs/salways emitted (unless--quiet)[progress] doc=N/T tokens=K rate=X.X docs/s eta=YYYY-MM-DDTHH:MM:SSZ;/Tandeta=fragments omitted when total unknownProgressEmitter(should_emit,format_line) so unit tests pin invariants without scraping stderrcivil_from_days(public domain)Tests
6 new unit tests, all passing:
progress_emit_every_n_docs_when_under_seconds_window— FALSIFY-APR-TOK-PAR-007progress_emit_every_n_seconds_when_under_docs_window— FALSIFY-APR-TOK-PAR-008progress_quiet_flag_suppresses_emission— FALSIFY-APR-TOK-PAR-009progress_format_line_no_total_omits_eta_fragment— FALSIFY-APR-TOK-PAR-010 part 1progress_format_line_with_total_includes_eta_fragment— FALSIFY-APR-TOK-PAR-010 part 2progress_mark_emitted_resets_both_clocks— OR-cadence reset correctnessContract
contracts/apr-tokenize-parallel-bpe-v1.yamlv1.1.0 → v1.2.0. Addsprogress_or_cadenceequation, four new falsifiers (007/008/009/010), two new proof obligations.pv validatepasses.Test plan
cargo build -p apr-cli --features trainingcargo clippy -p apr-cli --features training -- -D warningscargo fmt --package apr-cli -- --checkcargo test -p apr-cli --features training --lib commands::tokenize::tests(19/19 pass)cargo test -p apr-cli --features training --test cli_commands(8/8 pass)cargo test -p apr-cli --features training --test falsification_apr_tok_par_004(2/2 pass)pv validate contracts/apr-tokenize-parallel-bpe-v1.yamlCloses piece 2 of 3 of #1547. Piece 3 (
--estimate-only) is the follow-up PR and gates on this landing first.🤖 Generated with Claude Code