Skip to content

feat(benchmark): add benchmark-history capture and initial record#71

Merged
skuenzli merged 9 commits intomainfrom
feat/issue-69-benchmark-history
May 9, 2026
Merged

feat(benchmark): add benchmark-history capture and initial record#71
skuenzli merged 9 commits intomainfrom
feat/issue-69-benchmark-history

Conversation

@skuenzli
Copy link
Copy Markdown
Contributor

@skuenzli skuenzli commented May 9, 2026

Summary

Foundation for issue #69 goal 2 (historical record) and goal 3 (multi-run aggregation primitives the gating refactor will build on).

Out of scope (separate follow-ups)

Notes for reviewers

  • The 25 native pytest-benchmark JSONs are large (~25k lines diff); git diff --stat is the friendlier view. The substantive changes for review are in tests/benchmark/capture_history.sh, tests/benchmark/aggregate.py, tests/benchmark/README.md, the Makefile benchmark-history target, and tests/benchmark/results/HISTORY.md.
  • Δ max column in HISTORY.md is intentionally noisy — pytest-benchmark tail times vary run-to-run on the same machine. Useful for spotting consistent tail regressions, but expect ±5–10% noise on individual rows. The Δ median column is the trustworthy signal.
  • Branch is based on ca5d83c (current main HEAD). Once the perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix lands on main, this branch can be rebased and a follow-up PR can append the post-fix state.

Test plan

  • Read tests/benchmark/README.md and verify the documented workflow makes sense from a clean main checkout.
  • Inspect tests/benchmark/results/HISTORY.md — verify the table shape (Commit | Date | Median | Δ median | Max | Δ max) and that the ~+15% regression on test_complex_policy at 500a8d0/ca5d83c is visible.
  • Spot-check one per-commit summary JSON (e.g., tests/benchmark/results/history/v4.8.0.json) — schema, source_runs listing, n_runs count.
  • Optional fast path: python3 tests/benchmark/aggregate.py --phase b regenerates HISTORY.md from the committed per-commit JSONs (no capture needed; ~1 second).
  • Optional full reproduction (~30–40 min): make benchmark-history end-to-end. Runs release-mode 5×5 capture, regenerates HISTORY.md, restores starting branch on exit. Should produce identical statistical results modulo run-to-run variance.

skuenzli and others added 9 commits May 9, 2026 08:48
Adds tests/benchmark/capture_history.sh (release-mode multi-state runner)
and tests/benchmark/aggregate.py (per-commit summary JSONs + HISTORY.md
renderer). Wires `make benchmark-history` into the Makefile.

Run count is parameterized via BENCHMARK_RUNS (default 5). The aggregator
discovers states from native JSON filenames, so adding a new state is
"append to STATES in capture_history.sh + re-run".

Foundation for issue #69 goal 3 (productize multi-run aggregation). Will
populate tests/benchmark/results/history/ in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aggregate.py's filename regex required the vcs+timestamp suffix that
pytest-benchmark 5.x's --benchmark-save no longer produces. Make those
segments optional so both legacy and 5.x formats match.

Makefile target invoked bare `python`; switch to `python3` consistent
with the rest of the project.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured 5 release-mode runs at each of v4.8.0, v4.8.1, 6f601df (PR #65
merge), 500a8d0 (PR #66 merge), and ca5d83c (current main HEAD). Output:
25 native pytest-benchmark JSONs + 5 per-commit summary JSONs + rendered
HISTORY.md.

The record captures the @id-annotation perf regression introduced by PR
#66: test_complex_policy median +14.9% at ca5d83c vs v4.8.0 (matches the
release-mode bisect for #68). test_medium_policy / test_small_entity_set
similarly affected (~+7%). The #68 fix that resolves this regression has
not yet landed on main.

Establishes the historical record referenced in issue #69 goal 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reorder per-benchmark tables to: Commit | Date | Median (μs) | Δ median |
Max (μs) | Δ max. The new Δ max column is computed identically to Δ median
(percentage delta vs the earliest state in the record).

Useful for spotting tail-time regressions that don't show up in medians,
e.g., test_batch_complex_policy at ca5d83c: Δ median +2.3% but Δ max +15.1%.
Tail signal is intrinsically noisier than median, but having both columns
side-by-side makes it easy to dismiss the noise vs flag the rare cases
where they diverge consistently.

Regenerated HISTORY.md from the existing per-commit summary JSONs (no
re-capture needed; phase B only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…abels

The aggregator was deriving the descriptor portion of merge-commit labels
from the subject ("Merge pull request #N from <branch>"), producing
uninformative labels like "PR #65 merge". GitHub's merge UI puts the PR
title in the commit body; read it and use that instead.

Result, applied to the existing record:
  6f601df:  PR #65 merge
            -> PR #65: fix: surface invalid schema as NoDecision...
  500a8d0:  PR #66 merge
            -> PR #66: feat: honor @id annotation as the policy id
  ca5d83c:  PR #67 merge
            -> PR #67: docs: add CHANGELOG.md and reference it in...

Falls back to the prior "PR #N merge" wording when the body is empty.

Regenerated all 5 per-commit summary JSONs and HISTORY.md from the
existing native pytest-benchmark JSONs; no re-capture needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tested N=7 vs the N=5 default: medians moved <1.3 percentage points
(already converged at N=5), and max became noisier as N grew (more
samples = more chances of capturing rare tail outliers that dominate
the max statistic). Captured the rationale in tests/benchmark/README.md
so a future maintainer doesn't have to re-run the experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…via manifest

Each STATES entry in capture_history.sh now requires a third field — a
human-readable description used as the row label in HISTORY.md. The
runner errors out if any state is missing the description.

Plumbing: the runner serializes STATES to
tests/benchmark/results/history/states-manifest.json before the capture
loop. aggregate.py reads the manifest in Phase A and uses the description
as the per-commit label and note. Phase B skips non-summary files in
history/ so the manifest is not mistaken for a per-commit JSON.

The aggregator retains the prior auto-derivation (tag, then PR title from
the merge commit body, then short SHA) as a fallback for save_prefixes
not listed in the manifest. Defensive behavior, not the canonical path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ated labels

Re-rendered from the existing native pytest-benchmark JSONs using the
descriptions in states-manifest.json. No re-capture; only the label and
note fields change.

Row labels now read:
  v4.8.0
  v4.8.1
  PR65 surface invalid schema
  PR66 honor id annotation
  main_ca5d83c docs update

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…AUDE.md

Adds pointers to the new benchmark-history tooling (make target, runner,
aggregator, output locations) and a follow-on-work entry for issue #69
that captures what landed in PR #71, what remains (goals 1 and 3,
including the pytest-benchmark-5.x cross-invocation-aggregation gap),
the sequencing constraint between goal 1 and the baseline refresh, and
the empirical finding that medians are robust at N=5 while max-based
gating is misleading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant