feat(benchmark): add benchmark-history capture and initial record#71
Merged
feat(benchmark): add benchmark-history capture and initial record#71
Conversation
Adds tests/benchmark/capture_history.sh (release-mode multi-state runner) and tests/benchmark/aggregate.py (per-commit summary JSONs + HISTORY.md renderer). Wires `make benchmark-history` into the Makefile. Run count is parameterized via BENCHMARK_RUNS (default 5). The aggregator discovers states from native JSON filenames, so adding a new state is "append to STATES in capture_history.sh + re-run". Foundation for issue #69 goal 3 (productize multi-run aggregation). Will populate tests/benchmark/results/history/ in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aggregate.py's filename regex required the vcs+timestamp suffix that pytest-benchmark 5.x's --benchmark-save no longer produces. Make those segments optional so both legacy and 5.x formats match. Makefile target invoked bare `python`; switch to `python3` consistent with the rest of the project. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captured 5 release-mode runs at each of v4.8.0, v4.8.1, 6f601df (PR #65 merge), 500a8d0 (PR #66 merge), and ca5d83c (current main HEAD). Output: 25 native pytest-benchmark JSONs + 5 per-commit summary JSONs + rendered HISTORY.md. The record captures the @id-annotation perf regression introduced by PR #66: test_complex_policy median +14.9% at ca5d83c vs v4.8.0 (matches the release-mode bisect for #68). test_medium_policy / test_small_entity_set similarly affected (~+7%). The #68 fix that resolves this regression has not yet landed on main. Establishes the historical record referenced in issue #69 goal 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reorder per-benchmark tables to: Commit | Date | Median (μs) | Δ median | Max (μs) | Δ max. The new Δ max column is computed identically to Δ median (percentage delta vs the earliest state in the record). Useful for spotting tail-time regressions that don't show up in medians, e.g., test_batch_complex_policy at ca5d83c: Δ median +2.3% but Δ max +15.1%. Tail signal is intrinsically noisier than median, but having both columns side-by-side makes it easy to dismiss the noise vs flag the rare cases where they diverge consistently. Regenerated HISTORY.md from the existing per-commit summary JSONs (no re-capture needed; phase B only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…abels
The aggregator was deriving the descriptor portion of merge-commit labels
from the subject ("Merge pull request #N from <branch>"), producing
uninformative labels like "PR #65 merge". GitHub's merge UI puts the PR
title in the commit body; read it and use that instead.
Result, applied to the existing record:
6f601df: PR #65 merge
-> PR #65: fix: surface invalid schema as NoDecision...
500a8d0: PR #66 merge
-> PR #66: feat: honor @id annotation as the policy id
ca5d83c: PR #67 merge
-> PR #67: docs: add CHANGELOG.md and reference it in...
Falls back to the prior "PR #N merge" wording when the body is empty.
Regenerated all 5 per-commit summary JSONs and HISTORY.md from the
existing native pytest-benchmark JSONs; no re-capture needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tested N=7 vs the N=5 default: medians moved <1.3 percentage points (already converged at N=5), and max became noisier as N grew (more samples = more chances of capturing rare tail outliers that dominate the max statistic). Captured the rationale in tests/benchmark/README.md so a future maintainer doesn't have to re-run the experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…via manifest Each STATES entry in capture_history.sh now requires a third field — a human-readable description used as the row label in HISTORY.md. The runner errors out if any state is missing the description. Plumbing: the runner serializes STATES to tests/benchmark/results/history/states-manifest.json before the capture loop. aggregate.py reads the manifest in Phase A and uses the description as the per-commit label and note. Phase B skips non-summary files in history/ so the manifest is not mistaken for a per-commit JSON. The aggregator retains the prior auto-derivation (tag, then PR title from the merge commit body, then short SHA) as a fallback for save_prefixes not listed in the manifest. Defensive behavior, not the canonical path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ated labels Re-rendered from the existing native pytest-benchmark JSONs using the descriptions in states-manifest.json. No re-capture; only the label and note fields change. Row labels now read: v4.8.0 v4.8.1 PR65 surface invalid schema PR66 honor id annotation main_ca5d83c docs update Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…AUDE.md Adds pointers to the new benchmark-history tooling (make target, runner, aggregator, output locations) and a follow-on-work entry for issue #69 that captures what landed in PR #71, what remains (goals 1 and 3, including the pytest-benchmark-5.x cross-invocation-aggregation gap), the sequencing constraint between goal 1 and the baseline refresh, and the empirical finding that medians are robust at N=5 while max-based gating is misleading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
make benchmark-history: a release-mode multi-state runner (tests/benchmark/capture_history.sh) and aggregator (tests/benchmark/aggregate.py) that produce per-commit summary JSONs and a renderedHISTORY.md. Run count is parameterized viaBENCHMARK_RUNS(default 5).v4.8.0,v4.8.1,6f601df(PR fix: surface invalid schema as NoDecision with diagnostic error (#27) #65 merge),500a8d0(PR feat: honor @id("...") annotation as the policy id #66 merge), andca5d83c(current main HEAD). 25 native pytest-benchmark JSONs + 5 per-commit summaries + the renderedHISTORY.md.test_complex_policymedian +14.9% atca5d83cvsv4.8.0— matches the original release-mode bisect for perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 exactly. The perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix that resolves this regression lives onfix/optional-resolve-policy-idsand is not yet on main.Foundation for issue #69 goal 2 (historical record) and goal 3 (multi-run aggregation primitives the gating refactor will build on).
Out of scope (separate follow-ups)
make benchmark/benchmark-save/benchmark-compareto release mode (benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 1).tests/benchmark/results/baseline.jsonto a release-mode capture (benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 2 proper). Gated on the perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix landing on main.make benchmark-compareto use median-of-N gating (benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 3).Notes for reviewers
git diff --statis the friendlier view. The substantive changes for review are intests/benchmark/capture_history.sh,tests/benchmark/aggregate.py,tests/benchmark/README.md, theMakefilebenchmark-history target, andtests/benchmark/results/HISTORY.md.Δ maxcolumn inHISTORY.mdis intentionally noisy — pytest-benchmark tail times vary run-to-run on the same machine. Useful for spotting consistent tail regressions, but expect ±5–10% noise on individual rows. TheΔ mediancolumn is the trustworthy signal.ca5d83c(current main HEAD). Once the perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix lands on main, this branch can be rebased and a follow-up PR can append the post-fix state.Test plan
tests/benchmark/README.mdand verify the documented workflow makes sense from a clean main checkout.tests/benchmark/results/HISTORY.md— verify the table shape (Commit | Date | Median | Δ median | Max | Δ max) and that the ~+15% regression ontest_complex_policyat500a8d0/ca5d83cis visible.tests/benchmark/results/history/v4.8.0.json) — schema,source_runslisting,n_runscount.python3 tests/benchmark/aggregate.py --phase bregeneratesHISTORY.mdfrom the committed per-commit JSONs (no capture needed; ~1 second).make benchmark-historyend-to-end. Runs release-mode 5×5 capture, regeneratesHISTORY.md, restores starting branch on exit. Should produce identical statistical results modulo run-to-run variance.