feat(benchmark): add benchmark-history capture and initial record by skuenzli · Pull Request #71 · k9securityio/cedar-py

skuenzli · 2026-05-09T16:22:13Z

Summary

Add make benchmark-history: a release-mode multi-state runner (tests/benchmark/capture_history.sh) and aggregator (tests/benchmark/aggregate.py) that produce per-commit summary JSONs and a rendered HISTORY.md. Run count is parameterized via BENCHMARK_RUNS (default 5).
Capture the initial historical record across v4.8.0, v4.8.1, 6f601df (PR fix: surface invalid schema as NoDecision with diagnostic error (#27) #65 merge), 500a8d0 (PR feat: honor @id("...") annotation as the policy id #66 merge), and ca5d83c (current main HEAD). 25 native pytest-benchmark JSONs + 5 per-commit summaries + the rendered HISTORY.md.
The record cleanly captures the @id-annotation perf regression introduced by PR feat: honor @id("...") annotation as the policy id #66: test_complex_policy median +14.9% at ca5d83c vs v4.8.0 — matches the original release-mode bisect for perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 exactly. The perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix that resolves this regression lives on fix/optional-resolve-policy-ids and is not yet on main.

Foundation for issue #69 goal 2 (historical record) and goal 3 (multi-run aggregation primitives the gating refactor will build on).

Out of scope (separate follow-ups)

Switching make benchmark / benchmark-save / benchmark-compare to release mode (benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 1).
Refreshing tests/benchmark/results/baseline.json to a release-mode capture (benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 2 proper). Gated on the perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix landing on main.
Rewiring make benchmark-compare to use median-of-N gating (benchmark process improvements: release builds, refreshed baseline, multi-run signal #69 goal 3).
Adding a 6th historical state for "main post-perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix" once that fix lands.

Notes for reviewers

The 25 native pytest-benchmark JSONs are large (~25k lines diff); git diff --stat is the friendlier view. The substantive changes for review are in tests/benchmark/capture_history.sh, tests/benchmark/aggregate.py, tests/benchmark/README.md, the Makefile benchmark-history target, and tests/benchmark/results/HISTORY.md.
Δ max column in HISTORY.md is intentionally noisy — pytest-benchmark tail times vary run-to-run on the same machine. Useful for spotting consistent tail regressions, but expect ±5–10% noise on individual rows. The Δ median column is the trustworthy signal.
Branch is based on ca5d83c (current main HEAD). Once the perf: rename_from_id_annotation regresses is_authorized by up to 31% on policy-heavy workloads #68 fix lands on main, this branch can be rebased and a follow-up PR can append the post-fix state.

Test plan

Read tests/benchmark/README.md and verify the documented workflow makes sense from a clean main checkout.
Inspect tests/benchmark/results/HISTORY.md — verify the table shape (Commit | Date | Median | Δ median | Max | Δ max) and that the ~+15% regression on test_complex_policy at 500a8d0/ca5d83c is visible.
Spot-check one per-commit summary JSON (e.g., tests/benchmark/results/history/v4.8.0.json) — schema, source_runs listing, n_runs count.
Optional fast path: python3 tests/benchmark/aggregate.py --phase b regenerates HISTORY.md from the committed per-commit JSONs (no capture needed; ~1 second).
Optional full reproduction (~30–40 min): make benchmark-history end-to-end. Runs release-mode 5×5 capture, regenerates HISTORY.md, restores starting branch on exit. Should produce identical statistical results modulo run-to-run variance.

Adds tests/benchmark/capture_history.sh (release-mode multi-state runner) and tests/benchmark/aggregate.py (per-commit summary JSONs + HISTORY.md renderer). Wires `make benchmark-history` into the Makefile. Run count is parameterized via BENCHMARK_RUNS (default 5). The aggregator discovers states from native JSON filenames, so adding a new state is "append to STATES in capture_history.sh + re-run". Foundation for issue #69 goal 3 (productize multi-run aggregation). Will populate tests/benchmark/results/history/ in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aggregate.py's filename regex required the vcs+timestamp suffix that pytest-benchmark 5.x's --benchmark-save no longer produces. Make those segments optional so both legacy and 5.x formats match. Makefile target invoked bare `python`; switch to `python3` consistent with the rest of the project. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captured 5 release-mode runs at each of v4.8.0, v4.8.1, 6f601df (PR #65 merge), 500a8d0 (PR #66 merge), and ca5d83c (current main HEAD). Output: 25 native pytest-benchmark JSONs + 5 per-commit summary JSONs + rendered HISTORY.md. The record captures the @id-annotation perf regression introduced by PR #66: test_complex_policy median +14.9% at ca5d83c vs v4.8.0 (matches the release-mode bisect for #68). test_medium_policy / test_small_entity_set similarly affected (~+7%). The #68 fix that resolves this regression has not yet landed on main. Establishes the historical record referenced in issue #69 goal 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reorder per-benchmark tables to: Commit | Date | Median (μs) | Δ median | Max (μs) | Δ max. The new Δ max column is computed identically to Δ median (percentage delta vs the earliest state in the record). Useful for spotting tail-time regressions that don't show up in medians, e.g., test_batch_complex_policy at ca5d83c: Δ median +2.3% but Δ max +15.1%. Tail signal is intrinsically noisier than median, but having both columns side-by-side makes it easy to dismiss the noise vs flag the rare cases where they diverge consistently. Regenerated HISTORY.md from the existing per-commit summary JSONs (no re-capture needed; phase B only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@id

…abels The aggregator was deriving the descriptor portion of merge-commit labels from the subject ("Merge pull request #N from <branch>"), producing uninformative labels like "PR #65 merge". GitHub's merge UI puts the PR title in the commit body; read it and use that instead. Result, applied to the existing record: 6f601df: PR #65 merge -> PR #65: fix: surface invalid schema as NoDecision... 500a8d0: PR #66 merge -> PR #66: feat: honor @id annotation as the policy id ca5d83c: PR #67 merge -> PR #67: docs: add CHANGELOG.md and reference it in... Falls back to the prior "PR #N merge" wording when the body is empty. Regenerated all 5 per-commit summary JSONs and HISTORY.md from the existing native pytest-benchmark JSONs; no re-capture needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tested N=7 vs the N=5 default: medians moved <1.3 percentage points (already converged at N=5), and max became noisier as N grew (more samples = more chances of capturing rare tail outliers that dominate the max statistic). Captured the rationale in tests/benchmark/README.md so a future maintainer doesn't have to re-run the experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…via manifest Each STATES entry in capture_history.sh now requires a third field — a human-readable description used as the row label in HISTORY.md. The runner errors out if any state is missing the description. Plumbing: the runner serializes STATES to tests/benchmark/results/history/states-manifest.json before the capture loop. aggregate.py reads the manifest in Phase A and uses the description as the per-commit label and note. Phase B skips non-summary files in history/ so the manifest is not mistaken for a per-commit JSON. The aggregator retains the prior auto-derivation (tag, then PR title from the merge commit body, then short SHA) as a fallback for save_prefixes not listed in the manifest. Defensive behavior, not the canonical path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ated labels Re-rendered from the existing native pytest-benchmark JSONs using the descriptions in states-manifest.json. No re-capture; only the label and note fields change. Row labels now read: v4.8.0 v4.8.1 PR65 surface invalid schema PR66 honor id annotation main_ca5d83c docs update Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…AUDE.md Adds pointers to the new benchmark-history tooling (make target, runner, aggregator, output locations) and a follow-on-work entry for issue #69 that captures what landed in PR #71, what remains (goals 1 and 3, including the pytest-benchmark-5.x cross-invocation-aggregation gap), the sequencing constraint between goal 1 and the baseline refresh, and the empirical finding that medians are robust at N=5 while max-based gating is misleading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skuenzli and others added 9 commits May 9, 2026 08:48

skuenzli merged commit 6300065 into main May 9, 2026
7 checks passed

skuenzli deleted the feat/issue-69-benchmark-history branch May 9, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add benchmark-history capture and initial record#71

feat(benchmark): add benchmark-history capture and initial record#71
skuenzli merged 9 commits intomainfrom
feat/issue-69-benchmark-history

skuenzli commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skuenzli commented May 9, 2026

Summary

Out of scope (separate follow-ups)

Notes for reviewers

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant