ci: expand benchmark reporting by mldangelo-oai · Pull Request #823 · promptfoo/modelaudit

mldangelo-oai · 2026-03-31T00:45:18Z

Summary

make the benchmark workflow report-only by default so regressions and missing benchmarks are surfaced in the PR comment without blocking the PR
expand the benchmark summary with aggregate medians, top regressions and improvements, input target/size/file context, a workflow run link, and richer benchmark metadata for directory inputs
sync the branch with main and address follow-up review feedback by moving runner temp path resolution into a setup step, asserting the actual Markdown table order in tests, and stabilizing the merged cache micro-benchmark threshold under xdist

Validation

uv run ruff format --check modelaudit/ tests/ scripts/benchmark_report.py
uv run ruff check modelaudit/ tests/ scripts/benchmark_report.py
uv run mypy modelaudit/ tests/
uv run pytest tests/test_benchmark_report.py -q
uv run --with pytest-benchmark pytest tests/benchmarks/test_scan_benchmarks.py -q --benchmark-json=/tmp/modelaudit-bench.json
uv run pytest -n auto -m "not slow and not integration" --maxfail=1

Summary by CodeRabbit

New Features
- New CLI flags to optionally fail on regressions and/or on missing baseline benchmarks.
Improvements
- Reports include aggregate medians, counts of new/missing benchmarks, richer size/files context, and separate Top regressions/improvements sections.
- CI now uploads both benchmark JSON and Markdown summary artifacts; reporting is produced but does not block PRs by default.
Tests
- Added tests for regression/missing-baseline reporting and optional failure modes; updated fixtures and helpers to include per-benchmark metadata.
Documentation
- Updated workflow docs to reflect artifact and reporting behavior.

coderabbitai · 2026-03-31T00:45:32Z

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

Updated the perf GitHub Actions workflow to use a configurable artifact directory and detached git worktree for baseline checkout; benchmark outputs (JSON + Markdown) are written to and uploaded from that artifact dir. scripts/benchmark_report.py gained richer comparison output and flags --fail-on-regression / --fail-on-missing; tests updated accordingly.

Changes

Cohort / File(s)	Summary
Workflow Documentation `\.github/workflows/README.md`	Documented that the workflow uploads both benchmark JSON and Markdown artifacts, reports regressions/missing benchmarks, and that such reports are non-blocking by default.
Benchmark Workflow `\.github/workflows/perf.yml`	Narrowed trigger path filters to benchmark-related files; added `BENCHMARK_ARTIFACT_DIR` and `BENCHMARK_BASE_WORKTREE` envs; prepare/cleanup steps for artifact dir and detached base worktree; use `git worktree add --detach` for baseline; write current/base JSON and markdown under artifact dir; assemble comment body from artifact files and append to step summary; upload entire artifact directory; removed prior exit-code propagation and failure step; always-run cleanup removes detached worktree.
Benchmark Reporting Script `scripts/benchmark_report.py`	Added immutable `ComparisonRow` and helper formatters; include Target/Size/Files in tables; compute aggregate shared medians; split top regressions and improvements; added `--fail-on-regression` and `--fail-on-missing` flags and corresponding `_build_summary` params to control returned failure; updated threshold semantics.
Benchmark Tests & Scanning `tests/test_benchmark_report.py`, `tests/benchmarks/test_scan_benchmarks.py`, `tests/test_cache_optimizations.py`	Extended test helper to accept per-benchmark `extra_info` (path/size/files) and updated fixtures/assertions for new table schema and deterministic ordering; renamed regression test to reflect non-failing default; added tests for failing/non-failing `--fail-on-regression` and `--fail-on-missing`; scan benchmarks use `_benchmark_context` helper to populate `extra_info`; adjusted performance tolerance under pytest-xdist.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Runner as Runner
    participant Git as Git (worktree)
    participant Script as benchmark_report.py
    participant Art as Artifact Store
    participant PR as PR Comment Action

    GH->>Runner: trigger perf workflow (bench-related push/PR)
    Runner->>Runner: create BENCHMARK_ARTIFACT_DIR & BENCHMARK_BASE_WORKTREE
    Runner->>Git: git worktree add --detach -> BENCHMARK_BASE_WORKTREE
    Runner->>Runner: run baseline tests in base worktree -> write benchmark-base.json to artifact dir
    Runner->>Runner: run current benchmarks -> write benchmark-head.json and markdowns to artifact dir
    Runner->>Script: run benchmark_report.py (compare, flags)
    Script-->>Runner: emit comparison markdown(s) and exit status
    Runner->>Art: upload BENCHMARK_ARTIFACT_DIR
    alt run is PR
      Runner->>PR: post/refresh sticky benchmark comment from artifact markdown
    else non-PR run
      Runner->>Runner: write benchmark-summary.md and append to $GITHUB_STEP_SUMMARY
    end
    Runner->>Git: cleanup detached worktree (always)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through worktrees, neat and spry,
Writing JSON and Markdown side by side.
Flags to fail or simply to show,
Artifacts gathered, set to go.
A rabbit-approved CI sigh! 🐇📈

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'ci: expand benchmark reporting' is a clear, concise description that directly reflects the main objective of the pull request: expanding and improving the benchmark reporting functionality in the CI workflow.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mdangelo/codex/improve-benchmark-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/perf.yml:
- Around line 43-45: Job-level env currently uses runner.temp expressions which
are invalid there; move the BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE
assignments into the first step of the job (e.g., at the top step of the
benchmarks job) and export them to subsequent steps with echo
"BENCHMARK_ARTIFACT_DIR=$RUNNER_TEMP/modelaudit-benchmarks" >> $GITHUB_ENV and
echo "BENCHMARK_BASE_WORKTREE=$RUNNER_TEMP/modelaudit-base" >> $GITHUB_ENV so
all later steps that reference BENCHMARK_ARTIFACT_DIR and
BENCHMARK_BASE_WORKTREE get the resolved values; remove the runner.temp uses
from the job-level env block.

In `@tests/test_benchmark_report.py`:
- Around line 81-83: The current assertions use summary_text.index(...) which
can be satisfied by mentions outside the results table; instead extract the
benchmark table rows from summary_text (e.g., parse the "Slowest
median"/"Fastest median" table section or use the same row-extraction helper
used elsewhere in this test file) and assert that the row list contains
"test_scan_pytorch_zip" and "test_scan_safe_pickle" and that the index of
"test_scan_pytorch_zip" in that extracted rows list is less than the index of
"test_scan_safe_pickle"; update the assertions around summary_text,
summary_file.read_text, and the two benchmark names to validate order against
the parsed table rows.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ffd15f9-d30a-47cc-9745-c59e76575128

📥 Commits

Reviewing files that changed from the base of the PR and between 2179de8 and 8b5ab56.

📒 Files selected for processing (4)

.github/workflows/README.md
.github/workflows/perf.yml
scripts/benchmark_report.py
tests/test_benchmark_report.py

.github/workflows/perf.yml

tests/test_benchmark_report.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

tests/test_benchmark_report.py (1)

101-106: ⚠️ Potential issue | 🟡 Minor

Assert ordering from the table rows, not the whole markdown blob.

Both benchmark names also appear in the Slowest benchmarks: bullets, so summary_text.index(...) can still pass even if the table rows regress. Parse the benchmark rows and assert their order directly.

More targeted assertion

     summary_text = summary_file.read_text(encoding="utf-8")
     assert "| Benchmark | Target | Size | Files | Median | Mean | Rounds |" in summary_text
     assert "safe_model.pkl" in summary_text
     assert "2.0 KiB" in summary_text
-    assert "test_scan_safe_pickle" in summary_text
-    assert summary_text.index("test_scan_pytorch_zip") < summary_text.index("test_scan_safe_pickle")
+    table_rows = [
+        line for line in summary_text.splitlines() if line.startswith("| `tests/benchmarks/test_scan_benchmarks.py::")
+    ]
+    assert len(table_rows) == 2
+    assert table_rows[0].startswith(
+        "| `tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip` |"
+    )
+    assert table_rows[1].startswith(
+        "| `tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle` |"
+    )

As per coding guidelines, "Assert the actual signal in tests — never assert result is not None alone; verify specific check names, issue messages, or detail fields."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/test_benchmark_report.py` around lines 101 - 106, The current test uses
summary_text.index(...) which can be fooled by non-table occurrences; instead
parse the Markdown table rows from summary_text (e.g., split by newline and
locate the table header line `"| Benchmark | Target | Size | Files | Median |
Mean | Rounds |"`, then collect subsequent lines that start with `"|"` as table
rows) and assert the order of benchmark names within those rows (check that the
row containing "test_scan_pytorch_zip" appears before the row containing
"test_scan_safe_pickle"). Use the existing variables summary_text/summary_file
and replace the global index assertion with this table-row-based order check.

.github/workflows/perf.yml (1)

44-45: ⚠️ Potential issue | 🔴 Critical

runner.temp is invalid at job scope.

GitHub Actions does not allow the runner context in jobs.<job_id>.env, and actionlint is already failing on Lines 44-45. That means BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE never resolve before the later steps consume them. Move these assignments into an early run step and export them via $GITHUB_ENV.

Minimal fix

     env:
-      BENCHMARK_ARTIFACT_DIR: ${{ runner.temp }}/modelaudit-benchmarks
-      BENCHMARK_BASE_WORKTREE: ${{ runner.temp }}/modelaudit-base
       BENCHMARK_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
@@
       - name: Prepare benchmark temp directories
         run: |
-          rm -rf "$BENCHMARK_ARTIFACT_DIR" "$BENCHMARK_BASE_WORKTREE"
-          mkdir -p "$BENCHMARK_ARTIFACT_DIR"
+          echo "BENCHMARK_ARTIFACT_DIR=$RUNNER_TEMP/modelaudit-benchmarks" >> "$GITHUB_ENV"
+          echo "BENCHMARK_BASE_WORKTREE=$RUNNER_TEMP/modelaudit-base" >> "$GITHUB_ENV"
+          rm -rf "$RUNNER_TEMP/modelaudit-benchmarks" "$RUNNER_TEMP/modelaudit-base"
+          mkdir -p "$RUNNER_TEMP/modelaudit-benchmarks"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/perf.yml around lines 44 - 45, The env assignments using
runner.temp (BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE) are invalid at
job scope; remove them from jobs.<job_id>.env and instead create an early step
(e.g., a setup step) that computes those paths from runner.temp and writes them
to the job environment via $GITHUB_ENV (export lines for BENCHMARK_ARTIFACT_DIR
and BENCHMARK_BASE_WORKTREE), so subsequent steps can consume them; reference
the variable names BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE and the
runner.temp context when locating where to move the logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/perf.yml:
- Around line 100-104: The workflow invocation of benchmark_report.py should
pass the --fail-on-missing flag and capture its exit code so missing baseline
benchmarks cause the job to fail only after the markdown summary is appended;
update the uv run command that calls benchmark_report.py to include
--fail-on-missing and store the script's exit status, then continue to append
"$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" to the PR summary and finally
exit non-zero if the captured exit code indicates missing benchmarks. Target the
invocation that builds the compare (the uv run ... benchmark_report.py call and
its use of "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md") and ensure the
script’s exit code is propagated after writing the report.

---

Duplicate comments:
In @.github/workflows/perf.yml:
- Around line 44-45: The env assignments using runner.temp
(BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE) are invalid at job scope;
remove them from jobs.<job_id>.env and instead create an early step (e.g., a
setup step) that computes those paths from runner.temp and writes them to the
job environment via $GITHUB_ENV (export lines for BENCHMARK_ARTIFACT_DIR and
BENCHMARK_BASE_WORKTREE), so subsequent steps can consume them; reference the
variable names BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE and the
runner.temp context when locating where to move the logic.

In `@tests/test_benchmark_report.py`:
- Around line 101-106: The current test uses summary_text.index(...) which can
be fooled by non-table occurrences; instead parse the Markdown table rows from
summary_text (e.g., split by newline and locate the table header line `"|
Benchmark | Target | Size | Files | Median | Mean | Rounds |"`, then collect
subsequent lines that start with `"|"` as table rows) and assert the order of
benchmark names within those rows (check that the row containing
"test_scan_pytorch_zip" appears before the row containing
"test_scan_safe_pickle"). Use the existing variables summary_text/summary_file
and replace the global index assertion with this table-row-based order check.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a5e237e5-27dd-42cd-9809-1d4427b5a2e8

📥 Commits

Reviewing files that changed from the base of the PR and between 8b5ab56 and c5a3064.

📒 Files selected for processing (5)

.github/workflows/README.md
.github/workflows/perf.yml
scripts/benchmark_report.py
tests/benchmarks/test_scan_benchmarks.py
tests/test_benchmark_report.py

coderabbitai · 2026-03-31T05:04:10Z

.github/workflows/perf.yml

            uv run --locked python scripts/benchmark_report.py \
-              --current /tmp/benchmark-head.json \
-              --baseline /tmp/benchmark-base.json \
+              --current "$BENCHMARK_ARTIFACT_DIR/benchmark-head.json" \
+              --baseline "$BENCHMARK_ARTIFACT_DIR/benchmark-base.json" \
              --threshold 0.15 \
-              --summary-file /tmp/benchmark-comment.md
-            compare_exit=$?
-            set -e
+              --summary-file "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md"


⚠️ Potential issue | 🟠 Major

Wire --fail-on-missing into the PR comparison.

benchmark_report.py only returns a non-zero exit code for dropped baseline benchmarks when --fail-on-missing is passed. This invocation never sets it, so the compare step stays green on exactly the case the PR is supposed to block. Capture the exit code and fail after the markdown has been appended so the PR comment and step summary still include the report.

One way to restore the missing-benchmark gate without losing the summary output

- name: Compare against base id: compare if: github.event_name == 'pull_request' run: | + compare_status=0 { echo "[Workflow run and artifacts]($BENCHMARK_RUN_URL)" echo } > "$BENCHMARK_ARTIFACT_DIR/benchmark-comment.md" if [ -f "$BENCHMARK_ARTIFACT_DIR/benchmark-base.json" ]; then uv run --locked python scripts/benchmark_report.py \ --current "$BENCHMARK_ARTIFACT_DIR/benchmark-head.json" \ --baseline "$BENCHMARK_ARTIFACT_DIR/benchmark-base.json" \ --threshold 0.15 \ - --summary-file "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" + --summary-file "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" \ + --fail-on-missing || compare_status=$? else { echo "Base branch does not include the benchmark suite yet; showing current results only." echo @@ if [ -f "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" ]; then cat "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" >> "$BENCHMARK_ARTIFACT_DIR/benchmark-comment.md" fi cat "$BENCHMARK_ARTIFACT_DIR/benchmark-comment.md" >> "$GITHUB_STEP_SUMMARY" + exit "$compare_status"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/perf.yml around lines 100 - 104, The workflow invocation of benchmark_report.py should pass the --fail-on-missing flag and capture its exit code so missing baseline benchmarks cause the job to fail only after the markdown summary is appended; update the uv run command that calls benchmark_report.py to include --fail-on-missing and store the script's exit status, then continue to append "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" to the PR summary and finally exit non-zero if the captured exit code indicates missing benchmarks. Target the invocation that builds the compare (the uv run ... benchmark_report.py call and its use of "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md") and ensure the script’s exit code is propagated after writing the report.

…e-benchmark-ci

github-actions · 2026-03-31T05:41:34Z

Workflow run and artifacts

Performance Benchmarks

Compared 6 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 6 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 680.40ms -> 684.58ms (+0.6%).

Benchmark	Target	Size	Files	Baseline	Current	Change	Status
`tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle`	`safe_model.pkl`	49.4 KiB	1	119.3us	117.0us	-1.9%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip`	`state_dict.pt`	1.5 MiB	1	33.8us	33.2us	-1.7%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory`	`mixed-corpus`	1.7 MiB	54	300.37ms	302.89ms	+0.8%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip`	`state_dict.pt`	1.5 MiB	1	261.39ms	262.79ms	+0.5%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle`	`safe_model.pkl`	49.4 KiB	1	5.63ms	5.60ms	-0.5%	stable
`tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory`	`duplicate-corpus`	840.0 KiB	81	112.87ms	113.14ms	+0.2%	stable

ci: tighten benchmark workflow gating

8b5ab56

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

.github/workflows/perf.yml Outdated Show resolved Hide resolved

tests/test_benchmark_report.py Outdated Show resolved Hide resolved

ci: expand benchmark reporting

c5a3064

mldangelo-oai changed the title ~~ci: tighten benchmark workflow gating~~ ci: expand benchmark reporting Mar 31, 2026

coderabbitai bot reviewed Mar 31, 2026

View reviewed changes

mldangelo-oai added 3 commits March 30, 2026 22:32

ci: address benchmark review feedback

dd7e338

Merge remote-tracking branch 'origin/main' into mdangelo/codex/improv…

05746c5

…e-benchmark-ci

test: stabilize merged performance assertions

eac473b

mldangelo-oai merged commit d0d4a2d into main Mar 31, 2026
24 of 25 checks passed

mldangelo-oai deleted the mdangelo/codex/improve-benchmark-ci branch March 31, 2026 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: expand benchmark reporting#823

ci: expand benchmark reporting#823
mldangelo-oai merged 5 commits intomainfrom
mdangelo/codex/improve-benchmark-ci

mldangelo-oai commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 31, 2026 •

edited

Loading

Review failed

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 31, 2026

Uh oh!

github-actions bot commented Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mldangelo-oai commented Mar 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 31, 2026

Performance Benchmarks

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mldangelo-oai commented Mar 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 31, 2026 •

edited

Loading