Skip to content

ci: expand benchmark reporting#823

Merged
mldangelo-oai merged 5 commits intomainfrom
mdangelo/codex/improve-benchmark-ci
Mar 31, 2026
Merged

ci: expand benchmark reporting#823
mldangelo-oai merged 5 commits intomainfrom
mdangelo/codex/improve-benchmark-ci

Conversation

@mldangelo-oai
Copy link
Copy Markdown
Contributor

@mldangelo-oai mldangelo-oai commented Mar 31, 2026

Summary

  • make the benchmark workflow report-only by default so regressions and missing benchmarks are surfaced in the PR comment without blocking the PR
  • expand the benchmark summary with aggregate medians, top regressions and improvements, input target/size/file context, a workflow run link, and richer benchmark metadata for directory inputs
  • sync the branch with main and address follow-up review feedback by moving runner temp path resolution into a setup step, asserting the actual Markdown table order in tests, and stabilizing the merged cache micro-benchmark threshold under xdist

Validation

  • uv run ruff format --check modelaudit/ tests/ scripts/benchmark_report.py
  • uv run ruff check modelaudit/ tests/ scripts/benchmark_report.py
  • uv run mypy modelaudit/ tests/
  • uv run pytest tests/test_benchmark_report.py -q
  • uv run --with pytest-benchmark pytest tests/benchmarks/test_scan_benchmarks.py -q --benchmark-json=/tmp/modelaudit-bench.json
  • uv run pytest -n auto -m "not slow and not integration" --maxfail=1

Summary by CodeRabbit

  • New Features

    • New CLI flags to optionally fail on regressions and/or on missing baseline benchmarks.
  • Improvements

    • Reports include aggregate medians, counts of new/missing benchmarks, richer size/files context, and separate Top regressions/improvements sections.
    • CI now uploads both benchmark JSON and Markdown summary artifacts; reporting is produced but does not block PRs by default.
  • Tests

    • Added tests for regression/missing-baseline reporting and optional failure modes; updated fixtures and helpers to include per-benchmark metadata.
  • Documentation

    • Updated workflow docs to reflect artifact and reporting behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

Updated the perf GitHub Actions workflow to use a configurable artifact directory and detached git worktree for baseline checkout; benchmark outputs (JSON + Markdown) are written to and uploaded from that artifact dir. scripts/benchmark_report.py gained richer comparison output and flags --fail-on-regression / --fail-on-missing; tests updated accordingly.

Changes

Cohort / File(s) Summary
Workflow Documentation
\.github/workflows/README.md
Documented that the workflow uploads both benchmark JSON and Markdown artifacts, reports regressions/missing benchmarks, and that such reports are non-blocking by default.
Benchmark Workflow
\.github/workflows/perf.yml
Narrowed trigger path filters to benchmark-related files; added BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE envs; prepare/cleanup steps for artifact dir and detached base worktree; use git worktree add --detach for baseline; write current/base JSON and markdown under artifact dir; assemble comment body from artifact files and append to step summary; upload entire artifact directory; removed prior exit-code propagation and failure step; always-run cleanup removes detached worktree.
Benchmark Reporting Script
scripts/benchmark_report.py
Added immutable ComparisonRow and helper formatters; include Target/Size/Files in tables; compute aggregate shared medians; split top regressions and improvements; added --fail-on-regression and --fail-on-missing flags and corresponding _build_summary params to control returned failure; updated threshold semantics.
Benchmark Tests & Scanning
tests/test_benchmark_report.py, tests/benchmarks/test_scan_benchmarks.py, tests/test_cache_optimizations.py
Extended test helper to accept per-benchmark extra_info (path/size/files) and updated fixtures/assertions for new table schema and deterministic ordering; renamed regression test to reflect non-failing default; added tests for failing/non-failing --fail-on-regression and --fail-on-missing; scan benchmarks use _benchmark_context helper to populate extra_info; adjusted performance tolerance under pytest-xdist.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Runner as Runner
    participant Git as Git (worktree)
    participant Script as benchmark_report.py
    participant Art as Artifact Store
    participant PR as PR Comment Action

    GH->>Runner: trigger perf workflow (bench-related push/PR)
    Runner->>Runner: create BENCHMARK_ARTIFACT_DIR & BENCHMARK_BASE_WORKTREE
    Runner->>Git: git worktree add --detach -> BENCHMARK_BASE_WORKTREE
    Runner->>Runner: run baseline tests in base worktree -> write benchmark-base.json to artifact dir
    Runner->>Runner: run current benchmarks -> write benchmark-head.json and markdowns to artifact dir
    Runner->>Script: run benchmark_report.py (compare, flags)
    Script-->>Runner: emit comparison markdown(s) and exit status
    Runner->>Art: upload BENCHMARK_ARTIFACT_DIR
    alt run is PR
      Runner->>PR: post/refresh sticky benchmark comment from artifact markdown
    else non-PR run
      Runner->>Runner: write benchmark-summary.md and append to $GITHUB_STEP_SUMMARY
    end
    Runner->>Git: cleanup detached worktree (always)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hop through worktrees, neat and spry,
Writing JSON and Markdown side by side.
Flags to fail or simply to show,
Artifacts gathered, set to go.
A rabbit-approved CI sigh! 🐇📈

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ci: expand benchmark reporting' is a clear, concise description that directly reflects the main objective of the pull request: expanding and improving the benchmark reporting functionality in the CI workflow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mdangelo/codex/improve-benchmark-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/perf.yml:
- Around line 43-45: Job-level env currently uses runner.temp expressions which
are invalid there; move the BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE
assignments into the first step of the job (e.g., at the top step of the
benchmarks job) and export them to subsequent steps with echo
"BENCHMARK_ARTIFACT_DIR=$RUNNER_TEMP/modelaudit-benchmarks" >> $GITHUB_ENV and
echo "BENCHMARK_BASE_WORKTREE=$RUNNER_TEMP/modelaudit-base" >> $GITHUB_ENV so
all later steps that reference BENCHMARK_ARTIFACT_DIR and
BENCHMARK_BASE_WORKTREE get the resolved values; remove the runner.temp uses
from the job-level env block.

In `@tests/test_benchmark_report.py`:
- Around line 81-83: The current assertions use summary_text.index(...) which
can be satisfied by mentions outside the results table; instead extract the
benchmark table rows from summary_text (e.g., parse the "Slowest
median"/"Fastest median" table section or use the same row-extraction helper
used elsewhere in this test file) and assert that the row list contains
"test_scan_pytorch_zip" and "test_scan_safe_pickle" and that the index of
"test_scan_pytorch_zip" in that extracted rows list is less than the index of
"test_scan_safe_pickle"; update the assertions around summary_text,
summary_file.read_text, and the two benchmark names to validate order against
the parsed table rows.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3ffd15f9-d30a-47cc-9745-c59e76575128

📥 Commits

Reviewing files that changed from the base of the PR and between 2179de8 and 8b5ab56.

📒 Files selected for processing (4)
  • .github/workflows/README.md
  • .github/workflows/perf.yml
  • scripts/benchmark_report.py
  • tests/test_benchmark_report.py

@mldangelo-oai mldangelo-oai changed the title ci: tighten benchmark workflow gating ci: expand benchmark reporting Mar 31, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/test_benchmark_report.py (1)

101-106: ⚠️ Potential issue | 🟡 Minor

Assert ordering from the table rows, not the whole markdown blob.

Both benchmark names also appear in the Slowest benchmarks: bullets, so summary_text.index(...) can still pass even if the table rows regress. Parse the benchmark rows and assert their order directly.

More targeted assertion
     summary_text = summary_file.read_text(encoding="utf-8")
     assert "| Benchmark | Target | Size | Files | Median | Mean | Rounds |" in summary_text
     assert "safe_model.pkl" in summary_text
     assert "2.0 KiB" in summary_text
-    assert "test_scan_safe_pickle" in summary_text
-    assert summary_text.index("test_scan_pytorch_zip") < summary_text.index("test_scan_safe_pickle")
+    table_rows = [
+        line for line in summary_text.splitlines() if line.startswith("| `tests/benchmarks/test_scan_benchmarks.py::")
+    ]
+    assert len(table_rows) == 2
+    assert table_rows[0].startswith(
+        "| `tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip` |"
+    )
+    assert table_rows[1].startswith(
+        "| `tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle` |"
+    )

As per coding guidelines, "Assert the actual signal in tests — never assert result is not None alone; verify specific check names, issue messages, or detail fields."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_benchmark_report.py` around lines 101 - 106, The current test uses
summary_text.index(...) which can be fooled by non-table occurrences; instead
parse the Markdown table rows from summary_text (e.g., split by newline and
locate the table header line `"| Benchmark | Target | Size | Files | Median |
Mean | Rounds |"`, then collect subsequent lines that start with `"|"` as table
rows) and assert the order of benchmark names within those rows (check that the
row containing "test_scan_pytorch_zip" appears before the row containing
"test_scan_safe_pickle"). Use the existing variables summary_text/summary_file
and replace the global index assertion with this table-row-based order check.
.github/workflows/perf.yml (1)

44-45: ⚠️ Potential issue | 🔴 Critical

runner.temp is invalid at job scope.

GitHub Actions does not allow the runner context in jobs.<job_id>.env, and actionlint is already failing on Lines 44-45. That means BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE never resolve before the later steps consume them. Move these assignments into an early run step and export them via $GITHUB_ENV.

Minimal fix
     env:
-      BENCHMARK_ARTIFACT_DIR: ${{ runner.temp }}/modelaudit-benchmarks
-      BENCHMARK_BASE_WORKTREE: ${{ runner.temp }}/modelaudit-base
       BENCHMARK_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
@@
       - name: Prepare benchmark temp directories
         run: |
-          rm -rf "$BENCHMARK_ARTIFACT_DIR" "$BENCHMARK_BASE_WORKTREE"
-          mkdir -p "$BENCHMARK_ARTIFACT_DIR"
+          echo "BENCHMARK_ARTIFACT_DIR=$RUNNER_TEMP/modelaudit-benchmarks" >> "$GITHUB_ENV"
+          echo "BENCHMARK_BASE_WORKTREE=$RUNNER_TEMP/modelaudit-base" >> "$GITHUB_ENV"
+          rm -rf "$RUNNER_TEMP/modelaudit-benchmarks" "$RUNNER_TEMP/modelaudit-base"
+          mkdir -p "$RUNNER_TEMP/modelaudit-benchmarks"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/perf.yml around lines 44 - 45, The env assignments using
runner.temp (BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE) are invalid at
job scope; remove them from jobs.<job_id>.env and instead create an early step
(e.g., a setup step) that computes those paths from runner.temp and writes them
to the job environment via $GITHUB_ENV (export lines for BENCHMARK_ARTIFACT_DIR
and BENCHMARK_BASE_WORKTREE), so subsequent steps can consume them; reference
the variable names BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE and the
runner.temp context when locating where to move the logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/perf.yml:
- Around line 100-104: The workflow invocation of benchmark_report.py should
pass the --fail-on-missing flag and capture its exit code so missing baseline
benchmarks cause the job to fail only after the markdown summary is appended;
update the uv run command that calls benchmark_report.py to include
--fail-on-missing and store the script's exit status, then continue to append
"$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" to the PR summary and finally
exit non-zero if the captured exit code indicates missing benchmarks. Target the
invocation that builds the compare (the uv run ... benchmark_report.py call and
its use of "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md") and ensure the
script’s exit code is propagated after writing the report.

---

Duplicate comments:
In @.github/workflows/perf.yml:
- Around line 44-45: The env assignments using runner.temp
(BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE) are invalid at job scope;
remove them from jobs.<job_id>.env and instead create an early step (e.g., a
setup step) that computes those paths from runner.temp and writes them to the
job environment via $GITHUB_ENV (export lines for BENCHMARK_ARTIFACT_DIR and
BENCHMARK_BASE_WORKTREE), so subsequent steps can consume them; reference the
variable names BENCHMARK_ARTIFACT_DIR and BENCHMARK_BASE_WORKTREE and the
runner.temp context when locating where to move the logic.

In `@tests/test_benchmark_report.py`:
- Around line 101-106: The current test uses summary_text.index(...) which can
be fooled by non-table occurrences; instead parse the Markdown table rows from
summary_text (e.g., split by newline and locate the table header line `"|
Benchmark | Target | Size | Files | Median | Mean | Rounds |"`, then collect
subsequent lines that start with `"|"` as table rows) and assert the order of
benchmark names within those rows (check that the row containing
"test_scan_pytorch_zip" appears before the row containing
"test_scan_safe_pickle"). Use the existing variables summary_text/summary_file
and replace the global index assertion with this table-row-based order check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a5e237e5-27dd-42cd-9809-1d4427b5a2e8

📥 Commits

Reviewing files that changed from the base of the PR and between 8b5ab56 and c5a3064.

📒 Files selected for processing (5)
  • .github/workflows/README.md
  • .github/workflows/perf.yml
  • scripts/benchmark_report.py
  • tests/benchmarks/test_scan_benchmarks.py
  • tests/test_benchmark_report.py

Comment on lines 100 to +104
uv run --locked python scripts/benchmark_report.py \
--current /tmp/benchmark-head.json \
--baseline /tmp/benchmark-base.json \
--current "$BENCHMARK_ARTIFACT_DIR/benchmark-head.json" \
--baseline "$BENCHMARK_ARTIFACT_DIR/benchmark-base.json" \
--threshold 0.15 \
--summary-file /tmp/benchmark-comment.md
compare_exit=$?
set -e
--summary-file "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wire --fail-on-missing into the PR comparison.

benchmark_report.py only returns a non-zero exit code for dropped baseline benchmarks when --fail-on-missing is passed. This invocation never sets it, so the compare step stays green on exactly the case the PR is supposed to block. Capture the exit code and fail after the markdown has been appended so the PR comment and step summary still include the report.

One way to restore the missing-benchmark gate without losing the summary output
       - name: Compare against base
         id: compare
         if: github.event_name == 'pull_request'
         run: |
+          compare_status=0
           {
             echo "[Workflow run and artifacts]($BENCHMARK_RUN_URL)"
             echo
           } > "$BENCHMARK_ARTIFACT_DIR/benchmark-comment.md"
           if [ -f "$BENCHMARK_ARTIFACT_DIR/benchmark-base.json" ]; then
             uv run --locked python scripts/benchmark_report.py \
               --current "$BENCHMARK_ARTIFACT_DIR/benchmark-head.json" \
               --baseline "$BENCHMARK_ARTIFACT_DIR/benchmark-base.json" \
               --threshold 0.15 \
-              --summary-file "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md"
+              --summary-file "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" \
+              --fail-on-missing || compare_status=$?
           else
             {
               echo "Base branch does not include the benchmark suite yet; showing current results only."
               echo
@@
           if [ -f "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" ]; then
             cat "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md" >> "$BENCHMARK_ARTIFACT_DIR/benchmark-comment.md"
           fi
           cat "$BENCHMARK_ARTIFACT_DIR/benchmark-comment.md" >> "$GITHUB_STEP_SUMMARY"
+          exit "$compare_status"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/perf.yml around lines 100 - 104, The workflow invocation
of benchmark_report.py should pass the --fail-on-missing flag and capture its
exit code so missing baseline benchmarks cause the job to fail only after the
markdown summary is appended; update the uv run command that calls
benchmark_report.py to include --fail-on-missing and store the script's exit
status, then continue to append "$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md"
to the PR summary and finally exit non-zero if the captured exit code indicates
missing benchmarks. Target the invocation that builds the compare (the uv run
... benchmark_report.py call and its use of
"$BENCHMARK_ARTIFACT_DIR/benchmark-compare.md") and ensure the script’s exit
code is propagated after writing the report.

@github-actions
Copy link
Copy Markdown
Contributor

Workflow run and artifacts

Performance Benchmarks

Compared 6 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 6 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 680.40ms -> 684.58ms (+0.6%).

Benchmark Target Size Files Baseline Current Change Status
tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle safe_model.pkl 49.4 KiB 1 119.3us 117.0us -1.9% stable
tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip state_dict.pt 1.5 MiB 1 33.8us 33.2us -1.7% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory mixed-corpus 1.7 MiB 54 300.37ms 302.89ms +0.8% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip state_dict.pt 1.5 MiB 1 261.39ms 262.79ms +0.5% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle safe_model.pkl 49.4 KiB 1 5.63ms 5.60ms -0.5% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory duplicate-corpus 840.0 KiB 81 112.87ms 113.14ms +0.2% stable

@mldangelo-oai mldangelo-oai merged commit d0d4a2d into main Mar 31, 2026
24 of 25 checks passed
@mldangelo-oai mldangelo-oai deleted the mdangelo/codex/improve-benchmark-ci branch March 31, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant