feat: add benchmark dataset with 13 reference repos#33
Conversation
Curated set of popular Python repos with expected score ranges for regression testing. Detects calibration drift when analyzers or scoring weights change. Includes repos from 1K to 923K LOC across grades A through D. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d7e8fda15
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| from dataclasses import dataclass | ||
| from pathlib import Path | ||
|
|
||
| _BENCHMARK_PATH = Path(__file__).parent.parent.parent / "benchmarks" / "reference_repos.json" |
There was a problem hiding this comment.
Avoid repo-root path for benchmark data
_BENCHMARK_PATH assumes a source checkout layout (src/arbiter/... -> ../../.. /benchmarks), but in an installed package this resolves to something like .../lib/python3.x/benchmarks/reference_repos.json, which does not exist. Because BENCHMARK_REPOS = load_benchmark() runs at import time, importing arbiter.benchmark will raise FileNotFoundError for packaged installs, making the new benchmark API unusable outside this repo checkout.
Useful? React with 👍 / 👎.
Summary
benchmarks/reference_repos.jsonwith 13 repos spanning grades A-Dbenchmark.pymodule withBenchmarkEntrydataclass andload_benchmark()Purpose
Regression testing for Arbiter's scoring engine. If analyzer changes or weight adjustments cause a benchmark repo to fall outside its expected range, tests catch it.
Test plan
🤖 Generated with Claude Code