Problem
cr review can now run through multiple LLM runtimes and profiles. That makes it possible to compare different model/provider setups, but there is no repeatable way to do that today.
Right now a comparison is mostly manual:
- pick a review target;
- run one profile, then another;
- save the dry-run output somewhere;
- compare findings, runtime, token usage, and cost by hand;
- remember which base/head SHA pair was tested.
That is easy to get wrong and hard to repeat after the CLI, prompts, agent profiles, or model choices change.
We should have a small benchmark harness in this repo that exercises the real cr review command in dry-run mode and writes local artifacts that can be compared across models.
Scope
Add a cr benchmark command for code review benchmarks:
- Validate a benchmark suite file before running it.
- Preflight a benchmark suite before longer runs, including selected models/cases, configured profiles, agent directories, optional
cr binary path, and result directory selection.
- Run one or more benchmark cases against one or more configured
cr profiles.
- Invoke the normal review flow through
cr review --dry-run --json.
- Never post comments, resolve threads, request reviews, or mutate GitHub state.
- Write raw review output, stderr, normalized summary records, aggregate JSON, a run manifest, and a compact Markdown report to a local results directory.
- Support benchmark cases that pin the review URL plus base/head SHAs so old reviews can be reused as stable evaluation anchors.
- Support expected-result metadata, such as known findings or an expected clean run, so results can be compared against a baseline.
- Include grading and efficiency fields that are useful for downstream human or agent synthesis, including grade status/reason, expected-anchor counts, false-positive counts, duration, token, and cost metrics.
- Keep private benchmark cases and run outputs out of git by default.
This should stay focused on benchmarking codereview-cli behavior. It should not become a separate review daemon, a hosted evaluation service, or a checked-in leaderboard.
Acceptance Criteria
cr benchmark validate <suite.yml> loads and validates a benchmark suite with useful errors.
cr benchmark doctor <suite.yml> checks suite readiness without invoking model providers or running reviews.
cr benchmark doctor supports the same selection and environment flags as runs where they are relevant: --model, --case, --results-dir, --cr-bin, and --json.
cr benchmark run <suite.yml> runs selected cases/models through cr review --dry-run --json.
--model, --case, --results-dir, and --cr-bin allow targeted local runs.
- Benchmark output includes enough data to compare runs: case id, model/profile, review URL, exit status, finding count, duration, token/cost metrics when available, and artifact paths.
- Run artifacts include raw review JSON, stderr, optional metrics JSON,
summary.jsonl, suite-summary.json, manifest.json, and report.md.
manifest.json records enough provenance for later comparison: tool name/version, suite path/hash, selected models/cases, run timings, and artifact paths.
- Reports and summaries include expected-baseline fields and conservative grading fields for clean cases, expected anchors, false positives, duration, token, and cost efficiency.
- Result directories and files use private permissions by default.
- Local benchmark cases and results are ignored by git, while public examples and schema/docs are safe to commit.
- README or benchmark docs explain how to define a suite, validate it, preflight it, run it, and interpret the output.
- Tests cover suite validation, command wiring, dry-run invocation arguments, preflight checks, result artifact writing, manifest generation, and report/summary rendering.
- Existing
cr review behavior for normal users is unchanged.
Problem
cr reviewcan now run through multiple LLM runtimes and profiles. That makes it possible to compare different model/provider setups, but there is no repeatable way to do that today.Right now a comparison is mostly manual:
That is easy to get wrong and hard to repeat after the CLI, prompts, agent profiles, or model choices change.
We should have a small benchmark harness in this repo that exercises the real
cr reviewcommand in dry-run mode and writes local artifacts that can be compared across models.Scope
Add a
cr benchmarkcommand for code review benchmarks:crbinary path, and result directory selection.crprofiles.cr review --dry-run --json.This should stay focused on benchmarking
codereview-clibehavior. It should not become a separate review daemon, a hosted evaluation service, or a checked-in leaderboard.Acceptance Criteria
cr benchmark validate <suite.yml>loads and validates a benchmark suite with useful errors.cr benchmark doctor <suite.yml>checks suite readiness without invoking model providers or running reviews.cr benchmark doctorsupports the same selection and environment flags as runs where they are relevant:--model,--case,--results-dir,--cr-bin, and--json.cr benchmark run <suite.yml>runs selected cases/models throughcr review --dry-run --json.--model,--case,--results-dir, and--cr-binallow targeted local runs.summary.jsonl,suite-summary.json,manifest.json, andreport.md.manifest.jsonrecords enough provenance for later comparison: tool name/version, suite path/hash, selected models/cases, run timings, and artifact paths.cr reviewbehavior for normal users is unchanged.