Add code review benchmark harness

## Problem

`cr review` can now run through multiple LLM runtimes and profiles. That makes it possible to compare different model/provider setups, but there is no repeatable way to do that today.

Right now a comparison is mostly manual:

- pick a review target;
- run one profile, then another;
- save the dry-run output somewhere;
- compare findings, runtime, token usage, and cost by hand;
- remember which base/head SHA pair was tested.

That is easy to get wrong and hard to repeat after the CLI, prompts, agent profiles, or model choices change.

We should have a small benchmark harness in this repo that exercises the real `cr review` command in dry-run mode and writes local artifacts that can be compared across models.

## Scope

Add a `cr benchmark` command for code review benchmarks:

- Validate a benchmark suite file before running it.
- Preflight a benchmark suite before longer runs, including selected models/cases, configured profiles, agent directories, optional `cr` binary path, and result directory selection.
- Run one or more benchmark cases against one or more configured `cr` profiles.
- Invoke the normal review flow through `cr review --dry-run --json`.
- Never post comments, resolve threads, request reviews, or mutate GitHub state.
- Write raw review output, stderr, normalized summary records, aggregate JSON, a run manifest, and a compact Markdown report to a local results directory.
- Support benchmark cases that pin the review URL plus base/head SHAs so old reviews can be reused as stable evaluation anchors.
- Support expected-result metadata, such as known findings or an expected clean run, so results can be compared against a baseline.
- Include grading and efficiency fields that are useful for downstream human or agent synthesis, including grade status/reason, expected-anchor counts, false-positive counts, duration, token, and cost metrics.
- Keep private benchmark cases and run outputs out of git by default.

This should stay focused on benchmarking `codereview-cli` behavior. It should not become a separate review daemon, a hosted evaluation service, or a checked-in leaderboard.

## Acceptance Criteria

- `cr benchmark validate <suite.yml>` loads and validates a benchmark suite with useful errors.
- `cr benchmark doctor <suite.yml>` checks suite readiness without invoking model providers or running reviews.
- `cr benchmark doctor` supports the same selection and environment flags as runs where they are relevant: `--model`, `--case`, `--results-dir`, `--cr-bin`, and `--json`.
- `cr benchmark run <suite.yml>` runs selected cases/models through `cr review --dry-run --json`.
- `--model`, `--case`, `--results-dir`, and `--cr-bin` allow targeted local runs.
- Benchmark output includes enough data to compare runs: case id, model/profile, review URL, exit status, finding count, duration, token/cost metrics when available, and artifact paths.
- Run artifacts include raw review JSON, stderr, optional metrics JSON, `summary.jsonl`, `suite-summary.json`, `manifest.json`, and `report.md`.
- `manifest.json` records enough provenance for later comparison: tool name/version, suite path/hash, selected models/cases, run timings, and artifact paths.
- Reports and summaries include expected-baseline fields and conservative grading fields for clean cases, expected anchors, false positives, duration, token, and cost efficiency.
- Result directories and files use private permissions by default.
- Local benchmark cases and results are ignored by git, while public examples and schema/docs are safe to commit.
- README or benchmark docs explain how to define a suite, validate it, preflight it, run it, and interpret the output.
- Tests cover suite validation, command wiring, dry-run invocation arguments, preflight checks, result artifact writing, manifest generation, and report/summary rendering.
- Existing `cr review` behavior for normal users is unchanged.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add code review benchmark harness #92

Problem

Scope

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add code review benchmark harness #92

Description

Problem

Scope

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions