Skip to content

Add code review benchmark harness #92

@zzwong

Description

@zzwong

Problem

cr review can now run through multiple LLM runtimes and profiles. That makes it possible to compare different model/provider setups, but there is no repeatable way to do that today.

Right now a comparison is mostly manual:

  • pick a review target;
  • run one profile, then another;
  • save the dry-run output somewhere;
  • compare findings, runtime, token usage, and cost by hand;
  • remember which base/head SHA pair was tested.

That is easy to get wrong and hard to repeat after the CLI, prompts, agent profiles, or model choices change.

We should have a small benchmark harness in this repo that exercises the real cr review command in dry-run mode and writes local artifacts that can be compared across models.

Scope

Add a cr benchmark command for code review benchmarks:

  • Validate a benchmark suite file before running it.
  • Preflight a benchmark suite before longer runs, including selected models/cases, configured profiles, agent directories, optional cr binary path, and result directory selection.
  • Run one or more benchmark cases against one or more configured cr profiles.
  • Invoke the normal review flow through cr review --dry-run --json.
  • Never post comments, resolve threads, request reviews, or mutate GitHub state.
  • Write raw review output, stderr, normalized summary records, aggregate JSON, a run manifest, and a compact Markdown report to a local results directory.
  • Support benchmark cases that pin the review URL plus base/head SHAs so old reviews can be reused as stable evaluation anchors.
  • Support expected-result metadata, such as known findings or an expected clean run, so results can be compared against a baseline.
  • Include grading and efficiency fields that are useful for downstream human or agent synthesis, including grade status/reason, expected-anchor counts, false-positive counts, duration, token, and cost metrics.
  • Keep private benchmark cases and run outputs out of git by default.

This should stay focused on benchmarking codereview-cli behavior. It should not become a separate review daemon, a hosted evaluation service, or a checked-in leaderboard.

Acceptance Criteria

  • cr benchmark validate <suite.yml> loads and validates a benchmark suite with useful errors.
  • cr benchmark doctor <suite.yml> checks suite readiness without invoking model providers or running reviews.
  • cr benchmark doctor supports the same selection and environment flags as runs where they are relevant: --model, --case, --results-dir, --cr-bin, and --json.
  • cr benchmark run <suite.yml> runs selected cases/models through cr review --dry-run --json.
  • --model, --case, --results-dir, and --cr-bin allow targeted local runs.
  • Benchmark output includes enough data to compare runs: case id, model/profile, review URL, exit status, finding count, duration, token/cost metrics when available, and artifact paths.
  • Run artifacts include raw review JSON, stderr, optional metrics JSON, summary.jsonl, suite-summary.json, manifest.json, and report.md.
  • manifest.json records enough provenance for later comparison: tool name/version, suite path/hash, selected models/cases, run timings, and artifact paths.
  • Reports and summaries include expected-baseline fields and conservative grading fields for clean cases, expected anchors, false positives, duration, token, and cost efficiency.
  • Result directories and files use private permissions by default.
  • Local benchmark cases and results are ignored by git, while public examples and schema/docs are safe to commit.
  • README or benchmark docs explain how to define a suite, validate it, preflight it, run it, and interpret the output.
  • Tests cover suite validation, command wiring, dry-run invocation arguments, preflight checks, result artifact writing, manifest generation, and report/summary rendering.
  • Existing cr review behavior for normal users is unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:llmLLM adapters, contracts, review planning, agentsarea:pipelineEnd-to-end review pipelinearea:surfaceUser-visible command surface and lifecycle commandsenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions