SWE-Review-Bench

A pilot benchmark for cold code-review bug finding on SWE-bench Lite.

Motivation

SWE-bench measures whether a system can resolve a known issue when given the issue text, the failing tests, and the surrounding repo. That is a fix-given-issue task: the system already knows where to look. A different and arguably harder question is whether a system can locate a defect from the source alone, without an issue report or failing tests to anchor on. The skill exercised in a real pre-commit code review, "is something wrong with this file?", is not what SWE-bench fix-resolved rate measures.

SWE-Review-Bench evaluates LLM-based and static reviewers under the cold-review setup: the reviewer is shown only the buggy file at its pre-fix commit and asked to flag concrete correctness, reliability, or maintainability issues in a structured JSON format. The oracle (the fix patch's line ranges) is read only by the scorer, after the reviewer has emitted its output.

Key finding (pilot, n=20)

Claude Sonnet 4.5 emits comments on the correct oracle file in 16 of 20 instances (80%, Wilson 95% CI [0.584, 0.919]) but locates the actual fix region in 0 of 20 (0%, [0.000, 0.161]). GPT-4o-mini shows the inverse pattern: file-level detection on 13 of 20 (65%, [0.433, 0.819]) yet a nonzero line-level hit rate of 3 of 20 (15%, [0.052, 0.360]); see Preliminary results. The pilot therefore suggests a decoupling between detection and localization under cold review, rather than a flat ordering of one model below another. All numbers carry wide Wilson intervals at n = 20, so the section should be read as direction-of-effect, not as a ranking.

Task definition

For each instance in the dataset, the reviewer's input is exactly:

file_path: relative path of one file touched by the fix patch.
file_content: full pre-fix source content of that file, at the instance's base_commit.
A generic review instruction (the prompt template body).
The output JSON schema.

The reviewer's output is a JSON array of comments matching:

{
  "file": "django/http/response.py",
  "line_start": 173,
  "line_end": 173,
  "severity": "low | medium | high",
  "message": "one short sentence"
}

The scorer then matches each comment's (file, line_start, line_end) against oracle hunks recovered from the fix patch, under a line tolerance N (default N = 3). An instance counts as a hit if at least one comment matched any oracle hunk on that instance.

The cold-review input policy and the corresponding pytest assertions are documented in docs/leakage_statement.md and tests/test_no_leakage.py. The reviewer's input never contains problem_statement, hints_text, patch, test_patch, test names extracted from test_patch, or oracle line numbers.

Dataset

Source: princeton-nlp/SWE-bench_Lite, split=test.
Sampling: random.Random(42).sample(range(len(ds)), 20), a deterministic 20-instance pilot.
Dataset revision: not pinned at Round 1 load time. A post-hoc snapshot of the HuggingFace dataset state, recorded at outputs/round2/h_lite/dataset_revision.json, captures the dataset commit hash (hf_commit_sha: 6ec7bb89b9342f664a54a6e0a6ea6501d3437cc2); this snapshot reflects the dataset state at snapshot time, not necessarily the exact revision loaded during the Round 1 run on 2026-05-11. The Round 1 load timestamp and the litellm and datasets library versions remain in outputs/run_meta.json.

This is a 20-instance pilot drawn from SWE-bench Lite's 300-instance test split. All headline numbers below carry Wilson 95% confidence intervals to make the small-sample uncertainty visible.

Reviewers

Round 1 evaluated three reviewers:

reviewer	id (resolved)	source
Claude Sonnet	`claude-sonnet-4-5`	Anthropic API via `litellm` 1.83.9
GPT-4o-mini	`gpt-4o-mini`	OpenAI API via `litellm` 1.83.9
Static union	`static`	Ruff (`F,E9,B,A`) ∪ Pylint (default minus `C,R,I,import-error,no-name-in-module`)

Static reviewer is intentionally Python-only and runs locally; both LLM reviewers are accessed through litellm so the same client code handles both providers. Static comments are capped at --max-comments-per-file 20 to keep the false-positive count bounded. Round 2 prompt-variant experiments evaluate only the two LLM reviewers; the static baseline is variant-agnostic.

Metrics

Reviewer outputs are scored under --tolerance N (default 3) against oracle hunks built with swe_review_bench.data.oracle.build_oracle_sites in strict_mode=False (one site per hunk, line range covering the full hunk source range).

metric	definition
`instance_hit_rate`	instances where the reviewer hit ≥1 oracle hunk under tolerance N, over total instances scored (20).
`site_recall`	oracle sites hit, over total oracle sites across scored instances (32 in the pilot).
`file_level_hit_rate`	instances where the reviewer emitted ≥1 valid comment on any oracle file, ignoring line numbers.
`false_positives_per_instance_mean`	mean of `n_comments - n_hits` across instances.
`precision@k`	mean over instances of `(#hits within top-k by severity-desc then line-asc) / min(k, n_comments)`.

precision@k confidence intervals are not reported in this pilot because the denominator differs per instance (clipped to min(k, n_comments)); a CI would require either bootstrapping over instances on the per-comment data or a different aggregation. The CSV column order in outputs/round2/h_lite/round1_with_ci.csv marks these CI columns unavailable.

Tolerance sensitivity

Hit-rate under three tolerance values, derived post-hoc from the Round 1 per-comment distances (numerator: instances with ≥1 comment whose minimum line-gap to an oracle hunk in the same file is ≤ N; denominator: 20). Wilson 95% CIs in brackets:

reviewer	t = 0	t = 3 (default)	t = 10
`claude-sonnet-4-5`	0/20 = 0% [0.000, 0.161]	0/20 = 0% [0.000, 0.161]	1/20 = 5% [0.009, 0.236]
`gpt-4o-mini`	2/20 = 10% [0.028, 0.301]	3/20 = 15% [0.052, 0.360]	4/20 = 20% [0.081, 0.416]
`static`	2/20 = 10% [0.028, 0.301]	3/20 = 15% [0.052, 0.360]	5/20 = 25% [0.112, 0.469]

t = 3 is the default; t = 10 raises every reviewer by at most a couple of instances. The dominant failure mode for Claude in Round 1 is not "right region, just outside tolerance"; it is "right file, wrong region by tens of lines". See §Diagnostic Round.

Preliminary results

Round 1 baseline, all three reviewers, default prompt v1, tolerance = 3. Rate cells show count / 20 (or / 32 for site_recall) followed by the Wilson 95% interval.

reviewer	instance hit rate	file-level hit rate	site recall	FP / instance
`claude-sonnet-4-5`	0 / 20 = 0% [0.000, 0.161]	16 / 20 = 80% [0.584, 0.919]	0 / 32 = 0% [0.000, 0.107]	1.50
`gpt-4o-mini`	3 / 20 = 15% [0.052, 0.360]	13 / 20 = 65% [0.433, 0.819]	3 / 32 = 9% [0.032, 0.242]	2.20
`static`	3 / 20 = 15% [0.052, 0.360]	15 / 20 = 75% [0.531, 0.888]	4 / 32 = 13% [0.050, 0.281]	11.75

Because this is a 20-instance pilot, confidence intervals are wide and the results should not be interpreted as a conclusive model ranking.

The numerator and denominator counts above are emitted in machine- readable form at outputs/round2/h_lite/round1_with_ci.csv. The Wilson formula, denominators, and the deliberate omission of continuity correction are documented in outputs/round2/h_lite/ci_methodology.md.

Diagnostic round: prompt sensitivity

Round 2 swept three prompt variants on the same 20-instance pilot for the two LLM reviewers only:

Variant A: Round 1 baseline (v1); byte-identical, reused Round 1's cache so cost was $0.
Variant B: same body with the no-speculation clause removed (v1b). The deleted sentence is "Do not invent issues. If the code looks correct to you, return an empty list."
Variant C: Variant B plus "Return at least one comment per file, even if it is a minor observation." (v1c). Variant C is a diagnostic-only probe that forces ≥1 comment per file; it is not used as a headline result.

Comparison of Variant A and Variant B for the two LLM reviewers:

reviewer	A: instance hit rate	B: instance hit rate	B vs A
`claude-sonnet-4-5`	0 / 20 = 0% [0.000, 0.161]	3 / 20 = 15% [0.052, 0.360]	+15 pp
`gpt-4o-mini`	3 / 20 = 15% [0.052, 0.360]	6 / 20 = 30% [0.145, 0.519]	+15 pp

Under the Round 1 prompt, Claude's output rate and hit rate are more sensitive to hedge/no-speculation instructions than GPT-4o-mini's; relaxing the no-speculation clause raises Claude's pilot instance hit rate from 0% to 15% with the Wilson interval shown above. Note that the Wilson intervals for A and B overlap at n = 20, so the point-estimate deltas are direction-of-effect only and not a statistically resolved comparison. The companion artefact for these numbers is outputs/round2/h_lite/variant_summary_with_ci.csv; the qualitative analysis is in outputs/round2/variant_analysis.md.

Variant C and an extended per-bucket discussion live in outputs/round2/diagnostic_summary.md and outputs/round2/variant_analysis.md.

Reproducibility

One-command reproduction:

bash repro/run.sh

Key fingerprint values:

field	value
dataset	`princeton-nlp/SWE-bench_Lite`, split `test`
sampling seed	`42`
`n_requested`	20
`tolerance`	3
`max_comments_per_file`	20
`prompt_template_id` (Variant A)	`v1`
Variant B / C template ids	`v1b` / `v1c`
`strict_oracle_mode`	`false`
`litellm` version	`1.83.9`
Python	`3.9.12`

Cache behaviour: Round 1 LLM responses live under .cache/llm/ and are read-only for Round 2. Round 2 writes to .cache/round2/llm/. Variant A's cache key matches Round 1's (same template_id); a re-run under Variant A is a 100% cache hit. Variants B and C miss the cache on first run and cost about $1.9 in aggregate for the 20- instance pilot. The full run_meta.json records the resolved model ids, wall time, and run timestamp.

Leakage prevention

The cold-review input policy is documented in docs/leakage_statement.md. The corresponding pytest assertions are in tests/test_no_leakage.py (60 parametrised cells = 20 instances × 3 variants); the latest pass/fail summary is at outputs/round2/h_lite/leakage_audit_report.md.

To re-run the leakage tests:

pytest -v tests/test_no_leakage.py

Limitations

Small sample. n = 20 is a pilot; Wilson 95% intervals are wide and pairwise deltas overlap at this size. Headline numbers should be read as direction-of-effect, not as a conclusive ranking.
Single-file review only. Each reviewer is shown one file per instance (the file touched by the fix patch). Cross-file reasoning, retrieval, or multi-file context is out of scope for the pilot.
Tolerance sensitivity not exhaustively swept. Only t ∈ {0, 3, 10} was evaluated post-hoc.
Dataset composition. SWE-bench Lite is itself a curated subset of SWE-bench and skews toward a handful of large Python projects; the 20-instance pilot inherits any such bias.
Static baseline filtering. Ruff is invoked with --select F,E9,B,A; Pylint is invoked with --disable=C,R,I,import-error,no-name-in-module. These choices are documented in swe_review_bench/reviewers/static.py and were picked to keep the FP count tractable while preserving most correctness-flavoured warnings.
Prompt sensitivity. The Diagnostic Round shows the Round 1 prompt suppresses LLM output rates; reported headline numbers depend on the prompt variant chosen as canonical.
No formal model ranking. This benchmark does not claim "model X is better than model Y at code review". The pilot metric comparisons surface prompt-sensitivity and FP/recall trade-offs only.

License

MIT, see LICENSE.

About

SWE-Review-Bench is built and maintained by an independent CS master's student. The pilot in this repository is the current contribution: an evaluation pipeline, a 20-instance run with frozen artefacts, a prompt-variant probe, and a pytest leakage suite.

Contact: lmnstzz@gmail.com
GitHub: github.com/lmnst
Repository: github.com/lmnst/SWE-Review-Bench

Reference this work by citing the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
outputs		outputs
repro		repro
scripts		scripts
swe_review_bench		swe_review_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_NOTES_v0.1-pilot.md		RELEASE_NOTES_v0.1-pilot.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-Review-Bench

Motivation

Key finding (pilot, n=20)

Task definition

Dataset

Reviewers

Metrics

Tolerance sensitivity

Preliminary results

Diagnostic round: prompt sensitivity

Reproducibility

Leakage prevention

Limitations

License

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SWE-Review-Bench

Motivation

Key finding (pilot, n=20)

Task definition

Dataset

Reviewers

Metrics

Tolerance sensitivity

Preliminary results

Diagnostic round: prompt sensitivity

Reproducibility

Leakage prevention

Limitations

License

About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages