RepoScale is an alpha benchmark for repo continuation intelligence.
Measure how well models work on existing software projects instead of toy coding tasks.
RepoScale ships a reproducible CLI pipeline, a 12-case curated corpus, judge stability scoring, and baseline-ready workflows for early adopters and contributors.
Alpha release. Runnable, inspectable, and useful today. Not a final neutral benchmark yet.
Corpus • Baseline • Case Authoring • Launch Kit • Changelog
RepoScale evaluates whether a model can:
- read an existing codebase
- infer what the project was trying to become
- identify what is missing, broken, or risky
- propose a continuation path that fits the repo
It is built for repo-shaped work: partial implementations, stale docs, historical baggage, conflicting intent, and unfinished ideas.
pip install -e .find cases -name case.yaml -exec dirname {} \; | xargs reposcale validatereposcale batch cases/ --model gpt-4o
reposcale score results/<run-id>/ --judge-model gpt-4o --repeat 3
reposcale compare results/<run-a>/ results/<run-b>/Want a one-command baseline? Use scripts/run-baselines.sh.
results/<run-id>/
├── diagnose-001/
│ ├── response.json
│ └── evaluation.json
├── intent-001/
│ ├── response.json
│ └── evaluation.json
└── plan-001/
├── response.json
└── evaluation.json
And then:
reposcale summaryfor single-run metricsreposcale comparefor side-by-side model comparison- stability metadata when scoring with
--repeat
Most software work does not start from a blank file.
It starts with an existing repo: partial implementations, stale docs, architectural baggage, conflicting intent, and unfinished ideas. RepoScale focuses on that environment instead of isolated algorithmic tasks or greenfield generation.
| Area | Current alpha scope |
|---|---|
| Corpus | 12 curated cases across 3 tracks and 3 difficulty levels |
| Tracks | Diagnose, Intent, Plan |
| Pipeline | validate -> batch/run -> score -> summary -> compare |
| Scoring | Structural, heuristic, and LLM-judge scoring with stability repeats |
| Tooling | CLI, CI, case authoring guide, corpus manifest, baseline script |
Primary references:
RepoScale is public as an alpha, not a benchmark-grade final release.
That means:
- the pipeline is runnable and reproducible
- the corpus is large enough to be useful, but still small
- methodology is evolving, especially around judge neutrality and cross-provider baselines
- the best next contributions are new cases, baseline runs, and scoring improvements
| Track | Core question | Output |
|---|---|---|
| Diagnose | What exists, what is missing, and what is broken? | Structured diagnosis with evidence |
| Intent | What was this project trying to become? | Intent reading, fulfillment, deviations |
| Plan | What is the realistic path to scale it? | Phased roadmap, risks, first sprint |
Planned later tracks:
ExtendImplementAgent
- model eval builders who want repo-shaped tasks instead of toy prompts
- agent-tool teams testing repo understanding and continuation quality
- OSS contributors who want to add cases, baselines, or scoring layers
- researchers exploring intent reconstruction, gap detection, and architectural coherence
| Mode | Description |
|---|---|
prompt_only |
No tools, only packed textual context |
read_only_repo |
Read access to repo, no modifications |
history_aware |
Includes commit history, changelogs, harness logs |
tool_augmented |
Can search files, read modules, inspect metadata |
agentic_budgeted |
Coding agent access with step/time/token budget |
full_continuation |
End-to-end analyze, propose, implement, validate |
- Evidence first — claims must be grounded in the repo, docs, or history
- Continuity over rewrite — respect the project that exists
- Creativity with constraints — contextual ideas beat generic advice
- Multi-layer evaluation — structure, semantics, stability, and judgment
- Open and reproducible — cases, prompts, schemas, and scripts stay inspectable
src/reposcale/ core Python package (validate, run, score, summary, compare)
cases/ benchmark case packs organized by track
docs/ baselines, authoring guide, and reference material
schemas/ JSON schemas for cases, responses, and evaluations
prompts/ task prompts and judge protocol
scripts/ wrappers and baseline runner
tests/ pytest suite
results/ evaluation outputs (gitignored)
| Resource | What it gives you |
|---|---|
| cases/CORPUS.md | Full case inventory and coverage |
| docs/baseline-v0.md | First baseline and calibration notes |
| docs/case-authoring.md | How to create new cases |
| docs/scoring.md | Scoring model and dimensions |
| CONTRIBUTING.md | Contribution workflow |
| CHANGELOG.md | Release history |
The most useful contributions right now are:
- case curation
- baseline runs across additional providers
- judge calibration and methodology review
- CLI and scoring improvements
- failure-mode analysis
Start with CONTRIBUTING.md, then use docs/case-authoring.md if you're adding cases.
For public launch copy, suggested repo topics, and a release checklist, see docs/launch-kit.md.
- Issues: GitHub Issues
- Discussions: use pull requests and release notes for now
- Baseline script: scripts/run-baselines.sh
