Turn any Git repository into a local SWE-bench-style coding-agent benchmark.
PatchGym mines real Git history, creates hidden-test coding-agent tasks, runs agents against those tasks, and reports whether their patches actually fixed the code.
PatchGym is alpha software: local-first, practical, research-quality, and designed to be read. It is not a hosted leaderboard, not a cloud service, and not a claim that one model or agent wins everywhere.
PatchGym is not published to PyPI. Install it from a source checkout:
git clone https://github.com/nripankadas07/patchgym
cd patchgym
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"bash scripts/demo.shThe demo creates a tiny Git repository, mines one historical bug fix, verifies the hidden-test/oracle split, runs a toy shell agent, grades the patch, and writes:
.patchgym/reports/report.json.patchgym/reports/report.md.patchgym/reports/index.html
Expected shape:
mined 1 task(s)
built 1/1 valid task(s)
agent 'bash .../examples/custom_agent/agent.sh' solved 1/1 task(s)
PatchGym demo complete
For each selected historical commit, PatchGym splits the change into:
- base commit,
- hidden test patch,
- oracle solution patch,
- task prompt,
- validation command.
A task is valid only when:
base + hidden tests fails
base + hidden tests + oracle patch passes
During an agent run, PatchGym exports the base commit into a temporary workspace, runs the agent command there, captures the agent diff, applies hidden tests, runs the validation command, and records the result.
Deeper docs:
- PatchGym from scratch
- How it works
- Mining Git history
- Hidden tests
- Agent adapters
- Sandboxing
- Evaluation metrics
- Comparisons
- Limitations
patchgym init
patchgym mine .
patchgym build .
patchgym list
patchgym show <task-id>
patchgym verify <task-id>
patchgym context <task-id>
patchgym run <task-id> --agent "bash examples/custom_agent/agent.sh"
patchgym grade
patchgym report
patchgym replay <task-id>Older path-oriented usage also works:
patchgym mine /path/to/repo --out .patchgym/tasks --validation "python -m pytest -q"
patchgym verify .patchgym/tasks --repo /path/to/repo
patchgym run .patchgym/tasks --repo /path/to/repo --agent noopA task directory looks like:
.patchgym/tasks/<task-id>/
task.json
hidden_tests.patch
oracle_solution.patch
context/
CODEX_TASK.md
AGENTS.md
The agent receives the prompt/context, not the hidden tests or oracle patch. Maintainers can inspect the oracle patch to audit task quality.
patchgym report writes JSON, Markdown, and HTML. The Markdown report includes:
- tasks generated,
- pass/fail result,
- changed files,
- validation command,
- duration,
- local execution safety note.
PatchGym runs local Git commands, validation commands, tests, and explicit user-provided agent shell commands. Do not run it on untrusted repositories or with untrusted agents unless you use a disposable container, VM, or machine.
shell=True is only used for the explicit agent command. Validation commands are split and executed without a shell. Agent and validation commands both have timeouts.
PatchGym works most reliably when tests and fixes land in the same commit. It uses path heuristics to identify test files. It does not provide strong isolation by default. It does not claim public leaderboard readiness or a complete public-benchmark export format.
More limitations are documented in docs/limitations.md.
SWE-bench-style public benchmarks are valuable for broad comparison. PatchGym asks a narrower local question: can an agent fix tasks mined from your repository, under your tests, using your project history?
PatchGym is smaller and less comprehensive than public benchmark infrastructure. That is intentional: it is a readable reference harness and a practical local evaluation loop.
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
ruff check .
pytest -q
python -m build
bash scripts/demo.shCI runs the same core gates across Python 3.9 through 3.13: CLI smoke tests, Ruff, pytest, build, wheel install, and demo.
See ROADMAP.md.
MIT. See LICENSE.