Skip to content

nripankadas07/patchgym

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PatchGym

Turn any Git repository into a local SWE-bench-style coding-agent benchmark.

PatchGym mines real Git history, creates hidden-test coding-agent tasks, runs agents against those tasks, and reports whether their patches actually fixed the code.

PatchGym is alpha software: local-first, practical, research-quality, and designed to be read. It is not a hosted leaderboard, not a cloud service, and not a claim that one model or agent wins everywhere.

Install From Source

PatchGym is not published to PyPI. Install it from a source checkout:

git clone https://github.com/nripankadas07/patchgym
cd patchgym
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

60-Second Demo

bash scripts/demo.sh

The demo creates a tiny Git repository, mines one historical bug fix, verifies the hidden-test/oracle split, runs a toy shell agent, grades the patch, and writes:

  • .patchgym/reports/report.json
  • .patchgym/reports/report.md
  • .patchgym/reports/index.html

Expected shape:

mined 1 task(s)
built 1/1 valid task(s)
agent 'bash .../examples/custom_agent/agent.sh' solved 1/1 task(s)
PatchGym demo complete

How It Works

For each selected historical commit, PatchGym splits the change into:

  • base commit,
  • hidden test patch,
  • oracle solution patch,
  • task prompt,
  • validation command.

A task is valid only when:

base + hidden tests fails
base + hidden tests + oracle patch passes

During an agent run, PatchGym exports the base commit into a temporary workspace, runs the agent command there, captures the agent diff, applies hidden tests, runs the validation command, and records the result.

Deeper docs:

CLI

patchgym init
patchgym mine .
patchgym build .
patchgym list
patchgym show <task-id>
patchgym verify <task-id>
patchgym context <task-id>
patchgym run <task-id> --agent "bash examples/custom_agent/agent.sh"
patchgym grade
patchgym report
patchgym replay <task-id>

Older path-oriented usage also works:

patchgym mine /path/to/repo --out .patchgym/tasks --validation "python -m pytest -q"
patchgym verify .patchgym/tasks --repo /path/to/repo
patchgym run .patchgym/tasks --repo /path/to/repo --agent noop

Example Task

A task directory looks like:

.patchgym/tasks/<task-id>/
  task.json
  hidden_tests.patch
  oracle_solution.patch
  context/
    CODEX_TASK.md
    AGENTS.md

The agent receives the prompt/context, not the hidden tests or oracle patch. Maintainers can inspect the oracle patch to audit task quality.

Example Report

patchgym report writes JSON, Markdown, and HTML. The Markdown report includes:

  • tasks generated,
  • pass/fail result,
  • changed files,
  • validation command,
  • duration,
  • local execution safety note.

Safety Warning

PatchGym runs local Git commands, validation commands, tests, and explicit user-provided agent shell commands. Do not run it on untrusted repositories or with untrusted agents unless you use a disposable container, VM, or machine.

shell=True is only used for the explicit agent command. Validation commands are split and executed without a shell. Agent and validation commands both have timeouts.

Limitations

PatchGym works most reliably when tests and fixes land in the same commit. It uses path heuristics to identify test files. It does not provide strong isolation by default. It does not claim public leaderboard readiness or a complete public-benchmark export format.

More limitations are documented in docs/limitations.md.

Comparison

SWE-bench-style public benchmarks are valuable for broad comparison. PatchGym asks a narrower local question: can an agent fix tasks mined from your repository, under your tests, using your project history?

PatchGym is smaller and less comprehensive than public benchmark infrastructure. That is intentional: it is a readable reference harness and a practical local evaluation loop.

Development

python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
ruff check .
pytest -q
python -m build
bash scripts/demo.sh

CI runs the same core gates across Python 3.9 through 3.13: CLI smoke tests, Ruff, pytest, build, wheel install, and demo.

Roadmap

See ROADMAP.md.

License

MIT. See LICENSE.

About

Turn any Git repository into a local SWE-bench-style coding-agent benchmark.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors