Highlights — Arc 1 / 2 / 3 reference datasets shipped
Three oracle-verified RL-environment datasets now live on the Hub, in the Verifiable RL Environments collection:
AdithyaSK/repo2rlenv-pr-diff— 100 envs · text-only diff-similarity tasks with a 6-component verifier (Arc 1).AdithyaSK/repo2rlenv-pr-runtime— 100 envs · SWE-bench-style FAIL_TO_PASS / PASS_TO_PASS oracles with the new graded reward (Arc 2).AdithyaSK/repo2rlenv-commit-runtime— 52 envs · commit-history mining (Arc 3, still flagged experimental).
What's new
pr_diff — Harbor-runnable env + 6-component reward (#40)
Replaces the single difflib.ratio score with format + size + file-targeting + region-overlap + changes-only similarity + LLM-as-judge. Every task ships a thin python:3.12-slim image, no per-repo bootstrap.
pr_runtime — graded F2P/P2P + tracked vs command_resolved (#45)
- New in-container verifier (
_pr_runtime_verifier.py, pure stdlib) computesreward = f2p_rate × p2p_ratefor a dense RL gradient. - Split eval signal:
resolved(SWE-bench tracked — gold patch always satisfies it) vscommand_resolved(stricter: zero untracked failures + exit 0).exit_codeanduntracked_failed_countnow in everyreward.json. - Recipe robustness: PEP 735
pip install --group tests+ bare-pytestfallback so repos that declare test deps via[dependency-groups](werkzeug etc.) actually have a working pytest. - Plain task artifacts:
tests/{verifier.py,f2p.json,p2p.json}ship as inspectable files (Harbor mountstests/at/tests). - Enriched
manifest.json: per-taskbuild_status/oracle_reward/resolved/command_resolved/eval_grade/exit_code/parse_status/tests_parsed/runtime_splus sha256 checksums of all six task artifacts plus the dataset commit andrepo_distribution. Neweval_gradeflag (command_resolvedANDp2p_count > 0) identifies the benchmark-grade subset.
commit_runtime — filter + leak + artifact fixes (#47)
- Non-bugfix conventional-commit-type rejection (
chore:/docs:/feat:/refactor:/style:/test:/ci:/build:/perf:/revert:). - Bugfix positive-signal filter: require
fix:prefix ORCloses #NOR a bugfix keyword in the subject. - Issue-fetch fallback: when a commit has
Closes #N, source the problem statement from the GitHub issue body (less leak-prone than the commit message). _strip_info_leakextended for trailing(#NNNN)squash trailers + cross-reporepo#Nrefs (shared withpr_runtime).- Fixed a critical Arc 2 inheritance miss —
commit_runtimewas emitting binary-onlytest.shand missing the plaintests/artifacts; both corrected. reward_calibrationmetadata parity withpr_runtime.
Experimental pipeline flag
New Pipeline.experimental: ClassVar[bool]. repo2rlenv generate prints a warning before running any pipeline whose experimental = True. Stable today: pr_diff, pr_runtime. Experimental: commit_runtime, cve_patches, mutation_bugs, code_instruct, equivalence_tests, refactor_synthesis.
Env-var standardization
- New
R2E_CACHE_DIRenv var overrides the bootstrap cache root (default./envs). - New
docs/reference/ENV.mdcataloguing every env var the tool reads: storage paths, GitHub/HF/LLM auth, container registry,pr_diffreward weights, UI, and Visualiser.
Container-registry push
- Multi-repo datasets push each distinct bootstrap image and rewrite each task to its own digest.
- Docker Hub credentials resolve from explicit
DOCKER_USERNAME+DOCKER_TOKENenv vars before the credstore (the credstore's OAuth identity token is often pull-only). pushpreserves an enrichedmanifest.jsonif the source dataset carries one (it doesn't regenerate, because it can't run an oracle).- Inline-mode fast path for self-contained Dockerfiles (
python:3.12-slim/golang:1.23etc.) — preserves the validated bytes so manifest checksums stay authoritative.
README revamp
Centered header + badges, code-first "How it works" with Pipeline Protocol snippet, agent CLI examples (claude-code / openhands / codex / hermes), Stable / Experimental split, contribution section, minimal metadata table with column legend.
Breaking changes
pr_streamremoved. Was experimental. Its watermark +--sinceorchestration was scope-creep — if continuous mining ever becomes a hard requirement it'll be flags onpr_runtime, not a separate pipeline. Anyone with--pipeline pr_streamin scripts gets a clear error from the registry.
Acknowledgements
Built on top of Harbor (task format + runtime), SWE-bench (test-oracle methodology), SWE-RL (diff-similarity reward), and R2E-Gym / SWE-GEN (commit-level mining).
Full per-arc audits + findings: