Release v0.8.3 — Arc 1/2/3 datasets + graded reward + experimental flag · huggingface/Repo2RLEnv

Highlights — Arc 1 / 2 / 3 reference datasets shipped

Three oracle-verified RL-environment datasets now live on the Hub, in the Verifiable RL Environments collection:

AdithyaSK/repo2rlenv-pr-diff — 100 envs · text-only diff-similarity tasks with a 6-component verifier (Arc 1).
AdithyaSK/repo2rlenv-pr-runtime — 100 envs · SWE-bench-style FAIL_TO_PASS / PASS_TO_PASS oracles with the new graded reward (Arc 2).
AdithyaSK/repo2rlenv-commit-runtime — 52 envs · commit-history mining (Arc 3, still flagged experimental).

What's new

`pr_diff` — Harbor-runnable env + 6-component reward (#40)

Replaces the single difflib.ratio score with format + size + file-targeting + region-overlap + changes-only similarity + LLM-as-judge. Every task ships a thin python:3.12-slim image, no per-repo bootstrap.

`pr_runtime` — graded F2P/P2P + tracked vs command_resolved (#45)

New in-container verifier (_pr_runtime_verifier.py, pure stdlib) computes reward = f2p_rate × p2p_rate for a dense RL gradient.
Split eval signal: resolved (SWE-bench tracked — gold patch always satisfies it) vs command_resolved (stricter: zero untracked failures + exit 0). exit_code and untracked_failed_count now in every reward.json.
Recipe robustness: PEP 735 pip install --group tests + bare-pytest fallback so repos that declare test deps via [dependency-groups] (werkzeug etc.) actually have a working pytest.
Plain task artifacts: tests/{verifier.py,f2p.json,p2p.json} ship as inspectable files (Harbor mounts tests/ at /tests).
Enriched manifest.json: per-task build_status / oracle_reward / resolved / command_resolved / eval_grade / exit_code / parse_status / tests_parsed / runtime_s plus sha256 checksums of all six task artifacts plus the dataset commit and repo_distribution. New eval_grade flag (command_resolved AND p2p_count > 0) identifies the benchmark-grade subset.

`commit_runtime` — filter + leak + artifact fixes (#47)

Non-bugfix conventional-commit-type rejection (chore: / docs: / feat: / refactor: / style: / test: / ci: / build: / perf: / revert:).
Bugfix positive-signal filter: require fix: prefix OR Closes #N OR a bugfix keyword in the subject.
Issue-fetch fallback: when a commit has Closes #N, source the problem statement from the GitHub issue body (less leak-prone than the commit message).
_strip_info_leak extended for trailing (#NNNN) squash trailers + cross-repo repo#N refs (shared with pr_runtime).
Fixed a critical Arc 2 inheritance miss — commit_runtime was emitting binary-only test.sh and missing the plain tests/ artifacts; both corrected.
reward_calibration metadata parity with pr_runtime.

Experimental pipeline flag

New Pipeline.experimental: ClassVar[bool]. repo2rlenv generate prints a warning before running any pipeline whose experimental = True. Stable today: pr_diff, pr_runtime. Experimental: commit_runtime, cve_patches, mutation_bugs, code_instruct, equivalence_tests, refactor_synthesis.

Env-var standardization

New R2E_CACHE_DIR env var overrides the bootstrap cache root (default ./envs).
New docs/reference/ENV.md cataloguing every env var the tool reads: storage paths, GitHub/HF/LLM auth, container registry, pr_diff reward weights, UI, and Visualiser.

Container-registry push

Multi-repo datasets push each distinct bootstrap image and rewrite each task to its own digest.
Docker Hub credentials resolve from explicit DOCKER_USERNAME + DOCKER_TOKEN env vars before the credstore (the credstore's OAuth identity token is often pull-only).
push preserves an enriched manifest.json if the source dataset carries one (it doesn't regenerate, because it can't run an oracle).
Inline-mode fast path for self-contained Dockerfiles (python:3.12-slim / golang:1.23 etc.) — preserves the validated bytes so manifest checksums stay authoritative.

README revamp

Centered header + badges, code-first "How it works" with Pipeline Protocol snippet, agent CLI examples (claude-code / openhands / codex / hermes), Stable / Experimental split, contribution section, minimal metadata table with column legend.

Breaking changes

pr_stream removed. Was experimental. Its watermark + --since orchestration was scope-creep — if continuous mining ever becomes a hard requirement it'll be flags on pr_runtime, not a separate pipeline. Anyone with --pipeline pr_stream in scripts gets a clear error from the registry.

Acknowledgements

Built on top of Harbor (task format + runtime), SWE-bench (test-oracle methodology), SWE-RL (diff-similarity reward), and R2E-Gym / SWE-GEN (commit-level mining).

Full per-arc audits + findings:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.3 — Arc 1/2/3 datasets + graded reward + experimental flag

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights — Arc 1 / 2 / 3 reference datasets shipped

What's new

`pr_diff` — Harbor-runnable env + 6-component reward (#40)

`pr_runtime` — graded F2P/P2P + tracked vs command_resolved (#45)

`commit_runtime` — filter + leak + artifact fixes (#47)

Experimental pipeline flag

Env-var standardization

Container-registry push

README revamp

Breaking changes

Acknowledgements

Uh oh!

v0.8.3 — Arc 1/2/3 datasets + graded reward + experimental flag

Highlights — Arc 1 / 2 / 3 reference datasets shipped

What's new

pr_diff — Harbor-runnable env + 6-component reward (#40)

pr_runtime — graded F2P/P2P + tracked vs command_resolved (#45)

commit_runtime — filter + leak + artifact fixes (#47)

Experimental pipeline flag

Env-var standardization

Container-registry push

README revamp

Breaking changes

Acknowledgements

Uh oh!

`pr_diff` — Harbor-runnable env + 6-component reward (#40)

`pr_runtime` — graded F2P/P2P + tracked vs command_resolved (#45)

`commit_runtime` — filter + leak + artifact fixes (#47)