Skip to content

v0.8.3 — Arc 1/2/3 datasets + graded reward + experimental flag

Latest

Choose a tag to compare

@adithya-s-k adithya-s-k released this 28 May 13:19
· 9 commits to main since this release

Highlights — Arc 1 / 2 / 3 reference datasets shipped

Three oracle-verified RL-environment datasets now live on the Hub, in the Verifiable RL Environments collection:

What's new

pr_diff — Harbor-runnable env + 6-component reward (#40)

Replaces the single difflib.ratio score with format + size + file-targeting + region-overlap + changes-only similarity + LLM-as-judge. Every task ships a thin python:3.12-slim image, no per-repo bootstrap.

pr_runtime — graded F2P/P2P + tracked vs command_resolved (#45)

  • New in-container verifier (_pr_runtime_verifier.py, pure stdlib) computes reward = f2p_rate × p2p_rate for a dense RL gradient.
  • Split eval signal: resolved (SWE-bench tracked — gold patch always satisfies it) vs command_resolved (stricter: zero untracked failures + exit 0). exit_code and untracked_failed_count now in every reward.json.
  • Recipe robustness: PEP 735 pip install --group tests + bare-pytest fallback so repos that declare test deps via [dependency-groups] (werkzeug etc.) actually have a working pytest.
  • Plain task artifacts: tests/{verifier.py,f2p.json,p2p.json} ship as inspectable files (Harbor mounts tests/ at /tests).
  • Enriched manifest.json: per-task build_status / oracle_reward / resolved / command_resolved / eval_grade / exit_code / parse_status / tests_parsed / runtime_s plus sha256 checksums of all six task artifacts plus the dataset commit and repo_distribution. New eval_grade flag (command_resolved AND p2p_count > 0) identifies the benchmark-grade subset.

commit_runtime — filter + leak + artifact fixes (#47)

  • Non-bugfix conventional-commit-type rejection (chore: / docs: / feat: / refactor: / style: / test: / ci: / build: / perf: / revert:).
  • Bugfix positive-signal filter: require fix: prefix OR Closes #N OR a bugfix keyword in the subject.
  • Issue-fetch fallback: when a commit has Closes #N, source the problem statement from the GitHub issue body (less leak-prone than the commit message).
  • _strip_info_leak extended for trailing (#NNNN) squash trailers + cross-repo repo#N refs (shared with pr_runtime).
  • Fixed a critical Arc 2 inheritance miss — commit_runtime was emitting binary-only test.sh and missing the plain tests/ artifacts; both corrected.
  • reward_calibration metadata parity with pr_runtime.

Experimental pipeline flag

New Pipeline.experimental: ClassVar[bool]. repo2rlenv generate prints a warning before running any pipeline whose experimental = True. Stable today: pr_diff, pr_runtime. Experimental: commit_runtime, cve_patches, mutation_bugs, code_instruct, equivalence_tests, refactor_synthesis.

Env-var standardization

  • New R2E_CACHE_DIR env var overrides the bootstrap cache root (default ./envs).
  • New docs/reference/ENV.md cataloguing every env var the tool reads: storage paths, GitHub/HF/LLM auth, container registry, pr_diff reward weights, UI, and Visualiser.

Container-registry push

  • Multi-repo datasets push each distinct bootstrap image and rewrite each task to its own digest.
  • Docker Hub credentials resolve from explicit DOCKER_USERNAME + DOCKER_TOKEN env vars before the credstore (the credstore's OAuth identity token is often pull-only).
  • push preserves an enriched manifest.json if the source dataset carries one (it doesn't regenerate, because it can't run an oracle).
  • Inline-mode fast path for self-contained Dockerfiles (python:3.12-slim / golang:1.23 etc.) — preserves the validated bytes so manifest checksums stay authoritative.

README revamp

Centered header + badges, code-first "How it works" with Pipeline Protocol snippet, agent CLI examples (claude-code / openhands / codex / hermes), Stable / Experimental split, contribution section, minimal metadata table with column legend.

Breaking changes

  • pr_stream removed. Was experimental. Its watermark + --since orchestration was scope-creep — if continuous mining ever becomes a hard requirement it'll be flags on pr_runtime, not a separate pipeline. Anyone with --pipeline pr_stream in scripts gets a clear error from the registry.

Acknowledgements

Built on top of Harbor (task format + runtime), SWE-bench (test-oracle methodology), SWE-RL (diff-similarity reward), and R2E-Gym / SWE-GEN (commit-level mining).

Full per-arc audits + findings: