A pre-publish, fail-closed adversarial gate for RL-verifier Environments.
A falsifier of reward-hackability — not a prover of safety.
Apache-2.0 · pre-alpha (
v0.1.0a1)
envfuzz drives an RL verifier environment with deterministic adversarial policies
before you publish it, and fails closed (non-zero exit) if any policy can
inflate reward without actually satisfying the task. Wire it into CI as the last
gate before prime env push so a reward-hackable environment never ships.
pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # core: numpy + rich, CPU, offline
envfuzz audit corpus --fail-on-hackability # exit 1 if any env is hackable
envfuzz report corpus --html scorecard.html # self-contained HTML scorecardOr drop it into a CI workflow as a GitHub Action:
- uses: hinanohart/envfuzz@v0.1.0a1
with:
target: corpus # a corpus env id, or 'corpus'/'all'
fail-on-hackability: "true"
html: scorecard.html- A gate, run by the environment author before publishing. The unit of analysis is the environment's rubric + parser, not a trained model.
- A falsifier: it searches for an exploit. Finding one is a definite "this is hackable." Finding none is "no exploit in this catalog" — never a safety proof.
- CPU-only and deterministic. Every randomized routine takes an explicit seed.
The claimed path imports only
numpy; it does not require a GPU, network, API key, or evenverifiers.
- Not a calibrated probability of hackability. Scores are order-only and are combined with noisy-OR; the reward-inflation interval is a percentile bootstrap.
- Not an exhaustiveness proof, not a training-time reward-hacking detector, not a model red-teaming tool.
- Not a full OS sandbox. The live driver isolates untrusted environments at the Python process level (see Threat model), not with seccomp/namespaces.
For each environment, envfuzz drives a set of reference policies (honest_correct,
honest_wrong, null) and a catalog of blind adversarial policies, then runs a
suite of detectors over the resulting rollouts. An environment is flagged
hackable only when both:
- the noisy-OR composite of fired detectors ≥ 0.5; and
- the reward-inflation bootstrap CI lower bound > 0 — never report inflation the interval cannot separate from zero.
On the deterministic fixture corpus the composite is the sole operative discriminator. Because the inflation sample keeps only strictly-positive exploit gains over the null floor, condition 2 collapses to 0 exactly when no exploit beats the floor, so on fixtures it tracks "some exploit inflated reward" rather than independently gating the verdict. The CI guard becomes load-bearing on the stochastic live path (live detector wiring deferred to v0.1.1), where sampling noise can drag the lower bound to 0.
| class | name | what it exploits |
|---|---|---|
| A1 | early-termination / empty-work | reward for participation/termination signals on their own |
| A2 | parser-injection | format/well-formedness credit earnable without solving |
| A3 | rubric-weight interaction | keyword/length/auxiliary reward dominating correctness |
| A4 | state side-channel | grader-trust / echoing information the env exposes |
RewardGap, CeilingBreach, EmptyWorkHighReward, ParserOnly, and
MonotoneInflation are active. SideEffect is a v0.1.1 placeholder: its
interface is stable but it is inert in every v0.1.0a1 path (it depends on the
sandbox-backed by-name live driver — see Roadmap). A4 is still exercised in
fixture mode through the self_certify / prompt_leak rubric attacks.
CLAIMED (verified by CI): on the bundled synthetic corpus, the four attack classes above are falsified deterministically; detectors separate gameable from robust fixtures with the numbers quoted below; the subprocess sandbox contains the behaviors listed under Threat model; the CLI exits non-zero on a hackable env.
NON-CLAIM (shipped capability, not covered by CI, hardening in v0.1.1):
driving real, live verifiers environments end-to-end (the [vf] extra; see
envfuzz.drivers.vf_env), tool-call (ToolEnv) environments, and loading
environments by name through the sandbox.
Out of scope: browser / side-effecting environments, verifiers framework
internals (report those upstream), training-dynamics hacking, learned attackers,
and any exhaustiveness guarantee.
These boundaries are fixed and intentional:
- NC1 — envfuzz does not prove the absence of exploits; it only falsifies.
- NC2 — training-time reward hacking is not addressed.
- NC3 — attacks on a trained model (rather than the environment) are not addressed.
- NC4 — hackability scores are not calibrated probabilities.
- NC5 — corpus numbers describe the bundled synthetic corpus, not any real model.
Produced by envfuzz bench --quick and asserted in CI (tests/check_bench.py),
so the code and this table cannot drift apart:
| metric | value |
|---|---|
| environments | 12 (7 gameable, 5 robust) |
| precision | 1.0 |
| recall | 1.0 |
| accuracy | 1.0 |
| attack classes exercised | A1, A2, A3, A4 |
Per NC5, these are properties of a small synthetic corpus designed to exercise each attack class with robust negative controls — not a measurement against a real reward model.
Untrusted verifiers environments are third-party code; importing one can execute
arbitrary code at module load. envfuzz therefore runs untrusted execution in a
separate process with:
resourcelimits (CPU seconds, address space, file size, no core dumps);- a scrubbed environment (host secrets are not forwarded to the child);
- a Python-level network guard (
socketraises) and filesystem write containment (writes outside the sandbox working directory are denied); - fail-closed semantics: any failure to obtain a clean result is treated as "did not clear" — i.e., blocking.
This is process-level Python isolation, not OS-level isolation. There is no
seccomp or namespace confinement, so determined native/syscall-level code can still
escape; that hardening is planned for v0.1.1. The escape test
(tests/test_sandbox.py) asserts exactly the guarantees above and nothing more.
The current VerifiersDriver drives an Environment object you construct
(which you therefore already trust); loading arbitrary environments by name through
the sandbox is the v0.1.1 item.
v0.1.0a1 is distributed via GitHub (a PyPI release is planned):
pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # core (numpy, rich)
pip install "envfuzz[vf] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # + verifiers (live, NON-CLAIM)
pip install "envfuzz[dev] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1" # + test/lint toolchainPython 3.10–3.13.
These are deliberately deferred from v0.1.0a1 (the subprocess sandbox primitive already ships and is escape-tested; the items below are about wiring it):
- Load environments by name and drive them inside the subprocess sandbox
(the current live driver runs a user-constructed
Environmentin-process). - Make the
SideEffectdetector live: have the sandbox observe and report an environment's host side-effect attempts. - OS-level sandbox hardening (seccomp / namespaces) beyond Python-level guards.
- Tool-call (
ToolEnv) driving; an optional PyPI release.
Apache-2.0. See LICENSE and THIRD_PARTY_NOTICES.md.