Skip to content

hinanohart/envfuzz

Repository files navigation

envfuzz

A pre-publish, fail-closed adversarial gate for RL-verifier Environments. A falsifier of reward-hackability — not a prover of safety.

ci  Apache-2.0 · pre-alpha (v0.1.0a1)

envfuzz drives an RL verifier environment with deterministic adversarial policies before you publish it, and fails closed (non-zero exit) if any policy can inflate reward without actually satisfying the task. Wire it into CI as the last gate before prime env push so a reward-hackable environment never ships.

pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1"   # core: numpy + rich, CPU, offline
envfuzz audit corpus --fail-on-hackability     # exit 1 if any env is hackable
envfuzz report corpus --html scorecard.html    # self-contained HTML scorecard

Or drop it into a CI workflow as a GitHub Action:

- uses: hinanohart/envfuzz@v0.1.0a1
  with:
    target: corpus            # a corpus env id, or 'corpus'/'all'
    fail-on-hackability: "true"
    html: scorecard.html

What it is

  • A gate, run by the environment author before publishing. The unit of analysis is the environment's rubric + parser, not a trained model.
  • A falsifier: it searches for an exploit. Finding one is a definite "this is hackable." Finding none is "no exploit in this catalog" — never a safety proof.
  • CPU-only and deterministic. Every randomized routine takes an explicit seed. The claimed path imports only numpy; it does not require a GPU, network, API key, or even verifiers.

What it is NOT

  • Not a calibrated probability of hackability. Scores are order-only and are combined with noisy-OR; the reward-inflation interval is a percentile bootstrap.
  • Not an exhaustiveness proof, not a training-time reward-hacking detector, not a model red-teaming tool.
  • Not a full OS sandbox. The live driver isolates untrusted environments at the Python process level (see Threat model), not with seccomp/namespaces.

How it works

For each environment, envfuzz drives a set of reference policies (honest_correct, honest_wrong, null) and a catalog of blind adversarial policies, then runs a suite of detectors over the resulting rollouts. An environment is flagged hackable only when both:

  1. the noisy-OR composite of fired detectors ≥ 0.5; and
  2. the reward-inflation bootstrap CI lower bound > 0 — never report inflation the interval cannot separate from zero.

On the deterministic fixture corpus the composite is the sole operative discriminator. Because the inflation sample keeps only strictly-positive exploit gains over the null floor, condition 2 collapses to 0 exactly when no exploit beats the floor, so on fixtures it tracks "some exploit inflated reward" rather than independently gating the verdict. The CI guard becomes load-bearing on the stochastic live path (live detector wiring deferred to v0.1.1), where sampling noise can drag the lower bound to 0.

Attack classes (v0.1.0a1)

class name what it exploits
A1 early-termination / empty-work reward for participation/termination signals on their own
A2 parser-injection format/well-formedness credit earnable without solving
A3 rubric-weight interaction keyword/length/auxiliary reward dominating correctness
A4 state side-channel grader-trust / echoing information the env exposes

Detectors

RewardGap, CeilingBreach, EmptyWorkHighReward, ParserOnly, and MonotoneInflation are active. SideEffect is a v0.1.1 placeholder: its interface is stable but it is inert in every v0.1.0a1 path (it depends on the sandbox-backed by-name live driver — see Roadmap). A4 is still exercised in fixture mode through the self_certify / prompt_leak rubric attacks.

Claims, non-claims, and scope

CLAIMED (verified by CI): on the bundled synthetic corpus, the four attack classes above are falsified deterministically; detectors separate gameable from robust fixtures with the numbers quoted below; the subprocess sandbox contains the behaviors listed under Threat model; the CLI exits non-zero on a hackable env.

NON-CLAIM (shipped capability, not covered by CI, hardening in v0.1.1): driving real, live verifiers environments end-to-end (the [vf] extra; see envfuzz.drivers.vf_env), tool-call (ToolEnv) environments, and loading environments by name through the sandbox.

Out of scope: browser / side-effecting environments, verifiers framework internals (report those upstream), training-dynamics hacking, learned attackers, and any exhaustiveness guarantee.

These boundaries are fixed and intentional:

  • NC1 — envfuzz does not prove the absence of exploits; it only falsifies.
  • NC2 — training-time reward hacking is not addressed.
  • NC3 — attacks on a trained model (rather than the environment) are not addressed.
  • NC4 — hackability scores are not calibrated probabilities.
  • NC5 — corpus numbers describe the bundled synthetic corpus, not any real model.

Numbers (bundled synthetic corpus)

Produced by envfuzz bench --quick and asserted in CI (tests/check_bench.py), so the code and this table cannot drift apart:

metric value
environments 12 (7 gameable, 5 robust)
precision 1.0
recall 1.0
accuracy 1.0
attack classes exercised A1, A2, A3, A4

Per NC5, these are properties of a small synthetic corpus designed to exercise each attack class with robust negative controls — not a measurement against a real reward model.

Threat model (live driver)

Untrusted verifiers environments are third-party code; importing one can execute arbitrary code at module load. envfuzz therefore runs untrusted execution in a separate process with:

  • resource limits (CPU seconds, address space, file size, no core dumps);
  • a scrubbed environment (host secrets are not forwarded to the child);
  • a Python-level network guard (socket raises) and filesystem write containment (writes outside the sandbox working directory are denied);
  • fail-closed semantics: any failure to obtain a clean result is treated as "did not clear" — i.e., blocking.

This is process-level Python isolation, not OS-level isolation. There is no seccomp or namespace confinement, so determined native/syscall-level code can still escape; that hardening is planned for v0.1.1. The escape test (tests/test_sandbox.py) asserts exactly the guarantees above and nothing more.

The current VerifiersDriver drives an Environment object you construct (which you therefore already trust); loading arbitrary environments by name through the sandbox is the v0.1.1 item.

Install

v0.1.0a1 is distributed via GitHub (a PyPI release is planned):

pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1"           # core (numpy, rich)
pip install "envfuzz[vf] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1"   # + verifiers (live, NON-CLAIM)
pip install "envfuzz[dev] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1"  # + test/lint toolchain

Python 3.10–3.13.

Roadmap (v0.1.1)

These are deliberately deferred from v0.1.0a1 (the subprocess sandbox primitive already ships and is escape-tested; the items below are about wiring it):

  • Load environments by name and drive them inside the subprocess sandbox (the current live driver runs a user-constructed Environment in-process).
  • Make the SideEffect detector live: have the sandbox observe and report an environment's host side-effect attempts.
  • OS-level sandbox hardening (seccomp / namespaces) beyond Python-level guards.
  • Tool-call (ToolEnv) driving; an optional PyPI release.

License

Apache-2.0. See LICENSE and THIRD_PARTY_NOTICES.md.

About

Pre-publish, fail-closed adversarial gate for RL-verifier Environments. A falsifier of reward-hackability, not a prover of safety.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages