envfuzz

A pre-publish, fail-closed adversarial gate for RL-verifier Environments. A falsifier of reward-hackability — not a prover of safety.

Apache-2.0 · pre-alpha (v0.1.0a1)

envfuzz drives an RL verifier environment with deterministic adversarial policies before you publish it, and fails closed (non-zero exit) if any policy can inflate reward without actually satisfying the task. Wire it into CI as the last gate before prime env push so a reward-hackable environment never ships.

pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1"   # core: numpy + rich, CPU, offline
envfuzz audit corpus --fail-on-hackability     # exit 1 if any env is hackable
envfuzz report corpus --html scorecard.html    # self-contained HTML scorecard

Or drop it into a CI workflow as a GitHub Action:

- uses: hinanohart/envfuzz@v0.1.0a1
  with:
    target: corpus            # a corpus env id, or 'corpus'/'all'
    fail-on-hackability: "true"
    html: scorecard.html

What it is

A gate, run by the environment author before publishing. The unit of analysis is the environment's rubric + parser, not a trained model.
A falsifier: it searches for an exploit. Finding one is a definite "this is hackable." Finding none is "no exploit in this catalog" — never a safety proof.
CPU-only and deterministic. Every randomized routine takes an explicit seed. The claimed path imports only numpy; it does not require a GPU, network, API key, or even verifiers.

What it is NOT

Not a calibrated probability of hackability. Scores are order-only and are combined with noisy-OR; the reward-inflation interval is a percentile bootstrap.
Not an exhaustiveness proof, not a training-time reward-hacking detector, not a model red-teaming tool.
Not a full OS sandbox. The live driver isolates untrusted environments at the Python process level (see Threat model), not with seccomp/namespaces.

How it works

For each environment, envfuzz drives a set of reference policies (honest_correct, honest_wrong, null) and a catalog of blind adversarial policies, then runs a suite of detectors over the resulting rollouts. An environment is flagged hackable only when both:

the noisy-OR composite of fired detectors ≥ 0.5; and
the reward-inflation bootstrap CI lower bound > 0 — never report inflation the interval cannot separate from zero.

On the deterministic fixture corpus the composite is the sole operative discriminator. Because the inflation sample keeps only strictly-positive exploit gains over the null floor, condition 2 collapses to 0 exactly when no exploit beats the floor, so on fixtures it tracks "some exploit inflated reward" rather than independently gating the verdict. The CI guard becomes load-bearing on the stochastic live path (live detector wiring deferred to v0.1.1), where sampling noise can drag the lower bound to 0.

Attack classes (v0.1.0a1)

class	name	what it exploits
A1	early-termination / empty-work	reward for participation/termination signals on their own
A2	parser-injection	format/well-formedness credit earnable without solving
A3	rubric-weight interaction	keyword/length/auxiliary reward dominating correctness
A4	state side-channel	grader-trust / echoing information the env exposes

Detectors

RewardGap, CeilingBreach, EmptyWorkHighReward, ParserOnly, and MonotoneInflation are active. SideEffect is a v0.1.1 placeholder: its interface is stable but it is inert in every v0.1.0a1 path (it depends on the sandbox-backed by-name live driver — see Roadmap). A4 is still exercised in fixture mode through the self_certify / prompt_leak rubric attacks.

Claims, non-claims, and scope

CLAIMED (verified by CI): on the bundled synthetic corpus, the four attack classes above are falsified deterministically; detectors separate gameable from robust fixtures with the numbers quoted below; the subprocess sandbox contains the behaviors listed under Threat model; the CLI exits non-zero on a hackable env.

NON-CLAIM (shipped capability, not covered by CI, hardening in v0.1.1): driving real, live verifiers environments end-to-end (the [vf] extra; see envfuzz.drivers.vf_env), tool-call (ToolEnv) environments, and loading environments by name through the sandbox.

Out of scope: browser / side-effecting environments, verifiers framework internals (report those upstream), training-dynamics hacking, learned attackers, and any exhaustiveness guarantee.

These boundaries are fixed and intentional:

NC1 — envfuzz does not prove the absence of exploits; it only falsifies.
NC2 — training-time reward hacking is not addressed.
NC3 — attacks on a trained model (rather than the environment) are not addressed.
NC4 — hackability scores are not calibrated probabilities.
NC5 — corpus numbers describe the bundled synthetic corpus, not any real model.

Numbers (bundled synthetic corpus)

Produced by envfuzz bench --quick and asserted in CI (tests/check_bench.py), so the code and this table cannot drift apart:

metric	value
environments	12 (7 gameable, 5 robust)
precision	1.0
recall	1.0
accuracy	1.0
attack classes exercised	A1, A2, A3, A4

Per NC5, these are properties of a small synthetic corpus designed to exercise each attack class with robust negative controls — not a measurement against a real reward model.

Threat model (live driver)

Untrusted verifiers environments are third-party code; importing one can execute arbitrary code at module load. envfuzz therefore runs untrusted execution in a separate process with:

resource limits (CPU seconds, address space, file size, no core dumps);
a scrubbed environment (host secrets are not forwarded to the child);
a Python-level network guard (socket raises) and filesystem write containment (writes outside the sandbox working directory are denied);
fail-closed semantics: any failure to obtain a clean result is treated as "did not clear" — i.e., blocking.

This is process-level Python isolation, not OS-level isolation. There is no seccomp or namespace confinement, so determined native/syscall-level code can still escape; that hardening is planned for v0.1.1. The escape test (tests/test_sandbox.py) asserts exactly the guarantees above and nothing more.

The current VerifiersDriver drives an Environment object you construct (which you therefore already trust); loading arbitrary environments by name through the sandbox is the v0.1.1 item.

Install

v0.1.0a1 is distributed via GitHub (a PyPI release is planned):

pip install "git+https://github.com/hinanohart/envfuzz@v0.1.0a1"           # core (numpy, rich)
pip install "envfuzz[vf] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1"   # + verifiers (live, NON-CLAIM)
pip install "envfuzz[dev] @ git+https://github.com/hinanohart/envfuzz@v0.1.0a1"  # + test/lint toolchain

Python 3.10–3.13.

Roadmap (v0.1.1)

These are deliberately deferred from v0.1.0a1 (the subprocess sandbox primitive already ships and is escape-tested; the items below are about wiring it):

Load environments by name and drive them inside the subprocess sandbox (the current live driver runs a user-constructed Environment in-process).
Make the SideEffect detector live: have the sandbox observe and report an environment's host side-effect attempts.
OS-level sandbox hardening (seccomp / namespaces) beyond Python-level guards.
Tool-call (ToolEnv) driving; an optional PyPI release.

License

Apache-2.0. See LICENSE and THIRD_PARTY_NOTICES.md.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
examples		examples
src/envfuzz		src/envfuzz
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
action.yml		action.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

envfuzz

What it is

What it is NOT

How it works

Attack classes (v0.1.0a1)

Detectors

Claims, non-claims, and scope

Numbers (bundled synthetic corpus)

Threat model (live driver)

Install

Roadmap (v0.1.1)

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

envfuzz

What it is

What it is NOT

How it works

Attack classes (v0.1.0a1)

Detectors

Claims, non-claims, and scope

Numbers (bundled synthetic corpus)

Threat model (live driver)

Install

Roadmap (v0.1.1)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages