Verity Benchmark

Measuring AI agents at formally verifying smart contracts.

What is this?

Verity Benchmark measures the ability of AI agents to prove that a smart contract implementation matches its formal specification, using Lean 4 and the Verity framework.

Verity is a formally verified smart contract compiler. It lets you write contracts, state what they should do, prove correctness, and compile to EVM bytecode with machine-checked proofs that compilation preserves semantics. This benchmark is an initiative made in partnership with the Ethereum Foundation and various protocols of the ecosystem.

Each benchmark task gives an agent:

A fixed contract implementation
A fixed formal specification
One editable proof file with a single theorem to prove

The agent must produce a valid Lean proof. No placeholders (sorry, admit, axiom) allowed.

Benchmark suite

6 cases, 30 tasks, drawn from real-world contracts:

Case	Source	Tasks
`ethereum/deposit_contract_minimal`	Ethereum deposit contract	5
`lido/vaulthub_locked`	Lido VaultHub	5
`nexus_mutual/ramm_price_band`	Nexus Mutual RAMM	4
`kleros/sortition_trees`	Kleros sortition module	6
`paladin_votes/stream_recovery_claim_usdc`	Paladin Votes	5
`damn_vulnerable_defi/side_entrance`	Damn Vulnerable DeFi	5

Most cases include a reference proof (hidden from the agent during benchmarking) that validates the task is solvable.

Running the benchmark

Verify reference proofs

# Single task
./scripts/run_task.sh ethereum/deposit_contract_minimal/deposit_count

# All tasks in a case
./scripts/run_case.sh ethereum/deposit_contract_minimal

# Full suite
./scripts/run_all.sh

Run with the built-in harness

The benchmark ships with an agent harness that supports any OpenAI-compatible API. In interactive mode, it exposes Lean-specific tools to the agent:

Tool	Purpose
`read_public_file`	Read implementation and spec files
`write_editable_proof`	Write the proof file
`run_lean_check`	Type-check the proof, with structured error feedback and repair hints
`inspect_lean_goals`	Inspect open proof goals at hole sites
`try_tactic_at_hole`	Test a tactic at proof holes without committing
`search_public_defs`	Search definitions across implementation and spec files

# Run a single task with the built-in agent
./scripts/run_default_agent.sh lido/vaulthub_locked/locked_funds_solvency

# Run a full case
./scripts/run_default_agent_case.sh lido/vaulthub_locked

# Run the full suite
./scripts/run_default_agent_all.sh

To test a different model, configure the agent profile:

python3 harness/default_agent.py profiles
python3 harness/default_agent.py describe --profile openai-compatible

Use a custom harness

The benchmark also supports custom agent harnesses via an external command adapter. The evaluation contract stays the same: fixed input files, one editable proof, one required theorem.

./scripts/run_custom_agent.sh ethereum/deposit_contract_minimal/deposit_count
./scripts/run_custom_agent_case.sh ethereum/deposit_contract_minimal
./scripts/run_custom_agent_all.sh

Benchmark matrix

Run multiple models/harnesses in parallel and compare results:

python3 scripts/run_benchmark_matrix.py start
python3 scripts/run_benchmark_matrix.py status
python3 scripts/run_benchmark_matrix.py wait

Project structure

verity-benchmark/
├── Benchmark/
│   ├── Cases/           # Reference proofs (hidden from agents)
│   └── Generated/       # Public proof templates
├── cases/               # Task manifests and contract sources
├── harness/             # Agent runner, tools, and evaluation
├── scripts/             # CLI entry points
├── schemas/             # JSON schemas for results
└── results/             # Run artifacts

Documentation

Document	Description
harness/README.md	Harness internals and agent integration
docs/architecture/task-api.md	Task contract and manifest format
docs/architecture/runtime-modes.md	Runtime modes (strict, interactive, custom)

Benchmark Results

Run python3 scripts/run_benchmark_matrix.py render after the matrix finishes to refresh this section.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
Benchmark		Benchmark
backlog		backlog
cases		cases
docs		docs
families		families
harness		harness
results		results
schemas		schemas
scripts		scripts
.env		.env
.gitignore		.gitignore
Benchmark.lean		Benchmark.lean
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
REPORT.md		REPORT.md
benchmark-inventory.json		benchmark-inventory.json
benchmark.toml		benchmark.toml
lake-manifest.json		lake-manifest.json
lakefile.lean		lakefile.lean
lean-toolchain		lean-toolchain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verity Benchmark

What is this?

Benchmark suite

Running the benchmark

Verify reference proofs

Run with the built-in harness

Use a custom harness

Benchmark matrix

Project structure

Documentation

Benchmark Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Verity Benchmark

What is this?

Benchmark suite

Running the benchmark

Verify reference proofs

Run with the built-in harness

Use a custom harness

Benchmark matrix

Project structure

Documentation

Benchmark Results

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages