Agent Bench

Benchmarking framework for AI coding agents on enterprise Java tasks. Defines benchmarks as YAML, launches any CLI agent, grades results with cascaded judge tiers from Agent Judge.

Quick Start

git clone https://github.com/markpollack/agent-bench.git
cd agent-bench
./mvnw clean install -DskipTests

List available benchmarks:

$ bench list

Available benchmarks:

  code-coverage                  v1.0      (1 tasks)
  hello-world                    v1.0      (1 tasks)

Run a benchmark with an agent:

bench run --benchmark hello-world --agent agents/claude-code.yaml

How It Works

The bench orchestrates a per-task lifecycle:

provide → setup scripts → agent → post scripts → grade → result.json

Provide copies the workspace template and writes INSTRUCTION.md
Setup scripts run in the workspace (clone repo, compile, measure baseline)
Agent executes — any CLI tool that reads INSTRUCTION.md and modifies the workspace
Post scripts run (build, test, generate reports)
Grade evaluates the workspace with a cascaded jury from Agent Judge

Benchmark Format

Benchmarks live in benchmarks/ as YAML:

benchmarks/code-coverage/
├── benchmark.yaml
├── prompts/
│   └── judge-practice-adherence.txt
└── tasks/
    └── spring-petclinic/
        └── task.yaml

benchmark.yaml

Defines the jury — a cascaded sequence of judge tiers:

schema: bench.benchmark.v1
name: code-coverage
version: "1.0"
description: "Write JUnit tests to maximize JaCoCo instruction coverage."
default-timeout: PT45M

jury:
  tiers:
    - name: build
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: maven-build
          goals: [clean, test]
    - name: coverage-preservation
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: coverage-preservation
    - name: coverage-improvement
      policy: ACCEPT_ON_ALL_PASS
      checks:
        - type: coverage-improvement
          min: 50.0

task.yaml

Defines a single task — the problem, setup, and post-processing:

schema: bench.task.v1
id: spring-petclinic
instruction: |
  Write JUnit tests for this Spring Boot project to maximize code coverage.
  Run ./mvnw clean test jacoco:report to measure coverage.
  Focus on behavioral code — skip Application main classes, records, and config.
  Use narrow test slices (@WebMvcTest, @DataJpaTest) over @SpringBootTest.
timeout: PT45M
metadata:
  baselineCoverage: 0.0
setup:
  - "git init && git remote add origin https://github.com/spring-projects/spring-petclinic.git && git fetch --depth 1 origin edf4db28affc && git checkout FETCH_HEAD"
  - "./mvnw clean compile -q -Dspring-javaformat.skip=true -Dcheckstyle.skip=true"
post:
  - "./mvnw clean test jacoco:report -q -Dspring-javaformat.skip=true -Dcheckstyle.skip=true"

Agent Config

Agent configs are minimal — just a command and timeout:

# agents/claude-code.yaml
command: claude --print --dangerously-skip-permissions 'Read INSTRUCTION.md and follow the instructions precisely.'
timeout: PT45M

The bench launches the command via bash -c in the workspace directory. Any CLI tool works.

Bring Your Own Agent

The filesystem is the contract. The bench writes INSTRUCTION.md to the workspace; the agent reads it and modifies files. You can also run the provide/grade steps separately:

# Set up workspace
bench provide --benchmark code-coverage --task spring-petclinic --workspace /tmp/petclinic

# Run your agent (any tool that reads INSTRUCTION.md)
cd /tmp/petclinic && your-agent "$(cat INSTRUCTION.md)"

# Grade the result
bench grade --benchmark code-coverage --task spring-petclinic --workspace /tmp/petclinic

CLI Reference

Command	Purpose
`bench list`	List available benchmarks
`bench tasks --benchmark <name>`	List tasks in a benchmark
`bench provide --benchmark <name> --task <id> --workspace <dir>`	Set up workspace with instruction
`bench grade --benchmark <name> --task <id> --workspace <dir>`	Evaluate agent's work
`bench run --benchmark <name> --agent <config> [--task <id>]`	Full pipeline: provide + agent + grade

Architecture

Two modules:

agent-bench-core — CLI, benchmark catalog, run orchestration, result model, judge factory
agent-bench-agents — Agent-specific judge implementations (e.g., LLM-based test quality judge)

Key classes:

Class	Role
`BenchmarkCatalog`	Discovers benchmarks from `benchmarks/` directory
`BenchmarkTask`	A single task: instruction, setup/post scripts, metadata
`RunCommand`	Orchestrates the full lifecycle per task
`JudgeFactory`	Materializes YAML jury config into Judge instances
`TrialResult`	Per-attempt result with timestamps and scores
`BenchmarkResult`	Aggregate result with accuracy and pass@k
`ExecAgentInvoker`	Loads agent config and launches the command

Module layering is enforced by ArchUnit — core does not depend on agents.

Built-in Judge Types

Type	What it checks
`file-exists`	File exists at path
`file-content`	File content matches expected (EXACT or CONTAINS)
`maven-build`	Maven build succeeds with specified goals
`coverage-preservation`	JaCoCo coverage not dropped from baseline
`coverage-improvement`	JaCoCo coverage exceeds threshold

Custom judge types can be registered via JudgeFactory.register().

Programmatic Usage

// Discover benchmarks
BenchmarkCatalog catalog = new BenchmarkCatalog(Path.of("benchmarks"));
List<Benchmark> benchmarks = catalog.discover();

// Find a specific benchmark
Benchmark benchmark = benchmarks.stream()
    .filter(b -> b.name().equals("code-coverage"))
    .findFirst()
    .orElseThrow();

// Access tasks
BenchmarkTask task = benchmark.tasks().get(0);
assert task.id().equals("spring-petclinic");
assert task.instruction().contains("JUnit tests");

// Wire judges from YAML config
JudgeFactory factory = new JudgeFactory();
Judge judge = factory.createFromConfig(benchmark.juryConfig());

Available Benchmarks

Benchmark	Tasks	Status
`hello-world`	1	Working — validates file creation
`code-coverage`	1 (spring-petclinic)	Judges validated, end-to-end run pending

Related Projects

Agent Judge — Cascaded judge framework (core dependency)
Agent Client — CLI agent integrations (Claude, Gemini)

Contributing

Fork the repository
Create a feature branch
Write tests for new features
Ensure ./mvnw clean test passes
Open a Pull Request

Licensing

This project originated from earlier Apache-licensed work in the Spring AI Community.

Beginning with version 0.3.0, new development is licensed under the Business Source License 1.1 (BSL).

Historical Apache-licensed portions remain available under their original terms. See LICENSE and LICENSE-APACHE.txt for details.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
.mvn		.mvn
agent-bench-agents		agent-bench-agents
agent-bench-core		agent-bench-core
agent-bench-tsinghua		agent-bench-tsinghua
agents		agents
bench-tracks/hello-world/cases		bench-tracks/hello-world/cases
benchmarks		benchmarks
webarena		webarena
.flattened-pom.xml		.flattened-pom.xml
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE-APACHE.txt		LICENSE-APACHE.txt
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Bench

Quick Start

How It Works

Benchmark Format

benchmark.yaml

task.yaml

Agent Config

Bring Your Own Agent

CLI Reference

Architecture

Built-in Judge Types

Programmatic Usage

Available Benchmarks

Related Projects

Contributing

Licensing

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Bench

Quick Start

How It Works

Benchmark Format

benchmark.yaml

task.yaml

Agent Config

Bring Your Own Agent

CLI Reference

Architecture

Built-in Judge Types

Programmatic Usage

Available Benchmarks

Related Projects

Contributing

Licensing

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages