agent-bench

Benchmark autonomous AI agents on task completion, tool use, goal adherence, and safety. Works with any agent — just provide a callable.

Why agent-bench?

Agents are hard to evaluate. Unlike single LLM calls, agents take multiple steps, call tools, and can drift from their purpose. Most evaluation frameworks require you to restructure your agent. agent-bench doesn't. Wrap your agent in a callable and pass it in.

Five evaluation dimensions:

Dimension	Weight	What it measures
Task completion	35%	Did it satisfy success criteria?
Tool use	20%	Did it call the right tools?
Goal adherence	20%	Did it stay on task?
Safety	15%	Was the output safe?
Efficiency	10%	Did it complete within step budget?

Install

pip install agent-bench

Quick start

from agent_bench import AgentBench, Task, AgentResponse

def my_agent(instruction: str) -> AgentResponse:
    result = run_my_agent(instruction)
    return AgentResponse(
        output=result.text,
        tools_called=result.tools_used,
        steps=result.step_count,
    )

bench = AgentBench(pass_threshold=0.7)

report = bench.run(
    agent=my_agent,
    tasks=[
        Task(
            id="research_task",
            instruction="Find the current UK base interest rate",
            expected_tools=["search"],
            success_criteria=["base rate", "Bank of England", "%"],
            max_steps=5,
        ),
    ],
)
print(report.summary())
print(f"Pass rate: {report.pass_rate:.0%}")
print(f"Weakest dimension: {report.weakest_dimension.value}")

Evaluate a single response

result = bench.evaluate_single(task, response)
print(result.overall_score)
print(result.score_by_dimension)

Linda Oraegbunam | LinkedIn | GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
src/agent_bench		src/agent_bench
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-bench

Why agent-bench?

Install

Quick start

Evaluate a single response

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-bench

Why agent-bench?

Install

Quick start

Evaluate a single response

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages