Benchmark autonomous AI agents on task completion, tool use, goal adherence, and safety. Works with any agent — just provide a callable.
Agents are hard to evaluate. Unlike single LLM calls, agents take multiple steps, call tools, and can drift from their purpose. Most evaluation frameworks require you to restructure your agent. agent-bench doesn't. Wrap your agent in a callable and pass it in.
Five evaluation dimensions:
| Dimension | Weight | What it measures |
|---|---|---|
| Task completion | 35% | Did it satisfy success criteria? |
| Tool use | 20% | Did it call the right tools? |
| Goal adherence | 20% | Did it stay on task? |
| Safety | 15% | Was the output safe? |
| Efficiency | 10% | Did it complete within step budget? |
pip install agent-benchfrom agent_bench import AgentBench, Task, AgentResponse
def my_agent(instruction: str) -> AgentResponse:
result = run_my_agent(instruction)
return AgentResponse(
output=result.text,
tools_called=result.tools_used,
steps=result.step_count,
)
bench = AgentBench(pass_threshold=0.7)
report = bench.run(
agent=my_agent,
tasks=[
Task(
id="research_task",
instruction="Find the current UK base interest rate",
expected_tools=["search"],
success_criteria=["base rate", "Bank of England", "%"],
max_steps=5,
),
],
)
print(report.summary())
print(f"Pass rate: {report.pass_rate:.0%}")
print(f"Weakest dimension: {report.weakest_dimension.value}")result = bench.evaluate_single(task, response)
print(result.overall_score)
print(result.score_by_dimension)