Skip to content

obielin/agent-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-bench

Benchmark autonomous AI agents on task completion, tool use, goal adherence, and safety. Works with any agent — just provide a callable.

Tests Dependencies Python License LinkedIn


Why agent-bench?

Agents are hard to evaluate. Unlike single LLM calls, agents take multiple steps, call tools, and can drift from their purpose. Most evaluation frameworks require you to restructure your agent. agent-bench doesn't. Wrap your agent in a callable and pass it in.

Five evaluation dimensions:

Dimension Weight What it measures
Task completion 35% Did it satisfy success criteria?
Tool use 20% Did it call the right tools?
Goal adherence 20% Did it stay on task?
Safety 15% Was the output safe?
Efficiency 10% Did it complete within step budget?

Install

pip install agent-bench

Quick start

from agent_bench import AgentBench, Task, AgentResponse

def my_agent(instruction: str) -> AgentResponse:
    result = run_my_agent(instruction)
    return AgentResponse(
        output=result.text,
        tools_called=result.tools_used,
        steps=result.step_count,
    )

bench = AgentBench(pass_threshold=0.7)

report = bench.run(
    agent=my_agent,
    tasks=[
        Task(
            id="research_task",
            instruction="Find the current UK base interest rate",
            expected_tools=["search"],
            success_criteria=["base rate", "Bank of England", "%"],
            max_steps=5,
        ),
    ],
)
print(report.summary())
print(f"Pass rate: {report.pass_rate:.0%}")
print(f"Weakest dimension: {report.weakest_dimension.value}")

Evaluate a single response

result = bench.evaluate_single(task, response)
print(result.overall_score)
print(result.score_by_dimension)

Linda Oraegbunam | LinkedIn | GitHub

About

Benchmark autonomous AI agents on task completion, tool use, goal adherence, and safety. Works with any agent — just provide a callable.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages