agent-corp

abstract

We present agent-corp, a benchmark for evaluating autonomous AI agents in a simulated software company environment. The benchmark consists of 50 realistic tasks spanning code review, bug triage, documentation writing, feature implementation, and project planning. Each task provides a structured workspace with relevant files, issue trackers, and code repositories. We evaluate agents on task completion accuracy, code quality, and adherence to company policies. Initial experiments with GPT-4 and Claude-2 show that current agents achieve 34% and 28% task completion rates respectively, highlighting significant room for improvement in multi-step reasoning and tool use within constrained organizational contexts.

background

Existing agent benchmarks often focus on isolated capabilities such as coding (HumanEval), web browsing (WebArena), or general reasoning (GAIA). However, real-world software development involves navigating organizational structure, understanding context from multiple sources, and coordinating between different tools and stakeholders. Engineers must read bug reports, review code, update documentation, and communicate decisions within company guidelines.

agent-corp addresses this gap by creating a controlled simulation of a software company with realistic artifacts: codebases in multiple languages, issue tracking systems, documentation sites, and style guides. Tasks require agents to combine information retrieval, code understanding, generation, and decision-making skills while respecting organizational constraints.

method

The benchmark simulates a mid-sized software company with three product teams (backend services, web frontend, mobile app). Each task instance includes:

A starting state with Git repositories, issue tracker exports, and documentation
A natural language task description (e.g., "Review PR #423 and leave comments on any issues")
Success criteria evaluated programmatically where possible, with human evaluation for subjective tasks
Access to standard development tools (Git, linters, test runners, documentation search)

Tasks are categorized into five types:

Code Review: Identify bugs, style violations, or design issues in pull requests
Bug Triage: Reproduce issues, identify root causes, and suggest fixes
Documentation: Write or update technical documentation based on code changes
Implementation: Complete partially implemented features given specifications
Planning: Break down feature requests into actionable tasks with effort estimates

Agents interact through a command-line interface with access to file operations, shell commands, and simulated company tools (issue tracker API, code search). Each task has a timeout of 30 minutes and a maximum token budget to reflect real-world constraints.

We evaluate on three metrics: completion rate (percentage of tasks meeting success criteria), code quality (average linter score and test coverage for implementation tasks), and policy adherence (violations of company style guides or security policies).

install

pip install agent-corp

example

from agent_corp import Benchmark, Task
from agent_corp.agents import GPT4Agent

# Load the benchmark
benchmark = Benchmark.load("agent-corp-v1")

# Initialize your agent
agent = GPT4Agent(
    model="gpt-4-turbo",
    tools=["bash", "file_editor", "issue_tracker", "git"]
)

# Run a single task
task = benchmark.get_task("code_review_003")
result = task.run(agent, timeout=1800)

print(f"Success: {result.success}")
print(f"Score: {result.score}/100")
print(f"Feedback: {result.feedback}")

# Run full benchmark
results = benchmark.evaluate(agent, num_workers=4)
results.save("results/gpt4_run1.json")
print(results.summary())

results

We evaluated four agent configurations on agent-corp v1 (50 tasks):

Agent	Completion Rate	Code Quality	Policy Adherence	Avg. Time (min)
GPT-4-turbo	34%	72/100	81%	12.3
Claude-2	28%	68/100	76%	14.1
GPT-3.5-turbo	16%	54/100	69%	8.7
ReAct baseline	12%	51/100	64%	18.2

Common failure modes include:

Incomplete context gathering (42% of failures): agents miss relevant files or documentation
Incorrect tool use (23%): malformed commands or API calls
Policy violations (19%): code that violates style guides or security rules
Reasoning errors (16%): logical mistakes in bug diagnosis or design decisions

Code review tasks showed highest completion rates (48%), while planning tasks were most challenging (18%). Agents frequently struggled with multi-step tasks requiring coordination between tools.

Detailed results and error analysis are available in results/ directory.

references

Chen, M. et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 (2021).
Zhou, S. et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854 (2023).
Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR (2023).
Jimenez, C. et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770 (2023).
Liu, X. et al. "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688 (2023).

license

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github		.github
docs		docs
evaluation		evaluation
servers		servers
workspaces		workspaces
.gitignore		.gitignore
.openhands_instruction		.openhands_instruction
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-corp

abstract

background

method

install

example

results

references

license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-corp

abstract

background

method

install

example

results

references

license

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages