We present agent-corp, a benchmark for evaluating autonomous AI agents in a simulated software company environment. The benchmark consists of 50 realistic tasks spanning code review, bug triage, documentation writing, feature implementation, and project planning. Each task provides a structured workspace with relevant files, issue trackers, and code repositories. We evaluate agents on task completion accuracy, code quality, and adherence to company policies. Initial experiments with GPT-4 and Claude-2 show that current agents achieve 34% and 28% task completion rates respectively, highlighting significant room for improvement in multi-step reasoning and tool use within constrained organizational contexts.
Existing agent benchmarks often focus on isolated capabilities such as coding (HumanEval), web browsing (WebArena), or general reasoning (GAIA). However, real-world software development involves navigating organizational structure, understanding context from multiple sources, and coordinating between different tools and stakeholders. Engineers must read bug reports, review code, update documentation, and communicate decisions within company guidelines.
agent-corp addresses this gap by creating a controlled simulation of a software company with realistic artifacts: codebases in multiple languages, issue tracking systems, documentation sites, and style guides. Tasks require agents to combine information retrieval, code understanding, generation, and decision-making skills while respecting organizational constraints.
The benchmark simulates a mid-sized software company with three product teams (backend services, web frontend, mobile app). Each task instance includes:
- A starting state with Git repositories, issue tracker exports, and documentation
- A natural language task description (e.g., "Review PR #423 and leave comments on any issues")
- Success criteria evaluated programmatically where possible, with human evaluation for subjective tasks
- Access to standard development tools (Git, linters, test runners, documentation search)
Tasks are categorized into five types:
- Code Review: Identify bugs, style violations, or design issues in pull requests
- Bug Triage: Reproduce issues, identify root causes, and suggest fixes
- Documentation: Write or update technical documentation based on code changes
- Implementation: Complete partially implemented features given specifications
- Planning: Break down feature requests into actionable tasks with effort estimates
Agents interact through a command-line interface with access to file operations, shell commands, and simulated company tools (issue tracker API, code search). Each task has a timeout of 30 minutes and a maximum token budget to reflect real-world constraints.
We evaluate on three metrics: completion rate (percentage of tasks meeting success criteria), code quality (average linter score and test coverage for implementation tasks), and policy adherence (violations of company style guides or security policies).
pip install agent-corp
from agent_corp import Benchmark, Task
from agent_corp.agents import GPT4Agent
# Load the benchmark
benchmark = Benchmark.load("agent-corp-v1")
# Initialize your agent
agent = GPT4Agent(
model="gpt-4-turbo",
tools=["bash", "file_editor", "issue_tracker", "git"]
)
# Run a single task
task = benchmark.get_task("code_review_003")
result = task.run(agent, timeout=1800)
print(f"Success: {result.success}")
print(f"Score: {result.score}/100")
print(f"Feedback: {result.feedback}")
# Run full benchmark
results = benchmark.evaluate(agent, num_workers=4)
results.save("results/gpt4_run1.json")
print(results.summary())We evaluated four agent configurations on agent-corp v1 (50 tasks):
| Agent | Completion Rate | Code Quality | Policy Adherence | Avg. Time (min) |
|---|---|---|---|---|
| GPT-4-turbo | 34% | 72/100 | 81% | 12.3 |
| Claude-2 | 28% | 68/100 | 76% | 14.1 |
| GPT-3.5-turbo | 16% | 54/100 | 69% | 8.7 |
| ReAct baseline | 12% | 51/100 | 64% | 18.2 |
Common failure modes include:
- Incomplete context gathering (42% of failures): agents miss relevant files or documentation
- Incorrect tool use (23%): malformed commands or API calls
- Policy violations (19%): code that violates style guides or security rules
- Reasoning errors (16%): logical mistakes in bug diagnosis or design decisions
Code review tasks showed highest completion rates (48%), while planning tasks were most challenging (18%). Agents frequently struggled with multi-step tasks requiring coordination between tools.
Detailed results and error analysis are available in results/ directory.
- Chen, M. et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 (2021).
- Zhou, S. et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854 (2023).
- Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR (2023).
- Jimenez, C. et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770 (2023).
- Liu, X. et al. "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688 (2023).
MIT