Skip to content

jmerelnyc/agent-corp

Repository files navigation

agent-corp

abstract

We present agent-corp, a benchmark for evaluating autonomous AI agents in a simulated software company environment. The benchmark consists of 50 realistic tasks spanning code review, bug triage, documentation writing, feature implementation, and project planning. Each task provides a structured workspace with relevant files, issue trackers, and code repositories. We evaluate agents on task completion accuracy, code quality, and adherence to company policies. Initial experiments with GPT-4 and Claude-2 show that current agents achieve 34% and 28% task completion rates respectively, highlighting significant room for improvement in multi-step reasoning and tool use within constrained organizational contexts.

background

Existing agent benchmarks often focus on isolated capabilities such as coding (HumanEval), web browsing (WebArena), or general reasoning (GAIA). However, real-world software development involves navigating organizational structure, understanding context from multiple sources, and coordinating between different tools and stakeholders. Engineers must read bug reports, review code, update documentation, and communicate decisions within company guidelines.

agent-corp addresses this gap by creating a controlled simulation of a software company with realistic artifacts: codebases in multiple languages, issue tracking systems, documentation sites, and style guides. Tasks require agents to combine information retrieval, code understanding, generation, and decision-making skills while respecting organizational constraints.

method

The benchmark simulates a mid-sized software company with three product teams (backend services, web frontend, mobile app). Each task instance includes:

  • A starting state with Git repositories, issue tracker exports, and documentation
  • A natural language task description (e.g., "Review PR #423 and leave comments on any issues")
  • Success criteria evaluated programmatically where possible, with human evaluation for subjective tasks
  • Access to standard development tools (Git, linters, test runners, documentation search)

Tasks are categorized into five types:

  1. Code Review: Identify bugs, style violations, or design issues in pull requests
  2. Bug Triage: Reproduce issues, identify root causes, and suggest fixes
  3. Documentation: Write or update technical documentation based on code changes
  4. Implementation: Complete partially implemented features given specifications
  5. Planning: Break down feature requests into actionable tasks with effort estimates

Agents interact through a command-line interface with access to file operations, shell commands, and simulated company tools (issue tracker API, code search). Each task has a timeout of 30 minutes and a maximum token budget to reflect real-world constraints.

We evaluate on three metrics: completion rate (percentage of tasks meeting success criteria), code quality (average linter score and test coverage for implementation tasks), and policy adherence (violations of company style guides or security policies).

install

pip install agent-corp

example

from agent_corp import Benchmark, Task
from agent_corp.agents import GPT4Agent

# Load the benchmark
benchmark = Benchmark.load("agent-corp-v1")

# Initialize your agent
agent = GPT4Agent(
    model="gpt-4-turbo",
    tools=["bash", "file_editor", "issue_tracker", "git"]
)

# Run a single task
task = benchmark.get_task("code_review_003")
result = task.run(agent, timeout=1800)

print(f"Success: {result.success}")
print(f"Score: {result.score}/100")
print(f"Feedback: {result.feedback}")

# Run full benchmark
results = benchmark.evaluate(agent, num_workers=4)
results.save("results/gpt4_run1.json")
print(results.summary())

results

We evaluated four agent configurations on agent-corp v1 (50 tasks):

Agent Completion Rate Code Quality Policy Adherence Avg. Time (min)
GPT-4-turbo 34% 72/100 81% 12.3
Claude-2 28% 68/100 76% 14.1
GPT-3.5-turbo 16% 54/100 69% 8.7
ReAct baseline 12% 51/100 64% 18.2

Common failure modes include:

  • Incomplete context gathering (42% of failures): agents miss relevant files or documentation
  • Incorrect tool use (23%): malformed commands or API calls
  • Policy violations (19%): code that violates style guides or security rules
  • Reasoning errors (16%): logical mistakes in bug diagnosis or design decisions

Code review tasks showed highest completion rates (48%), while planning tasks were most challenging (18%). Agents frequently struggled with multi-step tasks requiring coordination between tools.

Detailed results and error analysis are available in results/ directory.

references

  1. Chen, M. et al. "Evaluating Large Language Models Trained on Code." arXiv:2107.03374 (2021).
  2. Zhou, S. et al. "WebArena: A Realistic Web Environment for Building Autonomous Agents." arXiv:2307.13854 (2023).
  3. Yao, S. et al. "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR (2023).
  4. Jimenez, C. et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arXiv:2310.06770 (2023).
  5. Liu, X. et al. "AgentBench: Evaluating LLMs as Agents." arXiv:2308.03688 (2023).

license

MIT

About

Agent benchmark with tasks in a simulated software company

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors