NOTE: This is an agent-first library. Everything in here is built by coding agents and verified by humans through extensive unit, integration, and full benchmark tests. Individual lines of code in the runtime have not been verified. Rather the goal is the ensure proper functionality and optimize execution efficiency over time where necessary.
A modular research system for modern generalist agents with multi-modal LLM support and a comprehensive benchmark framework with clean interfaces and re-usability in mind.
- Multi-modal LLM Support: Support for OpenAI, Anthropic, Ollama, HuggingFace, and Z.ai backends
- Advanced Context Compression: Selective, aggressive, summarization, and priority-based compression strategies
- Benchmark Framework: Ready-to-run GAIA and OfficeBench benchmarks with support for Terminal-Bench, SWE-bench, METR, AgentBench, τ-Bench, LongBench v2, and Context-Bench
- Tool Calling: Automatic tool calling loop with extensible tool registry
- File-based Office Operations: Calendar (.ics), Email (.eml), Word/Excel/PDF document operations
# Install core dependencies
pip install httpx pyyaml
# Install benchmark dependencies
pip install pytest pytest-asyncio pytest-cov
# Install OfficeBench dependencies (optional, for document operations)
pip install -e ".[officebench]"
# Or: pip install icalendar openpyxl python-docx pypdfFor full OfficeBench support:
- Linux:
sudo apt install tesseract-ocr libreoffice - macOS:
brew install tesseract libreoffice
# Using Ollama (default)
python -m benchmarks.runners.unified_runner --benchmark-name gaia --num-tasks 10
# Using OpenAI
python -m benchmarks.runners.unified_runner --benchmark-name gaia --provider openai --model gpt-4 --api-key $OPENAI_API_KEY# Single-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 1 --num-tasks 10
# Two-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 2 --num-tasks 10
# Three-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 3 --num-tasks 10# Run with native execution (recommended)
python -m benchmarks.runners.unified_runner \
--benchmark-name swebench \
--model qwen3-coder-next:cloud \
--num-tasks 10
# Run lite variant (300 tasks)
python -m benchmarks.runners.unified_runner \
--benchmark-name swebench \
--variant lite \
--model glm-4.7-flash
# Filter by repository
python -m benchmarks.runners.unified_runner \
--benchmark-name swebench \
--repos pytest-dev__pytest \
--num-tasks 10Ollama cloud models require authentication:
# Sign in to Ollama
ollama signin
# Pull a cloud model
ollama pull qwen3-coder-next:cloud
# Run benchmark with cloud model
python -m benchmarks.runners.unified_runner --benchmark-name officebench --model qwen3-coder-next:cloudOr set the API key via environment variable:
export OLLAMA_API_KEY=your_api_key
python -m benchmarks.runners.unified_runner --benchmark-name officebench --model qwen3-coder-next:cloudagent_runtime/ # Core library
├── core/ # Core types (Message, Agent, Tool, etc.)
├── compression/ # Compression strategies
├── config/ # Configuration management
├── engines/ # LLM backends (OpenAI, Ollama, Anthropic, etc.)
└── storage/ # Storage types
benchmarks/ # Benchmark framework
├── gaia/ # GAIA benchmark (agent, adapter, tools)
├── officebench/ # OfficeBench benchmark (agent, adapter, tools)
├── core/ # Shared base classes
├── configs/ # Configuration files
├── runners/ # UnifiedBenchmarkRunner
├── runtime/ # Container runtimes
└── data/ # Bundled task data
tests/ # Test suite
├── test_agent_runtime.py # Core runtime tests
└── benchmarks/ # Benchmark framework tests
examples/ # Example scripts
├── gaia/ # GAIA benchmark example
├── officebench/ # OfficeBench example
└── scientific_research_agent.py
- AGENTS.md - Detailed codebase documentation for AI assistants
- benchmarks/README.md - Benchmark framework guide
- examples/README.md - Example usage
# Run agent_runtime tests
python3 -m pytest tests/ -v
# Run benchmark tests
python3 -m pytest tests/benchmarks/ -v
# Run all tests with coverage
python3 -m pytest tests/ tests/benchmarks/ -v --cov=agent_runtime --cov=benchmarks --cov-report=term-missingMIT License