Agent Runtime Library

NOTE: This is an agent-first library. Everything in here is built by coding agents and verified by humans through extensive unit, integration, and full benchmark tests. Individual lines of code in the runtime have not been verified. Rather the goal is the ensure proper functionality and optimize execution efficiency over time where necessary.

A modular research system for modern generalist agents with multi-modal LLM support and a comprehensive benchmark framework with clean interfaces and re-usability in mind.

Features

Multi-modal LLM Support: Support for OpenAI, Anthropic, Ollama, HuggingFace, and Z.ai backends
Advanced Context Compression: Selective, aggressive, summarization, and priority-based compression strategies
Benchmark Framework: Ready-to-run GAIA and OfficeBench benchmarks with support for Terminal-Bench, SWE-bench, METR, AgentBench, τ-Bench, LongBench v2, and Context-Bench
Tool Calling: Automatic tool calling loop with extensible tool registry
File-based Office Operations: Calendar (.ics), Email (.eml), Word/Excel/PDF document operations

Installation

# Install core dependencies
pip install httpx pyyaml

# Install benchmark dependencies
pip install pytest pytest-asyncio pytest-cov

# Install OfficeBench dependencies (optional, for document operations)
pip install -e ".[officebench]"
# Or: pip install icalendar openpyxl python-docx pypdf

System Requirements

For full OfficeBench support:

Linux: sudo apt install tesseract-ocr libreoffice
macOS: brew install tesseract libreoffice

Quick Start

Run GAIA Benchmark

# Using Ollama (default)
python -m benchmarks.runners.unified_runner --benchmark-name gaia --num-tasks 10

# Using OpenAI
python -m benchmarks.runners.unified_runner --benchmark-name gaia --provider openai --model gpt-4 --api-key $OPENAI_API_KEY

Run OfficeBench

# Single-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 1 --num-tasks 10

# Two-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 2 --num-tasks 10

# Three-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 3 --num-tasks 10

Run SWE-Bench

# Run with native execution (recommended)
python -m benchmarks.runners.unified_runner \
    --benchmark-name swebench \
    --model qwen3-coder-next:cloud \
    --num-tasks 10

# Run lite variant (300 tasks)
python -m benchmarks.runners.unified_runner \
    --benchmark-name swebench \
    --variant lite \
    --model glm-4.7-flash

# Filter by repository
python -m benchmarks.runners.unified_runner \
    --benchmark-name swebench \
    --repos pytest-dev__pytest \
    --num-tasks 10

Using Ollama Cloud Models

Ollama cloud models require authentication:

# Sign in to Ollama
ollama signin

# Pull a cloud model
ollama pull qwen3-coder-next:cloud

# Run benchmark with cloud model
python -m benchmarks.runners.unified_runner --benchmark-name officebench --model qwen3-coder-next:cloud

Or set the API key via environment variable:

export OLLAMA_API_KEY=your_api_key
python -m benchmarks.runners.unified_runner --benchmark-name officebench --model qwen3-coder-next:cloud

Project Structure

agent_runtime/          # Core library
├── core/               # Core types (Message, Agent, Tool, etc.)
├── compression/        # Compression strategies
├── config/             # Configuration management
├── engines/            # LLM backends (OpenAI, Ollama, Anthropic, etc.)
└── storage/            # Storage types

benchmarks/             # Benchmark framework
├── gaia/               # GAIA benchmark (agent, adapter, tools)
├── officebench/        # OfficeBench benchmark (agent, adapter, tools)
├── core/               # Shared base classes
├── configs/            # Configuration files
├── runners/            # UnifiedBenchmarkRunner
├── runtime/            # Container runtimes
└── data/               # Bundled task data

tests/                  # Test suite
├── test_agent_runtime.py    # Core runtime tests
└── benchmarks/              # Benchmark framework tests

examples/               # Example scripts
├── gaia/               # GAIA benchmark example
├── officebench/        # OfficeBench example
└── scientific_research_agent.py

Documentation

AGENTS.md - Detailed codebase documentation for AI assistants
benchmarks/README.md - Benchmark framework guide
examples/README.md - Example usage

Running Tests

# Run agent_runtime tests
python3 -m pytest tests/ -v

# Run benchmark tests
python3 -m pytest tests/benchmarks/ -v

# Run all tests with coverage
python3 -m pytest tests/ tests/benchmarks/ -v --cov=agent_runtime --cov=benchmarks --cov-report=term-missing

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
agent_runtime		agent_runtime
benchmarks		benchmarks
docs		docs
examples		examples
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
__init__.py		__init__.py
cli.py		cli.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Runtime Library

Features

Installation

System Requirements

Quick Start

Run GAIA Benchmark

Run OfficeBench

Run SWE-Bench

Using Ollama Cloud Models

Project Structure

Documentation

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Runtime Library

Features

Installation

System Requirements

Quick Start

Run GAIA Benchmark

Run OfficeBench

Run SWE-Bench

Using Ollama Cloud Models

Project Structure

Documentation

Running Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages