Skip to content

laminair/agent-runtime

Repository files navigation

Agent Runtime Library

NOTE: This is an agent-first library. Everything in here is built by coding agents and verified by humans through extensive unit, integration, and full benchmark tests. Individual lines of code in the runtime have not been verified. Rather the goal is the ensure proper functionality and optimize execution efficiency over time where necessary.

A modular research system for modern generalist agents with multi-modal LLM support and a comprehensive benchmark framework with clean interfaces and re-usability in mind.

Features

  • Multi-modal LLM Support: Support for OpenAI, Anthropic, Ollama, HuggingFace, and Z.ai backends
  • Advanced Context Compression: Selective, aggressive, summarization, and priority-based compression strategies
  • Benchmark Framework: Ready-to-run GAIA and OfficeBench benchmarks with support for Terminal-Bench, SWE-bench, METR, AgentBench, τ-Bench, LongBench v2, and Context-Bench
  • Tool Calling: Automatic tool calling loop with extensible tool registry
  • File-based Office Operations: Calendar (.ics), Email (.eml), Word/Excel/PDF document operations

Installation

# Install core dependencies
pip install httpx pyyaml

# Install benchmark dependencies
pip install pytest pytest-asyncio pytest-cov

# Install OfficeBench dependencies (optional, for document operations)
pip install -e ".[officebench]"
# Or: pip install icalendar openpyxl python-docx pypdf

System Requirements

For full OfficeBench support:

  • Linux: sudo apt install tesseract-ocr libreoffice
  • macOS: brew install tesseract libreoffice

Quick Start

Run GAIA Benchmark

# Using Ollama (default)
python -m benchmarks.runners.unified_runner --benchmark-name gaia --num-tasks 10

# Using OpenAI
python -m benchmarks.runners.unified_runner --benchmark-name gaia --provider openai --model gpt-4 --api-key $OPENAI_API_KEY

Run OfficeBench

# Single-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 1 --num-tasks 10

# Two-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 2 --num-tasks 10

# Three-app tasks
python -m benchmarks.runners.unified_runner --benchmark-name officebench --app-filter 3 --num-tasks 10

Run SWE-Bench

# Run with native execution (recommended)
python -m benchmarks.runners.unified_runner \
    --benchmark-name swebench \
    --model qwen3-coder-next:cloud \
    --num-tasks 10

# Run lite variant (300 tasks)
python -m benchmarks.runners.unified_runner \
    --benchmark-name swebench \
    --variant lite \
    --model glm-4.7-flash

# Filter by repository
python -m benchmarks.runners.unified_runner \
    --benchmark-name swebench \
    --repos pytest-dev__pytest \
    --num-tasks 10

Using Ollama Cloud Models

Ollama cloud models require authentication:

# Sign in to Ollama
ollama signin

# Pull a cloud model
ollama pull qwen3-coder-next:cloud

# Run benchmark with cloud model
python -m benchmarks.runners.unified_runner --benchmark-name officebench --model qwen3-coder-next:cloud

Or set the API key via environment variable:

export OLLAMA_API_KEY=your_api_key
python -m benchmarks.runners.unified_runner --benchmark-name officebench --model qwen3-coder-next:cloud

Project Structure

agent_runtime/          # Core library
├── core/               # Core types (Message, Agent, Tool, etc.)
├── compression/        # Compression strategies
├── config/             # Configuration management
├── engines/            # LLM backends (OpenAI, Ollama, Anthropic, etc.)
└── storage/            # Storage types

benchmarks/             # Benchmark framework
├── gaia/               # GAIA benchmark (agent, adapter, tools)
├── officebench/        # OfficeBench benchmark (agent, adapter, tools)
├── core/               # Shared base classes
├── configs/            # Configuration files
├── runners/            # UnifiedBenchmarkRunner
├── runtime/            # Container runtimes
└── data/               # Bundled task data

tests/                  # Test suite
├── test_agent_runtime.py    # Core runtime tests
└── benchmarks/              # Benchmark framework tests

examples/               # Example scripts
├── gaia/               # GAIA benchmark example
├── officebench/        # OfficeBench example
└── scientific_research_agent.py

Documentation

Running Tests

# Run agent_runtime tests
python3 -m pytest tests/ -v

# Run benchmark tests
python3 -m pytest tests/benchmarks/ -v

# Run all tests with coverage
python3 -m pytest tests/ tests/benchmarks/ -v --cov=agent_runtime --cov=benchmarks --cov-report=term-missing

License

MIT License

About

A runtime with clean interfaces to quickly build and test agents. Enables message routing between LLMs, context compression, and standardized benchmark testing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages