An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.
Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.
- Interactive evaluation: models build solutions incrementally using grid-editing tools
- Full action traces: every tool call, grid state, and token count is recorded
- 1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
- Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
- Concurrent execution: evaluates tasks in parallel with configurable concurrency
- Checkpointing: interrupted runs resume from where they left off
pip install interactive-arcRequires Python 3.12+.
# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514
# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0interactive-arc run \
--provider openai \
--base-url http://localhost:8000/v1 \
--model Qwen/Qwen3.6-27B \
--dataset arc-agi-1 \
--split training \
--sample 50 --seed 42interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514interactive-arc run [OPTIONS]
| Option | Default | Description |
|---|---|---|
--dataset |
arc-agi-1 |
Dataset (arc-agi-1 or arc-agi-2) |
--split |
training |
Split (training or evaluation) |
--provider |
bedrock |
LLM provider (anthropic, bedrock, openai) |
--model |
Model identifier | |
--base-url |
Base URL for OpenAI-compatible endpoints | |
--renderer |
text |
Grid format sent to model (text, json, markdown) |
--sample |
all | Number of tasks to sample |
--seed |
Random seed for reproducible sampling | |
--output |
Path for summary statistics JSON | |
--traces |
./traces |
Directory for full trace files |
--max-attempts |
2 |
Submission attempts per task (1-10) |
--enabled-tools |
all | Comma-separated subset of tools to enable |
--grid-feedback |
both |
Grid state shown after actions (both, output, none) |
Models interact with the grid through these tools:
| Tool | Description |
|---|---|
set_cell(x, y, color) |
Set a single cell |
set_width(width) |
Resize grid width |
set_height(height) |
Resize grid height |
flood_fill(x, y, color) |
Fill connected region |
copy_input() |
Copy test input to output grid |
copy_region(x, y, w, h) |
Copy a rectangular region to clipboard |
paste_region(x, y) |
Paste clipboard at position |
undo() |
Undo last operation |
reset() |
Reset grid to initial state |
submit(explanation) |
Submit current grid as answer |
from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer
# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")
# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()
print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")Each run produces:
- Summary JSON: success rate, action efficiency, token usage, cost estimates
- Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)
The codebase follows a three-layer architecture with strict one-directional dependencies:
- Environment: grid state machine, tool execution, task loading
- Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
- Runner: concurrent orchestration, checkpointing, metrics, CLI
Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.
git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/MIT