interactive-arc

Interactive-ARC

An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.

Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.

Features

Interactive evaluation: models build solutions incrementally using grid-editing tools
Full action traces: every tool call, grid state, and token count is recorded
1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
Concurrent execution: evaluates tasks in parallel with configurable concurrency
Checkpointing: interrupted runs resume from where they left off

Installation

pip install interactive-arc

Requires Python 3.12+.

Quick Start

With a cloud provider

# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514

# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0

With a local model (vLLM, Ollama, etc.)

interactive-arc run \
    --provider openai \
    --base-url http://localhost:8000/v1 \
    --model Qwen/Qwen3.6-27B \
    --dataset arc-agi-1 \
    --split training \
    --sample 50 --seed 42

Inspect a single task

interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514

CLI Reference

interactive-arc run [OPTIONS]

Option	Default	Description
`--dataset`	`arc-agi-1`	Dataset (`arc-agi-1` or `arc-agi-2`)
`--split`	`training`	Split (`training` or `evaluation`)
`--provider`	`bedrock`	LLM provider (`anthropic`, `bedrock`, `openai`)
`--model`		Model identifier
`--base-url`		Base URL for OpenAI-compatible endpoints
`--renderer`	`text`	Grid format sent to model (`text`, `json`, `markdown`)
`--sample`	all	Number of tasks to sample
`--seed`		Random seed for reproducible sampling
`--output`		Path for summary statistics JSON
`--traces`	`./traces`	Directory for full trace files
`--max-attempts`	`2`	Submission attempts per task (1-10)
`--enabled-tools`	all	Comma-separated subset of tools to enable
`--grid-feedback`	`both`	Grid state shown after actions (`both`, `output`, `none`)

Tools

Models interact with the grid through these tools:

Tool	Description
`set_cell(x, y, color)`	Set a single cell
`set_width(width)`	Resize grid width
`set_height(height)`	Resize grid height
`flood_fill(x, y, color)`	Fill connected region
`copy_input()`	Copy test input to output grid
`copy_region(x, y, w, h)`	Copy a rectangular region to clipboard
`paste_region(x, y)`	Paste clipboard at position
`undo()`	Undo last operation
`reset()`	Reset grid to initial state
`submit(explanation)`	Submit current grid as answer

Python API

from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer

# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")

# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()

print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")

Output

Each run produces:

Summary JSON: success rate, action efficiency, token usage, cost estimates
Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)

Architecture

The codebase follows a three-layer architecture with strict one-directional dependencies:

Environment: grid state machine, tool execution, task loading
Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
Runner: concurrent orchestration, checkpointing, metrics, CLI

Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.

Development

git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/

Licence

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly