Skip to content
View interactive-arc's full-sized avatar
  • Joined May 18, 2026

Block or report interactive-arc

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
interactive-arc/README.md

Interactive-ARC

An interactive benchmark for evaluating LLM abstract reasoning on ARC-AGI tasks.

Instead of producing output grids directly, models construct solutions step-by-step using tool calls. Every action is recorded, producing interpretable reasoning traces that reveal how a model solves a task, not just whether it does.

Features

  • Interactive evaluation: models build solutions incrementally using grid-editing tools
  • Full action traces: every tool call, grid state, and token count is recorded
  • 1,120 public tasks: ships with ARC-AGI-1 (800) and ARC-AGI-2 (320)
  • Multiple providers: supports Anthropic, Amazon Bedrock, and any OpenAI-compatible endpoint (including vLLM)
  • Concurrent execution: evaluates tasks in parallel with configurable concurrency
  • Checkpointing: interrupted runs resume from where they left off

Installation

pip install interactive-arc

Requires Python 3.12+.

Quick Start

With a cloud provider

# Anthropic
export ANTHROPIC_API_KEY=your-key
interactive-arc run --provider anthropic --model claude-sonnet-4-20250514

# Amazon Bedrock (uses default AWS credentials)
interactive-arc run --provider bedrock --model anthropic.claude-sonnet-4-20250514-v1:0

With a local model (vLLM, Ollama, etc.)

interactive-arc run \
    --provider openai \
    --base-url http://localhost:8000/v1 \
    --model Qwen/Qwen3.6-27B \
    --dataset arc-agi-1 \
    --split training \
    --sample 50 --seed 42

Inspect a single task

interactive-arc task --task-id 08ed6ac7 --provider anthropic --model claude-sonnet-4-20250514

CLI Reference

interactive-arc run [OPTIONS]
Option Default Description
--dataset arc-agi-1 Dataset (arc-agi-1 or arc-agi-2)
--split training Split (training or evaluation)
--provider bedrock LLM provider (anthropic, bedrock, openai)
--model Model identifier
--base-url Base URL for OpenAI-compatible endpoints
--renderer text Grid format sent to model (text, json, markdown)
--sample all Number of tasks to sample
--seed Random seed for reproducible sampling
--output Path for summary statistics JSON
--traces ./traces Directory for full trace files
--max-attempts 2 Submission attempts per task (1-10)
--enabled-tools all Comma-separated subset of tools to enable
--grid-feedback both Grid state shown after actions (both, output, none)

Tools

Models interact with the grid through these tools:

Tool Description
set_cell(x, y, color) Set a single cell
set_width(width) Resize grid width
set_height(height) Resize grid height
flood_fill(x, y, color) Fill connected region
copy_input() Copy test input to output grid
copy_region(x, y, w, h) Copy a rectangular region to clipboard
paste_region(x, y) Paste clipboard at position
undo() Undo last operation
reset() Reset grid to initial state
submit(explanation) Submit current grid as answer

Python API

from interactive_arc.environment.loader import TaskLoader
from interactive_arc.environment.tools import ToolExecutor
from interactive_arc.agent.loop import AgentLoop
from interactive_arc.agent.providers.anthropic import AnthropicLLM
from interactive_arc.agent.renderers.text_renderer import TextRenderer

# Load a task
loader = TaskLoader("arc-agi-1", "training")
task = loader.load_task("08ed6ac7")

# Create an agent and solve
llm = AnthropicLLM(model="claude-sonnet-4-20250514")
loop = AgentLoop(task=task, llm=llm, renderer=TextRenderer())
result = loop.run()

print(f"Solved: {result.success}")
print(f"Actions: {result.total_tool_calls}")

Output

Each run produces:

  • Summary JSON: success rate, action efficiency, token usage, cost estimates
  • Trace files: one JSON per task with the full interaction history (every tool call, grid state, LLM response, and timestamps)

Architecture

The codebase follows a three-layer architecture with strict one-directional dependencies:

  1. Environment: grid state machine, tool execution, task loading
  2. Agent: multi-turn LLM interaction loop, provider adapters, grid renderers
  3. Runner: concurrent orchestration, checkpointing, metrics, CLI

Components are swappable via Protocol classes. Adding a new LLM provider or grid renderer requires implementing a single interface with no changes to other layers.

Development

git clone https://github.com/interactive-arc/interactive-arc.git
cd interactive-arc
uv sync --dev
uv run pytest tests/
uv run ruff check src/ tests/

Licence

MIT

Popular repositories Loading

  1. interactive-arc interactive-arc Public

    Python 1 1