Noemon's ARC-AGI 2 solver

This repository contains Noemon's agentic ARC solver using Gemini 3.1 Pro. It is based on arcprize/arc-agi-benchmarking, which runs ARC-AGI tasks against multiple model adapters. The Noemon solver implementation lives in src/arc_agi_benchmarking/noemon.

Public Eval Results

The solver achieves the current SOTA (04/2026) on the public eval set with 92.5% for ~3.9$ per task.

Our approach

The core idea is to have a Reasoner agent infer the transformation rule from the training pairs as natural-language instructions. Those instructions are then tested by a separate Validator agent. When the instructions are validated well enough, the solver uses them to solve the test task. A solution is submitted when it is self-consistent, i.e., it was produced at least twice. Both the learning loop and the solving loop are stopped latest after 3 iterations and a judge selects the final solution from the previously generated candidates.

At a high level, the Noemon solver runs the following stages:

reasoner_round_1: infer an initial instruction set from the training pairs.
validator: test those instructions on training pairs.
reasoner_round_2 / reasoner_round_3: refine the instructions or solve the test grid directly, depending on the validation outcome.
final_solver: produce additional independent candidate solutions based on instructions.
judge: choose the best candidate if no duplicate/self-consistent answer was produced.

The Reasoner agent keeps its memory, including the "thought history" over iterations, so it can learn from previous mistakes. Each agent receives a dynamic prompt (src/arc_agi_benchmarking/noemon/prompt_bundle.md) that compiles the information available at the respective stage.

The following diagram sketches the simplified flow of Noemon's algorithm:

We are using the Gemini batch API, which is currently the most reliable way to solve all tasks. Note however, that depending on current API load, a single stage can take up to 24h, usually less than 1h. Expect more than 10 waves for a full run. Batch mode executes the parallelizable stages in waves and then waits for the results from a stage before going to the next one. Thanks to checkpointing, a run can be resumed at any stage and already submitted batch jobs will resume polling if resumed while the batch job is alive. For leaderboard comparability, the reported costs are full API cost rather than discounted batch pricing.

Quickstart

Use the Kaggle notebook or follow these steps:

Clone this repo:

git clone https://github.com/noemon/noemon-arc-agi.git
cd noemon-arc-agi

Install (installs all adapters + SDKs):

pip install .

Download the ARC task data. One option is to clone the upstream dataset repo into data/arc-agi:

git clone https://github.com/fchollet/ARC-AGI.git data/arc-agi

The CLI accepts either a bundled JSON file of tasks or a directory of per-task JSON files.

Single-task dry run (no API keys) with the local random-baseline adapter:

python main.py \
  --data_dir data/sample/tasks \
  --config random-baseline \
  --task_id 66e6c45b \
  --save_submission_dir submissions/random-single \
  --log-level INFO

Run all bundled sample tasks with the random solver:

python cli/run_all.py \
  --config random-baseline \
  --data_dir data/sample/tasks \
  --save_submission_dir submissions \
  --log-level INFO

Score the outputs you just generated:

python src/arc_agi_benchmarking/scoring/scoring.py \
  --task_dir data/sample/tasks \
  --submission_dir submissions/random-baseline \
  --results_dir results/random-baseline

If using the random solver, expect all the attempts to be incorrect.

If you want to run real models, change the config and add the corresponding API keys (see Data and Config sections below). For the Noemon solver, use gemini-3-1-pro-noemon-batch and add a GOOGLE_API_KEY.

To re-run the full experiment on the public eval set, run the following:

python cli/run_all.py \
  --data_dir data/arc-agi/data/evaluation \
  --config gemini-3-1-pro-noemon-batch \
  --num_attempts 2

python src/arc_agi_benchmarking/scoring/scoring.py \
  --data_dir data/arc-agi/data/evaluation \
  --submission_dir submissions/gemini-3-1-pro-noemon-batch \
  --results_dir results/gemini-3-1-pro-noemon-batch

CLI parameters

--data_dir: Either a folder containing ARC task .json files (e.g., data/sample/tasks or data/arc-agi/data/evaluation) or a bundled JSON file containing all challenges (e.g., data/arc-agi_evaluation_challenges.json).
--task_list_file: Optional text file containing a list of tasks that should be run. One task id per row.
--config: Model config name from models.yml. Used by both single-task and batch.
--save_submission_dir: Where to write outputs. Use the same flag for single-task and batch.
--num_attempts: How many independent attempts per test pair (set to 2 per official rules).
--retry_attempts: Internal retries within an attempt if the provider call fails.
--log-level: DEBUG|INFO|WARNING|ERROR|CRITICAL|NONE.
--enable-metrics: Toggle metrics collection (saved in metrics_output/).
Scoring-specific:
- --submission_dir: Where your run wrote outputs
- --results_dir Where to write aggregated metrics/results

Configuring models and providers

Tests are run based on model configs. Model configs hold the configuration (max output tokens, temperature, pricing etc.) for each test.

Model configs live in src/arc_agi_benchmarking/models.yml. Example:

- name: "gpt-4o-2024-11-20"   # config name you reference on the CLI; typically includes the reasoning level for clarity (e.g., "-basic", "-advanced")
  model_name: "gpt-4o-2024-11-20"  # provider’s actual model id
  provider: "openai"         # must match an adapter
  max_output_tokens: 4096    # optional; provider-specific
  temperature: 0.0           # optional; provider-specific
  pricing:
    date: "2024-11-20"
    input: 5.00              # USD per 1M input tokens
    output: 15.00            # USD per 1M output tokens

Standard fields: name, model_name, provider, pricing (input/output per 1M tokens, date for traceability).
Provider kwargs: any extra keys become kwargs and are passed directly to the SDK (e.g., temperature, max_output_tokens, stream, etc.).
Rate limits live in provider_config.yml (rate, period per provider).
Environment: set provider keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HUGGING_FACE_API_KEY). Copy .env.example to .env and fill in.

We use a dedicated adapter with solver-specific runtime parameters in algorithm_params:

- name: "gemini-3-1-pro-noemon-batch"
  model_name: "models/gemini-3.1-pro-preview"
  provider: "noemon-gemini-batch"
  max_output_tokens: 65535
  automatic_function_calling:
    disable: true
  pricing:
    date: "2026-03-02"
    input: 2.00
    output: 12.00
  algorithm_params:
    temperature_reasoner: 1.0
    temperature_validator: 0.7
    temperature_final_solver: 0.7
    temperature_judge: 0.0
    reasoning_reasoner: "high"
    reasoning_validator: "medium"
    reasoning_final_solver: "high"
    reasoning_judge: "high"
    include_grid_metadata: true
    max_reasoner_train_grid_digit_count: 3000
    max_validator_train_grid_digit_count: 2000

automatic_function_calling.disable: true: keeps Gemini from auto-invoking tools;
temperature_<stage>: sampling temperature per Noemon stage (reasoner, validator, final_solver, judge).
reasoning_<stage>: reasoning level per stage, passed through to the provider request options. Note that final_solver refers only to the final solver stage, both temperature and reasoning of earlier solving stages vary, as the reasoner gets a shot at solving and for correctly validated tasks, a solution attempt is also made with the cheaper validator settings to save cost.
include_grid_metadata: adds extra grid metadata (color count and grid size) to prompts.
max_reasoner_train_grid_digit_count / max_validator_train_grid_digit_count: caps how many training-grid digits/cells are shown to those stages; if the budget is exceeded, prompt construction stops at the last full train pair.

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.github/workflows		.github/workflows
assets		assets
cli		cli
data		data
docs/examples		docs/examples
results/random-baseline-sample		results/random-baseline-sample
scripts		scripts
src/arc_agi_benchmarking		src/arc_agi_benchmarking
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
main.py		main.py
provider_config.yml		provider_config.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Noemon's ARC-AGI 2 solver

Public Eval Results

Our approach

Quickstart

CLI parameters

Configuring models and providers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Noemon's ARC-AGI 2 solver

Public Eval Results

Our approach

Quickstart

CLI parameters

Configuring models and providers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages