Skip to content

noemon/noemon-arc-agi

Repository files navigation

Noemon's ARC-AGI 2 solver

License: MIT Python 3.12+ ARC-AGI Gemini 3.1pro

This repository contains Noemon's agentic ARC solver using Gemini 3.1 Pro. It is based on arcprize/arc-agi-benchmarking, which runs ARC-AGI tasks against multiple model adapters. The Noemon solver implementation lives in src/arc_agi_benchmarking/noemon.

Public Eval Results

The solver achieves the current SOTA (04/2026) on the public eval set with 92.5% for ~3.9$ per task.

Leaderboard plot

Our approach

The core idea is to have a Reasoner agent infer the transformation rule from the training pairs as natural-language instructions. Those instructions are then tested by a separate Validator agent. When the instructions are validated well enough, the solver uses them to solve the test task. A solution is submitted when it is self-consistent, i.e., it was produced at least twice. Both the learning loop and the solving loop are stopped latest after 3 iterations and a judge selects the final solution from the previously generated candidates.

At a high level, the Noemon solver runs the following stages:

  • reasoner_round_1: infer an initial instruction set from the training pairs.
  • validator: test those instructions on training pairs.
  • reasoner_round_2 / reasoner_round_3: refine the instructions or solve the test grid directly, depending on the validation outcome.
  • final_solver: produce additional independent candidate solutions based on instructions.
  • judge: choose the best candidate if no duplicate/self-consistent answer was produced.

The Reasoner agent keeps its memory, including the "thought history" over iterations, so it can learn from previous mistakes. Each agent receives a dynamic prompt (src/arc_agi_benchmarking/noemon/prompt_bundle.md) that compiles the information available at the respective stage.

The following diagram sketches the simplified flow of Noemon's algorithm:

Simplified Noemon solver flow

We are using the Gemini batch API, which is currently the most reliable way to solve all tasks. Note however, that depending on current API load, a single stage can take up to 24h, usually less than 1h. Expect more than 10 waves for a full run. Batch mode executes the parallelizable stages in waves and then waits for the results from a stage before going to the next one. Thanks to checkpointing, a run can be resumed at any stage and already submitted batch jobs will resume polling if resumed while the batch job is alive. For leaderboard comparability, the reported costs are full API cost rather than discounted batch pricing.

Quickstart

Use the Kaggle notebook or follow these steps:

  1. Clone this repo:
git clone https://github.com/noemon/noemon-arc-agi.git
cd noemon-arc-agi
  1. Install (installs all adapters + SDKs):
pip install .
  1. Download the ARC task data. One option is to clone the upstream dataset repo into data/arc-agi:
git clone https://github.com/fchollet/ARC-AGI.git data/arc-agi

The CLI accepts either a bundled JSON file of tasks or a directory of per-task JSON files.

  1. Single-task dry run (no API keys) with the local random-baseline adapter:
python main.py \
  --data_dir data/sample/tasks \
  --config random-baseline \
  --task_id 66e6c45b \
  --save_submission_dir submissions/random-single \
  --log-level INFO
  1. Run all bundled sample tasks with the random solver:
python cli/run_all.py \
  --config random-baseline \
  --data_dir data/sample/tasks \
  --save_submission_dir submissions \
  --log-level INFO
  1. Score the outputs you just generated:
python src/arc_agi_benchmarking/scoring/scoring.py \
  --task_dir data/sample/tasks \
  --submission_dir submissions/random-baseline \
  --results_dir results/random-baseline

If using the random solver, expect all the attempts to be incorrect.

If you want to run real models, change the config and add the corresponding API keys (see Data and Config sections below). For the Noemon solver, use gemini-3-1-pro-noemon-batch and add a GOOGLE_API_KEY.

To re-run the full experiment on the public eval set, run the following:

python cli/run_all.py \
  --data_dir data/arc-agi/data/evaluation \
  --config gemini-3-1-pro-noemon-batch \
  --num_attempts 2
python src/arc_agi_benchmarking/scoring/scoring.py \
  --data_dir data/arc-agi/data/evaluation \
  --submission_dir submissions/gemini-3-1-pro-noemon-batch \
  --results_dir results/gemini-3-1-pro-noemon-batch

CLI parameters

  • --data_dir: Either a folder containing ARC task .json files (e.g., data/sample/tasks or data/arc-agi/data/evaluation) or a bundled JSON file containing all challenges (e.g., data/arc-agi_evaluation_challenges.json).
  • --task_list_file: Optional text file containing a list of tasks that should be run. One task id per row.
  • --config: Model config name from models.yml. Used by both single-task and batch.
  • --save_submission_dir: Where to write outputs. Use the same flag for single-task and batch.
  • --num_attempts: How many independent attempts per test pair (set to 2 per official rules).
  • --retry_attempts: Internal retries within an attempt if the provider call fails.
  • --log-level: DEBUG|INFO|WARNING|ERROR|CRITICAL|NONE.
  • --enable-metrics: Toggle metrics collection (saved in metrics_output/).
  • Scoring-specific:
    • --submission_dir: Where your run wrote outputs
    • --results_dir Where to write aggregated metrics/results

Configuring models and providers

Tests are run based on model configs. Model configs hold the configuration (max output tokens, temperature, pricing etc.) for each test.

Model configs live in src/arc_agi_benchmarking/models.yml. Example:

- name: "gpt-4o-2024-11-20"   # config name you reference on the CLI; typically includes the reasoning level for clarity (e.g., "-basic", "-advanced")
  model_name: "gpt-4o-2024-11-20"  # provider’s actual model id
  provider: "openai"         # must match an adapter
  max_output_tokens: 4096    # optional; provider-specific
  temperature: 0.0           # optional; provider-specific
  pricing:
    date: "2024-11-20"
    input: 5.00              # USD per 1M input tokens
    output: 15.00            # USD per 1M output tokens
  • Standard fields: name, model_name, provider, pricing (input/output per 1M tokens, date for traceability).
  • Provider kwargs: any extra keys become kwargs and are passed directly to the SDK (e.g., temperature, max_output_tokens, stream, etc.).
  • Rate limits live in provider_config.yml (rate, period per provider).
  • Environment: set provider keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, HUGGING_FACE_API_KEY). Copy .env.example to .env and fill in.

We use a dedicated adapter with solver-specific runtime parameters in algorithm_params:

- name: "gemini-3-1-pro-noemon-batch"
  model_name: "models/gemini-3.1-pro-preview"
  provider: "noemon-gemini-batch"
  max_output_tokens: 65535
  automatic_function_calling:
    disable: true
  pricing:
    date: "2026-03-02"
    input: 2.00
    output: 12.00
  algorithm_params:
    temperature_reasoner: 1.0
    temperature_validator: 0.7
    temperature_final_solver: 0.7
    temperature_judge: 0.0
    reasoning_reasoner: "high"
    reasoning_validator: "medium"
    reasoning_final_solver: "high"
    reasoning_judge: "high"
    include_grid_metadata: true
    max_reasoner_train_grid_digit_count: 3000
    max_validator_train_grid_digit_count: 2000
  • automatic_function_calling.disable: true: keeps Gemini from auto-invoking tools;
  • temperature_<stage>: sampling temperature per Noemon stage (reasoner, validator, final_solver, judge).
  • reasoning_<stage>: reasoning level per stage, passed through to the provider request options. Note that final_solver refers only to the final solver stage, both temperature and reasoning of earlier solving stages vary, as the reasoner gets a shot at solving and for correctly validated tasks, a solution attempt is also made with the cheaper validator settings to save cost.
  • include_grid_metadata: adds extra grid metadata (color count and grid size) to prompts.
  • max_reasoner_train_grid_digit_count / max_validator_train_grid_digit_count: caps how many training-grid digits/cells are shown to those stages; if the budget is exceeded, prompt construction stops at the last full train pair.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors