# Hard‑check reproduction (official Kaggle dataset)
This notebook is the strongest leakage test. It uses the **official ARC Prize 2024 public dataset available on Kaggle**, deletes the eval solutions immediately, builds a clean training/eval dataset, and trains from scratch. Only after inference do we re‑download solutions for scoring.

## Why this proves there is no leakage
- The eval solutions file is deleted before dataset construction and is **not present** during training/inference.
- The constructed `/content/challenges_dihedral_both.json` is built from train challenges + train solutions and eval challenges only; eval outputs are absent by construction.
- The repo is stripped to `src/*.py`, so there are no hidden files or checkpoints.
- Inference uses `SOLUTIONS_PRESENT=False` and produces `submission.json` without any access to solutions.
- Scoring happens **after** inference in the final section; you can skip it and score elsewhere.
- The core files are written in pytorch/python without external libraries, do not call the internet and there is no checkpoint

**Note: this notebook has only been tested on Google Colab**

### The original result had 3 runs
Training - 40GB A100  
Inference - 80GB A100  
Total eval tasks - 400  

1. 8.75% for 21 cents in lifetime compute (887s training, 404s inference)
    - 11 epochs with 10 color augmentations
2. 16% for 38 cents in lifetime compute (1629s training, 700s inference)
    - 21 epochs with 20 color augmentations
3. 27.5% for $1.7 in lifetime compute (7896s training, 2954 inference)
    - 101 epochs with 100 color augmentations

Note: 
- All 3 runs above had an extra dataset called ConceptARC added in training. This dataset is clean
- To reduce burden of verification, this dataset is removed. Performance drops only slightly

**Reproduced result without extra dataset** (same epoch and color augments)  
1. 7.88% for 18 cents (720s training, 339s inference)
2. 14.38% for 32 cents (1309s training, 636s inference)
3. 23.38% for $1.44 (6044s training, 2701s inference)

In [None]:
# Choose reproduction configuration
# runconfig = ["no_concept", 11]  # runs the fastest, expect 7%
runconfig = ["no_concept", 21]  # second fastest, expect 7%
# runconfig = ["no_concept", 101] # expect 23%

## To run this, you need a kaggle legacy API key
1. Create a kaggle account
2. In settings, generate a Legacy API key (it has to be legacy, the newer api keys seem to have weird problems on Google Colab)
    - This downloads a json file with your username and api key
3. Copy the username and api key and add it to the cell below
4. Remember to expire your api key after verification

In [None]:
!rm -rf /content/sample_data/
import os

os.environ["KAGGLE_USERNAME"] = "USERNAME"
os.environ["KAGGLE_KEY"] = "LEGACY_API_KEY"

# Check if it works
!kaggle competitions list

In [None]:
# Download the actual file, unzip
# and then delete the solutions (Eval Solutions)
# also delete everything other than the required grids (Train Challenges, Train Answers and Evaluation Challenges).
!kaggle competitions download -c arc-prize-2024

!unzip /content/arc-prize-2024.zip -d /content/arc-prize-2024

!rm /content/arc-prize-2024.zip
!rm /content/arc-prize-2024/arc-agi_evaluation_solutions.json
!rm /content/arc-prize-2024/arc-agi_test_challenges.json
!rm /content/arc-prize-2024/sample_submission.json

Feel free to inspect the folder.  
This is the official public datasets without the solutions file  

Nothing else other than train datapoints and test inputs

In [None]:
# creating dataset by combining eval challenges, train challenges and train solutions
# and then augments dihedrally
# (NO EVAL SOLUTIONS)
from pathlib import Path
import json

DATA_ROOT = Path("/content/arc-prize-2024")
TRAIN_CHALLENGES = DATA_ROOT / "arc-agi_training_challenges.json"
TRAIN_SOLUTIONS = DATA_ROOT / "arc-agi_training_solutions.json"
EVAL_CHALLENGES = DATA_ROOT / "arc-agi_evaluation_challenges.json"

OUT_PATH = Path("/content/challenges_dihedral_both.json")
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)


def load_json(path):
    with path.open("r") as f:
        return json.load(f)


def extract_output(sol):
    return sol["output"] if isinstance(sol, dict) and "output" in sol else sol


def attach_train_solutions(challenges, solutions):
    combined = {}
    for task_id, task in challenges.items():
        train_pairs = [dict(pair) for pair in task.get("train", [])]
        test_pairs = [dict(pair) for pair in task.get("test", [])]
        sol_list = solutions.get(task_id)
        if sol_list is None:
            raise KeyError(f"Missing solutions for task {task_id}")
        if len(test_pairs) != len(sol_list):
            raise ValueError(
                f"Solution count mismatch for {task_id}: {len(test_pairs)} test vs {len(sol_list)} sols"
            )
        for pair, sol in zip(test_pairs, sol_list):
            pair["output"] = extract_output(sol)
        task_copy = dict(task)
        task_copy["train"] = train_pairs
        task_copy["test"] = test_pairs
        combined[task_id] = task_copy
    return combined


def move_test_to_train(task_map):
    moved = {}
    for task_id, task in task_map.items():
        train_pairs = [dict(pair) for pair in task.get("train", [])]
        test_pairs = [dict(pair) for pair in task.get("test", [])]
        new_task = dict(task)
        new_task["train"] = train_pairs + test_pairs
        new_task.pop("test", None)
        new_task.pop("name", None)
        moved[task_id] = new_task
    return moved


def merge_task_maps(*maps):
    merged = {}
    for task_map in maps:
        for task_id, task in task_map.items():
            if task_id in merged:
                raise ValueError(f"Duplicate task id: {task_id}")
            merged[task_id] = task
    return {task_id: merged[task_id] for task_id in sorted(merged)}


def copy_grid(grid):
    return [list(row) for row in grid]


def rotate90(grid):
    if not grid:
        return []
    return [list(row) for row in zip(*grid[::-1])]


def rotate180(grid):
    return [list(reversed(row)) for row in reversed(grid)]


def rotate270(grid):
    if not grid:
        return []
    return [list(row) for row in zip(*grid)][::-1]


def flip_horizontal(grid):
    return [list(reversed(row)) for row in grid]


def flip_vertical(grid):
    return [list(row) for row in reversed(grid)]


def flip_main_diagonal(grid):
    if not grid:
        return []
    return [list(row) for row in zip(*grid)]


def flip_anti_diagonal(grid):
    return flip_vertical(rotate90(grid))


TRANSFORMS = [
    ("identity", copy_grid),
    ("rot90", rotate90),
    ("rot180", rotate180),
    ("rot270", rotate270),
    ("flip_horizontal", flip_horizontal),
    ("flip_vertical", flip_vertical),
    ("flip_main_diagonal", flip_main_diagonal),
    ("flip_anti_diagonal", flip_anti_diagonal),
]


def augment_pairs(pairs):
    augmented = []
    for pair in pairs:
        input_grid = pair["input"]
        output_grid = pair.get("output")
        for _, transform in TRANSFORMS:
            new_pair = {"input": transform(input_grid)}
            if output_grid is not None:
                new_pair["output"] = transform(output_grid)
            augmented.append(new_pair)
    return augmented


def augment_dataset(challenges):
    augmented = {}
    for task_id, payload in challenges.items():
        new_payload = dict(payload)
        if "train" in payload:
            new_payload["train"] = augment_pairs(list(payload.get("train", [])))
        if "test" in payload:
            new_payload["test"] = augment_pairs(list(payload.get("test", [])))
        augmented[task_id] = new_payload
    return augmented


train_challenges = load_json(TRAIN_CHALLENGES)
train_solutions = load_json(TRAIN_SOLUTIONS)
eval_challenges = load_json(EVAL_CHALLENGES)

train_both = move_test_to_train(
    attach_train_solutions(train_challenges, train_solutions)
)
combined = merge_task_maps(train_both, eval_challenges)

augmented = augment_dataset(combined)
OUT_PATH.write_text(json.dumps(augmented, indent=2))
print(f"Wrote {OUT_PATH} (tasks: {len(augmented)})")


## Clean dataset built
We now have `/content/challenges_dihedral_both.json`, which contains:
- Training inputs + outputs (from the official training set)
- Evaluation inputs only (no eval solutions)

The original Kaggle dataset folder is deleted next so nothing else remains on disk.


For safety, we delete the arc-prize folder

In [None]:
!rm -rf /content/arc-prize-2024/

Now we download the model files defined in pytorch

Feel free to inspect - there are no checkpoints

In [None]:
root_folder = "content"  # for colab

%cd /$root_folder/
!git clone https://github.com/mvakde/mdlARC.git # `-b <branch_name> --single-branch` if branch
%cd /$root_folder/mdlARC

For safety, we delete every single file other than the 4 python source files

You can inspect the source files, there's no hardcoding and there's no internet calls. No external libraries, just pytorch

In [None]:
# delete everything
!rm -rf /$root_folder/mdlARC/interactive-run.ipynb
!rm -rf /$root_folder/mdlARC/clean-env-run.ipynb
!rm -rf /$root_folder/mdlARC/dataset_building_scripts
!rm -rf /$root_folder/mdlARC/readme.md
!rm -rf /$root_folder/mdlARC/img

In [None]:
from pathlib import Path
import argparse
import importlib
import sys

PROJECT_ROOT = Path.cwd()
SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

import utils, tinytransformer, train

importlib.reload(utils)  # pick up code changes during iteration
importlib.reload(tinytransformer)
importlib.reload(train)

args = {
    # run config
    "num_workers": 0,
    "device": "cuda",  # 'cuda' | 'mps' | 'cpu'
    "do_validate": False,
    "name": "arc1-37M-bs32-101ep-100color-ccdb",  # download file name
    "GPU": "A100",  # just for logging purposes
    # paths - must pass as Path("<path_to_dir>")
    "train_log_file": Path("runs/training_log.txt"),
    "save_path": Path("runs/tiny.pt"),
    "checkpoint_path": None,  # Path("runs/tiny.pt"),  # or None to start from scratch
    "data_path": Path("../challenges_dihedral_both.json"),
    "dihedral_augmented": True,
    # hyperparameters
    "epochs": runconfig[1],
    "batch_size": 32,
    "val_batch_size": 300,
    "enable_color_aug_train": True,
    "enable_color_on_aug_test_split_during_training": True,
    "max_color_augments_train": (runconfig[1] - 1),
    "disable_color_aug_last_epochs": 1,
    "color_aug_seed": 42,
    "lr": 3e-4,
    "warmup_pct": 0.02,
    "wsd_decay_start_pct": 0.8,  # 1.0 = no decay (start at last epoch)
    "lr_floor": 0.01,
    "weight_decay": 0.01,
    "grad_clip": 1.0,
    "dropout": 0.1,
    "seed": 42,
    # Model Architecture
    "d_model": 768,  # 128, 256, 512, 768 | 128, 384, 640
    "n_heads": 12,  # 4, 8, 8/16, 12 | 4, 12, 10
    "d_ff": 3072,  # 512, 1024, 2048, 3072 | 512, 1536, 2560
    "n_layers": 4,  # 4, 6, 16, 16 | 24, 28, 24
    # Visibility toggles
    "log_train_strings": False,
    "log_train_limit": 10,
    "log_inference_prompt": False,
    "inference_temperature": None,
    "inference_top_k": None,
}
cfg = argparse.Namespace(**args)

runs_dir = Path("runs")
runs_dir.mkdir(parents=True, exist_ok=True)
with (runs_dir / "config.txt").open("w") as f:
    for k, v in args.items():
        f.write(f"{k}: {v}\n")

model, dataset, dataloader, device, data_path = train.build_model_and_data(cfg)

Train the model from scratch

In [None]:
# Training only

from time import perf_counter

t_start = perf_counter()

train.train_model(
    cfg,
    model=model,
    dataloader=dataloader,
    dataset=dataset,
    device=device,
    data_path=data_path,
)

t_duration = perf_counter() - t_start
print(f"Training took {t_duration:.2f}s")

with open(Path("runs/timing.txt"), "w") as f:
    f.write(f"Training: {t_duration:.4f} s\n")

In [None]:
# cleaning up memory to run inference
utils.cleanup_memory(globals())


In [None]:
from pathlib import Path
import importlib
import evaluations
import utils

importlib.reload(evaluations)
importlib.reload(utils)

PATH_BOTH = Path("../challenges_dihedral_both.json")

EVAL_CONFIGS = [("eval", runconfig[1] - 1, PATH_BOTH)]

EVAL_BATCH_SIZE = 1300
SPLITS = ["test"]
CHECKPOINT_PATH = Path("runs/tiny.pt")
SOLUTIONS_PRESENT = False
EVAL_TASK_IDS = None  # Set to None to evaluate full dataset, or ["00576224", ...] for specific tasks
LOG_CORRECT_GRIDS = False  # Print the actual grid, IDs, and augmentation indices for fully correct grids

eval_results = evaluations.run_evaluation_configs(
    cfg,
    EVAL_CONFIGS,
    eval_batch_size=EVAL_BATCH_SIZE,
    splits=SPLITS,
    checkpoint_path=CHECKPOINT_PATH,
    include_targets=SOLUTIONS_PRESENT,
    task_ids=EVAL_TASK_IDS,
    log_correct_grids=LOG_CORRECT_GRIDS,
)


In [None]:
# visualisation WITHOUT loading solutions
EVAL_SUB_FOLDER = EVAL_CONFIGS[0][0]
utils.visualize_eval_submissions(EVAL_SUB_FOLDER, mode="submission")


## Stop here for a strict no‑solutions run
At this point, the model has already produced `runs/<run_name>/submission.json` without any access to solutions. **This means the run is clean!**.

**The next 2 cells score the submission and visualise differences with ground truth. This requires downloading the solutions**

If you want a strict no‑solutions audit, stop here and score the submission yourself manually

In [None]:
# Download dataset, unzip and delete everything except solutions
%cd /content
!kaggle competitions download -c arc-prize-2024

!unzip /content/arc-prize-2024.zip -d /content/arc-prize-2024

!rm /content/arc-prize-2024.zip
!rm /content/arc-prize-2024/arc-agi_training_challenges.json
!rm /content/arc-prize-2024/arc-agi_training_solutions.json
!rm /content/arc-prize-2024/arc-agi_evaluation_challenges.json
!rm /content/arc-prize-2024/arc-agi_test_challenges.json
!rm /content/arc-prize-2024/sample_submission.json

In [None]:
from pathlib import Path
import utils

SOLUTIONS_FILE = Path("arc-prize-2024/arc-agi_evaluation_solutions.json")
SUBMISSION_FILE = Path(f"mdlARC/runs/{EVAL_SUB_FOLDER}/submission.json")

utils.score_arc_submission(SOLUTIONS_FILE, SUBMISSION_FILE)

In [None]:
# Visualise and compare the differences between the ground truth solutions and the correct answers
EVAL_SUB_FOLDER = EVAL_CONFIGS[0][0]
utils.visualize_eval_submissions(
    EVAL_SUB_FOLDER,
    submission_base="mdlARC/runs",
    solutions_file="arc-prize-2024/arc-agi_evaluation_solutions.json",
    mode="compare",
)

In [None]:
# AAIVR flow visualization for one task/pair (augmented inputs + outputs)
import aaivr

AAIVR_CONFIG_INDEX = 0  # which eval config to inspect
AAIVR_TASK_ID = None  # set to a task id string to override task index
AAIVR_TASK_INDEX = 0  # 0-based in evaluation pipeline order
AAIVR_INPUT_INDEX = 0  # which test pair (base index before dihedral aug)

AAIVR_DATASET_PATH = EVAL_CONFIGS[AAIVR_CONFIG_INDEX][2]
AAIVR_MAX_COLOR_AUG = EVAL_CONFIGS[AAIVR_CONFIG_INDEX][1]
if len(EVAL_CONFIGS[AAIVR_CONFIG_INDEX]) > 3:
    AAIVR_DIHEDRAL = bool(EVAL_CONFIGS[AAIVR_CONFIG_INDEX][3])
else:
    AAIVR_DIHEDRAL = bool(getattr(cfg, "dihedral_augmented", False))

AAIVR_COLOR_SEED = getattr(cfg, "color_aug_seed_eval", None)
if AAIVR_COLOR_SEED is None:
    AAIVR_COLOR_SEED = getattr(cfg, "color_aug_seed", None)
if AAIVR_COLOR_SEED is None:
    AAIVR_COLOR_SEED = getattr(cfg, "seed", 42)

aaivr.visualize_aaivr_flow(
    eval_results[AAIVR_CONFIG_INDEX][1]["test"]["results"],
    dataset_path=AAIVR_DATASET_PATH,
    input_index=AAIVR_INPUT_INDEX,
    task_id=AAIVR_TASK_ID,
    task_index=AAIVR_TASK_INDEX,
    is_dihedral_augmented=AAIVR_DIHEDRAL,
    max_color_augments=AAIVR_MAX_COLOR_AUG,
    color_aug_seed=AAIVR_COLOR_SEED,
)
