# Sanitized reproduction (no Kaggle account)
This notebook is a clean, self‑contained reproduction that **does not require Kaggle credentials**. It downloads the official ARC datasets from GitHub, constructs a training/eval dataset that excludes eval solutions, trains the model from scratch, and runs inference on eval inputs only.

## Why this proves there is no leakage
- The dataset is built from ARC‑1 training pairs (inputs + outputs), ARC‑1 eval **inputs only**, and optional ConceptARC. Eval solutions are never downloaded.
- After building `assets/challenges_dihedral_both.json`, every file (other than core model files) is deleted, so there’s nowhere for solutions to hide.
- Training starts from scratch (`checkpoint_path=None`) and uses only the cleaned dataset file.
- Inference runs with `SOLUTIONS_PRESENT=False` and produces `submission.json` without reading any solutions.
- If you want to be extra strict, delete the final visualization cell and score the submission yourself.

FYI:
- The core files do not make any calls to the internet, or use extra libraries (written in pure pytorch), and there's no loading of checkpoints

## What to change
- Pick a runconfig (with or without ConceptARC) in the config cell below.
- Reduce `EVAL_BATCH_SIZE` if your GPU is smaller than A100.


Steps:
- Upload this on colab/modal
    - No need to mount your google drive or modal volume
- Decide which experiment you want to reproduce (list below) and modify config accordingly
- Choose A100
- hit run all

The original run used an extra dataset called conceptARC. This dataset is clean. But to reduce the burden of verification, I add a config where this dataset is not used (performance reduces a bit)

Configs
- Run with ConceptARC for 11 epochs (10 color augments) - about 8-9%
- Run with ConceptARC for 101 epochs (100 color augments) - 25-28%
- Run without ConceptARC for 11 epochs (10 color augments) - about 7%
- Run without ConceptARC for 101 epochs (100 color augments) - about 23%

In [None]:
# Choose runconfig
runconfig = ["concept", 11]  # expect 8-9%
# runconfig = ["concept", 21]  # expect 16%
# runconfig = ["concept", 101] # expect 25-28%

# This runconfig has less verification burden by removing exrtra dataset, but perf drops slightly
# runconfig = ["no_concept", 11] # expect 7%
# runconfig = ["no_concept", 11] # expect 14%
# runconfig = ["no_concept", 101] # expect 23%

In [None]:
root_folder = "root"
# root_folder = "content" # for colab

%cd /$root_folder/
!git clone -b clean --single-branch https://github.com/mvakde/mdlARC.git # `-b <branch_name> --single-branch` if branch
%cd /$root_folder/mdlARC

In [None]:
# Build clean train dataset in assets WITHOUT solutions or other clutter
!python dataset_building_scripts/download_datasets.py
!python dataset_building_scripts/group_arc_tasks.py
if runconfig[0] == "concept":
    !python dataset_building_scripts/build_datasets.py --config concept_arc1_clean
else:
    !python dataset_building_scripts/build_datasets.py --config arc1_clean
!python dataset_building_scripts/augment_dataset_dihedral.py

# Delete all files, especially solutions
!find "assets" -mindepth 1 ! -path "assets/challenges_dihedral_both.json" -exec rm -rf -- {} +
!rm -rf /$root_folder/mdlARC/run-script.ipynb
!rm -rf /$root_folder/mdlARC/sanitised-env-run-script.ipynb
!rm -rf /$root_folder/mdlARC/dataset_building_scripts
!rm -rf /$root_folder/mdlARC/readme.md
!rm -rf /$root_folder/mdlARC/img

## Data is now “solution‑free”
At this point, the only dataset file we keep is `assets/challenges_dihedral_both.json`.
- It contains **train inputs/outputs** and **eval inputs only**.
- All other files and folders (including anything that could contain solutions) are deleted.

You can inspect this file before continuing if you want to verify it manually.


In [None]:
from pathlib import Path
import argparse
import importlib
import sys

PROJECT_ROOT = Path.cwd()
SRC_DIR = PROJECT_ROOT / "src"
if SRC_DIR.exists() and str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

import utils, tinytransformer, train

importlib.reload(utils)  # pick up code changes during iteration
importlib.reload(tinytransformer)
importlib.reload(train)

args = {
    # run config
    "num_workers": 0,
    "device": "cuda",  # 'cuda' | 'mps' | 'cpu'
    "do_validate": False,
    "name": "arc1-cleanenv-30M-vvwide-bs32-101ep-100color-ccdb-18dec0430",  # download file name
    "GPU": "A100-noaugreg",  # just for logging purposes
    # paths - must pass as Path("<path_to_dir>")
    "train_log_file": Path("runs/training_log.txt"),
    "save_path": Path("runs/tiny.pt"),
    "checkpoint_path": None,  # Path("runs/tiny.pt"),  # or None to start from scratch
    "data_path": Path("assets/challenges_dihedral_both.json"),
    # hyperparameters
    "epochs": runconfig[1],
    "batch_size": 32,
    "val_batch_size": 300,
    "enable_color_aug_train": True,
    "max_color_augments_train": (runconfig[1] - 1),
    "color_aug_seed": 42,
    "lr": 3e-4,
    "weight_decay": 0.01,
    "grad_clip": 1.0,
    "dropout": 0.1,
    "seed": 42,
    # Model Architecture
    "d_model": 768,  # 128, 256, 512, 768 | 128, 384, 640
    "n_heads": 12,  # 4, 8, 8/16, 12 | 4, 12, 10
    "d_ff": 3072,  # 512, 1024, 2048, 3072 | 512, 1536, 2560
    "n_layers": 4,  # 4, 6, 16, 16 | 24, 28, 24
    # Visibility toggles
    "log_train_strings": False,
    "log_train_limit": 10,
    "log_inference_prompt": False,
}
cfg = argparse.Namespace(**args)

runs_dir = Path("runs")
runs_dir.mkdir(parents=True, exist_ok=True)
with (runs_dir / "config.txt").open("w") as f:
    for k, v in args.items():
        f.write(f"{k}: {v}\n")

model, dataset, dataloader, device, data_path = train.build_model_and_data(cfg)

In [None]:
# Training only

from time import perf_counter

t_start = perf_counter()

# ---
# direct
train.train_model(
    cfg,
    model=model,
    dataloader=dataloader,
    dataset=dataset,
    device=device,
    data_path=data_path,
)


# # periodic checkpointing
# cfg.save_path = Path(f"runs/tiny-{cfg.epochs}.pt")
# for i in range(3):
#   if i != 0:
#     cfg.checkpoint_path = cfg.save_path
#     cfg.save_path = Path(f"runs/tiny-{cfg.epochs*(i+1)}.pt")
#   train.train_model(cfg, model=model, dataloader=dataloader, dataset=dataset, device=device, data_path=data_path)
# ---

t_duration = perf_counter() - t_start
print(f"Training took {t_duration:.2f}s")

with open(Path("runs/timing.txt"), "w") as f:
    f.write(f"Training: {t_duration:.4f} s\n")

In [None]:
# cleaning up memory to run inference
import gc
import torch

# 1. Delete global references to free memory
# Deleting 'model' ensures Cell 4 reloads a fresh instance from the checkpoint,
# preventing memory fragmentation or leftover gradients from training.
for name in ["model", "dataset", "dataloader", "optimizer", "scheduler"]:
    if name in globals():
        del globals()[name]

# 2. Reset compiled graph caches (crucial if torch.compile was used)
if hasattr(torch, "_dynamo"):
    torch._dynamo.reset()

# 3. Force garbage collection and clear GPU memory
gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()

print(f"GPU cleaned. Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

## Submission is generated without solutions
The next cells run inference and generate `runs/<run_name>/submission.json` with **no access to eval solutions**.  
Everything stays solution‑free unless you choose to add your own scoring later.

In [None]:
import torch
import inference
import tinytransformer
import utils
import sys
import json
import importlib
from pathlib import Path
from time import perf_counter

# Reload modules to pick up changes
importlib.reload(tinytransformer)
importlib.reload(inference)
importlib.reload(utils)

# Define your paths constants
PATH_BOTH = Path("assets/challenges_dihedral_both.json")

# Config List: (Run Name, Max Color Augments, Dataset Path)
EVAL_CONFIGS = [("eval", runconfig[1] - 1, PATH_BOTH)]

# Global settings shared across runs
EVAL_BATCH_SIZE = 1300
SPLITS = ["test"]
CHECKPOINT_PATH = Path("runs/tiny.pt")
SOLUTIONS_PRESENT = False
EVAL_TASK_IDS = None  # Set to None to evaluate full dataset, or ["00576224", ...] for specific tasks
LOG_CORRECT_GRIDS = False  # Print the actual grid, IDs, and augmentation indices for fully correct grids


# Helper class for logging to file and console
class TeeLogger(object):
    def __init__(self, filepath):
        self.terminal = sys.stdout
        self.log = open(filepath, "w")

    def write(self, message):
        self.terminal.write(message)
        self.log.write(message)

    def flush(self):
        self.terminal.flush()
        self.log.flush()

    def close(self):
        self.log.close()


def run_evaluation_pipeline(run_name, max_color_augments, dataset_path, device):
    print(f"\n{'=' * 60}")
    print(f"STARTING PIPELINE: {run_name} (Color Augs: {max_color_augments})")
    print(f"{'=' * 60}\n")

    # 1. Setup Directories
    base_run_dir = Path("runs") / run_name
    base_run_dir.mkdir(parents=True, exist_ok=True)

    eval_log_path = base_run_dir / "eval_log.txt"
    aaivr_log_path = base_run_dir / "aaivr.txt"
    submission_path = base_run_dir / "submission.json"

    # 2. Update Config
    cfg.checkpoint_path = CHECKPOINT_PATH
    cfg.data_path = dataset_path
    cfg.enable_color_aug_eval = max_color_augments > 0
    cfg.max_color_augments_eval = max_color_augments

    # 3. Build/Rebuild Model & Data
    # We rebuild the dataloader every time to handle the different color augmentation settings
    print("Building model and dataloader for config...")

    # Load checkpoint explicitly to pass to build function
    checkpoint = torch.load(
        cfg.checkpoint_path, map_location=device, weights_only=False
    )

    # Check if model exists in global scope to reuse weights, else create
    global model
    if "model" in globals() and model is not None:
        model.load_state_dict(
            checkpoint["model_state"] if "model_state" in checkpoint else checkpoint,
            strict=False,
        )
        model.eval()
        # Rebuild only dataset/loader
        _, dataset, dataloader, device, _ = train.build_model_and_data(
            cfg, checkpoint=checkpoint
        )
    else:
        model, dataset, dataloader, device, _ = train.build_model_and_data(cfg)

    # 4. Run Inference (Logic from old Cell 3)
    def log_eval(msg):
        print(msg)
        with open(eval_log_path, "a") as f:
            f.write(msg + "\n")

    color_mappings_eval = None
    color_apply_fn = None
    if cfg.enable_color_aug_eval and cfg.max_color_augments_eval > 0:
        color_seed = cfg.color_aug_seed or cfg.seed
        color_mappings_eval = utils.generate_color_mapping_tensors(
            cfg.max_color_augments_eval, color_seed
        )
        color_apply_fn = lambda split: True

    evaluation = inference.evaluate_model_on_dataset(
        model=model,
        dataset=dataset,
        device=device,
        batch_size=EVAL_BATCH_SIZE,
        log_prompts=args["log_inference_prompt"],
        splits=SPLITS,
        color_mappings=color_mappings_eval,
        color_apply_fn=color_apply_fn,
        task_ids=EVAL_TASK_IDS,
        include_targets=SOLUTIONS_PRESENT,
    )

    # Redirect stdout for AAIVR logging
    if hasattr(sys.stdout, "log"):
        sys.stdout = sys.stdout.terminal  # Reset if needed
    sys.stdout = TeeLogger(str(aaivr_log_path))

    try:
        test_results = evaluation.get("test", {}).get("results", [])
        dataset_has_dihedral_augments = "dihedral_both" in str(cfg.data_path)

        aaivr_results = []
        if test_results:
            aaivr_results = utils.run_aaivr_on_results(
                test_results,
                is_dihedral_augmented=dataset_has_dihedral_augments,
                color_aug_seed=cfg.color_aug_seed,
                max_color_augments=cfg.max_color_augments_eval,
            )

            # # Print Stats (will go to console + aaivr.txt)
            # utils.summarize_aaivr_pass_at_k(aaivr_results)
            # if aaivr_results:
            #     tasks_map = {}
            #     for res in aaivr_results:
            #         if res.task_id not in tasks_map:
            #             tasks_map[res.task_id] = []
            #         tasks_map[res.task_id].append(res)

            #     arc_score = 0.0
            #     total_tasks = len(tasks_map)

            #     for t_id, pairs in tasks_map.items():
            #         n_pairs = len(pairs)
            #         if n_pairs > 0:
            #             n_solved = sum(1 for p in pairs if p.pass_at_k)
            #             arc_score += (n_solved / n_pairs)

            #     max_score = total_tasks
            #     pct = (arc_score / max_score * 100) if max_score > 0 else 0.0
            #     print(f"Official ARC style scoring: {arc_score:.2f}/{max_score} ({pct:.2f}%)")
        else:
            print("No test results for AAIVR.")

    finally:
        # Always restore stdout
        if hasattr(sys.stdout, "terminal"):
            sys.stdout.close()
            sys.stdout = sys.stdout.terminal

    # 6. Generate Submission (Logic from old Cell 5)
    print(f"Generating submission.json for {run_name}...")
    submission_data = {}
    temp_grouping = {}

    if aaivr_results:
        for item in aaivr_results:
            t_id = item.task_id
            p_idx = item.original_pair_index
            if t_id not in temp_grouping:
                temp_grouping[t_id] = {}

            top_grids = item.selected_outputs[:2]
            if not top_grids:
                top_grids = [[[0]]]  # Fallback

            pair_dict = {
                "attempt_1": top_grids[0],
                "attempt_2": top_grids[1] if len(top_grids) > 1 else top_grids[0],
            }
            temp_grouping[t_id][p_idx] = pair_dict

        for t_id, pairs_map in temp_grouping.items():
            sorted_indices = sorted(pairs_map.keys())
            submission_data[t_id] = [pairs_map[idx] for idx in sorted_indices]

    with open(submission_path, "w") as f:
        json.dump(submission_data, f)

    print(f"Finished {run_name}. Submission saved to {submission_path}")


# --- Execute the Loop (Modified with Timing) ---
timing_path = Path("runs/timing.txt")

for name, aug_count, d_path in EVAL_CONFIGS:  # <--- Unpack 3 items
    t_start = perf_counter()

    run_evaluation_pipeline(name, aug_count, d_path, device)

    t_duration = perf_counter() - t_start
    print(f"Run {name} took {t_duration:.2f}s")

    with open(timing_path, "a") as f:
        f.write(f"Evaluation {name}: {t_duration:.4f} s\n")

print("\nAll evaluation runs completed.")

In [None]:
# Compare mode requires the solutions file; submission mode does not
# visualisation
import json
import matplotlib.pyplot as plt
from pathlib import Path

EVAL_SUB_FOLDER = EVAL_CONFIGS[0][0]
VIS_MODE = "submission"  # "!" = compare vs solutions, "submission" = attempts-only

submission_file = Path(f"runs/{EVAL_SUB_FOLDER}/submission.json")
# solutions_file = Path("assets/solutions.json")

if not submission_file.exists():
    print(f"Error: Could not find submission file: {submission_file}")
elif VIS_MODE == "!":
    if not solutions_file.exists():
        print(
            f"Error: Could not find solutions file for compare mode:\n{solutions_file}"
        )
    else:
        # Load Data
        with open(submission_file, "r") as f:
            subs = json.load(f)
        with open(solutions_file, "r") as f:
            sols = json.load(f)

        print(f"Visualizing comparison for {len(subs)} tasks...")

        for task_id, attempts_list in subs.items():
            # Get Ground Truth (list of grids)
            if task_id not in sols:
                print(f"Warning: Task {task_id} not found in solutions.json")
                continue

            gt_grids = sols[task_id]
            print(gt_grids)
            for i, attempts in enumerate(attempts_list):
                if i >= len(gt_grids):
                    break

                # 1. Retrieve Grids
                gt = gt_grids[i]
                att1 = attempts.get("attempt_1")
                att2 = attempts.get("attempt_2")

                # 2. Check Correctness
                pass1 = (att1 == gt) if att1 is not None else False
                pass2 = (att2 == gt) if att2 is not None else False

                if pass1 and pass2:
                    status = "Pass - both"
                elif pass1:
                    status = "Pass - 1"
                elif pass2:
                    status = "Pass - 2"
                else:
                    status = "Fail"

                # 3. Visualize
                # Construct list: [Ground Truth, Attempt 1, Attempt 2]
                grids_to_plot = [gt]
                if att1 is not None:
                    grids_to_plot.append(att1)
                if att2 is not None:
                    grids_to_plot.append(att2)

                header = f"Task: {task_id} | Pair: {i} | Status: {status}"
                print(f"Plotting {header}")

                # utils.plot_grids handles the matplotlib figure creation
                try:
                    utils.plot_grids(grids_to_plot, title=header)
                except Exception as e:
                    print(f"Skipping plot for {task_id} due to error: {e}")
else:
    # Submission-only visualization (attempts without ground truth)
    with open(submission_file, "r") as f:
        subs = json.load(f)

    print(f"Visualizing submissions for {len(subs)} tasks (no solutions)...")

    for task_id, attempts_list in subs.items():
        for i, attempts in enumerate(attempts_list):
            att1 = attempts.get("attempt_1")
            att2 = attempts.get("attempt_2")

            grids_to_plot = []
            if att1 is not None:
                grids_to_plot.append(att1)
            if att2 is not None:
                grids_to_plot.append(att2)

            if not grids_to_plot:
                print(f"Skipping {task_id} pair {i} (no attempts)")
                continue

            header = f"Task: {task_id} | Pair: {i} | Status: submission-only"
            print(f"Plotting {header}")

            try:
                utils.plot_grids(grids_to_plot, title=header)
            except Exception as e:
                print(f"Skipping plot for {task_id} due to error: {e}")
