## Pipeline: LLM-powered program generation for solving ARC-AGI

### Imports

In [245]:
import numpy as np
import ollama
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
import ast
import re
import importlib.util
import time

### Shared Variables

In [246]:
# Shared system prompt for all tasks
system_prompt = """
You are a visual reasoning and Python programming expert solving ARC-AGI (Abstraction and Reasoning Corpus - Artificial General Intelligence) tasks.

Each integer in the grid represents a color:
0 = black, 1 = blue, 2 = red, 3 = green, 4 = yellow,
5 = grey, 6 = pink, 7 = orange, 8 = light blue, 9 = brown.
"""


### Prompts

#### Basic Prompt

In [247]:
base_prompt = """
Write a Python function that correctly transforms each input grid into its corresponding output grid based on the given examples.

- ONLY return code. No explanations or anything other than code.
- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Use only pure Python — do not import or use libraries like NumPy
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- The function must return a plain 2D list of integers with consistent row lengths (List[List[int]])
- Do not return arrays, nested arrays, floats, or 3D structures
- Ensure your solution works for all provided input-output pairs
"""

In [248]:
print(base_prompt)


Write a Python function that correctly transforms each input grid into its corresponding output grid based on the given examples.

- ONLY return code. No explanations or anything other than code.
- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Use only pure Python — do not import or use libraries like NumPy
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- The function must return a plain 2D list of integers with consistent row lengths (List[List[int]])
- Do not return arrays, nested arrays, floats, or 3D structures
- Ensure your solution works for all provided input-output pairs



#### Prompt 1

In [249]:
prompt_1 = """
List visual observations from the training pairs.

- Use bullet points (max 10).
- Focus on colors, shapes, object counts, positions, and fixed elements (e.g., anchors, borders, gray blocks).
- Mention groupings or repeated patterns if visible.
- Avoid reasoning or explanations.
- Be concise. No full sentences, no extra formatting.
"""

In [250]:
print(prompt_1)


List visual observations from the training pairs.

- Use bullet points (max 10).
- Focus on colors, shapes, object counts, positions, and fixed elements (e.g., anchors, borders, gray blocks).
- Mention groupings or repeated patterns if visible.
- Avoid reasoning or explanations.
- Be concise. No full sentences, no extra formatting.



#### Prompt 2

In [251]:
prompt_2 = """
Describe the transformation(s) from input to output grids.

- Use 3 to 5 short sentences.
- Focus on what changes: movement, color, shape, duplication, stacking, mirroring, etc.
- Mention any use of anchors, fixed positions, or reference structures.
- If applicable, describe how objects are grouped, reassigned, or reorganized.
- Avoid implementation hints or code.
"""

In [252]:
print(prompt_2)


Describe the transformation(s) from input to output grids.

- Use 3 to 5 short sentences.
- Focus on what changes: movement, color, shape, duplication, stacking, mirroring, etc.
- Mention any use of anchors, fixed positions, or reference structures.
- If applicable, describe how objects are grouped, reassigned, or reorganized.
- Avoid implementation hints or code.



#### Prompt 3

In [253]:
prompt_3 = """
Reflect on how you would solve the task in Python.

- Use 3 to 5 sentences.
- Describe the main approach, such as identifying anchors, grouping objects, and applying transformations.
- Mention steps like scanning for fixed elements, sorting, or aligning data.
- Call out any challenges or unclear rules you would need to test for.
- Do not return code or pseudocode.
"""

In [254]:
print(prompt_3)


Reflect on how you would solve the task in Python.

- Use 3 to 5 sentences.
- Describe the main approach, such as identifying anchors, grouping objects, and applying transformations.
- Mention steps like scanning for fixed elements, sorting, or aligning data.
- Call out any challenges or unclear rules you would need to test for.
- Do not return code or pseudocode.



#### Prompt 4

In [255]:
revision_prompt = """
In the following you'll receive a Python function that attempted to solve the following task. It did'nt succeed and you are tasked with fixing it.

- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Include only the code and necessary imports (e.g., `import numpy as np`)
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- Ensure your solution works for all provided input-output pairs
"""

In [256]:
print(revision_prompt)


In the following you'll receive a Python function that attempted to solve the following task. It did'nt succeed and you are tasked with fixing it.

- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Include only the code and necessary imports (e.g., `import numpy as np`)
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- Ensure your solution works for all provided input-output pairs



#### Classification Prompt

In [257]:
# Check manually entering one demonstration pair for each concept
classification_prompt = """
You will receive a set of demonstration pairs (input and output grids) from a visual reasoning task.

Your task is to classify the transformation into **one of the following 16 concepts**:

- AboveBelow - Objects or patterns are arranged vertically, with relationships defined by what's above or below something else.
- Center - Elements are moved to or arranged around the center of the grid.
- CleanUp - The task removes noise or extraneous elements to leave a cleaner or more regular structure.
- CompleteShape - A partial or broken shape is completed to form a full geometric object.
- Copy - A shape or pattern is duplicated, often to another location in the grid.
- Count - The number of certain elements is counted to determine placement, output quantity, or transformation.
- ExtendToBoundary - Shapes or lines are extended until they touch the edge of the grid.
- ExtractObjects - Specific objects are isolated and copied or transformed while others are ignored.
- FilledNotFilled - The task distinguishes between filled and hollow shapes or fills in uncolored areas.
- HorizontalVertical - Patterns follow or are transformed along horizontal or vertical axes, often involving symmetry or alignment.
- InsideOutside - A relationship is determined based on whether elements are inside or outside a defined boundary.
- MoveToBoundary - Objects are shifted to the nearest edge of the grid without rotation or change in shape.
- Order - Items are rearranged according to size, color, frequency, or another ordinal property.
- SameDifferent - Objects are retained or manipulated based on whether they match or differ in some attribute (e.g., color, shape).
- TopBottom2D - A flat 2D interpretation of objects where the top and bottom halves of the grid are compared or modified.
- TopBottom3D - The task simulates a 3D stacking or layering behavior, such as viewing objects from above or combining vertical slices.

Instructions:
- Respond ONLY with the exact name of the matching concept from the list above.
- Do not explain your answer, just return the concept.
- If uncertain, choose the concept that fits best based on the input-output transformations.
"""


### Concepts

In [None]:
def load_concept_examples(concept_name, base_path="ConceptARC_Data"):
    concept_path = os.path.join(base_path, concept_name, "task.json")
    if not os.path.exists(concept_path):
        print(f"[Warning] No concept task found for '{concept_name}'")
        return ""
    
    with open(concept_path, "r") as f:
        concept_task = json.load(f)
    
    # Format the train pairs as in add_tasks
    formatted_examples = "\n\nHere are a few examples following the same concept that may help with solving this task. Keep in mind that the identified concept may be faulty:\n"
    for i, pair in enumerate(concept_task.get("train", [])):
        formatted_examples += f"\nConcept Train Input {i+1}: {pair['input']}\n"
        formatted_examples += f"Concept Train Output {i+1}: {pair['output']}\n"
    
    return formatted_examples


### Functions

#### Load Tasks

In [259]:
def load_tasks(folder):
    tasks = []
    for filename in sorted(os.listdir(folder)):
        if filename.endswith(".json"):
            with open(os.path.join(folder, filename), "r") as f:
                data = json.load(f)
                tasks.append({"filename": filename, "data": data})
    return tasks

#### Load API-Key

In [260]:
def load_api_key(file_path="key.env"):
    load_dotenv(file_path)
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
    if not openai.api_key:
        print("No API key found. Please set OPENAI_API_KEY in key.env.")
    global client
    client = OpenAI()

#### Call GPT

In [261]:
import time
import openai

def call_gpt(prompt, model="o4-mini", retries=3):
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt}
                ],
                # Only for GPT-4o
                # temperature=0.0
            )
            return response.choices[0].message.content.strip()
        
        except openai.RateLimitError as e:
            wait_time = 5 + attempt * 5  # exponential backoff
            print(f"Rate limit hit. Waiting {wait_time} seconds before retrying...")
            time.sleep(wait_time)

    raise Exception("Rate limit retries exhausted.")

#### Encodings

1. Non-zero Cords Encoding
TODO: Explain

In [262]:
def encode_nonzero_coordinates(grid):
    coords = []
    for i, row in enumerate(grid):
        for j, val in enumerate(row):
            if val != 0:
                coords.append({"row": i, "col": j, "val": val})
    return coords

def encode_task_nonzero_coords(task):
    encoded = {"train": [], "test": []}

    for pair in task["train"]:
        encoded["train"].append({
            "input": encode_nonzero_coordinates(pair["input"]),
            "output": encode_nonzero_coordinates(pair["output"])
        })

    for pair in task["test"]:
        encoded["test"].append({
            "input": encode_nonzero_coordinates(pair["input"])
        })

    return encoded

2. Object/Bounding Box Encoding TODO: Explain

In [263]:
from collections import deque

def extract_objects(grid):
    visited = set()
    objects = []
    rows, cols = len(grid), len(grid[0])

    def bfs(r, c, color):
        q = deque([(r, c)])
        visited.add((r, c))
        pixels = [(r, c)]
        min_r, min_c = r, c
        max_r, max_c = r, c

        while q:
            cr, cc = q.popleft()
            for dr, dc in [(-1,0), (1,0), (0,-1), (0,1)]:
                nr, nc = cr + dr, cc + dc
                if (
                    0 <= nr < rows and
                    0 <= nc < cols and
                    (nr, nc) not in visited and
                    grid[nr][nc] == color
                ):
                    visited.add((nr, nc))
                    q.append((nr, nc))
                    pixels.append((nr, nc))
                    min_r = min(min_r, nr)
                    min_c = min(min_c, nc)
                    max_r = max(max_r, nr)
                    max_c = max(max_c, nc)

        return {
            "color": color,
            "top_left": [min_r, min_c],
            "width": max_c - min_c + 1,
            "height": max_r - min_r + 1,
            "pixels": pixels
        }

    for r in range(rows):
        for c in range(cols):
            color = grid[r][c]
            if color != 0 and (r, c) not in visited:
                objects.append(bfs(r, c, color))

    return objects

def encode_task_objects(task):
    encoded = {"train": [], "test": []}

    for pair in task["train"]:
        encoded["train"].append({
            "input": extract_objects(pair["input"]),
            "output": extract_objects(pair["output"])
        })

    for pair in task["test"]:
        encoded["test"].append({
            "input": extract_objects(pair["input"])
        })

    return encoded


Adds Encodings to the Prompt with Explanation

In [264]:
def add_encodings(prompt, encoding_dicts):
    encoding_explanations = {
        "nonzero_coords": "This encoding lists only the non-zero cells as dictionaries containing their row, column, and value.",
        "object_bbox": "This encoding represents each object in the grid as a group of connected same-colored cells, described by color, bounding box, and coordinates.",
    }

    for encoding_name, encoding_data in encoding_dicts.items():
        explanation = encoding_explanations.get(encoding_name, "This encoding represents the input in an alternate form.")
        prompt += f"\n\nEncoding used: {encoding_name}\nExplanation: {explanation}\n\nHere are the demonstration pairs (encoded):\n"
        for i, pair in enumerate(encoding_data["train"]):
            prompt += f"\nTrain Input {i+1}: {pair['input']}\n"
            prompt += f"Train Output {i+1}: {pair['output']}\n"
    
    return prompt


#### Building and Combining Prompts

Adds the tasks demonstration pairs to the prompt:

In [265]:
def add_tasks(prompt, task_data):
    full_prompt = prompt.strip() + "\n\nHere are the demonstration pairs (JSON data):\n"
    for i, pair in enumerate(task_data['train']):
        full_prompt += f"\nTrain Input {i+1}: {pair['input']}\n"
        full_prompt += f"Train Output {i+1}: {pair['output']}\n"
    return full_prompt

Combines secondary prompt 1 and 2:

In [266]:
def combine_prompts_1_and_2(prompt_1_response, prompt_2_template):
    combined_prompt = f"""{prompt_2_template.strip()}

Here are visual observations of the task at hand, that may assist you in identifying the transformation:

{prompt_1_response.strip()}

Now provide your transformation analysis based on these observations."""
    return combined_prompt

Combines secondary prompt 1, 2 and 3:

In [267]:
def combine_prompts_1_2_and_3(prompt_1_response, prompt_2_response, prompt_3_template):
    combined_prompt = f"""{prompt_3_template.strip()}

Here are visual observations of the task that may help inform your implementation:
{prompt_1_response.strip()}

Here are the transformation rules that have been identified based on the task:
{prompt_2_response.strip()}

Now reflect on how you would implement a solution to this task in Python, following the instructions above.
"""
    return combined_prompt


Combines secondary prompt 3 with the base prompt

In [268]:
def combine_prompts_and_base(prompt_1_response, prompt_2_response, prompt_3_response, prompt_base_template):
    combined_prompt = f"""

Here are visual observations of the task that may help inform your implementation:
{prompt_1_response.strip()}

Here are the transformation rules that have been identified based on the task:
{prompt_2_response.strip()}
    
Here is a reflection on how you might implement a solution to this task in Python:
{prompt_3_response.strip()}

{prompt_base_template.strip()}
"""
    return combined_prompt.strip()

Combine responses of the secondary prompt to the base prompt to create task-tailored prompt.

In [269]:
# Global dictionary to store task concepts
task_concepts = {}

def build_prompts(task_data, task_id):
    global task_concepts 

    encoding_dicts = {
        "nonzero_coords": encode_task_nonzero_coords(task_data),
    }

    full_prompt_1 = add_tasks(prompt_1, task_data)
    full_prompt_1 = add_encodings(full_prompt_1, encoding_dicts)
    response_1 = call_gpt(full_prompt_1)

    combined_prompt_2 = combine_prompts_1_and_2(response_1, prompt_2)
    full_prompt_2 = add_tasks(combined_prompt_2, task_data)
    full_prompt_2 = add_encodings(full_prompt_2, encoding_dicts)
    response_2 = call_gpt(full_prompt_2)

    combined_prompt_3 = combine_prompts_1_2_and_3(response_1, response_2, prompt_3)
    full_prompt_3 = add_tasks(combined_prompt_3, task_data)
    full_prompt_3 = add_encodings(full_prompt_3, encoding_dicts)
    response_3 = call_gpt(full_prompt_3)

    # Classification step
    classification_full_prompt = add_tasks(classification_prompt, task_data)
    classification_full_prompt = add_encodings(classification_full_prompt, encoding_dicts)
    predicted_concept = call_gpt(classification_full_prompt).strip()
    concept_instruction = load_concept_examples(predicted_concept)
    task_concepts[task_id] = predicted_concept

    combined_prompt_base = combine_prompts_and_base(response_1, response_2, response_3, base_prompt)
    tailored_prompt = add_tasks(combined_prompt_base, task_data)
    tailored_prompt = add_encodings(tailored_prompt, encoding_dicts)
    
    # Add similar-concept examples (train only)
    tailored_prompt += f"\n\nThis task involves the concept: **{predicted_concept}**."
    tailored_prompt += concept_instruction

    print("Built tailored prompt.")

    return tailored_prompt

#### Save Programs

In [270]:
def save_program(program_text, actual_task_id, suffix=""):
    # Define the base and task-specific folder paths
    base_folder = "Candidate_programs_tailored_prompts"
    task_folder = os.path.join(base_folder, actual_task_id)
    
    # Create the task-specific folder if it doesn't exist
    os.makedirs(task_folder, exist_ok=True)

    # Remove ```python or ``` if present
    cleaned_text = re.sub(r"^```(?:python)?\s*|```$", "", program_text.strip(), flags=re.MULTILINE)

    # Find the next available version number, excluding suffix for base counting
    existing_files = os.listdir(task_folder)
    version_numbers = [
        int(re.search(r"solution_v(\d+)", fname).group(1))
        for fname in existing_files
        if re.match(r"solution_v\d+", fname)
    ]
    next_version = max(version_numbers, default=0) + 1
    
    # Define the full path to the new Python file with the suffix (if provided)
    file_path = os.path.join(task_folder, f"solution_v{next_version}{suffix}.py")
    
    # Save the program text to the file
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(cleaned_text.strip())

    print(f"Saved program for task {actual_task_id} as version {next_version}{suffix}: {file_path}")


#### Create Programs

In [271]:
def create_programs(tailored_prompt, actual_task_id, amount):
    # Create two programs (change range for n programs)
    for i in range(amount):  # You can adjust the range to create more programs
        response = call_gpt(tailored_prompt)
        
        # Save the generated program
        save_program(response, actual_task_id)

#### Evaluate Programs

In [272]:
def load_program(file_path):
    """Load and execute a Python program from a file."""
    spec = importlib.util.spec_from_file_location("program", file_path)
    program = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(program)
    return program.solve


Checks if the code is valid:

In [273]:
def is_valid_python_code(filepath):
    """Check if the Python file contains valid syntax."""
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            source = f.read()
        ast.parse(source)
        return True
    except SyntaxError:
        return False


In [274]:
def evaluate_programs(task_data, task_folder):
    """Evaluate programs and collect detailed candidate outputs for each demonstration pair."""
    programs = []
    program_files = [f for f in os.listdir(task_folder) if f.endswith(".py")]

    any_valid = False  # Track whether any valid programs exist

    for program_file in program_files:
        program_path = os.path.join(task_folder, program_file)

        # Check if the program is valid Python code
        if not is_valid_python_code(program_path):
            print(f"Deleting invalid file: {program_file}")
            os.remove(program_path)
            continue

        try:
            solve_function = load_program(program_path)
        except Exception as e:
            print(f"Error loading program {program_file}: {e}")
            os.remove(program_path)
            continue

        any_valid = True  # At least one valid program found

        details = []
        correct_count = 0
        total_pairs = len(task_data['train'])

        for pair in task_data['train']:
            input_grid = pair['input']
            expected_output = pair['output']
            try:
                candidate_output = solve_function(input_grid)
                if np.array_equal(np.array(candidate_output), np.array(expected_output)):
                    correct_count += 1
            except Exception as e:
                candidate_output = f"Error: {e}"
            details.append({
                "input": input_grid,
                "candidate_output": candidate_output,
                "expected_output": expected_output
            })

        score = correct_count / total_pairs if total_pairs > 0 else 0
        programs.append({
            "program_name": program_file,
            "score": score,
            "correct_pairs": correct_count,
            "total_pairs": total_pairs,
            "details": details
        })

    return programs if any_valid else 0

#### Creation of Revision Prompt and Revised Programs

In [275]:
def revise_candidate_program(candidate_file_path, task_data, actual_task_id):
    # Load candidate code text
    with open(candidate_file_path, "r", encoding="utf-8") as f:
        candidate_code_text = f.read()

    # Load candidate's solve function
    try:
        solve_fn = load_program(candidate_file_path)
    except Exception as e:
        print(f"Error loading program for revision: {e}")
        return

    # Use the externally defined revision_prompt and add the generated code and demonstration pairs details
    local_revision_prompt = revision_prompt + "\n\nHere is the generated code:\n" + candidate_code_text + "\n\nDemonstration Pairs:\n"

    for i, pair in enumerate(task_data['train']):
        try:
            candidate_output = solve_fn(pair['input'])
        except Exception as e:
            candidate_output = f"Error: {e}"
        local_revision_prompt += f"{i+1}. Input: {pair['input']}\n"
        local_revision_prompt += f"   Expected Output: {pair['output']}\n"
        local_revision_prompt += f"   Generated Output: {candidate_output}\n"

    local_revision_prompt += "\nPlease revise the code."

    # Send the revision prompt to the LLM
    revised_code = call_gpt(local_revision_prompt)

    # Determine the revision level (e.g. _rev1, _rev2, ...) based on existing files
    task_folder = os.path.join("Candidate_programs_tailored_prompts", actual_task_id)
    base_name = os.path.basename(candidate_file_path)
    base_version_match = re.search(r"solution_v(\d+)", base_name)
    base_version = base_version_match.group(1) if base_version_match else "1"

    # Count existing revisions for this version
    existing_files = os.listdir(task_folder)
    revision_count = len([
        f for f in existing_files
        if re.match(rf"solution_v{base_version}_rev\d+\.py", f)
    ])
    suffix = f"_rev{revision_count + 1}"

    # Save revised program
    save_program(revised_code, actual_task_id, suffix=suffix)


#### Identification of Best Programs

In [276]:
def get_best_programs(evaluation_results, actual_task_id, n=2):
    """
    Evaluates programs for a given task and returns the file paths for the top n programs based on demonstration pairs.
    Always returns exactly n programs by sorting by score in descending order.
    """
    # Sort programs by score descending; if scores are equal, the original order is preserved.
    sorted_programs = sorted(evaluation_results, key=lambda x: x['score'], reverse=True)
    task_folder = os.path.join("Candidate_programs_tailored_prompts", actual_task_id)
    best_program_files = [os.path.join(task_folder, prog['program_name']) for prog in sorted_programs[:n]]
    for i in best_program_files:
        print(f"Best program: {i}")
    return best_program_files



#### Generation of Predictions on Test Inputs

In [277]:
def generate_test_predictions(task_data, actual_task_id, best_program_files):
    """
    Loads the two best candidate programs from the specified file paths,
    runs each one on every test input, and saves the generated outputs
    in the submission file.
    
    Expects task_data to have a 'test' key with a list of test pairs,
    where each pair is a dict containing an "input" key.
    
    Returns a submission dictionary of the form:
    { actual_task_id: [ { "attempt_1": output_from_program1, "attempt_2": output_from_program2 }, ... ] }
    """
    # Load the candidate solvers using the file paths.
    best_solvers = [load_program(prog_file) for prog_file in best_program_files]
    
    predictions = []
    # Iterate over each test pair.
    for i, pair in enumerate(task_data["test"]):
        input_grid = pair["input"]
        attempt_predictions = {}
        # Run each candidate solver on the test input.
        for idx, solver in enumerate(best_solvers, start=1):
            try:
                output = solver(input_grid)
            except Exception as e:
                output = f"Error: {e}"
            attempt_predictions[f"attempt_{idx}"] = output
        predictions.append(attempt_predictions)
    
    submission = {str(actual_task_id): predictions}
    
    return submission


### Pipeline

In [278]:
# Load tasks and API key
tasks = load_tasks("evaluation_set")
load_api_key()

# Stores predictions for all tasks
final_submission = {} 

# Loop through each task (adjustable range)
for i, task in enumerate(tasks[:10]):
    actual_task_id = task["filename"].split(".")[0]
    
    ### PROMPT CREATION ###
    tailored_prompt = build_prompts(task['data'], actual_task_id)
    create_programs(tailored_prompt, actual_task_id, amount=2)
    
    ### INITIAL EVALUATION ###
    task_folder = os.path.join("Candidate_programs_tailored_prompts", actual_task_id)
    evaluation_results = evaluate_programs(task['data'], task_folder)
    
    while evaluation_results == 0:
        create_programs(tailored_prompt, actual_task_id, amount=2)
        evaluation_results = evaluate_programs(task['data'], task_folder)

    ### FIRST REVISION: Revise all < 1 ###
    for result in evaluation_results:
        if result['score'] < 1:
            candidate_file = os.path.join(task_folder, result['program_name'])
            print(f"Revising {result['program_name']}...")
            revise_candidate_program(candidate_file, task['data'], actual_task_id)

    ### FINAL EVALUATION ###
    evaluation_results_3 = evaluate_programs(task['data'], task_folder)
    print(f"Task {actual_task_id} ({i+1}) evaluation results:")
    for result in evaluation_results_3:
        print(f"Program {result['program_name']} solved {result['correct_pairs']} out of {result['total_pairs']} pairs. Score: {result['score']:.2f}")
        print("="*50)

    ### PROGRAM SELECTION + PREDICTIONS ###
    best_program_files = get_best_programs(evaluation_results_3, actual_task_id, n=2)
    submission = generate_test_predictions(task['data'], actual_task_id, best_program_files)
    final_submission.update(submission)

# Final output file
with open("submission_tailored_prompts.json", "w") as f:
    json.dump(final_submission, f)


Built tailored prompt.
Saved program for task 0607ce86 as version 1: Candidate_programs_tailored_prompts\0607ce86\solution_v1.py
Saved program for task 0607ce86 as version 2: Candidate_programs_tailored_prompts\0607ce86\solution_v2.py
Revising solution_v1.py...
Saved program for task 0607ce86 as version 3_rev1: Candidate_programs_tailored_prompts\0607ce86\solution_v3_rev1.py
Revising solution_v2.py...
Saved program for task 0607ce86 as version 4_rev1: Candidate_programs_tailored_prompts\0607ce86\solution_v4_rev1.py
Task 0607ce86 (1) evaluation results:
Program solution_v1.py solved 0 out of 3 pairs. Score: 0.00
Program solution_v2.py solved 0 out of 3 pairs. Score: 0.00
Program solution_v3_rev1.py solved 0 out of 3 pairs. Score: 0.00
Program solution_v4_rev1.py solved 0 out of 3 pairs. Score: 0.00
Best program: Candidate_programs_tailored_prompts\0607ce86\solution_v1.py
Best program: Candidate_programs_tailored_prompts\0607ce86\solution_v2.py
Built tailored prompt.
Saved program for ta

### Get Accuracy (for personal use if test outputs are present)

Compares the submission.json file with the actual correct test outputs of a task and returns the accuracy.

TODO: Add task categories and additionally show accuracies per category.

In [279]:
def are_equal(a, b):
    return json.dumps(a) == json.dumps(b)

def compute_accuracy(submission_path, tasks_dict, task_concepts):
    with open(submission_path, 'r') as f:
        submission = json.load(f)

    total_tasks = len(submission)
    correct_tasks = 0

    concept_totals = {}
    concept_corrects = {}

    for task_id, prediction_list in submission.items():
        task_data = tasks_dict[task_id]
        test_outputs = [test['output'] for test in task_data['test']]

        all_test_cases_correct = True
        for idx, expected_output in enumerate(test_outputs):
            attempts = prediction_list[idx]
            pred1 = attempts["attempt_1"]
            pred2 = attempts["attempt_2"]

            if not (are_equal(pred1, expected_output) or are_equal(pred2, expected_output)):
                all_test_cases_correct = False
                break

        concept = task_concepts.get(task_id, "Unknown Concept")
        concept_totals[concept] = concept_totals.get(concept, 0) + 1
        if all_test_cases_correct:
            correct_tasks += 1
            concept_corrects[concept] = concept_corrects.get(concept, 0) + 1

    accuracy = correct_tasks / total_tasks

    print(f"\nOverall Accuracy: {accuracy:.2%}\n")

    print("Accuracy per concept:")
    for concept, total in concept_totals.items():
        correct = concept_corrects.get(concept, 0)
        concept_accuracy = correct / total
        print(f"- {concept}: {correct}/{total} ({concept_accuracy:.2%})")

    return accuracy


In [280]:
with open("submission_tailored_prompts.json") as f:
    submission = json.load(f)

tasks_dict = {task["filename"].split(".")[0]: task["data"] for task in tasks}

accuracy = compute_accuracy("submission_tailored_prompts.json", tasks_dict, task_concepts)



Overall Accuracy: 50.00%

Accuracy per concept:
- CleanUp: 0/1 (0.00%)
- ExtractObjects: 0/1 (0.00%)
- FilledNotFilled: 0/1 (0.00%)
- ExtendToBoundary: 3/3 (100.00%)
- HorizontalVertical: 1/2 (50.00%)
- Count: 0/1 (0.00%)
- Order: 1/1 (100.00%)


In [281]:
print(tailored_prompt)

Here are visual observations of the task that may help inform your implementation:
- four distinct colors: 3 (green), 2 (red), 5 (grey), 8 (light-blue)  
- uniform zero background in all inputs  
- horizontal stripes of color at fixed rows (green: 4,7,10; red: 3,13; grey: 2,5,8,11; light-blue: 2,13)  
- vertical bars of same color at consistent columns per puzzle (green cols 3,7,11; red cols 3,8; grey cols 4,7,8,11; light-blue cols 4,6,8)  
- stripes and bars form hollow rectangular “rings”  
- each puzzle shows repeated stripe/bar pattern (green/grey three bars, red two bars, light-blue three bars)  
- no other colored pixels or border elements  
- outputs preserve stripe rows but shift bar columns inward/outward on alternating stripes  
- count of stripes equal to number of bar groups per color  
- perfect left–right symmetry in all shapes

Here are the transformation rules that have been identified based on the task:
1. Each colored frame’s horizontal stripes stay on their original 