# Pipeline: LLM-powered program generation for solving ARC-AGI

## Imports

In [36]:
import numpy as np
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
import ast
import re
import importlib.util
import time

## Prompts

### Shared Variables

The system prompt contains basic information essential for every prompt sent to the LLM

In [37]:
system_prompt = """
You are a visual reasoning and Python programming expert solving ARC-AGI (Abstraction and Reasoning Corpus - Artificial General Intelligence) tasks.

Each integer in the grid represents a color:
0 = black, 1 = blue, 2 = red, 3 = green, 4 = yellow,
5 = grey, 6 = pink, 7 = orange, 8 = light blue, 9 = brown.
"""


### Base Prompt

The following prompt tasks the LLM with program generation

In [38]:
base_prompt = """
Write a Python function that correctly transforms each input grid into its corresponding output grid based on the given examples.

- ONLY return code. No explanations or anything other than code.
- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Use only pure Python — do not import or use libraries like NumPy
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- The function must return a plain 2D list of integers with consistent row lengths (List[List[int]])
- Do not return arrays, nested arrays, floats, or 3D structures
- Ensure your solution works for all provided input-output pairs
"""

In [39]:
print(base_prompt)


Write a Python function that correctly transforms each input grid into its corresponding output grid based on the given examples.

- ONLY return code. No explanations or anything other than code.
- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Use only pure Python — do not import or use libraries like NumPy
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- The function must return a plain 2D list of integers with consistent row lengths (List[List[int]])
- Do not return arrays, nested arrays, floats, or 3D structures
- Ensure your solution works for all provided input-output pairs



### Prompt 1

The following prompt tasks the LLM to describe visual observations of the task

In [40]:
prompt_1 = """
List visual observations from the training pairs.

- Use bullet points (max 10).
- Focus on colors, shapes, object counts, positions, and fixed elements (e.g., anchors, borders, gray blocks).
- Mention groupings or repeated patterns if visible.
- Avoid reasoning or explanations.
- Be concise. No full sentences, no extra formatting.
"""

In [41]:
print(prompt_1)


List visual observations from the training pairs.

- Use bullet points (max 10).
- Focus on colors, shapes, object counts, positions, and fixed elements (e.g., anchors, borders, gray blocks).
- Mention groupings or repeated patterns if visible.
- Avoid reasoning or explanations.
- Be concise. No full sentences, no extra formatting.



### Prompt 2

The following prompt tasks the LLM with describing the tasks underlying transformation

In [42]:
prompt_2 = """
Describe the transformation(s) from input to output grids.

- Use 3 to 5 short sentences.
- Focus on what changes: movement, color, shape, duplication, stacking, mirroring, etc.
- Mention any use of anchors, fixed positions, or reference structures.
- If applicable, describe how objects are grouped, reassigned, or reorganized.
- Avoid implementation hints or code.
"""

In [43]:
print(prompt_2)


Describe the transformation(s) from input to output grids.

- Use 3 to 5 short sentences.
- Focus on what changes: movement, color, shape, duplication, stacking, mirroring, etc.
- Mention any use of anchors, fixed positions, or reference structures.
- If applicable, describe how objects are grouped, reassigned, or reorganized.
- Avoid implementation hints or code.



### Prompt 3

The following prompt tasks the LLM to reflect a possible implementation of the transformation in Python code

In [44]:
prompt_3 = """
Reflect on how you would solve the task in Python.

- Use 3 to 5 sentences.
- Describe the main approach, such as identifying anchors, grouping objects, and applying transformations.
- Mention steps like scanning for fixed elements, sorting, or aligning data.
- Call out any challenges or unclear rules you would need to test for.
- Do not return code or pseudocode.
"""

In [45]:
print(prompt_3)


Reflect on how you would solve the task in Python.

- Use 3 to 5 sentences.
- Describe the main approach, such as identifying anchors, grouping objects, and applying transformations.
- Mention steps like scanning for fixed elements, sorting, or aligning data.
- Call out any challenges or unclear rules you would need to test for.
- Do not return code or pseudocode.



### Prompt 4

The following prompt tasks the LLM with revising the Python implementations it created

In [46]:
revision_prompt = """
In the following you'll receive a Python function that attempted to solve the following task. It did'nt succeed and you are tasked with fixing it.

- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Include only the code and necessary imports (e.g., `import numpy as np`)
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- Ensure your solution works for all provided input-output pairs
"""

In [47]:
print(revision_prompt)


In the following you'll receive a Python function that attempted to solve the following task. It did'nt succeed and you are tasked with fixing it.

- The function must be named: `solve(grid: List[List[int]]) -> List[List[int]]`
- Include only the code and necessary imports (e.g., `import numpy as np`)
- Do not include comments, explanations, or print statements
- Do not hard-code values or specific grid sizes — the function must generalize based on the patterns in the examples
- Ensure your solution works for all provided input-output pairs



## Functions

### Load Tasks

Loads the tasks from the specified folder

In [48]:
def load_tasks(folder):
    tasks = []
    for filename in sorted(os.listdir(folder)):
        if filename.endswith(".json"):
            with open(os.path.join(folder, filename), "r") as f:
                data = json.load(f)
                tasks.append({"filename": filename, "data": data})
    return tasks

### Load API-Key

Loads the API-Key from the .env file

In [49]:
def load_api_key(file_path="key.env"):
    load_dotenv(file_path)
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
    if not openai.api_key:
        print("No API key found. Please set OPENAI_API_KEY in key.env.")
    global client
    client = OpenAI()

### Call GPT

Used to send requests to the LLM

In [50]:
import time
import openai

def call_gpt(prompt, model="o4-mini", retries=3):
    for attempt in range(retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": prompt}
                ],
                # Only for GPT-4o
                # temperature=0.0
            )
            return response.choices[0].message.content.strip()
        
        except openai.RateLimitError as e:
            wait_time = 5 + attempt * 5  # exponential backoff
            print(f"Rate limit hit. Waiting {wait_time} seconds before retrying...")
            time.sleep(wait_time)

    raise Exception("Rate limit retries exhausted.")

### Encodings

#### Non-zero Cords Encoding

Creates encoded variants of the task data

In [51]:
def encode_nonzero_coordinates(grid):
    coords = []
    for i, row in enumerate(grid):
        for j, val in enumerate(row):
            if val != 0:
                coords.append({"row": i, "col": j, "val": val})
    return coords

def encode_task_nonzero_coords(task):
    encoded = {"train": [], "test": []}

    for pair in task["train"]:
        encoded["train"].append({
            "input": encode_nonzero_coordinates(pair["input"]),
            "output": encode_nonzero_coordinates(pair["output"])
        })

    for pair in task["test"]:
        encoded["test"].append({
            "input": encode_nonzero_coordinates(pair["input"])
        })

    return encoded

#### Object/Bounding Box Encoding

Creates encoded variants of the task data

In [52]:
from collections import deque

def extract_objects(grid):
    visited = set()
    objects = []
    rows, cols = len(grid), len(grid[0])

    def bfs(r, c, color):
        q = deque([(r, c)])
        visited.add((r, c))
        pixels = [(r, c)]
        min_r, min_c = r, c
        max_r, max_c = r, c

        while q:
            cr, cc = q.popleft()
            for dr, dc in [(-1,0), (1,0), (0,-1), (0,1)]:
                nr, nc = cr + dr, cc + dc
                if (
                    0 <= nr < rows and
                    0 <= nc < cols and
                    (nr, nc) not in visited and
                    grid[nr][nc] == color
                ):
                    visited.add((nr, nc))
                    q.append((nr, nc))
                    pixels.append((nr, nc))
                    min_r = min(min_r, nr)
                    min_c = min(min_c, nc)
                    max_r = max(max_r, nr)
                    max_c = max(max_c, nc)

        return {
            "color": color,
            "top_left": [min_r, min_c],
            "width": max_c - min_c + 1,
            "height": max_r - min_r + 1,
            "pixels": pixels
        }

    for r in range(rows):
        for c in range(cols):
            color = grid[r][c]
            if color != 0 and (r, c) not in visited:
                objects.append(bfs(r, c, color))

    return objects

def encode_task_objects(task):
    encoded = {"train": [], "test": []}

    for pair in task["train"]:
        encoded["train"].append({
            "input": extract_objects(pair["input"]),
            "output": extract_objects(pair["output"])
        })

    for pair in task["test"]:
        encoded["test"].append({
            "input": extract_objects(pair["input"])
        })

    return encoded


#### Add Encoded Task to the Prompt

Adds the encoded task data to the prompt including a short description of the encoding

In [53]:
def add_encodings(prompt, encoding_dicts):
    encoding_explanations = {
        "nonzero_coords": "This encoding lists only the non-zero cells as dictionaries containing their row, column, and value.",
        "object_bbox": "This encoding represents each object in the grid as a group of connected same-colored cells, described by color, bounding box, and coordinates.",
    }

    for encoding_name, encoding_data in encoding_dicts.items():
        explanation = encoding_explanations.get(encoding_name, "This encoding represents the input in an alternate form.")
        prompt += f"\n\nEncoding used: {encoding_name}\nExplanation: {explanation}\n\nHere are the demonstration pairs (encoded):\n"
        for i, pair in enumerate(encoding_data["train"]):
            prompt += f"\nTrain Input {i+1}: {pair['input']}\n"
            prompt += f"Train Output {i+1}: {pair['output']}\n"
    
    return prompt


### Building Task-Tailored Prompt

#### Add Tasks to the Prompt

Adds the task data to the prompt

In [54]:
def add_tasks(prompt, task_data):
    full_prompt = prompt.strip() + "\n\nHere are the demonstration pairs (JSON data):\n"
    for i, pair in enumerate(task_data['train']):
        full_prompt += f"\nTrain Input {i+1}: {pair['input']}\n"
        full_prompt += f"Train Output {i+1}: {pair['output']}\n"
    return full_prompt

#### Different Combination Functions

Combines secondary prompt 1 and 2:

In [55]:
def combine_prompts_1_and_2(prompt_1_response, prompt_2_template):
    combined_prompt = f"""{prompt_2_template.strip()}

Here are visual observations of the task at hand, that may assist you in identifying the transformation:

{prompt_1_response.strip()}

Now provide your transformation analysis based on these observations."""
    return combined_prompt

Combines secondary prompt 1, 2 and 3:

In [56]:
def combine_prompts_1_2_and_3(prompt_1_response, prompt_2_response, prompt_3_template):
    combined_prompt = f"""{prompt_3_template.strip()}

Here are visual observations of the task that may help inform your implementation:
{prompt_1_response.strip()}

Here are the transformation rules that have been identified based on the task:
{prompt_2_response.strip()}

Now reflect on how you would implement a solution to this task in Python, following the instructions above.
"""
    return combined_prompt


Combines secondary prompt 1-3 with the base prompt

In [57]:
def combine_prompts_and_base(prompt_1_response, prompt_2_response, prompt_3_response, prompt_base_template):
    combined_prompt = f"""

Here are visual observations of the task that may help inform your implementation:
{prompt_1_response.strip()}

Here are the transformation rules that have been identified based on the task:
{prompt_2_response.strip()}
    
Here is a reflection on how you might implement a solution to this task in Python:
{prompt_3_response.strip()}

{prompt_base_template.strip()}
"""
    return combined_prompt.strip()

#### Creation of Task-Tailored Prompt

The following code creates Prompts 1-3 as well as the classification prompt and adds the LLMs responses to the task-tailored prompt (including encoding, task data, etc.). The result is a complete task-tailored prompt including:

- Visual observations.
- Transformation description.
- Python implementation reflection.
- Classification and examples of Concept.
- Additional encoding.

In [58]:
task_concepts = {}

def build_prompts(task_data, task_id):
    global task_concepts 

    # Create encodings
    encoding_dicts = {
        "nonzero_coords": encode_task_nonzero_coords(task_data),
    }
    
    # Build prompt 1
    full_prompt_1 = add_tasks(prompt_1, task_data)
    full_prompt_1 = add_encodings(full_prompt_1, encoding_dicts)
    response_1 = call_gpt(full_prompt_1)

    # Build prompt 2
    combined_prompt_2 = combine_prompts_1_and_2(response_1, prompt_2)
    full_prompt_2 = add_tasks(combined_prompt_2, task_data)
    full_prompt_2 = add_encodings(full_prompt_2, encoding_dicts)
    response_2 = call_gpt(full_prompt_2)

    # Build prompt 3
    combined_prompt_3 = combine_prompts_1_2_and_3(response_1, response_2, prompt_3)
    full_prompt_3 = add_tasks(combined_prompt_3, task_data)
    full_prompt_3 = add_encodings(full_prompt_3, encoding_dicts)
    response_3 = call_gpt(full_prompt_3)

    # Build final tailored prompt
    combined_prompt_base = combine_prompts_and_base(response_1, response_2, response_3, base_prompt)
    tailored_prompt = add_tasks(combined_prompt_base, task_data)
    tailored_prompt = add_encodings(tailored_prompt, encoding_dicts)

    print("Built tailored prompt.")

    return tailored_prompt

### Save Programs

The following function saves the generated programs in the specified folder using the tasks name. Additionally the LLMs response is cleaned ensuring only runnable Python code is saved.

In [59]:
def save_program(program_text, actual_task_id, suffix=""):
    # Folder definition
    base_folder = "Candidate_programs_tailored_prompts_encodings"
    task_folder = os.path.join(base_folder, actual_task_id)
    
    os.makedirs(task_folder, exist_ok=True)

    # Clean the LLMs response (e.g. remove ```python or ``` if present)
    cleaned_text = re.sub(r"^```(?:python)?\s*|```$", "", program_text.strip(), flags=re.MULTILINE)

    # Determine the next version number for program naming
    existing_files = os.listdir(task_folder)
    version_numbers = [
        int(re.search(r"solution_v(\d+)", fname).group(1))
        for fname in existing_files
        if re.match(r"solution_v\d+", fname)
    ]
    next_version = max(version_numbers, default=0) + 1
    
    # Define the full path to the new Python file and save it
    file_path = os.path.join(task_folder, f"solution_v{next_version}{suffix}.py")
    with open(file_path, "w", encoding="utf-8") as f:
        f.write(cleaned_text.strip())

    print(f"Saved program for task {actual_task_id} as version {next_version}{suffix}: {file_path}")


### Create Programs

The following function passes the prompt for program creation to the LLM for n amount of times and saves the response

In [60]:
def create_programs(tailored_prompt, actual_task_id, amount):
    # Create two programs (change range for n programs)
    for i in range(amount):  # You can adjust the range to create more programs
        response = call_gpt(tailored_prompt)
        
        # Save the generated program
        save_program(response, actual_task_id)

### Evaluate Programs

Loads and executes a Python program from a specified file path

In [61]:
def load_program(file_path):
    """Load and execute a Python program from a file."""
    spec = importlib.util.spec_from_file_location("program", file_path)
    program = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(program)
    return program.solve


Checks if the program is valid Python code

In [62]:
def is_valid_python_code(filepath):
    """Check if the Python file contains valid syntax."""
    try:
        with open(filepath, "r", encoding="utf-8") as f:
            source = f.read()
        ast.parse(source)
        return True
    except SyntaxError:
        return False


The following function runs the evaluation of the generated programs. This includes:

- Checking if the program is valid Python code and deleting it if it isn't. 
- Checking if there are any valid programs among the generated ones (if there aren't at least two valid programs, more will be created until there are at least two [correctness of programs doesn't matter only execution])
- Comparing the generated outputs with the correct outputs.
- Calculating a score for each program.

The score is defined by the amount of correct transformations / the total amount of demonstration pairs of a task. This score is used to determine whether a program should be revised later on (score < 1)

In [63]:
def evaluate_programs(task_data, task_folder):
    """Evaluate programs and collect detailed candidate outputs for each demonstration pair."""
    programs = []
    program_files = [f for f in os.listdir(task_folder) if f.endswith(".py")]

    # Track whether any valid programs exist
    any_valid = False

    for program_file in program_files:
        program_path = os.path.join(task_folder, program_file)

        # Check if the program is valid Python code if not delete it
        if not is_valid_python_code(program_path):
            print(f"Deleting invalid file: {program_file}")
            os.remove(program_path)
            continue

        try:
            solve_function = load_program(program_path)
        except Exception as e:
            print(f"Error loading program {program_file}: {e}")
            os.remove(program_path)
            continue

        # Set to true if at least one valid program is found
        any_valid = True

        details = []
        correct_count = 0
        total_pairs = len(task_data['train'])

        # Evaluate programs against the training pairs
        for pair in task_data['train']:
            input_grid = pair['input']
            expected_output = pair['output']
            try:
                candidate_output = solve_function(input_grid)
                if np.array_equal(np.array(candidate_output), np.array(expected_output)):
                    correct_count += 1
            except Exception as e:
                candidate_output = f"Error: {e}"
            details.append({
                "input": input_grid,
                "candidate_output": candidate_output,
                "expected_output": expected_output
            })

        # Calculate the score and store the results
        score = correct_count / total_pairs if total_pairs > 0 else 0
        programs.append({
            "program_name": program_file,
            "score": score,
            "correct_pairs": correct_count,
            "total_pairs": total_pairs,
            "details": details
        })

    return programs if any_valid else 0

### Creation of Revision Prompt and Revised Programs

The following function builds the revision prompt including the following:

- Generated Python code.
- Task demonstration pairs.
- Generated output.

The prompt is used to task the LLM with creating a new program based on the revision. The program is then saved.

In [64]:
def revise_candidate_program(candidate_file_path, task_data, actual_task_id):
    # Load candidate code from the specified file path
    with open(candidate_file_path, "r", encoding="utf-8") as f:
        candidate_code_text = f.read()

    try:
        solve_fn = load_program(candidate_file_path)
    except Exception as e:
        print(f"Error loading program for revision: {e}")
        return

    ### BUILD THE REVISION PROMPT ###
    
    # Add the candidate code to the revision prompt
    local_revision_prompt = revision_prompt + "\n\nHere is the generated code:\n" + candidate_code_text + "\n\nDemonstration Pairs:\n"

    # Add the demonstration pairs as well as the generated output to the revision prompt
    for i, pair in enumerate(task_data['train']):
        try:
            candidate_output = solve_fn(pair['input'])
        except Exception as e:
            candidate_output = f"Error: {e}"
        local_revision_prompt += f"{i+1}. Input: {pair['input']}\n"
        local_revision_prompt += f"   Expected Output: {pair['output']}\n"
        local_revision_prompt += f"   Generated Output: {candidate_output}\n"

    local_revision_prompt += "\nPlease revise the code."

    # Call the LLM to revise the code
    revised_code = call_gpt(local_revision_prompt)

    # Determine the revision level (e.g. _rev1, _rev2) for the filename
    task_folder = os.path.join("Candidate_programs_tailored_prompts_encodings", actual_task_id)
    base_name = os.path.basename(candidate_file_path)
    base_version_match = re.search(r"solution_v(\d+)", base_name)
    base_version = base_version_match.group(1) if base_version_match else "1"

    existing_files = os.listdir(task_folder)
    revision_count = len([
        f for f in existing_files
        if re.match(rf"solution_v{base_version}_rev\d+\.py", f)
    ])
    suffix = f"_rev{revision_count + 1}"

    # Save revised program
    save_program(revised_code, actual_task_id, suffix=suffix)


### Generation of Predictions on Test Inputs

#### Identification of Best Programs

Selects the two best performing programs based on their score (= performance on the demonstration pairs). If scores are equal over multiple programs the first few programs will be picked (e.g. if all are 0 then solution_v1 and solution_v2 will be picked).

In [65]:
def get_best_programs(evaluation_results, actual_task_id, n=2):
    """
    Evaluates programs for a given task and returns the file paths for the top n programs based on demonstration pairs.
    Always returns exactly n programs by sorting by score in descending order.
    """
    sorted_programs = sorted(evaluation_results, key=lambda x: x['score'], reverse=True)
    task_folder = os.path.join("Candidate_programs_tailored_prompts_encodings", actual_task_id)
    best_program_files = [os.path.join(task_folder, prog['program_name']) for prog in sorted_programs[:n]]
    for i in best_program_files:
        print(f"Best program: {i}")
    return best_program_files



#### Creation of Predictions on Test Inputs

The following function loads the two best performing programs to create predictions in the test inputs of the task. The resulting outputs are saved in a submission dictionary to be appended to the submission.json file later on.

In [66]:
def generate_test_predictions(task_data, actual_task_id, best_program_files):
    """
    Loads the two best candidate programs from the specified file paths,
    runs each one on every test input, and saves the generated outputs
    in the submission file.
    
    Returns a submission dictionary of the form:
    { actual_task_id: [ { "attempt_1": output_from_program1, "attempt_2": output_from_program2 }, ... ] }
    """
    # Load the candidate solvers using the file paths
    best_solvers = [load_program(prog_file) for prog_file in best_program_files]
    
    predictions = []
    # Iterate over each test pair and generate predictions using the best solvers
    for i, pair in enumerate(task_data["test"]):
        input_grid = pair["input"]
        attempt_predictions = {}
        for idx, solver in enumerate(best_solvers, start=1):
            try:
                output = solver(input_grid)
            except Exception as e:
                output = f"Error: {e}"
            attempt_predictions[f"attempt_{idx}"] = output
        predictions.append(attempt_predictions)
    
    # Save the predictions in the submission format
    submission = {str(actual_task_id): predictions}
    
    return submission


## Pipeline

The following is the flow of the Pipeline. All the above functions and prompts are used and work together to create predictions for test inputs. The predictions are saved in a submission.json file for calculating the accuracies.

In [67]:
# Load tasks and API key
tasks = load_tasks("evaluation_set")
load_api_key()

# Load existing submission file if it exists
submission_file = "submission_tailored_prompts_encodings.json"
if os.path.exists(submission_file):
    with open(submission_file, "r") as f:
        final_submission = json.load(f)
else:
    final_submission = {}

# Loop through each task (adjustable range)
for i, task in enumerate(tasks[30:100]):
    actual_task_id = task["filename"].split(".")[0]
    
    ### PROMPT CREATION ###
    tailored_prompt = build_prompts(task['data'], actual_task_id)
    create_programs(tailored_prompt, actual_task_id, amount=2)
    
    ### INITIAL EVALUATION ###
    task_folder = os.path.join("Candidate_programs_tailored_prompts_encodings", actual_task_id)
    evaluation_results = evaluate_programs(task['data'], task_folder)
    
    while evaluation_results == 0:
        create_programs(tailored_prompt, actual_task_id, amount=2)
        evaluation_results = evaluate_programs(task['data'], task_folder)

    ### FIRST REVISION: Revise all < 1 ###
    for result in evaluation_results:
        if result['score'] < 1:
            candidate_file = os.path.join(task_folder, result['program_name'])
            print(f"Revising {result['program_name']}...")
            revise_candidate_program(candidate_file, task['data'], actual_task_id)

    ### FINAL EVALUATION ###
    evaluation_results_3 = evaluate_programs(task['data'], task_folder)
    print(f"Task {actual_task_id} evaluation results:")
    for result in evaluation_results_3:
        print(f"Program {result['program_name']} solved {result['correct_pairs']} out of {result['total_pairs']} pairs. Score: {result['score']:.2f}")
        print("="*50)

    ### PROGRAM SELECTION + PREDICTIONS ###
    best_program_files = get_best_programs(evaluation_results_3, actual_task_id, n=2)
    submission = generate_test_predictions(task['data'], actual_task_id, best_program_files)
    final_submission.update(submission)

# Final output file
with open(submission_file, "w") as f:
    json.dump(final_submission, f)


Built tailored prompt.
Saved program for task 55059096 as version 1: Candidate_programs_tailored_prompts_encodings\55059096\solution_v1.py
Saved program for task 55059096 as version 2: Candidate_programs_tailored_prompts_encodings\55059096\solution_v2.py
Task 55059096 evaluation results:
Program solution_v1.py solved 3 out of 3 pairs. Score: 1.00
Program solution_v2.py solved 3 out of 3 pairs. Score: 1.00
Best program: Candidate_programs_tailored_prompts_encodings\55059096\solution_v1.py
Best program: Candidate_programs_tailored_prompts_encodings\55059096\solution_v2.py
Built tailored prompt.
Saved program for task 5783df64 as version 1: Candidate_programs_tailored_prompts_encodings\5783df64\solution_v1.py
Saved program for task 5783df64 as version 2: Candidate_programs_tailored_prompts_encodings\5783df64\solution_v2.py
Task 5783df64 evaluation results:
Program solution_v1.py solved 3 out of 3 pairs. Score: 1.00
Program solution_v2.py solved 3 out of 3 pairs. Score: 1.00
Best program: 

## Get Accuracy (for personal use if test outputs are present)

Compares the submission.json file with the actual correct test outputs of a task and returns the accuracy.

In [68]:
def are_equal(a, b):
    return json.dumps(a) == json.dumps(b)

def compute_accuracy(submission_path, tasks_dict):
    with open(submission_path, 'r') as f:
        submission = json.load(f)

    total_tasks = len(submission)
    correct_tasks = 0

    for task_id, prediction_list in submission.items():
        task_data = tasks_dict[task_id]
        test_outputs = [test['output'] for test in task_data['test']]

        all_test_cases_correct = True

        for idx, expected_output in enumerate(test_outputs):
            attempts = prediction_list[idx]  # get dict with attempt_1 and attempt_2 for this test input
            pred1 = attempts["attempt_1"]
            pred2 = attempts["attempt_2"]

            if not (are_equal(pred1, expected_output) or are_equal(pred2, expected_output)):
                all_test_cases_correct = False
                break

        if all_test_cases_correct:
            correct_tasks += 1

    accuracy = correct_tasks / total_tasks
    return accuracy

In [69]:
with open("submission_tailored_prompts_encodings.json") as f:
    submission = json.load(f)

tasks_dict = {task["filename"].split(".")[0]: task["data"] for task in tasks}

accuracy = compute_accuracy("submission_tailored_prompts_encodings.json", tasks_dict)
print(f"Accuracy: {accuracy:.2%}")

Accuracy: 47.00%


In [70]:
print(tailored_prompt)

Here are visual observations of the task that may help inform your implementation:
- background zeros as border on outer rows/columns and full-zero rows at row 5 and row 10  
- three horizontal bands of content in rows 1–4, 6–9, 11–14  
- in each band two 4×4 square outlines in color 2 at columns 1–4 and 6–9  
- each square outline with hollow interior or interior pattern holes  
- total of six square outlines repeated across bands  
- outputs introduce new fill colors 8 and 3 replacing 2 in those outlines  
- band-wise color pairs: band 1 shows 2 & 8, band 2 shows 8 & 2/3, band 3 shows 2 & 3  
- fixed shape positions and separator rows maintained

Here are the transformation rules that have been identified based on the task:
1. The grid is divided into three fixed horizontal bands, each containing two identical 4×4 square outlines in color 2 at the same column anchors.  
2. In each band exactly one of the two outlines is uniformly recolored to a new solid hue (either 8 or 3) according