In this example, we will see GEPA evolve the whole DSPy program/agent (not just the instruction), including modifying the structure/dataflow of the agent. We will use GEPA to tune a simple dspy.ChainOfThought module for ARC-AGI tasks into a full DSPy Program.

Notably, **GEPA optimizes Gemini-2.5-Pro's performance from 44% to 49.5% on ARC-AGI** evolving an elaborate 5-step schema to solve the problems:
1. Ask LLM to hypothesize a natural language rule given training examples
2. Ask LLM to generate a python program that executes the natural language rule
3. Run the generated python program on all training examples, gathering feedback on how/when they fail to run, or identifying if it succeeds in all training examples.
4. If
 * succeed in all training examples: then proceed as-is
 * otherwise, ask LLM to improve the program with gathered feedback
5. Finally execute the improved program on all test-inputs, and return outputs.

In [7]:
gemini_api_key = input("GEMINI_API_KEY: ")

In [1]:
from datasets import load_dataset

ds = load_dataset("dataartist/arc-agi")

In [None]:
import dspy

trainset = [
    dspy.Example(
        training_examples=ex["train"],
        test_inputs=[x["input"] for x in ex["test"]],
        test_outputs=[x["output"] for x in ex["test"]],
    ).with_inputs("training_examples", "test_inputs")
    for ex in ds["training"]
]
testset = [
    dspy.Example(
        training_examples=ex["train"],
        test_inputs=[x["input"] for x in ex["test"]],
        test_outputs=[x["output"] for x in ex["test"]],
    ).with_inputs("training_examples", "test_inputs")
    for ex in ds["evaluation"]
]

import random

random.Random(0).shuffle(trainset)

test_set = testset
val_set = trainset[-200:]
train_set = [ex for ex in trainset[:-200]]

In [10]:
len(train_set), len(val_set), len(test_set)

(200, 200, 400)

### Defining a simple ChainOfThought program

In [4]:
program_src = """import dspy
from typing import List
import pydantic

MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

class SolveTaskSignature(dspy.Signature):
    training_examples: List[TrainingExample] = dspy.InputField(description="Input and output examples demonstrating the task to be performed.")
    test_inputs: List[MATRIX] = dspy.InputField(description="Input matrices to be solved following the task described in the training examples.")
    test_outputs: List[MATRIX] = dspy.OutputField(description="Output matrices corresponding to the test inputs.")

program = dspy.ChainOfThought(SolveTaskSignature)"""

### Defining the evaluation metric, which doubles as GEPA's optimization feedback

In [13]:
def is_valid_matrix(matrix, gold_matrix):
    if not isinstance(matrix, list):
        return False, f"The matrix must be a List[List[int]]. The correct matrix is {gold_matrix}."
    n = len(matrix)
    if n == 0:
        return False, f"The matrix must have at least one row. The correct matrix is {gold_matrix}."
    m = len(matrix[0])
    if m == 0:
        return False, f"The matrix must have at least one column. The correct matrix is {gold_matrix}."
    for i in range(n):
        if not isinstance(matrix[i], list):
            return False, f"The {i}-th row must be a List[int]. The correct matrix is {gold_matrix}."
        if len(matrix[i]) != m:
            return (
                False,
                f"The matrix is staggered. Row 0 has {m} columns, but row {i} has {len(matrix[i])} columns. The correct matrix is {gold_matrix}.",
            )
        for j in range(m):
            if not isinstance(matrix[i][j], int):
                return (
                    False,
                    f"The {i}-th row, {j}-th column must be an int, found {type(matrix[i][j])}. The correct matrix is {gold_matrix}.",
                )

    # Check consistency with gold matrix
    gold_n = len(gold_matrix)
    gold_m = len(gold_matrix[0])
    if (n, m) != (gold_n, gold_m):
        return (
            False,
            f"The matrix has dimensions {n}x{m}, but the gold matrix has dimensions {gold_n}x{gold_m}. The correct matrix is {gold_matrix}.",
        )

    same = True
    wrong_indices = []
    for i in range(n):
        for j in range(m):
            if matrix[i][j] != gold_matrix[i][j]:
                same = False
                wrong_indices.append((i, j))
    if same:
        return True, f"Your response is correct. The correct matrix is {gold_matrix}."
    else:
        if len(wrong_indices) < 10:
            return (
                False,
                f"The matrix is incorrect. The following indices are incorrect: {wrong_indices}. The correct matrix is {gold_matrix}.",
            )
        else:
            return False, f"The matrix is incorrect. The correct matrix is {gold_matrix}."


def metric_fn(example, pred, trace=None):
    task_inputs = example.test_inputs
    gold_task_outputs = example.test_outputs
    pred_task_outputs = pred.test_outputs

    if not isinstance(pred_task_outputs, list):
        return dspy.Prediction(
            score=0,
            feedback=f"The response must be a List[List[List[int]]]. The correct response is {gold_task_outputs}.",
        )

    valids = []
    feedbacks = []
    feedback = ""
    if len(task_inputs) != len(pred_task_outputs):
        feedback = f"The number of output matrices ({len(pred_task_outputs)}) must match the number of input matrices ({len(task_inputs)}). The correct response is {gold_task_outputs}."
        return dspy.Prediction(score=0, feedback=feedback)
    for i, (input, gold_output, pred_output) in enumerate(
        zip(task_inputs, gold_task_outputs, pred_task_outputs, strict=False)
    ):
        is_valid, feedback = is_valid_matrix(pred_output, gold_output)
        valids.append(is_valid)
        feedbacks.append(f"Feedback on test input {i}: {feedback}")

    score = sum(valids) / len(valids)
    feedback_text = "\n".join(feedbacks)
    return dspy.Prediction(score=score, feedback=feedback_text)

### Setting up the GEPA DSPy Adapter (which provides the evaluation harness)

In [None]:
from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter

reflection_lm = dspy.LM(model="gemini/gemini-2.5-pro", max_tokens=32000, api_key=gemini_api_key)
adapter = DspyAdapter(
    task_lm=dspy.LM(model="gemini/gemini-2.5-pro", max_tokens=32000, api_key=gemini_api_key),
    metric_fn=metric_fn,
    num_threads=80,
    reflection_lm=lambda x: reflection_lm(x)[0],
)

### Evaluating the seed program

In [28]:
o_base = adapter.evaluate(test_set, {"program": program_src})

2025/08/30 04:55:07 INFO dspy.evaluate.evaluate: Average Metric: 176.0 / 400 (44.0%)


The base program obtains a score of 44.0%

### GEPA Optimization

In [None]:
from gepa import optimize

o = optimize(
    seed_candidate={"program": program_src},
    trainset=train_set,
    valset=val_set,
    adapter=adapter,
    reflection_lm=lambda x: reflection_lm(x)[0],
    max_metric_calls=4000,
    display_progress_bar=True,
)

2025/08/28 21:33:56 INFO dspy.evaluate.evaluate: Average Metric: 134.0 / 200 (67.0%)
GEPA Optimization:   5%|█████▊                                                                                                             | 200/4000 [00:00<00:11, 318.27rollouts/s]
Iteration 0: Base program full valset score: 0.67
Iteration 1: Selected program 0 score: 0.67
Average Metric: 3.00 / 3 (100.0%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 312.80it/s]
2025/08/28 21:33:56 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)

Iteration 1: All subsample scores perfect. Skipping.
Iteration 1: Reflective mutation did not propose a new candidate
Iteration 2: Selected program 0 score: 0.67
Average Metric: 1.00 / 1 (100.0%):   0%|                                                                                                                       | 0/3 [00:00<?, ?it/s]
Average Metric: 1.00 / 3 (33.3

### GEPA Optimization Results

#### View the GEPA optimized DSPy program

In [30]:
print(o.best_candidate["program"])

import dspy
from typing import List, Tuple, Optional
import pydantic
import copy
import traceback

# Define type aliases and pydantic models for clarity and structure.
MATRIX = List[List[int]]

class TrainingExample(pydantic.BaseModel):
    input: MATRIX
    output: MATRIX

# --- Signatures ---

class HypothesizeRule(dspy.Signature):
    """
    Analyze the provided input/output matrix pairs from the Abstraction and Reasoning Corpus (ARC).
    Deduce the single, underlying transformation rule that converts each input matrix to its corresponding output matrix.
    Describe this rule in clear, step-by-step, unambiguous English. Focus on the logic, not Python code.

    **Successful Strategies to Consider:**
    - **Start Simple:** First, check for simple rules. Is there a global transformation (e.g., rotation, reflection)? Is the output a subgrid of the input? Is a single color being replaced?
    - **Look for Separators:** Check if the grid is partitioned by separator lines (e.g., rows 

As can be seen above, GEPA discovered an elaborate, 5-step program that:
1. Ask LLM to hypothesize a natural language rule given training examples
2. Ask LLM to generate a python program that executes the natural language rule
3. Run the generated python program on all training examples, gathering feedback on how/when they fail to run, or identifying if it succeeds in all training examples.
4. If
 * succeed in all training examples: then proceed as-is
 * otherwise, ask LLM to improve the program with gathered feedback
5. Finally execute the improved program on all test-inputs, and return outputs.

Notably, GEPA with Gemini-2.5-Pro is able to discover reflective refinement!

#### Evaluating the optimized agent

In [29]:
o_opt = adapter.evaluate(test_set, o.best_candidate)

2025/08/30 04:55:24 INFO dspy.evaluate.evaluate: Average Metric: 198.0 / 400 (49.5%)


We see it going from **44% to 49.5%**!