In this example, we will see GEPA evolve the whole DSPy program (not just the instruction), including modifying the structure/dataflow of the program. We will use GEPA to tune a simple dspy.ChainOfThought module for MATH questions.

In [1]:
import os

os.environ["OPENAI_API_KEY"] = input("OPENAI_API_KEY: ")

In [2]:
import dspy

In [19]:
from dspy.datasets import MATH

dataset = MATH(subset="algebra")
print(len(dataset.train), len(dataset.dev), len(dataset.test))

350 350 487


Let's inspect one example from the training set.

In [4]:
example = dataset.train[0]
print("Question:", example.question)
print("Answer:", example.answer)

Question: The doctor has told Cal O'Ree that during his ten weeks of working out at the gym, he can expect each week's weight loss to be $1\%$ of his weight at the end of the previous week. His weight at the beginning of the workouts is $244$ pounds. How many pounds does he expect to weigh at the end of the ten weeks? Express your answer to the nearest whole number.
Answer: 221


Let's define a simple DSPy program to solve this task.

Unlike dspy.GEPA that can take an instantiated DSPy module as input, here, we want to evolve the full DSPy program. Hence, a candidate here is the source code as string.

In [None]:
program_src = """import dspy
program = dspy.Predict("question -> answer")"""

GEPA interfaces with external frameworks through an adapter. In this case, we integrate GEPA with a DspyAdapter.

In [6]:
from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter

In [7]:
def metric_fn(example, pred, trace=None):
    score = dataset.metric(example, pred)
    if score:
        feedback_text = f"The provided answer {pred.answer} is correct."
    else:
        feedback_text = f"The provided answer {pred.answer} is incorrect. The correct answer is {example.answer}."
    return dspy.Prediction(score=score, feedback=feedback_text)

In [8]:
reflection_lm = dspy.LM(model="openai/gpt-4.1", max_tokens=32000)
adapter = DspyAdapter(
    task_lm=dspy.LM(model="openai/gpt-4.1-nano", max_tokens=32000),
    metric_fn=metric_fn,
    num_threads=32,
    reflection_lm=lambda x: reflection_lm(x)[0],
)

Let's evaluate the base program

In [17]:
adapter.evaluate(dataset.test, {"program": program_src})

2025/08/27 05:10:58 INFO dspy.evaluate.evaluate: Average Metric: 327.0 / 487 (67.1%)


The base program obtains a score of 67.1%

Let's launch the GEPA optimization. We will just use 50 training and 50 valset examples, with 500 max metric calls (allowing for less than 10 GEPA candidates to be proposed).

In [None]:
import random

random.Random(0).shuffle(dataset.train)
random.Random(0).shuffle(dataset.dev)

In [12]:
from gepa import optimize

o = optimize(
    seed_candidate={"program": program_src},
    trainset=dataset.train[:50],
    valset=dataset.dev[:50],
    adapter=adapter,
    reflection_lm=lambda x: reflection_lm(x)[0],
    max_metric_calls=2000,
    display_progress_bar=True,
)

GEPA Optimization:   0%|                                                                                                              | 0/500 [00:00<?, ?rollouts/s]2025/08/27 05:07:43 INFO dspy.evaluate.evaluate: Average Metric: 34.0 / 50 (68.0%)


Iteration 0: Base program full valset score: 0.68
Iteration 1: Selected program 0 score: 0.68
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 678.47it/s]

2025/08/27 05:07:43 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:07:43 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)
GEPA Optimization:  11%|███████████▏                                                                                        | 56/500 [00:00<00:00, 555.61rollouts/s]


Iteration 1: Proposed new text for program: import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math question step by step, showing all necessary reasoning and calculations.
    Then, provide the final answer clearly and concisely.
    """
    question = dspy.InputField(desc="A math problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step logical reasoning and calculations leading to the answer.")
    answer = dspy.OutputField(desc="The final answer, clearly stated.")

program = dspy.ChainOfThought(MathQA)
Iteration 1: New subsample score is not better, skipping
Iteration 2: Selected program 0 score: 0.68
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 646.54it/s]

2025/08/27 05:07:43 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:07:43 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)



Iteration 2: Proposed new text for program: from typing import Optional
import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math question step by step, showing all necessary reasoning and calculations.
    Then, provide the final answer in the required form.
    """
    question = dspy.InputField(desc="A math problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with all calculations and explanations.")
    answer = dspy.OutputField(desc="Final answer, clearly boxed or highlighted as required.")

program = dspy.ChainOfThought(MathQA)
Iteration 2: New subsample score is not better, skipping
Iteration 3: Selected program 0 score: 0.68
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 645.15it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 3: Proposed new text for program: from typing import Any
import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math question step by step, showing clear reasoning and calculation, then provide the final answer in the required format.
    """
    question = dspy.InputField(desc="A math problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution and calculations, with clear explanations.")
    answer = dspy.OutputField(desc="Final answer, formatted exactly as requested in the question.")

program = dspy.ChainOfThought(MathQA)
Iteration 3: New subsample score is not better, skipping
Iteration 4: Selected program 0 score: 0.68
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 671.30it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 4: Proposed new text for program: import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math question step by step, showing all necessary reasoning and calculations.
    Provide a clear, concise answer in the required format.
    """
    question = dspy.InputField(desc="A math problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with all calculations and explanations.")
    answer = dspy.OutputField(desc="Final answer, clearly stated and formatted as requested in the question.")

program = dspy.ChainOfThought(MathQA)
Iteration 4: New subsample score is not better, skipping
Iteration 5: Selected program 0 score: 0.68
Average Metric: 1.00 / 3 (33.3%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 446.92it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)
2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 3 (0.0%)



Iteration 5: Proposed new text for program: from typing import Tuple
import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math question step by step, showing clear reasoning and providing a concise, boxed final answer.
    """
    question = dspy.InputField(desc="A math problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with clear logical steps and all necessary calculations.")
    answer: str = dspy.OutputField(desc="Final answer, boxed or clearly marked, in simplest form.")

program = dspy.ChainOfThought(MathQA)
Iteration 5: New subsample score is not better, skipping
Iteration 6: Selected program 0 score: 0.68
Average Metric: 1.00 / 3 (33.3%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 548.44it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)
2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 0.0 / 3 (0.0%)



Iteration 6: Proposed new text for program: import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math or word problem step by step, showing all reasoning and calculations.
    Then, provide the final answer clearly and concisely.
    """
    question = dspy.InputField(desc="A math or word problem to solve.", type=str)
    reasoning = dspy.OutputField(desc="Step-by-step reasoning and calculations.", type=str)
    answer = dspy.OutputField(desc="Final answer, clearly stated.", type=str)

program = dspy.ChainOfThought(MathQA)
Iteration 6: New subsample score is not better, skipping
Iteration 7: Selected program 0 score: 0.68
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 589.50it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 7: All subsample scores perfect. Skipping.
Iteration 7: Reflective mutation did not propose a new candidate
Iteration 8: Selected program 0 score: 0.68
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 604.72it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 8: All subsample scores perfect. Skipping.
Iteration 8: Reflective mutation did not propose a new candidate
Iteration 9: Selected program 0 score: 0.68
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 561.81it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 9: All subsample scores perfect. Skipping.
Iteration 9: Reflective mutation did not propose a new candidate
Iteration 10: Selected program 0 score: 0.68
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 618.96it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 10: All subsample scores perfect. Skipping.
Iteration 10: Reflective mutation did not propose a new candidate
Iteration 11: Selected program 0 score: 0.68
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 545.94it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 11: Proposed new text for program: import dspy

class MathQA(dspy.Signature):
    """
    Solve the given math question step by step, showing clear reasoning and providing the final answer in the required format.
    """
    question = dspy.InputField(desc="A math question, possibly with instructions on answer format.")
    reasoning = dspy.OutputField(desc="Step-by-step solution, showing all calculations and logic.")
    answer = dspy.OutputField(desc="Final answer, formatted exactly as requested in the question.")

program = dspy.ChainOfThought(MathQA)


2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 31.0 / 50 (62.0%)
GEPA Optimization:  31%|██████████████████████████████▍                                                                    | 154/500 [00:00<00:00, 502.90rollouts/s]

Iteration 11: Full valset score for new program: 0.62
Iteration 11: Full train_val score for new program: 0.62
Iteration 11: Individual valset scores for new program: [True, False, True, True, True, False, False, True, True, True, True, True, False, False, True, True, False, True, True, True, False, False, True, False, False, True, False, True, True, False, True, True, True, True, True, False, True, False, False, True, True, True, True, True, False, False, True, False, True, False]
Iteration 11: New valset pareto front scores: [True, False, True, True, True, False, False, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, False, False, True, False, True, True, False, True, True, True, True, True, False, True, True, True, True, True, True, True, True, False, False, True, False, True, False]
Iteration 11: Full valset pareto front score: 0.72
Iteration 11: Updated valset pareto front programs: [{0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, {0, 1}, 

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 12: All subsample scores perfect. Skipping.
Iteration 12: Reflective mutation did not propose a new candidate
Iteration 13: Selected program 0 score: 0.68
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 614.46it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 13: Proposed new text for program: from typing import Literal
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear reasoning, and provide only the final answer in the answer field.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations and explanations.")
    answer: str = dspy.OutputField(desc="Final answer only, no explanation.")

program = dspy.ChainOfThought(MathWordProblemSignature)


2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 40.0 / 50 (80.0%)


Iteration 13: New program is on the linear pareto front
Iteration 13: Full valset score for new program: 0.8
Iteration 13: Full train_val score for new program: 0.8
Iteration 13: Individual valset scores for new program: [True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, False, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, False]
Iteration 13: New valset pareto front scores: [True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, False, True, True, True, True, False, True, True, False, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, False]
Iteration 13: Full valset pareto front score: 0.8
Iteration 13: Updated valset pareto front programs: [{0, 1, 2

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 14: All subsample scores perfect. Skipping.
Iteration 14: Reflective mutation did not propose a new candidate
Iteration 15: Selected program 2 score: 0.8
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 582.19it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
GEPA Optimization:  44%|███████████████████████████████████████████▎                                                       | 219/500 [00:00<00:00, 540.88rollouts/s]


Iteration 15: All subsample scores perfect. Skipping.
Iteration 15: Reflective mutation did not propose a new candidate
Iteration 16: Selected program 2 score: 0.8
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 590.44it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 16: All subsample scores perfect. Skipping.
Iteration 16: Reflective mutation did not propose a new candidate
Iteration 17: Selected program 2 score: 0.8
Average Metric: 1.00 / 3 (33.3%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 659.03it/s]

2025/08/27 05:07:44 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)



Iteration 17: Proposed new text for program: from typing import Literal
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning with all intermediate steps, equations, and explanations. In the 'answer' field, provide ONLY the final answer, fully simplified and in the form requested by the problem (e.g., as a reduced fraction, integer, or in terms of variables as specified). Do not include any explanation or units in the 'answer' field.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with all intermediate steps, equations, and explanations. Show all algebraic manipulations and simplifications.")
    answer: str = dspy.OutputField(desc="Final answer only, fully simplified and in the form requested by the problem. No explanation, no units, no extra text.")

program = dspy.ChainOfThought(MathWordProblem

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)


Iteration 17: New subsample score is not better, skipping
Iteration 18: Selected program 2 score: 0.8
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 204.79it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 18: Proposed new text for program: from typing import Literal
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning with all intermediate steps, equations, and explanations. Ensure the reasoning is thorough and does not skip any steps. Then, provide ONLY the final answer (no explanation) in the answer field, formatted as requested in the problem (e.g., as a number, to the nearest whole number, etc.).
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step, detailed solution with all intermediate steps, equations, and explanations. Do not skip any steps.")
    answer: str = dspy.OutputField(desc="Final answer only, formatted as requested in the problem. No explanation.")

class MathWordProblemModule(dspy.Module):
    def __init__(self):
        super().__init__()
        # First, generate detailed reasoni

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 19: All subsample scores perfect. Skipping.
Iteration 19: Reflective mutation did not propose a new candidate
Iteration 20: Selected program 2 score: 0.8
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 146.79it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 20: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear reasoning, and provide only the final answer in the answer field.
    
    Instructions:
    - In the 'reasoning' field, show a detailed, step-by-step solution, including all equations, substitutions, and intermediate calculations. Use clear mathematical notation and explanations.
    - In the 'answer' field, provide ONLY the final answer, formatted as requested in the problem (e.g., include units, round as specified, use scientific notation if required). Do not include any explanation or extra text in the answer field.
    - If the answer is a number, repeat it exactly as it should appear in the answer box (e.g., include a dollar sign if required, or write as a decimal if specified).
    - If the problem requests a sum, product, or other operation on multiple solutions, perform t

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 21: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning with all intermediate calculations, and provide only the final answer in the answer field. 
    The answer should be in the simplest exact form (e.g., reduced fraction, radical, or boxed LaTeX), or as a decimal rounded as specified in the question. 
    Do not repeat the reasoning in the answer field. If the answer is a monetary value, include the dollar sign and two decimal places.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with all intermediate steps, equations, and explanations. Show all calculations and clearly indicate the final answer at the end of the reasoning, but do not repeat it in the answer field.")
    answer: str = dspy.OutputField(desc="Final answer on

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 36.0 / 50 (72.0%)
GEPA Optimization:  60%|███████████████████████████████████████████████████████████▊                                        | 299/500 [00:24<00:25,  7.83rollouts/s]

Iteration 21: Full valset score for new program: 0.72
Iteration 21: Full train_val score for new program: 0.72
Iteration 21: Individual valset scores for new program: [True, False, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, False, True, False, True, True, False, True, True, True, True, True, False, True, True, True, True, False, True, True, True, False, False, True, False, True, False]
Iteration 21: New valset pareto front scores: [True, True, True, True, True, False, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, False, True, True, True, False]
Iteration 21: Full valset pareto front score: 0.82
Iteration 21: Updated valset pareto front programs: [{0, 1, 2, 3}, {2}, {0, 1, 2, 3}, {0, 1, 2, 3}, {0, 1, 2, 3

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 22: All subsample scores perfect. Skipping.
Iteration 22: Reflective mutation did not propose a new candidate
Iteration 23: Selected program 3 score: 0.72
Average Metric: 3.00 / 3 (100.0%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 98.89it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 23: All subsample scores perfect. Skipping.
Iteration 23: Reflective mutation did not propose a new candidate
Iteration 24: Selected program 3 score: 0.72
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 124.75it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 24: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning with all intermediate calculations, and provide only the final answer in the answer field. 
    The answer should be in the simplest exact form (e.g., reduced fraction, radical, or boxed LaTeX), or as a decimal rounded as specified in the question. 
    Do not repeat the reasoning in the answer field. If the answer is a monetary value, include the dollar sign and two decimal places.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with all intermediate steps, equations, and explanations. Show all calculations and clearly indicate the final answer at the end of the reasoning, but do not repeat it in the answer field.")
    answer: str = dspy.OutputField(desc="Final answer on

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
GEPA Optimization:  63%|██████████████████████████████████████████████████████████████▊                                     | 314/500 [00:24<00:21,  8.65rollouts/s]


Iteration 25: All subsample scores perfect. Skipping.
Iteration 25: Reflective mutation did not propose a new candidate
Iteration 26: Selected program 2 score: 0.8
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 725.53it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 26: All subsample scores perfect. Skipping.
Iteration 26: Reflective mutation did not propose a new candidate
Iteration 27: Selected program 2 score: 0.8
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 597.82it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 27: All subsample scores perfect. Skipping.
Iteration 27: Reflective mutation did not propose a new candidate
Iteration 28: Selected program 2 score: 0.8
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 550.17it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 28: All subsample scores perfect. Skipping.
Iteration 28: Reflective mutation did not propose a new candidate
Iteration 29: Selected program 2 score: 0.8
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 491.81it/s]

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 29: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:08 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 46.0 / 50 (92.0%)
GEPA Optimization:  76%|███████████████████████████████████████████████████████████████████████████▊                        | 379/500 [00:25<00:08, 13.47rollouts/s]

Iteration 29: New program is on the linear pareto front
Iteration 29: Full valset score for new program: 0.92
Iteration 29: Full train_val score for new program: 0.92
Iteration 29: Individual valset scores for new program: [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False]
Iteration 29: New valset pareto front scores: [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, False, True, True, True, False]
Iteration 29: Full valset pareto front score: 0.92
Iteration 29: Updated valset pareto front programs: [{0, 1, 2, 3, 4}, 

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 30: All subsample scores perfect. Skipping.
Iteration 30: Reflective mutation did not propose a new candidate
Iteration 31: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 113.28it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 31: All subsample scores perfect. Skipping.
Iteration 31: Reflective mutation did not propose a new candidate
Iteration 32: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 119.83it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 32: Proposed new text for program: import dspy
from typing import Optional

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 33: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 34: All subsample scores perfect. Skipping.
Iteration 34: Reflective mutation did not propose a new candidate
Iteration 35: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 114.05it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 35: All subsample scores perfect. Skipping.
Iteration 35: Reflective mutation did not propose a new candidate
Iteration 36: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 115.72it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 36: All subsample scores perfect. Skipping.
Iteration 36: Reflective mutation did not propose a new candidate
Iteration 37: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 127.38it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 37: All subsample scores perfect. Skipping.
Iteration 37: Reflective mutation did not propose a new candidate
Iteration 38: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 121.34it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
GEPA Optimization:  82%|██████████████████████████████████████████████████████████████████████████████████▍                 | 412/500 [00:25<00:05, 16.62rollouts/s]


Iteration 38: All subsample scores perfect. Skipping.
Iteration 38: Reflective mutation did not propose a new candidate
Iteration 39: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 112.93it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 39: All subsample scores perfect. Skipping.
Iteration 39: Reflective mutation did not propose a new candidate
Iteration 40: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 117.59it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 40: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)


Iteration 40: New subsample score is not better, skipping
Iteration 41: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 111.66it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 41: All subsample scores perfect. Skipping.
Iteration 41: Reflective mutation did not propose a new candidate
Iteration 42: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 118.29it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 42: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form. If the answer should be a decimal to a certain precision, follow the instructions in the question.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTe

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)


Iteration 42: New subsample score is not better, skipping
Iteration 43: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 96.09it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 43: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
GEPA Optimization:  87%|███████████████████████████████████████████████████████████████████████████████████████▏            | 436/500 [00:26<00:03, 19.49rollouts/s]

Iteration 43: New subsample score is not better, skipping
Iteration 44: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 155.97it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 44: All subsample scores perfect. Skipping.
Iteration 44: Reflective mutation did not propose a new candidate
Iteration 45: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 113.10it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 45: All subsample scores perfect. Skipping.
Iteration 45: Reflective mutation did not propose a new candidate
Iteration 46: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 120.83it/s]

2025/08/27 05:08:09 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 46: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)


Iteration 46: New subsample score is not better, skipping
Iteration 47: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 109.59it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 47: All subsample scores perfect. Skipping.
Iteration 47: Reflective mutation did not propose a new candidate
Iteration 48: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 116.47it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 48: All subsample scores perfect. Skipping.
Iteration 48: Reflective mutation did not propose a new candidate
Iteration 49: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 121.50it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 49: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

GEPA Optimization:  92%|████████████████████████████████████████████████████████████████████████████████████████████        | 460/500 [00:26<00:01, 23.51rollouts/s]

Iteration 49: New subsample score is not better, skipping
Iteration 50: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 119.42it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 50: All subsample scores perfect. Skipping.
Iteration 50: Reflective mutation did not propose a new candidate
Iteration 51: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 116.15it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 51: All subsample scores perfect. Skipping.
Iteration 51: Reflective mutation did not propose a new candidate
Iteration 52: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 116.09it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 52: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 53: All subsample scores perfect. Skipping.
Iteration 53: Reflective mutation did not propose a new candidate
Iteration 54: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 122.64it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
GEPA Optimization:  96%|███████████████████████████████████████████████████████████████████████████████████████████████▌    | 478/500 [00:26<00:00, 27.54rollouts/s]


Iteration 54: All subsample scores perfect. Skipping.
Iteration 54: Reflective mutation did not propose a new candidate
Iteration 55: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 118.67it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 55: All subsample scores perfect. Skipping.
Iteration 55: Reflective mutation did not propose a new candidate
Iteration 56: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 112.16it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 56: All subsample scores perfect. Skipping.
Iteration 56: Reflective mutation did not propose a new candidate
Iteration 57: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 121.06it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 57: All subsample scores perfect. Skipping.
Iteration 57: Reflective mutation did not propose a new candidate
Iteration 58: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 121.87it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 58: All subsample scores perfect. Skipping.
Iteration 58: Reflective mutation did not propose a new candidate
Iteration 59: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 116.46it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
GEPA Optimization:  99%|██████████████████████████████████████████████████████████████████████████████████████████████████▌ | 493/500 [00:26<00:00, 31.61rollouts/s]


Iteration 59: All subsample scores perfect. Skipping.
Iteration 59: Reflective mutation did not propose a new candidate
Iteration 60: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 33.93it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 60: All subsample scores perfect. Skipping.
Iteration 60: Reflective mutation did not propose a new candidate
Iteration 61: Selected program 4 score: 0.92
Average Metric: 3.00 / 3 (100.0%): 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 106.63it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)



Iteration 61: All subsample scores perfect. Skipping.
Iteration 61: Reflective mutation did not propose a new candidate
Iteration 62: Selected program 4 score: 0.92
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 121.96it/s]

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)



Iteration 62: Proposed new text for program: from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class Math

2025/08/27 05:08:10 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)
GEPA Optimization: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████▊| 499/500 [00:26<00:00, 18.68rollouts/s]

Iteration 62: New subsample score is not better, skipping





Let's see the DSPy program found by GEPA

In [21]:
print(o.best_candidate["program"])

from typing import Optional
import dspy

class MathWordProblemSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing clear, detailed reasoning in the reasoning field. In the answer field, provide ONLY the final answer, in the simplest form, with no explanation, no units, and no restatement of the question. If the answer is a fraction, use LaTeX format (e.g., \\frac{a}{b}); if an interval, use standard interval notation (e.g., [0,1)); if a probability, use the simplest exact form.
    """
    question = dspy.InputField(desc="A math word problem to solve.")
    reasoning = dspy.OutputField(desc="Step-by-step solution with equations, logical steps, and explanations. Show all work.")
    answer: str = dspy.OutputField(desc="Final answer ONLY, in the simplest form, with no explanation or restatement. Use LaTeX for fractions, interval notation for intervals, and exact values for probabilities.")

class MathWordProblemFinalAnswer(dspy.Signature):
    ""

Evaluating the optimized program

In [18]:
adapter.evaluate(dataset.test, o.best_candidate)

2025/08/27 05:12:51 INFO dspy.evaluate.evaluate: Average Metric: 448.0 / 487 (92.0%)


We see it going from 67.1% to 92% in just a few rounds of optimization!