In [1]:
import dspy
lm = dspy.LM('azure/gpt-4.1-mini')
dspy.settings.configure(lm=lm)
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
gms8k = GSM8K()
trainset, devset = gms8k.train, gms8k.dev


class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought("question -> answer")

    def forward(self, question):
        return self.prog(question=question)

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

100%|██████████| 7473/7473 [00:00<00:00, 44176.58it/s]
100%|██████████| 1319/1319 [00:00<00:00, 43299.47it/s]


In [2]:
from dspy.evaluate import Evaluate
evaluate = Evaluate(devset=devset[:], metric=gsm8k_metric, num_threads=8, display_progress=True, display_table=False)

program = CoT()
evaluate(program, devset=devset[:])

Average Metric: 274.00 / 300 (91.3%): 100%|██████████| 300/300 [01:08<00:00,  4.35it/s]

2025/05/01 16:15:20 INFO dspy.evaluate.evaluate: Average Metric: 274 / 300 (91.3%)





91.33

In [3]:
# Import the optimizer
from dspy.teleprompt import MIPROv2

# Initialize optimizer
teleprompter = MIPROv2(
    metric=gsm8k_metric,
    auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    program.deepcopy(),
    trainset=trainset,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)

# Save optimize program for future use
optimized_program.save(f"mipro_optimized", save_program=True)

# Evaluate optimized program
print(f"Evaluate optimized program...")
evaluate(optimized_program, devset=devset[:])

2025/05/01 16:18:19 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 7
minibatch: True
num_candidates: 5
valset size: 100

2025/05/01 16:18:19 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/05/01 16:18:19 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/05/01 16:18:19 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=5 sets of demonstrations...


Optimizing program with MIPRO...
Bootstrapping set 1/5
Bootstrapping set 2/5
Bootstrapping set 3/5


  8%|▊         | 3/40 [00:08<01:44,  2.81s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 4/5


  8%|▊         | 3/40 [00:04<00:53,  1.45s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 5/5


  5%|▌         | 2/40 [00:04<01:30,  2.37s/it]
2025/05/01 16:18:36 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/05/01 16:18:36 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


2025/05/01 16:18:49 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/05/01 16:19:12 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/05/01 16:19:12 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

2025/05/01 16:19:12 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Given a natural language arithmetic word problem in the `question` field, generate a detailed step-by-step chain of thought reasoning that breaks down the problem into clear intermediate calculations and logical steps, and then provide the final numeric answer. The reasoning should explicitly show how each part of the problem is addressed, using straightforward integer arithmetic and clear explanations, culminating in a concise final answer. Output both the `reasoning` and the `answer` fields with transparent, interpretable calculations and a final numeric solution.

2025/05/01 16:19:12 INFO dspy.teleprompt.mipro_op

Average Metric: 88.00 / 100 (88.0%): 100%|██████████| 100/100 [00:25<00:00,  3.97it/s]

2025/05/01 16:19:38 INFO dspy.evaluate.evaluate: Average Metric: 88 / 100 (88.0%)
2025/05/01 16:19:38 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 88.0

2025/05/01 16:19:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 10 - Minibatch ==



Average Metric: 30.00 / 35 (85.7%): 100%|██████████| 35/35 [00:08<00:00,  4.11it/s]

2025/05/01 16:19:46 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/01 16:19:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/05/01 16:19:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71]
2025/05/01 16:19:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0]
2025/05/01 16:19:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/05/01 16:19:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 10 - Minibatch ==



Average Metric: 35.00 / 35 (100.0%): 100%|██████████| 35/35 [00:07<00:00,  4.49it/s]

2025/05/01 16:19:54 INFO dspy.evaluate.evaluate: Average Metric: 35 / 35 (100.0%)
2025/05/01 16:19:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/05/01 16:19:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71, 100.0]
2025/05/01 16:19:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0]
2025/05/01 16:19:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/05/01 16:19:54 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 10 - Minibatch ==



Average Metric: 33.00 / 35 (94.3%): 100%|██████████| 35/35 [00:09<00:00,  3.56it/s]

2025/05/01 16:20:04 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/01 16:20:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 1'].
2025/05/01 16:20:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71, 100.0, 94.29]
2025/05/01 16:20:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0]
2025/05/01 16:20:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/05/01 16:20:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 10 - Minibatch ==



Average Metric: 34.00 / 35 (97.1%): 100%|██████████| 35/35 [00:05<00:00,  6.33it/s] 

2025/05/01 16:20:10 INFO dspy.evaluate.evaluate: Average Metric: 34 / 35 (97.1%)
2025/05/01 16:20:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1'].
2025/05/01 16:20:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71, 100.0, 94.29, 97.14]
2025/05/01 16:20:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0]
2025/05/01 16:20:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/05/01 16:20:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 10 - Minibatch ==



Average Metric: 34.00 / 35 (97.1%): 100%|██████████| 35/35 [00:08<00:00,  4.17it/s] 

2025/05/01 16:20:18 INFO dspy.evaluate.evaluate: Average Metric: 34 / 35 (97.1%)
2025/05/01 16:20:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 3'].
2025/05/01 16:20:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71, 100.0, 94.29, 97.14, 97.14]
2025/05/01 16:20:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0]
2025/05/01 16:20:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 88.0


2025/05/01 16:20:18 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 10 - Full Evaluation =====
2025/05/01 16:20:18 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 98.57) from minibatch trials...



Average Metric: 95.00 / 100 (95.0%): 100%|██████████| 100/100 [00:09<00:00, 10.77it/s]

2025/05/01 16:20:27 INFO dspy.evaluate.evaluate: Average Metric: 95 / 100 (95.0%)
2025/05/01 16:20:27 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 95.0
2025/05/01 16:20:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0, 95.0]
2025/05/01 16:20:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 95.0
2025/05/01 16:20:27 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/01 16:20:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 10 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|██████████| 35/35 [00:09<00:00,  3.53it/s] 

2025/05/01 16:20:37 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/01 16:20:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/05/01 16:20:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71, 100.0, 94.29, 97.14, 97.14, 88.57]
2025/05/01 16:20:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0, 95.0]
2025/05/01 16:20:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 95.0


2025/05/01 16:20:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 10 - Minibatch ==



Average Metric: 34.00 / 35 (97.1%): 100%|██████████| 35/35 [00:07<00:00,  4.71it/s] 

2025/05/01 16:20:45 INFO dspy.evaluate.evaluate: Average Metric: 34 / 35 (97.1%)
2025/05/01 16:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 4'].
2025/05/01 16:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [85.71, 100.0, 94.29, 97.14, 97.14, 88.57, 97.14]
2025/05/01 16:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0, 95.0]
2025/05/01 16:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 95.0


2025/05/01 16:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 10 / 10 - Full Evaluation =====
2025/05/01 16:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 97.14) from minibatch trials...



Average Metric: 93.00 / 100 (93.0%): 100%|██████████| 100/100 [00:14<00:00,  6.74it/s]

2025/05/01 16:21:00 INFO dspy.evaluate.evaluate: Average Metric: 93 / 100 (93.0%)
2025/05/01 16:21:00 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [88.0, 95.0, 93.0]
2025/05/01 16:21:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 95.0
2025/05/01 16:21:00 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/01 16:21:00 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 95.0!



Evaluate optimized program...
Average Metric: 110.00 / 116 (94.8%):  39%|███▊      | 116/300 [00:27<00:41,  4.38it/s]



Average Metric: 284.00 / 300 (94.7%): 100%|██████████| 300/300 [01:09<00:00,  4.32it/s]

2025/05/01 16:22:09 INFO dspy.evaluate.evaluate: Average Metric: 284 / 300 (94.7%)





94.67

In [4]:
from dspy.teleprompt.utils import print_full_program

print_full_program(optimized_program)

Predictor 0
i: Given a natural language math word problem in the field `question`, generate a detailed, step-by-step reasoning process that clearly explains the solution, followed by the final concise numeric answer in the field `answer`.
p: Answer:




In [7]:
lm.inspect_history(n=5)






[34m[2025-05-01T16:22:08.226462][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
Your output fields are:
1. `reasoning` (str)
2. `answer` (str)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given a natural language math word problem in the field `question`, generate a detailed, step-by-step reasoning process that clearly explains the solution, followed by the final concise numeric answer in the field `answer`.


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## question ## ]]
Mark donates soup to the 6 homeless shelters.  Each one services 30 people and he decides to buy 10 cans of soup per person.  How many cans of soup does Mark donate?


[31mAssistant message:

In [None]:
lm.inspect_history(n=1)





[34m[2025-05-01T16:22:09.552388][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
Your output fields are:
1. `reasoning` (str)
2. `answer` (str)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given a natural language math word problem in the field `question`, generate a detailed, step-by-step reasoning process that clearly explains the solution, followed by the final concise numeric answer in the field `answer`.


[31mUser message:[0m

This is an example of the task, though some input or output fields are not supplied.

[[ ## question ## ]]
Mark donates soup to the 6 homeless shelters.  Each one services 30 people and he decides to buy 10 cans of soup per person.  How many cans of soup does Mark donate?


[31mAssistant message:

Bad pipe message: %s [b'(Windows NT; Windows NT 10.0; zh-CN) WindowsPow']
Bad pipe message: %s [b'\xb9E7\xa0\x9e\x17\x18\xce\xe8X-\xd4\x10EF\xc9C\x9e \xab\xf3\x1f\xb2B]\xa2H\xcf\x00#-q\x9d\x8b\x90\x16\x14\xa1\xde\xae\xf9\x85u\xdc9\xcc\xda\xc7\xda\x83\xf5\x00(\x13\x02\x13\x01\xc0,\xc0+\xc00\xc0/\xc0']
Bad pipe message: %s [b"#\xc0(\xc0'\xc0\n\xc0\t\xc0\x14\xc0\x13\x00\x9d\x00\x9c\x00=\x00<\x005\x00/\x01\x00\x01(\x00+\x00\t\x08\x03"]
Bad pipe message: %s [b"\xc2G\x95(;i\xf5\x1f\x13-\x92\x08\x05f\x8a\xe4a\xc8 ,\x84\x1c^\xfd\x03\xf4\xfc\x8f\x02TJ!\x14\xef\x95@\xdf\xe3\x7fh\x8d\x19\xd1J/)\x08Tc\x81*\x00(\x13\x02\x13\x01\xc0,\xc0+\xc00\xc0/\xc0$\xc0#\xc0(\xc0'\xc0\n\xc0\t\xc0\x14\xc0\x13\x00\x9d\x00\x9c\x00=\x00<\x005\x00/\x01\x00\x01(\x00+\x00\t\x08\x03\x04\x03\x03\x03\x02\x03\x01\x00\r\x00\x1a\x00\x18\x08\x04\x08\x05\x08\x06\x04\x01\x05\x01\x02\x01\x04\x03\x05\x03\x02\x03\x02\x02\x06\x01\x06\x03\x00#\x00\x00\x00\n\x00\x08\x00\x06\x00\x1d\x00\x17\x00\x18\x00\x0b\x00\x02\x01\x00\x003\x00\xd0