In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
import os
import dspy

lm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])
dspy.configure(lm=lm)

In [3]:
math = dspy.ChainOfThought("question -> answer: float")
math(question="Two dice are tossed. What is the probability that the sum equals two?")

Prediction(
    reasoning='When two dice are tossed, each die has 6 faces, resulting in a total of 6 * 6 = 36 possible outcomes. The only way to achieve a sum of 2 is if both dice show a 1 (i.e., (1,1)). There is only 1 favorable outcome for this event. Therefore, the probability of the sum equaling 2 is the number of favorable outcomes divided by the total number of outcomes, which is 1/36.',
    answer=0.027777777777777776
)

In [4]:
import dspy
from dspy.datasets import HotPotQA

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

def search_wikipedia(query: str) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=3)
    return [x["text"] for x in results]

trainset = [x.with_inputs('question') for x in HotPotQA(train_seed=2024, train_size=500).train]
react = dspy.ReAct("question -> answer", tools=[search_wikipedia])

tp = dspy.MIPROv2(metric=dspy.evaluate.answer_exact_match, auto="light", num_threads=24)
optimized_react = tp.compile(react, trainset=trainset)

  from .autonotebook import tqdm as notebook_tqdm
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'hotpot_qa' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
Generating train split: 100%|██████████| 90447/90447 [00:03<00:00, 24418.55 examples/s]
Generating validation split: 100%|██████████| 7405/7405 [00:00<00:00, 26771.52 examples/s]
Generating test split: 100%|██████████| 7405/7405 [00:00<00:00, 25039.82 examples/s]
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'hotpot_qa' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
2025/08/25 03:16:44 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE F

Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


 18%|█▊        | 18/100 [02:46<12:37,  9.24s/it]


Bootstrapped 4 full traces after 18 examples for up to 1 rounds, amounting to 18 attempts.
Bootstrapping set 4/6


  5%|▌         | 5/100 [01:09<22:08, 13.99s/it]


Bootstrapped 3 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 5/6


  5%|▌         | 5/100 [00:30<09:37,  6.08s/it]


Bootstrapped 1 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 6/6


  2%|▏         | 2/100 [00:20<16:49, 10.30s/it]
2025/08/25 03:21:32 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/08/25 03:21:32 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


2025/08/25 03:22:08 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/08/25 03:23:01 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/08/25 03:23:01 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

You are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.
Your goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.

To do this, you will interleave next_thought, next_tool_name, and next_tool_args in each turn, and also when finishing the task.
After each tool call, you receive a resulting observation, which gets appended to your trajectory.

When writing next_thought, you may reason about the current situation and plan for future steps.
When selecting the next_tool_name and its next_tool_args, the tool must be one of:

(1) search_wikipedia. It ta

Average Metric: 30.00 / 100 (30.0%): 100%|██████████| 100/100 [00:53<00:00,  1.86it/s]

2025/08/25 03:23:56 INFO dspy.evaluate.evaluate: Average Metric: 30 / 100 (30.0%)
2025/08/25 03:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 30.0

2025/08/25 03:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 25 - Minibatch ==



Average Metric: 8.00 / 35 (22.9%): 100%|██████████| 35/35 [00:36<00:00,  1.04s/it]

2025/08/25 03:24:32 INFO dspy.evaluate.evaluate: Average Metric: 8 / 35 (22.9%)
2025/08/25 03:24:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 22.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/08/25 03:24:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86]
2025/08/25 03:24:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0]
2025/08/25 03:24:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 30.0


2025/08/25 03:24:32 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 25 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:30<00:00,  1.14it/s]

2025/08/25 03:25:03 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)
2025/08/25 03:25:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/08/25 03:25:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71]
2025/08/25 03:25:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0]
2025/08/25 03:25:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 30.0


2025/08/25 03:25:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 25 - Minibatch ==



Average Metric: 8.00 / 35 (22.9%): 100%|██████████| 35/35 [00:39<00:00,  1.13s/it]

2025/08/25 03:25:43 INFO dspy.evaluate.evaluate: Average Metric: 8 / 35 (22.9%)
2025/08/25 03:25:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 22.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/08/25 03:25:43 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86]
2025/08/25 03:25:43 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0]
2025/08/25 03:25:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 30.0


2025/08/25 03:25:43 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 25 - Minibatch ==



Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:21<00:00,  1.63it/s]

2025/08/25 03:26:05 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)
2025/08/25 03:26:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/08/25 03:26:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29]
2025/08/25 03:26:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0]
2025/08/25 03:26:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 30.0


2025/08/25 03:26:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 25 - Minibatch ==





2025/08/25 03:26:06 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Which magazine established itself as a cradle of New Journalism, St. Anthony Messenger or New York?', 'answer': 'New York'}) (input_keys={'question'}): AST constructor recursion depth mismatch (before=63, after=69). Set `provide_traceback=True` for traceback.


Average Metric: 15.00 / 34 (44.1%): 100%|██████████| 35/35 [00:21<00:00,  1.64it/s]

2025/08/25 03:26:27 INFO dspy.evaluate.evaluate: Average Metric: 15.0 / 35 (42.9%)
2025/08/25 03:26:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/08/25 03:26:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86]
2025/08/25 03:26:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0]
2025/08/25 03:26:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 30.0


2025/08/25 03:26:27 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 25 - Full Evaluation =====
2025/08/25 03:26:27 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...



Average Metric: 51.00 / 100 (51.0%): 100%|██████████| 100/100 [00:42<00:00,  2.34it/s]

2025/08/25 03:27:10 INFO dspy.evaluate.evaluate: Average Metric: 51 / 100 (51.0%)
2025/08/25 03:27:10 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 51.0
2025/08/25 03:27:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0]
2025/08/25 03:27:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/08/25 03:27:10 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/25 03:27:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 25 - Minibatch ==



Average Metric: 9.00 / 35 (25.7%): 100%|██████████| 35/35 [00:41<00:00,  1.19s/it]

2025/08/25 03:27:52 INFO dspy.evaluate.evaluate: Average Metric: 9 / 35 (25.7%)
2025/08/25 03:27:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/08/25 03:27:52 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71]
2025/08/25 03:27:52 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0]
2025/08/25 03:27:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:27:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 25 - Minibatch ==



Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:25<00:00,  1.39it/s]

2025/08/25 03:28:18 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)
2025/08/25 03:28:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/08/25 03:28:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29]
2025/08/25 03:28:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0]
2025/08/25 03:28:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:28:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 25 - Minibatch ==



Average Metric: 11.00 / 35 (31.4%): 100%|██████████| 35/35 [00:35<00:00,  1.03s/it]

2025/08/25 03:28:54 INFO dspy.evaluate.evaluate: Average Metric: 11 / 35 (31.4%)
2025/08/25 03:28:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 31.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/08/25 03:28:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43]
2025/08/25 03:28:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0]
2025/08/25 03:28:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:28:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 25 - Minibatch ==



Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:20<00:00,  1.70it/s]

2025/08/25 03:29:15 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)





2025/08/25 03:29:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/08/25 03:29:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57]
2025/08/25 03:29:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0]
2025/08/25 03:29:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:29:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 25 - Minibatch ==


Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:54<00:00,  1.56s/it]

2025/08/25 03:30:11 INFO dspy.evaluate.evaluate: Average Metric: 15 / 35 (42.9%)
2025/08/25 03:30:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/08/25 03:30:11 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86]
2025/08/25 03:30:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0]
2025/08/25 03:30:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:30:11 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 25 - Full Evaluation =====
2025/08/25 03:30:11 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...



Average Metric: 45.00 / 100 (45.0%): 100%|██████████| 100/100 [00:44<00:00,  2.26it/s]

2025/08/25 03:30:56 INFO dspy.evaluate.evaluate: Average Metric: 45 / 100 (45.0%)
2025/08/25 03:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0]
2025/08/25 03:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/08/25 03:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/25 03:30:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 25 - Minibatch ==



Average Metric: 4.00 / 6 (66.7%):  14%|█▍        | 5/35 [00:00<00:05,  5.34it/s] 

2025/08/25 03:30:56 ERROR dspy.utils.parallelizer: Error for Example({'question': 'From the age of 14 Bate lived on the family property "Mountain View" at these villages. It was the site for which television series?', 'answer': 'River Cottage Australia'}) (input_keys={'question'}): AST constructor recursion depth mismatch (before=72, after=68). Set `provide_traceback=True` for traceback.


Average Metric: 16.00 / 34 (47.1%): 100%|██████████| 35/35 [00:00<00:00, 54.09it/s]

2025/08/25 03:30:57 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 35 (45.7%)
2025/08/25 03:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/08/25 03:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71]
2025/08/25 03:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0]
2025/08/25 03:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:30:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 25 - Minibatch ==



Average Metric: 23.00 / 35 (65.7%): 100%|██████████| 35/35 [00:05<00:00,  5.92it/s]

2025/08/25 03:31:03 INFO dspy.evaluate.evaluate: Average Metric: 23 / 35 (65.7%)
2025/08/25 03:31:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 1'].
2025/08/25 03:31:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71]
2025/08/25 03:31:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0]
2025/08/25 03:31:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:31:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 25 - Minibatch ==



Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:06<00:00,  5.22it/s]

2025/08/25 03:31:10 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/08/25 03:31:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 4'].
2025/08/25 03:31:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43]
2025/08/25 03:31:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0]
2025/08/25 03:31:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:31:11 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 25 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:07<00:00,  4.83it/s]

2025/08/25 03:31:18 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)
2025/08/25 03:31:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 1'].
2025/08/25 03:31:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71]
2025/08/25 03:31:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0]
2025/08/25 03:31:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:31:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 25 - Minibatch ==



Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:25<00:00,  1.38it/s]

2025/08/25 03:31:44 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)
2025/08/25 03:31:44 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/08/25 03:31:44 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71, 48.57]
2025/08/25 03:31:44 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0]
2025/08/25 03:31:44 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:31:44 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 25 - Full Evaluation =====
2025/08/25 03:31:44 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 55.709999999999994) from minibatch trials...



Average Metric: 51.00 / 100 (51.0%): 100%|██████████| 100/100 [00:07<00:00, 13.06it/s]

2025/08/25 03:31:52 INFO dspy.evaluate.evaluate: Average Metric: 51 / 100 (51.0%)
2025/08/25 03:31:52 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0]
2025/08/25 03:31:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/08/25 03:31:52 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/25 03:31:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 25 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:36<00:00,  1.03s/it]

2025/08/25 03:32:29 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)
2025/08/25 03:32:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 4'].
2025/08/25 03:32:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71, 48.57, 45.71]
2025/08/25 03:32:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0]
2025/08/25 03:32:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:32:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 25 - Minibatch ==



Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:05<00:00,  6.85it/s]

2025/08/25 03:32:35 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/08/25 03:32:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 1'].
2025/08/25 03:32:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71, 48.57, 45.71, 51.43]
2025/08/25 03:32:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0]
2025/08/25 03:32:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:32:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 25 - Minibatch ==



Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:25<00:00,  1.36it/s]

2025/08/25 03:33:01 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/08/25 03:33:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].
2025/08/25 03:33:01 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71, 48.57, 45.71, 51.43, 51.43]
2025/08/25 03:33:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0]
2025/08/25 03:33:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:33:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 25 - Minibatch ==



Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:04<00:00,  7.23it/s]

2025/08/25 03:33:06 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)





2025/08/25 03:33:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 5'].
2025/08/25 03:33:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71, 48.57, 45.71, 51.43, 51.43, 45.71]
2025/08/25 03:33:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0]
2025/08/25 03:33:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:33:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 25 - Minibatch ==


Average Metric: 22.00 / 35 (62.9%): 100%|██████████| 35/35 [00:15<00:00,  2.28it/s]

2025/08/25 03:33:22 INFO dspy.evaluate.evaluate: Average Metric: 22 / 35 (62.9%)
2025/08/25 03:33:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/08/25 03:33:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 45.71, 22.86, 54.29, 42.86, 25.71, 54.29, 31.43, 48.57, 42.86, 45.71, 65.71, 51.43, 45.71, 48.57, 45.71, 51.43, 51.43, 45.71, 62.86]
2025/08/25 03:33:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0]
2025/08/25 03:33:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/08/25 03:33:22 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 25 - Full Evaluation =====
2025/08/25 03:33:22 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 62.86) from minibatch trial


Average Metric: 49.00 / 100 (49.0%): 100%|██████████| 100/100 [00:52<00:00,  1.89it/s]

2025/08/25 03:34:15 INFO dspy.evaluate.evaluate: Average Metric: 49 / 100 (49.0%)





2025/08/25 03:34:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [30.0, 51.0, 45.0, 51.0, 49.0]
2025/08/25 03:34:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/08/25 03:34:16 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/08/25 03:34:16 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 51.0!


In [6]:
react

react = Predict(StringSignature(question, trajectory -> next_thought, next_tool_name, next_tool_args
    instructions="Given the fields `question`, produce the fields `answer`.\n\nYou are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.\nYour goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.\n\nTo do this, you will interleave next_thought, next_tool_name, and next_tool_args in each turn, and also when finishing the task.\nAfter each tool call, you receive a resulting observation, which gets appended to your trajectory.\n\nWhen writing next_thought, you may reason about the current situation and plan for future steps.\nWhen selecting the next_tool_name and its next_tool_args, the tool must be one of:\n\n(1) search_wikipedia. It takes arguments {'query': {'type': 'string'}}.\n(2) finish, whose description is <desc>Marks the task as complete. That is, signal

In [5]:
optimized_react

react = Predict(StringSignature(question, trajectory -> next_thought, next_tool_name, next_tool_args
    instructions="You are a knowledgeable trivia agent. Given the fields `question`, produce the fields `answer`. In each episode, you will be presented with the `question` as input alongside your past trajectory of thoughts and observations. Your objective is to utilize one or more of the provided tools to gather the necessary information to formulate the `answer`. \n\nInterleave your reasoning in `next_thought`, determine the appropriate tool to use in `next_tool_name`, and specify the required arguments for that tool in `next_tool_args`. After each tool call, the resulting observation will be added to your trajectory. \n\nWhen composing `next_thought`, think critically about the current situation and plan your next steps. Choose the next tool from the following options:\n\n(1) search_wikipedia, which requires the argument {'query': {'type': 'string'}}.  \n(2) finish, which indicates 

In [9]:
react.save("../tmp/react.json")

In [None]:
optimized_react.save("../tmp/optimized_react.json")

: 