In [2]:
import dspy

lm = dspy.LM('openai/gpt-4.1-nano')
dspy.configure(lm=lm)

In [40]:
qa = dspy.Predict('question: str -> response: str')
response = qa(question="what are high memory and low memory on linux?")

print(response.response)

In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of how the kernel manages memory for processes.

- **Low Memory**: This typically refers to the memory that is directly accessible by the kernel and can be used by processes without any special handling. In a 32-bit system, this is usually the first 896 MB of RAM (though this can vary based on the architecture and configuration). Low memory is used for kernel data structures and for user processes that require direct access to memory.

- **High Memory**: This refers to memory that is above the low memory limit and is not directly accessible by the kernel in a 32-bit system. Processes can use this memory, but the kernel must use special mechanisms (like paging) to access it. High memory is often used in systems with large amounts of RAM, allowing more memory to be allocated to user processes while still maintaining a smaller footprint for the kernel.

In 

In [41]:
dspy.inspect_history(n=1)






[34m[2025-05-18T10:39:49.439604][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
Your output fields are:
1. `response` (str)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## question ## ]]
what are high memory and low memory on linux?

Respond with the corresponding output fields, starting with the field `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## response ## ]]
In Linux, "high memory" and "low memory" refer to different regions of the system's memory address space, particularly in the context of how the kernel manages memory for processes.

- **Low Memory**: This typically refers to the memory that 

In [42]:
cot = dspy.ChainOfThought('question -> response')
r=cot(question="should curly braces appear on their own line?")
r.response

"Curly braces can either appear on their own line or on the same line as the preceding statement, depending on the coding style you choose to follow. It's important to maintain consistency throughout your codebase."

In [43]:
dspy.inspect_history(n=1)






[34m[2025-05-18T10:42:11.263307][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
Your output fields are:
1. `reasoning` (str)
2. `response` (str)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## question ## ]]
should curly braces appear on their own line?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.


[31mResponse:[0m

[32m[[ ## reasoning ## ]]
The placement of curly braces on their own line is largely a matter of coding style and conventions. In some programming languages and style guides, such a

In [44]:
import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

Downloading 'ragqa_arena_tech_examples.jsonl'...


In [45]:
data[0]

{'question': 'why igp is used in mpls?',
 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.",
 'gold_doc_ids': [2822, 2823]}

In [46]:
data = [dspy.Example(**d).with_inputs('question') for d in data]

# Let's pick an `example` here from the data.
example = data[2]
example

Example({'question': 'why are my text messages coming up as maybe?', 'response': 'This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". \n\nHowever, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.', 'gold_doc_ids': [3956, 3957, 8034]}) (input_keys={'question'})

In [47]:
import random

random.Random(0).shuffle(data)
trainset, devset, testset = data[:200], data[200:500], data[500:1000]

len(trainset), len(devset), len(testset)

(200, 300, 500)

In [48]:
example.inputs()

Example({'question': 'why are my text messages coming up as maybe?'}) (input_keys={'question'})

In [49]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Produce a prediction from our `cot` module, using the `example` above as input.
pred = cot(**example.inputs())

# Compute the metric score for the prediction.
score = metric(example, pred)

print(f"Question: \t {example.question}\n")
print(f"Gold Response: \t {example.response}\n")
print(f"Predicted Response: \t {pred.response}\n")
print(f"Semantic F1 Score: {score:.2f}")

Question: 	 why are my text messages coming up as maybe?

Gold Response: 	 This is part of the Proactivity features new with iOS 9: It looks at info in emails to see if anyone with this number sent you an email and if it finds the phone number associated with a contact from your email, it will show you "Maybe". 

However, it has been suggested there is a bug in iOS 11.2 that can result in "Maybe" being displayed even when "Find Contacts in Other Apps" is disabled.

Predicted Response: 	 Your text messages are showing up as "maybe" because your messaging app is uncertain about the sender's identity. This typically happens when the sender's number is not saved in your contacts or if the message is coming from an unknown or unrecognized number. Check your contacts to see if the sender is listed, or consider saving their number to avoid this issue in the future.

Semantic F1 Score: 0.28


In [50]:
dspy.inspect_history(n=1)






[34m[2025-05-18T10:44:40.956373][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `ground_truth` (str)
3. `system_response` (str)
Your output fields are:
1. `reasoning` (str)
2. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth
3. `system_response_key_ideas` (str): enumeration of key ideas in the system response
4. `discussion` (str): discussion of the overlap between ground truth and system response
5. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response
6. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## ground_truth ## ]]
{ground_truth}

[[ ## system_response ## ]]
{system_response}

[[ ## reasoning ## ]]
{reasoning}

[[ ## ground_truth_key_ideas ## ]]
{ground_truth_key_ideas}

[[ ## system_response_key_ideas ## ]]


In [55]:
# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(devset=devset, metric=metric, num_threads=24,
                         display_progress=True, display_table=2)

# Evaluate the Chain-of-Thought program.
evaluate(cot)

Average Metric: 39.11 / 63 (62.1%):  21%|█▋      | 63/300 [00:14<00:57,  4.12it/s]



Average Metric: 40.91 / 66 (62.0%):  22%|█▊      | 66/300 [00:15<01:05,  3.56it/s]



Average Metric: 48.20 / 77 (62.6%):  26%|██      | 77/300 [00:18<00:58,  3.78it/s]



Average Metric: 52.58 / 84 (62.6%):  28%|██▏     | 84/300 [00:19<00:33,  6.38it/s]



Average Metric: 66.59 / 104 (64.0%):  35%|██    | 104/300 [00:23<00:32,  6.01it/s]



Average Metric: 81.72 / 129 (63.3%):  43%|██▌   | 128/300 [00:28<00:31,  5.45it/s]



Average Metric: 103.40 / 162 (63.8%):  54%|██▋  | 162/300 [00:35<00:29,  4.70it/s]



Average Metric: 135.72 / 212 (64.0%):  71%|███▌ | 212/300 [00:44<00:14,  6.06it/s]



Average Metric: 144.83 / 225 (64.4%):  75%|███▊ | 225/300 [00:47<00:13,  5.69it/s]



Average Metric: 152.39 / 239 (63.8%):  79%|███▉ | 238/300 [00:49<00:10,  5.86it/s]



Average Metric: 164.80 / 258 (63.9%):  86%|████▎| 258/300 [00:54<00:10,  3.84it/s]



Average Metric: 165.80 / 259 (64.0%):  86%|████▎| 259/300 [00:54<00:15,  2.60it/s]



Average Metric: 189.70 / 295 (64.3%):  98%|████▉| 294/300 [01:01<00:01,  4.92it/s]



Average Metric: 193.19 / 300 (64.4%): 100%|█████| 300/300 [01:05<00:00,  4.61it/s]

2025/05/18 10:52:50 INFO dspy.evaluate.evaluate: Average Metric: 193.1869318525495 / 300 (64.4%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],Choosing between C and C++ depends on the specific requirements of...,"Use C over C++ when you need a lightweight, procedural language wi...",✔️ [0.774]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a git repository depends on the context and the ...,It depends on the size and number of images. For small projects wi...,✔️ [0.750]


64.4

In [57]:
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")


Downloading 'ragqa_arena_tech_corpus.jsonl'...


In [59]:
max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)['text'][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

len(corpus)

Loaded 28436 documents. Will encode them below.


28436

In [60]:
embedder = dspy.Embedder('openai/text-embedding-3-small', dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings


In [64]:
p=search("what are high memory and low memory on linux?")
p.passages

['As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.',
 'HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x00000000-0xbfffffff: user process (3GB) 0xc0000000-0xffffffff: kernel space (1GB) (CPU-specific vector and whatsoever are ignored here). Linux splits the 1GB kernel space into 2 pieces, LOWMEM and HIGHMEM. The split varies from installation to installation. If an installation chooses, say, 512MB-512MB for LOW and HIGH mems, the 512MB LOWMEM (0xc0000000-0xdfffffff) is statically mapped at the kernel boot time; usually the first so many bytes of the physical memory is used for this so that virtual and physical addresses in this range have a constant offset of, say, 0xc0000000. On the other hand, the latter 512MB (HIGHMEM) has no stat

In [65]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought('context, question -> response')

    def forward(self, question):
        context = search(question).passages
        return self.respond(context=context, question=question)

In [66]:
rag = RAG()
rag(question="what are high memory and low memory on linux?")

Prediction(
    reasoning="High memory (HighMem) and low memory (LowMem) are segments of physical memory in Linux systems, particularly relevant in 32-bit architectures. Low memory is always mapped into the kernel's address space, allowing the kernel to access it directly at all times. High memory, on the other hand, is memory above a certain threshold (around 860MB in some systems) that is not permanently mapped into the kernel's address space. To access high memory, the kernel must explicitly map pages into its address space using functions like kmap, which makes high memory slower to access compared to low memory. User-space programs can directly access high memory, but the kernel must perform special steps to access it when needed. This division helps optimize memory management and system performance, especially when physical memory exceeds the kernel's direct mapping capacity.",
    response="High memory in Linux refers to the segment of physical memory that is not permanently map

In [67]:
evaluate(rag)


Average Metric: 23.26 / 35 (66.5%):  11%|▉       | 34/300 [00:15<00:50,  5.30it/s]



Average Metric: 59.64 / 84 (71.0%):  28%|██▏     | 84/300 [00:27<00:58,  3.69it/s]



Average Metric: 77.81 / 107 (72.7%):  36%|██▏   | 107/300 [00:32<00:36,  5.31it/s]



Average Metric: 86.66 / 120 (72.2%):  40%|██▍   | 120/300 [00:36<00:59,  3.00it/s]



Average Metric: 86.66 / 121 (71.6%):  40%|██▍   | 121/300 [00:36<01:02,  2.85it/s]



Average Metric: 94.97 / 132 (71.9%):  44%|██▋   | 132/300 [00:39<00:22,  7.62it/s]



Average Metric: 128.83 / 179 (72.0%):  60%|██▉  | 179/300 [00:50<00:40,  2.98it/s]



Average Metric: 148.35 / 207 (71.7%):  69%|███▍ | 207/300 [00:58<00:32,  2.84it/s]



Average Metric: 205.44 / 284 (72.3%):  95%|████▋| 284/300 [01:16<00:03,  4.11it/s]



Average Metric: 215.59 / 297 (72.6%):  99%|████▉| 297/300 [01:18<00:00,  5.88it/s]



Average Metric: 218.15 / 300 (72.7%): 100%|█████| 300/300 [01:21<00:00,  3.68it/s]

2025/05/18 11:19:44 INFO dspy.evaluate.evaluate: Average Metric: 218.15210245486182 / 300 (72.7%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],The context provides insights into the advantages and typical use ...,Use C over C++ primarily in embedded systems or environments where...,✔️ [0.500]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",The context provides multiple perspectives on storing images in a ...,"Images can be stored in a git repository, but it is generally not ...",✔️ [0.750]


72.72

In [None]:
tp = dspy.MIPROv2(metric=metric, auto="medium", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(RAG(), trainset=trainset,
                           max_bootstrapped_demos=3, max_labeled_demos=3,
                           requires_permission_to_run=False)

In [None]:
evaluate(optimized_rag)

In [3]:
import random
import json
import asyncio

import pandas as pd
from collections import defaultdict
from datasets import load_dataset
from tqdm.asyncio import tqdm_asyncio
import pickle

from predictors import *
from llm import *
from prompts import *
from tools import *
from evaluations import *

dataset = load_dataset("gsm8k", "main")
training, validation = build_datasets(dataset, 1000, 0.7)


In [4]:
for x in training:
    x['answer']=int(x['answer'].replace(',', ''))

In [5]:
data = [dspy.Example(**d).with_inputs('question') for d in training]
len(data)


700

In [6]:
trainset, devset = data[:300], data[300:500]

In [32]:
class Solver(dspy.Module):
    def __init__(self):
        self.guess = dspy.ChainOfThought('question -> answer : int')
        self.critic = [dspy.ChainOfThought('question, last_reasoning, last_answer -> corrected_answer : int') for _ in range(3)]

    def forward(self, question):
        pred = self.guess(question=question)
        last_answer, last_reasoning = pred.answer, pred.reasoning
        for k in range(3):
            newpred = self.critic[k](question=question,
                                  last_reasoning=last_reasoning,
                                  last_answer=last_answer)
            last_answer, last_reasoning = newpred.corrected_answer, last_reasoning +" ==> "+ newpred.reasoning
        return dspy.Prediction(reasoning = last_reasoning, answer = last_answer)

solver = Solver()

In [35]:
solver(**data[0].inputs())

Prediction(
    reasoning='Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. ==> Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. ==> The problem states that Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. The calculation is correct, and the total clips sold is 72. ==> The problem states that Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add th

In [36]:
solver.guess(question=data[0].question)

Prediction(
    reasoning='Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72.',
    answer=72
)

In [37]:
def match(example, pred):
    return example.answer == pred.answer


evaluate = dspy.Evaluate(devset=devset, metric=match, num_threads=24,
                         display_progress=True, display_table=2)

In [38]:
evaluate(solver)

Average Metric: 36.00 / 38 (94.7%):  18%|█▍      | 37/200 [00:08<00:28,  5.72it/s]



Average Metric: 70.00 / 75 (93.3%):  38%|███     | 75/200 [00:14<00:21,  5.83it/s]



Average Metric: 71.00 / 76 (93.4%):  38%|███     | 75/200 [00:14<00:21,  5.83it/s]

2025/05/18 13:37:46 ERROR dspy.utils.parallelizer: Error for Example({'question': "Sam works at the Widget Factory, assembling Widgets. He can assemble 1 widget every 10 minutes. Jack from the loading dock can help assemble widgets when he doesn't have anything else to do. When he helps, they put together 2 complete widgets every 15 minutes. Recently the factory hired Tony to help assemble widgets. Being new to the job, he doesn't work as fast as Sam or Jack. Yesterday Sam worked for 6 hours before he had to leave work early for a dentist appointment. Jack was able to help out for 4 hours before he had to go back to the loading dock to unload a new shipment of widget materials. Tony worked the entire 8-hour shift. At the end of the day, they had completed 68 widgets. How long does it take Tony to assemble a Widget, in minutes?", 'thoughts': 'Sam completes a widget every 10 minutes.\nWhen Jack helps they finish 2 in 15 minutes. Sam has finished 1 widget and has begun working on another 

Average Metric: 174.00 / 199 (87.4%): 100%|█████| 200/200 [00:33<00:00,  5.98it/s]

2025/05/18 13:38:06 INFO dspy.evaluate.evaluate: Average Metric: 174.0 / 200 (87.0%)





Unnamed: 0,question,thoughts,example_answer,reasoning,pred_answer,match,answer
0,It costs $.10 to print one page. Jenny wants to print 7 copies of ...,Jenny needs to print 7*25= <<7*25=175>>175 pages. This will cost 1...,12.0,"First, calculate the cost to print 7 copies of a 25-page essay. Ea...",12.0,✔️ [True],
1,240 sixth-grade students were asked about the number of novels the...,1/6 * 240 students = <<1/6*240=40>>40 students. So 40 students rea...,16.0,"First, find the number of students in each category based on the g...",16.0,✔️ [True],


87.0

In [39]:
tp = dspy.MIPROv2(metric=match, auto="medium", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(Solver(), trainset=trainset,
                           max_bootstrapped_demos=0, max_labeled_demos=0,
                           requires_permission_to_run=False)

2025/05/18 13:38:20 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 57
minibatch: True
num_fewshot_candidates: 12
num_instruct_candidates: 6
valset size: 240

2025/05/18 13:38:20 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/05/18 13:38:20 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used for informing instruction proposal.

2025/05/18 13:38:20 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=12 sets of demonstrations...


Bootstrapping set 1/12
Bootstrapping set 2/12


  0%|                                                      | 0/60 [00:00<?, ?it/s]2025/05/18 13:38:25 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'question': 'Ken created a care package to send to his brother, who was away at boarding school.  Ken placed a box on a scale, and then he poured into the box enough jelly beans to bring the weight to 2 pounds.  Then, he added enough brownies to cause the weight to triple.  Next, he added another 2 pounds of jelly beans.  And finally, he added enough gummy worms to double the weight once again.  What was the final weight of the box of goodies, in pounds?', 'thoughts': 'To the initial 2 pounds of jelly beans, he added enough brownies to cause the weight to triple, bringing the weight to 2*3=<<2*3=6>>6 pounds.\nNext, he added another 2 pounds of jelly beans, bringing the weight to 6+2=<<6+2=8>>8 pounds.\nAnd finally, he added enough gummy worms to double the weight once again, to a final weight of 8*2=<<8*2=16

Average Metric: 6.00 / 7 (85.7%):   3%|▎          | 7/240 [00:03<00:56,  4.09it/s]



Average Metric: 49.00 / 53 (92.5%):  22%|█▋      | 52/240 [00:09<00:23,  8.13it/s]



Average Metric: 50.00 / 54 (92.6%):  22%|█▊      | 54/240 [00:09<00:20,  9.11it/s]

2025/05/18 13:40:36 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 95.00 / 101 (94.1%):  42%|██▌   | 102/240 [00:17<00:15,  8.92it/s]



Average Metric: 220.00 / 239 (92.1%): 100%|█████| 240/240 [00:38<00:00,  6.29it/s]

2025/05/18 13:41:04 INFO dspy.evaluate.evaluate: Average Metric: 220.0 / 240 (91.7%)
2025/05/18 13:41:04 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 91.67

2025/05/18 13:41:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 70 - Minibatch ==



Average Metric: 15.00 / 18 (83.3%):  51%|████▋    | 18/35 [00:06<00:03,  5.39it/s]



Average Metric: 28.00 / 33 (84.8%):  91%|████████▏| 32/35 [00:10<00:00,  3.54it/s]



Average Metric: 29.00 / 35 (82.9%): 100%|█████████| 35/35 [00:24<00:00,  1.45it/s]

2025/05/18 13:41:29 INFO dspy.evaluate.evaluate: Average Metric: 29 / 35 (82.9%)
2025/05/18 13:41:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 4'].
2025/05/18 13:41:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86]
2025/05/18 13:41:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67]
2025/05/18 13:41:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:41:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 70 - Minibatch ==



Average Metric: 29.00 / 35 (82.9%): 100%|█████████| 35/35 [00:11<00:00,  2.99it/s]

2025/05/18 13:41:40 INFO dspy.evaluate.evaluate: Average Metric: 29 / 35 (82.9%)
2025/05/18 13:41:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 2', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 2'].
2025/05/18 13:41:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86]
2025/05/18 13:41:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67]
2025/05/18 13:41:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:41:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:08<00:00,  4.22it/s]

2025/05/18 13:41:49 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:41:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 2', 'Predictor 2: Instruction 4', 'Predictor 3: Instruction 4'].
2025/05/18 13:41:49 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43]
2025/05/18 13:41:49 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67]
2025/05/18 13:41:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:41:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 70 - Minibatch ==



Average Metric: 19.00 / 21 (90.5%):  60%|█████▍   | 21/35 [00:06<00:01,  7.87it/s]



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:27<00:00,  1.28it/s]

2025/05/18 13:42:16 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:42:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 4'].
2025/05/18 13:42:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57]
2025/05/18 13:42:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67]
2025/05/18 13:42:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:42:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 70 - Minibatch ==



Average Metric: 6.00 / 8 (75.0%):  20%|██▍         | 7/35 [00:04<00:12,  2.27it/s]



Average Metric: 14.00 / 20 (70.0%):  54%|████▉    | 19/35 [00:06<00:01,  8.24it/s]



Average Metric: 28.00 / 35 (80.0%): 100%|█████████| 35/35 [00:11<00:00,  2.96it/s]

2025/05/18 13:42:28 INFO dspy.evaluate.evaluate: Average Metric: 28 / 35 (80.0%)
2025/05/18 13:42:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 1'].
2025/05/18 13:42:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0]
2025/05/18 13:42:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67]
2025/05/18 13:42:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:42:28 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 70 - Full Evaluation =====
2025/05/18 13:42:28 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 91.43) from minibatch trials...



Average Metric: 49.00 / 53 (92.5%):  22%|█▊      | 53/240 [00:08<00:14, 13.15it/s]



Average Metric: 50.00 / 54 (92.6%):  22%|█▊      | 53/240 [00:08<00:14, 13.15it/s]

2025/05/18 13:42:36 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 93.00 / 101 (92.1%):  42%|██▌   | 102/240 [00:15<00:22,  6.18it/s]



Average Metric: 219.00 / 239 (91.6%): 100%|█████| 240/240 [00:35<00:00,  6.83it/s]

2025/05/18 13:43:03 INFO dspy.evaluate.evaluate: Average Metric: 219.0 / 240 (91.2%)
2025/05/18 13:43:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25]
2025/05/18 13:43:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:43:03 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:43:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 70 - Minibatch ==



Average Metric: 28.00 / 35 (80.0%): 100%|█████████| 35/35 [00:15<00:00,  2.21it/s]

2025/05/18 13:43:19 INFO dspy.evaluate.evaluate: Average Metric: 28 / 35 (80.0%)
2025/05/18 13:43:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 3', 'Predictor 3: Instruction 0'].
2025/05/18 13:43:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0]
2025/05/18 13:43:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25]
2025/05/18 13:43:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:43:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:14<00:00,  2.41it/s]

2025/05/18 13:43:34 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:43:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 4', 'Predictor 3: Instruction 0'].
2025/05/18 13:43:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43]
2025/05/18 13:43:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25]
2025/05/18 13:43:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:43:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:11<00:00,  3.09it/s]

2025/05/18 13:43:45 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:43:45 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 4', 'Predictor 3: Instruction 3'].
2025/05/18 13:43:45 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43]
2025/05/18 13:43:45 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25]
2025/05/18 13:43:45 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:43:45 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 70 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:09<00:00,  3.60it/s]

2025/05/18 13:43:55 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:43:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 0'].
2025/05/18 13:43:55 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57]
2025/05/18 13:43:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25]
2025/05/18 13:43:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:43:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 70 - Minibatch ==



Average Metric: 34.00 / 35 (97.1%): 100%|█████████| 35/35 [00:04<00:00,  8.31it/s]

2025/05/18 13:43:59 INFO dspy.evaluate.evaluate: Average Metric: 34 / 35 (97.1%)
2025/05/18 13:43:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 0'].
2025/05/18 13:43:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14]
2025/05/18 13:43:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25]
2025/05/18 13:43:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:43:59 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 70 - Full Evaluation =====
2025/05/18 13:43:59 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 97.14) from minibatch trials...



Average Metric: 50.00 / 53 (94.3%):  22%|█▋      | 52/240 [00:04<00:16, 11.24it/s]



Average Metric: 51.00 / 54 (94.4%):  22%|█▊      | 54/240 [00:04<00:16, 11.14it/s]

2025/05/18 13:44:04 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 94.00 / 101 (93.1%):  42%|██▌   | 101/240 [00:08<00:14,  9.44it/s]



Average Metric: 220.00 / 239 (92.1%): 100%|█████| 240/240 [00:19<00:00, 12.11it/s]

2025/05/18 13:44:19 INFO dspy.evaluate.evaluate: Average Metric: 220.0 / 240 (91.7%)
2025/05/18 13:44:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67]
2025/05/18 13:44:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:44:19 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:44:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 70 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:07<00:00,  4.56it/s]

2025/05/18 13:44:27 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:44:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 0'].
2025/05/18 13:44:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57]
2025/05/18 13:44:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67]
2025/05/18 13:44:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:44:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 70 - Minibatch ==



Average Metric: 30.00 / 34 (88.2%):  94%|████████▍| 33/35 [00:06<00:00,  5.14it/s]



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:12<00:00,  2.83it/s]

2025/05/18 13:44:39 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:44:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 3', 'Predictor 3: Instruction 2'].
2025/05/18 13:44:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57]
2025/05/18 13:44:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67]
2025/05/18 13:44:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:44:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:12<00:00,  2.81it/s]

2025/05/18 13:44:51 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:44:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 2', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 1'].
2025/05/18 13:44:51 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43]
2025/05/18 13:44:51 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67]
2025/05/18 13:44:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:44:51 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:06<00:00,  5.79it/s]

2025/05/18 13:44:58 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:44:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 3'].
2025/05/18 13:44:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43]
2025/05/18 13:44:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67]
2025/05/18 13:44:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:44:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 70 - Minibatch ==



Average Metric: 22.00 / 24 (91.7%):  69%|██████▏  | 24/35 [00:08<00:04,  2.21it/s]



Average Metric: 32.00 / 34 (94.1%):  97%|████████▋| 34/35 [00:13<00:00,  1.45it/s]



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:24<00:00,  1.41it/s]

2025/05/18 13:45:23 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:45:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 3'].
2025/05/18 13:45:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43]
2025/05/18 13:45:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67]
2025/05/18 13:45:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:45:23 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 70 - Full Evaluation =====
2025/05/18 13:45:23 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 91.43) from minibatch trials...



Average Metric: 18.00 / 19 (94.7%):   8%|▋       | 19/240 [00:05<00:41,  5.37it/s]



Average Metric: 24.00 / 25 (96.0%):  10%|▊       | 24/240 [00:06<00:31,  6.93it/s]



Average Metric: 78.00 / 84 (92.9%):  35%|██▊     | 84/240 [00:18<01:09,  2.25it/s]



Average Metric: 89.00 / 96 (92.7%):  40%|███▏    | 95/240 [00:20<00:30,  4.81it/s]



Average Metric: 105.00 / 114 (92.1%):  48%|██▍  | 114/240 [00:25<00:27,  4.62it/s]

2025/05/18 13:45:48 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 195.00 / 215 (90.7%):  90%|████▌| 216/240 [00:42<00:04,  5.36it/s]



Average Metric: 216.00 / 239 (90.4%): 100%|█████| 240/240 [00:48<00:00,  4.98it/s]

2025/05/18 13:46:11 INFO dspy.evaluate.evaluate: Average Metric: 216.0 / 240 (90.0%)
2025/05/18 13:46:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0]
2025/05/18 13:46:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:46:11 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:46:11 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 70 - Minibatch ==



Average Metric: 19.00 / 19 (100.0%):  51%|████    | 18/35 [00:05<00:02,  7.35it/s]



Average Metric: 35.00 / 35 (100.0%): 100%|████████| 35/35 [00:12<00:00,  2.78it/s]

2025/05/18 13:46:23 INFO dspy.evaluate.evaluate: Average Metric: 35 / 35 (100.0%)
2025/05/18 13:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 5'].
2025/05/18 13:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0]
2025/05/18 13:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0]
2025/05/18 13:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:46:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 70 - Minibatch ==



  0%|                                                      | 0/35 [00:00<?, ?it/s]



Average Metric: 21.00 / 22 (95.5%):  63%|█████▋   | 22/35 [00:07<00:02,  4.88it/s]



Average Metric: 33.00 / 34 (97.1%):  97%|████████▋| 34/35 [00:10<00:00,  3.10it/s]

2025/05/18 13:46:38 ERROR dspy.utils.parallelizer: Error for Example({'question': 'An aquarium holds an equal number of clownfish and blowfish. 26 of the blowfish stay in their own tank, and the remaining blowfish swim into a display tank.  An equal number of clownfish join the blowfish in the display tank, but then a third of these clownfish swim back into their own tank.  If the aquarium holds a combined total of 100 fish, how many clownfish are now in the display tank?', 'thoughts': 'There is an equal number of clownfish and blowfish, which means there are 100 fish / 2 = <<100/2=50>>50 blowfish.\nSince some stay in their tanks, there are 50 blowfish – 26 blowfish = <<50-26=24>>24 blowfish in the display tank.\nThere were the same amount of clownfish until 24 clownfish / 3 = <<24/3=8>>8 clownfish swam back to their own tank.\nThis leaves 24 clownfish – 8 clownfish = <<24-8=16>>16 clownfish in the display tank.', 'answer': 16}) (input_keys={'question'}): Both structured output format 

Average Metric: 33.00 / 34 (97.1%): 100%|█████████| 35/35 [00:14<00:00,  2.36it/s]

2025/05/18 13:46:38 INFO dspy.evaluate.evaluate: Average Metric: 33.0 / 35 (94.3%)
2025/05/18 13:46:38 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:46:38 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29]
2025/05/18 13:46:38 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0]
2025/05/18 13:46:38 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:46:38 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 70 - Minibatch ==



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:11<00:00,  3.08it/s]

2025/05/18 13:46:50 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:46:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:46:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71]
2025/05/18 13:46:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0]
2025/05/18 13:46:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:46:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 70 - Minibatch ==



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:08<00:00,  4.09it/s]

2025/05/18 13:46:58 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:46:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 5'].
2025/05/18 13:46:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71]
2025/05/18 13:46:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0]
2025/05/18 13:46:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:46:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 70 - Minibatch ==



Average Metric: 33.00 / 35 (94.3%): 100%|█████████| 35/35 [00:13<00:00,  2.56it/s]

2025/05/18 13:47:12 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/18 13:47:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 3', 'Predictor 3: Instruction 5'].
2025/05/18 13:47:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29]
2025/05/18 13:47:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0]
2025/05/18 13:47:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:47:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 70 - Full Evaluation =====
2025/05/18 13:47:12 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 94.29) from minibatch trial


Average Metric: 17.00 / 17 (100.0%):   7%|▍      | 16/240 [00:05<00:39,  5.65it/s]



Average Metric: 86.00 / 92 (93.5%):  38%|███     | 92/240 [00:19<00:29,  5.02it/s]



Average Metric: 87.00 / 93 (93.5%):  39%|███     | 93/240 [00:19<00:27,  5.42it/s]



Average Metric: 90.00 / 97 (92.8%):  40%|███▏    | 97/240 [00:20<00:26,  5.38it/s]



Average Metric: 175.00 / 196 (89.3%):  81%|████ | 195/240 [00:37<00:05,  8.26it/s]



Average Metric: 216.00 / 240 (90.0%): 100%|█████| 240/240 [00:47<00:00,  5.04it/s]

2025/05/18 13:48:00 INFO dspy.evaluate.evaluate: Average Metric: 216 / 240 (90.0%)
2025/05/18 13:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0]
2025/05/18 13:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:48:00 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 70 - Minibatch ==



Average Metric: 34.00 / 35 (97.1%): 100%|█████████| 35/35 [00:08<00:00,  4.19it/s]

2025/05/18 13:48:08 INFO dspy.evaluate.evaluate: Average Metric: 34 / 35 (97.1%)
2025/05/18 13:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14]
2025/05/18 13:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0]
2025/05/18 13:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:48:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 70 - Minibatch ==



Average Metric: 3.00 / 3 (100.0%):   9%|▉          | 3/35 [00:02<00:28,  1.11it/s]



Average Metric: 4.00 / 4 (100.0%):   9%|▉          | 3/35 [00:02<00:28,  1.11it/s]

2025/05/18 13:48:11 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 30.00 / 34 (88.2%): 100%|█████████| 35/35 [00:08<00:00,  4.14it/s]

2025/05/18 13:48:16 INFO dspy.evaluate.evaluate: Average Metric: 30.0 / 35 (85.7%)
2025/05/18 13:48:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 2'].
2025/05/18 13:48:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71]
2025/05/18 13:48:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0]
2025/05/18 13:48:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:48:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 28 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:07<00:00,  4.76it/s]

2025/05/18 13:48:24 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:48:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 5'].
2025/05/18 13:48:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43]
2025/05/18 13:48:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0]
2025/05/18 13:48:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:48:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 29 / 70 - Minibatch ==



Average Metric: 28.00 / 35 (80.0%): 100%|█████████| 35/35 [00:09<00:00,  3.89it/s]

2025/05/18 13:48:33 INFO dspy.evaluate.evaluate: Average Metric: 28 / 35 (80.0%)
2025/05/18 13:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 3'].
2025/05/18 13:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0]
2025/05/18 13:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0]
2025/05/18 13:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:48:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 30 / 70 - Minibatch ==



Average Metric: 33.00 / 35 (94.3%): 100%|█████████| 35/35 [00:12<00:00,  2.70it/s]

2025/05/18 13:48:46 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/18 13:48:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 4'].
2025/05/18 13:48:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29]
2025/05/18 13:48:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0]
2025/05/18 13:48:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:48:46 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 31 / 70 - Full Evaluation =====
2025/05/18 13:48:46 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program


Average Metric: 38.00 / 42 (90.5%):  18%|█▍      | 42/240 [00:04<00:12, 15.49it/s]



Average Metric: 49.00 / 53 (92.5%):  22%|█▊      | 53/240 [00:05<00:14, 12.84it/s]



Average Metric: 49.00 / 54 (90.7%):  22%|█▊      | 53/240 [00:05<00:14, 12.84it/s]

2025/05/18 13:48:52 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 93.00 / 101 (92.1%):  42%|██▌   | 102/240 [00:11<00:14,  9.60it/s]



Average Metric: 218.00 / 239 (91.2%): 100%|█████| 240/240 [00:27<00:00,  8.71it/s]

2025/05/18 13:49:14 INFO dspy.evaluate.evaluate: Average Metric: 218.0 / 240 (90.8%)
2025/05/18 13:49:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83]
2025/05/18 13:49:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:49:14 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:49:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 32 / 70 - Minibatch ==



Average Metric: 35.00 / 35 (100.0%): 100%|████████| 35/35 [00:12<00:00,  2.87it/s]

2025/05/18 13:49:26 INFO dspy.evaluate.evaluate: Average Metric: 35 / 35 (100.0%)
2025/05/18 13:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0]
2025/05/18 13:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83]
2025/05/18 13:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:49:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 33 / 70 - Minibatch ==



Average Metric: 33.00 / 35 (94.3%): 100%|█████████| 35/35 [00:08<00:00,  4.03it/s]

2025/05/18 13:49:35 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/18 13:49:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:49:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29]
2025/05/18 13:49:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83]
2025/05/18 13:49:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:49:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 34 / 70 - Minibatch ==



Average Metric: 5.00 / 6 (83.3%):  14%|█▋          | 5/35 [00:02<00:11,  2.72it/s]



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:13<00:00,  2.53it/s]

2025/05/18 13:49:48 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57]
2025/05/18 13:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83]
2025/05/18 13:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:49:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 35 / 70 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:09<00:00,  3.58it/s]

2025/05/18 13:49:58 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:49:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 4', 'Predictor 3: Instruction 5'].
2025/05/18 13:49:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57]
2025/05/18 13:49:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83]
2025/05/18 13:49:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:49:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 36 / 70 - Minibatch ==



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:07<00:00,  4.84it/s]

2025/05/18 13:50:05 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:50:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 5'].
2025/05/18 13:50:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43]
2025/05/18 13:50:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83]
2025/05/18 13:50:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:50:05 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 37 / 70 - Full Evaluation =====
2025/05/18 13:50:05 INFO dspy.teleprompt.mipro_optimizer_v2: Doi


Average Metric: 19.00 / 20 (95.0%):   8%|▋       | 19/240 [00:03<00:33,  6.59it/s]



Average Metric: 48.00 / 53 (90.6%):  22%|█▋      | 52/240 [00:07<00:37,  5.02it/s]



Average Metric: 100.00 / 112 (89.3%):  46%|██▎  | 111/240 [00:16<00:19,  6.53it/s]



Average Metric: 216.00 / 240 (90.0%): 100%|█████| 240/240 [00:36<00:00,  6.57it/s]

2025/05/18 13:50:42 INFO dspy.evaluate.evaluate: Average Metric: 216 / 240 (90.0%)
2025/05/18 13:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0]
2025/05/18 13:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 38 / 70 - Minibatch ==



Average Metric: 2.00 / 2 (100.0%):   6%|▋          | 2/35 [00:02<00:38,  1.18s/it]



Average Metric: 28.00 / 33 (84.8%):  94%|████████▍| 33/35 [00:07<00:00,  4.22it/s]



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:10<00:00,  3.26it/s]

2025/05/18 13:50:53 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:50:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 3', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 5'].
2025/05/18 13:50:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71]
2025/05/18 13:50:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0]
2025/05/18 13:50:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:50:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 39 / 70 - Minibatch ==



Average Metric: 33.00 / 35 (94.3%): 100%|█████████| 35/35 [00:10<00:00,  3.41it/s]

2025/05/18 13:51:03 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/18 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 1'].
2025/05/18 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29]
2025/05/18 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0]
2025/05/18 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:51:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 40 / 70 - Minibatch ==



Average Metric: 33.00 / 35 (94.3%): 100%|█████████| 35/35 [00:09<00:00,  3.75it/s]

2025/05/18 13:51:13 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/18 13:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 1'].
2025/05/18 13:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29]
2025/05/18 13:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0]
2025/05/18 13:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 41 / 70 - Minibatch ==



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:06<00:00,  5.33it/s]

2025/05/18 13:51:19 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:51:19 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 2'].
2025/05/18 13:51:19 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71]
2025/05/18 13:51:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0]
2025/05/18 13:51:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:51:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 42 / 70 - Minibatch ==



Average Metric: 5.00 / 5 (100.0%):  11%|█▎         | 4/35 [00:02<00:17,  1.78it/s]



Average Metric: 7.00 / 7 (100.0%):  17%|█▉         | 6/35 [00:03<00:11,  2.55it/s]



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:10<00:00,  3.43it/s]

2025/05/18 13:51:29 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 1'].
2025/05/18 13:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57]
2025/05/18 13:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0]
2025/05/18 13:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 43 / 70 - Full Evaluation =====
2025/05/18 13:51:29 INF


Average Metric: 17.00 / 19 (89.5%):   8%|▌       | 18/240 [00:03<00:42,  5.22it/s]



Average Metric: 50.00 / 53 (94.3%):  22%|█▊      | 53/240 [00:06<00:18, 10.26it/s]



Average Metric: 93.00 / 102 (91.2%):  42%|██▌   | 102/240 [00:12<00:14,  9.27it/s]



Average Metric: 183.00 / 205 (89.3%):  85%|████▎| 204/240 [00:26<00:05,  7.16it/s]



Average Metric: 218.00 / 240 (90.8%): 100%|█████| 240/240 [00:31<00:00,  7.53it/s]

2025/05/18 13:52:01 INFO dspy.evaluate.evaluate: Average Metric: 218 / 240 (90.8%)
2025/05/18 13:52:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:52:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:52:01 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:52:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 44 / 70 - Minibatch ==



  0%|                                                      | 0/35 [00:00<?, ?it/s]



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:11<00:00,  3.13it/s]

2025/05/18 13:52:13 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:52:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 5'].
2025/05/18 13:52:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43]
2025/05/18 13:52:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:52:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:52:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 45 / 70 - Minibatch ==



Average Metric: 6.00 / 6 (100.0%):  14%|█▌         | 5/35 [00:03<00:12,  2.43it/s]



Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:11<00:00,  2.98it/s]

2025/05/18 13:52:24 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:52:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 5'].
2025/05/18 13:52:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43]
2025/05/18 13:52:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:52:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:52:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 46 / 70 - Minibatch ==



  0%|                                                      | 0/35 [00:00<?, ?it/s]

2025/05/18 13:52:24 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 32.00 / 34 (94.1%): 100%|█████████| 35/35 [00:14<00:00,  2.37it/s]

2025/05/18 13:52:39 INFO dspy.evaluate.evaluate: Average Metric: 32.0 / 35 (91.4%)
2025/05/18 13:52:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 0'].
2025/05/18 13:52:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43]
2025/05/18 13:52:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:52:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:52:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 47 / 70 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:09<00:00,  3.56it/s]

2025/05/18 13:52:49 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:52:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 5'].
2025/05/18 13:52:49 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57]
2025/05/18 13:52:49 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:52:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:52:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 48 / 70 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:07<00:00,  4.46it/s]

2025/05/18 13:52:57 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:52:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 4'].
2025/05/18 13:52:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57]
2025/05/18 13:52:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:52:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:52:57 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 49 / 70 - Ful


Average Metric: 24.00 / 26 (92.3%):  11%|▊       | 26/240 [00:05<00:41,  5.11it/s]



Average Metric: 35.00 / 39 (89.7%):  16%|█▎      | 39/240 [00:09<00:57,  3.48it/s]



Average Metric: 57.00 / 62 (91.9%):  26%|██      | 62/240 [00:13<00:36,  4.82it/s]

2025/05/18 13:53:11 ERROR dspy.utils.parallelizer: Error for Example({'question': 'By the time Anne is two times as old as Emile, Emile will be six times as old as Maude. If Maude will be 8 years old, how old will Anne be?', 'thoughts': "If Maude's age is 8 by the time Anne's age is four times Emile's age, Emile will be six times as old as Maude, which totals 6*8 = 48 years.\nIf Emile's age is 48 years old by the time Anne's age is twice her number, Anne will be 2*48 = <<48*2=96>>96 years.", 'answer': 96}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 76.00 / 82 (92.7%):  35%|██▊     | 83/240 [00:18<00:31,  4.95it/s]



Average Metric: 133.00 / 151 (88.1%):  63%|███▏ | 152/240 [00:38<00:17,  5.02it/s]



Average Metric: 148.00 / 168 (88.1%):  70%|███▌ | 169/240 [00:44<00:22,  3.19it/s]



Average Metric: 161.00 / 181 (89.0%):  76%|███▊ | 182/240 [00:47<00:12,  4.58it/s]



Average Metric: 216.00 / 239 (90.4%): 100%|█████| 240/240 [01:09<00:00,  3.46it/s]

2025/05/18 13:54:06 INFO dspy.evaluate.evaluate: Average Metric: 216.0 / 240 (90.0%)
2025/05/18 13:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0]
2025/05/18 13:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:54:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 50 / 70 - Minibatch ==



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:15<00:00,  2.23it/s]

2025/05/18 13:54:22 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:54:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 4'].
2025/05/18 13:54:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57]
2025/05/18 13:54:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0]
2025/05/18 13:54:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:54:22 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 51 


Average Metric: 2.00 / 2 (100.0%):   3%|▎          | 1/35 [00:04<02:18,  4.08s/it]



Average Metric: 34.00 / 35 (97.1%): 100%|█████████| 35/35 [00:16<00:00,  2.06it/s]

2025/05/18 13:54:39 INFO dspy.evaluate.evaluate: Average Metric: 34 / 35 (97.1%)
2025/05/18 13:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 97.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 0'].
2025/05/18 13:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14]
2025/05/18 13:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0]
2025/05/18 13:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:54:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Tr


Average Metric: 30.00 / 34 (88.2%):  97%|████████▋| 34/35 [00:10<00:00,  1.36it/s]



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:19<00:00,  1.81it/s]

2025/05/18 13:54:59 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:54:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 0'].
2025/05/18 13:54:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71]
2025/05/18 13:54:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0]
2025/05/18 13:54:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:54:59 INFO dspy.teleprompt.mipro_optimizer_v2


Average Metric: 7.00 / 7 (100.0%):  20%|██▏        | 7/35 [00:04<00:12,  2.17it/s]



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:14<00:00,  2.42it/s]

2025/05/18 13:55:13 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 3'].
2025/05/18 13:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71]
2025/05/18 13:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0]
2025/05/18 13:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:55:13 INFO dspy.teleprompt.mipro_optim


Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:12<00:00,  2.77it/s]

2025/05/18 13:55:26 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 3', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 0'].
2025/05/18 13:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71]
2025/05/18 13:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0]
2025/05/18 13:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:55:26 INFO dspy.teleprompt.mipr


Average Metric: 19.00 / 19 (100.0%):   8%|▌      | 18/240 [00:03<00:36,  6.11it/s]



Average Metric: 21.00 / 21 (100.0%):   8%|▌      | 20/240 [00:03<00:36,  6.11it/s]

2025/05/18 13:55:30 ERROR dspy.utils.parallelizer: Error for Example({'question': 'An aquarium holds an equal number of clownfish and blowfish. 26 of the blowfish stay in their own tank, and the remaining blowfish swim into a display tank.  An equal number of clownfish join the blowfish in the display tank, but then a third of these clownfish swim back into their own tank.  If the aquarium holds a combined total of 100 fish, how many clownfish are now in the display tank?', 'thoughts': 'There is an equal number of clownfish and blowfish, which means there are 100 fish / 2 = <<100/2=50>>50 blowfish.\nSince some stay in their tanks, there are 50 blowfish – 26 blowfish = <<50-26=24>>24 blowfish in the display tank.\nThere were the same amount of clownfish until 24 clownfish / 3 = <<24/3=8>>8 clownfish swam back to their own tank.\nThis leaves 24 clownfish – 8 clownfish = <<24-8=16>>16 clownfish in the display tank.', 'answer': 16}) (input_keys={'question'}): Both structured output format 

Average Metric: 72.00 / 75 (96.0%):  31%|██▌     | 75/240 [00:11<00:21,  7.82it/s]



Average Metric: 148.00 / 164 (90.2%):  69%|███▍ | 165/240 [00:29<00:17,  4.36it/s]



Average Metric: 180.00 / 197 (91.4%):  82%|████▏| 198/240 [00:35<00:07,  5.40it/s]



Average Metric: 218.00 / 239 (91.2%): 100%|█████| 240/240 [00:46<00:00,  5.20it/s]

2025/05/18 13:56:12 INFO dspy.evaluate.evaluate: Average Metric: 218.0 / 240 (90.8%)
2025/05/18 13:56:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:56:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67
2025/05/18 13:56:12 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:56:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 56 / 70 - Minibatch ==



Average Metric: 4.00 / 4 (100.0%):  11%|█▎         | 4/35 [00:01<00:14,  2.09it/s]



Average Metric: 10.00 / 10 (100.0%):  29%|██▎     | 10/35 [00:02<00:03,  7.37it/s]



Average Metric: 11.00 / 11 (100.0%):  29%|██▎     | 10/35 [00:02<00:03,  7.37it/s]

2025/05/18 13:56:15 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 32.00 / 33 (97.0%):  97%|████████▋| 34/35 [00:05<00:00,  6.60it/s]



Average Metric: 33.00 / 34 (97.1%): 100%|█████████| 35/35 [00:15<00:00,  2.24it/s]

2025/05/18 13:56:28 INFO dspy.evaluate.evaluate: Average Metric: 33.0 / 35 (94.3%)
2025/05/18 13:56:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 0'].
2025/05/18 13:56:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29]
2025/05/18 13:56:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:56:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:56:28 INFO dspy


Average Metric: 6.00 / 8 (75.0%):  20%|██▍         | 7/35 [00:03<00:05,  4.80it/s]



Average Metric: 31.00 / 35 (88.6%): 100%|█████████| 35/35 [00:09<00:00,  3.62it/s]

2025/05/18 13:56:37 INFO dspy.evaluate.evaluate: Average Metric: 31 / 35 (88.6%)
2025/05/18 13:56:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 4'].
2025/05/18 13:56:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57]
2025/05/18 13:56:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:56:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:56:37 INFO


Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:16<00:00,  2.15it/s]

2025/05/18 13:56:54 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 4', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 0', 'Predictor 3: Instruction 5'].
2025/05/18 13:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43]
2025/05/18 13:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:56:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18 13:56:


Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:10<00:00,  3.44it/s]

2025/05/18 13:57:04 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 1', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 2'].
2025/05/18 13:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71]
2025/05/18 13:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:57:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2025/05/18






Average Metric: 5.00 / 5 (100.0%):  11%|█▏        | 4/35 [00:00<00:00, 210.23it/s]

2025/05/18 13:57:04 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 30.00 / 34 (88.2%): 100%|█████████| 35/35 [00:06<00:00,  5.45it/s]

2025/05/18 13:57:10 INFO dspy.evaluate.evaluate: Average Metric: 30.0 / 35 (85.7%)
2025/05/18 13:57:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 1: Instruction 2', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 0'].
2025/05/18 13:57:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71]
2025/05/18 13:57:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83]
2025/05/18 13:57:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 91.67


2


Average Metric: 49.00 / 53 (92.5%):  22%|█▊      | 53/240 [00:05<00:15, 12.20it/s]



Average Metric: 50.00 / 54 (92.6%):  22%|█▊      | 53/240 [00:05<00:15, 12.20it/s]

2025/05/18 13:57:16 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Jim collects model cars, and he has 301 models total.  Jim has 4 times as many Buicks as Fords, and 3 more than twice the number of Fords than Chevys.  How many Buicks does Jim have?', 'thoughts': 'Let x represent the number of Chevys\nFords:3+2x\nBuicks:4(3+2x)=12+8x\nTotal:x+3+2x+12+8x=301\n11x+15=301\n11x=286\nx=<<26=26>>26\nBuicks:12+8(26)=220', 'answer': 220}) (input_keys={'question'}): Both structured output format and JSON mode failed. Please choose a model that supports `response_format` argument. Original error: Expected dict_keys(['reasoning', 'answer']) but got dict_keys(['reasoning']). Set `provide_traceback=True` for traceback.


Average Metric: 96.00 / 102 (94.1%):  42%|██▌   | 102/240 [00:11<00:20,  6.84it/s]



Average Metric: 221.00 / 239 (92.5%): 100%|█████| 240/240 [00:28<00:00,  8.38it/s]

2025/05/18 13:57:39 INFO dspy.evaluate.evaluate: Average Metric: 221.0 / 240 (92.1%)
2025/05/18 13:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 92.08
2025/05/18 13:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08]
2025/05/18 13:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 92.08
2025/05/18 13:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:57:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 62 / 70 - Minibatch ==



Average Metric: 1.00 / 1 (100.0%):   3%|▎          | 1/35 [00:02<01:29,  2.63s/it]



Average Metric: 14.00 / 15 (93.3%):  43%|███▊     | 15/35 [00:05<00:03,  5.81it/s]



Average Metric: 33.00 / 35 (94.3%): 100%|█████████| 35/35 [00:10<00:00,  3.35it/s]

2025/05/18 13:57:50 INFO dspy.evaluate.evaluate: Average Metric: 33 / 35 (94.3%)
2025/05/18 13:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 94.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 4', 'Predictor 3: Instruction 5'].
2025/05/18 13:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29]
2025/05/18 13:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08]
2025/05/18 13:57:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so fa


Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:11<00:00,  3.06it/s]

2025/05/18 13:58:01 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 3', 'Predictor 3: Instruction 0'].
2025/05/18 13:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29, 85.71]
2025/05/18 13:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08]
2025/05/18 13:58:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full scor


Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:12<00:00,  2.87it/s]

2025/05/18 13:58:13 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:58:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 5', 'Predictor 2: Instruction 1', 'Predictor 3: Instruction 5'].
2025/05/18 13:58:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29, 85.71, 91.43]
2025/05/18 13:58:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08]
2025/05/18 13:58:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best fu


Average Metric: 27.00 / 35 (77.1%): 100%|█████████| 35/35 [00:07<00:00,  4.91it/s]

2025/05/18 13:58:21 INFO dspy.evaluate.evaluate: Average Metric: 27 / 35 (77.1%)
2025/05/18 13:58:21 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 77.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 3', 'Predictor 3: Instruction 5'].
2025/05/18 13:58:21 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29, 85.71, 91.43, 77.14]
2025/05/18 13:58:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08]
2025/05/18 13:58:21 INFO dspy.teleprompt.mipro_optimizer_v2: 


Average Metric: 29.00 / 35 (82.9%): 100%|█████████| 35/35 [00:11<00:00,  3.14it/s]

2025/05/18 13:58:32 INFO dspy.evaluate.evaluate: Average Metric: 29 / 35 (82.9%)
2025/05/18 13:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 82.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 4', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 5'].
2025/05/18 13:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29, 85.71, 91.43, 77.14, 82.86]
2025/05/18 13:58:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08]
2025/05/18 13:58:32 INFO dspy.teleprompt.mipro_optimiz


Average Metric: 18.00 / 19 (94.7%):   8%|▋       | 19/240 [00:04<00:46,  4.79it/s]



Average Metric: 19.00 / 20 (95.0%):   8%|▋       | 19/240 [00:04<00:46,  4.79it/s]

2025/05/18 13:58:36 ERROR dspy.utils.parallelizer: Error for Example({'question': 'An aquarium holds an equal number of clownfish and blowfish. 26 of the blowfish stay in their own tank, and the remaining blowfish swim into a display tank.  An equal number of clownfish join the blowfish in the display tank, but then a third of these clownfish swim back into their own tank.  If the aquarium holds a combined total of 100 fish, how many clownfish are now in the display tank?', 'thoughts': 'There is an equal number of clownfish and blowfish, which means there are 100 fish / 2 = <<100/2=50>>50 blowfish.\nSince some stay in their tanks, there are 50 blowfish – 26 blowfish = <<50-26=24>>24 blowfish in the display tank.\nThere were the same amount of clownfish until 24 clownfish / 3 = <<24/3=8>>8 clownfish swam back to their own tank.\nThis leaves 24 clownfish – 8 clownfish = <<24-8=16>>16 clownfish in the display tank.', 'answer': 16}) (input_keys={'question'}): Both structured output format 

Average Metric: 97.00 / 101 (96.0%):  42%|██▌   | 102/240 [00:16<00:11, 12.53it/s]



Average Metric: 218.00 / 239 (91.2%): 100%|█████| 240/240 [00:35<00:00,  6.68it/s]

2025/05/18 13:59:08 INFO dspy.evaluate.evaluate: Average Metric: 218.0 / 240 (90.8%)
2025/05/18 13:59:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08, 90.83]
2025/05/18 13:59:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 92.08
2025/05/18 13:59:08 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:59:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 68 / 70 - Minibatch ==



Average Metric: 18.00 / 21 (85.7%):  60%|█████▍   | 21/35 [00:04<00:01,  7.50it/s]



Average Metric: 30.00 / 35 (85.7%): 100%|█████████| 35/35 [00:07<00:00,  4.47it/s]

2025/05/18 13:59:16 INFO dspy.evaluate.evaluate: Average Metric: 30 / 35 (85.7%)
2025/05/18 13:59:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 85.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 2', 'Predictor 3: Instruction 2'].
2025/05/18 13:59:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29, 85.71, 91.43, 77.14, 82.86, 85.71]
2025/05/18 13:59:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08, 90.83]
2025/05/18 13:59:16 INFO dspy.teleprompt


Average Metric: 32.00 / 35 (91.4%): 100%|█████████| 35/35 [00:06<00:00,  5.22it/s]

2025/05/18 13:59:22 INFO dspy.evaluate.evaluate: Average Metric: 32 / 35 (91.4%)
2025/05/18 13:59:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 91.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 5', 'Predictor 1: Instruction 0', 'Predictor 2: Instruction 5', 'Predictor 3: Instruction 0'].
2025/05/18 13:59:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [82.86, 82.86, 91.43, 88.57, 80.0, 80.0, 91.43, 91.43, 88.57, 97.14, 88.57, 88.57, 91.43, 91.43, 91.43, 100.0, 94.29, 85.71, 85.71, 94.29, 97.14, 85.71, 91.43, 80.0, 94.29, 100.0, 94.29, 88.57, 88.57, 91.43, 85.71, 94.29, 94.29, 85.71, 88.57, 91.43, 91.43, 91.43, 88.57, 88.57, 88.57, 97.14, 85.71, 85.71, 85.71, 94.29, 88.57, 91.43, 85.71, 85.71, 94.29, 85.71, 91.43, 77.14, 82.86, 85.71, 91.43]
2025/05/18 13:59:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08, 90.83]
2025/05/18 13:59:22 INFO dspy.tel


Average Metric: 15.00 / 18 (83.3%):   7%|▌       | 17/240 [00:02<00:33,  6.75it/s]



Average Metric: 18.00 / 21 (85.7%):   8%|▋       | 20/240 [00:02<00:32,  6.75it/s]

2025/05/18 13:59:25 ERROR dspy.utils.parallelizer: Error for Example({'question': 'An aquarium holds an equal number of clownfish and blowfish. 26 of the blowfish stay in their own tank, and the remaining blowfish swim into a display tank.  An equal number of clownfish join the blowfish in the display tank, but then a third of these clownfish swim back into their own tank.  If the aquarium holds a combined total of 100 fish, how many clownfish are now in the display tank?', 'thoughts': 'There is an equal number of clownfish and blowfish, which means there are 100 fish / 2 = <<100/2=50>>50 blowfish.\nSince some stay in their tanks, there are 50 blowfish – 26 blowfish = <<50-26=24>>24 blowfish in the display tank.\nThere were the same amount of clownfish until 24 clownfish / 3 = <<24/3=8>>8 clownfish swam back to their own tank.\nThis leaves 24 clownfish – 8 clownfish = <<24-8=16>>16 clownfish in the display tank.', 'answer': 16}) (input_keys={'question'}): Both structured output format 

Average Metric: 71.00 / 75 (94.7%):  31%|██▌     | 75/240 [00:05<00:06, 26.47it/s]



Average Metric: 75.00 / 80 (93.8%):  34%|██▋     | 81/240 [00:06<00:12, 12.61it/s]



Average Metric: 131.00 / 147 (89.1%):  61%|███  | 147/240 [00:12<00:08, 11.19it/s]



Average Metric: 174.00 / 193 (90.2%):  80%|████ | 193/240 [00:15<00:03, 14.31it/s]



Average Metric: 216.00 / 239 (90.4%): 100%|█████| 240/240 [00:20<00:00, 11.55it/s]

2025/05/18 13:59:43 INFO dspy.evaluate.evaluate: Average Metric: 216.0 / 240 (90.0%)
2025/05/18 13:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [91.67, 91.25, 91.67, 90.0, 90.0, 90.83, 90.0, 90.83, 90.0, 90.83, 92.08, 90.83, 90.0]
2025/05/18 13:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 92.08
2025/05/18 13:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/18 13:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 92.08!





In [40]:
evaluate(optimized_rag)

Average Metric: 36.00 / 37 (97.3%):  18%|█▍      | 37/200 [00:05<00:25,  6.33it/s]



Average Metric: 71.00 / 75 (94.7%):  38%|███     | 75/200 [00:08<00:10, 12.30it/s]



Average Metric: 72.00 / 76 (94.7%):  38%|███     | 75/200 [00:08<00:10, 12.30it/s]

2025/05/18 13:59:52 ERROR dspy.utils.parallelizer: Error for Example({'question': "Sam works at the Widget Factory, assembling Widgets. He can assemble 1 widget every 10 minutes. Jack from the loading dock can help assemble widgets when he doesn't have anything else to do. When he helps, they put together 2 complete widgets every 15 minutes. Recently the factory hired Tony to help assemble widgets. Being new to the job, he doesn't work as fast as Sam or Jack. Yesterday Sam worked for 6 hours before he had to leave work early for a dentist appointment. Jack was able to help out for 4 hours before he had to go back to the loading dock to unload a new shipment of widget materials. Tony worked the entire 8-hour shift. At the end of the day, they had completed 68 widgets. How long does it take Tony to assemble a Widget, in minutes?", 'thoughts': 'Sam completes a widget every 10 minutes.\nWhen Jack helps they finish 2 in 15 minutes. Sam has finished 1 widget and has begun working on another 

Average Metric: 115.00 / 126 (91.3%):  64%|███▏ | 127/200 [00:13<00:06, 10.99it/s]



Average Metric: 175.00 / 198 (88.4%):  99%|████▉| 198/200 [00:22<00:00,  9.96it/s]



Average Metric: 175.00 / 199 (87.9%): 100%|█████| 200/200 [00:26<00:00,  7.55it/s]

2025/05/18 14:00:10 INFO dspy.evaluate.evaluate: Average Metric: 175.0 / 200 (87.5%)





Unnamed: 0,question,thoughts,example_answer,reasoning,pred_answer,match,answer
0,It costs $.10 to print one page. Jenny wants to print 7 copies of ...,Jenny needs to print 7*25= <<7*25=175>>175 pages. This will cost 1...,12.0,"First, calculate the cost to print 7 copies of a 25-page essay. Ea...",12.0,✔️ [True],
1,240 sixth-grade students were asked about the number of novels the...,1/6 * 240 students = <<1/6*240=40>>40 students. So 40 students rea...,16.0,"First, find the number of students in each category based on the g...",16.0,✔️ [True],


87.5

In [41]:
optimized_rag(**data[0].inputs())

Prediction(
    reasoning='Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. ==> Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. ==> Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. ==> Natalia sold 48 clips in April. In May, she sold half as many clips, which is 48 divided by 2, resulting in 24 clips. To find the total number of clips sold in April and May, add the clips sold in both months: 48 + 24 = 72. The calculation is correct, and the total is 72 clips.',
    answ

In [42]:
dspy.inspect_history(n=1)






[34m[2025-05-18T14:00:12.984661][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)
2. `last_reasoning` (str)
3. `last_answer` (str)
Your output fields are:
1. `reasoning` (str)
2. `corrected_answer` (int)
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## last_reasoning ## ]]
{last_reasoning}

[[ ## last_answer ## ]]
{last_answer}

[[ ## reasoning ## ]]
{reasoning}

[[ ## corrected_answer ## ]]
{corrected_answer}        # note: the value you produce must be a single int value

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `question`, `last_reasoning`, `last_answer`, produce the fields `corrected_answer`.


[31mUser message:[0m

[[ ## question ## ]]
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

[[ ## last_reason