# DsPy Prompt Optimization with Local Langwatch

* Notes
  * In order to launch Langwatch locally, in `compse.yaml` file
    * Comment out `- "_JAVA_OPTIONS=-XX:UseSVE=0"`
    * Comment out the resources
  * And seems that Langwatch local version doesn't work for gemini flash 2.0....

## Preparing the LLM

In [1]:
import os
import dspy


OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
llm = dspy.LM("openai/gpt-4.1-nano", api_key=OPENAI_API_KEY)
print("LLM test response:", llm("How much do you know about Robert Nishihara?"))

# the retrieval model
colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"
)
dspy.settings.configure(lm=llm, rm=colbertv2_wiki17_abstracts)

  from .autonotebook import tqdm as notebook_tqdm


LLM test response: ['As of my knowledge cutoff in October 2023, Robert Nishihara is a researcher known for his work in the fields of machine learning, reinforcement learning, and distributed systems. He has contributed to the development of frameworks and algorithms that improve the scalability and efficiency of machine learning workflows. Nishihara has been involved with projects such as Ray, an open-source framework for distributed computing, which aims to simplify the development of scalable applications. He has also published research papers on topics related to reinforcement learning, parallel computing, and system design. If you have specific questions about his work or background, feel free to ask!']


## Preparing the Dataset for RAG

In [2]:
from dspy.datasets import HotPotQA


dataset = HotPotQA(train_seed=1, train_size=32, eval_seed=2025, dev_size=50, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

print(len(trainset), len(devset))
print(trainset[0])
print(devset[0])

32 50
Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})
Example({'question': 'Pehchaan: The Face of Truth stars Vinod Khanna, Rati Agnihotri and which Indian actress, producer, and former model who also produced the film?', 'answer': 'Raveena Tandon', 'gold_titles': {'Pehchaan: The Face of Truth', 'Raveena Tandon'}}) (input_keys={'question'})


## Define RAG

In [3]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")


class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)


dev_example = devset[12]
print(f"[Devset] Question: {dev_example.question}")
print(f"[Devset] Answer: {dev_example.answer}")
print(f"[Devset] Relevant Wikipedia Titles: {dev_example.gold_titles}")
print()

generate_answer = RAG()
pred = generate_answer(question=dev_example.question)
print(f"[Prediction] Question: {dev_example.question}")
print(f"[Prediction] Predicted Answer: {pred.answer}")

[Devset] Question: Twelve Inches is a compilation album by which 1980s British band?
[Devset] Answer: Frankie Goes to Hollywood
[Devset] Relevant Wikipedia Titles: {'Twelve Inches', 'Frankie Goes to Hollywood'}

[Prediction] Question: Twelve Inches is a compilation album by which 1980s British band?
[Prediction] Predicted Answer: Spandau Ballet


In [10]:
import langwatch


langwatch.endpoint = "http://localhost:5560"
langwatch.login()

LangWatch API key is already set, if you want to login again, please call as langwatch.login(relogin=True)


## Start Training Session

* Notes
  * `langwatch.dspy.init()` worked when I was using "https://app.langwatch.ai" as endpoint, but when using "http://localhost:5560" as the endpoint here, have to use `langwatch.dspy.__init__()`

In [None]:
import langwatch
print(dir(langwatch.dspy))  # try to find the init() function in langwatch.dspy

['__bool__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']


In [18]:
from dspy.teleprompt import MIPROv2
import dspy.evaluate


# Define our metric validation
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a MIPROv2 optimizer, which will compile our RAG program.
optimizer = MIPROv2(metric=validate_context_and_answer, prompt_model=llm,
                    task_model=llm, num_candidates=2, init_temperature=0.7,
                    auto=None)

# Initialize langwatch for this run, to track the optimizer compilation
langwatch.dspy.__init__(experiment="hanhan_exp1", optimizer=optimizer)

# Compile
compiled_rag = optimizer.compile( RAG(),
    trainset=trainset,
    num_trials=10,
    max_bootstrapped_demos=3,
    max_labeled_demos=5,
    minibatch_size=10,
    requires_permission_to_run=False
)

2025/05/19 16:27:40 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/05/19 16:27:40 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/05/19 16:27:40 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=2 sets of demonstrations...
2025/05/19 16:27:40 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/05/19 16:27:40 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapping set 1/2
Bootstrapping set 2/2


2025/05/19 16:27:45 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/05/19 16:27:55 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/05/19 16:27:55 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Answer questions with short factoid answers.

2025/05/19 16:27:55 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Given a context containing relevant facts and a specific question, analyze the information step-by-step to reason through the answer. Provide a concise, factual response that is typically between one and five words. Ensure your answer directly addresses the question using the context provided.

2025/05/19 16:27:55 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/19 16:27:55 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
2025/05/19 16:27:55 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-

Average Metric: 6.00 / 25 (24.0%): 100%|██████████| 25/25 [00:08<00:00,  2.95it/s]

2025/05/19 16:28:04 INFO dspy.evaluate.evaluate: Average Metric: 6 / 25 (24.0%)
2025/05/19 16:28:04 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 24.0

2025/05/19 16:28:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 12 - Minibatch ==



Average Metric: 2.00 / 10 (20.0%): 100%|██████████| 10/10 [00:03<00:00,  2.52it/s]

2025/05/19 16:28:08 INFO dspy.evaluate.evaluate: Average Metric: 2 / 10 (20.0%)
2025/05/19 16:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 20.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/19 16:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0]
2025/05/19 16:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:08 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 12 - Minibatch ==



Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:04<00:00,  2.27it/s]

2025/05/19 16:28:12 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)
2025/05/19 16:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/19 16:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0]
2025/05/19 16:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 12 - Minibatch ==



Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:04<00:00,  2.39it/s]

2025/05/19 16:28:16 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)
2025/05/19 16:28:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/05/19 16:28:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0]
2025/05/19 16:28:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 12 - Minibatch ==



Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:00<00:00, 271.36it/s]

2025/05/19 16:28:17 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)
2025/05/19 16:28:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/05/19 16:28:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0, 30.0]
2025/05/19 16:28:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 12 - Minibatch ==



Average Metric: 5.00 / 10 (50.0%): 100%|██████████| 10/10 [00:04<00:00,  2.32it/s]

2025/05/19 16:28:21 INFO dspy.evaluate.evaluate: Average Metric: 5 / 10 (50.0%)
2025/05/19 16:28:21 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/05/19 16:28:21 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0, 30.0, 50.0]
2025/05/19 16:28:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 12 - Minibatch ==



Average Metric: 5.00 / 10 (50.0%): 100%|██████████| 10/10 [00:03<00:00,  3.26it/s]

2025/05/19 16:28:24 INFO dspy.evaluate.evaluate: Average Metric: 5 / 10 (50.0%)
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0, 30.0, 50.0, 50.0]
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 12 - Minibatch ==



Average Metric: 1.00 / 10 (10.0%): 100%|██████████| 10/10 [00:00<00:00, 189.99it/s]

2025/05/19 16:28:24 INFO dspy.evaluate.evaluate: Average Metric: 1 / 10 (10.0%)
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 10.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0, 30.0, 50.0, 50.0, 10.0]
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 12 - Minibatch ==



Average Metric: 4.00 / 10 (40.0%): 100%|██████████| 10/10 [00:05<00:00,  1.80it/s]

2025/05/19 16:28:30 INFO dspy.evaluate.evaluate: Average Metric: 4 / 10 (40.0%)
2025/05/19 16:28:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/05/19 16:28:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0, 30.0, 50.0, 50.0, 10.0, 40.0]
2025/05/19 16:28:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 12 - Minibatch ==



Average Metric: 2.00 / 10 (20.0%): 100%|██████████| 10/10 [00:02<00:00,  3.35it/s]

2025/05/19 16:28:33 INFO dspy.evaluate.evaluate: Average Metric: 2 / 10 (20.0%)
2025/05/19 16:28:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 20.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/05/19 16:28:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [20.0, 30.0, 30.0, 30.0, 50.0, 50.0, 10.0, 40.0, 20.0]
2025/05/19 16:28:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0]
2025/05/19 16:28:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/05/19 16:28:33 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 12 - Full Evaluation =====
2025/05/19 16:28:33 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 35.0) from minibatch trials...



Average Metric: 8.00 / 25 (32.0%): 100%|██████████| 25/25 [00:02<00:00,  9.25it/s]

2025/05/19 16:28:36 INFO dspy.evaluate.evaluate: Average Metric: 8 / 25 (32.0%)
2025/05/19 16:28:36 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 32.0
2025/05/19 16:28:36 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0]
2025/05/19 16:28:36 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 32.0
2025/05/19 16:28:36 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/19 16:28:36 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 32.0!





In [19]:
compiled_rag

generate_answer.predict = Predict(StringSignature(context, question -> reasoning, answer
    instructions='Given a context containing relevant facts and a specific question, analyze the information step-by-step to reason through the answer. Provide a concise, factual response that is typically between one and five words. Ensure your answer directly addresses the question using the context provided.'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': 'often between 1 and 5 words',