# LangWatch DSPy Visualizer

This notebook shows an example of a simple DSPy optimization process integrated with LangWatch for training visualization and debugging.

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/langwatch/langwatch/blob/main/python-sdk/examples/dspy_visualization.ipynb)

In [None]:
# Install langwatch along with dspy for the visualization
!pip install dspy langwatch

## Preparing the LLM

In [6]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OPENAI_API_KEY: ")

import dspy

llm = dspy.LM("openai/gpt-4o-mini", api_key=os.environ["OPENAI_API_KEY"])

print("LLM test response:", llm("hello there"))

colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"
)
dspy.settings.configure(lm=llm, rm=colbertv2_wiki17_abstracts)

LLM test response: ['Hello! How can I assist you today?']


## Preparing the Dataset

In [7]:
from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=32, eval_seed=2025, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

(32, 50)

## Defining the model

In [8]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")


class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages # type: ignore
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)


dev_example = devset[18]
print(f"[Devset] Question: {dev_example.question}")
print(f"[Devset] Answer: {dev_example.answer}")
print(f"[Devset] Relevant Wikipedia Titles: {dev_example.gold_titles}")

generate_answer = RAG()

pred = generate_answer(question=dev_example.question)

# Print the input and the prediction.
print(f"[Prediction] Question: {dev_example.question}")
print(f"[Prediction] Predicted Answer: {pred.answer}")

[Devset] Question: Which magazine was released first, Fortune or Motor Trend?
[Devset] Answer: Motor Trend
[Devset] Relevant Wikipedia Titles: {'Fortune (magazine)', 'Motor Trend'}
[Prediction] Question: Which magazine was released first, Fortune or Motor Trend?
[Prediction] Predicted Answer: Fortune


## Login to LangWatch

In [9]:
import langwatch

langwatch.login()

LangWatch API key is already set, if you want to login again, please call as langwatch.login(relogin=True)


## Start Training Session!

In [10]:
from dspy.teleprompt import MIPROv2
import dspy.evaluate

# Define our metric validation
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a MIPROv2 optimizer, which will compile our RAG program.
optimizer = MIPROv2(metric=validate_context_and_answer, prompt_model=llm, task_model=llm, num_candidates=2, init_temperature=0.7)

# Initialize langwatch for this run, to track the optimizer compilation
langwatch.dspy.init(experiment="my-awesome-experiment", optimizer=optimizer)

# Compile
compiled_rag = optimizer.compile( RAG(),
    trainset=trainset,
    num_trials=10,
    max_bootstrapped_demos=3,
    max_labeled_demos=5,
)


[LangWatch] Experiment initialized, run_id: peculiar-marmot-of-renovation
[LangWatch] Open http://localhost:5560/inbox-narrator/experiments/my-awesome-experiment?runIds=peculiar-marmot-of-renovation to track your DSPy training session live



2025/03/26 22:15:14 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/03/26 22:15:14 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/03/26 22:15:14 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=2 sets of demonstrations...
2025/03/26 22:15:14 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/03/26 22:15:14 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapping set 1/2
Bootstrapping set 2/2


2025/03/26 22:15:17 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Answer questions with short factoid answers.

2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Given the context and the question, generate a reasoned answer that is concise and factually correct, ensuring it consists of no more than 5 words.

2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

2025/03/26 22:15:29 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 1

  0%|          | 0/25 [00:00<?, ?it/s]

2025/03/26 22:15:29 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?', 'answer': 'Rosario Dawson'}) (input_keys={'question'}): 'topk'. Set `provide_traceback=True` for traceback.
2025/03/26 22:15:29 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Samantha Cristoforetti and Mark Shuttleworth are both best known for being first in their field to go where? ', 'answer': 'space'}) (input_keys={'question'}): 'topk'. Set `provide_traceback=True` for traceback.


Average Metric: 3.00 / 4 (75.0%):  24%|██▍       | 6/25 [00:01<00:03,  4.94it/s] 

2025/03/26 22:15:30 ERROR dspy.utils.parallelizer: Error for Example({'question': 'The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?', 'answer': '2010'}) (input_keys={'question'}): 'topk'. Set `provide_traceback=True` for traceback.


Average Metric: 3.00 / 4 (75.0%):  24%|██▍       | 6/25 [00:01<00:03,  4.94it/s]

2025/03/26 22:15:30 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?', 'answer': 'Aleksandr Danilovich Aleksandrov'}) (input_keys={'question'}): 'topk'. Set `provide_traceback=True` for traceback.


Average Metric: 4.00 / 12 (33.3%):  60%|██████    | 15/25 [00:02<00:01,  6.47it/s]

2025/03/26 22:15:32 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Which band had a longer hiatus, Juliette and the Licks or The Last Shadow Puppets?', 'answer': 'The Last Shadow Puppets'}) (input_keys={'question'}): 'topk'. Set `provide_traceback=True` for traceback.


Average Metric: 4.00 / 12 (33.3%):  68%|██████▊   | 17/25 [00:02<00:01,  7.50it/s]

2025/03/26 22:15:32 ERROR dspy.utils.parallelizer: Error for Example({'question': 'Who composed "Sunflower Slow Drag" with the King of Ragtime?', 'answer': 'Scott Hayden'}) (input_keys={'question'}): 'topk'. Set `provide_traceback=True` for traceback.


Average Metric: 6.00 / 19 (31.6%): 100%|██████████| 25/25 [00:04<00:00,  5.20it/s]

2025/03/26 22:15:34 INFO dspy.evaluate.evaluate: Average Metric: 6.0 / 25 (24.0%)





2025/03/26 22:15:34 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 24.0

2025/03/26 22:15:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 12 - Minibatch ==


Average Metric: 8.00 / 25 (32.0%): 100%|██████████| 25/25 [00:05<00:00,  4.89it/s]

2025/03/26 22:15:39 INFO dspy.evaluate.evaluate: Average Metric: 8 / 25 (32.0%)





2025/03/26 22:15:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 32.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/03/26 22:15:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:15:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0]
2025/03/26 22:15:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:15:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 12 - Minibatch ==


Average Metric: 7.00 / 25 (28.0%): 100%|██████████| 25/25 [00:05<00:00,  4.87it/s]

2025/03/26 22:15:45 INFO dspy.evaluate.evaluate: Average Metric: 7 / 25 (28.0%)





2025/03/26 22:15:45 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 28.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/03/26 22:15:45 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:15:45 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0]
2025/03/26 22:15:45 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:15:45 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 12 - Minibatch ==


Average Metric: 10.00 / 25 (40.0%): 100%|██████████| 25/25 [00:06<00:00,  4.16it/s]

2025/03/26 22:15:51 INFO dspy.evaluate.evaluate: Average Metric: 10 / 25 (40.0%)





2025/03/26 22:15:52 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/03/26 22:15:52 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:15:52 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0]
2025/03/26 22:15:52 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:15:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 12 - Minibatch ==


Average Metric: 9.00 / 25 (36.0%): 100%|██████████| 25/25 [00:06<00:00,  4.01it/s]

2025/03/26 22:15:58 INFO dspy.evaluate.evaluate: Average Metric: 9 / 25 (36.0%)





2025/03/26 22:15:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 36.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/03/26 22:15:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:15:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0]
2025/03/26 22:15:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:15:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 12 - Minibatch ==


Average Metric: 9.00 / 25 (36.0%): 100%|██████████| 25/25 [00:07<00:00,  3.39it/s]

2025/03/26 22:16:06 INFO dspy.evaluate.evaluate: Average Metric: 9 / 25 (36.0%)





2025/03/26 22:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 36.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/03/26 22:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0, 36.0]
2025/03/26 22:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:16:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 12 - Minibatch ==


Average Metric: 8.00 / 25 (32.0%): 100%|██████████| 25/25 [00:08<00:00,  3.00it/s]

2025/03/26 22:16:14 INFO dspy.evaluate.evaluate: Average Metric: 8 / 25 (32.0%)





2025/03/26 22:16:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 32.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/03/26 22:16:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:16:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0, 36.0, 32.0]
2025/03/26 22:16:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:16:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 12 - Minibatch ==


Average Metric: 10.00 / 25 (40.0%): 100%|██████████| 25/25 [00:10<00:00,  2.36it/s]

2025/03/26 22:16:25 INFO dspy.evaluate.evaluate: Average Metric: 10 / 25 (40.0%)





2025/03/26 22:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/03/26 22:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0, 36.0, 32.0, 40.0]
2025/03/26 22:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:16:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 12 - Minibatch ==


Average Metric: 10.00 / 25 (40.0%): 100%|██████████| 25/25 [00:06<00:00,  4.07it/s]

2025/03/26 22:16:32 INFO dspy.evaluate.evaluate: Average Metric: 10 / 25 (40.0%)





2025/03/26 22:16:33 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/03/26 22:16:33 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:16:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0, 36.0, 32.0, 40.0, 40.0]
2025/03/26 22:16:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:16:33 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 12 - Minibatch ==


Average Metric: 9.00 / 25 (36.0%): 100%|██████████| 25/25 [00:06<00:00,  3.70it/s]

2025/03/26 22:16:39 INFO dspy.evaluate.evaluate: Average Metric: 9 / 25 (36.0%)





2025/03/26 22:16:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 36.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/03/26 22:16:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: []
2025/03/26 22:16:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0, 36.0, 32.0, 40.0, 40.0, 36.0]
2025/03/26 22:16:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 24.0


2025/03/26 22:16:40 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 12 - Full Evaluation =====
2025/03/26 22:16:40 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 40.0) from minibatch trials...


Average Metric: 10.00 / 25 (40.0%): 100%|██████████| 25/25 [00:06<00:00,  3.67it/s]

2025/03/26 22:16:47 INFO dspy.evaluate.evaluate: Average Metric: 10 / 25 (40.0%)





2025/03/26 22:16:47 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 40.0
2025/03/26 22:16:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [24.0, 32.0, 28.0, 40.0, 36.0, 36.0, 32.0, 40.0, 40.0, 36.0, 40.0]
2025/03/26 22:16:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 40.0
2025/03/26 22:16:47 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/26 22:16:47 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 40.0!


In [11]:
compiled_rag

generate_answer.predict = Predict(StringSignature(context, question -> reasoning, answer
    instructions='Given the context and the question, generate a reasoned answer that is concise and factually correct, ensuring it consists of no more than 5 words.'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': 'often between 1 and 5 words', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
))

In [None]:
compiled_rag.save("optimized_model.json")