<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/GenAI_Practice/LangWatch/dspy_promot_optimization_online_dashboard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangWatch DSPy Visualizer

This notebook shows an example of a simple DSPy optimization process integrated with LangWatch for training visualization and debugging.

In [1]:
%%capture --no-stderr
%pip install -U --quiet dspy langwatch

## Preparing the LLM

In [2]:
import os
from getpass import getpass
import dspy
from google.colab import userdata


OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
llm = dspy.LM("openai/gpt-4.1-nano", api_key=OPENAI_API_KEY)
print("LLM test response:", llm("How much do you know about Robert Nishihara?"))

# the retrieval model
colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"
)
dspy.settings.configure(lm=llm, rm=colbertv2_wiki17_abstracts)

LLM test response: ['As of my knowledge cutoff in October 2023, Robert Nishihara is a researcher known for his work in machine learning, particularly in the areas of reinforcement learning, optimization, and scalable algorithms. He has contributed to the development of tools and frameworks that facilitate large-scale machine learning experiments. Nishihara has been affiliated with institutions such as the University of California, Berkeley, and has collaborated on projects related to distributed computing and efficient training methods for complex models. If you have specific questions about his work or background, feel free to ask!']


## Preparing the Dataset

In [3]:
from dspy.datasets import HotPotQA


dataset = HotPotQA(train_seed=1, train_size=32, eval_seed=2025, dev_size=50, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

print(len(trainset), len(devset))
print(trainset[0])
print(devset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

hotpot_qa.py:   0%|          | 0.00/6.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7405 [00:00<?, ? examples/s]

32 50
Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})
Example({'question': 'Pehchaan: The Face of Truth stars Vinod Khanna, Rati Agnihotri and which Indian actress, producer, and former model who also produced the film?', 'answer': 'Raveena Tandon', 'gold_titles': {'Pehchaan: The Face of Truth', 'Raveena Tandon'}}) (input_keys={'question'})


## Defining the model

In [4]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")


class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)


dev_example = devset[12]
print(f"[Devset] Question: {dev_example.question}")
print(f"[Devset] Answer: {dev_example.answer}")
print(f"[Devset] Relevant Wikipedia Titles: {dev_example.gold_titles}")
print()

generate_answer = RAG()
pred = generate_answer(question=dev_example.question)
print(f"[Prediction] Question: {dev_example.question}")
print(f"[Prediction] Predicted Answer: {pred.answer}")

[Devset] Question: Twelve Inches is a compilation album by which 1980s British band?
[Devset] Answer: Frankie Goes to Hollywood
[Devset] Relevant Wikipedia Titles: {'Twelve Inches', 'Frankie Goes to Hollywood'}

[Prediction] Question: Twelve Inches is a compilation album by which 1980s British band?
[Prediction] Predicted Answer: Spandau Ballet


## Login to LangWatch

* Run LangWatch online dashboard

In [5]:
import langwatch


langwatch.endpoint = "https://app.langwatch.ai"
langwatch.login()

Please go to https://app.langwatch.ai/authorize to get your API key
Paste your API key here: ··········
LangWatch API key set


## Start Training Session!

In [6]:
from dspy.teleprompt import MIPROv2
import dspy.evaluate


# Define our metric validation
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a MIPROv2 optimizer, which will compile our RAG program.
optimizer = MIPROv2(metric=validate_context_and_answer, prompt_model=llm,
                    task_model=llm, num_candidates=2, init_temperature=0.7,
                    auto=None)

# Initialize langwatch for this run, to track the optimizer compilation
langwatch.dspy.init(experiment="hanhan_exp1", optimizer=optimizer)

# Compile
compiled_rag = optimizer.compile( RAG(),
    trainset=trainset,
    num_trials=10,
    max_bootstrapped_demos=3,
    max_labeled_demos=5,
    minibatch_size=10,
    requires_permission_to_run=False
)

2025/05/09 22:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/05/09 22:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/05/09 22:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=2 sets of demonstrations...
2025/05/09 22:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/05/09 22:50:42 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.



[LangWatch] Experiment initialized, run_id: pastel-visionary-lynx
[LangWatch] Open https://app.langwatch.ai/my-garden-vZCaox/experiments/hanhan-exp1?runIds=pastel-visionary-lynx to track your DSPy training session live

Bootstrapping set 1/2
Bootstrapping set 2/2


2025/05/09 22:50:45 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=2 instructions...

2025/05/09 22:50:50 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/05/09 22:50:50 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Answer questions with short factoid answers.

2025/05/09 22:50:50 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Given a context of relevant facts and a specific question, generate a clear, step-by-step reasoning process to arrive at a concise, factual answer. The reasoning should explain how the context supports the answer, which should be brief (typically 1-5 words). Ensure the response is accurate, direct, and aligned with the provided information. Use the chain-of-thought approach to justify the answer thoroughly before stating it.

2025/05/09 22:50:50 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/09 22:50:50 INFO dspy.teleprompt.mipro_optimizer_v2: ==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
2025/05/09 22:50:50 INFO dsp

Average Metric: 7.00 / 25 (28.0%): 100%|██████████| 25/25 [00:15<00:00,  1.61it/s]

2025/05/09 22:51:06 INFO dspy.evaluate.evaluate: Average Metric: 7 / 25 (28.0%)





2025/05/09 22:51:07 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 28.0

2025/05/09 22:51:07 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==


Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:02<00:00,  3.95it/s]

2025/05/09 22:51:10 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)





2025/05/09 22:51:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:11 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0]
2025/05/09 22:51:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0]
2025/05/09 22:51:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:11 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 13 - Minibatch ==


Average Metric: 5.00 / 10 (50.0%): 100%|██████████| 10/10 [00:01<00:00,  6.65it/s]

2025/05/09 22:51:12 INFO dspy.evaluate.evaluate: Average Metric: 5 / 10 (50.0%)





2025/05/09 22:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0]
2025/05/09 22:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0]
2025/05/09 22:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 13 - Minibatch ==


Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:01<00:00,  5.72it/s]

2025/05/09 22:51:15 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)





2025/05/09 22:51:16 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/05/09 22:51:16 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0]
2025/05/09 22:51:16 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0]
2025/05/09 22:51:16 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:16 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 13 - Minibatch ==


Average Metric: 5.00 / 10 (50.0%): 100%|██████████| 10/10 [00:00<00:00, 751.14it/s]

2025/05/09 22:51:16 INFO dspy.evaluate.evaluate: Average Metric: 5 / 10 (50.0%)





2025/05/09 22:51:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0]
2025/05/09 22:51:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0]
2025/05/09 22:51:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 13 - Minibatch ==


Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:01<00:00,  6.25it/s]

2025/05/09 22:51:19 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)





2025/05/09 22:51:20 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/05/09 22:51:20 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0, 30.0]
2025/05/09 22:51:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0]
2025/05/09 22:51:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:20 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 13 - Full Evaluation =====
2025/05/09 22:51:20 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 50.0) from minibatch trials...


Average Metric: 7.00 / 25 (28.0%): 100%|██████████| 25/25 [00:00<00:00, 641.39it/s]

2025/05/09 22:51:20 INFO dspy.evaluate.evaluate: Average Metric: 7 / 25 (28.0%)





2025/05/09 22:51:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0]
2025/05/09 22:51:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0
2025/05/09 22:51:21 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/09 22:51:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 13 - Minibatch ==


Average Metric: 6.00 / 10 (60.0%): 100%|██████████| 10/10 [00:01<00:00,  7.39it/s]

2025/05/09 22:51:23 INFO dspy.evaluate.evaluate: Average Metric: 6 / 10 (60.0%)





2025/05/09 22:51:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0, 30.0, 60.0]
2025/05/09 22:51:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0]
2025/05/09 22:51:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 13 - Minibatch ==


Average Metric: 1.00 / 10 (10.0%): 100%|██████████| 10/10 [00:00<00:00, 1474.22it/s]

2025/05/09 22:51:23 INFO dspy.evaluate.evaluate: Average Metric: 1 / 10 (10.0%)





2025/05/09 22:51:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 10.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0, 30.0, 60.0, 10.0]
2025/05/09 22:51:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0]
2025/05/09 22:51:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 13 - Minibatch ==


Average Metric: 3.00 / 10 (30.0%): 100%|██████████| 10/10 [00:01<00:00,  6.23it/s]

2025/05/09 22:51:25 INFO dspy.evaluate.evaluate: Average Metric: 3 / 10 (30.0%)





2025/05/09 22:51:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 30.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/05/09 22:51:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0, 30.0, 60.0, 10.0, 30.0]
2025/05/09 22:51:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0]
2025/05/09 22:51:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 13 - Minibatch ==


Average Metric: 2.00 / 10 (20.0%): 100%|██████████| 10/10 [00:00<00:00, 15.54it/s]

2025/05/09 22:51:27 INFO dspy.evaluate.evaluate: Average Metric: 2 / 10 (20.0%)





2025/05/09 22:51:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 20.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0, 30.0, 60.0, 10.0, 30.0, 20.0]
2025/05/09 22:51:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0]
2025/05/09 22:51:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==


Average Metric: 5.00 / 10 (50.0%): 100%|██████████| 10/10 [00:00<00:00, 1606.52it/s]

2025/05/09 22:51:28 INFO dspy.evaluate.evaluate: Average Metric: 5 / 10 (50.0%)





2025/05/09 22:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 10 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/09 22:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [30.0, 50.0, 30.0, 50.0, 30.0, 60.0, 10.0, 30.0, 20.0, 50.0]
2025/05/09 22:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0]
2025/05/09 22:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 28.0


2025/05/09 22:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 13 - Full Evaluation =====
2025/05/09 22:51:29 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 42.0) from minibatch trials...


Average Metric: 8.00 / 25 (32.0%): 100%|██████████| 25/25 [00:01<00:00, 18.14it/s]

2025/05/09 22:51:31 INFO dspy.evaluate.evaluate: Average Metric: 8 / 25 (32.0%)





2025/05/09 22:51:32 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 32.0
2025/05/09 22:51:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [28.0, 28.0, 32.0]
2025/05/09 22:51:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 32.0
2025/05/09 22:51:32 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/09 22:51:32 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 32.0!


In [9]:
compiled_rag

generate_answer.predict = Predict(StringSignature(context, question -> reasoning, answer
    instructions='Given a context of relevant facts and a specific question, generate a clear, step-by-step reasoning process to arrive at a concise, factual answer. The reasoning should explain how the context supports the answer, which should be brief (typically 1-5 words). Ensure the response is accurate, direct, and aligned with the provided information. Use the chain-of-thought approach to justify the answer thoroughly before stating it.'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__

In [10]:
compiled_rag.save("optimized_model.json")