<a href="https://colab.research.google.com/github/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/GenAI_Practice/Langwatch/try_dspy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Try DsPy for RAG Prompt Optimization

* https://github.com/hanhanwu/Hanhan_COLAB_Experiemnts/blob/master/GenAI_Practice/Langwatch/dspy_prompt_optimization_online_dashboard.ipynb
* https://dspy.ai/tutorials/rag/

In [6]:
%%capture --no-stderr
!pip install --upgrade nbformat
%pip install -U --quiet dspy

## Prepare LLM

* `http://20.102.90.50:2017/wiki17_abstracts` provides the sources for retrieval here

In [11]:
import os
import pandas as pd
from getpass import getpass
import dspy
from google.colab import userdata


# OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
# llm = dspy.LM("openai/gpt-4.1-nano", api_key=OPENAI_API_KEY)

GOOGLE_AI_API_KEY = userdata.get('GOOGLE_AI_API_KEY')
llm = dspy.LM("gemini/gemini-2.0-flash", api_key=GOOGLE_AI_API_KEY)
print("LLM test response:", llm("Where's Silicon Valley?"))

# the retrieval model
colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"
)
dspy.settings.configure(lm=llm, rm=colbertv2_wiki17_abstracts)

LLM test response: ['Silicon Valley is located in the southern part of the San Francisco Bay Area in **Northern California, United States**.\n']


## Preparing Dataset

In [12]:
from dspy.datasets import HotPotQA


dataset = HotPotQA(train_seed=1, train_size=32, eval_seed=2025, dev_size=50, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

print()
print(len(trainset), len(devset))
print(trainset[0])
print(devset[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

hotpot_qa.py:   0%|          | 0.00/6.42k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7405 [00:00<?, ? examples/s]


32 50
Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})
Example({'question': 'Pehchaan: The Face of Truth stars Vinod Khanna, Rati Agnihotri and which Indian actress, producer, and former model who also produced the film?', 'answer': 'Raveena Tandon', 'gold_titles': {'Pehchaan: The Face of Truth', 'Raveena Tandon'}}) (input_keys={'question'})


## Defining DsPy RAG

In [13]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""
    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")


class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context,
                               answer=prediction.answer,
                               reasoning=prediction.reasoning)


dev_example = devset[12]
print(f"[Devset] Question: {dev_example.question}")
print(f"[Devset] Answer: {dev_example.answer}")
print(f"[Devset] Relevant Wikipedia Titles: {dev_example.gold_titles}")
print()

generate_answer = RAG()
pred = generate_answer(question=dev_example.question)
print(f"[Prediction] Question: {dev_example.question}")
print(f"[Prediction] Predicted Answer: {pred.answer}")
print(f"[Prediction] Reasoning: {pred.reasoning}")

[Devset] Question: Twelve Inches is a compilation album by which 1980s British band?
[Devset] Answer: Frankie Goes to Hollywood
[Devset] Relevant Wikipedia Titles: {'Twelve Inches', 'Frankie Goes to Hollywood'}

[Prediction] Question: Twelve Inches is a compilation album by which 1980s British band?
[Prediction] Predicted Answer: Soft Cell, Bananarama, Spandau Ballet
[Prediction] Reasoning: The question asks which 1980s British band released a compilation album called "Twelve Inches". I need to find a band that matches both criteria.
The context provides three albums with "Twelve Inches" in the title: "The Twelve Inch Singles" by Soft Cell, "The Twelve Inches of Bananarama" by Bananarama, and "The Twelve Inch Mixes" by Spandau Ballet.
Soft Cell, Bananarama, and Spandau Ballet are all British bands from the 1980s.


In [18]:
from dspy.teleprompt import MIPROv2


def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

optimizer = MIPROv2(
    metric=validate_context_and_answer,
    prompt_model=llm,
    task_model=llm,
    num_candidates=2,
    init_temperature=0.7,
    auto=None,
    verbose=True
)

compiled_rag = optimizer.compile(
    RAG(),
    trainset=trainset,
    num_trials=5,
    max_bootstrapped_demos=2,
    max_labeled_demos=3,
    minibatch_size=4,
    requires_permission_to_run=False
)

2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=2 sets of demonstrations...
2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=2 instructions...

2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/05/30 22:19:55 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Answer questions with 

Bootstrapping set 1/2
Bootstrapping set 2/2
SOURCE CODE: StringSignature(context, question -> reasoning, answer
    instructions='Answer questions with short factoid answers.'
    context = Field(annotation=str required=True json_schema_extra={'desc': 'may contain relevant facts', '__dspy_field_type': 'input', 'prefix': 'Context:'})
    question = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Question:', 'desc': '${question}'})
    reasoning = Field(annotation=str required=True json_schema_extra={'prefix': "Reasoning: Let's think step by step in order to", 'desc': '${reasoning}', '__dspy_field_type': 'output'})
    answer = Field(annotation=str required=True json_schema_extra={'desc': 'often between 1 and 5 words', '__dspy_field_type': 'output', 'prefix': 'Answer:'})
)

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_ans

2025/05/30 22:19:58 INFO dspy.evaluate.evaluate: Average Metric: 9 / 25 (36.0%)
2025/05/30 22:19:58 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 36.0

2025/05/30 22:19:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 7 - Minibatch ==
2025/05/30 22:19:58 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...




Predictor 0
i: Given relevant context, answer the question with a concise, fact-based response. Provide a brief, step-by-step reasoning before stating the final answer.
p: Answer:


Average Metric: 1.00 / 4 (25.0%): 100%|██████████| 4/4 [00:02<00:00,  1.58it/s]

2025/05/30 22:20:01 INFO dspy.evaluate.evaluate: Average Metric: 1 / 4 (25.0%)
2025/05/30 22:20:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/30 22:20:01 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.0]
2025/05/30 22:20:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [36.0]
2025/05/30 22:20:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 36.0


2025/05/30 22:20:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 7 - Minibatch ==
2025/05/30 22:20:01 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...




Predictor 0
i: Given relevant context, answer the question with a concise, fact-based response. Provide a brief, step-by-step reasoning before stating the final answer.
p: Answer:


Average Metric: 1.00 / 4 (25.0%): 100%|██████████| 4/4 [00:01<00:00,  2.70it/s]

2025/05/30 22:20:02 INFO dspy.evaluate.evaluate: Average Metric: 1 / 4 (25.0%)
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0'].
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.0, 25.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [36.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 36.0


2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 7 - Minibatch ==
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...




Predictor 0
i: Given relevant context, answer the question with a concise, fact-based response. Provide a brief, step-by-step reasoning before stating the final answer.
p: Answer:


Average Metric: 1.00 / 4 (25.0%): 100%|██████████| 4/4 [00:00<00:00, 262.24it/s]

2025/05/30 22:20:02 INFO dspy.evaluate.evaluate: Average Metric: 1 / 4 (25.0%)
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1'].
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.0, 25.0, 25.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [36.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 36.0


2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 7 - Minibatch ==
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...




Predictor 0
i: Answer questions with short factoid answers.
p: Answer:


Average Metric: 2.00 / 4 (50.0%): 100%|██████████| 4/4 [00:00<00:00, 1000.73it/s]

2025/05/30 22:20:02 INFO dspy.evaluate.evaluate: Average Metric: 2 / 4 (50.0%)
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 0'].
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.0, 25.0, 25.0, 50.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [36.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 36.0


2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 7 - Minibatch ==
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Evaluating the following candidate program...




Predictor 0
i: Answer questions with short factoid answers.
p: Answer:


Average Metric: 1.00 / 4 (25.0%): 100%|██████████| 4/4 [00:00<00:00, 675.19it/s]

2025/05/30 22:20:02 INFO dspy.evaluate.evaluate: Average Metric: 1 / 4 (25.0%)
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 25.0 on minibatch of size 4 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1'].
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [25.0, 25.0, 25.0, 50.0, 25.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [36.0]
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 36.0


2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 7 - Full Evaluation =====
2025/05/30 22:20:02 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 50.0) from minibatch trials...



Average Metric: 9.00 / 25 (36.0%): 100%|██████████| 25/25 [00:00<00:00, 430.46it/s]

2025/05/30 22:20:03 INFO dspy.evaluate.evaluate: Average Metric: 9 / 25 (36.0%)
2025/05/30 22:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [36.0, 36.0]
2025/05/30 22:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 36.0
2025/05/30 22:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/05/30 22:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 36.0!





In [19]:
dev_example = devset[0]
pred = compiled_rag(question=dev_example.question)
print("\n--- Test on dev example ---")
print(f"Question: {dev_example.question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Ground Truth: {dev_example.answer}")


--- Test on dev example ---
Question: Pehchaan: The Face of Truth stars Vinod Khanna, Rati Agnihotri and which Indian actress, producer, and former model who also produced the film?
Predicted Answer: Raveena Tandon
Ground Truth: Raveena Tandon


In [21]:
print(f"Total prompts sent: {len(llm.history)}")
llm.inspect_history(n=1)

Total prompts sent: 127




[34m[2025-05-30T22:20:16.147937][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str): may contain relevant facts
2. `question` (str)
Your output fields are:
1. `reasoning` (str)
2. `answer` (str): often between 1 and 5 words
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Answer questions with short factoid answers.


[31mUser message:[0m

[[ ## context ## ]]
[1] «Pehchaan: The Face of Truth | Pehchaan: The Face of Truth is a Bollywood film released in 2005. The film directed by Shrabani Deodhar stars Vinod Khanna, Rati Agnihotri and Raveena Tandon who also produced the film.»
[2] «Mashaal | Mashaal is a 1984 Bollywood film. Produced and directed by Yash Chopra, it starred Dilip K

In [32]:
import pandas as pd

# Assuming llm.history exists and is a list of dicts
history_records = []

ct = 0
for record in llm.history:
  ct += 1
  pprint.pprint(record)
  print()
  if ct == 2:
    break

{'cost': 9.8e-06,
 'kwargs': {},
 'messages': None,
 'model': 'gemini/gemini-2.0-flash',
 'model_type': 'chat',
 'outputs': ['Silicon Valley is located in the southern part of the San '
             'Francisco Bay Area in **Northern California, United States**.\n'],
 'prompt': "Where's Silicon Valley?",
 'response': ModelResponse(id='chatcmpl-9a55bde3-cda1-4d2d-b3f6-57b77dfc6610', created=1748642983, model='gemini-2.0-flash', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Silicon Valley is located in the southern part of the San Francisco Bay Area in **Northern California, United States**.\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None))], usage=Usage(completion_tokens=23, prompt_tokens=6, total_tokens=29, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=None, text_tokens=6, image_tokens=None)), vertex_ai