# Setup

In [1]:
import dspy

from dotenv import load_dotenv
load_dotenv()

True

# DSPy playground

DSPy consists of two different concepts brought together:
- DSPy programming language
- DSPy compiler

The DSPy programming language focus on organizing LMs as building blocks to create a more complex system. Here, LMs are treated as layers in a neural network, while their prompts are the weights.
> In this context, different prompting techniques are seen as different layers: Predict (zero-shot), ChainOfThought, React, and so on.

The DSPy compiler is a tool to optimize the DSPy program. The compiler uses a DSPy program, a training set, and a metric to optimize the program following a teleprompt (an optimizer)

Therefore, the parallel follows:

| DSPy              | PyTorch                   |
| :--------:        | :-------:                 |
| LMs               | Layers                    |
| LM prompts        | Layers' weights           |
| Teleprompters     | Optimizer                 |
| Compiling         | Training                  |
| Bootstraping      | Hyperparameters tuning    |


Hands-on!

Despite DSPy's architecture, we can still call LMs directly

In [13]:
gpt35_turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct')
# gpt35_turbo = dspy.OpenAI(model='gpt-3.5-turbo-1106', model_type="text")
dspy.settings.configure(lm=gpt35_turbo)

In [3]:
config = {
    'n' : 5,
    'max_tokens' : 250,
    'temperature' : 0.5,
}
h = f"{40*'-'} Output {40*'-'}"
print(f"{h}\n"
      f"{gpt35_turbo('Using DSPy, can I just call LM models directly?', **config)}"
      f"\n{len(h)*'-'}\n"
      )


---------------------------------------- Output ----------------------------------------
['\n\nYes, you can call LM (linear model) models directly in DSPy using the LinearModel class. This class allows you to create and manipulate linear models, including fitting them to data, making predictions, and evaluating model performance.', '\n\nNo, DSPy does not have the capability to directly call LM models. DSPy is a library for digital signal processing and does not have any specific functionality for LM models. However, you can use DSPy to preprocess data and then use a separate library or tool to call LM models.', '\n\nNo, DSPy does not have a direct interface for calling LM models. It is designed to be used for building and training neural networks, not for directly calling pre-trained models. However, DSPy does have the ability to load and use pre-trained models as part of the network architecture, so you could use it in that way if you have a pre-trained LM model that you want to incor

Of course, we dont need DSPy for that. DSPy strength comes from wrapping LMs in programming modules (layers) that can be assambled to create programs (models)

In [4]:
pred = dspy.Predict('question -> answer', **config)

h = f"{40*'-'} Predict output {40*'-'}"
print(f"{h}\n"
      f"{pred(question = 'What is the capital of France?').completions.answer}"
      f"\n{len(h)*'-'}\n"
      )

CoT = dspy.ChainOfThought('question -> answer', **config)
h = f"{35*'-'} Chain of Thoughts output {35*'-'}"
print(f"{h}\n"
      f"{CoT(question = 'What is the capital of France?').completions.answer}"
      f"\n{len(h)*'-'}"
      )

---------------------------------------- Predict output ----------------------------------------
['Paris', 'Paris', 'Paris', 'Paris', 'Paris']
------------------------------------------------------------------------------------------------

----------------------------------- Chain of Thoughts output -----------------------------------
['Paris', 'Paris', 'Paris', 'The capital of France is Paris.', 'Paris']
------------------------------------------------------------------------------------------------


But how do we know which LM are `dspy.Predict` and `dspy.ChainOfThought using`? 

Annoyingly, DSPy takes the LM from its configuration (`dspy.settings.configure(lm=gpt35_turbo)`), not from the modules. 

But there's a workaround: `dspy.context`.

In [26]:
gpt4_turbo = dspy.OpenAI(model='gpt-4-1106-preview')

with dspy.context(lm=gpt4_turbo):
  config = {
      'n' : 2,
      'max_tokens' : 250,
      'temperature' : 0.5,
  }
  pred = dspy.Predict('question -> answer', **config)

  h = f"{40*'-'} Predict output {40*'-'}"
  print(f"{h}\n"
        f"{pred(question = 'What is the capital of France?').completions.answer}"
        f"\n{len(h)*'-'}\n"
        )

---------------------------------------- Predict output ----------------------------------------
['Question: What is the capital of France?\nAnswer: The capital of France is Paris.', 'Question: What is the capital of France?\nAnswer: The capital of France is Paris.']
------------------------------------------------------------------------------------------------



DSPy supports a few remote LMs:
- OpenAI
- AzureOpenAI
- Cohere
- Anyscale
- Together
- Databricks
- Mistral
- dspy.AWSMistral
- dspy.AWSAnthropic
- dspy.AWSMeta
- ...

But we can also use local LMs using a local server with:
- HFClientTGI
- HFClientVLLM
- HFModel

# Reproducing DSPy

## GSM8K dataset

In [183]:
from dspy.datasets.gsm8k import GSM8K, gsm8k_metric
from dspy.teleprompt import (
    LabeledFewShot,
    BootstrapFewShotWithRandomSearch,
    Ensemble
)
from dspy.evaluate import Evaluate

gpt35_turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct')
dspy.settings.configure(lm=gpt35_turbo)

In [184]:
gsm8k = GSM8K()
gsm8k_trainset, gsm8k_devset = gsm8k.train[:10], gsm8k.dev[:10]


[A
100%|██████████| 7473/7473 [00:00<00:00, 67816.78it/s]

100%|██████████| 1319/1319 [00:00<00:00, 68176.11it/s]


In [8]:
q=8
print(
    f'Question: {gsm8k_trainset[q].question}\n'
    f'Answer: {gsm8k_trainset[q].answer}'
    )

Question: Megan pays $16 for a shirt that costs $22 before sales. What is the amount of the discount?
Answer: 6


### gpt-3.5 turbo with no fancy prompt

In [None]:
class Vanilla(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict("question -> answer")
    
    def forward(self, question):
        return self.predict(question=question)

vanilla = Vanilla()
vanilla(question=gsm8k_trainset[8].question)

# Compiling
fewshot = LabeledFewShot(k=8).compile(vanilla, trainset=gsm8k_trainset)

tp = BootstrapFewShotWithRandomSearch(metric=gsm8k_metric)
bootstrap = tp.compile(vanilla , trainset=gsm8k_trainset , valset=gsm8k_devset)
bootstrap2 = tp.compile(vanilla , teacher=bootstrap, trainset=gsm8k_trainset, valset=gsm8k_devset)
# ensemble = Ensemble(reduce_fn=dspy.majority).compile()

In [23]:
print(f"Checking it for a random question: \nQuestion: {gsm8k_devset[q].question}")
print(f"Label:      {gsm8k_devset[q].answer}\n"
      f"zeroshot:   {vanilla(gsm8k_devset[q].question).answer}\n"
      f"fewshot:    {fewshot(gsm8k_devset[q].question).answer}\n"
      f"bootstrap:  {bootstrap(gsm8k_devset[q].question).answer}\n"
      f"bootstrap2: {bootstrap2(gsm8k_devset[q].question).answer}\n"
      # f"Ensemble: {bootstrap2(gsm8k_devset[q].question).answer}\n"
      )


print("Now, a systematic evaluation")
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)
print(f"\nScores:\n"
      f"zeroshot:   {evaluate(vanilla)}\n"
      f"fewshot:    {evaluate(fewshot)}\n"
      f"bootstrap:  {evaluate(bootstrap)}\n"
      f"bootstrap2: {evaluate(bootstrap2)}\n"
      # f"Ensemble: {evaluate(ensamble)}\n"
      )

Checking it for a random question: 
Question: Martha's cat catches 3 rats and 7 birds. Cara's cat catches 3 less than five times as many animals as Martha's cat. How many animals does Cara's cat catch?
Label:      47
zeroshot:   32
fewshot:    47
bootstrap:  47
bootstrap2: 47

Now, a systematic evaluation
Average Metric: 0 / 10  (0.0%)
Average Metric: 6 / 10  (60.0%)
Average Metric: 10 / 10  (100.0%)
Average Metric: 10 / 10  (100.0%)

Scores:zeroshot:   0.0
fewshot:    60.0
bootstrap:  100.0
bootstrap2: 100.0



### gpt-3.5 turbo with ChainOfThought

In [None]:
class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.cot = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        return self.cot(question=question)

cot = CoT()

# Compiling
fewshot = LabeledFewShot(k=8).compile(cot, trainset=gsm8k_trainset)

tp = BootstrapFewShotWithRandomSearch(metric=gsm8k_metric)
bootstrap = tp.compile(cot , trainset=gsm8k_trainset , valset=gsm8k_devset)
bootstrap2 = tp.compile(cot , teacher=bootstrap, trainset=gsm8k_trainset, valset=gsm8k_devset)
# ensemble = Ensemble(reduce_fn=dspy.majority).compile()

In [14]:
print(f"Checking it for a random question: \nQuestion: {gsm8k_devset[q].question}")
print(f"Label:      {gsm8k_devset[q].answer}\n"
      f"zeroshot:   {cot(gsm8k_devset[q].question).answer}\n"
      f"fewshot:    {fewshot(gsm8k_devset[q].question).answer}\n"
      f"bootstrap:  {bootstrap(gsm8k_devset[q].question).answer}\n"
      f"bootstrap2: {bootstrap2(gsm8k_devset[q].question).answer}\n"
      # f"Ensemble: {bootstrap2(gsm8k_devset[q].question).answer}\n"
      )

print("Now, a systematic evaluation")
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)
print(f"\nScores:\n"
      f"zeroshot:   {evaluate(cot)}\n"
      f"fewshot:    {evaluate(fewshot)}\n"
      f"bootstrap:  {evaluate(bootstrap)}\n"
      f"bootstrap2: {evaluate(bootstrap2)}\n"
      # f"Ensemble: {evaluate(ensamble)}\n"
      )

Label: 47
fewshot: 47
bootstrap: 47
bootstrap2: 47



### gpt-3.5 turbo with with CoT and reflection

In [None]:
class ThoughtReflection(dspy.Module): 
    def __init__(self , num_attempts):
        self.predict = dspy.ChainOfThought("question -> answer", n=num_attempts)
        self.compare = dspy.MultiChainComparison('question -> answer', M=num_attempts)
        
    def forward(self , question):
        completions = self.predict(question=question).completions
        return self.compare(question=question , completions=completions)

reflection = ThoughtReflection(num_attempts=5)

# Compiling
fewshot = LabeledFewShot(k=8).compile(reflection, trainset=gsm8k_trainset)

tp = BootstrapFewShotWithRandomSearch(metric=gsm8k_metric)
bootstrap = tp.compile(reflection , trainset=gsm8k_trainset , valset=gsm8k_devset)
bootstrap2 = tp.compile(reflection , teacher=bootstrap, trainset=gsm8k_trainset, valset=gsm8k_devset)
# ensemble = Ensemble(reduce_fn=dspy.majority).compile()

In [None]:
print(f"Label: {gsm8k_devset[q].answer}\n"
      f"zeroshot: {reflection(gsm8k_devset[q].question).answer}\n"
      f"fewshot: {fewshot(gsm8k_devset[q].question).answer}\n"
      f"bootstrap: {bootstrap(gsm8k_devset[q].question).answer}\n"
      f"bootstrap2: {bootstrap2(gsm8k_devset[q].question).answer}\n"
      # f"Ensemble: {bootstrap2(gsm8k_devset[q].question).answer}\n"
      )

print("Now, a systematic evaluation")
evaluate = Evaluate(devset=gsm8k_devset, metric=gsm8k_metric, num_threads=4, display_progress=True, display_table=0)
print(f"\nScores:\n"
      f"zeroshot:   {evaluate(reflection)}\n"
      f"fewshot:    {evaluate(fewshot)}\n"
      f"bootstrap:  {evaluate(bootstrap)}\n"
      f"bootstrap2: {evaluate(bootstrap2)}\n"
      # f"Ensemble: {evaluate(ensamble)}\n"
      )

Note that BootstrapFewShot is not an optimizing teleprompter, i.e. it simple creates and validates examples for steps of the pipeline but does not optimize the metric. 

Other teleprompters like BootstrapFewShotWithRandomSearch and MIPRO will apply direct optimization.

## HotPotQA Dataset

HotPotQA is a multi-hop QA dataset. 

They used CoT, ReAct, and multihop as model pipelines. And compiled it with bootstrap and finetuned T5-large

In [191]:
from dspy.datasets.hotpotqa import HotPotQA
from dspy.teleprompt import (
    LabeledFewShot,
    BootstrapFewShotWithRandomSearch,
    BootstrapFinetune,
    Ensemble
)
from dspy.evaluate import Evaluate
from dspy.evaluate import answer_exact_match

gpt35_turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=gpt35_turbo, rm=colbertv2_wiki17_abstracts)

In [192]:
dataset = HotPotQA(train_size=10, dev_size=10)
hotpot_trainset = [x.with_inputs('question') for x in dataset.train]
hotpot_devset = [x.with_inputs('question') for x in dataset.dev]


In [206]:
class Vanilla(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict("question -> answer")
    
    def forward(self, question):
        return self.predict(question=question)


class RAG(dspy.Module):
    def __init__(self , num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("context, question  -> answer")

    def forward(self , question):
        context = self.retrieve(question).passages
        return self.generate_answer(context=context, question=question)


class React(dspy.Module):
    def __init__(self):
        super().__init__()
        self.react = dspy.ReAct("question  -> answer", tools=[dspy.Retrieve(k=1)], max_iters=5)

    def forward(self, question):
        return self.react(question=question)


class BasicMultiHop(dspy.Module):
    def __init__(self, passages_per_hop=3):
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_query = dspy.ChainOfThought("context , question  -> search_query")
        self.generate_answer = dspy.ChainOfThought("context , question  -> answer")
    
    def forward(self, question):
        context = []
        for  hop in  range (2):
            query = self.generate_query(context=context, question=question).search_query 
            context  += self.retrieve(query).passages

        return self.generate_answer(context=context, question=question)

# Pipelines
vanilla = Vanilla()
cot = RAG(num_passages=5)
react = React()
multihop = BasicMultiHop()

In [207]:
# Compilations
for model in [cot]:#, ract, multihop]:
    fewshot = LabeledFewShot(k=8).compile(model, trainset=hotpot_trainset)

    tp = BootstrapFewShotWithRandomSearch(metric=answer_exact_match)
    bootstrap = tp.compile(model, trainset=hotpot_trainset, valset=hotpot_devset)
    # multihop_t5 = BootstrapFinetune(metric=answer_exact_match).compile(model, teacher=bootstrap, trainset=hotpot_trainset, target='t5-large')

    # evaluate = Evaluate(devset=hotpot_devset, metric=answer_exact_match, num_threads=4, display_progress=True, display_table=0)
    # print(f"\nScores for {model}:\n"
    #     f"zeroshot:   {evaluate(model)}\n"
    #     f"fewshot:    {evaluate(fewshot)}\n"
    #     f"bootstrap:  {evaluate(bootstrap)}\n"
    #     # f"T5-FT: {evaluate(multihop_t5)}\n"
    #     )

Going to sample between 1 and 4 traces per predictor.
Will attempt to train 16 candidate sets.


Average Metric: 1 / 10  (10.0): 100%|██████████| 10/10 [00:04<00:00,  2.17it/s]


Average Metric: 1 / 10  (10.0%)
Score: 10.0 for set: [0]
New best score: 10.0 for seed -3
Scores so far: [10.0]
Best score: 10.0


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:04<00:00,  2.22it/s]


Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
New best score: 20.0 for seed -2
Scores so far: [10.0, 20.0]
Best score: 20.0


100%|██████████| 10/10 [00:15<00:00,  1.56s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:02<00:00,  3.45it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
New best score: 30.0 for seed -1
Scores so far: [10.0, 20.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.5
Average of max per entry across top 3 scores: 0.5
Average of max per entry across top 5 scores: 0.5
Average of max per entry across top 8 scores: 0.5
Average of max per entry across top 9999 scores: 0.5


100%|██████████| 10/10 [00:14<00:00,  1.50s/it]


Bootstrapped 4 full traces after 10 examples in round 0.


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:03<00:00,  2.74it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.5
Average of max per entry across top 5 scores: 0.5
Average of max per entry across top 8 scores: 0.5
Average of max per entry across top 9999 scores: 0.5


 40%|████      | 4/10 [00:06<00:09,  1.58s/it]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:03<00:00,  2.99it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.6
Average of max per entry across top 8 scores: 0.6
Average of max per entry across top 9999 scores: 0.6


 40%|████      | 4/10 [00:06<00:10,  1.70s/it]


Bootstrapped 1 full traces after 5 examples in round 0.


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:03<00:00,  2.75it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.8
Average of max per entry across top 8 scores: 0.8
Average of max per entry across top 9999 scores: 0.8


 30%|███       | 3/10 [00:04<00:10,  1.51s/it]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:03<00:00,  2.56it/s]


Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.8
Average of max per entry across top 8 scores: 0.8
Average of max per entry across top 9999 scores: 0.8


 40%|████      | 4/10 [00:05<00:08,  1.34s/it]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:04<00:00,  2.25it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 0.9
Average of max per entry across top 9999 scores: 0.9


 90%|█████████ | 9/10 [00:12<00:01,  1.35s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


Average Metric: 1 / 4  (25.0):  30%|███       | 3/10 [00:01<00:03,  2.08it/s]

Backing off 0.9 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 5  (20.0):  50%|█████     | 5/10 [00:03<00:03,  1.61it/s]

Backing off 0.9 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.6 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 6  (16.7):  60%|██████    | 6/10 [00:05<00:04,  1.01s/it]

Backing off 0.3 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 1.1 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 7  (14.3):  70%|███████   | 7/10 [00:07<00:04,  1.35s/it]

Backing off 0.0 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 8  (25.0):  80%|████████  | 8/10 [00:09<00:02,  1.41s/it]

Backing off 1.1 seconds after 3 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:16<00:00,  1.68s/it]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 40%|████      | 4/10 [00:05<00:08,  1.49s/it]


Bootstrapped 1 full traces after 5 examples in round 0.


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:04<00:00,  2.13it/s]


Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 80%|████████  | 8/10 [00:11<00:02,  1.47s/it]


Bootstrapped 3 full traces after 9 examples in round 0.


Average Metric: 1 / 4  (25.0):  40%|████      | 4/10 [00:02<00:03,  1.97it/s]

Backing off 0.4 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.6 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 5  (20.0):  50%|█████     | 5/10 [00:04<00:04,  1.05it/s]

Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.5 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 6  (16.7):  60%|██████    | 6/10 [00:06<00:05,  1.35s/it]

Backing off 0.9 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.4 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 7  (14.3):  70%|███████   | 7/10 [00:08<00:04,  1.64s/it]

Backing off 0.9 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 8  (12.5):  80%|████████  | 8/10 [00:10<00:03,  1.66s/it]

Backing off 3.6 seconds after 3 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:16<00:00,  1.63s/it]


Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 60%|██████    | 6/10 [00:08<00:05,  1.40s/it]


Bootstrapped 2 full traces after 7 examples in round 0.


Average Metric: 2 / 6  (33.3):  60%|██████    | 6/10 [00:02<00:01,  2.95it/s]

Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 7  (42.9):  70%|███████   | 7/10 [00:03<00:01,  1.98it/s]

Backing off 0.2 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.1 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:07<00:00,  1.26it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


100%|██████████| 10/10 [00:16<00:00,  1.67s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


Average Metric: 2 / 7  (28.6):  70%|███████   | 7/10 [00:04<00:02,  1.15it/s] 

Backing off 0.5 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.6 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:10<00:00,  1.09s/it]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 30%|███       | 3/10 [00:04<00:10,  1.44s/it]


Bootstrapped 1 full traces after 4 examples in round 0.


Average Metric: 1 / 4  (25.0):  40%|████      | 4/10 [00:01<00:01,  3.53it/s]

Backing off 0.3 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 6  (16.7):  60%|██████    | 6/10 [00:03<00:02,  1.83it/s]

Backing off 1.5 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.4 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:07<00:00,  1.33it/s]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


100%|██████████| 10/10 [00:16<00:00,  1.64s/it]


Bootstrapped 3 full traces after 10 examples in round 0.


Average Metric: 1 / 4  (25.0):  40%|████      | 4/10 [00:02<00:02,  2.47it/s]

Backing off 0.2 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 6  (16.7):  60%|██████    | 6/10 [00:03<00:02,  1.85it/s]

Backing off 0.0 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 7  (14.3):  70%|███████   | 7/10 [00:05<00:03,  1.10s/it]

Backing off 0.5 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 8  (25.0):  80%|████████  | 8/10 [00:06<00:02,  1.12s/it]

Backing off 1.0 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:10<00:00,  1.09s/it]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0, 30.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


100%|██████████| 10/10 [00:15<00:00,  1.52s/it]


Bootstrapped 4 full traces after 10 examples in round 0.


Average Metric: 1 / 5  (20.0):  50%|█████     | 5/10 [00:03<00:03,  1.32it/s]

Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.1 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.4 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 6  (16.7):  60%|██████    | 6/10 [00:05<00:04,  1.06s/it]

Backing off 0.6 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 7  (14.3):  70%|███████   | 7/10 [00:08<00:05,  1.73s/it]

Backing off 0.4 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:15<00:00,  1.54s/it]


Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 60%|██████    | 6/10 [00:09<00:06,  1.55s/it]


Bootstrapped 3 full traces after 7 examples in round 0.


Average Metric: 2 / 4  (50.0):  40%|████      | 4/10 [00:03<00:04,  1.30it/s] 

Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.5 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 5  (40.0):  50%|█████     | 5/10 [00:04<00:04,  1.06it/s]

Backing off 0.2 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 6  (50.0):  60%|██████    | 6/10 [00:06<00:05,  1.36s/it]

Backing off 0.0 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 1.5 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 8  (37.5):  80%|████████  | 8/10 [00:10<00:03,  1.67s/it]

Backing off 3.1 seconds after 3 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:16<00:00,  1.62s/it]


Average Metric: 3 / 10  (30.0%)
Score: 30.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 30%|███       | 3/10 [00:03<00:09,  1.32s/it]


Bootstrapped 1 full traces after 4 examples in round 0.


Average Metric: 1 / 5  (20.0):  50%|█████     | 5/10 [00:01<00:01,  3.43it/s]

Backing off 0.1 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.4 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 6  (16.7):  60%|██████    | 6/10 [00:02<00:01,  2.00it/s]

Backing off 0.2 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 1.0 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 7  (28.6):  70%|███████   | 7/10 [00:03<00:02,  1.33it/s]

Backing off 1.0 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:06<00:00,  1.45it/s]


Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 20.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0


 50%|█████     | 5/10 [00:07<00:07,  1.56s/it]


Bootstrapped 2 full traces after 6 examples in round 0.


Average Metric: 1 / 4  (25.0):  40%|████      | 4/10 [00:02<00:02,  2.18it/s]

Backing off 0.1 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.2 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 1 / 5  (20.0):  50%|█████     | 5/10 [00:03<00:03,  1.63it/s]

Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.5 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 6  (33.3):  60%|██████    | 6/10 [00:04<00:03,  1.09it/s]

Backing off 1.6 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}
Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 7  (28.6):  70%|███████   | 7/10 [00:05<00:02,  1.06it/s]

Backing off 2.0 seconds after 2 tries calling function <function GPT3.request at 0x118e89b40> with kwargs {}


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:10<00:00,  1.10s/it]

Average Metric: 2 / 10  (20.0%)
Score: 20.0 for set: [10]
Scores so far: [10.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 30.0, 20.0, 20.0, 30.0, 30.0, 30.0, 30.0, 20.0, 30.0, 20.0, 20.0]
Best score: 30.0
Average of max per entry across top 1 scores: 0.3
Average of max per entry across top 2 scores: 0.4
Average of max per entry across top 3 scores: 0.6
Average of max per entry across top 5 scores: 0.9
Average of max per entry across top 8 scores: 1.0
Average of max per entry across top 9999 scores: 1.0
19 candidate programs found.





In [216]:
print(hotpot_devset[0].answer)
print(bootstrap(hotpot_devset[0].question).answer)

print(gpt35_turbo.history[0]['prompt'])
print(gpt35_turbo.history[0]['response']['choices'][0]['text'])

Lowell Thomas George
Lowell George
Given the fields `context`, `question`, produce the fields `answer`.

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «Bill Payne | Bill Payne (born March 12, 1949) is an American pianist who co-founded with Lowell George the American rock band Little Feat. He is considered by many other rock pianists, including Elton John, to be one of the finest American piano rock and blues musicians. In addition to his trademark barrelhouse blues piano, he is noted for his work on the Hammond B3 organ. Payne is an accomplished songwriter whose credits include "Oh, Atlanta". Following the death of Little Feat drummer Richie Hayward on August 12, 2010, Payne is the only member of the group from the original four-piece line-up currently playing in the band.»
[2] «Lowell Mason | Lowell Mason (January 8, 1792 – August 11, 18

In [210]:
evaluate = Evaluate(devset=hotpot_devset, metric=answer_exact_match, num_threads=4, display_progress=True, display_table=0)
print(f"\nScores for cot:\n"
    f"zeroshot:   {evaluate(cot)}\n"
    f"fewshot:    {evaluate(fewshot)}\n"
    f"bootstrap:  {evaluate(bootstrap)}\n"
    # f"T5-FT: {evaluate(multihop_t5)}\n"
    )

Average Metric: 1 / 10  (10.0): 100%|██████████| 10/10 [00:00<00:00, 2353.57it/s]


Average Metric: 1 / 10  (10.0%)


Average Metric: 2 / 10  (20.0): 100%|██████████| 10/10 [00:04<00:00,  2.47it/s]


Average Metric: 2 / 10  (20.0%)


Average Metric: 3 / 10  (30.0): 100%|██████████| 10/10 [00:00<00:00, 1026.81it/s]

Average Metric: 3 / 10  (30.0%)

Scores for cot:
zeroshot:   10.0
fewshot:    20.0
bootstrap:  30.0






# Let's do something fun now

In [131]:
import pandas as pd
import numpy as np
import random
random.seed(42)

from dspy.teleprompt import (
    LabeledFewShot,
    BootstrapFewShotWithRandomSearch,
)

gpt35_turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct')
dspy.settings.configure(lm=gpt35_turbo)

df = pd.read_csv("../data/12744_ocm_dataset.csv", sep=";").astype(str)
df = df.rename(columns={"prompt": "procedure", "completion": "C2_yield"})

# Create dataset
dataset = [dspy.Example(x).with_inputs('procedure') for x in df.to_dict(orient='records')]
train_dataset = dataset[:1000]
dev_dataset = random.sample(dataset[1000:], 200)


In [None]:
def mae(example, pred, trace=None):
    try:
        pred = np.array(pred)
        labels = np.array([float(exp.C2_yield) for exp in example])
    except: 
        return -10
    return -np.sum(np.absolute(pred - labels)) / len(pred)

class OCMSignature(dspy.Signature):
    """Predict C2 yield from a given experimental procedure"""

    # context = dspy.InputField(desc="may contain relevant facts")
    procedure = dspy.InputField(desc="A description of the experiment")
    answer = dspy.OutputField(desc="C2 yield in %")


class Vanilla(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predict = dspy.Predict("procedure -> C2_yield")
        # self.predict = dspy.Predict(OCMSignature)
    
    def forward(self, procedure):
        return self.predict(procedure=procedure)


class CoT(dspy.Module):
    def __init__(self , num_passages=3):
        self.cot = dspy.ChainOfThought("procedure  -> C2_yield")

    def forward(self , procedure):
        return self.cot(procedure=procedure)

vanilla = Vanilla()
cot = CoT()

fewshot = LabeledFewShot(k=5).compile(vanilla, trainset=train_dataset)

tp = BootstrapFewShotWithRandomSearch(metric=mae)
bootstrap = tp.compile(fewshot, trainset=train_dataset, valset=dev_dataset[:10])

In [178]:
print(dev_dataset[0])
bootstrap(dev_dataset[0].procedure)

Example({'procedure': 'To synthesize Na2WO4/SiO2, SiO2 (1.0 g) was impregnated with 4.5 mL of an aqueous solution consiting of n.a. (0%), Na (67%), W (33%), at 50 ºC for 6 h. Once activated the reaction is ran at 900 ºC. The total flow rate was 15 mL/min (Ar: 10.5 mL/min, CH4: 3.4 mL/min, O2: 1.1 mL/min), leading to a contact time of 0.5 s.', 'C2_yield': '5.82'}) (input_keys={'procedure'})


Prediction(
    C2_yield='n.a.'
)

In [176]:
gpt35_turbo.inspect_history()





Given the fields `procedure`, produce the fields `C2_yield`.

---

Follow the following format.

Procedure: ${procedure}
C 2 Yield: ${C2_yield}

---

Procedure: To synthesize Na2WO4/SiO2, SiO2 (1.0 g) was impregnated with 4.5 mL of an aqueous solution consiting of n.a. (0%), Na (67%), W (33%), at 50 ºC for 6 h. Once activated the reaction is ran at 900 ºC. The total flow rate was 15 mL/min (Ar: 10.5 mL/min, CH4: 3.4 mL/min, O2: 1.1 mL/min), leading to a contact time of 0.5 s.
C 2 Yield:[32m n.a.[0m





In [179]:

def evaluate(model, devset, metric, verbose=False):
    pred = []
    examples = []
    failed = 0
    for example in devset:
        try:
            yhat = model(example.procedure).C2_yield
            pred.append(float(yhat))
            examples.append(example)
        except:
            failed += 1
            if verbose:
                print("WARNING: Failed to parse respose for example:", example)
                print("WARNING: Model completed:", yhat, "\n")
    return metric(examples, pred), pred, examples, failed

In [181]:
m, yhat, y, f = evaluate(bootstrap, dev_dataset[:2], mae)
print(m, f)
yhat, y

nan 2


([], [])

In [None]:
evaluate = Evaluate(devset=dev_dataset, metric=dspy.bleu, num_threads=4, display_progress=True, display_table=0)
print(f"\nScores for {cot}:\n"
    f"zeroshot:   {evaluate(cot)}\n"
    f"fewshot:    {evaluate(fewshot)}\n"
    # f"bootstrap:  {evaluate(bootstrap)}\n"
    )

In [None]:
evaluate = Evaluate(devset=dev_dataset, metric=answer_exact_match, num_threads=4, display_progress=True, display_table=0, )
print(f"\nScores for {cot}:\n"
      f"vanilla:  {evaluate(vanilla)}\n"
      f"zeroshot: {evaluate(cot)}\n"
    #   f"fewshot:  {evaluate(fewshot)}\n"
    #   f"bootstrap:{evaluate(bootstrap)}\n"
    )