<a href="https://colab.research.google.com/github/olanigan/DSPy_Cookbook/blob/main/dvc_dspy_parea_rag_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
    <p style="text-align:center">
        <img alt="parea logo" src="https://media.dev.to/cdn-cgi/image/width=320,height=320,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F8067%2Fc508b9f7-50ae-43b6-91fc-d8535102b518.png" width="200"/>
        <br>
        <a href="https://docs.parea.ai/">Docs</a>
        |
        <a href="https://github.com/parea-ai/parea-sdk-py">GitHub</a>
        |
        <a href="https://discord.gg/KbHtZqwvsQ">Community</a>
    </p>
</center>
<h1 align="center">Tracing & Evaluating a DSPy Application Using Parea & DVC</h1>

[DSPy](https://github.com/stanfordnlp/dspy) is a framework for automatically prompting and fine-tuning language models. It provides:

- Composable and declarative APIs that allow developers to describe the architecture of their LLM application in the form of a "module" (inspired by PyTorch's `nn.Module`),
- Optimizers formerly known as "teleprompters" that optimize a user-defined module for a particular task. The optimization could involve selecting few-shot examples, generating prompts, or fine-tuning language models.

[Parea](https://www.parea.ai/) makes your DSPy applications *observable* by visualizing the underlying structure of each call to your compiled DSPy module and surfacing problematic spans of execution based on latency, token count, or other evaluation metrics. Additionally, Parea allows you to *track the performance* of your DSPy modules over time, across different architectures, optimizers, etc.

[DVC's experiment tracking](https://dvc.org/doc/start/experiments/experiment-tracking) enables to associate any evaluated DSPy module with a snapshot of the workspace without polluting the git history. This enables *reproducible experiments*.

In this tutorial, you will:
- Build and optimize DSPy modules that use retrieval-augmented generation and multi-hop reasoning to answer questions over [HotPotQA](https://hotpotqa.github.io) dataset,
- Instrument your application using [Parea AI](https://parea.ai),
- Inspect the traces of your application to understand the inner works of a DSPy forward pass.
- Evaluate your modules with experiments
- Integrate with DVC to make experiments reproducible

ℹ️ This notebook requires an OpenAI API key.

ℹ️ This notebook requires a Parea API key, which can be created [here](https://docs.parea.ai/api-reference/authentication#parea-api-key).


## 1. Install Dependencies and Import Libraries

Install Parea, DSPy, DVC, and other dependencies.

In [None]:
!pip install "regex~=2023.10.3" pygit2==1.14.1 dspy-ai parea-ai dvc  # DSPy requires an old version of regex that conflicts with the installed version on Colab

Collecting regex~=2023.10.3
  Downloading regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m773.9/773.9 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dspy-ai
  Downloading dspy_ai-2.4.9-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.4/220.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting parea-ai
  Downloading parea_ai-0.2.157-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dvc
  Downloading dvc-3.50.2-py3-none-any.whl (451 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m451.6/451.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting backoff~=2.2.1 (from dspy-ai)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting joblib~=1.3.2 (from dspy-ai)
  Downloading

⚠️ DSPy conflicts with the default version of the `regex` module that comes pre-installed on Google Colab. If you are running this notebook in Google Colab, you won't need to restart the kernel after running the installation step above.

Also, initilize a git repository and add a commit if no git repository has been initialized. This will be necessary for the DVC integration.

In [None]:
!git init
!git add -A
!git config --global user.email "you@example.com"
!git config --global user.name "Your Name"
!git commit -m "Init commit"

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/


Import libraries.

In [None]:
import json
import os
import random
from getpass import getpass

import dspy
import nest_asyncio
import openai
from dsp.utils import deduplicate
from dspy import evaluate as dspy_eval
from dspy.datasets import HotPotQA
from dspy.teleprompt import BootstrapFewShot

from parea import Parea
from parea.helpers import TurnOffPareaLogging
from parea.utils.trace_integrations.dspy import attach_evals_to_module, convert_dspy_examples_to_parea_dicts

## 2. Configure Your OpenAI & Parea API Key

Set your OpenAI & Parea API key if they are not already set as environment variables.

In [None]:
for api_key_name in ["OPENAI_API_KEY", "PAREA_API_KEY"]:
    if not (api_key_value := os.getenv(api_key_name)):
        api_key_value = getpass(f"🔑 Enter your {api_key_name.split('_')[0].title()} API key: ")
    if api_key_name == "OPENAI_API_KEY":
        openai.api_key = api_key_value
    os.environ[api_key_name] = api_key_value

🔑 Enter your Openai API key: ··········
🔑 Enter your Parea API key: ··········


## 3. Configure LM

We will use `gpt-3.5-turbo` as our LLM of choice for this tutorial. Additionally, we wil use ColBERT to retrieve Wikipedia articles.

In [None]:
turbo = dspy.OpenAI(model="gpt-3.5-turbo")
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

## 4. Load & Index Data

Next we will download the [HotPotQA](https://hotpotqa.github.io) dataset and mark the `question` field as the input field. Then, we can split the data into a training and test set.


In [None]:
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)
train_set = [x.with_inputs('question') for x in dataset.train]
test_set = [x.with_inputs('question') for x in dataset.dev]

len(train_set), len(test_set)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/6.42k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.19k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/566M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/47.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/46.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/90447 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/7405 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7405 [00:00<?, ? examples/s]

  table = cls._concat_blocks(blocks, axis=0)


(20, 50)

Each sample in our dataset has a question, and an human-annotated answer. The test set also comes with the correct Wikipedia articles to answer the question. This information will be helpful to evaluate the retrieval step.

In [None]:
train_set[0], test_set[0]

(Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'}),
 Example({'question': 'Are both Cangzhou and Qionghai in the Hebei province of China?', 'answer': 'no', 'gold_titles': {'Qionghai', 'Cangzhou'}}) (input_keys={'question'}))

## 5. Define A Simple RAG Module

In order to define the RAG module, we need to define a signature that takes in two inputs, `context` and `question`, and outputs an `answer`. The signature provides:

- A description of the sub-task the language model is supposed to solve.
- A description of the input fields to the language model.
- A description of the output fields the language model must produce.

In [None]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

In [None]:
cot = dspy.ChainOfThought(GenerateAnswer)
cot(
    question='When was OpenAI founded?',
    context='OpenAI is an American artificial intelligence (AI) research organization founded in December 2015'
)

Prediction(
    rationale='produce the answer. We know that OpenAI is an American AI research organization.',
    answer='December 2015'
)

Define your module by subclassing `dspy.Module` and overriding the `forward` method. Here, we use ChromaDB to retrieve the top-k passages from the context and then use the Chain-of-Thought generate the final answer.

In [None]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In [None]:
rag_pipeline = RAG()
rag_pipeline(question='When was OpenAI founded?')

Prediction(
    context=['Budapest Open Access Initiative | The Budapest Open Access Initiative (BOAI) is a public statement of principles relating to open access to the research literature, which was released to the public February 14, 2002. It arose from a conference convened in Budapest by the Open Society Institute on December 1–2, 2001 to promote open access – at the time also known as "Free Online Scholarship". This small gathering of individuals is recognised as one of the major defining events of the open access movement. On the occasion of the 10th anniversary of the initiative, it was reaffirmed in 2012 and supplemented with a set of concrete recommendations for achieving "the new goal that within the next ten years, OA will become the default method for distributing new peer-reviewed research in every field and country."', 'OpenAI | OpenAI is a non-profit artificial intelligence (AI) research company that aims to promote and develop friendly AI in such a way as to benefit hu

## 6. Evaluate the RAG Module

We will use Parea to evaluate the RAG module on the test set. This consists of two parts:
- **instrumentation**: We will trace the execution of the module components to understand how the module processes the input: done by the `trace_dspy` method.
- **experimentation**: We will run an experiment to see the model's performance on the test set.

To be able to execute experiments in a notebook, we need to enable nested asyncio loops with the help of the `nest_asyncio` module.

In [None]:
p = Parea(api_key=os.getenv("PAREA_API_KEY"))
p.trace_dspy()

nest_asyncio.apply()  # needed to make p.experiment work in notebooks
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # needed because of transformers

Additionally, we will integrate Parea with DVC's experiment tracking.

In [None]:
!dvc init  # initializes DVC
!parea dvc-init  # initializes Parea integration with DVC for experimetn tracking
!git add .parea/metrics.json .parea/dvc.yaml && git commit -m "Parea DVC integration"  # files which need to be tracked in git

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

We will use two evaluation functions for our experiment:
- `dspy.evaluate.answer_exact_match`: checks if the predicted answer is an exact match with the target answer.
- `gold_passages_retrieved`: checks if the retrieved context matches the golden context.

Note, we need to convert the list of `dspy.Example`s into a list of dictionaries and also attach the evaluation metric to the module such that we can execute the experiment with Parea. We can do the former via `convert_dspy_examples_to_parea_dicts` and the latter via `attach_evals_to_module`.

In [None]:
def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
    found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)


p.experiment(
    "HotPotQA",  # name of the experiment
    convert_dspy_examples_to_parea_dicts(test_set),  # dataset of the experiment
    attach_evals_to_module(RAG(), [dspy_eval.answer_exact_match, gold_passages_retrieved]),  # function which should be evaluated
).run(
    "naive-rag"
)  # name of the run

100%|██████████| 50/50 [00:04<00:00, 12.30it/s]
0it [00:04, ?it/s]


Experiment HotPotQA Run naive-rag2 stats:
{
  "latency": "0.79",
  "input_tokens": "0.00",
  "output_tokens": "0.00",
  "total_tokens": "0.00",
  "cost": "0.00000",
  "answer_exact_match": "0.54",
  "gold_passages_retrieved": "0.26"
}


View experiment & traces at: https://app.parea.ai/experiments/HotPotQA/07aea21a-3a8f-4d92-b795-17ad6113d0e2



Now we can check that the DVC integration is working correctly

In [None]:
!dvc exp show

 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
 [1;30;107m [0m[1;30;107mExperiment              [0m[1;30;107m [0m [1;30;107m [0m[1;30;107mCreated [0m[1;30;107m [0m [1;30;107m [0m[1;30;107mlatency[0m[1;30;107m [0m [1;30;107m [0m[1;30;107minput_tokens[0m[1;30;107m [0m [1;30;107m [0m[1;30;107moutput_tokens[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mtotal_tokens[0m[1;30;107m [0m [1;30;107m [0m[1;30;107m   cost[0m[1;30;107m [0m [1;30;107m [0m[1;30;107manswer_exact_match[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mgold_passages_retrieved[0m[1;30;107m [0m 
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
 [1m [0m[1mworkspace               [0m[1m [0m [1m [0m[1m-       [0m[1m [0m [1m [0m[1m   0.79[0m[1m [0m [1m [0m[1m  

We can see that only in 37% of the cases the correct context is retrieved. Additionally, by looking at the relationship between the retrieval accuracy (`gold_passages_retrieved`) and the overall accuracy of our RAG pipeline (`answer_exact_match`), we can see our retrieval step is the bottleneck (e.g. both metrics agree in 90% of cases).

![Simple RAG](https://drive.google.com/uc?id=1zZ-9b9PVfeeIX6fgSfqu_8NapIscpLsw)

When inspecting a single sample, we can see that the retrieved context (middle red box) doesn't match the question (top red box) and the correct context (bottom red box) at all:

![Bad Retrieval](https://drive.google.com/uc?id=1zBXRzKmTde4Qtd3cegSV1xAb9iUExDIu)

## 7. We need better retrieval: Simplified Baleen

One way to improve this to iteratively refine the query given already retrieved contexts before generating a final answer. This is encapsulated in standard NLP by multi-hop search systems, c.f. e.g. Baleen (Khattab et al., 2021). Let's try it out!

For that we will introduce a new `Signature`: given some context and a question, generate a new query to find more relevant information.

In [None]:
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

Now we can define a simplified version of Baleen. Concretely, we will do in the `forward` pass:

1. Loop `self.max_hops` times to fetch diverse contexts. In each iteration:
    1. Generate a search query using Chain-of-Thought (the predictor at `self.generate_query[hop]`).
    2. Then, retrieve the top-k passages using that query.
    3. Finally, add the (deduplicated) passages to our accumulated context.
2. After the loop, `self.generate_answer` generates an answer via CoT.
3. Finally, return a prediction with the retrieved context and predicted answer.

Note, we need to pull `ChromadbRM` outside of the module declaration to ensure that the module is pickleable, which is a requirement to optimize it later on.

In [None]:
class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops

    def forward(self, question):
        context = []

        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

## 8. Optimizing the Baleen Model

Now, we can apply the **magic** of DSPy and optimize our model on our training set. For that we need to select an optimizer and define an evaluation metric.

As optimizer, we will choose the `BootstrapFewShot` optimizer which uses few-shot examples to boost the performance of the prompts. To evaluate the pipeline we will apply the following logic:
1. check if the predicted answer is an exact match with the target answer
2. check if the retrieved context matches the golden context
3. check if the queries for the individual hops aren't too long
4. check if the queries are sufficiently different from each other

In [None]:
def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred):
        return False
    if not dspy.evaluate.answer_passage_match(example, pred):
        return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if "query" in outputs]

    if max([len(h) for h in hops]) > 100:
        return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))):
        return False

    return True


teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
with TurnOffPareaLogging():  # turn of logging during optimization
    compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=train_set)

 35%|███▌      | 7/20 [00:35<01:06,  5.13s/it]


Now let's compare the unoptimized with the optimized system to see if there are any improvements:

In [None]:
p.experiment(
    "HotPotQA",
    convert_dspy_examples_to_parea_dicts(test_set),
    attach_evals_to_module(SimplifiedBaleen(), [dspy_eval.answer_exact_match, gold_passages_retrieved]),
).run("unoptimized-baleen")

p.experiment(
    "HotPotQA", convert_dspy_examples_to_parea_dicts(test_set), attach_evals_to_module(compiled_baleen, [dspy_eval.answer_exact_match, gold_passages_retrieved])
).run("optimized-baleen")

100%|██████████| 50/50 [00:26<00:00,  1.92it/s]
0it [00:04, ?it/s]


Experiment HotPotQA Run unoptimized-baleen stats:
{
  "latency": "4.73",
  "input_tokens": "0.00",
  "output_tokens": "0.00",
  "total_tokens": "0.00",
  "cost": "0.00000",
  "answer_exact_match": "0.56",
  "gold_passages_retrieved": "0.40"
}


View experiment & traces at: https://app.parea.ai/experiments/HotPotQA/e342a0a2-0558-4e99-bd2a-99e915bfc003



 76%|███████▌  | 38/50 [00:18<00:03,  3.85it/s]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 158172, Requested 2787. Please try again in 359ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.p

Backing off 0.3 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159423, Requested 2041. Please try again in 549ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 0.1 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 158702, Requested 2920. Please try again in 608ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 1.0 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159443, Requested 2054. Please try again in 561ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 1.0 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.4 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 158785, Requested 2787. Please try again in 589ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 1.2 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159522, Requested 2790. Please try again in 867ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 0.7 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159117, Requested 2638. Please try again in 658ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 0.8 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 1.5 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.1 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


 78%|███████▊  | 39/50 [00:23<00:14,  1.33s/it]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 158059, Requested 2290. Please try again in 130ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.

Backing off 0.7 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 3.9 seconds after 3 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 157504, Requested 2863. Please try again in 137ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_integrations/dspy.py", line 94, in __call__
    return trace(name=span_name)(wrapped)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 260, in wrapper
    raise e
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/

Backing off 0.4 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.9 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 1.0 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


INFO:backoff:Backing off request(...) for 1.4s (openai.RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159541, Requested 2948. Please try again in 933ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})


Backing off 1.4 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


 80%|████████  | 40/50 [00:25<00:13,  1.35s/it]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 158654, Requested 2863. Please try again in 568ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.

Backing off 0.4 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.5 seconds after 3 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159642, Requested 2748. Please try again in 896ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.py", line 16, in wrapper
    return func(*args, 

Backing off 0.3 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 1.5 seconds after 3 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.6 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 3.6 seconds after 3 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


 82%|████████▏ | 41/50 [00:27<00:14,  1.65s/it]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159685, Requested 2906. Please try again in 971ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.

Backing off 0.0 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.9 seconds after 1 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.4 seconds after 4 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


 84%|████████▍ | 42/50 [00:29<00:13,  1.74s/it]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159386, Requested 2787. Please try again in 814ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.

Backing off 7.6 seconds after 4 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 4.2 seconds after 4 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 1.5 seconds after 3 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


 86%|████████▌ | 43/50 [00:30<00:10,  1.50s/it]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159553, Requested 2920. Please try again in 927ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.

Backing off 12.8 seconds after 5 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}
Backing off 0.2 seconds after 2 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


 90%|█████████ | 45/50 [00:32<00:05,  1.17s/it]ERROR:root:Error occurred in function basic_request, Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-3.5-turbo in organization org-6Yw8jxokcWTXBqzY3Yv2pEuI on tokens per min (TPM): Limit 160000, Used 159319, Requested 2948. Please try again in 850ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parea/utils/trace_utils.py", line 253, in wrapper
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 117, in basic_request
    response = chat_request(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/gpt3.py", line 263, in chat_request
    return v1_cached_gpt3_turbo_request_v2_wrapped(**kwargs).model_dump()
  File "/usr/local/lib/python3.10/dist-packages/dsp/modules/cache_utils.p

Backing off 3.6 seconds after 4 tries calling function <function GPT3.request at 0x7f427a826b90> with kwargs {}


100%|██████████| 50/50 [00:44<00:00,  1.12it/s]
0it [00:04, ?it/s]


Experiment HotPotQA Run optimized-baleen stats:
{
  "latency": "6.87",
  "input_tokens": "0.00",
  "output_tokens": "0.00",
  "total_tokens": "0.00",
  "cost": "0.00000",
  "answer_exact_match": "0.66",
  "gold_passages_retrieved": "0.54"
}


View experiment & traces at: https://app.parea.ai/experiments/HotPotQA/fe60b715-13b3-426b-aeef-3223c4dc95de



When selecting both experiments in the overview, we can that our retrieval accuracy has increased from 40% to 53.3% and the overall accuracy has increased from 37% to 43%.

![Experiments Comparison](https://drive.google.com/uc?id=1NI8_ELz-0Gyxw2VqQwz_HyuBOua_HVT2)

Finally, we can see all experiments also logged with DVC

In [None]:
!dvc exp show

 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
 [1;30;107m [0m[1;30;107mExperiment                      [0m[1;30;107m [0m [1;30;107m [0m[1;30;107mCreated [0m[1;30;107m [0m [1;30;107m [0m[1;30;107mlatency[0m[1;30;107m [0m [1;30;107m [0m[1;30;107minput_tokens[0m[1;30;107m [0m [1;30;107m [0m[1;30;107moutput_tokens[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mtotal_tokens[0m[1;30;107m [0m [1;30;107m [0m[1;30;107m   cost[0m[1;30;107m [0m [1;30;107m [0m[1;30;107manswer_exact_match[0m[1;30;107m [0m [1;30;107m [0m[1;30;107mgold_passages_retrieved[0m[1;30;107m [0m 
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
 [1m [0m[1mworkspace                       [0m[1m [0m [1m [0m[1m-       [0m[1m [0m [1m [0m[1m   6