# Goal

This notebook shows how to make synthetic data to bootstrap evaluation of your retrieval system. 

This synthetic data contains many triplets of `(RAG system input, system output, desired chunk to retrieve)`. For this example, we will work on a hardware retailer's system to answer user questions based on existing product previous. So the synthetic data will look like

```
Q: How frequently do I need to replace the blades on this saw?
A: A customer reported getting 7-10 hours of active use between blade replacements.
Chunk ID: 3
```

Once you have many of these triplets, you can experiments with different retrieval strategies (e.g. different embedding models, embedding vs keyword search, etc) to determine which strategies most consistently retrieve the desired chunks.

# A Starting Point

A simple approach would follow pseudo code:

```
synth_data = []
for chunk in corpus:
    response = call_llm(f"Give me a JSON array of 10 question/answer pairs. The questions should be things someone might ask about a product before purchase. The answer should be something contained in this text: {chunk}")
    q_a = json.loads(response.content)
    q_a_c = [{'question': q, 'answer': a, 'chunk': chunk} for (q, a) in q_a_pairs]
    synth_data.extend(q_a_c)
```

A practical implementation should address three issues that arise in the naive pseudo code.

| Issue | Solution |
|---------|----------|
| Inconsistent formatting of LLM response (e.g. different keys) | Instructor library |
| Bad questions | Guidance/examples in prompt |
| Time waiting for LLM responses when iterating over many chunks | Async LLM calls|

# Reusable Code to Bootstrap Evals

The code in this notebook addresses these issues. The code is also available as [this script](https://gist.github.com/jxnl/5627c9d463ffe0b085896f7890fab1bf).

## Data

This course uses synthetic data based on the use-case of answering questions on a hardware retailer's website based on product reviews. We have created this data in `make_product_reviews.ipynb`. Here is a small sample of the data.

In [1]:
import lancedb
import pandas as pd

pd.set_option("display.max_colwidth", 160)

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
db = lancedb.connect("./lancedb")
db

LanceDBConnection(/home/msivanes/Documents/1Projects/systematically-improving-rag/week1_bootstrap_evals/lancedb)

In [3]:
reviews_table = db.open_table("reviews")
reviews_table

LanceTable(connection=LanceDBConnection(/home/msivanes/Documents/1Projects/systematically-improving-rag/week1_bootstrap_evals/lancedb), name="reviews")

In [4]:
sample_reviews = reviews_table.to_pandas()

In [6]:
sample_reviews.head(3)

Unnamed: 0,id,product_title,product_description,review,vector
0,0,Cordless Drill,"This powerful cordless drill features an ergonomic design perfect for all-day use. With 20 torque settings and a lithium-ion battery, it offers unmatched ve...","I've been using this cordless drill for the past 6 months, and it's been a game-changer for my DIY projects. The 20 torque settings allow me to adjust the p...","[-0.0038200605, -0.009364537, -0.026297344, -0.020058312, 0.0376258, 0.0038050916, -0.000636552, 0.06768333, 0.023818497, -0.0054426882, 0.011939186, 0.0186..."
1,1,Cordless Drill,"This powerful cordless drill features an ergonomic design perfect for all-day use. With 20 torque settings and a lithium-ion battery, it offers unmatched ve...","Purchased this cordless drill a year ago and it has not disappointed. The 20 torque settings provide great control and precision, especially on delicate tas...","[-0.023377769, -0.01479075, -0.014503318, -0.029437784, 0.029629406, -0.0010449336, -0.022671165, 0.08718758, -0.0075630434, -0.0136050945, 0.041390147, 0.0..."
2,2,Cordless Drill,Our lightweight cordless drill comes equipped with a flexible LED work light to illuminate your workspace. The 18V battery provides ample power for tough ma...,"I've been using this cordless drill for the past six months on various projects around the house, and I am thoroughly impressed. The 18V battery provides in...","[-0.015202215, 0.0038308129, 0.0022391798, -0.008841734, 0.008781216, 0.008865941, -0.007449812, 0.061050933, 0.045049876, -0.01014288, 0.037472975, -0.0050..."


In [1]:
import lancedb
import pandas as pd

pd.set_option("display.max_colwidth", 160)

db = lancedb.connect("./lancedb")
reviews_table = db.open_table("reviews")
sample_reviews = reviews_table.to_pandas()
sample_reviews.review

0       I've owned several cordless drills over the years, but this one is exceptional. It is lightweight, making it easy to use for extended periods without fatigu...
1       As a professional contractor, I rely on my tools every day. This cordless drill has exceeded my expectations with its powerful motor and ergonomic design. T...
2       I'm a DIY enthusiast and bought this cordless drill for home projects. It's perfect for everything from hanging shelves to assembling furniture. The drill i...
3       After using this cordless drill for several months, I can confidently say it's one of the best I've owned. The lightweight design makes it comfortable to us...
4       This cordless drill has become my go-to tool for all my DIY projects. The lightweight design reduces strain on my wrist, and the 2-speed transmission is inc...
                                                                                     ...                                                                        

## Structure The Data

We use Pydantic & Instructor for a reliable interface between our LLMs and the structured data formats we need to run code on LLM output

In [7]:
from pydantic import BaseModel

In [8]:
class Review(BaseModel):
    id: str
    product_title: str
    product_description: str
    review: str

In [9]:
sample_chunks = [
    Review(
        id=str(row.id),
        product_title=row.product_title,
        product_description=row.product_description,
        review=row.review,
    )
    for _, row in sample_reviews.iterrows()
]

In [10]:
n_questions = 2  # number of questions to get in each LLM call

In [11]:
example_questions = [
    "What does the reviewer like about the product?",
    "What does the reviewer think could be improved?",
]

In [2]:
from pydantic import BaseModel


class Review(BaseModel):
    id: str
    product_title: str
    product_description: str
    review: str


sample_chunks = [
    Review(
        id=str(row.id),
        product_title=row.product_title,
        product_description=row.product_description,
        review=row.review,
    )
    for _, row in sample_reviews.iterrows()
]

n_questions = 2  # number of questions to get in each LLM call
example_questions = [
    "What does the reviewer like about the product?",
    "What does the reviewer think could be improved?",
]

In [12]:
from typing import List
import instructor
from openai import AsyncOpenAI

In [13]:
# Patch the AsyncOpenAI client
client = instructor.from_openai(AsyncOpenAI())

In [14]:
class QuestionAnswer(BaseModel):
    question: str
    answer: str

In [15]:
class ChunkEval(QuestionAnswer):
    chunk_id: str

In [17]:
sample_chunks

[Review(id='0', product_title='Cordless Drill', product_description='This powerful cordless drill features an ergonomic design perfect for all-day use. With 20 torque settings and a lithium-ion battery, it offers unmatched versatility and efficiency in any drilling task.', review="I've been using this cordless drill for the past 6 months, and it's been a game-changer for my DIY projects. The 20 torque settings allow me to adjust the power precisely for each material I'm working with, whether it's wood, metal, or plastic. The lithium-ion battery charges quickly, often taking less than an hour to reach full capacity, and it easily lasts through a full day of work without needing a recharge. The ergonomic design is truly comfortable, reducing fatigue during extended use. One feature I particularly appreciate is the built-in LED light, which illuminates the work area perfectly, making it convenient to work in low-light conditions. Compared to my old drill, this one is significantly quieter

Now see how we build questions on a single chunk

In [21]:
chr(10) # newline character

'\n'

In [25]:
review = sample_chunks[0]
n_questions, example_questions

(2,
 ['What does the reviewer like about the product?',
  'What does the reviewer think could be improved?'])

In [26]:
prompt = f"""
Generate `{n_questions}` question-answer pairs about a {review.product_title}. The answers should primarily be derived from information in this product review:

<content>
{review.review}
</content>

While they should contain information from the product review, you may also find it helpful context to see a product description:
<content>
{review.product_description}
</content>

Example questions:
{chr(10).join(f'- {q}' for q in example_questions)}

Provide a concise and specific answer for each question.
Do not use the exact example questions. Use them only as inspiration for the types of more specific questions to generate.
Do not include answers that are not in the content.
Questions should ask about product characteristics (e.g. durability) and answers should refer to product characteristics without referring to the reviewer specifically.
Stylistically, the questions should resemble what people would ask a RAG-based answer bot on a retailer's website. So they can be a little informal, messy or scattered.
"""

In [27]:
pairs = client.chat.completions.create_iterable(
            model="gpt-4o-mini",
            response_model=QuestionAnswer,
            messages=[{"role": "user", "content": prompt}],
        )

In [28]:
pairs

<async_generator object AsyncInstructor.create_iterable at 0x7f6139f9fd40>

In [30]:
[
            ChunkEval(question=pair.question, answer=pair.answer, chunk_id=review.id)
            async for pair in pairs
        ]

[ChunkEval(question='How many torque settings does this cordless drill have?', answer='The cordless drill has 20 torque settings, allowing precise adjustment of power for various materials.', chunk_id='0'),
 ChunkEval(question="What's special about the battery of this drill?", answer='The lithium-ion battery charges quickly, often taking less than an hour to reach full capacity, and lasts through a full day of work.', chunk_id='0')]

In [3]:
from typing import List
import instructor
from openai import AsyncOpenAI

# Patch the AsyncOpenAI client
client = instructor.from_openai(AsyncOpenAI())


class QuestionAnswer(BaseModel):
    question: str
    answer: str


class ChunkEval(QuestionAnswer):
    chunk_id: str


async def generate_evals(
    review: Review, n_questions: int, example_questions: List[str]
) -> List[ChunkEval]:

    prompt = f"""
        Generate `{n_questions}` question-answer pairs about a {review.product_title}. The answers should primarily be derived from information in this product review:

        <content>
        {review.review}
        </content>

        While they should contain information from the product review, you may also find it helpful context to see a product description:
        <content>
        {review.product_description}
        </content>

        Example questions:
        {chr(10).join(f'- {q}' for q in example_questions)}

        Provide a concise and specific answer for each question.
        Do not use the exact example questions. Use them only as inspiration for the types of more specific questions to generate.
        Do not include answers that are not in the content.
        Questions should ask about product characteristics (e.g. durability) and answers should refer to product characteristics without referring to the reviewer specifically.
        Stylistically, the questions should resemble what people would ask a RAG-based answer bot on a retailer's website. So they can be a little informal, messy or scattered.
        """

    try:
        pairs = client.chat.completions.create_iterable(
            model="gpt-4o-mini",
            response_model=QuestionAnswer,
            messages=[{"role": "user", "content": prompt}],
        )
        return [
            ChunkEval(question=pair.question, answer=pair.answer, chunk_id=review.id)
            async for pair in pairs
        ]
    except Exception as e:
        print(f"Error generating evals: {str(e)}")
        return []


first_chunk_res = await generate_evals(sample_chunks[0], n_questions, example_questions)
first_chunk_res

[ChunkEval(question='How does the weight of this cordless drill affect its usability?', answer='The cordless drill is lightweight, making it easy to use for extended periods without fatigue.', chunk_id='0'),
 ChunkEval(question='What features does this drill have for different tasks?', answer='It has a 2-speed transmission that allows switching between high torque for drilling and high speed for driving screws.', chunk_id='0')]

To run `generate_evals` for many chunks in parallel, wrap it with a function that also takes a semaphore. 

In [31]:
import asyncio

In [32]:
class ChunkProcessingError(Exception):
    pass

In [4]:
import asyncio


class ChunkProcessingError(Exception):
    pass


async def process_chunk(
    review: Review,
    n_questions: int,
    example_questions: List[str],
    semaphore: asyncio.Semaphore,
) -> List[ChunkEval]:
    async with semaphore:
        try:
            return await generate_evals(review, n_questions, example_questions)
        except Exception as e:
            print(f"Unexpected error processing chunk {review.id}: {str(e)}")
            raise ChunkProcessingError(f"Failed to process chunk {review.id}") from e


# Test that we get the same results as directly calling generate_evals
await process_chunk(
    sample_chunks[0], n_questions, example_questions, asyncio.Semaphore(1)
)

[ChunkEval(question='How heavy is this cordless drill compared to others on the market?', answer='This cordless drill is lightweight, making it easy to use for extended periods without fatigue.', chunk_id='0'),
 ChunkEval(question='What kind of tasks can this drill handle?', answer='The drill handles heavy-duty tasks with ease, thanks to its top-notch build quality and 2-speed transmission that allows for high torque and high speed.', chunk_id='0')]

Now you can call `process_chunks` with all chunks to build the full dataset

In [5]:
import json


async def create_synthetic_dataset(
    reviews: List[Review],
    n_questions: int,
    example_questions: List[str],
    max_concurrency: int = 10,
) -> List[ChunkEval]:
    semaphore = asyncio.Semaphore(max_concurrency)
    tasks = [
        process_chunk(review, n_questions, example_questions, semaphore)
        for review in reviews
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    dataset = []
    for result in results:
        if isinstance(result, ChunkProcessingError):
            print(result)
        elif isinstance(result, list):
            dataset.extend(result)
        else:
            print(f"Unexpected result type: {type(result)}")

    return dataset


def save_dataset(dataset: List[ChunkEval], filename: str):
    with open(filename, "w") as f:
        json.dump([chunk_eval.model_dump() for chunk_eval in dataset], f, indent=2)


synthetic_dataset = await create_synthetic_dataset(
    sample_chunks, n_questions, example_questions
)
save_dataset(synthetic_dataset, "synthetic_eval_dataset.json")

print(f"Generated {len(synthetic_dataset)} ChunkEvals.")
print("Dataset saved as 'synthetic_eval_dataset.json'")

Generated 2270 ChunkEvals.
Dataset saved as 'synthetic_eval_dataset.json'


View the data as a DataFrame

In [6]:
data = [(i.question, i.answer, i.chunk_id) for i in synthetic_dataset]
pd.DataFrame(data, columns=["question", "answer", "chunk_id"]).head()

Unnamed: 0,question,answer,chunk_id
0,How good is the battery life on this cordless drill?,"It comes with two included batteries, ensuring that you never run out of power on the job.",0
1,Is this cordless drill easy to handle for long tasks?,"Yes, its lightweight design makes it easy to use for extended periods without fatigue.",0
2,How powerful is the motor in this cordless drill?,The cordless drill features a powerful motor that exceeds expectations for professional use.,1
3,What design features make this drill suitable for overhead tasks?,"The cordless drill has a lightweight design and ergonomic build, making it perfect for overhead tasks.",1
4,How durable are the batteries for this cordless drill?,"The batteries charge quickly and last a long time, which is a huge plus.",2
