# Synthetic Data Generation for Evaluations

We'll create a system to automatically generate realistic user questions based on the technical documentation usecase.

This synthetic data can be used for testing our agent and its search function.

```bash
uv add --dev jupyter
uv run jupyter notebook
```

## Loading and Processing Documentation

First, let's load the documentation data

In [None]:
import docs

raw_documents = docs.read_github_data()
documents = docs.parse_data(raw_documents)
len(documents)


Note that we don't do chunking here. We want to use the complete documents for generating the data.

If for your case the data is too big (e.g. it's a book), then use some logical-based chunking - e.g. per chapter/section, and then use them for creating questions.

In this case, we get 95 documents total. Now let's filter and select documents that are suitable for question generation:



In [None]:
num_questions_total = 0

selected_documents = []

for doc in documents[5:]:
    if 'title' not in doc:
        continue

    title = doc['title']
    if 'unpublished' in title.lower():
        continue
    if 'legacy' in title.lower():
        continue
    if 'leftovers' in title.lower():
        continue

    content = doc.get('content', '').strip()
    if len(content) <= 1000:
        continue

    # We want to create one question for every 1000 characters
    num_questions = len(content) // 1000
    print(doc.get('title'))
    print(len(content), num_questions)
    num_questions_total = num_questions_total + num_questions
    print('------------')

    selected_documents.append(doc)

print(num_questions_total)

This filtering process:

Skips documents without titles

Excludes unpublished, legacy, and leftover content

Only includes substantial documents (over 1000 characters)

The longer the document, the more questions we may want to generate. So let's use the following logic: for each 1000 characters of the document we generate one question.

This way we can generate approximately 471 questions from 69 selected documents.

## Setting Up the LLM Client

We already know how to produce structured output. For this lesson let's use the OpenAI client. We will re-use the helper function for structured output that we created in Week 1:

In [None]:
from openai import OpenAI

openai_client = OpenAI()

def llm_structured(instructions, user_prompt, output_format, model="gpt-4o-mini"):
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": user_prompt}
    ]

    response = openai_client.responses.parse(
        model=model,
        input=messages,
        text_format=output_format
    )

    return (response.output_parsed, response.usage)

# Developing Instructions

Let's start with these instructions:

In [None]:
generator_instructions = """
you're given an article and your task is to think what kind of questions 
the reader of this article may have
"""

And very simple schema:

In [None]:
from pydantic import BaseModel

class Question(BaseModel):
    question: str
    summary_answer: str

class GeneratedQuestions(BaseModel):
    questions: list[Question]


Run it:

In [None]:
import json

def process_document(doc):
    content = doc['content']
    num_questions = len(content) // 1000
    user_prompt = f"""generate {num_questions} for this document:
    <document>{json.dumps(doc)}</document>
    """
    response, usage = llm_structured(
        instructions=generator_instructions,
        user_prompt=user_prompt,
        output_format=GeneratedQuestions
    )
    return {
        'doc': doc,
        'questions': response.questions,
        'usage': usage
    }

doc = selected_documents[0]
result = process_document(doc)


The results don't look like search queries yet. Let's ask it explicitly:

`formulate it as a search query`

Or with help of ChatGPT:

In [None]:
generator_instructions = """
You are given a technical article. Your task is to imagine what questions 
a person might type into a search engine before they find and read this article.

Formulate natural, human-like search queries — the way real users might ask them online. 
Avoid robotic or academic phrasing. Include diversity in tone and style.

Assume users have different knowledge levels:
- Beginners who are just starting and may not know the terminology.
- Intermediate users who understand the basics but need clarification or examples.
- Advanced users who seek details, edge cases, or integration insights.

Each query should sound plausible as a Google-style search — short, goal-oriented, and reflecting the user’s curiosity or problem.

For each generated query:
- Provide the likely search question (in natural language).
- Provide a summary_answer — a 1–2 sentence explanation summarizing how the article would answer that question.

Return results in the GeneratedQuestions schema.
"""


The questions are still quite big, so we ask it to "avoid full-sentence questions with punctuation like 'What is...'". Plus, we can specify the difficulty and the distribution of questions according to it:

In [None]:
generator_instructions = """
You are given a technical article. Your task is to imagine what a person might type into a search engine 
before finding and reading this article.

Generate realistic, human-like search queries — not formal questions. 
They should sound like things real users type into Google or Stack Overflow 
when trying to solve a problem or learn about the topic.

Guidelines:
- Avoid full-sentence questions with punctuation like "What is..." or "How do I...".
- Use short, natural search phrases instead, such as:
  - "evidently data definition example"
  - "map target and prediction columns evidently"
  - "difference between timestamp and datetime evidently"
- Vary phrasing to sound human and spontaneous.
- Assume users of different knowledge levels:
  - beginner: broad or basic understanding
  - intermediate: knows basic terms but seeks clarification or examples
  - advanced: familiar with the tool, looking for specific options or integrations

For each generated query:
- question: the realistic search phrase
- summary_answer: a short, 1–2 sentence summary of how the article answers it
- difficulty: one of ["beginner", "intermediate", "advanced"]
- 50% of questions should be beginner, 30% intermediate and 20% advanced

Also include a description summarizing what kind of article the questions are about.

Return results in the GeneratedQuestions schema.
"""


Let's also include intent classification to understand whether users are looking for conceptual explanations or code examples:

In [None]:
generator_instructions = """
You are given a technical article. Your task is to imagine what a person might type into a search engine 
before finding and reading this article.

Generate realistic, human-like search queries — not formal questions. 
They should sound like what people actually type into Google or Stack Overflow 
when trying to solve a problem, learn a concept, or find code examples.

Guidelines:
- Avoid full-sentence questions with punctuation like "What is..." or "How do I...".
- Use short, natural search phrases instead, such as:
  - "evidently data definition example"
  - "map target and prediction columns evidently"
  - "difference between timestamp and datetime evidently"
- Make queries varied and spontaneous, not repetitive or over-polished.
- Assume users of different knowledge levels:
  - beginner: broad or basic understanding
  - intermediate: knows basic terms but seeks clarification or examples
  - advanced: familiar with the tool, looking for details, edge cases, or integration options

Distribution rules:
- 60% of the queries should target beginner-level users
- 30% should target intermediate-level users
- 10% should target advanced-level users
- 75% of queries should have an intent of "code" (looking for examples or implementation)
- 25% should have an intent of "text" (looking for conceptual or theoretical explanations)

For each generated query, include:
- question: the natural, human-style search phrase
- summary_answer: a short 1–2 sentence summary of how the article addresses it
- difficulty: one of ["beginner", "intermediate", "advanced"]
- intent: one of ["text", "code"]

Also include a description summarizing what kind of article the questions are about.
"""


For the schema, this is the final version with intent classification:

In [None]:
from pydantic import BaseModel, Field
from typing import List, Literal

class Question(BaseModel):
    """
    Represents a realistic search-engine-style query a user might type before finding the article.
    Each question captures the likely search phrase, a short summary answer,
    the user's assumed skill level, and their intent (conceptual or code-focused).
    """
    question: str = Field(
        ...,
        description="A natural, short search query — not a full-sentence question — phrased like something typed into Google."
    )
    summary_answer: str = Field(
        ...,
        description="A concise 1–2 sentence summary of how the article addresses the query."
    )
    difficulty: Literal["beginner", "intermediate", "advanced"] = Field(
        ...,
        description="The assumed knowledge level of the user making the query."
    )
    intent: Literal["text", "code"] = Field(
        ...,
        description="Specifies if the user's intent is to get a theoretical explanation ('text') or an implementation example ('code')."
    )


class GeneratedQuestions(BaseModel):
    """
    A structured collection of human-like search queries derived from a given article.
    Includes a brief description of the article topic and a list of generated queries.
    Difficulty distribution: 60% beginner, 30% intermediate, 10% advanced.
    Intent distribution: 75% code-focused, 25% concept-focused.
    """
    description: str = Field(
        ...,
        description="A summary of the article or topic these search-style questions were generated for."
    )
    questions: List[Question] = Field(
        ...,
        description="A list of realistic search queries with short summaries, difficulty levels, and user intent."
    )


## Processing Documents

Let's process all the documents. We want to make it fast, so we'll do this in multiple parallel threads.

Here's some helper code for that:

In [None]:
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

def map_progress(pool, seq, f):
    """Map function f over seq using the provided executor pool while
    displaying a tqdm progress bar. Returns a list of results in submission order.
    """
    results = []
    
    with tqdm(total=len(seq)) as progress:
        futures = []
    
        for el in seq:
            future = pool.submit(f, el)
            future.add_done_callback(lambda p: progress.update())
            futures.append(future)

        for future in futures:
            result = future.result()
            results.append(result)
        
        return results

And this is how we use it:

In [None]:
with ThreadPoolExecutor(max_workers=6) as pool:
    all_results = map_progress(pool, selected_documents, process_document)

## Cost

Let's see how much it costed to generate the data.

Create the pricing config:

In [None]:
from toyaikit.pricing import PricingConfig
pricing = PricingConfig()

Calculate the price:

In [None]:
total_input = 0
total_output = 0

for res in all_results:
    usage = res['usage']
    total_input = total_input + usage.input_tokens
    total_output = total_output + usage.output_tokens

pricing.calculate_cost('gpt-4o-mini', total_input, total_output)

$0.04

Not bad.

## Saving the results

Now let's save all the results:

In [None]:
all_questions = []

for res in all_results:
    doc = res['doc']
    questions = res['questions']
    for q in questions:
        q_dict = q.model_dump()
        q_dict['filename'] = doc['filename']
        all_questions.append(q_dict)

We'll use pandas for that:

In [None]:
import pandas as pd

df_questions = pd.DataFrame(all_questions)
df_questions.to_csv('ground_truth_evidently.csv', index=False)

## Summary and next steps

We have generated synthetic data for our evaluations. Next we can use it for:

- Evaluating search
- Evaluating our agent

Let's start with search.