# Generating Test Datasets

## Key Concepts
- Evaluation datasets contain inputs to test prompt performance systematically
- Dataset is an array of JSON objects, each with a "task" property
- Can generate datasets manually or automatically using Claude
- Use faster models (like Haiku) for generating test data to save costs
- Prefilling and stop sequences help parse JSON responses cleanly
- Save datasets to files for reusable evaluation workflows

## Important Code Patterns
- `add_user_message(messages, text)` - append user message to conversation
- `add_assistant_message(messages, text)` - append assistant message (for prefilling)
- `chat(messages, system=None, temperature=1.0, stop_sequences=[])` - wrapper for API calls
- `add_assistant_message(messages, "```json")` - prefill to force JSON format
- `chat(messages, stop_sequences=["```"])` - stop at closing code fence
- `json.loads(text)` - parse JSON response into Python object
- `json.dump(dataset, f, indent=2)` - save dataset to file with formatting

## Best Practices
- Generate focused test cases that match your evaluation goals (single function, single JSON, single regex)
- Keep test tasks simple and specific to the domain (e.g., AWS-related)
- Always specify expected output format clearly in generation prompts
- Use prefilling with stop sequences for reliable JSON parsing
- Save datasets to files for reproducibility and reuse
- Start with small datasets (3-5 cases) before scaling up
- Use appropriate model tier for the task (Haiku for generation, full Claude for evaluation)

In [58]:
# Load env variables and create client
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()

client = Anthropic()
model = "claude-haiku-4-5"

In [59]:
# Helper functions
def add_user_message(messages, text):
    user_message = {"role": "user", "content": text}
    messages.append(user_message)


def add_assistant_message(messages, text):
    assistant_message = {"role": "assistant", "content": text}
    messages.append(assistant_message)


def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
        "stop_sequences": stop_sequences
    }

    if system:
        params["system"] = system
    if stop_sequences:
        params["stop_sequences"] = stop_sequences
    response = client.messages.create(**params)
    return response.content[0].text

In [60]:
import json


def generate_dataset():
    prompt = """
Generate a evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts
that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an array of JSON objects,
each representing task that requires Python, JSON, or a Regex to complete.

Example output:
```json
[
    {
        "task": "Description of task",
    },
    ...additional
]
```

* Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a regular expression.
* Focus on tasks that do not require writing much code

Please generate 3 objects.
"""

    messages = []
    add_user_message(messages, prompt)
    response = chat(messages)
    return response


In [61]:
dataset = generate_dataset()

# use dataset for outputing on the screen 
# dataset

with open("dataset.json", "w") as f:
    json.dump(dataset, f, indent=2)