# Running the Eval

## Key Concepts
- Three-function pipeline: run_prompt, run_test_case, run_eval
- run_prompt merges test case with prompt template and gets Claude's response
- run_test_case orchestrates running prompt and grading (grading placeholder for now)
- run_eval processes entire dataset and collects results
- Initial prompt kept simple without formatting instructions
- Results structured as list of dicts containing output, test_case, and score
- Expect ~30 seconds runtime with Haiku for full dataset

## Important Code Patterns
- `run_prompt(test_case)` - interpolate `test_case["task"]` into prompt template
- `run_test_case(test_case)` - calls run_prompt, applies grading, returns result dict
- `run_eval(dataset)` - loops through dataset, appends results to list
- `with open("dataset.json", "r") as f: dataset = json.load(f)` - load generated dataset
- `results = run_eval(dataset)` - execute full evaluation pipeline
- Result structure: `{"output": text, "test_case": dict, "score": number}`
- Hardcoded `score = 10` as placeholder until grading implemented

## Best Practices
- Start with simple baseline prompt before optimization
- Use placeholders (hardcoded scores) to test pipeline before full implementation
- Structure results for easy analysis and comparison
- Expect verbose Claude output without formatting constraints
- Core pipeline is foundation - complexity comes from prompt refinement and grading
- Process all test cases to get comprehensive evaluation data

In [1]:
# Load env variables and create client
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()

client = Anthropic()
model = "claude-haiku-4-5"

In [2]:
# Helper functions
def add_user_message(messages, text):
    user_message = {"role": "user", "content": text}
    messages.append(user_message)


def add_assistant_message(messages, text):
    assistant_message = {"role": "assistant", "content": text}
    messages.append(assistant_message)


def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
        "stop_sequences": stop_sequences
    }

    if system:
        params["system"] = system
    if stop_sequences:
        params["stop_sequences"] = stop_sequences
    response = client.messages.create(**params)
    return response.content[0].text

In [3]:
import json


def generate_dataset():
    prompt = """
Generate a evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts
that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an array of JSON objects,
each representing task that requires Python, JSON, or a Regex to complete.

Example output:
```json
[
    {
        "task": "Description of task",
    },
    ...additional
]
```

* Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a regular expression.
* Focus on tasks that do not require writing much code

Please generate 3 objects.
"""

    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```json")
    text = chat(messages, stop_sequences=["```"])
    return json.loads(text)

In [4]:
dataset = generate_dataset()

# use dataset for outputing on the screen 
# dataset

with open("dataset.json", "w") as f:
    json.dump(dataset, f, indent=2)

In [5]:
def run_prompt(test_case):
    """Merges the prompt and test case input, then returns the result"""
    prompt = f"""
Please solve the following task:

{test_case["task"]}
"""
    
    messages = []
    add_user_message(messages, prompt)
    output = chat(messages)
    return output

In [6]:
def run_test_case(test_case):
    """Calls run_prompt, then grades the result"""
    output = run_prompt(test_case)

    # TODO - Grading
    score = 10

    return {
        "output": output,
        "test_case": test_case,
        "score": score
    }

In [7]:
def run_eval(dataset):
    """Loads the dataset and calls run_test_case with each case"""
    results = []

    for test_case in dataset:
        result = run_test_case(test_case)
        results.append(result)

    return results


In [8]:
with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = run_eval(dataset)

In [9]:
print(json.dumps(results, indent=2))


[
  {
    "output": "# AWS S3 Bucket Name Parser\n\nHere's a clean and robust solution:\n\n```python\ndef parse_s3_bucket_name(s3_uri: str) -> str:\n    \"\"\"\n    Parse an AWS S3 bucket name from an S3 URI.\n    \n    Args:\n        s3_uri: S3 URI in the format 's3://bucket-name/key/path'\n        \n    Returns:\n        The bucket name\n        \n    Raises:\n        ValueError: If the URI format is invalid\n        \n    Examples:\n        >>> parse_s3_bucket_name('s3://my-bucket/path/to/file.txt')\n        'my-bucket'\n        >>> parse_s3_bucket_name('s3://bucket-123/key')\n        'bucket-123'\n    \"\"\"\n    if not s3_uri.startswith('s3://'):\n        raise ValueError(f\"Invalid S3 URI format: {s3_uri}. Must start with 's3://'\")\n    \n    # Remove 's3://' prefix\n    uri_without_scheme = s3_uri[5:]\n    \n    # Split by '/' and get the first part (bucket name)\n    bucket_name = uri_without_scheme.split('/')[0]\n    \n    if not bucket_name:\n        raise ValueError(f\"Inva