# Prompt Evaluation

**Prompt Evaluation** is the process of testing the efficacy of a prompt. The goal of prompt evaluation is to get an objective metric that tells us if our prompt is effective (or not). While there are manual methods for prompt evaluation, often, the best approach is to define an evaluation workflow that automatically tests and scores prompts, allowing developers to easily see how changes to their prompts effect performance.

A typical prompt evaluation workflow might look like this:
- Step 1: Draft a prompt
- Step 2: Create an evaluation dataset
- Step 3: Test and score the prompt (using the evaluation dataset)
- Step 4: Iterate

Scoring is generally achieved by scoring each test case in the evaluation set individually and then averaging all the individual test scores to derive and aggregate score.

This notebook offers an implementation of a rudimentary prompt evaluation workflow.

**Note:** There are open source (e.g., [DeepEval](https://github.com/confident-ai/deepeval) and [Promptfoo](https://github.com/promptfoo/promptfoo)) and paid packages that can help you develop a prompt evaluation workflow. We are writing our own below to:
- Demonstrate that you don't *require* full featured tools to perform prompt evaluation
- Build competency with the concepts

## Set up the Environment and Create Helper Functions

In [35]:
# Load environment variables and create client

from dotenv import load_dotenv
from anthropic import Anthropic
import json

load_dotenv()   # Load environment variables from .env file

# Create an API client
client = Anthropic()
model = "claude-haiku-4-5"

In [None]:
# Define helper functions for managing message history and chatting with Claude

def add_user_message(messages : list[str], content : str):
    """Append a user message to a message (aka conversation) history.

    Args:
        messages (list[str]): The message history.
        content (str): A user message to append to the message history.
    """
    user_message = { "role": "user", "content": content }
    messages.append(user_message)

def add_assistant_message(messages : list[str], content : str):
    """Append an assistant message to a message (aka conversation) history.

    Args:
        messages (list[str]): The message history.
        content (str): An assistant message to append to the message history.
    """
    assistant_message = { "role": "assistant", "content": content }
    messages.append(assistant_message)

def chat(messages : list[str],
         system : str = None,
         temperature : float =1.0,
         stop_sequences: list[str] = []) -> str:
    """Chat with Claude.

    Args:
        messages (list[str]): The message (aka conversation) history.
        system (str): The system prompt. Default to None
        temperature (float, optional): Temperature for response generation. Defaults to 1.0.
        stop_sequences (list[str], optional): Stop sequences for Claude's response. Defaults to [].

    Returns:
        str: Claude's text response.
    """
    response = client.messages.create(
        model=model,
        max_tokens=1000,
        messages=messages,
        system="Provide concise answers.",
        stop_sequences=stop_sequences
    )
    return response.content[0].text

def chat(messages, system=None, temperature=1.0, stop_sequences=[]):
    params = {
        "model": model,
        "max_tokens": 1000,
        "messages": messages,
        "temperature": temperature,
        "stop_sequences": stop_sequences,
    }

    # Add a system parameter to the dictionary only if provided
    if system:
        params["system"] = system

    response = client.messages.create(**params)

    return response.content[0].text

## Evaluation Workflow

### Step 1: Draft a Prompt

Our prompt's goal is to assist users in writing three specific types of output for AWS use cases. The output could be:

- Python code
- A JSON configuration files
- A regular expressions

The key requirement is that when a user requests help with a task, we return clean output in one of these formats without any extra explanations, headers, or footers.

Version 1 of our prompt, which is *super* naive, is as follows:

```python
prompt_v1 = """
    Please provide a solution to the following task:
    {{task}}
"""
```

### Step 2: Create Evaluation Dataset

First, define a function for creating our evaluation dataset.

In this example, the evaluation dataset will:
- Be generated by Claude
- Contain only three entries

In "real life" the evaluation dataset would be much larger and would likely contain a combination of human generated and machine generated entries.

In [33]:
# Define the function for generating an evaluation dataset

import json     # For validating JSON data

def generate_dataset():

    prompt = """
        Generate a evaluation dataset for a prompt evaluation. The dataset will be used to evaluate
        prompts that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an
        array of JSON objects, each representing a task that requires Python, JSON, or a Regex to
        complete.

        Example output:
        ```json
        [
            {
                "task": "Description of task",
                "format": "python" or "json" or "regex"
            },
            ...additional
        ]
        ```

        * Focus on tasks that can be solved by writing a single Python function, a single JSON object,
        or a regular expression.
        * Focus on tasks that do not require writing much code

        Please generate 3 objects.
        """
    
    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```json")
    response = chat(messages, stop_sequences=["```"])
    return json.loads(response)     # the json.loads will raise an error if response is not valid JSON




Next, create the evaluation dataset and write it to a file named `dataset.json`.

In [34]:
# Generate the evaluation dataset and save it to a file

dataset = generate_dataset()

# Write the evaluation dataset to a file
with open("dataset.json", "w") as f:
    json.dump(dataset, f, indent=4)

### Step 3: Test and Score the Prompt

In this step, we take each task in the evaluation dataset, merge it with our prompt to create a test case, feed the test case into Claude, and then score the results.

The most complex operation in this step is scoring. Generally speaking, there are three types of scorers:

- **Code Scorers** are programmatic scorers written by humans. These types of scorers can do things like evaluate the length of a response, verify key words, syntax validation, etc.
- **Model Scorers** are typically language models that evaluate the quality, completeness, helpfulness, safety, etc. of the output
- **Human Scorers**, which often provide the best quality scoring, but are generally slower and more expensive than the above

In the example below, we will use both code and model scorers.

Before we implement scorers, however, we will need to define evaluation criteria. For our example, we will evaluate output based on:

- **Format/Syntax** - Responses should be python code, JSON, or regex *only*. Responses should contain valid code syntax.
- **Accuracy** - Responses should accurately, directly, and clearly address the task

We will be using:

- Code scorers for format and syntax
- A model scorer for accuracy

We start by defining functions that will accept various prompts, create the test cases, and score the output.

**Note:** The `run_test_case` and `score_the_accuracy` methods below uses the *prefilled assistant messages* and *stop sequences* techniques introduced in [005_controlling_output.ipynb](001_accessing_claude_with_the_api/005_controlling_output.ipynb) and [006_structured_data.ipynb](001_accessing_claude_with_the_api/006_structured_data.ipynb).

In [61]:
# Functions for evaluating a prompt using the evaluation dataset

from statistics import mean 
import re                       # For validating regex patterns
import ast                      # For validating Python code

# Merge the prompt with a specific test case
def merge_task_with_prompt(task, prompt):
    return prompt.replace("{{task}}", task)

# Run a single test case
def run_test_case(test_case):
    messages = []
    add_user_message(messages, test_case)
    add_assistant_message(messages, "```code")
    response = chat(messages, stop_sequences=["```"])
    return response

# Score the format/syntax of the response using a code scorer
def score_the_syntax(solution, expected_format):
    if expected_format == "json":
        try:
            json.loads(solution.strip())
            return 10
        except json.JSONDecodeError:
            return 0
    elif expected_format == "python":
        try:
            ast.parse(solution.strip())
            return 10
        except SyntaxError:
            return 0
    else:
        try:
            re.compile(solution.strip())
            return 10
        except re.error:
            return 0

# Score the accuracy of the response using a model scorer
def score_the_accuracy(task, solution):

    # Create the evaluation prompt
    # Note: If you only ask Claude to score the accuracy, it often tends to give a score of 6. You must ask
    # for strengths, weaknesses, and reasoning to get a more balanced evaluation.
    eval_prompt = """
        You are an expert AWS code reviewer. Your task is to evaluate the following AI-generated solution.

        Original Task:
        <task>
        {{task}}
        </task>

        Solution to Evaluate:
        <solution>
        {{solution}}
        </solution>

        Output Format
        Provide your evaluation as a structured JSON object with the following fields, in this specific order:
        - "strengths": An array of 1-3 key strengths
        - "weaknesses": An array of 1-3 key areas for improvement
        - "reasoning": A concise explanation of your overall assessment
        - "score": A number between 1-10

        Respond with JSON. Keep your response concise and direct.
        Example response shape:
        {{
            "strengths": string[],
            "weaknesses": string[],
            "reasoning": string,
            "score": number
        }}
    """    
    eval_prompt = eval_prompt.replace("{{task}}", task)
    eval_prompt = eval_prompt.replace("{{solution}}", solution)

    messages = []

    # Prepare the messages for the using a prefilled assistant message and a stop sequence
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "Your evaluation in JSON format is as follows:\n```json")
    response = chat(messages, stop_sequences=["```"])

    # Convert the JSON into a dict for easy access to the data (in Python).
    # Note: The json.loads will raise an error if response is not valid JSON
    try:
        response = json.loads(response)
    except json.JSONDecodeError:
        response = {
            "strengths": [],
            "weaknesses": [],
            "reasoning": "Response was not valid JSON.",
            "score": 0
        }

    return response

# Evaluate the prompt on the entire dataset
def evaluate_prompt(prompt, dataset):
    results = []
    scores = []
    for record in dataset:

        # Run the test case
        test_case = merge_task_with_prompt(record["task"], prompt)
        response = run_test_case(test_case)

        # Score the response
        syntax_score = score_the_syntax(response, record["format"])
        accuracy_object = score_the_accuracy(record["task"], response)
        accuracy_score = accuracy_object["score"]
        accuracy_reasoning = accuracy_object["reasoning"]

        # Get the average score
        individual_score = mean([syntax_score, accuracy_score])
        
        # Store the score for overall averaging later
        scores.append(individual_score)

        # Store the results of the each evaluation
        results.append({
            "task": record["task"],
            "output": response,
            "syntax_score": syntax_score,
            "accuracy_score": accuracy_score,
            "accuracy_reasoning": accuracy_reasoning,
            "score": individual_score
        })

    # Calculate and print the overall average score
    overall_average_score = mean(scores)
    print(f"Average score (across all test cases) for this prompt: {overall_average_score}")

    return results

### Step 4: Iterate

Here is the first prompt:

In [62]:
prompt_v1 = """
    Please provide a solution to the following task: {{task}}
"""

with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = evaluate_prompt(prompt_v1, dataset)
print(json.dumps(results, indent=4))


Average score (across all test cases) for this prompt: 4.833333333333333
[
    {
        "task": "Parse an AWS S3 bucket name from a full S3 URI (e.g., 's3://my-bucket/path/to/object') and extract only the bucket name",
        "output": "\n# Solution 1: Using string methods (Simple and straightforward)\ndef extract_bucket_name_simple(s3_uri):\n    \"\"\"\n    Extract bucket name from S3 URI using string methods.\n    \n    Args:\n        s3_uri (str): Full S3 URI (e.g., 's3://my-bucket/path/to/object')\n    \n    Returns:\n        str: The bucket name\n    \"\"\"\n    # Remove the 's3://' prefix and split by '/'\n    return s3_uri.replace('s3://', '').split('/')[0]\n\n\n# Solution 2: Using urllib.parse (More robust)\nfrom urllib.parse import urlparse\n\ndef extract_bucket_name_urlparse(s3_uri):\n    \"\"\"\n    Extract bucket name from S3 URI using urllib.parse.\n    \n    Args:\n        s3_uri (str): Full S3 URI (e.g., 's3://my-bucket/path/to/object')\n    \n    Returns:\n        str

Here is the (trivially different) second prompt.

In [63]:
prompt_v2 = """
    Please provide a solution to the following task: {{task}}
    Respond only with Python, JSON, or a plain Regex.
    Do not add any comments or commentary or explanation.
"""

with open("dataset.json", "r") as f:
    dataset = json.load(f)

results = evaluate_prompt(prompt_v2, dataset)
print(json.dumps(results, indent=4))



Average score (across all test cases) for this prompt: 6.5
[
    {
        "task": "Parse an AWS S3 bucket name from a full S3 URI (e.g., 's3://my-bucket/path/to/object') and extract only the bucket name",
        "output": "\nimport re\n\ndef extract_bucket_name(s3_uri):\n    match = re.match(r's3://([a-z0-9.-]+)/', s3_uri)\n    if match:\n        return match.group(1)\n    return None\n",
        "syntax_score": 10,
        "accuracy_score": 5,
        "accuracy_reasoning": "The solution works for the most common case (URIs with paths) but has significant limitations. The trailing slash requirement is a critical flaw since 's3://my-bucket' is a valid S3 URI. The regex pattern also doesn't validate AWS bucket naming constraints, which could allow invalid bucket names through. The implementation would benefit from more robust pattern matching and input validation.",
        "score": 7.5
    },
    {
        "task": "Create a Python function that takes an AWS IAM policy document as a JS

Per the results above (which could vary run to run), you may see improvement in the average score (across all test cases) between version 1 and version 2 of the prompt. If you want to see more dramatic improvement, you could (as [this](https://anthropic.skilljar.com/claude-with-the-anthropic-api/287738) video suggests):

- Update the dataset generation prompt to ask for solution criteria
- Update the prompt in the `score_the_accuracy` function to include that solution criteria