# Prompting with Claude

- **Prompt Engineering**: Best practices to *improve* prompts (multishot prompting, structure with XML tags, etc.)
- **Prompt Evaluation**: Automated testing to *measure* how well prompts work (test against expected answers, compare different versions, review outputs for errors, etc.)

##### Import packages & set constants

In [5]:
# Load env variables
from dotenv import load_dotenv
load_dotenv()

# Set client
from anthropic import Anthropic
client = Anthropic()

In [6]:
import json
from statistics import mean
import ast
import re

##### Chat functions

In [2]:
def add_user_message(messages: list, text: str):
    user_mesage = {"role": "user", "content": text}
    messages.append(user_mesage)

def add_assistant_message(messages: list, text: str):
    assistant_mesage = {"role": "assistant", "content": text}
    messages.append(assistant_mesage)

def chat(messages: list,
         model: str = "claude-3-haiku-latest",
         max_tokens: int = 1000,
         system = None,
         temperature: float = 1.0,
         stop_sequences: list=[]
    ):
    params = {
        "model": model, 
        "max_tokens": max_tokens,
        "messages": messages,
        "temperature": temperature,
        "stop_sequences": stop_sequences
    }
    if system:
        params['system'] = system
    message = client.messages.create(**params)
    return message.content[0].text

## 1. Prompt Evaluation

Run prompt through evaluation pipeline to score it, and iterate con the prompt.

*Sample Eval Workflow*: 
1. Draft a prompt
2. Create an Eval Dataset
3. Feed through Claude
4. Feed through a grader
5. Change prompt and repeat


**Exercise**: Write a prompt that will assist users in writing Python code, JSON config, or Regular expressions focused on AWS-specific use cases.
- **Input**: User's request
- **Output**: Python, JSON, or regular expression wo any explanation.

#### 1.1 Generate Eval Dataset

In [11]:
def generate_dataset():
    prompt = """
Generate an evaluation dataset for a prompt evaluation. The dataset will be used to evaluate prompts that generate Python, JSON, or Regex specifically for AWS-related tasks. Generate an array of JSON objects, each representing task that requires Python, JSON, or a Regex to complete.

Example output:
```json
[
{
    "task": "Description of task",
    "format": "json" or "python" or "regex",
    "solution_criteria": "Key criteria for evaluating the solution"
},
...additional
]
```

* Focus on tasks that can be solved by writing a single Python function, a single JSON object, or a single regex
* Focus on tasks that do not require writing much code

Generate 3 objects.
"""

    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```json")
    text = chat(messages, stop_sequences=["```"], model = "claude-3-haiku-20240307")
    return json.loads(text)

In [12]:
dataset = generate_dataset()

dataset

[{'task': 'Create an AWS Lambda function to handle a simple API Gateway request that returns a JSON response.',
  'format': 'python',
  'solution_criteria': 'The solution should include a Python function that takes an event object as input, processes the request, and returns a JSON response.'},
 {'task': 'Define a JSON structure to represent an AWS EC2 instance configuration, including instance type, image ID, key pair, and security group.',
  'format': 'json',
  'solution_criteria': 'The solution should include a valid JSON object that accurately represents the required EC2 instance configuration.'},
 {'task': 'Write a regular expression to validate an AWS IAM user name, which must be between 1 and 64 characters long and can only contain alphanumeric characters and the following special characters: plus (+), equal (=), comma (,), period (.), at (@), and hyphen (-).',
  'format': 'regex',
  'solution_criteria': 'The solution should include a regular expression that accurately validates

In [None]:
# Save dataset
with open("02_PromptEvaluation_outputs/dataset.json", "w") as f:
    json.dump(dataset, f, indent=2)

#### 1.2 Running the eval

**Graders**:

| Grader | Description | Uses | 
| -- | -- | -- |
| **Code** | Programatically evaluate the result. | - Check output length <br> - Verify words in output  <br> - Syntax validation  <br> - Readability scores |
| **Model** | Ask a model to assign a score to the output, or compare 2 versions | - Response quality <br> - Instruction following <br> - Completeness <br> - Helpfulness <br> - Safety |
| **Human** | Ask a human to assign a score to the output, or compare 2 versions | - General response quality <br> - Comprengeniveness <br> - Depth <br> - Conciseness <br> - Relevance |


**Evaluation Criteria for Exercise**:
- Format: Should return only Python, JSON, or Regex, no explanation.
- Valid Syntax: Produced Python, JSON and Regex should have correct syntax.
- Task Following: Response should directly and clearly address the user's task; generated code should be accurate.

##### Functions

In [107]:
# Create a code based grader for *Format and Syntax*

# JSON
def validate_json(text):
    try:
        json.loads(text.strip())
        return 10
    except json.JSONDecodeError:
        return 0

# Python
def validate_python(text):
    try:
        ast.parse(text.strip())
        return 10
    except SyntaxError:
        return 0

# Regex
def validate_regex(text):
    try:
        re.compile(text.strip())
        return 10
    except re.error:
        return 0

# Code grader
def code_grader(response, test_case):
    format = test_case["format"]
    match format:
        case "Python" | "python":
            score = validate_python(response)
        case "Json" | "json" | "JSON":
            score = validate_json(response)
        case "Regex" | "regex":
            score = validate_regex(response)
    return float(score)


In [122]:
# Create a model grader for *Task Following
def model_grader(test_case, output):
    # Create evaluation prompt
    eval_prompt = f"""
    You are an expert AWS code reviewer. Evaluate this AI-generated solution, consider how much it aligns to the solution criteria provided.
    
    ORIGINAL TASK:
    <task>
    {test_case['task']}
    </task>

    SOLUTION TO EVALUATE:
    <solution>
    {output}
    </solution>

    SOLUTION_CRITERIA:
    <solution_criteria>
    {test_case['solution_criteria']}
    </solution_criteria>

    OUTPUT FORMAT
    Provide your evaluation as a structured JSON object with:
    - "strengths": An array of 1-3 key strengths
    - "weaknesses": An array of 1-3 key areas for improvement  
    - "reasoning": A concise explanation of your assessment
    - "score": A number between 1-10

    Respond with JSON. Keep the response concise and direct.
    Sample response:
    {{
      "strengths": string[],
      "weaknesses": string[],
      "reasoning": string,
      "score": number,

    }}
    """
    
    messages = []
    add_user_message(messages, eval_prompt)
    add_assistant_message(messages, "```json")
    eval_text = chat(messages, stop_sequences=["```"])
    return json.loads(eval_text)

In [None]:
def create_prompt(prompt, test_case):
    """Merges the prompt and test case input"""
    prompt_merged = prompt(test_case)
    return prompt_merged

def run_prompt(prompt, test_case):
    """ returns the result"""
    prompt = create_prompt(prompt, test_case)
    messages = []
    add_user_message(messages, prompt)
    add_assistant_message(messages, "```code")
    output = chat(messages, stop_sequences=["```"])
    return output

def score_output(prompt, test_case):
    """Calls run_prompt and grades the result"""
    # Get response 
    output = run_prompt(prompt, test_case)

    # Grading
    model_score = model_grader(test_case, output)
    syntax_score = code_grader(output, test_case)
    score = (2 * syntax_score/3 + model_score['score']/3)

    result = {
        "test_case": test_case,
        "output": output,
        "strengths": model_score['strengths'],
        "weaknesses": model_score['weaknesses'],
        "reasoning": model_score['reasoning'],
        "model_score": model_score['score'],
        "syntax_score": syntax_score,
        "final_score": score
    }
    return result

def run_full_eval(prompt, dataset):
    """Loads dataset and calls score_output with each case"""
    results = []
    for test_case in dataset:
        result = score_output(prompt, test_case)
        results.append(result)
    
    average_score = mean([result["final_score"] for result in results])
    print(f"Average score: {average_score}")
    
    return {"prompt": create_prompt(prompt, "test_case"), "results": results}

##### Prompt V1

In [None]:
# Load dataset
with open("dataset.json", "r") as f:
    dataset = json.load(f)

In [None]:
prompt_v1 = lambda x: f"""Provide a solution to the following task:
{x}
"""

results = run_full_eval(prompt = prompt_v1, dataset = dataset)

Average score: 9.333333333333334


In [None]:
with open("eval_prompt_v1.json", "w") as f:
    json.dump(results, f)

##### Prompt V2

In [123]:
prompt_v2 = lambda x: f"""Provide a solution to the following task:
{x}

* Respond only with Python, JSON, or plain Regex.
* Do not add any comments or commentary or explanation.
"""

results = run_full_eval(prompt = prompt_v2, dataset = dataset)

with open("eval_prompt_v2.json", "w") as f:
    json.dump(results, f)

Average score: 10.0
