# Notebook 1: Eval Set & Baseline Implementation

## 1. Introduction

This notebook introduces:
- A synthetic evaluation set for customer support queries
- A baseline monolithic LLM system (no tools)
- Evaluation logic to track how well it answers queries based on policy

We'll use a simulated product catalog, return policy, and order data.

## 2. Setup

First, let's install the necessary libraries and set up our environment:

In [None]:
%pip install pandas

In [12]:
import random
import pandas as pd
from datetime import datetime, timedelta
import openai
import getpass

# Initialize OpenAI API key
openai.api_key = getpass.getpass('Enter your OpenAI API key: ')
client = openai.OpenAI(api_key=openai.api_key)

## 3. Generate Synthetic Data

Before we start building, we need to curate a dataset. This is necessary to ensure that our agent is making the right choices. Since we don't have production data, we can synthetically generate some. I'll go ahead and generate 20 customer support queries related to refunds. This is enough to get started. I can't stress this enough, but evaluating your AI agents is not a one-time thing. It's a continuous process that should be done after each iteration of your development cycle. 

In [24]:
products = [
    {"product_name": "Wireless Headphones", "category": "Electronics"},
    {"product_name": "Running Shoes", "category": "Footwear"},
    {"product_name": "Leather Wallet", "category": "Accessories"},
    {"product_name": "Smartwatch", "category": "Electronics"},
    {"product_name": "Yoga Mat", "category": "Fitness"}
]

return_policy_days = {
    "Electronics": 14,
    "Footwear": 30,
    "Accessories": 30,
    "Fitness": 20
}

today = datetime.strptime("2025-04-23", "%Y-%m-%d")
examples = []

for i in range(20):
    product = random.choice(products)
    order_date = today - timedelta(days=random.randint(5, 45))
    policy_days = return_policy_days[product["category"]]
    refund_eligible = (today - order_date).days <= policy_days

    examples.append({
        "query": f"Can I return my {product['product_name']}?",
        "customer_id": f"C{i+1:03}",
        "order_info": [{
            "product_name": product["product_name"],
            "order_id": f"ORD{i+1000}",
            "order_date": order_date.strftime("%Y-%m-%d"),
            "amount": round(random.uniform(20, 200), 2),
            "return_policy_days": policy_days
        }],
        "expected_tool_sequence": ["none"],
        "expected_response_contains": [
            "not eligible" if not refund_eligible else "eligible"
        ]
    })

df_eval = pd.DataFrame(examples)
df_eval.head()

Unnamed: 0,query,customer_id,order_info,expected_tool_sequence,expected_response_contains
0,Can I return my Leather Wallet?,C001,"[{'product_name': 'Leather Wallet', 'order_id'...",[none],[eligible]
1,Can I return my Wireless Headphones?,C002,"[{'product_name': 'Wireless Headphones', 'orde...",[none],[not eligible]
2,Can I return my Yoga Mat?,C003,"[{'product_name': 'Yoga Mat', 'order_id': 'ORD...",[none],[eligible]
3,Can I return my Wireless Headphones?,C004,"[{'product_name': 'Wireless Headphones', 'orde...",[none],[not eligible]
4,Can I return my Leather Wallet?,C005,"[{'product_name': 'Leather Wallet', 'order_id'...",[none],[eligible]


Note that the expected_tool_sequence fields are empty for now. We'll fill them in later.

## 4. Define the LLM

We'll use OpenAI's GPT-4o for this example.

In [25]:
def call_monolithic_llm(query: str) -> str:
    """
    Call a monolithic LLM using just the query.
    Replace with your actual OpenAI or local model call.
    """
    response = client.responses.create(
        model="gpt-4o-mini",
        input = [
            {
                "role": "user",
                "content": f"Respond with eligible or not eligible based on the following query: {query}\n\n Do not say anything else.",
            },
        ]
    )
    return response.output_text

def run_monolithic_llm_responses(df):
    responses = []
    for _, row in df.iterrows():
        response = call_monolithic_llm(row["query"])
        responses.append(response)
    return responses

df_eval["baseline_response"] = run_monolithic_llm_responses(df_eval)


## 5. Evaluation

Let's evaluate the monolithic LLM responses. 

In [26]:
def evaluate_response(row):
    response = row["baseline_response"].lower()
    score = sum(1 for phrase in row["expected_response_contains"] if phrase.lower() in response)
    return score / len(row["expected_response_contains"])

df_eval["baseline_score"] = df_eval.apply(evaluate_response, axis=1)
df_eval[["query", "baseline_response", "baseline_score"]].head()

# Summary Stats
accuracy = (df_eval["baseline_score"] == 1.0).mean()
print(f"Baseline Accuracy (exact match on all expected phrases): {accuracy:.2%}")

Baseline Accuracy (exact match on all expected phrases): 60.00%


This is obviously completely random, since the LLM did not have access to our customer data. Let's improve the reliability of the LLM by giving it access to its first tool, a retriever. 