# Evaluation Quickstart
In this notebooks, we'll walk through the principles of creating evaluations through the SDK.
This will draw on the code in this package implementing an email agent (```agent/*```), as well as experiments (```setup/experiments.py```).

Evaluations are made up of three components:

1. A **dataset test** inputs and expected outputs.
2. An **application or target function** that defines what you are evaluating, taking in inputs and returning the application output
3. **Evaluators** that score your target function's outputs.

## Setup
We'll start by importing the email agent we created. This agent consists of 2 main steps: a triage step, and a response step.

The triage step determines whether the agent should respond to the email or ignore it. The response step takes the required actions needed (such as checking our calendar or scheduling a meeting) to construct a response.

We also import our LangSmith client to use for running our evaluations.

In [None]:
from agent.agent import email_assistant
from setup.config import client

![Arch](../images/architecture.png)

## Part 1: Final Response Evaluations

### Dataset Creation
We'll create a dataset that captures both the inputs to our email agent, as well as the expected output. Instead of directly crafting ground truth, we'll define our expected output as success criteria our agent should meet.


In [None]:
examples = [
    {
        "email": {
            "author": "Alice Smith <alice.smith@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Quick question about API documentation",
            "email_thread": """Hi Robert,

        I was reviewing the API documentation for the new authentication service and noticed a few endpoints seem to be missing from the specs. Could you help clarify if this was intentional or if we should update the docs?

        Specifically, I'm looking at:
        - /auth/refresh
        - /auth/validate

        Thanks!
        Alice""",
        },

        "success_criteria":  """
        • Send email with write_email tool call to acknowledge the question and confirm it will be investigated
        """,
    },
    {
        "email": {
            "author": "Project Manager <pm@client.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Tax season let's schedule call",
            "email_thread": """Robert,

        It's tax season again, and I wanted to schedule a call to discuss your tax planning strategies for this year. I have some suggestions that could potentially save you money.

        Are you available sometime next week? Tuesday or Thursday afternoon would work best for me, for about 45 minutes.

        Regards,
        Project Manager""",
        },

        "success_criteria": """
        • Check calendar availability for Tuesday or Thursday afternoon next week with check_calendar_availability tool call 
        • Confirm availability for a 45-minute meeting
        • Send calendar invite with schedule_meeting tool call 
        • Send email with write_email tool call to acknowledge tax planning request and notifying that a meeting has been scheduled  
        """,

    },

    {
        "email": {
            "author": "HR Department <hr@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Reminder: Submit your expense reports",
            "email_thread": """Hello Robert,

        This is a friendly reminder that all expense reports for the previous month need to be submitted by this Friday. Please make sure to include all receipts and proper documentation.

        If you have any questions about the submission process, feel free to reach out to the HR team.

        Best regards,
        HR Department""",

        },

        "success_criteria": """
        • No response needed
        • Ensure the user is notified  
        """,
    },
    { 
        "email": {
            "author": "Conference Organizer <events@techconf.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Do you want to attend this conference?",
            "email_thread": """Hi Robert,

        We're reaching out to invite you to TechConf 2025, happening May 15-17 in San Francisco. 

        The conference features keynote speakers from major tech companies, workshops on AI and ML, and great networking opportunities. Early bird registration is available until April 30th.

        Would you be interested in attending? We can also arrange for group discounts if other team members want to join.

        Best regards,
        Conference Organizers""",
        },

        "success_criteria": """
        • Express interest in attending TechConf 2025
        • Ask specific questions about AI/ML workshops
        • Inquire about group discount details
        • Send email with write_email tool call to express interest in attending TechConf 2025, ask specific questions about AI/ML workshops, and inquire about group discount details
        """,
    },
    { 
        "email": {
            "author": "Team Lead <teamlead@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Quarterly planning meeting",
            "email_thread": """Hi Robert,

        It's time for our quarterly planning session. I'd like to schedule a 90-minute meeting next week to discuss our roadmap for Q3.

        Could you let me know your availability for Monday or Wednesday? Ideally sometime between 10AM and 3PM.

        Looking forward to your input on the new feature priorities.

        Best,
        Team Lead""",
        },

        "success_criteria": """
        • Check calendar for 90-minute meeting availability for Monday or Wednesday with check_calendar_availability tool call 
        • Send email acknowledging the request and providing availability with write_email tool call  
        """
    },
]

dataset_name = "Email Agent Notebook: Final Response"

if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[{"email_input": ex["email"]} for ex in examples],
        outputs=[{"success_criteria": ex["success_criteria"]} for ex in examples],
        dataset_id=dataset.id
    )

Next, we'll create a run function to run our email agent on the dataset inputs.

In [None]:
import uuid

def run_email_agent(inputs: dict):
     # Creating configuration 
    thread_id = uuid.uuid4()
    configuration = {"thread_id": thread_id}


    result = email_assistant.invoke(inputs, config = configuration)
    return {"classification_decision": result["classification_decision"], "messages": result["messages"]}

We'll define an LLM as a Judge evaluator to check whether the response met our success criteria.

In [None]:
from langchain_openai import ChatOpenAI
from typing import TypedDict, Annotated

# LLM-as-judge output schema for professionalism
class Completeness(TypedDict):
    """Evaluate the professionalism of an agent response."""
    reasoning: Annotated[str, ..., "Explain your step-by-step reasoning for the professionalism assessment, covering tone, language, structure, courtesy, boundaries, and helpfulness."]
    is_complete: Annotated[bool, ..., "True if the agent response meets all success criteria, otherwise False."]

# Judge LLM for professionalism
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
completeness_grader_llm = model.with_structured_output(Completeness, method="json_schema", strict=True)

async def completeness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict = None) -> bool:
    instructions = """
You are an expert data analyst grading outputs generated by an AI email assistant. You are to judge whether the agent generated an accurate and complete response for the given input email. You are also provided with success criteria written by a human, which serves as the ground truth rubric for your grading.

When grading, complete emails will have the following properties:
- All success criteria are met by the output, and none are missing
- The output correctly chooses whether to ignore, notify, or respond to the email
"""
    user_context = f"""Please grade the following example according to the above instructions:
<example>
<input>
{inputs}
</input>

<output>
{outputs}
</output>

<success_criteria>
{reference_outputs["success_criteria"]}
</success_criteria>
"""
    grade = await completeness_grader_llm.ainvoke([
        {"role": "system", "content": instructions}, 
        {"role": "user", "content": user_context}
    ])
    return {"key": "completeness", "score": grade["is_complete"], "comment": grade["reasoning"]}

And we'll use LangSmith's ```evaluate()``` to run our experiment!

In [None]:
final_response_dataset = "Email Agent Notebook: Final Response"
results = client.evaluate(
    run_email_agent,
    data=final_response_dataset,
    evaluators=[completeness_evaluator],
    experiment_prefix="email-agent-gpt4.1",
    num_repetitions=1,
    max_concurrency=4,
)

## Part 2: Single Step Evaluations

### Dataset Creation
Let's create a dataset that matches the format of our triage step.

In [None]:
examples = [
    {
        "email": {
            "author": "Alice Smith <alice.smith@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Quick question about API documentation",
            "email_thread": """Hi Robert,

        I was reviewing the API documentation for the new authentication service and noticed a few endpoints seem to be missing from the specs. Could you help clarify if this was intentional or if we should update the docs?

        Specifically, I'm looking at:
        - /auth/refresh
        - /auth/validate

        Thanks!
        Alice""",
        },

        "classification_decision": "respond",
    },
    {
        "email": {
            "author": "Project Manager <pm@client.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Tax season let's schedule call",
            "email_thread": """Robert,

        It's tax season again, and I wanted to schedule a call to discuss your tax planning strategies for this year. I have some suggestions that could potentially save you money.

        Are you available sometime next week? Tuesday or Thursday afternoon would work best for me, for about 45 minutes.

        Regards,
        Project Manager""",
        },

        "classification_decision": "respond",

    },

    {
        "email": {
            "author": "HR Department <hr@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Reminder: Submit your expense reports",
            "email_thread": """Hello Robert,

        This is a friendly reminder that all expense reports for the previous month need to be submitted by this Friday. Please make sure to include all receipts and proper documentation.

        If you have any questions about the submission process, feel free to reach out to the HR team.

        Best regards,
        HR Department""",

        },

        "classification_decision": "notify",
    },
    { 
        "email": {
            "author": "Conference Organizer <events@techconf.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Do you want to attend this conference?",
            "email_thread": """Hi Robert,

        We're reaching out to invite you to TechConf 2025, happening May 15-17 in San Francisco. 

        The conference features keynote speakers from major tech companies, workshops on AI and ML, and great networking opportunities. Early bird registration is available until April 30th.

        Would you be interested in attending? We can also arrange for group discounts if other team members want to join.

        Best regards,
        Conference Organizers""",
        },

        "classification_decision": "respond",
    },
    { 
        "email": {
            "author": "Team Lead <teamlead@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Quarterly planning meeting",
            "email_thread": """Hi Robert,

        It's time for our quarterly planning session. I'd like to schedule a 90-minute meeting next week to discuss our roadmap for Q3.

        Could you let me know your availability for Monday or Wednesday? Ideally sometime between 10AM and 3PM.

        Looking forward to your input on the new feature priorities.

        Best,
        Team Lead""",
        },

        "classification_decision": "respond",
    },
]

dataset_name = "Email Agent Notebook: Single Step"

if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[{"email_input": ex["email"]} for ex in examples],
        outputs=[{"classification_decision": ex["classification_decision"]} for ex in examples],
        dataset_id=dataset.id
    )

Next, we'll create a run function that only runs the triage step of our email agent.

In [None]:
def run_triage_step(inputs: dict): 
    thread_id = uuid.uuid4()
    configuration = {"thread_id": thread_id}

    result = email_assistant.invoke(inputs, config = configuration, interrupt_after="triage_router")
    return {"classification_decision": result["classification_decision"]}


Finally, we'll define an evaluator the check whether the classification was correct.

In [None]:
def exact_match(outputs, reference_outputs):
    correctness = outputs["classification_decision"].lower() == reference_outputs["classification_decision"].lower()
    return { "correctness": correctness }

And we'll use LangSmith's ```evaluate()``` to run our experiment!

In [None]:
single_step_dataset = "Email Agent Notebook: Single Step"
results = client.evaluate(
    run_triage_step,
    data=single_step_dataset,
    evaluators=[exact_match],
    experiment_prefix="email-agent-gpt4.1",
    num_repetitions=1,
    max_concurrency=4,
)

## Part 3: Trajectory Evaluations

### Dataset Creation
We first need to create a dataset that matches the format of our trajectory. As an example, we can manually define some samples below. For larger examples of datasets, see ```setup/datasets.py```.

In [None]:
examples = [
    {
        "email": {
            "author": "Alice Smith <alice.smith@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Quick question about API documentation",
            "email_thread": """Hi Robert,

        I was reviewing the API documentation for the new authentication service and noticed a few endpoints seem to be missing from the specs. Could you help clarify if this was intentional or if we should update the docs?

        Specifically, I'm looking at:
        - /auth/refresh
        - /auth/validate

        Thanks!
        Alice""",
        },

        "trajectory": ["write_email", "done"],
    },
    {
        "email": {
            "author": "Project Manager <pm@client.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Tax season let's schedule call",
            "email_thread": """Robert,

        It's tax season again, and I wanted to schedule a call to discuss your tax planning strategies for this year. I have some suggestions that could potentially save you money.

        Are you available sometime next week? Tuesday or Thursday afternoon would work best for me, for about 45 minutes.

        Regards,
        Project Manager""",
        },

        "trajectory": ["check_calendar_availability", "schedule_meeting", "write_email", "done"],

    },

    {
        "email": {
            "author": "HR Department <hr@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Reminder: Submit your expense reports",
            "email_thread": """Hello Robert,

        This is a friendly reminder that all expense reports for the previous month need to be submitted by this Friday. Please make sure to include all receipts and proper documentation.

        If you have any questions about the submission process, feel free to reach out to the HR team.

        Best regards,
        HR Department""",

        },

        "trajectory": [],
    },
    { 
        "email": {
            "author": "Conference Organizer <events@techconf.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Do you want to attend this conference?",
            "email_thread": """Hi Robert,

        We're reaching out to invite you to TechConf 2025, happening May 15-17 in San Francisco. 

        The conference features keynote speakers from major tech companies, workshops on AI and ML, and great networking opportunities. Early bird registration is available until April 30th.

        Would you be interested in attending? We can also arrange for group discounts if other team members want to join.

        Best regards,
        Conference Organizers""",
        },

        "trajectory": ["write_email", "done"],
    },
    { 
        "email": {
            "author": "Team Lead <teamlead@company.com>",
            "to": "Robert Xu <Robert@company.com>",
            "subject": "Quarterly planning meeting",
            "email_thread": """Hi Robert,

        It's time for our quarterly planning session. I'd like to schedule a 90-minute meeting next week to discuss our roadmap for Q3.

        Could you let me know your availability for Monday or Wednesday? Ideally sometime between 10AM and 3PM.

        Looking forward to your input on the new feature priorities.

        Best,
        Team Lead""",
        },

        "trajectory": ["check_calendar_availability", "write_email", "done"],
    },
]

dataset_name = "Email Agent Notebook: Trajectory"

if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name=dataset_name)
    client.create_examples(
        inputs=[{"email_input": ex["email"]} for ex in examples],
        outputs=[{"trajectory": ex["trajectory"]} for ex in examples],
        dataset_id=dataset.id
    )

Next, we'll create a run function that will run our email agent and capture the trajectory of the tool calls it made. This will require a helper function to extract out the relevant information.

In [None]:
import uuid
from typing import Any, List

# Helper to extract tool trajectory
def extract_tool_calls(messages: List[Any]) -> List[str]:
    """Extract tool call names from messages, safely handling messages without tool_calls."""
    tool_call_names = []
    for message in messages:
        # Check if message is a dict and has tool_calls
        if isinstance(message, dict) and message.get("tool_calls"):
            tool_call_names.extend([call["name"].lower() for call in message["tool_calls"]])
        # Check if message is an object with tool_calls attribute
        elif hasattr(message, "tool_calls") and message.tool_calls:
            tool_call_names.extend([call["name"].lower() for call in message.tool_calls])
    
    return tool_call_names


# Define Run Function for your Application
def run_email_trajectory(inputs: dict) -> dict:
    """Run the email assistant on the given email input."""
    # Creating configuration 
    thread_id = uuid.uuid4()
    configuration = {"thread_id": thread_id}


    result = email_assistant.invoke(inputs, config = configuration)
    return {"trajectory": extract_tool_calls(result["messages"])}

Finally, we'll create an evaluator that defines a metric we want to measure. In this case, let's see how many extra steps our agent outputs compared to the ground truth.

In [None]:
# Define Evaluator Functions
def evaluate_extra_steps(outputs: dict, reference_outputs: dict) -> dict:
    """Evaluate the number of unmatched steps in the agent's output."""
    i = j = 0
    unmatched_steps = 0

    while i < len(reference_outputs['trajectory']) and j < len(outputs['trajectory']):
        if reference_outputs['trajectory'][i] == outputs['trajectory'][j]:
            i += 1  # Match found, move to the next step in reference trajectory
        else:
            unmatched_steps += 1  # Step is not part of the reference trajectory
        j += 1  # Always move to the next step in outputs trajectory

    # Count remaining unmatched steps in outputs beyond the comparison loop
    unmatched_steps += len(outputs['trajectory']) - j

    return {
        "key": "unmatched_steps",
        "score": unmatched_steps,
    }

We can use LangSmith's evaluate function to run this experiment!

In [None]:
trajectory_dataset = "Email Agent Notebook: Trajectory"
results = client.evaluate(
    run_email_trajectory,
    data=trajectory_dataset,
    evaluators=[evaluate_extra_steps],
    experiment_prefix="email-agent-gpt4.1",
    num_repetitions=1,
    max_concurrency=4,
)
