# Module 2, Section 1: Establishing Baseline with Offline Evaluation

We now have an MVP customer support agent for TechHub that can answer questions about orders, provide product information, and explain store policies. But before we can put it infront of customers, we first need build up confidence that it does the things we expect it to do.

Throughout this module, we'll learn how to run offline evaluations to establish baseline performance and then systematically improve our agent with evaluation driven development (EDD).

**What is offline evaluation?**

Evaluation is a crucial, ongoing process that allows us to quantitatively measure how well our application is working, identify areas for improvement, and reliably evolve our system over time.

<div align="center">
    <img src="../../images/offline_eval_process.png">
</div>

The offline evaluation process is comprised of a few components:

1. Dataset - a curated set of representative examples, where each example includes:
    - an input to the system
    - a ground truth (reference) output that demonstrates what the expected, high quality outcome should look like
2. Application - the LLM system that we intend to evaluate. We feed it our example inputs, and collect the system's actual output.
3. Evaluators - functions that quantify some aspect of performance by comparing the inputs, outputs, and reference outputs


**What makes a good eval setup?**

When beginning to create a new eval suite, it's often best to:

- Gather a small set of labeled examples that are representative of your system's core functionality and scenarios it should handle.
- Lean on domain expertise to ensure the examples are representative and accurate.
- Select only a few, simple metrics. In practice, binary evaluation metrics force clearer thinking, more consistent labeling, and are easier/faster to interpret when analyzing and iterating on your system.

Starting with a large sample and/or many, complex metrics makes it harder to inspect and deeply understand system behavior, which quickly leads to analysis paralysis.

Let's see how we can perform offline evaluation on our TechHub agent in LangSmith to establish the baseline performance!

#### Setup


In [None]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

## 1. Curate a set of representative examples

In our use case, we've teamed up with the TechHub customer support team to create 10 ground truth examples. Each example has:

- **inputs**: A customers question
- **outputs**: The expected correct answer (i.e. ground truth)
- **metadata**: A category that the customer support team uses to bucket question types

This dataset structure allows us to evaluate the "end-to-end" nature of our agent - commonly referred to as [final answer evaluation](https://docs.langchain.com/langsmith/evaluation-approaches#evaluating-an-agent%E2%80%99s-final-response).

Let's load and explore the dataset:

In [None]:
import json
from pathlib import Path
from pprint import pprint

# Load the dataset from JSON
dataset_path = Path("baseline_dataset.json")

with open(dataset_path, "r") as f:
    examples = json.load(f)

In [None]:
pprint(examples[0])

## 2. Create a dataset in LangSmith

Note that we convert the examples into `messages` format so that its natively stored in LangSmith with the structure needed to use these examples when invoking our agent.

In [None]:
import uuid
from langsmith import Client

client = Client()

dataset_name = f"techhub-baseline-eval-{uuid.uuid4()}"
dataset_description = "Representative set of customer support questions and answers curated by our support team"

dataset = client.create_dataset(
    dataset_name=dataset_name,
    description=dataset_description,
)

client.create_examples(
    dataset_id=dataset.id,
    inputs=[
        {"messages": [{"role": "user", "content": ex["inputs"]["question"]}]}
        for ex in examples
    ],
    outputs=[
        {"messages": [{"role": "assistant", "content": ex["outputs"]["answer"]}]}
        for ex in examples
    ],
    metadata=[ex["metadata"] for ex in examples],
)

print(f"Dataset in LangSmith: {dataset.url}")

## 3. Initialize the agent we want to evaluate

Here we'll use the the supervisor agent with HITL verification that we built in Section 4 of Module 1.

In [None]:
from IPython.display import Image
from agents.supervisor_hitl_agent import create_supervisor_hitl_agent

agent = create_supervisor_hitl_agent()

display(Image(agent.get_graph(xray=True).draw_mermaid_png()))

Quick test:

In [None]:
import uuid

thread_id = uuid.uuid4()
config = {"configurable": {"thread_id": thread_id}}

t = agent.invoke(
    {
        "messages": [
            {
                "role": "user",
                "content": "What's your return policy for opened electronics?",
            }
        ]
    },
    config=config,
)

In [None]:
t["messages"][-1].pretty_print()

## 4. Define our evaluators

Evaluators are functions that score how well your application performs on a particular example.

We'll start out with two simple evaluators: `correctness` and `total_tool_calls`

### Evaluator #1: Correctness

An evaluator that uses LLM-as-a-Judge to determine if the agent's output is "correct" when comparing it against the reference output (i.e. ground truth output).

> Note: We manually define the LLM-as-a-Judge evaluator below for clarity, but you can achieve the same goal with fewer lines of code via our [openevals](https://github.com/langchain-ai/openevals) library.

In [None]:
from pydantic import BaseModel, Field
from langchain.chat_models import init_chat_model
from config import DEFAULT_MODEL


CORRECTNESS_PROMPT = """You are an expert data labeler evaluating model outputs for correctness.

Your task is to assign a boolean score based on the following rubric:

<Rubric>
  A correct answer (True):
  - Provides accurate and complete information
  - Contains no factual errors
  - Addresses all parts of the question
  - Is logically consistent
</Rubric>

<Instructions>
  - Carefully read the input and output
  - Compare the output to the reference_output
  - Check for factual accuracy and completeness
  - Focus on correctness of information rather than style or verbosity differences
  - Return a boolean score (True if correct, False if incorrect), not a string
</Instructions>

<Note>
- It's ok if the ouput provides additional information that is not directly included in the reference output
- The output is just the final output from an agent invocation, so it will not include all the intermediate steps or tool calls, this is ok.
</Note>

<input>
{inputs}
</input>

<output>
{outputs}
</output>

<reference_outputs>
{reference_outputs}
</reference_outputs>
"""


# For structured LLM output
class CorrectnessScore(BaseModel):
    reasoning: str = Field(..., description="A concise reasoning for the score")
    score: bool = Field(
        ..., description="True if the output is correct, False if incorrect."
    )


# Create a structured LLM
correctness_evaluator_llm = init_chat_model(model=DEFAULT_MODEL).with_structured_output(
    CorrectnessScore
)


# Define the evaluator function
def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
    """Evaluate the correctness of the output against the reference output."""

    formatted_prompt = CORRECTNESS_PROMPT.format(
        inputs=inputs, outputs=outputs, reference_outputs=reference_outputs
    )

    eval_result = correctness_evaluator_llm.invoke(formatted_prompt)

    # return a dictionary with the format the evaluator expects
    return {
        "key": "correctness",
        "score": eval_result.score,
        "comment": eval_result.reasoning,
    }

#### Target Function

A target function is used to specify how `inputs` from our dataset are processed to produce `outputs` that we want to evaluate. In this case, its simply just running the `inputs` through our agent to produce a response message.

In [None]:
import uuid


def target_function(inputs: dict) -> dict:
    """Target function that runs our agent to get outputs for evaluation."""

    thread_id = uuid.uuid4()
    config = {"configurable": {"thread_id": thread_id}}

    result = agent.invoke(
        inputs,
        config=config,
    )

    return {
        "messages": [{"role": "assistant", "content": result["messages"][-1].content}]
    }

Now, let's test the `correctness` evaluator on a single example from our dataset to see how everything works

In [None]:
# get an example from our dataset
example = next(
    client.list_examples(dataset_id=dataset.id, metadata={"example_number": 1})
)
pprint(example.inputs)

In [None]:
# run the example inputs through our target function
output = target_function(example.inputs)
pprint(output)

In [None]:
# score the agent output against the reference output
correctness_score = correctness_evaluator(
    inputs=example.inputs, outputs=output, reference_outputs=example.outputs
)
pprint(correctness_score)

### Evaluator #2: Total Tool Calls

An evaluator that doesn't rely on a ground truth reference, but is a good thing to track as it can help reveal agent patterns and highlight inefficiences

In [None]:
from langsmith.schemas import Run


def count_total_tool_calls_evaluator(run: Run) -> dict:
    """
    Count total tool calls across the entire run (supervisor + sub-agents).

    Returns a single 'score' metric with the total count.
    Use for tracking efficiency: fewer calls = more efficient.
    """

    def traverse_runs(run_obj: Run) -> int:
        """Recursively count all tool-type runs in the tree."""
        count = 0

        # Count this run if it's a tool execution
        if run_obj.run_type == "tool":
            count = 1

        # Recursively count child runs
        if hasattr(run_obj, "child_runs") and run_obj.child_runs:
            for child in run_obj.child_runs:
                count += traverse_runs(child)

        return count

    total_tools = traverse_runs(run)

    return {"key": "total_tool_calls", "score": total_tools}

This evaluator doesn't depend on reference outputs to produce a score, but rather uses metadata from the target function's execution on a given example. This is flexibly handled in the LangSmith SDK by passing a `Run` object as input to the evaluator.

> Note: see [this docs page](https://docs.langchain.com/langsmith/code-evaluator#evaluator-args) for a list of all arguments accepted by an evaluator function

Let's walk through an example to make this clear.

In [None]:
# get a recent sample run from our project
runs = client.list_runs(
    project_name="langsmith-agent-lifecycle-workshop",  # Your project
    filter="""and(eq(is_root, true), eq(name, "supervisor_hitl_agent"))""",  # get a LangGraph (i.e. supervisor) run
    limit=1,
)
run = next(runs)

# Fetch the complete run with all children
full_run = client.read_run(run.id, load_child_runs=True)

In [None]:
# we can inspect the full run metadata
# vars(full_run)

# or just look at the child runs
vars(full_run).get("child_runs")

In [None]:
# now lets pass the run to our evaluator
count_total_tool_calls_evaluator(full_run)

## 5. Run an experiment on the full dataset

Now we can use our target_function and two evaluators to programatically run an offline evaluation over each example in our dataset - this is called an "experiment" in LangSmith.

In [None]:
# Disable parallelism warnings from the tokenizers library to keep notebook output clean
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Run the experiment
results = client.evaluate(
    target_function,
    data=dataset_name,
    evaluators=[correctness_evaluator, count_total_tool_calls_evaluator],
    experiment_prefix="baseline-eval",
    description="Evaluate the final answer correctness and total tool calls of our agent on the baseline dataset",
    max_concurrency=5,
)

## 6. Error Analysis in LangSmith UI

Now let's analyze our baseline performance in the LangSmith UI via the link above.
