# Customer Support

Here, we show an example of building a customer support chatbot.

This customer support chatbot interacts with SQL database to answer questions.
We will use a mock SQL database to get started: the [Chinook](https://www.sqlitetutorial.net/sqlite-sample-database/) database.
This database is about sales from a music store: what songs and albums exists, customer orders, things like that.

This chatbot has two different states: 
1. Music: the user can inquire about different songs and albums present in the store
2. Account: the user can ask questions about their account

Under the hood, this is handled by two separate agents. 
Each has a specific prompt and tools related to their objective. 
There is also a generic agent who is responsible for routing between these two agents as needed.

In [223]:
!uv sync

[2mResolved [1m103 packages[0m [2min 32ms[0m[0m
[2mUninstalled [1m19 packages[0m [2min 268ms[0m[0m
[2K[2mInstalled [1m11 packages[0m [2min 254ms[0m[0m                              [0m
 [31m-[39m [1manthropic[0m[2m==0.75.0[0m
 [31m-[39m [1manyio[0m[2m==4.12.0[0m
 [32m+[39m [1manyio[0m[2m==4.11.0[0m
 [31m-[39m [1mdocstring-parser[0m[2m==0.17.0[0m
 [31m-[39m [1mlangchain[0m[2m==1.1.2[0m
 [32m+[39m [1mlangchain[0m[2m==0.0.27[0m
 [31m-[39m [1mlangchain-anthropic[0m[2m==1.2.0[0m
 [31m-[39m [1mlangchain-core[0m[2m==1.1.1[0m
 [32m+[39m [1mlangchain-core[0m[2m==1.1.0[0m
 [31m-[39m [1mlangchain-openai[0m[2m==1.1.0[0m
 [32m+[39m [1mlangchain-openai[0m[2m==0.3.34[0m
 [31m-[39m [1mlanggraph[0m[2m==1.0.4[0m
 [32m+[39m [1mlanggraph[0m[2m==0.6.11[0m
 [31m-[39m [1mlanggraph-prebuilt[0m[2m==1.0.5[0m
 [32m+[39m [1mlanggraph-prebuilt[0m[2m==0.6.5[0m
 [31m-[39m [1mlanggraph-sdk[0m[2m==0.2.14[

In [224]:
from dotenv import load_dotenv

load_dotenv()

True

## Load the data

Utils to pull the Chinook database, populate an in-memory SQLite database, and create the engine.

In [225]:
import sqlite3
import requests
from langchain_community.utilities.sql_database import SQLDatabase
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool

def get_engine_for_chinook_db():
    """Pull sql file, populate in-memory database, and create engine."""
    url = "https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_Sqlite.sql"
    response = requests.get(url)
    sql_script = response.text

    connection = sqlite3.connect(":memory:", check_same_thread=False)
    connection.executescript(sql_script)
    return create_engine(
        "sqlite://",
        creator=lambda: connection,
        poolclass=StaticPool,
        connect_args={"check_same_thread": False},
    )

engine = get_engine_for_chinook_db()
db = SQLDatabase(engine)

In [226]:
print(db.get_usable_table_names())

['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']


## Get sample of every table

In [227]:
import pandas as pd
from sqlalchemy import text, inspect

def sample_all_tables(engine, n=5):
    inspector = inspect(engine)
    tables = inspector.get_table_names()

    samples = {}

    with engine.connect() as conn:
        for table in tables:
            query = text(f"SELECT * FROM {table} LIMIT {n}")
            result = conn.execute(query)

            rows = result.fetchall()
            cols = result.keys()

            df = pd.DataFrame(rows, columns=cols)
            samples[table] = df

            print(f"\n{'='*80}")
            print(f"TABLE: {table} (showing up to {n} rows)")
            print(f"{'='*80}")
            print(df)

    return samples


In [228]:
samples = sample_all_tables(engine, n=5)
print(db.run(f"""SELECT Track.Name as SongName, Artist.Name as ArtistName 
        FROM Album 
        LEFT JOIN Artist ON Album.ArtistId = Artist.ArtistId 
        LEFT JOIN Track ON Track.AlbumId = Album.AlbumId 
        WHERE Artist.Name LIKE '%Aerosmith%';"""))


TABLE: Album (showing up to 5 rows)
   AlbumId                                  Title  ArtistId
0        1  For Those About To Rock We Salute You         1
1        2                      Balls to the Wall         2
2        3                      Restless and Wild         2
3        4                      Let There Be Rock         1
4        5                               Big Ones         3

TABLE: Artist (showing up to 5 rows)
   ArtistId               Name
0         1              AC/DC
1         2             Accept
2         3          Aerosmith
3         4  Alanis Morissette
4         5    Alice In Chains

TABLE: Customer (showing up to 5 rows)
   CustomerId  FirstName     LastName  \
0           1       Luís    Gonçalves   
1           2     Leonie       Köhler   
2           3   François     Tremblay   
3           4      Bjørn       Hansen   
4           5  František  Wichterlová   

                                            Company  \
0  Embraer - Empresa Brasileira de Ae

## Load an LLM

We will load a language model to use.
For this demo we will use OpenAI.

In [229]:
from langchain_openai import ChatOpenAI

# We will set streaming=True so that we can stream tokens
# See the streaming section for more information on this.
model = ChatOpenAI(temperature=0, streaming=True, model="gpt-4o", tags=["router-agent"])

## Load Other Modules

Load other modules we will use.

All of the tools our agents will use will be custom tools. As such, we will use the `@tool` decorator to create custom tools.

We will pass in messages to the agent, so we load `HumanMessage` and `SystemMessage`

In [230]:
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, SystemMessage

## Define the Customer Agent

This agent is responsible for looking up customer information.
It will have a specific prompt as well a specific tool to look up information about that customer (after asking for their user id).

In [231]:
# This tool is given to the agent to look up information about a customer
@tool
def get_customer_info(customer_id: int):
    """Look up customer info given their ID. ALWAYS make sure you have the customer ID before invoking this."""
    return db.run(f"SELECT * FROM Customer WHERE CustomerID = {customer_id};")

In [232]:
customer_prompt = """Your job is to help a user update their profile.

You only have certain tools you can use. These tools require specific input. If you don't know the required input, then ask the user for it.

If you are unable to help the user, you can """

def get_customer_messages(messages):
    return [SystemMessage(content=customer_prompt)] + messages

customer_chain = get_customer_messages | model.bind_tools([get_customer_info]).with_config(tags=["customer-agent"])

## Define the Music Agent

This agent is responsible for figuring out information about music. To do that, we will create a prompt and various tools for looking up information about music

First, we will create indexes for looking up artists and track names.
This will allow us to look up artists and tracks without having to spell their names exactly right.

First, let's create a tool for getting albums by artist.

In [233]:
@tool
def get_albums_by_artist(artist: str):
    """Get albums by an artist."""
    return db.run(
        f"""
        SELECT Album.Title, Artist.Name 
        FROM Album 
        JOIN Artist ON Album.ArtistId = Artist.ArtistId 
        WHERE Artist.Name LIKE '%{artist}%';
        """,
        include_columns=True
    )

Next, lets create a tool for getting tracks by an artist

In [234]:
@tool
def get_tracks_by_artist(artist: str):
    """Get songs by an artist (or similar artists)."""
    return db.run(
        f"""
        SELECT Track.Name as SongName, Artist.Name as ArtistName 
        FROM Album 
        LEFT JOIN Artist ON Album.ArtistId = Artist.ArtistId 
        LEFT JOIN Track ON Track.AlbumId = Album.AlbumId 
        WHERE Artist.Name LIKE '%{artist}%';
        """,
        include_columns=True
    )

Finally, let's create a tool for looking up songs by their name.

In [235]:
@tool
def check_for_songs(song_title):
    """Check if a song exists by its name."""
    return db.run(
        f"""
        SELECT * FROM Track WHERE Name LIKE '%{song_title}%';
        """,
        include_columns=True
    )

Create the chain to call the relevant tools

In [236]:
song_system_message = """Your job is to help a customer find any songs they are looking for. 

You only have certain tools you can use. If a customer asks you to look something up that you don't know how, politely tell them what you can help with.

When looking up artists and songs, sometimes the artist/song will not be found. In that case, the tools will return information \
on simliar songs and artists. This is intentional, it is not the tool messing up."""
def get_song_messages(messages):
    return [SystemMessage(content=song_system_message)] + messages

song_recc_chain = get_song_messages | model.bind_tools([get_albums_by_artist, get_tracks_by_artist, check_for_songs]).with_config(tags=["music-agent"])

In [237]:
msgs = [HumanMessage(content="hi! can you help me find songs by amy whinehouse?")]
song_recc_chain.invoke(msgs)

AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_WBAEDrCxcEghaRcsLVhEjyjn', 'function': {'arguments': '{"artist":"Amy Winehouse"}', 'name': 'get_tracks_by_artist'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_37d212baff', 'service_tier': 'default'}, id='lc_run--d8b0a079-946f-49c8-8f67-9e265e029daf', tool_calls=[{'name': 'get_tracks_by_artist', 'args': {'artist': 'Amy Winehouse'}, 'id': 'call_WBAEDrCxcEghaRcsLVhEjyjn', 'type': 'tool_call'}])

## Define the Generic Agent

We now define a generic agent that is responsible for handling initial inquiries and routing to the right sub agent.

In [238]:
from langchain_core.messages import SystemMessage, HumanMessage
from pydantic import BaseModel, Field

class Router(BaseModel):
    """Call this if you are able to route the user to the appropriate representative."""
    choice: str = Field(description="should be one of: music, customer")

system_message = """Your job is to help as a customer service representative for a music store.

You should interact politely with customers to try to figure out how you can help. You can help in a few ways:

- Updating user information: if a customer wants to update the information in the user database. Call the router with `customer`
- Recomending music: if a customer wants to find some music or information about music. Call the router with `music`

If the user is asking or wants to ask about updating or accessing their information, send them to that route.
If the user is asking or wants to ask about music, send them to that route.
Otherwise, respond."""
def get_messages(messages):
    return [SystemMessage(content=system_message)] + messages

In [239]:
chain = get_messages | model.bind_tools([Router])

In [240]:
msgs = [HumanMessage(content="hi! can you help me find a good song?")]
chain.invoke(msgs)

AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_9DakkICCaiYnhfg7yb58E5tG', 'function': {'arguments': '{"choice":"music"}', 'name': 'Router'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_83554c687e', 'service_tier': 'default'}, id='lc_run--1670cec0-ab32-48ea-90d7-f375532b49ae', tool_calls=[{'name': 'Router', 'args': {'choice': 'music'}, 'id': 'call_9DakkICCaiYnhfg7yb58E5tG', 'type': 'tool_call'}])

In [241]:
msgs = [HumanMessage(content="hi! whats the email you have for me?")]
chain.invoke(msgs)

AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_tLdKZ73bD1eCFvTzMWs1XWMe', 'function': {'arguments': '{"choice":"customer"}', 'name': 'Router'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_83554c687e', 'service_tier': 'default'}, id='lc_run--3d8fb07b-1908-40da-acde-bd59016ebce1', tool_calls=[{'name': 'Router', 'args': {'choice': 'customer'}, 'id': 'call_tLdKZ73bD1eCFvTzMWs1XWMe', 'type': 'tool_call'}])

In [242]:
from langchain_core.messages import AIMessage

def add_name(message, name):
    _dict = message.model_dump()
    _dict["name"] = name
    return {"messages": [AIMessage(**_dict)]}

In [243]:
from langgraph.graph import END

def _get_last_ai_message(messages):
    for m in messages[::-1]:
        if isinstance(m, AIMessage):
            return m
    return None


def _is_tool_call(msg):
    return isinstance(msg, AIMessage) and msg.content_blocks[0]["type"] == "tool_call"


def _route(messages):
    last_message = messages["messages"][-1] if messages["messages"] else None
    if isinstance(last_message, AIMessage):
        if not _is_tool_call(last_message):
            return END
        else:
            if last_message.name == "general":
                tool_calls = last_message.content_blocks
                if len(tool_calls) > 1:
                    raise ValueError
                tool_call = tool_calls[0]
                return tool_call['args']['choice']
            else:
                return "tools"
    last_m = _get_last_ai_message(messages["messages"])
    if last_m is None:
        return "general"
    if last_m.name == "music":
        return "music"
    elif last_m.name == "customer":
        return "customer"
    else:
        return "general"

In [244]:
from langgraph.prebuilt import ToolNode

tools = [get_albums_by_artist, get_tracks_by_artist, check_for_songs, get_customer_info]
tools_node = ToolNode(tools)

In [245]:
def _filter_out_routes(messages):
    ms = []
    for m in messages["messages"]:
        if _is_tool_call(m):
            if m.name == "general":
                continue
        ms.append(m)
    return ms

In [246]:
from functools import partial

general_node = _filter_out_routes | chain | partial(add_name, name="general")
music_node = _filter_out_routes | song_recc_chain | partial(add_name, name="music")
customer_node = _filter_out_routes | customer_chain | partial(add_name, name="customer")

In [247]:
from langgraph.graph import MessagesState, StateGraph
from langgraph.checkpoint.sqlite import SqliteSaver

memory = SqliteSaver.from_conn_string(":memory:")
graph = StateGraph(MessagesState)
nodes = {"general": "general", "music": "music", END: END, "tools": "tools", "customer": "customer"}
# Define a new graph
workflow = StateGraph(MessagesState)
workflow.add_node("general", general_node)
workflow.add_node("music", music_node)
workflow.add_node("customer", customer_node)
workflow.add_node("tools", tools_node)
workflow.add_conditional_edges("general", _route, nodes)
workflow.add_conditional_edges("tools", _route, nodes)
workflow.add_conditional_edges("music", _route, nodes)
workflow.add_conditional_edges("customer", _route, nodes)
workflow.set_conditional_entry_point(_route, nodes)
graph = workflow.compile()

# Evals

## Final Response Evals

Dataset created in Langsmith, referenced here:

In [248]:
from langsmith import Client, evaluate

# 1. Create and/or select your dataset
client = Client()
dataset_name = "Chinook support bot: Final Response"

Correctness evaluation using openevals prompt:

In [249]:
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# CORRECTNES_PROMPT = """
# You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

# <Rubric>
#   A correct answer:
#   - Provides accurate and complete information
#   - Contains no factual errors
#   - Addresses all parts of the question
#   - Is logically consistent
#   - Uses precise and accurate terminology

#   When scoring, you should penalize:
#   - Factual errors or inaccuracies
#   - Incomplete or partial answers
#   - Misleading or ambiguous statements
#   - Incorrect terminology
#   - Logical inconsistencies
#   - Missing key information
# </Rubric>

# <Instructions>
#   - Carefully read the input and output
#   - Check for factual accuracy and completeness
#   - Focus on correctness of information rather than style or verbosity
# </Instructions>

# <Reminder>
#   The goal is to evaluate factual correctness and completeness of the response.
# </Reminder>

# <input>
# {inputs}
# </input>

# <output>
# {outputs}
# </output>

# Use the reference outputs below to help you evaluate the correctness of the response:

# <reference_outputs>
# {reference_outputs}
# </reference_outputs>
# """

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    feedback_key="correctness",
    model="openai:o3-mini",
)

Run evaluation:

In [250]:
from langchain_core.messages import HumanMessage

evaluate(
    lambda x : (
        graph.invoke({"messages": [HumanMessage(content=x['Input'])]})
    ),
    data=dataset_name,
    evaluators=[correctness_evaluator],
    experiment_prefix="sql-agent-gpt4o-final-response"
)

View the evaluation results for experiment: 'Chinook support bot: Final Response experiment-c1f52024' at:
https://smith.langchain.com/o/911e8147-82b3-4451-9946-e96f9331e9f4/datasets/fa69919b-e8f2-4642-82d3-c2c0ae39e782/compare?selectedSessions=5e8c9ce4-d67c-400e-b6f2-12243bf07011




0it [00:00, ?it/s]

Unnamed: 0,inputs.Input,outputs.messages,error,reference.Output,feedback.correctness,execution_time,example_id,id
0,"My customer ID is 4, what is my email?","[{'role': 'user', 'content': 'My customer ID i...",,Your email is bjorn.hansen@yahoo.no,True,2.689188,5c91d6a8-b34e-455e-8a57-0474560f7717,019af9a9-ac79-7093-8663-14a66e39bfd6
1,What address do you have for me?,"[{'role': 'user', 'content': 'What address do ...",,"To look up your account information, I’ll need...",True,2.04494,499d3cab-c932-4764-b6a4-4233fe2c109b,019af9a9-c7e3-7052-a0f2-73cf2c7a57fa
2,"My customer ID is 2, what is my birthday?","[{'role': 'user', 'content': 'My customer ID i...",,I couldn’t find your birthday within our system.,True,3.549573,665517d4-5336-445c-ac94-a877ccc511b4,019af9a9-da7e-7053-aed9-91cc90b06d8e
3,Find me songs by Ammy Winehouse,"[{'role': 'user', 'content': 'Find me songs by...",,Here are the songs we have by Amy Winehouse:\n...,False,1.344135,688cfd87-f368-4c4c-adf0-d2f21fe87ae2,019af9a9-f74e-7482-9536-81ace3b6bdec
4,Do you have the song Rehab and how much does i...,"[{'role': 'user', 'content': 'Do you have the ...",,"Yes, the song 'Rehab' is available in our cata...",False,1.879769,789b371f-7219-440d-9012-d05f811376b8,019af9aa-09c8-72b6-9a94-ee71599b69e8
5,What albums does AC/DC have?,"[{'role': 'user', 'content': 'What albums does...",,Here are the albums we have by AC/DC:\n\n1. Fo...,True,1.979476,81bcbeb3-22ab-4fa1-9cd2-d8b94fa3d08e,019af9aa-20d4-737d-a0cf-2dea29242e98
6,"My customer ID is 999999, what is my address?","[{'role': 'user', 'content': 'My customer ID i...",,I couldn’t find a customer with that ID in our...,True,2.086384,cd0677d2-02c6-46e9-806e-62cb0cc2f528,019af9aa-3b7c-72d2-8920-698c31b08696
7,How many songs Aerosmith?,"[{'role': 'user', 'content': 'How many songs A...",,We have 15 songs by Aerosmith in our catalog.,False,1.580661,eb2cba20-9bdc-4a7b-a582-048d5de03807,019af9aa-4de1-75c8-97d3-2ff43d5cd0f2
8,Can you help me?,"[{'role': 'user', 'content': 'Can you help me?...",,I can help you find songs and albums in our mu...,True,0.67278,f0121034-ddee-4f6f-b305-350f7e49e896,019af9aa-6af4-739e-902f-6210ef82f4d2


## Single Step Evaluation: Routing

In [251]:
# Target function for running the routing eval
def run_router_eval(inputs: dict) -> dict:
    result = graph.nodes["general"].invoke(inputs["Input"])

    messages = result["messages"]
    if not messages:
        return None

    last_message = messages[-1]
    
    router_tool_calls = [x for x in last_message.tool_calls if x['name'] == 'Router' and 'choice' in x['args']]
    if not router_tool_calls:
        return {
            "choice": None
        }

    return {
        "choice": router_tool_calls[0]['args']['choice']
    }

In [252]:
# Evaluator
def correct_router(outputs: dict, reference_outputs: dict) -> bool:
    print(reference_outputs)
    """Check if the agent chose the correct route."""
    return outputs["choice"] == reference_outputs['Output']

In [253]:
from langsmith import evaluate

evaluate(
    run_router_eval,
    data="Chinook support bot: Single Step",
    evaluators=[correct_router],
    experiment_prefix="sql-agent-gpt4o-single-step",
)

View the evaluation results for experiment: 'Chinook support bot: Single Step Router experiment-b8addd5f' at:
https://smith.langchain.com/o/911e8147-82b3-4451-9946-e96f9331e9f4/datasets/9f0e12ac-c715-4c97-8323-a5a187ae4639/compare?selectedSessions=7a40a9e0-87ae-4e5c-a9d6-d41ba4528b09




0it [00:00, ?it/s]

{'Output': 'null'}
{'Output': 'customer'}
{'Output': 'music'}
{'Output': 'customer'}
{'Output': 'music'}
{'Output': 'music'}
{'Output': 'customer'}
{'Output': 'music'}


Unnamed: 0,inputs.Input,outputs.choice,error,reference.Output,feedback.correct_router,execution_time,example_id,id
0,"{'messages': [{'role': 'user', 'content': 'I d...",,,,False,0.767935,f9ccc869-75bb-44f2-9f2d-46f5128f4281,019af9aa-8655-7269-a039-0f04e78dfd08
1,"{'messages': [{'role': 'user', 'content': 'Can...",customer,,customer,True,0.388266,01f6c316-adf3-496a-a2cc-e715f21facf3,019af9aa-895b-713b-9811-b832992e1751
2,"{'messages': [{'role': 'user', 'content': 'Wha...",music,,music,True,0.629129,29affc93-5738-4a81-b2ed-400a08700fac,019af9aa-8ae3-7534-9b8e-78e7fabbc739
3,"{'messages': [{'role': 'user', 'content': 'My ...",customer,,customer,True,0.398341,44f6f053-62b9-46a2-a01e-9aef812b6c54,019af9aa-8d5c-77bb-8c9a-3e7d64e46d8e
4,"{'messages': [{'role': 'user', 'content': 'I w...",music,,music,True,1.339869,805040fd-c6b2-45e0-9119-8da27712aa3d,019af9aa-8eee-74a9-bfb7-ed7b2b72ccf3
5,"{'messages': [{'role': 'user', 'content': 'How...",music,,music,True,0.393043,82e5a21d-9137-4250-8775-ad5c63ba950e,019af9aa-942e-74c2-9842-e7e98535df95
6,"{'messages': [{'role': 'user', 'content': 'Wha...",customer,,customer,True,0.549793,ad13b9bc-9a0c-41c0-8c2e-4d328c495b6a,019af9aa-95bc-7509-8684-93e03a9a0e40
7,"{'messages': [{'role': 'user', 'content': 'Fin...",music,,music,True,0.439133,c2a6a4a8-2d71-46b5-ba6f-644b68454b0f,019af9aa-97e7-7047-8e7a-dd96ffcd5f6f


## Trajectory Eval

In [None]:
def run_graph_trajectory(inputs: dict) -> dict:
    """Run graph and track the trajectory it takes along with the final response."""
    trajectory = []
    # Set subgraph=True to stream events from subgraphs of the main graph: https://langchain-ai.github.io/langgraph/how-tos/streaming-subgraphs/
    # Set stream_mode="debug" to stream all possible events: https://langchain-ai.github.io/langgraph/concepts/streaming
    for chunk in graph.astream({"messages": [HumanMessage(content=x['Input'])]}, subgraphs=True, stream_mode="debug"):
        # Event type for entering a node
        print(chunk)
        if chunk[1]['type'] == 'task':
            # Record the node name
            trajectory.append(chunk[1]['payload']['name'])
            # Given how we defined our dataset, we also need to track when specific tools are
            # called by our question answering ReACT agent. These tool calls can be found
            # when the ToolsNode (named "tools") is invoked by looking at the AIMessage.tool_calls
            # of the latest input message.
            if chunk[1]['payload']['name'] == 'tools' and chunk[1]['type'] == 'task':
                for tc in chunk[1]['payload']['input']['messages'][-1].tool_calls:
                    trajectory.append(tc['name'])
    return {"trajectory": trajectory}

In [None]:
def evaluate_extra_steps(outputs: dict, reference_outputs: dict) -> dict:
    """
    Evaluate the number of extra steps in the agent's output.

    Agent performance indicator.
    """
    extra_steps = len(outputs['trajectory']) - len(reference_outputs['trajectory'])
    return {
        "key": "extra_steps",
        "score": extra_steps,
    }

def evaluate_precision_error_percentage(outputs: dict, reference_outputs: dict) -> dict:
    """
    Precision Error (%): Percent of agent steps that are incorrect vs. the reference (correct order).

    This answers: "How many unneccessary are in the agent's trajectory?"

    It DOES NOT penalize missing required steps (recall).
    """

    i = j = 0
    true_positive = 0

    while i < len(reference_outputs["trajectory"]) and j < len(outputs["trajectory"]):
        if reference_outputs["trajectory"][i] == outputs["trajectory"][j]:
            true_positive += 1
            i += 1   # Advance reference only on ordered match
        j += 1       # Always advance through agent output

    total_predicted = len(outputs["trajectory"])
    false_positives = total_predicted - true_positive

    # Percision error = FP / (TP + FP)
    precision_error_pct = (
        (false_positives / total_predicted) * 100 if total_predicted > 0 else 0.0
    )

    return {
        "key": "precision_error_pct",
        "score": precision_error_pct,
    }

In [None]:
def evaluate_recall_error_percentage(outputs: dict, reference_outputs: dict) -> dict:
    """
    Recall Error (%): Percent of required reference steps 
    missing from the agent's trajectory (correct order).
    
    This answers: "How incomplete is the agent's trajectory?"

    It DOES NOT penalize extra or hallucinated steps (precision).
    """
    i = j = 0
    true_positive = 0

    while i < len(reference_outputs["trajectory"]) and j < len(outputs["trajectory"]):
        if reference_outputs["trajectory"][i] == outputs["trajectory"][j]:
            true_positive += 1
            i += 1   # Advance reference only on ordered match
        j += 1       # Always advance through agent output

    total_required = len(reference_outputs["trajectory"])
    false_negatives = total_required - true_positive

    # Recall error = FN / (TP + FN)
    recall_error_pct = (
        (false_negatives / total_required) * 100 if total_required > 0 else 0.0
    )

    return {
        "key": "recall_error_pct",
        "score": recall_error_pct,
    }

In [None]:
evaluate(
    run_graph_trajectory,
    data="",
    evaluators=[evaluate_extra_steps, evaluate_precision_error_percentage, evaluate_recall_error_percentage],
    experiment_prefix="sql-agent-gpt4o-trajectory",
)

## Test it out

In [None]:
from langchain_core.messages import HumanMessage
from langgraph.graph import START

history = []
while True:
    user = input('User (q/Q to quit): ')
    if user in {'q', 'Q'}:
        break
    history.append(HumanMessage(content=user))
    async for output in graph.astream({"messages": history}):
        if END in output or START in output:
            continue
        # stream() yields dictionaries with output keyed by node name
        for key, value in output.items():
            print(f"Output from node '{key}':")
            print("---")
            print(value)
        print("\n---\n")
    history.append(AIMessage(content=value["messages"][-1].content))

User (q/Q to quit):  hi! whats the email you have for me?


Output from node 'general':
---
{'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_jBrtVtnLXNVtYgPtsBhzaTRI', 'function': {'arguments': '{"choice":"customer"}', 'name': 'Router'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_83554c687e', 'service_tier': 'default'}, name='general', id='lc_run--24c9858e-ebd6-464a-9a2b-861de0333fce', tool_calls=[{'name': 'Router', 'args': {'choice': 'customer'}, 'id': 'call_jBrtVtnLXNVtYgPtsBhzaTRI', 'type': 'tool_call'}])]}

---

Output from node 'customer':
---
{'messages': [AIMessage(content='Could you please provide me with your customer ID so I can look up your information?', additional_kwargs={}, response_metadata={'finish_reason': 'stop', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_cbf1785567', 'service_tier': 'default'}, name='customer', id='lc_run--61b3145a-dc67-4701-82e0-e2aca1f574ff')]}

---



User (q/Q to quit):  4


Output from node 'general':
---
{'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_KEdvkNlEWtTyrsxz1i0lvj1n', 'function': {'arguments': '{"choice":"customer"}', 'name': 'Router'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_83554c687e', 'service_tier': 'default'}, name='general', id='lc_run--8d630fc5-3faa-4ba8-8cc7-bb83cf433670', tool_calls=[{'name': 'Router', 'args': {'choice': 'customer'}, 'id': 'call_KEdvkNlEWtTyrsxz1i0lvj1n', 'type': 'tool_call'}])]}

---

Output from node 'customer':
---
{'messages': [AIMessage(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_jYpwM9qUSONONIvEIvL8cH0v', 'function': {'arguments': '{"customer_id":4}', 'name': 'get_customer_info'}, 'type': 'function'}]}, response_metadata={'finish_reason': 'tool_calls', 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_cbf1785567', 'service_tier': 'defau