# Debug, Evaluate, and Monitor LLMs with LangSmith

LangChain makes it easy to get started with Agents and other LLM applications. Even so, delivering a high-quality agent to production can be deceptively difficult. To aid the development process, we've designed tracing and callbacks at the core of LangChain. In this notebook, you will get started prototyping, testing, and monitoring an LLM agent.

When might you want to use tracing? Some situations we've found it useful include:
- Quickly debugging a new chain, agent, or set of tools
- Evaluating a given chain across different LLMs or Chat Models to compare results or improve prompts
- Running a given chain multiple time on a dataset to ensure it consistently meets a quality bar.
- Capturing production traces and using LangChain summarizers to analyze app usage

## Prerequisites

**Either [create a hosted LangSmith account](https://www.langchain.plus/) and connect with an API key OR
run the server locally.**


To run the local server, execute the following comand in your terminal:
```
pip install --upgrade langchain
langchain plus start
```

Now, let's get started by creating a client to connect to LangChain+.

## Debug your Agent

First, configure your environment variables to tell LangChain to log traces. This is done by setting the `LANGCHAIN_TRACING_V2` environment variable to true.
You can tell LangChain which project to log to by setting the `LANGCHAIN_PROJECT` environment variable. This will automatically create a debug project for you.

For more information on other ways to set up tracing, please reference the [LangSmith documentation](https://docs.langchain.plus/docs/)

**NOTE:** You must also set your `OPENAI_API_KEY` and `SERPAPI_API_KEY` environment variables in order to run the following tutorial.

**NOTE:** You can optionally set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables if using the hosted version which is in private beta.

In [2]:
import os
from uuid import uuid4
from langchainplus_sdk import LangChainPlusClient

unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Tracing Walkthrough - {unique_id}"
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"  # Uncomment this line to use the hosted version
# os.environ["LANGCHAIN_API_KEY"] = "<YOUR-LANGCHAINPLUS-API-KEY>"  # Uncomment this line to use the hosted version.

# Used by the agent below
# os.environ["OPENAI_API_KEY"] = "<YOUR-OPENAI-API-KEY>"
# os.environ["SERPAPI_API_KEY"] = "<YOUR-SERPAPI-API-KEY>"

client = LangChainPlusClient()
print("You can click the link below to view the UI")
client

You can click the link below to view the UI


Now, start prototyping your agent. We will use a straightforward math example.

In [3]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, load_tools
from langchain.agents import AgentType

llm = ChatOpenAI(temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False
)

In [4]:
import asyncio

inputs = [
    "How many people live in canada as of 2023?",
    "who is dua lipa's boyfriend? what is his age raised to the .43 power?",
    "what is dua lipa's boyfriend age raised to the .43 power?",
    "how far is it from paris to boston in miles",
    "what was the total number of points scored in the 2023 super bowl? what is that number raised to the .23 power?",
    "what was the total number of points scored in the 2023 super bowl raised to the .23 power?",
    "how many more points were scored in the 2023 super bowl than in the 2022 super bowl?",
    "what is 153 raised to .1312 power?",
    "who is kendall jenner's boyfriend? what is his height (in inches) raised to .13 power?",
    "what is 1213 divided by 4345?",
]
results = []


async def arun(agent, input_example):
    try:
        return await agent.arun(input_example)
    except Exception as e:
        # The agent sometimes makes mistakes! These will be captured by the tracing.
        return e


for input_example in inputs:
    results.append(arun(agent, input_example))
results = await asyncio.gather(*results)

In [5]:
from langchain.callbacks.tracers.langchain import wait_for_all_tracers

# Logs are submitted in a background thread. Make sure they've been submitted before moving on.
wait_for_all_tracers()

Assuming you've successfully initiated the server as described earlier, your agent logs should show up in your server. You can check by clicking on the link below:

In [6]:
client

## Test

Once you've debugged a prototype of your agent, you will want to create tests and benchmark evaluations as you think about putting it into a production environment.

In this notebook, you will run evaluators to test an agent. You will do so in a few steps:

1. Create a dataset
2. Select or create evaluators to measure performance
3. Define the LLM or Chain initializer to test
4. Run the chain and evaluators using the helper functions

### 1. Create Dataset

Below, use the client to create a dataset from the Agent runs you just logged while debugging above. You will use these later to measure performance.

For more information on datasets, including how to create them from CSVs or other files or how to create them in the web app, please refer to the [LangSmith documentation](https://docs.langchain.plus/docs).

In [7]:
dataset_name = "calculator-example-dataset"

In [8]:
if dataset_name in set([dataset.name for dataset in client.list_datasets()]):
    client.delete_dataset(dataset_name=dataset_name)
dataset = client.create_dataset(
    dataset_name, description="A calculator example dataset"
)

runs = client.list_runs(
    project_name=os.environ["LANGCHAIN_PROJECT"],
    execution_order=1,  # Only return the top-level runs
    error=False,  # Only runs that succeed
)
for run in runs:
    client.create_example(
        inputs=run.inputs, outputs=run.outputs, dataset_id=dataset.id
    )

### 2. Select RunEvaluators

Manually comparing the results of chains in the UI is effective, but it can be time consuming.
It's easier to leverage AI-assisted feedback to evaluate your agent's performance.

Below, we will create some pre-implemented run evaluators that do the following:
- Compare results against ground truth labels. (You used the debug outputs above for this)
- Evaluate the overall agent trajectory based on the tool usage and intermediate steps.
- Evaluating 'aspects' of the agent's response in a reference-free manner using custom criteria
- Evaluating performance based on 'context' such as retrieved documents or tool results.

For a longer discussion of how to select an appropriate evaluator for your use case and how to create your own
custom evaluators, please refer to the [LangSmith documentation](https://docs.langchain.plus/docs/).

Below, create the run evaluators.

**Note: the feedback API is currently experimental and subject to change.**

In [9]:
from langchain.evaluation.run_evaluators import (
    get_qa_evaluator,
    get_criteria_evaluator,
    get_trajectory_evaluator,
)
from langchain.chat_models import ChatOpenAI

# You can use any model, but stronger llms tend to be more reliable
eval_llm = ChatOpenAI(model="gpt-4", temperature=0)

# Measures accuracy against ground truth
qa_evaluator = get_qa_evaluator(eval_llm) 

# Measures how effective and efficient the agent's actions are
tools = load_tools(["serpapi", "llm-math"], llm=llm)
trajectory_evaluator = get_trajectory_evaluator(eval_llm, agent_tools=tools)

# Measure helpfulness. We have some pre-defined criteria you can select
helpfulness_evaluator = get_criteria_evaluator(
    eval_llm,
    "helpfulness",
)

# Custom criteria are specified as a dictionary
custom_criteria_evaluator = get_criteria_evaluator(
    eval_llm,
    {
        "fifth-grader-score": "Do you have to be smarter than a fifth grader to answer this question?"
    },
)

evaluators = [
    qa_evaluator,
    trajectory_evaluator,
    helpfulness_evaluator,
    custom_criteria_evaluator,
]

### 3. Define the Agent or LLM to Test

You can evaluate any LLM or chain. Since chains can have memory, we need to pass an
initializer function that returns a new chain for each row.

In this case, you will test an agent that uses OpenAI's function calling endpoints, but it can be any simple chain.

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import initialize_agent, load_tools
from langchain.agents import AgentType

llm = ChatOpenAI(model="gpt-3.5-turbo-0613", temperature=0)
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# Since chains can be stateful (e.g. they can have memory), we need provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.
def agent_factory():
    return initialize_agent(
    tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=False
)

# If your chain is NOT stateful, your factory can return the object directly
# to improve runtime performance. For example:
# chain_factory = lambda: agent

### 4. Run the Agent and Evaluators

With the dataset, agent, and evaluators selected, you can use the helper function below to run them all.

The run traces and evaluation feedback will automatically be associated with the dataset for easy attribution and analysis.

In [11]:
from langchain.client import (
    arun_on_dataset,
    run_on_dataset, # Available if your chain doesn't support async calls.
)

?arun_on_dataset

[0;31mSignature:[0m
[0marun_on_dataset[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdataset_name[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mllm_or_chain_factory[0m[0;34m:[0m [0;34m'MODEL_OR_CHAIN_FACTORY'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconcurrency_level[0m[0;34m:[0m [0;34m'int'[0m [0;34m=[0m [0;36m5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_repetitions[0m[0;34m:[0m [0;34m'int'[0m [0;34m=[0m [0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mproject_name[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mverbose[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclient[0m[0;34m:[0m [0;34m'Optional[LangChainPlusClient]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtags[0m[0;34m:[0m [0;34m'Optional[List[s

In [12]:
chain_results = await arun_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=agent_factory,
    concurrency_level=5,  # Optional, sets the number of examples to run at a time
    verbose=True,
    client=client,
    tags=[
        "testing-notebook",
    ],  # Optional, adds a tag to the resulting chain runs
    run_evaluators=evaluators,
)

# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.
# These are logged as warnings here and captured as errors in the tracing UI.

Processed examples: 6

### Review the Test Results

You can review the test results tracing UI below by navigating to the Testing project 
with the title that starts with **"calculator-example-dataset-AgentExecutor-"**

This will show the new runs and the feedback logged from the selected evaluators.

In [13]:
# You can navigate to the UI by clicking on the link below
client

For a real production application, you will want to add many more test cases and
incorporate larger datasets to run benchmark evaluations to measure aggregate performance
across. For more information on recommended ways to do this, see [LangSmith Documentation](https://docs.langchain.plus/docs/)

## Monitor

Once your agent passed the selected quality bar, you can deploy it to production. For this notebook, you will simulate user interactions directly while logging your traces to LangSmith for monitoring.

For more information on real production deployments, check out the [LangChain documentation](https://python.langchain.com/docs/guides/deployments/) or contact us at [support@langchain.dev](mailto:support@langchain.dev).

**First, create a new project to use in your production deployment.**

In [14]:
deployment_name = f"Search + Calculator Deployment - {unique_id}"
project = client.create_project(deployment_name, mode="monitor")

**Then, deploy your agent to production, making sure to configure the environment to log to the monitoring project.**

In [15]:
agent = initialize_agent(
    tools, llm, agent=AgentType.OPENAI_FUNCTIONS, verbose=False
)

In [16]:
os.environ["LANGCHAIN_PROJECT"] = deployment_name

inputs = [
    "What's the ratio of the current US GDP to the average lifespan of a human?",
    "What's sin of 180 degrees?",
    "I need help on my homework",
    "If the price of bushel of wheat increases by 10 cents, about how much will that impact the average cost of bread?",
    # etc.
]
for query in inputs:
    try:
        await agent.arun(query)
    except Exception as e:
        print(e)

LLMMathChain._evaluate("
US_GDP / average_lifespan
") raised error: 'US_GDP'. Please try again with a valid numerical expression


## Conclusion

Congratulations! You have succesfully created connected an agent to LangSmith to trace and debug, evaluated it for accuracy, helpfulness, and trajectory efficiency over a dataset, and instrumented a monitoring project for a simulated "production" application!

This was a quick guide to get started, but there are many more ways to use LangSmith to speed up your developer flow and produce better products.

For more information on how you can get the most out of LangSmith, check out [LangSmith documentation](https://docs.langchain.plus/docs/),

and please reach out with questions, feature requests, or feedback at [support@langchain.dev](mailto:support@langchain.dev).