# LangSmith Walkthrough

LangChain makes it easy to prototype LLM applications and Agents. However, delivering LLM applications to production can be deceptively difficult. You will likely have to heavily customize and iterate on your prompts, chains, and other components to create a high-quality product.

To aid in this process, we've launched LangSmith, a unified platform for debugging, testing, and monitoring your LLM applications.

When might this come in handy? You may find it useful when you want to:

- Quickly debug a new chain, agent, or set of tools
- Visualize how components (chains, llms, retrievers, etc.) relate and are used
- Evaluate different prompts and LLMs for a single component
- Run a given chain several times over a dataset to ensure it consistently meets a quality bar
- Capture usage traces and using LLMs or analytics pipelines to generate insights

## Prerequisites

**[Create a LangSmith account](https://smith.langchain.com/) and create an API key (see bottom left corner). Familiarize yourself with the platform by looking through the [docs](https://docs.smith.langchain.com/)**

Note LangSmith is in closed beta; we're in the process of rolling it out to more users. However, you can fill out the form on the website for expedited access.

Now, let's get started!

## Log runs to LangSmith

First, configure your environment variables to tell LangChain to log traces. This is done by setting the `LANGCHAIN_TRACING_V2` environment variable to true.
You can tell LangChain which project to log to by setting the `LANGCHAIN_PROJECT` environment variable (if this isn't set, runs will be logged to the `default` project). This will automatically create the project for you if it doesn't exist. You must also set the `LANGCHAIN_ENDPOINT` and `LANGCHAIN_API_KEY` environment variables.

For more information on other ways to set up tracing, please reference the [LangSmith documentation](https://docs.smith.langchain.com/docs/).

**NOTE:** You must also set your `OPENAI_API_KEY` and `SERPAPI_API_KEY` environment variables in order to run the following tutorial.

**NOTE:** You can only access an API key when you first create it. Keep it somewhere safe.

**NOTE:** You can also use a context manager in python to log traces using
```python
from langchain.callbacks.manager import tracing_v2_enabled

with tracing_v2_enabled(project_name="My Project"):
    agent.run("How many people live in canada as of 2023?")
```

However, in this example, we will use environment variables.

In [1]:
# %pip install -U langchain langsmith --quiet
%pip install openai tiktoken pandas html2text faiss-cpu duckduckgo-search --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from uuid import uuid4

unique_id = uuid4().hex[0:8]
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Tracing Walkthrough - {unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
# os.environ["LANGCHAIN_API_KEY"] = ""  # Update to your API key

# Used by the agent in this tutorial
# os.environ["OPENAI_API_KEY"] = "<YOUR-OPENAI-API-KEY>"
# os.environ["SERPAPI_API_KEY"] = "<YOUR-SERPAPI-API-KEY>"

Create the langsmith client to interact with the API

In [3]:
from langsmith import Client

client = Client()

Create a LangChain component and log runs to the platform. In this example, we will create a ReAct-style agent with access to a general search tool (DuckDuckGo) as well as a vector store retrieval tool.

In [5]:
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import TokenTextSplitter
from langchain.vectorstores import FAISS

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=1000,
    chunk_overlap=200,
)
doc_transformer = Html2TextTransformer()
raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)
documents = text_splitter.split_documents(transformed)

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})



In [6]:
from langchain.agents import AgentType, initialize_agent
from langchain.agents.agent_toolkits.conversational_retrieval.tool import (
    create_retriever_tool,
)
from langchain.chat_models import ChatOpenAI
from langchain.tools import DuckDuckGoSearchResults, StructuredTool


llm = ChatOpenAI(
    model="gpt-3.5-turbo-16k",
    temperature=0,
)

tools = [
    DuckDuckGoSearchResults(), # General internet search using DuckDuckGo
    create_retriever_tool( # Search the Langsmith documentation
        retriever,
        name="query-langsmith-docs",
        description="Query the Langsmith documentation using semantic search.",
    )
]
agent = initialize_agent(
    tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False,
)

We are running the agent concurrently on multiple inputs to reduce latency. Runs get logged to LangSmith in the background so execution latency is unaffected.

In [7]:
inputs = [
    "What is LangChain?",
    "How might I query for all runs in a project?",
    "What's a langsmith dataset?",
    "How do I use a traceable decorator?",
    "Can I trace my Llama V2 llm?",
    "Why do I have to set environment variables?",
    "How do I move my project between organizations?",
    "How do I search for a run with metadata?",
]

results = agent.batch(inputs, return_exceptions=True)

In [8]:
# This weaker agent returns a number of errors
results[:3]

[{'input': 'What is LangChain?',
  'output': 'LangChain is a framework for developing applications powered by language models.'},
 InvalidRequestError(message="This model's maximum context length is 16385 tokens. However, your messages resulted in 16771 tokens. Please reduce the length of the messages.", param='messages', code='context_length_exceeded', http_status=400, request_id=None),
 langchain.schema.output_parser.OutputParserException('Could not parse LLM output: `Based on the observation, a langsmith dataset is a curated dataset that can be used for testing and evaluating changes to a prompt or chain in LangSmith. It can be used to run the chain over the data points and visualize the outputs. The LangSmith client makes it easy to pull down a dataset and run a chain over them, logging the results to a new project associated with the dataset. The dataset can also be exported for use in other contexts.`')]

Assuming you've successfully set up your environment, your agent traces should show up in the `Projects` section in the [app](https://smith.langchain.com/). Congrats!

## Evaluate another agent implementation

In addition to logging runs, LangSmith also allows you to test and evaluate your LLM applications.

In this section, you will leverage LangSmith to create a benchmark dataset and run AI-assisted evaluators on an agent. You will do so in a few steps:

1. Create a dataset from pre-existing run inputs and outputs
2. Initialize a new agent to benchmark
3. Configure evaluators to grade an agent's output
4. Run the agent over the dataset and evaluate the results

### 1. Create a LangSmith dataset

Below, we use the LangSmith client to create a dataset from the input questions from above and a list labels. You will use these later to measure performance for a new agent. A dataset is a collection of examples, which are nothing more than input-output pairs you can use as test cases to your application.

For more information on datasets, including how to create them from CSVs or other files or how to create them in the platform, please refer to the [LangSmith documentation](https://docs.smith.langchain.com/).

In [9]:
outputs = [
    "LangChain is an open-source framework for building applications using large language models. It is also the name of the company building LangSmith.",
    "client.list_runs(project_name='my-project-name'), or in TypeScript, client.ListRuns({projectName: 'my-project-name'})",
    "A LangSmith dataset is a collection of examples. Each example contains inputs and optional expected outputs or references for that data point.",
    """The traceable decorator is available in the langsmith python SDK. To use, configure your environment with your API key,import the required function, decorate your function, and then call the function. Below is an example:
```python
from langsmith.run_helpers import traceable
@traceable(run_type="chain") # or "llm", etc.
def my_function(input_param):
    # Function logic goes here
    return output
result = my_function(input_param)
```""",
    "So long as you are using one of LangChain's LLM implementations, all your calls can be traced",
    "Environment variables can tell your LangChain application to perform tracing and contain the information necessary to authenticate to LangSmith. While there are other ways to connect, environment variables tend to be the simplest way to configure your application.",
    "LangSmith doesn't directly support moving projects between organizations.",
    """You can search for runs with specific metadata using the filter 'has(metadata, "<json-search>")'. An example in python is:
```python
from langsmith import Client

client = Client()
runs = list(client.list_runs(
    project_name="<your_project>",
    filter='has(metadata, '{"variant": "abc123"}')',
))```""",
]

In [10]:
dataset_name = f"langsmith-docs-dataset-{unique_id}"

dataset = client.create_dataset(
    dataset_name, description="An example dataset of questions over the LangSmith documentation."
)

for query, answer in zip(inputs, outputs):
    client.create_example(inputs={"input": query}, outputs={"output": answer}, dataset_id=dataset.id)

### 2. Initialize a new agent to benchmark

LangSmith lets you evaluate any LLM, chain, agent, or even a custom function. Conversational agents are stateful (they have memory); to ensure that this state isn't shared between dataset runs, we will pass in a `chain_factory` (aka a `constructor`) function to initialize for each call.

In this case, we will test an agent that uses OpenAI's function calling endpoints.

In [11]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import AgentType, initialize_agent, load_tools, AgentExecutor
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.agents.format_scratchpad import format_to_openai_functions
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser
from langchain.tools.render import format_tool_to_openai_function
from langchain import hub


# Since chains can be stateful (e.g. they can have memory), we provide
# a way to initialize a new chain for each row in the dataset. This is done
# by passing in a factory function that returns a new chain for each row.
def agent_factory():
    prompt = hub.pull("wfh/langsmith-agent-prompt")
    
    llm_with_tools = llm.bind(
        functions=[format_tool_to_openai_function(t) for t in tools]
    )
    runnable_agent = (
            {
                "input": lambda x: x["input"],
                "agent_scratchpad": lambda x: format_to_openai_functions(x['intermediate_steps'])
            } 
             | prompt 
             | llm_with_tools 
             | OpenAIFunctionsAgentOutputParser()
    )
    return  AgentExecutor(agent=runnable_agent, tools=tools)




### 3. Configure evaluation

Manually comparing the results of chains in the UI is effective, but it can be time consuming.
It can be helpful to use automated metrics and AI-assisted feedback to evaluate your component's performance.

Below, we will create some pre-implemented run evaluators that do the following:
- Compare results against ground truth labels.
- Measure semantic (dis)similarity using embedding distance
- Evaluate 'aspects' of the agent's response in a reference-free manner using custom criteria

For a longer discussion of how to select an appropriate evaluator for your use case and how to create your own
custom evaluators, please refer to the [LangSmith documentation](https://docs.smith.langchain.com/).


In [12]:
from langchain.evaluation import EvaluatorType
from langchain.smith import RunEvalConfig

evaluation_config = RunEvalConfig(
    # Evaluators can either be an evaluator type (e.g., "qa", "criteria", "embedding_distance", etc.) or a configuration for that evaluator
    evaluators=[
        # Measures whether a QA response is "Correct", based on a reference answer
        # You can also select via the raw string "qa"
        EvaluatorType.QA,
        # Measure the embedding distance between the output and the reference answer
        # Equivalent to: EvalConfig.EmbeddingDistance(embeddings=OpenAIEmbeddings())
        EvaluatorType.EMBEDDING_DISTANCE,
        # Grade whether the output satisfies the stated criteria. You can select a default one such as "helpfulness" or provide your own.
        RunEvalConfig.LabeledCriteria("helpfulness"),
        # Both the Criteria and LabeledCriteria evaluators can be configured with a dictionary of custom criteria.
        RunEvalConfig.Criteria(
            {
                "fifth-grader-score": "Do you have to be smarter than a fifth grader to answer this question?"
            }
        ),
    ],
    # You can add custom StringEvaluator or RunEvaluator objects here as well, which will automatically be
    # applied to each prediction. Check out the docs for examples.
    custom_evaluators=[],
)

### 4. Run the agent and evaluators

Use the [run_on_dataset](https://api.python.langchain.com/en/latest/smith/langchain.smith.evaluation.runner_utils.run_on_dataset.html#langchain.smith.evaluation.runner_utils.run_on_dataset) (or asynchronous [arun_on_dataset](https://api.python.langchain.com/en/latest/smith/langchain.smith.evaluation.runner_utils.arun_on_dataset.html#langchain.smith.evaluation.runner_utils.arun_on_dataset)) function to evaluate your model. This will:
1. Fetch example rows from the specified dataset.
2. Run your agent (or any custom function) on each example.
3. Apply evalutors to the resulting run traces and corresponding reference examples to generate automated feedback.

The results will be visible in the LangSmith app.

In [13]:
from langchain.smith import (
    arun_on_dataset,
    run_on_dataset, 
)

chain_results = run_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=agent_factory,
    evaluation=evaluation_config,
    verbose=True,
    client=client,
    project_name=f"runnable-agent-test-{unique_id}",
    tags=["testing-notebook"],  # Optional, adds a tag to the resulting chain runs
)

# Sometimes, the agent will error due to parsing issues, incompatible tool inputs, etc.
# These are logged as warnings here and captured as errors in the tracing UI.

View the evaluation results for project 'openai-functions-agent-test-e36e3f91' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/4219e049-472b-4dce-ac02-1ba390379fb5
[------------------------------------------------->] 8/8
 Eval quantiles:
                               0.25       0.5      0.75      mean      mode
correctness                0.000000  0.000000  1.000000  0.375000  0.000000
embedding_cosine_distance  0.139897  0.166687  0.184377  0.154261  0.072665
helpfulness                0.000000  0.500000  1.000000  0.500000  0.000000
fifth-grader-score         0.000000  0.000000  1.000000  0.375000  0.000000


### Review the test results

You can review the test results tracing UI below by clicking the URL in the output above or navigating to the "Testing & Datasets" page in LangSmith  **"f"langsmith-docs-dataset-{unique_id}"*"** dataset. 

This will show the new runs and the feedback logged from the selected evaluators. You can also explore a summary of the results in tabular format below.

In [14]:
chain_results.to_dataframe()

Unnamed: 0,correctness,embedding_cosine_distance,helpfulness,fifth-grader-score,input,output,reference
1e9fbacd-1655-43fd-88ea-5837ccb74394,0,0.155596,0,0,{'input': 'How do I search for a run with meta...,{'input': 'How do I search for a run with meta...,{'output': 'You can search for runs with speci...
a0691ae7-2fe5-4f64-878f-a67f68f68416,0,0.182275,0,0,{'input': 'How do I move my project between or...,{'input': 'How do I move my project between or...,{'output': 'LangSmith doesn't directly support...
d4bdcd32-0ac4-4eac-b913-7c46ba37b879,1,0.154744,1,0,{'input': 'Why do I have to set environment va...,{'input': 'Why do I have to set environment va...,{'output': 'Environment variables can tell you...
6429c1e0-11c4-4d0b-b277-b4ba7e43b7bf,0,0.204993,0,1,{'input': 'Can I trace my Llama V2 llm?'},"{'input': 'Can I trace my Llama V2 llm?', 'out...",{'output': 'So long as you are using one of La...
64677fff-5c59-40f3-b7c4-95df1c2411a7,1,0.072665,1,1,{'input': 'How do I use a traceable decorator?'},{'input': 'How do I use a traceable decorator?...,{'output': 'The traceable decorator is availab...
015f3a00-0587-41ee-b6bd-d195c50a96d9,1,0.095354,1,0,{'input': 'What's a langsmith dataset?'},"{'input': 'What's a langsmith dataset?', 'outp...",{'output': 'A LangSmith dataset is a collectio...
cd1c6874-ff6f-4e55-92ad-1f80d8037f09,0,0.190684,1,0,{'input': 'How might I query for all runs in a...,{'input': 'How might I query for all runs in a...,{'output': 'client.list_runs(project_name='my-...
d436add4-3eb0-417c-9998-59dbc53b70a8,0,0.177778,0,1,{'input': 'What is LangChain?'},"{'input': 'What is LangChain?', 'output': 'I'm...",{'output': 'LangChain is an open-source framew...


## Exporting datasets and runs

LangSmith lets you export data to common formats such as CSV or JSONL directly in the web app. You can also use the client to fetch runs for further analysis, to store in your own database, or to share with others. Let's fetch the run traces from the evaluation run.

**Note: It may be a few moments before all the runs are accessible.**

In [15]:
runs = client.list_runs(project_name=chain_results["project_name"], execution_order=1)

In [16]:
# After some time, these will be populated.
client.read_project(project_name=chain_results["project_name"]).feedback_stats

## Conclusion

Congratulations! You have succesfully traced and evaluated an agent using LangSmith!

This was a quick guide to get started, but there are many more ways to use LangSmith to speed up your developer flow and produce better results.

For more information on how you can get the most out of LangSmith, check out [LangSmith documentation](https://docs.smith.langchain.com/), and please reach out with questions, feature requests, or feedback at [support@langchain.dev](mailto:support@langchain.dev).