# LangSmith Demo
    By Aaron Roberts

* LangSmith is a platform for building production-grade LLM applications. URL: https://smith.langchain.com/
* It lets you debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework
* It seamlessly integrates with LangChain, the go-to open source framework for building with LLMs.

    ** It is in a private Beta and need to be added to wait list (I got an account almost instantly) **

##### Part of the presentation:

* How to integrate your chains and agents with LangSmith (basic debugging)
* Creating/ Collecting Datasets
* Testing and Evaluations
* Human Evaluations
* Monitoring


### Getting Started
##### Set Environment Variables
All you need to do in order to use LangSmith is to set a number of Environment Variables

In [2]:
import os

#LangSmith API keys:
os.environ['LANGCHAIN_API_KEY'] = os.getenv('LANGCHAIN_API_KEY') # API key on LangSmith
os.environ['LANGCHAIN_TRACING_V2'] = 'true' # Enable Tracing
os.environ['LANGCHAIN_ENDPOINT'] = "https://api.smith.langchain.com" # Cloud endpoint
os.environ['LANGCHAIN_PROJECT'] = "LangSmith-Demo" # Project name

# Other API KEYS:
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
os.environ['SERPAPI_API_KEY'] = os.getenv('SERPAPI_API_KEY') # For internet searches

### Build a Basic Agent

In [3]:
from operator import itemgetter 

from langchain import hub #This is a prompt repository
from langchain.agents import AgentExecutor, load_tools #Loads an Agent with tool capabilites, a way to load in tools by name
from langchain.agents.format_scratchpad import format_to_openai_functions # Convert Agent Actions and Tool outputs into function messages
from langchain.agents.output_parsers import OpenAIFunctionsAgentOutputParser # Parses a message into an Agent Action or the final output
from langchain.chat_models import ChatOpenAI #This is the basic chat API for OpenAI
from langchain.tools.render import format_tool_to_openai_function #Converts tool to match the openAI function

llm = ChatOpenAI(temperature=0) # Creates a llm connection 3.5 turbo by default
tools = load_tools(["serpapi", "llm-math"], llm=llm) # Load in the tools
llm_with_tools = llm.bind(functions = [format_tool_to_openai_function(t) for t in tools]) # Adding the tools as OpenAI functions

prompt = hub.pull("wfh/agent-lcel-prompt") # Pulls a prompt from langhub that works well with agents with tools


#Builds out the agent using LCEL (langChain expression language)
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_openai_functions(
            x["intermediate_steps"]
        ),
    }
    | prompt
    | llm_with_tools
    | OpenAIFunctionsAgentOutputParser()
)

# Build the Agent RunTime
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    handle_parsing_errors=True,
    return_intermediate_steps=True
)

  warn_deprecated(


### Create Some Inputs for the Agent

In [6]:
inputs = [
    "What is the number of cats in the US multiplied by the number of cats in China",
    "How much is all of the cats in the world multiplied by the number of the number of legs that a cat has?",
    "How many cats are there in the US divided by the number of US citizens?",
    "How much larger is an average cat to an average dog?",
    "Are there more dogs or cats in the US?",
    "What is the size ratio from the largest cat to the smallest cat?",
    "How many cats does it take to fill a 1500 square foot house?",
    "How many more hours per day do cats sleep than the average human?",
    "How many cat themed Gods are there compared to dog themed Gods?",
    "Is the largest rat bigger than the smallest cat?",
    "How many cats are there in the United States to the power of 10?"
]

### Batch the inputs and send them langsmith.

In [7]:
results = agent_executor.batch([{"input": x} for x in inputs], return_exceptions=True)

#### Test for single inputs

In [18]:
agent_executor.invoke({"input":"How many cats are there in the state of Texas to the power of 10?"})

{'input': 'How many cats are there in the state of Texas to the power of 10?',
 'output': 'There are 10 billion cats in the state of Texas to the power of 10.',
 'intermediate_steps': [(AgentActionMessageLog(tool='Calculator', tool_input='10^10', log='\nInvoking: `Calculator` with `10^10`\n\n\n', message_log=[AIMessageChunk(content='', additional_kwargs={'function_call': {'arguments': '{\n  "__arg1": "10^10"\n}', 'name': 'Calculator'}})]),
   'Answer: 10000000000')]}

## The LangChain Traces are sent to the Project View

##### You can view the different stats of the run.

* End state status (complete/error)
* Latency
* Tokens
* Aggregate Stats

##### Tree View / Agent Path

* Step Through the Tree Trace
* Keeps track of Tools
* Steps through each step (Search/Math)
* Keeps track of the response in the agent scratch pad
* Reasons for the next steps.
* Get the break down of token usage/latency at each step


### LangSmith Playground

* Try a different model
* Change the prompt
* Inputs
* Temperature
* **Need to supply API key, LangSmith calls the API on your behalf.**
* Easy way to test out the flow of the agent to see if you get a different results.
* Does not change the code, but a good place test things out.

### Easy to check where the error exist in the Agent Run.

* Very useful way to check for problems.
* Can find the exact problem and can adjust code to include guard rails

### Sharing the Run (Collaborative Debugging)

* The run can be shared to anyone with the link regardless if they have LangSmith or not.
* Great way to get someone to help narrow down a problem.

## **Collecting examples**

* Very easy to collect datasets
* Hard to collect datasets (You need thousands to 10s of thousands of datasets)
* Datasets can be golden datasets (ideal responses)
* Problem Datasets (Might want to iterate over these to find ways to prevent the underlying issues)

### Collecting a Golden Dataset

* Filter for the successful runs.
* Check next to the runs that you found to be correct
* Create a new dataset

#### There are many ways to create datasets

* Manually
* Upload a CSV
* Do it programmatically

**You can share your dataset to your team**

Collecting Datasets is the first thing that you need to do in order to do further evaluations.

## **Testing and Evaluation**

When putting your LLM into production you need to test the model over a vast number of prompts to be confident that it meets all of your standards before releasing it.

### Off the shelf evaluators

* Correctness
* Helpfulness
* Harmfulness

* You can build your own tools (Check for sql or json formatting)
* Lots of the hard work! Build competence

In [8]:
import langsmith

from langchain import chat_models, prompts, smith
from langchain.schema import output_parser


eval_config = smith.RunEvalConfig(
    evaluators=[
        "cot_qa",
        smith.RunEvalConfig.LabeledCriteria("harmfulness"),
        smith.RunEvalConfig.LabeledCriteria("helpfulness")
    ],
    reference_key="output",
    eval_llm=chat_models.ChatOpenAI(model="gpt-4", temperature=0) # Better at evaluating responses
)

client = langsmith.Client()
chain_results = client.run_on_dataset(
    dataset_name="demo-correct-outputs",
    llm_or_chain_factory=agent_executor,
    evaluation=eval_config,
    project_name="test-cot-harm-help-0.1",
    concurrency_level=5,
    tags=["gpt-3.5"] #Tags allow you to check for version control/ varients for each of your prompts
)


### Test Result Break Down

* You can see aggregate measures for each of the evaluation criteria for each of the tests
* The output is being compared against the reference output from the dataset
* Can check into the run for each of the evaluators (it is another LLM run) 
* Much easier to add evaluations

#### Test Run Comparisons

* You can compare your runs against each other to see the results against each other.
* The review can see where they exactly different

## **Human Evaluation**

#### Why would you need to have human evaluations

* Evaluation criteria that is hard for a model to auto eval
* Want a human to pick through a few examples from thousands to evaluations to make sure the LLM grader is performing up to par.


### How to start a human evaluation que

* Go to your test run
* Select the test you want to evaluate
* Filter if needed
* Add to the annotation que
* Go through each of the examples and provide the grade
* Can create new tags for evaluation such as humor or creativity

## **Monitoring**

This is a way to monitor LLM applications that currently in production.

This is a current LLM application that is production that searches over the LangChain documentation.
https://chat.langchain.com/

* Check for volume, Trace count, llm count ( how active)
* Success Rates
* Latency
* Number of calls per trace
* Tokens
* Streaming metrics (how long until the first token is streamed back) (see how much the person is being taken care of)
* Can see challenges and click in and see where that error happened.
(This is very helpful way to monitor the agent in production in real time)

* The aggregated feedback is posted up top

## **Conclusion**

* LangSmith is a great tool for debugging, evaluating, testing, and monitoring LLMs at any point during the production cycle.
* Rich ecosystem with a ton of visibility and support through the LangChain Team.
* Very powerful tool for anyone hoping to ship a LLM application and wants to be confident that the tools/agent is performing to standard.

**Thank you for listening to my presentation**
