# LangSmith Overview with AI Makerspace

Today we'll be looking at an amazing tool:

[LangSmith](https://docs.smith.langchain.com/)!

This tool will help us monitor, test, debug, and evaluate our LangChain applications - and more!


## Depenedencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for a simple RAG system built on top of LangChain's blogs!

In [None]:
!pip install langchain_core langchain_openai langchain_community langsmith openai tiktoken cohere -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

Enter your OpenAI API Key:··········


## Basic RAG Chain

Now we'll set up our basic RAG chain, first up we need a model!

### OpenAI Model


We'll use OpenAI's `gpt-3.5-turbo` model to ensure we can use a stronger model for decent evaluation later!

Notice that we can tag our resources - this will help us be able to keep track of which resources were used where later on!

In [None]:
from langchain_openai.chat_models import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-3.5-turbo", tags=["base_llm"])

#### Asyncio Bug Handling

This is necessary for Colab.

In [None]:
import nest_asyncio
nest_asyncio.apply()

### SiteMap Loader

We'll use a SiteMapLoader to scrape the LangChain blogs.

In [None]:
from langchain.document_loaders import SitemapLoader

documents = SitemapLoader(web_path="https://blog.langchain.dev/sitemap-posts.xml").load()

Fetching pages: 100%|##########| 213/213 [00:08<00:00, 24.48it/s]


In [None]:
documents[0].metadata["source"]

'https://blog.langchain.dev/documentation-refresh-for-langchain-v0-2/'

### RecursiveCharacterTextSplitter

We're going to use a relatively naive text splitting strategy today!

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

split_documents = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size = 500,
    chunk_overlap = 20
).split_documents(documents)

In [None]:
len(split_documents)

1498

### Embeddings

We'll be leveraging OpenAI's [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/how-to-get-embeddings) today!

In [None]:
from langchain_openai.embeddings import OpenAIEmbeddings

base_embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

### FAISS VectorStore Retriever

Now we can use a FAISS VectorStore to embed and store our documents and then convert it to a retriever so it can be used in our chain!

In [None]:
!pip install faiss-cpu -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(split_documents, base_embeddings_model)

In [None]:
base_retriever = vectorstore.as_retriever()

### Prompt Template

All we have left is a prompt template, which we'll create here!

In [None]:
from langchain.prompts import ChatPromptTemplate

base_rag_prompt_template = """\
Using the provided context, please answer the user's question. If you don't know the answer based on the context, say you don't know.

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

### LCEL Chain

Now that we have:

- Embeddings Model
- Generation Model
- Retriever
- Prompt

We're ready to build our LCEL chain!

Keep in mind that we're returning our source documents with our queries - while this isn't necessary, it's a great thing to get into the habit of doing.

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

base_rag_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm | StrOutputParser(), "context": itemgetter("context")}
)

Let's test it out!

In [None]:
base_rag_chain.invoke({"question" : "What is a good way to evaluate agents?"})

{'response': 'A good way to evaluate agents is by testing their capabilities in various tasks that are prerequisites for common agentic workflows, such as planning, task decomposition, function calling, and the ability to override pre-trained biases when needed. Additionally, measuring overall performance across all tasks and analyzing specific key findings can help assess the effectiveness of agents.',
 'context': [Document(page_content="that there are some downsides/dangers:With agents, they can occasionally spiral out of control. That's why we've added controls to our AgentExecutor to cap them at a certain max amount of steps. It's also worth noting that this is a VERY focused agent, in that it's only given one tool (and a pretty simple tool at that). In general, the fewer (and simpler) tools an agent is given, the more likely it is to be reliable.By remembering ai <-> tool interactions, that can hog the context window occasionally. That's why we've included a flag to disable that t

## LangSmith

Now that we have a chain - we're ready to get started with LangSmith!

We're going to go ahead and use the following `env` variables to get our Colab notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [None]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith - {unique_id}"

### LangSmith API

In order to use LangSmith - you will need a beta key, you can join the queue through the `Beta Sign Up` button on LangSmith's homepage!

Join [here](https://www.langchain.com/langsmith)

In [None]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Enter your LangSmith API key: ··········


Let's test our our first generation!

In [None]:
base_rag_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']

'LangSmith is a unified platform for debugging, testing, evaluating, and monitoring LLM (Large Language Models) applications. It provides features such as utilizing existing datasets, creating new datasets, running them against chains, visual feedback on outputs, accuracy metrics, evaluation of LLM runs, monitoring AI processes, and more.'

## Create Testing Dataset

Now we can create a dataset using some user defined questions, and providing the retrieved context as a "ground truth" context.

> NOTE: There are many different ways you can approach this specific task - generating ground truth answers with AI, using human experts to generate golden datasets, and more!

In [None]:
from langsmith import Client

test_inputs = [
    "What is LangSmith?",
    "What is LangServe?",
    "How could I benchmark RAG on tables?",
    "What was exciting about LangChain's first birthday?",
    "What features were released for LangChain on August 7th?",
    "What is a conversational retrieval agent?"
]

client = Client()

dataset_name = "langsmith-demo-dataset-v1"

dataset = client.create_dataset(
    dataset_name=dataset_name, description="LangChain Blog Test Questions"
)

for input in test_inputs:
  client.create_example(
      inputs={"question" : input},
      outputs={"answer" : base_rag_chain.invoke({"question" : input})["context"]},
      dataset_id=dataset.id
  )

### Evaluation

Now we can run the evaluation!

In [None]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_llm = ChatOpenAI(model="gpt-4o", temperature=0)

eval_config = RunEvalConfig(
  evaluators=[
    RunEvalConfig.CoTQA(llm=eval_llm, prediction_key="response"),
    RunEvalConfig.Criteria("harmfulness", prediction_key="response"),
  ]
)

base_rag_base_run = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=base_rag_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project 'whispered-side-96' at:
https://smith.langchain.com/o/340cd80b-3296-5752-9a9e-58582118073a/datasets/61bdc417-2079-44d4-916c-d9b553fa6f32/compare?selectedSessions=318afa52-d514-4ccb-8df8-84844d1760bb

View all tests for Dataset langsmith-demo-dataset-v1 at:
https://smith.langchain.com/o/340cd80b-3296-5752-9a9e-58582118073a/datasets/61bdc417-2079-44d4-916c-d9b553fa6f32
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.Contextual Accuracy  feedback.harmfulness error  execution_time                                run_id
count                           6.00                  6.00     0            6.00                                     6
unique                           NaN                   NaN     0             NaN                                     6
top                              NaN                   NaN   NaN             NaN  119add34-9617-4904-bf33-71f75c40c697
freq                             NaN

## Adding Reranking

We'll add reranking to our RAG application to confirm the claim made by [Cohere](https://cohere.com/rerank)!

`Improve search performance with a single line of code`

We'll put that to the test today!

In [None]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API Key:")

Enter your Cohere API Key:··········


In [None]:
base_retriever_expander = vectorstore.as_retriever(
    search_kwargs={"k" : 10}
)

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

reranker = CohereRerank()
rerank_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever_expander
)

  warn_deprecated(


### Recreating our Chain with Reranker

Now we can recreate our chain using the reranker.

In [None]:
rerank_rag_chain = (
    {"context": itemgetter("question") | rerank_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": base_rag_prompt | base_llm | StrOutputParser(), "context": itemgetter("context")}
)

rerank_rag_chain = rerank_rag_chain.with_config({"tags" : ["cohere-rerank"]})

### Improved Evaluation

Now we can leverage the full suite of LangSmith's evaluation to evaluate our chains on multiple metrics, including custom metrics!

In [None]:
eval_config = RunEvalConfig(
  evaluators=[
    RunEvalConfig.CoTQA(llm=eval_llm, prediction_key="response"),
    RunEvalConfig.Criteria("harmfulness", prediction_key="response"),
    RunEvalConfig.LabeledCriteria(
        {
            "helpfulness" : (
                "Is this submission helpful to the user,"
                "taking into account the correct reference answer?"
            )
        },
        prediction_key="response"
    ),
    RunEvalConfig.LabeledCriteria(
        {
            "litness" : (
                "Is this submission lit, dope, or cool?"
            )
        },
        prediction_key="response"
    ),
    RunEvalConfig.LabeledCriteria("conciseness", prediction_key="response"),
    RunEvalConfig.LabeledCriteria("coherence", prediction_key="response"),
    RunEvalConfig.LabeledCriteria("relevance", prediction_key="response")
  ]
)

### Running Eval on Each Chain

Now we can evaluate each of our chains!

In [None]:
base_chain_results = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=base_rag_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project 'stupendous-test-17' at:
https://smith.langchain.com/o/340cd80b-3296-5752-9a9e-58582118073a/datasets/61bdc417-2079-44d4-916c-d9b553fa6f32/compare?selectedSessions=ba5a6170-232f-4e8b-9dab-f3a19b5b8e5e

View all tests for Dataset langsmith-demo-dataset-v1 at:
https://smith.langchain.com/o/340cd80b-3296-5752-9a9e-58582118073a/datasets/61bdc417-2079-44d4-916c-d9b553fa6f32
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.Contextual Accuracy  feedback.harmfulness  feedback.helpfulness  feedback.litness  feedback.conciseness  feedback.coherence  feedback.relevance error  execution_time                                run_id
count                           6.00                  6.00                  6.00              6.00                  6.00                6.00                6.00     0            6.00                                     6
unique                           NaN                   NaN           

In [None]:
rerank_chain_results = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=rerank_rag_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project 'left-measure-57' at:
https://smith.langchain.com/o/340cd80b-3296-5752-9a9e-58582118073a/datasets/61bdc417-2079-44d4-916c-d9b553fa6f32/compare?selectedSessions=f39cb22d-1974-438d-ae18-6d63fede9745

View all tests for Dataset langsmith-demo-dataset-v1 at:
https://smith.langchain.com/o/340cd80b-3296-5752-9a9e-58582118073a/datasets/61bdc417-2079-44d4-916c-d9b553fa6f32
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.Contextual Accuracy  feedback.harmfulness  feedback.helpfulness  feedback.litness  feedback.conciseness  feedback.coherence  feedback.relevance error  execution_time                                run_id
count                           6.00                  6.00                  6.00              6.00                  6.00                6.00                6.00     0            6.00                                     6
unique                           NaN                   NaN              