# LangSmith and Evaluation Overview with AI Makerspace

Today we'll be looking at an amazing tool:

[LangSmith](https://docs.smith.langchain.com/)!

This tool will help us monitor, test, debug, and evaluate our LangChain applications - and more!

We'll also be looking at a few Advanced Retrieval techniques along the way - and evaluate it using LangSmith!

✋BREAKOUT ROOM #2:
- Task 1: Dependencies and OpenAI API Key
- Task 2: LCEL RAG Chain
- Task 3: Setting Up LangSmith
- Task 4: Examining the Trace in LangSmith!
- Task 5: Create Testing Dataset
- Task 6: Evaluation

## Task 1: Dependencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for a simple RAG system built on top of LangChain's blogs!

In [1]:
!pip install langchain_core langchain_openai langchain_community langchain-qdrant qdrant-client langsmith openai tiktoken cohere lxml -qU

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

#### Asyncio Bug Handling

This is necessary for Colab.

In [3]:
import nest_asyncio
nest_asyncio.apply()

## Task #2: Create a Simple RAG Application Using Qdrant, Hugging Face, and LCEL

Now that we have a grasp on how LCEL works, and how we can use LangChain and Hugging Face to interact with our data - let's step it up a notch and incorporate Qdrant!

## LangChain Powered RAG

First and foremost, LangChain provides a convenient way to store our chunks and their embeddings.

It's called a `VectorStore`!

We'll be using QDrant as our `VectorStore` today. You can read more about it [here](https://qdrant.tech/documentation/).

Think of a `VectorStore` as a smart way to house your chunks and their associated embedding vectors. The implementation of the `VectorStore` also allows for smarter and more efficient search of our embedding vectors - as the method we used above would not scale well as we got into the millions of chunks.

Otherwise, the process remains relatively similar under the hood!

We'll use a SiteMapLoader to scrape the LangChain blogs - which will serve as our data for today!

### Data Collection

We'll be leveraging the `SitemapLoader` to load our PDF directly from the web!

In [4]:
from langchain.document_loaders import SitemapLoader

documents = SitemapLoader(web_path="https://blog.langchain.dev/sitemap-posts.xml").load()

USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|##########| 220/220 [00:05<00:00, 43.98it/s]


### Chunking Our Documents

Let's do the same process as we did before with our `RecursiveCharacterTextSplitter` - but this time we'll use ~200 tokens as our max chunk size!

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 0,
    length_function = len,
)

split_chunks = text_splitter.split_documents(documents)

In [6]:
len(split_chunks)

4821

Alright, now we have 516 ~200 token long documents.

Let's verify the process worked as intended by checking our max document length.

In [7]:
max_chunk_length = 0

for chunk in split_chunks:
  max_chunk_length = max(max_chunk_length, len(chunk.page_content))

print(max_chunk_length)

499


Perfect! Now we can carry on to creating and storing our embeddings.

### Embeddings and Vector Storage

We'll use the `text-embedding-3-small` embedding model again - and `Qdrant` to store all our embedding vectors for easy retrieval later!

In [8]:
from langchain_community.vectorstores import Qdrant
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

qdrant_vectorstore = Qdrant.from_documents(
    documents=split_chunks,
    embedding=embedding_model,
    location=":memory:"
)

Now let's set up our retriever, just as we saw before, but this time using LangChain's simple `as_retriever()` method!

In [9]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

#### Back to the Flow

We're ready to move to the next step!

### Setting up our RAG

We'll use the LCEL we touched on earlier to create a RAG chain.

Let's think through each part:

1. First we need to retrieve context
2. We need to pipe that context to our model
3. We need to parse that output

Let's start by setting up our prompt again, just so it's fresh in our minds!

####🏗️ Activity #2:

Complete the prompt so that your RAG application answers queries based on the context provided, but *does not* answer queries if the context is unrelated to the query.

In [10]:
from langchain.prompts import ChatPromptTemplate

base_rag_prompt_template = """\
You are a helpful assistant that can answer questions related to the provided context. Repond I don't have that information if outside context.

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

We'll set our Generator - `gpt-4o` in this case - below!

In [11]:
from langchain_openai.chat_models import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-4o-mini", tags=["base_llm"])

#### Our RAG Chain

Notice how we have a bit of a more complex chain this time - that's because we want to return our sources with the response.

Let's break down the chain step-by-step:

1. We invoke the chain with the `question` item. Notice how we only need to provide `question` since both the retreiver and the `"question"` object depend on it.
  - We also chain our `"question"` into our `retriever`! This is what ultimately collects the context through Qdrant.
2. We assign our collected context to a `RunnablePassthrough()` from the previous object. This is going to let us simply pass it through to the next step, but still allow us to run that section of the chain.
3. We finally collect our response by chaining our prompt, which expects both a `"question"` and `"context"`, into our `llm`. We also, collect the `"context"` again so we can output it in the final response object.

The key thing to keep in mind here is that we need to pass our context through *after* we've retrieved it - to populate the object in a way that doesn't require us to call it or try and use it for something else.

In [12]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm, "context": itemgetter("context")}
)

Let's get a visual understanding of our chain!

In [13]:
!pip install -qU grandalf

In [14]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

          +---------------------------------+      
          | Parallel<context,question>Input |      
          +---------------------------------+      
                    **            **               
                  **                **             
                **                    **           
         +--------+                     **         
         | Lambda |                      *         
         +--------+                      *         
              *                          *         
              *                          *         
              *                          *         
  +----------------------+          +--------+     
  | VectorStoreRetriever |          | Lambda |     
  +----------------------+          +--------+     
                    **            **               
                      **        **                 
                        **    **                   
          +----------------------------------+     
          | 

Let's try another visual representation:

![image](https://i.imgur.com/Ad31AhL.png)

Let's test our chain out!

In [15]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What's new in LangChain v0.2?"})

In [16]:
response["response"].content

'LangChain v0.2 includes several new features and improvements:\n\n1. **Full Separation of langchain and langchain-community** - This decoupling allows langchain-community to depend on langchain-core and langchain.\n2. **New (and versioned) Documentation** - Improved documentation has been introduced to assist users.\n3. **More Mature and Controllable Agent Framework** - Enhancements to the agent framework for better control and maturity.\n4. **Improved LLM Interface Standardization** - Better standardization around tool calling in the LLM interface.\n5. **Streaming Support** - Addition of streaming support.\n6. **30+ New Partner Packages** - A variety of new partner packages have been included.\n\nThis is a pre-release, and the full version is expected to come in a few weeks.'

In [16]:
for context in response["context"]:
  print("Context:")
  print(context)
  print("----")

Context:
page_content='Four months ago, we released the first stable version of LangChain. Today, we are following up by announcing a pre-release of langchain v0.2.This release builds upon the foundation laid in v0.1 and incorporates community feedback. We’re excited to share that v0.2 brings: A much-desired full separation of langchain and langchain-community New (and versioned!) docs A more mature and controllable agent framework Improved LLM interface standardization, particularly around tool callingBetter' metadata={'source': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'loc': 'https://blog.langchain.dev/langchain-v02-leap-to-stability/', 'lastmod': '2024-05-16T22:26:07.000Z', '_id': '83497f257e074d49a0b86ca0c54000db', '_collection_name': '35e716a0fe8142f99ca0ca03bbe8ca72'}
----
Context:
page_content='LangChain v0.2: A Leap Towards Stability




















































Skip to content
















All Posts




Case Studies




In the Lo

Let's see if it can handle a query that is totally unrelated to the source documents.

In [17]:
response = retrieval_augmented_qa_chain.invoke({"question" : "What is the airspeed velocity of an unladen swallow?"})

In [18]:
response["response"].content

"I don't have that information."

## Task 3: Setting Up LangSmith

Now that we have a chain - we're ready to get started with LangSmith!

We're going to go ahead and use the following `env` variables to get our Colab notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [31]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith - {unique_id}"
os.environ["LANGCHAIN_PROJECT"] = f"LangSmith - Evaluation"

### LangSmith API

In order to use LangSmith - you will need a beta key, you can join the queue through the `Beta Sign Up` button on LangSmith's homepage!

Join [here](https://www.langchain.com/langsmith)

In [20]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Let's test our our first generation!

In [21]:
retrieval_augmented_qa_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']

AIMessage(content='LangSmith is a framework built on LangChain, designed to enhance the observability and decision-making processes in AI development, particularly for applications involving large language models (LLMs). It provides tools for tracking and improving the performance of LLMs, allowing for fine-grain controls and customizability through an SDK. LangSmith can streamline processes such as prompt engineering and enhance the overall product lifecycle by offering insights into the quality and behavior of AI-powered products.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 91, 'prompt_tokens': 917, 'total_tokens': 1008}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_f33667828e', 'finish_reason': 'stop', 'logprobs': None}, id='run-d0b313e5-888f-40da-a303-d54e018d07b5-0', usage_metadata={'input_tokens': 917, 'output_tokens': 91, 'total_tokens': 1008})

## Task 4: Examining the Trace in LangSmith!

Head on over to your LangSmith web UI to check out how the trace looks in LangSmith!

#### 🏗️ Activity #1:

Include a screenshot of your trace and explain what it means.

# Answer
![Langsmith](./langsmith.png)

```python

base_rag_prompt_template = """\
You are a helpful assistant that can answer questions related to the provided context. Repond I don't have that information if outside context.

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

base_llm = ChatOpenAI(model="gpt-4o-mini", tags=["base_llm"])

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | qdrant_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm, "context": itemgetter("context")}
)

retrieval_augmented_qa_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']
```


Langsmith Interface Walkthrough: https://www.loom.com/share/c87f632bd715428aa04dcba82f9b88cc?sid=6ceec49b-fb32-4689-b837-82d83b7b438a

## Task 5: Loading Our Testing Set

In [23]:
!git clone https://github.com/AI-Maker-Space/DataRepository.git

fatal: destination path 'DataRepository' already exists and is not an empty directory.


In [27]:
import pandas as pd

test_df = pd.read_csv("DataRepository/langchain_blog_test_data.csv")
test_df

Unnamed: 0.1,Unnamed: 0,question,context,idx,answer
0,0,How did Podium improve their agent F1 response...,How Podium optimized agent behavior and reduce...,0,Podium optimized agent behavior and reduced en...
1,1,How did Athena Intelligence utilize LangSmith ...,How Athena Intelligence optimized research rep...,40,Athena Intelligence used the LangSmith playgro...
2,2,What are the four strategies supported by Lang...,"and with LangGraph and LangSmith, LangChain de...",80,LangGraph Cloud provides four different strate...
3,3,What action is required after receiving the su...,Tags\n\n\n\nJoin our newsletter\nUpdates from ...,120,Please check your inbox and click the link to ...
4,4,What are the key features of the Open Source E...,Open Source Extraction Service\n\n\n\n\n\n\n\n...,160,The Open Source Extraction Service is a newly ...
5,5,What are the main subsections of Gödel's mathe...,"{'article_h1_main': 'Kurt Gödel', 'article_h2_...",200,"The question isn't provided, but based on the ..."
6,6,"What should you do after receiving the ""Succes...",Tags\nBy LangChain\n\n\nJoin our newsletter\nU...,240,Subscribe
7,7,What is the role of the ISUSE token in the Sel...,"output is yes, no, continueISREL token decides...",280,The ISUSE token decides whether the generation...
8,8,What should enterprises do if they are looking...,strategies to get the Assistant architecture t...,320,gtm@langchain.dev
9,9,What is the purpose of using LangChain in the ...,"As 2023 comes to a close, Graphite wanted to c...",360,"Year in code is a personalized, AI-generated v..."


Now we can set up our LangSmith client - and we'll add the above created dataset to our LangSmith instance!

> NOTE: Read more about this process [here](https://docs.smith.langchain.com/old/evaluation/faq/manage-datasets#create-from-list-of-values)

In [29]:
from langsmith import Client

client = Client()

dataset_name = "langsmith-demo-dataset-aie4-triples-v3"

dataset = client.create_dataset(
    dataset_name=dataset_name, description="LangChain Blog Test Questions"
)

for triplet in test_df.iterrows():
  triplet = triplet[1]
  client.create_example(
      inputs={"question" : triplet["question"], "context": triplet["context"]},
      outputs={"answer" : triplet["answer"]},
      dataset_id=dataset.id
  )

## Task 6: Evaluation

Now we can run the evaluation!

We'll need to start by preparing some custom data preparation functions to ensure our chain works with the expected inputs/outputs from the `evaluate` process in LangSmith.

> NOTE: More reading on this available [here](https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#evaluate-a-langchain-runnable)

In [35]:
def prepare_data_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.outputs["answer"],
      "input" : example.inputs["question"]
  }

def prepare_data_noref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "input" : example.inputs["question"]
  }

def prepare_context_ref(run, example):
  return {
      "prediction" : run.outputs["response"],
      "reference" : example.inputs["context"],
      "input" : example.inputs["question"]
  }

We'll be using a few custom evaluators to evaluate our pipeline, as well as a few "built in" methods!

Check out the built-ins [here](https://docs.smith.langchain.com/reference/sdk_reference/langchain_evaluators)!

In [40]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

#Chain of thought question answering evaluator, which grades answers to questions using chain of thought ‘reasoning’.
cot_qa_evaluator = LangChainStringEvaluator("cot_qa",  config={"llm": base_llm},prepare_data=prepare_context_ref)

unlabeled_dopeness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "llm": base_llm,
        "criteria" : {
            "dopeness" : "Is the answer to the question dope, meaning cool - awesome - and legit?"
        }
    },
    prepare_data=prepare_data_noref
)

labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "llm": base_llm,
        "criteria": {
            "accuracy": "Is the generated answer the same as the reference answer?"
        },
    },
    prepare_data=prepare_data_ref
)

# consice evaluator
unlabled_conciseness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "llm": base_llm,
        "criteria": {            
            "conciseness": "Is the answer to the question succinct, providing only the necessary information without unnecessary elaboration?"
        }
    },
    prepare_data=prepare_data_noref
)

unlabled_harmfulness_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "llm": base_llm,
        "criteria": {            
            "harmfulness": """Does the answer contain any harmful, offensive, discriminatory, or inappropriate content that could negatively impact the user?
            
            Please analyze the response according to these steps:
            1. Identify any language that is offensive, harmful, discriminatory, or inappropriate.
            2. Determine if the content could negatively impact the user.
            3. Conclude with a clear "Yes" if any harmful content is found, otherwise "No"."""
        }
    },
    prepare_data=prepare_data_noref
)

base_rag_results = evaluate(
    retrieval_augmented_qa_chain.invoke,
    data=dataset_name,
    evaluators=[
        cot_qa_evaluator,
        unlabeled_dopeness_evaluator,
        labeled_score_evaluator,
        unlabled_conciseness_evaluator,
        unlabled_harmfulness_evaluator
    ],
    experiment_prefix="Base RAG Evaluation Updated"
)

View the evaluation results for experiment: 'Base RAG Evaluation Updated-a615c459' at:
https://smith.langchain.com/o/e319c8f1-73bd-5d44-8897-d19a191ebc54/datasets/0037c4f9-186e-4c24-bef0-776064165e00/compare?selectedSessions=a5a1625d-33be-4092-803f-955ef368ba6d




0it [00:00, ?it/s]

#### ❓Question #1:

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

# My Understanding from above evaluation experiment code

 - First we are evaluating a dataset i.e. in this case a list of langchain blogs from their website. 
 - The dataset is in a csv format and each row contains `question`, `context`, `idx` and `answer` fields.
 - The evaluation is done using Langsmith evaluation framework which accepts the data in certain format. So need to prepare the data in a format native to Langsmith. 
 - We are creating a DataSet in LangSmith portal and add examples 
   -  Each Example will have `input` and `output` where `input` contains the `question` and `context`. And `output` contains the expected output from LLM.
 - Here we are evaluating an RAG pipeline (LLM chain) against the dataset examples. 
 - To perform the evaluation we need below items 
   - Entity that is being evaluated (In this cases is a RAG pipeline chain)
   - Dataset examples - Here we are using the examples prepared and uploaded to langsmith
   - Evaluation Criteria - Need to specfiy the characteristics we are evaluating like accuray, dopness and chain-of-thought
     - COT Evaluator - It grades the model's answer based on its reasoning, using chain of thought prompts
     - Eeach evaluator need to be instrcuted how to get the examples. Some evaluater needs all 3 columns `question`,`context` and `response`. Some may only need two.
   - Each Evaluator uses an LLM to evaluate based on the criteria. In this example we are using `gpt-4o-mini` the model for both evaluation and the model being evaluated. We can completely use a different model which will be the judge for this evaluation.
   - Langsmith evalutor provides a list of evaluator types listed below which will fit in most of the criterias.
       - embedding_evaluator
       - criteria_evaluator,
       - exact_match_evaluator,
       - regex_match_evaluator,
       - scoring_evaluator,
       - string_distance_evaluator,

# What the metrices are expressing?

  1. COT Evaluator - It grades the model's answer based on its reasoning, using chain of thought prompts. It uses the LLM as judge to grade the response from the LLM chain for a given input and reference then finally evaluates it against the prediction from the example. The evaluator is loaded with a prompt that uses Chain of thought method by giving an example steps to evaluate the LLM response like a teacher evaluating the students answer paper. 
  2. criteria_evaluator - This evaluator evaluates based on the specified criteria criterion may be one of the default implemented criteria: conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, and criminality. Or, we can define our own criteria in a custom dict as follows: `{ "criterion_key": "criterion description" }`
  3. Dopness is one such custom evaluation criteria which evaluates the dopness of the response. 


# Post Disccussions with Peer supporter

I feel evaluations are the techniques behind validating different component involved in the generative AI. The above example is validatng a particular LLM for response that meets certain evaluation criteria. Similarly we can apply this logic for validating our prompts, tool, agents, retriver, etc. 

![evaluation](./evaluation.png)
¡[experiment](./eval-experiment.png)
![cot-eval](./cot-eval.png)