# LangSmith Evaluation Deep Dive

### Summary

See here for an overview of evaluation:
https://docs.smith.langchain.com/evaluation

![langsmith_summary.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/langsmith_summary.png)

## Environment

In [None]:
# ! pip install -U langsmith openai ollama

In [None]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

os.environ['LANGCHAIN_TRACING_V2'] = 'true' # enables tracing 
os.environ["LANGCHAIN_PROJECT"] = "Test"

# Dataset: Manually Curated

`Question:` 

How can I build my own dataset?

`Setup:` 

Let's build a dataset of question-answer pairs on this blog post about `DBRX`:

https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

We'll build a `Manually Curated` dataset of input, output pairs:

![ai-eng/langsmith_rag_story](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/langsmith_rag_story.png)

In [None]:
import pandas as pd

# QA
inputs = [
    "How many tokens was DBRX pre-trained on?",
    "Is DBRX a MOE model and how many parameters does it have?",
    "How many GPUs was DBRX trained on and what was the connectivity between GPUs?",
]

outputs = [
    "DBRX was pre-trained on 12 trillion tokens of text and code data.",
    "Yes, DBRX is a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters.",
    "DBRX was trained on 3072 NVIDIA H100s connected by 3.2Tbps Infiniband",
]

# Dataset
qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]
df = pd.DataFrame(qa_pairs)

# Write to csv

# Create directory data if it does not exist
os.makedirs("data", exist_ok=True)
csv_path = "data/DBRX_eval.csv"
df.to_csv(csv_path, index=False)

LangSmith SDK docs:

* https://docs.smith.langchain.com/evaluation/quickstart#1-create-a-dataset

In [None]:
from langsmith import Client

client = Client()
dataset_name = "DBRX"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about DBRX model.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

Update dataset

In [None]:
new_questions = [
    "What is the context window of DBRX Instruct?",
]

new_answers = [
    "DBRX Instruct was trained with up to a 32K token context window.",
]

# See updated version in the UI
client.create_examples(
    inputs=[{"question": q} for q in new_questions],
    outputs=[{"answer": a} for a in new_answers],
    dataset_id=dataset.id,
)

We can also create a dataset directly from a csv with the LangSmith UI.

LangSmith UI docs:

https://docs.smith.langchain.com/evaluation/faq/manage-datasets

# Dataset: From User Logs

`Question:` 

How can I save user logs as a dataset for future testing?

![ai-eng/userlogs.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/userlogs.png)

In [None]:
# Create a new project where user question are logged

import os

os.environ["LANGCHAIN_PROJECT"] = "DBRX"

In [None]:
# Load blog post

import requests
from bs4 import BeautifulSoup

url = "https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
text = [p.text for p in soup.find_all("p")]
full_text = "\n".join(text)

In [None]:
# OpenAI API

import openai
from langsmith.wrappers import wrap_openai

openai_client = wrap_openai(openai.Client())


def answer_dbrx_question_oai(inputs: dict) -> dict:
    """
    Generates answers to user questions based on a provided website text using OpenAI API.

    Parameters:
    inputs (dict): A dictionary with a single key 'question', representing the user's question as a string.

    Returns:
    dict: A dictionary with a single key 'output', containing the generated answer as a string.
    """

    # System prompt
    system_msg = (
        f"Answer user questions in 2-3 sentences about this context: \n\n\n {full_text}"
    )

    # Pass in website text
    messages = [
        {"role": "system", "content": system_msg},
        {"role": "user", "content": inputs["question"]},
    ]

    # Call OpenAI
    response = openai_client.chat.completions.create(
        messages=messages, model="gpt-3.5-turbo"
    )

    # Response in output dict
    return {"answer": response.dict()["choices"][0]["message"]["content"]}

In [None]:
# User question example

answer_dbrx_question_oai(
    {
        "question": "What are the main differences in training efficiency between MPT-7B vs DBRX?"
    }
)

In [None]:
# User question example

answer_dbrx_question_oai({"question": "How many tokens was DBRX pre-trained on?"})

# LLM-as-Judge: Built-in evaluator

`Question:` 

How can I evaluate the my LLM against my dataset?

`Evaluation flow`

![ai-eng/llm-as-judge.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/llm-as-judge.png)

`Built-in evaluator`

https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations

`CoT_qa`

```
Use chain of thought "reasoning" before determining a final verdict
```

In [None]:
from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Evaluators
qa_evalulator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "DBRX"

experiment_results = evaluate(
    answer_dbrx_question_oai,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-dbrx-qa-oai",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into gpt-3.5-turbo",
    },
)

`What did we do?`

![ai-eng/llm-as-judge2.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/llm-as-judge2.png)

# Custom evaluator

`Question:` 

How can I define my own custom evaluator? 

Let's say we want to define a simple assertion that an answer is actually generated.

![ai-eng/custom-evaluator.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/custom-evaluator.png)

In [None]:
from langsmith.schemas import Run, Example


def is_answered(run: Run, example: Example) -> dict:
    # Get outputs
    student_answer = run.outputs.get("answer")

    # Check if the student_answer is an empty string
    if not student_answer:
        return {"key": "is_answered", "score": 0}
    else:
        return {"key": "is_answered", "score": 1}


# Evaluators
qa_evalulator = [is_answered]
dataset_name = "DBRX"

# Run
experiment_results = evaluate(
    answer_dbrx_question_oai,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-dbrx-qa-custom-eval-is-answered",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into gpt-3.5-turbo",
    },
)

# Comparison 

`Question:` 

How does `Mistral-7b` running locally compare to `GPT-3.5-turbo` for question-answering?
 
`Setup:`

https://github.com/ollama/ollama-python

After installing it, you will need to run the server:

`ollama serve`

if you installed it on a Mac:

`/opt/homebrew/opt/ollama/bin/ollama serve`

then pull the mistral model: `ollama pull mistral`

Instrument Ollama calls with LangSmith: 

https://docs.smith.langchain.com/cookbook/tracing-examples/traceable#using-the-decorator

In [None]:
# Mistral

import ollama
from langsmith.run_helpers import traceable


@traceable(run_type="llm")
def call_ollama(messages, model: str):
    stream = ollama.chat(messages=messages, model="mistral", stream=True)
    response = ""
    for chunk in stream:
        print(chunk["message"]["content"], end="", flush=True)
        response = response + chunk["message"]["content"]
    return response


def answer_dbrx_question_mistral(inputs: dict) -> dict:
    """
    Generates answers to user questions based on a provided website text using Ollama serving Mistral locally.

    Parameters:
    inputs (dict): A dictionary with a single key 'question', representing the user's question as a string.

    Returns:
    dict: A dictionary with a single key 'output', containing the generated answer as a string.
    """

    # System prompt
    system_msg = f"Answer user questions about this context: \n\n\n {full_text}"

    # Pass in website text
    messages = [
        {"role": "system", "content": system_msg},
        {
            "role": "user",
            "content": f'Answer the question in 2-3 sentences {inputs["question"]}',
        },
    ]

    # Call Mistral
    response = call_ollama(messages, model="mistral")

    # Response in output dict
    return {"answer": response}


result = answer_dbrx_question_mistral(
    {
        "question": "What are the main differences in training efficiency between MPT-7B vs DBRX?"
    }
)

What are we doing?

![ai-eng/comparison.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/comparison.png)

In [None]:
# Evaluators
qa_evalulator = [LangChainStringEvaluator("cot_qa")]
dataset_name = "DBRX"

experiment_results = evaluate(
    answer_dbrx_question_mistral,
    data=dataset_name,
    evaluators=qa_evalulator,
    experiment_prefix="test-dbrx-qa-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "variant": "stuff website context into mistral",
    },
)

Use comparison view to inspect results.

# Experiment on datasets from the prompt playground (no code)

We've showed various ways to run evals using the SDK.
 
But sometimes I want to do more rapid testing.

For this I can use the LangSmith prompt hub directly: 

https://docs.smith.langchain.com/evaluation/faq/experiments-app

Here is a problem I've worked on recently: 

I want to grade documents in a RAG chain that takes as input: (1) A document and (2) A question.
 
And returns: (3) JSON with `score` yes or no that tells me if the documents are related to a question. 

See notebooks [here](https://github.com/langchain-ai/langgraph/tree/main/examples/rag).

![ai-eng/experiment.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/experiment.png)

`Question:` 

How do different LLMs perform at instruction following to produce a JSON output?

First, I build a dataset of test examples:

In [None]:
# Define a dataset

import pandas as pd

# relevance check
inputs = [
    {
        "question": "agent memory",
        "doc_txt": "agent memory has two types: short and long term",
    },
    {"question": "hallucinations", "doc_txt": "DBRX was pretrained on 12T tokens"},
    {
        "question": "DBRX content window",
        "doc_txt": "DBRX has a 32K token context window",
    },
]

outputs = ["yes", "no", "yes"]

In [None]:
from langsmith import Client

client = Client()
dataset_name = "Relevance_grade"

# Store
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Testing relevance grading.",
)
client.create_examples(
    inputs=inputs,
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

Test prompt in the [Prompt Hub](https://smith.langchain.com/hub/rlm/score_documents?organizationId=1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8).

```
SYSTEM

You are a grader assessing relevance of a retrieved document to a user question. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. Provide the binary score as a JSON with a single key 'score' and no premable or explaination.

HUMAN

Question: {question} 

Document: {doc_txt}
```

# Attach evaluators to datasets (no code)

From part 8, we:

(1) Set up a dataset of test cases for document grading 

(2) Ran experiments from the prompt hub

(3) Manually reviewed them

But, we can go one step further:

We can attach an LLM evaluator to our dataset. 

This is automatically applied for every experiment.

Grade prompt:

```
You are a grader. You will be shown: 

(1) Submission: a student submission for a JSON string

(2) Reference: the ground truth value expected in the JSON string

The student is producing a JSON with a single key "score" to indicate whether doc_text is relevant to question for this input:

[Input]: {input}

Grade the student as correct if that the student submission is valid JSON (or a JSON string) and contains the Reference value. If the student submission contains a preamble of text "e.g., 'sure, here is the JSON'" then score that as incorrect because we only want to JSON returned.

[BEGIN DATA]

***

[Submission]: {output}

***

[Reference]: {reference}

***

[END DATA]
```

# Summary Evaluators

We previously talked about using retrieval grading as part of RAG:

![ai-eng/summary.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/summary.png)

In short, we use an LLM to grader whether a document is relevant to input question.

This returns a binary `yes` or `no`.

We built an eval set and ground truth is a binary `yes` or `no` for each example:

https://smith.langchain.com/public/ad300ffb-8bf5-450a-9c26-1b34481fb709/d

`Question:`

How can I create a custom metric to summarize performance on this dataset?

![ai-eng/summary2.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/summary2.png)

First, let's set up the two chains we want to compare:

* [OpenAI w/ tool use](https://github.com/langchain-ai/langgraph/blob/e779b4335b8a8b11c9e8ac71b89e9a08a94e3ff9/examples/rag/langgraph_self_rag.ipynb)
* [Mistral w/ JSON mode running locally](https://github.com/langchain-ai/langgraph/blob/e779b4335b8a8b11c9e8ac71b89e9a08a94e3ff9/examples/rag/langgraph_self_rag_local.ipynb) 

In [None]:
### OpenAI Grader

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field


# Data model
class GradeDocuments(BaseModel):
    """Binary score for relevance check on retrieved documents."""

    score: str = Field(
        description="Documents are relevant to the question, 'yes' or 'no'"
    )


# LLM with function call
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm_grader = llm.with_structured_output(GradeDocuments)

# Prompt
system = """You are a grader assessing relevance of a retrieved document to a user question. \n 
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    If the document contains keyword(s) or semantic meaning related to the user question, grade it as relevant. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question."""
grade_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Retrieved document: \n\n {document} \n\n User question: {question}"),
    ]
)

retrieval_grader_oai = grade_prompt | structured_llm_grader


def predict_oai(inputs: dict) -> dict:
    # Returns pydantic object
    grade = retrieval_grader_oai.invoke(
        {"question": inputs["question"], "document": inputs["doc_txt"]}
    )
    return {"grade": grade.score}

In [None]:
### Mistral Grader

from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import JsonOutputParser

# LLM
llm = ChatOllama(model="mistral", format="json", temperature=0)

prompt = PromptTemplate(
    template="""You are a grader assessing relevance of a retrieved document to a user question. \n 
    Here is the retrieved document: \n\n {document} \n\n
    Here is the user question: {question} \n
    If the document contains keywords related to the user question, grade it as relevant. \n
    It does not need to be a stringent test. The goal is to filter out erroneous retrievals. \n
    Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question. \n
    Provide the binary score as a JSON with a single key 'score' and no premable or explaination.""",
    input_variables=["question", "document"],
)

retrieval_grader_mistral = prompt | llm | JsonOutputParser()


def predict_mistral(inputs: dict) -> dict:
    # Returns JSON
    grade = retrieval_grader_mistral.invoke(
        {"question": inputs["question"], "document": inputs["doc_txt"]}
    )
    return {"grade": grade["score"]}

Documentation: 

https://docs.smith.langchain.com/evaluation/faq/custom-evaluators#summary-evaluators

We can define a custom summary metric over the dataset.

[`Precision` and `Recall` are common metrics to evaluate a binary clasification](https://en.wikipedia.org/wiki/Precision_and_recall):

* `Precision`: True positives (`TP`) / All positives (`TP + False Positives (FP)`).
* `Recall`: `TP` / All samples that should have been identified as positive

`F1` considers both the precision and the recall of the test to compute the score:
* `F1` score is the harmonic mean of precision and recall, and it reaches its best value at 1 

In [None]:
from typing import List
from langsmith.schemas import Example, Run
from langsmith.evaluation import evaluate


def f1_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:
    """
    Evaluates the F1 score for a list of runs against a set of examples.

    The function iterates through paired runs and examples, comparing the output
    of each run (`run.outputs["grade"]`) with the expected output in the example
    (`example.outputs["answer"]`). It calculates the true positives, false positives,
    and false negatives based on these comparisons to compute the F1 score of the predictions.

    Parameters:
    - runs (List[Run]): A list of run objects, where each run contains an output that is a prediction.
    - examples (List[Example]): A list of example objects, where each example contains an output that is the expected answer.

    Returns:
    - dict: A dictionary with a single key-value pair where the key is "f1_score" and the value
    """

    # Default values
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    # Iterate through samples
    for run, example in zip(runs, examples):
        reference = example.outputs["answer"]
        prediction = run.outputs["grade"]
        if reference and prediction == reference:
            true_positives += 1
        elif prediction and not reference:
            false_positives += 1
        elif not prediction and reference:
            false_negatives += 1
    if true_positives == 0:
        return {"key": "f1_score", "score": 0.0}

    # Compute F1 score
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return {"key": "f1_score", "score": f1_score}

In [None]:
evaluate(
    predict_mistral,
    data="Relevance_grade",
    summary_evaluators=[f1_score_summary_evaluator],
    experiment_prefix="test-score-mistral",
    # Any experiment metadata can be specified here
    metadata={
        "model": "mistral",
    },
)

In [None]:
evaluate(
    predict_oai,
    data="Relevance_grade",
    summary_evaluators=[f1_score_summary_evaluator],
    experiment_prefix="test-score-oai",
    # Any experiment metadata can be specified here
    metadata={
        "model": "oai",
    },
)

On the Langsmith website, select both experiments by clicking on the small box on the left, and then select "Compare"

# Evaluating RAG

See our [RAG guide](https://docs.smith.langchain.com/cookbook/testing-examples/rag_eval).

# Regression testing

Previously, we talked about various types of RAG evaluations.

`Question:` 

How can I assess whether a new LLM (e.g., phi3), can I be used in my RAG chain?

For this, regression testing is highly useful.

It lets us easily pinpoint changes in performance in our eval set across model versions.

First, define an eval set:

In [None]:
import os
os.environ["LANGCHAIN_PROJECT"] = "RAG_repetitions"

In [None]:
from langsmith import Client

# QA
inputs = [
    "My LCEL map contains the key 'question'. What is the difference between using itemgetter('question'), lambda x: x['question'], and x.get('question')?",
    "How can I make the output of my LCEL chain a string?",
    "How can I run two LCEL chains in parallel and write their output to a map?",
]

outputs = [
    "Itemgetter can be used as shorthand to extract specific keys from the map. In the context of a map operation, the lambda function is applied to each element in the input map and the function returns the value associated with the key 'question'. (get) is safer for accessing values in a dictionary because it handles the case where the key might not exist.",
    "Use StrOutputParser. from langchain_openai import ChatOpenAI; from langchain_core.prompts import ChatPromptTemplate; from langchain_core.output_parsers import StrOutputParser; prompt = ChatPromptTemplate.from_template('Tell me a short joke about {topic}'); model = ChatOpenAI(model='gpt-3.5-turbo') #gpt-4 or other LLMs can be used here; output_parser = StrOutputParser(); chain = prompt | model | output_parser",
    "We can use RunnableParallel. For example: from langchain_core.prompts import ChatPromptTemplate; from langchain_core.runnables import RunnableParallel; from langchain_openai import ChatOpenAI; model = ChatOpenAI(); joke_chain = ChatPromptTemplate.from_template('tell me a joke about {topic}') | model; poem_chain = (ChatPromptTemplate.from_template('write a 2-line poem about {topic}') | model); map_chain = RunnableParallel(joke=joke_chain, poem=poem_chain); map_chain.invoke({'topic': 'bear'})",
]

qa_pairs = [{"question": q, "answer": a} for q, a in zip(inputs, outputs)]

# Create dataset
client = Client()
dataset_name = "RAG_QA_LCEL"
dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="QA pairs about LCEL.",
)
client.create_examples(
    inputs=[{"question": q} for q in inputs],
    outputs=[{"answer": a} for a in outputs],
    dataset_id=dataset.id,
)

RAG chain:

In [None]:
### INDEX

from bs4 import BeautifulSoup as Soup
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

url = "https://python.langchain.com/v0.1/docs/expression_language/"
loader = RecursiveUrlLoader(url=url, max_depth=20, extractor=lambda x: Soup(x, "html.parser").text)
docs = loader.load()
full_doc_text = ' ---- '.join([d.page_content for d in docs])

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=4000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Embed
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Index
retriever = vectorstore.as_retriever()

In [None]:
### RAG

import openai
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOllama
from langchain_core.output_parsers import StrOutputParser

class RagBot:
    """
    A class to interface with retrieval-augmented generation (RAG) models from different providers
    such as OpenAI or Ollama, utilizing a retriever for document-based context.
    """

    def __init__(
        self,
        retriever,
        provider: str = "openai",
        model: str = "gpt-4-0125-preview",
        use_vectorstore: bool = True,
    ):
        """
        Initializes the RagBot with a retriever, provider information, model details, and configuration
        to use a vector store for document retrieval.

        Args:
        retriever: The document retriever instance.
        provider (str): The provider of the RAG model ('openai' or 'ollama').
        model (str): The model identifier used by the provider.
        use_vectorstore (bool): Flag to determine whether to use vectorstore for document retrieval.
        """
        self._retriever = retriever
        self._provider = provider
        self._model = model
        self._use_vectorstore = use_vectorstore
        if provider == "openai":
            self._client = wrap_openai(openai.Client())
        elif provider == "ollama":
            self._client = ChatOllama(model=model, temperature=0)

    @traceable()
    def retrieve_docs(self, question):
        """
        Retrieves documents based on the input question, using either a vectorstore or full context.

        Args:
        question (str): The question to retrieve documents for.

        Returns:
        list: A list of documents relevant to the question or the full context (as a string).
        """
        if self._use_vectorstore:
            return self._retriever.invoke(question)
        else:
            return full_doc_text

    @traceable()
    def get_answer(self, question: str):
        """
        Generates an answer for a given question by using RAG, leveraging both the retriever
        and the provider's model capabilities.

        Args:
        question (str): The user's question to answer.

        Returns:
        dict: A dictionary containing the 'answer' and 'contexts' (related documents).
        """
        similar = self.retrieve_docs(question)
        if self._provider == "openai":
            "OpenAI RAG"
            response = self._client.chat.completions.create(
                model=self._model,
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful AI code assistant with expertise in LCEL.\n"
                        " Use the following docs to produce a concise code solution to the user question.\n"
                        " Use three sentences maximum and keep the answer concise. \n"
                        f"## Docs\n\n{similar}",
                    },
                    {"role": "user", "content": question},
                ],
            )
            response_str = response.choices[0].message.content

        elif self._provider == "ollama":
            "Ollama RAG"
            prompt = PromptTemplate(
                template="""You are a helpful AI code assistant with expertise in LCEL.
                Use the following docs to produce a concise code solution to the user question.
                If you don't know the answer, just say that you don't know. 
                Use three sentences maximum and keep the answer concise.
                Question: {question} 
                Context: {context} 
                Answer: """,
                input_variables=["question", "context"],
            )
            rag_chain = prompt | self._client | StrOutputParser()
            response_str = rag_chain.invoke({"context": similar, "question": question})

        return {
            "answer": response_str,
            "contexts": [str(doc) for doc in similar],
        }

In [None]:
def predict_rag_answer_oai(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot(retriever, provider="openai", model="gpt-4-0125-preview")
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}


def predict_rag_answer_llama3(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot(retriever, provider="ollama", model="llama3")
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}


def predict_rag_answer_phi3(example: dict):
    """Use this for answer evaluation"""
    rag_bot = RagBot(retriever, provider="ollama", model="phi3")
    response = rag_bot.get_answer(example["question"])
    return {"answer": response["answer"]}

Define evaluator:

Note, the evaluator is checked in here: 

https://smith.langchain.com/hub/langchain-ai/rag-answer-accuracy/

In [None]:
from langchain_openai import ChatOpenAI
from langsmith.schemas import Example, Run
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langsmith.evaluation import LangChainStringEvaluator, evaluate

def answer_evaluator(root_run: Run, example: Example) -> dict:
    """
    A simple evaluator for RAG answer generation
    """

    # Get question, answer, and reference answer
    rag_pipeline_run = next(
        run for run in root_run.child_runs if run.name == "get_answer"
    )
    retrieve_run = next(
        run for run in rag_pipeline_run.child_runs if run.name == "retrieve_docs"
    )
    input_question = example.inputs["question"]
    reference = example.outputs["answer"]
    prediction = rag_pipeline_run.outputs["answer"]

    # Data model for grade
    class GradeAnswer(BaseModel):
        """A numerical score for answer accuracy."""

        score: int = Field(
            description="Answer matches the grond truth, score from 1 to 10"
        )

    # LLM with function call, use highest capacity model
    llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)  #
    structured_llm_grader = llm.with_structured_output(GradeAnswer)

    # Prompt
    system = """Is the Assistant's Answer grounded in and similar to the Ground Truth answer. Note that we do not expect all of the text 
            in code solution examples to be identical. We expect (1) code imports to be identical if the same import is used. (2) But, it is
            ok if there are differences in the implementation itself. The main point is that the same concept is employed. A score of 1 means 
            that the Assistant answer is not at all conceptically grounded in and similar to the Ground Truth answer. A score of 5 means  that the Assistant 
            answer contains some information that is conceptically grounded in and similar to the Ground Truth answer. A score of 10 means that the 
            Assistant answer is fully conceptically grounded in and similar to the Ground Truth answer."""

    grade_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system),
            (
                "human",
                "Ground Truth answer: \n\n {reference} \n\n Assistant's Answer: {prediction}",
            ),
        ]
    )

    answer_grader = grade_prompt | structured_llm_grader
    score = answer_grader.invoke({"reference": reference, "prediction": prediction})
    return {"key": "answer_accuracy", "score": int(score.score) / 10}

#### Compare OpenAI vs Ollama (open source LLMs)

In [None]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

dataset_name = "RAG_QA_LCEL"
experiment_results = evaluate(
    predict_rag_answer_oai,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-gpt4-0125",
    metadata={"variant": "LCEL context, gpt-4-0125-preview"},
)

In [None]:
experiment_results = evaluate(
    predict_rag_answer_llama3,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-llama3",
    metadata={"variant": "LCEL context, gpt-4-0125-preview"},
)

In [None]:
experiment_results = evaluate(
    predict_rag_answer_phi3,
    data=dataset_name,
    evaluators=[answer_evaluator],
    experiment_prefix="rag-qa-phi3",
    metadata={"variant": "LCEL context, phi3"},
)

# Online Evaluators

Sometimes we want to evaluate generations as they are logged to a project.

There are a few common applications for online evaluation:

* [RAG: answer hallucinations](https://smith.langchain.com/hub/rlm/rag-answer-hallucination?organizationId=1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8)
* [RAG: document relevance](https://smith.langchain.com/hub/rlm/rag-document-relevance/playground?organizationId=1fa8b1f4-fcb9-4072-9aa9-983e35ad61b8&type=structured)

![ai-eng/onlineevaluators.png](https://education-team-2020.s3.eu-west-1.amazonaws.com/ai-eng/onlineevaluators.png)

In [None]:
# Test our RAG bot
rag_bot = RagBot(retriever, provider="openai", model="gpt-3.5-turbo")
response = rag_bot.get_answer("How to define an RAG chain in LCEL?")

In [None]:
import os

os.environ["LANGCHAIN_PROJECT"] = "RAG_online_eval"

In [None]:
import openai
from langsmith import traceable, Client
import uuid

client = openai.Client()


@traceable(
    run_type="chain",
    name="rag",
)
def rag(question: str, documents):
    return (
        client.chat.completions.create(
            model="gpt-3.5-turbo",
            temperature=0,
            messages=[
                {
                    "role": "system",
                    "content": f"Answer questions based on these documents: {documents}",
                },
                {"role": "user", "content": question},
            ],
        )
        .choices[0]
        .message.content
    )


rag("where did harrison work", ["ankush and his friend worked at kensho"])

In [None]:
rag("where did ankush work", ["ankush and his friend worked at kensho"])

In [None]:
rag("where did lance work", ["ankush and his friend worked at kensho"])