# Checking for Hallucinations in a Q&A System

When evaluating any system for which accuracy is of importance, collecting labeled datasets and using strong correctness metrics (mixed with human spot checking and oversight) is the gold standard.

But it's a challenge to collect and maintain up-to-date, labeled datasets.

In the absence of labels, you can quantify other metrics like:
- Faithfulness - How faithful is the generated response to the retrieved documents?
- Relevance - How relevant is the response to the original question?
- Helpfulness - How helpful is the response in resolving the intent behind the question?

This example shows one way to measure this using llm-assisted evals. The main steps are:

1. Create a dataset of questions.
2. Define the retrieval-augmented generation (RAG) question and answering (Q&A) system.
3. Define the evaluators, which will generate metrics over the dataset.
4. Run evaluation in LangSmith.

**Note:** Separately evaluating the retriever itself (using standard retrieval metrics) can be helpful alongside whole-system evaluations. This guide will focus on measuring the llm response **conditioned on seeing the selected documents**. To maximize your system effectiveness, you likely will want to also evaluate and tune the retriever itself.

## Prerequisites

We will be using [LangSmith](https://smith.langchain.com) and langchain. Please configure your API Key appropriately.

In [None]:
# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

We will also be using Chroma and OpenAI for this example.

In [None]:
# %pip install -U langchain > /dev/null
# %pip install chromadb > /dev/null
# %pip install lxml > /dev/null
# %pip install html2text > /dev/null

In [None]:
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## 1. Create a Dataset of Examples

We are going to hard-code a list of input questions to evaluate and the `create_example` method on the client to create each example row.

In [1]:
from langsmith import Client

client = Client()

In [2]:
questions = [
    "How might I query for all runs in a project?",
    "What's a langsmith dataset?",
    "How do I use a traceable decorator?",
    "Can I trace my Llama V2 llm?",
    "Why do I have to set environment variables?"
]

dataset_name = "Retrieval QA Questions"
dataset = client.create_dataset(dataset_name=dataset_name)

for q in questions:
    client.create_example(inputs={"question": q}, dataset_id=dataset.id)

## 2. Define RAG Q&A System

For our example, we will use a Q&A system over the LangSmith documentation. The chain will be composed of:

1. An embedding model to vectorize documents and user queries for retrieval. In this case, the [OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html) model
2. A [VectorStoreRetriever](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.base.VectorStoreRetriever.html#langchain.vectorstores.base.VectorStoreRetriever) to retrieve documents. We will use [Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html#langchain.vectorstores.chroma.Chroma) in this example.
3. A [ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html#langchain.prompts.chat.ChatPromptTemplate) to combine the query and documents. 
4. An LLM, in this case, gpt 3.5 turbo via [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain.chat_models.openai.ChatOpenAI.html#langchain.chat_models.openai.ChatOpenAI).

We will combine them using the [expression syntax](https://python.langchain.com/docs/guides/expression_language/cookbook).

First, load the documents to populate the vectorstore:

In [4]:
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=2000,
    chunk_overlap=200,
)
doc_transformer = Html2TextTransformer()

raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)
documents = text_splitter.split_documents(transformed)

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)



In [1]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from datetime import datetime
from operator import itemgetter
from uuid import uuid4



def get_chain(retriever):
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a helpful documentation Q&A assistant, trained to answer"
            " questions from LangChain's documentation."
            " LangChain is a framework for building applications using large language models."
            "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
            ("system", "{context}"),
            ("human","{question}")
        ]
    ).partial(time=str(datetime.now()))
    
    model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)
    chain = (
        {
            "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
            "question": itemgetter("question")
        }
        | prompt 
        | model 
        | StrOutputParser()
    )
    return chain


def eval_wrapper(example_input: dict) -> dict:
    # Our production chain doesn't return the docs to the end user, but we want to surface them
    # for our evaluator. We can wrap the retriever to pipe out the retrieved docs.
    collected_docs = None
    def collect_docs(docs: list) -> list:
        """Pass-through to return the retrieved docs as well."""
        nonlocal collected_docs
        collected_docs = docs
        return docs
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
    wrapped_retriever = retriever | collect_docs
    chain = get_chain(wrapped_retriever)
    result = chain.invoke(example_input)
    return {
        "prediction": result,
        "docs": collected_docs
    }

## Evaluate

We will create a custom evaluator to apply to our retrieval Q&A model. In this case, we will wrap one of LangChain's criteria evaluators
and treat the retrieved documents as a reference. This way we can evaluate the criterion against the grounded documents.

In [9]:
from typing import Any, Optional

from langsmith.schemas import Run, Example
from langsmith.evaluation import EvaluationResult, RunEvaluator
from langchain.evaluation import load_evaluator


class RetrievalEvaluator(RunEvaluator):
    """Evaluate the perplexity of a predicted string."""

    def __init__(self, criteria: dict):
        self.criteria_evaluator = load_evaluator("labeled_criteria", criteria=criteria)

    def evaluate_run(self, run: Run, example: Example) -> EvaluationResult:
        prediction, docs = run.outputs["prediction"], run.outputs["docs"]
        docs_string = "\n".join([doc['page_content'] for doc in docs])
        res = self.criteria_evaluator.evaluate_strings(
            prediction=prediction, 
            reference=docs_string,
            input=run.inputs["question"],
            include_run_info=True,
        )
        return EvaluationResult(
            key=self.criteria_evaluator.evaluation_name,
            score=res.get("score"),
            comment=res.get("reasoning"),
            # Crosslink this run to the feedback
            evaluator_info={
                "__run": res["__run"]
            }
        )
        

In [10]:
from langchain.smith import RunEvalConfig

faithfulness_criterion = """Is the submission faithful to and fully consistent\
 with the reference supporting documents? It cannot contradict anything therein."""
hallucination_criterion = """Is all information presented in submission\
 conveyed grounded in the reference documents? If ANYTHING is inferred,\
 presumed, or fully and explicitly stated in the reference docs, then the submission fails: respond 'N'."""

eval_config = RunEvalConfig(
    custom_evaluators = [
        RetrievalEvaluator({"faithfulness": faithfulness_criterion}),
        RetrievalEvaluator({"hallucination": hallucination_criterion}),
    ],
    # If you are fetching many, large documents, you may need
    # a larger token window for the evaluator.
    # Claude 2 can perform reasonably well.
    # In general, it is not recommended to use eval LLMs less
    # capable than claude 2 or gpt-4
    # eval_llm=ChatAnthropic(model="claude-2", temperature=0)
)

In [11]:
_ = await client.arun_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=predict,
    evaluation=eval_config,
)

View the evaluation results for project '3e9533de54124606843771d21a8584f4-RunnableLambda' at:
https://dev.smith.langchain.com/projects/p/360f5072-f42f-4070-b600-485e5978d67c?eval=true


Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-4 in organization org-i0zjYONU3PemzJ222esBaAzZ on tokens per min. Limit: 40000 / min. Please try again in 1ms. Contact us through our help center at help.openai.com if you continue to have issues..
