# Checking Hallucinations in a Q&A System

In this example, you will create a chain using LangChain's [Expression Language](https://python.langchain.com/docs/guides/expression_language/) to perform question answering (Q&A) over langsmith documentation and then evalute it using a dataset and criteria directed to check for hallucinations. When evaluating any system for which accuracy is of importance, collecting labeled datasets and using strong correctness metrics (mixed with human spot checking and oversight) is the gold standard.

But it's a challenge to collect and maintain up-to-date, labeled datasets.

In the absence of labels, you can quantify other metrics like:
- Faithfulness - How faithful is the generated response to the retrieved documents?
- Relevance - How relevant is the response to the original question?
- Helpfulness - How helpful is the response in resolving the intent behind the question?

This example shows one way to measure this using llm-assisted evals. The main steps are:

1. Define the retrieval-augmented generation (RAG) question and answering (Q&A) system.
2. Create a dataset of questions.
3. Define evaluation config.
4. Run evaluation in LangSmith

**Note:** Separately evaluating the retriever itself (using standard retrieval metrics) can be helpful alongside whole-system evaluations. This guide will focus on measuring the llm response **conditioned on seeing the selected documents**. To maximize your system effectiveness, you likely will want to also evaluate and tune the retriever itself.

## Prerequisites

We will be using [LangSmith](https://smith.langchain.com) and langchain. Please configure your API Key appropriately.

In [1]:
# %env LANGCHAIN_API_KEY=<YOUR_API_KEY>

We will be using langchain, openai, and chromadb for this example. Upgrade to the latest versions of these libraries to make sure
you have the requisite functionality.

In [2]:
# %pip install -U langchain > /dev/null
# %pip install -U langsmith > /dev/null
# %pip install openai > /dev/null
# %pip install chromadb > /dev/null
# %pip install lxml > /dev/null
# %pip install html2text > /dev/null

In [3]:
# %env OPENAI_API_KEY=<YOUR-API-KEY>

## 1. Define RAG Q&A System

For our example, we will create a Q&A system over the LangSmith documentation.

In [4]:
from langchain.document_loaders import RecursiveUrlLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.text_splitter import TokenTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

api_loader = RecursiveUrlLoader("https://docs.smith.langchain.com")
text_splitter = TokenTextSplitter(
    model_name="gpt-3.5-turbo",
    chunk_size=1000,
    chunk_overlap=200,
)
doc_transformer = Html2TextTransformer()

raw_documents = api_loader.load()
transformed = doc_transformer.transform_documents(raw_documents)
documents = text_splitter.split_documents(transformed)

embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})



We will construct our chatbot using the expression language below, it is simple and incorporates the following components:
- Retriever
- Prompt template
- LLM

In [5]:
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from datetime import datetime
from operator import itemgetter
from uuid import uuid4

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful documentation Q&A assistant, trained to answer"
        " questions from LangChain's documentation."
        " LangChain is a framework for building applications using large language models."
        "\nThe current time is {time}.\n\nRelevant documents will be retrieved in the following messages."),
        ("system", "{context}"),
        ("human","{question}")
    ]
).partial(time=str(datetime.now()))

model = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0)

chain = (
    {
        "context": itemgetter("question") | retriever | (lambda docs: "\n".join([doc.page_content for doc in docs])),
        "question": itemgetter("question")
    } | prompt 
    | model 
    | StrOutputParser()
)

## 2. Create a Dataset of Examples

We are going to hard-code a list of input questions and then use the traces to assemble the retrieved documents into a dataset.

In [6]:
questions = [
    "How might I query for all runs in a project?",
    "What's a langsmith dataset?",
    "How do I use a traceable decorator?",
    "Can I trace my Llama V2 llm?",
    "Why do I have to set environment variables?"
]

In [7]:
from langchain.callbacks import tracing_v2_enabled

random_salt = str(uuid4().hex[0:8])
project_name = f"Retrieval QA Question Generation - {random_salt}"
with tracing_v2_enabled(project_name=project_name):
    for q in questions:
        chain.invoke({"question": q})

In [8]:
from langsmith import Client

client = Client()

In [9]:
from IPython.display import IFrame
# Here's an example run:
first_run = next(iter(client.list_runs(project_name=project_name, execution_order=1)))
shared_link = client.share_run(first_run.id)
IFrame(shared_link, width='1000px', height='520px', zoom=50)

##### Extract Data from Runs

In this case, it's easy to extract the pieces we want from the fully hydrated chat prompt template in the LLM runs.
Below, read the LLM runs in the project and parse the messages from the structure.

In this case, we are NOT going to add outputs to the dataset examples, since we have not verified that the responses are correct.

In [10]:
import json
from langchain.load.load import loads

example_inputs = []
for run in client.list_runs(project_name=project_name, run_type="llm"):
    # Deserialize the langchain objects
    messages = loads(json.dumps(run.inputs['messages']))
    context = messages[1].content
    question = messages[-1].content
    example_inputs.append({'context': context, 'question': question})

#### Create the dataset

We will use the `create_example` method on the client to create each example row. In this case, we will treat the context like a reference and then direct the criteria evaluator to consider the reference information in grading relevance.

In [11]:
dataset_name = f"Retrieval QA Questions - {random_salt}"
dataset = client.create_dataset(dataset_name=dataset_name)

for data in example_inputs:
    client.create_example(inputs={"question": data['question']}, outputs={"context": data['context']}, dataset_id=dataset.id)

## Evaluate

We will use a labeled criteria evaluator for this

In [12]:
from langchain.smith import RunEvalConfig

faithfulness_criterion = """Is the submission faithful to and fully consistent\
 with the reference supporting documents? It cannot contradict anything therein."""
hallucination_criterion = """Is all information presented in submission\
 conveyed grounded in the reference documents? If ANYTHING is inferred,\
 presumed, or fully and explicitly stated in the reference docs, then the submission fails: respond 'N'."""

eval_config = RunEvalConfig(
    evaluators = [
        RunEvalConfig.LabeledCriteria(
            criteria={"faithfulness": faithfulness_criterion}
        ),
        RunEvalConfig.LabeledCriteria(
            criteria={"hallucination": hallucination_criterion}
        ),
    ],
    # If you are fetching many, large documents, you may need
    # a larger token window for the evaluator.
    # Claude 2 can perform reasonably well.
    # In general, it is not recommended to use eval LLMs less
    # capable than claude 2 or gpt-4
    # eval_llm=ChatAnthropic(model="claude-2", temperature=0)
)

In [13]:
res = await client.arun_on_dataset(
    dataset_name=dataset_name,
    llm_or_chain_factory=chain,
    evaluation=eval_config,
)

View the evaluation results for project '696acf01f9e44a399eec71dfd3afa652-RunnableSequence' at:
https://dev.smith.langchain.com/projects/p/d8f42fc8-c577-42b1-87e3-3c1cdfb5e71a?eval=true


In [15]:
project = client.read_project(project_name=res["project_name"])
project.feedback_stats

{'faithfulness': {'n': 5, 'avg': 0.2, 'mode': 0},
 'hallucination': {'n': 5, 'avg': 0.2, 'mode': 0}}