# Evaluating Using Fixed Sources

A simple RAG pipeline is composed of (at least) two components: the retriever and the response generator. While you can evaluate the whole chain end-to-end, as shown in the [QA Correctness](../qa-correctness/) walkthrough, but you likely will get more actionable and fine-grained metrics by evaluating each component in isolation.

This example addresses evaluating the response generation component by fixing the retrieved documents within the dataset inputs. In this walkthrough, you will use this dataset to evaluate the response generator using both correctenss and a custom "faithfulness" evaluator.

![Custom Evaluator](./img/example_results.png)

## Prerequisites

First, install the required packages and configure your environment.

In [None]:
%pip install -U langchain openai anthropic

In [1]:
import os
import uuid

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Update with your API URL if using a hosted instance of Langsmith.
# os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY" # Update with your API key
uid = uuid.uuid4()

## 1. Create a dataset

Next, we'll create a dataset. This simple dataset below will illustrate the correctness and faithfulness metrics we want to use.

In [2]:
# A simple example dataset
examples = [
    {
        "inputs": {
            "question": "What's the company's total revenue for q2 of 2022?",
            "documents": [
                {
                    "metadata": {},
                    "page_content": "In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars."
                }
            ],
        },
        "outputs": {
            "label": "2 trillion dollars",
        },
    },
    {
        "inputs": {
            "question": "Who is Lebron?",
            "documents": [
                {
                    "metadata": {},
                    "page_content": "On Thursday, February 16, Lebron James was nominated as President of the United States."
                }
            ],
        },
        "outputs": {
            "label": "Lebron James is the President of the USA.",
        },
    }
]

In [3]:
from langsmith import Client

client = Client()

dataset_name = f"Faithfulness Example - {uid}"
dataset = client.create_dataset(dataset_name=dataset_name)

In [4]:
client.create_examples(inputs=[e["inputs"] for e in examples], outputs=[e["outputs"] for e in examples], dataset_id=dataset.id)

## 2. Define Chain

Suppose your chain is composed of two main components: a retriever and response synthesizer. Using LangChain runnables, it's easy to separate these two components to evaluate them in isolation.

Below is a very simple RAG chain with a placeholder retriever. For our testing, we will evaluate ONLY the response synthesizer.

In [5]:
from langchain import chat_models, prompts
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.retriever import BaseRetriever, Document

class MyRetriever(BaseRetriever):
    def _get_relevant_documents(self, query, *, run_manager):
        return [Document(page_content="Example")]

# This is what we will evaluate
response_synthesizer = (
    prompts.ChatPromptTemplate.from_messages(
        [
            ("system", "Respond using the following documents as context:\n{documents}"),
            ("user", "{question}")
        ]
    ) | chat_models.ChatAnthropic(model="claude-2", max_tokens=1000)
)

# Full chain below for illustration
chain = (
    {
        "documents": MyRetriever(),
        "qusetion": RunnablePassthrough(),
    }
    | response_synthesizer
)

## 3. Evaluate

Below, we will define a custom "FaithfulnessEvaluator" that measures how faithful the chain's output prediction is to the reference input documents, given the user's input question.

In this case, we will wrap the [Scoring Eval Chain](https://python.langchain.com/docs/guides/evaluation/string/scoring_eval_chain) and manually select which fields in the run and dataset example to use to represent the prediction, input question, and reference.

In [6]:
from langsmith.evaluation import RunEvaluator, EvaluationResult
from langchain.evaluation import load_evaluator

class FaithfulnessEvaluator(RunEvaluator):

    def __init__(self):
        self.evaluator = load_evaluator(
            "labeled_score_string", 
            criteria={"faithful": "How faithful is the submission to the reference context?"},
            normalize_by=10,
        )

    def evaluate_run(self, run, example) -> EvaluationResult:
        res = self.evaluator.evaluate_strings(
            prediction=next(iter(run.outputs.values())),
            input=run.inputs["question"],
            # We are treating the documents as the reference context in this case.
            reference=example.inputs["documents"],
        )
        return EvaluationResult(key="labeled_criteria:faithful", **res)

In [7]:
from langchain.smith import RunEvalConfig

eval_config = RunEvalConfig(
    evaluators=["qa"],
    custom_evaluators=[FaithfulnessEvaluator()],
    input_key="question",
)
results = client.run_on_dataset(
    llm_or_chain_factory=response_synthesizer,
    dataset_name=dataset_name,
    evaluation=eval_config,
)   

View the evaluation results for project 'test-spotless-vessel-38' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/79c76eb3-d2a6-4274-ab35-d64efeb6ec20
[------------------------------------------------->] 2/2

You can review the results in LangSmith to see how the chain fares. The trace for the custom faithfulness evaluator should look something like this:

[![](./img/example_score.png)](https://smith.langchain.com/public/9a4e6ee2-f26c-4bcd-a050-04766fbfd350/r)

## Discussion


Most of LangChain's open-source evaluators implement the "[StringEvaluator](https://python.langchain.com/docs/guides/evaluation/string/)" interface, meaning they compute a metric based on:

- An input string from the dataset example inputs (configurable by the RunEvalConfig's input_key property)
- An output prediction string from the evaluated chain's outputs (configurable by the RunEvalConfig's prediction_key property)
- (If labels are required) a reference string from the example outputs (configurable by the RunEvalConfig's reference_key property)

However, you may want to compute metrics based on multiple fields in the dataset. For instance, you could simulataneously compute standard labeled metrics like "correctness" as well as other metrics like "faithfulness" according to some additional input fields. It's easy to measure your chain's performance using data from various fields in the example inputs and outputs by configuring a custom RunEvaluator.
