## Component-wise Evaluation of Retrieval-Augmented Generation Systems

### Motivation
End-to-end evaluation of retrieval-augmented generation (RAG) systems often obscures the source of failures. An incorrect or unfaithful answer may arise from retrieval errors, generation errors, or interactions between the two. Without isolating components, it is difficult to determine whether a system failed because relevant information was unavailable, incorrectly retrieved, or improperly used.

This notebook studies component-wise evaluation as a way to attribute failures more precisely within RAG pipelines.

### Experimental Setup
We evaluate the response generation component independently by supplying fixed, curated source documents to the generator. By holding retrieval constant, we can examine whether the model correctly conditions on provided evidence, adheres to source material, and avoids introducing unsupported claims.

This setup allows us to separate failures of information access from failures of information use.

### What Component-wise Evaluation Reveals
Isolating the generator surfaces several important failure modes:
- hallucinated content despite correct evidence being available,
- partial use of retrieved information,
- selective attention to salient but irrelevant passages,
- confident synthesis that diverges subtly from source documents.

These behaviours are often masked in end-to-end evaluations, where retrieval and generation errors are conflated.

### Prerequisites and Setup

In [1]:
# The '%pip install' command installs python packages from the notebook.
# -U flag ensures we get the latest versions.
%pip install -U langchain openai anthropic

In [2]:
import os # Import the 'os' module to interact with the operating system.
import uuid # Import the uuid library to generate unique identifiers.

# Update with your API URL if using a hosted instance of Langsmith.
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com" # Set the LangSmith API endpoint.
os.environ["LANGCHAIN_API_KEY"] = "YOUR API KEY"  # Update with your API key.
uid = uuid.uuid4() # Generate a unique ID to keep dataset names unique.

### Create a Dataset with Fixed Context

The key feature of this dataset is its structure. For each example, the `inputs` dictionary will contain both the user's `question` and a list of `documents`. This list of documents represents the fixed context that our response generator will use. The `outputs` dictionary contains the `label`, which is the ground-truth answer for the correctness check.

The examples below are designed to test if the response generator can correctly extract information and whether it will ignore its pre-trained knowledge in favor of the provided (and sometimes counter-intuitive) context.

In [3]:
# A simple example dataset to illustrate the concept.
examples = [
    {
        "inputs": {
            "question": "What's the company's total revenue for q2 of 2022?",
            # The 'documents' are part of the input for the component we are testing.
            "documents": [
                {
                    "metadata": {},
                    "page_content": "In q1 the lemonade company made $4.95. In q2 revenue increased by a sizeable amount to just over $2T dollars.",
                }
            ],
        },
        "outputs": {
            # The 'label' is the ground-truth answer for correctness evaluation.
            "label": "2 trillion dollars",
        },
    },
    {
        "inputs": {
            "question": "Who is Lebron?",
            # This document provides a fictional, counter-intuitive context.
            "documents": [
                {
                    "metadata": {},
                    "page_content": "On Thursday, February 16, Lebron James was nominated as President of the United States.",
                }
            ],
        },
        "outputs": {
            "label": "Lebron James is the President of the USA.",
        },
    },
]

In [4]:
from langsmith import Client # Import the Client class to interact with LangSmith.

client = Client() # Instantiate the LangSmith client.

dataset_name = f"Faithfulness Example - {uid}" # Create a unique name for our dataset.
dataset = client.create_dataset(dataset_name=dataset_name) # Create the dataset on the LangSmith platform.
# Create the examples in the dataset.
client.create_examples(
    inputs=[e["inputs"] for e in examples], # Pass the list of input dictionaries.
    outputs=[e["outputs"] for e in examples], # Pass the list of output dictionaries.
    dataset_id=dataset.id, # Link these examples to the dataset we just created.
)

### Define the Chain Component

We'll show the full chain for context, but we will clearly separate the **`response_synthesizer`** component. This synthesizer is the specific part of the chain that we will be evaluating. It takes a dictionary containing `documents` and a `question` and generates the final answer.

In [5]:
from langchain import chat_models, prompts # Import core LangChain components.
from langchain_core.documents import Document # Import the Document class.
from langchain_core.retrievers import BaseRetriever # Import the base retriever class.
from langchain_core.runnables import RunnablePassthrough # Import a passthrough runnable.


# This is a placeholder retriever to illustrate the full chain. It will not be used in our evaluation.
class MyRetriever(BaseRetriever):
    def _get_relevant_documents(self, query, *, run_manager):
        return [Document(page_content="Example")]


# This is the specific component we will be evaluating.
response_synthesizer = prompts.ChatPromptTemplate.from_messages(
    [
        ("system", "Respond using the following documents as context:\n{documents}"),
        ("user", "{question}"),
    ]
) | chat_models.ChatAnthropic(model="claude-2", max_tokens=1000) # We pipe the prompt to an LLM.

# The full RAG chain is shown below for illustrative purposes only.
chain = {
    "documents": MyRetriever(),
    "qusetion": RunnablePassthrough(),
} | response_synthesizer

### Define a Custom Faithfulness Evaluator

To measure faithfulness, we need an evaluator that checks if the model's `prediction` is consistent with the provided `documents`. Standard evaluators assume the reference context comes from the `outputs` of a dataset example. In our case, the context (the documents) is in the `inputs`.

To handle this, we'll create a custom `FaithfulnessEvaluator`. This class will wrap a standard LangChain scoring evaluator but will override the data mapping. It will tell the underlying evaluator to use:
- The model's generation as the `prediction`.
- The `question` from the run's inputs as the `input`.
- The `documents` from the *example's inputs* as the `reference` context.

This allows us to use an off-the-shelf LLM-based scoring mechanism with our custom dataset structure.

In [6]:
from langsmith.evaluation import RunEvaluator, EvaluationResult # Import the base classes for custom evaluation.
from langchain.evaluation import load_evaluator # Import a helper to load built-in evaluators.


# Define our custom evaluator class, inheriting from RunEvaluator.
class FaithfulnessEvaluator(RunEvaluator):
    def __init__(self):
        # Initialize a built-in 'labeled_score_string' evaluator.
        # This evaluator uses an LLM to score a prediction on a 1-10 scale based on given criteria.
        self.evaluator = load_evaluator(
            "labeled_score_string",
            criteria={
                "faithful": "How faithful is the submission to the reference context?"
            },
            normalize_by=10, # Normalize the score to be between 0 and 1.
        )

    # This is the core method that LangSmith will call for each run.
    def evaluate_run(self, run, example) -> EvaluationResult:
        # Call the underlying evaluator's 'evaluate_strings' method with custom-mapped fields.
        res = self.evaluator.evaluate_strings(
            prediction=next(iter(run.outputs.values())), # The LLM's generated answer.
            input=run.inputs["question"], # The user's question.
            # This is the key part: we use the 'documents' from the example's INPUTS as the reference context.
            reference=str(example.inputs["documents"]),
        )
        # Return the result in the standard EvaluationResult format.
        return EvaluationResult(key="labeled_criteria:faithful", **res)

### Run the Evaluation

Now we can run the evaluation. We will configure it to use two evaluators:
1. The standard `"qa"` evaluator, which will measure correctness against the `label` in our dataset outputs.
2. Our custom `FaithfulnessEvaluator`, which will measure how grounded the response is in the provided documents.

We will pass the `response_synthesizer` directly as the system to be tested.

In [7]:
from langchain.smith import RunEvalConfig # Import the evaluation configuration class.

# Create an evaluation configuration.
eval_config = RunEvalConfig(
    evaluators=["qa"], # Include the standard 'qa' correctness evaluator.
    custom_evaluators=[FaithfulnessEvaluator()], # Include our custom faithfulness evaluator.
    input_key="question", # Tell the 'qa' evaluator to use the 'question' field from the inputs.
)
# Run the evaluation on the dataset.
results = client.run_on_dataset(
    llm_or_chain_factory=response_synthesizer, # The specific component to be tested.
    dataset_name=dataset_name, # The name of our dataset in LangSmith.
    evaluation=eval_config, # The evaluation configuration.
)

View the evaluation results for project 'test-puzzled-texture-92' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/projects/p/4d35dd98-d797-47ce-ae4b-608e96ddf6bf
[------------------------------------------------->] 2/2

You can now review the results in LangSmith by clicking the link in the output above. You will see scores for both correctness (`qa`) and faithfulness (`labeled_criteria:faithful`). Inspecting the trace for the faithfulness evaluator will show how the LLM judged the response against the provided documents.

[![](./img/example_score.png)](https://smith.langchain.com/public/9a4e6ee2-f26c-4bcd-a050-04766fbfd350/r)

### Limitations
Component-wise evaluation does not capture interactions between retrieval and generation that occur in full pipelines, and it does not assess retrieval quality itself. Its value lies in diagnosis rather than completeness: it provides clarity about where failures occur, not a holistic system score.

### Role in a Broader Evaluation Framework
Within this project, component-wise RAG evaluation complements structured validation, LLM-based judging, and trajectory analysis. When used together, these methods allow failures to be localised to specific stages of the pipeline, enabling targeted improvements rather than broad, undirected changes.

### Discussion
As RAG systems are increasingly deployed in settings where faithfulness matters, understanding whether models are using evidence correctly becomes as important as whether they can retrieve it. This notebook demonstrates how isolating components can turn opaque system failures into tractable research questions.