# Evaluating RAG systems with DeepEval

In RAG systems, the output quality depends not only on the language model but also on the relevance and quality of retrieved documents. To build trustworthy and robust RAG applications, we need ways to evaluate them beyond simple accuracy.

This notebook walks through a practical setup for evaluating RAG outputs using the `deepeval` library. It focuses on measuring correctness, faithfulness to retrieved context, and contextual relevance of outputs.

In [3]:
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from dotenv import load_dotenv
import os

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Test correctness
Let’s begin by checking if the model’s answer is actually factually correct compared to the expected answer. This is a basic and essential evaluation step — making sure the model tells the truth.

In [4]:
# Define a GEval metric for checking correctness
correctness_metric = GEval(
    name="Correctness",  # Name for logging or reports
    model="gpt-3.5-turbo-0125",  # Evaluation model
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,  # What the model should have said
        LLMTestCaseParams.ACTUAL_OUTPUT  # What the model actually said
    ],
    evaluation_steps=[
        # The prompt used internally to guide the LLM's judgment
        "Determine whether the actual output is factually correct based on the expected output."
    ],
)

# Define a test case: Ground truth vs prediction
gt_answer = "Madrid is the capital of Spain."  # This is the expected, factually correct output
pred_answer = "MadriD."  # This is what our RAG model produced - Incorrect casing and slightly vague

# Construct the test case to evaluate
test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",  # The original question
    expected_output=gt_answer,  # Ground truth answer
    actual_output=pred_answer,  # Model-generated answer
)

# Evaluate the test case using the correctness metric
correctness_metric.measure(test_case_correctness)
print(correctness_metric.score)

Output()

0.17888971300910023


Here, we define and run a factual correctness test on a single QA pair, comparing a predicted answer with a known correct answer. We sre using `GEval`, which wraps an LLM to judge correctness not just for exact matches, but for semantic and factual alignment. It's particularly powerful because it allows for flexibility in wording, while still penalizing factual errors. Even though "MadriD" is close to the correct answer, it might be rated lower due to issues with capitalization or ambiguous formatting — the scoring reflects how confidently the evaluator model thinks the answer is valid.

This kind of evaluation is useful when we are building systems where correctness matters more than style or structure — like education, factual assistants, or research tools.


### Test faithfulness
Even when an answer is factually correct, it might still be unfaithful — meaning it includes information that wasn't supported by the retrieved context. For RAG systems, we want to ensure that the model doesn't hallucinate facts or "make things up" beyond what was retrieved. This step checks whether the answer remains grounded in the context.

In [5]:
# Define a simple test case: question and its retrieved context
question = "what is 3+3?"
context = ["6"]  # Simulated retrieved documents — in this case, a simple answer
generated_answer = "6"  # Model output (correct and consistent)

# Instantiate the faithfulness metric to check if the answer aligns with context
faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,  # Score below this would flag the answer as unfaithful
    model="gpt-3.5-turbo-0125",  # LLM that performs the evaluation
    include_reason=False  # Skip explanations for now
)

# Package everything into a test case
test_case = LLMTestCase(
    input = question,
    actual_output=generated_answer,
    retrieval_context=context

)

# Evaluate the model's answer for alignment with the context
faithfulness_metric.measure(test_case)
# Print the faithfulness score (ranges from 0 to 1)
print(faithfulness_metric.score)
# Optional: print reason if include_reason=True was set
#print(faithfulness_metric.reason)

Output()

1.0


This code sets up a small test to validate whether a model's output is faithful to the retrieved context. The answer is `"6"`, which aligns directly with the context (also just `"6"`), so the score is expected to be high.

Technically, the `FaithfulnessMetric` uses LLM to compare the answer against the supporting documents. It's not checking if the answer is correct in the abstract — it checks whether that answer is justified based on the retrieval. This is a critical distinction in RAG systems, where correctness alone isn't enough — faithfulness ensures the model is not freelancing.

This metric is especially useful when retrieved content includes partial or ambiguous information and we want to catch hallucinations or unsupported reasoning in the generated answer.


### Test contextual relevancy
Even if a model produces a fluent and factually correct answer, it doesn’t necessarily mean the retrieved context played a role in shaping that output. In a RAG setup, we want the retrieval to be not only accurate but useful. This part tests whether the retrieved documents were genuinely helpful for generating the final answer — or whether the model essentially ignored them.

In [6]:
# Define a sample generated answer
actual_output = "then go somewhere else."
# Define the context that was retrieved by the RAG system
retrieval_context = ["this is a test context","mike is a cat","if the shoes don't fit, then go somewhere else."]
# Define the expected answer based on relevant content
gt_answer = "if the shoes don't fit, then go somewhere else."

# Create the contextual relevancy metric
relevance_metric = ContextualRelevancyMetric(
    threshold=1,  # Maximum score threshold — very strict in this example
    model="gpt-3.5-turbo-0125",  # Evaluation model used to judge relevance
    include_reason=True  # Ask the model to explain the score it gives
)

# Package everything into a DeepEval test case
relevance_test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,  # What the model said
    retrieval_context=retrieval_context,  # Retrieved context the model had access to
    expected_output=gt_answer,  # What we expected the model to say

)

# Evaluate how relevant the retrieved context was to the output
relevance_metric.measure(relevance_test_case)
# Print the relevancy score (0–1, higher is better)
print(relevance_metric.score)
# Print the reason from the evaluator (optional)
print(relevance_metric.reason)

Output()

0.3333333333333333
The score is 0.33 because the majority of the statements in the retrieval context are not relevant to the input about shoe fitting, but there is one statement that directly addresses the issue.


This test checks if the generated answer is grounded in the actual retrieved passages. The key here is not just whether the answer is “good” — it is whether the answer was influenced by the retrieval.

In this case, only one of the three context chunks is actually relevant to the answer. The rest are noise. The metric relies on an LLM evaluator to assess whether the useful context was present and sufficiently aligned with the model’s output.

This helps evaluate whether retrieval is pulling useful content or just filler — a critical distinction in high-stakes or noisy RAG deployments where irrelevant documents might drown out the real signal. We are not testing the generation here — we are testing the retrieval pipeline’s contribution to the final output.


### Evaluating multiple test cases with multiple metrics
Once we have defined individual metrics like correctness, faithfulness, and contextual relevance, it becomes useful to scale up our evaluation. Instead of testing cases one by one, deepeval allows us to run multiple test cases through multiple metrics in a single evaluation step.

This part showcases how to do that — an essential feature when we want to benchmark our RAG system across several dimensions at once.

#### Add another test case
Before running a batch evaluation, we create another reusable test case — this time a basic factual one.

In [7]:
# Define a second test case to evaluate
new_test_case = LLMTestCase(
    input="What is the capital of Spain?",  # Original question
    expected_output="Madrid is the capital of Spain.",  # Ground truth
    actual_output="MadriD.",  # Model's generated output (formatting is slightly off)
    retrieval_context=["Madrid is the capital of Spain."]  # What the model had access to
)

This test checks how well the model answers a factual question and whether it stays grounded in the provided context. Although this case seems simple, it helps validate multiple evaluation layers simultaneously: is the answer accurate? Is it grounded in the context? Is the context even relevant?

#### Run multi-metric evaluation
Now that we have two test cases and three metrics defined (correctness, faithfulness, relevance), we evaluate all combinations at once. This is where `deepeval` really shines — we can get structured, consistent evaluation across different dimensions.

In [None]:
# Run all three metrics on both test cases in one batch
evaluate(
    test_cases=[relevance_test_case, new_test_case],  # The cases we defined
    metrics=[correctness_metric, faithfulness_metric, relevance_metric]  # Metrics we want to apply
)

This code runs all provided test cases against the listed metrics, using the appropriate LLM behind each metric. Internally, `deepeval` will execute the evaluation logic for each metric, generate a score, and optionally include an explanation or rationale.

The real strength here is consistency — every test case is measured against the same standards, enabling meaningful comparison and deeper insight into where our RAG system may be strong or weak.

This also scales beautifully: if we are testing dozens or hundreds of cases, we only need to define them once and plug them into this batch call.

### Utility: Batch creation of test cases from lists
When we are evaluating many inputs at once — especially during QA pipelines or automated benchmarks — creating `LLMTestCase` objects manually becomes tedious and error-prone. This helper function solves that by turning raw data into a clean list of test cases, ready for evaluation.

It accepts four parallel lists:
- Questions (inputs)
- Ground truth answers (expected outputs)
- Model outputs (generated answers)
- Retrieved documents (context retrieved for each question)

In [8]:
# Utility function to generate a list of LLMTestCase objects for batch evaluation
def create_deep_eval_test_cases(questions, gt_answers, generated_answers, retrieved_documents):
    return [
        LLMTestCase(
            input=question,  # Original user query
            expected_output=gt_answer,  # Ground truth answer
            actual_output=generated_answer,  # Output generated by the model
            retrieval_context=retrieved_document  # List of retrieved document snippets
        )
        for question, gt_answer, generated_answer, retrieved_document in zip(
            questions, gt_answers, generated_answers, retrieved_documents
        )
    ]

This function loops over four input sequences simultaneously using `zip()` — one for each element of a test case — and constructs a list of `LLMTestCase` objects. Each test case encapsulates everything needed to evaluate a single example under correctness, faithfulness, or contextual relevance metrics.

This is especially useful when:
- We are running evaluation across a large dataset.
- We already have model outputs stored from a previous run.
- We are integrating deep evaluation into automated pipelines or scheduled batch jobs.

It turns messy raw data into a structured format ready for `deepeval`'s `evaluate()` function — helping ensure our RAG evaluation remains scalable, testable, and consistent.