# Deep Evaluation of RAG Systems using deepeval

## Overview

This code demonstrates the use of the `deepeval` library to perform comprehensive evaluations of Retrieval-Augmented Generation (RAG) systems. It covers various evaluation metrics and provides a framework for creating and running test cases.

## Key Components

1. Correctness Evaluation
2. Faithfulness Evaluation
3. Contextual Relevancy Evaluation
4. Combined Evaluation of Multiple Metrics
5. Batch Test Case Creation

## Evaluation Metrics

### 1. Correctness (GEval)

- Evaluates whether the actual output is factually correct based on the expected output.
- Uses GPT-4 as the evaluation model.
- Compares the expected and actual outputs.

### 2. Faithfulness (FaithfulnessMetric)

- Assesses whether the generated answer is faithful to the provided context.
- Uses GPT-4 as the evaluation model.
- Can provide detailed reasons for the evaluation.

### 3. Contextual Relevancy (ContextualRelevancyMetric)

- Evaluates how relevant the retrieved context is to the question and answer.
- Uses GPT-4 as the evaluation model.
- Can provide detailed reasons for the evaluation.

## Key Features

1. Flexible Metric Configuration: Each metric can be customized with different models and parameters.
2. Multi-Metric Evaluation: Ability to evaluate test cases using multiple metrics simultaneously.
3. Batch Test Case Creation: Utility function to create multiple test cases efficiently.
4. Detailed Feedback: Options to include detailed reasons for evaluation results.

## Benefits of this Approach

1. Comprehensive Evaluation: Covers multiple aspects of RAG system performance.
2. Flexibility: Easy to add or modify evaluation metrics and test cases.
3. Scalability: Capable of handling multiple test cases and metrics efficiently.
4. Interpretability: Provides detailed reasons for evaluation results, aiding in system improvement.

## Conclusion

This deep evaluation approach using the `deepeval` library offers a robust framework for assessing the performance of RAG systems. By evaluating correctness, faithfulness, and contextual relevancy, it provides a multi-faceted view of system performance. This comprehensive evaluation is crucial for identifying areas of improvement and ensuring the reliability and effectiveness of RAG systems in real-world applications.

In [13]:
!pip install deepeval
!pip install ipywidgets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [1]:
from deepeval import evaluate
from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

In [2]:
import deepeval
deepeval.login_with_confident_api_key("NmYNg4nRKJQYoKMBvmXX3JSEQbhrIglazXtFSdrI0kE=")


### Test Correctness

Compares the expected output with the actual output to test the response from an LLM.

#### Correctness

**Definition:** Measures whether the model’s answer is factually correct compared to the expected (ground truth) answer.

**How it works:** Compares the actual output to the expected output and checks if the information is accurate and complete.

**Example:**

* Expected output: "Madrid is the capital of Spain."
* Actual output: "Madrid."
* Correctness: Partial, because the answer is not fully complete but factually correct.

**Correctness = Is the answer factually right?**

In [None]:
correctness_metric = GEval(
    name="Correctness",
    model="gpt-4o",
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT],
        evaluation_steps=[
        "Determine whether the actual output is factually correct based on the expected output."
    ],

)

gt_answer = "Madrid is the capital of Spain."
pred_answer = "MadriD."
# Uncomment the line below to test an incorrect prediction
#pred_answer = "Madrid is the capital of France."


test_case_correctness = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output=gt_answer,
    actual_output=pred_answer,
)

correctness_metric.measure(test_case_correctness)
print(correctness_metric.score)

Output()

0.11657310100660305


### Test faithfulness

#### Faithfulness

**Definition:** Measures whether the model’s answer is faithful to the provided context (retrieved documents or facts).
**How it works:** Checks if the answer only uses information present in the context and does not hallucinate or invent facts.
**Example:**
* Context: ["6"]
* Question: "What is 3+3?"
* Generated answer: "6"
* Faithfulness: High, because the answer is directly supported by the context.

**Faithfulness = Is the answer strictly based on the provided context?**

In [13]:
faithfulness_metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4-turbo",
    include_reason=True,
    #verbose_mode=True   # More debugging info
)

# Define test cases for faithfulness metric
test_case = LLMTestCase(
    input = "what is 3+3?",
    actual_output="6",
    retrieval_context=["6"]

)

# Test 1: Clear contradiction (should score low)
test_contradiction = LLMTestCase(
    input="What color is the car?",
    actual_output="The car is red",
    retrieval_context=["The car is blue and parked outside"]
)

# Test 2: Supported claim (should score high)
test_supported = LLMTestCase(
    input="What color is the car?", 
    actual_output="The car is blue",
    retrieval_context=["The car is blue and parked outside"]
)

# Test 3: Partially supported (should score medium)
test_partial = LLMTestCase(
    input="What do we know about the car?",
    actual_output="The car is blue and has leather seats",  # Only blue is supported
    retrieval_context=["The car is blue and parked outside"]
)

# Test 4: Contextually relevant but not directly answering (should score medium)
test_case_with_context = LLMTestCase(
    input="What number is mentioned?",
    actual_output="The number is 7",  # Wrong number
    retrieval_context=["The document mentions the number 6"]  # Contradictory context
)


faithfulness_metric.measure(test_case)
print(faithfulness_metric.score)
print(faithfulness_metric.reason)




Output()

1
The score is 1.00 because there are no contradictions between the actual output and the retrieval context, indicating perfect faithfulness.


### Test contextual relevancy 

This code evaluates how relevant the retrieved context is to the question and the generated answer using the deepeval library’s ContextualRelevancyMetric.

**How it works:**

- **actual_output:** The answer generated by the model (`"then go somewhere else."`)
- **retrieval_context:** A list of context strings retrieved for the question (e.g., `["this is a test context", "mike is a cat", "if the shoes don't fit, then go somewhere else."]`)
- **gt_answer:** The ground truth answer (`"if the shoes don't fit, then go somewhere else."`)
- A `ContextualRelevancyMetric` is created with a threshold and model.
- A test case is defined with the question, actual output, context, and expected output.
- The metric’s `measure` method evaluates how well the context supports the answer, and prints both the score (between 0 and 1) and the reason for the score.

In [14]:
actual_output = "then go somewhere else."
retrieval_context = ["this is a test context","mike is a cat","if the shoes don't fit, then go somewhere else."]
gt_answer = "if the shoes don't fit, then go somewhere else."

relevance_metric = ContextualRelevancyMetric(
    threshold=1,
    model="gpt-4o",
    include_reason=True
)
relevance_test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context,
    expected_output=gt_answer,

)

relevance_metric.measure(relevance_test_case)
print(relevance_metric.score)
print(relevance_metric.reason)

Output()

0.3333333333333333
The score is 0.33 because the majority of the context, such as 'this is a test context' and 'mike is a cat,' does not relate to the question about shoe fitting. However, the statement 'if the shoes don't fit, then go somewhere else' provides some relevant advice, contributing to the score.


### Test two different cases together with several metrics together

In [16]:
new_test_case = LLMTestCase(
    input="What is the capital of Spain?",
    expected_output="Madrid is the capital of Spain.",
    actual_output="MadriD.",
    retrieval_context=["Madrid is the capital of Spain."]
)

In [17]:
evaluate(
    test_cases=[relevance_test_case, new_test_case],
    metrics=[correctness_metric, faithfulness_metric, relevance_metric]
)

Output()



Metrics Summary

  - ❌ Correctness (GEval) (score: 0.1362608186515598, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output is incorrect; 'MadriD.' is not a factual statement about the capital of Spain., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4-turbo, reason: The score is 1.00 because there are no contradictions between the actual output and the retrieval context, indicating perfect faithfulness., error: None)
  - ✅ Contextual Relevancy (score: 1.0, threshold: 1.0, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the statement 'Madrid is the capital of Spain.' directly answers the input question with perfect relevance. Great job!, error: None)

For test case:

  - input: What is the capital of Spain?
  - actual output: MadriD.
  - expected output: Madrid is the capital of Spain.
  - context: None
  - retrieval context: ['Madrid is the capital of Spain.']


Overall Metric

EvaluationResult(test_results=[TestResult(name='test_case_1', success=False, metrics_data=[MetricData(name='Correctness (GEval)', threshold=0.5, success=False, score=0.1362608186515598, reason="The actual output is incorrect; 'MadriD.' is not a factual statement about the capital of Spain.", strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.000915, verbose_logs='Criteria:\nNone \n \nEvaluation Steps:\n[\n    "Determine whether the actual output is factually correct based on the expected output."\n] \n \nRubric:\nNone'), MetricData(name='Faithfulness', threshold=0.7, success=True, score=1.0, reason='The score is 1.00 because there are no contradictions between the actual output and the retrieval context, indicating perfect faithfulness.', strict_mode=False, evaluation_model='gpt-4-turbo', error=None, evaluation_cost=0.00956, verbose_logs='Truths (limit=None):\n[\n    "Madrid is the capital of Spain."\n] \n \nClaims:\n[] \n \nVerdicts:\n[]'), MetricData(name='Co

In [20]:
import pytest
import deepeval
from deepeval import assert_test
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import AnswerRelevancyMetric, GEval

# To run this file: deepeval test run <file_name>.py

dataset = EvaluationDataset(alias="My dataset", test_cases=[])


test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output of your LLM application
        actual_output="We offer a 30-day full refund at no extra cost.",
        expected_output="You're eligible for a free full refund within 30 days of purchase.",
)

answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

correctness_metric = GEval(
    name="Correctness",
    criteria="Correctness - determine if the actual output is correct according to the expected output.",
    evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT,
        ],
        strict=True,
    


@deepeval.log_hyperparameters(model="gpt-4", prompt_template="...")
def hyperparameters():
    return {"temperature": 1, "chunk size": 500}

TypeError: EvaluationDataset.__init__() got an unexpected keyword argument 'alias'