![deepeval_image](images/deepeval.png)

# RAG Evaluations Demonstration: DeepEval

DeepEval is an open-source evaluation framework for LLMs. DeepEval makes it extremely easy to build and iterate on LLM (applications) and was built with the following principles in mind:

- Easily "unit test" LLM outputs in a similar way to Pytest.
- Plug-and-use 14+ LLM-evaluated metrics, most with research backing.
- Synthetic dataset generation with state-of-the-art evolution techniques.
- Metrics are simple to customize and covers all use cases.
- Real-time evaluations in production.

#### Load Evaluation Dataset 

In [1]:
# Download amnesty_qa dataset
from datasets import load_dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2", trust_remote_code=True)
eval_data = amnesty_qa['eval']
eval_data[2]

Repo card metadata block was not found. Setting CardData to empty.


{'question': 'Which private companies in the Americas are the largest GHG emitters according to the Carbon Majors database?',
 'ground_truth': 'The largest private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database are ExxonMobil, Chevron, and Peabody.',
 'answer': 'According to the Carbon Majors database, the largest private companies in the Americas that are the largest GHG emitters are:\n\n1. Chevron Corporation (United States)\n2. ExxonMobil Corporation (United States)\n3. ConocoPhillips Company (United States)\n4. BP plc (United Kingdom, but with significant operations in the Americas)\n5. Royal Dutch Shell plc (Netherlands, but with significant operations in the Americas)\n6. Peabody Energy Corporation (United States)\n7. Duke Energy Corporation (United States)\n8. TotalEnergies SE (France, but with significant operations in the Americas)\n9. BHP Group Limited (Australia, but with significant operations in the Americas)\n10. Rio Ti

#### Build a local LLM-as-a-judge directly from HuggingFace

In [2]:
# Create a using a custom model from huggingface, create an llm-as-a-judge
from llama3_deepeval import Llama3_8B
from transformers import AutoModelForCausalLM, AutoTokenizer

model_str = "solidrust/Meta-Llama-3-8B-Instruct-hf-AWQ"

model = AutoModelForCausalLM.from_pretrained(model_str, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_str, device_map="auto")

llama_3 = Llama3_8B(model=model, tokenizer=tokenizer)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [3]:
# test basic prompting of the local llm
gen_output = llama_3.generate("Why is the sky blue?")

print(gen_output)

The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He discovered that shorter wavelengths of light, such as blue and violet, are scattered more than longer wavelengths, like red and orange, when they interact with tiny molecules of gases in the atmosphere.

Here's what happens:

1. When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases like nitrogen (N2) and oxygen (O2).
2. These molecules scatter the light in all directions, but they scatter shorter wavelengths (like blue and violet) more than longer wavelengths (like red and orange).
3. As a result, the blue and violet light is dispersed throughout the atmosphere, reaching our eyes from all directions.
4. Our brains perceive this scattered blue light as the color of the sky, making it appear blue during the daytime.

The color of the sky can vary depending on several factors, such as:

* Time of day: The sky can take on hues of red, orange,

#### Create DeepEval Test Cases

DeepEval uses a `LLMTestCase` class to handle evaluations. This class has built-in fields that closely match what we've already seen:
- input
- expected_output
- actual_output
- retrieval_context

Plus a field for additional metadata that can be utilized for *custom* metrics

Here we define a function that translates our dataset into a list of test cases

In [4]:
# create deepeval evaluation dataset from the downloaded dataset
from deepeval.test_case import LLMTestCase
from random import uniform

def test_case_from_data(data_point):
    test_case = LLMTestCase(
        input = data_point['question'],
        actual_output = data_point['answer'],
        expected_output = data_point['ground_truth'],
        retrieval_context = data_point['contexts'],
        additional_metadata = {'latency': uniform(0,20)}
    )
    return test_case


test_cases = [test_case_from_data(data_point) for data_point in eval_data]

#### Import the Evaluation Metrics

In [5]:
# create metrics to measure from deepeval
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    AnswerRelevancyMetric
)

# Evaluate whether nodes in retrieval_context that are relevant to the given input are ranked higher than irrelevant ones.
contextual_precision = ContextualPrecisionMetric(model=llama_3, threshold=0.5)

# Evaluate the quality of the retriever by evaluating the extent of which the retrieval_context aligns with the expected_output
contextual_recall = ContextualRecallMetric(model=llama_3, threshold=0.5)

# Evaluate how relevant the actual_output is to the provided input
answer_relevancy = AnswerRelevancyMetric(model=llama_3, threshold=0.5)

Now we can make our own custom metrics as well!

In [6]:
from custom_deepeval_metric import LatencyMetric

# Evaluate the latency of the test_case run
latency = LatencyMetric(max_seconds=10)

#### Run the Evaluation

First, some housekeeping functions

In [7]:
def retry_five_times(func, *args, **kwargs):
    for ii in range(5):
        try:
            func(*args, **kwargs)
            # If function succeeds, break the loop
            break
        except Exception as e:
            if ii == 4:
                print(f"Function failed all attempts, returning null")
        
            print(f"Function call failed with error: {e}")


Running the evaluation of a single test case

In [8]:
# evaluate one at a time
retry_five_times(contextual_precision.measure, test_cases[2])
print("Contextual Precision Score: ", contextual_precision.score)
print("Reason: ", contextual_precision.reason)

Output()

Output()

Function call failed with error: Evaluation LLM outputted an invalid JSON. Please use a better evaluation model.


Contextual Precision Score:  1.0
Reason:  The score is 1.00 because the retrieval contexts are perfectly aligned with the input, with the most relevant nodes ranked higher and the irrelevant nodes correctly placed lower. The first two nodes provide direct and specific information about the largest private companies in the Americas that are the largest GHG emitters according to the Carbon Majors database, while the third node is a general statement about the issue of greenhouse gas emissions, making it less relevant to the question.


Running the evaluation of multiple test cases on multiple metrics

**Note: This is not the fastest way to run this

In [9]:
import pandas as pd

def evaluation_grid(metrics, test_cases):
    df = pd.DataFrame()
    df["Questions"] = [test_case.input for test_case in test_cases]
    df["Answers"] = [test_case.actual_output for test_case in test_cases]
    for metric in metrics:
        metric_scores = []
        metric_reasons = []
        for test_case in test_cases:
            retry_five_times(metric.measure,test_case)
            metric_scores.append(metric.score)
            metric_reasons.append(metric.reason)
        df[metric.__name__+" Score"] = metric_scores
        df[metric.__name__+" Reason"] = metric_reasons

    return df
        

In [10]:
import nest_asyncio
nest_asyncio.apply()

evaluation_data = evaluation_grid([contextual_recall,contextual_precision,answer_relevancy,latency], test_cases[2:5])

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

Output()

In [15]:
evaluation_data.head()

Unnamed: 0,Questions,Answers,Contextual Recall Score,Contextual Recall Reason,Contextual Precision Score,Contextual Precision Reason,Answer Relevancy Score,Answer Relevancy Reason,Latency Score,Latency Reason
0,Which private companies in the Americas are th...,"According to the Carbon Majors database, the l...",0.666667,The score is 0.67 because the model partially ...,1.0,The score is 1.00 because all the relevant nod...,0.909091,The score is 0.91 because the output is mostly...,1,Latency was below the acceptable limit of 1 se...
1,What action did Amnesty International urge its...,Amnesty International urged its supporters to ...,1.0,The score is 1.00 because the expected output ...,1.0,The score is 1.00 because all the relevant nod...,0.75,"The score is 0.75 because, although the output...",1,Latency was below the acceptable limit of 1 se...
2,What are the recommendations made by Amnesty I...,Amnesty International made several recommendat...,0.8,The score is 0.80 because the output accuratel...,1.0,The score is 1.00 because all relevant nodes i...,1.0,The score is 1.00 because the actual output di...,1,Latency was below the acceptable limit of 1 se...


It should be noted that you can also create an evaluation dataset of all your test cases and evaluate them at once.

In [None]:
from deepeval.dataset import EvaluationDataset

deepeval_ds = EvaluationDataset(test_cases=test_cases[2:5])

deepeval_ds.evaluate(metrics=[contextual_recall,contextual_precision,answer_relevancy,latency])

#### Running Tests with DeepEval

In [14]:
from deepeval import assert_test

try:
    assert_test(test_cases[4], metrics=[latency])
    print("Test passed!")
except Exception as e:
    print('Error found: {}'.format(e))


Output()

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Test passed!


DeepEval also offers integration with the ConfidentAI app, offering a centralized place to log evals, change hyperparameters, debug via LLM traces, and monitor in production. However, this doesnt appear to be open-source or able to be self-hosted and wouldn't be available in an Air-gapped scenario most likely.

![confident_ai](images/confident_ai.png)

#### DeepEval Pros
- Lots of built-in metrics (includes everything provided by RAGAS)
- Lots of customization (handles LLM and non-LLM evals)
- Applicable to RAG and non-RAG specific evals
- Metrics supply reasoning (more transparency than just a score)
- Easy to customize evaluator LLMs
- Test set generation capabilities

#### DeepEval Cons
- Slower than RAGAS (getting reasonings takes time)
- Requires a really good evaluator LLM
- If run as a batch (using EvaluationDataset for example), if one eval errors, they all fail
