## Evaluating a Retrieval-Augmented Generation (RAG) System

This notebook provides a quick start guide on evaluating RAG systems using the Lastmile AutoEval SDK. We'll cover how to:

1. Measure the faithfulness of a generated output against a ground truth and input
2. Check the toxicity of an input and deny requests that are too toxic 
3. Detect hallucinations by comparing generated output to retrieved context

Note: you will need an OpenAI API key set in your environment variables to run this notebook.

Let's get started!

In [None]:
# Install Dependencies
%pip install "llama-index>=0.11.0"
%pip install lastmile
%pip install pandas

# Setup Pandas to display without truncation (for display purposes)
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

### Setup LlamaIndex RAG System

First, let's setup a simple RAG system using LlamaIndex to generate responses. 
Note: llama index defaults to using OpenAI as the LLM, But feel free to swap it out with your preferred LLM.

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data/PaulGrahamEssay").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

### Evaluating With Base Metric Faithfulness 

Now let's evaluate how faithful the generated response is to the ground truth, given an input query.

In [3]:
from lastmile.lib.auto_eval import AutoEval, Metric
import pandas as pd

eval = AutoEval()

query = "Where did the author grow up?"
expected_response = "England"
llm_response = query_engine.query(query)

eval_result = eval.evaluate_data(
    data=pd.DataFrame({
        "input": [query],
        "output": [llm_response.response],
        "ground_truth": [expected_response]
    }),
    metrics=[Metric(name="Faithfulness")]
)

print(f'Evaluation results:')
eval_result

Evlauation results:


Unnamed: 0,input,output,ground_truth,Faithfulness_score
0,Where did the author grow up?,"The author grew up in Yorkville, a neighborhoo...",England,0.058303


### Checking Output Toxicity

We can also check the toxicity of the output and deny responses that are too toxic. Designed to detect and flag low-quality or potentially harmful AI-generated content.

In [4]:
toxicity_result = eval.evaluate_data(
    data=pd.DataFrame({"output": [llm_response.response]}),
    metrics=[Metric(name="Toxicity")]
)

toxicity_score = toxicity_result["Toxicity_score"][0]
print(f'Toxicity: {toxicity_score}')


Toxicity: 0.7909269332885742


### Detecting Hallucinations

Finally, let's detect potential hallucinations by comparing the generated response to the retrieved context used to generate it.

In [5]:
query = "What year was the author born?"
llm_response = query_engine.query(query)

hallucination_result = eval.evaluate_data(
    data=pd.DataFrame({
        "input": [query], 
        "output": [llm_response.response],
        "ground_truth": [llm_response.source_nodes[0].text]
    }),
    metrics=[Metric(name="Faithfulness")]
)

print(f'Faithfulness to Context: {hallucination_result["Faithfulness_score"][0]}')
hallucination_result

Faithfulness to Context: 0.2740519344806671


Unnamed: 0,input,output,ground_truth,Faithfulness_score
0,What year was the author born?,The author was born in 1964.,If he even knew about the strange classes I wa...,0.274052


## Evaluating AutoEval's Builtin Metrics

LastMile's Builtin Metrics are a set of pre-defined metrics that cover a range of common evaluation tasks.
- Faithfulness: Measures how closely the generated response matches the ground truth.
- Relevance: Measures how relevant the generated response is to the input query.
- Toxicity: Measures the toxicity of a generated response.
- Answer Correctness: Measures how correct the generated response is.
- Summarization: Measures how well the generated response summarizes the input query.

In [6]:
## Evaluate
from lastmile.lib.auto_eval import BuiltinMetrics

query = "Where did the author grow up?"
expected_response = "England"
llm_response = query_engine.query(query)

eval_result = eval.evaluate_data(
    data=pd.DataFrame({
    "input": [query],
    "output": [llm_response.response],
        "ground_truth": [expected_response]
    }),
    metrics=[BuiltinMetrics.FAITHFULNESS, BuiltinMetrics.RELEVANCE, BuiltinMetrics.TOXICITY, BuiltinMetrics.ANSWER_CORRECTNESS, BuiltinMetrics.SUMMARIZATION]
)

print(f'Evaluation results:')
eval_result

Evaluation results:


Unnamed: 0,input,output,ground_truth,Faithfulness_score,Relevance_score,Toxicity_score,Answer Correctness_score,Summarization_score
0,Where did the author grow up?,"The author grew up in Yorkville, a neighborhoo...",England,0.058303,0.754154,0.058303,0.058303,0.058303


### Fine-tune a metric

Follow the [Getting Started](https://github.com/lastmile-ai/lastmile-docs/blob/main/cookbook/AutoEval_Getting_Started.ipynb) guide to build a custom evaluation metric.