# Use Ragas to evaluate RAG pipeline

Ragas is an open source project for evaluating RAG components.  [Paper](https://arxiv.org/abs/2309.15217), [Code](https://docs.ragas.io/en/stable/getstarted/index.html), [Docs](https://docs.ragas.io/en/stable/getstarted/index.html), [Intro blog](https://medium.com/towards-data-science/rag-evaluation-using-ragas-4645a4c6c477).

<div>
<img src="../../pics/ragas_eval_image.png" width="80%"/>
</div>

**Please note that RAGAS can use a large amount of OpenAI api token consumption.** <br> 

Read through this notebook carefully and pay attention to the number of questions and metrics you want to evaluate.

### 1. Prepare Ragas environment and ground truth data

In [None]:
# ! python -m pip install langchain openai dataset ragas pandas

In [3]:
# Read questions and ground truth answers into a pandas dataframe.
# Note: Surround each context string with ''' to avoid issues with quotes inside.
# Note: Separate each context string with a comma.
import pandas as pd
import numpy as np
import os

# Get the current working directory.
cwd = os.getcwd()
relative_path = '/data/ground_truth_answers.csv'
file_path = cwd + relative_path

# Read ground truth answers from file.
eval_df = pd.read_csv(file_path, header=0, skip_blank_lines=True)
display(eval_df.head())

Unnamed: 0,Question,ground_truth_answer,Sources,Custom_RAG_context,Custom_RAG_answer,llama3_answer,anthropic_claud3_haiku_answer
0,What do the parameters for HNSW mean?,# M: maximum degree of nodes in a layer of the...,https://milvus.io/docs/index.md,this value can improve recall rate at the cost...,The parameters for HNSW are as follows:\n\n# M...,"The parameters for HNSW include M, which is th...",The parameters for HNSW (Hierarchical Navigabl...
1,What are good default values for HNSW paramete...,"M=16, efConstruction=32, ef=32","https://milvus.io/docs/index.md, https://milvu...",parameters vary with Milvus distribution. Sele...,"M=16, efConstruction=500, and ef=64","For a Milvus distribution, there is no direct ...",I don't know. The context provided does not co...
2,What does nlist mean in ivf_flat?,The `nlist` parameter in IVF_FLAT index divide...,https://milvus.io/docs/index.md,index? IVF_FLAT index divides a vector space i...,"In IVF_FLAT, nlist refers to the number of clu...",The `nlist` parameter in IVF_FLAT index divide...,The nlist parameter in the IVF_FLAT index in M...
3,What is the default AUTOINDEX distance metric ...,"Trick answer: IP inner product, not yet updat...",https://milvus.io/docs/index.md,and Hamming. This type of indexes include BIN_...,L2,"According to the Milvus documentation, the def...",The default AUTOINDEX distance metric in Milvu...


In [5]:
# Replace Custom_RAG_context column with LLM answer of your choice.

# Possible choices:
# 1. openai gpt-3.5-turbo = 'Custom_RAG_answer'
# 2. llama3_answer
# 3. anthropic_claud3_haiku_answer
# LLM_TO_EVALUATE = 'Custom_RAG_answer'
LLM_TO_EVALUATE = 'llama3_answer'
# LLM_TO_EVALUATE = 'anthropic_claud3_haiku_answer'

temp_df = eval_df.copy()
if LLM_TO_EVALUATE != 'Custom_RAG_answer':
    temp_df['Custom_RAG_answer'] = temp_df[LLM_TO_EVALUATE]

# Display the dataframe.
display(temp_df.head())

Unnamed: 0,Question,ground_truth_answer,Sources,Custom_RAG_context,Custom_RAG_answer,llama3_answer,anthropic_claud3_haiku_answer
0,What do the parameters for HNSW mean?,# M: maximum degree of nodes in a layer of the...,https://milvus.io/docs/index.md,this value can improve recall rate at the cost...,"The parameters for HNSW include M, which is th...","The parameters for HNSW include M, which is th...",The parameters for HNSW (Hierarchical Navigabl...
1,What are good default values for HNSW paramete...,"M=16, efConstruction=32, ef=32","https://milvus.io/docs/index.md, https://milvu...",parameters vary with Milvus distribution. Sele...,"For a Milvus distribution, there is no direct ...","For a Milvus distribution, there is no direct ...",I don't know. The context provided does not co...
2,What does nlist mean in ivf_flat?,The `nlist` parameter in IVF_FLAT index divide...,https://milvus.io/docs/index.md,index? IVF_FLAT index divides a vector space i...,The `nlist` parameter in IVF_FLAT index divide...,The `nlist` parameter in IVF_FLAT index divide...,The nlist parameter in the IVF_FLAT index in M...
3,What is the default AUTOINDEX distance metric ...,"Trick answer: IP inner product, not yet updat...",https://milvus.io/docs/index.md,and Hamming. This type of indexes include BIN_...,"According to the Milvus documentation, the def...","According to the Milvus documentation, the def...",The default AUTOINDEX distance metric in Milvu...


In [None]:
# Ragas default uses HuggingFace Datasets.
# https://docs.ragas.io/en/latest/getstarted/evaluation.html
from datasets import Dataset

def assemble_ragas_dataset(input_df):
    """Assemble a RAGAS HuggingFace Dataset from an input pandas df."""

    # Assemble Ragas lists: questions, ground_truth_answers, retrieval_contexts, and RAG answers.
    question_list, truth_list, context_list = [], [], []

    # Get all the questions.
    question_list = input_df.Question.to_list()

    # Get all the ground truth answers.
    truth_list = input_df.ground_truth_answer.to_list()

    # Get all the Milvus Retrieval Contexts as list[list[str]]
    context_list = input_df.Custom_RAG_context.to_list()
    context_list = [[context] for context in context_list]

    # Get all the RAG answers based on contexts.
    rag_answer_list = input_df.Custom_RAG_answer.to_list()

    # Create a HuggingFace Dataset from the ground truth lists.
    ragas_ds = Dataset.from_dict({"question": question_list,
                            "contexts": context_list,
                            "answer": rag_answer_list,
                            "ground_truth": truth_list
                            })
    return ragas_ds

In [None]:
# Create a Ragas HuggingFace Dataset from the ground truth lists.
ragas_input_ds = assemble_ragas_dataset(eval_df)
display(ragas_input_ds)

In [None]:
# Debugging inspect all the data.
ragas_input_df = ragas_input_ds.to_pandas()
display(ragas_input_df.head())

### 2. Start Ragas Evaluation with custom Evaluation LLM

The default OpenAI model used by Ragas is `gpt-3.5-turbo-16k`.

Note that a large amount of OpenAI api token is consumed. Every time you ask a question and every evaluation, you will ask the OpenAI service. Please pay attention to your token consumption. 

In [None]:
import os, openai, pprint
from openai import OpenAI

# Save api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_api_key=os.environ['OPENAI_API_KEY']

In [None]:
# Choose the metrics you want to see.
# Remove context relevancy metric - it is deprecated and not maintained.
from ragas.metrics import (
    context_recall, 
    context_precision, 
    # answer_relevancy,
    # answer_similarity,
    # answer_correctness,
    # faithfulness,
    )
metrics = ['context_recall', 'context_precision']

# Change the default the llm-as-critic.
# It is also possible to switch out a HuggingFace open LLM here if you want.
# https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html
from ragas.llms import llm_factory
LLM_NAME = "gpt-3.5-turbo"
# Default temperature = 1e-8
ragas_llm = ragas.llms.llm_factory(model=LLM_NAME)

# Change the default embeddings to HuggingFace models.
from langchain_community.embeddings import HuggingFaceEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
EMB_NAME = "BAAI/bge-large-en-v1.5"
lc_embeddings = HuggingFaceEmbeddings(model_name=EMB_NAME)

# # Alternatively use OpenAI embedding models.
# # https://openai.com/blog/new-embedding-models-and-api-updates
# from langchain_openai.embeddings import OpenAIEmbeddings
# lc_embeddings = OpenAIEmbeddings(
#     model="text-embedding-3-small", 
#     # 512 or 1536 possible for 3-small
#     # 256, 1024, or 3072 for 3-large
#     dimensions=512)
ragas_emb = LangchainEmbeddingsWrapper(embeddings=lc_embeddings)

# Change each metric.
for metric in metrics:
    globals()[metric].llm = ragas_llm
    globals()[metric].embeddings = ragas_emb

In [None]:
# Evaluate the dataset.
from ragas import evaluate

ragas_result = evaluate(
    ragas_input_ds,
    metrics=[
        context_precision,
        context_recall,
    ],
    llm=ragas_llm,
)

# View evaluations.
ragas_output_df = ragas_result.to_pandas()
temp = ragas_output_df.fillna(0.0)
temp['context_f1'] = 2.0 * temp.context_precision * temp.context_recall \
                    / (temp.context_precision + temp.context_recall)
temp.head()

# Calculate Retrieval average score.
avg_retrieval_f1 = np.round(temp.context_f1.mean(),2)

In [None]:
# Display Retrieval average score.
print(f"Using {eval_df.shape[0]} eval questions, Mean Retrieval F1 Score = {avg_retrieval_f1, 2}")

In [None]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p datasets,langchain,openai,ragas --conda