# RAG Metrics

The list below is some of the metrics that are used to measure the performance of a RAG model.
As the peformance depends on 3 difference components: Answer Synthesis (LLM Parameters and Prompt), Retreival (Search parameters and ranking), and Indexing (Embedding model and chunking methods), each of the metrics attempt at measuring different components. The list below is not exhaustive and there are different metrics for different use cases, for example, measuring the tone of the final answer for customer facing models.

**Correctness**: 

Whether the generated answer matches that of the reference answer given the query (requires labels).

**Semantic Similarity**: 

Whether the predicted answer is semantically similar to the reference answer (requires labels).

**Faithfulness**: 

Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there's hallucination).

**Context Relevancy**: 

Whether retrieved context is relevant to the query. 

**Answer Relevancy**: 

Whether the generated answer is relevant to the query. 

**Guideline Adherence**: 

Whether the predicted answer adheres to specific guidelines.

**Hit Rate**:

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

**Mean Reciprocal Rank (MRR)**:

MRR serves as a metric for evaluating the accuracy of a system by examining the rank of the highest-placed relevant document for each query. It calculates the average of the reciprocals of these ranks across all queries. For instance, if the first relevant document is ranked highest, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so forth.

# Preparation

Run the cell below and restart the kernel. Once the kernel restarts, you and skip to **Set Up**

In [68]:
!pip install -q -U llama-index llama-index-llms-ibm pydantic
print("Done Installing!")

Done Installing!


# Set Up

Before we measure the performances, we need a way to tell WatsonX.AI that it's an authroized user trying to use the models. Follow the instruction below to retrieve your API Key.

Instruction:

When done, paste **your api key below** and run both cells

In [None]:
myApikey="your Apikey Here"

In [6]:

from llama_index.core.evaluation import BatchEvalRunner
from llama_index.core.evaluation import AnswerRelevancyEvaluator
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.ibm import WatsonxLLM
from statistics import mean
import os
import nest_asyncio
import asyncio
import warnings
import pandas as pd
pd.set_option('display.max_colwidth', None)
warnings.filterwarnings("ignore")

nest_asyncio.apply()

watsonx_llm = WatsonxLLM(
    model_id="mistralai/mistral-large", # granite or mistral 
    url="https://us-south.ml.cloud.ibm.com",
    apikey=myApikey,
    project_id=os.getenv("PROJECT_ID"),
    max_new_tokens=1000
)

In [7]:
async def evaluate_rag(log,ground_truth=None):
    # Ingest the session log and grount truth label to compute the RAG metrics
    # The metrics to be computed are: Answer Relevancy, Correctness, and Semantic Similiarity
    parsed_list=log.split("\n") # take the log from the session log and turn it into a list
    filtered=[x for x in parsed_list if "<|system|>" not in x and "<|assistant|>" not in x and "<|user|>" not in x and len(x)>5]
    queries=filtered[1::2] #extract the queries only
    responses=filtered[2::2] #extract the responses only
    
    #print("Your queries are: {0}{1}".format('\n\n',' \n'.join(queries)))
    #print("\nYour responses are: {0}{1}".format('\n\n',' \n\n'.join(responses)))
    
    ar_eval=AnswerRelevancyEvaluator(llm=watsonx_llm)
    cr_eval=CorrectnessEvaluator(llm=watsonx_llm)
    #ss_eval=SemanticSimilarityEvaluator(llm=watsonx_llm)
    
    ar_runner = BatchEvalRunner(
        {"answer_relevancy": ar_eval},workers=5,show_progress=True)

    cr_ss_runner=BatchEvalRunner(
        {"correctness": cr_eval},workers=5,show_progress=True)

    eval_results_ar = await ar_runner.aevaluate_response_strs(queries=queries, response_strs=responses)
    ar_score=mean([float(eval_results_ar['answer_relevancy'][x].score) for x in range(len(queries))])*100
    
    #eval_results_crss = await runner.aevaluate_response_strs(queries=queries, response_strs=responses,reference=ground_truth)

    ar_score=mean([float(eval_results_ar['answer_relevancy'][x].score) for x in range(len(queries))])*100
    results=[eval_results_ar['answer_relevancy'][x].feedback for x in range(len(queries))]
    feedback, scores = map(list, zip(*(s.split("\n\n") for s in results)))
    feedback=[x.replace("\n","") for x in feedback]
    print("\n")
    
    print(f"The average Answer Relevancy score is {round(ar_score,2)}%")
    print("\n")

    df=pd.DataFrame(data={'Query':queries,'Response':responses,'Feedback':feedback,'Score (max:2)':scores})
    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        display(df)
    



    

    

With the LLM connection establsihed, copy paste your RAG log from your Prompt Lab session and run both cells.

The responses will be evaluated against the following two criteria: <br> 1. Does the provided response match the subject matter of the user's query? <br> 2. Does the provided response attempt to address the focus or perspective on the subject matter taken on by the user's query? <br> Each question above is worth 1 point.

In [8]:
myLog="""<|system|>
You are Granite Chat, an AI language model developed by IBM. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You are a AI language model designed to function as a specialized Retrieval Augmented Generation (RAG) assistant. When generating responses, prioritize correctness, i.e., ensure that your response is correct given the context and user query, and that it is grounded in the context. Furthermore, make sure that the response is supported by the given document or context. Always make sure that your response is relevant to the question. If an explanation is needed, first provide the explanation or reasoning, and then give the final answer. Avoid repeating information unless asked.
<|user|>
What's the ideal temperature?
<|assistant|>
The ideal temperature settings for a programmable thermostat depend on the time of day and whether the homeowner wants heating or cooling. For example, the wake time should be set between 60-70°F (15-21°C) for heat and 70-78°F (21-26°C) for cool. During the day, the temperature can be set back at least 8°F (4°C) for heat and 7°F (4°C) for cool during the daytime. In the evening, the temperature should be set back at least 8°F (4°C) for heat and 4°F (2°C) for cool. It's important to note that these are just general guidelines and can be adjusted to fit the homeowner's specific needs and comfort preferences.
<|user|>
What is energy star?
<|assistant|>
Energy Star is a government-backed program that helps people save money and protect the environment by promoting energy-efficient products and practices. It was established by the U.S. Environmental Protection Agency (EPA) to reduce greenhouse gas emissions and combat global warming. Energy Star provides recommendations for energy-efficient equipment, installation, maintenance, and ways to optimize heating and cooling systems. By using Energy Star qualified products and taking energy-saving actions, individuals can save money on energy bills, improve comfort, and protect their valuables from harmful ultraviolet light.
<|user|>
What is IBM Sterling?
<|assistant|>
Explanation: The document provided does not contain any information about IBM Sterling. It is a software product used for managing supply chains and trading networks.
<|assistant|>

"""


In [9]:
await evaluate_rag(myLog)

100%|██████████| 3/3 [00:06<00:00,  2.16s/it]



The average Answer Relevancy score is 66.67%







Unnamed: 0,Query,Response,Feedback,Score (max:2)
0,What's the ideal temperature?,"The ideal temperature settings for a programmable thermostat depend on the time of day and whether the homeowner wants heating or cooling. For example, the wake time should be set between 60-70°F (15-21°C) for heat and 70-78°F (21-26°C) for cool. During the day, the temperature can be set back at least 8°F (4°C) for heat and 7°F (4°C) for cool during the daytime. In the evening, the temperature should be set back at least 8°F (4°C) for heat and 4°F (2°C) for cool. It's important to note that these are just general guidelines and can be adjusted to fit the homeowner's specific needs and comfort preferences.",1. The provided response matches the subject matter of the user's query as it discusses temperature settings.2. The provided response attempts to address the focus of the user's query by giving specific temperature ranges for different times of the day and for both heating and cooling.,[RESULT] 2
1,What is energy star?,"Energy Star is a government-backed program that helps people save money and protect the environment by promoting energy-efficient products and practices. It was established by the U.S. Environmental Protection Agency (EPA) to reduce greenhouse gas emissions and combat global warming. Energy Star provides recommendations for energy-efficient equipment, installation, maintenance, and ways to optimize heating and cooling systems. By using Energy Star qualified products and taking energy-saving actions, individuals can save money on energy bills, improve comfort, and protect their valuables from harmful ultraviolet light.",1. The provided response matches the subject matter of the user's query as it explains what Energy Star is.2. The provided response addresses the focus of the user's query by explaining the purpose and benefits of the Energy Star program.,[RESULT] 2
2,What is IBM Sterling?,Explanation: The document provided does not contain any information about IBM Sterling. It is a software product used for managing supply chains and trading networks.,"1. The response does not directly match the subject matter of the user's query. While it mentions ""IBM Sterling,"" it primarily discusses the absence of information in a document rather than explaining what IBM Sterling is.2. The response does not address the focus or perspective of the user's query, which is to understand what IBM Sterling is. Instead, it talks about the lack of information in a document.",[RESULT] 0
