# RAG Metrics

The list below is some of the metrics that are used to measure the performance of a RAG model.
As the peformance depends on 3 difference components: Answer Synthesis (LLM Parameters and Prompt), Retreival (Search parameters and ranking), and Indexing (Embedding model and chunking methods), each of the metrics attempt at measuring different components. The list below is not exhaustive and there are different metrics for different use cases, for example, measuring the tone of the final answer for customer facing models.

**Correctness**: 

Whether the generated answer matches that of the reference answer given the query (requires labels).

**Semantic Similarity**: 

Whether the predicted answer is semantically similar to the reference answer (requires labels).

**Faithfulness**: 

Evaluates if the answer is faithful to the retrieved contexts (in other words, whether if there's hallucination).

**Context Relevancy**: 

Whether retrieved context is relevant to the query. 

**Answer Relevancy**: 

Whether the generated answer is relevant to the query. 

**Guideline Adherence**: 

Whether the predicted answer adheres to specific guidelines.

**Hit Rate**:

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

**Mean Reciprocal Rank (MRR)**:

MRR serves as a metric for evaluating the accuracy of a system by examining the rank of the highest-placed relevant document for each query. It calculates the average of the reciprocals of these ranks across all queries. For instance, if the first relevant document is ranked highest, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so forth.

# Preparation

Run the cell below and restart the kernel. Once the kernel restarts, you and skip to **Set Up**

In [None]:
!pip install -q -U llama-index llama-index-llms-ibm pydantic
print("Done Installing!")

# Set Up

Before we measure the performances, we need a way to tell WatsonX.AI that it's an authroized user trying to use the models. Follow the instruction below to retrieve your API Key.

Instruction:

When done, paste **your api key below** and run both cells

In [None]:
myApikey="Paste your API KEY Here"

In [None]:

from llama_index.core.evaluation import BatchEvalRunner
from llama_index.core.evaluation import AnswerRelevancyEvaluator
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.llms.ibm import WatsonxLLM
from statistics import mean
import os
import nest_asyncio
import asyncio
import warnings

warnings.filterwarnings("ignore")

nest_asyncio.apply()

watsonx_llm = WatsonxLLM(
    model_id="mistralai/mistral-large", # granite or mistral 
    url="https://us-south.ml.cloud.ibm.com",
    apikey=myApikey,
    project_id=os.getenv("PROJECT_ID"),
    max_new_tokens=1000
)

In [None]:
async def evaluate_rag(log,ground_truth=None):
    # Ingest the session log and grount truth label to compute the RAG metrics
    # The metrics to be computed are: Answer Relevancy, Correctness, and Semantic Similiarity
    parsed_list=log.split("\n") # take the log from the session log and turn it into a list
    filtered=[x for x in parsed_list if "<|system|>" not in x and "<|assistant|>" not in x and "<|user|>" not in x and len(x)>5]
    queries=filtered[1::2] #extract the queries only
    responses=filtered[2::2] #extract the responses only
    
    print("Your queries are: {0}{1}".format('\n\n',' \n'.join(queries)))
    print("\nYour responses are: {0}{1}".format('\n\n',' \n'.join(responses)))
    
    ar_eval=AnswerRelevancyEvaluator(llm=watsonx_llm)
    cr_eval=CorrectnessEvaluator(llm=watsonx_llm)
    #ss_eval=SemanticSimilarityEvaluator(llm=watsonx_llm)
    
    ar_runner = BatchEvalRunner(
        {"answer_relevancy": ar_eval},workers=5,show_progress=True)

    cr_ss_runner=BatchEvalRunner(
        {"correctness": cr_eval},workers=5,show_progress=True)

    eval_results_ar = await ar_runner.aevaluate_response_strs(queries=queries, response_strs=responses)
    ar_score=mean([float(eval_results_ar['answer_relevancy'][x].score) for x in range(len(queries))])*100
    print(f"The average Answer Relevancy score is {round(ar_score,2)}%")
    
    #eval_results_crss = await runner.aevaluate_response_strs(queries=queries, response_strs=responses,reference=ground_truth)

    ar_score=mean([float(eval_results_ar['answer_relevancy'][x].score) for x in range(len(queries))])*100
    print("\n")
    
    print(f"The average Answer Relevancy score is {round(ar_score,2)}%")
    



    

    

With the LLM connection establsihed, copy paste your RAG log from your saved Prompt Lab session

In [None]:
myLog="""<|system|>
You are Granite Chat, an AI language model developed by IBM. You are a cautious assistant. You carefully follow instructions. You are helpful and harmless and you follow ethical guidelines and promote positive behavior. You are a AI language model designed to function as a specialized Retrieval Augmented Generation (RAG) assistant. When generating responses, prioritize correctness, i.e., ensure that your response is correct given the context and user query, and that it is grounded in the context. Furthermore, make sure that the response is supported by the given document or context. Always make sure that your response is relevant to the question. If an explanation is needed, first provide the explanation or reasoning, and then give the final answer. Avoid repeating information unless asked.
<|user|>
What is mega yacht
<|assistant|>
A mega yacht is a type of large yacht that is typically over 80 feet in length overall. These vessels are often used for leisure, recreational, or educational purposes and are available for hire in the inland and offshore waters of Puerto Rico. To be considered a mega yacht, the vessel must meet certain criteria, such as being available in Puerto Rico for tourist activities for at least six months out of the year and submitting quarterly reports to the Tourism Office detailing its usage in the tourist activity.
<|user|>
What is an existing business
<|assistant|>
An existing business, as referred to in the document, is a business that is already engaged in a Tourist Activity at the time of filing an application for incentives under Chapter 5 of Subtitle B of this Code or which otherwise does not qualify as a New Tourism Business. This means that the business has already started operating and is generating revenue in the Tourist Activity sector.
<|user|>
What are the qualifications
<|assistant|>
To become a Qualified Promoter in Puerto Rico, one must hold a bachelor's degree from an accredited university and have at least five years of professional experience in their area of competence, which can include fields such as accounting, finance, marketing, planning, international trade, law, economics, science, engineering, real estate, or other similar fields.
<|assistant|>"""


To evaluate the answer correctness, we've prepared the ground truth below

In [None]:
await evaluate_rag(myLog)