# Learning Objectives

Implement a monitoring workflow to check groundedness and relevance for a RAG application deployed on HuggingFace Spaces.

# Setup

In [None]:
!pip install -q openai==1.23.2 datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pandas as pd

from openai import OpenAI
from datasets import load_dataset
from google.colab import userdata
from tqdm import tqdm

In [None]:
anyscale_api_key = userdata.get('anyscale_api_key')

In [None]:
client = OpenAI(
    base_url="https://api.endpoints.anyscale.com/v1",
    api_key=anyscale_api_key
)

In [None]:
rater_model = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# Access Logs

In [None]:
prediction_logs = load_dataset("pgurazada1/document-qna-chroma-anyscale-logs")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/30.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4 [00:00<?, ? examples/s]

In [None]:
prediction_logs_df = prediction_logs['train'].to_pandas()

In [None]:
prediction_logs_df

Unnamed: 0,user_input,retrieved_context,model_response
0,What was the total revenue of the company in 2...,(Dollars\tin\tmillions)\n2023\n2022\n2021\n$\n...,The total revenue of the company in 2022 was $...
1,Summarize the Management Discussion and Analys...,ITEM\t7.\t\nMANAGEMENT’S\tDISCUSSION\tAND\tANA...,In the 2021 Management Discussion and Analysis...
2,What was the company's debt level in 2020?,"1,800\n\t\n\t\n\t\n\t\n\t\n—\n\t\n\t\n\t\n\t\n...","The company's debt level in 2020 was $10,402."
3,Identify 5 key risks identified in the 2019 10...,is\tnot\tincorporated\tby\treference\tinto\tth...,1. Impact from macroeconomic conditions result...


# RAG Quality Checks

Let us now use the LLM-as-a-judge method to check the quality of the RAG system on two parameters - retrieval and generation. We evaluate for these parameters based on the logs collected from the production endpoint. Note that these are the same prompts we used during prompt engineering before deployment.

In [None]:
groundedness_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
The answer should be derived only from the information presented in the context

Instructions:
1. First write down the steps that are needed to evaluate the answer as per the metric.
2. Give a step-by-step explanation if the answer adheres to the metric considering the question and context as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the answer using the evaluaton criteria and assign a score.
"""

In [None]:
relevance_rater_system_message = """
You are tasked with rating AI generated answers to questions posed by users.
You will be presented a question, context used by the AI system to generate the answer and an AI generated answer to the question.
In the input, the question will begin with ###Question, the context will begin with ###Context while the AI generated answer will begin with ###Answer.

Evaluation criteria:
The task is to judge the extent to which the metric is followed by the answer.
1 - The metric is not followed at all
2 - The metric is followed only to a limited extent
3 - The metric is followed to a good extent
4 - The metric is followed mostly
5 - The metric is followed completely

Metric:
Relevance measures how well the answer addresses the main aspects of the question, based on the context.
Consider whether all and only the important aspects are contained in the answer when evaluating relevance.

Instructions:
1. First write down the steps that are needed to evaluate the context as per the metric.
2. Give a step-by-step explanation if the context adheres to the metric considering the question as the input.
3. Next, evaluate the extent to which the metric is followed.
4. Use the previous information to rate the context using the evaluaton criteria and assign a score.
"""

In [None]:
user_message_template = """
###Question
{question}

###Context
{context}

###Answer
{answer}
"""

We iterate over the rows in the logs and retrieve the three components needed for evaluation - question, context and answer.

In [None]:
groundedness_evaluation, relevance_evaluation = [], []

In [None]:
for index, row in tqdm(prediction_logs_df.iterrows()):

    user_input = row['user_input']
    retrieved_context = row['retrieved_context']
    model_response = row['model_response']

    groundedness_prompt = [
        {'role':'system', 'content': groundedness_rater_system_message},
        {'role': 'user', 'content': user_message_template.format(
            question=user_input,
            context=retrieved_context,
            answer=model_response
            )
        }
    ]

    groundedness_response = client.chat.completions.create(
        model=rater_model,
        messages=groundedness_prompt,
        temperature=0
    )

    groundedness_rating = groundedness_response.choices[0].message.content

    relevance_prompt = [
        {'role':'system', 'content': relevance_rater_system_message},
        {'role': 'user', 'content': user_message_template.format(
            question=user_input,
            context=retrieved_context,
            answer=model_response
            )
        }
    ]

    relevance_response = client.chat.completions.create(
        model=rater_model,
        messages=relevance_prompt,
        temperature=0
    )

    relevance_rating = relevance_response.choices[0].message.content

    groundedness_evaluation.append(groundedness_rating)
    relevance_evaluation.append(relevance_rating)

In [None]:
ratings_df = prediction_logs_df.copy()
ratings_df['groundedness_evaluation'] = groundedness_evaluation
ratings_df['relevance_evaluation'] = relevance_evaluation

In [None]:
ratings_df

Unnamed: 0,user_input,retrieved_context,model_response,groundedness_evaluation,relevance_evaluation
0,What was the total revenue of the company in 2...,(Dollars\tin\tmillions)\n2023\n2022\n2021\n$\n...,The total revenue of the company in 2022 was $...,Steps to evaluate the answer:\n1. Identify th...,1. The steps to evaluate the context as per t...
1,Summarize the Management Discussion and Analys...,ITEM\t7.\t\nMANAGEMENT’S\tDISCUSSION\tAND\tANA...,In the 2021 Management Discussion and Analysis...,Steps to evaluate the answer:\n\n1. Identify ...,1. To evaluate the context as per the relevan...
2,What was the company's debt level in 2020?,"1,800\n\t\n\t\n\t\n\t\n\t\n—\n\t\n\t\n\t\n\t\n...","The company's debt level in 2020 was $10,402.",Steps to evaluate the answer:\n1. Identify th...,1. To evaluate the context as per the relevan...
3,Identify 5 key risks identified in the 2019 10...,is\tnot\tincorporated\tby\treference\tinto\tth...,1. Impact from macroeconomic conditions result...,Steps to evaluate the answer:\n1. Identify th...,Steps to evaluate the context as per the rele...


This dataframe can then be inspected for low ratings on specific inputs and depending on the issue uncovered, appropriate action could be taken (e.g., change the embedding model, or chunk size/overlap, or model).