# RAG metrics: answer correctness

This notebook will implement and explore an evaluation metric for RAG called answer correctness.
The notebook is accompanied with a blog post that can be found [here](https://www.opper.ai/blog/rag-metrics-answer-correctness).

In [None]:
%pip install -U opperai pandas

# import os
# os.environ['OPPER_API_KEY'] = 'YOUR_API_KEY'

## Answer Correctness

Answer correctness is a metric that measures the accuracy of the RAG answer, comparing the generated answer to the ground truth. The `calculate` method will take in the generated answer and the ground truth and return a number between 0 and 1. A higher score indicates a higher correctness. See [f-score](https://en.wikipedia.org/wiki/F-score) for more information.

In [15]:
from typing import List
from pydantic import BaseModel, Field
from opperai import fn


class Reason(BaseModel):
    statement: str = Field(..., description="The statement that was classified")
    reason: str = Field(
        ..., description="The reason why the statement was classified as such"
    )


class CorrectnessClassifications(BaseModel):
    true_positives: List[Reason] = Field(..., description="True positives - statements that are present in answer that are also directly supported by the one or more statements in ground truth")
    false_positives: List[Reason] = Field(..., description="False positives - statements that are present in answer but not directly supported by any statement in ground truth")
    false_negatives: List[Reason] = Field(..., description="False negatives - statements found in the ground truth but not present in answer")

    @property
    def score(self) -> float:
        """Given an answer and a ground truth, calculate the correctness (f1) score."""

        tp = len(self.true_positives)
        fp = len(self.false_positives)
        fn = len(self.false_negatives)

        score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0

        return score


class Correctness(BaseModel):
    @fn(model="openai/gpt4-turbo")
    def classify(answer: str, ground_truth: str) -> CorrectnessClassifications:
        """Given an answer and a ground truth, analyze each statement and classify it as belonging 
        to one of the classifications.

        NOTE Each statement can only belong to one classification.
        """

    def calculate(self, answer: str, ground_truth: str) -> float:
        """Given an answer and a ground truth, calculate the correctness score."""

        classifications = self.classify(answer=answer, ground_truth=ground_truth)

        return classifications.score

We can now try out our correctness calculation. We'll use a couple of samples from the Paul Graham qna dataset that can be found on (Hugging Face)[https://huggingface.co/datasets/LangChainDatasets/question-answering-paul-graham].

First we try providing the same input as `answer` and `ground_truth`

In [16]:
correctness_calculator = Correctness()

classification = correctness_calculator.classify(
    answer="The two main things the author worked on before college were writing and programming.",
    ground_truth="The two main things the author worked on before college were writing and programming.",
)

print(classification)
print(classification.score)

true_positives=[Reason(statement='The two main things the author worked on before college were writing and programming.', reason='This statement is present in both the answer and the ground truth.')] false_positives=[] false_negatives=[]
1.0


We now modify the answer by changing `writing` to `cleaning windows` which should give us a lower score.

In [17]:
classification = correctness_calculator.classify(
    answer="The two main things the author worked on before college were cleaning windows and programming.",
    ground_truth="The two main things the author worked on before college were writing and programming.",
)

print(classification)
print(classification.score)

true_positives=[Reason(statement='programming', reason='This statement is present in both the answer and the ground truth, describing one of the main things the author worked on before college.')] false_positives=[Reason(statement='cleaning windows', reason="This activity is mentioned in the answer but not supported by the ground truth, which instead lists 'writing' as the activity.")] false_negatives=[Reason(statement='writing', reason="This activity is listed in the ground truth but omitted in the answer, which mentions 'cleaning windows' instead.")]
0.5


## Naive RAG in Opper 

To try this out at a little larger scale we could try it out on the Paul Graham qna dataset. To have something to benchmark we can create an index in Opper and upload the dataset and then create a function that uses entries from the index to answer questions.

First we load the dataset into an Opper index.

In [18]:
from opperai import Opper

opper = Opper()

index = opper.indexes.get(name="qna")
if not index:
    index = opper.indexes.create("qna")
    res = index.upload_file("what_i_worked_on.txt")
    print(res)

We can now query the index

In [19]:
res = index.query("What were the two main things the author worked on before college?")

print(res)

[RetrievalResponse(content='What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines â€” CPU, disk drives, printer, card reader â€” sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type prog

The result from querying an index is a list of `RetrievalResult`s. We can create a function that given a list of `RetrievalResult`s and a question, will return an answer to the question.

In [20]:
from opperai.types import RetrievalResponse


@fn(model="openai/gpt4-turbo")
def query_qna(query: str, context: List[RetrievalResponse]) -> str:
    """Given a query and a context, answer the question using the context."""


query = "What were the two main things the author worked on before college?"
context = index.query(query)
res = query_qna(query=query, context=context)

print(res)


def answer_question(query: str) -> str:
    context = index.query(query)
    res = query_qna(query=query, context=context)

    return res

The two main things the author worked on before college were writing and programming.


This looks good, now we load the dataset with all the questions and answers from the qna dataset.

In [21]:
import requests

url = "https://huggingface.co/datasets/LangChainDatasets/question-answering-paul-graham/raw/main/paul_graham_qa.json"

response = requests.get(url)
response.raise_for_status()

json_data = response.json()

for row in json_data[0:2]:
    q = row["question"]
    a = row["answer"]
    res = answer_question(q)
    print(f"Question: {q}")
    print(f"Answer: {a}")
    print(f"Opper Answer: {res}")

Question: What were the two main things the author worked on before college?
Answer: The two main things the author worked on before college were writing and programming.
Opper Answer: The author worked on writing and programming before college.
Question: What made the author want to work on AI?
Answer: The novel 'The Moon is a Harsh Mistress' and a PBS documentary showing Terry Winograd using SHRDLU made the author want to work on AI.
Opper Answer: The author was motivated to work on AI by two main influences: a novel by Heinlein called 'The Moon is a Harsh Mistress' featuring an intelligent computer named Mike, and a PBS documentary showing Terry Winograd using the SHRDLU program. These experiences led the author to believe that creating intelligent machines like Mike was imminent and inspired him to teach himself AI, starting with learning Lisp.


## Benchmark correctness

Now we stitch the RAG function together with the correctness function.

First create a dataframe with all questions and answers from the qna dataset.


In [22]:
import pandas as pd

df = pd.DataFrame(json_data[:10])
df["opper_answer"] = df["question"].apply(answer_question)

print(df)

                                            question  \
0  What were the two main things the author worke...   
1           What made the author want to work on AI?   
2  What did the author realize while looking at a...   
3   What did the author write their dissertation on?   
4  What is the difference between painting still ...   
5  What did the author learn while working at Int...   
6  What did the author do to survive during the n...   
7  What was the author's motivation for wanting t...   
8        What is Viaweb and how did it get its name?   
9  What was the price charged by Viaweb for a sma...   

                                              answer  \
0  The two main things the author worked on befor...   
1  The novel 'The Moon is a Harsh Mistress' and a...   
2  The author realized that paintings were someth...   
3  The author wrote their dissertation on applica...   
4  Painting still lives is different from paintin...   
5  The author learned that low end software ten

We then create a new column with the correctness score for each question.

In [23]:
df["correctness"] = df.apply(
    lambda row: correctness_calculator.calculate(row["opper_answer"], row["answer"]),
    axis=1,
)

In [24]:
for row in df[:5].itertuples():
    print(f"Question: {row.question}")
    print(f"Answer: {row.answer}")
    print(f"Opper Answer: {row.opper_answer}")
    print(f"Answer Correctness: {row.correctness}")
    print("\n")

Question: What were the two main things the author worked on before college?
Answer: The two main things the author worked on before college were writing and programming.
Opper Answer: The author worked on writing and programming before college.
Answer Correctness: 0.6666666666666666


Question: What made the author want to work on AI?
Answer: The novel 'The Moon is a Harsh Mistress' and a PBS documentary showing Terry Winograd using SHRDLU made the author want to work on AI.
Opper Answer: The author was inspired to work on AI primarily by a novel by Heinlein titled 'The Moon is a Harsh Mistress', featuring an intelligent computer called Mike, and a PBS documentary showcasing Terry Winograd using SHRDLU. These works deeply influenced him and made him believe that creating intelligent machines like Mike was imminent and feasible.
Answer Correctness: 0.5


Question: What did the author realize while looking at a painting at the Carnegie Institute?
Answer: The author realized that paintin

Taking the mean of the correctness column gives us the average correctness score.

In [25]:
print(df["correctness"].mean())

0.742051282051282


## Conclusion

What we have implemented is the basis of a more comprehensive evaluation of RAG. This can be used to compare the correctness of different implementations, prompts, models, etc. We can also more metrics like faithfulness and relevance.

Correctness prompt heavily inspired by [ragas](https://github.com/explodinggradients/ragas) correctness prompt.