# RAG Evaluation

Let's start evaluating my chat assistant's question answering ability. Below are the validations I'm planning.
1. Is the answer relevant to the question?
2. Is the answer has all the important points taken from my resume relevant to the question asked?
3. Are the contexts retrieved from vector db is the most matched chunks with my question?
4. Is the answer relevant to the context ?

Initially I tried with [Non-LLM Evaluation techniques](#non_llm_evaluations). Later, I used [LLM Evaluation techniques](#llm_evaluations).

## Getting response from the RAG system

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_message_histories import SQLChatMessageHistory
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Pinecone as pc_vector
from langchain.chains import ConversationalRetrievalChain
from langchain.chains.conversation.memory import ConversationBufferMemory
import os
from pinecone import Pinecone
from dotenv import load_dotenv
from langchain_huggingface import HuggingFaceEmbeddings


load_dotenv()
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
openai_api_key = os.getenv("OPENAI_API_KEY")
username = 'hari'

def configure_retriever(index_name):
    embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L12-v1")

    docsearch = pc_vector.from_existing_index(index_name, embeddings)
    retriever = docsearch.as_retriever(search_type="mmr")
    return retriever

db_path = 'local_sqlite_db.db'
msgs = SQLChatMessageHistory(
    session_id=username,
    connection="sqlite:///" + db_path  # This is the SQLite connection string
)

# index_name = "my-portfolio"
index_name = 'test-my-portfolio'
memory = ConversationBufferMemory(memory_key="chat_history", chat_memory=msgs, return_messages=True)

llm = ChatOpenAI(
        model_name="gpt-4o-mini", openai_api_key=openai_api_key, temperature=0, streaming=True
    )

system_prompt = "You are A CHAT ASSISTANT even with or without context and you are NOT Haripriya. You are supposed to answer the questions asked only about Haripriya and you are NOT Haripriya. Use the following pieces of retrieved context to answer the question.\n\n{context}"
qa_prompt = ChatPromptTemplate.from_messages(
[
    ("system", system_prompt),
    ("human", "{question}"),
]
)
qa_chain = ConversationalRetrievalChain.from_llm(
        llm, retriever=configure_retriever(index_name), memory=memory, verbose=True,
        combine_docs_chain_kwargs={"prompt": qa_prompt},
)

user_query = 'Give educational background of Haripriya'
user_query += ' Brief it in at most 35 words or the count limit given in the previous sentence.'
response = qa_chain.invoke({"question":user_query})
print(response)

Since the output was huge, I took a snapshot of it to get an idea of which format I was looking at. So, here is the snap [Old Context Format's Image](images/old_context_format_pdf.png). You can see that there are so many spacings. This happens in the same way when we copy texts from pdf directly and paste it. Even though there were no spaces in the pdf, the pasting technique introduces it. The same scenario is happening in the pdf conversion to embedding as well. This fogs the actual context retrieval.

# Non-LLM Evaluations
<a id='non_llm_evaluations'></a>

- BLEU Score: Compares the generated text to reference texts, assessing the overlap of n-grams (sequences of words).
- ROUGE Score: Similar to BLEU, but focuses on recall, measuring how much of the reference content is captured in the generated text.
- METEOR: Considers synonyms and paraphrases, aiming to better align with human judgment.
- BERTScore: Evaluates text by comparing the similarity of contextual embeddings, focusing on meaning.

In [None]:
import nltk  # For BLEU and other metrics (if needed)
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from bert_score import score
import numpy as np

# 1. BLEU Score
def calculate_bleu(generated, references):
    """Calculates BLEU score."""
    smoothing = SmoothingFunction().method4  # Choose a smoothing function
    bleu_scores = []
    for i in range(len(generated)):
        score = sentence_bleu(references[i], generated[i].split(), smoothing_function=smoothing)
        bleu_scores.append(score)
    return np.mean(bleu_scores)

# 2. ROUGE Score
def calculate_rouge(generated, references):
    """Calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L)."""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_1_scores = []
    rouge_2_scores = []
    rouge_l_scores = []

    for i in range(len(generated)):
        scores = scorer.score(references[i][0], generated[i]) # reference is a list of sentences, we take the first one
        rouge_1_scores.append(scores['rouge1'].fmeasure)
        rouge_2_scores.append(scores['rouge2'].fmeasure)
        rouge_l_scores.append(scores['rougeL'].fmeasure)
    return np.mean(rouge_1_scores), np.mean(rouge_2_scores), np.mean(rouge_l_scores)

# 3. BERTScore
def calculate_bertscore(generated, references):
    """Calculates BERTScore."""
    P, R, F1 = score(generated, [ref[0] for ref in references], lang="en")  # reference is a list of sentences, we take the first one
    return np.mean(P.cpu().numpy()), np.mean(R.cpu().numpy()), np.mean(F1.cpu().numpy())

In [33]:
#clearing out session when required
msgs.clear()

In [17]:
reference_texts = [['Haripriya had good scores in academics during school and studied Electronics and Communication Engineering. She also is studying micromasters in Statistics and Data science']]
generated_texts = ['Haripriya excelled academically, achieving School First in both SSLC and Higher Secondary School with 97% marks, and District First in the Sri Ramanujan Maths Aptitude Test and Second in a Maths Quiz during HSC.']

bert_p, bert_r, bert_f1 = calculate_bertscore(generated_texts, reference_texts)
print(f"BERTScore Precision: {bert_p}")
print(f"BERTScore Recall: {bert_r}")
print(f"BERTScore F1: {bert_f1}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.8457125425338745
BERTScore Recall: 0.861638069152832
BERTScore F1: 0.8536010384559631


The score is low. I'm expecting a score of 90 and above. I'll be re-creating the index by modifying the pdf and chunk size.

## Modification 1

Modified the chunk size to 2000

In [None]:
qa_chain = ConversationalRetrievalChain.from_llm(
        llm, retriever=configure_retriever(index_name), memory=memory, verbose=True,
        combine_docs_chain_kwargs={"prompt": qa_prompt},
)

user_query = 'Give educational background of Haripriya'
user_query += ' Brief it in at most 35 words or the count limit given in the previous sentence.'
response = qa_chain.invoke({"question":user_query})
print(response)

In [20]:
reference_texts = [['Haripriya had good scores in academics during school and studied Electronics and Communication Engineering. She also is studying micromasters in Statistics and Data science']]

generated_texts = ['Haripriya Rajendran has completed a Micromasters in ML with Python, Google Advanced Data Analytics and Data Analytics Professional Certificates, and a Machine Learning course from Stanford University, among other certifications.']

bert_p, bert_r, bert_f1 = calculate_bertscore(generated_texts, reference_texts)
print(f"BERTScore Precision: {bert_p}")
print(f"BERTScore Recall: {bert_r}")
print(f"BERTScore F1: {bert_f1}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.8500571846961975
BERTScore Recall: 0.8697115182876587
BERTScore F1: 0.8597720265388489


Still I did not get the score I was expecting

## Modification 2

Modified the whole embedding model back to the older L6 model.

In [None]:
qa_chain = ConversationalRetrievalChain.from_llm(
        llm, retriever=configure_retriever(index_name), memory=memory, verbose=True,
        combine_docs_chain_kwargs={"prompt": qa_prompt},
)

user_query = 'Give educational background of Haripriya'
user_query += ' Brief it in at most 35 words or the count limit given in the previous sentence.'
response = qa_chain.invoke({"question":user_query})
print(response)

In [23]:
reference_texts = [['Haripriya had good scores in academics during school and studied Electronics and Communication Engineering. She also is studying micromasters in Statistics and Data science']]

generated_texts = ['Haripriya excelled academically, achieving school first in both SSLC and Higher Secondary with 97% marks, and district first in the Sri Ramanujan Maths aptitude test, along with second place in a Maths quiz.']

bert_p, bert_r, bert_f1 = calculate_bertscore(generated_texts, reference_texts)
print(f"BERTScore Precision: {bert_p}")
print(f"BERTScore Recall: {bert_r}")
print(f"BERTScore F1: {bert_f1}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.857337474822998
BERTScore Recall: 0.861737847328186
BERTScore F1: 0.8595320582389832


The score is still low. Let's use some actual LLM-RAG's evaluation  metrics and see what might be the issues with context, input or output.

# Metrics using Deepval - LLM Evaluation
<a id='llm_evaluations'></a>

I have taken the input, output and the contexts from the above queries that was ran with langchain. It was in the output cells. I took that and stored in another local file.

In [9]:

import contexts_for_testing
contexts = contexts_for_testing.context_1


In [10]:
print([len(context) for context in contexts])

[1136, 816, 493, 1742]


In [17]:
%load_ext dotenv
%dotenv

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [18]:
from deepeval.metrics import ContextualRecallMetric, ContextualRelevancyMetric, ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Actual output from your LLM application
actual_output = "Haripriya Rajendran excelled academically, achieving school first in both SSLC and Higher Secondary with 97% marks, and district first in the Sri Ramanujan Maths aptitude test, alongside various professional certifications in data science and analytics."

# Replace this with the expected output from your RAG generator
expected_output = "'Haripriya had good scores in academics during school and studied Electronics and Communication Engineering. She also is studying micromasters in Statistics and Data science'"

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = contexts

contextualRecall = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="Give educational background of Haripriya",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

contextualRelevancy = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

contextualPrecision = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

contextualRecall.measure(test_case)
print("contextualRecall : ", contextualRecall.score)
print(contextualRecall.reason)

contextualRelevancy.measure(test_case)
print("contextualRelevancy : ", contextualRelevancy.score)
print(contextualRelevancy.reason)

contextualPrecision.measure(test_case)
print("contextualPrecision : ", contextualPrecision.score)
print(contextualPrecision.reason)

Output()

Output()

contextualRecall :  0.0
The score is 0.00 because none of the sentences in the expected output align with any of the nodes in the retrieval context, as there are no relevant quotes to support Haripriya's academic achievements or her current studies.


Output()

contextualRelevancy :  0.3448275862068966
The score is 0.34 because while there are relevant educational achievements mentioned, such as 'DISTRICT FIRST in Sri Ramanujan Maths aptitude test' and various certificates from prestigious institutions like 'MITx' and 'Stanford University', the majority of the retrieval context is focused on technical skills and work experiences that do not pertain to Haripriya's educational background.


contextualPrecision :  0.3333333333333333
The score is 0.33 because while there is one relevant node that provides direct information about Haripriya's educational background, the higher-ranked nodes predominantly contain irrelevant content. Specifically, the first node ranks highest but states, "The first document does not mention Haripriya's academic background or any relevant details about her education," thus contributing to the lower score. Furthermore, the second and fourth nodes also do not provide pertinent information related to the request, confirming that relevant nodes are not ranked high enough.


In [19]:
from deepeval.metrics import FaithfulnessMetric
faithfulness = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

faithfulness.measure(test_case)
print("faithfulness :", faithfulness.score)
print(faithfulness.reason)

Output()

faithfulness : 0.75
The score is 0.75 because while the actual output accurately conveys Haripriya Rajendran's top achievements in her classes, it incorrectly interprets 'first in academics' as 'school first'. This slight misalignment leads to a reduction in faithfulness to the retrieval context.


- The contextual recall is just 0 telling that none of the contexts actually aligns with the expected output
- The contextual relevancy is also very low and explains why there is a very low score in a more human understandable way. This checks the relevancy between the input and the context 
- The contexual precision is also low because only one of the four documents is somewhat similar to the input and is also not ranked at the top. There is an irrelevant document at the top
- faithfulness has a good score because the actual output is faithful to the context. The LLM actually answers properly with whatever context it has.

So, from this we can understand that the retrieved context is actually the problem.

## Modification 3

I figured that the context given is not as expected and the context format is also not proper since it's extracted from a formatted pdf. So, I tried to populate the pinecone index with Docxloaders instead of pypdfloader and used a word document. Now this document is used to create vectors and then used them as context. Hopefully, there won't be any formatting issue this time and the context is proper.

### Test sample 1

In [None]:
qa_chain = ConversationalRetrievalChain.from_llm(
        llm, retriever=configure_retriever(index_name), memory=memory, verbose=True,
        combine_docs_chain_kwargs={"prompt": qa_prompt},
)

user_query = 'Give educational background of Haripriya'
user_query += ' Brief it in at most 35 words or the count limit given in the previous sentence.'
response = qa_chain.invoke({"question":user_query})
print(response)

Since the output was huge, I took a snapshot of it to get an idea of which format of context I was looking at. 

So, here are the snaps [New Context Format's Image](images/new_context_format_word_1.png), [New Context Format's Image 2](images/new_context_format_word_2.png)

In [27]:
reference_texts = [['Haripriya had good scores in academics during school and studied Electronics and Communication Engineering. She also is studying micromasters in Statistics and Data science']]

generated_texts = ["Haripriya holds a Bachelor's degree in Electronics and Communication Engineering from Sri Shakthi Institute of Engineering and Technology, Coimbatore, with a CGPA of 8.44, and is pursuing a Micromasters in Statistics and Data Science."]

bert_p, bert_r, bert_f1 = calculate_bertscore(generated_texts, reference_texts)
print(f"BERTScore Precision: {bert_p}")
print(f"BERTScore Recall: {bert_r}")
print(f"BERTScore F1: {bert_f1}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.8738595843315125
BERTScore Recall: 0.9114338159561157
BERTScore F1: 0.8922513127326965


Better recall score and much better response as well. Earlier there was no information stating about my Engineering degree which I was expecting the most. Now it's better. We will try few other responses as well.

As you can see, the context is of proper format which made this happen. Let's try the LLM evaluation metrics

In [22]:
import contexts_for_testing
contexts = contexts_for_testing.context_2

In [23]:
print([len(context) for context in contexts])

[1979, 1146, 1929, 1876]


In [24]:
from deepeval.metrics import ContextualRecallMetric, ContextualRelevancyMetric, ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Actual output from your LLM application
actual_output = "Haripriya Rajendran excelled academically, achieving school first in both SSLC and Higher Secondary with 97% marks, and district first in the Sri Ramanujan Maths aptitude test, alongside various professional certifications in data science and analytics."

# Replace this with the expected output from your RAG generator
expected_output = "'Haripriya had good scores in academics during school and studied Electronics and Communication Engineering. She also is studying micromasters in Statistics and Data science'"

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = contexts

contextualRecall = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="Give educational background of Haripriya",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

contextualRelevancy = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

contextualPrecision = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

contextualRecall.measure(test_case)
print("contextualRecall : ", contextualRecall.score)
print(contextualRecall.reason)

contextualRelevancy.measure(test_case)
print("contextualRelevancy : ", contextualRelevancy.score)
print(contextualRelevancy.reason)

contextualPrecision.measure(test_case)
print("contextualPrecision : ", contextualPrecision.score)
print(contextualPrecision.reason)

Output()

Output()

contextualRecall :  1.0
The score is 1.00 because the expected output aligns perfectly with the details from the nodes in the retrieval context, specifically mentioning 'Electronics and Communication Engineering' from node 9 and the 'micromasters in Statistics and Data Science' from node 1 under 'EDUCATION'.


Output()

contextualRelevancy :  0.4186046511627907
The score is 0.42 because while there are relevant educational achievements listed, such as 'Bachelors in Electronics and Communication Engineering - 2014 - 2018' and various certifications like 'Google Advanced Data Analytics Professional Certificate', the majority of the context focuses on job tasks and non-educational qualifications, as highlighted in the reasons for irrelevancy.


contextualPrecision :  0.5833333333333333
The score is 0.58 because although there are relevant nodes that provide details about Haripriya's education, two irrelevant nodes rank higher than the last relevant one. Specifically, the first node detailing data preparation is ranked first and states, 'Performed Retro Data preparation and Analysis on new data products' which does not mention Haripriya's educational background. This lowers the score despite the presence of two relevant nodes ranked second and third, which effectively address her academic qualifications.


- The contextual recall is the perfect score now that means atleast one of the contexts actually aligns with the expected output
- The contextual relevancy is still low but better than previous because even though we have the context that we need, other contexts are not relevant to the input
- The contexual precision is still low because there are relevant documents but is not ranked at the top. There is an irrelevant document at the top

So far so good!

### Test sample 2

In [None]:
qa_chain = ConversationalRetrievalChain.from_llm(
        llm, retriever=configure_retriever(index_name), memory=memory, verbose=True,
        combine_docs_chain_kwargs={"prompt": qa_prompt},
)

user_query = 'give a summary on her Generative AI projects'
user_query += ' Brief it in at most 35 words or the count limit given in the previous sentence.'
response = qa_chain.invoke({"question":user_query})
print(response)

In [36]:
reference_texts = [['Haripriya has worked on her portfolio project on creating a chat assistant about her using RAG with generativeAI and langchain. She also has created a weather API agent using streamlit']]

generated_texts = ["Haripriya specializes in Generative AI, utilizing LangChain for RAG, tool calling with Agents, and fine-tuning Gemma models, enhancing predictive capabilities and automating processes in data science applications."]

bert_p, bert_r, bert_f1 = calculate_bertscore(generated_texts, reference_texts)
print(f"BERTScore Precision: {bert_p}")
print(f"BERTScore Recall: {bert_r}")
print(f"BERTScore F1: {bert_f1}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.8725282549858093
BERTScore Recall: 0.874882698059082
BERTScore F1: 0.8737038373947144


: 

In [27]:
import contexts_for_testing
contexts = contexts_for_testing.context_3
print([len(context) for context in contexts])


[1509, 1929, 1876, 1146]


In [28]:
from deepeval.metrics import ContextualRecallMetric, ContextualRelevancyMetric, ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Actual output from your LLM application
actual_output = "Haripriya specializes in Generative AI, utilizing LangChain for RAG, tool calling with Agents, and fine-tuning Gemma models, enhancing predictive capabilities and automating processes in data science applications."

# Replace this with the expected output from your RAG generator
expected_output = 'Haripriya has worked on her portfolio project on creating a chat assistant about her using RAG with generativeAI and langchain. She also has created a weather API agent using streamlit'

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = contexts

contextualRecall = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="give a summary on her Generative AI projects",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

contextualRelevancy = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

contextualPrecision = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

contextualRecall.measure(test_case)
print("contextualRecall : ", contextualRecall.score)
print(contextualRecall.reason)

contextualRelevancy.measure(test_case)
print("contextualRelevancy : ", contextualRelevancy.score)
print(contextualRelevancy.reason)

contextualPrecision.measure(test_case)
print("contextualPrecision : ", contextualPrecision.score)
print(contextualPrecision.reason)

Output()

Output()

contextualRecall :  1.0
The score is 1.00 because all elements of the expected output perfectly align with the provided context nodes, with the chat assistant work linking to node 5 and the weather API creation linked to node 1.


Output()

contextualRelevancy :  0.12903225806451613
The score is 0.13 because, despite the presence of relevant statements like 'Lead Data Scientist with GenerativeAI' and 'expertise in using GenAI models,' the majority of the context focuses on unrelated qualifications and general skills rather than summarizing specific Generative AI projects.


contextualPrecision :  1.0
The score is 1.00 because all relevant nodes are ranked higher than the irrelevant nodes. The first node highlights expertise in 'using GenAI models for RAG using LangChain', making it directly pertinent to Generative AI projects. Similarly, the second node presents a specific Generative AI portfolio project, the 'langchain-weather-tool-calling'. In contrast, the third and fourth nodes discuss themes unrelated to Generative AI, such as model performance evaluation and course certifications, allowing for a clear distinction in relevance.


- The contextual recall is the perfect score now that means atleast one of the contexts actually aligns with the expected output
- The contextual relevancy is very low because even though we have the context that we need, other contexts are not relevant to the input
- The contexual precision is still low because there are relevant documents but is not ranked at the top. There is an irrelevant document at the top

### Success

We cannot increase contextual relevancy for now, but increasing contextual recall was the major goal and we achieved it! I would consider this as a success, atleast for this project!