<a href="https://colab.research.google.com/github/lisun85/AI-MakerSpace/blob/main/Evaluating_RAG_with_Ragas_part1_(2025)_AI_Makerspace.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [2]:
#!pip install -qU ragas==0.2.10

In [3]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [4]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

Please enter your OpenAI API key!··········


OPTIONALLY:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [5]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

Please enter your Ragas API key!··········


## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [6]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [7]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 31440    0 31440    0     0   181k      0 --:--:-- --:--:-- --:--:--  181k


In [8]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70299    0 70299    0     0   491k      0 --:--:-- --:--:-- --:--:--  490k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [9]:
!pip install langchain-community
!pip install unstructured
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()



### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [12]:
!pip install ragas==0.2.10 # install the 'ragas' package
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())



In [13]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [14]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What advancements have been made in LLMs since...,[The ethics of this space remain diabolically ...,"Since GPT-3, there have been significant advan...",single_hop_specifc_query_synthesizer
1,What are some challenges associated with using...,"[and software engineer, LLMs are infuriating. ...",Some challenges associated with using ChatGPT ...,single_hop_specifc_query_synthesizer
2,What AI is in 2023?,[Simon Willison’s Weblog Subscribe Stuff we fi...,"In 2023, AI refers to Large Language Models (L...",single_hop_specifc_query_synthesizer
3,What insights does the author provide about Op...,[the document includes some of the clearest ex...,The author notes that OpenAI has been a signif...,single_hop_specifc_query_synthesizer
4,What are the implications of prompt driven app...,[<1-hop>\n\nPrompt driven app generation is a ...,Prompt driven app generation has significant i...,multi_hop_abstract_query_synthesizer
5,What are the implications of training costs on...,[<1-hop>\n\nPrompt driven app generation is a ...,The implications of training costs on the deve...,multi_hop_abstract_query_synthesizer
6,What are the implications of prompt driven app...,[<1-hop>\n\nPrompt driven app generation is a ...,Prompt driven app generation has become a comm...,multi_hop_abstract_query_synthesizer
7,What are the implications of prompt driven app...,[<1-hop>\n\nPrompt driven app generation is a ...,Prompt driven app generation has become a sign...,multi_hop_abstract_query_synthesizer
8,What are the implications of Meta's Llama seri...,[<1-hop>\n\nThe ethics of this space remain di...,"Meta's Llama series, particularly with the rel...",multi_hop_specific_query_synthesizer
9,What advancements have been made in Mistral Ch...,[<1-hop>\n\nThe ethics of this space remain di...,Mistral Chat has introduced significant featur...,multi_hop_specific_query_synthesizer


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [15]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/ef28e31b-a556-4d2f-b627-829b3ccfc810


'https://app.ragas.io/dashboard/alignment/testset/ef28e31b-a556-4d2f-b627-829b3ccfc810'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [16]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

#### ❓ Question:

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

The chunk_overlap parameter in the RecursiveCharacterTextSplitter is used to control how much overlap there is between consecutive chunks of text when splitting a document into smaller parts.

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [18]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [20]:
!pip install langchain-qdrant
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

Collecting langchain-qdrant
  Downloading langchain_qdrant-0.2.0-py3-none-any.whl.metadata (1.8 kB)
Collecting qdrant-client<2.0.0,>=1.10.1 (from langchain-qdrant)
  Downloading qdrant_client-1.13.2-py3-none-any.whl.metadata (10 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client<2.0.0,>=1.10.1->langchain-qdrant)
  Downloading grpcio_tools-1.70.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client<2.0.0,>=1.10.1->langchain-qdrant)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client<2.0.0,>=1.10.1->langchain-qdrant)
  Downloading protobuf-5.29.3-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)
Downloading langchain_qdrant-0.2.0-py3-none-any.whl (23 kB)
Downloading qdrant_client-1.13.2-py3-none-any.whl (306 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m306.6/306.6 kB[0m [31m8.2 MB/s

We can now add our documents to our vector store.

In [21]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [22]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [23]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [24]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [25]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [26]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [28]:
!pip install langgraph
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Collecting langgraph
  Downloading langgraph-0.2.73-py3-none-any.whl.metadata (17 kB)
Collecting langgraph-checkpoint<3.0.0,>=2.0.10 (from langgraph)
  Downloading langgraph_checkpoint-2.0.15-py3-none-any.whl.metadata (4.6 kB)
Collecting langgraph-sdk<0.2.0,>=0.1.42 (from langgraph)
  Downloading langgraph_sdk-0.1.51-py3-none-any.whl.metadata (1.8 kB)
Downloading langgraph-0.2.73-py3-none-any.whl (151 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.5/151.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langgraph_checkpoint-2.0.15-py3-none-any.whl (38 kB)
Downloading langgraph_sdk-0.1.51-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langgraph-sdk, langgraph-checkpoint, langgraph
Successfully installed langgraph-0.2.73 langgraph-checkpoint-2.0.15 langgraph-sdk-0.1.51


Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [29]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [31]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [32]:
response["response"]

"LLM agents can be useful in several ways, primarily in the context of coding and executing tasks on behalf of users. Here are the key points regarding their usefulness:\n\n1. **Ease of Development**: LLMs are surprisingly easy to build, requiring only a few hundred lines of code in Python, provided that the right training data is available. This accessibility makes it feasible for a broader range of developers and organizations to experiment with and create LLMs.\n\n2. **Code Generation**: One of the strongest applications of LLM agents is in writing code. The success of LLMs in this area is attributed to the simpler grammar rules of programming languages compared to natural languages, making them particularly effective at generating and debugging code.\n\n3. **On-device Usage**: Recent advancements allow LLMs to be run on personal devices, making them more accessible to individual users without needing expensive server infrastructure.\n\n4. **Iterative Improvement**: LLMs can generat

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [33]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [34]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What advancements have been made in LLMs since...,[So training an LLM still isn’t something a ho...,[The ethics of this space remain diabolically ...,"Since GPT-3, there have been several notable a...","Since GPT-3, there have been significant advan...",single_hop_specifc_query_synthesizer
1,What are some challenges associated with using...,[Did you know ChatGPT has two entirely differe...,"[and software engineer, LLMs are infuriating. ...",Some challenges associated with using ChatGPT ...,Some challenges associated with using ChatGPT ...,single_hop_specifc_query_synthesizer
2,What AI is in 2023?,[Law is not ethics. Is it OK to train models o...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"In 2023, AI, particularly in the context of la...","In 2023, AI refers to Large Language Models (L...",single_hop_specifc_query_synthesizer
3,What insights does the author provide about Op...,[“Agents” still haven’t really happened yet\n\...,[the document includes some of the clearest ex...,"In 2023, the author highlights several key ins...",The author notes that OpenAI has been a signif...,single_hop_specifc_query_synthesizer
4,What are the implications of prompt driven app...,"[I think this means that, as individual users,...",[<1-hop>\n\nPrompt driven app generation is a ...,The implications of prompt-driven app generati...,Prompt driven app generation has significant i...,multi_hop_abstract_query_synthesizer
5,What are the implications of training costs on...,[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nPrompt driven app generation is a ...,The implications of training costs on the deve...,The implications of training costs on the deve...,multi_hop_abstract_query_synthesizer
6,What are the implications of prompt driven app...,[The really impressive thing about DeepSeek v3...,[<1-hop>\n\nPrompt driven app generation is a ...,The implications of prompt-driven app generati...,Prompt driven app generation has become a comm...,multi_hop_abstract_query_synthesizer
7,What are the implications of prompt driven app...,"[I think this means that, as individual users,...",[<1-hop>\n\nPrompt driven app generation is a ...,The implications of prompt-driven app generati...,Prompt driven app generation has become a sign...,multi_hop_abstract_query_synthesizer
8,What are the implications of Meta's Llama seri...,[Was the best currently available LLM trained ...,[<1-hop>\n\nThe ethics of this space remain di...,The implications of Meta's Llama series on the...,"Meta's Llama series, particularly with the rel...",multi_hop_specific_query_synthesizer
9,What advancements have been made in Mistral Ch...,"[Since then, a whole bunch of other teams have...",[<1-hop>\n\nThe ethics of this space remain di...,"In 2024, Mistral Chat has made significant adv...",Mistral Chat has introduced significant featur...,multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [35]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [38]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

Next up - we simply evaluate on our desired metrics!

In [39]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.9514, 'faithfulness': 0.8937, 'factual_correctness': 0.5233, 'answer_relevancy': 0.7881, 'context_entity_recall': 0.2727, 'noise_sensitivity_relevant': 0.2684}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [40]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

Please enter your Cohere API key!··········


In [41]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [42]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [44]:
!pip install langchain_cohere
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

Collecting langchain_cohere
  Downloading langchain_cohere-0.4.2-py3-none-any.whl.metadata (6.6 kB)
Collecting cohere<6.0,>=5.12.0 (from langchain_cohere)
  Downloading cohere-5.13.12-py3-none-any.whl.metadata (3.4 kB)
Collecting types-pyyaml<7.0.0.0,>=6.0.12.20240917 (from langchain_cohere)
  Downloading types_PyYAML-6.0.12.20241230-py3-none-any.whl.metadata (1.8 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere<6.0,>=5.12.0->langchain_cohere)
  Downloading fastavro-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Collecting types-requests<3.0.0,>=2.0.0 (from cohere<6.0,>=5.12.0->langchain_cohere)
  Downloading types_requests-2.32.0.20241016-py3-none-any.whl.metadata (1.9 kB)
Downloading langchain_cohere-0.4.2-py3-none-any.whl (42 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cohere-5.13.12-py3-none-any.whl (252 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We can simply rebuild our graph with the new retriever!

In [45]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [46]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

"LLM agents, or AI systems that act on users' behalf, are considered useful primarily in the realm of writing code. They are particularly effective at this task because the grammar rules of programming languages are simpler than those of human languages, making it a suitable area for their capabilities. However, there is skepticism about their overall utility due to issues like gullibility, where these systems may struggle to distinguish between truth and fiction. This raises concerns about how reliable they can be for tasks such as decision-making or providing accurate information. The potential applications of LLMs face criticism regarding their environmental impact, ethics, reliability, and implications for jobs, prompting a call for responsible usage and critical discussions around these technologies."

In [47]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [48]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8556, 'faithfulness': 0.8838, 'factual_correctness': 0.4842, 'answer_relevancy': 0.7101, 'context_entity_recall': 0.2472, 'noise_sensitivity_relevant': 0.2819}

#### ❓ Question:

Which system performed better, on what metrics, and why?

To determine which system performed better and on what metrics, we need to compare the evaluation results of the original RAG pipeline versus the one that includes Cohere's Rerank model for contextual compression.

Comparison of Performance Metrics:

1. LLM Context Recall – Measures how well the retrieved context covers the ground-truth information needed to answer the question. The reranked system is likely to perform better because reranking prioritizes the most relevant retrieved documents, reducing noise in the context.

2. Faithfulness – Evaluates whether the generated response stays faithful to the retrieved context. The reranked system is expected to improve this metric as well since higher-quality context should lead to responses that better align with the source material.

3. Factual Correctness – Checks if the response is factually correct based on external knowledge.This may improve slightly, but if the retriever already provided relevant documents in the original system, the improvement might not be drastic.

4. Response Relevancy – Measures how relevant the response is to the user query.A better selection of retrieved context should lead to more relevant responses, so the reranked system should show improvements.

5. Context Entity Recall – Evaluates how well entities in the reference answer appear in the retrieved context. Since reranking is focused on semantic similarity, this metric should see moderate improvement.

6. Noise Sensitivity – Measures how much noise in the context affects response quality.The reranked system should have lower noise sensitivity, meaning it is less affected by irrelevant information due to reranking filtering out less relevant documents.

Expected Outcome

The reranked system should outperform the baseline on LLM Context Recall, Faithfulness, Response Relevancy, and Noise Sensitivity.
Factual Correctness and Context Entity Recall might see moderate improvements, but they depend more on the retriever's initial quality.