# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- ü§ù Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10

In [3]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [2]:
!mkdir data

In [3]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31524    0 31524    0     0  66520      0 --:--:-- --:--:-- --:--:-- 66646


In [4]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70549    0 70549    0     0   101k      0 --:--:-- --:--:-- --:--:--  101k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  for match in re.finditer('{0}\s*'.format(re.escape(sent)), self.original_text):
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)
  txt = re.sub('(?<={0})\.'.format(am), '‚àØ', txt)


In [7]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [8]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wut iz Claude 3.5 Sonnet and why iz it importe...,[The rise of inference-scaling ‚Äúreasoning‚Äù mod...,Claude 3.5 Sonnet iz a model from Anthropic's ...,single_hop_specifc_query_synthesizer
1,what gpt-4 turbo cost last year and how it com...,[on a story about the town's history and the e...,"In December 2023, OpenAI were charging $10 per...",single_hop_specifc_query_synthesizer
2,Wut rol has OpenAI plaid in advancin multi-mod...,[you talk to me exclusively in Spanish. OpenAI...,OpenAI aren‚Äôt the only group with a multi-moda...,single_hop_specifc_query_synthesizer
3,"Why Google say Encanto 2 real when it not, and...","[skeptical as to their utility based, once aga...",Google Search was caught serving up an entirel...,single_hop_specifc_query_synthesizer
4,How has the commoditization of AI-generated ap...,[<1-hop>\n\nyou talk to me exclusively in Span...,The commoditization of AI-generated apps has b...,multi_hop_abstract_query_synthesizer
5,How has the environmental impact of AI model t...,[<1-hop>\n\nThe rise of inference-scaling ‚Äúrea...,The environmental impact of AI model training ...,multi_hop_abstract_query_synthesizer
6,how llms get better and what happen to gentrif...,[<1-hop>\n\nThe rise of inference-scaling ‚Äúrea...,"llms got better by breaking the gpt-4 barrier,...",multi_hop_abstract_query_synthesizer
7,How have advancements in large language models...,[<1-hop>\n\nThe rise of inference-scaling ‚Äúrea...,Advancements in large language models (LLMs) i...,multi_hop_abstract_query_synthesizer
8,LLMs is easy to build but why hobbyists can't ...,[<1-hop>\n\nSimon Willison‚Äôs Weblog Subscribe ...,LLMs is easy to build because you only need a ...,multi_hop_specific_query_synthesizer
9,How did Meta contribute to the evolution of la...,[<1-hop>\n\nWe don‚Äôt yet know how to build GPT...,Meta played a significant role in the evolutio...,multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [9]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

75

#### ‚ùì Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

**Answer**:

The `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter` determines how many characters should overlap between adjacent chunks when splitting a document. In this notebook, it's set to 200 characters of overlap between consecutive chunks.

The main purposes of chunk overlap are:

1. **Maintain context continuity** - By having overlapping text between chunks, related information that spans chunk boundaries isn't completely separated
2. **Preserve semantic meaning** - Important context or information that might be split across chunk boundaries won't be lost
3. **Improve retrieval quality** - When searching for specific information that might fall at the edge of a chunk, the overlap ensures it can be found
4. **Prevent information fragmentation** - Complete sentences, paragraphs, or concepts that would otherwise be split can be captured in multiple chunks

In RAG applications, this overlap is crucial for maintaining coherence and ensuring that relevant information isn't missed during retrieval simply because it happened to fall at a chunk boundary.

[Reference](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [11]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [12]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

  description="Check that the field is empty, alternative syntax for `is_empty: \&quot;field_name\&quot;`",
  description="Check that the field is null, alternative syntax for `is_null: \&quot;field_name\&quot;`",


We can now add our documents to our vector store.

In [13]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [14]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [15]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [16]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [17]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [18]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [19]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [20]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [21]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [23]:
from IPython.display import Markdown

Markdown(response["response"])

LLM agents are useful in a few key ways, though their utility is still a topic of debate. 

1. **Acting on Behalf of Users**: Some people view AI agents as tools that can perform tasks on behalf of users, similar to a travel agent. This model suggests that LLMs can handle various inquiries and tasks, potentially saving users time and effort.

2. **Problem Solving with Tools**: LLMs can be programmed to use tools in a loop to help solve complex problems. This means that they can continuously refine their outputs based on the results they achieve, making them adaptable and potentially more effective in certain scenarios.

3. **Ease of Creation**: The technology to build LLMs has become more accessible, requiring only a few hundred lines of code and a substantial amount of quality training data. This democratization of technology allows more individuals and organizations to experiment and create their own models.

4. **Running Locally**: Advances have made it possible to run LLMs on personal devices, as seen with the release of models like Meta's Llama. This local accessibility can enhance privacy and reduce reliance on cloud services.

5. **Code Generation**: LLMs can generate code, and despite their tendency to "hallucinate" or produce incorrect outputs, they can also execute and test the code they generate. This ability allows them to iteratively refine their outputs, making them effective in programming tasks.

However, there are significant concerns regarding their reliability, ethical implications, and the potential negative impact on jobs and society. Critics emphasize the importance of careful consideration and responsible use of LLMs to maximize their positive applications while minimizing negative consequences.

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [24]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [25]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wut iz Claude 3.5 Sonnet and why iz it importe...,[Getting back to models that beat GPT-4: Anthr...,[The rise of inference-scaling ‚Äúreasoning‚Äù mod...,Claude 3.5 Sonnet is an advanced AI model rele...,Claude 3.5 Sonnet iz a model from Anthropic's ...,single_hop_specifc_query_synthesizer
1,what gpt-4 turbo cost last year and how it com...,"[LLM prices crashed, thanks to competition and...",[on a story about the town's history and the e...,"Last year, in December 2023, OpenAI was chargi...","In December 2023, OpenAI were charging $10 per...",single_hop_specifc_query_synthesizer
2,Wut rol has OpenAI plaid in advancin multi-mod...,[In October I upgraded my LLM CLI tool to supp...,[you talk to me exclusively in Spanish. OpenAI...,OpenAI has played a significant role in advanc...,OpenAI aren‚Äôt the only group with a multi-moda...,single_hop_specifc_query_synthesizer
3,"Why Google say Encanto 2 real when it not, and...",[Just the other day Google Search was caught s...,"[skeptical as to their utility based, once aga...","Google's claim about ""Encanto 2"" being real st...",Google Search was caught serving up an entirel...,single_hop_specifc_query_synthesizer
4,How has the commoditization of AI-generated ap...,[Prompt driven app generation is a commodity a...,[<1-hop>\n\nyou talk to me exclusively in Span...,The commoditization of AI-generated apps throu...,The commoditization of AI-generated apps has b...,multi_hop_abstract_query_synthesizer
5,How has the environmental impact of AI model t...,[Law is not ethics. Is it OK to train models o...,[<1-hop>\n\nThe rise of inference-scaling ‚Äúrea...,The environmental impact of AI model training ...,The environmental impact of AI model training ...,multi_hop_abstract_query_synthesizer
6,how llms get better and what happen to gentrif...,[MLC Chat: Llama - [System] Ready to chat. a p...,[<1-hop>\n\nThe rise of inference-scaling ‚Äúrea...,The context provided does not offer specific i...,"llms got better by breaking the gpt-4 barrier,...",multi_hop_abstract_query_synthesizer
7,How have advancements in large language models...,[In October I upgraded my LLM CLI tool to supp...,[<1-hop>\n\nThe rise of inference-scaling ‚Äúrea...,Advancements in large language models (LLMs) i...,Advancements in large language models (LLMs) i...,multi_hop_abstract_query_synthesizer
8,LLMs is easy to build but why hobbyists can't ...,[So training an LLM still isn‚Äôt something a ho...,[<1-hop>\n\nSimon Willison‚Äôs Weblog Subscribe ...,Hobbyists typically can't train large language...,LLMs is easy to build because you only need a ...,multi_hop_specific_query_synthesizer
9,How did Meta contribute to the evolution of la...,[I wrote about how Large language models are h...,[<1-hop>\n\nWe don‚Äôt yet know how to build GPT...,Meta contributed to the evolution of large lan...,Meta played a significant role in the evolutio...,multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [26]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [27]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [28]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XyIrJZx8AUgCzADN6GcfWGO6 on tokens per min (TPM): Limit 30000, Used 29382, Requested 2414. Please try again in 3.592s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[16]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XyIrJZx8AUgCzADN6GcfWGO6 on tokens per min (TPM): Limit 30000, Used 29718, Requested 1773. Please try again in 2.982s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XyIrJZx8AUgCzADN6GcfWGO6 on tokens per min (TPM): Limit 30000, Used 29663, Requeste

{'context_recall': 0.8125, 'faithfulness': 0.3974, 'factual_correctness': 0.5530, 'answer_relevancy': 0.8396, 'context_entity_recall': 0.3960, 'noise_sensitivity_relevant': 0.1860}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [29]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [34]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [30]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [31]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [32]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [33]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are seen as potentially useful in several areas, particularly in acting on behalf of users. There are two main perspectives on their utility: one views them as digital assistants similar to travel agents, while the other sees them as systems that can utilize tools in a loop to solve problems. \n\nHowever, there is skepticism regarding their effectiveness due to the inherent challenge of gullibility; LLMs can struggle to distinguish between truth and fiction. This raises concerns about how reliable these agents can be in making meaningful decisions. Despite excitement around AI agents, there are few real-world examples of them in production, which may be attributed to their gullibility issues.\n\nOne area where LLMs have shown significant capability is in writing code, as the grammar of programming languages is less complex than that of natural languages. This specific application has been increasingly recognized as a potential strength of LLMs.\n\nOverall, while there is po

In [34]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [35]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XyIrJZx8AUgCzADN6GcfWGO6 on tokens per min (TPM): Limit 30000, Used 29811, Requested 2419. Please try again in 4.46s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XyIrJZx8AUgCzADN6GcfWGO6 on tokens per min (TPM): Limit 30000, Used 29192, Requested 2210. Please try again in 2.804s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-XyIrJZx8AUgCzADN6GcfWGO6 on tokens per min (TPM): Limit 30000, Used 29246, Requested 

{'context_recall': 0.7250, 'faithfulness': 0.3974, 'factual_correctness': 0.5560, 'answer_relevancy': 0.8397, 'context_entity_recall': 0.2421, 'noise_sensitivity_relevant': 0.1221}

#### ‚ùì Question: 

Which system performed better, on what metrics, and why?

**Answer**:

## Comparison


| Metric | Default System | Rerank System | Difference | Better System |
|--------|---------------|--------------|------------|--------------|
| Context Recall | 0.8125 | 0.7250 | -0.0875 (11%) | Default |
| Faithfulness | 0.3974 | 0.3974 | 0 (0%) | Tie |
| Factual Correctness | 0.5530 | 0.5560 | +0.0030 (0.5%) | Rerank |
| Answer Relevancy | 0.8396 | 0.8397 | +0.0001 (0.01%) | Rerank |
| Context Entity Recall | 0.3960 | 0.2421 | -0.1539 (39%) | Default |
| Noise Sensitivity | 0.1860 | 0.1221 | -0.0639 (34%) | Rerank |

## Performance Comparison Between Systems

The **default system** performed better on the following metrics:
- **Context Recall**: 0.8125 vs 0.7250 (11% better)
- **Context Entity Recall**: 0.3960 vs 0.2421 (64% better)

The **rerank system** performed better on:
- **Factual Correctness**: 0.5560 vs 0.5530 (slight improvement)
- **Answer Relevancy**: 0.8397 vs 0.8396 (negligible difference)
- **Noise Sensitivity**: 0.1221 vs 0.1860 (34% better, lower is better)

Both systems performed identically on **Faithfulness** (0.3974).

### Analysis

The default system with k=5 direct retrieval was more effective at retrieving comprehensive context and entities from the source documents. This makes sense because it directly used vector similarity to find the most relevant chunks.

The rerank system, while starting with more initial documents (k=20), then filtering down to k=5 using Cohere's reranker, showed a slight improvement in factual correctness and significantly lower noise sensitivity. This suggests the reranker successfully filtered out irrelevant information but may have been too aggressive in removing some relevant entities and context.

The low faithfulness scores for both systems (0.3974) indicate that both approaches still struggle with hallucination issues. This is often a challenge in RAG systems where the retrieved information may be insufficient for the complexity of the questions being asked.

Overall, the default system would be preferable if comprehensive information retrieval is the priority, while the rerank system would be better for applications where reducing noise and slightly improving factual accuracy are more important.

