# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [1]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [3]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [4]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '76d1f6'. Skipping!
Property 'summary' already exists in node 'b14b55'. Skipping!
Property 'summary' already exists in node '8c3cf7'. Skipping!
Property 'summary' already exists in node '1c3b06'. Skipping!
Property 'summary' already exists in node 'dbfe93'. Skipping!
Property 'summary' already exists in node '53794e'. Skipping!
Property 'summary' already exists in node '347bfa'. Skipping!
Property 'summary' already exists in node '29b407'. Skipping!
Property 'summary' already exists in node 'c1c473'. Skipping!
Property 'summary' already exists in node '775a84'. Skipping!
Property 'summary' already exists in node 'df8467'. Skipping!
Property 'summary' already exists in node '9b6b4c'. Skipping!
Property 'summary' already exists in node '253924'. Skipping!
Property 'summary' already exists in node '252a7b'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'b14b55'. Skipping!
Property 'summary_embedding' already exists in node '9b6b4c'. Skipping!
Property 'summary_embedding' already exists in node 'c1c473'. Skipping!
Property 'summary_embedding' already exists in node '347bfa'. Skipping!
Property 'summary_embedding' already exists in node '252a7b'. Skipping!
Property 'summary_embedding' already exists in node '775a84'. Skipping!
Property 'summary_embedding' already exists in node '76d1f6'. Skipping!
Property 'summary_embedding' already exists in node '8c3cf7'. Skipping!
Property 'summary_embedding' already exists in node '1c3b06'. Skipping!
Property 'summary_embedding' already exists in node '29b407'. Skipping!
Property 'summary_embedding' already exists in node '253924'. Skipping!
Property 'summary_embedding' already exists in node '53794e'. Skipping!
Property 'summary_embedding' already exists in node 'dbfe93'. Skipping!
Property 'summary_embedding' already exists in node 'df8467'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [5]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is BBAY 1 and how is it used for monitori...,"[non-term (includes clock-hour calendars), or ...",BBAY 1 is one of the options for monitoring Di...,single_hop_specifc_query_synthesizer
1,Whaat are the requirments for includin phsyica...,[Inclusion of Clinical Work in a Standard Term...,Clinical work in physical therapy may be inclu...,single_hop_specifc_query_synthesizer
2,Which Title IV programs require disbursements ...,[Non-Term Characteristics A program that measu...,All Title IV programs except the Federal Work-...,single_hop_specifc_query_synthesizer
3,"According to federal regulations, how does acc...",[both the credit or clock hours and the weeks ...,In clock-hour or non-term credit-hour programs...,single_hop_specifc_query_synthesizer
4,How does the disbursement timing for federal s...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
5,when do student get disbursement for federal s...,[<1-hop>\n\nboth the credit or clock hours and...,"in clock-hour or non-term credit-hour program,...",multi_hop_abstract_query_synthesizer
6,when do student get disbursement for federal s...,[<1-hop>\n\nboth the credit or clock hours and...,"in clock-hour or non-term credit-hour program,...",multi_hop_abstract_query_synthesizer
7,What are the disbursement timing requirements ...,[<1-hop>\n\nboth the credit or clock hours and...,"In subscription-based programs, for the first ...",multi_hop_abstract_query_synthesizer
8,how title iv program work for subscription-bas...,[<1-hop>\n\nnon-term (includes clock-hour cale...,"for title iv program, subscription-based progr...",multi_hop_specific_query_synthesizer
9,How do the payment period requirements for Tit...,[<1-hop>\n\nNon-Term Characteristics A program...,"For Title IV programs, payment periods apply t...",multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [6]:
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

1102

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### ✅ Answer:

### Purpose of the `chunk_overlap` Parameter in `RecursiveCharacterTextSplitter`

The `chunk_overlap` parameter in `RecursiveCharacterTextSplitter` serves a crucial role in maintaining **contextual continuity** when splitting large documents into smaller, manageable chunks for RAG (Retrieval-Augmented Generation) systems.

### **Primary Functions:**

**1. Preserving Context Across Boundaries**
When documents are split into chunks, important information often spans across chunk boundaries. Without overlap, a sentence, paragraph, or concept that's split between two chunks could lose its meaning. The overlap ensures that critical context from the end of one chunk appears at the beginning of the next chunk.

**2. Improving Retrieval Quality**
In the example shown (with `chunk_size=1000` and `chunk_overlap=200`), each chunk shares 200 characters with its adjacent chunks. This redundancy increases the likelihood that relevant information will be retrieved during similarity search, as the same concept appears in multiple chunks with slightly different surrounding context.

**3. Handling Imperfect Split Points**
The `RecursiveCharacterTextSplitter` tries to split at natural boundaries (sentences, paragraphs), but sometimes it must split mid-concept. The overlap acts as a safety net, ensuring that if a key piece of information is truncated at a chunk boundary, it's still fully preserved in the overlapping portion.

### **Practical Benefits:**

- **Better Question Answering**: When a user asks a question that relates to information near chunk boundaries, the overlap increases the chances that the complete answer will be retrieved
- **Reduced Information Loss**: Concepts that span multiple sentences or paragraphs remain intact across chunks
- **Enhanced Semantic Coherence**: The additional context helps embedding models better understand the meaning of each chunk

### **Trade-offs:**

**Pros:**
- Improved retrieval accuracy
- Better preservation of context
- More robust information retrieval

**Cons:**
- Increased storage requirements (redundant text)
- Higher computational costs (more chunks to process)
- Potential for retrieving duplicate information

### **Best Practices:**
The overlap should typically be 10-20% of the chunk size. In this example, 200 characters of overlap with 1000-character chunks (20%) is a well-balanced choice that provides sufficient context preservation without excessive redundancy.

The `chunk_overlap` parameter is essential for building robust RAG systems that can reliably retrieve and present complete, contextually accurate information to users.


Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [8]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [9]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [10]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [11]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [12]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [13]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-nano` to avoid using the same model as our judge model.

In [14]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [15]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [16]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [17]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [18]:
response = graph.invoke({"question" : "What are the different kinds of loans?"})

In [19]:
response["response"]

'Based on the provided context, the different kinds of loans mentioned are:\n\n1. Direct Loans, which include:\n   - Direct Unsubsidized Loans\n   - Other types of Direct Loans (implied, but not explicitly named in the excerpt)\n\nThe context primarily discusses the administration and transfer of Direct Loans, their interest accrual, repayment options, and related counseling, but it does not explicitly list other distinct types of loans beyond Direct Loans.'

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [20]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [21]:
dataset.samples[0].eval_sample.response

"BBAY 1 is a way of defining an academic period for monitoring Direct Loan annual loan limit progression, particularly for credit-hour programs with an SAY (Scheduled Academic Year). It is an alternative to using the SAY and is not a fixed calendar period; instead, its start and end depend on the individual student's enrollment.\n\nBBAY 1 must include the same number of terms as the SAY that would otherwise be used. For example, if the SAY includes fall, winter, and spring quarters, a BBAY 1 would consist of any three consecutive terms. It can include terms the student does not attend, provided the student could have enrolled at least half-time in those terms. However, unlike the SAY, a BBAY 1 must begin with a term in which the student was enrolled.\n\nIn terms of monitoring loan limits, BBAY 1 allows a student to regain eligibility for a new annual loan limit once the BBAY 1 period has elapsed. It offers flexibility in academic year standards, enabling a student to receive another lo

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [22]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [23]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [24]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[41]: AttributeError('StringIO' object has no attribute 'statements')


{'context_recall': 0.9014, 'faithfulness': 0.9033, 'factual_correctness': 0.6250, 'answer_relevancy': 0.8748, 'context_entity_recall': 0.3166, 'noise_sensitivity_relevant': 0.3104}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [25]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [26]:
adjusted_example_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [27]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=adjusted_example_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [28]:
class AdjustedState(TypedDict):
  question: str
  context: List[Document]
  response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

In [29]:
response = adjusted_graph.invoke({"question" : "What are the different kinds of loans?"})
response["response"]

'The provided context mentions specific types of loans related to education financing, including:\n\n1. Federal PLUS Loans\n2. Federal Family Education Loan (FFEL) Program loans (before July 1, 2010)\n3. Direct Subsidized Loans\n4. Direct Unsubsidized Loans\n5. Student Direct PLUS Loans\n\nThese are the different kinds of loans referenced in the context.'

In [30]:
import time
import copy

rerank_dataset = copy.deepcopy(dataset)

for test_row in rerank_dataset:
  response = adjusted_graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [31]:
rerank_dataset.samples[0].eval_sample.response

"BBAY 1 is a type of Borrower-Based Academic Year used for credit-hour programs with an SAY (Scheduled Academic Year). It serves as an alternative method for monitoring a student's progression toward the Direct Loan annual loan limit. \n\nA BBAY 1 is not fixed in duration like a traditional academic year; instead, its start and end dates depend on the individual student's enrollment. For programs with an SAY, a BBAY 1 must include the same number of terms as the SAY (excluding any summer trailer or header). For example, if the SAY includes three consecutive terms (such as fall, winter, and spring), the BBAY 1 would also consist of any three consecutive terms. It may include terms the student does not attend if the student could have enrolled at least half-time during those terms, but it must begin with a term in which the student was enrolled.\n\nIn summary, BBAY 1 is used to monitor annual loan limit progression by aligning with the student's enrollment pattern, matching the number of

In [32]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_dataset.to_pandas())

In [33]:
result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8069, 'faithfulness': 0.8275, 'factual_correctness': 0.6142, 'answer_relevancy': 0.9429, 'context_entity_recall': 0.4058, 'noise_sensitivity_relevant': 0.2551}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

##### ✅ Answer:

## **System Performance Comparison - Baseline vs Reranked (Recalculated)**

**Actual Evaluation Results:**
- **BASELINE**: {'context_recall': 0.9014, 'faithfulness': 0.9033, 'factual_correctness': 0.6250, 'answer_relevancy': 0.8748, 'context_entity_recall': 0.3166, 'noise_sensitivity_relevant': 0.3104}
- **RERANKED**: {'context_recall': 0.8069, 'faithfulness': 0.8275, 'factual_correctness': 0.6142, 'answer_relevancy': 0.9429, 'context_entity_recall': 0.4058, 'noise_sensitivity_relevant': 0.2551}

| Metric | Baseline System | Reranked System | Change | Winner |
|--------|----------------|-----------------|---------|---------|
| Context Recall | **0.9014** | 0.8069 | -0.0945 (-10.5%) | **Baseline** |
| Faithfulness | **0.9033** | 0.8275 | -0.0758 (-8.4%) | **Baseline** |
| Factual Correctness | **0.6250** | 0.6142 | -0.0108 (-1.7%) | **Baseline** |
| Answer Relevancy | 0.8748 | **0.9429** | +0.0681 (+7.8%) | **Reranked** |
| Context Entity Recall | 0.3166 | **0.4058** | +0.0892 (+28.2%) | **Reranked** |
| Noise Sensitivity | 0.3104 | **0.2551** | -0.0553 (-17.8%) | **Baseline** |

*Note: For noise sensitivity, lower scores indicate better performance (less sensitivity to noise).*

## **Analysis**

**Contrary to initial expectations, the Baseline System outperformed the Reranked System overall**, winning on 3 out of 6 metrics including the most critical ones for RAG performance.

### **Baseline System Advantages:**

**1. Context Recall (-10.5% for reranked)**: The baseline system was significantly better at retrieving all relevant context pieces. This is crucial for RAG as it directly impacts the system's ability to find necessary information.

**2. Faithfulness (-8.4% for reranked)**: The baseline produced more faithful responses that better adhered to the retrieved context, indicating less hallucination and better grounding.

**3. Factual Correctness (-1.7% for reranked)**: While a smaller difference, the baseline was more factually accurate, which is fundamental for trustworthy RAG systems.

### **Reranked System Advantages:**

**1. Answer Relevancy (+7.8%)**: The reranked system produced more relevant answers to the specific questions asked, likely due to the reranker's ability to identify contextually appropriate information.

**2. Context Entity Recall (+28.2%)**: Significant improvement in retrieving documents containing relevant entities, showing the reranker's strength in entity-focused retrieval.

**3. Noise Sensitivity (-17.8%)**: Better resilience to irrelevant information, suggesting the reranking process helps filter out noise.

### **Why the Baseline Performed Better Overall:**

**1. Retrieval Coverage**: The baseline's higher context recall suggests that retrieving more documents (k=5) without reranking captured more of the relevant information than the reranked approach (retrieve k=20, rerank to k=5).

**2. Information Loss**: The reranking process, while good at noise reduction, may have inadvertently filtered out some relevant context, leading to lower faithfulness and factual correctness.

**3. Over-optimization**: The reranker may have over-optimized for specific types of relevance while sacrificing broader contextual coverage needed for comprehensive answers.

### **Trade-off Analysis:**

The results reveal a classic **precision vs. recall trade-off**:
- **Baseline**: Higher recall (finds more relevant information) but potentially more noise
- **Reranked**: Higher precision (finds more targeted information) but misses some relevant context

**Overall Verdict**: For this specific dataset and use case, the **Baseline System performed better** due to superior context retrieval and factual accuracy. The reranking approach, while improving answer relevancy and entity recall, came at the cost of losing important contextual information that degraded overall performance. This suggests that for comprehensive question-answering tasks, casting a wider net without aggressive filtering may be more effective than sophisticated reranking.