# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [36]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [37]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [38]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [39]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '9a57c2'. Skipping!
Property 'summary' already exists in node '4d646e'. Skipping!
Property 'summary' already exists in node '76ca2d'. Skipping!
Property 'summary' already exists in node '9f5b2d'. Skipping!
Property 'summary' already exists in node 'efde27'. Skipping!
Property 'summary' already exists in node 'f76bff'. Skipping!
Property 'summary' already exists in node 'b327e2'. Skipping!
Property 'summary' already exists in node 'd4c37c'. Skipping!
Property 'summary' already exists in node 'd742d0'. Skipping!
Property 'summary' already exists in node '019c17'. Skipping!
Property 'summary' already exists in node '30ef8d'. Skipping!
Property 'summary' already exists in node 'd4e9a1'. Skipping!
Property 'summary' already exists in node '6c6449'. Skipping!
Property 'summary' already exists in node '18256e'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'd4e9a1'. Skipping!
Property 'summary_embedding' already exists in node 'd4c37c'. Skipping!
Property 'summary_embedding' already exists in node 'efde27'. Skipping!
Property 'summary_embedding' already exists in node '019c17'. Skipping!
Property 'summary_embedding' already exists in node '18256e'. Skipping!
Property 'summary_embedding' already exists in node '9a57c2'. Skipping!
Property 'summary_embedding' already exists in node 'b327e2'. Skipping!
Property 'summary_embedding' already exists in node 'd742d0'. Skipping!
Property 'summary_embedding' already exists in node '9f5b2d'. Skipping!
Property 'summary_embedding' already exists in node '4d646e'. Skipping!
Property 'summary_embedding' already exists in node '6c6449'. Skipping!
Property 'summary_embedding' already exists in node '30ef8d'. Skipping!
Property 'summary_embedding' already exists in node '76ca2d'. Skipping!
Property 'summary_embedding' already exists in node 'f76bff'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [40]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is Scheduled Academic Year and how it wor...,"[non-term (includes clock-hour calendars), or ...",Scheduled Academic Year (SAY) is used for moni...,single_hop_specifc_query_synthesizer
1,How does federal student aid policy address th...,[Inclusion of Clinical Work in a Standard Term...,Federal student aid policy allows clinical wor...,single_hop_specifc_query_synthesizer
2,How are clock hours used to determine the stru...,[Non-Term Characteristics A program that measu...,A program that measures progress in clock hour...,single_hop_specifc_query_synthesizer
3,Whaat are the requirments for FSEOG disbursmen...,[both the credit or clock hours and the weeks ...,A student must complete both the credit or clo...,single_hop_specifc_query_synthesizer
4,what is the disbursement timing for federal st...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
5,What are the disbursement requirements and tim...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
6,what is the disbursement requirements for fede...,[<1-hop>\n\nboth the credit or clock hours and...,for federal student aid programs like Pell Gra...,multi_hop_abstract_query_synthesizer
7,if practicum is licensure need and program is ...,[<1-hop>\n\nInclusion of Clinical Work in a St...,if practicum or clinical experience is needed ...,multi_hop_abstract_query_synthesizer
8,"According to the guidance in Volume 2, Chapter...",[<1-hop>\n\nnon-term (includes clock-hour cale...,The requirements for including clinical work i...,multi_hop_specific_query_synthesizer
9,"According to Volume 8, Chapter 3, how do the d...",[<1-hop>\n\nDisbursement Timing in Subscriptio...,"In subscription-based programs, as outlined in...",multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [41]:
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [42]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

1102

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

##### ✅ Answer:
Chunk_overlap creates desirable redundancy between different chunks. Without an overlap, we can have a chunk that is either cut abruptly or starts in a way that destroys connected information. For example, if information we are retrieving is at the end of a chunk, the next chunk may have details about it. To increase our chances of getting the second chunk in the context, we create a shared part between chunks that makes them both more likely to get retrieved in case information is truly connected. 

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [43]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [44]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="loan_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="loan_data",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [45]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [46]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [47]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [48]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-nano` to avoid using the same model as our judge model.

In [49]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [50]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [51]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [52]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [53]:
response = graph.invoke({"question" : "What are the different kinds of loans?"})

In [54]:
response["response"]

'Based on the provided context, the document discusses different aspects of student loans, particularly focusing on the types of loans under the Direct Loan program. The main kinds of loans mentioned are:\n\n1. **Direct Loans** – These are federal student loans that include:\n   - **Direct Unsubsidized Loans** – Loans where interest accrues during school and deferment periods, and borrowers may choose to pay the interest while in school.\n   - (Implied) **Direct Subsidized Loans** – Not explicitly mentioned in the excerpt, but generally part of the Direct Loan program, where the government pays interest while the student is in school.\n\nThe context emphasizes the importance of understanding loan options, repayment plans, and borrower responsibilities related to Direct Loans.\n\n**Note:** The document does not mention other types of student loans such as Perkins Loans or Private Loans, only the Direct Loan types are explicitly referenced.\n\n**In summary:**  \n**The different kinds of 

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [55]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [56]:
dataset.samples[0].eval_sample.response

"A Scheduled Academic Year (SAY) is a fixed, traditional academic calendar that generally begins and ends at the same time each year, typically aligning with the fall and spring semesters or trimesters. It may also include summer terms, but these are part of the overall fixed period. SAY is used for monitoring the progression of Direct Loan annual loan limits by establishing a clear time frame (usually a full academic year) during which a student can receive up to the applicable loan limit.\n\nFor programs with standard terms or SE9W nonstandard terms that align with a traditional academic calendar, the SAY corresponds to the overall academic year. In subscription-based programs with standard or SE9W nonstandard terms, the SAY still applies if the program follows a similar fixed, academic-year structure.\n\nWhen it comes to nonstandard term programs or non-term (clock-hour) programs, the use of SAY depends on whether the program has a comparable calendar structure. Nonstandard terms th

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [57]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [58]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [59]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8194, 'faithfulness': 0.9039, 'factual_correctness': 0.6017, 'answer_relevancy': 0.9557, 'context_entity_recall': 0.3605, 'noise_sensitivity_relevant': 0.3477}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [60]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [61]:
adjusted_example_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [62]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=adjusted_example_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [63]:
class AdjustedState(TypedDict):
  question: str
  context: List[Document]
  response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

In [64]:
response = adjusted_graph.invoke({"question" : "What are the different kinds of loans?"})
response["response"]

'The different kinds of loans mentioned are:\n- Federal PLUS Loans\n- Federal Family Education Loan (FFEL) Program loans (made under this program before July 1, 2010)\n- Direct Subsidized Loans\n- Direct Unsubsidized Loans'

In [65]:
import time
import copy

rerank_dataset = copy.deepcopy(dataset)

for test_row in rerank_dataset:
  response = adjusted_graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [66]:
rerank_dataset.samples[0].eval_sample.response

"A Scheduled Academic Year (SAY) is a fixed, traditional academic year calendar that typically begins and ends at the same time each year. It generally aligns with the school's published academic schedule and can be used for awarding and disbursing Title IV aid, including federal student aid programs like Direct Loans.\n\n**How SAY Works for Financial Aid:**\n- It provides a consistent time frame for monitoring student enrollment and loan limit progression.\n- For programs with standard terms (such as semesters or trimesters) or SE9W nonstandard terms that are comparable in length, the SAY serves as the basis for tracking annual loan limits and determining full-time enrollment.\n- In subscription-based programs with standard or SE9W nonstandard terms, the SAY is also used for this purpose, provided nonstandard terms are substantially equal in length.\n\n**Implications for Different Academic Calendars:**\n- If a program uses a standard term or a comparable nonstandard term calendar, the

In [67]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_dataset.to_pandas())

In [68]:
result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.8694, 'faithfulness': 0.8532, 'factual_correctness': 0.4933, 'answer_relevancy': 0.7864, 'context_entity_recall': 0.3649, 'noise_sensitivity_relevant': 0.3113}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

##### ✅ Answer:
Original run:
| Metric                     | Original run | Re-ranked run (new) | Change  |
| :------------------------: | :----------: | :-----------------: | :-----: |
| Context_recall             | 0.8194       | 0.8694              | +6%     |
| Faithfulness               | 0.9039       | 0.8532              | -5.6%   |
| Factual_correctness        | 0.6017       | 0.4933              | -18%    |
| Answer_relevancy           | 0.9557       | 0.7864              | -17.7%  |
| Context_entity_recall      | 0.3605       | 0.3649              | +1.2%   |
| Noise_sensitivity          | 0.3477       | 0.3113              | -10.5%  |

1. **[Expected] Context recall improved slightly. Good thing.** Context recall tells us how well the model covered claims in the reference (the more claims that exist in reference are mentioned in the response, the better). We'd expect reranked RAG to have retrieved documents that are more relevant or better suited for the answer, so that aligns with our expectations.
2. **[Unexpected] Faithfulness reduced slightly. Bad thing.** Faithfulness tells us whether the model stays within the retrieved context. In other words, whether claims that are made by the model are supported by the retrieved context. My main hypothesis: -nano model is a weak model, so it may have high variability by itself. I don't expect reranking to negatively influence Faithfulness, unless our retrieval now returns fuzzy documents back (in that case, abstract documents would open the model to more degrees of freedom in answering). This is highly unlikely, as reranking helps us improve similarity to the query, not decrease it. 
3. **[Strongly unexpected] Factual correctness reduced significantly. Bad thing.** The default mode of factual correctness is f1-score, which balances precision and recall. This is highly unexpected, as the only way it could happen is increase in either False Positives or False Negatives. The reranking model may be mistuned for dataset, or we may restricted "k" too much (although original model has the same final k).
4. **[Unexpected] Answer relevancy reduced significantly. Bad thing.** The metric tells us how relevant the answer is to the question based on generating synthetic questions from the answer and evaluating cosine similarity of the original question with generated questions. We'd expect reranked RAG to have better answer relevancy via superior retrieved context, however, given the decrease in factual correctness, it is now only logical answer relevancy also decreased. Similar probably root causes.
5. **[Expected] Context entity recall increased slightly (probably just noise). Good thing.**  The metric shows the proportion of entities in the ground truth that is mentioned in the evaluated answer. This metric is highly dependent on how many overall entities we have in our documents/ground truths as well as whether relevant chunks are being retrieved. As we expected context recall to improve, we also expect context entity recall to improve, as we reranked for relevance. This change is small, so it's just noise. 
6. **[Expected] Noise sensitivity decreased slightly. Good thing.** The metric shows proportion of claims in the generated answer that is correct and based on the relevant context. As we expect reranked model to return more relevant context, we expect the final answer to be more grounded. However, it's unexpected that it decreased while factual correctness and answer relevancy decreased. 


Overall conclusion: reranked RAG has better retrieval as evidenced by context recall and context entity recall metrics, but degraded generation. 

Potential root causes:
1. Model overfits with better context from retrieved docs (context is too narrow).
2. Model underfits with broader context from retrieved docs (context is too wide, the model is unsure what to select in the answer)
2. Denser semantic similarity leads to narrower information stream that is needed for our use case.
3. Reranker model is mistuned/mismatched vs. our RAG pipeline quality evaluation metrics.

