# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [4]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - which should hopefull be familiar at this point since it's our Loan Data use-case!

Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [6]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [7]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

Applying SummaryExtractor:   0%|          | 0/38 [00:00<?, ?it/s]

Property 'summary' already exists in node 'd2fdb6'. Skipping!
Property 'summary' already exists in node '2c7890'. Skipping!
Property 'summary' already exists in node '47b685'. Skipping!
Property 'summary' already exists in node '5c36d6'. Skipping!
Property 'summary' already exists in node '1961b7'. Skipping!
Property 'summary' already exists in node '8c38bb'. Skipping!
Property 'summary' already exists in node '669679'. Skipping!
Property 'summary' already exists in node 'a94345'. Skipping!
Property 'summary' already exists in node '89f0b8'. Skipping!
Property 'summary' already exists in node '1054ba'. Skipping!
Property 'summary' already exists in node '814b99'. Skipping!
Property 'summary' already exists in node '2c39eb'. Skipping!
Property 'summary' already exists in node 'e56ab1'. Skipping!
Property 'summary' already exists in node 'a051a9'. Skipping!
Property 'summary' already exists in node '7cc9c7'. Skipping!
Property 'summary' already exists in node '7e7d7a'. Skipping!
Property

Applying CustomNodeFilter:   0%|          | 0/8 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/48 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '1961b7'. Skipping!
Property 'summary_embedding' already exists in node '814b99'. Skipping!
Property 'summary_embedding' already exists in node 'd2fdb6'. Skipping!
Property 'summary_embedding' already exists in node '47b685'. Skipping!
Property 'summary_embedding' already exists in node '2c39eb'. Skipping!
Property 'summary_embedding' already exists in node '2c7890'. Skipping!
Property 'summary_embedding' already exists in node 'a94345'. Skipping!
Property 'summary_embedding' already exists in node '669679'. Skipping!
Property 'summary_embedding' already exists in node 'e56ab1'. Skipping!
Property 'summary_embedding' already exists in node '89f0b8'. Skipping!
Property 'summary_embedding' already exists in node '8c38bb'. Skipping!
Property 'summary_embedding' already exists in node '5c36d6'. Skipping!
Property 'summary_embedding' already exists in node '1054ba'. Skipping!
Property 'summary_embedding' already exists in node '7cc9c7'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [8]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Whaat do Korinek and Suh say abot the impact o...,[Introduction ChatGPT launched in November 202...,Korinek and Suh (2024) are referenced in the c...,single_hop_specifc_query_synthesizer
1,What are the total number of messages and the ...,[Month Non-Work (M) (%) Work (M) (%) Total Mes...,"In Jun 2024, there were a total of 451 message...",single_hop_specifc_query_synthesizer
2,who Collis and Brynjolfsson say about ai and m...,[Table 1: ChatGPT daily message counts (millio...,Collis and Brynjolfsson (2025) use choice expe...,single_hop_specifc_query_synthesizer
3,What information does Figure 23 provide about ...,[Variation by Occupation Figure 23 presents va...,Figure 23 presents variation in ChatGPT usage ...,single_hop_specifc_query_synthesizer
4,Based on the provided message volume statistic...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,"Between June 2024 and June 2025, ChatGPT's tot...",multi_hop_abstract_query_synthesizer
5,What do the message volume statistics reveal a...,[<1-hop>\n\nMonth Non-Work (M) (%) Work (M) (%...,The message volume statistics show that ChatGP...,multi_hop_abstract_query_synthesizer
6,how chatgpt use different for work in jobs? wh...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,chatgpt use for work is different by job. peop...,multi_hop_abstract_query_synthesizer
7,How has the rapid adoption and usage of ChatGP...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,"Since its launch in November 2022, ChatGPT has...",multi_hop_abstract_query_synthesizer
8,How did the volume and proportion of work vers...,[<1-hop>\n\nIntroduction ChatGPT launched in N...,"Between Jun 2024 and Jun 2025, the total numbe...",multi_hop_specific_query_synthesizer
9,How has the rapid growth of ChatGPT in the US ...,[<1-hop>\n\nTable 1: ChatGPT daily message cou...,The rapid growth of ChatGPT in the US has led ...,multi_hop_specific_query_synthesizer


In [42]:
import pprint

pprint.pprint(dataset)

Testset(samples=[TestsetSample(eval_sample=SingleTurnSample(user_input="How is Artificial Intelligence related to ChatGPT's underlying technology?", retrieved_contexts=['tion of ChatGPT and Generative AI more broadly,', 'The Growth of ChatGPT', 'that the work activities associated with ChatGPT'], reference_contexts=['Introduction ChatGPT launched in November 2022. By July 2025, 18 billion messages were being sent each week by 700 million users, representing around 10% of the global adult population.1 For a new technology, this speed of global diffusion has no precedent (Bick et al., 2024). This paper studies consumer usage of ChatGPT, the first mass-market chatbot and likely the largest.2 ChatGPT is based on a Large Language Model (LLM), a type of Artificial Intelligence (AI) developed over the last decade and generally considered to represent an acceleration in AI capabilities.3 The sudden growth in LLM abilities and adoption has intensified interest in the effects of artificial intel

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [43]:
path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [44]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=50, chunk_overlap=0)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

3122

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

The chunk_overlap parameter specifies the number of characters (or tokens) that are shared between consecutive chunks. This overlap helps preserve important context that might otherwise be lost when dividing text into smaller segments.

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [45]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [46]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="use_case_data",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="use_case_data",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [47]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [48]:
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

Now we can produce a node for retrieval!

In [49]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [50]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-nano` to avoid using the same model as our judge model.

In [51]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-nano")

Then we can create a `generate` node!

In [52]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [53]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [54]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [55]:
response = graph.invoke({"question" : "What are the different kinds of loans?"})

In [56]:
response["response"]

'The provided context does not mention or specify any kinds of loans.'

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [57]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [27]:
dataset.samples[0].eval_sample.response

"The provided context mentions the relationship between ChatGPT and Generative AI but does not provide specific details about how Artificial Intelligence is related to ChatGPT's underlying technology."

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [58]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [59]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next up - we simply evaluate on our desired metrics!

In [60]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

baseline_result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
baseline_result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

{'context_recall': 0.1375, 'faithfulness': 0.6592, 'factual_correctness': 0.2940, 'answer_relevancy': 0.2831, 'context_entity_recall': 0.1053, 'noise_sensitivity_relevant': 0.0000}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [61]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [62]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="use_case_data_new_chunks",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="use_case_data_new_chunks",
    embedding=embeddings,
)

_ = vector_store.add_documents(documents=split_documents)

adjusted_example_retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [63]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=adjusted_example_retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [64]:
class AdjustedState(TypedDict):
  question: str
  context: List[Document]
  response: str

adjusted_graph_builder = StateGraph(AdjustedState).add_sequence([retrieve_adjusted, generate])
adjusted_graph_builder.add_edge(START, "retrieve_adjusted")
adjusted_graph = adjusted_graph_builder.compile()

In [65]:
response = adjusted_graph.invoke({"question" : "What are the different kinds of loans?"})
response["response"]

'The provided context does not mention or describe the different kinds of loans.'

In [66]:
import time
import copy

rerank_dataset = copy.deepcopy(dataset)

for test_row in rerank_dataset:
  response = adjusted_graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(3) # To try to avoid rate limiting.

In [67]:
rerank_dataset.samples[0].eval_sample.response

"Artificial Intelligence is related to ChatGPT's underlying technology because ChatGPT is based on a Large Language Model (LLM), which is a type of Artificial Intelligence (AI)."

In [68]:
rerank_evaluation_dataset = EvaluationDataset.from_pandas(rerank_dataset.to_pandas())

In [69]:
rerank_result = evaluate(
    dataset=rerank_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
rerank_result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

{'context_recall': 0.6756, 'faithfulness': 0.8517, 'factual_correctness': 0.6110, 'answer_relevancy': 0.9444, 'context_entity_recall': 0.3712, 'noise_sensitivity_relevant': 0.1112}

#### ❓ Question: 

Which system performed better, on what metrics, and why?


# RAG Evaluation Comparison: With vs Without Re-ranker

| Metric | Without Re-ranker | With Re-ranker | Delta | 
|--------|------------------|----------------|-------|
| **context_recall** | 0.1375 | 0.6756 | +0.5381 | 
| **faithfulness** | 0.6592 | 0.8517 | +0.1925 | 
| **factual_correctness** | 0.2940 | 0.6110 | +0.3170 | 
| **answer_relevancy** | 0.2831 | 0.9444 | +0.6613 |
| **context_entity_recall** | 0.1053 | 0.3712 | +0.2659 | 
| **noise_sensitivity_relevant** | 0.0000 | 0.1112 | +0.1112 | 

## Key Insights

- **Most Improved**: `answer_relevancy` (+233.6%) and `context_recall` (+391.3%) show the most dramatic improvements
- **Consistent Improvement**: All metrics show positive improvement with the re-ranker
- **Zero to Some**: `noise_sensitivity_relevant` went from 0 to 0.1112, indicating the re-ranker helps with noise handling
- **Overall Performance**: The re-ranker significantly enhances the RAG system's performance across all evaluation dimensions

The re-ranker essentially acts as a "quality filter" that ensures only the most relevant documents are used for answer generation, leading to more accurate, faithful, and relevant responses.