# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [13]:
!pip install -qU ragas==0.2.10

In [14]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [9]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

OPTIONALLY:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [10]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [11]:
!mkdir data

mkdir: data: File exists


In [12]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31392    0 31392    0     0   162k      0 --:--:-- --:--:-- --:--:--  163k


In [13]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70292    0 70292    0     0  1267k      0 --:--:-- --:--:-- --:--:-- 1271k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [14]:
from langchain_community.document_loaders import DirectoryLoader
import nltk

# nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [15]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [16]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [17]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What Stability AI do in LLMs?,[Code may be the best application The ethics o...,Stability AI is one of the organizations that ...,single_hop_specifc_query_synthesizer
1,Wut wuz the signifcance of the term 'September...,[Based Development As a computer scientist and...,The term 'September' is significant because it...,single_hop_specifc_query_synthesizer
2,"Wht are the key advancemnts in AI, particulrly...",[Simon Willison’s Weblog Subscribe Stuff we fi...,2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,Wht is the Plausible analytics data showing ab...,[easy to follow. The rest of the document incl...,The Plausible analytics data shows that AI-rel...,single_hop_specifc_query_synthesizer
4,What are the challenges and implications of us...,[<1-hop>\n\nCode may be the best application T...,The use of Large Language Models (LLMs) as bla...,multi_hop_abstract_query_synthesizer
5,How have advancements in model training costs ...,[<1-hop>\n\nCode may be the best application T...,Advancements in model training costs have sign...,multi_hop_abstract_query_synthesizer
6,What are the ethical concerns associated with ...,[<1-hop>\n\nCode may be the best application T...,The ethical concerns associated with the train...,multi_hop_abstract_query_synthesizer
7,How do the challenges of understanding and con...,[<1-hop>\n\nCode may be the best application T...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
8,What were the key advancements in Large Langua...,[<1-hop>\n\nCode may be the best application T...,"In 2023, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer
9,How has Meta's Llama model contributed to the ...,[<1-hop>\n\nfor a model to follow the resultin...,Meta's Llama model has significantly contribut...,multi_hop_specific_query_synthesizer


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [18]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/083da5aa-170d-46f8-a862-812f5b4df7c9


'https://app.ragas.io/dashboard/alignment/testset/083da5aa-170d-46f8-a862-812f5b4df7c9'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [19]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

73

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### Answer

The chunk_overlap parameter (set to 200 in this case) determines how many characters should overlap between consecutive chunks when splitting documents. This overlap is important because:

It helps maintain context between chunks, preventing important information from being cut off at chunk boundaries
It ensures that related information that might span a chunk boundary isn't lost
It improves retrieval quality by giving multiple opportunities to match content that might be split across chunks
It helps preserve semantic coherence when the text is split

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [21]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [22]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [23]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [24]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [25]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [26]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [27]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [28]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [29]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [30]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [31]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [32]:
response["response"]

'LLM agents can be useful in several ways, primarily in automating tasks and assisting users in problem-solving. They can act on behalf of users similarly to a travel agent, helping with decisions and actions. Furthermore, LLMs can be equipped with tools to run processes in a loop, making them capable of solving more complex problems. \n\nDespite their potential, there is skepticism regarding their utility, particularly due to their tendency to believe and propagate inaccurate information. This raises concerns about their effectiveness in making meaningful decisions, as they may struggle to distinguish between truth and fiction. \n\nOn a technical level, LLMs are relatively easy to build, requiring only a few hundred lines of code and a substantial amount of quality training data. This accessibility allows more people to experiment with LLMs, even if training them still requires significant resources.\n\nMoreover, LLMs can be run on personal devices, making them more accessible than pr

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [33]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [34]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What Stability AI do in LLMs?,[So training an LLM still isn’t something a ho...,[Code may be the best application The ethics o...,Stability AI is one of the organizations that ...,Stability AI is one of the organizations that ...,single_hop_specifc_query_synthesizer
1,Wut wuz the signifcance of the term 'September...,"[29th: You can now run prompts against images,...",[Based Development As a computer scientist and...,The provided context does not contain any spec...,The term 'September' is significant because it...,single_hop_specifc_query_synthesizer
2,"Wht are the key advancemnts in AI, particulrly...",[OpenAI are not the only game in town here. Go...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"In 2023, several key advancements in Large Lan...",2023 was the breakthrough year for Large Langu...,single_hop_specifc_query_synthesizer
3,Wht is the Plausible analytics data showing ab...,"[The top five: ai (342), generativeai (300), l...",[easy to follow. The rest of the document incl...,The Plausible analytics data shows that AI-rel...,The Plausible analytics data shows that AI-rel...,single_hop_specifc_query_synthesizer
4,What are the challenges and implications of us...,[Another common technique is to use larger mod...,[<1-hop>\n\nCode may be the best application T...,The challenges and implications of using Large...,The use of Large Language Models (LLMs) as bla...,multi_hop_abstract_query_synthesizer
5,How have advancements in model training costs ...,[OpenAI are not the only game in town here. Go...,[<1-hop>\n\nCode may be the best application T...,Advancements in model training costs have sign...,Advancements in model training costs have sign...,multi_hop_abstract_query_synthesizer
6,What are the ethical concerns associated with ...,"[Since then, almost every major LLM (and most ...",[<1-hop>\n\nCode may be the best application T...,The ethical concerns associated with the train...,The ethical concerns associated with the train...,multi_hop_abstract_query_synthesizer
7,How do the challenges of understanding and con...,[Code may be the best application\n\nThe ethic...,[<1-hop>\n\nCode may be the best application T...,The challenges associated with understanding a...,The challenges of understanding and controllin...,multi_hop_abstract_query_synthesizer
8,What were the key advancements in Large Langua...,[Training a GPT-4 beating model was a huge dea...,[<1-hop>\n\nCode may be the best application T...,"In 2023, several key advancements in Large Lan...","In 2023, significant advancements in Large Lan...",multi_hop_specific_query_synthesizer
9,How has Meta's Llama model contributed to the ...,[I wrote about how Large language models are h...,[<1-hop>\n\nfor a model to follow the resultin...,Meta's Llama model has significantly contribut...,Meta's Llama model has significantly contribut...,multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [35]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [36]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [37]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[22]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-4vKkvHgki0O52P13nyzZJ8RE on tokens per min (TPM): Limit 30000, Used 29838, Requested 1704. Please try again in 3.084s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[25]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-4vKkvHgki0O52P13nyzZJ8RE on tokens per min (TPM): Limit 30000, Used 29800, Requested 1940. Please try again in 3.48s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-4vKkvHgki0O52P13nyzZJ8RE on tokens per min (TPM): Limit 30000, Used 29179, Requested 

{'context_recall': 0.6429, 'faithfulness': 0.8132, 'factual_correctness': 0.4145, 'answer_relevancy': 0.7959, 'context_entity_recall': 0.4509, 'noise_sensitivity_relevant': 0.2901}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [38]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [34]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [39]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [40]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [41]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [42]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents can be useful in specific contexts, particularly in areas like coding, where they demonstrate a strong capability to perform tasks effectively. The grammar rules of programming languages are simpler than those of natural languages, which may contribute to their effectiveness in writing code. However, there is skepticism surrounding their broader utility, primarily due to concerns about their inability to distinguish truth from fiction, referred to as "gullibility." This limitation raises questions about the reliability of LLM agents in making meaningful decisions on behalf of users.\n\nThe excitement around AI agents is often tied to their potential to act autonomously, but practical implementations in production are still limited, with many prototypes not yet materializing into effective solutions. Critics highlight the need for better scrutiny of LLMs, considering their environmental impact, ethical concerns, and reliability issues. Overall, while LLM agents show promise,

In [44]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [45]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[19]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-4vKkvHgki0O52P13nyzZJ8RE on tokens per min (TPM): Limit 30000, Used 29286, Requested 2075. Please try again in 2.722s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[7]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-4vKkvHgki0O52P13nyzZJ8RE on tokens per min (TPM): Limit 30000, Used 29966, Requested 2317. Please try again in 4.566s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[16]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-4vKkvHgki0O52P13nyzZJ8RE on tokens per min (TPM): Limit 30000, Used 29464, Requested

{'context_recall': 0.6667, 'faithfulness': 0.8974, 'factual_correctness': 0.4455, 'answer_relevancy': 0.7952, 'context_entity_recall': 0.3425, 'noise_sensitivity_relevant': 0.3325}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

#### Answer

The reranking-enhanced system outperformed the baseline on several key metrics, while showing mixed results on others:
Improvements:

Context Recall: Increased from 0.6429 to 0.6667 (+3.7%)
Faithfulness: Improved significantly from 0.8132 to 0.8974 (+10.4%)
Factual Correctness: Improved from 0.4145 to 0.4455 (+7.5%)
Noise Sensitivity (relevant): Increased from 0.2901 to 0.3325 (+14.6%)

Slight Decrease:

Answer Relevancy: Minimal decrease from 0.7959 to 0.7952 (-0.1%)

Noticeable Decrease:

Context Entity Recall: Decreased from 0.4509 to 0.3425 (-24.0%)

The reranking approach performed better overall because it significantly improved the system's ability to remain faithful to the retrieved context and increased factual correctness. The improvement in context recall indicates that the reranker was more effective at identifying truly relevant passages from the initial larger set of 20 documents.
The increased noise sensitivity score suggests the reranker made the system more resilient to irrelevant information, likely because the Cohere Rerank v3.5 model is better at distinguishing relevant from irrelevant context compared to pure vector similarity.
The decrease in context entity recall is interesting and suggests that while the reranker improved overall context quality, it may have prioritized passages with fewer but more relevant entities rather than passages with higher entity density. This trade-off resulted in better faithfulness and factual correctness despite fewer entities being captured.
The virtually unchanged answer relevancy suggests that both approaches produced similarly relevant answers to the questions, indicating that even with different retrieval mechanisms, the response generation remained consistent in addressing the query intent.