# [RAGAS] Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import gc;
import pandas as pd

In [2]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

True

In [3]:
def check_if_env_var_is_set(env_var_name: str, human_readable_string: str = "API Key"):
    api_key = os.getenv(env_var_name)
  
    if api_key:
       print(f"{env_var_name} is present")
    else:
      print(f"{env_var_name} is NOT present, paste key at the prompt:")
      os.environ[env_var_name] = getpass.getpass(f"Please enter your {human_readable_string}: ")

In [4]:
import os
import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

check_if_env_var_is_set("OPENAI_API_KEY", "OpenAI API key")

OPENAI_API_KEY is present


In [5]:
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

check_if_env_var_is_set("COHERE_API_KEY", "Cohere API key")

COHERE_API_KEY is present


## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [7]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

In [8]:
gc.collect()

30

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [9]:
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient, models
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings

small_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    small_embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [10]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [12]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano",                       
    temperature=0.1,      # Lower temperature for more consistent outputs
    request_timeout=120   # Longer timeout for complex operations
)

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [13]:
%%time
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

CPU times: user 20.4 ms, sys: 3.48 ms, total: 23.9 ms
Wall time: 27.5 ms


Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [14]:
%%time
# naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 5 μs, sys: 1e+03 ns, total: 6 μs
Wall time: 11.2 μs


In [15]:
%%time
# naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 6 μs, sys: 2 μs, total: 8 μs
Wall time: 13.1 μs


In [16]:
%%time
# naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 3 μs, sys: 1 μs, total: 4 μs
Wall time: 6.68 μs


Overall, this is not bad! Let's see if we can make it better!

In [17]:
gc.collect()

48

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [18]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [19]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [20]:
%%time
# bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 6 μs, sys: 0 ns, total: 6 μs
Wall time: 9.3 μs


It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

In [21]:
gc.collect()

0

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [22]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [23]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [24]:
%%time
# contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 57 μs, sys: 12 μs, total: 69 μs
Wall time: 77 μs


We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

In [25]:
gc.collect()

40

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
%%time
# multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 4 μs, sys: 1 μs, total: 5 μs
Wall time: 7.63 μs


In [29]:
gc.collect()

0

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [30]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [31]:
vectorstore.client.create_collection(
  collection_name="full_documents",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=small_embeddings,         # ✅ Reuse embeddings
  collection_name="full_documents"
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [32]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [33]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [34]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [35]:
%%time
# parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 3 μs, sys: 1 μs, total: 4 μs
Wall time: 6.68 μs


Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

In [36]:
gc.collect()

201

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [37]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [38]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [39]:
%%time
# ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 4 μs, sys: 1e+03 ns, total: 5 μs
Wall time: 8.34 μs


In [40]:
gc.collect()

0

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

The `breakpoint_threshold_type` parameter controls when the semantic chunker creates chunk boundaries based on embedding similarity between sentences:

**Four Threshold Types:**

1. _"percentile" (default)_
- Splits when sentence embedding distance exceeds the 95th percentile of all distances
- Effect: Creates chunks at the most semantically distinct boundaries
- Behavior: More conservative splitting, larger chunks

2. _"standard_deviation"_
- Splits when distance exceeds 3 standard deviations from mean
- Effect: Better predictable performance, especially for normally distributed content
- Behavior: More consistent chunk sizes

3. _"interquartile"_
- Uses IQR * 1.5 scaling factor to determine breakpoints
- Effect: Middle-ground approach, robust to outliers
- Behavior: Balanced chunk distribution

4. _"gradient"_
- Detects anomalies in embedding distance gradients
- Effect: Best for domain-specific/highly correlated content
- Behavior: Finds subtle semantic transitions

**Impact:** _The threshold type determines sensitivity to semantic changes - more sensitive types create smaller, more focused chunks while less sensitive types create larger, more comprehensive chunks._

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [41]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    small_embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [42]:
%%time
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

CPU times: user 271 ms, sys: 30.8 ms, total: 302 ms
Wall time: 9.18 s


Let's create a new vector store.

In [43]:
vectorstore.client.create_collection(
  collection_name="Loan_Complaint_Data_Semantic_Chunks",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

semantic_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=small_embeddings,         # ✅ Reuse embeddings
  collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

# Add documents after creation
_ = semantic_vectorstore.add_documents(semantic_documents)

We'll use naive retrieval for this example.

In [44]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [45]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [46]:
# semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [47]:
gc.collect()

520

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [48]:
### YOUR CODE HERE

In [49]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [50]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

True

In [51]:
def check_if_env_var_is_set(env_var_name: str, human_readable_string: str = "API Key"):
    api_key = os.getenv(env_var_name)
  
    if api_key:
       print(f"{env_var_name} is present")
    else:
      print(f"{env_var_name} is NOT present, paste key at the prompt:")
      os.environ[env_var_name] = getpass.getpass(f"Please enter your {human_readable_string}: ")

In [52]:
# docs = loan_complaint_data.copy()
print(f"Original documents count: {len(loan_complaint_data)}")

filtered_docs = []
for doc in loan_complaint_data:
    narrative = doc.metadata.get("Consumer complaint narrative", "")
    if (len(narrative.strip()) < 100 or 
        narrative.count("XXXX") > 5 or 
        narrative.strip() in ["", "None", "N/A"]):
        continue

    doc.page_content = f"Customer Issue: {doc.metadata.get('Issue', 'Unknown')}\n"
    doc.page_content += f"Product: {doc.metadata.get('Product', 'Unknown')}\n"
    doc.page_content += f"Complaint Details: {narrative}"

    filtered_docs.append(doc)

print(f"Documents count after filtering: {len(filtered_docs)}")

Original documents count: 825
Documents count after filtering: 480


In [53]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
# generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_llm = LangchainLLMWrapper(ChatOpenAI(
    model="gpt-4.1-nano",  # Less capable than mini for reasoning tasks, but okay for the task
    temperature=0.1,      # Lower temperature for more consistent outputs
    request_timeout=120   # Longer timeout for complex operations
))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [54]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(
    ChatOpenAI(
        model="gpt-4.1-mini",
        temperature=0.1,      # Lower temperature for more consistent outputs
        request_timeout=120   # Longer timeout for complex operations        
    )
)

In [55]:
gc.collect()

40

In [56]:
# %%time
from ragas.testset.transforms import default_transforms, apply_transforms
transformer_llm = generator_llm
embedding_model = generator_embeddings

In [57]:
gc.collect()

0

In [58]:
import psutil

# Check memory usage
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage before generation: {memory_mb:.1f} MB")

Memory usage before generation: 484.4 MB


In [59]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
                llm=generator_llm, embedding_model=embedding_model, 
                #knowledge_graph=loan_data_kg
)

In [60]:
%%time
testset = None

from ragas.testset import TestsetGenerator
from ragas import EvaluationDataset

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# if os.path.exists('golden-master.csv'):
#     golden_dataset_df = pd.read_csv('golden-master.csv')
#     golden_dataset_df['reference_contexts'] = golden_dataset_df['reference_contexts'].apply(eval)
#     testset = EvaluationDataset.from_pandas(golden_dataset_df)
# else:
#     testset = generator.generate_with_langchain_docs(loan_complaint_data[:20], testset_size=10)
testset = generator.generate_with_langchain_docs(loan_complaint_data[:20], testset_size=10)
testset.to_pandas()

Applying SummaryExtractor:   0%|          | 0/14 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/20 [00:00<?, ?it/s]

Node bc149e2a-4864-4d8e-915c-ae5c61dbb8bb does not have a summary. Skipping filtering.
Node 27daea41-fe22-4363-b130-2eb07abc95f4 does not have a summary. Skipping filtering.
Node 79cb6a7a-bf39-4a56-b2ae-d26f88f5f2e1 does not have a summary. Skipping filtering.
Node c3b82972-0700-4233-becb-6239f293b25c does not have a summary. Skipping filtering.
Node 36b7a2ea-7cc9-4f32-8e57-44d346f6e869 does not have a summary. Skipping filtering.
Node 3fdc6de8-ff46-4e3c-9e1a-05b5f8397ded does not have a summary. Skipping filtering.


Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/54 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

CPU times: user 1.67 s, sys: 234 ms, total: 1.9 s
Wall time: 2min 27s


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Can you please provide me with the detailed Co...,[Customer Issue: Dealing with your lender or s...,Customer Issue: Dealing with your lender or se...,single_hop_specifc_query_synthesizer
1,What issues are being reported with Aidvantage...,[Customer Issue: Dealing with your lender or s...,The customer reports that Aidvantage assigned ...,single_hop_specifc_query_synthesizer
2,How does FERPA protect student loan borrowers ...,[Customer Issue: Dealing with your lender or s...,Customer Issue: Dealing with your lender or se...,single_hop_specifc_query_synthesizer
3,What issues are being raised regarding Nelnet ...,[Customer Issue: Dealing with your lender or s...,The consumer is confused about the issuer of t...,single_hop_specifc_query_synthesizer
4,What does it mean that I was told I am in forb...,[Customer Issue: Dealing with your lender or s...,The context states that the borrower was told ...,single_hop_specifc_query_synthesizer
5,How does FDCPA relate to the illegal student l...,[<1-hop>\n\nCustomer Issue: Improper use of yo...,The FDCPA (Fair Debt Collection Practices Act)...,multi_hop_specific_query_synthesizer
6,"Based on the issues reported with Aidvantage, ...",[<1-hop>\n\nCustomer Issue: Dealing with your ...,The cases highlight that Aidvantage has failed...,multi_hop_specific_query_synthesizer
7,How does the Department of Education's abolish...,[<1-hop>\n\nCustomer Issue: Improper use of yo...,The context indicates that following the Depar...,multi_hop_specific_query_synthesizer
8,How does the violation of the FCRA by XXXX and...,[<1-hop>\n\nI am writing to formally dispute i...,"The violation of the FCRA by XXXX and XXXX, wh...",multi_hop_specific_query_synthesizer
9,How does the violation of the Family Education...,[<1-hop>\n\nCustomer Issue: Improper use of yo...,The violation of FERPA is directly related to ...,multi_hop_specific_query_synthesizer


In [61]:
# Check memory usage
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage after generation: {memory_mb:.1f} MB")

Memory usage after generation: 490.8 MB


In [62]:
gc.collect()

2935

In [63]:
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage after gc.collect(): {memory_mb:.1f} MB")

Memory usage after gc.collect(): 490.8 MB


In [64]:
from langsmith import Client

langsmith_client = Client(
    timeout_ms=60000,  # 60 seconds
    retry_config={"max_retries": 5}
)

dataset_name = "Loan Synthetic Data (s09)"

existing_datasets = langsmith_client.list_datasets()
dataset_exists = any(dataset.name == dataset_name for dataset in existing_datasets)

if dataset_exists:
  langsmith_dataset = langsmith_client.read_dataset(dataset_name=dataset_name)
  print(f"Using existing dataset: {dataset_name}")
else:
  langsmith_dataset = langsmith_client.create_dataset(
      dataset_name=dataset_name,
      description="Loan Synthetic Data (for s09 exercise)"
  )
  print(f"Created new dataset: {dataset_name}")

Using existing dataset: Loan Synthetic Data (s09)


In [65]:
gc.collect()

0

In [66]:
gc.collect()

0

## Ragas Evaluation

In [67]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy, 
    context_precision,
    context_recall,
    answer_correctness,
    answer_similarity
)

In [70]:
retriever_chains_list = {
    "naive_retrieval_chain" : { 'rag_chain': naive_retrieval_chain },
    "bm25_retrieval_chain": { 'rag_chain': bm25_retrieval_chain },
    "contextual_compression_retrieval_chain": { 'rag_chain': contextual_compression_retrieval_chain },
    "multi_query_retrieval_chain": { 'rag_chain': multi_query_retrieval_chain },
    "parent_document_retrieval_chain": { 'rag_chain': parent_document_retrieval_chain },
    "ensemble_retrieval_chain": { 'rag_chain': ensemble_retrieval_chain }
}

In [71]:
import copy
def simplest_copy_method(original_dataset):
    """
    Simplest method: Use copy.deepcopy()
    This creates a completely independent copy
    """
    dataset_copy = copy.deepcopy(original_dataset)
    return dataset_copy

In [73]:
# %%time
from ragas import EvaluationDataset
from tqdm.notebook import tqdm
for retriever_chain in tqdm(retriever_chains_list.keys()):
    copy_of_testset = simplest_copy_method(testset)
    retriever_chains_list[retriever_chain]['dataset'] = copy_of_testset
    rag_chain = retriever_chains_list[retriever_chain]['rag_chain']
    for test_row in copy_of_testset:
        response = rag_chain.invoke({"question" : test_row.eval_sample.user_input})
        test_row.eval_sample.response = response["response"].content
        # test_row.eval_sample.metrics = response["response"].usage
        test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

  0%|          | 0/6 [00:00<?, ?it/s]

In [74]:
for retriever_chain in tqdm(retriever_chains_list.keys()):
    copy_of_testset = retriever_chains_list[retriever_chain]['dataset']
    retriever_chains_list[retriever_chain]['evaluation_dataset'] = EvaluationDataset.from_pandas(copy_of_testset.to_pandas())

  0%|          | 0/6 [00:00<?, ?it/s]

In [75]:
from evaluation_cache import save_evaluation_result, load_evaluation_result

pipeline_stages_folder_name = ".pipeline-stages"
os.makedirs(pipeline_stages_folder_name, exist_ok=True)

In [98]:
%%time
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
from ragas.cost import get_token_usage_for_openai

evaluation_results = {}
custom_run_config = RunConfig(timeout=360)

for retriever_chain in tqdm(retriever_chains_list.keys()):
    evaluation_results_filename = f"{pipeline_stages_folder_name}/ragas_evaluation_results_{retriever_chain}.pkl"
    if os.path.exists(evaluation_results_filename):
        print(f"{retriever_chain} already processed, skipping to the next one...")
        retriever_chains_list[retriever_chain]['evaluation_result'] = load_evaluation_result(evaluation_results_filename)
        continue

    result = evaluate(
        dataset=retriever_chains_list[retriever_chain]['evaluation_dataset'],
        metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
        llm=evaluator_llm,
        token_usage_parser=get_token_usage_for_openai,
        run_config=custom_run_config
    )
    print(f"Saving {retriever_chain}...")
    retriever_chains_list[retriever_chain]['evaluation_result'] = result
    save_evaluation_result(result, evaluation_results_filename)
        
    print(f"Finished evaluating and saving {retriever_chain} moving to the next one...")

  0%|          | 0/6 [00:00<?, ?it/s]

naive_retrieval_chain already processed, skipping to the next one...
✅ Loaded evaluation results from: .pipeline-stages/ragas_evaluation_results_naive_retrieval_chain.pkl
📅 Cached on: 2025-07-29T02:46:06.269653
bm25_retrieval_chain already processed, skipping to the next one...
✅ Loaded evaluation results from: .pipeline-stages/ragas_evaluation_results_bm25_retrieval_chain.pkl
📅 Cached on: 2025-07-29T02:11:28.847554
contextual_compression_retrieval_chain already processed, skipping to the next one...
✅ Loaded evaluation results from: .pipeline-stages/ragas_evaluation_results_contextual_compression_retrieval_chain.pkl
📅 Cached on: 2025-07-29T02:15:28.550960
multi_query_retrieval_chain already processed, skipping to the next one...
✅ Loaded evaluation results from: .pipeline-stages/ragas_evaluation_results_multi_query_retrieval_chain.pkl
📅 Cached on: 2025-07-29T02:23:23.942733
parent_document_retrieval_chain already processed, skipping to the next one...
✅ Loaded evaluation results from:

## Evaluation and Performance Analysis

Now that we have evaluation data from LangSmith, let's analyze the performance of different retrievers across multiple dimensions: **Performance**, **Cost**, and **Latency**.

In [99]:
from tqdm.notebook import tqdm

In [147]:
def extract_ragas_metrics(ragas_result, model_name: str = ''):
    """Extract cost, latency, and token metrics from RAGAS evaluation result"""
    import numpy as np
    
    def get_value(obj, key):
        """Get value from dict key or object attribute"""
        return obj.get(key) if isinstance(obj, dict) else getattr(obj, key, None)
    
    def safe_mean(values):
        """Calculate mean, filtering out NaN values"""
        if not values:
            return 0
        arr = np.array(values, dtype=float)
        valid = arr[~np.isnan(arr)]
        return float(np.mean(valid)) if len(valid) > 0 else 0

    def get_model_costs(model_name):
        PER_MILLION = 1_000_000
        """Get per-token costs for common models"""
        costs = {
            'gpt-4.1': (2.50 / PER_MILLION, 10.00 / PER_MILLION),
            'gpt-4.1-nano': (0.15 / PER_MILLION, 0.60 / PER_MILLION),
            'gpt-4.1-mini': (0.15 / PER_MILLION, 0.60 / PER_MILLION), 
            'gpt-4o-mini': (0.000000150, 0.000000600),
            'gpt-4o': (0.000002500, 0.000010000),
            'gpt-4-turbo': (0.000010000, 0.000030000),
            'gpt-3.5-turbo': (0.000000500, 0.000001500),
            'claude-3-haiku': (0.000000250, 0.000001250),
            'claude-3-sonnet': (0.000003000, 0.000015000),
            'claude-3-opus': (0.000015000, 0.000075000),
            'text-embedding-3-small': (0.02 / PER_MILLION, 0.0),
            'text-embedding-3-large': (0.13 / PER_MILLION, 0.0),
            'rerank-v3.5': (2.00 / PER_MILLION, 0.0)
        }
        
        # Try exact match, then partial match
        if model_name in costs:
            return costs[model_name]
        
        for model_key in costs:
            if model_key in model_name.lower():
                return costs[model_key]
        
        return costs['gpt-4o-mini']  # Default
    
    # Extract data
    scores = get_value(ragas_result, 'scores') or []
    scores_dict = get_value(ragas_result, '_scores_dict') or {}
    cost_cb = get_value(ragas_result, 'cost_cb') or {}
    usage_data = get_value(cost_cb, 'usage_data') or []
    
    # Calculate runs
    total_runs = len(scores) if scores else 1
    
    # Calculate RAGAS scores (averages from score lists)
    ragas_scores = {}
    for metric, values in scores_dict.items():
        if isinstance(values, list):
            ragas_scores[metric] = safe_mean(values)
    
    # Calculate tokens and cost
    total_input = sum(get_value(usage, 'input_tokens') or 0 for usage in usage_data)
    total_output = sum(get_value(usage, 'output_tokens') or 0 for usage in usage_data)
    
    input_cost, output_cost = get_model_costs(model_name)
    total_cost = (total_input * input_cost) + (total_output * output_cost)
    
    # Build metrics
    metrics = {
        'Total_Runs': total_runs,
        'Total_Cost': total_cost,
        'Total_Input_Tokens': total_input,
        'Total_Output_Tokens': total_output,
        'Total_Latency_Sec': 0,  # Not available in this data
        'Avg_Cost_Per_Run': total_cost / total_runs,
        'Avg_Input_Tokens_Per_Run': total_input / total_runs,
        'Avg_Output_Tokens_Per_Run': total_output / total_runs,
        'Avg_Latency_Sec': 0,
        **ragas_scores
    }
    
    return metrics

In [148]:
import pandas as pd
raw_stats_df = pd.DataFrame()
for retriever_chain in tqdm(retriever_chains_list.keys()):
    result = retriever_chains_list[retriever_chain]['evaluation_result']
    retriever_chains_list[retriever_chain]['evaluation_metrics'] = extract_ragas_metrics(result, 'gpt-4.1-mini').copy()
    each_retriever_df = pd.concat([
        pd.DataFrame([{"retriever": retriever_chain}]), 
        pd.DataFrame([retriever_chains_list[retriever_chain]['evaluation_metrics']])
    ], axis=1)
    raw_stats_df = pd.concat([
        raw_stats_df, each_retriever_df
    ])

  0%|          | 0/6 [00:00<?, ?it/s]

In [149]:
raw_stats_df.to_csv('ragas_retriever_raw_stats.csv', index=False)

In [150]:
raw_stats_df

Unnamed: 0,retriever,Total_Runs,Total_Cost,Total_Input_Tokens,Total_Output_Tokens,Total_Latency_Sec,Avg_Cost_Per_Run,Avg_Input_Tokens_Per_Run,Avg_Output_Tokens_Per_Run,Avg_Latency_Sec,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,naive_retrieval_chain,10,0.242236,469358,286387,0,0.024224,46935.8,28638.7,0,0.75881,0.840723,0.454,0.950364,0.436795,0.199752
0,bm25_retrieval_chain,10,0.153595,309224,178686,0,0.01536,30922.4,17868.6,0,0.85619,0.867977,0.445,0.958127,0.47906,0.359528
0,contextual_compression_retrieval_chain,10,0.12462,253635,144291,0,0.012462,25363.5,14429.1,0,0.639524,0.803765,0.472222,0.865389,0.387106,0.332763
0,multi_query_retrieval_chain,10,0.222198,457379,255985,0,0.02222,45737.9,25598.5,0,0.887381,0.915717,0.506,0.960024,0.425128,0.545455
0,parent_document_retrieval_chain,10,0.15221,307142,176898,0,0.015221,30714.2,17689.8,0,0.898095,0.895017,0.441,0.861598,0.445128,0.366881
0,ensemble_retrieval_chain,10,0.241888,542828,267440,0,0.024189,54282.8,26744.0,0,0.931429,0.878024,0.414,0.960497,0.436813,0.0


In [162]:
import importlib

import ragas_rank_retrievers
importlib.reload(ragas_rank_retrievers)
from ragas_rank_retrievers import RetrieverRanker

ranker = RetrieverRanker('ragas_retriever_raw_stats.csv')

## Final outcome of the Ragas Evaluators

In [163]:
ranker.get_recommendations_table()

Unnamed: 0,Category,Retriever,Key Metric,Description
0,Overall Winner,Multi Query,Score: 0.827,Best balanced performance
1,Budget Option,Contextual Compression,Cost: $0.0125,Lowest cost per run
2,Quality Leader,Ensemble,Quality: 0.923,Highest average quality metrics
3,Production Ready,Bm25,Score: 0.558,Meets minimum thresholds


In [164]:
ranker.get_rankings_table('weighted')

Unnamed: 0,rank,retriever_chain,score,context_recall,faithfulness,factual_correctness,answer_relevancy,Avg_Cost_Per_Run
0,1,Multi Query,0.8273,0.8874,0.9157,0.506,0.96,0.0222
1,2,Bm25,0.7172,0.8562,0.868,0.445,0.9581,0.0154
2,3,Ensemble,0.587,0.9314,0.878,0.414,0.9605,0.0242
3,4,Parent Document,0.5575,0.8981,0.895,0.441,0.8616,0.0152
4,5,Naive,0.4648,0.7588,0.8407,0.454,0.9504,0.0242
5,6,Contextual Compression,0.2331,0.6395,0.8038,0.4722,0.8654,0.0125


In [165]:
ranker.get_metrics_comparison_table()

Unnamed: 0,retriever_chain,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant,Avg_Cost_Per_Run
0,Naive,0.7588,0.8407,0.454,0.9504,0.4368,0.1998,0.0242
1,Bm25,0.8562,0.868,0.445,0.9581,0.4791,0.3595,0.0154
2,Contextual Compression,0.6395,0.8038,0.4722,0.8654,0.3871,0.3328,0.0125
3,Multi Query,0.8874,0.9157,0.506,0.96,0.4251,0.5455,0.0222
4,Parent Document,0.8981,0.895,0.441,0.8616,0.4451,0.3669,0.0152
5,Ensemble,0.9314,0.878,0.414,0.9605,0.4368,0.0,0.0242


In [166]:
ranker.get_algorithm_comparison_table()

Unnamed: 0_level_0,weighted_rank,weighted_score,quality_first_rank,quality_first_score,balanced_rank,balanced_score,production_ready_rank,production_ready_score
retriever,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bm25,2,0.7172,1,0.7572,2,0.9897,1,0.5584
Contextual Compression,6,0.2331,5,0.6952,6,0.2699,6,0.0
Ensemble,3,0.587,4,0.6963,3,0.9151,4,0.4208
Multi Query,1,0.8273,3,0.7343,1,1.2358,3,0.4642
Naive,5,0.4648,6,0.651,5,0.6777,5,0.1999
Parent Document,4,0.5575,2,0.7505,4,0.7674,2,0.5343
