# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [4]:
import gc;

In [5]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

True

In [6]:
def check_if_env_var_is_set(env_var_name: str, human_readable_string: str = "API Key"):
    api_key = os.getenv(env_var_name)
  
    if api_key:
       print(f"{env_var_name} is present")
    else:
      print(f"{env_var_name} is NOT present, paste key at the prompt:")
      os.environ[env_var_name] = getpass.getpass(f"Please enter your {human_readable_string}: ")

In [7]:
import os
import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

check_if_env_var_is_set("OPENAI_API_KEY", "OpenAI API key")

OPENAI_API_KEY is present


In [8]:
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

check_if_env_var_is_set("COHERE_API_KEY", "Cohere API key")

COHERE_API_KEY is present


## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [9]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [10]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

In [11]:
gc.collect()

295

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [12]:
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient, models
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings

small_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    small_embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [13]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [14]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [15]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [16]:
%%time
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

CPU times: user 6.97 ms, sys: 3.47 ms, total: 10.4 ms
Wall time: 47.5 ms


Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [17]:
%%time
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 432 ms, sys: 19.2 ms, total: 451 ms
Wall time: 2.06 s


'Based on the provided context, the most common issue with loans appears to be dealing with the lender or servicer, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and poor communication regarding loan transfers and account status. Many complaints also involve incorrect or outdated information on credit reports, inability to make proper payments or apply extra funds correctly, and issues related to loan mismanagement and lack of transparency.'

In [18]:
%%time
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 401 ms, sys: 0 ns, total: 401 ms
Wall time: 1.95 s


'Based on the provided information, it appears that several complaints were related to delays or failures in handling issues in a timely manner. Specifically:\n\n- One complaint (row 441) from 03/28/25 regarding a loan application that showed no movement despite multiple follow-ups, was marked as "Timely response? No."\n- Another complaint (row 67) from 04/14/25 about a previous unresolved issue with auto pay and missing bank information was marked as "Timely response? Yes," but the complaint indicates ongoing issues.\n- Multiple complaints (rows 423, 518, 716, 810, 400, etc.) involve delays ranging from a few days to over a year, with some responses marked as "Closed with explanation" or "None," and multiple instances of unresolved or delayed responses.\n\nTherefore, yes, there were complaints that did not get handled in a timely manner.'

In [19]:
%%time
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 403 ms, sys: 0 ns, total: 403 ms
Wall time: 4.22 s


"People failed to pay back their loans for several reasons, including:\n\n1. **Accumulation of interest and limited repayment options:** Many borrowers experienced ongoing interest accumulation, especially when loans were put into forbearance or deferment, which extended the repayment period and increased the total debt. Lowering monthly payments often resulted in interest compounding, making it difficult to pay down the principal.\n\n2. **Financial hardships and income challenges:** Borrowers faced economic difficulties such as stagnant wages, high living expenses, unemployment, or underemployment, which made it impossible to afford higher payments or even maintain existing payments.\n\n3. **Lack of clear communication and transparency:** Many borrowers were not adequately informed about their loan status, payment resumption dates, or changes in loan servicers. Sudden reporting of delinquency or late payments without proper notification led to credit score drops and financial setbacks

Overall, this is not bad! Let's see if we can make it better!

In [20]:
gc.collect()

694

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [21]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [22]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [23]:
%%time
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 117 ms, sys: 14.1 ms, total: 131 ms
Wall time: 2.37 s


'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, specifically issues such as:\n\n- Disputes over fees charged.\n- Trouble with how payments are being handled, such as difficulty applying payments to the principal or paying off loans more quickly.\n- Receiving inaccurate or bad information about loans, including loan balances, interest calculations, and loan history.\n- Issues with loan terms, deferments, income-driven repayment plans, and loan duration.\n\nThese issues suggest that a prevalent problem is the lack of transparency, miscommunication, and difficulties in managing and understanding loan details. Therefore, the most common issue with loans, as reflected in the complaints, is **"Dealing with lenders or servicers" with specific problems like incorrect information, fee disputes, and payment application issues**.'

In [24]:
%%time
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 36.4 ms, sys: 11.5 ms, total: 47.9 ms
Wall time: 1.95 s


'Based on the provided information, it appears that several complaints were handled with a response marked as "Closed with explanation," and the responses were noted as "Timely response?": "Yes" in each case. However, some complaints involved ongoing issues where the complainants expressed dissatisfaction with the adequacy of the company\'s response or unresolved issues, particularly regarding loan corrections or validation of debt, but the records indicate that responses were officially provided in a timely manner.\n\nTherefore, while formal responses were given promptly, the presence of unresolved or ongoing concerns in the complaints suggests that some issues may not have been fully addressed to the complainants\' satisfaction in a timely manner. Nonetheless, based solely on the data about response timings, the complaints did receive timely responses, even if the issues remain unresolved.\n\nIn conclusion:  \n**Yes, some complaints were answered promptly, but there are indications t

In [25]:
%%time
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 44.8 ms, sys: 17.9 ms, total: 62.7 ms
Wall time: 1.7 s


'People failed to pay back their loans for various reasons, including problems with payment plans, miscommunication or lack of communication from the loan servicers, and issues related to incorrect or unauthorized transfer of their loans. Some specific causes include:\n\n- Being steered into wrong types of forbearances or experiencing improper handling of their repayment plans.\n- Lack of response or failure to receive notices from loan servicers about important changes, such as loan transfers or due dates.\n- Automatic payment issues, such as payments being unenrolled without proper notification or payments being reversed multiple times due to errors on the part of the servicers.\n- Negative impact on credit scores due to billing errors or lack of contact from the servicers.\n- Not being properly informed of their loan status or changes, leading to overdue payments and negative credit reporting.\n- Attempts to seek hardship relief or forbearance not being addressed, resulting in conti

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

In [26]:
gc.collect()

618

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer:

BM25, a traditional full-text search ranking function, is particularly effective when dealing with queries that rely heavily on exact term matching, term frequency, and inverse document frequency (TF-IDF) principles.

BM25 is generally better suited for scenarios where exact keyword matching is essential, such as in e-commerce search engines, document retrieval systems, and legal e-discovery.

Additionally, BM25 is often used in hybrid search systems alongside vector search to create a more comprehensive understanding of both semantic meaning and keyword importance.

Here are a couple of queries where the exact matching terms in the document would be essential to prevent a lot of results with noise and near close terms but not close enough:

- "Find documents about COVID-19 vaccine side effects in patients with diabetes"
  - the key terms here COVID-19 vaccine and diabetes are were the focus is in the query
- "Best practices for data backup in 2025"
  - It includes specific terms like "data backup" and "2025" that are likely to appear verbatim in relevant documents.
  - BM25 can effectively leverage term frequency (e.g., how often "data backup" appears in a document) and document length normalization to rank documents accurately. The query does not heavily rely on semantic similarity but rather on the presence and frequency of exact keywords.
  - In contrast, dense embeddings might struggle if the training data does not include similar phrasing or if the semantic model does not strongly associate "best practices" with "data backup" in the context of 2025.

Embeddings, on the other hand, are better suited for capturing semantic relationships between words and documents. If embeddings were used in the above scenarios or use-cases, the precision of the results would not be as accurate as with BM25.


### Addendum

_**Sparse Embeddings** are high-dimensional vectors where most values are zero, with only a few non-zero values representing specific features or tokens that are present, making them memory-efficient and interpretable but limited to explicit feature representation._

_**Dense Embeddings** are vectors where most or all dimensions have non-zero values, creating rich, continuous representations that capture complex semantic relationships and contextual meaning, but require more storage and are less interpretable._

_**Key Difference:** Sparse embeddings work like "on/off switches" for specific features (like one-hot encoding or TF-IDF), while dense embeddings work like "semantic fingerprints" where every dimension contributes to the overall meaning representation - sparse focuses on explicit presence/absence, dense captures nuanced relationships._

___

_**Sparse Retrieval** uses exact keyword matching with algorithms like BM25, where documents are represented as sparse vectors containing only the specific terms that appear in them, making it excellent for precise term-based searches but limited to lexical matches._

_**Dense Retrieval** uses semantic embeddings where documents and queries are converted into dense vector representations that capture meaning and context, allowing it to find semantically similar content even when different words are used, but potentially missing exact keyword matches._

_**Key Difference:** Sparse retrieval excels at "what you search is what you get" with exact terms, while dense retrieval excels at "what you mean is what you get" through semantic understanding - which is why hybrid approaches combining both often work best._


## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [27]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [28]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [29]:
%%time
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 429 ms, sys: 0 ns, total: 429 ms
Wall time: 1.71 s


'Based on the provided context, the most common issue with loans appears to be problems related to the handling and management of the loans by servicers, including errors in loan balances, misapplied payments, incorrect or bad information, and issues with communication and documentation. Specifically, issues such as errors in loan balances, misapplied payments, wrongful denials of payment plans, and mishandling of information are frequently mentioned.'

In [30]:
%%time
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 425 ms, sys: 0 ns, total: 425 ms
Wall time: 1.84 s


'Yes, according to the provided complaints, at least one complaint was not handled in a timely manner. Specifically, the complaint regarding the student loan issues submitted to Maximus Federal Services, Inc. has been open for over 1 year and nearly 18 months without resolution. The complaint from EdFinancial Services about payments not being applied also indicates ongoing issues, although the response was marked as "closed with explanation."'

In [31]:
%%time
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 405 ms, sys: 0 ns, total: 405 ms
Wall time: 2.12 s


'People failed to pay back their loans primarily due to a lack of clear communication and understanding about their loan obligations, as well as issues related to the management and handling of their loans by servicers. Specifically, some borrowers were unaware that they needed to repay their loans, and they were not adequately informed about the details of interest accumulation, payment options, or changes in loan ownership. Additionally, difficulties such as technical problems accessing online accounts, lack of notifications about repayment requirements, and complex or conflicting loan account information contributed to borrowers being unable to fulfill their repayment obligations. In some cases, borrowers also experienced financial hardship because the available repayment options, such as forbearance or deferment, led to ongoing interest accumulation, making it even harder to pay off the loans over time.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

In [32]:
gc.collect()

1132

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [33]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [34]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [35]:
%%time
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 1.15 s, sys: 0 ns, total: 1.15 s
Wall time: 3.83 s


"The most common issues with loans, based on the complaints provided, are:\n\n- Dealing with lenders or servicers, including trouble with how payments are handled, misapplication of payments, and lack of communication.\n- Problems with loan balances, interest calculations, and errors in account information.\n- Issues related to loan management practices such as forbearance steering, unauthorized interest capitalization, and improper loan deferments.\n- Problems with loan forgiveness, cancellation, or discharge, often involving mismanagement, missed opportunities, or fraud concerns.\n- Difficulties in obtaining loan information, account validation, and transparency about terms and balances.\n- Errors leading to negative impacts on credit reports and scores.\n\nOverall, the most common underlying theme appears to be **mismanagement or poor communication by loan servicers**, leading to errors in balances, interest, and account status, which significantly affect borrowers' financial health

In [36]:
%%time
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 1.15 s, sys: 0 ns, total: 1.15 s
Wall time: 4.38 s


'Based on the information provided, yes, some complaints did not get handled in a timely manner. Specifically, there are complaints from consumers indicating delays in responses or resolution, such as:\n\n- Complaint with ID 12709087 (submitted 03/28/25) from CA: The company response was "Closed with explanation," and it was marked as "Not timely." The consumer reported that it has been nearly 18 months with no resolution.\n- Complaint with ID 12739706 (submitted 04/01/25) from NJ: Marked as "Not timely."\n- Complaint with ID 12668396 (submitted 03/26/25) from NJ: Marked as "Not timely."\n- Complaint with ID 13056764 (submitted 04/18/25) from IN: Marked as "Timely," but involves ongoing issues with inaction.\n- Multiple other complaints show delays, failed follow-ups, or responses with explanations that indicate unresolved issues over long periods.\n\nWhile some complaints were responded to promptly, others experienced significant delays or lack of resolution, indicating that not all c

In [37]:
%%time
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 1.18 s, sys: 0 ns, total: 1.18 s
Wall time: 5.19 s


"People failed to pay back their loans primarily due to systemic issues and misconduct by loan servicers and lenders. The provided complaints reveal several common reasons:\n\n1. **Mismanagement and Errors by Servicers:** Many complain about errors in loan balances, misapplied payments, incorrect account statuses, and improper transfer of accounts between servicers without proper notification or authorization. For example, some borrowers' accounts were reported as delinquent or in default despite being current, often due to reporting errors or lack of proper notices.\n\n2. **Lack of Proper Communication and Notification:** Several accounts of borrowers not being informed about the resumption of payments, changes in servicers, or delinquency status. For instance, borrowers reported receiving no prior notices before being marked as late or delinquent, which hindered their ability to manage payments.\n\n3. **Inadequate Guidance and Support:** Complaints include instances where servicers f

In [38]:
gc.collect()

495

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### ✅ Answer:

Multiple reformulations improve recall because relevant documents may use different terminology than the original query, and each reformulation can surface documents the others miss (different phrasings in multiple reformulations of a query can match different relevant documents).

In other words, multiple reformulations approach the same query from different angles/facets, leading to retrieval of documents covering those various angles. This increases the confluence of documents around the common theme while capturing variations in terminology and perspective, thereby enhancing retrieval scope.

And since such retrievers that use multiple reformulations would follow the below steps:

  1. Generates multiple query variations from the original query using an LLM
  2. Retrieves documents for each variation (each gets k results)
  3. Deduplicates and merges the results from all queries
  4. Returns the final deduplicated set

The return results from multiple reformulations would be more beneficial as a retrieval process.

An example would be "machine learning algorithms" vs "AI models" retrieves different relevant documents but around the same or similar theme.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [39]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [40]:
vectorstore.client.create_collection(
  collection_name="full_documents",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=small_embeddings,         # ✅ Reuse embeddings
  collection_name="full_documents"
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [41]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [42]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [43]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [44]:
%%time
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 394 ms, sys: 21.2 ms, total: 415 ms
Wall time: 1.21 s


'The most common issue with loans, based on the provided complaints, appears to be related to mismanagement and errors by loan servicers. This includes problems such as incorrect information on credit reports, discrepancies in loan balances and interest rates, wrongful denials of payment plans, errors in loan accounting, and issues stemming from the transfer or sale of loans. Many complaints highlight systemic breakdowns, misapplication of payments, misleading reporting, and difficulty in resolving discrepancies.'

In [45]:
%%time
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 398 ms, sys: 0 ns, total: 398 ms
Wall time: 1.64 s


"Based on the provided context, several complaints explicitly mention that they were not handled in a timely manner. For example, the complaint about the student loan application filed on 03/28/25 indicates that the consumer waited for responses and had not heard back, with responses delayed beyond the estimated timeframes. Similarly, complaints filed on 04/11/25 regarding issues with Mohela's processing times also note that responses were not timely. The complaint about the dispute settlement sent over 30 days ago also emphasizes that the issue has not been addressed within a reasonable period.\n\nTherefore, yes, there are complaints within the provided data that did not get handled in a timely manner."

In [46]:
%%time
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 395 ms, sys: 0 ns, total: 395 ms
Wall time: 1.2 s


'People may fail to pay back their loans due to various reasons, including financial hardship, lack of proper information or communication from lenders, misrepresentation by educational institutions about the value or stability of their programs, and systemic issues within loan servicing agencies. Specifically, some individuals face difficulties because they were not properly informed about repayment obligations, experienced severe financial hardship after graduation, or encountered issues with loan servicing companies that failed to communicate correctly or manage their accounts transparently.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

In [47]:
gc.collect()

1030

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [48]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [49]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [50]:
%%time
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 2.25 s, sys: 0 ns, total: 2.25 s
Wall time: 4.74 s


'The most common issue with loans, based on the complaints in the provided data, appears to be problems related to how payments are being handled, mismanagement by servicers, and errors in loan information. Specifically, recurring issues include:\n\n- Payments being reversed without explanation, causing credit scores to drop and inaccurate reporting of delinquency.\n- Errors in reporting loan status, such as incorrect delinquency or late payment marks.\n- Lack of clear communication or notification about account changes, transfers, or balances.\n- Discrepancies and inaccuracies in loan balances and payment histories.\n- Failure of servicers to properly process payments or update account information.\n- Issues with loan classification, mismanagement, and mishandling of deferments or forbearance.\n\nWhile other problems like bad information, incorrect balances, or dispute of validity also occur, the most frequent pattern is the mishandling of payments and inaccurate reporting by loan ser

In [51]:
%%time
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 2.27 s, sys: 0 ns, total: 2.27 s
Wall time: 6.04 s


'Based on the provided complaints data, yes, some complaints indicate that issues were not handled in a timely manner. Specifically:\n\n- Complaint ID 12935889 (Page 1): The response was marked as "No" for timely response, and the response was "Closed with explanation."\n- Complaint ID 12739706 (Page 2): The response was "Closed with explanation," and it was marked as "No" for timely response.\n- Complaint ID 12950199 (Page 2): The response was "Closed with explanation," but it was marked as "Yes" for timely response.\n- Complaint ID 12973003 (Page 2): The response was "Closed with explanation," and it was marked as "Yes" for timely response.\n- Complaint ID 13062402 (Page 2): The response was "Closed with explanation," and it was marked as "Yes" for timely response.\n- Several other complaints, such as IDs 13160766, 13205525, and 13197090, are marked as timely, but some still resulted in closure with explanation, which may indicate handling was not fully satisfactory or timely.\n\nIn 

In [52]:
%%time
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 2.27 s, sys: 0 ns, total: 2.27 s
Wall time: 7.4 s


"People failed to pay back their loans primarily due to a combination of factors highlighted in the complaints:\n\n1. **Lack of clear communication and notification:** Many borrowers were not adequately informed about their repayment status, whether their loans had resumed, or when payments were due. Several complaints mention not receiving proper notices, emails, or mail, leading to unexpected delinquencies and credit impacts.\n\n2. **Mismanagement and errors by loan servicers:** Servicers, such as Nelnet, Maximus, EdFinancial, and others, have been reported to mishandle accounts—incorrectly reporting delinquencies, refusing to provide accurate information, or failing to follow regulations. This mismanagement often results in increased balances, erroneous defaults, or inaccurate credit reporting.\n\n3. **Interest accumulation and forbearance practices:** Borrowers were often steered into forbearance or deferment without adequate understanding of how interest would accrue or options to

In [53]:
gc.collect()

1533

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

The `breakpoint_threshold_type` parameter controls when the semantic chunker creates chunk boundaries based on embedding similarity between sentences:

**Four Threshold Types:**

1. _"percentile" (default)_
- Splits when sentence embedding distance exceeds the 95th percentile of all distances
- Effect: Creates chunks at the most semantically distinct boundaries
- Behavior: More conservative splitting, larger chunks

2. _"standard_deviation"_
- Splits when distance exceeds 3 standard deviations from mean
- Effect: Better predictable performance, especially for normally distributed content
- Behavior: More consistent chunk sizes

3. _"interquartile"_
- Uses IQR * 1.5 scaling factor to determine breakpoints
- Effect: Middle-ground approach, robust to outliers
- Behavior: Balanced chunk distribution

4. _"gradient"_
- Detects anomalies in embedding distance gradients
- Effect: Best for domain-specific/highly correlated content
- Behavior: Finds subtle semantic transitions

**Impact:** _The threshold type determines sensitivity to semantic changes - more sensitive types create smaller, more focused chunks while less sensitive types create larger, more comprehensive chunks._

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [54]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    small_embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [55]:
%%time
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

CPU times: user 304 ms, sys: 0 ns, total: 304 ms
Wall time: 8.09 s


Let's create a new vector store.

In [56]:
vectorstore.client.create_collection(
  collection_name="Loan_Complaint_Data_Semantic_Chunks",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

semantic_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=small_embeddings,         # ✅ Reuse embeddings
  collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

# Add documents after creation
_ = semantic_vectorstore.add_documents(semantic_documents)

We'll use naive retrieval for this example.

In [57]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [58]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [59]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to loan servicing and communication issues. These include difficulties with repayment, errors or delays in payment processing, issues with loan reporting and credit impact, and lack of clear information or transparency from loan servicers. Many complainants report being unable to get accurate or timely information about their loan status, repayment amounts, or servicing changes.'

In [60]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, several complaints indicate that they were handled in a timely manner, as the responses from companies were marked as "Closed with explanation" and responded to within the expected time frames. For example, complaints received on 04/28/25, 05/01/25, 05/04/25, and 05/09/25 all state that responses were timely. \n\nHowever, the first complaint from 05/04/25 involving Nelnet mentions that despite acknowledgment of receipt and response, the complaint involved serious unresolved issues such as unresponded to certified mail and ongoing violations—suggesting that some aspects of the complaint were not fully handled or resolved to the complainant\'s satisfaction.\n\nGiven the available data, there is no explicit mention of complaints that were not handled in a timely manner overall. Most responses are marked "Yes" for timely response. Nonetheless, some complaints involve ongoing issues or unresolved problems despite responses being sent.\n\nTherefore, based 

In [61]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans mainly due to issues such as mismanagement or lack of proper communication from lenders or servicers, disputes over the legitimacy of the debt, technical problems with payment processing, and inadequate or delayed re-amortization of payments after forbearance periods. Additionally, some individuals faced complications related to incorrect or disputed information on their credit reports, and there are reports of alleged improper or illegal reporting and collection practices by loan servicers. These factors can create barriers or uncertainties that hinder borrowers' ability or willingness to repay their loans promptly."

In [62]:
gc.collect()

1405

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### ✅ Answer:

Short and highly repetitive sentences create _minimal embedding distance_ variations, making it difficult to detect _meaningful semantic_ boundaries.

Threshold Type Behaviors:

1. "percentile" (95th percentile)

- Behavior: Creates very few chunks since most distances are similar
- Issue: May group unrelated FAQ topics together
- Adjustment: Lower to 75-85th percentile to increase sensitivity

2. "standard_deviation" (3σ)

- Behavior: Performs poorly due to low variance in short, similar sentences
- Issue: Creates massive chunks with no meaningful breaks
- Adjustment: Reduce to 1-2 standard deviations for more splitting

3. "interquartile" (IQR × 1.5)

- Behavior: Most robust for FAQs due to outlier resistance
- Issue: Still may miss subtle topic transitions
- Adjustment: Reduce scaling factor to 0.8-1.0

4. "gradient" (anomaly detection)

- Behavior: Best performer - detects subtle topic shifts in repetitive content
- Issue: May be overly sensitive to minor variations
- Adjustment: Fine-tune threshold to 85-90th percentile

Conclusion: Use "gradient" with _85th percentile_ + minimum chunk size constraints + keyword-based post-processing to ensure FAQ topics remain grouped appropriately despite repetitive language patterns.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [63]:
### YOUR CODE HERE

In [64]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [65]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

True

In [66]:
def check_if_env_var_is_set(env_var_name: str, human_readable_string: str = "API Key"):
    api_key = os.getenv(env_var_name)
  
    if api_key:
       print(f"{env_var_name} is present")
    else:
      print(f"{env_var_name} is NOT present, paste key at the prompt:")
      os.environ[env_var_name] = getpass.getpass(f"Please enter your {human_readable_string}: ")

In [67]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
check_if_env_var_is_set("LANGCHAIN_API_KEY", "LangChain API key")
check_if_env_var_is_set("LANGSMITH_API_KEY", "LangSmith API key")
check_if_env_var_is_set("OPENAI_API_KEY", "OpenAI API key")

LANGCHAIN_API_KEY is present
LANGSMITH_API_KEY is present
OPENAI_API_KEY is present


In [68]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - ADwLC - {uuid4().hex[0:8]}"

In [69]:
# langsmith_project_name = os.environ["LANGCHAIN_PROJECT"]

In [70]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [71]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [72]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

In [73]:
from ragas.testset.graph import Node, NodeType
if not os.path.exists('loan_data_kg.json'):
    ### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
    for doc in docs[:10]: ### 20
        kg.nodes.append(
            Node(
                type=NodeType.DOCUMENT,
                properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
            )
        )
kg

KnowledgeGraph(nodes: 0, relationships: 0)

In [74]:
gc.collect()

20

In [75]:
%%time
from ragas.testset.transforms import default_transforms, apply_transforms
transformer_llm = generator_llm
embedding_model = generator_embeddings

if not os.path.exists('loan_data_kg.json'):
    default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
    apply_transforms(kg, default_transforms)
else:
    kg.load('loan_data_kg.json')
kg

CPU times: user 1.52 s, sys: 721 ms, total: 2.24 s
Wall time: 2.2 s


KnowledgeGraph(nodes: 0, relationships: 0)

In [76]:
%%time
if not os.path.exists('loan_data_kg.json'):
    kg.save("loan_data_kg.json")
    
loan_data_kg = KnowledgeGraph.load("loan_data_kg.json")
loan_data_kg

CPU times: user 1.33 s, sys: 175 ms, total: 1.5 s
Wall time: 1.47 s


KnowledgeGraph(nodes: 43, relationships: 647)

In [77]:
gc.collect()

0

In [78]:
import psutil

# Check memory usage
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage before generation: {memory_mb:.1f} MB")

Memory usage before generation: 686.7 MB


In [79]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=loan_data_kg)

In [80]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25)
]

In [81]:
%%time
testset = None
if not os.path.exists('golden-master.csv'):
    testset = generator.generate(testset_size=10, query_distribution=query_distribution)
    testset.to_pandas()

CPU times: user 66 μs, sys: 0 ns, total: 66 μs
Wall time: 71.3 μs


In [82]:
# Check memory usage
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage after generation: {memory_mb:.1f} MB")

Memory usage after generation: 686.7 MB


In [83]:
gc.collect()

0

In [84]:
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage after gc.collect(): {memory_mb:.1f} MB")

Memory usage after gc.collect(): 686.7 MB


In [85]:
import pandas as pd

In [86]:
if testset:
    testset_df = testset.to_pandas()
    testset_df.to_csv('golden-master.csv', index=False)
else:
    testset_df = pd.read_csv('golden-master.csv')
testset_df    

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What is the role of the School Participation D...,"['Chapter 1 Academic Years, Academic Calendars...",The context does not specify the exact role of...,single_hop_specifc_query_synthesizer
1,What does 34 CFR 668.3(a) specify regarding th...,['Regulatory Citations Academic year minimums:...,34 CFR 668.3(a) pertains to the minimum number...,single_hop_specifc_query_synthesizer
2,"In the context of Chapter 3, how does the incl...",['Inclusion of Clinical Work in a Standard Ter...,Inclusion of clinical work in a standard term ...,single_hop_specifc_query_synthesizer
3,Can you explain how Title IV regulations apply...,['Non-Term Characteristics A program that meas...,Payment periods under Title IV are applicable ...,single_hop_specifc_query_synthesizer
4,How are academic calendars defined for differe...,"['Chapter 1 Academic Years, Academic Calendars...",Academic calendars are defined for each progra...,single_hop_specifc_query_synthesizer
5,Wht are the regultory citatons for academic ye...,"['<1-hop>\n\nChapter 1 Academic Years, Academi...",The regulatory citations for academic years an...,multi_hop_abstract_query_synthesizer
6,Include clinical work in standard term periods...,['<1-hop>\n\nInclusion of Clinical Work in a S...,The context explains that clinical work conduc...,multi_hop_abstract_query_synthesizer
7,How does the impact of term length and measure...,['<1-hop>\n\nInclusion of Clinical Work in a S...,The inclusion of clinical work in standard ter...,multi_hop_abstract_query_synthesizer
8,How do Chapters 2 and 3 collectively address t...,"['<1-hop>\n\nChapter 1 Academic Years, Academi...",Chapter 2 details the requirements for establi...,multi_hop_specific_query_synthesizer
9,How do different academic years for programs a...,"['<1-hop>\n\nChapter 1 Academic Years, Academi...",Different academic years for various programs ...,multi_hop_specific_query_synthesizer


In [87]:
from langsmith import Client

langsmith_client = Client(
    timeout_ms=60000,  # 60 seconds
    retry_config={"max_retries": 5}
)

dataset_name = "Loan Synthetic Data (s09)"

existing_datasets = langsmith_client.list_datasets()
dataset_exists = any(dataset.name == dataset_name for dataset in existing_datasets)

if dataset_exists:
  langsmith_dataset = langsmith_client.read_dataset(dataset_name=dataset_name)
  print(f"Using existing dataset: {dataset_name}")
else:
  langsmith_dataset = langsmith_client.create_dataset(
      dataset_name=dataset_name,
      description="Loan Synthetic Data (for s09 exercise)"
  )
  print(f"Created new dataset: {dataset_name}")

Using existing dataset: Loan Synthetic Data (s09)


In [88]:
gc.collect()

0

In [89]:
for data_row in testset_df.iterrows():
  langsmith_client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [90]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(docs)

In [91]:
from langchain_openai import OpenAIEmbeddings

In [92]:
from langchain_community.vectorstores import Qdrant

small_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
try:
    vectorstore.client.create_collection(
      collection_name="Loan RAG (semantic)",
      vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
    )
except:
    pass

semantic_vectorstore = Qdrant(
    client=vectorstore.client,     # ✅ Reuse existing client
    embeddings=small_embeddings,         # ✅ Reuse embeddings
    collection_name="Loan RAG (semantic)"
)

_ = semantic_vectorstore.add_documents(rag_documents)

In [93]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [94]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [95]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

In [96]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

## LangSmith Evaluation Set-up

In [97]:
eval_llm = ChatOpenAI(model="gpt-4.1")

In [98]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

qa_evaluator = LangChainStringEvaluator(
    "qa",
    config={"llm": eval_llm},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["response"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)  

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["response"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

empathy_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
        },
        "llm": eval_llm 
    },
    prepare_data=lambda run, example: {
       "prediction": run.outputs["response"],
       "input": example.inputs["question"],
    }
)

## LangSmith Evaluation

## Dope-ifying Our Application

In [99]:
EMPATHY_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the question using empathy and kindness, and make sure the user feels heard.

Context: {context}
Question: {question}
"""

empathy_rag_prompt = ChatPromptTemplate.from_template(EMPATHY_RAG_PROMPT)

In [100]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(docs)

In [101]:
from langchain_openai import OpenAIEmbeddings

large_embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [102]:
try:
    vectorstore.client.create_collection(
      collection_name="Loan Data for RAG",
      vectors_config=models.VectorParams(size=3072, distance=models.Distance.COSINE) ### was 1536
    )
except:
    pass

dope_app_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=large_embeddings,         # ✅ Reuse embeddings
  collection_name="Loan Data for RAG"
)

# Add documents after creation
_ = dope_app_vectorstore.add_documents(rag_documents)

In [103]:
retriever = vectorstore.as_retriever()

In [104]:
empathy_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | empathy_rag_prompt | llm | StrOutputParser()
)

In [105]:
gc.collect()

829

### Retriever Evaluation

#### Naive Retrieval Chain

In [108]:
from tqdm.notebook import tqdm

In [163]:
pipeline_stages_folder_name = ".pipeline-stages"
os.makedirs(pipeline_stages_folder_name, exist_ok=True)
def write_to_file(filename: str, content: str):
    with open(f"{pipeline_stages_folder_name}/{filename}", 'w') as text_file:
        try:
            text_file.write(content)
        finally:
            text_file.close()

## Evaluation and Performance Analysis

Now that we have evaluation data from LangSmith, let's analyze the performance of different retrievers across multiple dimensions: **Performance**, **Cost**, and **Latency**.

In [110]:
# !uv sync

### Option 1: Quick Analysis (Recommended)

### 🎯 Final Solution - Dataset-Based Analysis

Based on the working LangSmith pattern, this approach uses the dataset name to find evaluation sessions and extract runs properly.

In [2]:
%%time
# Run complete performance analysis
# This will:
# 1. Extract evaluation results from LangSmith
# 2. Analyze performance, cost, and latency metrics
# 3. Generate rankings and recommendations
# 4. Create visualizations
# 5. Save detailed reports

# Use the dataset name from your LangSmith evaluation
# dataset_name = "Loan Synthetic Data (s09)"

# This approach:
# 1. Gets dataset ID from dataset name
# 2. Finds evaluation sessions that used the dataset
# 3. Extracts all runs from those sessions
# 4. Groups runs by retriever names
# 5. Analyzes performance, cost, and latency

# analyzer = analyze_retriever_performance(dataset_name=dataset_name, project_name_prefix="AIM - ADwLC - 2b1903b4", days_back=7, save_report=True)
# display(analyzer)
# print(f"Ready to analyze dataset: {dataset_name}")

# print("Performance analysis ready")

🚀 Starting retriever performance analysis...
📊 Dataset: Loan Synthetic Data (s09)
📅 Looking back: 7 days
--------------------------------------------------
=== RETRIEVER PERFORMANCE ANALYSIS ===
Getting evaluation runs for dataset: Loan Synthetic Data (s09)
Dataset approach failed: Failed to POST /runs/query in LangSmith API. HTTPError('400 Client Error: Bad Request for url: https://api.smith.langchain.com/runs/query', '{"detail":"At least one of \'session\', \'id\', \'parent_run\', \'trace\' or \'reference_example\' must be specified"}')
Fallback: Getting all recent runs... (this may take a bit of time)
Filtered to 1 projects matching prefix 'AIM - ADwLC - 2b1903b4'


  0%|          | 0/1 [00:00<?, ?it/s]

✅ Fallback found 273 total runs
Analyzing runs: 273 runs
  -> Processed 273 runs, avg_cost: $0.0002, avg_latency: 2.11s

❌ Error during analysis: object of type 'NoneType' has no len()
Error type: <class 'TypeError'>


<performance_analysis.RetrieverPerformanceAnalyzer at 0x7f6615c1f390>

Ready to analyze dataset: Loan Synthetic Data (s09)
Performance analysis ready
CPU times: user 199 ms, sys: 47 ms, total: 246 ms
Wall time: 6.27 s


### Option 2: Step-by-Step Analysis

In [112]:
# Initialize the analyzer
analyzer = RetrieverPerformanceAnalyzer()

# Step 1: Extract experiment data
# print("Step 1: Extracting experiment data from LangSmith...")

In [113]:
# Uncomment to extract data:
# results_df = analyzer.analyze_all_retrievers(days_back=7)
# print(f"Found data for {len(results_df)} experiments")
# results_df.head()

# print("Data extraction ready - uncomment above to run")

In [114]:
# Step 2: Create performance summary
# Uncomment to run:
# summary = analyzer.create_performance_summary()
# for retriever, metrics in summary.items():
#     print(f"\n{retriever}:")
#     print(f"  - Avg Cost per Run: ${metrics.get('avg_cost_per_run', 0):.4f}")
#     print(f"  - Avg Latency: {metrics.get('avg_latency', 0):.2f}s")
#     print(f"  - QA Score: {metrics.get('qa_score', 'N/A')}")

# print("Performance summary ready - uncomment above to run")

In [115]:
# Step 3: Generate rankings
# Uncomment to run:
# rankings = analyzer.rank_retrievers()

# print("🏆 RANKINGS 🏆")
# print("\n📈 By Performance (Higher is Better):")
# for i, item in enumerate(rankings['by_performance'][:3], 1):
#     print(f"  {i}. {item['name']}: {item['performance']:.3f}")

# print("\n💰 By Cost (Lower is Better):")
# for i, item in enumerate(rankings['by_cost'][:3], 1):
#     print(f"  {i}. {item['name']}: ${item['cost']:.4f}")

# print("\n⚡ By Latency (Lower is Better):")
# for i, item in enumerate(rankings['by_latency'][:3], 1):
#     print(f"  {i}. {item['name']}: {item['latency']:.2f}s")

# print("\n🎯 Overall Ranking (Weighted):")
# for i, item in enumerate(rankings['by_overall'][:3], 1):
#     print(f"  {i}. {item['name']}: {item['overall_score']:.3f}")

# print("Rankings ready - uncomment above to run")

In [116]:
# Step 4: Generate full report and save results
# Uncomment to run:
# report = analyzer.generate_analysis_report()
# print(report)

# # Save all results
# analyzer.save_analysis("retriever_analysis_report.md")

# print("Report generation ready - uncomment above to run")

In [117]:
# Step 5: Create visualizations
# Uncomment to run:
# analyzer.create_visualizations()

# print("Visualization creation ready - uncomment above to run")

### Analysis Framework

The performance analysis script provides comprehensive evaluation across three key dimensions:

#### 📊 **Performance Metrics**
- **QA Score**: Correctness of answers based on reference ground truth
- **Helpfulness Score**: How helpful responses are to users
- **Empathy Score**: Empathetic quality of responses

#### 💰 **Cost Analysis**
- **Total Cost**: Cumulative cost across all evaluation runs
- **Cost per Run**: Average cost per individual evaluation
- **Token Usage**: Input and output token consumption

#### ⚡ **Latency Analysis**
- **Average Latency**: Mean response time per retriever
- **Total Processing Time**: Cumulative time across evaluations
- **Latency Distribution**: Variation in response times

#### 🎯 **Overall Ranking**
Uses weighted scoring:
- **40%** Performance (accuracy/quality)
- **30%** Cost efficiency 
- **30%** Speed/latency

This provides a balanced view for different use cases and requirements.

## 🎯 Direct Analysis from Evaluate Results

The most accurate approach is to analyze the results returned directly from the `evaluate()` function calls. This gives us immediate access to all metrics without needing to query LangSmith again.

### Method 1: Collect Results During Evaluation

Instead of the current evaluation loop, modify it to collect the `evaluate()` results:

In [168]:
# Import the new analyzer
from evaluation_cache import save_evaluation_results, load_evaluation_results

print("✅ Imported evaluate results analyzer")

✅ Imported evaluate results analyzer


### Method 2: Analyze Collected Results

In [167]:
retriever_chains_list = {
    "naive_retrieval_chain" : naive_retrieval_chain,
    "bm25_retrieval_chain": bm25_retrieval_chain,
    "contextual_compression_retrieval_chain": contextual_compression_retrieval_chain,
    "multi_query_retrieval_chain": multi_query_retrieval_chain,
    "parent_document_retrieval_chain": parent_document_retrieval_chain,
    "ensemble_retrieval_chain": ensemble_retrieval_chain,
    "semantic_retrieval_chain": semantic_retrieval_chain
}

if not os.path.exists("evaluation_results.pkl"):
    evaluation_results = {}
    retriever_eval_progress_bar = tqdm(retriever_chains_list)
    for retriever_chain in retriever_eval_progress_bar:
        gc.collect()
        chain_stage_filename = f"{pipeline_stages_folder_name}/{retriever_chain}"
        if os.path.exists(chain_stage_filename):
            print(f"{retriever_chain} already processed, skipping to the next one...")
            continue
    
        retriever_eval_progress_bar.set_description(retriever_chain, refresh=True)
        chain_to_invoke = retriever_chains_list[retriever_chain]
        try:
            result = evaluate(
              chain_to_invoke.invoke,
              data=dataset_name,
              evaluators=[qa_evaluator, labeled_helpfulness_evaluator, empathy_evaluator],
              num_repetitions=3,
              max_concurrency=1,  # Limit concurrent requests to avoid rate limits
              metadata={"revision_id": retriever_chain},
              experiment_prefix=retriever_chain
            )
            
            # Store the result for analysis
            evaluation_results[retriever_chain] = result
            write_to_file(retriever_chain, f"revision_id: {retriever_chain}")
            print(f"Finished evaluating and saving {retriever_chain} moving to the next one...")
        except Exception as ex:
            print(f"Failed to run evaluation on the {retriever_chain}, due to {ex}, skipping to the next one...")
            continue

    save_evaluation_results(evaluation_results, "evaluation_results.pkl")
        

  0%|          | 0/7 [00:00<?, ?it/s]

View the evaluation results for experiment: 'naive_retrieval_chain-b210c0d3' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=17ee44f7-5020-4528-ad21-c18d738d39d2




0it [00:00, ?it/s]

Finished evaluating and saving naive_retrieval_chain moving to the next one...
View the evaluation results for experiment: 'bm25_retrieval_chain-0061bbf8' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=41364f12-b61c-4946-a574-b1e2115149fb




0it [00:00, ?it/s]

Finished evaluating and saving bm25_retrieval_chain moving to the next one...
View the evaluation results for experiment: 'contextual_compression_retrieval_chain-af218c41' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=f38840d0-8041-4237-adc5-cba35b0146c9




0it [00:00, ?it/s]

Error running target function: status_code: 429, body: data=None id='11c8dea4-283c-4f1d-8c57-1f7dc440885c' message="You are using a Trial key, which is limited to 10 API calls / minute. You can continue to use the Trial key for free or upgrade to a Production key with higher rate limits at 'https://dashboard.cohere.com/api-keys'. Contact us on 'https://discord.gg/XW44jPfYJu' or email us at support@cohere.com with any questions"
Traceback (most recent call last):
  File "/home/AIE7/09_Advanced_Retrieval/.venv/lib/python3.13/site-packages/langsmith/evaluation/_runner.py", line 1907, in _forward
    fn(*args, langsmith_extra=langsmith_extra)
    ~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/AIE7/09_Advanced_Retrieval/.venv/lib/python3.13/site-packages/langchain_core/runnables/base.py", line 3044, in invoke
    input_ = context.run(step.invoke, input_, config, **kwargs)
  File "/home/AIE7/09_Advanced_Retrieval/.venv/lib/python3.13/site-packages/langchain_core/runnables/base.py",

Finished evaluating and saving contextual_compression_retrieval_chain moving to the next one...
View the evaluation results for experiment: 'multi_query_retrieval_chain-aafc4de1' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=81f3e562-015b-4e9e-ae40-953c774f6af0




0it [00:00, ?it/s]

Finished evaluating and saving multi_query_retrieval_chain moving to the next one...
View the evaluation results for experiment: 'parent_document_retrieval_chain-586d81dc' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=a6a41dbc-08d9-4f89-909d-6645b35a33dc




0it [00:00, ?it/s]

Finished evaluating and saving parent_document_retrieval_chain moving to the next one...
View the evaluation results for experiment: 'ensemble_retrieval_chain-b6f6d0c5' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=eb81b1ac-69a9-4543-9ae0-fa9e3fc56b1d




0it [00:00, ?it/s]

Finished evaluating and saving ensemble_retrieval_chain moving to the next one...
View the evaluation results for experiment: 'semantic_retrieval_chain-14786ed9' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=b05b2aa5-c596-4c5c-8efa-dc49fbbd8d33




0it [00:00, ?it/s]

Finished evaluating and saving semantic_retrieval_chain moving to the next one...
🔄 Extracting serializable data from evaluation results...
   Processing naive_retrieval_chain...
   Processing bm25_retrieval_chain...
   Processing contextual_compression_retrieval_chain...
   Processing multi_query_retrieval_chain...
   Processing parent_document_retrieval_chain...
   Processing ensemble_retrieval_chain...
   Processing semantic_retrieval_chain...
✅ Evaluation results cached to: evaluation_results.pkl
📋 JSON version saved to: evaluation_results.json
📋 Metadata saved to: evaluation_results_metadata.json
📊 Cached 7 retriever evaluations


In [154]:
from performance_analysis import analyze_from_evaluate_results

In [190]:
cached_results = load_evaluation_results("evaluation_results.pkl")

# # Run the analysis
# analyzer = analyze_from_evaluate_results(cached_results)

# # Save comprehensive results
# analyzer.save_analysis("retriever_performance_from_evaluate.md")

# # Access specific data
# results_df = analyzer.results_df
# summary = analyzer.analysis_summary
# rankings = analyzer.rank_retrievers()

# print("Analysis complete!")
# print(f"Results DataFrame shape: {results_df.shape}")
# print(f"Retrievers analyzed: {list(summary.keys())}")

# print("Analysis code ready - uncomment when evaluation_results is populated")

✅ Loaded evaluation results from: evaluation_results.pkl
📅 Cached on: 2025-07-27T02:59:55.289381
📊 Found 7 retriever evaluations:
   - naive_retrieval_chain
   - bm25_retrieval_chain
   - contextual_compression_retrieval_chain
   - multi_query_retrieval_chain
   - parent_document_retrieval_chain
   - ensemble_retrieval_chain
   - semantic_retrieval_chain


In [191]:
cached_results['naive_retrieval_chain'].keys()

dict_keys(['_manager', '_results', '_queue', '_processing_complete', '_thread', '_summary_results'])

In [244]:
from performance_analysis import analyze_retrievers
df = analyze_retrievers(evaluation_results)
# df = analyze_retrievers(cached_results)
df

🚀 Analyzing 7 retrievers...
✅ Processed 33 runs for naive_retrieval_chain
✅ Processed 33 runs for bm25_retrieval_chain
✅ Processed 33 runs for contextual_compression_retrieval_chain
✅ Processed 33 runs for multi_query_retrieval_chain
✅ Processed 33 runs for parent_document_retrieval_chain
✅ Processed 33 runs for ensemble_retrieval_chain
✅ Processed 33 runs for semantic_retrieval_chain

RETRIEVER PERFORMANCE SUMMARY

📊 COMPLETE RESULTS:
             Retriever  Total_Runs  Avg_Cost_Per_Run  Total_Cost  Total_Input_Tokens  Total_Output_Tokens  Avg_Input_Tokens_Per_Run  Avg_Output_Tokens_Per_Run  Avg_Latency_Sec  Total_Latency_Sec  QA_Avg_Score  QA_Success_Rate  QA_Min  QA_Max  Helpfulness_Avg_Score  Helpfulness_Success_Rate  Helpfulness_Min  Helpfulness_Max  Empathy_Avg_Score  Empathy_Success_Rate  Empathy_Min  Empathy_Max  Correctness_Avg_Score  Correctness_Success_Rate  Correctness_Min  Correctness_Max
       Parent Document          33          0.000132    0.004347              127620 

Unnamed: 0,Retriever,Total_Runs,Avg_Cost_Per_Run,Total_Cost,Total_Input_Tokens,Total_Output_Tokens,Avg_Input_Tokens_Per_Run,Avg_Output_Tokens_Per_Run,Avg_Latency_Sec,Total_Latency_Sec,QA_Avg_Score,QA_Success_Rate,QA_Min,QA_Max,Helpfulness_Avg_Score,Helpfulness_Success_Rate,Helpfulness_Min,Helpfulness_Max,Empathy_Avg_Score,Empathy_Success_Rate,Empathy_Min,Empathy_Max,Correctness_Avg_Score,Correctness_Success_Rate,Correctness_Min,Correctness_Max
4,Parent Document,33,0.000132,0.004347,127620,8640,3867.272727,261.818182,4.05,133.59,0.0,0.0,0.0,0.0,0.152,0.152,0,1,0.0,0.0,0,0,0.788,0.788,0,1
5,Ensemble,33,0.000656,0.021653,702660,9556,21292.727273,289.575758,8.39,276.73,0.0,0.0,0.0,0.0,0.152,0.152,0,1,0.121,0.121,0,1,0.758,0.758,0,1
2,Contextual Compression,33,2.8e-05,0.000917,25486,2546,772.30303,77.151515,1.56,51.57,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0,0,0.7,0.7,0,1
3,Multi Query,33,0.000468,0.015428,495689,9289,15020.878788,281.484848,6.34,209.17,0.0,0.0,0.0,0.0,0.061,0.061,0,1,0.091,0.091,0,1,0.697,0.697,0,1
0,Naive,33,0.00028,0.009245,290671,8741,8808.212121,264.878788,3.93,129.79,0.0,0.0,0.0,0.0,0.121,0.121,0,1,0.061,0.061,0,1,0.606,0.606,0,1
1,Bm25,33,0.000125,0.004139,124071,6941,3759.727273,210.333333,3.12,102.94,0.0,0.0,0.0,0.0,0.091,0.091,0,1,0.091,0.091,0,1,0.545,0.545,0,1
6,Semantic,33,0.000225,0.007409,233058,6953,7062.363636,210.69697,3.11,102.62,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.121,0.121,0,1,0.424,0.424,0,1


In [243]:
cached_results['naive_retrieval_chain'].keys()

dict_keys(['_manager', '_results', '_queue', '_processing_complete', '_thread', '_summary_results'])

In [238]:
df

Unnamed: 0,Retriever,Total_Runs,Avg_Cost_Per_Run,Total_Cost,Total_Input_Tokens,Total_Output_Tokens,Avg_Input_Tokens_Per_Run,Avg_Output_Tokens_Per_Run,Avg_Latency_Sec,Total_Latency_Sec,QA_Avg_Score,QA_Success_Rate,QA_Min,QA_Max,Helpfulness_Avg_Score,Helpfulness_Success_Rate,Helpfulness_Min,Helpfulness_Max,Empathy_Avg_Score,Empathy_Success_Rate,Empathy_Min,Empathy_Max,Correctness_Avg_Score,Correctness_Success_Rate,Correctness_Min,Correctness_Max
4,Parent Document,33,0.000132,0.004347,127620,8640,3867.272727,261.818182,4.05,133.59,0.0,0.0,0.0,0.0,0.152,0.152,0,1,0.0,0.0,0,0,0.788,0.788,0,1
5,Ensemble,33,0.000656,0.021653,702660,9556,21292.727273,289.575758,8.39,276.73,0.0,0.0,0.0,0.0,0.152,0.152,0,1,0.121,0.121,0,1,0.758,0.758,0,1
2,Contextual Compression,33,2.8e-05,0.000917,25486,2546,772.30303,77.151515,1.56,51.57,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.0,0.0,0,0,0.7,0.7,0,1
3,Multi Query,33,0.000468,0.015428,495689,9289,15020.878788,281.484848,6.34,209.17,0.0,0.0,0.0,0.0,0.061,0.061,0,1,0.091,0.091,0,1,0.697,0.697,0,1
0,Naive,33,0.00028,0.009245,290671,8741,8808.212121,264.878788,3.93,129.79,0.0,0.0,0.0,0.0,0.121,0.121,0,1,0.061,0.061,0,1,0.606,0.606,0,1
1,Bm25,33,0.000125,0.004139,124071,6941,3759.727273,210.333333,3.12,102.94,0.0,0.0,0.0,0.0,0.091,0.091,0,1,0.091,0.091,0,1,0.545,0.545,0,1
6,Semantic,33,0.000225,0.007409,233058,6953,7062.363636,210.69697,3.11,102.62,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0.121,0.121,0,1,0.424,0.424,0,1
