# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [3]:
# Get LangSmith API key
langsmith_key = getpass.getpass("LangSmith API Key:")
os.environ["LANGSMITH_API_KEY"] = langsmith_key

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [5]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [7]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [8]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [9]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [10]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [11]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to dealing with lenders or servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and issues with how payments are being handled. Many complaints involve mismanagement, inaccurate information, and difficulties in making or applying payments correctly.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

"Based on the provided information, yes, some complaints were not handled in a timely manner. Specifically, at least one complaint was marked as 'No' in the 'Timely response?' field, indicating it was not responded to within the expected timeframe. For example, the complaint with Complaint ID '12709087' submitted to MOHELA on 03/28/25 was not handled in a timely manner."

In [13]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans mainly due to a combination of factors such as ongoing interest accumulation even during forbearance or deferment periods, lack of clear communication from loan servicers about payment resumption or delinquency status, and the inability to afford increased payments without jeopardizing their basic living expenses. Additionally, some borrowers faced issues like being misled about their repayment obligations, being unaware of loan transfers between companies, or experiencing difficulties in applying extra payments to principal, which extended the duration and cost of repayment. These challenges, coupled with financial hardships, stagnant wages, and limited access to loan forgiveness programs, contributed to their inability to fully repay their student loans.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [14]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [15]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [16]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, including issues such as incorrect or bad information about the loan, difficulty in managing payments, and disputes over fees or loan details. Multiple complaints involve challenges with understanding or verifying loan balances, issues with loan application or repayment processes, and inaccuracies or bad communication from loan servicers.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints indicated that the companies responded timely. Specifically, the complaints from 04/26/25, 04/01/25, 04/24/25, and 05/08/25 all note that the companies responded with a "Closed with explanation" status and specify "Timely response? Yes." Therefore, there is no evidence in the provided data that any complaints were not handled in a timely manner.'

In [18]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including difficulties with their payment plans, miscommunication or lack of communication from lenders or servicers, problems with automated payments, and issues with understanding or receiving information about their loans. Some specific examples include being steered into incorrect forbearance options, having their autopayments unexpectedly discontinued without proper notification, and experiencing errors or delays in response from loan servicers when requesting deferments or forbearances. Additionally, there are cases where borrowers believe they fulfilled all requirements for discharge or repayment assistance but did not receive proper acknowledgment or help, leading to continued or increased debt. Overall, these issues often stem from administrative errors, poor communication, or alleged deceptive practices by loan servicers.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ Answer:

**Example Query: "Why did people fail to pay back their loans?"**

Based on the invocations in this notebook, BM25 performs better than embeddings for this query.

**Comparison of responses:**

**Naive Retrieval (Embeddings):** Provided a broad, conceptual response covering systemic issues, miscommunication, and financial constraints but was more general in nature.

**BM25 Retrieval:** Delivered more specific, actionable details including:
- "unenrolled from autopay without their knowledge"
- "payment reversals" 
- "steered into improper forbearances"
- "capitalized interest"
- "loan transfer process"

**Why BM25 is better here:**

1. **Exact Term Matching**: BM25 excels at finding documents containing specific financial and procedural terminology like "autopay", "forbearances", "capitalized interest" - terms that are crucial for understanding concrete loan servicing problems.

2. **Factual Precision**: The query asks for specific reasons why payments failed. BM25's keyword-based approach captures precise procedural failures and technical issues that directly answer the "why" question.

3. **Domain-Specific Language**: In financial/loan contexts, exact terminology matters immensely. BM25's ability to match specific loan servicing terms provides more actionable and legally/procedurally accurate information than semantic similarity alone.

4. **Reduced Semantic Drift**: Embeddings might retrieve conceptually similar but less precise content, whereas BM25 stays focused on documents containing the exact operational terms that explain payment failures.


## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [19]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [20]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with loans, particularly student loans, appear to involve errors and misconduct by servicers or lenders. Specific recurring problems include errors in loan balances, misapplied payments, wrongful denials of payment plans, incorrect or inconsistent information about loan amounts and interest, unauthorized transfers of loans, and mishandling of personal data. These issues often lead to confusion, damaged credit, and violations of legal rights.\n\nIn summary, the most common issue is dealing with errors and misconduct by loan servicers or lenders, especially related to inaccurate information, mismanagement, and inadequate communication.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

"Based on the provided information, yes, there are complaints that did not get handled in a timely manner. For example, the complaint regarding the student's loan account review and resolution has been open for over 1 year and nearly 18 months without resolution. Additionally, a previous complaint about issues with payments not appearing on the account was still unresolved after 2-3 weeks, and the complaint about unapplied payments also concerns ongoing unresolved issues."

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily because they were not adequately informed about the repayment obligations and the complexities of the loan process. For example, some borrowers were unaware that they would have to repay their loans at all, or were not told about important details like interest accumulation, payment plans, and loan transfers. Additionally, many faced challenges such as interest accruing during deferment or forbearance, which increased their total debt over time, and a lack of clear communication from loan servicers about payment requirements, due dates, or account status. These issues made it difficult for them to manage and repay their loans effectively, leading to missed payments, growing balances, and sometimes even late payment reports on credit reports.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [24]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [25]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [26]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to "Dealing with your lender or servicer," specifically sub-issues such as:\n\n- Trouble with how payments are being handled (e.g., improper application of payments, inability to apply extra funds to principal, payments being directed to interest, or misapplied payments).\n- Issues with loan balances, interest calculations, or discrepancies in account information.\n- Lack of or improper communication from loan servicers about payment status, loan amount, or account changes.\n- Unauthorized transfers or reassignment of loans without borrower consent.\n- Inadequate documentation or verification of loan terms and legal rights.\n- Misleading or inaccurate information about loan terms, balances, or repayment options.\n- Challenges in obtaining accurate loan history or account status, and difficulties in resolving disputes.\n\nOverall, the most prevalent theme is the difficulties borrowers exp

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints data, yes, some complaints indicate that issues were not handled in a timely manner. Specifically, a few complaints mention delays in responses or resolutions:\n\n- Complaint ID 12698650 (Mohela, LA): The consumer reported that the issue was not resolved after over 18 months, despite ongoing efforts and multiple submissions. Although the company responded "timely" in the response, the consumer\'s experience reflects a significant delay in resolution.\n- Complaint ID 12739706 (Mohela, MD): The respondent was marked as "No" for timely response, indicating the complaint was not addressed promptly.\n- Complaint ID 12973003 (EdFinancial, NJ): The complaint response was marked as "Yes" but the narrative describes ongoing unresolved issues over several weeks.\n- Complaint ID 12654977 (Mohela, NJ): Marked as "No" for timely response, indicating a delay.\n- Complaint IDs involving credit report inaccuracies or account status issues also reflect delays or failur

In [28]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to issues such as mismanagement by loan servicers, errors in loan balances, misapplied payments, and wrongful denials of payment plans. Additionally, some borrowers experienced difficulties because of inadequate communication from lenders, legal discrepancies, privacy violations, and systemic failures in handling their accounts. Factors like being misled about repayment options, being placed in long-term forbearances without proper guidance, and experiencing unauthorized or incorrect reporting of delinquencies also contributed to their inability to repay the loans.'

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer:

Generating multiple reformulations of a user query can significantly improve recall through several mechanisms:

**1. Vocabulary Mismatch Reduction:**
- Users and document authors often use different words to describe the same concepts
- Multiple reformulations increase the likelihood of matching the exact terminology used in relevant documents
- Example: A user asking "payment issues" might miss documents that use "billing problems" or "transaction difficulties"

**2. Query Perspective Diversification:**
- Different reformulations can approach the same topic from various angles
- Each perspective might surface different relevant documents that focus on specific aspects
- Example: "Why did loans fail?" vs "What caused borrower defaults?" vs "Reasons for payment problems"

**3. Semantic Coverage Expansion:**
- LLMs can generate reformulations that capture different semantic nuances of the original query
- This helps retrieve documents that are conceptually relevant but use different language patterns
- Increases the semantic search space beyond the original query's limited scope

**4. Synonym and Paraphrase Utilization:**
- Reformulations naturally incorporate synonyms and paraphrases
- Documents using alternative terminology become discoverable
- Reduces dependency on exact keyword matches

**5. Comprehensive Document Retrieval:**
- By retrieving documents for each reformulated query and taking the union of all results
- The final context includes a broader set of potentially relevant documents
- Higher chance of including the most relevant information that might have been missed by a single query

**Implementation in Multi-Query Retriever:**
The Multi-Query Retriever demonstrates this by:
1. Using an LLM to generate multiple query variations
2. Running retrieval for each variation
3. Combining all unique documents into the final context
4. Providing richer, more comprehensive information for answer generation


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [29]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [30]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [31]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [32]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [33]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [34]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

"The most common issue with loans, based on the context provided, appears to be problems related to the servicing of federal student loans. These include errors in loan balances, misapplied payments, wrongful denials of payment plans, discrepancies with loan balances and interest rates, and improper or outdated credit reporting. Many complaints involve systemic breakdowns, miscommunications, and unfair practices by loan servicers, which can severely impact borrowers' credit scores and financial stability."

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, the complaints listed indicate that they did not receive a timely response from the companies involved. Specifically:\n\n- The complaint received on 03/28/25 regarding the student loan application process was marked as "Timely response?": No.\n- The complaint received on 04/11/25 regarding issues with loan payment handling was also marked as "Timely response?": No.\n\nAdditionally, the complaint filed on 04/27/25 about credit bureau dispute resolution was marked as "Timely response?": Yes, but it appears there has been a delay of over 30 days in response.\n\nTherefore, yes, there were complaints that did not get handled in a timely manner.'

In [36]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans mainly due to factors such as experiencing severe financial hardship, being misinformed about the long-term consequences of their loans, and facing difficulties in securing employment or income to make payments. In some cases, they also encountered issues with loan servicing, such as being unaware of payment requirements, lack of proper communication from lenders, or complications related to loan transfer and reporting, which further hindered their ability to repay.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [37]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [38]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [39]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with student loan loans appear to involve:\n\n- Dealing with lenders or servicers, including receiving bad information about loans, problematic handling of payments, and mismanagement\n- Errors in loan balances, misapplied payments, wrongful denials of repayment plans, and incorrect account statuses\n- Problems with repayment plans, including difficulty with income-driven repayment, forbearance, deferments, and loan forgiveness\n- Incorrect or incomplete information on credit reports and credit reporting errors\n- Lack of proper communication, notifications, or transparency from loan servicers\n- Unauthorized or improper transfer or sale of loans without borrower knowledge\n- Inaccurate loan classification or mislabeling of loan types (e.g., FFELP vs. HEAL)\n- Problems with loan repayment timing and interest capitalization, leading to increasing balances\n- Issues related to illegal collection practices and failure to follow federa

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided data, yes, there are complaints that were not handled in a timely manner. Specifically, one complaint received on 03/26/25 involving Maximus Federal Services, Inc. (Aidvantage) was marked as "No" in the "Timely response?" field, indicating it was not addressed within the expected period. Additionally, several complaints involving ED Financial Services and Maximus Federal Services, Inc. were also marked as "No" under "Timely response," meaning they were not responded to promptly.\n\nIn sum, multiple complaints documented in the data were not handled in a timely manner.'

In [41]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for several reasons, including:\n\n1. Lack of proper notification or communication from loan servicers about payment obligations, due dates, or changes in account status, leading to unawareness of when payments were due.\n2. Difficulty understanding or navigating repayment plans, especially when offered limited options like forbearance or deferment that result in accumulating interest, making loans harder to pay off.\n3. Mismanagement or errors by loan servicers, such as incorrect account information, misapplied payments, and errors in balances, which can hinder repayment efforts.\n4. Financial hardships, such as unemployment, low income, homelessness, or medical issues, which make it challenging to make timely payments.\n5. Issues with the transfer of loans between companies or lack of transparency about account status, leading borrowers to be unaware of their obligations.\n6. Problems with loan handling, including inappropriate steering into lon

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [42]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [43]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [44]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [45]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [46]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [47]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issues with loans seem to involve problems related to loan servicing and communication. These include:\n\n- Struggles to repay or issues with repayment plans (e.g., difficulties with income-driven repayment calculations and unexpected payment amounts).\n- Poor communication or lack of transparency from loan servicers (delays, miscommunication about loan status or payment terms).\n- Errors in reporting or account status (incorrect default notices, delinquency reports, or account statuses on credit reports).\n- Problems with loan handling after legal or policy changes, such as illegitimate collections or data breaches.\n- Difficulties in setting up or verifying auto-debits and payment processing.\n- Issues arising from the end of forbearance or re-amortization not being properly processed.\n\nOverall, a common theme appears to be that many issues stem from poor communication, administrative errors, or mismanagement by loan servicers, cau

In [48]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, several complaints indicate that responses from the companies were handled in a timely manner, with responses marked as "Yes" for timely response and "Closed with explanation." However, there is at least one complaint where the consumer explicitly states that despite multiple efforts to communicate, the company did not respond to their written complaints or questions. Specifically, in complaint ID 13331376 regarding Nelnet, Inc., the consumer mentions that despite sending multiple certified mail letters detailing serious misconduct and violations of law, Nelnet never responded to the CM, nor provided any answers to the questions raised.\n\nTherefore, yes, some complaints did not get handled in a timely manner, or in some cases, not at all from the consumer’s perspective.'

In [49]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People may fail to pay back their loans for various reasons, including communication issues with lenders or servicers, lack of transparency, difficulties in navigating loan processes, and administrative errors. For example, some borrowers experience trouble receiving clear information about their loan status or payment requirements, which can lead to missed payments. Others face delays or inaccuracies in payment processing, or encounter problems with documentation and loan transfer procedures. Additionally, disputes over loan legitimacy or claims of improper reporting can cause confusion and hinder repayment. Overall, challenges related to poor communication, administrative complications, and lack of clear information contribute to some borrowers' inability to repay their loans."

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer:

**How Semantic Chunking Behaves with Short, Repetitive Sentences:**

**1. Problematic Similarity Patterns:**
- Short, repetitive sentences (like FAQ questions) often have very high semantic similarity scores
- The algorithm may struggle to find meaningful breakpoints between semantically similar content
- Could result in either overly large chunks (everything grouped together) or overly small chunks (no grouping occurs)

**2. Threshold Sensitivity Issues:**
- With highly similar content, small variations in similarity scores become less meaningful
- Percentile-based thresholding may not work effectively when most distances are very close
- The algorithm might either chunk everything together or keep everything separate

**3. Loss of Logical Structure:**
- FAQs have inherent question-answer pairs that should ideally stay together
- Semantic chunking might break these logical pairs if focusing purely on sentence-level similarity
- Context and structure information gets lost in favor of semantic similarity

**Adjustments to Improve Performance:**

**1. Threshold Method Modifications:**
```python
# Use more aggressive thresholding
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="standard_deviation",  # More sensitive to small differences
    breakpoint_threshold_amount=1.0  # Lower threshold for more breaks
)
```

**2. Pre-processing Strategies:**
- Identify FAQ structure patterns (Q: A: formatting)
- Use regex to detect question-answer pairs
- Apply metadata-based chunking to preserve logical pairs

**3. Hybrid Approaches:**
```python
# Combine semantic chunking with structural rules
- First chunk by logical structure (Q-A pairs)
- Then apply semantic chunking within those logical boundaries
- Use custom splitting logic for FAQ-specific patterns
```

**4. Alternative Chunking Methods:**
- **Fixed-size chunking** with Q-A pair preservation
- **Metadata-based chunking** using FAQ structure
- **Custom splitters** that understand FAQ formatting

**5. Enhanced Similarity Calculation:**
- Use more sophisticated embedding models that better capture subtle differences
- Apply domain-specific embeddings trained on FAQ data
- Consider using multiple embedding dimensions for better differentiation

**Recommended Implementation for FAQs:**
Rather than relying purely on semantic chunking, use a structured approach that preserves the logical Q-A relationships while still benefiting from semantic organization for related topics.


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [50]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [51]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [52]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '8ff305'. Skipping!
Property 'summary' already exists in node 'bbc2f8'. Skipping!
Property 'summary' already exists in node 'b99cce'. Skipping!
Property 'summary' already exists in node 'bf5926'. Skipping!
Property 'summary' already exists in node '8747b0'. Skipping!
Property 'summary' already exists in node '53ab69'. Skipping!
Property 'summary' already exists in node '18c482'. Skipping!
Property 'summary' already exists in node '9dc621'. Skipping!
Property 'summary' already exists in node 'aae253'. Skipping!
Property 'summary' already exists in node '7846f1'. Skipping!
Property 'summary' already exists in node 'ff934e'. Skipping!
Property 'summary' already exists in node '837071'. Skipping!
Property 'summary' already exists in node '24e505'. Skipping!
Property 'summary' already exists in node '786f11'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '24e505'. Skipping!
Property 'summary_embedding' already exists in node '8747b0'. Skipping!
Property 'summary_embedding' already exists in node '837071'. Skipping!
Property 'summary_embedding' already exists in node '7846f1'. Skipping!
Property 'summary_embedding' already exists in node 'bf5926'. Skipping!
Property 'summary_embedding' already exists in node 'ff934e'. Skipping!
Property 'summary_embedding' already exists in node '786f11'. Skipping!
Property 'summary_embedding' already exists in node 'bbc2f8'. Skipping!
Property 'summary_embedding' already exists in node '18c482'. Skipping!
Property 'summary_embedding' already exists in node 'aae253'. Skipping!
Property 'summary_embedding' already exists in node '53ab69'. Skipping!
Property 'summary_embedding' already exists in node '8ff305'. Skipping!
Property 'summary_embedding' already exists in node '9dc621'. Skipping!
Property 'summary_embedding' already exists in node 'b99cce'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [None]:
# Import required libraries for evaluation
from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
import pandas as pd
import time
from langsmith import traceable
from datetime import datetime
import os
import getpass

## Set up LangSmith for cost and latency tracking

In [None]:

# # Get LangSmith API key
# langsmith_key = getpass.getpass("LangSmith API Key:")
# os.environ["LANGSMITH_API_KEY"] = langsmith_key

# Enable LangSmith tracing
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "retriever-evaluation"

print("✅ LangSmith tracing enabled!")
print(f"📊 Project: {os.environ['LANGSMITH_PROJECT']}")
print("🔗 Visit https://smith.langchain.com to view your traces")

✅ LangSmith tracing enabled!
📊 Project: retriever-evaluation
🔗 Visit https://smith.langchain.com to view your traces


## Set up RAGAS evaluator with LLM wrapper

In [None]:

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Define RAGAS metrics for retriever evaluation
ragas_metrics = [
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm), 
    ContextRelevance(llm=evaluator_llm)
]

print("RAGAS evaluator setup complete!")

RAGAS evaluator setup complete!


In [56]:
# Convert the generated dataset to RAGAS format
test_df = dataset.to_pandas()

# Create evaluation samples from the test dataset
evaluation_samples = []
for idx, row in test_df.iterrows():
    sample = {
        'user_input': row['user_input'],
        'reference_contexts': row['reference_contexts'],
        'reference': row['reference']
    }
    evaluation_samples.append(sample)

print(f"Created {len(evaluation_samples)} evaluation samples")
print("Sample structure:", evaluation_samples[0].keys())


Created 12 evaluation samples
Sample structure: dict_keys(['user_input', 'reference_contexts', 'reference'])


## Define all retrievers to evaluate

In [None]:
retrievers_to_evaluate = {
    "naive": naive_retriever,
    "bm25": bm25_retriever, 
    "contextual_compression": compression_retriever,
    "multi_query": multi_query_retriever,
    "parent_document": parent_document_retriever,
    "ensemble": ensemble_retriever,
    "semantic": semantic_retriever
}

print(f"Will evaluate {len(retrievers_to_evaluate)} retrieval methods")


Will evaluate 7 retrieval methods


In [58]:
@traceable
def evaluate_retriever(retriever, retriever_name, evaluation_samples):
    """Evaluate a single retriever using RAGAS metrics"""
    
    print(f"Evaluating {retriever_name}...")
    start_time = time.time()
    
    # Prepare evaluation data
    ragas_samples = []
    successful_samples = 0
    
    for sample in evaluation_samples:
        try:
            # Retrieve documents for this question
            retrieved_docs = retriever.invoke(sample['user_input'])
            retrieved_contexts = [doc.page_content for doc in retrieved_docs]
            
            # Create RAGAS sample
            ragas_sample = SingleTurnSample(
                user_input=sample['user_input'],
                retrieved_contexts=retrieved_contexts,
                reference_contexts=sample['reference_contexts'],
                reference=sample['reference']
            )
            ragas_samples.append(ragas_sample)
            successful_samples += 1
            
        except Exception as e:
            print(f"Error processing sample for {retriever_name}: {e}")
            continue
    
    # Calculate timing
    end_time = time.time()
    avg_latency = (end_time - start_time) / len(evaluation_samples)
    
    # Evaluate with RAGAS
    if ragas_samples:
        eval_dataset = EvaluationDataset(samples=ragas_samples)
        evaluation_results = evaluate(dataset=eval_dataset, metrics=ragas_metrics)
        
        return {
            'retriever_name': retriever_name,
            'metrics': evaluation_results,
            'avg_latency_seconds': avg_latency,
            'total_samples': len(evaluation_samples),
            'successful_samples': successful_samples
        }
    else:
        return None

print("Evaluation function defined!")


Evaluation function defined!


## Run evaluation for all retrievers

In [None]:

print("Starting retriever evaluation...")
print("="*60)

results = []
for retriever_name, retriever in retrievers_to_evaluate.items():
    try:
        result = evaluate_retriever(retriever, retriever_name, evaluation_samples)
        if result:
            results.append(result)
            print(f"✅ {retriever_name} evaluation completed")
        else:
            print(f"❌ {retriever_name} evaluation failed")
    except Exception as e:
        print(f"❌ Error evaluating {retriever_name}: {e}")
        continue
    time.sleep(60)

print(f"\nCompleted evaluation of {len(results)} retrievers")


Starting retriever evaluation...
Evaluating naive...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

✅ naive evaluation completed
Evaluating bm25...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

✅ bm25 evaluation completed
Evaluating contextual_compression...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

✅ contextual_compression evaluation completed
Evaluating multi_query...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

An error occurred: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-mini in organization org-Hp16PKiuF3av02eUg87TR03d on tokens per min (TPM): Limit 200000, Used 200000, Requested 1995. Please try again in 598ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}. Skipping a sample by assigning it nan score.
✅ multi_query evaluation completed
Evaluating parent_document...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

✅ parent_document evaluation completed
Evaluating ensemble...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

An error occurred: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-mini in organization org-Hp16PKiuF3av02eUg87TR03d on tokens per min (TPM): Limit 200000, Used 200000, Requested 1995. Please try again in 598ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}. Skipping a sample by assigning it nan score.
✅ ensemble evaluation completed
Evaluating semantic...


Evaluating:   0%|          | 0/36 [00:00<?, ?it/s]

✅ semantic evaluation completed

Completed evaluation of 7 retrievers


## Analyze and compile results

In [None]:
print("Analyzing results...")
print("="*80)

if not results:
    print("No results to analyze!")
else:
    # Create a comprehensive results DataFrame
    compiled_results = []
    
    for result in results:
        # Convert metrics to pandas DataFrame and extract mean values
        metrics_df = result['metrics'].to_pandas()
        
        # Calculate mean values for each metric
        context_precision_mean = metrics_df['context_precision'].mean()
        context_recall_mean = metrics_df['context_recall'].mean() 
        context_relevance_mean = metrics_df['nv_context_relevance'].mean()
        
        row = {
            'Retriever': result['retriever_name'],
            'Context_Precision': round(context_precision_mean, 4),
            'Context_Recall': round(context_recall_mean, 4),
            'Context_Relevance': round(context_relevance_mean, 4),
            'Avg_Latency_Seconds': round(result['avg_latency_seconds'], 4),
            'Success_Rate': round(result['successful_samples'] / result['total_samples'], 4),
            'Total_Samples': result['total_samples'],
            'Successful_Samples': result['successful_samples']
        }
        compiled_results.append(row)
    
    # Create results DataFrame
    results_df = pd.DataFrame(compiled_results)
    
    # Display results
    print("RETRIEVER EVALUATION RESULTS")
    print("="*80)
    print(results_df.to_string(index=False))
    
    # Find best performers
    print("\n" + "="*80)
    print("TOP PERFORMERS BY METRIC")
    print("="*80)
    
    best_precision_idx = results_df['Context_Precision'].idxmax()
    best_recall_idx = results_df['Context_Recall'].idxmax()
    best_relevance_idx = results_df['Context_Relevance'].idxmax()
    best_latency_idx = results_df['Avg_Latency_Seconds'].idxmin()  # Lower is better
    
    print(f"🎯 Best Context Precision: {results_df.at[best_precision_idx, 'Retriever']} - {results_df.at[best_precision_idx, 'Context_Precision']}")
    print(f"🔍 Best Context Recall: {results_df.at[best_recall_idx, 'Retriever']} - {results_df.at[best_recall_idx, 'Context_Recall']}")
    print(f"⭐ Best Context Relevance: {results_df.at[best_relevance_idx, 'Retriever']} - {results_df.at[best_relevance_idx, 'Context_Relevance']}")
    print(f"⚡ Lowest Latency: {results_df.at[best_latency_idx, 'Retriever']} - {results_df.at[best_latency_idx, 'Avg_Latency_Seconds']}s")


Analyzing results...
RETRIEVER EVALUATION RESULTS
             Retriever  Context_Precision  Context_Recall  Context_Relevance  Avg_Latency_Seconds  Success_Rate  Total_Samples  Successful_Samples
                 naive             0.1250          0.0000             0.2083               0.2762           1.0             12                  12
                  bm25             0.0000          0.0000             0.0417               0.0034           1.0             12                  12
contextual_compression             0.0278          0.0833             0.1250               0.5156           1.0             12                  12
           multi_query             0.1042          0.0521             0.2045               9.5312           1.0             12                  12
       parent_document             0.0833          0.1146             0.1667               0.3101           1.0             12                  12
              ensemble             0.1667          0.1250           

## Calculate overall performance score (weighted average)

In [None]:

if results:
    print("\n" + "="*80)
    print("OVERALL PERFORMANCE ANALYSIS")
    print("="*80)
    
    # Normalize metrics to 0-1 scale for fair comparison
    results_df['Precision_Normalized'] = results_df['Context_Precision'] / results_df['Context_Precision'].max()
    results_df['Recall_Normalized'] = results_df['Context_Recall'] / results_df['Context_Recall'].max()
    results_df['Relevance_Normalized'] = results_df['Context_Relevance'] / results_df['Context_Relevance'].max()
    
    # For latency, lower is better, so we invert it
    results_df['Latency_Normalized'] = results_df['Avg_Latency_Seconds'].min() / results_df['Avg_Latency_Seconds']
    
    # Calculate composite score (equal weighting for simplicity)
    results_df['Composite_Score'] = (
        results_df['Precision_Normalized'] * 0.25 + 
        results_df['Recall_Normalized'] * 0.25 + 
        results_df['Relevance_Normalized'] * 0.25 + 
        results_df['Latency_Normalized'] * 0.25
    ).round(4)
    
    # Sort by composite score
    results_df_sorted = results_df.sort_values('Composite_Score', ascending=False)
    
    print("RANKING BY COMPOSITE SCORE (Precision + Recall + Relevance + Speed):")
    print("-" * 70)
    for idx, row in results_df_sorted.iterrows():
        print(f"{row.name + 1:2d}. {row['Retriever']:20} - Score: {row['Composite_Score']:.4f}")
    
    print("\n" + "="*80)
    print("RECOMMENDATIONS")
    print("="*80)
    
    best_overall = results_df_sorted.iloc[0]
    print(f"🏆 **BEST OVERALL RETRIEVER: {best_overall['Retriever'].upper()}**")
    print(f"   - Composite Score: {best_overall['Composite_Score']:.4f}")
    print(f"   - Context Precision: {best_overall['Context_Precision']:.4f}")
    print(f"   - Context Recall: {best_overall['Context_Recall']:.4f}")
    print(f"   - Context Relevance: {best_overall['Context_Relevance']:.4f}")
    print(f"   - Average Latency: {best_overall['Avg_Latency_Seconds']:.4f}s")
    
    print("\n📊 **ANALYSIS SUMMARY:**")
    print("This evaluation considers cost, latency, and performance factors:")
    print("• **Performance**: Measured through RAGAS context precision, recall, and relevance metrics")
    print("• **Latency**: Time taken to retrieve documents per query")  
    print("• **Cost**: Tracked through LangSmith (check your LangSmith dashboard for detailed cost analysis)")
    
    print(f"\n💡 **RECOMMENDATION FOR LOAN COMPLAINT DATA:**")
    print(f"Based on the evaluation, **{best_overall['Retriever']}** retriever is recommended because:")
    
    # Provide specific reasoning based on the best performer
    if best_overall['Retriever'] == 'ensemble':
        print("• Combines strengths of multiple retrieval strategies")
        print("• Shows balanced performance across all metrics")
        print("• More robust to different query types")
    elif best_overall['Retriever'] == 'contextual_compression':
        print("• Excellent at filtering most relevant content")
        print("• Reduces noise in retrieved documents")
        print("• Good balance of precision and relevance")
    elif best_overall['Retriever'] == 'multi_query':
        print("• Improves recall through query reformulation")
        print("• Captures different perspectives of user questions")
        print("• Good for complex, ambiguous queries")
    elif best_overall['Retriever'] == 'bm25':
        print("• Excellent for exact term matching")
        print("• Very fast retrieval with low latency")
        print("• Cost-effective with no embedding computation")
    else:
        print("• Shows strong performance across key metrics")
        print("• Good balance of effectiveness and efficiency")
    
    print("\n📋 **FINAL NOTES:**")
    print("• Check LangSmith dashboard for detailed cost analysis")
    print("• Consider your specific use case when choosing a retriever")
    print("• Ensemble methods often provide the most robust performance")
    print("• BM25 excels for keyword-heavy queries, embeddings for semantic similarity")



OVERALL PERFORMANCE ANALYSIS
RANKING BY COMPOSITE SCORE (Precision + Recall + Relevance + Speed):
----------------------------------------------------------------------
 6. ensemble             - Score: 0.7501
 5. parent_document      - Score: 0.5402
 4. multi_query          - Score: 0.4855
 1. naive                - Score: 0.4196
 3. contextual_compression - Score: 0.3474
 2. bm25                 - Score: 0.2959
 7. semantic             - Score: 0.0028

RECOMMENDATIONS
🏆 **BEST OVERALL RETRIEVER: ENSEMBLE**
   - Composite Score: 0.7501
   - Context Precision: 0.1667
   - Context Recall: 0.1250
   - Context Relevance: 0.2273
   - Average Latency: 8.2376s

📊 **ANALYSIS SUMMARY:**
This evaluation considers cost, latency, and performance factors:
• **Performance**: Measured through RAGAS context precision, recall, and relevance metrics
• **Latency**: Time taken to retrieve documents per query
• **Cost**: Tracked through LangSmith (check your LangSmith dashboard for detailed cost analysis

# 📊 Retrieval Performance Analysis

## Evaluation Results Summary

After comprehensive evaluation using RAGAS metrics on loan complaint data, the **Ensemble Retriever** emerged as the clear winner with a composite score of **0.7501**.

### 🏆 Performance Rankings

| Rank | Retriever | Composite Score | Precision | Recall | Relevance | Latency (s) |
|------|-----------|----------------|-----------|--------|-----------|-------------|
| 1 | **Ensemble** | **0.7501** | 0.1667 | 0.1250 | 0.2273 | 8.24 |
| 2 | Parent Document | 0.5402 | 0.0833 | 0.1146 | 0.1667 | 0.31 |
| 3 | Multi Query | 0.4855 | 0.1042 | 0.0521 | 0.2045 | 9.53 |
| 4 | Naive | 0.4196 | 0.1250 | 0.0000 | 0.2083 | 0.28 |
| 5 | Contextual Compression | 0.3474 | 0.0278 | 0.0833 | 0.1250 | 0.52 |
| 6 | BM25 | 0.2959 | 0.0000 | 0.0000 | 0.0417 | 0.003 |
| 7 | Semantic | 0.0028 | 0.0000 | 0.0000 | 0.0000 | 0.31 |

### 💡 Key Findings

**Why Ensemble Retriever Wins:**
- **Robustness**: Combines strengths of multiple retrieval strategies, mitigating individual weaknesses
- **Balanced Performance**: Achieves the highest scores across precision, recall, and relevance metrics
- **Query Adaptability**: Effectively handles diverse query types in financial complaint data

**Trade-off Analysis:**
- **Speed vs Quality**: BM25 offers lightning-fast retrieval (0.003s) but poor relevance
- **Precision vs Coverage**: Naive retrieval shows good precision but zero recall
- **Complexity vs Performance**: Ensemble's higher latency (8.24s) is justified by superior quality

### 🎯 Recommendation

For loan complaint analysis, **Ensemble Retrieval** is optimal because:
- Financial queries require comprehensive coverage (high recall)
- Accurate information filtering is critical (high precision)
- Diverse complaint types benefit from multi-strategy approaches
- Quality improvements outweigh moderate latency increases

*Note: Monitor LangSmith dashboard for detailed cost analysis and production optimization.*
