# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [4]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [6]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [7]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [8]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [9]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [10]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be problems related to the mismanagement and mishandling of student loans. This includes errors in loan balances, misapplied payments, wrongful denials of payment plans, incorrect information reported on credit reports, transfers of loans without proper notification, and difficulties in applying payments correctly. Many complaints also involve issues with loan servicing companies providing bad or confusing information, and improper handling of loan data and privacy violations.'

In [11]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided context, it appears that at least one complaint was not handled in a timely manner. Specifically, the complaint with Complaint ID \'12709087\' submitted to MOHELA on 03/28/25 was marked as "Timely response?": "No," indicating it was not handled promptly. The narratives mention ongoing delays and lack of responses despite assurances.\n\nAdditionally, several other complaints, such as those with Complaint IDs \'12832400\' and \'12832400\', were marked as "Timely response?": "Yes," implying they were handled on time. However, the complaint with ID \'12709087\' clearly indicates a failure to respond in a timely manner.\n\nTherefore, yes, there were complaints that did not get handled in a timely manner.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily because of a combination of mismanagement, lack of clear communication, and financial hardship. Many borrowers were not adequately informed about the status of their loans, including transfer of servicing companies, payment resumption dates, or changes in payment plans. Some were unaware that interest would continue to accrue during forbearance or deferment periods, making the debt grow rather than decrease. Others faced financial difficulties due to stagnant wages, economic downturns, or personal hardships, which made it impossible to increase payments or meet the required repayment schedules. Additionally, complex or unhelpful loan repayment options, as well as administrative errors such as incorrect reporting or failure to notify about payment due dates, contributed to borrowers falling behind. Overall, issues related to poor communication, unexpected interest accumulation, and economic challenges led many to struggle with repaying th

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [13]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [14]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [15]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be related to dealing with lenders or servicers, specifically issues such as misreported or confusing loan information, difficulty in making payments or applying funds correctly, and disputes over fees or loan terms. Many complaints involve borrowers receiving incorrect or bad information about their loans, problems with repayment processes, and issues with the accuracy of loan balances and interest calculations.'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, there are complaints that did not get handled in a timely manner. In particular, one complaint describes a situation where the consumer waited over several minutes (with some details redacted as "XXXX") on a call, and eventually had to hang up without resolution, indicating the complaint was not addressed promptly. Additionally, the overall tone of multiple complaints suggests ongoing difficulties with communication and resolution times.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People often fail to pay back their loans due to various reasons such as problems with their payment plans, miscommunication or lack of communication from the loan servicers, errors in processing payments, and issues related to loan transfers or automatic payments being unenrolled without proper notification. In some cases, borrowers are unaware of their current debt status, or their requests for deferment or forbearance are not properly addressed, leading to continued billing and late fees. Additionally, deceptive practices or mismanagement by loan servicing companies can contribute to borrowers' inability to fulfill repayment obligations."

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ Answer:

**Example Query: "Why did people fail to pay back their loans?"**

Based on the invocations in this notebook, BM25 performs better than embeddings for this query.

**Comparison of responses:**

**Naive Retrieval (Embeddings):** Provided a broad, conceptual response covering systemic issues, miscommunication, and financial constraints but was more general in nature.

**BM25 Retrieval:** Delivered more specific, actionable details including:
- "unenrolled from autopay without their knowledge"
- "payment reversals" 
- "steered into improper forbearances"
- "capitalized interest"
- "loan transfer process"

**Why BM25 is better here:**

1. **Exact Term Matching**: BM25 excels at finding documents containing specific financial and procedural terminology like "autopay", "forbearances", "capitalized interest" - terms that are crucial for understanding concrete loan servicing problems.

2. **Factual Precision**: The query asks for specific reasons why payments failed. BM25's keyword-based approach captures precise procedural failures and technical issues that directly answer the "why" question.

3. **Domain-Specific Language**: In financial/loan contexts, exact terminology matters immensely. BM25's ability to match specific loan servicing terms provides more actionable and legally/procedurally accurate information than semantic similarity alone.

4. **Reduced Semantic Drift**: Embeddings might retrieve conceptually similar but less precise content, whereas BM25 stays focused on documents containing the exact operational terms that explain payment failures.


## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [18]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [19]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [20]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, particularly student loans as indicated in the provided complaints, appears to be problems related to dealing with lenders or servicers. Specific issues include errors in loan balances, misapplied payments, wrongful denials of payment plans, incorrect or misleading information, lack of proper communication, and mishandling of loan data. These issues often lead to confusion, inaccurate credit reporting, and difficulties in managing repayment.'

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints did not get handled in a timely manner. For example, there is a complaint from a consumer who has been waiting over a year for a response and resolution to their request regarding loan account issues, with nearly 18 months having passed without resolution. This indicates that a complaint was not addressed in a timely manner.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans mainly because they were often misled or not fully informed about their repayment obligations, and faced difficulties managing their loan payments due to accumulating interest and financial hardships. Specifically, some borrowers were unaware they had to repay their student loans until long after taking them out, and they did not receive clear information about interest accrual, loan transfer processes, or payment plans. Additionally, options like deferment or forbearance did not prevent interest from growing, which increased the total amount owed over time. Many borrowers also felt that the financial burdens and repayment terms made it impossible to pay off their loans without sacrificing their basic living expenses, especially when they could not qualify for loan forgiveness programs. Overall, lack of clear communication, unexpected interest accumulation, and economic hardships contributed to failure to repay loans.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [23]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [24]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [25]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issues with student loans tend to revolve around:\n\n- Errors or discrepancies in loan balances and interest calculations.\n- Problems with loan servicing, including mismanagement and improper handling of payments.\n- Lack of proper communication or notification about account status, transfers, or late payments.\n- Difficulties in obtaining accurate information or validation of loans.\n- Issues related to loan transfers between agencies or servicers without proper notification.\n- Problems with repayment plans, including being steered into unsuitable options or being unable to make principal payments.\n- Violations of borrower rights, including privacy breaches or improper data handling.\n- Problems with loan discharge, forgiveness, or legal disputes over enforceability.\n\nWhile specific issues vary, a common theme is the mismanagement or mishandling of loan information and payments, leading to confusion, inaccurate reporting, and fin

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints data, yes, some complaints were not handled in a timely manner. For example:\n\n- Complaint ID 12739706 against MOHELA, submitted on 04/01/25, was marked as "No" for timely response, indicating it was not handled promptly.\n- Similarly, complaint ID 12709087 also against MOHELA, submitted on 03/28/25, was marked as "No" for timely response.\n\nHowever, many other complaints received responses marked as "Yes," indicating they were handled timely.\n\nIn conclusion, at least some complaints reported delays or failures in response times.'

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of systemic servicing issues, misinformation, legal discrepancies, and financial hardships. The complaints indicate that borrowers often received bad information, were misled into forbearance or consolidation practices that increased their debt due to interest capitalization, and were not properly informed about available repayment options such as income-driven plans or rehabilitation programs. Additionally, some borrowers experienced disputes over inaccurate account information, legal violations, and the negative impact of loan management errors on their credit scores, making repayment even more difficult. Overall, these factors created obstacles that prevented many from successfully repaying their loans.'

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer:

Generating multiple reformulations of a user query can significantly improve recall through several mechanisms:

**1. Vocabulary Mismatch Reduction:**
- Users and document authors often use different words to describe the same concepts
- Multiple reformulations increase the likelihood of matching the exact terminology used in relevant documents
- Example: A user asking "payment issues" might miss documents that use "billing problems" or "transaction difficulties"

**2. Query Perspective Diversification:**
- Different reformulations can approach the same topic from various angles
- Each perspective might surface different relevant documents that focus on specific aspects
- Example: "Why did loans fail?" vs "What caused borrower defaults?" vs "Reasons for payment problems"

**3. Semantic Coverage Expansion:**
- LLMs can generate reformulations that capture different semantic nuances of the original query
- This helps retrieve documents that are conceptually relevant but use different language patterns
- Increases the semantic search space beyond the original query's limited scope

**4. Synonym and Paraphrase Utilization:**
- Reformulations naturally incorporate synonyms and paraphrases
- Documents using alternative terminology become discoverable
- Reduces dependency on exact keyword matches

**5. Comprehensive Document Retrieval:**
- By retrieving documents for each reformulated query and taking the union of all results
- The final context includes a broader set of potentially relevant documents
- Higher chance of including the most relevant information that might have been missed by a single query

**Implementation in Multi-Query Retriever:**
The Multi-Query Retriever demonstrates this by:
1. Using an LLM to generate multiple query variations
2. Running retrieval for each variation
3. Combining all unique documents into the final context
4. Providing richer, more comprehensive information for answer generation


## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [28]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [29]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [30]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [31]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [32]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [33]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be related to misconduct by loan servicers, including errors in loan balances, misapplication of payments, wrongful denials of payment plans, and incorrect or unverified reporting to credit bureaus. Additionally, issues such as discrepancies in interest rates, improper account handling, and problems stemming from loan transfers or sale of loans are prevalent.'

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, several complaints were identified as not being handled in a timely manner. Specifically, the complaints about the delayed responses from Mohela regarding student loan applications and payments were marked as "Timely response?": "No" in the records. For instance, the complaint from row 441 and row 84 both indicate that the responses were not timely. Therefore, yes, some complaints did not get handled in a timely manner.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People often fail to pay back their loans due to a variety of reasons. Based on the provided context, some common reasons include:\n\n1. Financial Hardship: Borrowers experience severe financial difficulties that make it difficult to make loan payments, such as unemployment, health issues, or other financial burdens.\n2. Lack of Proper Information or Miscommunication: Borrowers may not be properly informed about when payments are due, payment plans, or the terms of their loans, leading to unintentional delinquency.\n3. Institutional Misconduct or Misrepresentation: Some borrowers were misled about the value of their education, the manageability of their loans, or the financial stability of their educational institution, resulting in unexpected financial burdens and difficulty in repayment.\n4. Issues with Loan Servicing: Problems such as failure to notify borrowers of repayment obligations, errors in reporting late payments, or improper handling of account changes can hinder repayment

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [36]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [37]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [38]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issues with loans, based on the complaints, include:\n\n- Errors or inaccuracies in loan balances and interest calculations\n- Poor communication from servicers, including lack of notices or notifications\n- Problems with how payments are applied, often limited to interest rather than principal\n- Unauthorized or improper transfer or sale of loans without borrower notice\n- Bad or misleading information reported to credit bureaus, leading to credit score damages\n- Coercive servicing practices such as steering into forbearance or consolidation without providing all options\n- Unexplained or incorrect delinquency and default reporting\n- Inadequate investigation or resolution of disputes about loan terms, balances, or misconduct\n- Data breaches or improper disclosures violating privacy laws like FERPA\n- Challenges in obtaining documentation or proof of original loan agreements, transfers, and legal authority\n\nOverall, the most common issues center around mismanageme

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the complaints provided, yes, several complaints indicate that complaints were not handled in a timely manner. Specifically, there are multiple instances where responses from the companies were marked as "No" or "Not timely," such as:\n\n- Complaint ID: 12935889 (Mohela, MD) — Response was "No" for timeliness.\n- Complaint ID: 12654977 (Mohela, MD) — Response was "No" for timeliness.\n- Complaint ID: 12739706 (Mohela, NJ) — Response was "No" for timeliness.\n- Complaint ID: 12744910 (Maximus Federal Services, KY) — Response was "Yes" for timeliness.\n- Complaint ID: 12823876 (EdFinancial Services, CA) — Response was "Yes" for timeliness.\n- Complaint ID: 12516723 (EdFinancial Services, CA) — Response was "Yes" for timeliness.\n- Multiple complaints regarding failures of companies like Maximus Federal Services/Aidvantage and TransUnion showing they either did not respond or responded after delays, with some explicitly indicating that their complaints or disputes were not addre

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People often failed to pay back their loans due to a variety of complex issues, including:\n\n1. Lack of clear or adequate information from lenders or servicers about repayment options, interest accumulation, and possible forgiveness programs, leading to misconceptions about their obligations.\n2. Being steered into forbearance or deferment, which allowed interest to continue accruing and increased the total debt over time.\n3. Financial hardships such as unemployment, low income, health issues, or unexpected expenses like homelessness or accidents, which made payments unmanageable.\n4. Poor communication or failure of loan servicers to notify borrowers about due dates, payment statuses, or transfers between servicers, resulting in missed payments or incorrect delinquency reporting.\n5. Administrative errors, such as incorrect account information, improper reporting to credit bureaus, or mishandling of payments, which damaged credit scores and created barriers to further borrowing.\n6

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [41]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [42]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [43]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [44]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [45]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [46]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints, the most common issues with loans appear to be related to miscommunication or errors in loan servicing, such as problems with repayment plans, incorrect account status or reporting, difficulties with auto-debit setup, and disputes over loan balances or default status. \n\nWhile the specific "most common issue" is not explicitly stated, recurring themes include:\n\n- Problems with repayment and payment processing\n- Incorrect or disputed account status and reporting\n- Lack of clear communication from servicers\n- Issues with loan forgiveness or discharge processes\n- Breaches of privacy or improper use of personal data\n\nIn summary, the most common issue seems to be **problems related to loan servicing, including payment handling, account status, and communication difficulties**.'

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the information provided, all the complaints included responses labeled as "Closed with explanation," and the "Timely response?" field indicates "Yes" for each. This suggests that, according to the recorded data, no complaints remain unhandled or unresolved due to delays. \n\nHowever, it is important to note that some complaints mention issues such as lack of response or ongoing disputes, but the official status in the data indicates that they were responded to within the expected timeframe.\n\nTherefore, the answer is: **No, there are no complaints recorded in this data set that were left unhandled or not handled in a timely manner.**'

In [48]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including issues related to miscommunication, administrative delays, and disputes over the legitimacy or status of their loans. For example, some borrowers experienced trouble due to receiving incorrect or bad information from lenders or servicers regarding their loan status or repayment terms. Others encountered difficulties with the handling of payments, such as payments not being processed correctly or being rejected despite sufficient funds. Additionally, some borrowers faced complications arising from loan transfer issues, inaccurate reporting, or legal disputes concerning the validity of their debts. In some cases, borrowers felt that their personal information was mishandled or compromised, leading to further complications. Overall, these issues highlight challenges such as lack of transparency, administrative delays, or legal and informational disputes that contributed to borrowers' inability to successfully repay thei

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer:

**How Semantic Chunking Behaves with Short, Repetitive Sentences:**

**1. Problematic Similarity Patterns:**
- Short, repetitive sentences (like FAQ questions) often have very high semantic similarity scores
- The algorithm may struggle to find meaningful breakpoints between semantically similar content
- Could result in either overly large chunks (everything grouped together) or overly small chunks (no grouping occurs)

**2. Threshold Sensitivity Issues:**
- With highly similar content, small variations in similarity scores become less meaningful
- Percentile-based thresholding may not work effectively when most distances are very close
- The algorithm might either chunk everything together or keep everything separate

**3. Loss of Logical Structure:**
- FAQs have inherent question-answer pairs that should ideally stay together
- Semantic chunking might break these logical pairs if focusing purely on sentence-level similarity
- Context and structure information gets lost in favor of semantic similarity

**Adjustments to Improve Performance:**

**1. Threshold Method Modifications:**
```python
# Use more aggressive thresholding
semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="standard_deviation",  # More sensitive to small differences
    breakpoint_threshold_amount=1.0  # Lower threshold for more breaks
)
```

**2. Pre-processing Strategies:**
- Identify FAQ structure patterns (Q: A: formatting)
- Use regex to detect question-answer pairs
- Apply metadata-based chunking to preserve logical pairs

**3. Hybrid Approaches:**
```python
# Combine semantic chunking with structural rules
- First chunk by logical structure (Q-A pairs)
- Then apply semantic chunking within those logical boundaries
- Use custom splitting logic for FAQ-specific patterns
```

**4. Alternative Chunking Methods:**
- **Fixed-size chunking** with Q-A pair preservation
- **Metadata-based chunking** using FAQ structure
- **Custom splitters** that understand FAQ formatting

**5. Enhanced Similarity Calculation:**
- Use more sophisticated embedding models that better capture subtle differences
- Apply domain-specific embeddings trained on FAQ data
- Consider using multiple embedding dimensions for better differentiation

**Recommended Implementation for FAQs:**
Rather than relying purely on semantic chunking, use a structured approach that preserves the logical Q-A relationships while still benefiting from semantic organization for related topics.


# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [49]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [50]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [51]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(
    llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node '9753e5'. Skipping!
Property 'summary' already exists in node '295764'. Skipping!
Property 'summary' already exists in node 'caae16'. Skipping!
Property 'summary' already exists in node 'c5a4e3'. Skipping!
Property 'summary' already exists in node 'b28cdc'. Skipping!
Property 'summary' already exists in node '9e77b8'. Skipping!
Property 'summary' already exists in node 'bbb038'. Skipping!
Property 'summary' already exists in node 'c4415d'. Skipping!
Property 'summary' already exists in node 'e9a3e0'. Skipping!
Property 'summary' already exists in node '78d57e'. Skipping!
Property 'summary' already exists in node '28afad'. Skipping!
Property 'summary' already exists in node '2e9c99'. Skipping!
Property 'summary' already exists in node '46810f'. Skipping!
Property 'summary' already exists in node '2594b9'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node '46810f'. Skipping!
Property 'summary_embedding' already exists in node 'c4415d'. Skipping!
Property 'summary_embedding' already exists in node '78d57e'. Skipping!
Property 'summary_embedding' already exists in node 'e9a3e0'. Skipping!
Property 'summary_embedding' already exists in node '2e9c99'. Skipping!
Property 'summary_embedding' already exists in node '295764'. Skipping!
Property 'summary_embedding' already exists in node '2594b9'. Skipping!
Property 'summary_embedding' already exists in node 'b28cdc'. Skipping!
Property 'summary_embedding' already exists in node 'bbb038'. Skipping!
Property 'summary_embedding' already exists in node '9e77b8'. Skipping!
Property 'summary_embedding' already exists in node 'caae16'. Skipping!
Property 'summary_embedding' already exists in node '28afad'. Skipping!
Property 'summary_embedding' already exists in node 'c5a4e3'. Skipping!
Property 'summary_embedding' already exists in node '9753e5'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [52]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does the use of BBAY 3 relate to Direct Lo...,"[non-term (includes clock-hour calendars), or ...",If substantially equal nonstandard terms in a ...,single_hop_specifc_query_synthesizer
1,Are there exceptions to the normal loan period...,[Inclusion of Clinical Work in a Standard Term...,"Yes, there are exceptions to the normal loan p...",single_hop_specifc_query_synthesizer
2,What is the payment period requirement for Tit...,[Non-Term Characteristics A program that measu...,"Title IV program disbursements, except for Fed...",single_hop_specifc_query_synthesizer
3,Whaat informashun does Volume 8 provide regard...,[both the credit or clock hours and the weeks ...,"Volume 8, Chapters 5 and 6, contains informati...",single_hop_specifc_query_synthesizer
4,What are the definition and characteristics of...,[<1-hop>\n\nInclusion of Clinical Work in a St...,Nonstandard terms are defined as terms that ar...,multi_hop_abstract_query_synthesizer
5,How do the disbursement requirements for feder...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
6,What are the disbursement requirements for fed...,[<1-hop>\n\nboth the credit or clock hours and...,For federal student aid programs such as Pell ...,multi_hop_abstract_query_synthesizer
7,wut is a nonstanderd term and how is a non-ter...,[<1-hop>\n\nInclusion of Clinical Work in a St...,A nonstandard term is a term that is not a sem...,multi_hop_abstract_query_synthesizer
8,How do the disbursement requirements for Title...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,"In subscription-based programs, for the first ...",multi_hop_specific_query_synthesizer
9,How do the examples in Appendix A and Appendix...,[<1-hop>\n\nDisbursement Timing in Subscriptio...,The examples in Appendix A illustrate the prin...,multi_hop_specific_query_synthesizer


In [None]:
# Set up LangSmith for cost and latency tracking
import os
import getpass

os.environ["LANGSMITH_API_KEY"] = getpass.getpass("LangSmith API Key:")

# Enable LangSmith tracing
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "retriever-evaluation"

In [54]:
# Import required libraries for evaluation
from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.dataset_schema import SingleTurnSample, EvaluationDataset
import pandas as pd
import time
from langsmith import traceable
from datetime import datetime


In [55]:
# Set up RAGAS evaluator with LLM wrapper
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Define RAGAS metrics for retriever evaluation
ragas_metrics = [
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm), 
    ContextRelevance(llm=evaluator_llm)
]

print("RAGAS evaluator setup complete!")


RAGAS evaluator setup complete!


In [56]:
# Convert the generated dataset to RAGAS format
test_df = dataset.to_pandas()

# Create evaluation samples from the test dataset
evaluation_samples = []
for idx, row in test_df.iterrows():
    sample = {
        'user_input': row['user_input'],
        'reference_contexts': row['reference_contexts'],
        'reference': row['reference']
    }
    evaluation_samples.append(sample)

print(f"Created {len(evaluation_samples)} evaluation samples")
print("Sample structure:", evaluation_samples[0].keys())


Created 12 evaluation samples
Sample structure: dict_keys(['user_input', 'reference_contexts', 'reference'])


In [57]:
# Create a comprehensive evaluation function for retrievers
@traceable(name="retriever_evaluation")
def evaluate_retriever(retriever, retriever_name, test_samples, use_semantic_chunking=False):
    """
    Evaluate a retriever using RAGAS metrics with cost and latency tracking
    """
    results = []
    total_latency = 0
    
    print(f"Evaluating {retriever_name} (Semantic Chunking: {use_semantic_chunking})...")
    
    for i, sample in enumerate(test_samples):
        try:
            start_time = time.time()
            
            # Retrieve documents using the retriever
            retrieved_docs = retriever.invoke(sample['user_input'])
            
            # Extract context from retrieved documents
            retrieved_contexts = [doc.page_content for doc in retrieved_docs]
            
            end_time = time.time()
            latency = end_time - start_time
            total_latency += latency
            
            # Create RAGAS sample
            ragas_sample = {
                'user_input': sample['user_input'],
                'retrieved_contexts': retrieved_contexts,
                'reference_contexts': sample['reference_contexts'],
                'reference': sample['reference']
            }
            
            results.append(ragas_sample)
            
        except Exception as e:
            print(f"Error evaluating sample {i}: {str(e)}")
            continue
    
    # Convert to RAGAS dataset format
    ragas_dataset = EvaluationDataset.from_list(results)
    
    # Run RAGAS evaluation
    evaluation_result = evaluate(dataset=ragas_dataset, metrics=ragas_metrics, llm=evaluator_llm)
    
    avg_latency = total_latency / len(test_samples) if test_samples else 0
    
    return {
        'retriever_name': retriever_name,
        'semantic_chunking': use_semantic_chunking,
        'metrics': evaluation_result,
        'avg_latency_seconds': avg_latency,
        'total_samples': len(test_samples),
        'successful_samples': len(results)
    }

print("Retriever evaluation function created!")


Retriever evaluation function created!


In [None]:
# Create semantic chunking versions of all retrievers
print("Creating semantic chunking versions of retrievers...")

# Use more documents for semantic chunking to get better evaluation results
semantic_documents_full = semantic_chunker.split_documents(loan_complaint_data[:100])

# Create semantic vectorstore with more documents
semantic_vectorstore_full = Qdrant.from_documents(
    semantic_documents_full,
    embeddings,
    location=":memory:",
    collection_name="Semantic_Full_Eval"
)

# Semantic versions of retrievers
semantic_naive_retriever = semantic_vectorstore_full.as_retriever(search_kwargs={"k": 10})

# Semantic BM25 retriever
from langchain_community.retrievers import BM25Retriever
semantic_bm25_retriever = BM25Retriever.from_documents(semantic_documents_full)

# Semantic compression retriever
semantic_compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=semantic_naive_retriever
)

# Semantic multi-query retriever  
semantic_multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=semantic_naive_retriever, 
    llm=chat_model
)

# Semantic parent document retriever
semantic_client = QdrantClient(location=":memory:")
semantic_client.create_collection(
    collection_name="semantic_parent_docs",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

semantic_parent_vectorstore = QdrantVectorStore(
    collection_name="semantic_parent_docs", 
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"), 
    client=semantic_client
)

semantic_store = InMemoryStore()
semantic_parent_document_retriever = ParentDocumentRetriever(
    vectorstore=semantic_parent_vectorstore,
    docstore=semantic_store,
    child_splitter=child_splitter,
)

# Add semantic documents to parent retriever
semantic_parent_document_retriever.add_documents(semantic_documents_full, ids=None)

# Semantic ensemble retriever
semantic_ensemble_retriever = EnsembleRetriever(
    retrievers=[
        semantic_bm25_retriever, 
        semantic_naive_retriever, 
        semantic_parent_document_retriever, 
        semantic_compression_retriever, 
        semantic_multi_query_retriever
    ], 
    weights=[0.2, 0.2, 0.2, 0.2, 0.2]
)

print("Semantic chunking versions created!")


Creating semantic chunking versions of retrievers...


In [None]:
# Run evaluations for all retrievers
results = []

# Define retrievers to evaluate
retrievers_to_evaluate = [
    # Original retrievers (without semantic chunking)
    (naive_retriever, "Naive RAG", False),
    (bm25_retriever, "BM25", False),
    (compression_retriever, "Contextual Compression", False),
    (multi_query_retriever, "Multi-Query", False),
    (parent_document_retriever, "Parent Document", False),
    (ensemble_retriever, "Ensemble", False),
    
    # Semantic chunking versions
    (semantic_naive_retriever, "Naive RAG", True),
    (semantic_bm25_retriever, "BM25", True),
    (semantic_compression_retriever, "Contextual Compression", True),
    (semantic_multi_query_retriever, "Multi-Query", True),
    (semantic_parent_document_retriever, "Parent Document", True),
    (semantic_ensemble_retriever, "Ensemble", True),
]

print("Starting comprehensive retriever evaluation...")
print(f"Evaluating {len(retrievers_to_evaluate)} retriever configurations...")

# Run evaluations
for retriever, name, semantic in retrievers_to_evaluate:
    try:
        result = evaluate_retriever(retriever, name, evaluation_samples, semantic)
        results.append(result)
        print(f"✓ Completed: {name} (Semantic: {semantic})")
    except Exception as e:
        print(f"✗ Error evaluating {name} (Semantic: {semantic}): {str(e)}")
        continue

print(f"\\nCompleted evaluation of {len(results)} retriever configurations!")


In [None]:
# Analyze and compile results
print("Analyzing results...")

# Create a comprehensive results DataFrame
compiled_results = []

for result in results:
    metrics_dict = result['metrics']
    
    row = {
        'Retriever': result['retriever_name'],
        'Semantic_Chunking': result['semantic_chunking'],
        'Context_Precision': metrics_dict.get('context_precision', 0),
        'Context_Recall': metrics_dict.get('context_recall', 0),
        'Context_Relevance': metrics_dict.get('context_relevance', 0),
        'Avg_Latency_Seconds': result['avg_latency_seconds'],
        'Total_Samples': result['total_samples'],
        'Successful_Samples': result['successful_samples'],
        'Success_Rate': result['successful_samples'] / result['total_samples'] if result['total_samples'] > 0 else 0
    }
    compiled_results.append(row)

results_df = pd.DataFrame(compiled_results)

# Display results
print("\\n" + "="*80)
print("RETRIEVER EVALUATION RESULTS")
print("="*80)
print(results_df.round(4).to_string(index=False))

# Calculate summary statistics
print("\\n" + "="*80)
print("SUMMARY STATISTICS")
print("="*80)

# Group by semantic chunking
semantic_comparison = results_df.groupby('Semantic_Chunking').agg({
    'Context_Precision': 'mean',
    'Context_Recall': 'mean', 
    'Context_Relevance': 'mean',
    'Avg_Latency_Seconds': 'mean'
}).round(4)

print("\\nSemantic Chunking Impact:")
print(semantic_comparison)

# Find best performers
print("\\nTop Performers by Metric:")
print(f"Best Context Precision: {results_df.loc[results_df['Context_Precision'].idxmax(), 'Retriever']} (Semantic: {results_df.loc[results_df['Context_Precision'].idxmax(), 'Semantic_Chunking']}) - {results_df['Context_Precision'].max():.4f}")
print(f"Best Context Recall: {results_df.loc[results_df['Context_Recall'].idxmax(), 'Retriever']} (Semantic: {results_df.loc[results_df['Context_Recall'].idxmax(), 'Semantic_Chunking']}) - {results_df['Context_Recall'].max():.4f}")
print(f"Best Context Relevance: {results_df.loc[results_df['Context_Relevance'].idxmax(), 'Retriever']} (Semantic: {results_df.loc[results_df['Context_Relevance'].idxmax(), 'Semantic_Chunking']}) - {results_df['Context_Relevance'].max():.4f}")
print(f"Fastest Retriever: {results_df.loc[results_df['Avg_Latency_Seconds'].idxmin(), 'Retriever']} (Semantic: {results_df.loc[results_df['Avg_Latency_Seconds'].idxmin(), 'Semantic_Chunking']}) - {results_df['Avg_Latency_Seconds'].min():.4f}s")


In [None]:
# Cost analysis using LangSmith data
print("\\n" + "="*80)
print("COST AND EFFICIENCY ANALYSIS")
print("="*80)

# Calculate efficiency scores
results_df['Overall_Performance'] = (
    results_df['Context_Precision'] + 
    results_df['Context_Recall'] + 
    results_df['Context_Relevance']
) / 3

# Calculate efficiency (performance per second)
results_df['Efficiency_Score'] = results_df['Overall_Performance'] / results_df['Avg_Latency_Seconds']

# Sort by overall performance
top_performers = results_df.sort_values('Overall_Performance', ascending=False)

print("\\nRanked by Overall Performance:")
print(top_performers[['Retriever', 'Semantic_Chunking', 'Overall_Performance', 'Avg_Latency_Seconds', 'Efficiency_Score']].round(4).to_string(index=False))

# Efficiency analysis
print("\\nEfficiency Analysis (Performance per Second):")
efficiency_ranking = results_df.sort_values('Efficiency_Score', ascending=False)
print(efficiency_ranking[['Retriever', 'Semantic_Chunking', 'Efficiency_Score', 'Overall_Performance', 'Avg_Latency_Seconds']].round(4).to_string(index=False))


In [None]:
# Create visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the plotting style
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Performance comparison by retriever
ax1 = axes[0, 0]
performance_data = results_df.pivot(index='Retriever', columns='Semantic_Chunking', values='Overall_Performance')
performance_data.plot(kind='bar', ax=ax1, color=['skyblue', 'lightcoral'])
ax1.set_title('Overall Performance by Retriever\\n(With vs Without Semantic Chunking)')
ax1.set_ylabel('Overall Performance Score')
ax1.legend(['Without Semantic', 'With Semantic'])
ax1.tick_params(axis='x', rotation=45)

# 2. Latency comparison
ax2 = axes[0, 1]
latency_data = results_df.pivot(index='Retriever', columns='Semantic_Chunking', values='Avg_Latency_Seconds')
latency_data.plot(kind='bar', ax=ax2, color=['lightgreen', 'orange'])
ax2.set_title('Average Latency by Retriever\\n(With vs Without Semantic Chunking)')
ax2.set_ylabel('Average Latency (seconds)')
ax2.legend(['Without Semantic', 'With Semantic'])
ax2.tick_params(axis='x', rotation=45)

# 3. Performance vs Latency scatter plot
ax3 = axes[1, 0]
semantic_true = results_df[results_df['Semantic_Chunking'] == True]
semantic_false = results_df[results_df['Semantic_Chunking'] == False]

ax3.scatter(semantic_false['Avg_Latency_Seconds'], semantic_false['Overall_Performance'], 
           label='Without Semantic', alpha=0.7, s=100, color='skyblue')
ax3.scatter(semantic_true['Avg_Latency_Seconds'], semantic_true['Overall_Performance'], 
           label='With Semantic', alpha=0.7, s=100, color='lightcoral')

# Add retriever labels
for i, row in results_df.iterrows():
    ax3.annotate(row['Retriever'][:8], 
                (row['Avg_Latency_Seconds'], row['Overall_Performance']),
                xytext=(5, 5), textcoords='offset points', fontsize=8)

ax3.set_xlabel('Average Latency (seconds)')
ax3.set_ylabel('Overall Performance Score')
ax3.set_title('Performance vs Latency Trade-off')
ax3.legend()
ax3.grid(True, alpha=0.3)

# 4. Metric comparison heatmap
ax4 = axes[1, 1]
metrics_data = results_df.set_index(['Retriever', 'Semantic_Chunking'])[['Context_Precision', 'Context_Recall', 'Context_Relevance']]
heatmap_data = metrics_data.T
sns.heatmap(heatmap_data, annot=True, fmt='.3f', cmap='YlOrRd', ax=ax4, cbar_kws={'shrink': 0.8})
ax4.set_title('Performance Metrics Heatmap')
ax4.set_xlabel('')

plt.tight_layout()
plt.show()

print("Visualizations created successfully!")


In [None]:
# Save results for further analysis
results_df.to_csv('retriever_evaluation_results.csv', index=False)
print("Results saved to 'retriever_evaluation_results.csv'")

# Final summary table for easy reference
print("\\n" + "="*100)
print("FINAL SUMMARY TABLE")
print("="*100)

summary_table = results_df.copy()
summary_table['Retriever_Config'] = summary_table['Retriever'] + ' (' + summary_table['Semantic_Chunking'].apply(lambda x: 'Semantic' if x else 'Standard') + ')'

final_summary = summary_table[['Retriever_Config', 'Overall_Performance', 'Avg_Latency_Seconds', 'Efficiency_Score']].sort_values('Overall_Performance', ascending=False)

print("\\nRanked by Overall Performance:")
print(final_summary.round(4).to_string(index=False))

print("\\n🎉 Evaluation Complete!")
print(f"✅ Evaluated {len(retrievers_to_evaluate)} retriever configurations")
print(f"✅ Used {len(evaluation_samples)} test samples") 
print("✅ Tracked cost and latency with LangSmith")
print("✅ Applied RAGAS retriever-specific metrics")
print("✅ Compared with and without semantic chunking")

print("\\n📊 Check LangSmith dashboard for detailed cost and trace analysis:")
print("https://smith.langchain.com")
