# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [4]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received",
      "Product",
      "Sub-product",
      "Issue",
      "Sub-issue",
      "Consumer complaint narrative",
      "Company public response",
      "Company",
      "State",
      "ZIP code",
      "Tags",
      "Consumer consent provided?",
      "Submitted via",
      "Date sent to company",
      "Company response to consumer",
      "Timely response?",
      "Consumer disputed?",
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [6]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [7]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [8]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [10]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [11]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [12]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to mismanagement and misinformation. Specifically, many complaints highlight issues such as errors in loan balances, misapplied payments, incorrect reporting on credit reports, unfair or confusing payment handling, and wrongful transfer or sale of loans without proper notification. Additionally, issues like difficulties in repayment plans, unauthorized collection efforts, and disputes over interest capitalization and loan terms are prevalent.\n\nIn summary, the most common issue is **mismanagement and miscommunication regarding loan balances, payments, and terms**, leading to financial hardship and credit report inaccuracies.'

In [13]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, some complaints were not handled in a timely manner. For example:\n\n- The complaint received on 03/28/25 by MOHELA was marked as "No" for timely response.\n- The complaint received on 04/18/25 by Nelnet, Inc. was handled with a response "Closed with explanation" and marked as "Yes" for timely response.\n- The complaint received on 04/24/25 by Maximus Federal Services, Inc. was timely responded to.\n\nMost other complaints indicate either timely responses or issues with unresolved complaints over longer periods. \n\nSo, to answer your question: Yes, some complaints, such as the one received on 03/28/25 by MOHELA, did not get handled in a timely manner.'

In [14]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for several reasons, including:\n\n1. Accumulation of interest during forbearance or deferment periods, which increased the total amount owed and made repayment more difficult.\n2. Lack of clear communication or proper notification from lenders or servicers about repayment start dates, loan transfers, or delinquency status, leading to unawareness of when payments were due.\n3. Inability to afford increased or minimum payments due to stagnant wages, high living expenses, or financial hardship.\n4. Mismanagement of loans by servicers, such as incorrect or confusing account information, failure to provide accurate payment or balance details, or improper handling of loan transfers.\n5. Restrictions on applying extra payments directly to principal, which extended the repayment period and increased the total interest paid.\n6. Errors or delays in processing income-based repayment plans or forbearance requests, leading to missed payments or credit issues

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [15]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [16]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [17]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be dealing with the lender or servicer, specifically related to disputes over fees, interest calculations, payment application, and inaccurate or bad information about the loan. Many complaints involve difficulties in understanding or managing repayment terms, being misled about loan details, or issues with payment processing.\n\nIf I had to identify a single most common issue from the data, it would be: **Problems related to dealing with the lender or servicer, including disputes over fees, interest, and loan information.**\n\nPlease note that this is based on the recurring theme observed in multiple complaints within the context.'

In [18]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints listed indicate that the companies responded to the complaints in a timely manner, with responses marked as "Yes" under the "Timely response?" field. Therefore, there is no indication that any complaints did not get handled in a timely manner.'

In [19]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily because of issues related to miscommunication, mismanagement, and administrative problems with loan servicers. For example, some borrowers' automatic payments were unenrolled without their knowledge, leading to missed payments and negative impacts on their credit scores. Others were steered into incorrect forbearance options or experienced delays and lack of response when attempting to apply for relief, deferments, or forbearance. Additionally, some borrowers were not properly informed about changes in their loan servicing, new account details, or the status of their payments, which contributed to unpaid balances and deteriorating credit. Overall, these issues stem from ineffective communication, mishandling by servicers, and procedural errors within the loan servicing system."

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ **Answer:**

Given the question *"What does error code TS-999 mean?"* for a corpus including error documentation and many specific types of error codes:

- An embedding model will find chunks related to error codes in general but may not find the exact match to "TS-999" which is very important in this query.

- BM25 will return chunks that contain the exact phrase "TS-999"

These differences are fundamental to these two types of preprocessing. Embedding models are designed to extract semantic meaning. BM25 is short for "Best Matching 25" which is a ranking function that uses lexical matching to find precise word or phrase matches. It's particularly effective for queries that include unique identifiers or technical terms. 

If your use case is likely to require keyword exact matches, the advice is to use a hybrid approach and include both types of preprocessing (embedding and BM25 ranking) combined with rank fusion to get the best results from RAG.

[Source: Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval)

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [20]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [21]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to involve problems with loan servicing, including errors in loan balances, misapplied payments, incorrect or confusing information about loan terms, and mishandling of personal data. Many complaints highlight issues such as inaccurate account information, unauthorized transfers of loans, lack of communication or documentation, and violations of privacy laws.'

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, it appears that at least one complaint was not handled in a timely manner. Specifically, the complaint regarding the student loan account review and violations of FERPA has been open for nearly 18 months without resolution, despite requests for assistance. Although the company\'s response indicated it was "Closed with explanation" and the response was "Yes" in terms of timeliness, the resolution has not been achieved for an extended period, suggesting that this complaint was not addressed promptly.\n\nAdditionally, there is a complaint from another individual about unresolved issues with their loan payments not being applied, which also seems to remain unresolved despite the complaint being submitted.\n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [24]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a lack of clear communication, misinformation, and difficulties managing loan repayment options. Many borrowers were unaware that they needed to repay their loans because they were not properly informed by financial aid officers or loan servicers. Additionally, issues such as incorrect or inconsistent account information, unexplained increases in loan balances and interest, and limited repayment options like forbearance or deferment—while interest continued to accrue—made it challenging for borrowers to pay off their loans. Some borrowers also faced hardships because the available options often led to increasing total debt over time, and they lacked sufficient financial resources or information about how interest compounds and affects repayment, which further hindered their ability to successfully repay their loans.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [25]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [26]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [27]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided context, appears to be problems related to **debt management and reporting errors**, including:\n\n- Incorrect or misleading information on credit reports (e.g., account status incorrectly reported as delinquent or overdue).\n- Systemic errors in credit reporting impacting credit scores.\n- Confusion or disputes over loan balances, interest calculations, and loan transfer or servicing changes.\n- Difficulties in communication with loan servicers, such as lack of transparency, unnotified loan transfers, and inadequate or misleading information about repayment options and loan terms.\n- Challenges in handling forbearance, repayment plans, and unawarded or misapplied payments leading to increased balances and interest.\n\nThese issues often lead to significant financial hardship, damaged credit scores, and frustration with the loan servicing process, indicating that a most common problem is systemic mismanagement and errors in loan 

In [28]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the information provided, several complaints indicate that issues were not handled in a timely manner. For example, a complaint from a consumer involves a response time exceeding 1 year, with the complaint still unresolved despite requests for review and resolution. Additionally, some complaints note delays of over 30 days or more before receiving any response or resolution, despite the company\'s responses being marked as "timely" or "with explanation."\n\nTherefore, yes, there were complaints that did not get handled in a timely manner.'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans due to a variety of systemic and individual challenges, including:\n\n1. Lack of clear information and understanding about loan terms, interest accumulation, and repayment options, which led borrowers to make uninformed decisions.\n2. High and accumulating interest, especially when loans were placed in forbearance or deferment, causing balances to grow and making repayment seem impossible.\n3. Inadequate communication from lenders or servicers, resulting in borrowers being unaware of when repayment resumed or of available options such as income-driven repayment plans or loan forgiveness.\n4. Misinformation or mismanagement by loan servicers, including incorrect account reporting, mishandling of payments, and improper transfer of loans, which further complicated repayment efforts.\n5. Financial hardships, such as unemployment, illness, homelessness, or unexpected events, that made it difficult for borrowers to meet their repayment obligations.\n6. 

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ **Answer:**

Multi-query retrieval improves recall by addressing the fundamental limitation that a single query formulation may not capture all relevant documents, even when those documents contain the information the user seeks. This is called the **vocabulary mismatch problem**. Users and document authors often use different terminology to describe the same concepts. Multi-query generates several reformulations of the original query, typically 3-5 variations that:

- Use different synonyms and related terms
- Vary the specificity level (broader or narrower focus)
- Rephrase the question structure
- Emphasize different aspects of the information

This increased diversity in queries, in turn increases the chances of retrieving relevant documents.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [30]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [31]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [32]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [33]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [34]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [35]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be related to mismanagement and errors by loan servicers, including errors in loan balances, misapplied payments, wrongful denials of payment plans, and incorrect or inconsistent credit reporting. Many complaints highlight problems such as inaccurate loan balances, unfair increases in interest rates, and issues arising from the transfer or sale of loan accounts.'

In [36]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, yes, there are complaints that did not get handled in a timely manner. Specifically, the complaints about the student loan issues with MOHELA (Complaint IDs: 12709087 and 12935889) indicate that the responses from the company were delayed, with the consumer noting they had not heard from anyone despite multiple follow-ups and extended wait times. The complaint about the credit dispute with Nelnet (Complaint ID: 13205525) was responded to within the expected timeframe, so that was handled timely.\n\nSo, the answer is: Yes, some complaints did not get handled in a timely manner.'

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily due to financial hardships, lack of proper information, and mismanagement by loan servicers or educational institutions. Specifically, some borrowers experienced severe financial difficulties after graduation, making it impossible to consistently make payments. Others were misled by schools' representations about the value and manageability of their loans, and some faced issues with loan servicers failing to provide proper notification or account management, leading to unverified or questionable debts, late payments, or credit reporting errors. Additionally, some borrowers relied on deferment and forbearance options that increased interest costs, further complicating repayment."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [38]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [39]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [40]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with student loans appear to involve:\n\n- Dealing with lenders or servicers, including receiving bad information about loans, incorrect account status, or mishandling of accounts.\n- Problems related to loan management such as improper transfer, unclear or unnotified transfer of loan management, or incorrect reporting of loan status.\n- Errors in loan balances or interest calculations, including unauthorized interest accrual, improper capitalization, or incorrect reporting to credit bureaus.\n- Difficulties with repayment plans, including being steered into long-term forbearance without proper guidance on options like income-driven repayment or rehabilitation.\n- Communications failures, such as lack of timely notification about default, transfer, or changes in account status.\n- Issues with loan consolidation, including insufficient disclosure, lack of informed consent, or unexpected payment amounts.\n\nWhile there is no single "

In [41]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the information provided, yes, there are complaints indicating that some complaints did not get handled in a timely manner. Specifically, at least one complaint (Complaint ID: 12935889) was marked "No" for timely response, meaning it was not handled promptly. Additionally, multiple complaints mention delays, lack of response, or insufficient follow-up from the companies involved.'

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily because of a combination of factors highlighted in the complaints:\n\n1. Lack of clear and adequate communication: Borrowers were often not informed about when payments were due, changes in loan servicers, or the status of their loans, leading to unintentional delinquency.\n2. Mismanagement and errors by loan servicers: Many complaints describe errors such as misapplied payments, incorrect loan balances, and reporting delinquency to credit bureaus without proper notice or verification.\n3. Unsuitable repayment options and strategies: Borrowers were frequently steered into forbearance or deferment without fully understanding the long-term consequences, including accumulating interest, which increased the total amount owed and extended repayment periods.\n4. Difficulty in accessing accurate information: Complaints include issues like inability to obtain original loan documentation, proof of ownership, or correct account status, which impai

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [43]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [44]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [45]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [46]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [47]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [48]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be problems related to loan servicing, including miscommunication, inaccurate or inconsistent information about loan status or payment requirements, and difficulties with repayment plans. Specifically, many complaints mention issues such as:\n\n- Struggling to repay loans or problems with forgiveness or discharge.\n- Errors or discrepancies in loan account information, such as default notices when borrowers have not defaulted.\n- Confusion about loan servicer or issuer notices.\n- Problems setting up or maintaining auto-debit payments.\n- Disputes over inaccurate reporting or illegal collection activities.\n- Lack of transparency and inadequate communication from loan servicers.\n\nTherefore, a common and recurring problem is the mishandling of loan information and servicing issues, which often lead to borrower frustration, incorrect account statuses, and difficulty managing repayment plans.'

In [49]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, according to the provided complaints, some complaints were handled promptly with responses marked as "Yes" for timely response and "Closed with explanation." However, multiple complaints indicate that the complaints were eventually closed with explanations, and in some cases, the complainants reported ongoing issues or violations, suggesting that not all complaints may have been fully resolved in a timely or satisfactory manner. \n\nBased on the information, it appears that at least some complaints were not fully handled in a timely manner, or at minimum, there are concerns about the effectiveness and completeness of the resolution process for certain complaints.'

In [50]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans due to various reasons such as receiving bad or misleading information about their loan status or repayment terms, experiencing technical issues or lack of transparency with their loan servicers, and encountering difficulties in verifying or discharging their debts due to illegal reporting or administrative errors. In some cases, borrowers believe their loans are invalid or unenforceable because of violations of privacy laws, administrative mishandling, or changes in legal status of the loans, which can lead to delays or failures in repayment.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ **Answer:**

Interquartile Range (IQR) is likely the best choice for FAQ content for several reasons:

- Robustness to outliers: FAQs often contain a few highly similar pairs and some completely distinct ones. IQR handles this distribution better than standard deviation, which gets skewed by extreme values.
- Adaptive to content distribution: Unlike percentile methods that impose fixed cut-offs, IQR adapts to the actual similarity distribution in your FAQ dataset.
- Interpretable boundaries: IQR provides clearer decision boundaries for what constitutes "typical" vs "unusual" similarity levels in repetitive content.

The change of thresholding method is likely not sufficient for FAQs. In addition, we'd want to combat over-fragmentation, similarity inflation and weak boundary detection problems. A few ideas for algorithmic changes:

- Content-aware preprocessing: Implement FAQ-specific parsing to identify question-answer pairs as atomic units. Treat each Q&A pair as the minimum chunk size, preventing fragmentation of logically connected content.
- Similarity normalization: Apply domain-specific adjustments to similarity thresholds. Since FAQ language is inherently repetitive, we need to recalibrate what constitutes "semantically different" content in this context.
- Structural awareness: Leverage FAQ formatting patterns (numbered lists, consistent question structures) to inform chunking decisions rather than relying solely on semantic similarity.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other.

You are expected to:

1. Create a "golden dataset"
    - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
    - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

---
##### ✅ **Answer:**

1. Generate "golden dataset" using knowledge graph, SDG with Ragas
    - I installed `ragas` and dependencies
2. Pick specific metrics for each retriever
    - Running all metrics but will focus on the most important for each retriever
3. Compile the results and write about findings
    - Summarize overall findings and lessons learned


In [60]:
### SOLUTION
# 1. Generate golden dataset

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.llms import LangchainLLMWrapper
from ragas.testset.synthesizers import SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer
from ragas.testset import TestsetGenerator

# define LLM to generate data/embeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

query_distribution = [
    (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
    (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
    (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
]

# use subset of complaint data from middle set of records
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
golden_dataset = generator.generate_with_langchain_docs(loan_complaint_data[:50], testset_size=20, query_distribution=query_distribution)


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/50 [00:00<?, ?it/s]

Node a48e7012-ccb1-4c09-8e8f-a573f3832a05 does not have a summary. Skipping filtering.
Node b9e7bab8-64d9-480c-b27b-f349b11d40eb does not have a summary. Skipping filtering.
Node 5faac7b5-47f9-4589-8ce9-597dc1412a11 does not have a summary. Skipping filtering.
Node 05918dc3-256c-4dc3-aadc-56a6aa053af0 does not have a summary. Skipping filtering.
Node 4f840c59-5a08-4c5f-9732-b48dadc02307 does not have a summary. Skipping filtering.
Node fe0796e2-a229-476b-bbf2-9de2770d60de does not have a summary. Skipping filtering.
Node e560d683-70d8-4e64-aa12-f8ce424eef5a does not have a summary. Skipping filtering.
Node fc2bb20c-e967-4b75-8052-cb9e763c583c does not have a summary. Skipping filtering.
Node 012125bf-9ab3-4976-9119-d6d358f58355 does not have a summary. Skipping filtering.
Node 5186583d-0fa1-4610-ba00-ff0392d34afd does not have a summary. Skipping filtering.
Node 892eff0c-e4fe-4b0d-a00a-c0eedcc113e6 does not have a summary. Skipping filtering.
Node a79d81e6-ec2c-43fe-9f26-0ced26bcf498 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/131 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/20 [00:00<?, ?it/s]

In [61]:
# human review of dataset
golden_dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How did the end of the federal student loan CO...,[The federal student loan COVID-19 forbearance...,The federal student loan COVID-19 forbearance ...,single_hop_specifc_query_synthesizer
1,How does Aidvantage handle repayment calculati...,[I submitted my annual Income-Driven Repayment...,"According to the context, Aidvantage assigned ...",single_hop_specifc_query_synthesizer
2,How does FERPA protect my personal and financi...,[My personal and financial data was compromise...,My personal and financial data was compromised...,single_hop_specifc_query_synthesizer
3,What is the issue with Nelnet regarding the bo...,"[According to Studentaid.gov, Im to get an ema...",The borrower states that according to Studenta...,single_hop_specifc_query_synthesizer
4,How does the Consumer Financial Protection Bur...,[I am writing to formally dispute inaccurate i...,The Consumer Financial Protection Bureau (CFPB...,single_hop_specifc_query_synthesizer
5,How can I report a wrongful default on my stud...,[I am devastated. I would like to report a sit...,You can report the situation to protect your r...,single_hop_specifc_query_synthesizer
6,What is the Department of Government Efficienc...,"[On XXXX XXXX XXXX, XXXX XXXX instructed his t...","On XXXX XXXX XXXX, XXXX XXXX instructed his te...",single_hop_specifc_query_synthesizer
7,What is wrong with EdFinancials and why they k...,[I have provided documentation relating to my ...,EdFinancials form provides for only one entry ...,single_hop_specifc_query_synthesizer
8,How does a violation of FERPA impact a borrowe...,[My personal and financial data was compromise...,My personal and financial data was compromised...,single_hop_specifc_query_synthesizer
9,How does the violation of FERPA impact borrowe...,[I am writing to formally dispute my XXXX XXXX...,The violation of FERPA occurs when student rec...,single_hop_specifc_query_synthesizer


##### We have the test dataset created, now let's set up LangSmith for tracing

In [63]:
import os
import getpass
from uuid import uuid4

# Get environment variables
# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

# Settings for LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIM - Compare Retrievers Student Loan Complaints - {uuid4().hex[0:8]}"

In [None]:
from langchain.callbacks.tracers import LangChainTracer
from langchain.schema.runnable import RunnableConfig
from ragas import EvaluationDataset, evaluate, RunConfig
from ragas.metrics import LLMContextRecall, ContextEntityRecall, LLMContextPrecisionWithReference, NonLLMContextPrecisionWithReference

eval_llm = ChatOpenAI(model="gpt-4.1-mini")

# Helper function to run the tests
def test_retriever(name, retriever, results_dict):
    """test each retriever and return ragas dataset"""
    print(f"Evaluating {name}...")

    for test_row in golden_dataset:
        retrieved_docs = retriever.invoke(test_row.eval_sample.user_input)
        test_row.eval_sample.retrieved_contexts = [doc.page_content for doc in retrieved_docs]

    eval_dataset = EvaluationDataset.from_pandas(golden_dataset.to_pandas())

    tracer = LangChainTracer(project_name=f"{os.environ["LANGCHAIN_PROJECT"]} - {retriever}")

    # Evaluate THIS retriever
    result = evaluate(
        dataset=eval_dataset,
        metrics=[LLMContextRecall(),ContextEntityRecall(),LLMContextPrecisionWithReference(),NonLLMContextPrecisionWithReference()],
        llm=eval_llm,
        run_config=RunConfig(timeout=360),
        callbacks=[tracer]
    )

    results_dict[retriever] = result


In [None]:

# title

evaluation_table = {}

retrievers_to_test = {"BM25": bm25_retriever, "Reranker": compression_retriever, "Ensemble": ensemble_retriever, "Multi-Query": multi_query_retriever, "Naive": naive_retriever, "Parent-Document": parent_document_retriever}

for name, retriever in retrievers_to_test.items():
    test_retriever(name, retriever, evaluation_table)

print(evaluation_table)


Evaluating BM25...


Evaluating:   0%|          | 0/80 [00:00<?, ?it/s]

KeyboardInterrupt: 

Exception raised in Job[66]: AttributeError('NoneType' object has no attribute 'generate')
Exception raised in Job[78]: AttributeError('NoneType' object has no attribute 'generate')
Exception raised in Job[70]: AttributeError('NoneType' object has no attribute 'generate')
Exception raised in Job[74]: AttributeError('NoneType' object has no attribute 'generate')


In [55]:
# 2. Pick specific metrics for each retriever

2. Retrievers and metrics

| Retrieval method   | FactualCorrectness |
| ------------------ | ------------- |
| Naive              |  |
| BM25               |  |
| Multi-query        |  |
| Parent-document    |  |
| Rerank             |  |
| Ensemble           |  |



Context Precision
Context Recall
Context Entities Recall
Noise Sensitivity
Response Relevancy
Faithfulness
Multimodal Faithfulness
Multimodal Relevance


In [56]:
# LangSmith evaluation


# for test_row in dataset:
#   response = graph.invoke({"question" : test_row.eval_sample.user_input})
#   test_row.eval_sample.response = response["response"]
#   test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

# eval_llm = ChatOpenAI(model="gpt-4.1")

# from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
# from ragas import evaluate, RunConfig

# custom_run_config = RunConfig(timeout=360)

# result = evaluate(
#     dataset=evaluation_dataset,
#     metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
#     llm=evaluator_llm,
#     run_config=custom_run_config
# )
# result

##### Findings


# 3. Compile the results and write about findings.

TODO add table of results, graphs?

| Retrieval method   | Metric          | Value  |
| ------------------ | --------------- | ------ |
| Naive              |                
| BM25               |                
| Multi-query        |                
| Parent-document    |                
| Rerank             |                
| Ensemble           |                 


TODO add paragraph of findings

TODO how does semantic chunking figure in?