# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received",
      "Product",
      "Sub-product",
      "Issue",
      "Sub-issue",
      "Consumer complaint narrative",
      "Company public response",
      "Company",
      "State",
      "ZIP code",
      "Tags",
      "Consumer consent provided?",
      "Submitted via",
      "Date sent to company",
      "Company response to consumer",
      "Timely response?",
      "Consumer disputed?",
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [5]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [7]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [8]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [9]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [10]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [11]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be dealing with the servicing and management of student loans. Specifically, frequent issues include errors in loan balances, misapplied payments, wrongful denials of payment plans, incorrect or outdated account and status information, trouble with repayment plans, and mismanagement stemming from loan transfers or improper handling of loan data. Many complaints highlight problems like receiving bad information, difficulties in applying payments correctly, unauthorized loan transfers, and issues with loan balances and interest accumulation.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, some complaints did not get handled in a timely manner. Specifically, at least two complaints received a response marked as "No" for timely response:\n\n- Complaint from row 441 received on 03/28/25, regarding delays in processing a student loan application, was marked as "Timely response? No."\n- Complaint from row 816 received on 04/05/25, about non-response to a CFPB complaint, was marked as "Timely response? Yes," but the narrative indicates ongoing failure to respond.\n\nOverall, there are instances where complaints were not handled promptly.'

In [13]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans primarily because of several interconnected reasons highlighted in the complaints:\n\n1. **Lack of Clear and Timely Information:** Borrowers were often not adequately informed about when their repayment obligations would resume, changes in loan servicers, or specific payment details, leading to confusion and unintentional delinquency.\n\n2. **Interest Accumulation and Limited Payment Options:** Many borrowers only received options like forbearance or deferment, which allowed interest to continue accumulating. Lowering payments or delaying repayment often resulted in the total debt growing, making it harder to pay off later.\n\n3. **Financial Hardship and Unaffordable Payments:** Borrowers faced financial hardships such as stagnant wages, rising living costs, or job instability, which made meeting repayment obligations difficult without sacrificing essential expenses.\n\n4. **Loan Management and Servicer Issues:** Complaints about mismanagement, tr

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [14]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [15]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [16]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems dealing with the lender or servicer, including issues such as incorrect fees, trouble with how payments are applied, and receiving bad or confusing information about loan balances or terms. Several complaints highlight issues like being unable to properly pay down the principal, being misled about loan details, or experiencing unhelpful or deceptive responses from the servicers.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints listed indicate that the companies responded in a timely manner, with responses marked as "Yes" for being timely. Therefore, there is no indication that any complaints were left unresolved or not handled in a timely manner.'

In [18]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People fail to pay back their loans for various reasons, including problems with managing their payment plans, miscommunication or lack of communication from loan servicers, and issues related to loan transfer and autopay enrollment. Specific issues mentioned include being steered into incorrect forbearance options, being unaware of transfers to new loan servicers like Aidvantage, having autopay discontinues without notice, and experiencing repeated payment reversals due to errors by the servicer. Additionally, some borrowers faced difficulties in receiving accurate information or timely responses from their lenders or servicers, leading to missed payments, negative credit impacts, and frustration.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ **Answer:**

Given the question *"What does error code TS-999 mean?"* for a corpus including error documentation and many specific types of error codes:

- An embedding model will find chunks related to error codes in general but may not find the exact match to "TS-999" which is very important in this query.

- BM25 will return chunks that contain the exact phrase "TS-999"

These differences are fundamental to these two types of preprocessing. Embedding models are designed to extract semantic meaning. BM25 is short for "Best Matching 25" which is a ranking function that uses lexical matching to find precise word or phrase matches. It's particularly effective for queries that include unique identifiers or technical terms. 

If your use case is likely to require keyword exact matches, the advice is to use a hybrid approach and include both types of preprocessing (embedding and BM25 ranking) combined with rank fusion to get the best results from RAG.

[Source: Anthropic Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval)

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [19]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [20]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, a common issue with loans, particularly student loans, involves errors and mismanagement by lenders or servicers. Specific issues include errors in loan balances, misapplied payments, wrongful denials of payment plans, and mishandling of loan data. Additionally, problems such as receiving bad information about loans, lack of communication, discrepancies in account balances, unauthorized transfers of loans, and violations of privacy laws are also prevalent.\n\nOverall, the most common issue appears to be **dealing with lenders or servicers who provide incorrect, incomplete, or mismanaged information about the loan, leading to confusion, errors, and disputes**.'

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided context, it appears that at least one complaint was explicitly noted to have taken a very long time to resolve. For example, the complaint about the student loan account review and remediations has been open for over 1 year and nearly 18 months with no resolution, and the issue has not been handled in a timely manner. \n\nAdditionally, the complaint regarding the problem with payments not being applied to the account also involved delays, although the response to that particular complaint was marked as "closed with explanation" and the update suggests a response was provided within the timeline.\n\nOverall, yes, there were complaints that did not get handled in a timely manner, with specific issues taking over a year to resolve or remain unresolved for an extended period.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors including a lack of understanding about their loan obligations, poor communication from lenders or servicers, and complications arising from interest accumulation and management options. Many borrowers were not adequately informed about their repayment responsibilities or the consequences of forbearance and deferment, which allowed interest to continue accruing and increased their total debt over time. Additionally, borrowers faced challenges with inconsistent or confusing information, such as not receiving proper notifications about payment due dates, loan transfers without their knowledge, or difficulties accessing their account information. Economic hardships, stagnant wages, and the unavailability of manageable repayment plans further contributed, making full repayment difficult for many.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [24]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [26]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [27]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the context provided, appears to be problems related to improper handling by loan servicers, including errors in loan balances, misapplied payments, inaccurate information, and mishandling of loan documentation. Many complaints highlight issues such as:\n\n- Errors in loan balances and interest calculations\n- Mismanagement of loan transfers and reassignments\n- Lack of proper loan documentation (e.g., signed promissory notes)\n- Violations of borrower rights such as unauthorized access to personal information\n- Inaccurate credit reporting and incorrect account statuses\n- Trouble with repayment plans and loan forgiveness processes\n\nOverall, a predominant theme is the mismanagement and mishandling by loan servicers leading to inaccuracies and legal concerns for borrowers.'

In [28]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, several complaints did not get handled in a timely manner. Specifically, at least two complaints (Complaint IDs: 12709087 and 12739706) were marked as "N/A" under the "Timely response?" field, indicating they were not responded to promptly or possibly not responded to at all within the expected timeframes. Additionally, some complaints (such as Complaint ID: 12654977) received responses marked as "No" under "Timely response?", which confirms delays or failures in handling in a timely manner.'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily because of complex issues related to misunderstandings, lack of clear information, and systemic servicing practices. Specifically, many borrowers were steered into forbearance or deferment without fully understanding how interest would continue to accumulate, making repayment more difficult over time. Others faced difficulties due to inaccurate or incomplete loan information, errors in reporting, and inadequate communication from servicers about available repayment options like income-driven plans or loan forgiveness programs. Additionally, some borrowers experienced hardship due to financial instability, stagnant wages, or unexpected life events, which made it impossible to keep up with payments. In some cases, systemic practices such as forbearance steering and lack of transparency contributed significantly to borrowers falling behind on their loans.\n\nIf you need more specific details or examples, feel free to ask!'

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ **Answer:**

Multi-query retrieval improves recall by addressing the fundamental limitation that a single query formulation may not capture all relevant documents, even when those documents contain the information the user seeks. This is called the **vocabulary mismatch problem**. Users and document authors often use different terminology to describe the same concepts. Multi-query generates several reformulations of the original query, typically 3-5 variations that:

- Use different synonyms and related terms
- Vary the specificity level (broader or narrower focus)
- Rephrase the question structure
- Emphasize different aspects of the information

This increased diversity in queries, in turn increases the chances of retrieving relevant documents.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [30]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [31]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [32]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [33]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [34]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [35]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be related to problems with federal student loan servicing. Specifically, issues include errors in loan balances, misapplied payments, wrongful denials of payment plans, discrepancies in loan balances and interest rates, and miscommunication or bad information about the loans. There are also repeated concerns about illegal credit reporting, unverified debts, and unfair practices by loan servicers.'

In [36]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all the complaints in the context were marked as "No" for timely response. Specifically, two complaints related to student loan servicing (entries with Complaint IDs 12709087 and 12935889) were explicitly noted as "Timely response?": "No." \n\nThis indicates that some complaints did not get handled in a timely manner.'

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including:\n\n1. Lack of proper communication and transparency from loan servicers, such as failure to notify borrowers of payment obligations or reasons for payments starting earlier than expected.\n2. Financial hardship or severe economic difficulties that made it difficult to make loan payments.\n3. Misrepresentation by educational institutions about the value of degrees and career outcomes, leading to increased debt burdens that borrowers could not manage.\n4. Poor management or errors by loan servicers, such as incorrect reporting of payments or failure to verify the legitimacy of debts.\n5. Borrowers relying on deferment or forbearance options which increased interest owed, making repayment more difficult.\n6. Lack of adequate financial guidance or support, leading to unawareness of repayment obligations or full impact of their debts.\n7. Unforeseen personal or health issues that impacted their ability to meet repayment 

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [38]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [39]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [40]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be mismanagement and misinformation by loan servicers. Many complaints highlight problems such as:\n\n- Errors in loan balances and interest calculations\n- Incomplete or inaccurate loan information\n- Incorrect account status reporting (e.g., reporting current loans as delinquent or in default)\n- Unauthorized or improper loan transfers\n- Problems with payment application, often being applied primarily to interest rather than principal\n- Lack of communication or notification about important account changes\n- Unauthorized or confusing loan classification and handling (e.g., mislabeling loan types, ending in-school deferments improperly)\n- Bad information about loan status leading to credit report damage\n- Issues with understanding or accessing income-driven repayment or forgiveness programs\n\nOverall, a recurring theme is that borrowers experience significant difficulty due to poor communication, inco

In [41]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, several complaints indicate that they were not handled in a timely manner. For example, one complaint noted that the response was "No" for timely response, and the issue involved being in dispute for over 1 year without resolution. Another complaint from EdFinancial Services was marked "No" for response time, indicating it was not handled on time. Overall, based on the provided data, multiple complaints were either not responded to promptly or experienced significant delays in resolution.'

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People often fail to pay back their loans due to several interconnected reasons highlighted in the complaints:\n\n1. **Accumulation of Interest and Unmanageable Balances:** Many borrowers were misled about repayment terms and faced interest that compounded or increased over time, even when making payments or entering forbearance, leading to balances that are difficult to reduce or pay off.\n\n2. **Inadequate or Misleading Information:** Borrowers report not being properly informed about their loan statuses, repayment options, or the consequences of forbearance and deferment, resulting in confusion and unintentional delays in repayment.\n\n3. **Payment Processing and Administrative Failures:** There are recurring issues with payments not being properly applied, reversed, or not reflected in account statements, causing missed payments and credit score drops.\n\n4. **Loan Servicer Practices and Mismanagement:** Complaints include wrongful transfer of loans without consent, failure to not

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [43]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [44]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [45]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [46]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [47]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [48]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the complaints provided, appears to be problems related to mismanagement and poor communication from loan servicers. Specific common issues include:\n\n- Struggling to repay or problems with loan forgiveness, cancellation, or discharge.\n- Reporting and collection of loans that are now legally void or disputed.\n- Errors in loan account status, such as loans being reported as in default when the borrower disputes this.\n- Issues with how payments are being handled, including auto-debit setup failures and incorrect payment amounts.\n- Lack of proper communication or transparency regarding loan status, payment plans, or servicing changes.\n- Unauthorized access or breach of personal and financial information.\n- Misreporting on credit reports and improper use of borrower data.\n\nOverall, a key issue is inadequate servicing, miscommunication, and mishandling of loan accounts, which leads to borrower frustration and potential violations of legal

In [49]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, all the complaints listed indicate that responses from the companies were marked as "Closed with explanation," and the "Timely response?" column is marked as "Yes" for each. This suggests that, in these cases, complaints were handled in a timely manner. \n\nHowever, it is important to note that the specific complaints do not explicitly mention whether the complaints were handled effectively or satisfactorily from the consumer\'s perspective, only that the responses were timely.\n\nTherefore, based on the available information, **no complaints in this dataset are indicated as not being handled in a timely manner**.'

In [50]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including issues with loan processing, miscommunication, and legal or administrative complications. For example:\n\n- Some borrowers experienced difficulties with their loan servicers providing accurate or consistent information about their loan status or repayment options, causing confusion and delays.\n- Others faced delays or errors in payment processing due to technical issues or bank rejections, which could lead to missed payments.\n- In certain cases, borrowers encountered problems with documentation or verification for loan forgiveness or discharge programs, resulting in difficulties in proving their eligibility.\n- There are also instances where borrowers disputed the legitimacy of their loans or reported that their information was mishandled or compromised, making repayment complicated.\n- Additionally, some borrowers faced legal or administrative disputes, such as claims of illegal reporting, unauthorized data breach

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ **Answer:**

Interquartile Range (IQR) is likely the best choice for FAQ content for several reasons:

- Robustness to outliers: FAQs often contain a few highly similar pairs and some completely distinct ones. IQR handles this distribution better than standard deviation, which gets skewed by extreme values.
- Adaptive to content distribution: Unlike percentile methods that impose fixed cut-offs, IQR adapts to the actual similarity distribution in your FAQ dataset.
- Interpretable boundaries: IQR provides clearer decision boundaries for what constitutes "typical" vs "unusual" similarity levels in repetitive content.

The change of thresholding method is likely not sufficient for FAQs. In addition, we'd want to combat over-fragmentation, similarity inflation and weak boundary detection problems. A few ideas for algorithmic changes:

- Content-Aware Preprocessing: Implement FAQ-specific parsing to identify question-answer pairs as atomic units. Treat each Q&A pair as the minimum chunk size, preventing fragmentation of logically connected content.
- Similarity Normalization: Apply domain-specific adjustments to similarity thresholds. Since FAQ language is inherently repetitive, we need to recalibrate what constitutes "semantically different" content in this context.
- Structural Awareness: Leverage FAQ formatting patterns (numbered lists, consistent question structures) to inform chunking decisions rather than relying solely on semantic similarity.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against each other.

You are expected to:

1. Create a "golden dataset"
    - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
    - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

##### Answer:

Make a plan: 
1. Generate "golden dataset" using knowledge graph, SDG with Ragas
2. Pick specific metrics for each retriever
3. Compile the results and write about findings.

80 "No" for timely response

In [None]:
### SOLUTION

# 1. Generate golden dataset

import os
import getpass
from uuid import uuid4

# Get environment variables
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

# Settings for LangSmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIM - Retrievers - {uuid4().hex[0:8]}"


In [None]:
# 1. Generate golden dataset (cont.)

# load the csv file (825 rows + header)
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = "data/complaints.csv"
loader = CSVLoader(file_path=file_path)
complaint_data = loader.load()

# TODO do we need to parse headers? prob yes

In [None]:
# 1. Generate golden dataset (cont.)

from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.llms import LangchainLLMWrapper
from ragas.testset import TestsetGenerator

# define LLM to generate data/embeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# create generator using our selected LLM and embeddings
# using full dataset to build knowledge graph
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(complaint_data, testset_size=20)

# human review of dataset
dataset.to_pandas()

In [None]:
# 1. Generate golden dataset (cont.)

from langsmith import Client

client = Client()

dataset_name = "StudentLoanComplaintsSynthetic"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Student loan complaints synthetic data"
)

# TODO check if this is right mapping
# map dataset data to expected keys for langsmith client
for data_row in dataset.to_pandas().iterrows():
    client.create_example(
        inputs={
            "question": data_row[1]["user_input"]
        },
        outputs={
            "answer": data_row[1]["reference"]
        },
        metadata={
            "context": data_row[1]["reference_contexts"]
        },
        dataset_id=langsmith_dataset.id
    )


In [None]:
import csv

def get_csv_headings(file_path):
    """
    Reads the first row of a CSV file and returns a list of column headings.

    Args:
        file_path (str): Path to the CSV file.

    Returns:
        list: List of column headings as strings.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the file is empty or not a valid CSV.
    """
    assert isinstance(file_path, str) and file_path.endswith('.csv'), "file_path must be a CSV file path string."
    try:
        with open(file_path, newline='', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            headings = next(reader, None)
            if headings is None:
                raise ValueError("CSV file is empty or does not contain a header row.")
            return headings
    except FileNotFoundError as e:
        raise FileNotFoundError(f"CSV file not found: {file_path}") from e
    except Exception as e:
        raise ValueError(f"Error reading CSV file: {e}") from e

# Example usage:
csv_headings = get_csv_headings("./data/complaints.csv")
print(csv_headings)

In [None]:
# 2. Pick specific metrics for each retriever

In [None]:
# LangSmith evaluation


for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

eval_llm = ChatOpenAI(model="gpt-4.1")

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

##### Findings


# 3. Compile the results and write about findings.

TODO add table of results, graphs?

| Retrieval method   | Metric          | Value  |
| ------------------ | --------------- | ------ |
| Naive              |                
| BM25               |                
| Multi-query        |                
| Parent-document    |                
| Rerank             |                
| Ensemble           |                 


TODO add paragraph of findings

TODO how does semantic chunking figure in?