# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import gc;

In [2]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

True

In [3]:
def check_if_env_var_is_set(env_var_name: str, human_readable_string: str = "API Key"):
    api_key = os.getenv(env_var_name)
  
    if api_key:
       print(f"{env_var_name} is present")
    else:
      print(f"{env_var_name} is NOT present, paste key at the prompt:")
      os.environ[env_var_name] = getpass.getpass(f"Please enter your {human_readable_string}: ")

In [4]:
import os
import getpass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

check_if_env_var_is_set("OPENAI_API_KEY", "OpenAI API key")

OPENAI_API_KEY is present


In [5]:
# os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

check_if_env_var_is_set("COHERE_API_KEY", "Cohere API key")

COHERE_API_KEY is present


## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [7]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

In [8]:
gc.collect()

10

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [9]:
from langchain_community.vectorstores import Qdrant
from qdrant_client import QdrantClient, models
from langchain_qdrant import QdrantVectorStore
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [10]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [12]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [13]:
%%time
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

CPU times: user 29.3 ms, sys: 3.74 ms, total: 33.1 ms
Wall time: 64.5 ms


Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [14]:
%%time
# naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 7 μs, sys: 1 μs, total: 8 μs
Wall time: 12.6 μs


In [15]:
%%time
# naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 4 μs, sys: 1e+03 ns, total: 5 μs
Wall time: 7.39 μs


In [16]:
%%time
# naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 8 μs, sys: 1e+03 ns, total: 9 μs
Wall time: 13.6 μs


Overall, this is not bad! Let's see if we can make it better!

In [17]:
gc.collect()

48

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [18]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [19]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [20]:
%%time
# bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 6 μs, sys: 1e+03 ns, total: 7 μs
Wall time: 10.7 μs


In [21]:
%%time
# bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 16 μs, sys: 2 μs, total: 18 μs
Wall time: 34.1 μs


In [22]:
%%time
# bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 13 μs, sys: 1e+03 ns, total: 14 μs
Wall time: 23.1 μs


It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

In [23]:
gc.collect()

0

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer:

BM25, a traditional full-text search ranking function, is particularly effective when dealing with queries that rely heavily on exact term matching, term frequency, and inverse document frequency (TF-IDF) principles.

BM25 is generally better suited for scenarios where exact keyword matching is essential, such as in e-commerce search engines, document retrieval systems, and legal e-discovery.

Additionally, BM25 is often used in hybrid search systems alongside vector search to create a more comprehensive understanding of both semantic meaning and keyword importance.

Here are a couple of queries where the exact matching terms in the document would be essential to prevent a lot of results with noise and near close terms but not close enough:

- "Find documents about COVID-19 vaccine side effects in patients with diabetes"
  - the key terms here COVID-19 vaccine and diabetes are were the focus is in the query
- "Best practices for data backup in 2025"
  - It includes specific terms like "data backup" and "2025" that are likely to appear verbatim in relevant documents.
  - BM25 can effectively leverage term frequency (e.g., how often "data backup" appears in a document) and document length normalization to rank documents accurately. The query does not heavily rely on semantic similarity but rather on the presence and frequency of exact keywords.
  - In contrast, dense embeddings might struggle if the training data does not include similar phrasing or if the semantic model does not strongly associate "best practices" with "data backup" in the context of 2025.

Embeddings, on the other hand, are better suited for capturing semantic relationships between words and documents. If embeddings were used in the above scenarios or use-cases, the precision of the results would not be as accurate as with BM25.


### Addendum

_**Sparse Embeddings** are high-dimensional vectors where most values are zero, with only a few non-zero values representing specific features or tokens that are present, making them memory-efficient and interpretable but limited to explicit feature representation._

_**Dense Embeddings** are vectors where most or all dimensions have non-zero values, creating rich, continuous representations that capture complex semantic relationships and contextual meaning, but require more storage and are less interpretable._

_**Key Difference:** Sparse embeddings work like "on/off switches" for specific features (like one-hot encoding or TF-IDF), while dense embeddings work like "semantic fingerprints" where every dimension contributes to the overall meaning representation - sparse focuses on explicit presence/absence, dense captures nuanced relationships._

___

_**Sparse Retrieval** uses exact keyword matching with algorithms like BM25, where documents are represented as sparse vectors containing only the specific terms that appear in them, making it excellent for precise term-based searches but limited to lexical matches._

_**Dense Retrieval** uses semantic embeddings where documents and queries are converted into dense vector representations that capture meaning and context, allowing it to find semantically similar content even when different words are used, but potentially missing exact keyword matches._

_**Key Difference:** Sparse retrieval excels at "what you search is what you get" with exact terms, while dense retrieval excels at "what you mean is what you get" through semantic understanding - which is why hybrid approaches combining both often work best._


## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [24]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [25]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [26]:
%%time
# contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 4 μs, sys: 1e+03 ns, total: 5 μs
Wall time: 7.15 μs


In [27]:
%%time
# contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 4 μs, sys: 0 ns, total: 4 μs
Wall time: 6.44 μs


In [28]:
%%time
# contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 6 μs, sys: 1 μs, total: 7 μs
Wall time: 11.9 μs


We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

In [29]:
gc.collect()

40

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [30]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [31]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [32]:
%%time
# multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 6 μs, sys: 0 ns, total: 6 μs
Wall time: 12.2 μs


In [33]:
%%time
# multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 8 μs, sys: 1e+03 ns, total: 9 μs
Wall time: 15.3 μs


In [34]:
%%time
# multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 6 μs, sys: 1 μs, total: 7 μs
Wall time: 10.5 μs


In [35]:
gc.collect()

0

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### ✅ Answer:

Multiple reformulations improve recall because relevant documents may use different terminology than the original query, and each reformulation can surface documents the others miss (different phrasings in multiple reformulations of a query can match different relevant documents).

In other words, multiple reformulations approach the same query from different angles/facets, leading to retrieval of documents covering those various angles. This increases the confluence of documents around the common theme while capturing variations in terminology and perspective, thereby enhancing retrieval scope.

And since such retrievers that use multiple reformulations would follow the below steps:

  1. Generates multiple query variations from the original query using an LLM
  2. Retrieves documents for each variation (each gets k results)
  3. Deduplicates and merges the results from all queries
  4. Returns the final deduplicated set

The return results from multiple reformulations would be more beneficial as a retrieval process.

An example would be "machine learning algorithms" vs "AI models" retrieves different relevant documents but around the same or similar theme.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [36]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [37]:
vectorstore.client.create_collection(
  collection_name="full_documents",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=embeddings,         # ✅ Reuse embeddings
  collection_name="full_documents"
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [38]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [39]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [40]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [41]:
%%time
# parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 6 μs, sys: 0 ns, total: 6 μs
Wall time: 10.7 μs


In [42]:
%%time
# parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 3 μs, sys: 1e+03 ns, total: 4 μs
Wall time: 6.2 μs


In [43]:
%%time
# parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 4 μs, sys: 0 ns, total: 4 μs
Wall time: 6.2 μs


Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

In [44]:
gc.collect()

201

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [45]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [46]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [47]:
%%time
# ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

CPU times: user 9 μs, sys: 0 ns, total: 9 μs
Wall time: 15 μs


In [48]:
%%time
# ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

CPU times: user 4 μs, sys: 1 μs, total: 5 μs
Wall time: 6.68 μs


In [49]:
%%time
# ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

CPU times: user 4 μs, sys: 1e+03 ns, total: 5 μs
Wall time: 5.96 μs


In [50]:
gc.collect()

0

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

The `breakpoint_threshold_type` parameter controls when the semantic chunker creates chunk boundaries based on embedding similarity between sentences:

**Four Threshold Types:**

1. _"percentile" (default)_
- Splits when sentence embedding distance exceeds the 95th percentile of all distances
- Effect: Creates chunks at the most semantically distinct boundaries
- Behavior: More conservative splitting, larger chunks

2. _"standard_deviation"_
- Splits when distance exceeds 3 standard deviations from mean
- Effect: Better predictable performance, especially for normally distributed content
- Behavior: More consistent chunk sizes

3. _"interquartile"_
- Uses IQR * 1.5 scaling factor to determine breakpoints
- Effect: Middle-ground approach, robust to outliers
- Behavior: Balanced chunk distribution

4. _"gradient"_
- Detects anomalies in embedding distance gradients
- Effect: Best for domain-specific/highly correlated content
- Behavior: Finds subtle semantic transitions

**Impact:** _The threshold type determines sensitivity to semantic changes - more sensitive types create smaller, more focused chunks while less sensitive types create larger, more comprehensive chunks._

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [51]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [52]:
%%time
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

CPU times: user 250 ms, sys: 50.2 ms, total: 300 ms
Wall time: 9.04 s


Let's create a new vector store.

In [53]:
vectorstore.client.create_collection(
  collection_name="Loan_Complaint_Data_Semantic_Chunks",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

semantic_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=embeddings,         # ✅ Reuse embeddings
  collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

# Add documents after creation
_ = semantic_vectorstore.add_documents(semantic_documents)

We'll use naive retrieval for this example.

In [54]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [55]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [56]:
# semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [57]:
# semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [58]:
# semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

In [59]:
gc.collect()

226

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### ✅ Answer:

Short and highly repetitive sentences create _minimal embedding distance_ variations, making it difficult to detect _meaningful semantic_ boundaries.

Threshold Type Behaviors:

1. "percentile" (95th percentile)

- Behavior: Creates very few chunks since most distances are similar
- Issue: May group unrelated FAQ topics together
- Adjustment: Lower to 75-85th percentile to increase sensitivity

2. "standard_deviation" (3σ)

- Behavior: Performs poorly due to low variance in short, similar sentences
- Issue: Creates massive chunks with no meaningful breaks
- Adjustment: Reduce to 1-2 standard deviations for more splitting

3. "interquartile" (IQR × 1.5)

- Behavior: Most robust for FAQs due to outlier resistance
- Issue: Still may miss subtle topic transitions
- Adjustment: Reduce scaling factor to 0.8-1.0

4. "gradient" (anomaly detection)

- Behavior: Best performer - detects subtle topic shifts in repetitive content
- Issue: May be overly sensitive to minor variations
- Adjustment: Fine-tune threshold to 85-90th percentile

Conclusion: Use "gradient" with _85th percentile_ + minimum chunk size constraints + keyword-based post-processing to ensure FAQ topics remain grouped appropriately despite repetitive language patterns.

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [60]:
### YOUR CODE HERE

In [61]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [62]:
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

True

In [63]:
def check_if_env_var_is_set(env_var_name: str, human_readable_string: str = "API Key"):
    api_key = os.getenv(env_var_name)
  
    if api_key:
       print(f"{env_var_name} is present")
    else:
      print(f"{env_var_name} is NOT present, paste key at the prompt:")
      os.environ[env_var_name] = getpass.getpass(f"Please enter your {human_readable_string}: ")

In [64]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
check_if_env_var_is_set("LANGCHAIN_API_KEY", "LangChain API key")
check_if_env_var_is_set("OPENAI_API_KEY", "OpenAI API key")

LANGCHAIN_API_KEY is present
OPENAI_API_KEY is present


In [65]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - ADwLC - {uuid4().hex[0:8]}"

In [95]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [96]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [97]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

In [98]:
from ragas.testset.graph import Node, NodeType
if not os.path.exists('loan_data_kg.json'):
    ### NOTICE: We're using a subset of the data for this example - this is to keep costs/time down.
    for doc in docs[:5]: ### 20
        kg.nodes.append(
            Node(
                type=NodeType.DOCUMENT,
                properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
            )
        )
kg

KnowledgeGraph(nodes: 0, relationships: 0)

In [99]:
gc.collect()

329

In [100]:
%%time
from ragas.testset.transforms import default_transforms, apply_transforms
transformer_llm = generator_llm
embedding_model = generator_embeddings

if not os.path.exists('loan_data_kg.json'):
    default_transforms = default_transforms(documents=docs, llm=transformer_llm, embedding_model=embedding_model)
    apply_transforms(kg, default_transforms)
else:
    kg.load('loan_data_kg.json')
kg

CPU times: user 126 ms, sys: 17.4 ms, total: 143 ms
Wall time: 141 ms


KnowledgeGraph(nodes: 0, relationships: 0)

In [101]:
%%time
if not os.path.exists('loan_data_kg.json'):
    kg.save("loan_data_kg.json")
    
loan_data_kg = KnowledgeGraph.load("loan_data_kg.json")
loan_data_kg

CPU times: user 104 ms, sys: 9.98 ms, total: 114 ms
Wall time: 112 ms


KnowledgeGraph(nodes: 11, relationships: 39)

In [102]:
gc.collect()

0

In [103]:
import psutil

# Check memory usage
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage before generation: {memory_mb:.1f} MB")

Memory usage before generation: 551.1 MB


In [104]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=embedding_model, knowledge_graph=loan_data_kg)

In [105]:
from ragas.testset.synthesizers import default_query_distribution, SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer

query_distribution = [
        # (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        # (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        # (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
    # (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 1.0),
        # (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        # (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.25),
        # (MultiHopSpecificQuerySynthesizer(llm=generator_llm), 0.25),
      (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.8),    # 80%
      (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.2),     # 20%
]

In [106]:
%%time
testset = None
if not os.path.exists('golden-master.csv'):
    testset = generator.generate(testset_size=10, query_distribution=query_distribution)
    testset.to_pandas()

CPU times: user 57 μs, sys: 6 μs, total: 63 μs
Wall time: 68.9 μs


In [107]:
# Check memory usage
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage after generation: {memory_mb:.1f} MB")

Memory usage after generation: 551.1 MB


In [108]:
gc.collect()

75

In [109]:
process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Memory usage after gc.collect(): {memory_mb:.1f} MB")

Memory usage after gc.collect(): 551.3 MB


In [110]:
import pandas as pd

In [111]:
if testset:
    testset_df = testset.to_pandas()
    testset_df.to_csv('golden-master.csv', index=False)
else:
    testset_df = pd.read_csv('golden-master.csv')
testset_df    

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are academic years in the context of educ...,"['Chapter 1 Academic Years, Academic Calendars...",Academic years are defined periods that every ...,single_hop_specifc_query_synthesizer
1,Could you please explain the role of the Knowl...,"['Chapter 1 Academic Years, Academic Calendars...",The Knowledge Center provides information and ...,single_hop_specifc_query_synthesizer
2,What information is available regarding the co...,"['Chapter 1 Academic Years, Academic Calendars...",The provided context discusses Chapter 1 and i...,single_hop_specifc_query_synthesizer
3,What is a department in the context of academi...,"['Chapter 1 Academic Years, Academic Calendars...",The provided context does not explicitly defin...,single_hop_specifc_query_synthesizer
4,What does 34 CFR 668.3(b) specify regarding we...,['Regulatory Citations Academic year minimums:...,34 CFR 668.3(b) pertains to weeks of instructi...,single_hop_specifc_query_synthesizer
5,What is the regulation 34 CFR 668.3(a) about i...,['Regulatory Citations Academic year minimums:...,Regulatory Citations Academic year minimums: 3...,single_hop_specifc_query_synthesizer
6,What does the regulation 34 CFR 668.3(b) speci...,['Regulatory Citations Academic year minimums:...,Regulatory citations indicate that 34 CFR 668....,single_hop_specifc_query_synthesizer
7,Could you please explain the significance of 3...,['Regulatory Citations Academic year minimums:...,Regulatory citations indicate that 34 CFR 668....,single_hop_specifc_query_synthesizer
8,what is the rule about academic calendars and ...,"['<1-hop>\n\nChapter 1 Academic Years, Academi...",The context explains that each eligible progra...,multi_hop_abstract_query_synthesizer
9,Different academic year defs for programs how ...,"['<1-hop>\n\nChapter 1 Academic Years, Academi...",The context explains that a school can have di...,multi_hop_abstract_query_synthesizer


In [112]:
from langsmith import Client

langsmith_client = Client()

dataset_name = "Loan Synthetic Data (s09)"

existing_datasets = langsmith_client.list_datasets()
dataset_exists = any(dataset.name == dataset_name for dataset in existing_datasets)

if dataset_exists:
  langsmith_dataset = langsmith_client.read_dataset(dataset_name=dataset_name)
  print(f"Using existing dataset: {dataset_name}")
else:
  langsmith_dataset = langsmith_client.create_dataset(
      dataset_name=dataset_name,
      description="Loan Synthetic Data (for s09 exercise)"
  )
  print(f"Created new dataset: {dataset_name}")

Using existing dataset: Loan Synthetic Data (s09)


In [113]:
gc.collect()

46

In [114]:
for data_row in testset_df.iterrows():
  langsmith_client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [115]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(docs)

In [116]:
from langchain_openai import OpenAIEmbeddings

In [117]:
from langchain_community.vectorstores import Qdrant

vectorstore.client.create_collection(
  collection_name="Loan RAG (semantic)",
  vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

semantic_vectorstore = Qdrant(
    client=vectorstore.client,     # ✅ Reuse existing client
    embeddings=embeddings,         # ✅ Reuse embeddings
    collection_name="Loan RAG (semantic)"
)

_ = semantic_vectorstore.add_documents(rag_documents)

In [118]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [119]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

Context: {context}
Question: {question}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [120]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4.1-mini")

In [121]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | rag_prompt | llm | StrOutputParser()
)

In [122]:
# rag_chain.invoke({"question" : "What kinds of loans are available?"})

## LangSmith Evaluation Set-up

In [123]:
eval_llm = ChatOpenAI(model="gpt-4.1")

In [124]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

qa_evaluator = LangChainStringEvaluator(
    "qa",
    config={"llm": eval_llm},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["response"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)  

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["response"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

empathy_evaluator = LangChainStringEvaluator(
    "criteria",
    config={
        "criteria": {
            "empathy": "Is this response empathetic? Does it make the user feel like they are being heard?",
        },
        "llm": eval_llm 
    },
    prepare_data=lambda run, example: {
       "prediction": run.outputs["response"],
       "input": example.inputs["question"],
    }
)

## LangSmith Evaluation

In [125]:
# evaluate(
#     rag_chain.invoke,
#     data=dataset_name,
#     evaluators=[
#         qa_evaluator,
#         labeled_helpfulness_evaluator,
#         empathy_evaluator
#     ],
#     metadata={"revision_id": "default_chain_init"},
# )

## Dope-ifying Our Application

In [126]:
EMPATHY_RAG_PROMPT = """\
Given a provided context and question, you must answer the question based only on context.

If you cannot answer the question based on the context - you must say "I don't know".

You must answer the question using empathy and kindness, and make sure the user feels heard.

Context: {context}
Question: {question}
"""

empathy_rag_prompt = ChatPromptTemplate.from_template(EMPATHY_RAG_PROMPT)

In [127]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 50
)

rag_documents = text_splitter.split_documents(docs)

In [128]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [131]:
try:
    vectorstore.client.create_collection(
      collection_name="Loan Data for RAG",
      vectors_config=models.VectorParams(size=3072, distance=models.Distance.COSINE) ### was 1536
    )
except:
    pass

dope_app_vectorstore = Qdrant(
  client=vectorstore.client,     # ✅ Reuse existing client
  embeddings=embeddings,         # ✅ Reuse embeddings
  collection_name="Loan Data for RAG"
)

# Add documents after creation
_ = dope_app_vectorstore.add_documents(rag_documents)

In [132]:
retriever = vectorstore.as_retriever()

In [133]:
empathy_rag_chain = (
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    | empathy_rag_prompt | llm | StrOutputParser()
)

In [134]:
# empathy_rag_chain.invoke({"question" : "What kinds of loans are available?"})

In [135]:
# evaluate(
#     empathy_rag_chain.invoke,
#     data=dataset_name,
#     evaluators=[
#         qa_evaluator,
#         labeled_helpfulness_evaluator,
#         empathy_evaluator
#     ],
#     metadata={"revision_id": "empathy_rag_chain"},
# )

In [None]:
gc.collect()

### Retriever Evaluation

#### Naive Retrieval Chain

In [136]:
from tqdm.notebook import tqdm

In [137]:
def write_to_file(filename: str, content: str):
    with open(filename, 'w') as text_file:
        try:
            text_file.write(content)
        finally:
            text_file.close()

In [None]:
retriever_chains_list = {
    "naive_retrieval_chain" : naive_retrieval_chain,
    "bm25_retrieval_chain": bm25_retrieval_chain,
    "contextual_compression_retrieval_chain": contextual_compression_retrieval_chain,
    "multi_query_retrieval_chain": multi_query_retrieval_chain,
    "parent_document_retrieval_chain": parent_document_retrieval_chain,
    "ensemble_retrieval_chain": ensemble_retrieval_chain,
    "semantic_retrieval_chain": semantic_retrieval_chain
}

retriever_eval_progress_bar = tqdm(retriever_chains_list)
for retriever_chain in retriever_eval_progress_bar:
    if os.path.exists(retriever_chain):
        print(f"{retriever_chain} already processed, skipping to the next one...")
        continue

    retriever_eval_progress_bar.set_description(retriever_chain, refresh=True)
    chain_to_invoke = retriever_chains_list[retriever_chain]
    try:
        evaluate(
          chain_to_invoke.invoke,
          data=dataset_name,
          evaluators=[qa_evaluator, labeled_helpfulness_evaluator, empathy_evaluator],
          metadata={"revision_id": retriever_chain},
          experiment_prefix=retriever_chain
        )
        write_to_file(retriever_chain, f"revision_id: {retriever_chain}")
    except Exception as ex:
        print(f"Failed to run evaluation on the {retriever_chain}, due to {ex}, skipping to the next one...")
        continue

  0%|          | 0/7 [00:00<?, ?it/s]

View the evaluation results for experiment: 'naive_retrieval_chain-7ca064c5' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=9e8db4be-dfd8-4128-8daf-cd064652bd29




0it [00:00, ?it/s]

View the evaluation results for experiment: 'bm25_retrieval_chain-55985f76' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=796aae70-2bca-4999-aa42-918ad963c585




0it [00:00, ?it/s]

View the evaluation results for experiment: 'contextual_compression_retrieval_chain-90109886' at:
https://smith.langchain.com/o/4a563880-75b7-483f-b9cd-cf740f81427b/datasets/8fab2835-2938-4c6a-9e6d-55e84c59a784/compare?selectedSessions=8e9c8e22-164d-4863-aa9d-57cd5f244c59




0it [00:00, ?it/s]

### Debugging Anthropic

In [None]:
# try:
# # Try to list projects to see current usage
#   projects = langsmith_client.list_projects()
#   print(f"Current projects: {len(list(projects))}")

#   # Try to list datasets
#   datasets = langsmith_client.list_datasets()
#   print(f"Current datasets: {len(list(datasets))}")

# except Exception as e:
#   print(f"Error details: {e}")
#   print(f"Error type: {type(e)}")

In [None]:
# import logging
# import requests

# # Enable debug logging for requests
# logging.basicConfig(level=logging.INFO) ### logging.DEBUG
# requests_log = logging.getLogger("requests.packages.urllib3")
# requests_log.setLevel(logging.INFO) ### logging.DEBUG
# requests_log.propagate = True

In [None]:
# import requests
# import os

# headers = {
#   "X-API-Key": os.environ["LANGCHAIN_API_KEY"],
#   "Content-Type": "application/json"
# }

# from pprint import pprint
# try:
#   response = requests.get(
#       "https://api.smith.langchain.com/datasets",
#       headers=headers,
#       timeout=30
#   )
#   print(f"Status code: {response.status_code}")
#   pprint(f"Response: {response.text}")
# except Exception as e:
#   print(f"Direct API error: {e}")

In [None]:
# from langsmith import Client

# client = Client()
# try:
#   # Make any API call and check the response
#   datasets = list(client.list_datasets(limit=1))
# except Exception as e:
#   # Look for rate limit headers in the error
#   if hasattr(e, 'response') and e.response:
#       print(f"Headers: {e.response.headers}")
#       print(f"Status: {e.response.status_code}")
#       print(f"Body: {e.response.text}")