# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [46]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Use Case Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/Projects_with_Domains.csv",
    metadata_columns=[
      "Project Title",
      "Project Domain",
      "Secondary Domain",
      "Description",
      "Judge Comments",
      "Score",
      "Project Name",
      "Judge Score"
    ]
)

synthetic_usecase_data = loader.load()

for doc in synthetic_usecase_data:
    doc.page_content = doc.metadata["Description"]

Let's look at an example document to see if everything worked as expected!

In [4]:
synthetic_usecase_data[0]

Document(metadata={'source': './data/Projects_with_Domains.csv', 'row': 0, 'Project Title': 'InsightAI 1', 'Project Domain': 'Security', 'Secondary Domain': 'Finance / FinTech', 'Description': 'A low-latency inference system for multimodal agents in autonomous systems.', 'Judge Comments': 'Technically ambitious and well-executed.', 'Score': '85', 'Project Name': 'Project Aurora', 'Judge Score': '9.5'}, page_content='A low-latency inference system for multimodal agents in autonomous systems.')

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "Synthetic_Usecases".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    synthetic_usecase_data,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecases"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [17]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [18]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [19]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [20]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [21]:
naive_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data appears to be "Healthcare / MedTech," as it is mentioned multiple times among the projects.'

In [15]:
naive_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are usecases related to security. Specifically, one project, "LatticeFlow," is described as an AI model compression suite enabling on-device reasoning for IoT sensors, which falls under the secondary domain of Security. Additionally, another project, "Pathfinder 24," focuses on an AI-powered platform optimizing logistics routes for sustainability, which is linked to security in the context of healthcare / MedTech.'

In [16]:
naive_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had a generally positive view of the fintech projects, often highlighting their technical strength and real-world impact. For example, one judge described a fintech project as a "clever solution with measurable environmental benefit," and another called a project "impressive" with "robust experimental validation." Overall, the judges appreciated the quality, ambition, and potential impact of the fintech-related projects, though some noted minor issues such as integration challenges or the need for further benchmarking.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [17]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(synthetic_usecase_data)

We'll construct the same chain - only changing the retriever.

In [18]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [19]:
bm25_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the project domains mentioned include Productivity Assistants, E-commerce / Marketplaces, Healthcare / MedTech, and Finance / FinTech. Since the dataset snippet shows multiple projects with various domains but does not specify the total counts, I cannot definitively determine the most common domain. \n\nHowever, among the listed projects, Finance / FinTech appears twice (PulseAI 50 and DocuCheck 47), which suggests it might be a common domain in this sample. \n\nIf you need a precise answer based on the full data, I recommend analyzing the entire dataset to identify the domain with the highest frequency.'

In [20]:
bm25_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there was a use case related to security. The project titled "SecureNest 49" in the domain of E-commerce / Marketplaces with a secondary focus on Legal / Compliance involved a document summarization and retrieval system for enterprise knowledge bases, which relates to security and compliance aspects.'

In [21]:
bm25_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges had the following comments about the fintech projects:\n\n- For the project "SynthMind" in the finance/fintech domain, the judges noted that it had a conceptually strong approach but indicated that the results need more benchmarking.\n- For the project "PulseAI 50," which is also in the finance/fintech secondary domain, the judges described it as "Technically ambitious and well-executed."'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

##### ✅ Answer

BM25 will be better in case that there is a file with specific terminology or name that we want to search about. For example in a file with space and planets. Assume that in the file there is a sentence like: `The Kepler-186f exoplanet is located about 500 light-years away and may have conditions similar to Earth.`<br>
If these are chunked by characters  and the query is `What is known about Kepler-186f?`, the BM25 finds it immediately because contains exact tokens "Kepler-186f".<br>BM25 wins — because character-level embeddings destroy semantic integrity, while BM25 still recognizes exact term overlap (Kepler-186f).

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [22]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [28]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [29]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, the most common project domain appears to be "Security," as it is listed for at least one project in the sample. However, since the sample includes only a few projects, and the data source is a CSV file, I cannot definitively determine the most common domain without analyzing the entire dataset.\n\nIf you have the complete dataset or additional information, I can help analyze it further.'

In [30]:
contextual_compression_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided context, there are no specific use cases explicitly related to security. The use cases mentioned mainly focus on privacy improvements in healthcare applications through federated learning.'

In [31]:
contextual_compression_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges had positive comments about the fintech projects. For example, one project, Pathfinder 27, was praised for its excellent code quality and use of open-source libraries, receiving a high judge score of 9.8. Additionally, a project called PlanPilot 35 was described as a clever solution with measurable environmental benefits and received a judge score of 8.4. Overall, the judges appreciated the quality, innovation, and potential impact of the fintech projects.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [32]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
) 

In [33]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [34]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'Based on the provided data, there are multiple project domains such as Healthcare / MedTech, Developer Tools / DevEx, E-commerce / Marketplaces, Legal / Compliance, Finance / FinTech, Data / Analytics, Sales / CRM, Customer Support / Helpdesk, Productivity Assistants, Security, Creative / Design / Media, and Writing & Content.\n\nAmong these, the most frequently mentioned project domain appears to be **Writing & Content**. Several projects are categorized under this domain, indicating it is the most common project domain in the dataset.'

In [35]:
multi_query_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are usecases related to security. One example is the project titled "SecureNest," which focuses on a hardware-aware model quantization benchmark suite with a strong concept in the security and compliance domain.'

In [36]:
multi_query_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had generally positive comments about the fintech projects. For example, one judge described the project as a "clever solution with measurable environmental benefit," indicating appreciation for innovation and impact. Another commented that the project was "technically ambitious and well-executed," highlighting its strong technical foundation. Additionally, some projects received praise for their promising ideas with robust validation, or for their potential real-world impact. Overall, the judges recognized the fintech projects for their ingenuity, technical rigor, and potential benefits.'

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

##### ✅ Answer
A single user query is often too narrow or ambiguous. For example in previous example with query:`planets similar to Earth`, if the documents describe:“Earth-like exoplanets discovered by Kepler”,“Habitable zone worlds around distant stars”,“Terrestrial planets with liquid water potential”, a simple lexical or embedding-based search might miss some because the exact wording “planets similar to Earth” doesn’t appear in every text. A simple lexical or embedding-based search might miss some because the exact wording “planets similar to Earth” doesn’t appear in every text. 

Solution:Generating multiple reformulations of a query increases recall because it helps the retriever find different relevant documents that express the same concept using different language or terminology.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [37]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = synthetic_usecase_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [38]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [39]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [40]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [41]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [42]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data appears to be "Healthcare / MedTech," as it is mentioned in multiple project entries.'

In [43]:
parent_document_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Based on the provided context, there are no specific use cases about security explicitly mentioned. The projects focus on federated learning and improving privacy in healthcare applications, which relates to security and privacy concerns, but there are no direct references to security use cases.'

In [44]:
parent_document_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'The judges had positive comments about the fintech projects. For example, one project was described as "A clever solution with measurable environmental benefit," indicating that the judges appreciated its innovation and impact. Another project was noted as "Technically ambitious and well-executed," suggesting a high level of technical quality and execution. Overall, the judges recognized the projects for their creativity, impact, and soundness.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [45]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [46]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [47]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data appears to be "E‑commerce / Marketplaces," as it is mentioned more than once among the sample projects.'

In [48]:
ensemble_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there was at least one use case related to security. The project titled "SecureNest" involves a document summarization and retrieval system for enterprise knowledge bases, which is associated with the domain of E-commerce / Marketplaces and secondary domain of Legal / Compliance. The description indicates it is a comprehensive and technically mature approach, likely addressing security in the context of managing sensitive enterprise information.'

In [49]:
ensemble_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had various comments about the fintech projects. For the project "SynthMind," judges described it as having a strong conceptual foundation but noted that the results require more benchmarking. Overall, the feedback indicates that while some fintech projects are conceptually solid, there is a need for more comprehensive evaluation to fully demonstrate their effectiveness.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [50]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [51]:
semantic_documents = semantic_chunker.split_documents(synthetic_usecase_data[:20])

Let's create a new vector store.

In [52]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Synthetic_Usecase_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [53]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [54]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [55]:
semantic_retrieval_chain.invoke({"question" : "What is the most common project domain?"})["response"].content

'The most common project domain in the provided data is "Legal / Compliance," which appears twice among the listed projects.'

In [56]:
semantic_retrieval_chain.invoke({"question" : "Were there any usecases about security?"})["response"].content

'Yes, there are use cases related to security. Specifically, projects such as "MediMind 17" and "AutoMate 11" are in the security domain. "MediMind 17" involves a medical imaging solution, and "AutoMate 11" features a reinforcement learning setup for optimizing energy efficiency in data centers, which can be linked to security aspects of infrastructure.'

In [57]:
semantic_retrieval_chain.invoke({"question" : "What did judges have to say about the fintech projects?"})["response"].content

'Judges had a generally positive view of the fintech projects. For example, the project "TrendLens 19" was described as "Technically ambitious and well-executed," and "AutoMate 5" was noted for being "A forward-looking idea with solid supporting data." Overall, the judges appreciated the technical quality and potential of the fintech-related projects.'

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

##### ✅ Answer
Can use keyword and semantic search together: BM25 finds exact word matches,Embeddings find similar meanings.<br> Rerank with a cross-encoder can be used.<br> Use a ParentDocumentRetriever to keep answers intact.<br>Use no overlap and short chunk size only if the atomic FAQ fits (typically 100–300 tokens).<br>

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [8]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("data/howpeopleuseai.pdf")
pdf_docs = loader.load()

In [9]:

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")) 
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
  generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())


In [10]:
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import  SingleHopSpecificQuerySynthesizer, MultiHopAbstractQuerySynthesizer, MultiHopSpecificQuerySynthesizer


generator = TestsetGenerator(
    llm=generator_llm, 
    embedding_model=generator_embeddings,
)

dataset = generator.generate_with_langchain_docs(
    pdf_docs, 
    testset_size=10,
    query_distribution=[
        (SingleHopSpecificQuerySynthesizer(llm=generator_llm), 0.5),
        (MultiHopAbstractQuerySynthesizer(llm=generator_llm), 0.5),
    ],
)

Applying HeadlinesExtractor:   0%|          | 0/21 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/64 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/33 [00:00<?, ?it/s]

Property 'summary' already exists in node '6dccc4'. Skipping!
Property 'summary' already exists in node '7a3782'. Skipping!
Property 'summary' already exists in node '1cbf38'. Skipping!
Property 'summary' already exists in node '1cc502'. Skipping!
Property 'summary' already exists in node 'c80d6e'. Skipping!
Property 'summary' already exists in node '09f3f1'. Skipping!
Property 'summary' already exists in node 'e719a7'. Skipping!
Property 'summary' already exists in node 'd3050a'. Skipping!
Property 'summary' already exists in node '1d3eda'. Skipping!
Property 'summary' already exists in node '22d286'. Skipping!
Property 'summary' already exists in node '75fcba'. Skipping!
Property 'summary' already exists in node '86ef5d'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/19 [00:00<?, ?it/s]

Applying EmbeddingExtractor:   0%|          | 0/33 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'd3050a'. Skipping!
Property 'summary_embedding' already exists in node '6dccc4'. Skipping!
Property 'summary_embedding' already exists in node '7a3782'. Skipping!
Property 'summary_embedding' already exists in node '75fcba'. Skipping!
Property 'summary_embedding' already exists in node '22d286'. Skipping!
Property 'summary_embedding' already exists in node 'c80d6e'. Skipping!
Property 'summary_embedding' already exists in node '1cc502'. Skipping!
Property 'summary_embedding' already exists in node 'e719a7'. Skipping!
Property 'summary_embedding' already exists in node '86ef5d'. Skipping!
Property 'summary_embedding' already exists in node '1d3eda'. Skipping!
Property 'summary_embedding' already exists in node '1cbf38'. Skipping!
Property 'summary_embedding' already exists in node '09f3f1'. Skipping!


Applying ThemesExtractor:   0%|          | 0/14 [00:00<?, ?it/s]

Applying NERExtractor:   0%|          | 0/14 [00:00<?, ?it/s]

Applying CosineSimilarityBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/8 [00:00<?, ?it/s]

## Evaluate Naive Retriever with Ragas Metrics

Now let's use the synthetic dataset to evaluate our naive retriever


In [12]:
# Convert to pandas to see the data more clearly
import pandas as pd
df_dataset = dataset.to_pandas()
df_dataset.head()


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"Wht did Bick et al., 2024 say about the speed ...",[Introduction\nChatGPT launched in November 20...,"Bick et al., 2024 noted that the speed of glob...",single_hop_specific_query_synthesizer
1,Who is Brynjolfsson?,[Month\nNon-Work (M)\n(%)\nWork (M)\n(%)\nTota...,Brynjolfsson is mentioned in the context as pa...,single_hop_specific_query_synthesizer
2,What does Handa et al. (2025) report about the...,[Two of our findings stand in contrast to othe...,Handa et al. (2025) report that 37% of convers...,single_hop_specific_query_synthesizer
3,What Caplin et al. say about ChatGPT and how i...,"[Doing, and that Asking messages are consisten...",Caplin et al. (2023) argue that ChatGPT likely...,single_hop_specific_query_synthesizer
4,How do large language models (LLMs) function a...,[What is ChatGPT?\nHere we give a simplified o...,An LLM can be thought of as a function from a ...,single_hop_specific_query_synthesizer


### Create Vector Stores from PDF Documents




Naive Retiever

In [22]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

pdf_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

pdf_vectorstore = Qdrant.from_documents(
    pdf_docs,
    pdf_embeddings,
    location=":memory:",
    collection_name="PDF_Documents"
)


In [23]:
pdf_naive_retriever = pdf_vectorstore.as_retriever(search_kwargs={"k": 10})

BM 25

In [25]:
from langchain_community.retrievers import BM25Retriever

pdf_bm25_retriever = BM25Retriever.from_documents(pdf_docs)
pdf_bm25_retriever.k = 10

MultiQueryRetriever

In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

pdf_multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=pdf_naive_retriever,
    llm=chat_model
)

ParentDocumentRetriever

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models

pdf_child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

pdf_parent_client = QdrantClient(location=":memory:")
pdf_parent_client.create_collection(
    collection_name="pdf_parent_docs",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

pdf_parent_vectorstore = QdrantVectorStore(
    collection_name="pdf_parent_docs",
    embedding=pdf_embeddings,
    client=pdf_parent_client
)

# Create parent document retriever
pdf_parent_store = InMemoryStore()
pdf_parent_document_retriever = ParentDocumentRetriever(
    vectorstore=pdf_parent_vectorstore,
    docstore=pdf_parent_store,
    child_splitter=pdf_child_splitter,
)

pdf_parent_document_retriever.add_documents(pdf_docs, ids=None)

ContextualCompressionRetriever

In [48]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

pdf_compressor = CohereRerank(model="rerank-v3.5")
pdf_compression_retriever = ContextualCompressionRetriever(
    base_compressor=pdf_compressor,
    base_retriever=pdf_naive_retriever
)

EnsembleRetriever

In [29]:
from langchain.retrievers import EnsembleRetriever

pdf_retriever_list = [
    pdf_bm25_retriever,
    pdf_naive_retriever,
    pdf_multi_query_retriever
]

pdf_ensemble_retriever = EnsembleRetriever(
    retrievers=pdf_retriever_list,
    weights=[1/3, 1/3, 1/3]
)

In [None]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import AIMessage
from ragas import EvaluationDataset, evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import (
    LLMContextRecall, Faithfulness, FactualCorrectness,
)

def reset_eval_fields(ds):
    for s in ds:
        if hasattr(s, "eval_sample"):
            s.eval_sample.response = ""
            s.eval_sample.retrieved_contexts = []

def to_text(x):
    return x.content if isinstance(x, AIMessage) else str(x)

def make_chain(retriever, prompt, llm):
    return (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | {"response": prompt | llm, "context": itemgetter("context")}
    )

def evaluate_current_dataset(ds, evaluator_llm):
    evaluation_dataset = EvaluationDataset.from_pandas(ds.to_pandas())
    return evaluate(
        dataset=evaluation_dataset,
        metrics=[
            LLMContextRecall(), Faithfulness(), FactualCorrectness()
        ],
        llm=evaluator_llm,
        run_config=RunConfig(timeout=360),
        raise_exceptions=False,
    )

evaluator_llm = LangchainLLMWrapper(chat_model)


  evaluator_llm = LangchainLLMWrapper(chat_model)


**NAIVE retriever evaluation**

In [None]:
naive_chain = make_chain(pdf_naive_retriever, rag_prompt, chat_model)

reset_eval_fields(dataset)
for row in dataset:
    q = getattr(row.eval_sample, "user_input", None) or getattr(row.eval_sample, "question", None)
    if not q: 
        continue
    out = naive_chain.invoke({"question": q})
    row.eval_sample.response = to_text(out["response"])
    row.eval_sample.retrieved_contexts = [d.page_content for d in out["context"]][:10]

# 3) Evaluate
res_naive = evaluate_current_dataset(dataset, evaluator_llm)
print(res_naive)


Evaluating:   0%|          | 0/48 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Exception raised in Job[20]: PermissionDeniedError(Error code: 403)
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[34]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()


{'context_recall': 1.0000, 'faithfulness': 0.9492, 'factual_correctness(mode=f1)': 0.6900, 'answer_relevancy': 0.8274, 'context_entity_recall': 0.2358, 'noise_sensitivity(mode=relevant)': 0.5278}


**BM25 retriever**

In [34]:
bm25_chain = make_chain(pdf_bm25_retriever, rag_prompt, chat_model)

reset_eval_fields(dataset)
for row in dataset:
    q = getattr(row.eval_sample, "user_input", None) or getattr(row.eval_sample, "question", None)
    if not q: 
        continue
    out = bm25_chain.invoke({"question": q})
    row.eval_sample.response = to_text(out["response"])
    row.eval_sample.retrieved_contexts = [d.page_content for d in out["context"]][:10]

res_bm25 = evaluate_current_dataset(dataset, evaluator_llm)
print(res_bm25)


Evaluating:   0%|          | 0/48 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
Prompt extract_entities_prompt failed to parse output: The output parser failed to pars

{'context_recall': 1.0000, 'faithfulness': 0.8382, 'factual_correctness(mode=f1)': 0.6700, 'answer_relevancy': 0.8212, 'context_entity_recall': 0.1084, 'noise_sensitivity(mode=relevant)': 0.2000}


**Multi-Query retriever**

In [35]:
mq_chain = make_chain(pdf_multi_query_retriever, rag_prompt, chat_model)

reset_eval_fields(dataset)
for row in dataset:
    q = getattr(row.eval_sample, "user_input", None) or getattr(row.eval_sample, "question", None)
    if not q: 
        continue
    out = mq_chain.invoke({"question": q})
    row.eval_sample.response = to_text(out["response"])
    row.eval_sample.retrieved_contexts = [d.page_content for d in out["context"]][:10]

res_mq = evaluate_current_dataset(dataset, evaluator_llm)
res_mq


Evaluating:   0%|          | 0/48 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Exception raised in Job[11]: TimeoutError()
Exception raised in Job[5]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()


{'context_recall': 1.0000, 'faithfulness': 0.9861, 'factual_correctness(mode=f1)': 0.6700, 'answer_relevancy': 0.9376, 'context_entity_recall': 0.3073, 'noise_sensitivity(mode=relevant)': 0.3500}

**Parent-Document retriever**

In [37]:
parent_chain = make_chain(pdf_parent_document_retriever, rag_prompt, chat_model)

reset_eval_fields(dataset)
for row in dataset:
    q = getattr(row.eval_sample, "user_input", None) or getattr(row.eval_sample, "question", None)
    if not q: 
        continue
    out = parent_chain.invoke({"question": q})
    row.eval_sample.response = to_text(out["response"])
    row.eval_sample.retrieved_contexts = [d.page_content for d in out["context"]][:10]

res_parent = evaluate_current_dataset(dataset, evaluator_llm)
print(res_parent)


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

{'context_recall': 1.0000, 'faithfulness': 0.8750, 'factual_correctness(mode=f1)': 0.7637}


**Compression retriever (Cohere rerank)**

In [49]:
compression_chain = make_chain(pdf_compression_retriever, rag_prompt, chat_model)

reset_eval_fields(dataset)
for row in dataset:
    q = getattr(row.eval_sample, "user_input", None) or getattr(row.eval_sample, "question", None)
    if not q: 
        continue
    out = compression_chain.invoke({"question": q})
    row.eval_sample.response = to_text(out["response"])
    row.eval_sample.retrieved_contexts = [d.page_content for d in out["context"]][:10]

res_compression = evaluate_current_dataset(dataset, evaluator_llm)
print(res_compression)


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

{'context_recall': 1.0000, 'faithfulness': 0.8353, 'factual_correctness(mode=f1)': 0.5763}


**Ensemble retriever**

In [54]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [
    pdf_bm25_retriever,
    pdf_naive_retriever,
    pdf_multi_query_retriever,
    pdf_parent_document_retriever,
    pdf_compression_retriever,  
]

equal_weighting = [1/len(retriever_list)] * len(retriever_list)

pdf_ensemble_retriever_all = EnsembleRetriever(
    retrievers=retriever_list,
    weights=equal_weighting,
)

ensemble_chain = make_chain(pdf_ensemble_retriever_all, rag_prompt, chat_model)

reset_eval_fields(dataset)
for row in dataset:
    q = getattr(row.eval_sample, "user_input", None) or getattr(row.eval_sample, "question", None)
    if not q:
        continue
    out = ensemble_chain.invoke({"question": q})
    row.eval_sample.response = to_text(out["response"])
    row.eval_sample.retrieved_contexts = [d.page_content for d in out["context"]][:10]

res_ensemble = evaluate_current_dataset(dataset, evaluator_llm)
print(res_ensemble)


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

{'context_recall': 1.0000, 'faithfulness': 1.0000, 'factual_correctness(mode=f1)': 0.7963}


In [None]:
import pandas as pd

results_data = {
    "Retriever": ["Naive", "BM25", "Multi-Query", "Parent-Doc", "Compression", "Ensemble"],
    "Context Recall": [
        res_naive['context_recall'],
        res_bm25['context_recall'],
        res_mq['context_recall'],
        res_parent['context_recall'],
        res_compression['context_recall'],
        res_ensemble['context_recall']
    ],
    "Faithfulness": [
        res_naive['faithfulness'],
        res_bm25['faithfulness'],
        res_mq['faithfulness'],
        res_parent['faithfulness'],
        res_compression['faithfulness'],
        res_ensemble['faithfulness']
    ],
    "Factual Correctness": [
        res_naive['factual_correctness(mode=f1)'],
        res_bm25['factual_correctness(mode=f1)'],
        res_mq['factual_correctness(mode=f1)'],
        res_parent['factual_correctness(mode=f1)'],
        res_compression['factual_correctness(mode=f1)'],
        res_ensemble['factual_correctness(mode=f1)']
    ],
}

results_df = pd.DataFrame(results_data)

# Sort by Factual Correctness
results_df = results_df.sort_values('Factual Correctness', ascending=False)

print("\n" + "="*80)
print("ACTIVITY 1: RETRIEVER COMPARISON RESULTS")
print("="*80)
print(results_df.to_string(index=False))
print("="*80)


ACTIVITY 1: RETRIEVER COMPARISON RESULTS
  Retriever                           Context Recall                                                            Faithfulness                            Factual Correctness
 Parent-Doc [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]                                [1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] [1.0, 0.33, 0.57, 0.62, 1.0, 0.92, 0.93, 0.74]
   Ensemble [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]                                [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]  [1.0, 0.2, 1.0, 0.91, 0.83, 0.85, 0.92, 0.66]
       BM25 [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] [1.0, 0.0, 1.0, 0.8666666666666667, 0.95, 0.8888888888888888, 1.0, 1.0] [1.0, 0.0, 0.57, 0.31, 0.87, 0.94, 0.86, 0.81]
Compression [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]  [1.0, 0.0, 0.7777777777777778, 1.0, 0.9047619047619048, 1.0, 1.0, 1.0] [0.5, 0.0, 0.18, 0.73, 0.87, 0.71, 0.93, 0.69]
Multi-Query [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]                 [1.0, 1.0, 1.0, 1.0, 0.888888

**ANALYSIS OF RETRIEVAL STRATEGIES**

Based on evaluation across 6 retrieval methods on 10 synthetic questions, the Ensemble 
retriever achieved the highest performance with perfect faithfulness (1.00) and factual 
correctness (0.80), demonstrating that combining multiple retrieval strategies (BM25, 
Naive, Multi-Query, Parent-Doc, and Compression) effectively captures diverse relevant 
information. However, this comes at significant cost—the Ensemble approach incurs 
cumulative API expenses and latency from all constituent retrievers plus Cohere reranking.

For production use prioritizing cost-efficiency, the Parent-Document retriever offers 
the best balance, achieving strong factual correctness (0.76) and faithfulness (0.88) 
with only single embedding and LLM calls per query. BM25 provides the lowest cost option 
(zero embedding fees) but trails in accuracy (0.67 factual). The Compression retriever, 
despite adding Cohere reranking costs, surprisingly underperformed (0.58 factual), 
suggesting the reranking may have filtered out relevant context for this particular dataset.

All retrievers achieved perfect context recall (1.00), indicating the primary differentiator 
is not retrieval coverage but rather how well each method ranks and presents context to 
the LLM. For this PDF dataset about AI usage, I recommend the Parent-Document retriever 
for balanced production use, or Ensemble when accuracy justifies the higher operational cost.