# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [1]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [2]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [3]:
import os
import getpass

def set_api_key(key_name: str) -> None:
    """
    Securely set an environment variable if it doesn't already exist.
    Prompts the user for input using a password-style hidden input.

    Args:
        key_name (str): Name of the environment variable to set (e.g., "OPENAI_API_KEY")
    """
    if not os.environ.get(key_name):
        os.environ[key_name] = getpass.getpass(f"{key_name}: ")

set_api_key("OPENAI_API_KEY")
set_api_key("COHERE_API_KEY")


### LangSmith Configuration

In [4]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - 13 Advanced Retrieval - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_TRACING_V2"] = "true"
set_api_key("LANGCHAIN_API_KEY")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [5]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-03-04 09:07:26--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2025-03-04 09:07:26 (82.8 KB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2025-03-04 09:07:27--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2025-03-04 09:07:27 (1.79 MB/s) - ‘john_wick_2.csv’

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [7]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 3, 1, 9, 7, 31, 139811)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [8]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [9]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [10]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [11]:
from langchain_openai import ChatOpenAI
# Changed to model with larger token limit to prevent errors when invoking some retriever chains
chat_model = ChatOpenAI(model="gpt-4o-mini")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [12]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [13]:
# naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [14]:
# naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [15]:
# naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [16]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [17]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [18]:
# bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [19]:
# bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [20]:
# bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [21]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [22]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [23]:
# contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [24]:
# contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [25]:
# contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
# multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [29]:
# multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [30]:
# multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [31]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [32]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [33]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [34]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [35]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [36]:
# parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [37]:
# parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [38]:
# parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [39]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [40]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [41]:
# ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [42]:
# ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [43]:
# ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [44]:
#!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [45]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [46]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [47]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [48]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [49]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [50]:
# semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

In [51]:
# semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

In [52]:
# semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

### Golden Dataset using SDG

In [53]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(embeddings)

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
golden_dataset = generator.generate_with_langchain_docs(documents, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node a2927424-8db5-4ea9-bf61-c6f130fca3dd does not have a summary. Skipping filtering.
Node 5631205f-34ec-42d0-93e1-5b1875d1917b does not have a summary. Skipping filtering.
Node c3868328-b9cf-4ca0-957a-05287839a37e does not have a summary. Skipping filtering.
Node 667dd6c0-5e90-4874-b441-dfd2d8a24697 does not have a summary. Skipping filtering.
Node efc6062f-c030-45d4-b6eb-97789905d288 does not have a summary. Skipping filtering.
Node c01c32d5-8fab-4296-9257-bda0eaae386c does not have a summary. Skipping filtering.
Node b267ffd3-352f-4b56-8897-829ded8295cd does not have a summary. Skipping filtering.
Node 3a717069-c4be-45d9-97bf-790d06de6c3e does not have a summary. Skipping filtering.
Node 2ec92694-c119-4733-8e4c-b2282e9133cc does not have a summary. Skipping filtering.
Node bcbbf296-08eb-410d-b2e1-1272a2f2f315 does not have a summary. Skipping filtering.
Node feeee072-67f5-4aff-ad5b-513c61141959 does not have a summary. Skipping filtering.
Node 9f52621e-0431-43eb-9089-98ea081c3bca d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [54]:
golden_dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What makes John Wick a standout action movie a...,[: 0\nReview: The best way I can describe John...,John Wick is described as a standout action mo...,single_hop_specifc_query_synthesizer
1,Wut makes Jon Wick so popular among action mov...,[: 2\nReview: With the fourth installment scor...,The fourth installment of John Wick scored imm...,single_hop_specifc_query_synthesizer
2,What makes Chad Stahelski's direction in John ...,[: 3\nReview: John wick has a very simple reve...,Chad Stahelski's direction in John Wick stands...,single_hop_specifc_query_synthesizer
3,What role do Russian mobsters play in the movi...,[: 4\nReview: Though he no longer has a taste ...,"In the movie John Wick, Russian mobsters are r...",single_hop_specifc_query_synthesizer
4,How does the Russian mob prince contribute to ...,[: 5\nReview: Ultra-violent first entry with l...,"In John Wick (2014), the Russian mob prince pl...",single_hop_specifc_query_synthesizer
5,"So like, how's Keanu Reeves doin' in that movi...","[<1-hop>\n\n: 24\nReview: Predictable, juvenil...","In the movie, Keanu Reeves mumbles his way thr...",multi_hop_specific_query_synthesizer
6,How does John Wick's battle against the Russia...,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,John Wick's battle against the Russian Mafia i...,multi_hop_specific_query_synthesizer
7,How does 'John Wick: Chapter 3 - Parabellum' e...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,'John Wick: Chapter 3 - Parabellum' explores t...,multi_hop_specific_query_synthesizer
8,How do the reviews of 'John Wick: Chapter 4' d...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 4...,The reviews of 'John Wick: Chapter 4' differ s...,multi_hop_specific_query_synthesizer
9,In what ways does the film 'John Wick' stand o...,[<1-hop>\n\n: 3\nReview: John wick has a very ...,The film 'John Wick' stands out in the action ...,multi_hop_specific_query_synthesizer


### RAGAS Evaluation

In [55]:
from ragas import EvaluationDataset, RunConfig, evaluate
from ragas.metrics import context_precision, context_recall, context_entity_recall, answer_relevancy, faithfulness
import pandas as pd
import json

retriever_specific_metrics = [context_precision, context_recall, context_entity_recall]
additional_metrics = [faithfulness, answer_relevancy]
all_metrics = retriever_specific_metrics + additional_metrics

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
custom_run_config = RunConfig(timeout=360)

def run_ragas_evaluation(retriever_chain, dataset):
    for test_row in dataset:
        response = retriever_chain.invoke({"question" : test_row.eval_sample.user_input})
        test_row.eval_sample.response = response["response"].content
        test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]] 

    evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

    result = evaluate(
        dataset=evaluation_dataset,
        metrics=all_metrics,
        llm=evaluator_llm,
        run_config=custom_run_config
    )

    return result

def add_results_to_df(results_df, name, result):
    df = pd.DataFrame([{"retriver_name": name}])
    df = pd.concat([df, pd.DataFrame(json.loads(f"{result}".replace("'", '"')), index=[0])], axis=1)
    if results_df is None:
        results_df = df
    else:
        results_df = pd.concat([results_df, df], ignore_index=True)
    return results_df

In [56]:
retriever_chains = {
    "Naive": naive_retrieval_chain,
    "BM25": bm25_retrieval_chain,
    "ContextualCompression": contextual_compression_retrieval_chain,
    "ParentDocument": parent_document_retrieval_chain,
    "MultiQuery": multi_query_retrieval_chain,
    "Ensemble": ensemble_retrieval_chain,
    # "Semantic Chunking": semantic_retrieval_chain
}

results_df = None
for name, retriever_chain in retriever_chains.items():
    print(f"Evaluating {name}")
    result = run_ragas_evaluation(retriever_chain.with_config({"run_name": name}), golden_dataset)
    print(result)
    results_df = add_results_to_df(results_df, name, result)
    print("\n" + "="*50 + "\n")

Evaluating Naive


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

{'context_precision': 0.5649, 'context_recall': 0.8000, 'context_entity_recall': 0.7117, 'faithfulness': 0.8182, 'answer_relevancy': 0.9479}


Evaluating BM25


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

{'context_precision': 0.4583, 'context_recall': 0.6500, 'context_entity_recall': 0.5733, 'faithfulness': 0.7171, 'answer_relevancy': 0.8544}


Evaluating ContextualCompression


Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('400 Client Error: Bad Request for url: https://api.smith.langchain.com/runs/multipart', '{"error":"Bad request: invalid \'dotted_order\': dotted_order 021bbb1f-2e53-463f-9574-fbd6977efef9 has timestamp 2025-03-04 15:20:43.352662 +0000 UTC earlier than parent timestamp 2025-03-04 15:20:43.457121 +0000 UTC for run_id:021bbb1f-2e53-463f-9574-fbd6977efef9 trace_id:33e5df35-c961-48cd-8ee2-72722e1f9045 dotted_order:20250304T152043457121Z33e5df35-c961-48cd-8ee2-72722e1f9045.20250304T152043352662Z021bbb1f-2e53-463f-9574-fbd6977efef9 parent_run_id:33e5df35-c961-48cd-8ee2-72722e1f9045"}\n')
Failed to send compressed multipart ingest: langsmith.utils.LangSmithError: Failed to POST https://api.smith.langchain.com/runs/multipart in LangSmith API. HTTPError('400 Client Error: Bad Request for url: https://api.smith.langchain.com/runs/mu

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

{'context_precision': 0.6833, 'context_recall': 0.7667, 'context_entity_recall': 0.6600, 'faithfulness': 0.7595, 'answer_relevancy': 0.9571}


Evaluating ParentDocument


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

{'context_precision': 0.6083, 'context_recall': 0.7167, 'context_entity_recall': 0.5733, 'faithfulness': 0.8395, 'answer_relevancy': 0.9600}


Evaluating MultiQuery


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

{'context_precision': 0.5970, 'context_recall': 0.8667, 'context_entity_recall': 0.6783, 'faithfulness': 0.8501, 'answer_relevancy': 0.9396}


Evaluating Ensemble


Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

{'context_precision': 0.5306, 'context_recall': 0.8667, 'context_entity_recall': 0.6867, 'faithfulness': 0.8470, 'answer_relevancy': 0.9531}




In [57]:
results_df = results_df.set_index('retriver_name')
results_df

Unnamed: 0_level_0,context_precision,context_recall,context_entity_recall,faithfulness,answer_relevancy
retriver_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Naive,0.5649,0.8,0.7117,0.8182,0.9479
BM25,0.4583,0.65,0.5733,0.7171,0.8544
ContextualCompression,0.6833,0.7667,0.66,0.7595,0.9571
ParentDocument,0.6083,0.7167,0.5733,0.8395,0.96
MultiQuery,0.597,0.8667,0.6783,0.8501,0.9396
Ensemble,0.5306,0.8667,0.6867,0.847,0.9531


### LangSmith Evaluation

In [58]:
from langsmith import Client

client = Client()

dataset_name = "John Wick Reviews"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="John Wick Reviews"
)

In [59]:
for data_row in golden_dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [67]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

eval_llm = ChatOpenAI(model="gpt-4o")
qa_evaluator = LangChainStringEvaluator(
    "qa", 
    config={"llm" : eval_llm},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["response"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)

def run_langsmith_evaluation(retriever_name, retriever_chain):
    evaluate(
        retriever_chain.invoke,
        data=dataset_name,
        evaluators=[qa_evaluator],
        metadata={"revision_id": retriever_name}
    )

In [69]:
for name, retriever_chain in retriever_chains.items():
    print(f"Evaluating {name}")
    run_langsmith_evaluation(name, retriever_chain)
    print("\n" + "="*50 + "\n")

Evaluating Naive
View the evaluation results for experiment: 'advanced-reaction-8' at:
https://smith.langchain.com/o/df6cd833-569f-46ec-9ae9-90000a6e38c6/datasets/f82d93ec-f21b-4649-8d7e-aa454ad3d78c/compare?selectedSessions=f7e5bf89-6d9b-40b9-b96b-2c948f8d2554




0it [00:00, ?it/s]



Evaluating BM25
View the evaluation results for experiment: 'healthy-whistle-63' at:
https://smith.langchain.com/o/df6cd833-569f-46ec-9ae9-90000a6e38c6/datasets/f82d93ec-f21b-4649-8d7e-aa454ad3d78c/compare?selectedSessions=48bb66a7-17e1-4853-8894-b370182c293d




0it [00:00, ?it/s]



Evaluating ContextualCompression
View the evaluation results for experiment: 'monthly-chart-71' at:
https://smith.langchain.com/o/df6cd833-569f-46ec-9ae9-90000a6e38c6/datasets/f82d93ec-f21b-4649-8d7e-aa454ad3d78c/compare?selectedSessions=a0108ed6-8acd-4663-b06c-cefc92655abd




0it [00:00, ?it/s]



Evaluating ParentDocument
View the evaluation results for experiment: 'helpful-surprise-55' at:
https://smith.langchain.com/o/df6cd833-569f-46ec-9ae9-90000a6e38c6/datasets/f82d93ec-f21b-4649-8d7e-aa454ad3d78c/compare?selectedSessions=6b8f4bbd-f13a-47f6-af8a-ffcd9842a71d




0it [00:00, ?it/s]



Evaluating MultiQuery
View the evaluation results for experiment: 'pertinent-leather-41' at:
https://smith.langchain.com/o/df6cd833-569f-46ec-9ae9-90000a6e38c6/datasets/f82d93ec-f21b-4649-8d7e-aa454ad3d78c/compare?selectedSessions=2b94d175-68d7-4270-a873-dc344bae48d4




0it [00:00, ?it/s]



Evaluating Ensemble
View the evaluation results for experiment: 'sparkling-speed-65' at:
https://smith.langchain.com/o/df6cd833-569f-46ec-9ae9-90000a6e38c6/datasets/f82d93ec-f21b-4649-8d7e-aa454ad3d78c/compare?selectedSessions=a8d0c069-ca4e-4c62-9a71-3c91452be9c2




0it [00:00, ?it/s]





### LangSmith Evaluation Results

![LangSmith Evaluation Results](langsmith_experiments.jpeg)

### Add Latency and Cost from LangSmith

In [71]:
retriever_stats = {
    "Naive": {"latency": 3.98, "cost": 0.00668085},
    "BM25": {"latency": 3.05, "cost": 0.0028041},
    "ContextualCompression": {"latency": 3.51, "cost": 0.0030783},
    "ParentDocument": {"latency": 3.12, "cost": 0.00203745},
    "MultiQuery": {"latency": 7.50, "cost": 0.0091617},
    "Ensemble": {"latency": 11.06, "cost": 0.00995205}
}

for name, stats in retriever_stats.items():
    results_df.loc[results_df.index == name, 'latency'] = stats['latency']
    results_df.loc[results_df.index == name, 'cost'] = stats['cost']

results_df

Unnamed: 0_level_0,context_precision,context_recall,context_entity_recall,faithfulness,answer_relevancy,latency,cost
retriver_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Naive,0.5649,0.8,0.7117,0.8182,0.9479,3.98,0.006681
BM25,0.4583,0.65,0.5733,0.7171,0.8544,3.05,0.002804
ContextualCompression,0.6833,0.7667,0.66,0.7595,0.9571,3.51,0.003078
ParentDocument,0.6083,0.7167,0.5733,0.8395,0.96,3.12,0.002037
MultiQuery,0.597,0.8667,0.6783,0.8501,0.9396,7.5,0.009162
Ensemble,0.5306,0.8667,0.6867,0.847,0.9531,11.06,0.009952


# Best Retriever Analysis
Based on the *retriever-specific* evaluation metrics (context precision, context recall, context entity recall) and operational considerations (latency, cost), the optimal retriever choice ***depends on specific use case requirements***. For applications where *precision is paramount*, the **Contextual Compression** retriever stands out with the highest context precision score, while maintaining moderate latency and cost. This makes it ideal for scenarios where accuracy is crucial, but some latency and cost can be tolerated. For use cases that *prioritize recall*, such as comprehensive question answering, the **Multi-Query** retriever excels with the highest context recall, though at higher latency and cost. If *budget constraints* are a primary concern, the **Parent Document** retriever offers the lowest cost and a relatively low latency, making it suitable for cost-sensitive applications, despite its lower recall metrics. The **BM25** retriever provides a balanced approach with moderate precision and recall, low latency and cost, making it a viable option for *general-purpose* use. Lastly, the **Ensemble** retriever, while having the highest operational cost and latency, offers a balanced performance across multiple metrics, making it suitable for *mission-critical* applications where the highest accuracy justifies the expense.