# [RAGAS] Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

In [4]:
import importlib

import core_functions

importlib.reload(core_utils)

from core_functions import load_and_prepare_data, get_vector_store, get_naive_retriever, get_rag_prompt, get_chat_model

ModuleNotFoundError: No module named 'langchain_cohere'

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

Let's look at an example document to see if everything worked as expected!

In [None]:
%%time
load_and_prepare_data()

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [9]:
vectorstore = get_vector_store()

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [10]:
naive_retriever = get_naive_retriever(vectorstore)

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [11]:
rag_prompt = get_rag_prompt()

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [12]:
from langchain_openai import ChatOpenAI

chat_model = get_chat_model("gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [13]:
naive_retrieval_chain = get_naive_retrieval_chain(naive_retriever, rag_prompt, chat_model)

CPU times: user 20.4 ms, sys: 3.48 ms, total: 23.9 ms
Wall time: 27.5 ms


In [17]:
gc.collect()

48

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [18]:
bm25_retriever = get_bm25_retriever(filtered_loan_dataset)

We'll construct the same chain - only changing the retriever.

In [19]:
bm25_retrieval_chain = get_bm25_retriever_chain(bm25_retriever, rag_prompt, chat_model)

In [21]:
gc.collect()

0

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [22]:
contextual_compression_retriever = get_contextual_compression_retriever(naive_retriever)

Let's create our chain again, and see how this does!

In [23]:
contextual_compression_retrieval_chain = get_compression_retriever_chain(
    contextual_compression_retriever, rag_prompt, chat_model
)

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

In [25]:
gc.collect()

40

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
multi_query_retriever = get_multi_query_retriever(naive_retriever)

In [27]:
multi_query_retrieval_chain = get_multi_query_retrieval_chain(multi_query_retriever, rag_prompt, chat_model)

In [29]:
gc.collect()

0

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [30]:
parent_document_retrieval_chain = get_parent_document_retrieval_chain(parent_document_retriever, rag_prompt, chat_model)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

Let's give it a whirl!

In [36]:
gc.collect()

201

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [37]:
ensemble_retriever = get_ensemble_retriever()

We'll pack *all* of these retrievers together in an ensemble.

In [38]:
ensemble_retrieval_chain = get_ensemble_retrieval_chain(
    ensemble_retriever, rag_prompt, chat_model
)

Let's look at our results!

In [40]:
gc.collect()

0

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

The `breakpoint_threshold_type` parameter controls when the semantic chunker creates chunk boundaries based on embedding similarity between sentences:

**Four Threshold Types:**

1. _"percentile" (default)_
- Splits when sentence embedding distance exceeds the 95th percentile of all distances
- Effect: Creates chunks at the most semantically distinct boundaries
- Behavior: More conservative splitting, larger chunks

2. _"standard_deviation"_
- Splits when distance exceeds 3 standard deviations from mean
- Effect: Better predictable performance, especially for normally distributed content
- Behavior: More consistent chunk sizes

3. _"interquartile"_
- Uses IQR * 1.5 scaling factor to determine breakpoints
- Effect: Middle-ground approach, robust to outliers
- Behavior: Balanced chunk distribution

4. _"gradient"_
- Detects anomalies in embedding distance gradients
- Effect: Best for domain-specific/highly correlated content
- Behavior: Finds subtle semantic transitions

**Impact:** _The threshold type determines sensitivity to semantic changes - more sensitive types create smaller, more focused chunks while less sensitive types create larger, more comprehensive chunks._

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [41]:
semantic_retriever = get_semantic_retriever(loan_complaint_data, vectorstore)

In [45]:
semantic_retrieval_chain = get_semantic_retrieval_chain(semantic_retriever, rag_prompt, chat_model)

In [59]:
golden_master = generate_golden_master()

In [66]:
create_examples_on_langsmith()

0

## Ragas Evaluation

In [None]:
create_evaluation_dataset_after_applying_retrieval_chains()

In [None]:
create_pipeline_folder()

In [None]:
run_ragas_evaluations()

## Evaluation and Performance Analysis

Now that we have evaluation data from LangSmith, let's analyze the performance of different retrievers across multiple dimensions: **Performance**, **Cost**, and **Latency**.

In [185]:
from tqdm.notebook import tqdm

In [187]:
raw_stats_df = gather_and_save_raw_stats()

  0%|          | 0/6 [00:00<?, ?it/s]

In [217]:
raw_stats_df

Unnamed: 0,retriever,Total_Runs,Total_Cost,Total_Input_Tokens,Total_Output_Tokens,Total_Latency_Sec,Avg_Cost_Per_Run,Avg_Input_Tokens_Per_Run,Avg_Output_Tokens_Per_Run,Avg_Latency_Sec,context_recall,llm_context_precision_without_reference,llm_context_precision_with_reference,non_llm_context_precision_with_reference,context_entity_recall,noise_sensitivity_relevant,faithfulness,faithful_rate
0,naive_retrieval_chain,10,0.243176,659319,240463,0,0.024318,65931.9,24046.3,0,0.794524,0.918197,0.777317,0.392897,0.320769,0.223647,0.80804,1.0
0,bm25_retrieval_chain,10,0.150856,392780,153232,0,0.015086,39278.0,15323.2,0,0.822857,0.913889,0.683333,0.413889,0.434487,0.37931,0.870687,1.0
0,contextual_compression_retrieval_chain,10,0.111962,280371,116510,0,0.011196,28037.1,11651.0,0,0.639524,0.983333,0.733333,0.508333,0.449487,0.406235,0.782676,1.0
0,multi_query_retrieval_chain,10,0.254773,805097,223347,0,0.025477,80509.7,22334.7,0,0.844524,0.908392,0.850487,0.359511,0.424487,0.454545,0.896755,1.0
0,parent_document_retrieval_chain,10,0.145341,365018,150980,0,0.014534,36501.8,15098.0,0,0.80619,0.933333,0.858333,0.255556,0.445128,0.317317,0.883523,1.0
0,ensemble_retrieval_chain,10,0.283593,1023190,216857,0,0.028359,102319.0,21685.7,0,0.851429,0.88989,0.767724,0.39311,0.474359,0.0,0.891511,1.0


In [218]:
import importlib

import ragas_rank_retrievers
importlib.reload(ragas_rank_retrievers)
from ragas_rank_retrievers import RetrieverRanker

ranker = RetrieverRanker('ragas_retriever_raw_stats.csv')

## Final outcome of the Ragas Evaluators

In [219]:
# ranker.print_available_metrics()

In [220]:
ranker.get_recommendations_table()

Unnamed: 0,Category,Retriever,Key Metric,Description
0,Overall Winner,Parent Document,Score: 0.729,Best balanced performance
1,Budget Option,Contextual Compression,Cost: $0.0112,Lowest cost per run
2,Quality Leader,Multi Query,Quality: 0.864,Highest average across 3 quality metrics
3,Production Ready,Parent Document,Score: 0.540,Meets minimum thresholds


In [221]:
ranker.get_rankings_table('weighted')

Unnamed: 0,rank,retriever_chain,score,context_recall,faithfulness,llm_context_precision_with_reference,llm_context_precision_without_reference,faithful_rate,context_entity_recall,Avg_Cost_Per_Run
0,1,Parent Document,0.729,0.8062,0.8835,0.8583,0.9333,1.0,0.4451,0.0145
1,2,Multi Query,0.7278,0.8445,0.8968,0.8505,0.9084,1.0,0.4245,0.0255
2,3,Ensemble,0.6126,0.8514,0.8915,0.7677,0.8899,1.0,0.4744,0.0284
3,4,Bm25,0.6052,0.8229,0.8707,0.6833,0.9139,1.0,0.4345,0.0151
4,5,Contextual Compression,0.5338,0.6395,0.7827,0.7333,0.9833,1.0,0.4495,0.0112
5,6,Naive,0.4459,0.7945,0.808,0.7773,0.9182,1.0,0.3208,0.0243


In [222]:
ranker.get_metrics_comparison_table()

Unnamed: 0,retriever_chain,context_recall,faithfulness,llm_context_precision_with_reference,llm_context_precision_without_reference,non_llm_context_precision_with_reference,faithful_rate,context_entity_recall,noise_sensitivity_relevant,Avg_Cost_Per_Run
0,Naive,0.7945,0.808,0.7773,0.9182,0.3929,1.0,0.3208,0.2236,0.0243
1,Bm25,0.8229,0.8707,0.6833,0.9139,0.4139,1.0,0.4345,0.3793,0.0151
2,Contextual Compression,0.6395,0.7827,0.7333,0.9833,0.5083,1.0,0.4495,0.4062,0.0112
3,Multi Query,0.8445,0.8968,0.8505,0.9084,0.3595,1.0,0.4245,0.4545,0.0255
4,Parent Document,0.8062,0.8835,0.8583,0.9333,0.2556,1.0,0.4451,0.3173,0.0145
5,Ensemble,0.8514,0.8915,0.7677,0.8899,0.3931,1.0,0.4744,0.0,0.0284


In [223]:
ranker.get_algorithm_comparison_table()

Unnamed: 0_level_0,weighted_rank,weighted_score,quality_first_rank,quality_first_score,balanced_rank,balanced_score,production_ready_rank,production_ready_score
retriever,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bm25,4,0.6052,2,0.8,4,0.8241,5,0.0
Contextual Compression,5,0.5338,4,0.7847,5,0.7125,6,0.0
Ensemble,3,0.6126,5,0.7501,3,0.8605,3,0.2607
Multi Query,2,0.7278,3,0.7918,2,1.0665,2,0.3921
Naive,6,0.4459,6,0.7481,6,0.7065,4,0.2005
Parent Document,1,0.729,1,0.8509,1,1.1147,1,0.5397
