# Advanced Retrieval Evaulation

Please read the full article at [thedataguy.pro](https://thedataguy.pro/blog/evaluating-advanced-rag-retrievers/).


## Solution Approach

### Step 1: Generate Synthetic Data

We'll generate test data using RAGAS Synthetic Data Generator to create a comprehensive evaluation set for our retrievers. The synthetic data will be stored in CSV format for reuse across different retriever evaluations.

> Note: `testset.csv` was generated using `grok-3` and `Snowflake/snowflake-arctic-embed-l` in the `grok_3.ipynb` notebook.

### Step 2: Generate Eval Dataset

We'll create a LangSmith dataset and run evaluation using the following RAGAS metrics:
- `LLMContextRecall`: Measures how much of the relevant context the LLM incorporated
- `Faithfulness`: Evaluates if the answers are grounded in the retrieved context
- `ContextRecall`: Assesses how well the retriever found relevant documents
- `AnswerRelevancy`: Measures how relevant the generated answers are to the questions

### Step 3: Evaluation

Each retriever chain will be evaluated using LangSmith experiments:
- Naive Retriever (Vector similarity)
- BM25 Retriever (Sparse retrieval)
- Contextual Compression Retriever with Cohere Rerank
- Multi-Query Retriever
- Parent Document Retriever
- Ensemble Retriever

## Step 4: Analysis and Report

We'll build comparison charts and tables across three key dimensions:
- Performance: Based on RAGAS metrics scores
- Cost: API costs for embedding, retrieval, and reranking operations
- Latency: Response time measurements for each retriever

The final report will include recommendations on which retriever performs best for this specific dataset and use case.

[Read full article](https://thedataguy.pro/blog/evaluating-advanced-rag-retrievers/)


Open `Advanced_Retrieval_Evaluation` in `VS Code` and run `uv sync`

### Required API Keys

This notebook requires the following API keys to function properly:

1. **OpenAI API Key** - For embeddings and LLM access
2. **Cohere API Key** - For reranking capability
3. **LangChain API Key** - For LangSmith tracing and evaluation
4. **X AI API Key** - For synthetic data generation

The API keys will be securely collected in the following cells using `getpass` to avoid exposing sensitive information.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [3]:
# For LangSmith
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")
os.environ["LANGCHAIN_TRACING_V2"] = "true"
dataset_name = "Retrieval Evaulation - John Wick"
os.environ["LANGCHAIN_PROJECT"] = dataset_name

In [4]:
os.environ["XAI_API_KEY"] = getpass.getpass("Enter your XAI API Key:")

### Data Collection

We can simply `wget` these from GitHub.

In [None]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [6]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 5, 15, 15, 9, 8, 768714)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

In [7]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

  description="Check that the field is empty, alternative syntax for `is_empty: \&quot;field_name\&quot;`",
  description="Check that the field is null, alternative syntax for `is_null: \&quot;field_name\&quot;`",


## RAG Prompt Template

In [16]:
RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""



In [24]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser


chat_model = ChatOpenAI(model="gpt-4.1-nano")

def create_rag_chain(llm,retriever, template=RAG_TEMPLATE):
    """Create a RAG chain using the provided vectorstore and template."""
    prompt = ChatPromptTemplate.from_template(template)
    
    return (
        {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
        | RunnablePassthrough.assign(context=itemgetter("context"))
        | {"answer": prompt | llm | StrOutputParser(), "contexts": itemgetter("context")}
    )



## Naive RAG Chain

In [25]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

naive_retrieval_chain = create_rag_chain(
    chat_model, retriever=naive_retriever, template=RAG_TEMPLATE
)


In [26]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})

{'answer': 'Based on the reviews, people generally liked John Wick. The majority of reviews are quite positive, with ratings often high—many reviewers gave it 8 or 9 out of 10, praising its style, action sequences, and entertainment value. While there are some mixed opinions and a few lower ratings, overall, the sentiment indicates that people appreciated the film and regarded it as a strong action movie.',
 'contexts': [Document(metadata={'source': 'john_wick_1.csv', 'row': 9, 'Review_Date': '20 October 2014', 'Review_Title': " The coolest action film you'll see all year\n", 'Review_Url': '/review/rw3107759/?ref_=tt_urv', 'Author': 'trublu215', 'Rating': 9, 'Movie_Title': 'John Wick 1', 'last_accessed_at': '2025-05-15T15:09:08.768729', '_id': '117febd3e3fc4ab8a29102d62f1db260', '_collection_name': 'JohnWick'}, page_content=": 9\nReview: At first glance, John Wick sounds like a terrible film on paper but with the slickness of Keanu Reeves' performance as the titular character and the s

Overall, this is not bad! Let's see if we can make it better!

## Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [None]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

bm25_retrieval_chain = create_rag_chain(
    chat_model, retriever=bm25_retriever, template=RAG_TEMPLATE
)

Let's look at the responses!

In [32]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["answer"]

'Based on the provided reviews, there are no reviews with a rating of 10.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [35]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

contextual_compression_retrieval_chain = create_rag_chain(
    chat_model, retriever=compression_retriever, template=RAG_TEMPLATE
)

Let's look at the responses!

In [37]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["answer"]

"Based on the reviews provided, people generally liked John Wick. The reviews are highly positive, with ratings of 9 and 10 out of 10, highlighting its stylishness, action sequences, and Keanu Reeves' performance. Although there is one mixed review with a rating of 5, the overall sentiment from the majority of the reviews suggests that viewers generally liked the film."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [39]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

multi_query_retrieval_chain = create_rag_chain(
    chat_model, retriever=multi_query_retriever, template=RAG_TEMPLATE
)

In [40]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["answer"]

"Based on the reviews in the provided context, people generally liked John Wick. The films received high ratings such as 9 and 10 from some reviewers, and many praised its stylish action sequences, Keanu Reeves' performance, and entertainment value. While there are some negative reviews criticizing over-the-top action or lack of plot, the overall sentiment appears to be positive, indicating that most people generally enjoyed John Wick."

## Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [43]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)
parent_document_retriever.add_documents(parent_docs, ids=None)

parent_document_retrieval_chain = create_rag_chain(
    chat_model, retriever=parent_document_retriever, template=RAG_TEMPLATE
)

  parent_document_vectorstore = Qdrant(


Let's give it a whirl!

In [44]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["answer"]

'Based on the provided reviews, people generally liked John Wick. The majority of positive reviews praise the series, with one reviewer calling it "beautifully choreographed" and highly recommend it. Although there is at least one negative review describing John Wick 4 as a "HORRIBLE" movie, overall, the positive comments and ratings suggest that people tend to enjoy the series.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [45]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

ensemble_retrieval_chain = create_rag_chain(
    chat_model, retriever=ensemble_retriever, template=RAG_TEMPLATE
)

Let's look at our results!

In [47]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["answer"]

'Based on the reviews provided, people generally liked John Wick. Many reviews are highly positive, praising its action, style, and entertainment value, with some ratings close to perfect scores. However, there are also some mixed or negative reviews expressing disappointment or criticism of certain elements. Overall, the majority of reviewers seem to have a favorable opinion of John Wick.'

## Generate Synthetic Data

In [48]:
from ragas.run_config import RunConfig
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_xai import ChatXAI

grok = ChatXAI(model="grok-3-latest")
# Create a RunConfig with rate limit settings
my_run_config = RunConfig(
    max_workers=8,      # Control concurrent requests (default is 16)
    timeout=180,         # Maximum time to wait for a single operation (default is 180s)
    max_retries=10,      # Maximum number of retry attempts (default is 10)
    max_wait=120,         # Maximum wait time between retries (default is 60s)
    exception_types=(Exception,)  # Types of exceptions to retry on
)

# Initialize the generator with the LLM and embedding model
generator_llm = LangchainLLMWrapper(grok)
generator_embeddings = LangchainEmbeddingsWrapper(HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l"))

# Set the run config for both LLM and embeddings
generator_llm.set_run_config(my_run_config)
generator_embeddings.set_run_config(my_run_config)

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

# Use the run_config in your generate call
dataset = generator.generate_with_langchain_docs(
    documents=documents, 
    testset_size=10,
    run_config=my_run_config
)
# Save the testset to a CSV file
dataset.to_pandas().to_csv("testset.csv", index=False)

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node 25b69217-d2b5-4f01-9309-1ba668f586e1 does not have a summary. Skipping filtering.
Node 73fc9fd7-6ce5-4ac2-b7cd-98c4fbeb8ba9 does not have a summary. Skipping filtering.
Node 0e5d82ca-a1df-438d-8d9e-be363c7663d7 does not have a summary. Skipping filtering.
Node a4433134-6c85-45f4-8e06-3b12176d6e0d does not have a summary. Skipping filtering.
Node 821934f0-2fd4-42d7-9c79-a7313c97510d does not have a summary. Skipping filtering.
Node 5a09a6a7-011c-4e94-b1ec-58ebd8ab2eb7 does not have a summary. Skipping filtering.
Node 52803da1-aa1c-4e02-a7c4-0a5d7eab19fc does not have a summary. Skipping filtering.
Node 079381f5-62aa-47ca-9551-c6f47b5e5491 does not have a summary. Skipping filtering.
Node c28ce429-e5d2-4a65-81af-1d371bf76cb9 does not have a summary. Skipping filtering.
Node d96b7034-b55f-4c6a-8101-5b66b213d7f2 does not have a summary. Skipping filtering.
Node de436da4-0cbe-4756-a47c-6939cddc6278 does not have a summary. Skipping filtering.
Node d4f6551d-3a3f-44eb-8c92-a240e69f9c33 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

## Upload LangSmith DataSet

In [49]:
# from ragas.integrations.langsmith import upload_dataset

from __future__ import annotations
import pandas as pd
import ast

import typing as t

from langchain.smith import RunEvalConfig

from ragas.integrations.langchain import EvaluatorChain

if t.TYPE_CHECKING:
    from langsmith.schemas import Dataset as LangsmithDataset

    from ragas.testset import Testset

try:
    from langsmith import Client
    from langsmith.utils import LangSmithNotFoundError
except ImportError:
    raise ImportError(
        "Please install langsmith to use this feature. You can install it via pip install langsmith"
    )


def upload_dataset(
    dataset: pd.DataFrame, dataset_name: str, dataset_desc: str = ""
) -> LangsmithDataset:
    """
    Uploads a new dataset to LangSmith, converting it from a TestDataset object to a
    pandas DataFrame before upload. If a dataset with the specified name already
    exists, the function raises an error.

    Parameters
    ----------
    dataset : TestDataset
        The dataset to be uploaded.
    dataset_name : str
        The name for the new dataset in LangSmith.
    dataset_desc : str, optional
        A description for the new dataset. The default is an empty string.

    Returns
    -------
    LangsmithDataset
        The dataset object as stored in LangSmith after upload.

    Raises
    ------
    ValueError
        If a dataset with the specified name already exists in LangSmith.

    Notes
    -----
    The function attempts to read a dataset by the given name to check its existence.
    If not found, it proceeds to upload the dataset after converting it to a pandas
    DataFrame. This involves specifying input and output keys for the dataset being
    uploaded.
    """
    client = Client()
    try:
        # check if dataset exists
        langsmith_dataset: LangsmithDataset = client.read_dataset(
            dataset_name=dataset_name
        )
        raise ValueError(
            f"Dataset {dataset_name} already exists in langsmith. [{langsmith_dataset}]"
        )
    except LangSmithNotFoundError:
        # if not create a new one with the generated query examples
        langsmith_dataset: LangsmithDataset = client.upload_dataframe(
            df=dataset,
            name=dataset_name,
            input_keys=["question"],
            output_keys=["ground_truth"],
            #metadata_keys=["context"],
            description=dataset_desc,
        )

        print(
            f"Created a new dataset '{langsmith_dataset.name}'. Dataset is accessible at {langsmith_dataset.url}"
        )
        return langsmith_dataset
    
# Load the test set from a CSV file

df = pd.read_csv("testset.csv")
# Convert string representations of lists to actual Python lists
df['reference_contexts'] = df['reference_contexts'].apply(ast.literal_eval)
# set columns to question, context, ground_truth
df = df.rename(columns={
    'user_input': 'question',
    'reference_contexts': 'context',
    'reference': 'ground_truth'
})

upload_dataset(
    dataset=df,
    dataset_name=dataset_name,
    dataset_desc="A test set of John Wick reviews",
)

Created a new dataset 'Retrieval Evaulation - John Wick'. Dataset is accessible at https://smith.langchain.com/o/e106fdae-1163-4ad0-b46b-09a4850df972/datasets/738cb06d-ba41-46b9-a772-a11e2e7bda4b


Dataset(name='Retrieval Evaulation - John Wick', description='A test set of John Wick reviews', data_type=<DataType.kv: 'kv'>, id=UUID('738cb06d-ba41-46b9-a772-a11e2e7bda4b'), created_at=datetime.datetime(2025, 5, 18, 20, 40, 17, 275857, tzinfo=datetime.timezone.utc), modified_at=datetime.datetime(2025, 5, 18, 20, 40, 17, 275857, tzinfo=datetime.timezone.utc), example_count=0, session_count=0, last_session_start_time=None, inputs_schema=None, outputs_schema=None, transformations=None)

## Run Evaluation

In [None]:
from ragas.integrations.langsmith import evaluate
from ragas.metrics import context_recall, faithfulness, context_precision, answer_relevancy

# build the evaluation metrics
metrics = [answer_relevancy, context_precision, faithfulness, context_recall]

# Create a list of chains to evaluate
chain_list = [
    ("Naive Retrieval", naive_retrieval_chain),
    ("BM25 Retrieval", bm25_retrieval_chain),
    ("Parent Document Retrieval", parent_document_retrieval_chain),
    ("Contextual Compression Retrieval", contextual_compression_retrieval_chain),
    ("Multi-Query Retrieval", multi_query_retrieval_chain),
    ("Ensemble Retrieval", ensemble_retrieval_chain),
]


# Run evaluation on each chain
for chain_name, chain in chain_list:
    print(f"Evaluating {chain_name}...")

    evaluate(
        dataset_name=dataset_name,
        llm_or_chain_factory=chain,
        experiment_name=f"{chain_name}",
        metrics=metrics,
    )
