# Reranking methods in RAG systems

Reranking is a crucial step in RAG systems that aims to improve the relevance and quality of retrieved documents. It involves reassessing and reordering initially retrieved documents to ensure that the most relevant information is prioritized for subsequent processing or presentation. This notebook demonstrates two powerful reranking strategies: one using a LLMs, and another using a cross-encoder model.

In [1]:
import os
from dotenv import load_dotenv
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.docstore.document import Document
from typing import List, Dict, Any, Tuple
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_core.retrievers import BaseRetriever
from sentence_transformers import CrossEncoder
from langchain_core.pydantic_v1 import BaseModel, Field

import warnings
warnings.simplefilter("ignore", category=FutureWarning)

# Load environment variables from a .env file
load_dotenv()

# Access the API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

## Preprocessing

### Load and encode the PDF into a vector store
In this step, we process a PDF document and convert its textual content into a vector database using FAISS and OpenAI embeddings.


In [2]:
def encode_pdf(path, chunk_size=1000, chunk_overlap=200):
    """
    Encodes a PDF book into a vector store using OpenAI embeddings.

    Args:
        path: The path to the PDF file.
        chunk_size: The desired size of each text chunk.
        chunk_overlap: The amount of overlap between consecutive chunks.

    Returns:
        A FAISS vector store containing the encoded book content.
    """

    # Load PDF documents
    loader = PyPDFLoader(path)
    documents = loader.load()

    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, length_function=len)
    texts = text_splitter.split_documents(documents)

    # Replace tab characters ('\t') with spaces in the document chunks
    for doc in texts:
        doc.page_content = doc.page_content.replace('\t', ' ')  # Replace tabs with spaces

    # Create embeddings and vector store
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(texts, embeddings)

    return vectorstore

path = "Understanding_Climate_Change.pdf"
vectorstore = encode_pdf(path)

- `PyPDFLoader` reads the PDF and extracts its text content.
- `loader.load()` returns a list of `Document` objects, each usually representing one page of the PDF.
- `RecursiveCharacterTextSplitter` tries to split the text at natural breakpoints (paragraphs, then sentences, etc.). we are splitting the document into chunks of size 1000 characters with 200 characters of overlap between chunks.
  - `chunk_size` makes it more manageable for indexing and retrieval.
  - `chunk_overlap` ensures that the context is preserved when the text is split. This helps maintain the flow of information between chunks.
  - `length_function` tells the splitter to calculate the length of the chunks based on the number of characters, ensuring that the chunks are exactly the specified size.
- We then initialize `OpenAIEmbeddings()`, which provides access to OpenAI’s text embedding model. This model will convert each chunk of text into a high-dimensional vector that captures its semantic meaning.
- With the cleaned chunks and the embedding model ready, we use `FAISS.from_documents()` to create a vector store. It enables us to efficiently query the document for relevant content based on a user’s input.
- Finally, the function returns the completed `vectorstore`.

## Method 1: LLM based function to rerank the retrieved documents
In this step, we define a reranking function that uses a LLM to evaluate and reorder documents retrieved from the vector store. This allows us to go beyond simple similarity scores and instead use natural language reasoning to assess which chunks are most relevant to a given query.

Embedding-based retrieval (like FAISS) is fast and powerful, but it relies purely on vector similarity, which may not always capture deeper nuances in the query-document relationship. By using an LLM to re-score the documents, we can apply a much more context-aware and task-specific measure of relevance — one that mimics human reasoning. This often leads to better answer quality and fewer irrelevant results in downstream tasks.

#### Define a structured output class
We start by defining a small `RatingScore` class using `pydantic` that describes the structure of the LLM’s response. It simply expects a single numerical score.



In [3]:
# Class to define the expected output format from the LLM — a single float value.
class RatingScore(BaseModel):
    relevance_score: float = Field(..., description="The relevance score of a document to a query.")

This class specifies the structure we expect when the LLM outputs a response — in this case, just a single field: `relevance_score`, a float from 1 to 10 indicating how relevant the document is to the query.

#### Function to prompt the LLM to evaluate relevance
We now define the `rerank_documents` function, which prompts the LLM to score each document.

In [4]:
def rerank_documents(query: str, docs: List[Document], top_n: int = 3) -> List[Document]:
    # Define the prompt template that will be used to query the LLM
    prompt_template = PromptTemplate(
        input_variables=["query", "doc"],
        template="""On a scale of 1-10, rate the relevance of the following document to the query. Consider the specific context and intent of the query, not just keyword matches.
        Query: {query}
        Document: {doc}
        Relevance Score:"""
    )

    # Instantiate the LLM with deterministic settings (temperature = 0)
    llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini-2024-07-18", max_tokens=4000)
    # Combine the prompt with the model and request structured output (RatingScore)
    llm_chain = prompt_template | llm.with_structured_output(RatingScore)

    scored_docs = []
    # Evaluate each document by prompting the LLM
    for doc in docs:
        input_data = {"query": query, "doc": doc.page_content}
        # Invoke the chain to get a relevance score from the LLM
        score = llm_chain.invoke(input_data).relevance_score
        # Ensure the score is a float; fallback to 0 if parsing fails
        try:
            score = float(score)
        except ValueError:
            score = 0  # Default score if parsing fails
        scored_docs.append((doc, score)) # Store doc along with its score

    # Sort all documents by score in descending order
    reranked_docs = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    # Return only the top_n most relevant documents
    return [doc for doc, _ in reranked_docs[:top_n]]

- Inside the `rerank_documents()` function, we define a natural language prompt using `PromptTemplate`. This prompt asks the LLM to rate how relevant a given document is to a query on a scale from 1 to 10. The instructions encourage the model to focus on semantic relevance, not just keyword overlap.
- We instantiate a `ChatOpenAI` model with `temperature=0` to ensure deterministic outputs (more consistent scoring). We also use the `with_structured_output()` method to enforce that the LLM returns a well-structured JSON output matching our `RatingScore` schema.
- For each document in the list, we pass the query and the document content to the prompt pipeline. The LLM returns a `relevance_score`, which we attempt to convert to a float. If something goes wrong (e.g. a malformed output), we assign a default score of `0` to prevent the process from failing.
- Once all scores are collected, we sort the documents in descending order of their relevance score using `sorted()`. This produces a reranked list, with the most relevant documents at the top.
- Finally, we return only the top `n` documents (default: 3) from the reranked list. These will be the most semantically relevant chunks chosen by the LLM, and they can now be used as input to a generation model.

#### Example usage: Run retrieval and reranking on a real query
Now that we have defined the reranking function, we can try it out on a real example. We will start by performing a standard similarity search using the FAISS vector store, then apply our `rerank_documents()` method to reorder the results based on LLM-based semantic relevance.

In [5]:
query = "What are the impacts of climate change on biodiversity?"

# Retrieve top 15 documents using vector similarity
initial_docs = vectorstore.similarity_search(query, k=15)

# Apply LLM-based reranking
reranked_docs = rerank_documents(query, initial_docs)

# Print the top 3 initial documents (from similarity search)
print("Top initial documents:")
for i, doc in enumerate(initial_docs[:3]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


# Print the top reranked documents (after LLM scoring)
print(f"Query: {query}\n")
print("Top reranked documents:")
for i, doc in enumerate(reranked_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document

Top initial documents:

Document 1:
managed retreats. 
Extreme Weather Events 
Climate change is linked to an increase in the frequency and severity of extreme weather 
events, such as hurricanes, heatwaves, droughts, and heavy rainfall...

Document 2:
development of eco-friendly fertilizers and farming techniques is essential for reducing the 
agricultural sector's carbon footprint. 
Chapter 3: Effects of Climate Change 
The effects of climate chan...

Document 3:
Heatwaves can lead to heat-related illnesses and exacerbate existing health conditions. 
Changing Seasons 
Climate change is altering the timing and length of seasons, affecting ecosystems and human 
...
Query: What are the impacts of climate change on biodiversity?

Top reranked documents:

Document 1:
Tropical rainforests are particularly important for carbon storage. Deforestation in the 
Amazon, Congo Basin, and Southeast Asia has significant impacts on global carbon cycles 
and biodiversity. The...

Document 2:
Coral re

- We define a query relevant to our document.
- Using `vectorstore.similarity_search()`, we retrieve the top 15 most similar chunks from the vector database.
- Then, we pass these retrieved documents through our reranker, which uses a language model to score and sort them more intelligently.
- Then, we compare the top documents before and after reranking to see how the LLM changes the order based on a deeper understanding of the query.

We can see that these reranked results are more focused, precise, and contextually aligned with the intent of the query.

#### Wrap reranking into a custom retriever
To make the reranking process seamlessly integrate with a RAG pipeline, we encapsulate it in a custom retriever. This allows us to plug it into LangChain’s RetrievalQA class just like any other retriever — but under the hood, we now benefit from our LLM-based reranking logic.

In [6]:
# Create a custom retriever class that incorporates LLM-based reranking
class CustomRetriever(BaseRetriever, BaseModel):

    vectorstore: Any = Field(description="Vector store for initial retrieval")

    class Config:
        arbitrary_types_allowed = True

    def _get_relevant_documents(self, query: str, num_docs=2) -> List[Document]:
        # Step 1: Retrieve an initial set of documents (more than we need)
        initial_docs = self.vectorstore.similarity_search(query, k=30)
        # Step 2: Rerank and return only the top N most relevant ones
        return rerank_documents(query, initial_docs, top_n=num_docs)

- Here we define a class `CustomRetriever` that inherits from both `BaseRetriever` (LangChain interface) and `BaseModel` (to support field validation).
- The `vectorstore` is passed into the retriever as a parameter.
- In `get_relevant_documents()`, we first retrieve a broader set of documents (e.g., top 30 from FAISS) and then narrow it down using the LLM-based `rerank_documents()` function.
- The final output is a smaller set of top-ranked documents, now selected not just by similarity but by semantic relevance using the language model.

#### Connect to a RetrievalQA chain
With our custom retriever ready, we can now use it in a `RetrievalQA` pipeline. This pipeline retrieves relevant documents using our reranker, then passes them to the LLM to generate an answer.

In [7]:
# Create an instance of the custom retriever
custom_retriever = CustomRetriever(vectorstore=vectorstore)

# Instantiate the LLM for answering questions
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini-2024-07-18")

# Create the RetrievalQA chain with the custom retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=custom_retriever,
    return_source_documents=True
)

- We create the `llm` with a temperature of `0` for deterministic responses.
- We pass our `CustomRetriever` into the QA pipeline.
- The `RetrievalQA` chain is configured with the `"stuff"` chain type — meaning it will simply concatenate the retrieved documents and feed them to the LLM as context.
- Setting `return_source_documents=True` ensures that we can inspect which chunks the model used to generate its answer.

#### Example query: Run the RAG pipeline with reranked retrieval
Now that the pipeline is fully set up, we can pass in a natural language query and run the full RAG process. This involves three main steps under the hood:
1. Initial retrieval via the vector store,
2. Reranking using the LLM to find the most semantically relevant chunks,
3. Answer generation by feeding those reranked chunks into the LLM.

In [8]:
# Run the full QA pipeline
result = qa_chain.invoke({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


Question: What are the impacts of climate change on biodiversity?
Answer: Climate change impacts biodiversity in several ways, including:

1. Habitat Loss: Changes in temperature and precipitation patterns can lead to the loss of habitats, such as tropical rainforests and boreal forests, which are crucial for many species.

2. Species Extinction: As habitats are altered or destroyed, many species may not be able to adapt quickly enough to survive, leading to increased rates of extinction.

3. Ocean Acidification: Increased CO2 levels lead to ocean acidification, which affects marine ecosystems, particularly coral reefs. This can result in coral bleaching and mortality, threatening marine biodiversity.

4. Disruption of Food Webs: Changes in species distributions and interactions can disrupt food webs, affecting the survival of various marine and terrestrial species.

5. Altered Migration Patterns: Climate change can affect the migration patterns of species, leading to mismatches in ti

- We invoke the `qa_chain` by passing in the query as a dictionary. The chain internally:
  - Uses our custom retriever to find and rerank documents,
  - Concatenates the top documents into a context window,
  - And feeds them into the LLM to generate a natural language answer.
  
- The result is returned as a dictionary. The field `result['result']` contains the model's final answer.
- The field `result['source_documents']` provides the actual document chunks that were used to generate the answer — this is crucial for transparency and debugging.

This gives us a better explainable question-answering system — powered by both retrieval and reasoning — all while maintaining full visibility into how the answer was derived.

### Example: Why reranking improves retrieval quality
To demonstrate the value of using reranking with an LLM, we will walk through a toy example. We define a list of short text chunks — some contain surface-level keyword matches (like "The capital of France is..."), while others include richer context and reasoning (such as detailed descriptions of Paris).

By comparing the baseline FAISS retrieval to our custom reranked retriever, we can see how reranking helps us surface the most meaningful and informative content, rather than just text that shares similar words.

In [10]:
# Sample input text chunks with varying levels of semantic relevance
chunks = [
    "The capital of France is great.",
    "The capital of France is huge.",
    "The capital of France is beautiful.",
    """Have you ever visited Paris? It is a beautiful city where you can eat delicious food and see the Eiffel Tower.
    I really enjoyed all the cities in france, but its capital with the Eiffel Tower is my favorite city.""",
    "I really enjoyed my trip to Paris, France. The city is beautiful and the food is delicious. I would love to visit again. Such a great capital city."
]
docs = [Document(page_content=sentence) for sentence in chunks] # Wrap each chunk as a Document

# Function to compare baseline and reranked retrieval side-by-side
def compare_rag_techniques(query: str, docs: List[Document] = docs) -> None:
    embeddings = OpenAIEmbeddings()
    vectorstore = FAISS.from_documents(docs, embeddings) # Build a temporary vectorstore for testing

    print("Comparison of Retrieval Techniques")
    print("==================================")
    print(f"Query: {query}\n")

    # Baseline: Retrieve top 2 results using vector similarity only
    print("Baseline Retrieval Result:")
    baseline_docs = vectorstore.similarity_search(query, k=2)
    for i, doc in enumerate(baseline_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

    # Advanced: Retrieve top 2 reranked results using LLM scoring
    print("\nAdvanced Retrieval Result:")
    custom_retriever = CustomRetriever(vectorstore=vectorstore)
    advanced_docs = custom_retriever.get_relevant_documents(query)
    for i, doc in enumerate(advanced_docs):
        print(f"\nDocument {i+1}:")
        print(doc.page_content)

# Run the comparison
query = "what is the capital of france?"
compare_rag_techniques(query, docs)

Comparison of Retrieval Techniques
Query: what is the capital of france?

Baseline Retrieval Result:

Document 1:
The capital of France is great.

Document 2:
The capital of France is beautiful.

Advanced Retrieval Result:

Document 1:
The capital of France is beautiful.

Document 2:
The capital of France is great.


- This example illustrates a common issue with standard vector similarity: it favors short, keyword-aligned snippets over more contextually relevant ones. The top baseline results are likely generic and similar, like “The capital of France is...,” which are short but not very informative.
- In contrast, the reranked results are chosen by the LLM based on their semantic relevance. However, since the model used here is `gpt-4o-mini-2024-07-18` and not the full `gpt-4o`, the model's ability to deeply understand and improve the results is somewhat limited, resulting in less-than-expected improvements in reranking.

This highlights how reranking works by interpreting meaning rather than just matching words, but the choice of model can still affect the outcome.

## Method 2: Reranking with a cross-encoder model
In this step, we define a custom retriever that leverages a cross-encoder model to rerank documents based on a deeper, pairwise semantic evaluation. Unlike embedding-based similarity or LLM-based rating (which evaluate query and document independently), a cross-encoder scores each query-document pair jointly, often yielding state-of-the-art results in many information retrieval tasks.

Cross-encoders can be slower than other methods (since each query-document pair is passed through the model), but they make up for it with more accurate relevance scoring, especially for nuanced or complex queries.

This cross-encoder approach is particularly effective when:
- We have a small number of documents to rerank.
- Accuracy is more important than speed.
- The retrieval task demands more nuanced understanding of the query-document relationship (e.g. subtle distinctions in meaning or intent).

It provides a strong alternative to the LLM-based reranker, especially when we want a more deterministic and cost-efficient solution without calling a external generative model.

In [21]:
# Load a pretrained cross-encoder model fine-tuned for ranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

Here we load a cross-encoder model from Hugging Face. This model has been trained on MS MARCO — a large-scale dataset for passage ranking — making it a good fit for reranking retrieved documents based on relevance.

#### Define the CrossEncoderRetriever class
We now define a custom retriever class that integrates both a vector store for initial retrieval and the cross-encoder model for reranking the results.

In [24]:
class CrossEncoderRetriever(BaseRetriever, BaseModel):
    vectorstore: Any = Field(description="Vector store for initial retrieval")
    cross_encoder: Any = Field(description="Cross-encoder model for reranking")
    k: int = Field(default=5, description="Number of documents to retrieve initially")
    rerank_top_k: int = Field(default=3, description="Number of documents to return after reranking")

    class Config:
        arbitrary_types_allowed = True

    def _get_relevant_documents(self, query: str) -> List[Document]:
        # Initial retrieval
        initial_docs = self.vectorstore.similarity_search(query, k=self.k)

        # Prepare pairs for cross-encoder
        pairs = [[query, doc.page_content] for doc in initial_docs]

        # Get cross-encoder scores
        scores = self.cross_encoder.predict(pairs)

        # Sort documents by score
        scored_docs = sorted(zip(initial_docs, scores), key=lambda x: x[1], reverse=True)

        # Return top reranked documents
        return [doc for doc, _ in scored_docs[:self.rerank_top_k]]

    async def _aget_relevant_documents(self, query: str) -> List[Document]:
        raise NotImplementedError("Async retrieval not implemented")

- This class extends LangChain's `BaseRetriever` and uses `pydantic` for schema validation.
- It takes in a `vectorstore` (for fast initial retrieval) and a `cross_encoder` (for scoring).
- `k` controls how many documents are initially retrieved using FAISS, and `rerank_top_k` sets how many reranked documents will be returned.

In `get_relevant_documents`:
  - We first retrieve `k` documents using traditional vector similarity (`similarity_search()` from FAISS).
  - Each document is then paired with the query and passed into the cross-encoder. The model scores each pair based on how relevant the document is to the query.
  - These scores are used to sort the documents from most to least relevant.
  - Finally, the top `rerank_top_k` documents are returned — now reranked using deeper semantic reasoning.

In `async def aget_relevant_documents`:
  - For completeness, we include the async method required by the retriever base class — but since it’s not implemented, this method will raise an error if used.


#### Example: Running a QA pipeline with cross-encoder reranking
Now that we have defined our `CrossEncoderRetriever`, we can plug it into a full RAG pipeline. We will set up the retriever, initialize the LLM, and use `RetrievalQA` to answer a query using the most relevant documents — reranked by the cross-encoder.

The goal here is to use the best of both worlds: fast FAISS-based search for initial recall, and accurate reranking via a cross-encoder before passing the results to the language model for answer generation.

In [25]:
# Create the cross-encoder retriever
cross_encoder_retriever = CrossEncoderRetriever(
    vectorstore=vectorstore,
    cross_encoder=cross_encoder,
    k=10,  # Retrieve 10 documents initially
    rerank_top_k=5  # Return top 5 after reranking
)

- We instantiate the `CrossEncoderRetriever`, passing in our FAISS vector store and the cross-encoder model.
- `k=10` means the retriever will first pull 10 documents based on vector similarity.
- These are then reranked by the cross-encoder, and only the top 5 (`rerank_top_k=5`) are passed along to the next stage.

In [26]:
# Set up the LLM
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini-2024-07-18")

- We instantiate the OpenAI GPT-4o model with `temperature=0` for consistent and deterministic answers.
- This model will be used to generate a final answer based on the selected documents.

In [27]:
# Create the RetrievalQA chain with the cross-encoder retriever
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=cross_encoder_retriever,
    return_source_documents=True
)

- We build a `RetrievalQA` chain, combining our LLM and custom retriever.
- The `chain_type="stuff"` setting means all the top documents are simply concatenated ("stuffed") into the prompt for the LLM.
- `return_source_documents=True` allows us to inspect which documents the model used to generate its answer — useful for debugging and transparency.

We run an example query that explores the effects of climate change on biodiversity. This query will flow through the retriever → reranker → LLM pipeline.

In [28]:
# Example query
query = "What are the impacts of climate change on biodiversity?"
result = qa_chain({"query": query})

print(f"\nQuestion: {query}")
print(f"Answer: {result['result']}")
print("\nRelevant source documents:")
for i, doc in enumerate(result["source_documents"]):
    print(f"\nDocument {i+1}:")
    print(doc.page_content[:200] + "...")  # Print first 200 characters of each document


Question: What are the impacts of climate change on biodiversity?
Answer: The impacts of climate change on biodiversity include habitat loss, species extinction, and disruptions to ecosystems. Changes in temperature and precipitation patterns can alter the timing and length of seasons, affecting plant and animal life cycles. Additionally, extreme weather events, such as hurricanes and droughts, can have devastating effects on communities and ecosystems. Coral reefs, for example, are highly sensitive to changes in temperature and acidity, leading to coral bleaching and mortality, which threatens marine biodiversity. Overall, climate change poses significant risks to the survival of various species and the health of ecosystems.

Relevant source documents:

Document 1:
Tropical rainforests are particularly important for carbon storage. Deforestation in the 
Amazon, Congo Basin, and Southeast Asia has significant impacts on global carbon cycles 
and biodiversity. The...

Document 2:
devel

Here we can assess how well the retriever worked — and whether the final response is grounded in the retrieved evidence. This method offers context-aware document selection, leading to more accurate and reliable responses from the LLM.