# Enhancing Retrieval with Reranking

Reranking is a two-stage process:

1. Initial Retrieval: First, we use our existing retrievers (like BM25, vector search, or hybrid approaches) to efficiently fetch a candidate set of potentially relevant documents.
2. Reranking: Then, we apply a more computationally intensive model to score and reorder these candidates based on their relevance to the query.

Cross-Encoders vs. Bi-Encoders:

- Bi-Encoders (like those used in vector search):
    - Encode queries and documents separately
    - Allow for pre-computation of document embeddings
    - Efficient for initial retrieval across large collections
    - Examples: OpenAI embeddings, sentence-transformers, etc.

- Cross-Encoders:
    - Process query and document pairs together
    - Capture complex interactions between query and document
    - More accurate at assessing relevance
    - Computationally more expensive (can't pre-compute)
    - Examples: BERT-based cross-encoders from Hugging Face


In [1]:
from typing import List
import numpy as np

# LlamaIndex imports
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore, QueryBundle, Document, TextNode
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.retrievers.bm25 import BM25Retriever

# Hugging Face imports
from sentence_transformers import CrossEncoder, SentenceTransformer

# For demonstration
import time

In [2]:
class BiEncoder:
    """BiEncoder using Sentence Transformers for initial retrieval.
    
    Bi-encoders encode queries and documents separately, allowing
    for efficient retrieval from large collections by pre-computing 
    document embeddings.
    """

    def __init__(
        self,
        model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
    ):
        """Initialize with SentenceTransformer model."""
        self.model_name = model_name

        # Load bi-encoder model (CPU for Codespaces)
        self.model = SentenceTransformer(model_name, device="cpu")
        print(f"Loaded BiEncoder model: {model_name}")

    def encode(self, texts):
        """Encode texts to embedding vectors."""
        return self.model.encode(texts)

In [3]:
class CrossEncoderReranker:
    """Reranker using HuggingFace Cross-Encoder models.
    
    Cross-encoders process query-document pairs together to capture
    complex interactions, providing more accurate relevance scores.
    """

    def __init__(
        self,
        model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_n: int = None,
    ):
        """Initialize with CrossEncoder model."""
        self.model_name = model_name
        self.top_n = top_n

        # Load cross-encoder model (always use CPU for Codespaces)
        self.model = CrossEncoder(model_name, device="cpu")
        print(f"Loaded CrossEncoder model: {model_name}")

    def rerank(
        self,
        query: str,
        nodes: List[NodeWithScore]
    ) -> List[NodeWithScore]:
        """Rerank nodes for a given query."""
        if not nodes:
            return []

        # Extract texts from nodes
        node_texts = [node.node.get_content() for node in nodes]

        # Create query-document pairs
        query_doc_pairs = [(query, text) for text in node_texts]

        # Get scores from cross-encoder
        rerank_scores = self.model.predict(query_doc_pairs)

        # Create new NodeWithScore objects with updated scores
        reranked_nodes = []
        for i, node in enumerate(nodes):
            reranked_node = NodeWithScore(
                node=node.node,
                score=float(rerank_scores[i])
            )
            reranked_nodes.append(reranked_node)

        # Sort by new scores (descending)
        reranked_nodes.sort(key=lambda x: x.score, reverse=True)

        # Apply top_n filter if specified
        if self.top_n is not None and self.top_n < len(reranked_nodes):
            reranked_nodes = reranked_nodes[:self.top_n]

        return reranked_nodes

In [4]:
class RerankedRetriever(BaseRetriever):
    """Retriever with reranking capabilities."""

    def __init__(
        self,
        base_retriever: BaseRetriever,
        reranker: CrossEncoderReranker,
        fetch_k: int = 20,
    ):
        """Initialize with base retriever and reranker."""
        self.base_retriever = base_retriever
        self.reranker = reranker
        self.fetch_k = fetch_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve and rerank nodes for the given query."""
        # Step 1: Get initial candidates from base retriever
        base_nodes = self.base_retriever.retrieve(query_bundle)

        # Limit candidates if fetch_k is specified
        if self.fetch_k is not None and len(base_nodes) > self.fetch_k:
            base_nodes = base_nodes[:self.fetch_k]

        # Step 2: Rerank the candidates
        reranked_nodes = self.reranker.rerank(
            query=query_bundle.query_str,
            nodes=base_nodes
        )

        return reranked_nodes

In [5]:
def create_sample_documents():
    """Create a set of sample documents for demonstration."""
    texts = [
        "Python is a high-level, interpreted programming language known for its readability.",
        "Machine learning is a subset of artificial intelligence that learns from data.",
        "Neural networks are computing systems inspired by biological neural networks.",
        "Deep learning uses neural networks with many layers to extract features from data.",
        "Natural language processing helps computers understand human language.",
        "Python libraries like PyTorch and TensorFlow are used for deep learning.",
        "BM25 is a bag-of-words retrieval function used in information retrieval.",
        "Vector search finds documents by measuring similarity in embedding space.",
        "Reranking refines initial search results with a more complex model.",
        "Hybrid search combines multiple retrieval methods to improve search quality."
    ]

    documents = []
    for i, text in enumerate(texts):
        doc = Document(text=text, id_=f"doc_{i}")
        documents.append(doc)

    return documents

In [6]:

"""Demo of reranking with a simple example."""

# Create sample documents
documents = create_sample_documents()

# Create nodes from documents
nodes = [TextNode(text=doc.text, id_=doc.id_) for doc in documents]

# Set up bi-encoder for vector search
# This explicitly shows the bi-encoder component and configures LlamaIndex to use it
model_name = "sentence-transformers/all-MiniLM-L6-v2"
bi_encoder = SentenceTransformer(model_name, device="cpu")
embed_model = HuggingFaceEmbedding(model_name=model_name)

# Create nodes with embeddings from our model
# We'll compute embeddings manually to ensure SentenceTransformer is used
for node in nodes:
    text = node.get_content()
    embedding = bi_encoder.encode(text)
    node.embedding = embedding

# Create vector index with our pre-embedded nodes
vector_index = VectorStoreIndex(nodes=nodes, embed_model=embed_model)
vector_retriever = vector_index.as_retriever(similarity_top_k=5)

# Set up BM25 retriever
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes, similarity_top_k=5)

# Create hybrid retriever (same as from Video 3)
class WeightedFusionRetriever(BaseRetriever):
    def __init__(self, retrievers, weights):
        self.retrievers = retrievers
        self.weights = weights
        super().__init__()

    def _retrieve(self, query_bundle):
        all_results = {}
        for name, retriever in self.retrievers.items():
            results = retriever.retrieve(query_bundle)
            weight = self.weights.get(name, 1.0)

            for node_with_score in results:
                node_id = node_with_score.node.node_id
                weighted_score = node_with_score.score * weight

                if node_id not in all_results:
                    all_results[node_id] = {
                        "node": node_with_score.node,
                        "scores": {}
                    }
                all_results[node_id]["scores"][name] = weighted_score

        final_results = []
        for node_id, data in all_results.items():
            combined_score = sum(data["scores"].values())
            node_with_score = NodeWithScore(
                node=data["node"],
                score=combined_score
            )
            final_results.append(node_with_score)

        final_results.sort(key=lambda x: x.score, reverse=True)
        return final_results

# Create weighted fusion retriever
hybrid_retriever = WeightedFusionRetriever(
    retrievers={"vector": vector_retriever, "bm25": bm25_retriever},
    weights={"vector": 0.7, "bm25": 0.3}
)

# Create cross-encoder reranker
reranker = CrossEncoderReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=3
)

# Create reranked retriever
reranked_retriever = RerankedRetriever(
    base_retriever=hybrid_retriever,
    reranker=reranker,
    fetch_k=5
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded CrossEncoder model: cross-encoder/ms-marco-MiniLM-L-6-v2


In [7]:
# Show how bi-encoder works
print("1. EMBEDDING SIMILARITY (BI-ENCODER)")
print("-----------------------------------")
print("Bi-encoders encode queries and documents separately.")

# Define our query
query = "How is Python used in machine learning?"

# Encode query and a document
query_embedding = bi_encoder.encode(query)
doc_embedding = bi_encoder.encode(
    "Python libraries like PyTorch and TensorFlow are used for deep learning.")

# Calculate similarity (dot product normalized)
# Normalize embeddings for proper cosine similarity
query_norm = np.linalg.norm(query_embedding)
doc_norm = np.linalg.norm(doc_embedding)
similarity = np.dot(query_embedding, doc_embedding) / (query_norm * doc_norm)

print(f"Query: '{query}'")
print(f"Query embedding dimensions: {len(query_embedding)}")
print(f"Document embedding dimensions: {len(doc_embedding)}")
print(f"Similarity score: {similarity:.4f}")
print("This score was calculated without the query and document seeing each other.")

1. EMBEDDING SIMILARITY (BI-ENCODER)
-----------------------------------
Bi-encoders encode queries and documents separately.
Query: 'How is Python used in machine learning?'
Query embedding dimensions: 384
Document embedding dimensions: 384
Similarity score: 0.6004
This score was calculated without the query and document seeing each other.


In [8]:
print("\n2. HYBRID RETRIEVAL RESULTS")
print("-------------------------")
hybrid_start = time.time()
hybrid_results = hybrid_retriever.retrieve(QueryBundle(query))
hybrid_time = time.time() - hybrid_start

for i, node in enumerate(hybrid_results, 1):
    print(f"{i}. Score: {node.score:.4f} - {node.node.get_content()}")

print(f"Time: {hybrid_time:.4f}s")


2. HYBRID RETRIEVAL RESULTS
-------------------------
1. Score: 0.8852 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.8512 - Machine learning is a subset of artificial intelligence that learns from data.
3. Score: 0.6068 - Python is a high-level, interpreted programming language known for its readability.
4. Score: 0.5325 - Deep learning uses neural networks with many layers to extract features from data.
5. Score: 0.2845 - Natural language processing helps computers understand human language.
6. Score: 0.1412 - BM25 is a bag-of-words retrieval function used in information retrieval.
Time: 0.0367s


In [9]:
print("\n3. CROSS-ENCODER RERANKING")
print("-------------------------")
print("Cross-encoders process query and document pairs together to better capture relevance.")
rerank_start = time.time()
reranked_results = reranked_retriever.retrieve(QueryBundle(query))
rerank_time = time.time() - rerank_start

for i, node in enumerate(reranked_results, 1):
    print(f"{i}. Score: {node.score:.4f} - {node.node.get_content()}")

print(f"Time: {rerank_time:.4f}s")
print(f"Reranking Overhead: {rerank_time - hybrid_time:.4f}s")


3. CROSS-ENCODER RERANKING
-------------------------
Cross-encoders process query and document pairs together to better capture relevance.
1. Score: 5.1547 - Python libraries like PyTorch and TensorFlow are used for deep learning.
2. Score: 0.4332 - Python is a high-level, interpreted programming language known for its readability.
3. Score: -2.5431 - Machine learning is a subset of artificial intelligence that learns from data.
Time: 1.3182s
Reranking Overhead: 1.2816s


In [10]:
print("\n4. DIRECT CROSS-ENCODER VS BI-ENCODER COMPARISON")
print("----------------------------------------------")
test_doc = "Python libraries like PyTorch and TensorFlow are used for deep learning."

# Cross-encoder score
cross_scores = reranker.model.predict([(query, test_doc)])
print(
    f"Cross-encoder score: {cross_scores[0]:.4f} (evaluates query-document pair together)")

# Bi-encoder similarity from earlier
print(
    f"Bi-encoder similarity: {similarity:.4f} (compares separate embeddings)")
print("The cross-encoder can capture more complex relevance patterns between query and document.")


4. DIRECT CROSS-ENCODER VS BI-ENCODER COMPARISON
----------------------------------------------
Cross-encoder score: 5.1547 (evaluates query-document pair together)
Bi-encoder similarity: 0.6004 (compares separate embeddings)
The cross-encoder can capture more complex relevance patterns between query and document.


In [11]:
print("\n5. RETRIEVAL QUALITY COMPARISON")
print("-----------------------------")
print("Let's see how the document ranking changed after reranking:")

# Get the contents of top results from each method
hybrid_contents = [node.node.get_content() for node in hybrid_results[:3]]
reranked_contents = [node.node.get_content() for node in reranked_results[:3]]

print("\nChanges in ranking:")
for i, content in enumerate(reranked_contents):
    if content in hybrid_contents:
        old_rank = hybrid_contents.index(content) + 1
        if old_rank != i + 1:
            print(f"Document moved from position {old_rank} to {i+1}")
    else:
        print(f"New document at position {i+1} (wasn't in top 3 before)")


5. RETRIEVAL QUALITY COMPARISON
-----------------------------
Let's see how the document ranking changed after reranking:

Changes in ranking:
Document moved from position 3 to 2
Document moved from position 2 to 3


# Additional Considerations

1. Latency-Quality Tradeoff:

Reranking adds processing time (in our example, about 0.04 seconds)
This overhead scales with the number of candidates being reranked
Adjust fetch_k to balance between quality and performance

2. Model Selection:

Smaller cross-encoder models (like the one we used) are faster but less accurate
Larger models provide better quality but with higher computational costs
Consider distilled models that balance speed and accuracy

3. Resource Requirements:

Cross-encoders are more compute-intensive than bi-encoders
GPU acceleration can significantly improve throughput
Consider batching and asynchronous processing for efficiency

4. Candidate Selection:

The quality of reranking depends on having good initial candidates
If the initial retrieval misses relevant documents, reranking can't fix it
Always focus on improving both stages of the pipeline

5. Result Diversity:

Reranking can sometimes over-prioritize one aspect of relevance
Consider adding diversity mechanisms if this is a concern