# Demo #5: Re-Ranking with Cross-Encoders - Post-Retrieval Refinement

## Overview

This notebook demonstrates **Two-Stage Retrieval** using bi-encoders for fast initial retrieval followed by cross-encoders for precise re-ranking.

### The Two-Stage Architecture

**Stage 1 - Bi-Encoder (Fast Recall):**
- Query and documents encoded separately
- Fast cosine similarity search
- Retrieve top-N candidates (e.g., 20)

**Stage 2 - Cross-Encoder (Accurate Precision):**
- Query + Document concatenated
- Deep attention mechanism
- Precise relevance scoring
- Re-rank to top-K (e.g., 5)

### Key Concepts Demonstrated
- Two-stage retrieval architecture
- Bi-encoder vs. Cross-encoder comparison
- Re-ranking for precision optimization
- Trade-off between speed and accuracy

### Data Flow
```
Query → Bi-encoder retrieval (top-20, fast) → 
Cross-encoder re-ranking (top-5, accurate) → 
LLM generation with highest-quality context
```

### References
- **Rerankers and Two-Stage Retrieval**: Pinecone Documentation
- **Research**: "Comparative Analysis of Cross-Encoder Reranking" (Hugging Face Papers)
- **Model**: cross-encoder/ms-marco-MiniLM-L6-v2 (4.9M downloads)

## 1. Environment Setup

In [None]:
# Install required packages
# !pip install llama-index llama-index-llms-azure-openai llama-index-embeddings-azure-openai
# !pip install sentence-transformers torch python-dotenv

In [None]:
import os
from dotenv import load_dotenv
from typing import List, Optional
import warnings
warnings.filterwarnings('ignore')

# LlamaIndex imports
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Settings
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import NodeWithScore, QueryBundle
from llama_index.core.postprocessor import BaseNodePostprocessor
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI

# Sentence Transformers for cross-encoder
from sentence_transformers import CrossEncoder
import torch

# Load environment variables
load_dotenv()

print("✓ Environment setup complete")
print(f"  PyTorch: {torch.__version__}")
print(f"  CUDA available: {torch.cuda.is_available()}")

## 2. Configure Azure OpenAI

In [None]:
# Initialize Azure OpenAI LLM
azure_llm = AzureOpenAI(
    engine=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    model="gpt-4",
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    temperature=0.1
)

# Initialize Azure OpenAI Embedding (Bi-Encoder)
azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

# Set global defaults
Settings.llm = azure_llm
Settings.embed_model = azure_embed
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ Azure OpenAI configured")
print(f"  LLM: {os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME')}")
print(f"  Embeddings (Bi-Encoder): {os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT')}")

## 3. Load Cross-Encoder Model

We'll use the popular **cross-encoder/ms-marco-MiniLM-L6-v2** model from Hugging Face.
- 4.9M downloads
- Trained on MS MARCO dataset
- Optimized for passage ranking

In [None]:
# Load cross-encoder model
print("Loading cross-encoder model...")
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2')

print("✓ Cross-encoder model loaded")
print("  Model: cross-encoder/ms-marco-MiniLM-L6-v2")
print("  Downloads: 4.9M+")
print("  Use case: Passage ranking and re-ranking")

## 4. Prepare Knowledge Base

We'll use a moderately sized knowledge base with topically similar but semantically distinct content to demonstrate re-ranking effectiveness.

In [None]:
# Load documents from two domains
tech_docs_path = "../RAG_v2/data/tech_docs/"
ml_concepts_path = "../RAG_v2/data/ml_concepts/"

# Load tech documents
tech_reader = SimpleDirectoryReader(input_dir=tech_docs_path)
tech_documents = tech_reader.load_data()

# Load ML concept documents
ml_reader = SimpleDirectoryReader(input_dir=ml_concepts_path)
ml_documents = ml_reader.load_data()

# Combine all documents
all_documents = tech_documents + ml_documents

print(f"✓ Loaded {len(all_documents)} documents")
print(f"  Tech docs: {len(tech_documents)}")
print(f"  ML concept docs: {len(ml_documents)}")

for i, doc in enumerate(all_documents[:5]):
    print(f"  {i+1}. {doc.metadata.get('file_name', 'Unknown')} ({len(doc.text)} chars)")

In [None]:
# Parse documents into chunks
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(all_documents)

print(f"✓ Created {len(nodes)} chunks from documents")
print(f"  Average chunk size: {sum(len(n.text) for n in nodes)//len(nodes)} characters")

## 5. Build Baseline Query Engine (Bi-Encoder Only)

First, establish baseline with standard bi-encoder retrieval (no re-ranking).

In [None]:
# Create vector index
index = VectorStoreIndex(nodes, embed_model=azure_embed)

# Create baseline retriever with high top-k
baseline_retriever = index.as_retriever(similarity_top_k=20)

# Create baseline query engine
baseline_query_engine = index.as_query_engine(
    similarity_top_k=5,  # Only top 5 for generation
    llm=azure_llm
)

print("✓ Baseline query engine ready")
print("  Stage 1: Bi-encoder retrieval (top-5)")
print("  Stage 2: No re-ranking")

## 6. Implement Cross-Encoder Re-Ranker

Create custom postprocessor that re-ranks nodes using the cross-encoder.

In [None]:
class CrossEncoderReranker(BaseNodePostprocessor):
    """Re-rank nodes using a cross-encoder model."""
    
    def __init__(
        self,
        model: CrossEncoder,
        top_n: int = 5,
    ):
        self.model = model
        self.top_n = top_n
        super().__init__()
    
    def _postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        """Re-rank nodes using cross-encoder."""
        
        if query_bundle is None:
            return nodes
        
        query_str = query_bundle.query_str
        
        # Prepare query-document pairs for cross-encoder
        pairs = [[query_str, node.node.get_content()] for node in nodes]
        
        # Get cross-encoder scores
        scores = self.model.predict(pairs)
        
        # Update node scores
        for node, score in zip(nodes, scores):
            node.score = float(score)
        
        # Sort by new scores and return top-n
        nodes.sort(key=lambda x: x.score, reverse=True)
        
        return nodes[:self.top_n]

# Create re-ranker
reranker = CrossEncoderReranker(model=cross_encoder, top_n=5)

print("✓ Cross-encoder re-ranker created")
print("  Input: Top-20 from bi-encoder")
print("  Output: Top-5 after re-ranking")

## 7. Build Re-Ranking Query Engine

Create query engine with two-stage retrieval: bi-encoder + cross-encoder re-ranking.

In [None]:
# Create re-ranking query engine
rerank_query_engine = index.as_query_engine(
    similarity_top_k=20,  # Bi-encoder retrieves 20
    node_postprocessors=[reranker],  # Cross-encoder re-ranks to 5
    llm=azure_llm
)

print("✓ Re-ranking query engine ready")
print("  Stage 1: Bi-encoder retrieval (top-20)")
print("  Stage 2: Cross-encoder re-ranking (top-5)")

## 8. Test Query 1: Complex Technical Query

Test with a query where initial retrieval may include noise.

In [None]:
test_query_1 = "Explain how transformer models use self-attention mechanisms for sequence processing."

print("="*80)
print(f"TEST QUERY: {test_query_1}")
print("="*80)

### 8.1 Baseline: Bi-Encoder Only (Top-20 Retrieved)

In [None]:
# Retrieve top-20 with bi-encoder only
from llama_index.core import QueryBundle

query_bundle = QueryBundle(query_str=test_query_1)
baseline_nodes_20 = baseline_retriever.retrieve(query_bundle)

print("\n" + "="*80)
print("BASELINE: BI-ENCODER RETRIEVAL (Top-20)")
print("="*80)

print("\n📊 Retrieved Nodes (showing first 10):")
for i, node in enumerate(baseline_nodes_20[:10]):
    print(f"\n{i+1}. Score: {node.score:.4f}")
    print(f"   Source: {node.node.metadata.get('file_name', 'Unknown')}")
    print(f"   Preview: {node.text[:150]}...")

### 8.2 Analyze Top-5 from Baseline

In [None]:
print("\n" + "="*80)
print("BASELINE TOP-5 (Direct from Bi-Encoder)")
print("="*80)

response_baseline = baseline_query_engine.query(test_query_1)

print("\n📄 Top-5 Chunks Used for Generation:")
for i, node in enumerate(response_baseline.source_nodes):
    print(f"\n{i+1}. Bi-Encoder Score: {node.score:.4f}")
    print(f"   Source: {node.node.metadata.get('file_name', 'Unknown')}")
    print(f"   Text: {node.text[:200]}...")

print("\n💡 Generated Answer:")
print(response_baseline.response)

### 8.3 Re-Ranked Results with Cross-Encoder

In [None]:
print("\n" + "="*80)
print("RE-RANKED TOP-5 (Bi-Encoder → Cross-Encoder)")
print("="*80)

response_rerank = rerank_query_engine.query(test_query_1)

print("\n📄 Top-5 Chunks After Re-Ranking:")
for i, node in enumerate(response_rerank.source_nodes):
    print(f"\n{i+1}. Cross-Encoder Score: {node.score:.4f}")
    print(f"   Source: {node.node.metadata.get('file_name', 'Unknown')}")
    print(f"   Text: {node.text[:200]}...")

print("\n💡 Generated Answer:")
print(response_rerank.response)

### 8.4 Visualize Rank Changes

In [None]:
print("\n" + "="*80)
print("RANK CHANGES: Bi-Encoder vs Cross-Encoder")
print("="*80)

# Map nodes to their sources
baseline_sources = [node.node.metadata.get('file_name', 'Unknown') 
                   for node in response_baseline.source_nodes]
rerank_sources = [node.node.metadata.get('file_name', 'Unknown') 
                 for node in response_rerank.source_nodes]

print("\n📊 Source Distribution Comparison:")
print("\nBi-Encoder Only (Baseline):")
for i, source in enumerate(baseline_sources):
    print(f"  Rank {i+1}: {source}")

print("\nAfter Cross-Encoder Re-Ranking:")
for i, source in enumerate(rerank_sources):
    print(f"  Rank {i+1}: {source}")

## 9. Test Query 2: Ambiguous Query

In [None]:
test_query_2 = "What are the key challenges in training deep neural networks?"

print("="*80)
print(f"TEST QUERY: {test_query_2}")
print("="*80)

In [None]:
# Baseline
print("\n🔵 BASELINE (Bi-Encoder Only):")
response_b2 = baseline_query_engine.query(test_query_2)

print("\nTop-5 Sources:")
for i, node in enumerate(response_b2.source_nodes):
    print(f"  {i+1}. {node.node.metadata.get('file_name', 'Unknown')} (Score: {node.score:.4f})")

print(f"\nAnswer Preview: {response_b2.response[:250]}...")

In [None]:
# Re-ranked
print("\n🟢 RE-RANKED (Bi-Encoder → Cross-Encoder):")
response_r2 = rerank_query_engine.query(test_query_2)

print("\nTop-5 Sources After Re-Ranking:")
for i, node in enumerate(response_r2.source_nodes):
    print(f"  {i+1}. {node.node.metadata.get('file_name', 'Unknown')} (Score: {node.score:.4f})")

print(f"\nAnswer Preview: {response_r2.response[:250]}...")

## 10. Bi-Encoder vs Cross-Encoder: Architecture Explanation

In [None]:
print("\n" + "="*80)
print("BI-ENCODER vs CROSS-ENCODER: ARCHITECTURAL DIFFERENCES")
print("="*80)

print("""
╔════════════════════════════════════════════════════════════════════╗
║                         BI-ENCODER                                 ║
╚════════════════════════════════════════════════════════════════════╝

    Query: "transformer attention"          Document: "attention mechanism..."
          │                                            │
          ▼                                            ▼
    ┌──────────┐                                ┌──────────┐
    │ Encoder  │                                │ Encoder  │
    │ (BERT)   │                                │ (BERT)   │
    └─────┬────┘                                └─────┬────┘
          │                                            │
          ▼                                            ▼
    [0.23, 0.45, ...]                          [0.21, 0.48, ...]
    (Query Embedding)                          (Doc Embedding)
          │                                            │
          └────────────────────┬───────────────────────┘
                               ▼
                      Cosine Similarity
                         Score: 0.87

✓ FAST: Encode once, compare many (millions of docs)
✓ SCALABLE: Pre-compute document embeddings
⚠ APPROXIMATE: No query-document interaction

╔════════════════════════════════════════════════════════════════════╗
║                        CROSS-ENCODER                               ║
╚════════════════════════════════════════════════════════════════════╝

    Query + Document Concatenated:
    "[CLS] transformer attention [SEP] attention mechanism... [SEP]"
                               │
                               ▼
                    ┌─────────────────────┐
                    │   BERT Encoder      │
                    │   (Deep Attention)  │
                    │   Query ⟷ Document │
                    └──────────┬──────────┘
                               │
                               ▼
                        [CLS] Token
                               │
                               ▼
                    ┌──────────────────┐
                    │ Classification   │
                    │ Head (Linear)    │
                    └──────┬───────────┘
                           │
                           ▼
                    Relevance Score
                       Score: 0.94

✓ ACCURATE: Full attention between query and document
✓ PRECISE: Captures nuanced semantic relationships
⚠ SLOW: Must process each query-document pair separately

╔════════════════════════════════════════════════════════════════════╗
║                      TWO-STAGE STRATEGY                            ║
╚════════════════════════════════════════════════════════════════════╝

1. Use Bi-Encoder for FAST RECALL (retrieve 100-1000 candidates)
   → Screen millions of documents efficiently
   → Get "good enough" top-N candidates

2. Use Cross-Encoder for PRECISE RANKING (re-rank top 20 → top 5)
   → Deep analysis of query-document interaction
   → Highly accurate final rankings

Result: Best of both worlds - Speed + Accuracy
""")

## 11. Performance Analysis

In [None]:
import pandas as pd
import time

# Benchmark retrieval times
print("\n" + "="*80)
print("PERFORMANCE BENCHMARKING")
print("="*80)

test_queries = [
    "What is BERT?",
    "Explain gradient descent optimization",
    "How do transformers handle long sequences?"
]

baseline_times = []
rerank_times = []

for query in test_queries:
    # Baseline
    start = time.time()
    baseline_query_engine.query(query)
    baseline_times.append(time.time() - start)
    
    # Re-ranked
    start = time.time()
    rerank_query_engine.query(query)
    rerank_times.append(time.time() - start)

avg_baseline = sum(baseline_times) / len(baseline_times)
avg_rerank = sum(rerank_times) / len(rerank_times)

print(f"\n⏱️ Average Query Times:")
print(f"  Baseline (Bi-Encoder only): {avg_baseline:.2f}s")
print(f"  With Re-Ranking (Bi + Cross): {avg_rerank:.2f}s")
print(f"  Overhead: {avg_rerank - avg_baseline:.2f}s ({((avg_rerank/avg_baseline)-1)*100:.1f}% increase)")

print("\n💡 Trade-off Analysis:")
print("  - Re-ranking adds small overhead (cross-encoder on top-20 only)")
print("  - Gain: Significantly improved precision and answer quality")
print("  - For most applications, the accuracy gain justifies the latency cost")

## 12. Comparative Summary

In [None]:
comparison_data = {
    'Aspect': [
        'Architecture',
        'Speed',
        'Accuracy',
        'Scalability',
        'Use Case',
        'Cost'
    ],
    'Bi-Encoder Only': [
        'Separate query & doc encoding',
        '⚡⚡⚡⚡⚡ Very Fast',
        '⭐⭐⭐ Good',
        '✓ Millions of documents',
        'Initial retrieval / Recall',
        '$ Low'
    ],
    'Cross-Encoder Only': [
        'Concatenated query + doc',
        '⚡ Slow',
        '⭐⭐⭐⭐⭐ Excellent',
        '✗ Hundreds of documents max',
        'Final ranking / Precision',
        '$$$ High'
    ],
    'Two-Stage (Bi + Cross)': [
        'Bi-encoder → Cross-encoder',
        '⚡⚡⚡⚡ Fast',
        '⭐⭐⭐⭐⭐ Excellent',
        '✓ Millions (via bi-encoder)',
        'Production RAG systems',
        '$$ Moderate'
    ]
}

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "="*80)
print("RETRIEVAL STRATEGY COMPARISON")
print("="*80)
print(comparison_df.to_string(index=False))

## 13. Key Findings & Best Practices

### When to Use Re-Ranking

✅ **Use Re-Ranking When:**
- High precision is critical (medical, legal, financial domains)
- Initial retrieval returns many semi-relevant results
- Answer quality matters more than latency
- Working with nuanced queries requiring semantic understanding
- Budget allows for additional compute

❌ **Skip Re-Ranking When:**
- Bi-encoder already achieves good results
- Extreme latency requirements (< 100ms)
- Simple keyword-based queries
- Very limited compute budget

### Implementation Guidelines

1. **Retrieval Pool Size**: Retrieve 10-20x your final top-k with bi-encoder
   - If you need top-5, retrieve top-50 to 100 initially
   - More candidates = better re-ranking effectiveness

2. **Model Selection**:
   - **Fast**: cross-encoder/ms-marco-MiniLM-L6-v2 (this demo)
   - **Accurate**: BAAI/bge-reranker-v2-m3 (multilingual)
   - **Balanced**: Qwen/Qwen3-Reranker-0.6B

3. **Hybrid Optimization**:
   - Combine with query expansion for better recall
   - Use with hybrid search (dense + sparse) for diversity
   - Add context compression after re-ranking for efficiency

4. **Caching Strategy**:
   - Cache re-ranked results for common queries
   - Reduces latency on repeated queries

### Production Considerations

- **Latency Budget**: Re-ranking adds ~100-500ms depending on top-k
- **GPU Acceleration**: Cross-encoders benefit significantly from GPU
- **Batch Processing**: Process multiple queries in batches when possible
- **Monitoring**: Track precision improvements to justify latency cost

### Research-Backed Insights

1. **MS MARCO Dataset**: Cross-encoders trained on MS MARCO show 10-30% improvement in NDCG@10
2. **Optimal Pool Size**: Studies show retrieval pool of 20-50 documents optimal for re-ranking
3. **Domain Adaptation**: Fine-tuning cross-encoders on domain data yields best results

## 14. Further Reading & Resources

### Research Papers
1. **MS MARCO Ranking**: "Passage Re-ranking with BERT" (Nogueira et al.)
2. **Comparative Analysis**: "Lion and AdamW Optimizers for Cross-Encoder Reranking"
   - Link: https://hf.co/papers/2506.18297
3. **Relevance Feedback**: "Incorporating Relevance Feedback for Information-Seeking Retrieval"
   - Link: https://hf.co/papers/2210.10695

### Hugging Face Models
1. **cross-encoder/ms-marco-MiniLM-L6-v2** (Used in this demo)
   - Link: https://hf.co/cross-encoder/ms-marco-MiniLM-L6-v2
   - Downloads: 4.9M+

2. **BAAI/bge-reranker-v2-m3** (Multilingual)
   - Link: https://hf.co/BAAI/bge-reranker-v2-m3
   - Downloads: 2.6M+

3. **Qwen/Qwen3-Reranker-0.6B** (Modern architecture)
   - Link: https://hf.co/Qwen/Qwen3-Reranker-0.6B
   - Downloads: 930K+

### Documentation
- **Pinecone**: Rerankers and Two-Stage Retrieval
- **LlamaIndex**: CohereRerank, SentenceTransformerRerank
- **Sentence Transformers**: Cross-Encoder documentation

## Summary

In this demo, we successfully implemented **Two-Stage Retrieval with Cross-Encoder Re-Ranking**:

1. ✅ Established baseline with bi-encoder only retrieval
2. ✅ Loaded and configured cross-encoder model from Hugging Face
3. ✅ Implemented custom CrossEncoderReranker postprocessor
4. ✅ Demonstrated improved precision through re-ranking
5. ✅ Analyzed trade-offs between speed and accuracy
6. ✅ Explained architectural differences (bi-encoder vs cross-encoder)

**Key Takeaway**: Two-stage retrieval combines the speed of bi-encoders with the precision of cross-encoders, providing an optimal balance for production RAG systems where answer quality is critical.