# Demo #5: Re-ranking with Cross-Encoders

## Overview

This demo demonstrates the **two-pass retrieval architecture** using cross-encoder re-ranking, a powerful technique that significantly improves retrieval precision in RAG systems.

### The Problem with Single-Pass Retrieval

Traditional RAG systems use **bi-encoders** (like sentence-transformers or OpenAI embeddings) for retrieval:
- Query and documents are encoded **independently**
- Similarity is computed via simple operations (cosine similarity, dot product)
- **Fast** but **less accurate** - no interaction between query and document during encoding

### The Solution: Two-Pass Retrieval with Cross-Encoders

Cross-encoders evaluate query-document pairs **jointly**:
1. **Pass 1 (Recall)**: Use fast bi-encoder to retrieve top-K candidates (e.g., top-20)
2. **Pass 2 (Precision)**: Use accurate cross-encoder to re-rank candidates to top-N (e.g., top-3)

**Why this works:**
- Cross-encoders encode query+document **together**, capturing interaction signals
- Much more accurate than bi-encoders but also much slower (O(n) vs O(1))
- Two-pass approach gets **best of both worlds**: speed from bi-encoders, accuracy from cross-encoders

### Core Concepts Demonstrated
- Two-pass retrieval architecture (recall → precision)
- Bi-encoder vs. Cross-encoder comparison
- Post-retrieval optimization
- Precision improvement over recall-focused retrieval

### References
- Retrieval-Augmented Generation (RAG) from basics to advanced (Reference 15)
- Advanced RAG Techniques: What They Are & How to Use Them - FalkorDB (Reference 5)

## 1. Environment Setup and Imports

In [1]:
# Core imports
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
)
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

# For re-ranking - we'll implement a custom cross-encoder using sentence-transformers
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.schema import NodeWithScore, QueryBundle
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from typing import List, Optional

# For cross-encoder model
try:
    from sentence_transformers import CrossEncoder
    HAS_SENTENCE_TRANSFORMERS = True
except ImportError:
    print("⚠️ sentence-transformers not installed. Installing...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "sentence-transformers", "-q"])
    from sentence_transformers import CrossEncoder
    HAS_SENTENCE_TRANSFORMERS = True

# Visualization
import pandas as pd
from IPython.display import display, Markdown, HTML
import matplotlib.pyplot as plt

# Utilities
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

load_dotenv()

print("✓ All imports successful")

✓ All imports successful


## 2. Azure OpenAI Configuration

In [2]:
# Azure OpenAI configuration from environment variables
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4")
AZURE_OPENAI_EMBEDDING_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-ada-002")

# Validate configuration
if not all([AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT]):
    raise ValueError(
        "Missing Azure OpenAI configuration. Please set:\n"
        "- AZURE_OPENAI_API_KEY\n"
        "- AZURE_OPENAI_ENDPOINT\n"
        "- AZURE_OPENAI_DEPLOYMENT (optional, default: gpt-4)\n"
        "- AZURE_OPENAI_EMBEDDING_DEPLOYMENT (optional, default: text-embedding-ada-002)"
    )

# Initialize Azure OpenAI LLM
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=AZURE_OPENAI_DEPLOYMENT,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=AZURE_OPENAI_API_VERSION,
    temperature=0.1,
)

# Initialize Azure OpenAI Embeddings (Bi-Encoder)
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=AZURE_OPENAI_API_VERSION,
)

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ Azure OpenAI configured successfully")
print(f"  LLM Deployment: {AZURE_OPENAI_DEPLOYMENT}")
print(f"  Embedding Deployment (Bi-Encoder): {AZURE_OPENAI_EMBEDDING_DEPLOYMENT}")

✓ Azure OpenAI configured successfully
  LLM Deployment: gpt-4
  Embedding Deployment (Bi-Encoder): text-embedding-ada-002


## 3. Custom Cross-Encoder Re-Ranker

Implement a custom cross-encoder post-processor for LlamaIndex.

In [None]:
class CrossEncoderReranker(BaseNodePostprocessor):
    """Cross-encoder reranker using sentence-transformers."""
    
    def __init__(
        self,
        model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        top_n: int = 3,
    ):
        """Initialize cross-encoder reranker.
        
        Args:
            model_name: HuggingFace model name for cross-encoder
            top_n: Number of top results to return after reranking
        """
        super().__init__()
        self.model = CrossEncoder(model_name)
        self.top_n = top_n
        self._model_name = model_name
        print(f"✓ Cross-Encoder loaded: {model_name}")
    
    def _postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        """Rerank nodes using cross-encoder."""
        if query_bundle is None:
            return nodes
        
        query_str = query_bundle.query_str
        
        # Prepare query-document pairs for cross-encoder
        pairs = [[query_str, node.node.get_content()] for node in nodes]
        
        # Get cross-encoder scores
        scores = self.model.predict(pairs)
        
        # Update node scores with cross-encoder scores
        for node, score in zip(nodes, scores):
            node.score = float(score)
        
        # Sort by cross-encoder score and return top_n
        nodes_sorted = sorted(nodes, key=lambda x: x.score, reverse=True)
        return nodes_sorted[:self.top_n]

# Initialize cross-encoder reranker
cross_encoder_reranker = CrossEncoderReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",  # Fast, accurate cross-encoder
    top_n=3,
)

print("✓ Cross-Encoder Re-ranker ready")

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  [2m2025-10-16T12:57:00.529371Z[0m [31mERROR[0m  [31mPython exception updating progress:, error: PyErr { type: <class 'LookupError'>, value: LookupError(<ContextVar name='shell_parent' at 0x7efd32b42520>), traceback: Some(<traceback object at 0x7efb7370be00>) }, [1;31mcaller[0m[31m: "src/progress_update.rs:313"[0m
    [2;3mat[0m /home/runner/work/xet-core/xet-core/error_printer/src/lib.rs:28

  [2m2025-10-16T12:57:00.568624Z[0m [31mERROR[0m  [31mPython exception updating progress:, error: PyErr { type: <class 'LookupError'>, value: LookupError(<ContextVar name='shell_parent' at 0x7efd32b42520>), traceback: Some(<traceback object at 0x7efb7370bf40>) }, [1;31mcaller[0m[31m: "src/progress_update.rs:313"[0m
    [2;3mat[0m /home/runner/work/xet-core/xet-core/error_printer/src/lib.rs:28



## 4. Data Preparation

Load technical documents with varying relevance patterns to demonstrate re-ranking benefits.

In [None]:
# Define data directory
data_dir = Path("./data/tech_docs")

# Load documents
print("Loading documents...")
documents = SimpleDirectoryReader(str(data_dir)).load_data()

print(f"\n✓ Loaded {len(documents)} documents")
for i, doc in enumerate(documents, 1):
    file_name = Path(doc.metadata.get('file_name', 'unknown')).name
    print(f"  {i}. {file_name} ({len(doc.text)} chars)")

## 5. Build Vector Index (Bi-Encoder)

In [None]:
# Build index using bi-encoder (Azure OpenAI embeddings)
print("Building vector index with bi-encoder embeddings...")
index = VectorStoreIndex.from_documents(documents)

print("✓ Vector index built successfully")

## 6. Single-Pass Retrieval (Baseline)

Create a baseline query engine using only bi-encoder retrieval.

In [None]:
# Baseline: Retrieve top-10 with bi-encoder only
baseline_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,  # Retrieve top-10 for baseline
)

baseline_query_engine = RetrieverQueryEngine(
    retriever=baseline_retriever,
)

print("✓ Baseline query engine ready (bi-encoder only, top-10)")

## 7. Two-Pass Retrieval with Cross-Encoder Re-Ranking

In [None]:
# Two-pass: Retrieve top-10 with bi-encoder, then rerank to top-3 with cross-encoder
rerank_retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,  # Pass 1: Cast wide net with bi-encoder
)

rerank_query_engine = RetrieverQueryEngine(
    retriever=rerank_retriever,
    node_postprocessors=[cross_encoder_reranker],  # Pass 2: Rerank with cross-encoder
)

print("✓ Two-pass query engine ready")
print("  Pass 1: Bi-encoder retrieves top-10")
print("  Pass 2: Cross-encoder reranks to top-3")

## 8. Comparative Evaluation

Test both systems with queries that benefit from cross-encoder re-ranking.

In [None]:
# Define test queries
test_queries = [
    "How does the attention mechanism work in transformers?",
    "What are the key differences between REST APIs and GraphQL?",
    "Explain how BERT uses masked language modeling for training.",
]

print(f"Testing with {len(test_queries)} queries...\n")

### Query 1: Attention Mechanism in Transformers

In [None]:
query = test_queries[0]
print(f"Query: {query}\n")

# Baseline: Bi-encoder only (top-10)
print("="*80)
print("BASELINE: BI-ENCODER ONLY (Top-10)")
print("="*80)
baseline_response = baseline_query_engine.query(query)

print(f"\nRetrieved Nodes: {len(baseline_response.source_nodes)}\n")
baseline_results = []
for i, node in enumerate(baseline_response.source_nodes, 1):
    file_name = Path(node.node.metadata.get('file_name', 'unknown')).name
    baseline_results.append({
        'Rank': i,
        'Score': f"{node.score:.4f}",
        'Source': file_name,
        'Text': node.node.text[:150] + "..."
    })
    print(f"Rank {i} | Score: {node.score:.4f} | Source: {file_name}")
    print(f"  {node.node.text[:150]}...\n")

# Two-pass: Bi-encoder + Cross-encoder reranking (top-3)
print("\n" + "="*80)
print("TWO-PASS: BI-ENCODER → CROSS-ENCODER RE-RANKING (Top-3)")
print("="*80)
rerank_response = rerank_query_engine.query(query)

print(f"\nRe-ranked Nodes: {len(rerank_response.source_nodes)}\n")
rerank_results = []
for i, node in enumerate(rerank_response.source_nodes, 1):
    file_name = Path(node.node.metadata.get('file_name', 'unknown')).name
    rerank_results.append({
        'Rank': i,
        'Cross-Encoder Score': f"{node.score:.4f}",
        'Source': file_name,
        'Text': node.node.text[:150] + "..."
    })
    print(f"Rank {i} | Cross-Encoder Score: {node.score:.4f} | Source: {file_name}")
    print(f"  {node.node.text[:150]}...\n")

# Compare answers
print("\n" + "="*80)
print("ANSWER COMPARISON")
print("="*80)
print("\nBaseline Answer (Bi-encoder only):")
print(baseline_response.response)
print("\n" + "-"*80)
print("\nRe-ranked Answer (With cross-encoder):")
print(rerank_response.response)

### Query 2: REST APIs vs GraphQL

In [None]:
query = test_queries[1]
print(f"Query: {query}\n")

# Baseline
print("="*80)
print("BASELINE: BI-ENCODER ONLY")
print("="*80)
baseline_response = baseline_query_engine.query(query)
print(f"\nTop 3 from baseline (out of {len(baseline_response.source_nodes)}):")
for i, node in enumerate(baseline_response.source_nodes[:3], 1):
    file_name = Path(node.node.metadata.get('file_name', 'unknown')).name
    print(f"  {i}. Score: {node.score:.4f} | {file_name}")

# Two-pass
print("\n" + "="*80)
print("TWO-PASS: WITH CROSS-ENCODER RE-RANKING")
print("="*80)
rerank_response = rerank_query_engine.query(query)
print(f"\nRe-ranked top 3:")
for i, node in enumerate(rerank_response.source_nodes, 1):
    file_name = Path(node.node.metadata.get('file_name', 'unknown')).name
    print(f"  {i}. Cross-Encoder Score: {node.score:.4f} | {file_name}")

print("\n" + "="*80)
print("ANSWER COMPARISON")
print("="*80)
print("\nBaseline Answer:")
print(baseline_response.response)
print("\n" + "-"*80)
print("\nRe-ranked Answer:")
print(rerank_response.response)

### Query 3: BERT Masked Language Modeling

In [None]:
query = test_queries[2]
print(f"Query: {query}\n")

# Baseline
print("="*80)
print("BASELINE: BI-ENCODER ONLY")
print("="*80)
baseline_response = baseline_query_engine.query(query)
print(f"\nTop 3 from baseline:")
for i, node in enumerate(baseline_response.source_nodes[:3], 1):
    file_name = Path(node.node.metadata.get('file_name', 'unknown')).name
    print(f"  {i}. Score: {node.score:.4f} | {file_name}")

# Two-pass
print("\n" + "="*80)
print("TWO-PASS: WITH CROSS-ENCODER RE-RANKING")
print("="*80)
rerank_response = rerank_query_engine.query(query)
print(f"\nRe-ranked top 3:")
for i, node in enumerate(rerank_response.source_nodes, 1):
    file_name = Path(node.node.metadata.get('file_name', 'unknown')).name
    print(f"  {i}. Cross-Encoder Score: {node.score:.4f} | {file_name}")

print("\n" + "="*80)
print("ANSWER COMPARISON")
print("="*80)
print("\nBaseline Answer:")
print(baseline_response.response)
print("\n" + "-"*80)
print("\nRe-ranked Answer:")
print(rerank_response.response)

## 9. Architecture Visualization

In [None]:
architecture_md = """
### Two-Pass Retrieval Architecture

```
┌────────────────────────────────────────────────────────────────┐
│                        USER QUERY                              │
└────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  PASS 1: BI-ENCODER RETRIEVAL (Fast, Recall-Focused)           │
│  ─────────────────────────────────────────────────────────      │
│  • Encode query independently                                   │
│  • Compute cosine similarity with all documents                 │
│  • Retrieve top-K candidates (e.g., K=10)                       │
│  • Fast: O(1) similarity computation                            │
│  • Moderate accuracy: no query-doc interaction                  │
└─────────────────────────────────────────────────────────────────┘
                              ↓
                    Top-10 Candidates Retrieved
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│  PASS 2: CROSS-ENCODER RE-RANKING (Accurate, Precision-Focused)│
│  ───────────────────────────────────────────────────────────    │
│  • Encode query+document pairs JOINTLY                          │
│  • Capture interaction signals between query and doc            │
│  • Re-rank candidates based on relevance scores                 │
│  • Return top-N (e.g., N=3)                                     │
│  • Slower but more accurate: O(K) evaluations                   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
                      Top-3 Re-ranked Results
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                 LLM GENERATION                                  │
│  Uses top-3 highly relevant contexts                            │
└─────────────────────────────────────────────────────────────────┘
```

### Bi-Encoder vs Cross-Encoder

| Aspect | Bi-Encoder | Cross-Encoder |
|--------|------------|---------------|
| **Encoding** | Query and docs encoded separately | Query+doc encoded together |
| **Interaction** | No interaction signals | Captures query-doc interaction |
| **Speed** | Very fast (pre-computed embeddings) | Slower (O(n) evaluations) |
| **Accuracy** | Good | Excellent |
| **Use Case** | Initial retrieval (recall) | Re-ranking (precision) |
| **Scalability** | Millions of docs | Limited to top-K candidates |

### Key Benefits

1. **Best of Both Worlds**: Speed from bi-encoders + accuracy from cross-encoders
2. **Improved Precision**: Cross-encoders catch nuanced relevance signals missed by bi-encoders
3. **Ranking Quality**: Better ordering of results leads to better LLM context
4. **Scalable**: Two-pass approach is computationally feasible even at scale
"""

display(Markdown(architecture_md))

## 10. Quantitative Analysis

In [None]:
# Compare the two approaches
comparison_data = {
    'Metric': [
        'Retrieval Model',
        'Pass 1 (Recall)',
        'Pass 2 (Precision)',
        'Candidates Retrieved',
        'Final Results',
        'Query-Doc Interaction',
        'Speed',
        'Accuracy',
        'Best For',
    ],
    'Baseline (Bi-Encoder Only)': [
        'Bi-Encoder (Azure OpenAI)',
        'Yes (cosine similarity)',
        'No',
        '10',
        'Top-10',
        'No (independent encoding)',
        'Fast',
        'Good',
        'Speed-critical applications',
    ],
    'Two-Pass (Bi-Encoder + Cross-Encoder)': [
        'Bi-Encoder → Cross-Encoder',
        'Yes (bi-encoder)',
        'Yes (cross-encoder rerank)',
        '10',
        'Top-3 (re-ranked)',
        'Yes (joint encoding)',
        'Fast overall (only 10 rerank ops)',
        'Excellent',
        'Accuracy-critical applications',
    ]
}

df_comparison = pd.DataFrame(comparison_data)
display(HTML("<h3>Comparative Analysis</h3>"))
display(df_comparison)

## 11. Key Takeaways

### What We Learned

1. **Two-Pass Architecture is Best Practice**:
   - Use fast bi-encoders for initial retrieval (recall optimization)
   - Use accurate cross-encoders for final re-ranking (precision optimization)
   - Get both speed and accuracy

2. **Why Cross-Encoders Are More Accurate**:
   - Joint encoding of query+document captures interaction signals
   - Can model attention between query terms and document terms
   - Better at detecting nuanced relevance

3. **Computational Trade-offs**:
   - Bi-encoder: O(1) per query after pre-computation (fast)
   - Cross-encoder: O(K) evaluations where K is number of candidates (slower)
   - Two-pass: Fast for initial retrieval, accurate for final ranking

4. **Practical Impact**:
   - Improved retrieval precision → better context for LLM
   - Better ranking → reduced noise in top results
   - Higher quality answers with minimal latency increase

### When to Use Cross-Encoder Re-Ranking

✅ **Good for**:
- High-stakes applications where accuracy is critical
- Complex queries requiring nuanced matching
- When you can afford slight latency increase
- Domain-specific retrieval (use domain-tuned cross-encoders)

❌ **Less suitable for**:
- Ultra-low latency requirements (< 100ms)
- Very large candidate sets (> 100 candidates)
- Simple keyword matching tasks

### Popular Cross-Encoder Models

1. **ms-marco-MiniLM-L-6-v2** (used in this demo): Fast, good balance
2. **ms-marco-MiniLM-L-12-v2**: More accurate, slightly slower
3. **nli-deberta-v3-large**: Very accurate for natural language inference
4. **Domain-specific models**: Fine-tune on your domain for best results

### Production Considerations

1. **Latency Budget**: Cross-encoder adds ~50-200ms depending on model and batch size
2. **Batch Processing**: Evaluate multiple candidates in parallel for efficiency
3. **Caching**: Cache cross-encoder scores for frequently retrieved docs
4. **Model Selection**: Choose cross-encoder based on accuracy/speed requirements
5. **Monitoring**: Track ranking changes and answer quality improvements

## 12. Further Exploration

Try these experiments:
1. Adjust `similarity_top_k` (5, 10, 20) and observe impact on final results
2. Test different cross-encoder models and compare accuracy vs speed
3. Combine with other techniques: HyDE + reranking, hybrid search + reranking
4. Measure latency differences between single-pass and two-pass
5. Fine-tune a cross-encoder on your domain-specific data