# Demo #5: Re-Ranking with Cross-Encoders - Post-Retrieval Refinement

## Objective
Demonstrate how a two-stage retrieval process (fast bi-encoder + accurate cross-encoder) improves precision by re-ordering initial results.

## Core Concepts
- **Two-stage retrieval architecture**: Fast bi-encoder for recall, accurate cross-encoder for precision
- **Bi-encoder vs. Cross-encoder comparison**: Understanding the trade-offs
- **Re-ranking for precision optimization**: Improving the quality of retrieved results

## What is Re-Ranking?

### The Problem
Standard RAG systems use **bi-encoders** (like Azure OpenAI embeddings) that:
- Encode queries and documents **independently**
- Compare via simple cosine similarity
- Are **fast** but may miss nuanced semantic relationships
- Can rank less-relevant documents higher due to keyword overlap

### The Solution: Two-Stage Retrieval
1. **Stage 1 (Recall)**: Bi-encoder retrieves a larger set (e.g., top-20) quickly
2. **Stage 2 (Precision)**: Cross-encoder re-ranks this smaller set more accurately

### Bi-Encoder vs. Cross-Encoder

**Bi-Encoder:**
```
Query → Encoder A → Vector Q
Document → Encoder B → Vector D
Similarity = cosine(Q, D)
```
- ✅ Fast: Pre-compute document embeddings
- ✅ Scalable: Works with millions of documents
- ❌ Limited: No direct query-document interaction

**Cross-Encoder:**
```
[Query + Document] → Single Encoder → Relevance Score
```
- ✅ Accurate: Deep attention between query and document
- ✅ Contextual: Captures subtle semantic relationships
- ❌ Slow: Must process each query-document pair
- ❌ Not scalable: Cannot pre-compute

## Data Flow
```
Query
  ↓
Bi-encoder retrieval (top-20, fast)
  ↓
Cross-encoder re-ranking (top-5, accurate)
  ↓
LLM generation with highest-quality context
```

## Setup: Install Dependencies and Load Environment

In [None]:
# Install required packages
# Run this cell if packages are not already installed
# !pip install llama-index llama-index-llms-azure-openai llama-index-embeddings-azure-openai
# !pip install sentence-transformers torch  # For cross-encoder model
# !pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# Load environment variables
load_dotenv()

# Verify Azure OpenAI credentials
required_vars = [
    'AZURE_OPENAI_API_KEY',
    'AZURE_OPENAI_ENDPOINT',
    'AZURE_OPENAI_API_VERSION',
    'AZURE_OPENAI_DEPLOYMENT_NAME',
    'AZURE_OPENAI_EMBEDDING_DEPLOYMENT'
]

missing_vars = [var for var in required_vars if not os.getenv(var)]
if missing_vars:
    print(f"❌ Missing environment variables: {', '.join(missing_vars)}")
    print("Please ensure your .env file contains all required Azure OpenAI credentials.")
else:
    print("✅ All required environment variables are set.")
    print(f"   Endpoint: {os.getenv('AZURE_OPENAI_ENDPOINT')}")
    print(f"   LLM Deployment: {os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME')}")
    print(f"   Embedding Deployment: {os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT')}")

## Step 1: Data and Baseline Setup

We'll use a moderately sized knowledge base (11 documents total) combining:
- ML concepts (5 docs): gradient boosting, neural networks, random forests, support vector machines, k-means clustering
- Tech docs (6 docs): BERT, GPT-4, REST API, Transformer Architecture, Docker, Embeddings

This mix includes topically similar but semantically distinct content, which is ideal for demonstrating re-ranking effectiveness.

In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings

# Initialize Azure OpenAI LLM
azure_llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION'),
    temperature=0.1  # Low temperature for consistent results
)

# Initialize Azure OpenAI Embedding Model
azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT'),
    api_key=os.getenv('AZURE_OPENAI_API_KEY'),
    azure_endpoint=os.getenv('AZURE_OPENAI_ENDPOINT'),
    api_version=os.getenv('AZURE_OPENAI_API_VERSION'),
)

# Set global defaults
Settings.llm = azure_llm
Settings.embed_model = azure_embed

print("✅ Azure OpenAI LLM and Embedding models initialized successfully.")

In [None]:
# Load documents from both directories
ml_docs = SimpleDirectoryReader('data/ml_concepts').load_data()
tech_docs = SimpleDirectoryReader('data/tech_docs').load_data()
documents = ml_docs + tech_docs

print(f"📚 Loaded {len(documents)} documents:")
print(f"   - ML Concepts: {len(ml_docs)} documents")
print(f"   - Tech Docs: {len(tech_docs)} documents")
print(f"\nDocument names:")
for i, doc in enumerate(documents, 1):
    filename = os.path.basename(doc.metadata.get('file_name', 'Unknown'))
    print(f"   {i}. {filename}")

In [None]:
# Chunk documents
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = splitter.get_nodes_from_documents(documents)

print(f"✂️ Created {len(nodes)} chunks from {len(documents)} documents")
print(f"   Average chunks per document: {len(nodes) / len(documents):.1f}")

# Show sample chunk
print(f"\n📄 Sample chunk (first 300 chars):")
print(f"   {nodes[0].text[:300]}...")

In [None]:
# Create vector index
index = VectorStoreIndex(nodes, embed_model=azure_embed)
print("✅ Vector index created successfully with Azure OpenAI embeddings.")

In [None]:
# Configure baseline retriever with high top_k
# We retrieve 20 chunks to simulate a realistic scenario where initial retrieval includes noise
baseline_retriever = index.as_retriever(similarity_top_k=20)

print("✅ Baseline retriever configured (top_k=20)")
print("   This will retrieve a larger set that likely includes some irrelevant chunks.")

## Step 2: Analyze Baseline Retrieval Quality

Let's test the baseline retriever with a query that might retrieve some irrelevant content in the initial top-20.

In [None]:
# Test query that spans multiple topics
test_query = "How do neural networks use backpropagation to learn features from data?"

print(f"🔍 Test Query: {test_query}")
print("\nThis query is specifically about neural networks and backpropagation.")
print("We expect some retrieved chunks to be about other ML algorithms,")
print("which may have high cosine similarity but lower actual relevance.")

In [None]:
# Retrieve with baseline
baseline_results = baseline_retriever.retrieve(test_query)

print(f"\n📊 Baseline Bi-Encoder Retrieval Results (Top 20):")
print("=" * 80)

for i, node in enumerate(baseline_results, 1):
    score = node.score
    filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
    text_preview = node.node.text[:150].replace('\n', ' ')
    
    print(f"\nRank {i:2d} | Score: {score:.4f} | Source: {filename}")
    print(f"         Text: {text_preview}...")

### Observations on Baseline Results

Let's analyze the quality of these results:
- Are all top-20 results relevant to neural networks and backpropagation?
- Do we see chunks from other ML algorithms (SVM, Random Forests, etc.) in the results?
- Are the most relevant chunks always at the top?

**The Issue**: Bi-encoders compute similarity based on cosine distance between independently-encoded vectors. This can lead to:
- False positives from keyword overlap (e.g., "learning", "data", "training")
- Missing nuanced semantic relationships
- Sub-optimal ranking where less relevant docs appear higher

## Step 3: Implement Cross-Encoder Re-Ranking

We'll use a Sentence-Transformers cross-encoder model that processes [query, document] pairs jointly.

**Model**: `cross-encoder/ms-marco-MiniLM-L-6-v2`
- Trained on MS MARCO dataset for passage ranking
- Lightweight (6 layers)
- High accuracy for re-ranking tasks

In [None]:
from sentence_transformers import CrossEncoder
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core.schema import NodeWithScore
from typing import List, Optional

# Load cross-encoder model
cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
print("✅ Cross-encoder model loaded: cross-encoder/ms-marco-MiniLM-L-6-v2")

In [None]:
# Create custom re-ranker post-processor
from llama_index.core.postprocessor import BaseNodePostprocessor
from llama_index.core import QueryBundle

class CrossEncoderReranker(BaseNodePostprocessor):
    """
    Custom post-processor that uses a cross-encoder model to re-rank retrieved nodes.
    
    The cross-encoder processes [query, document] pairs together, allowing for
    deep attention between query and document tokens, resulting in more accurate
    relevance scores compared to bi-encoder cosine similarity.
    """
    
    def __init__(self, model, top_n: int = 5):
        """
        Args:
            model: Sentence-Transformers CrossEncoder model
            top_n: Number of top results to return after re-ranking
        """
        super().__init__()
        self.model = model
        self.top_n = top_n
        self._metadata = {}
    
    def _postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        """
        Re-rank nodes using cross-encoder model.
        """
        if query_bundle is None:
            return nodes
        
        query_str = query_bundle.query_str
        
        # Prepare [query, document] pairs for cross-encoder
        pairs = [[query_str, node.node.text] for node in nodes]
        
        # Get cross-encoder scores
        scores = self.model.predict(pairs)
        
        # Update node scores
        for node, score in zip(nodes, scores):
            node.score = float(score)
        
        # Sort by new scores (descending)
        nodes = sorted(nodes, key=lambda x: x.score, reverse=True)
        
        # Return top_n
        return nodes[:self.top_n]

# Create re-ranker instance
reranker = CrossEncoderReranker(model=cross_encoder_model, top_n=5)
print("✅ Custom CrossEncoderReranker created (will return top-5 after re-ranking)")

In [None]:
# Create query engine with re-ranking
from llama_index.core.query_engine import RetrieverQueryEngine

# Standard query engine (no re-ranking)
baseline_query_engine = RetrieverQueryEngine(
    retriever=baseline_retriever,
    node_postprocessors=[],  # No post-processing
)

# Query engine with re-ranking
rerank_query_engine = RetrieverQueryEngine(
    retriever=baseline_retriever,
    node_postprocessors=[reranker],  # Apply cross-encoder re-ranking
)

print("✅ Query engines created:")
print("   1. Baseline (bi-encoder only, top-20)")
print("   2. Re-ranked (bi-encoder → cross-encoder, top-5)")

## Step 4: Execute Re-Ranking and Compare Results

Now let's retrieve and re-rank with the same query to see the improvement.

In [None]:
# Retrieve with baseline (top-20)
print(f"🔍 Query: {test_query}\n")
baseline_results = baseline_retriever.retrieve(test_query)

print(f"📊 STAGE 1: Bi-Encoder Retrieval (Top 20)")
print("=" * 80)
for i, node in enumerate(baseline_results[:10], 1):  # Show first 10 for brevity
    score = node.score
    filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
    text_preview = node.node.text[:100].replace('\n', ' ')
    print(f"Rank {i:2d} | Bi-Score: {score:.4f} | {filename}")
    print(f"         {text_preview}...\n")

print("... (10 more results with scores ranging lower) ...\n")

In [None]:
# Re-rank with cross-encoder
from llama_index.core import QueryBundle

query_bundle = QueryBundle(query_str=test_query)
reranked_results = reranker._postprocess_nodes(baseline_results, query_bundle=query_bundle)

print(f"📊 STAGE 2: Cross-Encoder Re-Ranking (Top 5)")
print("=" * 80)
for i, node in enumerate(reranked_results, 1):
    cross_score = node.score
    filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
    text_preview = node.node.text[:100].replace('\n', ' ')
    
    # Find original rank in baseline results
    original_rank = None
    for orig_idx, orig_node in enumerate(baseline_results, 1):
        if orig_node.node.node_id == node.node.node_id:
            original_rank = orig_idx
            break
    
    rank_change = f" (↑ moved from rank {original_rank})" if original_rank and original_rank > i else ""
    
    print(f"Rank {i} | Cross-Score: {cross_score:.4f} | {filename}{rank_change}")
    print(f"        {text_preview}...\n")

## Step 5: Visualize Rank Changes

Let's create a visualization showing how documents moved in ranking.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Track rank changes
rank_changes = []
labels = []

for new_rank, node in enumerate(reranked_results, 1):
    # Find original rank
    for orig_rank, orig_node in enumerate(baseline_results, 1):
        if orig_node.node.node_id == node.node.node_id:
            filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
            rank_changes.append((orig_rank, new_rank))
            labels.append(f"{filename[:20]}...")
            break

# Create visualization
fig, ax = plt.subplots(figsize=(12, 6))

for i, (orig, new) in enumerate(rank_changes):
    color = 'green' if orig > 5 else 'blue'
    ax.plot([0, 1], [orig, new], 'o-', color=color, linewidth=2, markersize=8)
    ax.text(-0.05, orig, f"{orig}", ha='right', va='center', fontsize=10)
    ax.text(1.05, new, f"{new}", ha='left', va='center', fontsize=10, fontweight='bold')

ax.set_xlim(-0.2, 1.2)
ax.set_ylim(0, 21)
ax.invert_yaxis()
ax.set_xticks([0, 1])
ax.set_xticklabels(['Bi-Encoder\n(Top 20)', 'Cross-Encoder\n(Top 5)'], fontsize=12, fontweight='bold')
ax.set_ylabel('Rank Position', fontsize=12)
ax.set_title('Re-Ranking Impact: Document Rank Changes', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Add legend
from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0], [0], color='blue', linewidth=2, label='Already in top-5'),
    Line2D([0], [0], color='green', linewidth=2, label='Promoted to top-5')
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.show()

print("\n📈 Rank Changes Summary:")
for i, (orig, new) in enumerate(rank_changes):
    change = orig - new
    arrow = "↑" if change > 0 else "→"
    print(f"   {labels[i]}: Rank {orig} → {new} {arrow} (moved {abs(change)} positions)")

## Step 6: Compare Generated Answers

Let's see how the quality of retrieved context affects the final generated answer.

In [None]:
# Generate answer with baseline (top-5 from bi-encoder)
baseline_top5_retriever = index.as_retriever(similarity_top_k=5)
baseline_top5_engine = RetrieverQueryEngine(retriever=baseline_top5_retriever)

print("🤖 Generating answer with BASELINE approach (Bi-encoder top-5 only)...\n")
baseline_response = baseline_top5_engine.query(test_query)

print("📝 BASELINE Answer (Bi-Encoder Top-5):")
print("=" * 80)
print(baseline_response.response)
print("=" * 80)

In [None]:
# Generate answer with re-ranking (top-5 from cross-encoder)
print("\n🤖 Generating answer with RE-RANKING approach (Cross-encoder top-5)...\n")
reranked_response = rerank_query_engine.query(test_query)

print("📝 RE-RANKED Answer (Cross-Encoder Top-5):")
print("=" * 80)
print(reranked_response.response)
print("=" * 80)

## Step 7: Analyze Answer Quality

Let's use the LLM as a judge to compare the quality of both answers.

In [None]:
# LLM-as-judge evaluation
judge_prompt = f"""
You are an expert evaluator. Compare two answers to the following question and determine which is better.

Question: {test_query}

Answer A (Baseline - Bi-Encoder):
{baseline_response.response}

Answer B (Re-Ranked - Cross-Encoder):
{reranked_response.response}

Evaluate based on:
1. Accuracy: Does it correctly answer the question?
2. Completeness: Does it cover all aspects of the question?
3. Relevance: Does it stay focused on the question?
4. Clarity: Is it well-structured and easy to understand?

Provide:
- Brief analysis of each answer's strengths/weaknesses
- Overall verdict: Which is better and why?
"""

print("⚖️ Using LLM as Judge to evaluate answer quality...\n")
judgment = azure_llm.complete(judge_prompt)

print("🎯 Quality Evaluation:")
print("=" * 80)
print(judgment.text)
print("=" * 80)

## Step 8: Test with Multiple Queries

Let's test several different query types to demonstrate the robustness of re-ranking.

In [None]:
# Define diverse test queries
test_queries = [
    "What are the main differences between BERT and GPT-4 architectures?",
    "How does k-means clustering determine the optimal number of clusters?",
    "Explain the concept of containerization in Docker and its benefits.",
]

print("🧪 Testing re-ranking across diverse queries...\n")

for idx, query in enumerate(test_queries, 1):
    print(f"\n{'='*80}")
    print(f"Test Query {idx}: {query}")
    print(f"{'='*80}")
    
    # Bi-encoder results
    bi_results = baseline_top5_retriever.retrieve(query)
    
    # Cross-encoder re-ranking
    qb = QueryBundle(query_str=query)
    bi_results_20 = baseline_retriever.retrieve(query)
    cross_results = reranker._postprocess_nodes(bi_results_20, query_bundle=qb)
    
    print(f"\n📊 Top-3 Sources Comparison:")
    print(f"\nBi-Encoder Top-3:")
    for i, node in enumerate(bi_results[:3], 1):
        filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
        print(f"  {i}. {filename} (score: {node.score:.4f})")
    
    print(f"\nCross-Encoder Top-3:")
    for i, node in enumerate(cross_results[:3], 1):
        filename = os.path.basename(node.node.metadata.get('file_name', 'Unknown'))
        print(f"  {i}. {filename} (score: {node.score:.4f})")
    
    # Check if re-ranking changed the top-3
    bi_top3_ids = {n.node.node_id for n in bi_results[:3]}
    cross_top3_ids = {n.node.node_id for n in cross_results[:3]}
    
    if bi_top3_ids != cross_top3_ids:
        print(f"\n✅ Re-ranking changed the top-3 results (improved precision)")
    else:
        print(f"\n→ Re-ranking confirmed the top-3 results (validation)")

## Key Takeaways

### 1. **Two-Stage Architecture is Optimal**
- **Stage 1 (Bi-Encoder)**: Fast retrieval over millions of documents
- **Stage 2 (Cross-Encoder)**: Accurate re-ranking of a smaller candidate set
- This combines the best of both worlds: speed + accuracy

### 2. **Cross-Encoders Provide Superior Precision**
- Deep attention between query and document captures nuanced semantics
- More accurate than simple cosine similarity
- Can identify subtle relevance differences that bi-encoders miss

### 3. **When to Use Re-Ranking**
✅ **Use when:**
- You need high precision (e.g., question answering, search)
- Initial retrieval includes noise or false positives
- Query complexity requires deep semantic understanding
- You have a two-stage budget: cheap first-pass, expensive second-pass

❌ **Skip when:**
- Latency is critical and bi-encoder quality is sufficient
- Dataset is very small (< 100 documents)
- Simple keyword matching suffices

### 4. **Performance Considerations**
- **Bi-Encoder**: ~1ms per query (pre-computed embeddings)
- **Cross-Encoder**: ~50-100ms for 20 pairs (must compute on-the-fly)
- Total: ~100ms for full pipeline (very acceptable for most applications)

### 5. **Implementation Tips**
- Start with top-20 to top-50 from bi-encoder
- Re-rank to top-5 to top-10 for generation
- Choose cross-encoder model based on domain:
  - `ms-marco-MiniLM`: General web/passage ranking
  - `ms-marco-TinyBERT`: Faster, slightly lower quality
  - Domain-specific models: If available for your use case

## Architecture Diagram

```
User Query
    │
    ▼
┌─────────────────────────────────────┐
│   Stage 1: Bi-Encoder Retrieval    │
│   (Fast, High Recall)               │
│   • Retrieve top-20                 │
│   • Cosine similarity               │
│   • ~1ms latency                    │
└─────────────┬───────────────────────┘
              │ 20 candidates
              ▼
┌─────────────────────────────────────┐
│  Stage 2: Cross-Encoder Re-Ranking  │
│  (Accurate, High Precision)         │
│  • Re-score all 20 pairs            │
│  • Deep attention                   │
│  • Return top-5                     │
│  • ~50-100ms latency                │
└─────────────┬───────────────────────┘
              │ 5 best chunks
              ▼
┌─────────────────────────────────────┐
│      LLM Generation                 │
│      (High-Quality Context)         │
└─────────────────────────────────────┘
```

## Next Steps

To further improve your RAG system:
1. **Experiment with different cross-encoder models** for your domain
2. **Tune the top-K parameters** (Stage 1: K=20-50, Stage 2: K=5-10)
3. **Combine with other techniques**: Hybrid search → Re-ranking → Context compression
4. **Monitor performance**: Track retrieval precision and generation quality
5. **Consider fine-tuning** cross-encoders on domain-specific data

---

**Demo Complete! ✅**

You've successfully implemented re-ranking with cross-encoders and seen how two-stage retrieval dramatically improves precision in RAG systems.