# 📓 The GenAI Revolution Cookbook

**Title:** Cross Encoder Reranking: The Low-Cost Fix for RAG Misses

**Description:** Cut RAG hallucinations and misses using cross-encoder reranking. Learn optimal rerank depth, caching strategies, and ColBERT tradeoffs for throughput balance.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



When your RAG system retrieves passages that *look* relevant but lead the LLM to cite incorrect facts or miss critical nuances, the problem often lies in first-stage retrieval. BM25 and dense bi-encoders excel at broad recall but struggle with subtle intent—negations, temporal constraints, or domain-specific phrasing. A cross-encoder reranker fixes this by scoring query-passage pairs jointly, catching "close but wrong" candidates before they reach the LLM.

This article explains why first-stage retrieval fails on nuanced queries, how cross-encoder reranking delivers top-rank precision, and the latency tradeoff you're accepting. You'll walk away with a decision rule for when to deploy reranking and a minimal set of defaults to start measuring impact.

---

## Why This Matters

First-stage retrieval optimizes for recall: cast a wide net and surface plausible candidates quickly. BM25 matches keywords but ignores semantics. Dense bi-encoders (e.g., `all-MiniLM-L6-v2`) encode query and passage independently, missing fine-grained interactions like negations or conditional clauses.

The result: your top-30 candidates often include passages that share vocabulary with the query but contradict its intent. When the LLM sees "FDA allows off-label promotion in certain contexts" alongside "FDA prohibits off-label promotion to physicians," it may cite the wrong one—or hedge into a vague, low-confidence answer.

**Precision at top-k matters more than recall.** If your LLM only sees 5–10 passages, a single "close but wrong" document in that set can trigger a hallucination or incorrect citation. Cross-encoder reranking re-scores the top candidates using joint attention over query and passage tokens, surfacing the truly relevant passages and demoting near-misses.

**The tradeoff:** reranking adds 100–200ms of latency per query (for ~30 candidates). If your application tolerates this and precision failures are measurable (e.g., answer correctness < 85%, nDCG@10 below baseline), reranking is worth deploying.

---

## How It Works

**1. First-stage retrieval is independent and fast**  
BM25 and bi-encoders score query and passage separately. BM25 counts term overlap; bi-encoders compare precomputed embeddings via cosine similarity. Neither model sees the query and passage together, so they miss contextual cues—negations, qualifiers, or domain-specific phrasing that flip meaning.

**2. Cross-encoders use joint attention for precision**  
A cross-encoder (e.g., `cross-encoder/ms-marco-MiniLM-L-6-v2`) concatenates query and passage as a single input: `[CLS] query [SEP] passage [SEP]`. Transformer attention layers process both together, learning token-level interactions. This catches subtle mismatches—e.g., "FDA prohibits" vs. "FDA allows"—that independent encoders miss.

**3. Two-stage cascade balances speed and accuracy**  
Retrieve 100–200 candidates with fast first-stage methods (BM25, dense, or hybrid). Rerank the top 30–50 with a cross-encoder. Send the top 5–10 reranked passages to the LLM. This keeps latency manageable while ensuring the LLM sees only high-precision context.

**4. Latency cost is proportional to candidate count**  
Cross-encoders run full forward passes for each query-passage pair. Reranking 30 candidates takes ~100–150ms on CPU, ~30–50ms on GPU (T4 or better). Reranking 100+ candidates pushes latency above 300ms. The default—rerank top-30, return top-8—fits most sub-500ms SLA budgets.

---

## What You Should Do

**Start with a proven default configuration:**

- **Retrieve 150 candidates** using your existing first-stage method (BM25, dense, or hybrid).
- **Rerank the top 30** with `cross-encoder/ms-marco-MiniLM-L-6-v2` (6-layer MiniLM, strong precision/speed tradeoff).
- **Send the top 8 reranked passages** to your LLM prompt.
- **Set a 200ms timeout** for reranking; fall back to first-stage top-k if exceeded.

**Measure impact with two metrics:**

- **nDCG@10** on a labeled eval set (query + gold passage IDs). Reranking should lift nDCG by 5–15 points if first-stage precision is weak.
- **Answer correctness** via LLM-as-judge or human eval. Track whether reranking reduces hallucinations or incorrect citations in your application.

**When to deploy reranking:**

- Precision@5 from first-stage retrieval is below 60% on your eval set.
- Users report incorrect or contradictory answers despite relevant documents existing in your index.
- Your application can absorb +100–200ms latency without breaking SLA.

**When to skip reranking:**

- First-stage retrieval already achieves >80% precision@10 (e.g., small, curated corpus with simple queries).
- Latency budget is under 200ms end-to-end and cannot flex.
- Queries are keyword-heavy with little semantic nuance (pure BM25 may suffice).

**Integrate reranking into your pipeline with this code:**

This example demonstrates how to add a cross-encoder reranker to your RAG pipeline. It retrieves first-stage candidates, batches them for efficient inference, reranks using a cross-encoder, and optionally caches results with Redis to avoid redundant computation.

In [None]:
import os
from typing import List, Tuple
from sentence_transformers import CrossEncoder

def get_first_stage_candidates(query: str, k: int = 150) -> List[Tuple[str, str]]:
    """
    Simulate first-stage retrieval (replace with your BM25/dense retriever).
    Returns list of (passage_id, passage_text).
    """
    return [(f"doc_{i}", f"Passage {i} for '{query}'") for i in range(k)]

def rerank_candidates(
    query: str,
    candidates: List[Tuple[str, str]],
    cross_encoder: CrossEncoder,
    top_n: int = 30
) -> List[Tuple[str, float]]:
    """
    Rerank candidates using cross-encoder, return top_n scored passages.
    """
    input_pairs = [(query, ptext) for _, ptext in candidates]
    scores = cross_encoder.predict(input_pairs, batch_size=32)
    scored = list(zip([pid for pid, _ in candidates], scores))
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_n]

# Load model once at startup
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Pipeline
query = "Does FDA allow off-label promotion of drugs to physicians?"
candidates = get_first_stage_candidates(query, k=150)
reranked = rerank_candidates(query, candidates[:30], cross_encoder, top_n=8)

# Use top 8 passages in LLM prompt
top_passages = [pid for pid, _ in reranked]
print(f"Top passages for LLM: {top_passages}")

---

## Conclusion – Key Takeaways

Cross-encoder reranking fixes "close but wrong" retrieval by scoring query and passage jointly, catching nuances that first-stage methods miss. It trades 100–200ms of latency for measurably better precision at top-k, reducing hallucinations and incorrect citations.

**Deploy reranking when:**

- First-stage precision@5 is below 60%
- Answer correctness is suffering despite good recall
- You can afford +100–200ms in your latency budget

**Default recipe:** retrieve 150, rerank top-30 with `ms-marco-MiniLM-L-6-v2`, send top-8 to LLM. Measure nDCG@10 and answer correctness to confirm impact.

For advanced patterns—late interaction models like ColBERT, caching strategies, or dynamic rerank depth under SLA pressure—see the related explainers on production-scale reranking and retrieval optimization.