# Tutorial 3 — Reranking (Two-Stage Retrieval)

## Where You Are in the Learning Journey

```
 Tutorial 1 --> Tutorial 2 --> Tutorial 3 --> Tutorial 4 --> Tutorial 5
 Basic RAG      Semantic       YOU ARE        Add Keyword    Benchmark
 (baseline)     Chunking       HERE           Search         All Four
```

**What changed from Tutorial 2:** one new stage added after retrieval — the **reranker**.
Chunking, embeddings, and the vector store are identical to Tutorial 2.

**What you will learn in this tutorial:**
- What a reranker is and how it differs from a vector search
- What 'bi-encoder' and 'cross-encoder' mean and why the difference matters
- How two-stage retrieval works: retrieve a wide candidate set, then rerank to find the best
- Why ranking position matters for LLM answer quality

```mermaid
flowchart LR
    A[Query] --> B[Dense Retrieve top-10]
    B --> C[Cross-Encoder Rerank]
    C --> D[Top-5 for LLM]
```


## Learning Checkpoint: What Reranking Solves and What Remains

### What Works Better in Tutorial 3

- Candidate chunks are re-ordered with stronger query awareness — the reranker reads
  the query and each chunk *together* and produces a relevance score that is
  more accurate than cosine similarity alone.
- The most relevant context is more likely to appear at rank 1, which the LLM
  reads first (position bias: LLMs pay more attention to early context).

### What Still Does Not Work Well

- **Dense retrieval still dominates stage 1.** If a chunk is not in the first-stage
  top-k (e.g. because the exact keywords are absent), the reranker never sees it.
- **Exact-term queries** like 'Form A-12' are weakly handled by embedding similarity;
  no reranker can recover a chunk that was never retrieved.

### Why Move to Tutorial 4?

Tutorial 4 adds **keyword (BM25) retrieval** alongside dense retrieval.  BM25 handles
exact token matches that embeddings miss, giving the pipeline two complementary signals.


In [None]:
# 1-5) Setup, load handbook text, chunk, embed, index

import importlib
import os
from pathlib import Path
import shutil
import subprocess
import sys

import pandas as pd
from dotenv import load_dotenv

if shutil.which("uv") is None:
    print("uv not found. Installing with pip...")
    subprocess.run([sys.executable, "-m", "pip", "install", "uv"], check=True)

cwd = Path.cwd().resolve()
repo_root = next(
    (path for path in [cwd, *cwd.parents] if (path / "pyproject.toml").exists() and (path / "src").exists()),
    cwd,
)
os.chdir(repo_root)
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

REQUIRED_PACKAGES = ["openai", "chromadb", "numpy", "pandas", "rank_bm25", "sentence_transformers", "dotenv"]
PIP_NAME_MAP = {"rank_bm25": "rank-bm25", "sentence_transformers": "sentence-transformers", "dotenv": "python-dotenv"}

def find_missing(packages: list[str]) -> list[str]:
    importlib.invalidate_caches()
    return [pkg for pkg in packages if importlib.util.find_spec(pkg) is None]

missing = find_missing(REQUIRED_PACKAGES)
if missing:
    print("Missing packages:", missing)
    print("Running: uv sync")
    subprocess.run(["uv", "sync"], check=True)

missing_after_sync = find_missing(REQUIRED_PACKAGES)
if missing_after_sync:
    pip_targets = [PIP_NAME_MAP.get(pkg, pkg) for pkg in missing_after_sync]
    print("Installing into current kernel with pip:", pip_targets)
    subprocess.run([sys.executable, "-m", "pip", "install", *pip_targets], check=True)

final_missing = find_missing(REQUIRED_PACKAGES)
if final_missing:
    raise ImportError(f"Dependencies still missing in current kernel: {final_missing}")

from rag_tutorials.io_utils import load_handbook_documents, load_queries
from rag_tutorials.chunking import semantic_chunk_documents
from rag_tutorials.pipeline import build_dense_retriever
from rag_tutorials.reranking import LocalCrossEncoderReranker
from rag_tutorials.qa import answer_with_context
from rag_tutorials.evaluation import evaluate_single, summarize

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is required")

embedding_model = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
chat_model = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")

handbook_path = Path("data/handbook_manual.txt")
queries_path = Path("data/queries.jsonl")
if not handbook_path.exists() or not queries_path.exists():
    raise FileNotFoundError("Run: uv run python scripts/generate_data.py")

documents = load_handbook_documents(handbook_path)
queries = load_queries(queries_path)
chunks = semantic_chunk_documents(documents)

dense_retriever, _ = build_dense_retriever(
    chunks=chunks,
    collection_name="tutorial3_dense_semantic",
    embedding_model=embedding_model,
)
reranker = LocalCrossEncoderReranker()

In [None]:
# Chunk boundary visualization (same source text, different split strategies)

from rag_tutorials.chunking import fixed_chunk_documents

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

### Key Concepts: What Is a Reranker and How Does It Work?

#### Why Does Ranking Position Matter?

When the LLM receives retrieved chunks, it reads them in order.
Research consistently shows that information at the **beginning** of the context
has more influence on the answer than information buried at rank 4 or 5.

```
Context sent to LLM:
  [rank 1]  remote work policy — full section       <- LLM pays most attention here
  [rank 2]  leave entitlement summary
  [rank 3]  manager approval process
  [rank 4]  tax implications — working abroad       <- LLM may under-weight this
  [rank 5]  HR contact information

If rank 4 is actually the most useful chunk, the LLM might produce a weaker answer.
A reranker re-orders so the most useful chunk moves to rank 1.
```

#### What Is a Bi-Encoder? (Dense Retrieval = Stage 1)

A **bi-encoder** is a model that converts text to a vector *independently*.
'Bi' means two separate encodings — one for the query, one for each chunk.

```
Query:  "remote work policy"     -> Encoder -> query_vec  = [0.85, 0.15, 0.45, ...]
Chunk:  "Employees may work..."  -> Encoder -> chunk_vec  = [0.80, 0.20, 0.50, ...]

Similarity = cosine(query_vec, chunk_vec) = 0.97

Key point: the encoder never sees query and chunk AT THE SAME TIME.
This means it can pre-compute and store all chunk vectors in advance.
```

**Advantage:** Chunk vectors are computed once during indexing and reused for every query.
Similarity lookup is fast — milliseconds even for thousands of chunks.

**Limitation:** Because the query and chunk are encoded independently, the model
cannot see how words in the query interact with words in the chunk.  Some
relevance signals are missed.

#### What Is a Cross-Encoder? (Reranking = Stage 2)

A **cross-encoder** reads the query and a single chunk *together* as one combined input,
then outputs a single relevance score.

```
Input:  [query] + [SEP] + [chunk]  ->  Cross-Encoder  ->  relevance score: 0.94

Example:
  Input:  "What is the remote work policy? [SEP] Employees may work remotely..."
  Output: 0.94  (high relevance)

  Input:  "What is the remote work policy? [SEP] Annual leave entitlement is..."
  Output: 0.12  (low relevance)
```

**Advantage:** Sees the full interaction between query tokens and chunk tokens.
Much more accurate relevance scores than cosine similarity.

**Limitation:** Cannot pre-compute scores — must run the full model for every
(query, chunk) pair at query time.  Too slow to run on all N chunks.
That is why we use it only on the top-10 candidates from stage 1.


### Two-Stage Retrieval: Why Not Just Use the Cross-Encoder Directly?

A natural question: if the cross-encoder is more accurate, why use dense retrieval at all?

**The answer is speed.**

| | Bi-encoder (dense, Stage 1) | Cross-encoder (Stage 2) |
|---|---|---|
| **How it works** | Embeds query and chunk *independently*; compares vectors | Reads `(query, chunk)` as one input; models their interaction |
| **Vectors pre-computed?** | Yes — chunk vectors stored in advance | No — must run every pair at query time |
| **Speed** | Milliseconds for thousands of chunks | Seconds if run on thousands of chunks |
| **Accuracy** | Good but misses subtle word interactions | Better — sees query and chunk together |
| **Role in pipeline** | Fast first-pass: retrieve top-10 candidates | Accurate second-pass: reorder those 10 candidates |

**Stage 1 — Dense retrieval (same cosine similarity as Tutorial 1)**  
Query and every chunk are embedded into vectors; top-10 chunks by cosine score are selected.
This is fast because chunk vectors were pre-computed and stored in Chroma.

**Stage 2 — Cross-encoder reranking (new in this tutorial)**  
The cross-encoder model reads each `(query, chunk)` pair and outputs a relevance score.
It can see word-level interactions (e.g., the word "cannot" near "work abroad" matters).
We only run it on 10 candidates — small enough to be fast, accurate enough to reorder well.

**Why top-10 → top-5?**  
We ask Stage 1 for 10 candidates (more recall), rerank all 10, then send only the top-5 to the LLM.
This balances cost (LLM input tokens) with accuracy (best context available).

See **Tutorial 1, cells 8–10** for the cosine similarity basics.

### Nearest-Neighbor Search in Stage 1 — How Candidates Are Shortlisted

Before the reranker sees anything, **dense retrieval** runs nearest-neighbor search
to produce a shortlist of candidates.  Here is what that looks like:

```
Stage 1 — Dense nearest-neighbor (top-k=10)

 rank  score   chunk
 ────  ──────  ────────────────────────────────────────
   1   0.91    remote work policy — full section
   2   0.87    international transfer rules
   3   0.80    home-office equipment allowance
   4   0.76    tax implications — working abroad
   5   0.70    remote work — manager approval process
  ...  ...     (6 more candidates)

← ALL 10 are handed to the cross-encoder for Stage 2 →
```

**Why use a large first-stage k?**  
The nearest-neighbor score (cosine similarity) measures vector direction, not
deep semantic match.  A truly relevant chunk might be ranked 7th by cosine
similarity but 1st after the cross-encoder reads the query and chunk together.
Casting a wider net (k=10 instead of k=3) gives the reranker a better pool to
work with at a small cost in latency.

```
Stage 2 — Reranker re-orders the 10 candidates:

 new rank  rerank score  original rank  chunk
 ────────  ────────────  ─────────────  ──────────────────────────────────
    1          0.94           4         tax implications — working abroad   ← moved up!
    2          0.91           1         remote work policy — full section
    3          0.83           2         international transfer rules
  (remaining 7 candidates dropped; only final_k=3 passed to LLM)
```

See **Tutorial 1 cells 10–13** for the step-by-step nearest-neighbor walkthrough.


In [None]:
# 6) Retriever + reranker logic and novice score inspection

def retrieve_with_rerank(question: str, first_stage_k: int = 10, final_k: int = 5):
    first_pass = dense_retriever(question, top_k=first_stage_k)
    reranked = reranker.rerank(question, first_pass, top_k=final_k)
    return first_pass, reranked

probe = "What is the policy for working from another country?"
first_pass, reranked = retrieve_with_rerank(probe)

before_df = pd.DataFrame([
    {"rank": i + 1, "chunk_id": r.chunk_id, "dense_score": r.score, "preview": r.text[:90]}
    for i, r in enumerate(first_pass)
])
after_df = pd.DataFrame([
    {"rank": i + 1, "chunk_id": r.chunk_id, "rerank_score": r.score, "preview": r.text[:90]}
    for i, r in enumerate(reranked)
])

print("Before reranking")
display(before_df.head(10))
print("After reranking")
display(after_df.head(5))

In [None]:
# 7-8) Prompt + end-to-end RAG query

def rag_answer_reranked(question: str, top_k: int = 5):
    _, ranked = retrieve_with_rerank(question, first_stage_k=10, final_k=top_k)
    context = [r.text for r in ranked]
    answer = answer_with_context(question, context, model=chat_model)
    return answer, ranked

answer, ranked = rag_answer_reranked(probe)
print(answer)

In [None]:
# 9-10) Evaluation queries and debug output

def retrieval_fn(question: str):
    _, ranked = retrieve_with_rerank(question, first_stage_k=10, final_k=5)
    return ranked

rows = [
    evaluate_single(
        query=q,
        retrieval_fn=retrieval_fn,
        answer_fn=lambda question, context: answer_with_context(question, context, model=chat_model),
        top_k=5,
    )
    for q in queries[:20]
]

print("Tutorial 3 metrics:", summarize(rows))