# Tutorial 5 â€” Side-by-Side RAG Benchmark

This notebook compares the four tutorial variants under the same conditions:

1. Dense baseline + fixed chunks
2. Dense + semantic chunks
3. Dense + semantic + reranking
4. Hybrid (dense + BM25) + semantic chunks

```mermaid
flowchart TD
    A[Shared Documents + Queries] --> B1[Variant 1]
    A --> B2[Variant 2]
    A --> B3[Variant 3]
    A --> B4[Variant 4]
    B1 --> C[Common Evaluation Harness]
    B2 --> C
    B3 --> C
    B4 --> C
    C --> D[Benchmark Table + Plots]
```

Outcome: one consolidated table with `Recall@k`, `MRR`, `Groundedness`, and `Latency`.

## Learning checkpoint: how to read benchmark outcomes

**What this notebook confirms**
- Which variant is best for your workload depends on metric priority.
- Retrieval quality, groundedness, and latency can move in different directions.

**Common interpretation mistakes to avoid**
- Picking a winner from one metric only (e.g., MRR without latency).
- Ignoring failure cases where a variant has blind spots.
- Assuming the best mean score is best for every query type.

**How to use results for next iteration**
- Select a baseline variant from measured tradeoffs.
- Tune top-k, chunking, reranking depth, and hybrid fusion.
- Add domain-specific queries to stress-test your production use cases.

In [None]:
# 1) Setup and imports

import importlib
import os
from pathlib import Path
import shutil
import subprocess
import sys
import time

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from dotenv import load_dotenv

# Ensure uv is available (installs with: pip install uv)
if shutil.which("uv") is None:
    print("uv not found. Installing with pip...")
    subprocess.run([sys.executable, "-m", "pip", "install", "uv"], check=True)

# Ensure notebook runs from repo root and local src/ is importable
cwd = Path.cwd().resolve()
repo_root = next(
    (path for path in [cwd, *cwd.parents] if (path / "pyproject.toml").exists() and (path / "src").exists()),
    cwd,
)
os.chdir(repo_root)
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

REQUIRED_PACKAGES = [
    "openai",
    "chromadb",
    "numpy",
    "pandas",
    "rank_bm25",
    "sentence_transformers",
    "dotenv",
    "seaborn",
    "matplotlib",
]

missing = [pkg for pkg in REQUIRED_PACKAGES if importlib.util.find_spec(pkg) is None]
if missing:
    print("Missing packages:", missing)
    print("Running: uv sync")
    subprocess.run(["uv", "sync"], check=True)
else:
    print("All required packages are available.")

from rag_tutorials.io_utils import load_handbook_documents, load_queries
from rag_tutorials.chunking import fixed_chunk_documents, semantic_chunk_documents
from rag_tutorials.pipeline import build_dense_retriever, build_hybrid_retriever
from rag_tutorials.reranking import LocalCrossEncoderReranker
from rag_tutorials.qa import answer_with_context
from rag_tutorials.evaluation import evaluate_single, summarize

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is missing. Configure .env before running this notebook.")

EMBED_MODEL = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
CHAT_MODEL = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")
TOP_K = 5
EVAL_SIZE = 30

sns.set_theme(style="whitegrid")
print("Setup complete")
print("Working directory:", Path.cwd())
print("Repo root:", repo_root)
print("Using src path:", src_path)

In [None]:
# 2) Load shared handbook text + query set from data/

handbook_path = Path("data/handbook_manual.txt")
queries_path = Path("data/queries.jsonl")

if not handbook_path.exists() or not queries_path.exists():
    raise FileNotFoundError(
        "Shared data files are missing. Run: uv run python scripts/generate_data.py"
    )

documents = load_handbook_documents(handbook_path)
queries = load_queries(queries_path)[:EVAL_SIZE]

print({"handbook_path": str(handbook_path), "parsed_sections": len(documents), "eval_queries": len(queries)})

In [None]:
# 3) Build all four retrieval variants

fixed_chunks = fixed_chunk_documents(documents, chunk_size=260)
semantic_chunks = semantic_chunk_documents(documents)

v1_dense_fixed, _ = build_dense_retriever(
    chunks=fixed_chunks,
    collection_name="t5_v1_dense_fixed",
    embedding_model=EMBED_MODEL,
)

v2_dense_semantic, _ = build_dense_retriever(
    chunks=semantic_chunks,
    collection_name="t5_v2_dense_semantic",
    embedding_model=EMBED_MODEL,
)

reranker = LocalCrossEncoderReranker()

def v3_reranked(question: str, top_k: int = TOP_K):
    first_pass = v2_dense_semantic(question, top_k=10)
    return reranker.rerank(question, first_pass, top_k=top_k)

v4_hybrid_semantic = build_hybrid_retriever(semantic_chunks, v2_dense_semantic)

variants = {
    "t1_dense_fixed": v1_dense_fixed,
    "t2_dense_semantic": v2_dense_semantic,
    "t3_reranked": v3_reranked,
    "t4_hybrid": v4_hybrid_semantic,
}

print("Variants ready:", list(variants.keys()))

In [None]:
# Chunk boundary visualization (same source text, different split strategies)

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

In [None]:
# 4) Evaluate every variant with identical queries

def answer_fn(question: str, contexts: list[str]) -> str:
    return answer_with_context(question, contexts, model=CHAT_MODEL)

summary_rows = []
all_rows = []

for name, retrieval_fn in variants.items():
    started = time.perf_counter()
    rows = [
        evaluate_single(
            query=q,
            retrieval_fn=lambda query_text, fn=retrieval_fn: fn(query_text, top_k=TOP_K),
            answer_fn=answer_fn,
            top_k=TOP_K,
        )
        for q in queries
    ]
    elapsed = time.perf_counter() - started
    metrics = summarize(rows)
    metrics["variant"] = name
    metrics["wall_seconds"] = elapsed
    summary_rows.append(metrics)

    for row in rows:
        all_rows.append(
            {
                "variant": name,
                "query_id": row.query_id,
                "recall_at_k": row.recall_at_k,
                "mrr": row.mrr,
                "latency_ms": row.latency_ms,
                "groundedness": row.groundedness,
            }
        )

summary_df = pd.DataFrame(summary_rows).sort_values("mrr", ascending=False)
detail_df = pd.DataFrame(all_rows)
summary_df

In [None]:
# 5) Plot benchmark comparison

melted = summary_df.melt(
    id_vars=["variant"],
    value_vars=["recall_at_k", "mrr", "groundedness"],
    var_name="metric",
    value_name="value",
)

plt.figure(figsize=(10, 5))
sns.barplot(data=melted, x="metric", y="value", hue="variant")
plt.title("RAG Variant Quality Comparison")
plt.ylim(0, 1)
plt.show()

plt.figure(figsize=(10, 4))
sns.barplot(data=summary_df, x="variant", y="latency_ms")
plt.title("Average End-to-End Latency (ms)")
plt.xticks(rotation=20)
plt.show()

In [None]:
# 6) Failure analysis helper: show where one variant succeeds and another fails

pivot_recall = detail_df.pivot_table(index="query_id", columns="variant", values="recall_at_k")

if {"t1_dense_fixed", "t4_hybrid"}.issubset(set(pivot_recall.columns)):
    hard_cases = pivot_recall[(pivot_recall["t1_dense_fixed"] < 1.0) & (pivot_recall["t4_hybrid"] == 1.0)]
    print("Queries where hybrid recovers misses from baseline:", len(hard_cases))
    display(hard_cases.head(10))
else:
    print("Expected variants not found in detail table.")