# Tutorial 4 â€” Hybrid Search (Dense + Keyword)

This final tutorial adds BM25 keyword retrieval and fuses it with dense retrieval via Reciprocal Rank Fusion.

```mermaid
flowchart LR
    Q[Query] --> D[Dense Retriever]
    Q --> K[Keyword Retriever BM25]
    D --> F[RRF Fusion]
    K --> F
    F --> L[Top-k for Generation]
```

## Learning checkpoint: hybrid tradeoffs and readiness for benchmarking

**What works better in Tutorial 4**
- Keyword retrieval improves exact-token recall (e.g., `Form A-12`).
- Dense retrieval still contributes semantic context.
- Fusion provides more robust retrieval across question styles.

**Challenges you should observe**
- More moving parts increase tuning complexity (weights, top-k, fusion behavior).
- Latency and system complexity are higher than baseline.
- Hybrid is better in many cases, but not always dominant on every metric.

**Why move to Tutorial 5**
- At this point, intuition is not enough.
- We benchmark all variants side-by-side to quantify quality, groundedness, and latency tradeoffs.

In [None]:
# 1-5) Setup, handbook text, semantic chunks, embeddings, index

import os
import pandas as pd
from dotenv import load_dotenv
from pathlib import Path

from rag_tutorials.io_utils import load_handbook_documents, load_queries
from rag_tutorials.chunking import semantic_chunk_documents
from rag_tutorials.pipeline import build_dense_retriever, build_hybrid_retriever
from rag_tutorials.retrieval import build_bm25, bm25_search
from rag_tutorials.qa import answer_with_context
from rag_tutorials.evaluation import evaluate_single, summarize

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is required")

embedding_model = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
chat_model = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")

handbook_path = Path("data/handbook_manual.txt")
queries_path = Path("data/queries.jsonl")
if not handbook_path.exists() or not queries_path.exists():
    raise FileNotFoundError("Run: uv run python scripts/generate_data.py")

documents = load_handbook_documents(handbook_path)
queries = load_queries(queries_path)
chunks = semantic_chunk_documents(documents)

dense_retriever, _ = build_dense_retriever(
    chunks=chunks,
    collection_name="tutorial4_dense",
    embedding_model=embedding_model,
)
hybrid_retriever = build_hybrid_retriever(chunks, dense_retriever)

bm25_index, corpus, chunk_ids = build_bm25(chunks)

In [None]:
# Chunk boundary visualization (same source text, different split strategies)

from rag_tutorials.chunking import fixed_chunk_documents

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

In [None]:
# 6) Implement retrieval functions and inspect dense vs keyword vs hybrid

def dense_only(question: str, top_k: int = 5):
    return dense_retriever(question, top_k=top_k)

def keyword_only(question: str, top_k: int = 5):
    return bm25_search(bm25_index, question, corpus, chunk_ids, top_k=top_k)

def hybrid(question: str, top_k: int = 5):
    return hybrid_retriever(question, top_k=top_k)

probe = "Do I need Form A-12 for my trip?"

dense_df = pd.DataFrame([{"rank": i + 1, "chunk_id": r.chunk_id, "score": r.score, "preview": r.text[:90]} for i, r in enumerate(dense_only(probe))])
keyword_df = pd.DataFrame([{"rank": i + 1, "chunk_id": r.chunk_id, "score": r.score, "preview": r.text[:90]} for i, r in enumerate(keyword_only(probe))])
hybrid_df = pd.DataFrame([{"rank": i + 1, "chunk_id": r.chunk_id, "score": r.score, "preview": r.text[:90]} for i, r in enumerate(hybrid(probe))])

print("Dense only")
display(dense_df)
print("Keyword only")
display(keyword_df)
print("Hybrid")
display(hybrid_df)

In [None]:
# 7-8) Prompt assembly + end-to-end query

def rag_answer_hybrid(question: str, top_k: int = 5):
    retrieved = hybrid(question, top_k=top_k)
    context = [r.text for r in retrieved]
    answer = answer_with_context(question, context, model=chat_model)
    return answer, retrieved

answer, retrieved = rag_answer_hybrid(probe)
print(answer)

In [None]:
# 9-10) Evaluation + continuity summary table

rows = [
    evaluate_single(
        query=q,
        retrieval_fn=lambda question: hybrid(question, top_k=5),
        answer_fn=lambda question, context: answer_with_context(question, context, model=chat_model),
        top_k=5,
    )
    for q in queries[:20]
]

print("Tutorial 4 metrics:", summarize(rows))

continuity = pd.DataFrame(
    [
        {"tutorial": 1, "change": "dense baseline with fixed chunks"},
        {"tutorial": 2, "change": "semantic chunking"},
        {"tutorial": 3, "change": "reranking"},
        {"tutorial": 4, "change": "hybrid dense + keyword"},
    ]
)
continuity