# Advanced Retrieval

## Beyond Basic Vector Search

This notebook explores advanced retrieval techniques that address the limitations of basic dense vector search, and evaluates them with concrete metrics.

```
    ┌─────────────────────────────────────────────────────────────┐
    │              Advanced Retrieval Techniques                  │
    │                                                             │
    │   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
    │   │ Hybrid   │  │Reranking │  │  Query   │  │   RAG    │    │
    │   │ Search   │  │ (Cohere) │  │ Expansion│  │  Fusion  │    │
    │   │ BM25+Vec │  │          │  │          │  │          │    │
    │   └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
    │        │             │             │             │          │
    │        v             v             v             v          │
    │   Exact + dense   Precision    Vocabulary    Comprehensive  │
    │   matching        reordering   bridging      coverage       │
    │                                                             │
    │   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
    │   │ Semantic │  │Contextual│  │Relevance │  │Systematic│    │
    │   │ Chunking │  │Compress  │  │ Filter   │  │Comparison│    │
    │   └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
    └─────────────────────────────────────────────────────────────┘
```

Topics covered:
- Reciprocal Rank Fusion (RRF) for combining ranked lists.
- Hybrid search (BM25 + dense vectors).
- Reranking with Cohere cross-encoders.
- Query expansion and RAG-Fusion.
- Semantic chunking and structure-aware splitting.
- Contextual compression and relevance filtering.
- Hierarchical retrieval.
- Systematic retriever comparison with RAGAS evaluation.

## Setup and Imports

Make sure you have the required packages installed:

```bash
uv sync
```

Note: This notebook uses `nest_asyncio` for RAGAS compatibility in Jupyter. Some sections require optional dependencies (`cohere`, `langchain-experimental`, `rank-bm25`). The notebook handles missing optional packages gracefully.

In [None]:
import os
import json
import time
from pathlib import Path
from dataclasses import dataclass
from typing import Callable

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

# RAGAS requires nest_asyncio in Jupyter environments
import nest_asyncio
nest_asyncio.apply()

# Load environment variables
load_dotenv()

# Verify API key
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in environment variables")

# Initialize the LLM and embeddings
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")


def extract_json(content: str) -> str:
    """Extract JSON from LLM response, handling markdown code blocks."""
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
    elif "```" in content:
        content = content.split("```")[1].split("```")[0]
    return content.strip()


# Check optional dependencies
HAS_RANK_BM25 = False
try:
    from rank_bm25 import BM25Okapi  # noqa: F401
    HAS_RANK_BM25 = True
except ImportError:
    pass

HAS_COHERE = False
try:
    import cohere  # noqa: F401
    HAS_COHERE = True
except ImportError:
    pass

HAS_SEMANTIC_CHUNKER = False
try:
    from langchain_experimental.text_splitter import SemanticChunker  # noqa: F401
    HAS_SEMANTIC_CHUNKER = True
except ImportError:
    pass

print("Optional dependencies:")
print(f"  rank_bm25 (for BM25Retriever): {'Available' if HAS_RANK_BM25 else 'Not installed - run: uv add rank-bm25'}")
print(f"  cohere (for reranking):         {'Available' if HAS_COHERE else 'Not installed - run: uv add cohere'}")
print(f"  langchain-experimental:         {'Available' if HAS_SEMANTIC_CHUNKER else 'Not installed - run: uv add langchain-experimental'}")
print("\nSetup complete!")

## Loading and Indexing Documents

Before exploring retrieval techniques, we need a document corpus and a vector store. We load the reference guides from the `documents/` directory, chunk them, embed them, and store them in an in-memory Qdrant vector store.

In [None]:
# Load reference documents
documents_dir = Path("documents")
guide_files = sorted(documents_dir.glob("ref-*.md"))

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

all_chunks = []
raw_documents = []  # Keep full docs for BM25

for filepath in guide_files:
    loader = TextLoader(str(filepath))
    docs = loader.load()
    raw_documents.extend(docs)
    chunks = text_splitter.split_documents(docs)
    all_chunks.extend(chunks)
    print(f"  Loaded {len(chunks):3d} chunks from {filepath.name}")

print(f"\nTotal: {len(all_chunks)} chunks from {len(guide_files)} documents")

# Create in-memory Qdrant vector store
COLLECTION_NAME = "chapter9_retrieval"

qdrant_client = QdrantClient(":memory:")
qdrant_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=COLLECTION_NAME,
    embedding=embeddings,
)

# Index all chunks
vector_store.add_documents(all_chunks)
print(f"Indexed {len(all_chunks)} chunks in Qdrant vector store")

## Reciprocal Rank Fusion (RRF)

Hybrid search combines results from two completely different scoring systems. Dense search returns cosine similarity (0-1). BM25 returns scores on a completely different scale. You cannot just average them.

RRF solves this by combining **rankings** instead of scores. If a document appears highly ranked in multiple result lists, it is probably relevant. The formula:

```
score(doc) = sum( 1 / (k + rank) )  for each list where doc appears
```

The constant `k` (typically 60) prevents the top-ranked document from dominating too heavily.

In [None]:
def reciprocal_rank_fusion(
    result_lists: list[list],
    k: int = 60
) -> list:
    """
    Combine multiple ranked result lists using RRF.
    
    Args:
        result_lists: List of ranked result lists, each containing (doc, score) tuples
        k: Ranking constant (default 60, higher values reduce impact of rank differences)
    
    Returns:
        Combined ranked list of documents
    """
    scores = {}
    
    for results in result_lists:
        for rank, (doc, _) in enumerate(results):
            # Use content hash as document ID
            doc_id = hash(doc.page_content)
            
            if doc_id not in scores:
                scores[doc_id] = {"doc": doc, "score": 0}
            
            # RRF formula: 1 / (k + rank + 1)
            scores[doc_id]["score"] += 1 / (k + rank + 1)
    
    # Sort by combined score
    sorted_results = sorted(
        scores.values(),
        key=lambda x: x["score"],
        reverse=True
    )
    
    return [item["doc"] for item in sorted_results]


print("reciprocal_rank_fusion() defined")

In [None]:
# Demo: RRF with two mock result lists
doc_a = Document(page_content="Document about embeddings and vector search")
doc_b = Document(page_content="Document about BM25 keyword matching")
doc_c = Document(page_content="Document about hybrid search approaches")
doc_d = Document(page_content="Document about search engine optimization")

# List 1: dense search ranking
list1 = [(doc_a, 0.95), (doc_c, 0.88), (doc_b, 0.82), (doc_d, 0.70)]
# List 2: BM25 ranking (different order, different score scale)
list2 = [(doc_b, 15.2), (doc_c, 12.1), (doc_d, 8.3), (doc_a, 5.0)]

fused = reciprocal_rank_fusion([list1, list2])
print("RRF fused ranking:")
for i, doc in enumerate(fused):
    print(f"  {i+1}. {doc.page_content}")

## Hybrid Search

Hybrid search combines dense vector retrieval with BM25 sparse keyword matching. Dense search finds semantically similar content. BM25 finds exact term matches. Together they cover each other's blind spots.

- **Dense search** handles: paraphrasing, synonyms, conceptual similarity
- **BM25** handles: exact identifiers, codes, names, technical terms

In [None]:
from langchain_community.retrievers import BM25Retriever


def hybrid_search(
    query: str,
    vector_store: QdrantVectorStore,
    documents: list,
    k: int = 5,
    alpha: float = 0.5
) -> list:
    """
    Combine dense vector search with BM25 keyword search.
    
    Args:
        query: Search query
        vector_store: Qdrant vector store for dense search
        documents: Original documents for BM25 (needs raw text)
        k: Number of results to return
        alpha: Weight for dense results (1-alpha for sparse)
    """
    # Dense search
    dense_results = vector_store.similarity_search_with_score(query, k=k*2)
    
    # Sparse search with BM25
    bm25_retriever = BM25Retriever.from_documents(documents)
    bm25_retriever.k = k * 2
    sparse_docs = bm25_retriever.invoke(query)
    # Convert to (doc, score) format for RRF
    sparse_results = [(doc, 1.0 / (i + 1)) for i, doc in enumerate(sparse_docs)]
    
    # Combine with RRF
    combined = reciprocal_rank_fusion([dense_results, sparse_results])
    
    return combined[:k]


print("hybrid_search() defined")

In [None]:
if HAS_RANK_BM25:
    query = "How does BM25 keyword matching work?"
    print(f"Query: {query}\n")

    # Compare: dense-only vs hybrid
    dense_only = vector_store.similarity_search(query, k=5)
    print("Dense search results:")
    for i, doc in enumerate(dense_only):
        source = Path(doc.metadata.get("source", "unknown")).name
        print(f"  {i+1}. [{source}] {doc.page_content[:80]}...")

    hybrid_results = hybrid_search(query, vector_store, all_chunks, k=5)
    print(f"\nHybrid search results:")
    for i, doc in enumerate(hybrid_results):
        source = Path(doc.metadata.get("source", "unknown")).name
        print(f"  {i+1}. [{source}] {doc.page_content[:80]}...")
else:
    print("Skipping hybrid search demo: rank_bm25 not installed.")
    print("Install with: uv add rank-bm25")

## Reranking with Cohere

Initial retrieval casts a wide net for recall. Reranking is a second pass that reorders results using a more powerful cross-encoder model for precision.

- **Bi-encoder** (embedding model): processes query and document independently -- fast
- **Cross-encoder** (reranker): processes query+document jointly -- accurate but slow

Reranking 20 candidates takes 100-300ms. It pays off when precision matters more than latency.

> Note: This section requires a Cohere API key. Set `COHERE_API_KEY` in your `.env` file. Get a free trial key at https://dashboard.cohere.com/api-keys

In [None]:
def search_with_reranking(
    query: str,
    vector_store,
    k: int = 5,
    initial_k: int = 20
) -> list:
    """
    Two-stage retrieval: broad search then precise reranking.
    
    Args:
        query: Search query
        vector_store: Vector store for initial retrieval
        k: Final number of results
        initial_k: Candidates to fetch for reranking
    """
    # Stage 1: Broad retrieval
    candidates = vector_store.similarity_search(query, k=initial_k)
    
    if not candidates:
        return []
    
    # Check for Cohere availability
    if not HAS_COHERE:
        print("Cohere not installed. Returning unranked results.")
        return candidates[:k]
    
    api_key = os.getenv("COHERE_API_KEY")
    if not api_key or api_key == "your_cohere_api_key_here":
        print("COHERE_API_KEY not set. Returning unranked results.")
        return candidates[:k]
    
    try:
        # Stage 2: Rerank with Cohere
        import cohere
        co = cohere.Client(api_key=api_key)
        
        rerank_response = co.rerank(
            query=query,
            documents=[doc.page_content for doc in candidates],
            top_n=k,
            model="rerank-english-v3.0"
        )
        
        # Return reranked documents
        reranked = []
        for result in rerank_response.results:
            doc = candidates[result.index]
            doc.metadata["rerank_score"] = result.relevance_score
            reranked.append(doc)
        
        return reranked
    except Exception as e:
        print(f"Cohere reranking failed: {e}")
        return candidates[:k]


print("search_with_reranking() defined")

In [None]:
query = "What is the attention mechanism in transformers?"
print(f"Query: {query}\n")

reranked = search_with_reranking(query, vector_store, k=5, initial_k=20)
print(f"Results ({len(reranked)} docs):")
for i, doc in enumerate(reranked):
    source = Path(doc.metadata.get("source", "unknown")).name
    score = doc.metadata.get("rerank_score", "N/A")
    print(f"  {i+1}. [{source}] (score: {score}) {doc.page_content[:80]}...")

## Query Expansion

Users don't always use the same words as your documents. Query expansion generates multiple query variations to cover different phrasings of the same intent.

- **Synonym-based**: "laptop won't start" -> "notebook fails to boot"
- **Specificity variation**: general to specific and back
- **LLM-generated**: use the model to rephrase the query

In [None]:
def expand_query(query: str, llm) -> list[str]:
    """
    Generate query variations using an LLM.
    
    Returns the original query plus variations.
    """
    prompt = f"""Generate 3 alternative search queries for:

Original: {query}

Create variations using:
1. Synonyms and related terms
2. More specific phrasing
3. More general phrasing

Return as a JSON array of strings. Only the array, no explanation."""

    response = llm.invoke(prompt)
    content = extract_json(response.content)

    try:
        variations = json.loads(content)
    except json.JSONDecodeError:
        variations = []
    
    # Always include original query
    return [query] + variations


def search_with_expansion(
    query: str,
    vector_store,
    llm,
    k: int = 5
) -> list:
    """
    Search with query expansion for better coverage.
    """
    queries = expand_query(query, llm)
    
    all_results = []
    seen_content = set()
    
    for q in queries:
        results = vector_store.similarity_search(q, k=k)
        for doc in results:
            content_hash = hash(doc.page_content)
            if content_hash not in seen_content:
                seen_content.add(content_hash)
                all_results.append(doc)
    
    # Return top k unique results
    return all_results[:k]


print("expand_query() and search_with_expansion() defined")

In [None]:
query = "How do I make my chatbot remember things?"
print(f"Original query: {query}\n")

expanded = expand_query(query, llm)
print("Expanded queries:")
for i, q in enumerate(expanded):
    print(f"  {i+1}. {q}")

print(f"\nSearch with expansion:")
expanded_results = search_with_expansion(query, vector_store, llm, k=5)
for i, doc in enumerate(expanded_results):
    source = Path(doc.metadata.get("source", "unknown")).name
    print(f"  {i+1}. [{source}] {doc.page_content[:80]}...")

## RAG-Fusion

RAG-Fusion takes query expansion further by systematically generating diverse queries and combining their results with Reciprocal Rank Fusion. Different query phrasings activate different regions of the embedding space. By searching with multiple phrasings and fusing results, you cover more ground.

Best for: research-style queries, complex multi-faceted questions, situations where missing relevant content is worse than including some irrelevant content.

In [None]:
def rag_fusion(
    query: str,
    vector_store,
    llm,
    k: int = 5,
    num_queries: int = 4
) -> list:
    """
    RAG-Fusion: multi-query retrieval with RRF combination.
    
    Generates diverse queries, searches with each, combines results.
    """
    # Generate diverse queries
    prompt = f"""Generate {num_queries} different search queries to find information for:
\"{query}\"

Make queries diverse:
- Different angles on the topic
- Varying levels of specificity  
- Alternative phrasings and synonyms

Return as a JSON array of strings."""

    response = llm.invoke(prompt)
    content = extract_json(response.content)
    
    try:
        queries = json.loads(content)
    except json.JSONDecodeError:
        queries = [query]
    
    # Include original
    queries = [query] + queries
    
    # Search with each query
    all_ranked_results = []
    for q in queries:
        results = vector_store.similarity_search_with_score(q, k=k*2)
        all_ranked_results.append(results)
    
    # Combine with RRF
    combined = reciprocal_rank_fusion(all_ranked_results)
    
    return combined[:k]


print("rag_fusion() defined")

In [None]:
query = "Compare embeddings and keyword search for different use cases"
print(f"Query: {query}\n")

fusion_results = rag_fusion(query, vector_store, llm, k=5, num_queries=3)
print(f"RAG-Fusion results ({len(fusion_results)} docs):")
for i, doc in enumerate(fusion_results):
    source = Path(doc.metadata.get("source", "unknown")).name
    print(f"  {i+1}. [{source}] {doc.page_content[:100]}...")

## Semantic Chunking

Fixed-size chunking ignores document structure. A chunk might split a code example in half or separate a concept from its explanation.

**Semantic chunking** uses embeddings to identify natural topic boundaries -- it embeds each sentence, then looks for points where consecutive sentences have significantly different embeddings, indicating a topic shift.

For structured documents like markdown, **structure-aware chunking** respects headers and sections explicitly.

> Note: `SemanticChunker` requires `langchain-experimental`. Install with: `uv add langchain-experimental`

In [None]:
def semantic_chunk_documents(documents: list, threshold: float = 95) -> list:
    """
    Split documents at semantic boundaries rather than fixed sizes.
    
    Args:
        documents: Documents to chunk
        threshold: Percentile threshold for detecting topic shifts (higher = fewer splits)
    """
    if not HAS_SEMANTIC_CHUNKER:
        print("langchain-experimental not installed. Using RecursiveCharacterTextSplitter fallback.")
        fallback = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        chunks = []
        for doc in documents:
            doc_chunks = fallback.split_documents([doc])
            chunks.extend(doc_chunks)
        return chunks
    
    from langchain_experimental.text_splitter import SemanticChunker
    
    chunker_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    chunker = SemanticChunker(
        embeddings=chunker_embeddings,
        breakpoint_threshold_type="percentile",
        breakpoint_threshold_amount=threshold
    )
    
    chunks = []
    for doc in documents:
        doc_chunks = chunker.create_documents([doc.page_content])
        
        # Preserve original metadata
        for chunk in doc_chunks:
            chunk.metadata.update(doc.metadata)
        
        chunks.extend(doc_chunks)
    
    return chunks


from langchain_text_splitters import MarkdownHeaderTextSplitter


def chunk_markdown_by_structure(markdown_text: str) -> list:
    """
    Split markdown documents by header structure.
    """
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=[
            ("#", "h1"),
            ("##", "h2"),
            ("###", "h3"),
        ]
    )
    
    chunks = splitter.split_text(markdown_text)
    # Each chunk includes header hierarchy in metadata
    return chunks


print("semantic_chunk_documents() and chunk_markdown_by_structure() defined")

In [None]:
# Demo: markdown structure-aware chunking on a single document
sample_doc = raw_documents[0]
source_name = Path(sample_doc.metadata.get("source", "unknown")).name
print(f"Structure-aware chunking of: {source_name}")
print(f"  Original length: {len(sample_doc.page_content)} characters\n")

structure_chunks = chunk_markdown_by_structure(sample_doc.page_content)
print(f"  Structure chunks: {len(structure_chunks)}")
for i, chunk in enumerate(structure_chunks[:5]):
    headers = {k: v for k, v in chunk.metadata.items() if k.startswith("h")}
    print(f"    {i+1}. {headers} -> {chunk.page_content[:60]}...")

# Demo: semantic chunking on a small sample (limit to 2 docs to save API calls)
print(f"\nSemantic chunking (2 documents):")
semantic_chunks = semantic_chunk_documents(raw_documents[:2])
print(f"  Produced {len(semantic_chunks)} semantic chunks")
for i, chunk in enumerate(semantic_chunks[:3]):
    print(f"    {i+1}. ({len(chunk.page_content)} chars) {chunk.page_content[:60]}...")

## Contextual Compression

Sometimes you retrieve the right document but it is too long. A 2000-token chunk might contain only 200 tokens of relevant content. Contextual compression extracts only the relevant portions, saving context window space and reducing noise.

The trade-off: additional LLM calls per retrieved document.

In [None]:
from langchain_classic.retrievers import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import LLMChainExtractor


def create_compression_retriever(base_retriever, llm):
    """
    Wrap a retriever with contextual compression.
    
    Retrieved documents are filtered to only include relevant content.
    """
    compressor = LLMChainExtractor.from_llm(llm)
    
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever
    )
    
    return compression_retriever


print("create_compression_retriever() defined")

In [None]:
# Create a compression retriever wrapping basic vector search
base_retriever = vector_store.as_retriever(search_kwargs={"k": 10})
compressed_retriever = create_compression_retriever(base_retriever, llm)

query = "What is the attention mechanism?"
print(f"Query: {query}\n")

# Uncompressed
print("Uncompressed results (first 3):")
uncompressed = base_retriever.invoke(query)
for i, doc in enumerate(uncompressed[:3]):
    print(f"  {i+1}. ({len(doc.page_content)} chars) {doc.page_content[:100]}...")

# Compressed
print(f"\nCompressed results:")
compressed_docs = compressed_retriever.invoke(query)
for i, doc in enumerate(compressed_docs[:3]):
    print(f"  {i+1}. ({len(doc.page_content)} chars) {doc.page_content[:100]}...")

## Relevance Filtering

A lighter-weight alternative to full compression: score each chunk's cosine similarity to the query and drop low-scoring chunks before sending to the LLM. Faster than LLM compression but less precise -- it drops clearly irrelevant documents but cannot extract relevant portions from partially relevant ones.

In [None]:
def filter_by_relevance(
    query: str,
    documents: list,
    embeddings,
    threshold: float = 0.7
) -> list:
    """
    Filter documents below a relevance threshold.
    """
    query_embedding = embeddings.embed_query(query)
    
    filtered = []
    for doc in documents:
        doc_embedding = embeddings.embed_query(doc.page_content)
        
        # Cosine similarity
        similarity = sum(a * b for a, b in zip(query_embedding, doc_embedding))
        
        if similarity >= threshold:
            filtered.append(doc)
    
    return filtered


print("filter_by_relevance() defined")

In [None]:
query = "How do vector databases store embeddings?"
candidates = vector_store.similarity_search(query, k=10)
print(f"Query: {query}")
print(f"Candidates before filtering: {len(candidates)}")

# Use a moderate threshold -- text-embedding-3-small cosine similarities
# tend to be in the 0.2-0.6 range for related content
filtered = filter_by_relevance(query, candidates, embeddings, threshold=0.3)
print(f"After filtering (threshold=0.3): {len(filtered)}")

for i, doc in enumerate(filtered[:5]):
    source = Path(doc.metadata.get("source", "unknown")).name
    print(f"  {i+1}. [{source}] {doc.page_content[:80]}...")

## Hierarchical Retrieval

For large corpora, flat retrieval can be inefficient. Hierarchical retrieval uses document summaries as a first-stage filter:

1. **Stage 1:** Search document summaries to find relevant documents
2. **Stage 2:** Search chunks within those relevant documents

Like using a table of contents before reading individual chapters.

In [None]:
from qdrant_client.models import Filter, FieldCondition, MatchValue


def hierarchical_search(
    query: str,
    summary_store,
    chunk_store,
    k: int = 5,
    top_docs: int = 3
) -> list:
    """
    Two-stage hierarchical retrieval.
    
    Stage 1: Search document summaries to find relevant documents
    Stage 2: Search chunks within those documents
    """
    # Stage 1: Find relevant documents via summaries
    summaries = summary_store.similarity_search(query, k=top_docs)
    
    # Get document IDs from summaries
    doc_ids = [s.metadata["document_id"] for s in summaries]
    
    # Stage 2: Search chunks filtered to those documents
    all_chunks = []
    for doc_id in doc_ids:
        qdrant_filter = Filter(
            must=[
                FieldCondition(
                    key="metadata.document_id",
                    match=MatchValue(value=doc_id)
                )
            ]
        )
        chunks = chunk_store.similarity_search(
            query,
            k=k,
            filter=qdrant_filter
        )
        all_chunks.extend(chunks)
    
    # Re-rank combined chunks
    all_chunks.sort(
        key=lambda x: x.metadata.get("relevance_score", 0),
        reverse=True
    )
    
    return all_chunks[:k]


print("hierarchical_search() defined")

In [None]:
# Build a summary store for demonstration
SUMMARY_COLLECTION = "chapter9_summaries"

qdrant_client.create_collection(
    collection_name=SUMMARY_COLLECTION,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

summary_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=SUMMARY_COLLECTION,
    embedding=embeddings,
)

print("Generating document summaries...")
summary_docs = []
for doc in raw_documents[:5]:  # Limit to 5 for speed
    source = Path(doc.metadata.get("source", "unknown")).name
    summary_response = llm.invoke(
        f"Summarize this document in 2-3 sentences:\n\n{doc.page_content[:3000]}"
    )
    summary_doc = Document(
        page_content=summary_response.content,
        metadata={"document_id": source, "source": source}
    )
    summary_docs.append(summary_doc)
    print(f"  Summarized: {source}")

summary_store.add_documents(summary_docs)

# Build a chunk store with document_id metadata
CHUNK_COLLECTION = "chapter9_hier_chunks"
qdrant_client.create_collection(
    collection_name=CHUNK_COLLECTION,
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
chunk_store = QdrantVectorStore(
    client=qdrant_client,
    collection_name=CHUNK_COLLECTION,
    embedding=embeddings,
)

hier_chunks = []
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
for doc in raw_documents[:5]:
    source = Path(doc.metadata.get("source", "unknown")).name
    chunks = splitter.split_documents([doc])
    for chunk in chunks:
        chunk.metadata["document_id"] = source
    hier_chunks.extend(chunks)

chunk_store.add_documents(hier_chunks)
print(f"\nIndexed {len(hier_chunks)} chunks with document_id metadata")

# Test hierarchical search
query = "What are the best practices for chunking documents?"
print(f"\nHierarchical search: {query}")
results = hierarchical_search(query, summary_store, chunk_store, k=5, top_docs=2)
print(f"Results: {len(results)} chunks")
for i, doc in enumerate(results):
    print(f"  {i+1}. [{doc.metadata.get('document_id', '?')}] {doc.page_content[:80]}...")

## Systematic Comparison of Retrievers

With multiple techniques available, the answer to "which should I use?" is: **measurement**. Build a comparison harness, run your evaluation dataset through different configurations, and let the numbers guide your decisions.

The key to good comparisons: control variables. Same documents, same chunking, same embedding model, same k. Only the retrieval strategy changes.

In [None]:
@dataclass
class RetrieverConfig:
    name: str
    retrieve_fn: Callable
    description: str


def compare_retrievers(
    configs: list[RetrieverConfig],
    test_queries: list[dict],
    evaluate_fn: Callable
) -> dict:
    """
    Systematically compare multiple retrieval configurations.
    
    Args:
        configs: List of retriever configurations to test
        test_queries: List of {query, ground_truth} dicts
        evaluate_fn: Function that scores retrieval results
    
    Returns:
        Comparison results with metrics for each config
    """
    results = {}
    
    for config in configs:
        config_results = {
            "scores": [],
            "latencies": [],
            "name": config.name,
            "description": config.description
        }
        
        for test_case in test_queries:
            query = test_case["query"]
            ground_truth = test_case["ground_truth"]
            
            # Measure retrieval
            start = time.time()
            retrieved = config.retrieve_fn(query)
            latency = time.time() - start
            
            # Score results
            score = evaluate_fn(retrieved, ground_truth, query)
            
            config_results["scores"].append(score)
            config_results["latencies"].append(latency)
        
        # Aggregate metrics
        config_results["mean_score"] = sum(config_results["scores"]) / len(config_results["scores"])
        config_results["mean_latency"] = sum(config_results["latencies"]) / len(config_results["latencies"])
        config_results["p95_latency"] = sorted(config_results["latencies"])[int(len(config_results["latencies"]) * 0.95)]
        
        results[config.name] = config_results
    
    return results


def segment_analysis(
    results: dict,
    test_queries: list[dict],
    segment_fn: Callable
) -> dict:
    """
    Analyze retriever performance across query segments.
    
    Args:
        results: Raw results from compare_retrievers
        test_queries: Original test queries with metadata
        segment_fn: Function that returns segment name for a query
    """
    segmented = {}
    
    for config_name, config_results in results.items():
        segmented[config_name] = {}
        
        for i, query in enumerate(test_queries):
            segment = segment_fn(query)
            
            if segment not in segmented[config_name]:
                segmented[config_name][segment] = {"scores": [], "latencies": []}
            
            segmented[config_name][segment]["scores"].append(
                config_results["scores"][i]
            )
            segmented[config_name][segment]["latencies"].append(
                config_results["latencies"][i]
            )
        
        # Calculate segment averages
        for segment in segmented[config_name]:
            scores = segmented[config_name][segment]["scores"]
            latencies = segmented[config_name][segment]["latencies"]
            segmented[config_name][segment]["mean_score"] = sum(scores) / len(scores)
            segmented[config_name][segment]["mean_latency"] = sum(latencies) / len(latencies)
    
    return segmented


def query_complexity_segment(query: dict) -> str:
    """Segment queries by complexity."""
    words = query["query"].split()
    if len(words) <= 5:
        return "simple"
    elif len(words) <= 15:
        return "moderate"
    else:
        return "complex"


print("Comparison framework defined!")

## Running a Retriever Comparison

Let's compare basic vector search against advanced techniques using test queries with ground truth context. We use a simple relevance scoring function first, then upgrade to RAGAS metrics.

In [None]:
# Test queries with ground truth (key phrases that should appear in good retrievals)
test_queries = [
    {
        "query": "What is RAG?",
        "ground_truth": "retrieval-augmented generation combines retrieval with generation"
    },
    {
        "query": "How do embeddings capture semantic meaning?",
        "ground_truth": "embeddings map text to vectors where similar meanings are close together"
    },
    {
        "query": "What are the different chunking strategies?",
        "ground_truth": "fixed-size chunking recursive character splitting semantic chunking"
    },
    {
        "query": "Compare vector search with keyword search",
        "ground_truth": "dense vector search finds semantic similarity BM25 finds exact term matches"
    },
    {
        "query": "error handling in agents",
        "ground_truth": "agents need error handling retry logic fallback strategies"
    },
    {
        "query": "How does the ReAct pattern work?",
        "ground_truth": "ReAct alternates between reasoning and acting observation loop"
    },
    {
        "query": "What is prompt engineering and why does it matter?",
        "ground_truth": "prompt engineering designs inputs to get better outputs from language models"
    },
    {
        "query": "Explain the supervisor pattern in multi-agent systems",
        "ground_truth": "supervisor agent coordinates delegates tasks to specialized worker agents"
    },
]


def simple_relevance_score(retrieved: list, ground_truth: str, query: str) -> float:
    """Score retrieval quality by checking ground truth keyword overlap."""
    if not retrieved:
        return 0.0
    
    retrieved_text = " ".join([doc.page_content.lower() for doc in retrieved])
    truth_terms = ground_truth.lower().split()
    
    matches = sum(1 for term in truth_terms if term in retrieved_text)
    return matches / len(truth_terms) if truth_terms else 0.0


print(f"Defined {len(test_queries)} test queries with ground truth")

In [None]:
# Define retriever configurations
configs = [
    RetrieverConfig(
        name="basic_vector",
        retrieve_fn=lambda q: vector_store.similarity_search(q, k=5),
        description="Basic dense vector search"
    ),
    RetrieverConfig(
        name="rag_fusion",
        retrieve_fn=lambda q: rag_fusion(q, vector_store, llm, k=5, num_queries=3),
        description="RAG-Fusion with multi-query + RRF"
    ),
]

# Add hybrid search if rank_bm25 is available
if HAS_RANK_BM25:
    configs.append(RetrieverConfig(
        name="hybrid",
        retrieve_fn=lambda q: hybrid_search(q, vector_store, all_chunks, k=5),
        description="Hybrid dense + BM25 with RRF"
    ))

print("Comparing retrievers...\n")
comparison_results = compare_retrievers(configs, test_queries, simple_relevance_score)

# Display results
print(f"{'Retriever':<20} {'Mean Score':>12} {'Mean Latency':>14} {'P95 Latency':>13}")
print("-" * 62)
for name, data in comparison_results.items():
    print(f"{name:<20} {data['mean_score']:>12.3f} {data['mean_latency']*1000:>12.1f}ms {data['p95_latency']*1000:>11.1f}ms")

In [None]:
# Analyze by query complexity
segmented = segment_analysis(comparison_results, test_queries, query_complexity_segment)

print("Performance by query complexity:\n")
for config_name, segments in segmented.items():
    print(f"  {config_name}:")
    for segment, data in segments.items():
        print(f"    {segment:>10}: score={data['mean_score']:.3f}, latency={data['mean_latency']*1000:.1f}ms")
    print()

## RAGAS Evaluation of Retrieval Quality

For a more rigorous evaluation, we use RAGAS metrics to measure context precision and context recall. This tells us not just whether keywords match, but whether the retrieved context actually supports answering the question correctly.

In [None]:
from ragas import evaluate
from ragas.metrics._context_precision import LLMContextPrecisionWithoutReference
from ragas.metrics._context_recall import LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from datasets import Dataset

# Configure RAGAS using LangChain wrappers
ragas_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
ragas_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

metrics = [
    LLMContextPrecisionWithoutReference(llm=ragas_llm),
    LLMContextRecall(llm=ragas_llm),
]


def build_ragas_dataset(retriever_fn, test_queries):
    """Run retriever on test queries, generate answers, and build RAGAS-compatible dataset."""
    questions = []
    responses = []
    contexts_list = []
    ground_truths = []
    
    for tq in test_queries:
        retrieved = retriever_fn(tq["query"])
        context_texts = [doc.page_content for doc in retrieved]
        
        # Generate a response using the retrieved context
        context_str = "\n\n".join(context_texts[:3])
        answer = llm.invoke(
            f"Based on the following context, answer the question briefly.\n\n"
            f"Context:\n{context_str}\n\nQuestion: {tq['query']}"
        )
        
        questions.append(tq["query"])
        responses.append(answer.content)
        contexts_list.append(context_texts)
        ground_truths.append(tq["ground_truth"])
    
    return Dataset.from_dict({
        "question": questions,
        "answer": responses,
        "contexts": contexts_list,
        "ground_truth": ground_truths,
    })


# Evaluate basic vector search (limit to 4 queries to save API calls)
ragas_test_queries = test_queries[:4]

print("Evaluating basic vector search with RAGAS...")
basic_dataset = build_ragas_dataset(
    lambda q: vector_store.similarity_search(q, k=5),
    ragas_test_queries
)
basic_results = evaluate(basic_dataset, metrics=metrics)
basic_df = basic_results.to_pandas()

# Evaluate RAG-Fusion
print("Evaluating RAG-Fusion with RAGAS...")
fusion_dataset = build_ragas_dataset(
    lambda q: rag_fusion(q, vector_store, llm, k=5, num_queries=3),
    ragas_test_queries
)
fusion_results = evaluate(fusion_dataset, metrics=metrics)
fusion_df = fusion_results.to_pandas()

# Compare
metric_cols = basic_df.select_dtypes(include=["float64", "int64"]).columns
print(f"\n{'Metric':<40} {'Basic Vector':>14} {'RAG-Fusion':>14}")
print("-" * 70)
for col in metric_cols:
    basic_mean = basic_df[col].mean()
    fusion_mean = fusion_df[col].mean()
    delta = fusion_mean - basic_mean
    arrow = "^" if delta > 0.01 else "v" if delta < -0.01 else "="
    print(f"{col:<40} {basic_mean:>14.3f} {fusion_mean:>13.3f} {arrow}")

## Summary

We explored advanced retrieval techniques and a systematic evaluation framework:

| Technique | When to Use | Typical Latency |
|-----------|------------|----------------|
| **Hybrid search** (BM25 + dense) | Queries with specific terms, codes, names | 40-80ms |
| **Reranking** (Cohere cross-encoder) | Precision is critical | 100-250ms |
| **Query expansion** (LLM-generated) | Vocabulary mismatch is common | 100-300ms |
| **RAG-Fusion** (multi-query + RRF) | Complex, multi-faceted queries | 200-500ms |
| **Semantic chunking** | Structured documents, code examples | Indexing-time |
| **Contextual compression** | Long chunks, precious context window | 200-500ms per doc |
| **Relevance filtering** | Quick noise removal | 50-100ms |
| **Hierarchical retrieval** | Very large corpora | Varies |

**Decision framework:**
1. Start simple -- basic vector search with good chunking.
2. Add hybrid search if you have specific identifiers users search for.
3. Add reranking if precision is critical and 200ms extra latency is acceptable.
4. Add query expansion if vocabulary mismatch is causing retrieval failures.
5. Add RAG-Fusion if comprehensive coverage matters more than speed.

The key: **measure before and after each change**. Do not assume more sophisticated equals better.

In [None]:
print("=" * 60)
print("CHAPTER 9: ADVANCED RETRIEVAL & EVALUATION COMPLETE")
print("=" * 60)

print(f"\nDocument corpus: {len(all_chunks)} chunks from {len(guide_files)} files")
print(f"Vector store: Qdrant in-memory ({COLLECTION_NAME})")

print(f"\nTechniques implemented:")
techniques = [
    "Reciprocal Rank Fusion (RRF)",
    "Hybrid Search (BM25 + Dense)" + (" [active]" if HAS_RANK_BM25 else " [needs rank-bm25]"),
    "Reranking (Cohere)" + (" [active]" if HAS_COHERE and os.getenv("COHERE_API_KEY") else " [needs cohere + API key]"),
    "Query Expansion (LLM-based)",
    "RAG-Fusion (multi-query + RRF)",
    "Semantic Chunking" + (" [active]" if HAS_SEMANTIC_CHUNKER else " [needs langchain-experimental]"),
    "Structure-Aware Chunking (Markdown headers)",
    "Contextual Compression (LLMChainExtractor)",
    "Relevance Filtering (cosine threshold)",
    "Hierarchical Retrieval (summary -> chunk)",
    "Systematic Retriever Comparison Framework",
    "RAGAS Evaluation (context precision + recall)",
]
for t in techniques:
    print(f"  - {t}")

print(f"\nComparison results:")
for name, data in comparison_results.items():
    print(f"  {name}: score={data['mean_score']:.3f}, latency={data['mean_latency']*1000:.1f}ms")