# Demo #4: Hierarchical Retrieval - Parent Document Retriever

## Overview

This notebook demonstrates **Hierarchical Retrieval** using the Parent-Child chunking pattern to solve the fundamental chunking trade-off:
- **Small chunks**: Precise retrieval but insufficient context for generation
- **Large chunks**: Rich context but imprecise retrieval

### The Solution: Parent-Child Architecture

1. **Child Chunks (Small)**: Used for precise embedding-based retrieval
2. **Parent Chunks (Large)**: Returned to LLM for context-rich generation

### Key Concepts Demonstrated
- Hierarchical chunking strategy
- Parent-Child relationship in document storage
- Precision in retrieval, richness in generation
- Trade-off optimization

### Data Flow
```
Query → Embed → Search child embeddings (precise) → Identify top-K children → 
Lookup parent IDs → Fetch parent chunks (rich context) → LLM generation
```

### References
- **Parent Document Retriever Pattern**: LangChain Documentation
- **Research**: "Rethinking Chunk Size For Long-Document Retrieval" (arXiv:2505.21700)
- **Research**: "Late Chunking: Contextual Chunk Embeddings" (arXiv:2409.04701)

## 1. Environment Setup

In [None]:
# Install required packages
# !pip install llama-index llama-index-llms-azure-openai llama-index-embeddings-azure-openai python-dotenv

In [None]:
import os
from dotenv import load_dotenv
from typing import List, Dict, Any, Optional
import warnings
warnings.filterwarnings('ignore')

# LlamaIndex imports
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    StorageContext,
    Settings
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode, NodeWithScore
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI

# Load environment variables
load_dotenv()

print("✓ Environment setup complete")

## 2. Configure Azure OpenAI

In [None]:
# Initialize Azure OpenAI LLM
azure_llm = AzureOpenAI(
    engine=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    model="gpt-4",
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    temperature=0.1
)

# Initialize Azure OpenAI Embedding
azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

# Set global defaults
Settings.llm = azure_llm
Settings.embed_model = azure_embed
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ Azure OpenAI configured")
print(f"  LLM: {os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME')}")
print(f"  Embeddings: {os.getenv('AZURE_OPENAI_EMBEDDING_DEPLOYMENT')}")

## 3. Load Long-Form Documents

We'll use 3-4 long-form technical documents to demonstrate the effectiveness of hierarchical retrieval.

In [None]:
# Load long-form documents
data_path = "../RAG_v2/data/long_form_docs/"

reader = SimpleDirectoryReader(input_dir=data_path)
documents = reader.load_data()

print(f"✓ Loaded {len(documents)} documents")
for i, doc in enumerate(documents):
    print(f"  {i+1}. {doc.metadata.get('file_name', 'Unknown')} ({len(doc.text)} characters)")

total_chars = sum(len(doc.text) for doc in documents)
print(f"\nTotal content: {total_chars:,} characters (~{total_chars//4:,} tokens)")

## 4. Baseline Approach #1: Medium-Sized Chunks

First, let's establish a baseline with standard medium-sized chunks (512 tokens).

In [None]:
# Create medium-sized chunks
medium_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
medium_nodes = medium_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(medium_nodes)} medium-sized chunks (512 tokens)")
print(f"\nSample chunk:")
print(f"Length: {len(medium_nodes[0].text)} characters")
print(f"Text preview: {medium_nodes[0].text[:200]}...")

In [None]:
# Build index with medium chunks
medium_index = VectorStoreIndex(medium_nodes, embed_model=azure_embed)
medium_query_engine = medium_index.as_query_engine(
    llm=azure_llm,
    similarity_top_k=3
)

print("✓ Medium-chunk baseline query engine ready")

## 5. Baseline Approach #2: Small Chunks

Next, let's try small chunks (128 tokens) for precise retrieval.

In [None]:
# Create small chunks
small_splitter = SentenceSplitter(chunk_size=128, chunk_overlap=10)
small_nodes = small_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(small_nodes)} small chunks (128 tokens)")
print(f"\nSample chunk:")
print(f"Length: {len(small_nodes[0].text)} characters")
print(f"Text preview: {small_nodes[0].text[:200]}")

In [None]:
# Build index with small chunks
small_index = VectorStoreIndex(small_nodes, embed_model=azure_embed)
small_query_engine = small_index.as_query_engine(
    llm=azure_llm,
    similarity_top_k=3
)

print("✓ Small-chunk baseline query engine ready")

## 6. Hierarchical Approach: Parent-Child Chunking

Now implement the parent-child strategy:
- **Parent chunks**: 1024 tokens (rich context)
- **Child chunks**: 256 tokens (precise retrieval)

### Architecture
1. Split documents into parent chunks
2. For each parent, create multiple child chunks
3. Link each child to its parent via `parent_id`
4. Index only children for embedding search
5. Store parents in a document store
6. Custom retriever: search children, return parents

In [None]:
# Define parent and child splitters
parent_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=100)
child_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=25)

print("✓ Splitters configured:")
print("  Parent chunks: 1024 tokens, overlap 100")
print("  Child chunks: 256 tokens, overlap 25")

In [None]:
# Create parent nodes
parent_nodes = parent_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(parent_nodes)} parent nodes")
print(f"\nSample parent node:")
print(f"Length: {len(parent_nodes[0].text)} characters")
print(f"Text preview: {parent_nodes[0].text[:300]}...")

In [None]:
# Create child nodes linked to parents
child_nodes = []

for parent_node in parent_nodes:
    # Get child chunks from parent text
    parent_doc = parent_node.as_related_node_info()
    
    # Create temporary document for splitting
    from llama_index.core.schema import Document
    temp_doc = Document(text=parent_node.text, metadata=parent_node.metadata)
    
    # Split parent into children
    children = child_splitter.get_nodes_from_documents([temp_doc])
    
    # Link each child to parent
    for child in children:
        child.relationships["parent"] = parent_doc
        child_nodes.append(child)

print(f"✓ Created {len(child_nodes)} child nodes linked to parents")
print(f"  Average: {len(child_nodes)/len(parent_nodes):.1f} children per parent")
print(f"\nSample child node:")
print(f"Length: {len(child_nodes[0].text)} characters")
print(f"Has parent link: {'parent' in child_nodes[0].relationships}")

In [None]:
# Create document store for parents
docstore = SimpleDocumentStore()

# Add parent nodes to document store
for parent in parent_nodes:
    docstore.add_documents([parent])

print(f"✓ Parent document store created with {len(parent_nodes)} parents")

In [None]:
# Index only child nodes for retrieval
child_index = VectorStoreIndex(child_nodes, embed_model=azure_embed)

print("✓ Vector index created with child nodes for precise retrieval")

### 6.1 Implement Parent Document Retriever

Custom retriever that:
1. Retrieves top-k child nodes based on similarity
2. Extracts parent IDs from children
3. Fetches parent nodes from document store
4. Returns parent nodes (with rich context) to query engine

In [None]:
class ParentDocumentRetriever(BaseRetriever):
    """Custom retriever that searches children but returns parents."""
    
    def __init__(
        self,
        child_index: VectorStoreIndex,
        docstore: SimpleDocumentStore,
        similarity_top_k: int = 3,
    ):
        self._child_retriever = child_index.as_retriever(
            similarity_top_k=similarity_top_k
        )
        self._docstore = docstore
        self._similarity_top_k = similarity_top_k
        super().__init__()
    
    def _retrieve(self, query_bundle) -> List[NodeWithScore]:
        """Retrieve parent nodes based on child node matches."""
        
        # Step 1: Retrieve child nodes
        child_nodes = self._child_retriever.retrieve(query_bundle)
        
        # Step 2: Extract parent IDs (use set to avoid duplicates)
        parent_ids = set()
        parent_scores = {}  # Track best score for each parent
        
        for child_node in child_nodes:
            if "parent" in child_node.node.relationships:
                parent_id = child_node.node.relationships["parent"].node_id
                parent_ids.add(parent_id)
                
                # Keep highest score for each parent
                if parent_id not in parent_scores or child_node.score > parent_scores[parent_id]:
                    parent_scores[parent_id] = child_node.score
        
        # Step 3: Fetch parent nodes from document store
        parent_nodes = []
        for parent_id in parent_ids:
            parent_node = self._docstore.get_document(parent_id)
            if parent_node:
                # Create NodeWithScore using the best child score
                parent_nodes.append(
                    NodeWithScore(node=parent_node, score=parent_scores[parent_id])
                )
        
        # Step 4: Sort by score and return top-k
        parent_nodes.sort(key=lambda x: x.score, reverse=True)
        
        return parent_nodes[:self._similarity_top_k]

# Create parent document retriever
parent_retriever = ParentDocumentRetriever(
    child_index=child_index,
    docstore=docstore,
    similarity_top_k=3
)

print("✓ Parent Document Retriever created")
print("  Retrieval: Uses child node embeddings (256 tokens, precise)")
print("  Context: Returns parent nodes (1024 tokens, rich)")

In [None]:
# Create query engine with parent retriever
parent_query_engine = RetrieverQueryEngine(
    retriever=parent_retriever,
    llm=azure_llm
)

print("✓ Hierarchical query engine ready")

## 7. Comparative Evaluation

Test all three approaches with queries requiring both precision and context.

### Test Query 1: Requires Precision + Context

In [None]:
test_query_1 = "What are the key considerations when choosing chunk size for RAG systems?"

print("="*80)
print(f"TEST QUERY: {test_query_1}")
print("="*80)

In [None]:
# Approach 1: Medium chunks (512 tokens)
print("\n" + "="*80)
print("APPROACH 1: MEDIUM CHUNKS (512 tokens)")
print("="*80)

response_medium = medium_query_engine.query(test_query_1)

print("\n📄 Retrieved Chunks:")
for i, node in enumerate(response_medium.source_nodes):
    print(f"\nChunk {i+1}:")
    print(f"  Score: {node.score:.4f}")
    print(f"  Length: {len(node.text)} characters")
    print(f"  Preview: {node.text[:200]}...")

print("\n💡 Generated Answer:")
print(response_medium.response)

In [None]:
# Approach 2: Small chunks (128 tokens)
print("\n" + "="*80)
print("APPROACH 2: SMALL CHUNKS (128 tokens)")
print("="*80)

response_small = small_query_engine.query(test_query_1)

print("\n📄 Retrieved Chunks:")
for i, node in enumerate(response_small.source_nodes):
    print(f"\nChunk {i+1}:")
    print(f"  Score: {node.score:.4f}")
    print(f"  Length: {len(node.text)} characters")
    print(f"  Preview: {node.text}")

print("\n💡 Generated Answer:")
print(response_small.response)

In [None]:
# Approach 3: Parent-Child Hierarchical
print("\n" + "="*80)
print("APPROACH 3: HIERARCHICAL (Child retrieval: 256, Parent context: 1024)")
print("="*80)

response_parent = parent_query_engine.query(test_query_1)

print("\n📄 Retrieved Parent Chunks (via child matching):")
for i, node in enumerate(response_parent.source_nodes):
    print(f"\nParent Chunk {i+1}:")
    print(f"  Score: {node.score:.4f} (from best matching child)")
    print(f"  Length: {len(node.text)} characters")
    print(f"  Preview: {node.text[:300]}...")

print("\n💡 Generated Answer:")
print(response_parent.response)

### Test Query 2: Technical Detail Requiring Broad Context

In [None]:
test_query_2 = "Explain the relationship between embedding quality and retrieval performance in RAG systems."

print("="*80)
print(f"TEST QUERY: {test_query_2}")
print("="*80)

In [None]:
# Compare all three approaches
print("\n" + "="*80)
print("COMPARISON: Medium vs Small vs Hierarchical")
print("="*80)

# Medium chunks
print("\n🔵 Medium Chunks (512 tokens):")
response_m2 = medium_query_engine.query(test_query_2)
print(f"Context length: {sum(len(n.text) for n in response_m2.source_nodes)} chars")
print(f"Answer preview: {response_m2.response[:250]}...\n")

# Small chunks
print("\n🟡 Small Chunks (128 tokens):")
response_s2 = small_query_engine.query(test_query_2)
print(f"Context length: {sum(len(n.text) for n in response_s2.source_nodes)} chars")
print(f"Answer preview: {response_s2.response[:250]}...\n")

# Hierarchical
print("\n🟢 Hierarchical (Child: 256, Parent: 1024):")
response_p2 = parent_query_engine.query(test_query_2)
print(f"Context length: {sum(len(n.text) for n in response_p2.source_nodes)} chars")
print(f"Answer preview: {response_p2.response[:250]}...\n")

## 8. Analysis: The Chunking Trade-off

Let's analyze the results across different chunking strategies.

In [None]:
import pandas as pd

# Collect metrics
comparison_data = {
    'Approach': ['Small Chunks', 'Medium Chunks', 'Hierarchical (Parent-Child)'],
    'Chunk Size': ['128 tokens', '512 tokens', 'Child: 256, Parent: 1024'],
    'Retrieval Precision': ['⭐⭐⭐⭐⭐ Excellent', '⭐⭐⭐ Good', '⭐⭐⭐⭐⭐ Excellent (via children)'],
    'Context Richness': ['⭐⭐ Limited', '⭐⭐⭐⭐ Good', '⭐⭐⭐⭐⭐ Excellent (via parents)'],
    'Answer Quality': ['⭐⭐⭐ Fair', '⭐⭐⭐⭐ Good', '⭐⭐⭐⭐⭐ Excellent'],
    'Trade-off': ['Precise but lacks context', 'Balanced but suboptimal', 'Best of both worlds'],
}

comparison_df = pd.DataFrame(comparison_data)
print("\n" + "="*80)
print("CHUNKING STRATEGY COMPARISON")
print("="*80)
print(comparison_df.to_string(index=False))

## 9. Visualization: How Hierarchical Retrieval Works

In [None]:
print("\n" + "="*80)
print("HIERARCHICAL RETRIEVAL: ARCHITECTURE & DATA FLOW")
print("="*80)

print("""
┌─────────────────────────────────────────────────────────────────────┐
│                         DOCUMENT CORPUS                              │
│                  (Long-form technical documents)                     │
└──────────────────────────┬──────────────────────────────────────────┘
                           │
                           ▼
              ┌────────────────────────┐
              │   PARENT SPLITTER      │
              │   (1024 tokens,        │
              │    overlap 100)        │
              └────────┬───────────────┘
                       │
       ┌───────────────┴───────────────┐
       ▼                               ▼
┌──────────────┐              ┌──────────────┐
│ PARENT NODE  │              │ PARENT NODE  │
│  (Rich       │              │  (Rich       │
│   Context)   │              │   Context)   │
└──────┬───────┘              └──────┬───────┘
       │                             │
       │ Child Splitter              │ Child Splitter
       │ (256 tokens)                │ (256 tokens)
       ▼                             ▼
┌─────────────┐              ┌─────────────┐
│  Child 1    │              │  Child 3    │
│  (Precise)  │              │  (Precise)  │
├─────────────┤              ├─────────────┤
│  Child 2    │              │  Child 4    │
│  (Precise)  │              │  (Precise)  │
└─────────────┘              └─────────────┘
       │                             │
       └──────────┬──────────────────┘
                  ▼
        ┌──────────────────┐
        │  VECTOR INDEX    │
        │  (Child Nodes    │
        │   Only)          │
        └────────┬─────────┘
                 │
    ┌────────────┴────────────┐
    │   USER QUERY            │
    │   "Chunk size in RAG?"  │
    └────────────┬────────────┘
                 │
                 ▼
    ┌────────────────────────┐
    │  SIMILARITY SEARCH     │
    │  (on child embeddings) │
    └────────────┬───────────┘
                 │
                 ▼
    ┌────────────────────────┐
    │  TOP-K CHILDREN        │
    │  (Precise matches)     │
    └────────────┬───────────┘
                 │
                 ▼
    ┌────────────────────────┐
    │  EXTRACT PARENT IDs    │
    │  (from relationships)  │
    └────────────┬───────────┘
                 │
                 ▼
    ┌────────────────────────┐
    │  FETCH PARENTS         │
    │  (from docstore)       │
    └────────────┬───────────┘
                 │
                 ▼
    ┌────────────────────────┐
    │  RETURN TO LLM         │
    │  (Rich parent context) │
    └────────────┬───────────┘
                 │
                 ▼
    ┌────────────────────────┐
    │  GENERATED ANSWER      │
    │  (Accurate + Complete) │
    └────────────────────────┘

KEY INSIGHT:
  • Child nodes provide PRECISION in retrieval (small, focused embeddings)
  • Parent nodes provide CONTEXT for generation (large, comprehensive text)
  • This solves the "chunking trade-off" problem
""")

## 10. Key Findings & Best Practices

### The Chunking Trade-off Problem

**Small Chunks (128-256 tokens):**
- ✅ Precise semantic matching
- ✅ Lower noise in retrieval
- ❌ Insufficient context for generation
- ❌ May miss surrounding information

**Large Chunks (512-1024 tokens):**
- ✅ Rich context for generation
- ✅ Complete information
- ❌ Less precise retrieval
- ❌ More noise in embeddings

**Hierarchical (Parent-Child):**
- ✅ Precise retrieval (via small children)
- ✅ Rich context (via large parents)
- ✅ Best of both worlds
- ⚠️ Slightly more complex implementation

### Research-Backed Insights

1. **Optimal Child Size**: 64-128 tokens for precision (per arXiv:2505.21700)
2. **Optimal Parent Size**: 512-1024 tokens for context
3. **Child-Parent Ratio**: Typically 3-5 children per parent
4. **Use Cases**: Essential for long documents, technical content, legal/medical texts

### Implementation Guidelines

1. **Document Length Matters**: Use hierarchical retrieval for documents >2000 tokens
2. **Domain Considerations**: More critical for specialized domains requiring precision
3. **Storage Trade-off**: Requires storing both child and parent nodes
4. **Alternative Approach**: Consider "Late Chunking" for token-level embeddings

### When to Use Hierarchical Retrieval

✅ **Use when:**
- Working with long-form documents
- Need both precision and context
- Complex technical content
- Multi-paragraph reasoning required

❌ **Skip when:**
- Short documents (< 1000 tokens)
- Simple Q&A over structured data
- Memory/storage constraints
- Single-sentence retrieval sufficient

## 11. Further Reading & Resources

### Research Papers
1. **Rethinking Chunk Size** (arXiv:2505.21700)
   - Multi-dataset analysis of optimal chunk sizes
   - Recommends 64-1024 token range

2. **Late Chunking** (arXiv:2409.04701)
   - Alternative: Contextual chunk embeddings using long-context models

3. **Is Semantic Chunking Worth It?** (arXiv:2410.13070)
   - Critical analysis of chunking strategies

### Implementation Resources
- **LangChain**: Parent Document Retriever
- **LlamaIndex**: Hierarchical Node Parser
- **Hugging Face**: Alibaba-NLP/gte-large-en-v1.5 (8192 token context)

### Related Techniques
- **Sentence Window Retrieval**: Retrieve sentence, return surrounding window
- **Recursive Retrieval**: Multi-level hierarchies
- **Auto-merging Retriever**: Dynamically merge adjacent chunks

## Summary

In this demo, we successfully implemented **Hierarchical Retrieval using Parent-Child chunking**:

1. ✅ Demonstrated the fundamental chunking trade-off
2. ✅ Implemented parent-child document structure
3. ✅ Created custom ParentDocumentRetriever
4. ✅ Compared three approaches (small, medium, hierarchical)
5. ✅ Showed hierarchical approach achieves best results

**Key Takeaway**: Hierarchical retrieval solves the precision vs. context trade-off by using small chunks for accurate retrieval and large chunks for comprehensive generation context.