# Demo #4: Hierarchical Retrieval - Parent Document Retriever

## Objective
Demonstrate how hierarchical chunking solves the chunking trade-off by retrieving with small, precise child chunks while generating with larger, context-rich parent chunks.

## Core Concepts
- **Hierarchical chunking strategy**: Multiple chunk sizes for different purposes
- **Parent-Child relationship**: Small chunks for retrieval, large chunks for generation
- **Precision vs Context trade-off**: Getting the best of both worlds

## The Chunking Problem
One of the fundamental challenges in RAG systems is choosing the right chunk size:
- **Small chunks (128-256 tokens)**: Precise retrieval but insufficient context for LLM
- **Large chunks (1024+ tokens)**: Rich context but imprecise retrieval (too much noise)
- **Medium chunks (512 tokens)**: Compromise but doesn't excel at either

Hierarchical retrieval solves this by using **small chunks for finding** and **large chunks for generating**.

## 1. Setup and Dependencies

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

# LlamaIndex core components
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import TextNode
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.storage.docstore import SimpleDocumentStore

# Azure OpenAI
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

# For visualization
import pandas as pd

print("✓ All libraries imported successfully")

## 2. Configure Azure OpenAI

In [None]:
# Load environment variables
load_dotenv()

# Configure Azure OpenAI LLM
azure_llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    temperature=0.1
)

# Configure Azure OpenAI Embeddings
azure_embed = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
)

print("✓ Azure OpenAI configured successfully")
print(f"  LLM: {azure_llm.model}")
print(f"  Embedding: {azure_embed.model}")

## 3. Load Long-Form Documents

We'll use comprehensive technical documents that benefit from hierarchical chunking.

In [None]:
# Load long-form documents
documents = SimpleDirectoryReader(
    input_dir="./data/long_form_docs",
    recursive=True
).load_data()

print(f"✓ Loaded {len(documents)} documents")
for i, doc in enumerate(documents, 1):
    print(f"  {i}. {doc.metadata.get('file_name', 'Unknown')} ({len(doc.text)} characters)")

total_chars = sum(len(doc.text) for doc in documents)
print(f"\nTotal content: {total_chars:,} characters (~{total_chars//4:,} tokens)")

## 4. Baseline Approach #1: Medium-Sized Chunks (512 tokens)

First, let's try the standard compromise - medium-sized chunks.

In [None]:
# Create medium-sized chunks (standard approach)
medium_splitter = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50
)

medium_nodes = medium_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(medium_nodes)} medium-sized chunks (512 tokens)")
print(f"  Average chunk size: {sum(len(node.text) for node in medium_nodes) // len(medium_nodes)} characters")

# Show example chunk
print("\n📄 Example medium chunk:")
print("=" * 80)
print(medium_nodes[5].text[:300] + "...")
print("=" * 80)

In [None]:
# Build index with medium chunks
medium_index = VectorStoreIndex(
    medium_nodes,
    embed_model=azure_embed,
    show_progress=True
)

medium_query_engine = medium_index.as_query_engine(
    llm=azure_llm,
    similarity_top_k=3
)

print("✓ Medium-chunk query engine ready")

## 5. Baseline Approach #2: Small Chunks (128 tokens)

Now let's try small chunks for precision.

In [None]:
# Create small chunks for precision
small_splitter = SentenceSplitter(
    chunk_size=128,
    chunk_overlap=10
)

small_nodes = small_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(small_nodes)} small chunks (128 tokens)")
print(f"  Average chunk size: {sum(len(node.text) for node in small_nodes) // len(small_nodes)} characters")

# Show example chunk
print("\n📄 Example small chunk:")
print("=" * 80)
print(small_nodes[20].text)
print("=" * 80)

In [None]:
# Build index with small chunks
small_index = VectorStoreIndex(
    small_nodes,
    embed_model=azure_embed,
    show_progress=True
)

small_query_engine = small_index.as_query_engine(
    llm=azure_llm,
    similarity_top_k=3
)

print("✓ Small-chunk query engine ready")

## 6. Advanced Approach: Hierarchical Chunking (Parent-Child)

Now let's implement the hierarchical approach:
- **Parent chunks**: 1024 tokens (rich context for generation)
- **Child chunks**: 256 tokens (precise retrieval)
- **Strategy**: Search children, return parents

In [None]:
# Create parent chunks (large, context-rich)
parent_splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=100
)

parent_nodes = parent_splitter.get_nodes_from_documents(documents)

print(f"✓ Created {len(parent_nodes)} parent chunks (1024 tokens)")
print(f"  Average parent size: {sum(len(node.text) for node in parent_nodes) // len(parent_nodes)} characters")

In [None]:
# Create child chunks from each parent
child_splitter = SentenceSplitter(
    chunk_size=256,
    chunk_overlap=25
)

# Store parent nodes in document store
docstore = SimpleDocumentStore()

# Create child nodes and link to parents
all_child_nodes = []

for parent_node in parent_nodes:
    # Store parent in docstore
    docstore.add_documents([parent_node])
    
    # Create children from parent text
    child_nodes = child_splitter.get_nodes_from_documents(
        [parent_node.to_document()]
    )
    
    # Link each child to its parent
    for child_node in child_nodes:
        child_node.metadata["parent_id"] = parent_node.node_id
        child_node.metadata["file_name"] = parent_node.metadata.get("file_name", "Unknown")
    
    all_child_nodes.extend(child_nodes)

print(f"✓ Created {len(all_child_nodes)} child chunks (256 tokens)")
print(f"  Average child size: {sum(len(node.text) for node in all_child_nodes) // len(all_child_nodes)} characters")
print(f"  Ratio: {len(all_child_nodes) / len(parent_nodes):.1f} children per parent")

### Create Custom Parent Document Retriever

This retriever:
1. Searches over child embeddings (precise)
2. Maps children to parent IDs
3. Fetches parent nodes from docstore
4. Returns parent nodes to LLM (context-rich)

In [None]:
from typing import List, Optional
from llama_index.core.schema import NodeWithScore, QueryBundle

class ParentDocumentRetriever(BaseRetriever):
    """Retriever that searches child chunks but returns parent chunks."""
    
    def __init__(
        self,
        child_index: VectorStoreIndex,
        docstore: SimpleDocumentStore,
        similarity_top_k: int = 3,
    ):
        self._child_retriever = child_index.as_retriever(
            similarity_top_k=similarity_top_k
        )
        self._docstore = docstore
        super().__init__()
    
    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve parent nodes based on child node matches."""
        # Step 1: Retrieve child nodes
        child_nodes_with_scores = self._child_retriever.retrieve(query_bundle)
        
        # Step 2: Get unique parent IDs
        parent_ids = set()
        parent_scores = {}  # Track best score for each parent
        
        for node_with_score in child_nodes_with_scores:
            parent_id = node_with_score.node.metadata.get("parent_id")
            if parent_id:
                parent_ids.add(parent_id)
                # Keep highest child score for each parent
                if parent_id not in parent_scores or node_with_score.score > parent_scores[parent_id]:
                    parent_scores[parent_id] = node_with_score.score
        
        # Step 3: Fetch parent nodes from docstore
        parent_nodes_with_scores = []
        for parent_id in parent_ids:
            parent_node = self._docstore.get_document(parent_id)
            if parent_node:
                parent_nodes_with_scores.append(
                    NodeWithScore(
                        node=parent_node,
                        score=parent_scores[parent_id]
                    )
                )
        
        # Sort by score
        parent_nodes_with_scores.sort(key=lambda x: x.score, reverse=True)
        
        return parent_nodes_with_scores

print("✓ Custom ParentDocumentRetriever defined")

In [None]:
# Build child index (only child nodes are embedded and indexed)
child_index = VectorStoreIndex(
    all_child_nodes,
    embed_model=azure_embed,
    show_progress=True
)

print("✓ Child index created")

In [None]:
# Create parent document retriever
parent_retriever = ParentDocumentRetriever(
    child_index=child_index,
    docstore=docstore,
    similarity_top_k=3
)

# Build query engine with parent retriever
hierarchical_query_engine = RetrieverQueryEngine(
    retriever=parent_retriever,
    llm=azure_llm
)

print("✓ Hierarchical query engine ready")

## 7. Comparative Evaluation

Let's test all three approaches with queries that require both precision and context.

In [None]:
# Test queries that benefit from hierarchical retrieval
test_queries = [
    "What are the advantages and disadvantages of semantic chunking compared to fixed-size chunking?",
    "Explain the differences between bi-encoders and cross-encoders in retrieval systems.",
    "How does hierarchical chunking solve the precision-context trade-off?"
]

print("Test Queries:")
for i, q in enumerate(test_queries, 1):
    print(f"{i}. {q}")

### Test Query 1: Semantic vs Fixed-Size Chunking

In [None]:
query = test_queries[0]
print(f"\n🔍 Query: {query}")
print("=" * 100)

In [None]:
# Medium chunks (512 tokens)
print("\n📊 APPROACH 1: Medium Chunks (512 tokens)")
print("=" * 100)

medium_response = medium_query_engine.query(query)

print("\n📄 Retrieved Chunks:")
for i, node in enumerate(medium_response.source_nodes, 1):
    print(f"\nChunk {i} (score: {node.score:.4f}):")
    print(f"Source: {node.node.metadata.get('file_name', 'Unknown')}")
    print(f"Length: {len(node.node.text)} characters")
    print(f"Preview: {node.node.text[:200]}...")

print("\n💡 Generated Answer:")
print("-" * 100)
print(medium_response.response)
print("-" * 100)

In [None]:
# Small chunks (128 tokens)
print("\n📊 APPROACH 2: Small Chunks (128 tokens)")
print("=" * 100)

small_response = small_query_engine.query(query)

print("\n📄 Retrieved Chunks:")
for i, node in enumerate(small_response.source_nodes, 1):
    print(f"\nChunk {i} (score: {node.score:.4f}):")
    print(f"Source: {node.node.metadata.get('file_name', 'Unknown')}")
    print(f"Length: {len(node.node.text)} characters")
    print(f"Content: {node.node.text}")

print("\n💡 Generated Answer:")
print("-" * 100)
print(small_response.response)
print("-" * 100)

In [None]:
# Hierarchical (256 token children -> 1024 token parents)
print("\n📊 APPROACH 3: Hierarchical Chunking (Child: 256 → Parent: 1024)")
print("=" * 100)

hierarchical_response = hierarchical_query_engine.query(query)

print("\n📄 Retrieved Parent Chunks:")
for i, node in enumerate(hierarchical_response.source_nodes, 1):
    print(f"\nParent Chunk {i} (score: {node.score:.4f}):")
    print(f"Source: {node.node.metadata.get('file_name', 'Unknown')}")
    print(f"Length: {len(node.node.text)} characters")
    print(f"Preview: {node.node.text[:300]}...")

print("\n💡 Generated Answer:")
print("-" * 100)
print(hierarchical_response.response)
print("-" * 100)

### Analysis: Query 1 Results

In [None]:
# Compare the approaches
comparison_data = [
    {
        "Approach": "Medium Chunks (512)",
        "Avg Chunk Size": f"{sum(len(n.node.text) for n in medium_response.source_nodes) // len(medium_response.source_nodes)} chars",
        "Total Context": f"{sum(len(n.node.text) for n in medium_response.source_nodes)} chars",
        "Answer Length": f"{len(medium_response.response)} chars"
    },
    {
        "Approach": "Small Chunks (128)",
        "Avg Chunk Size": f"{sum(len(n.node.text) for n in small_response.source_nodes) // len(small_response.source_nodes)} chars",
        "Total Context": f"{sum(len(n.node.text) for n in small_response.source_nodes)} chars",
        "Answer Length": f"{len(small_response.response)} chars"
    },
    {
        "Approach": "Hierarchical (256→1024)",
        "Avg Chunk Size": f"{sum(len(n.node.text) for n in hierarchical_response.source_nodes) // len(hierarchical_response.source_nodes)} chars",
        "Total Context": f"{sum(len(n.node.text) for n in hierarchical_response.source_nodes)} chars",
        "Answer Length": f"{len(hierarchical_response.response)} chars"
    }
]

df = pd.DataFrame(comparison_data)
print("\n📊 Comparison Table:")
print(df.to_string(index=False))

### Test Query 2: Bi-encoders vs Cross-encoders

In [None]:
query = test_queries[1]
print(f"\n🔍 Query: {query}")
print("=" * 100)

# Test all three approaches
medium_response_2 = medium_query_engine.query(query)
small_response_2 = small_query_engine.query(query)
hierarchical_response_2 = hierarchical_query_engine.query(query)

In [None]:
print("\n💡 MEDIUM CHUNKS Answer:")
print("-" * 100)
print(medium_response_2.response)
print(f"\nContext provided: {sum(len(n.node.text) for n in medium_response_2.source_nodes)} characters")

print("\n\n💡 SMALL CHUNKS Answer:")
print("-" * 100)
print(small_response_2.response)
print(f"\nContext provided: {sum(len(n.node.text) for n in small_response_2.source_nodes)} characters")

print("\n\n💡 HIERARCHICAL Answer:")
print("-" * 100)
print(hierarchical_response_2.response)
print(f"\nContext provided: {sum(len(n.node.text) for n in hierarchical_response_2.source_nodes)} characters")

## 8. Visualizing the Hierarchical Retrieval Process

In [None]:
print("📊 DATA FLOW: Hierarchical Retrieval Process")
print("=" * 100)
print("""\n
1. USER QUERY
   ↓
   'What are the advantages of semantic chunking?'
   ↓

2. EMBED QUERY (Azure OpenAI)
   ↓
   Query Embedding: [0.123, -0.456, 0.789, ...] (1536 dimensions)
   ↓

3. SEARCH CHILD EMBEDDINGS (Precise)
   ↓
   Vector similarity search over small chunks (256 tokens)
   ↓
   Top-K Child Chunks Retrieved:
   - Child 1 (score: 0.89) → Parent ID: abc123
   - Child 2 (score: 0.85) → Parent ID: abc123
   - Child 3 (score: 0.82) → Parent ID: def456
   ↓

4. MAP CHILDREN → PARENTS
   ↓
   Unique Parent IDs: {abc123, def456}
   ↓

5. FETCH PARENT CHUNKS (Rich Context)
   ↓
   Document Store Lookup:
   - Parent abc123: 1024 tokens (comprehensive explanation)
   - Parent def456: 1024 tokens (related concepts)
   ↓

6. PASS TO LLM (Context-Rich Generation)
   ↓
   LLM receives large parent chunks (not small children)
   ↓

7. GENERATE ANSWER
   ↓
   Comprehensive, well-contextualized response

""")
print("=" * 100)

print("\n✨ KEY INSIGHT:")
print("  - RETRIEVAL uses SMALL chunks (precise targeting)")
print("  - GENERATION uses LARGE chunks (rich context)")
print("  - Best of both worlds!")

## 9. Key Findings and Best Practices

### The Chunking Trade-off

| Approach | Retrieval Precision | Context for LLM | Best Use Case |
|----------|-------------------|-----------------|---------------|
| **Small Chunks** | ⭐⭐⭐⭐⭐ High | ⭐⭐ Limited | Factual lookups, specific data extraction |
| **Medium Chunks** | ⭐⭐⭐ Moderate | ⭐⭐⭐ Moderate | General-purpose RAG (compromise) |
| **Large Chunks** | ⭐⭐ Low | ⭐⭐⭐⭐⭐ Rich | When you know what section to retrieve |
| **Hierarchical** | ⭐⭐⭐⭐⭐ High | ⭐⭐⭐⭐⭐ Rich | Complex queries requiring comprehension |

### When to Use Hierarchical Chunking

✅ **Ideal Scenarios:**
- Long-form documents (whitepapers, articles, documentation)
- Queries requiring deep understanding and context
- When answer quality is more important than latency
- Complex analytical questions

❌ **When to Avoid:**
- Simple factual queries ("What is X?")
- Very short documents
- When minimizing token usage is critical
- Real-time, high-throughput applications (extra complexity)

### Implementation Considerations

1. **Storage**: Need to store both child embeddings and parent documents
2. **Complexity**: More moving parts than simple chunking
3. **Cost**: Larger chunks → more tokens → higher LLM costs
4. **Latency**: Fetching parent chunks adds minimal overhead
5. **Maintenance**: Need to keep parent-child mappings consistent

### Optimization Tips

- **Parent size**: 1024-2048 tokens (balance context and cost)
- **Child size**: 128-256 tokens (precise but not too fragmented)
- **Overlap**: Use overlap in parent chunks to avoid boundary issues
- **De-duplication**: If multiple children from same parent retrieved, return parent once
- **Metadata**: Store useful metadata in child chunks for filtering

## 10. Conclusion

Hierarchical chunking elegantly solves the fundamental chunking dilemma in RAG systems:

🎯 **The Problem**: Small chunks provide precise retrieval but lack context for generation. Large chunks provide context but dilute retrieval precision.

✨ **The Solution**: Use small child chunks for retrieval (precision) and large parent chunks for generation (context).

📈 **The Result**: Improved answer quality, especially for complex queries requiring deep understanding.

This technique is particularly powerful for:
- Technical documentation
- Research papers and whitepapers  
- Legal and regulatory documents
- Educational content
- Any long-form content where context matters

**Next Steps**: Experiment with different parent/child size ratios for your specific use case!