# Demo #4: Hierarchical Retrieval with Sentence Window

## Overview

This demo demonstrates **Hierarchical Retrieval** using the **Sentence Window** technique, which addresses a fundamental challenge in RAG systems: the trade-off between retrieval precision and context sufficiency.

### The Problem

Traditional RAG systems face a dilemma:
- **Small chunks**: Provide precise retrieval (easier to match user queries) but lack sufficient context for the LLM to generate comprehensive answers
- **Large chunks**: Provide rich context but dilute relevance signals, causing less relevant information to be retrieved

### The Solution: Sentence Window Retrieval

Sentence Window Retrieval solves this by **separating retrieval granularity from generation context**:
1. **Index small units**: Embed and index individual sentences or small text segments for precise retrieval
2. **Return expanded context**: When a sentence is retrieved, provide the surrounding "window" of sentences to the LLM

This gives us the best of both worlds: precise retrieval matching with sufficient context for generation.

### Core Concepts Demonstrated
- Hierarchical retrieval strategies
- Sentence Window Retrieval
- Separation of retrieval granularity from generation context
- Lost-in-the-middle problem mitigation

### References
- Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation (Reference 37)
- Develop a RAG Solution - Chunking Phase - Azure Architecture (Reference 19)

## 1. Environment Setup and Imports

In [1]:
# Core imports
import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# LlamaIndex imports
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    Settings,
)
from llama_index.core.node_parser import (
    SentenceWindowNodeParser,
    SentenceSplitter,
)
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

# Visualization
import pandas as pd
from IPython.display import display, Markdown, HTML

# Utilities
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

load_dotenv()

print("✓ All imports successful")

✓ All imports successful


## 2. Azure OpenAI Configuration

Configure Azure OpenAI for both LLM and embeddings.

In [2]:
# Azure OpenAI configuration from environment variables
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2024-02-15-preview")
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4")
AZURE_OPENAI_EMBEDDING_DEPLOYMENT = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-ada-002")

# Validate configuration
if not all([AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT]):
    raise ValueError(
        "Missing Azure OpenAI configuration. Please set:\n"
        "- AZURE_OPENAI_API_KEY\n"
        "- AZURE_OPENAI_ENDPOINT\n"
        "- AZURE_OPENAI_DEPLOYMENT (optional, default: gpt-4)\n"
        "- AZURE_OPENAI_EMBEDDING_DEPLOYMENT (optional, default: text-embedding-ada-002)"
    )

# Initialize Azure OpenAI LLM
llm = AzureOpenAI(
    model="gpt-4",
    deployment_name=AZURE_OPENAI_DEPLOYMENT,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=AZURE_OPENAI_API_VERSION,
    temperature=0.1,
)

# Initialize Azure OpenAI Embeddings
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name=AZURE_OPENAI_EMBEDDING_DEPLOYMENT,
    api_key=AZURE_OPENAI_API_KEY,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=AZURE_OPENAI_API_VERSION,
)

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ Azure OpenAI configured successfully")
print(f"  LLM Deployment: {AZURE_OPENAI_DEPLOYMENT}")
print(f"  Embedding Deployment: {AZURE_OPENAI_EMBEDDING_DEPLOYMENT}")

✓ Azure OpenAI configured successfully
  LLM Deployment: gpt-4
  Embedding Deployment: text-embedding-ada-002


## 3. Data Preparation

Load long-form documents from `data/long_form_docs/`. These documents are 1000+ words and ideal for demonstrating the benefits of sentence window retrieval.

In [3]:
# Define data directory
data_dir = Path("./data/long_form_docs")

# Load documents
print("Loading documents...")
documents = SimpleDirectoryReader(str(data_dir)).load_data()

print(f"\n✓ Loaded {len(documents)} documents")
for i, doc in enumerate(documents, 1):
    print(f"  {i}. {Path(doc.metadata.get('file_name', 'unknown')).name} ({len(doc.text)} chars)")

Loading documents...

✓ Loaded 3 documents
  1. advanced_chunking_strategies.md (13411 chars)
  2. embedding_models_deep_dive.md (13849 chars)
  3. rag_comprehensive_guide.md (16903 chars)


## 4. Baseline: Standard Chunking Strategy

First, let's create a baseline RAG system with standard fixed-size chunking to compare against.

In [4]:
# Create standard sentence splitter
baseline_parser = SentenceSplitter(
    chunk_size=512,
    chunk_overlap=50,
)

# Parse documents into nodes
baseline_nodes = baseline_parser.get_nodes_from_documents(documents)

print(f"✓ Created {len(baseline_nodes)} baseline chunks")
print(f"\nExample baseline chunk (first 300 chars):")
print(f"{baseline_nodes[10].text[:300]}...")

✓ Created 19 baseline chunks

Example baseline chunk (first 300 chars):
Modern vector databases can search millions of embeddings in milliseconds, making bi-encoders practical for large-scale retrieval.

However, bi-encoders have a fundamental limitation: they can't model interactions between query and document text. The query "capital of France" and the document "Paris...


In [5]:
# Build baseline index
print("Building baseline vector index...")
baseline_index = VectorStoreIndex(baseline_nodes)

# Create baseline query engine
baseline_query_engine = baseline_index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
)

print("✓ Baseline RAG system ready")

Building baseline vector index...


2025-10-16 14:52:43,894 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,236 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,236 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"


✓ Baseline RAG system ready


## 5. Advanced: Sentence Window Retrieval

Now let's implement the Sentence Window approach:
1. Parse documents into individual sentences
2. Store metadata about surrounding sentences (the "window")
3. Retrieve sentences precisely, then expand to include window context

In [6]:
# Create Sentence Window Node Parser
# window_size=3 means: retrieve 1 sentence, but include 3 sentences before and 3 after
sentence_window_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Number of sentences before and after to include
    window_metadata_key="window",  # Metadata key for expanded window
    original_text_metadata_key="original_sentence",  # Key for original sentence
)

# Parse documents into sentence nodes with window metadata
sentence_nodes = sentence_window_parser.get_nodes_from_documents(documents)

print(f"✓ Created {len(sentence_nodes)} sentence nodes with window metadata")
print(f"\nExample sentence node:")
example_node = sentence_nodes[50]
print(f"Original sentence: {example_node.text[:200]}...")
print(f"\nWindow context (includes surrounding sentences): {example_node.metadata.get('window', '')[:300]}...")

✓ Created 344 sentence nodes with window metadata

Example sentence node:
Original sentence: However, implementation is more complex, processing is slower, and chunk sizes become variable, requiring careful handling downstream.

...

Window context (includes surrounding sentences): By prompting a language model to identify logical breakpoints in a document, systems can achieve human-like understanding of where natural divisions occur.  The LLM might be asked to segment a long document into sections that each cover a distinct topic, or to identify the minimal units of informati...


In [7]:
# Build sentence window index
print("Building sentence window vector index...")
sentence_window_index = VectorStoreIndex(sentence_nodes)

# Create Metadata Replacement Post-Processor
# This replaces the retrieved sentence with its expanded window context before sending to LLM
metadata_replacement_postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window",
)

# Create sentence window query engine with post-processor
sentence_window_query_engine = sentence_window_index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact",
    node_postprocessors=[metadata_replacement_postprocessor],
)

print("✓ Sentence Window RAG system ready")
print("  - Retrieval: Precise sentence-level matching")
print("  - Generation: Expanded 7-sentence context window (3 before + 1 target + 3 after)")

Building sentence window vector index...


2025-10-16 14:52:44,558 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,670 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,670 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,803 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,803 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:44,930 -

✓ Sentence Window RAG system ready
  - Retrieval: Precise sentence-level matching
  - Generation: Expanded 7-sentence context window (3 before + 1 target + 3 after)


## 6. Comparative Evaluation

Let's test both systems with queries that benefit from the sentence window approach.

In [8]:
# Define test queries
test_queries = [
    "What are the main limitations of pure LLM approaches that RAG addresses?",
    "Explain the trade-off between chunk size and retrieval precision in RAG systems.",
    "How do embedding models capture semantic meaning in text?",
]

print(f"Testing with {len(test_queries)} queries...\n")

Testing with 3 queries...



### Query 1: LLM Limitations

In [9]:
query = test_queries[0]
print(f"Query: {query}\n")

# Baseline retrieval
print("="*80)
print("BASELINE RETRIEVAL (Standard Chunking)")
print("="*80)
baseline_response = baseline_query_engine.query(query)
print(f"\nAnswer:\n{baseline_response.response}\n")
print(f"\nRetrieved Chunks ({len(baseline_response.source_nodes)}):")
for i, node in enumerate(baseline_response.source_nodes, 1):
    print(f"\n--- Chunk {i} (Score: {node.score:.4f}) ---")
    print(f"{node.text[:300]}...")

# Sentence window retrieval
print("\n" + "="*80)
print("SENTENCE WINDOW RETRIEVAL")
print("="*80)
window_response = sentence_window_query_engine.query(query)
print(f"\nAnswer:\n{window_response.response}\n")
print(f"\nRetrieved Contexts ({len(window_response.source_nodes)}):")
for i, node in enumerate(window_response.source_nodes, 1):
    print(f"\n--- Context Window {i} (Score: {node.score:.4f}) ---")
    print(f"Original Sentence: {node.metadata.get('original_sentence', node.text)[:200]}...")
    print(f"\nExpanded Window Context: {node.text[:400]}...")

2025-10-16 14:52:52,487 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"


Query: What are the main limitations of pure LLM approaches that RAG addresses?

BASELINE RETRIEVAL (Standard Chunking)


2025-10-16 14:52:54,131 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:54,289 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:52:54,289 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"



Answer:
The main limitations of pure LLM approaches that RAG addresses include:

1. **Knowledge Cutoff**: LLMs cannot access information published after their training data cutoff, making them unable to provide up-to-date responses.
2. **Hallucination**: LLMs may generate plausible-sounding but factually incorrect or fabricated information.
3. **Expensive Updates**: Updating an LLM's knowledge requires costly retraining, which is impractical for rapidly changing domains.
4. **Lack of Transparency**: LLMs do not provide clear sources for their information, making it difficult to verify or audit their outputs. 

RAG mitigates these issues by incorporating a retrieval step to dynamically fetch relevant and factual information from external sources before generating responses.


Retrieved Chunks (3):

--- Chunk 1 (Score: 0.8205) ---
# Comprehensive Guide to Retrieval-Augmented Generation (RAG)

## Introduction to RAG

Retrieval-Augmented Generation (RAG) is a paradigm that combines the st

2025-10-16 14:52:55,191 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"



Answer:
The main limitations of pure LLM approaches that RAG addresses are:  
1. LLMs have a knowledge cutoff date, meaning they cannot access information published after their training concluded.  
2. LLMs may hallucinate or generate plausible-sounding but factually incorrect content, even within their training data.  
3. Updating an LLM's knowledge requires expensive retraining, making it impractical for domains that evolve rapidly.  
4. LLMs lack transparency about their information sources, making it difficult to verify or audit their outputs.  


Retrieved Contexts (3):

--- Context Window 1 (Score: 0.8873) ---
Original Sentence: The fundamental motivation behind RAG stems from several limitations of pure LLM approaches. ...

Expanded Window Context: # Comprehensive Guide to Retrieval-Augmented Generation (RAG)

## Introduction to RAG

Retrieval-Augmented Generation (RAG) is a paradigm that combines the strengths of large language models with external knowledge retrieval to gener

### Query 2: Chunking Trade-off

In [10]:
query = test_queries[1]
print(f"Query: {query}\n")

# Baseline
print("="*80)
print("BASELINE RETRIEVAL")
print("="*80)
baseline_response = baseline_query_engine.query(query)
print(f"\nAnswer:\n{baseline_response.response}\n")

# Sentence Window
print("\n" + "="*80)
print("SENTENCE WINDOW RETRIEVAL")
print("="*80)
window_response = sentence_window_query_engine.query(query)
print(f"\nAnswer:\n{window_response.response}\n")

2025-10-16 14:52:55,308 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"


Query: Explain the trade-off between chunk size and retrieval precision in RAG systems.

BASELINE RETRIEVAL


2025-10-16 14:52:58,672 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"



Answer:
In Retrieval-Augmented Generation (RAG) systems, the trade-off between chunk size and retrieval precision revolves around balancing the need for precise information retrieval with the need for sufficient context during generation.

Smaller chunks enhance retrieval precision by reducing irrelevant content in the results. They are more semantically focused, which minimizes noise and ensures that the retrieved information is highly relevant to the query. This is particularly beneficial when working with limited context windows or when token processing costs are high.

However, smaller chunks can lack the broader context needed for effective generation. Language models perform better when they have access to richer surrounding information, which helps them interpret and generate responses more accurately. Larger chunks provide this richer context, offering narrative flow and background details that improve the quality of the generated output. Additionally, larger chunks reduce the

2025-10-16 14:52:58,897 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:53:00,403 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:53:00,403 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"



Answer:
In RAG systems, the trade-off between chunk size and retrieval precision revolves around balancing the granularity of information retrieval and the context provided for generation. Smaller chunks allow for more precise retrieval, as each chunk can be narrowly focused and independently scored for relevance. This makes it easier to locate specific information. However, smaller chunks may lack sufficient context, which can negatively impact the quality of the generated responses. On the other hand, larger chunks provide richer context, which can improve the quality of the generated answers by giving the model more comprehensive information. However, larger chunks may dilute relevance signals, making it harder to retrieve the most pertinent information. This trade-off is a fundamental challenge in designing effective chunking strategies for RAG systems.



### Query 3: Embedding Models

In [11]:
query = test_queries[2]
print(f"Query: {query}\n")

# Baseline
print("="*80)
print("BASELINE RETRIEVAL")
print("="*80)
baseline_response = baseline_query_engine.query(query)
print(f"\nAnswer:\n{baseline_response.response}\n")

# Sentence Window
print("\n" + "="*80)
print("SENTENCE WINDOW RETRIEVAL")
print("="*80)
window_response = sentence_window_query_engine.query(query)
print(f"\nAnswer:\n{window_response.response}\n")

2025-10-16 14:53:00,539 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"


Query: How do embedding models capture semantic meaning in text?

BASELINE RETRIEVAL


2025-10-16 14:53:01,945 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:53:02,042 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"
2025-10-16 14:53:02,042 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/text-embedding-ada-002/embeddings?api-version=2024-12-01-preview "HTTP/1.1 200 OK"



Answer:
Embedding models capture semantic meaning in text by representing it as dense vectors in a high-dimensional space. These vectors encode abstract semantic features, allowing text with similar meanings to produce similar embeddings. Modern embedding models, such as those based on transformers, process entire sentences or passages, considering word relationships and context. This enables them to capture not only individual word meanings but also compositional semantics, reflecting how words combine to convey meaning. The training process positions semantically related texts close together in the embedding space, optimizing for tasks like semantic similarity and retrieval.


SENTENCE WINDOW RETRIEVAL


2025-10-16 14:53:03,353 - INFO - HTTP Request: POST https://aoai-sweden-505.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-12-01-preview "HTTP/1.1 200 OK"



Answer:
Embedding models capture semantic meaning in text by mapping variable-length text sequences to fixed-dimensional dense vectors in a high-dimensional space. These vectors are designed so that semantically similar texts are positioned close to each other in the embedding space, while unrelated texts are placed farther apart. This is achieved through training neural networks to optimize the positioning of texts based on their semantic relationships. The learned representations capture not only individual word meanings but also compositional semantics, reflecting how words combine to create meaning. This enables embedding models to represent abstract semantic features effectively.



## 7. Quantitative Analysis

Let's analyze the key differences between the two approaches.

In [12]:
# Analyze chunk/node characteristics
baseline_lengths = [len(node.text) for node in baseline_nodes]
sentence_lengths = [len(node.text) for node in sentence_nodes]
window_lengths = [len(node.metadata.get('window', node.text)) for node in sentence_nodes]

analysis_data = {
    'Metric': [
        'Total Nodes',
        'Avg Retrieval Unit Size (chars)',
        'Avg Context Size (chars)',
        'Retrieval Precision',
        'Context Sufficiency',
    ],
    'Baseline (Standard)': [
        len(baseline_nodes),
        f"{sum(baseline_lengths)/len(baseline_lengths):.0f}",
        f"{sum(baseline_lengths)/len(baseline_lengths):.0f}",
        'Medium (512 chars)',
        'Medium (512 chars)',
    ],
    'Sentence Window': [
        len(sentence_nodes),
        f"{sum(sentence_lengths)/len(sentence_lengths):.0f}",
        f"{sum(window_lengths)/len(window_lengths):.0f}",
        'High (sentence-level)',
        'High (7-sentence window)',
    ]
}

df_analysis = pd.DataFrame(analysis_data)
display(HTML("<h3>Comparative Analysis</h3>"))
display(df_analysis)

Unnamed: 0,Metric,Baseline (Standard),Sentence Window
0,Total Nodes,19,344
1,Avg Retrieval Unit Size (chars),2493,128
2,Avg Context Size (chars),2493,888
3,Retrieval Precision,Medium (512 chars),High (sentence-level)
4,Context Sufficiency,Medium (512 chars),High (7-sentence window)


## 8. Visualization: Retrieval vs Context Separation

In [13]:
# Visualize the concept
visualization_md = """
### Sentence Window Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    ORIGINAL DOCUMENT                            │
│  Sentence 1. Sentence 2. Sentence 3. [TARGET]. Sentence 5.     │
│  Sentence 6. Sentence 7. ...                                    │
└─────────────────────────────────────────────────────────────────┘
                                ↓
                    ┌───────────────────────┐
                    │  INDEXING PHASE       │
                    └───────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  Each sentence embedded INDIVIDUALLY          │
        │  + Metadata stores surrounding window         │
        └───────────────────────────────────────────────┘
                                ↓
                    ┌───────────────────────┐
                    │  USER QUERY           │
                    └───────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  RETRIEVAL: Match on SENTENCE level           │
        │  ✓ High precision (small retrieval unit)      │
        └───────────────────────────────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  POST-PROCESSING: Expand to WINDOW            │
        │  Sentence 1. Sentence 2. Sentence 3.          │
        │  [TARGET]. Sentence 5. Sentence 6.            │
        │  Sentence 7.                                  │
        └───────────────────────────────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  GENERATION: LLM gets EXPANDED context        │
        │  ✓ High context sufficiency (7 sentences)     │
        └───────────────────────────────────────────────┘
```

### Key Benefits:
1. **Precise Retrieval**: Semantic search operates on small, focused units (sentences)
2. **Rich Context**: LLM receives expanded windows with surrounding sentences
3. **Lost-in-the-Middle Mitigation**: Focused retrieval reduces noise in context
4. **Flexible Window Sizing**: Adjust `window_size` parameter based on task needs
"""

display(Markdown(visualization_md))


### Sentence Window Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    ORIGINAL DOCUMENT                            │
│  Sentence 1. Sentence 2. Sentence 3. [TARGET]. Sentence 5.     │
│  Sentence 6. Sentence 7. ...                                    │
└─────────────────────────────────────────────────────────────────┘
                                ↓
                    ┌───────────────────────┐
                    │  INDEXING PHASE       │
                    └───────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  Each sentence embedded INDIVIDUALLY          │
        │  + Metadata stores surrounding window         │
        └───────────────────────────────────────────────┘
                                ↓
                    ┌───────────────────────┐
                    │  USER QUERY           │
                    └───────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  RETRIEVAL: Match on SENTENCE level           │
        │  ✓ High precision (small retrieval unit)      │
        └───────────────────────────────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  POST-PROCESSING: Expand to WINDOW            │
        │  Sentence 1. Sentence 2. Sentence 3.          │
        │  [TARGET]. Sentence 5. Sentence 6.            │
        │  Sentence 7.                                  │
        └───────────────────────────────────────────────┘
                                ↓
        ┌───────────────────────────────────────────────┐
        │  GENERATION: LLM gets EXPANDED context        │
        │  ✓ High context sufficiency (7 sentences)     │
        └───────────────────────────────────────────────┘
```

### Key Benefits:
1. **Precise Retrieval**: Semantic search operates on small, focused units (sentences)
2. **Rich Context**: LLM receives expanded windows with surrounding sentences
3. **Lost-in-the-Middle Mitigation**: Focused retrieval reduces noise in context
4. **Flexible Window Sizing**: Adjust `window_size` parameter based on task needs


## 9. Key Takeaways

### What We Learned

1. **Hierarchical Retrieval Solves the Chunking Dilemma**: By separating retrieval granularity from generation context, we achieve both precise matching and sufficient context.

2. **Sentence Window Technique**:
   - Index at sentence level for precise semantic matching
   - Retrieve with high confidence scores
   - Expand to multi-sentence windows for LLM generation
   
3. **Configurable Trade-offs**:
   - `window_size` parameter controls context expansion
   - Smaller windows (1-2): For precise, focused answers
   - Larger windows (3-5): For comprehensive, contextual answers

4. **Lost-in-the-Middle Problem**: By retrieving focused units and expanding only what's needed, we reduce the "lost in the middle" phenomenon where LLMs miss information buried in long contexts.

5. **Performance Benefits**:
   - Improved retrieval precision (sentence-level matching)
   - Better answer quality (sufficient context)
   - Reduced noise (focused retrieval units)

### When to Use Sentence Window Retrieval

✅ **Good for**:
- Long-form documents where context matters
- Questions requiring nuanced, contextual understanding
- Domains where sentences are semantically rich
- When retrieval precision is critical

❌ **Less suitable for**:
- Very short documents
- Structured data (tables, lists)
- When global document context is required

### Production Considerations

1. **Window Size Tuning**: Experiment with different `window_size` values based on your documents and use cases
2. **Storage Overhead**: Storing window metadata increases index size
3. **Retrieval Latency**: Sentence-level indexing creates more nodes, which may impact search time at scale
4. **Hybrid Approaches**: Consider combining with other techniques like re-ranking or query transformation

## 10. Further Exploration

Try these experiments:
1. Adjust `window_size` (1, 3, 5, 10) and observe answer quality changes
2. Test with different document types (technical docs, narratives, etc.)
3. Combine with other techniques from previous demos (HyDE, hybrid search)
4. Compare with Auto-Merging Retrieval (another hierarchical approach)