## What is Chunking and Why Does it Matter?

Chunking is the process of breaking down documents into smaller pieces that can be efficiently processed by language models and retrieval systems. The way you chunk your documents directly impacts:

- **Retrieval Precision**: How accurately your system can find relevant information
- **Context Preservation**: How much surrounding information is maintained
- **Token Economy**: How efficiently you use your LLM's context window
- **Storage Requirements**: How much vector storage you need

Let's explore different chunking strategies and their impact on document retrieval.

In [None]:
from llama_index.core.schema import Document
import textwrap

# Sample document text
sample_text = """
# Introduction to Vector Databases

Vector databases are specialized database systems designed to store and query vector embeddings efficiently.
Unlike traditional databases optimized for exact matches, vector databases excel at similarity searches.

## Key Advantages

Vector databases offer several advantages for AI applications:
- Efficient similarity search using algorithms like HNSW and IVF
- Support for high-dimensional vector data
- Optimized for retrieval-augmented generation (RAG) applications

## Common Operations

The most common operations in vector databases include:
1. Adding vectors with associated metadata
2. Searching for similar vectors using distance metrics
3. Filtering results based on metadata
4. Building and optimizing indexes for faster retrieval

# Performance Considerations

When working with vector databases at scale, consider:
- Index construction time vs. query performance
- Memory usage vs. search accuracy
- Batch processing for efficient vector insertion
"""

# Create a Document
document = Document(text=sample_text)

# Let's print the original document to understand its structure
print("Original Document:")
print(textwrap.fill(document.text[:500], 100))
print("...\n")

Original Document:
 # Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.  ## Key Advantages  Vector databases offer
several advantages for AI applications: - Efficient similarity search using algorithms like HNSW and
IVF - Support for high-dimensional vector data - Optimized for retrieval-augmented generation (RAG)
appli
...



In [None]:
from llama_index.core.node_parser import SentenceSplitter

# Sentence-based chunking
sentence_splitter = SentenceSplitter(
    chunk_size=60,  # Target chunk size (in characters)
    chunk_overlap=50  # Overlap between chunks (in characters)
)

sentence_nodes = sentence_splitter.get_nodes_from_documents([document])

print(f"Sentence Splitting created {len(sentence_nodes)} chunks\n")
print("Example chunks:")
for i in range(min(3, len(sentence_nodes))):
    print(f"\nChunk {i}:")
    print(textwrap.fill(sentence_nodes[i].text[:300], 100))
    print(f"Character length: {len(sentence_nodes[i].text)}")

Sentence Splitting created 9 chunks

Example chunks:

Chunk 0:
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.  ## Key Advantages  Vector databases offer
several
Character length: 384

Chunk 1:
Unlike traditional databases optimized for exact matches, vector databases excel at similarity
searches.  ## Key Advantages  Vector databases offer several advantages for AI applications: -
Efficient similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector
data - Op
Character length: 342

Chunk 2:
## Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimized for retrieval-augmented generation (RAG) applications  ## Common Oper

In [None]:
from llama_index.core.node_parser import TokenTextSplitter

# Token-based chunking
token_splitter = TokenTextSplitter(
    chunk_size=60,  # Target chunk size (in tokens)
    chunk_overlap=20  # Overlap between chunks (in tokens)
)

token_nodes = token_splitter.get_nodes_from_documents([document])

print(f"Token Splitting created {len(token_nodes)} chunks\n")
print("Example chunks:")
for i in range(min(3, len(token_nodes))):
    print(f"\nChunk {i}:")
    print(textwrap.fill(token_nodes[i].text[:300], 100))
    print(f"Character length: {len(token_nodes[i].text)}")

Token Splitting created 5 chunks

Example chunks:

Chunk 0:
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.  ## Key Advantages  Vector databases offer
several
Character length: 379

Chunk 1:
Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimized for retrieval-augmented generation (RAG) applications  ## Common Operations  The most
common operatio
Character length: 302

Chunk 2:
for retrieval-augmented generation (RAG) applications  ## Common Operations  The most common
operations in vector databases include: 1. Adding vectors with associated metadata 2. Searching for
similar vectors using distance metrics 3. Filtering results based on metadata 4. 

In [None]:
from llama_index.core.node_parser import HierarchicalNodeParser

# Hierarchical chunking
hierarchical_splitter = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[70, 60, 50]  # Multi-level chunking
)

hierarchical_nodes = hierarchical_splitter.get_nodes_from_documents([document])

print(f"Hierarchical Splitting created {len(hierarchical_nodes)} chunks\n")
print("Example chunks:")
for i in range(min(3, len(hierarchical_nodes))):
    print(f"\nChunk {i}:")
    print(textwrap.fill(hierarchical_nodes[i].text[:300], 100))
    print(f"Character length: {len(hierarchical_nodes[i].text)}")

Hierarchical Splitting created 18 chunks

Example chunks:

Chunk 0:
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.
Character length: 249

Chunk 1:
Unlike traditional databases optimized for exact matches, vector databases excel at similarity
searches.  ## Key Advantages  Vector databases offer several advantages for AI applications: -
Efficient similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector
data - Op
Character length: 443

Chunk 2:
Adding vectors with associated metadata 2. Searching for similar vectors using distance metrics 3.
Filtering results based on metadata 4. Building and optimizing indexes for faster retrieval  #
Performance Considerations  When working with vector databases at scale, consider: - Index
construction ti
Character length

In [27]:
from llama_index.core.node_parser import MarkdownNodeParser

# Structure-aware chunking for Markdown
markdown_splitter = MarkdownNodeParser()

markdown_nodes = markdown_splitter.get_nodes_from_documents([document])

print(f"\nMarkdown-aware Splitting created {len(markdown_nodes)} chunks\n")
print("Example chunks with their headings:")
for i in range(min(3, len(markdown_nodes))):
    heading = markdown_nodes[i].metadata.get('heading', 'No heading')
    print(f"\nChunk {i} (Heading: {heading}):")
    print(textwrap.fill(markdown_nodes[i].text[:200], 100))


Markdown-aware Splitting created 4 chunks

Example chunks with their headings:

Chunk 0 (Heading: No heading):
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matche

Chunk 1 (Heading: No heading):
## Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimize

Chunk 2 (Heading: No heading):
## Common Operations  The most common operations in vector databases include: 1. Adding vectors with
associated metadata 2. Searching for similar vectors using distance metrics 3. Filtering results ba


# Let's try the different strategies!

In [None]:
# Special lib needed for local embeddings
%pip install llama-index-embeddings-huggingface

Defaulting to user installation because normal site-packages is not writeable
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.5.2-py3-none-any.whl.metadata (767 bytes)
Downloading llama_index_embeddings_huggingface-0.5.2-py3-none-any.whl (8.9 kB)
Installing collected packages: llama-index-embeddings-huggingface
Successfully installed llama-index-embeddings-huggingface-0.5.2


In [36]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Create a local embedding model
local_embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

# Create vector stores with different chunking strategies (using local embeddings)
sentence_index = VectorStoreIndex(
    sentence_nodes, embed_model=local_embed_model)
token_index = VectorStoreIndex(token_nodes, embed_model=local_embed_model)
markdown_index = VectorStoreIndex(
    markdown_nodes, embed_model=local_embed_model)

# Query to test retrieval
query = "What are the top operations in vector databases?"

# Get retrieval results
sentence_results = sentence_index.as_retriever().retrieve(query)
token_results = token_index.as_retriever().retrieve(query)
markdown_results = markdown_index.as_retriever().retrieve(query)

# Compare top results
print("\nRETRIEVAL COMPARISON\n")
print("Sentence chunking result:")
print(textwrap.fill(sentence_results[0].node.text[:300], 100))

print("\nToken chunking result:")
print(textwrap.fill(token_results[0].node.text[:300], 100))

print("\nMarkdown-aware chunking result:")
print(textwrap.fill(markdown_results[0].node.text[:300], 100))



RETRIEVAL COMPARISON

Sentence chunking result:
applications  ## Common Operations  The most common operations in vector databases include: 1.
Adding vectors with associated metadata 2. Searching for similar vectors using distance metrics 3.
Filtering results based on metadata 4. Building and optimizing indexes for faster retrieval  #
Performance

Token chunking result:
for retrieval-augmented generation (RAG) applications  ## Common Operations  The most common
operations in vector databases include: 1. Adding vectors with associated metadata 2. Searching for
similar vectors using distance metrics 3. Filtering results based on metadata 4. Building and
optimizing in

Markdown-aware chunking result:
## Common Operations  The most common operations in vector databases include: 1. Adding vectors with
associated metadata 2. Searching for similar vectors using distance metrics 3. Filtering results
based on metadata 4. Building and optimizing indexes for faster retrieval
