## Chunking

Chunking is the process of breaking down documents into smaller pieces that can be efficiently processed by language models and retrieval systems. The way you chunk your documents directly impacts :

- **Retrieval Precision** : How accurately your system can find relevant information
- **Context Preservation** : How much surrounding information is maintained
- **Token Economy** : How efficiently you use your LLM's context window
- **Storage Requirements** : How much vector storage you need

Let's explore different chunking strategies and their impact on document retrieval.

In [1]:
from llama_index.core.schema import Document
import textwrap

In [2]:
# Sample document text
sample_doc = """
# Introduction to Vector Databases

Vector databases are specialized database systems designed to store and query vector embeddings efficiently.
Unlike traditional databases optimized for exact matches, vector databases excel at similarity searches.

## Key Advantages

Vector databases offer several advantages for AI applications:
- Efficient similarity search using algorithms like HNSW and IVF
- Support for high-dimensional vector data
- Optimized for retrieval-augmented generation (RAG) applications

## Common Operations

The most common operations in vector databases include:
1. Adding vectors with associated metadata
2. Searching for similar vectors using distance metrics
3. Filtering results based on metadata
4. Building and optimizing indexes for faster retrieval

# Performance Considerations

When working with vector databases at scale, consider:
- Index construction time vs. query performance
- Memory usage vs. search accuracy
- Batch processing for efficient vector insertion
"""

In [3]:
# Create a Document
document = Document(text=sample_doc)

print("Original Document")
print(textwrap.fill(document.text, 90))

Original Document
 # Introduction to Vector Databases  Vector databases are specialized database systems
designed to store and query vector embeddings efficiently. Unlike traditional databases
optimized for exact matches, vector databases excel at similarity searches.  ## Key
Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector
data - Optimized for retrieval-augmented generation (RAG) applications  ## Common
Operations  The most common operations in vector databases include: 1. Adding vectors with
associated metadata 2. Searching for similar vectors using distance metrics 3. Filtering
results based on metadata 4. Building and optimizing indexes for faster retrieval  #
Performance Considerations  When working with vector databases at scale, consider: - Index
construction time vs. query performance - Memory usage vs. search accuracy - Batch
processing for efficient

In [5]:
from llama_index.core.node_parser import SentenceSplitter

# Sentence-based chunking
sentence_splitter = SentenceSplitter(chunk_size=40, chunk_overlap=10)

sentence_nodes = sentence_splitter.get_nodes_from_documents([document])

Metadata length (0) is close to chunk size (40). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.


In [7]:
print(f"Sentence chunks created : {len(sentence_nodes)}")
for i in range(len(sentence_nodes)):
    print(f"- Chunk n°{i + 1} :")
    print(f"{textwrap.fill(sentence_nodes[i].text, 100)}")
    print(f"Characters length : {len(sentence_nodes[i].text)}")
    print()

Sentence chunks created : 6
- Chunk n°1 :
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.  ## Key
Characters length : 257

- Chunk n°2 :
## Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimized for
Characters length : 205

- Chunk n°3 :
for high-dimensional vector data - Optimized for retrieval-augmented generation (RAG) applications
## Common Operations  The most common operations in vector databases include: 1.
Characters length : 180

- Chunk n°4 :
common operations in vector databases include: 1. Adding vectors with associated metadata 2.
Searching for similar vectors using distance metrics 3. Filtering results based on metadata 4.
Character

In [8]:
from llama_index.core.node_parser import TokenTextSplitter

# Token-based chunking
token_splitter = TokenTextSplitter(chunk_size=40, chunk_overlap=10)
token_nodes = token_splitter.get_nodes_from_documents([document])

Metadata length (2) is close to chunk size (40). Resulting chunks are less than 50 tokens. Consider increasing the chunk size or decreasing the size of your metadata to avoid this.


In [9]:
print(f"Token chunks created : {len(token_nodes)}")
for i in range(len(token_nodes)):
    print(f"- Chunk n°{i + 1} :")
    print(f"{textwrap.fill(token_nodes[i].text, 100)}")
    print(f"Characters length : {len(token_nodes[i].text)}")
    print()

Token chunks created : 7
- Chunk n°1 :
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity
Characters length : 239

- Chunk n°2 :
optimized for exact matches, vector databases excel at similarity searches.  ## Key Advantages
Vector databases offer several advantages for AI applications: - Efficient similarity search using
algorithms like HNSW and
Characters length : 219

- Chunk n°3 :
Efficient similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector
data - Optimized for retrieval-augmented generation (RAG) applications  ## Common
Characters length : 182

- Chunk n°4 :
generation (RAG) applications  ## Common Operations  The most common operations in vector databases
include: 1. Adding vectors with associated metadata 2. Searching for similar vectors using distance
Ch

In [None]:
from llama_index.core.node_parser import HierarchicalNodeParser

# Hierarchical chunking
hierarchical_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[70, 60, 40])
hierarchical_nodes = hierarchical_parser.get_nodes_from_documents([document])

In [11]:
print(f"Hierarchica chunks created : {len(hierarchical_nodes)}")
for i in range(len(hierarchical_nodes)):
    print(f"- Chunk n°{i + 1} :")
    print(f"{textwrap.fill(hierarchical_nodes[i].text, 100)}")
    print(f"Characters length : {len(hierarchical_nodes[i].text)}")
    print()

Hierarchica chunks created : 18
- Chunk n°1 :
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.
Characters length : 249

- Chunk n°2 :
## Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimized for retrieval-augmented generation (RAG) applications  ## Common Operations  The most
common operations in vector databases include: 1.
Characters length : 337

- Chunk n°3 :
Adding vectors with associated metadata 2. Searching for similar vectors using distance metrics 3.
Filtering results based on metadata 4. Building and optimizing indexes for faster retrieval  #
Performance Considerations  When working with vector databases at scale, consider: - Index
con

In [12]:
from llama_index.core.node_parser import MarkdownNodeParser

# Structure-aware chunking for Markdown
markdown_parser = MarkdownNodeParser()
markdown_nodes = markdown_parser.get_nodes_from_documents([document])

In [13]:
print(f"Markdpwn chunks created : {len(markdown_nodes)}")
for i in range(len(markdown_nodes)):
    print(f"- Chunk n°{i + 1} :")
    print(f"{textwrap.fill(markdown_nodes[i].text, 100)}")
    print(f"Characters length : {len(markdown_nodes[i].text)}")
    print()

Markdpwn chunks created : 4
- Chunk n°1 :
# Introduction to Vector Databases  Vector databases are specialized database systems designed to
store and query vector embeddings efficiently. Unlike traditional databases optimized for exact
matches, vector databases excel at similarity searches.
Characters length : 249

- Chunk n°2 :
## Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimized for retrieval-augmented generation (RAG) applications
Characters length : 255

- Chunk n°3 :
## Common Operations  The most common operations in vector databases include: 1. Adding vectors with
associated metadata 2. Searching for similar vectors using distance metrics 3. Filtering results
based on metadata 4. Building and optimizing indexes for faster retrieval
Characters length : 271

- Chunk n°4 :
# Performance Considerations  When working with vector database

In [15]:
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

In [16]:
# Create a local embedding model
embedding_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

# Create vector stores with different chunking strategies (using local embeddings)
sentence_index = VectorStoreIndex(sentence_nodes, embed_model=embedding_model)
token_index = VectorStoreIndex(token_nodes, embed_model=embedding_model)
hierarchical_index = VectorStoreIndex(hierarchical_nodes, embed_model=embedding_model)
markdown_index = VectorStoreIndex(markdown_nodes, embed_model=embedding_model)

# Query to test retrieval
query = "What are the top operations in vector databases ?"

# Get retrieval results
sentence_result = sentence_index.as_retriever().retrieve(query)
token_result = token_index.as_retriever().retrieve(query)
hierarchical_result = hierarchical_index.as_retriever().retrieve(query)
markdown_result = markdown_index.as_retriever().retrieve(query)

In [17]:
# Compare top results
print("- Sentence chunking result :")
print(f"{textwrap.fill(sentence_result[0].text, 100)}")
print()
print("- Token chunking result :")
print(f"{textwrap.fill(token_result[0].text, 100)}")
print()
print("- Hierarchical chunking result :")
print(f"{textwrap.fill(hierarchical_result[0].text, 100)}")
print()
print("- Markdown chunking result :")
print(f"{textwrap.fill(markdown_result[0].text, 100)}")

- Sentence chunking result :
common operations in vector databases include: 1. Adding vectors with associated metadata 2.
Searching for similar vectors using distance metrics 3. Filtering results based on metadata 4.

- Token chunking result :
generation (RAG) applications  ## Common Operations  The most common operations in vector databases
include: 1. Adding vectors with associated metadata 2. Searching for similar vectors using distance

- Hierarchical chunking result :
## Key Advantages  Vector databases offer several advantages for AI applications: - Efficient
similarity search using algorithms like HNSW and IVF - Support for high-dimensional vector data -
Optimized for retrieval-augmented generation (RAG) applications  ## Common Operations  The most
common operations in vector databases include: 1.

- Markdown chunking result :
## Common Operations  The most common operations in vector databases include: 1. Adding vectors with
associated metadata 2. Searching for similar vectors 

## Conclusions

1. **Chunk Size Trade-offs**: Smaller chunks allow for more precise retrieval but may lose context. Larger chunks preserve more context but might introduce noise and use more tokens from LLM's context window.

2. **Overlap Between Chunks**: adding overlap ensures that sentences or ideas that cross chunk boundaries aren't lost, but increases storage requirements and can create duplicate information in retrieval.

3. **Structure Awareness**: Domain-specific chunking that understands document structure (like markdown example) typically produces better results but requires more specialized processing.