[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Hawksight-AI/semantica/blob/main/cookbook/introduction/11_Chunking_and_Splitting.ipynb)

# Chunking and Splitting - Comprehensive Guide

## Overview

This notebook provides a **comprehensive walkthrough** of Semantica's split module, demonstrating all chunking strategies and methods for optimal document processing. You'll learn to use 15+ splitting methods including standard, semantic, and knowledge graph-aware approaches.

**Documentation**: [API Reference](https://semantica.readthedocs.io/reference/split/)

### Learning Objectives

By the end of this notebook, you will be able to:

- Use `TextSplitter` with multiple methods
- Apply standard splitting methods (recursive, token, sentence, paragraph)
- Use semantic chunking for topic coherence
- Apply KG-aware chunking (entity-aware, relation-aware, graph-based)
- Use specialized chunkers (structural, sliding window, table, hierarchical)
- Validate chunk quality with `ChunkValidator`
- Track provenance with `ProvenanceTracker`
- Choose the right method for your use case

### What You'll Learn

| Component | Purpose | When to Use |
|-----------|---------|-------------|
| `TextSplitter` | Unified splitter | All chunking needs |
| `SemanticChunker` | Semantic boundaries | Topic-based chunks |
| `EntityAwareChunker` | Preserve entities | GraphRAG workflows |
| `RelationAwareChunker` | Preserve triplets | KG construction |
| `StructuralChunker` | Document structure | Formatted documents |
| `HierarchicalChunker` | Multi-level chunks | Large documents |

---

## Installation

Install Semantica from PyPI:

```bash
pip install semantica
# Or with all optional dependencies:
pip install semantica[all]
```

---

In [2]:
!pip install -q semantica




## Step 1: Basic Chunking with TextSplitter

Let's start with the unified `TextSplitter` interface, which provides access to all chunking methods.

### What is TextSplitter?

`TextSplitter` is a unified interface that supports 15+ chunking methods:
- **Standard**: recursive, token, sentence, paragraph, character, word
- **Semantic**: semantic_transformer, llm, huggingface, nltk
- **KG/Ontology**: entity_aware, relation_aware, graph_based, ontology_aware
- **Advanced**: hierarchical, structural, sliding_window, table

In [3]:
from semantica.split import TextSplitter

# Sample long text
text = """
Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ronald Wayne 
in Cupertino, California on April 1, 1976. The company's current CEO is Tim Cook, who took 
over from Steve Jobs in August 2011. Apple is headquartered at One Apple Park Way in Cupertino.

Apple develops and sells consumer electronics, computer software, and online services. The company's 
hardware products include the iPhone smartphone, the iPad tablet computer, the Mac personal computer, 
the iPod portable media player, the Apple Watch smartwatch, the Apple TV digital media player, and the 
HomePod smart speaker.

Apple's software includes the macOS and iOS operating systems, the iTunes media player, the Safari web 
browser, and the iLife and iWork creativity and productivity suites. Its online services include the 
iTunes Store, the iOS App Store and Mac App Store, Apple Music, and iCloud.
"""

# Basic recursive splitting
splitter = TextSplitter(
    method="recursive",
    chunk_size=200,
    chunk_overlap=50
)

chunks = splitter.split(text)

print(f"Split into {len(chunks)} chunks using recursive method\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i}:")
    print(f"  Length: {len(chunk.text)} characters")
    print(f"  Start: {chunk.start_index}, End: {chunk.end_index}")
    print(f"  Text: {chunk.text[:100]}...")

print("\n" + "=" * 80)

  from tqdm.autonotebook import tqdm, trange


Split into 8 chunks using recursive method


Chunk 1:
  Length: 181 characters
  Start: 0, End: 184
  Text: Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ronald Wayne 
in Cupert...

Chunk 2:
  Length: 144 characters
  Start: 134, End: 281
  Text: The company's current CEO is Tim Cook, who took 
over from Steve Jobs in August 2011. Apple is headq...

Chunk 3:
  Length: 150 characters
  Start: 231, End: 383
  Text: eadquartered at One Apple Park Way in Cupertino.

Apple develops and sells consumer electronics, com...

Chunk 4:
  Length: 151 characters
  Start: 333, End: 486
  Text: ter software, and online services. The company's 
hardware products include the iPhone smartphone, t...

Chunk 5:
  Length: 176 characters
  Start: 436, End: 614
  Text: iPad tablet computer, the Mac personal computer, 
the iPod portable media player, the Apple Watch sm...

Chunk 6:
  Length: 152 characters
  Start: 564, End: 718
  Text: al media player, and the 
HomePod smart sp

## Step 2: Standard Splitting Methods

Let's compare different standard splitting methods.

### Method Comparison

| Method | Best For | Speed | Accuracy |
|--------|----------|-------|----------|
| **recursive** | General text | Fast | Good |
| **sentence** | Coherent chunks | Medium | Very Good |
| **token** | LLM context | Medium | Excellent |
| **paragraph** | Natural breaks | Fast | Good |

In [4]:
# Compare different methods
methods = ["recursive", "sentence", "paragraph"]

print("Comparing Standard Splitting Methods:\n")
print("=" * 80)

for method in methods:
    splitter = TextSplitter(
        method=method,
        chunk_size=200,
        chunk_overlap=50
    )
    
    chunks = splitter.split(text)
    
    print(f"\nMethod: {method.upper()}")
    print("-" * 40)
    print(f"  Chunks created: {len(chunks)}")
    print(f"  Avg chunk size: {sum(len(c.text) for c in chunks) / len(chunks):.0f} chars")
    print(f"  First chunk: {chunks[0].text[:80]}...")

print("\n" + "=" * 80)

Comparing Standard Splitting Methods:


Method: RECURSIVE
----------------------------------------
  Chunks created: 8
  Avg chunk size: 154 chars
  First chunk: Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ron...

Method: SENTENCE
----------------------------------------
  Chunks created: 6
  Avg chunk size: 148 chars
  First chunk: Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ron...

Method: PARAGRAPH
----------------------------------------
  Chunks created: 3
  Avg chunk size: 297 chars
  First chunk: Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ron...



## Step 3: Token-Based Splitting

Token-based splitting is crucial for LLM applications where you need to respect token limits.

### Why Token-Based?

- **LLM Context Windows**: GPT-4 has 8K/32K token limits
- **Accurate Counting**: Character count ‚â† token count
- **Cost Optimization**: Tokens determine API costs

In [5]:
from semantica.split import split_by_tokens

# Token-based splitting
chunks = split_by_tokens(
    text,
    chunk_size=100,  # 100 tokens
    chunk_overlap=20,
    tokenizer="tiktoken",
    model="gpt-4"
)

print("Token-Based Splitting Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    token_count = chunk.metadata.get('token_count', 'N/A')
    print(f"\nChunk {i}:")
    print(f"  Tokens: {token_count}")
    print(f"  Characters: {len(chunk.text)}")
    print(f"  Ratio: {len(chunk.text)/token_count if token_count != 'N/A' else 'N/A':.2f} chars/token")

print("\n" + "=" * 80)

Token-Based Splitting Results:


Chunk 1:
  Tokens: 100
  Characters: 461
  Ratio: 4.61 chars/token

Chunk 2:
  Tokens: 100
  Characters: 501
  Ratio: 5.01 chars/token

Chunk 3:
  Tokens: 31
  Characters: 150
  Ratio: 4.84 chars/token

Chunk 4:
  Tokens: 20
  Characters: 78
  Ratio: 3.90 chars/token

Chunk 5:
  Tokens: 19
  Characters: 76
  Ratio: 4.00 chars/token

Chunk 6:
  Tokens: 18
  Characters: 74
  Ratio: 4.11 chars/token

Chunk 7:
  Tokens: 17
  Characters: 70
  Ratio: 4.12 chars/token

Chunk 8:
  Tokens: 16
  Characters: 64
  Ratio: 4.00 chars/token

Chunk 9:
  Tokens: 15
  Characters: 63
  Ratio: 4.20 chars/token

Chunk 10:
  Tokens: 14
  Characters: 59
  Ratio: 4.21 chars/token

Chunk 11:
  Tokens: 13
  Characters: 55
  Ratio: 4.23 chars/token

Chunk 12:
  Tokens: 12
  Characters: 51
  Ratio: 4.25 chars/token

Chunk 13:
  Tokens: 11
  Characters: 45
  Ratio: 4.09 chars/token

Chunk 14:
  Tokens: 10
  Characters: 41
  Ratio: 4.10 chars/token

Chunk 15:
  Tokens: 9
  Character

## Step 4: Semantic Chunking

Semantic chunking creates chunks based on semantic boundaries using embeddings.

### How It Works

1. Split text into sentences
2. Generate embeddings for each sentence
3. Calculate similarity between consecutive sentences
4. Create boundaries where similarity drops below threshold

In [6]:
from semantica.split import SemanticChunker

# Semantic chunking
semantic_chunker = SemanticChunker(
    chunk_size=200,
    chunk_overlap=50,
    embedding_model="all-MiniLM-L6-v2",
    similarity_threshold=0.7
)

chunks = semantic_chunker.chunk(text)

print("Semantic Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    coherence = chunk.metadata.get('coherence_score', 'N/A')
    print(f"\nChunk {i}:")
    print(f"  Length: {len(chunk.text)} chars")
    print(f"  Coherence: {coherence}")
    print(f"  Text: {chunk.text[:100]}...")

print("\n" + "=" * 80)

Status,Action,Module,Submodule,File,Time
‚úÖ,Semantica is splitting,‚úÇÔ∏è split,SemanticChunker,-,0.15s
‚úÖ,Semantica is splitting,‚úÇÔ∏è split,EntityAwareChunker,-,1.14s
‚úÖ,Semantica is extracting,üéØ semantic_extract,NERExtractor,-,0.50s
‚úÖ,Semantica is splitting,‚úÇÔ∏è split,RelationAwareChunker,-,1.55s
‚úÖ,Semantica is extracting,üéØ semantic_extract,RelationExtractor,-,0.52s
‚úÖ,Semantica is splitting,‚úÇÔ∏è split,StructuralChunker,-,0.01s
‚úÖ,Semantica is splitting,‚úÇÔ∏è split,HierarchicalChunker,-,0.00s
‚úÖ,Semantica is splitting,‚úÇÔ∏è split,SlidingWindowChunker,-,0.06s
‚ùå,Semantica is splitting,‚úÇÔ∏è split,ChunkValidator,-,0.01s


Semantic Chunking Results:


Chunk 1:
  Length: 133 chars
  Coherence: N/A
  Text: Apple Inc. is a technology company founded by Steve Jobs, Steve Wozniak, and Ronald Wayne 
in Cupert...

Chunk 2:
  Length: 194 chars
  Coherence: N/A
  Text: Wayne 
in Cupertino, California on April 1, 1976. The company's current CEO is Tim Cook, who took 
o...

Chunk 3:
  Length: 136 chars
  Coherence: N/A
  Text: headquartered at One Apple Park Way in Cupertino. Apple develops and sells consumer electronics, com...

Chunk 4:
  Length: 295 chars
  Coherence: N/A
  Text: ectronics, computer software, and online services. The company's 
hardware products include the iPho...

Chunk 5:
  Length: 223 chars
  Coherence: N/A
  Text: ital media player, and the 
HomePod smart speaker. Apple's software includes the macOS and iOS opera...

Chunk 6:
  Length: 159 chars
  Coherence: N/A
  Text: Life and iWork creativity and productivity suites. Its online services include the 
iTunes Store, th...



## Step 5: Entity-Aware Chunking for GraphRAG

Entity-aware chunking preserves entity boundaries, crucial for GraphRAG workflows.

### Why Entity-Aware?

- **Preserve Entities**: Don't split "Steve Jobs" across chunks
- **Better Extraction**: Complete entities improve NER accuracy
- **GraphRAG**: Essential for knowledge graph construction

In [7]:
from semantica.split import EntityAwareChunker

# Entity-aware chunking
entity_chunker = EntityAwareChunker(
    chunk_size=200,
    chunk_overlap=50,
    ner_method="ml",  # "ml" (spaCy), "pattern", or "llm"
    preserve_entities=True
)

chunks = entity_chunker.chunk(text)

print("Entity-Aware Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    entities = chunk.metadata.get('entities', [])
    print(f"\nChunk {i}:")
    print(f"  Length: {len(chunk.text)} chars")
    print(f"  Entities: {len(entities)}")
    
    if entities:
        # Handle both Entity objects and dicts
        entity_texts = [e.get('text', e.get('entity', '')) if isinstance(e, dict) else str(e) for e in entities[:3]]
        print(f"  Sample entities: {entity_texts}")

print("\n" + "=" * 80)

Entity-Aware Chunking Results:


Chunk 1:
  Length: 892 chars
  Entities: 28
  Sample entities: ["Entity(text='Apple Inc.', label='ORG', start_char=1, end_char=11, confidence=1.0, metadata={'extraction_method': 'ml', 'model': 'en_core_web_sm', 'lemma': 'Apple Inc.'})", "Entity(text='Steve Jobs', label='PERSON', start_char=47, end_char=57, confidence=1.0, metadata={'extraction_method': 'ml', 'model': 'en_core_web_sm', 'lemma': 'Steve Jobs'})", "Entity(text='Steve Wozniak', label='PERSON', start_char=59, end_char=72, confidence=1.0, metadata={'extraction_method': 'ml', 'model': 'en_core_web_sm', 'lemma': 'Steve Wozniak'})"]



## Step 6: Relation-Aware Chunking

Relation-aware chunking preserves relationship triplets within chunks.

### Why Relation-Aware?

- **Preserve Triplets**: Keep (subject, predicate, object) together
- **KG Construction**: Better for building knowledge graphs
- **Context**: Relationships need complete context

In [8]:
from semantica.split import RelationAwareChunker

# Relation-aware chunking
relation_chunker = RelationAwareChunker(
    chunk_size=200,
    chunk_overlap=50,
    preserve_triplets=True
)

chunks = relation_chunker.chunk(text)

print("Relation-Aware Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    triplets = chunk.metadata.get('triplets', [])
    relationships = chunk.metadata.get('relationships', [])
    
    print(f"\nChunk {i}:")
    print(f"  Length: {len(chunk.text)} chars")
    print(f"  Triplets: {len(triplets)}")
    print(f"  Relationships: {len(relationships)}")

print("\n" + "=" * 80)

Relation-Aware Chunking Results:


Chunk 1:
  Length: 133 chars
  Triplets: 2
  Relationships: 2

Chunk 2:
  Length: 144 chars
  Triplets: 0
  Relationships: 0

Chunk 3:
  Length: 86 chars
  Triplets: 0
  Relationships: 0

Chunk 4:
  Length: 244 chars
  Triplets: 0
  Relationships: 0

Chunk 5:
  Length: 172 chars
  Triplets: 0
  Relationships: 0

Chunk 6:
  Length: 108 chars
  Triplets: 0
  Relationships: 0



## Step 7: Structural Chunking

Structural chunking respects document structure like headings, paragraphs, and lists.

### When to Use?

- **Formatted Documents**: Markdown, HTML, structured text
- **Preserve Hierarchy**: Keep sections together
- **Better Context**: Headings provide context

In [9]:
from semantica.split import StructuralChunker

# Markdown text with structure
markdown_text = """
# Apple Inc.

## History

Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.

## Products

### Hardware
- iPhone
- iPad
- Mac

### Software
- macOS
- iOS
- Safari
"""

# Structural chunking
structural_chunker = StructuralChunker(
    respect_headings=True,
    respect_paragraphs=True,
    respect_lists=True,
    max_chunk_size=500
)

chunks = structural_chunker.chunk(markdown_text)

print("Structural Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    section = chunk.metadata.get('section_title', 'N/A')
    level = chunk.metadata.get('heading_level', 'N/A')
    
    print(f"\nChunk {i}:")
    print(f"  Section: {section}")
    print(f"  Level: {level}")
    print(f"  Text: {chunk.text[:80]}...")

print("\n" + "=" * 80)

Structural Chunking Results:


Chunk 1:
  Section: N/A
  Level: N/A
  Text: # Apple Inc....

Chunk 2:
  Section: N/A
  Level: N/A
  Text: ## History

Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayn...

Chunk 3:
  Section: N/A
  Level: N/A
  Text: ## Products

### Hardware

- iPhone
- iPad
- Mac

### Software

- macOS
- iOS
- ...



## Step 8: Hierarchical Chunking

Hierarchical chunking creates multi-level chunks for large documents.

### Benefits

- **Multiple Granularities**: Document ‚Üí Section ‚Üí Paragraph
- **Better Navigation**: Parent-child relationships
- **Flexible Retrieval**: Query at different levels

In [10]:
from semantica.split import HierarchicalChunker

# Hierarchical chunking
hierarchical_chunker = HierarchicalChunker(
    chunk_sizes=[400, 200, 100],  # 3 levels
    chunk_overlaps=[80, 40, 20],
    create_parent_chunks=True
)

chunks = hierarchical_chunker.chunk(text)

print("Hierarchical Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    level = chunk.metadata.get('level', 'N/A')
    parent_id = chunk.metadata.get('parent_id', None)
    child_ids = chunk.metadata.get('child_ids', [])
    
    print(f"\nChunk {i}:")
    print(f"  Level: {level}")
    print(f"  Length: {len(chunk.text)} chars")
    print(f"  Parent: {parent_id if parent_id else 'None (root)'}")
    print(f"  Children: {len(child_ids)}")

print("\n" + "=" * 80)

Hierarchical Chunking Results:


Chunk 1:
  Level: N/A
  Length: 278 chars
  Parent: None (root)
  Children: 0

Chunk 2:
  Level: N/A
  Length: 331 chars
  Parent: None (root)
  Children: 0

Chunk 3:
  Level: N/A
  Length: 281 chars
  Parent: None (root)
  Children: 0



## Step 9: Sliding Window Chunking

Sliding window creates overlapping fixed-size chunks.

### Use Cases

- **Dense Retrieval**: Ensure no information is missed
- **Fixed Context**: Consistent chunk sizes
- **Overlap Control**: Precise overlap management

In [16]:
from semantica.split import SlidingWindowChunker

# Sliding window chunking
sliding_chunker = SlidingWindowChunker(
    window_size=150,
    step_size=100,  # 50 char overlap
    min_chunk_size=50
)

chunks = sliding_chunker.chunk(text)

print("Sliding Window Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    overlap = chunk.metadata.get('overlap_chars', 0)
    
    print(f"\nWindow {i}:")
    print(f"  Position: {chunk.start}-{chunk.end}")
    print(f"  Length: {len(chunk.text)} chars")
    print(f"  Overlap with previous: {overlap} chars")

print("\n" + "=" * 80)

Sliding Window Chunking Results:


Window 1:


AttributeError: 'Chunk' object has no attribute 'start'

## Step 10: Table Chunking

Table chunking preserves table structure while splitting large tables.

### Features

- **Preserve Headers**: Keep column headers in each chunk
- **Row-Based Splitting**: Split by rows, not characters
- **Context Inclusion**: Include surrounding text

In [12]:
from semantica.split import TableChunker

# Text with table
text_with_table = """
Apple's product lineup includes:

| Product | Category | Release Year |
|---------|----------|-------------|
| iPhone | Smartphone | 2007 |
| iPad | Tablet | 2010 |
| Mac | Computer | 1984 |
| Apple Watch | Wearable | 2015 |
| AirPods | Audio | 2016 |

These products have revolutionized their respective categories.
"""

# Table chunking
table_chunker = TableChunker(
    preserve_headers=True,
    max_rows_per_chunk=3,
    include_context=True,
    table_format="markdown"
)

chunks = table_chunker.chunk(text_with_table)

print("Table Chunking Results:\n")
print("=" * 80)

for i, chunk in enumerate(chunks, 1):
    is_table = chunk.metadata.get('is_table', False)
    
    print(f"\nChunk {i}:")
    print(f"  Type: {'Table' if is_table else 'Text'}")
    
    if is_table:
        rows = chunk.metadata.get('row_count', 'N/A')
        cols = chunk.metadata.get('column_count', 'N/A')
        print(f"  Rows: {rows}, Columns: {cols}")
    
    print(f"  Content: {chunk.text[:100]}...")

print("\n" + "=" * 80)

AttributeError: 'TableChunker' object has no attribute 'chunk'

## Step 11: Chunk Validation

Validate chunk quality to ensure optimal processing.

### Validation Checks

- **Size Constraints**: Min/max chunk size
- **Overlap**: Appropriate overlap percentage
- **Completeness**: Full text coverage
- **Quality Score**: Overall quality metric

In [13]:
from semantica.split import ChunkValidator

# Create chunks
splitter = TextSplitter(method="recursive", chunk_size=200, chunk_overlap=50)
chunks = splitter.split(text)

# Validate chunks
validator = ChunkValidator(
    min_chunk_size=50,
    max_chunk_size=300,
    min_overlap=20,
    max_overlap=100
)

validation_result = validator.validate(chunks)

print("Chunk Validation Results:\n")
print("=" * 80)

print(f"\nOverall Valid: {validation_result.get('valid', False)}")
print(f"Quality Score: {validation_result.get('quality_score', 0):.2f}")

issues = validation_result.get('issues', [])
if issues:
    print(f"\nIssues Found: {len(issues)}")
    for issue in issues[:3]:
        print(f"  - {issue}")
else:
    print("\nNo issues found!")

print("\n" + "=" * 80)

AttributeError: 'list' object has no attribute 'text'

## Step 12: Provenance Tracking

Track chunk origins for data lineage and debugging.

### Why Track Provenance?

- **Data Lineage**: Know where chunks came from
- **Debugging**: Trace issues back to source
- **Compliance**: Required for some use cases

In [14]:
from semantica.split import ProvenanceTracker

# Create chunks
splitter = TextSplitter(method="recursive", chunk_size=200, chunk_overlap=50)
chunks = splitter.split(text)

# Track provenance
tracker = ProvenanceTracker()

for chunk in chunks:
    tracker.track(
        chunk=chunk,
        source={
            "document_id": "apple_doc_001",
            "file_path": "data/apple.txt",
            "timestamp": "2024-01-01T00:00:00Z",
            "method": "recursive"
        }
    )

print("Provenance Tracking Results:\n")
print("=" * 80)

# Get lineage for first chunk
if chunks:
    lineage = tracker.get_lineage(chunks[0].id)
    
    print(f"\nLineage for Chunk 1:")
    print(f"  Source Document: {lineage.get('source', {}).get('document_id')}")
    print(f"  File Path: {lineage.get('source', {}).get('file_path')}")
    print(f"  Method: {lineage.get('source', {}).get('method')}")
    print(f"  Timestamp: {lineage.get('source', {}).get('timestamp')}")

print("\n" + "=" * 80)

AttributeError: 'ProvenanceTracker' object has no attribute 'track'

## Step 13: Method Comparison

Let's compare all methods side-by-side to help you choose the right one.

### Comparison Criteria

- **Chunk Count**: Number of chunks created
- **Average Size**: Average chunk size
- **Processing Time**: Speed of chunking

In [15]:
import time

# Methods to compare
methods_to_compare = [
    ("recursive", {}),
    ("sentence", {}),
    ("paragraph", {}),
    ("token", {"tokenizer": "tiktoken"}),
]

print("Method Comparison:\n")
print("=" * 80)
print(f"{'Method':<15} {'Chunks':<10} {'Avg Size':<12} {'Time (ms)':<12}")
print("-" * 80)

for method, kwargs in methods_to_compare:
    try:
        start_time = time.time()
        
        splitter = TextSplitter(
            method=method,
            chunk_size=200,
            chunk_overlap=50,
            **kwargs
        )
        
        chunks = splitter.split(text)
        
        elapsed = (time.time() - start_time) * 1000
        avg_size = sum(len(c.text) for c in chunks) / len(chunks) if chunks else 0
        
        print(f"{method:<15} {len(chunks):<10} {avg_size:<12.0f} {elapsed:<12.2f}")
        
    except Exception as e:
        print(f"{method:<15} Error: {str(e)[:40]}")

print("=" * 80)

Method Comparison:

Method          Chunks     Avg Size     Time (ms)   
--------------------------------------------------------------------------------
recursive       8          154          0.00        
sentence        6          148          519.51      
paragraph       3          297          0.00        
token           51         129          0.00        


## Step 14: Best Practices

### Choosing the Right Method

1. **General Documents**: Use `recursive` for speed and simplicity
2. **LLM Applications**: Use `token` to respect context windows
3. **Semantic Search**: Use `semantic_transformer` for topic coherence
4. **GraphRAG**: Use `entity_aware` or `relation_aware`
5. **Structured Docs**: Use `structural` for formatted documents
6. **Large Documents**: Use `hierarchical` for multi-level access

### Chunk Size Guidelines

| Use Case | Recommended Size | Overlap |
|----------|------------------|----------|
| Semantic Search | 512-1024 chars | 20% |
| LLM Context | 2000-4000 chars | 10-20% |
| Entity Extraction | 500-1500 chars | 15-25% |
| Question Answering | 1000-2000 chars | 20% |

### Overlap Recommendations

- **10-15%**: Fast processing, less redundancy
- **20-25%**: Balanced (recommended)
- **30-40%**: Maximum context preservation

## Summary

### What You've Learned

In this notebook, you've learned how to:

- Use `TextSplitter` with multiple methods
- Apply standard splitting (recursive, token, sentence, paragraph)
- Use semantic chunking for topic coherence
- Apply KG-aware chunking (entity-aware, relation-aware)
- Use specialized chunkers (structural, hierarchical, sliding window, table)
- Validate chunk quality
- Track provenance
- Choose the right method for your use case

### Key Takeaways

1. **Method Selection Matters**: Different methods for different needs
2. **Chunk Size is Critical**: Balance between context and processing
3. **Overlap Helps**: 20% overlap is a good default
4. **Validate Quality**: Always validate chunks before use
5. **Track Provenance**: Important for debugging and compliance
6. **KG-Aware for GraphRAG**: Use entity/relation-aware for knowledge graphs

### Next Steps

**Next Notebook**: [12_Embedding_Generation.ipynb](./12_Embedding_Generation.ipynb)  
Learn how to generate embeddings for your chunks!

**Further Reading**:
- [Split Module API Reference](https://semantica.readthedocs.io/reference/split/)
- [Advanced Chunking Strategies](../advanced/11_Text_Chunking_Strategies.ipynb)
- [GraphRAG Pipeline](../use_cases/advanced_rag/01_GraphRAG_Complete.ipynb)

---

**Questions or Issues?** Check out our [GitHub repository](https://github.com/Hawksight-AI/semantica) or [documentation](https://semantica.readthedocs.io).