# Document Splitting Techniques: The Foundation of Effective RAG

## Table of Contents
1. [Introduction to Document Splitting](#introduction)
2. [Fixed-Size Chunking](#fixed-size)
3. [Semantic Chunking](#semantic-chunking)
4. [Recursive Character Splitting](#recursive)
5. [Sentence-Aware Splitting](#sentence-aware)
6. [HTML/Markdown Splitting](#html-markdown)
7. [Code Splitting](#code-splitting)
8. [Performance Comparison](#performance)
9. [Best Practices & Guidelines](#best-practices)
10. [Real-World Examples](#real-world)

---

## Introduction to Document Splitting {#introduction}

Document splitting is the process of breaking down large documents into smaller, manageable chunks that can be effectively processed by embedding models and retrieved by RAG systems.

### Why Splitting Matters

**The Challenge**: 
- LLMs have context limits (e.g., 4K, 8K, 32K tokens)
- Large documents can overwhelm the context window
- Irrelevant information can dilute the response quality

**The Solution**:
- Split documents into focused, coherent chunks
- Preserve semantic meaning within chunks
- Enable precise retrieval of relevant information

### Key Considerations

1. **Chunk Size**: Balance between context and precision
2. **Overlap**: Ensure continuity between chunks
3. **Semantic Boundaries**: Respect natural language boundaries
4. **Metadata Preservation**: Maintain document structure and context

In [None]:
# Install required packages
!pip install -q langchain langchain-text-splitters tiktoken nltk spacy sentence-transformers

# Import necessary libraries
import os
import re
import json
from typing import List, Dict, Any, Tuple, Optional
from dataclasses import dataclass
import numpy as np
import tiktoken
from sentence_transformers import SentenceTransformer
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Set up encoding for token counting
encoding = tiktoken.get_encoding("cl100k_base")

print("‚úÖ All packages imported successfully!")
print("üîß Environment configured for document splitting analysis")

## Fixed-Size Chunking {#fixed-size}

The simplest and most common approach - split text into chunks of fixed character or token length.

### Pros:
- ‚úÖ Simple to implement
- ‚úÖ Predictable chunk sizes
- ‚úÖ Fast processing
- ‚úÖ Works with any text type

### Cons:
- ‚ùå May break sentences mid-way
- ‚ùå Can lose semantic context
- ‚ùå No consideration of content structure
- ‚ùå May create incoherent chunks

In [None]:
@dataclass
class Chunk:
    """Represents a text chunk with metadata"""
    content: str
    start_index: int
    end_index: int
    chunk_id: str
    metadata: Dict[str, Any] = None

class FixedSizeSplitter:
    """Fixed-size text splitter"""
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50, 
                 use_tokens: bool = False):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.use_tokens = use_tokens
        
    def count_tokens(self, text: str) -> int:
        """Count tokens in text"""
        return len(encoding.encode(text))
    
    def split_text(self, text: str, metadata: Dict = None) -> List[Chunk]:
        """Split text into fixed-size chunks"""
        chunks = []
        
        if self.use_tokens:
            # Token-based splitting
            tokens = encoding.encode(text)
            start = 0
            chunk_id = 0
            
            while start < len(tokens):
                end = min(start + self.chunk_size, len(tokens))
                chunk_tokens = tokens[start:end]
                chunk_text = encoding.decode(chunk_tokens)
                
                chunk = Chunk(
                    content=chunk_text,
                    start_index=start,
                    end_index=end,
                    chunk_id=f"chunk_{chunk_id}",
                    metadata=metadata or {}
                )
                chunks.append(chunk)
                
                start = end - self.chunk_overlap
                chunk_id += 1
                
                if start >= len(tokens) - self.chunk_overlap:
                    break
        else:
            # Character-based splitting
            start = 0
            chunk_id = 0
            
            while start < len(text):
                end = min(start + self.chunk_size, len(text))
                chunk_text = text[start:end]
                
                chunk = Chunk(
                    content=chunk_text,
                    start_index=start,
                    end_index=end,
                    chunk_id=f"chunk_{chunk_id}",
                    metadata=metadata or {}
                )
                chunks.append(chunk)
                
                start = end - self.chunk_overlap
                chunk_id += 1
                
                if start >= len(text) - self.chunk_overlap:
                    break
        
        return chunks

# Sample technical documentation
sample_doc = """
# Machine Learning Fundamentals

## Introduction
Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data. It has revolutionized many industries including healthcare, finance, and technology.

## Types of Machine Learning

### Supervised Learning
Supervised learning uses labeled training data to learn a mapping from inputs to outputs. Common algorithms include:
- Linear Regression: Used for predicting continuous values
- Decision Trees: Create a tree-like model of decisions
- Random Forest: Ensemble method using multiple decision trees
- Support Vector Machines: Find optimal hyperplane for classification

### Unsupervised Learning
Unsupervised learning finds patterns in data without labeled examples:
- Clustering: Group similar data points together
- Dimensionality Reduction: Reduce number of features while preserving information
- Association Rules: Find relationships between variables

### Reinforcement Learning
Reinforcement learning learns through interaction with an environment:
- Agent takes actions in an environment
- Receives rewards or penalties
- Learns optimal policy through trial and error

## Applications
Machine learning is used in:
- Recommendation systems (Netflix, Amazon)
- Image recognition (medical diagnosis, autonomous vehicles)
- Natural language processing (chatbots, translation)
- Fraud detection (banking, insurance)
- Predictive analytics (weather, stock markets)

## Best Practices
1. Start with simple models
2. Ensure data quality
3. Use cross-validation
4. Monitor for overfitting
5. Consider interpretability
"""

# Test fixed-size splitting
print("üìÑ Sample Document Length:", len(sample_doc), "characters")
print("üî¢ Token Count:", len(encoding.encode(sample_doc)), "tokens")

# Character-based splitting
char_splitter = FixedSizeSplitter(chunk_size=200, chunk_overlap=50, use_tokens=False)
char_chunks = char_splitter.split_text(sample_doc, {"source": "ml_fundamentals"})

print(f"\nüìä Character-based splitting results:")
print(f"Number of chunks: {len(char_chunks)}")
print(f"Average chunk length: {np.mean([len(chunk.content) for chunk in char_chunks]):.0f} characters")

# Token-based splitting
token_splitter = FixedSizeSplitter(chunk_size=100, chunk_overlap=25, use_tokens=True)
token_chunks = token_splitter.split_text(sample_doc, {"source": "ml_fundamentals"})

print(f"\nüìä Token-based splitting results:")
print(f"Number of chunks: {len(token_chunks)}")
print(f"Average token count: {np.mean([len(encoding.encode(chunk.content)) for chunk in token_chunks]):.0f} tokens")

# Show sample chunks
print(f"\nüîç Sample chunks (character-based):")
for i, chunk in enumerate(char_chunks[:2]):
    print(f"\nChunk {i+1} (ID: {chunk.chunk_id}):")
    print(f"Content: {chunk.content[:150]}...")
    print(f"Length: {len(chunk.content)} characters")

## Semantic Chunking {#semantic-chunking}

Split text based on semantic similarity - group sentences that are semantically related together.

### Pros:
- ‚úÖ Preserves semantic coherence
- ‚úÖ Better for retrieval tasks
- ‚úÖ Reduces noise in chunks
- ‚úÖ More meaningful chunk boundaries

### Cons:
- ‚ùå More complex to implement
- ‚ùå Requires embedding model
- ‚ùå Slower processing
- ‚ùå May create very small or large chunks

In [None]:
class SemanticSplitter:
    """Semantic text splitter based on sentence similarity"""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2", 
                 similarity_threshold: float = 0.7,
                 min_chunk_size: int = 100,
                 max_chunk_size: int = 1000):
        self.model = SentenceTransformer(model_name)
        self.similarity_threshold = similarity_threshold
        self.min_chunk_size = min_chunk_size
        self.max_chunk_size = max_chunk_size
        
    def split_text(self, text: str, metadata: Dict = None) -> List[Chunk]:
        """Split text based on semantic similarity"""
        # Split into sentences
        sentences = sent_tokenize(text)
        
        if len(sentences) <= 1:
            return [Chunk(
                content=text,
                start_index=0,
                end_index=len(text),
                chunk_id="chunk_0",
                metadata=metadata or {}
            )]
        
        # Generate embeddings for sentences
        sentence_embeddings = self.model.encode(sentences)
        
        # Group sentences by similarity
        chunks = []
        current_chunk = [sentences[0]]
        current_start = 0
        
        for i in range(1, len(sentences)):
            # Calculate similarity with previous sentence
            similarity = np.dot(sentence_embeddings[i-1], sentence_embeddings[i])
            
            # Check if we should start a new chunk
            current_text = " ".join(current_chunk)
            should_split = (
                similarity < self.similarity_threshold or
                len(current_text) > self.max_chunk_size
            )
            
            if should_split and len(current_text) >= self.min_chunk_size:
                # Create chunk
                chunk_text = " ".join(current_chunk)
                chunk = Chunk(
                    content=chunk_text,
                    start_index=current_start,
                    end_index=current_start + len(chunk_text),
                    chunk_id=f"chunk_{len(chunks)}",
                    metadata=metadata or {}
                )
                chunks.append(chunk)
                
                # Start new chunk
                current_chunk = [sentences[i]]
                current_start += len(chunk_text) + 1
            else:
                current_chunk.append(sentences[i])
        
        # Add final chunk
        if current_chunk:
            chunk_text = " ".join(current_chunk)
            chunk = Chunk(
                content=chunk_text,
                start_index=current_start,
                end_index=current_start + len(chunk_text),
                chunk_id=f"chunk_{len(chunks)}",
                metadata=metadata or {}
            )
            chunks.append(chunk)
        
        return chunks

# Test semantic splitting
semantic_splitter = SemanticSplitter(
    similarity_threshold=0.6,
    min_chunk_size=150,
    max_chunk_size=800
)

semantic_chunks = semantic_splitter.split_text(sample_doc, {"source": "ml_fundamentals"})

print(f"üìä Semantic splitting results:")
print(f"Number of chunks: {len(semantic_chunks)}")
print(f"Average chunk length: {np.mean([len(chunk.content) for chunk in semantic_chunks]):.0f} characters")

# Show sample chunks
print(f"\nüîç Sample semantic chunks:")
for i, chunk in enumerate(semantic_chunks[:3]):
    print(f"\nChunk {i+1} (ID: {chunk.chunk_id}):")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Length: {len(chunk.content)} characters")
    print(f"Sentences: {len(sent_tokenize(chunk.content))}")

## Recursive Character Splitting {#recursive}

A hierarchical approach that tries different splitting strategies in order of preference.

### Pros:
- ‚úÖ Respects document structure
- ‚úÖ Handles multiple text types
- ‚úÖ Configurable splitting hierarchy
- ‚úÖ Good balance of structure and size

### Cons:
- ‚ùå Can be complex to configure
- ‚ùå May not always find optimal splits
- ‚ùå Requires understanding of document structure

In [None]:
class RecursiveSplitter:
    """Recursive character splitter with multiple strategies"""
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        # Define splitting separators in order of preference
        self.separators = [
            "\n\n",  # Paragraph breaks
            "\n",    # Line breaks
            ". ",    # Sentence endings
            "! ",    # Exclamation marks
            "? ",    # Question marks
            "; ",    # Semicolons
            ", ",    # Commas
            " ",     # Spaces
            ""       # Character level
        ]
    
    def split_text(self, text: str, metadata: Dict = None) -> List[Chunk]:
        """Split text using recursive strategy"""
        return self._split_text_recursive(text, 0, metadata or {})
    
    def _split_text_recursive(self, text: str, start_index: int, 
                            metadata: Dict, separator_index: int = 0) -> List[Chunk]:
        """Recursively split text using separators"""
        
        # If text is small enough, return as single chunk
        if len(text) <= self.chunk_size:
            return [Chunk(
                content=text,
                start_index=start_index,
                end_index=start_index + len(text),
                chunk_id=f"chunk_{start_index}",
                metadata=metadata
            )]
        
        # If we've tried all separators, split by character
        if separator_index >= len(self.separators):
            return self._split_by_character(text, start_index, metadata)
        
        separator = self.separators[separator_index]
        
        # Split by current separator
        if separator:
            parts = text.split(separator)
        else:
            parts = list(text)
        
        # If splitting didn't help, try next separator
        if len(parts) <= 1:
            return self._split_text_recursive(text, start_index, metadata, separator_index + 1)
        
        # Group parts into chunks
        chunks = []
        current_chunk = ""
        current_start = start_index
        
        for i, part in enumerate(parts):
            # Add separator back (except for last part)
            if i < len(parts) - 1 and separator:
                part_with_sep = part + separator
            else:
                part_with_sep = part
            
            # Check if adding this part would exceed chunk size
            if len(current_chunk + part_with_sep) > self.chunk_size and current_chunk:
                # Create chunk
                chunk = Chunk(
                    content=current_chunk.strip(),
                    start_index=current_start,
                    end_index=current_start + len(current_chunk),
                    chunk_id=f"chunk_{current_start}",
                    metadata=metadata
                )
                chunks.append(chunk)
                
                # Start new chunk with overlap
                overlap_text = current_chunk[-self.chunk_overlap:] if self.chunk_overlap > 0 else ""
                current_chunk = overlap_text + part_with_sep
                current_start += len(current_chunk) - len(overlap_text)
            else:
                current_chunk += part_with_sep
        
        # Add final chunk
        if current_chunk.strip():
            chunk = Chunk(
                content=current_chunk.strip(),
                start_index=current_start,
                end_index=current_start + len(current_chunk),
                chunk_id=f"chunk_{current_start}",
                metadata=metadata
            )
            chunks.append(chunk)
        
        # If chunks are still too large, try next separator
        if any(len(chunk.content) > self.chunk_size for chunk in chunks):
            return self._split_text_recursive(text, start_index, metadata, separator_index + 1)
        
        return chunks
    
    def _split_by_character(self, text: str, start_index: int, metadata: Dict) -> List[Chunk]:
        """Split text by character when all other methods fail"""
        chunks = []
        current_start = start_index
        
        for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
            end = min(i + self.chunk_size, len(text))
            chunk_text = text[i:end]
            
            chunk = Chunk(
                content=chunk_text,
                start_index=current_start,
                end_index=current_start + len(chunk_text),
                chunk_id=f"chunk_{current_start}",
                metadata=metadata
            )
            chunks.append(chunk)
            current_start += len(chunk_text) - self.chunk_overlap
        
        return chunks

# Test recursive splitting
recursive_splitter = RecursiveSplitter(chunk_size=300, chunk_overlap=50)
recursive_chunks = recursive_splitter.split_text(sample_doc, {"source": "ml_fundamentals"})

print(f"üìä Recursive splitting results:")
print(f"Number of chunks: {len(recursive_chunks)}")
print(f"Average chunk length: {np.mean([len(chunk.content) for chunk in recursive_chunks]):.0f} characters")

# Show sample chunks
print(f"\nüîç Sample recursive chunks:")
for i, chunk in enumerate(recursive_chunks[:3]):
    print(f"\nChunk {i+1} (ID: {chunk.chunk_id}):")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Length: {len(chunk.content)} characters")
    print(f"Starts with: '{chunk.content[:50]}...'")
    print(f"Ends with: '...{chunk.content[-50:]}'")

## Sentence-Aware Splitting {#sentence-aware}

Split text while respecting sentence boundaries to maintain coherence.

### Pros:
- ‚úÖ Preserves sentence integrity
- ‚úÖ Better for natural language processing
- ‚úÖ Maintains grammatical structure
- ‚úÖ Good for question-answering tasks

### Cons:
- ‚ùå May create very small chunks
- ‚ùå Doesn't consider semantic similarity
- ‚ùå Can break paragraph context

In [None]:
class SentenceAwareSplitter:
    """Split text while respecting sentence boundaries"""
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def split_text(self, text: str, metadata: Dict = None) -> List[Chunk]:
        """Split text respecting sentence boundaries"""
        # Split into sentences
        sentences = sent_tokenize(text)
        
        if len(sentences) <= 1:
            return [Chunk(
                content=text,
                start_index=0,
                end_index=len(text),
                chunk_id="chunk_0",
                metadata=metadata or {}
            )]
        
        chunks = []
        current_chunk = []
        current_length = 0
        current_start = 0
        chunk_id = 0
        
        for sentence in sentences:
            sentence_length = len(sentence)
            
            # Check if adding this sentence would exceed chunk size
            if current_length + sentence_length > self.chunk_size and current_chunk:
                # Create chunk
                chunk_text = " ".join(current_chunk)
                chunk = Chunk(
                    content=chunk_text,
                    start_index=current_start,
                    end_index=current_start + len(chunk_text),
                    chunk_id=f"chunk_{chunk_id}",
                    metadata=metadata or {}
                )
                chunks.append(chunk)
                
                # Start new chunk with overlap
                if self.chunk_overlap > 0:
                    # Keep last few sentences for overlap
                    overlap_sentences = []
                    overlap_length = 0
                    for sent in reversed(current_chunk):
                        if overlap_length + len(sent) <= self.chunk_overlap:
                            overlap_sentences.insert(0, sent)
                            overlap_length += len(sent)
                        else:
                            break
                    current_chunk = overlap_sentences + [sentence]
                    current_length = overlap_length + sentence_length
                else:
                    current_chunk = [sentence]
                    current_length = sentence_length
                
                current_start += len(chunk_text) - overlap_length if self.chunk_overlap > 0 else len(chunk_text)
                chunk_id += 1
            else:
                current_chunk.append(sentence)
                current_length += sentence_length
        
        # Add final chunk
        if current_chunk:
            chunk_text = " ".join(current_chunk)
            chunk = Chunk(
                content=chunk_text,
                start_index=current_start,
                end_index=current_start + len(chunk_text),
                chunk_id=f"chunk_{chunk_id}",
                metadata=metadata or {}
            )
            chunks.append(chunk)
        
        return chunks

# Test sentence-aware splitting
sentence_splitter = SentenceAwareSplitter(chunk_size=400, chunk_overlap=100)
sentence_chunks = sentence_splitter.split_text(sample_doc, {"source": "ml_fundamentals"})

print(f"üìä Sentence-aware splitting results:")
print(f"Number of chunks: {len(sentence_chunks)}")
print(f"Average chunk length: {np.mean([len(chunk.content) for chunk in sentence_chunks]):.0f} characters")
print(f"Average sentences per chunk: {np.mean([len(sent_tokenize(chunk.content)) for chunk in sentence_chunks]):.1f}")

# Show sample chunks
print(f"\nüîç Sample sentence-aware chunks:")
for i, chunk in enumerate(sentence_chunks[:3]):
    print(f"\nChunk {i+1} (ID: {chunk.chunk_id}):")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Length: {len(chunk.content)} characters")
    print(f"Sentences: {len(sent_tokenize(chunk.content))}")
    print(f"First sentence: '{sent_tokenize(chunk.content)[0][:100]}...'")
    print(f"Last sentence: '...{sent_tokenize(chunk.content)[-1][-100:]}'")

## HTML/Markdown Splitting {#html-markdown}

Specialized splitting for structured documents like HTML and Markdown that preserves document hierarchy.

### Pros:
- ‚úÖ Preserves document structure
- ‚úÖ Maintains heading hierarchy
- ‚úÖ Good for technical documentation
- ‚úÖ Enables section-based retrieval

### Cons:
- ‚ùå Requires parsing HTML/Markdown
- ‚ùå May create very small chunks
- ‚ùå Complex to implement correctly

In [None]:
class MarkdownSplitter:
    """Split Markdown documents while preserving structure"""
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def split_text(self, text: str, metadata: Dict = None) -> List[Chunk]:
        """Split Markdown text preserving structure"""
        chunks = []
        current_chunk = ""
        current_start = 0
        chunk_id = 0
        
        # Split by headers first
        sections = self._split_by_headers(text)
        
        for section in sections:
            section_text, section_metadata = section
            
            # If section is small enough, add as single chunk
            if len(section_text) <= self.chunk_size:
                chunk = Chunk(
                    content=section_text,
                    start_index=current_start,
                    end_index=current_start + len(section_text),
                    chunk_id=f"chunk_{chunk_id}",
                    metadata={**(metadata or {}), **section_metadata}
                )
                chunks.append(chunk)
                current_start += len(section_text)
                chunk_id += 1
            else:
                # Split large section further
                sub_chunks = self._split_section(section_text, current_start, 
                                               {**(metadata or {}), **section_metadata})
                chunks.extend(sub_chunks)
                current_start += len(section_text)
                chunk_id += len(sub_chunks)
        
        return chunks
    
    def _split_by_headers(self, text: str) -> List[Tuple[str, Dict]]:
        """Split text by Markdown headers"""
        sections = []
        current_section = ""
        current_metadata = {}
        
        lines = text.split('\n')
        
        for line in lines:
            # Check if line is a header
            if line.startswith('#'):
                # Save previous section
                if current_section.strip():
                    sections.append((current_section.strip(), current_metadata.copy()))
                
                # Start new section
                header_level = len(line) - len(line.lstrip('#'))
                header_text = line.lstrip('#').strip()
                current_metadata = {
                    'header_level': header_level,
                    'header_text': header_text,
                    'section_type': 'header'
                }
                current_section = line + '\n'
            else:
                current_section += line + '\n'
        
        # Add final section
        if current_section.strip():
            sections.append((current_section.strip(), current_metadata))
        
        return sections
    
    def _split_section(self, text: str, start_index: int, metadata: Dict) -> List[Chunk]:
        """Split a large section into smaller chunks"""
        chunks = []
        current_chunk = ""
        current_start = start_index
        chunk_id = 0
        
        # Split by paragraphs
        paragraphs = text.split('\n\n')
        
        for paragraph in paragraphs:
            if len(current_chunk + paragraph) > self.chunk_size and current_chunk:
                # Create chunk
                chunk = Chunk(
                    content=current_chunk.strip(),
                    start_index=current_start,
                    end_index=current_start + len(current_chunk),
                    chunk_id=f"chunk_{start_index}_{chunk_id}",
                    metadata=metadata
                )
                chunks.append(chunk)
                
                # Start new chunk with overlap
                overlap_text = current_chunk[-self.chunk_overlap:] if self.chunk_overlap > 0 else ""
                current_chunk = overlap_text + paragraph
                current_start += len(current_chunk) - len(overlap_text)
                chunk_id += 1
            else:
                current_chunk += paragraph + '\n\n'
        
        # Add final chunk
        if current_chunk.strip():
            chunk = Chunk(
                content=current_chunk.strip(),
                start_index=current_start,
                end_index=current_start + len(current_chunk),
                chunk_id=f"chunk_{start_index}_{chunk_id}",
                metadata=metadata
            )
            chunks.append(chunk)
        
        return chunks

# Test Markdown splitting
markdown_splitter = MarkdownSplitter(chunk_size=300, chunk_overlap=50)
markdown_chunks = markdown_splitter.split_text(sample_doc, {"source": "ml_fundamentals"})

print(f"üìä Markdown splitting results:")
print(f"Number of chunks: {len(markdown_chunks)}")
print(f"Average chunk length: {np.mean([len(chunk.content) for chunk in markdown_chunks]):.0f} characters")

# Show sample chunks with metadata
print(f"\nüîç Sample Markdown chunks:")
for i, chunk in enumerate(markdown_chunks[:3]):
    print(f"\nChunk {i+1} (ID: {chunk.chunk_id}):")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Length: {len(chunk.content)} characters")
    print(f"Metadata: {chunk.metadata}")
    print(f"Starts with: '{chunk.content[:50]}...'")

## Code Splitting {#code-splitting}

Specialized splitting for code files that preserves syntax and structure.

### Pros:
- ‚úÖ Preserves code syntax
- ‚úÖ Maintains function/class boundaries
- ‚úÖ Good for code search and retrieval
- ‚úÖ Enables code-specific queries

### Cons:
- ‚ùå Language-specific implementation
- ‚ùå Complex parsing requirements
- ‚ùå May create very small chunks

In [None]:
class CodeSplitter:
    """Split code files while preserving syntax structure"""
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def split_text(self, text: str, metadata: Dict = None) -> List[Chunk]:
        """Split code text preserving structure"""
        chunks = []
        current_chunk = ""
        current_start = 0
        chunk_id = 0
        
        # Split by functions/classes first
        sections = self._split_by_functions(text)
        
        for section in sections:
            section_text, section_metadata = section
            
            # If section is small enough, add as single chunk
            if len(section_text) <= self.chunk_size:
                chunk = Chunk(
                    content=section_text,
                    start_index=current_start,
                    end_index=current_start + len(section_text),
                    chunk_id=f"chunk_{chunk_id}",
                    metadata={**(metadata or {}), **section_metadata}
                )
                chunks.append(chunk)
                current_start += len(section_text)
                chunk_id += 1
            else:
                # Split large section further
                sub_chunks = self._split_section(section_text, current_start, 
                                               {**(metadata or {}), **section_metadata})
                chunks.extend(sub_chunks)
                current_start += len(section_text)
                chunk_id += len(sub_chunks)
        
        return chunks
    
    def _split_by_functions(self, text: str) -> List[Tuple[str, Dict]]:
        """Split text by function/class definitions"""
        sections = []
        current_section = ""
        current_metadata = {}
        
        lines = text.split('\n')
        
        for line in lines:
            # Check if line starts a function or class
            if (line.strip().startswith('def ') or 
                line.strip().startswith('class ') or
                line.strip().startswith('async def ')):
                
                # Save previous section
                if current_section.strip():
                    sections.append((current_section.strip(), current_metadata.copy()))
                
                # Start new section
                if line.strip().startswith('def '):
                    func_name = line.strip().split('(')[0].replace('def ', '')
                    current_metadata = {
                        'type': 'function',
                        'name': func_name,
                        'section_type': 'function'
                    }
                elif line.strip().startswith('class '):
                    class_name = line.strip().split('(')[0].replace('class ', '').split(':')[0]
                    current_metadata = {
                        'type': 'class',
                        'name': class_name,
                        'section_type': 'class'
                    }
                
                current_section = line + '\n'
            else:
                current_section += line + '\n'
        
        # Add final section
        if current_section.strip():
            sections.append((current_section.strip(), current_metadata))
        
        return sections
    
    def _split_section(self, text: str, start_index: int, metadata: Dict) -> List[Chunk]:
        """Split a large section into smaller chunks"""
        chunks = []
        current_chunk = ""
        current_start = start_index
        chunk_id = 0
        
        # Split by logical blocks (indentation changes)
        lines = text.split('\n')
        current_block = []
        
        for line in lines:
            current_block.append(line)
            
            # Check if we should create a chunk
            if len('\n'.join(current_block)) > self.chunk_size and len(current_block) > 1:
                # Create chunk from previous block
                chunk_text = '\n'.join(current_block[:-1])
                chunk = Chunk(
                    content=chunk_text,
                    start_index=current_start,
                    end_index=current_start + len(chunk_text),
                    chunk_id=f"chunk_{start_index}_{chunk_id}",
                    metadata=metadata
                )
                chunks.append(chunk)
                
                # Start new chunk with overlap
                overlap_lines = current_block[-2:] if len(current_block) >= 2 else current_block
                current_block = overlap_lines + [line]
                current_start += len(chunk_text)
                chunk_id += 1
        
        # Add final chunk
        if current_block:
            chunk_text = '\n'.join(current_block)
            chunk = Chunk(
                content=chunk_text,
                start_index=current_start,
                end_index=current_start + len(chunk_text),
                chunk_id=f"chunk_{start_index}_{chunk_id}",
                metadata=metadata
            )
            chunks.append(chunk)
        
        return chunks

# Sample Python code
sample_code = '''
class DocumentProcessor:
    """Handles document processing and chunking"""
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.encoding = tiktoken.get_encoding("cl100k_base")
    
    def count_tokens(self, text: str) -> int:
        """Count tokens in text using tiktoken"""
        return len(self.encoding.encode(text))
    
    def create_product_document(self, product: Dict[str, Any]) -> str:
        """Convert product data to a searchable document"""
        doc_parts = [
            f"Product: {product['name']}",
            f"Category: {product['category']}",
            f"Price: ${product['price']}",
            f"Description: {product['description']}"
        ]
        return "\\n".join(doc_parts)
    
    def chunk_text(self, text: str, metadata: Dict[str, Any]) -> List[DocumentChunk]:
        """Split text into overlapping chunks"""
        chunks = []
        words = text.split()
        
        start = 0
        chunk_id = 0
        
        while start < len(words):
            end = min(start + self.chunk_size, len(words))
            chunk_words = words[start:end]
            chunk_text = " ".join(chunk_words)
            
            chunk_id_str = f"{metadata.get('product_id', 'unknown')}_chunk_{chunk_id}"
            
            chunk = DocumentChunk(
                id=chunk_id_str,
                content=chunk_text,
                metadata=metadata.copy()
            )
            chunks.append(chunk)
            
            start = end - self.chunk_overlap
            chunk_id += 1
            
            if start >= len(words) - self.chunk_overlap:
                break
        
        return chunks

def process_documents(documents: List[Dict]) -> List[DocumentChunk]:
    """Process a list of documents into chunks"""
    processor = DocumentProcessor()
    all_chunks = []
    
    for doc in documents:
        chunks = processor.chunk_text(doc['content'], doc['metadata'])
        all_chunks.extend(chunks)
    
    return all_chunks
'''

# Test code splitting
code_splitter = CodeSplitter(chunk_size=400, chunk_overlap=50)
code_chunks = code_splitter.split_text(sample_code, {"source": "document_processor.py"})

print(f"üìä Code splitting results:")
print(f"Number of chunks: {len(code_chunks)}")
print(f"Average chunk length: {np.mean([len(chunk.content) for chunk in code_chunks]):.0f} characters")

# Show sample chunks
print(f"\nüîç Sample code chunks:")
for i, chunk in enumerate(code_chunks[:3]):
    print(f"\nChunk {i+1} (ID: {chunk.chunk_id}):")
    print(f"Type: {chunk.metadata.get('type', 'unknown')}")
    print(f"Name: {chunk.metadata.get('name', 'unknown')}")
    print(f"Content: {chunk.content[:200]}...")
    print(f"Length: {len(chunk.content)} characters")

## Performance Comparison {#performance}

Let's compare all splitting techniques on the same document to understand their strengths and weaknesses.

In [None]:
import time
import pandas as pd
from typing import List, Dict, Any

class SplittingEvaluator:
    """Evaluate different splitting techniques"""
    
    def __init__(self):
        self.splitters = {
            "Fixed Size (Char)": FixedSizeSplitter(chunk_size=300, chunk_overlap=50, use_tokens=False),
            "Fixed Size (Token)": FixedSizeSplitter(chunk_size=150, chunk_overlap=25, use_tokens=True),
            "Semantic": SemanticSplitter(similarity_threshold=0.6, min_chunk_size=100, max_chunk_size=800),
            "Recursive": RecursiveSplitter(chunk_size=300, chunk_overlap=50),
            "Sentence Aware": SentenceAwareSplitter(chunk_size=400, chunk_overlap=100),
            "Markdown": MarkdownSplitter(chunk_size=300, chunk_overlap=50),
            "Code": CodeSplitter(chunk_size=400, chunk_overlap=50)
        }
    
    def evaluate_splitter(self, splitter, text: str, metadata: Dict) -> Dict[str, Any]:
        """Evaluate a single splitter"""
        start_time = time.time()
        
        try:
            chunks = splitter.split_text(text, metadata)
            end_time = time.time()
            
            # Calculate metrics
            chunk_lengths = [len(chunk.content) for chunk in chunks]
            chunk_tokens = [len(encoding.encode(chunk.content)) for chunk in chunks]
            
            # Check for sentence breaks
            sentence_breaks = 0
            for chunk in chunks:
                sentences = sent_tokenize(chunk.content)
                if len(sentences) > 1:
                    # Check if any sentence is cut off
                    for sentence in sentences:
                        if not sentence.strip().endswith(('.', '!', '?')):
                            sentence_breaks += 1
            
            return {
                "success": True,
                "processing_time": end_time - start_time,
                "num_chunks": len(chunks),
                "avg_chunk_length": np.mean(chunk_lengths),
                "std_chunk_length": np.std(chunk_lengths),
                "min_chunk_length": np.min(chunk_lengths),
                "max_chunk_length": np.max(chunk_lengths),
                "avg_chunk_tokens": np.mean(chunk_tokens),
                "sentence_breaks": sentence_breaks,
                "chunk_size_consistency": 1 - (np.std(chunk_lengths) / np.mean(chunk_lengths)) if np.mean(chunk_lengths) > 0 else 0
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "processing_time": 0,
                "num_chunks": 0,
                "avg_chunk_length": 0,
                "std_chunk_length": 0,
                "min_chunk_length": 0,
                "max_chunk_length": 0,
                "avg_chunk_tokens": 0,
                "sentence_breaks": 0,
                "chunk_size_consistency": 0
            }
    
    def compare_all(self, text: str, metadata: Dict = None) -> pd.DataFrame:
        """Compare all splitters"""
        results = []
        
        for name, splitter in self.splitters.items():
            print(f"üîÑ Evaluating {name}...")
            result = self.evaluate_splitter(splitter, text, metadata or {})
            result["splitter_name"] = name
            results.append(result)
        
        return pd.DataFrame(results)

# Evaluate all splitters
evaluator = SplittingEvaluator()

# Test on sample document
print("üß™ Evaluating all splitting techniques...")
results_df = evaluator.compare_all(sample_doc, {"source": "ml_fundamentals"})

# Display results
print("\nüìä Performance Comparison Results:")
print("="*80)

# Filter successful results
successful_results = results_df[results_df["success"] == True].copy()

if len(successful_results) > 0:
    # Sort by processing time
    successful_results = successful_results.sort_values("processing_time")
    
    print("\n‚è±Ô∏è Processing Time (seconds):")
    for _, row in successful_results.iterrows():
        print(f"  {row['splitter_name']}: {row['processing_time']:.3f}s")
    
    print("\nüìè Chunk Statistics:")
    print(f"{'Splitter':<20} {'Chunks':<8} {'Avg Length':<12} {'Consistency':<12} {'Sentence Breaks':<15}")
    print("-" * 80)
    
    for _, row in successful_results.iterrows():
        print(f"{row['splitter_name']:<20} {row['num_chunks']:<8} {row['avg_chunk_length']:<12.0f} {row['chunk_size_consistency']:<12.3f} {row['sentence_breaks']:<15}")
    
    print("\nüéØ Recommendations:")
    print("  ‚Ä¢ Fastest: Fixed Size splitters")
    print("  ‚Ä¢ Most consistent: Fixed Size splitters")
    print("  ‚Ä¢ Best for NLP: Sentence Aware")
    print("  ‚Ä¢ Best for structure: Markdown/Code splitters")
    print("  ‚Ä¢ Best for semantics: Semantic splitter")
    
    # Show failed splitters
    failed_results = results_df[results_df["success"] == False]
    if len(failed_results) > 0:
        print("\n‚ùå Failed Splitters:")
        for _, row in failed_results.iterrows():
            print(f"  {row['splitter_name']}: {row['error']}")
else:
    print("‚ùå All splitters failed!")
    print("\nError details:")
    for _, row in results_df.iterrows():
        print(f"  {row['splitter_name']}: {row['error']}")

## Best Practices & Guidelines {#best-practices}

### Choosing the Right Splitting Technique

| Use Case | Recommended Technique | Reason |
|----------|----------------------|---------|
| **General Text** | Recursive Character | Good balance of structure and size |
| **Technical Docs** | Markdown Splitter | Preserves heading hierarchy |
| **Code Files** | Code Splitter | Maintains syntax structure |
| **Q&A Systems** | Sentence Aware | Preserves sentence integrity |
| **Semantic Search** | Semantic Splitter | Groups related content |
| **High Performance** | Fixed Size | Fastest processing |

### Chunk Size Guidelines

| Document Type | Recommended Size | Overlap |
|---------------|------------------|---------|
| **Short Articles** | 200-400 chars | 50-100 chars |
| **Long Documents** | 500-1000 chars | 100-200 chars |
| **Code Files** | 300-600 chars | 50-100 chars |
| **Technical Docs** | 400-800 chars | 100-150 chars |

### Quality Metrics to Monitor

1. **Chunk Size Consistency**: Standard deviation / mean
2. **Sentence Breaks**: Number of broken sentences
3. **Processing Time**: Time to split documents
4. **Retrieval Quality**: Precision and recall in RAG
5. **Context Preservation**: Semantic coherence within chunks

## Real-World Examples {#real-world}

### 1. E-commerce Product Catalog
- **Challenge**: Product descriptions vary greatly in length
- **Solution**: Recursive splitting with 300-char chunks
- **Result**: Consistent retrieval across different product types

### 2. Technical Documentation
- **Challenge**: Need to preserve section hierarchy
- **Solution**: Markdown splitting with header-aware chunking
- **Result**: Better context for technical queries

### 3. Legal Documents
- **Challenge**: Long paragraphs with complex legal language
- **Solution**: Sentence-aware splitting with 400-char chunks
- **Result**: Preserved legal context and improved accuracy

### 4. Code Repository
- **Challenge**: Need to maintain code structure
- **Solution**: Code splitting by functions and classes
- **Result**: Better code search and retrieval

### 5. Customer Support Knowledge Base
- **Challenge**: Mix of short FAQs and long articles
- **Solution**: Hybrid approach - semantic for articles, fixed-size for FAQs
- **Result**: Optimized retrieval for different content types

## Key Takeaways & Next Steps

### What We've Learned
‚úÖ **Multiple Splitting Techniques** with different strengths and use cases
‚úÖ **Performance Comparison** showing trade-offs between speed and quality
‚úÖ **Real-world Applications** demonstrating practical implementations
‚úÖ **Best Practices** for choosing the right technique

### Key Insights
1. **No One-Size-Fits-All**: Choose technique based on your specific use case
2. **Chunk Size Matters**: Balance between context and precision
3. **Overlap is Important**: Ensures continuity between chunks
4. **Structure Preservation**: Consider document type and retrieval needs
5. **Performance vs Quality**: Trade-offs between speed and accuracy

### Next Steps
- **Experiment**: Try different techniques on your specific documents
- **Measure**: Implement evaluation metrics for your use case
- **Optimize**: Fine-tune parameters based on performance data
- **Monitor**: Track retrieval quality and user satisfaction

### Advanced Topics to Explore
- **Hybrid Splitting**: Combine multiple techniques
- **Dynamic Chunking**: Adjust chunk size based on content
- **Metadata Enrichment**: Add semantic tags to chunks
- **Quality Scoring**: Rate chunk quality automatically

---

**Ready to implement document splitting?** Start with recursive character splitting for general text, then experiment with specialized techniques based on your specific needs!