# ‚úÇÔ∏è Notebook 03: Text Splitting Strategies

**LangChain 1.0.5+ | Mixed Level Class**

---

## üéØ Learning Objectives

By the end of this notebook, you will:
1. Understand **why** text splitting is necessary for RAG
2. Master **RecursiveCharacterTextSplitter** (the recommended default)
3. Learn other splitters: Character, HTMLHeader, RecursiveJson, Token
4. Choose optimal **chunk sizes** and **overlap**
5. Compare splitters side-by-side
6. Apply the right splitter for different content types

---

## üìñ Table of Contents

1. [Why Split Text?](#why-split)
2. [RecursiveCharacterTextSplitter](#recursive-splitter)
3. [CharacterTextSplitter](#character-splitter)
4. [HTMLHeaderTextSplitter](#html-splitter)
5. [RecursiveJsonSplitter](#json-splitter)
6. [TokenTextSplitter](#token-splitter)
7. [Chunk Size & Overlap Optimization](#optimization)
8. [Comparison & Best Practices](#comparison)
9. [Summary & Exercises](#summary)

---

In [None]:
# Setup
from pathlib import Path
from dotenv import load_dotenv
load_dotenv()

print("‚úÖ Environment ready")

<a id="why-split"></a>
## 1. Why Split Text? ü§î

### üî∞ BEGINNER EXPLANATION

Imagine you have a 200-page book and someone asks: *"What did the author say about machine learning on page 87?"*

**Problem:** LLMs have a limited "attention span" (context window):
- GPT-3.5-Turbo: ~4,000 tokens (~16,000 characters)
- GPT-4: ~8,000 tokens (~32,000 characters)
- You **can't** fit a whole book in one query!

**Solution:** Split the book into smaller **chunks**:
1. Each chunk is small enough for the LLM
2. Search finds the **relevant chunks** (like page 87)
3. Only send those chunks to the LLM

### The Challenge

If you split text randomly:
```
‚ùå BAD SPLIT:
Chunk 1: "The transformer architecture revolutionized NLP. It uses self-att"
Chunk 2: "ention mechanisms to process sequences in parallel. This allows..."
```

The word "attention" is cut in half! üò±

**Good splitters** respect boundaries (paragraphs, sentences, words):
```
‚úÖ GOOD SPLIT:
Chunk 1: "The transformer architecture revolutionized NLP. It uses self-attention mechanisms."
Chunk 2: "Self-attention allows the model to process sequences in parallel. This improves speed..."
```

### üéì INTERMEDIATE: Trade-offs

| Aspect | Small Chunks (500 chars) | Large Chunks (2000 chars) |
|--------|-------------------------|---------------------------|
| **Precision** | High (very specific) | Lower (more general) |
| **Context** | Less context | More context |
| **Retrieval Quality** | More precise matches | May include noise |
| **# of Chunks** | More chunks = more storage | Fewer chunks |
| **Best for** | Q&A, facts, technical docs | Long-form, narrative content |

<a id="recursive-splitter"></a>
## 2. RecursiveCharacterTextSplitter ‚≠ê

### üî∞ BEGINNER: The Default Choice

**RecursiveCharacterTextSplitter** is your go-to splitter for 90% of cases.

**How it works:**
1. Tries to split on **double newlines** (\n\n) ‚Üí paragraphs
2. If chunks still too big, splits on **single newlines** (\n) ‚Üí lines
3. If still too big, splits on **periods** (.) ‚Üí sentences
4. If still too big, splits on **spaces** ( ) ‚Üí words
5. Last resort: splits on **characters**

This preserves meaning as much as possible!

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# Load our sample text
txt_path = "sample_data/notes.txt"

if Path(txt_path).exists():
    # Load the document
    loader = TextLoader(txt_path)
    documents = loader.load()
    
    print(f"üìÑ Original document: {len(documents[0].page_content)} characters\n")
    
    # Create splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,        # Maximum chunk size in characters
        chunk_overlap=200,      # Overlap between chunks
        length_function=len,    # How to measure length
        separators=["\n\n", "\n", ". ", " ", ""]  # Try these in order
    )
    
    # Split the document
    chunks = splitter.split_documents(documents)
    
    print(f"‚úÇÔ∏è Split into {len(chunks)} chunks\n")
    
    # Examine first 3 chunks
    for i, chunk in enumerate(chunks[:3], 1):
        print(f"{'='*70}")
        print(f"Chunk {i} ({len(chunk.page_content)} chars):")
        print(f"{'='*70}")
        print(chunk.page_content[:300] + "..." if len(chunk.page_content) > 300 else chunk.page_content)
        print()
else:
    print(f"‚ùå File not found: {txt_path}")

### üî∞ Understanding Chunk Overlap

**Why overlap?** To preserve context across boundaries.

**Example without overlap:**
```
Chunk 1: "...introducing the transformer architecture."
Chunk 2: "The model uses multi-head attention..."
```
‚Üí Missing connection between "transformer" and "multi-head attention"

**Example with overlap:**
```
Chunk 1: "...introducing the transformer architecture. The model uses..."
Chunk 2: "...transformer architecture. The model uses multi-head attention..."
```
‚Üí Both chunks have the connection!

In [None]:
# Demonstrate overlap
if Path(txt_path).exists():
    docs = TextLoader(txt_path).load()
    
    # Splitter with overlap
    splitter_with_overlap = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=100  # 100 chars overlap
    )
    
    chunks = splitter_with_overlap.split_documents(docs)
    
    print("üîç Examining overlap between chunks:\n")
    
    # Show overlap between chunk 1 and 2
    if len(chunks) >= 2:
        chunk1_end = chunks[0].page_content[-150:]
        chunk2_start = chunks[1].page_content[:150]
        
        print("Chunk 1 ending:")
        print(f"  ...{chunk1_end}")
        print("\nChunk 2 beginning:")
        print(f"  {chunk2_start}...")
        print("\nüí° Notice the overlap? This preserves context!")

### üéì INTERMEDIATE: Custom Separators for Code

In [None]:
# Example: Splitting Python code
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

# Python code example
python_code = '''
def calculate_total(items):
    """Calculate total price of items."""
    total = 0
    for item in items:
        total += item['price']
    return total

def apply_discount(total, discount_percent):
    """Apply discount to total."""
    discount = total * (discount_percent / 100)
    return total - discount

class ShoppingCart:
    def __init__(self):
        self.items = []
    
    def add_item(self, item):
        self.items.append(item)
'''

# Python-aware splitter
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=200,
    chunk_overlap=50
)

code_chunks = python_splitter.split_text(python_code)

print(f"‚úÇÔ∏è Split code into {len(code_chunks)} chunks:\n")
for i, chunk in enumerate(code_chunks, 1):
    print(f"Chunk {i}:")
    print(chunk)
    print("-" * 50)

<a id="character-splitter"></a>
## 3. CharacterTextSplitter

### üî∞ BEGINNER: Simple Splitting

**CharacterTextSplitter** splits on a single separator (like "\n\n").
- Simpler than Recursive
- Less intelligent
- Use for testing or very simple text

In [None]:
from langchain_text_splitters import CharacterTextSplitter

# Sample text with clear paragraph breaks
sample_text = """First paragraph about machine learning.
It has multiple sentences. This is important context.

Second paragraph about deep learning.
Neural networks are powerful. They learn from data.

Third paragraph about transformers.
Attention mechanisms are key. They revolutionized NLP.
"""

# Split on paragraph breaks
simple_splitter = CharacterTextSplitter(
    separator="\n\n",  # Split on double newline (paragraphs)
    chunk_size=100,
    chunk_overlap=20
)

chunks = simple_splitter.split_text(sample_text)

print(f"Split into {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk.strip()}\n")

<a id="html-splitter"></a>
## 4. HTMLHeaderTextSplitter üåê

### üî∞ BEGINNER: Structure-Aware Splitting

**HTMLHeaderTextSplitter** splits HTML based on headers (h1, h2, h3).
- Preserves document structure
- Adds header information to metadata
- Perfect for documentation

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

# Load the HTML blog post
html_path = "sample_data/blog_post.html"

if Path(html_path).exists():
    # Read HTML content
    with open(html_path, 'r', encoding='utf-8') as f:
        html_content = f.read()
    
    # Define headers to split on
    headers_to_split_on = [
        ("h1", "Title"),
        ("h2", "Section"),
        ("h3", "Subsection"),
    ]
    
    # Create splitter
    html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    # Split the HTML
    html_chunks = html_splitter.split_text(html_content)
    
    print(f"‚úÇÔ∏è Split HTML into {len(html_chunks)} sections\n")
    
    # Show first 3 sections with metadata
    for i, chunk in enumerate(html_chunks[:3], 1):
        print(f"{'='*70}")
        print(f"Section {i}:")
        print(f"Metadata: {chunk.metadata}")
        print(f"Content (first 200 chars): {chunk.page_content[:200]}...")
        print()
else:
    print(f"‚ùå HTML file not found: {html_path}")

<a id="json-splitter"></a>
## 5. RecursiveJsonSplitter üì¶

### üî∞ BEGINNER: Splitting JSON Data

**RecursiveJsonSplitter** splits JSON while preserving structure.

In [None]:
from langchain_text_splitters import RecursiveJsonSplitter
import json

# Load JSON data
json_path = "sample_data/api_response.json"

if Path(json_path).exists():
    with open(json_path, 'r') as f:
        json_data = json.load(f)
    
    # Create splitter
    json_splitter = RecursiveJsonSplitter(
        max_chunk_size=1000,
        min_chunk_size=100
    )
    
    # Split
    json_chunks = json_splitter.split_text(
        json_data=json_data,
        convert_lists=True
    )
    
    print(f"‚úÇÔ∏è Split JSON into {len(json_chunks)} chunks\n")
    
    # Show first chunk
    print("First chunk:")
    print(json.dumps(json_chunks[0], indent=2)[:500] + "...")
else:
    print(f"‚ùå JSON file not found: {json_path}")

<a id="token-splitter"></a>
## 6. TokenTextSplitter üéØ

### üéì INTERMEDIATE: Precise Token-Based Splitting

**TokenTextSplitter** splits based on **tokens** (not characters).
- More accurate for LLM context windows
- Uses tiktoken (OpenAI's tokenizer)

In [None]:
from langchain_text_splitters import TokenTextSplitter

# Sample text
text = """The transformer architecture, introduced in the paper 'Attention Is All You Need', 
revolutionized natural language processing. It uses self-attention mechanisms to process 
sequences in parallel, making it much faster than recurrent neural networks."""

# Token-based splitter
token_splitter = TokenTextSplitter(
    chunk_size=50,  # 50 tokens (not characters!)
    chunk_overlap=10,
    encoding_name="cl100k_base"  # GPT-3.5/GPT-4 tokenizer
)

token_chunks = token_splitter.split_text(text)

print(f"Split into {len(token_chunks)} token-based chunks:\n")
for i, chunk in enumerate(token_chunks, 1):
    print(f"Chunk {i}: {chunk}\n")

<a id="optimization"></a>
## 7. Chunk Size & Overlap Optimization üìä

### üî∞ BEGINNER: Rules of Thumb

#### Recommended Configurations

| Content Type | Chunk Size | Overlap | Why |
|-------------|-----------|---------|-----|
| **General Text** | 1000 chars | 200 chars | Balanced precision & context |
| **Technical Docs** | 500-800 | 100-150 | Precision for code/commands |
| **Long Articles** | 1500-2000 | 300 | More context for narrative |
| **Code** | 200-400 | 50-100 | Function/class level |
| **FAQs** | 200-300 | 30-50 | Question-answer pairs |

#### Overlap Guidelines
- **10-15%**: Minimal overlap, saves storage
- **20%**: Sweet spot (recommended)
- **30%+**: Maximum context preservation

In [None]:
# Compare different chunk sizes
if Path(txt_path).exists():
    docs = TextLoader(txt_path).load()
    
    chunk_sizes = [500, 1000, 1500, 2000]
    
    print("üìä Chunk Size Comparison:\n")
    print(f"{'Size':<8} {'Chunks':<10} {'Avg Length':<12} {'Overlap %'}")
    print("-" * 50)
    
    for size in chunk_sizes:
        overlap = int(size * 0.2)  # 20% overlap
        
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size,
            chunk_overlap=overlap
        )
        
        chunks = splitter.split_documents(docs)
        avg_length = sum(len(c.page_content) for c in chunks) / len(chunks)
        overlap_pct = (overlap / size) * 100
        
        print(f"{size:<8} {len(chunks):<10} {avg_length:<12.0f} {overlap_pct:.0f}%")

<a id="comparison"></a>
## 8. Comparison & Best Practices üåü

### Splitter Comparison

| Splitter | Best For | Pros | Cons |
|----------|----------|------|------|
| **RecursiveCharacter** | General text, docs | Smart boundaries, flexible | Slower |
| **Character** | Simple text | Fast, simple | Not intelligent |
| **HTMLHeader** | Web content, docs | Preserves structure | HTML only |
| **RecursiveJson** | JSON data | Preserves JSON structure | JSON only |
| **Token** | Precise LLM usage | Accurate token count | Requires tokenizer |

### üéì Best Practices

1. **Start with RecursiveCharacterTextSplitter**
2. **Test different chunk sizes** with your data
3. **Use 20% overlap** as default
4. **Match splitter to content type** (HTML ‚Üí HTMLHeaderTextSplitter)
5. **Monitor retrieval quality** and adjust
6. **Consider token-based splitting** for production

<a id="summary"></a>
## 9. Summary & Exercises üìù

### üéâ What You Learned

‚úÖ Text splitting is necessary because **LLMs have limited context windows**

‚úÖ **RecursiveCharacterTextSplitter** is the recommended default

‚úÖ **Chunk size** determines precision vs context trade-off

‚úÖ **Overlap** (20%) preserves context across boundaries

‚úÖ Different content types need different splitters

‚úÖ **Best practice:** chunk_size=1000, chunk_overlap=200 for general text

### üí° Practice Exercises

#### üî∞ Beginner
1. Load a PDF and split it with chunk_size=500, overlap=100
2. Count total chunks created
3. Print first and last chunks

#### üéì Intermediate
1. Compare chunk sizes (500, 1000, 2000) on the same document
2. Create a chart showing # of chunks vs chunk size
3. Test different overlap percentages (10%, 20%, 30%)

### üìö Next: Notebook 04 - Embeddings!

---