# Text Preprocessing and Chunking Strategies

In this notebook, we'll explore different text preprocessing and chunking strategies that are crucial for building effective RAG systems. The way we split and prepare our documents directly impacts retrieval quality.

## Learning Objectives
By the end of this notebook, you will:
1. Understand different chunking strategies and their trade-offs
2. Learn how to preprocess text for optimal retrieval
3. Compare chunking methods on real documents
4. Implement quality filtering and metadata preservation
5. Understand the impact of chunking on RAG performance


## Setup and Imports

Let's import the libraries we need and load our data.


In [12]:
# Standard library imports
import json
import pandas as pd
import numpy as np
from pathlib import Path
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import re
import nltk
from tqdm import tqdm

# Add project root to path
import sys
sys.path.append(str(Path.cwd().parent))

# Import our modules
from src.data.preprocess_data import TextPreprocessor, ChunkingConfig
from src.config import DATA_DIR

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
    print("Downloaded NLTK punkt tokenizer")


Libraries imported successfully!


## Load Sample Data

Let's load some sample documents to work with different chunking strategies.


In [13]:
# Load processed chunks from previous notebook
chunks_file = DATA_DIR / "processed" / "all_chunks.json"

if chunks_file.exists():
    with open(chunks_file, 'r', encoding='utf-8') as f:
        all_chunks = json.load(f)
    print(f"Loaded {len(all_chunks)} chunks from previous notebook")
else:
    print("No processed chunks found. Creating sample data...")
    
    # Create sample documents for demonstration
    sample_documents = [
        {
            'id': 'doc1',
            'title': 'Machine Learning Fundamentals',
            'text': '''Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data. It involves training models on historical data to make predictions or decisions without being explicitly programmed for every scenario. There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning uses labeled training data to learn a mapping from inputs to outputs. Unsupervised learning finds hidden patterns in data without labeled examples. Reinforcement learning learns through interaction with an environment, receiving rewards or penalties for actions.''',
            'source': 'wikipedia'
        },
        {
            'id': 'doc2',
            'title': 'Deep Learning and Neural Networks',
            'text': '''Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers to model and understand complex patterns in data. These networks are inspired by the structure and function of the human brain. Deep learning has revolutionized many fields including computer vision, natural language processing, and speech recognition. Convolutional Neural Networks (CNNs) are particularly effective for image processing tasks. Recurrent Neural Networks (RNNs) and their variants like LSTM and GRU are well-suited for sequential data. The success of deep learning is largely due to the availability of large datasets, increased computational power, and improved algorithms.''',
            'source': 'wikipedia'
        },
        {
            'id': 'doc3',
            'title': 'Natural Language Processing',
            'text': '''Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and human language. It involves developing algorithms and models that can understand, interpret, and generate human language in a valuable way. NLP tasks include text classification, sentiment analysis, named entity recognition, machine translation, and question answering. Recent advances in transformer models like BERT, GPT, and T5 have significantly improved NLP performance. These models use attention mechanisms to process sequences of text and can capture long-range dependencies in language.''',
            'source': 'wikipedia'
        }
    ]
    
    all_chunks = sample_documents
    print(f"Created {len(all_chunks)} sample documents")


Loaded 22 chunks from previous notebook


## Understanding Chunking Strategies

Let's explore different chunking strategies and their characteristics.


In [14]:
# Initialize text preprocessor
preprocessor = TextPreprocessor()

print("Text Preprocessor initialized!")
print(f"Default chunk size: {preprocessor.config.chunk_size}")
print(f"Default chunk overlap: {preprocessor.config.chunk_overlap}")

# Let's examine one document in detail
if all_chunks:
    sample_doc = all_chunks[0]
    print(f"\nSample document: {sample_doc['source_title']}")
    print(f"Text length: {len(sample_doc['text'])} characters")
    print(f"Word count: {len(sample_doc['text'].split())} words")
    print(f"\nText preview:")
    print(sample_doc['text'][:300] + "...")


INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed


Text Preprocessor initialized!
Default chunk size: 512
Default chunk overlap: 50

Sample document: Machine learning
Text length: 460 characters
Word count: 68 words

Text preview:
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions Within a subdiscipline in machine learning, advances in...


## Fixed-Size Chunking

Fixed-size chunking splits text into chunks of a predetermined size. This is simple but may break sentences or paragraphs.


In [15]:
# Test fixed-size chunking with different sizes
if all_chunks:
    sample_doc = all_chunks[0]
    
    chunk_sizes = [100, 200, 300, 500]
    
    print("Fixed-Size Chunking Comparison:")
    print("=" * 50)
    
    for chunk_size in chunk_sizes:
        # Create custom config
        config = ChunkingConfig(
            chunk_size=chunk_size,
            chunk_overlap=50,
            chunk_by_sentences=False
        )
        
        # Create preprocessor with custom config
        custom_preprocessor = TextPreprocessor(config)
        
        # Chunk the document
        chunks = custom_preprocessor.chunk_document(sample_doc, strategy='fixed')
        
        print(f"\nChunk size: {chunk_size}")
        print(f"Number of chunks: {len(chunks)}")
        
        if chunks:
            chunk_lengths = [len(chunk['text']) for chunk in chunks]
            print(f"Chunk lengths: min={min(chunk_lengths)}, max={max(chunk_lengths)}, avg={np.mean(chunk_lengths):.1f}")
            
            # Show first chunk
            print(f"First chunk: {chunks[0]['text'][:100]}...")
            
            # Check for sentence breaks
            sentence_breaks = sum(1 for chunk in chunks if chunk['text'].endswith('.') or chunk['text'].endswith('!') or chunk['text'].endswith('?'))
            print(f"Chunks ending with sentence punctuation: {sentence_breaks}/{len(chunks)}")


INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed
INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed
INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed
INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed


Fixed-Size Chunking Comparison:

Chunk size: 100
Number of chunks: 0

Chunk size: 200
Number of chunks: 3
Chunk lengths: min=170, max=195, avg=186.7
First chunk: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
Chunks ending with sentence punctuation: 0/3

Chunk size: 300
Number of chunks: 2
Chunk lengths: min=213, max=297, avg=255.0
First chunk: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
Chunks ending with sentence punctuation: 0/2

Chunk size: 500
Number of chunks: 1
Chunk lengths: min=460, max=460, avg=460.0
First chunk: Machine learning (ML) is a field of study in artificial intelligence concerned with the development ...
Chunks ending with sentence punctuation: 0/1


## Semantic Chunking

Semantic chunking tries to keep related content together by splitting at natural boundaries like sentences or paragraphs.


In [16]:
# Test semantic chunking
if all_chunks:
    sample_doc = all_chunks[0]
    
    print("Semantic Chunking:")
    print("=" * 30)
    
    # Test with different configurations
    configs = [
        (512, 50, True, "Default semantic"),
        (256, 25, True, "Smaller chunks"),
        (1024, 100, True, "Larger chunks"),
        (512, 50, False, "No sentence boundary respect")
    ]
    
    for chunk_size, overlap, by_sentences, description in configs:
        config = ChunkingConfig(
            chunk_size=chunk_size,
            chunk_overlap=overlap,
            chunk_by_sentences=by_sentences
        )
        
        custom_preprocessor = TextPreprocessor(config)
        chunks = custom_preprocessor.chunk_document(sample_doc, strategy='semantic')
        
        print(f"\n{description}:")
        print(f"  Chunk size: {chunk_size}, Overlap: {overlap}, By sentences: {by_sentences}")
        print(f"  Number of chunks: {len(chunks)}")
        
        if chunks:
            chunk_lengths = [len(chunk['text']) for chunk in chunks]
            print(f"  Chunk lengths: min={min(chunk_lengths)}, max={max(chunk_lengths)}, avg={np.mean(chunk_lengths):.1f}")
            
            # Show first chunk
            print(f"  First chunk: {chunks[0]['text'][:150]}...")


INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed
INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed
INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed
INFO:src.data.preprocess_data:Text preprocessor initialized. Output directory: /Users/scienceman/Desktop/LLM/data/processed


Semantic Chunking:

Default semantic:
  Chunk size: 512, Overlap: 50, By sentences: True
  Number of chunks: 1
  Chunk lengths: min=460, max=460, avg=460.0
  First chunk: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...

Smaller chunks:
  Chunk size: 256, Overlap: 25, By sentences: True
  Number of chunks: 1
  Chunk lengths: min=460, max=460, avg=460.0
  First chunk: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...

Larger chunks:
  Chunk size: 1024, Overlap: 100, By sentences: True
  Number of chunks: 1
  Chunk lengths: min=460, max=460, avg=460.0
  First chunk: Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn...

No sentence boundary respect:
  Chunk size: 512, Overlap: 50, By 