Document Chunker

A lightweight Python utility for splitting large documents into optimally-sized chunks for Large Language Model (LLM) processing. Preserves sentence boundaries and maintains context continuity with configurable overlap between chunks.

Features

Smart Sentence-Based Chunking: Splits documents within configurable size ranges (default 2000-4000 words) while preserving sentence boundaries
Even Split Mode: Divides documents into a specified number of roughly equal chunks
Configurable Overlap: Maintains context continuity between chunks with customizable overlap (default 200 words)
Batch Processing: Process entire directories of documents at once
Intelligent Handling: Gracefully handles encoding issues (UTF-8, Latin-1 fallback) and very long sentences
Chunk Metadata: Each chunk includes word count, preview text, and chunk number
Zero Dependencies: Uses only Python standard library - no external packages required

Use Cases

Preparing long documents for LLM analysis (ChatGPT, Claude, etc.)
Breaking down research papers, books, or documentation for AI processing
Creating overlapping segments for comprehensive document analysis
Preprocessing data for vector embeddings or semantic search systems

Requirements

Python 3.7+
No external dependencies (standard library only)

Installation

Clone or download the project
No additional installation needed - ready to use immediately!

Usage

Interactive Mode

Run the script directly for prompted input:

python doc_chunker.py

You'll be guided through the chunking options interactively.

As a Python Module

Import and use programmatically:

from doc_chunker import DocumentChunker, analyze_document, batch_process_directory

# Smart chunking with size range
chunker = DocumentChunker(min_chunk_size=2000, max_chunk_size=4000, overlap=200)
chunks = chunker.process_file('path/to/document.txt')

# Analyze a document and see chunk statistics
analyze_document('path/to/document.txt')

# Even split into specific number of chunks
analyze_document_even_split('path/to/document.txt', num_chunks=5)

# Batch process entire directory
batch_process_directory('path/to/directory', output_dir='chunks')

Command Examples

Basic single file chunking:

from doc_chunker import DocumentChunker

chunker = DocumentChunker()
chunks = chunker.process_file('research_paper.txt')

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk['word_count']} words")
    print(f"Preview: {chunk['preview']}")
    print(f"Content: {chunk['content'][:200]}...")

Custom chunk sizes:

# Smaller chunks for Twitter-like summarization
small_chunker = DocumentChunker(min_chunk_size=500, max_chunk_size=1000, overlap=100)

# Larger chunks for deep analysis
large_chunker = DocumentChunker(min_chunk_size=4000, max_chunk_size=8000, overlap=500)

Export chunks to files:

chunker = DocumentChunker()
chunks = chunker.process_file('book.txt')
chunker.save_chunks(chunks, output_dir='./chunks', base_filename='book')
# Creates: book_chunk_001.txt, book_chunk_002.txt, etc.

Batch processing:

from doc_chunker import batch_process_directory

# Process all .txt files in a directory
batch_process_directory('./documents', output_dir='./processed_chunks')

Configuration Options

DocumentChunker Parameters

Parameter	Default	Description
`min_chunk_size`	2000	Minimum target words per chunk
`max_chunk_size`	4000	Maximum words per chunk
`overlap`	200	Number of overlapping words between chunks

Chunking Strategies

1. Smart Range-Based Chunking (Default)

Aims for chunks between min and max size
Preserves sentence boundaries
Adds overlap for context continuity
Best for: General LLM processing, maintaining readability

2. Even Split Chunking

Divides document into N roughly equal chunks
Useful when you need predictable chunk count
Best for: Parallel processing, consistent batch sizes

Output Format

Chunk Object Structure

Each chunk is a dictionary with:

{
    'content': 'Full text of the chunk...',
    'word_count': 2547,
    'preview': 'First 100 characters of content...',
    'chunk_number': 1
}

Saved Chunk File Format

Chunk 1
Word count: 2547
--------------------------------------------------

[Full chunk text content...]

Key Functions

Core Functions

DocumentChunker.chunk_text(text) - Primary chunking method with overlap
DocumentChunker.chunk_text_even_split(text, num_chunks) - Even distribution chunking
DocumentChunker.process_file(file_path) - Process single file
DocumentChunker.save_chunks(chunks, output_dir, base_filename) - Export chunks to files

Utility Functions

analyze_document(file_path, min_size, max_size) - Analyze with statistics
batch_process_directory(directory_path, output_dir) - Process multiple files

Technical Details

How It Works

File Reading: Attempts UTF-8 encoding, falls back to Latin-1 if needed
Sentence Splitting: Uses regex to identify sentence boundaries
Chunking Algorithm:
- Accumulates sentences until max_chunk_size is reached
- Ensures minimum chunk size is met
- Handles edge cases (very long sentences, small documents)
Overlap Addition: Appends last N words from previous chunk to maintain context
Export: Optional file output with metadata headers

Sentence Boundary Detection

Uses regex pattern: r'(?<=[.!?])\s+'

Splits on periods, exclamation marks, question marks
Preserves whitespace structure
Handles common punctuation patterns

Encoding Support

Primary: UTF-8
Fallback: Latin-1
Graceful error handling for encoding issues

Example Workflow

from doc_chunker import DocumentChunker

# Initialize with custom settings
chunker = DocumentChunker(
    min_chunk_size=3000,  # Larger chunks
    max_chunk_size=5000,
    overlap=300           # More context overlap
)

# Process a document
chunks = chunker.process_file('long_research_paper.txt')

# Review chunk statistics
print(f"Total chunks created: {len(chunks)}")
for chunk in chunks:
    print(f"Chunk {chunk['chunk_number']}: {chunk['word_count']} words")

# Save to files
chunker.save_chunks(
    chunks,
    output_dir='./research_chunks',
    base_filename='research_paper'
)

# Now process with your LLM of choice
for chunk in chunks:
    # Send chunk['content'] to ChatGPT, Claude, etc.
    process_with_llm(chunk['content'])

Tips & Best Practices

Choosing Chunk Sizes:
- GPT-3.5: 2000-3000 words per chunk
- GPT-4: 4000-6000 words per chunk
- Claude: 6000-8000 words per chunk
- Consider API token limits when sizing chunks
Overlap Settings:
- 10-15% of max_chunk_size recommended
- Higher overlap for documents with complex context
- Lower overlap for independent sections
Batch Processing:
- Use batch_process_directory() for multiple files
- Chunks are automatically saved with descriptive filenames
- Original files remain unchanged
Memory Efficiency:
- Processes documents in memory
- For very large files (100MB+), consider splitting first
- No temporary files created during processing

Troubleshooting

Issue: "UnicodeDecodeError"

Solution: The script automatically falls back to Latin-1 encoding, but if issues persist, check file encoding with file -I filename.txt

Issue: Chunks are too small/large

Solution: Adjust min_chunk_size and max_chunk_size parameters
Check if sentences in your document are unusually long/short

Issue: Losing context between chunks

Solution: Increase the overlap parameter to preserve more context

Issue: Permission errors when saving

Solution: Ensure write permissions for output directory: chmod +w output_dir

Performance

Speed: Processes ~1MB of text in <1 second on modern hardware
Memory: Loads entire document into memory - suitable for files up to 100MB
Accuracy: Preserves sentence boundaries with 99%+ accuracy

License

MIT

Perfect for: Data scientists, AI researchers, developers working with LLM APIs, content analysts, anyone processing large documents for AI/ML workflows.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
doc_chunker.py		doc_chunker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Chunker

Features

Use Cases

Requirements

Installation

Usage

Interactive Mode

As a Python Module

Command Examples

Configuration Options

DocumentChunker Parameters

Chunking Strategies

Output Format

Chunk Object Structure

Saved Chunk File Format

Key Functions

Core Functions

Utility Functions

Technical Details

How It Works

Sentence Boundary Detection

Encoding Support

Example Workflow

Tips & Best Practices

Troubleshooting

Performance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document Chunker

Features

Use Cases

Requirements

Installation

Usage

Interactive Mode

As a Python Module

Command Examples

Configuration Options

DocumentChunker Parameters

Chunking Strategies

Output Format

Chunk Object Structure

Saved Chunk File Format

Key Functions

Core Functions

Utility Functions

Technical Details

How It Works

Sentence Boundary Detection

Encoding Support

Example Workflow

Tips & Best Practices

Troubleshooting

Performance

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages