## What is Chunking?

**Chunking** is the process of breaking large documents into smaller pieces (chunks).

### Why Do We Need Chunking?

**Problem:** You have a 50-page research paper, but:
1. **LLMs have token limits** - Can't process entire paper at once (e.g., GPT-3.5 = 4K tokens)
2. **Better retrieval** - When searching, you want specific relevant sections, not the whole document
3. **More precise answers** - Smaller chunks = more focused context for the LLM
4. **Cost efficiency** - Only send relevant chunks to the LLM, not the entire document

### Real-World Analogy:

Imagine you're studying for an exam:
- ‚ùå **No chunking:** Reading the entire 500-page textbook every time you have a question
- ‚úÖ **With chunking:** Having an index that points you to specific pages/paragraphs relevant to your question

## Installation

In [None]:
# Install required packages
# !pip install langchain langchain-community langchain-text-splitters pypdf

## Load Your Research Papers

Let's start by loading the two research papers from the data folder.

In [2]:
from langchain_community.document_loaders import PyPDFLoader
import os

# Get the project root directory
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
data_dir = os.path.join(project_root, 'data')

# Load both research papers
pdf1_path = os.path.join(data_dir, 'CodingqualitativedataResearchgate.pdf')
pdf2_path = os.path.join(data_dir, 'EJ1172284.pdf')

# Load first paper
loader1 = PyPDFLoader(pdf1_path)
pages1 = loader1.load()

# Load second paper
loader2 = PyPDFLoader(pdf2_path)
pages2 = loader2.load()

print(f"üìÑ Paper 1: {len(pages1)} pages")
print(f"üìÑ Paper 2: {len(pages2)} pages")
print(f"\nTotal pages: {len(pages1) + len(pages2)}")

# Let's look at the first page
print(f"\n{'='*50}")
print("First page preview:")
print(f"{'='*50}")
print(pages1[0].page_content[:500] + "...")

üìÑ Paper 1: 27 pages
üìÑ Paper 2: 11 pages

Total pages: 38

First page preview:
1 
Coding qualitative data: a synthesis to guide the novice 
 
Mai Skj√∏tt Linneberg 
Department of Management, Aarhus University 
 
Steffen Korsgaard 
Department of Entrepreneurship and Relationship Management 
 
 
 
This is a pre-print version of our paper that will be published in Qualitative Research 
Journal (https://doi.org/10.1108/QRJ -12-2018-0012)...


## Understanding Chunking Parameters

Before we chunk, let's understand the key parameters:

### 1. **chunk_size** (Most Important)
- The maximum number of characters in each chunk
- Too small ‚Üí Loses context, many chunks
- Too large ‚Üí Still hits token limits, less precise
- **Sweet spot:** 500-1500 characters (for most use cases)

### 2. **chunk_overlap**
- How many characters to share between consecutive chunks
- Prevents information from being cut off mid-sentence
- **Typical:** 10-20% of chunk_size

### 3. **separators**
- How to split text (paragraphs, sentences, etc.)
- Default: `["\n\n", "\n", " ", ""]`

### Visual Example:

```
Original Document: [------ 3000 characters ------]

With chunk_size=1000, chunk_overlap=200:

Chunk 1: [------ 1000 chars ------]
Chunk 2:         [overlap][------ 1000 chars ------]
Chunk 3:                          [overlap][------ 1000 chars ------]

The overlap ensures important context isn't lost at boundaries!
```

## Method 1: RecursiveCharacterTextSplitter ‚≠ê Most Popular

This is the **most commonly used** splitter. It tries to keep paragraphs, sentences, and words together.

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create the splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Maximum characters per chunk
    chunk_overlap=200,      # Overlap between chunks
    length_function=len,    # How to measure chunk size
    separators=["\n\n", "\n", " ", ""]  # Try to split on paragraphs first, then sentences, then words
)

# Split the first research paper
chunks = text_splitter.split_documents(pages1)

print(f"Original: {len(pages1)} pages")
print(f"After chunking: {len(chunks)} chunks")
print(f"\nAverage chunk size: {sum(len(chunk.page_content) for chunk in chunks) / len(chunks):.0f} characters")

# Look at first 3 chunks
print(f"\n{'='*60}")
print("First 3 chunks:")
print(f"{'='*60}")
for i, chunk in enumerate(chunks[:3]):
    print(f"\n--- Chunk {i+1} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content[:200] + "...")
    print(f"Metadata: {chunk.metadata}")

Original: 27 pages
After chunking: 67 chunks

Average chunk size: 793 characters

First 3 chunks:

--- Chunk 1 (357 chars) ---
1 
Coding qualitative data: a synthesis to guide the novice 
 
Mai Skj√∏tt Linneberg 
Department of Management, Aarhus University 
 
Steffen Korsgaard 
Department of Entrepreneurship and Relationship M...
Metadata: {'producer': 'Microsoft¬Æ Word for Office 365', 'creator': 'Microsoft¬Æ Word for Office 365', 'creationdate': '2019-05-16T11:07:21+02:00', 'author': 'Steffen Korsgaard', 'moddate': '2019-05-16T11:07:21+02:00', 'source': '/Users/apple/Desktop/RAG Langchain/data/CodingqualitativedataResearchgate.pdf', 'total_pages': 27, 'page': 0, 'page_label': '1'}

--- Chunk 2 (933 chars) ---
2 
Purpose 
Qualitative research has gained in importance in the social sciences. General knowledge about 
qualitative data analysis, how to code qualitative data and decisions concerning related 
res...
Metadata: {'producer': 'Microsoft¬Æ Word for Office 365', 'creator': 'Micro

## Method 2: CharacterTextSplitter (Simple)

Simpler than Recursive - splits on a single separator.

In [3]:
from langchain_text_splitters import CharacterTextSplitter

# Simple splitter - splits on double newlines (paragraphs)
simple_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separator="\n\n"  # Only split on paragraph breaks
)

simple_chunks = simple_splitter.split_documents(pages2)

print(f"Simple splitting created: {len(simple_chunks)} chunks")
print(f"First chunk preview:")
print(simple_chunks[0].page_content[:300] + "...")

Simple splitting created: 11 chunks
First chunk preview:
The EUROCALL Review, Volume 25, No. 2, September 2017 
 
 18 
Research paper 
 
A look at advanced learners‚Äô use of mobile devices for 
English language study: Insights from interview data 
Mariusz Kruk 
University of Zielona Gora, Poland 
____________________________________________________________...


## Method 3: TokenTextSplitter (For LLMs)

Splits based on **tokens** (what LLMs actually count), not characters.

**Why use tokens?**
- LLMs count in tokens, not characters
- 1 token ‚âà 4 characters (English)
- More accurate for staying within LLM limits

In [3]:
from langchain_text_splitters import TokenTextSplitter

# Split by tokens (more accurate for LLMs)
token_splitter = TokenTextSplitter(
    chunk_size=250,         # 250 tokens (roughly 1000 characters)
    chunk_overlap=50        # 50 tokens overlap
)

token_chunks = token_splitter.split_documents(pages1[:5])  # Use first 5 pages

print(f"Token-based splitting created: {len(token_chunks)} chunks")
print(f"\nFirst chunk:")
print(token_chunks[0].page_content[:300] + "...")

Token-based splitting created: 9 chunks

First chunk:
1 
Coding qualitative data: a synthesis to guide the novice 
 
Mai Skj√∏tt Linneberg 
Department of Management, Aarhus University 
 
Steffen Korsgaard 
Department of Entrepreneurship and Relationship Management 
 
 
 
This is a pre-print version of our paper that will be published in Qualitative Rese...


## Method 4: Semantic Chunking (Advanced) üöÄ

**Most intelligent** - splits based on meaning, not just character count!

Uses embeddings to understand where topics change.

In [None]:
# Note: Semantic chunking requires embeddings (OpenAI API or local models)
# For now, let's show the concept

# from langchain_experimental.text_splitter import SemanticChunker
# from langchain_openai.embeddings import OpenAIEmbeddings
# 
# semantic_splitter = SemanticChunker(
#     OpenAIEmbeddings(),
#     breakpoint_threshold_type="percentile"  # Split when meaning changes significantly
# )
# 
# semantic_chunks = semantic_splitter.split_documents(pages1)

print("Semantic Chunking splits by meaning/topic changes")
print("Requires embeddings model (OpenAI or local)")
print("Most intelligent but also most computationally expensive")

## Comparing Different Chunk Sizes

Let's see how different chunk sizes affect the output.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Test different chunk sizes
chunk_sizes = [500, 1000, 2000]

print("Comparing different chunk sizes:\n")
print(f"{'Chunk Size':<15} {'Total Chunks':<15} {'Avg Chunk':<15}")
print("-" * 45)

for size in chunk_sizes:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size,
        chunk_overlap=int(size * 0.2)  # 20% overlap
    )
    test_chunks = splitter.split_documents(pages1)
    avg_size = sum(len(c.page_content) for c in test_chunks) / len(test_chunks)
    
    print(f"{size:<15} {len(test_chunks):<15} {avg_size:<15.0f}")

print("\nüí° Notice: Smaller chunks = More chunks (better precision)")
print("üí° Larger chunks = Fewer chunks (more context per chunk)")

Comparing different chunk sizes:

Chunk Size      Total Chunks    Avg Chunk      
---------------------------------------------
500             125             428            
1000            67              793            
2000            30              1578           

üí° Notice: Smaller chunks = More chunks (better precision)
üí° Larger chunks = Fewer chunks (more context per chunk)


## Chunk Overlap Visualization

Let's see how overlap works in practice.

In [5]:
# Create splitter with overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100
)

# Get first few chunks
overlap_chunks = splitter.split_documents(pages1[:2])

print("Showing overlap between consecutive chunks:\n")

# Compare first two chunks
for i in range(min(2, len(overlap_chunks)-1)):
    chunk1 = overlap_chunks[i].page_content
    chunk2 = overlap_chunks[i+1].page_content
    
    # Find overlapping text
    print(f"--- Chunk {i+1} ends with: ---")
    print(chunk1[-150:])  # Last 150 chars of chunk 1
    
    print(f"\n--- Chunk {i+2} starts with: ---")
    print(chunk2[:150])  # First 150 chars of chunk 2
    
    print(f"\n{'='*60}\n")

print("üëÜ Notice how chunks share some text to maintain context!")

Showing overlap between consecutive chunks:

--- Chunk 1 ends with: ---
nt 
 
 
 
This is a pre-print version of our paper that will be published in Qualitative Research 
Journal (https://doi.org/10.1108/QRJ -12-2018-0012)

--- Chunk 2 starts with: ---
2 
Purpose 
Qualitative research has gained in importance in the social sciences. General knowledge about 
qualitative data analysis, how to code qual


--- Chunk 2 ends with: ---
rticle 
offers researchers who are new to qualitative research a thorough yet practical introduction to 
the vocabulary and craft of coding. 
 
Design

--- Chunk 3 starts with: ---
the vocabulary and craft of coding. 
 
Design 
Having pooled our experience in coding qualitative material and teaching students how to 
code, in this


üëÜ Notice how chunks share some text to maintain context!


## Metadata Preservation

Chunks preserve metadata from the original document (page numbers, source, etc.)

In [None]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

chunks_with_metadata = splitter.split_documents(pages1)

print("Metadata in chunks:\n")
for i in range(min(5, len(chunks_with_metadata))):
    chunk = chunks_with_metadata[i]
    print(f"Chunk {i+1}:")
    print(f"  Source: {chunk.metadata.get('source', 'N/A')}")
    print(f"  Page: {chunk.metadata.get('page', 'N/A')}")
    print(f"  Length: {len(chunk.page_content)} chars")
    print()

print("üí° Metadata helps track where each chunk came from!")

## Practical Example: RAG Pipeline

Let's see how chunking fits into a real RAG application.

In [6]:
# Complete RAG Pipeline Step-by-Step

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

print("RAG Pipeline Steps:\n")

# Step 1: Load documents
print("1Ô∏è‚É£ Loading documents...")
loader = PyPDFLoader(pdf1_path)
documents = loader.load()
print(f"   Loaded {len(documents)} pages")

# Step 2: Chunk documents
print("\n2Ô∏è‚É£ Chunking documents...")
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(documents)
print(f"   Created {len(chunks)} chunks")

# Step 3: Would create embeddings (not shown - requires API key)
print("\n3Ô∏è‚É£ Next steps (not executed):")
print("   - Create embeddings for each chunk")
print("   - Store in vector database (Chroma, Pinecone, etc.)")
print("   - When user asks question:")
print("     a) Convert question to embedding")
print("     b) Find most similar chunks")
print("     c) Send relevant chunks to LLM")
print("     d) LLM generates answer")

print("\n‚úÖ Chunking is Step 2 in the RAG pipeline!")

RAG Pipeline Steps:

1Ô∏è‚É£ Loading documents...
   Loaded 27 pages

2Ô∏è‚É£ Chunking documents...
   Created 67 chunks

3Ô∏è‚É£ Next steps (not executed):
   - Create embeddings for each chunk
   - Store in vector database (Chroma, Pinecone, etc.)
   - When user asks question:
     a) Convert question to embedding
     b) Find most similar chunks
     c) Send relevant chunks to LLM
     d) LLM generates answer

‚úÖ Chunking is Step 2 in the RAG pipeline!


## Best Practices for Chunking

### 1. Chunk Size Guidelines

| Document Type | Recommended Chunk Size | Reason |
|--------------|----------------------|--------|
| **Research Papers** | 1000-1500 | Preserve paragraph context |
| **Books** | 1500-2000 | Longer narrative context |
| **News Articles** | 500-1000 | Shorter, focused content |
| **Technical Docs** | 800-1200 | Balance detail and context |
| **Social Media** | 200-500 | Short, standalone posts |

### 2. Overlap Rules
- **General rule:** 10-20% of chunk size
- **More overlap** = Better context continuity but more storage
- **Less overlap** = Less storage but might lose context

### 3. Choosing a Splitter

```python
# For most cases (‚≠ê Recommended)
RecursiveCharacterTextSplitter  # Intelligent, preserves structure

# For simple documents
CharacterTextSplitter  # Fast, basic

# For precise token counting
TokenTextSplitter  # Matches LLM token limits exactly

# For maximum intelligence (advanced)
SemanticChunker  # Splits by meaning (requires embeddings)
```

### 4. Testing Your Chunks

Always check:
1. Are chunks too small? (losing context)
2. Are chunks too large? (exceeding token limits)
3. Do chunks split mid-sentence? (bad)
4. Do chunks preserve meaning? (important)

## Advanced: Custom Splitting Logic

You can create custom splitters for specific needs.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Custom splitter for academic papers (preserve sections)
academic_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=300,
    separators=[
        "\n## ",      # Markdown headers
        "\n### ",     # Sub-headers
        "\n\n",       # Paragraphs
        "\n",         # Lines
        ". ",         # Sentences
        " ",          # Words
        ""            # Characters
    ]
)

academic_chunks = academic_splitter.split_documents(pages1)
print(f"Academic splitter created {len(academic_chunks)} chunks")
print("\nThis splitter tries to keep sections and subsections together!")

## Summary: The Complete Chunking Process

### What We Learned:

1. **Why Chunk?**
   - LLM token limits
   - Better retrieval
   - Cost efficiency

2. **Key Parameters:**
   - `chunk_size`: Max characters per chunk (500-2000)
   - `chunk_overlap`: Shared characters (10-20% of chunk_size)
   - `separators`: How to split (paragraphs ‚Üí sentences ‚Üí words)

3. **Splitter Types:**
   - **RecursiveCharacterTextSplitter** ‚≠ê Best for most cases
   - **CharacterTextSplitter** - Simple splitting
   - **TokenTextSplitter** - Token-based (for LLMs)
   - **SemanticChunker** - AI-powered (advanced)

4. **Best Practices:**
   - Start with chunk_size=1000, overlap=200
   - Test with your specific documents
   - Check chunk quality manually
   - Preserve metadata

### Next Steps:

After chunking, you would:
1. Create embeddings for each chunk
2. Store in a vector database
3. Use for retrieval in RAG applications

**Remember:** Good chunking = Better RAG performance! üöÄ