# Pre-Retrieval Methods: Comprehensive Guide to Document Chunking Strategies

This notebook provides a comprehensive comparison of document chunking strategies for Retrieval-Augmented Generation (RAG) systems. Document chunking is a critical preprocessing step that directly impacts the quality of information retrieval and subsequent answer generation.

## Learning Objectives

By the end of this notebook, you will understand:

### Four Core Document Chunking Strategies
1. **Baseline Chunking**: Character-based splitting with fixed size limits
2. **Recursive Character Chunking**: Hierarchical text splitting that respects natural language boundaries
3. **Unstructured Chunking**: Structure-aware document processing that preserves semantic elements
4. **Docling Chunking**: Advanced hybrid parsing with sophisticated document understanding

### Evaluation Framework
- Systematic evaluation methodology for chunking strategies
- Key performance metrics for RAG system assessment

## Dataset and Methodology

We demonstrate these techniques using research papers from the [mahimaarora025/research_papers](https://huggingface.co/datasets/mahimaarora025/research_papers/tree/main/sample_research_papers) dataset. This dataset contains peer-reviewed academic papers spanning multiple domains:
- Analytics and Data Science
- Computer Vision
- Generative AI
- Machine Learning
- Statistics

The diverse academic content provides an excellent testbed for evaluating chunking strategies across different document structures and content types.


## Environment Setup and Dependencies

This section initializes the required libraries and configures the environment for our chunking strategy comparison. We'll use a combination of document processing libraries, language models, and evaluation frameworks.

### Key Dependencies
- **LangChain**: Document loading and text splitting utilities
- **Unstructured**: Advanced document parsing and structure recognition
- **Docling**: State-of-the-art document conversion and chunking
- **HuggingFace Transformers**: Embedding models for semantic similarity
- **Qdrant**: Vector database for storing and retrieving document chunks
- **RAGAS**: Evaluation framework for RAG system assessment


In [None]:
# Core Python libraries for file handling and data manipulation
import os
import gc
import tempfile
import uuid
from typing import List, Dict, Any
from pathlib import Path

# Dataset loading from HuggingFace Hub
from datasets import load_dataset

# LangChain ecosystem for document processing and language model integration
from langchain.schema import Document
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

# Advanced document processing libraries
from unstructured.partition.pdf import partition_pdf  # PDF parsing with structure recognition
from unstructured.chunking.title import chunk_by_title  # Title-based intelligent chunking
from docling.document_converter import DocumentConverter  # Advanced document conversion
from docling.chunking import HybridChunker  # Hybrid semantic-syntactic chunking

# Vector database and embeddings for semantic search
from langchain_huggingface import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore

# Data analysis and evaluation frameworks
import json
import pandas as pd
import numpy as np
from retrieval_playground.src.pre_retrieval.chunking_evaluation import ChunkingEvaluator

# Additional Docling components for markdown processing
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_docling.loader import ExportType
from langchain_docling import DoclingLoader

# Environment variable management
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# System configuration and API keys
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")  # Gemini API for language model operations
QDRANT_URL = os.getenv("QDRANT_URL")  # Qdrant cloud instance URL
QDRANT_KEY = os.getenv("QDRANT_KEY")  # Qdrant authentication key
EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-0.6B"  # Lightweight multilingual embedding model

print("Imports completed")

## Loading Models example

In [None]:
# Verify API credentials are available
if not GOOGLE_API_KEY:
    raise ValueError("Please set GOOGLE_API_KEY environment variable")

# Initialize Google Gemini language model for text generation and evaluation
# Using Gemini 2.0 Flash for fast inference with low temperature for consistency
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.1,  # Low temperature for deterministic outputs
    google_api_key=GOOGLE_API_KEY
)

# Initialize HuggingFace embedding model for semantic similarity computation
# Qwen3-Embedding-0.6B provides efficient multilingual embeddings
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL
)

print("Models initialized")

In [None]:
del llm, embeddings
gc.collect()

## Document Loading and Preprocessing

This section demonstrates loading a research paper to serve as our test document for comparing chunking strategies. We'll use a representative academic paper that contains typical document structures found in research literature.

### Data Source
The test document comes from the research papers collection available at:
https://huggingface.co/datasets/mahimaarora025/research_papers/tree/main/sample_research_papers

To get started -
1. Go to the above url
2. Download the paper with title - `Generative_AI_2025_Frozen_in_Time__Parameter-Efficient_Time_Series_Transformers_via___Reservoir-Ind.pdf`
3. Create a new folder inside data --> `data/sample_research_papers`.
4. Move the downloaded file to the above folder


In [None]:
pdf_path = "../data/sample_research_papers/Generative_AI_2025_Frozen_in_Time__Parameter-Efficient_Time_Series_Transformers_via___Reservoir-Ind.pdf"

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# Initialize PDF loader for the selected research paper
loader = PyPDFLoader(str(pdf_path))
pdf_docs = loader.load()  # Load all pages as separate document objects

# Extract text content from all pages
sample_data = []
for i in range(len(pdf_docs)):
    sample_data.append(pdf_docs[i].page_content)

# Combine all pages into a single text document with page breaks preserved
sample_data = '\n\n'.join(sample_data)

# Document Chunking Strategy Implementations

This section provides hands-on demonstrations of four distinct chunking approaches. Each strategy represents a different philosophy for dividing documents into manageable pieces while preserving semantic coherence and structural integrity.

## Strategy 1: Baseline Character-Based Chunking

### Overview
The baseline approach uses simple character counting to divide text into fixed-size chunks. This method prioritizes speed and simplicity over semantic preservation.

### Characteristics
- **Speed**: Fastest execution time
- **Simplicity**: Minimal configuration required
- **Limitations**: May split sentences, paragraphs, or concepts mid-way
- **Best Use Cases**: Large-scale processing where speed trumps precision


In [None]:
# Configure baseline character-based text splitter
# This approach splits text based on character count with minimal intelligence
baseline_splitter = CharacterTextSplitter(
    chunk_size=5000,        # Target chunk size in characters
    chunk_overlap=100,      # Character overlap between consecutive chunks
    separator="\n\n"        # Preferred split point (paragraph breaks)
)

# Apply baseline chunking to the sample document
baseline_chunks = baseline_splitter.split_text(sample_data)

# Display results summary
print("BASELINE CHUNKING RESULTS")
print(f"Number of chunks: {len(baseline_chunks)}")
print("-" * 50)

# Show first chunk as example
print(baseline_chunks[0])


## Strategy 2: Recursive Character Chunking

### Overview
Recursive character chunking employs a hierarchical approach that attempts to split text at natural boundaries while respecting size constraints. This method balances efficiency with semantic preservation.

### Splitting Hierarchy
The algorithm tries to split text in the following order of preference:
1. **Paragraph breaks** (`\n\n`) - Preserves conceptual boundaries
2. **Line breaks** (`\n`) - Maintains sentence structure when possible
3. **Spaces** (` `) - Avoids breaking words
4. **Character-level** - Last resort when size constraints are strict

### Characteristics
- **Intelligence**: Respects natural text boundaries
- **Flexibility**: Configurable separator hierarchy
- **Balance**: Good trade-off between speed and quality
- **Best Use Cases**: General-purpose text chunking for most applications


In [None]:
# Configure recursive character text splitter with hierarchical boundary detection
# This approach intelligently chooses split points based on natural text structure
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=5000,                    # Target chunk size in characters
    chunk_overlap=50,                   # Overlap for context continuity
    separators=["\n\n", "\n"]           # Hierarchical split preferences: paragraphs, then lines
)

# Apply recursive chunking strategy to the sample document
recursive_chunks = recursive_splitter.split_text(sample_data)

# Display results summary
print("RECURSIVE CHUNKING RESULTS")
print(f"Number of chunks: {len(recursive_chunks)}")
print("-" * 50)

# Show first chunk demonstrating boundary-aware splitting
print(recursive_chunks[0])

## Strategy 3: Unstructured Document-Aware Chunking

### Overview
The Unstructured library provides sophisticated document parsing that recognizes and preserves document structure. This approach understands document elements like titles, headers, paragraphs, lists, and tables before applying chunking logic.

### Document Understanding Capabilities
- **Element Detection**: Automatically identifies titles, headers, body text, captions
- **Structure Preservation**: Maintains hierarchical relationships between elements
- **Content Classification**: Distinguishes between different types of content
- **Intelligent Grouping**: Chunks content based on semantic coherence rather than arbitrary size

### Characteristics
- **Accuracy**: High-quality structure recognition
- **Semantic Preservation**: Maintains document logic and flow
- **Flexibility**: Handles diverse document formats and layouts
- **Best Use Cases**: Documents with clear structure, academic papers, reports, books

### Implementation Details
We'll process the PDF directly to extract structural elements before applying title-based chunking:


In [None]:
# Parse PDF with structure recognition using Unstructured library
# Fast strategy balances speed with reasonable accuracy for document element detection
elements = partition_pdf(
    pdf_path, 
    strategy="fast",                    
    infer_table_structure=True          # Attempt to preserve table structure
)

# Apply title-based intelligent chunking that respects document hierarchy
# This groups content under relevant headings and maintains semantic coherence
unstructured_chunks = chunk_by_title(
    elements, 
    max_characters=5000                 # Maximum chunk size while preserving structure
)

# Convert chunk objects to text strings for analysis (limiting for demo purposes)
unstructured_chunks = [str(chunk) for chunk in unstructured_chunks[:6]]

# Display results summary
print("UNSTRUCTURED CHUNKING RESULTS")
print(f"Number of chunks: {len(unstructured_chunks)}")
print("-" * 50)

# Show first chunk demonstrating structure-aware processing
print(unstructured_chunks[0])

## Strategy 4: Docling Advanced Hybrid Chunking

### Overview

Docling Hybrid Chunking processes documents by combining structured PDF parsing, markdown conversion, and intelligent chunking. 
It preserves hierarchical structure while creating semantically meaningful chunks.

### Processing Pipeline

- **PDF Conversion** – Converts PDF documents to structured markdown.
- **Header-Aware Splitting** – Uses markdown headers (#, ##) to guide chunk boundaries while preserving headers for context.
- **Hybrid Chunking** – Integrates token-based limits with semantic awareness for fine-grained and meaningful chunks.
- **Document Loading** – Processes documents through the Docling loader, applying the hybrid chunking pipeline.

### Key Features

- **Structure Preservation** – Maintains headings and document hierarchy.
- **Semantic Awareness** – Considers content meaning alongside structure.
- **Custom Token Limits** – Fine-grained chunks optimized for downstream embedding models.
- **Advanced Parsing** – Handles complex PDF layouts converted to markdown.

### Characteristics

- **Accuracy** – Produces high-quality chunks that respect structure and content.
- **Completeness** – Preserves key information across document sections.
- **Resource Usage** – More computationally intensive due to hybrid processing.
- **Best Use Cases** – Complex or high-value documents where structure and context matter.

In [None]:
# Configure markdown header-based splitter for structured document processing
# This respects the hierarchical structure created by Docling's markdown conversion
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header_1"),              # Top-level headers (titles, major sections)
        ("##", "Header_2"),             # Second-level headers (subsections)
    ],
    strip_headers=False                 # Preserve headers in chunks for context
)

# Initialize Docling loader with advanced hybrid chunking capabilities
# Combines PDF parsing, markdown conversion, and intelligent chunking
loader = DoclingLoader(
    file_path=pdf_path,
    export_type=ExportType.MARKDOWN,   # Convert to structured markdown format
    chunker=HybridChunker(              # Advanced semantic-syntactic chunking
        tokenizer=EMBEDDING_MODEL,      # Use same tokenizer as embedding model
        max_tokens=100                  # Conservative token limit for fine-grained chunks
    )
)

# Process document through Docling pipeline
docs = loader.load()

# Apply header-aware splitting to the markdown-converted content
# Creates chunks that respect document structure and semantic boundaries
docling_chunks = [
    split.page_content 
    for doc in docs 
    for split in splitter.split_text(doc.page_content)
]

# Display results summary
print("DOCLING CHUNKING RESULTS")
print(f"Number of chunks: {len(docling_chunks)}")
print("-" * 50)

# Show first chunk demonstrating advanced structure preservation
print(docling_chunks[0])

# Chunking Strategy Comparative Analysis

This section provides a quantitative comparison of the four chunking strategies implemented above. We'll examine key metrics including chunk count, average chunk length, and qualitative characteristics to understand the trade-offs between different approaches.

In [None]:
# Define strategy comparison dataset with descriptive metadata
strategies = [
    ("Baseline", baseline_chunks, "Character-based splitting"),
    ("Recursive", recursive_chunks, "Boundary-aware splitting"), 
    ("Unstructured", unstructured_chunks, "Structure-aware parsing"),
    ("Docling", docling_chunks, "Advanced hybrid parsing")
]

# Calculate comparative metrics for each chunking strategy
comparison_data = []
for name, chunks, description in strategies:
    # Compute average chunk length as primary size metric
    avg_length = sum(len(chunk) for chunk in chunks) / len(chunks)
    
    # Compile strategy performance summary
    comparison_data.append({
        "Strategy": name,
        "Chunks": len(chunks),              # Total number of chunks produced
        "Avg Length": f"{avg_length:.0f}",  # Mean characters per chunk
        "Description": description          # Strategy characterization
    })

# Create comparison DataFrame for structured analysis
comparison_df = pd.DataFrame(comparison_data)

# Display formatted comparison results
print("CHUNKING STRATEGY COMPARISON")
print("=" * 70)
print(comparison_df.to_string(index=False))
print("=" * 70)

In [None]:
del baseline_chunks, recursive_chunks, unstructured_chunks, docling_chunks
gc.collect()

# Vector Database Integration and Collection Management

This section demonstrates connecting to Qdrant, a high-performance vector database, to explore existing chunk collections. Vector databases are essential for RAG systems as they enable semantic search and similarity-based retrieval of document chunks.

## Qdrant Vector Database
Qdrant provides:
- **High Performance**: Optimized for similarity search at scale
- **Flexibility**: Support for various distance metrics and filtering
- **Scalability**: Handles large collections efficiently
- **Integration**: Seamless integration with embedding models

## Collection Exploration
We'll examine existing collections to understand how different chunking strategies have been previously processed and stored.


In [None]:
if not QDRANT_URL or not QDRANT_KEY:
    print("Warning: Qdrant credentials not found. Please set QDRANT_URL and QDRANT_KEY")
    qdrant_client = None
else:
    # Initialize Qdrant client with cloud credentials
    qdrant_client = QdrantClient(
        url=QDRANT_URL,    # Cloud instance endpoint
        api_key=QDRANT_KEY # Authentication key
    )
    print("Qdrant connection established")
    
    # Retrieve and display available vector collections
    # Each collection typically represents a different chunking strategy or dataset
    collections = qdrant_client.get_collections()
    print("\nAVAILABLE QDRANT COLLECTIONS")
    print("-" * 40)
    
    if collections.collections:
        # List all collections with their basic information
        for collection in collections.collections:
            print(f"Collection: {collection.name}")
            # Additional collection metadata could be displayed here
            print()
    else:
        print("No collections found")

In [None]:
scroll_iter = qdrant_client.scroll(
    collection_name="baseline",
    limit=2          
)

points, next_page = scroll_iter
for p in points:
    print(p)

# Comprehensive Evaluation Framework

### **What We're Evaluating**
We measure **end-to-end RAG pipeline performance** by testing how well each chunking strategy enables:
- **Information Retrieval**: Finding relevant document chunks
- **Answer Generation**: Producing accurate, helpful responses
- **Source Fidelity**: Maintaining truthfulness to original documents

---

### 📏 **Core Evaluation Metrics**

| Metric | **What It Measures** | **Key Question** |
|--------|---------------------|------------------|
| **Answer Relevancy** | How well the generated answer addresses the user's question | *"Does this answer actually help the user?"* |
| **Faithfulness** | Whether the answer is grounded in the retrieved context | *"Is the answer truthful to the source material?"* |
| **Context Precision** | Quality and relevance of retrieved document chunks | *"Did we retrieve the right information?"* |
| **Context Recall** | Completeness of information retrieval | *"Did we get all the necessary information?"* |

In [None]:
# Load standardized test queries for consistent evaluation across all chunking strategies
# These queries are designed to test different aspects of retrieval and comprehension
import json
with open("../tests/test_queries.json", 'r') as f:
    test_queries = json.load(f)

# Display sample query to demonstrate evaluation approach
print("\nSample Test Query:")
print(f"Question: {test_queries[0]['user_input']}")
print(f"Source Document: {test_queries[0]['source_file']}")

In [None]:
# Initialize comprehensive chunking evaluation framework
# This evaluator will test all four chunking strategies against standardized queries
evaluator = ChunkingEvaluator(
    query_count=2,  # Number of test queries per strategy
    metrics=[       # RAGAS evaluation metrics for comprehensive assessment
        'answer_relevancy',    
        'faithfulness',        
        'context_precision',   
        'context_recall'       
    ]
)

print("Starting comprehensive chunking evaluation...")
print("This process evaluates all strategies against standardized queries using RAGAS metrics.")
print("Expected duration: 3-5 minutes depending on system performance.\n")

try:
    # Execute evaluation across all chunking strategies
    # This runs complete RAG pipelines for each strategy with identical test conditions
    results_df = evaluator.evaluate_all_strategies()
    print("\nEvaluation completed successfully!")
    
    # Provide summary of evaluation results
    print(f"Results matrix: {results_df.shape}")
    print("Analysis includes performance across all metrics and strategies")
    
except Exception as e:
    print(f"Evaluation failed with error: {e}")
    print("Check vector database connectivity and API credentials")

In [None]:
# Display formatted results
evaluator.print_results(results_df)