# Pre-Retrieval Methods: Comprehensive Guide to Document Chunking Strategies

This notebook provides a comprehensive comparison of document chunking strategies for Retrieval-Augmented Generation (RAG) systems. Document chunking is a critical preprocessing step that directly impacts the quality of information retrieval and subsequent answer generation.

## Learning Objectives

By the end of this notebook, you will understand:

### Four Core Document Chunking Strategies
1. **Baseline Chunking**: Character-based splitting with fixed size limits
2. **Recursive Character Chunking**: Hierarchical text splitting that respects natural language boundaries
3. **Unstructured Chunking**: Structure-aware document processing that preserves semantic elements
4. **Docling Chunking**: Advanced hybrid parsing with sophisticated document understanding

### Evaluation Framework
- Systematic evaluation methodology for chunking strategies
- Key performance metrics for RAG system assessment
- Evidence-based recommendations for strategy selection
- Practical trade-offs between quality, speed, and complexity

## Dataset and Methodology

We demonstrate these techniques using research papers from the [mahimaarora025/research_papers](https://huggingface.co/datasets/mahimaarora025/research_papers/tree/main/sample_research_papers) dataset. This dataset contains peer-reviewed academic papers spanning multiple domains:
- Analytics and Data Science
- Computer Vision
- Generative AI
- Machine Learning
- Statistics

The diverse academic content provides an excellent testbed for evaluating chunking strategies across different document structures and content types.


## Environment Setup and Dependencies

This section initializes the required libraries and configures the environment for our chunking strategy comparison. We'll use a combination of document processing libraries, language models, and evaluation frameworks.

### Key Dependencies
- **LangChain**: Document loading and text splitting utilities
- **Unstructured**: Advanced document parsing and structure recognition
- **Docling**: State-of-the-art document conversion and chunking
- **HuggingFace Transformers**: Embedding models for semantic similarity
- **Qdrant**: Vector database for storing and retrieving document chunks
- **RAGAS**: Evaluation framework for RAG system assessment


In [None]:
# Core Python libraries for file handling and data manipulation
import os
import tempfile
import uuid
from typing import List, Dict, Any
from pathlib import Path

# Dataset loading from HuggingFace Hub
from datasets import load_dataset

# LangChain ecosystem for document processing and language model integration
from langchain.schema import Document
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

# Advanced document processing libraries
from unstructured.partition.pdf import partition_pdf  # PDF parsing with structure recognition
from unstructured.chunking.title import chunk_by_title  # Title-based intelligent chunking
from docling.document_converter import DocumentConverter  # Advanced document conversion
from docling.chunking import HybridChunker  # Hybrid semantic-syntactic chunking

# Vector database and embeddings for semantic search
from langchain_huggingface import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from langchain_qdrant import QdrantVectorStore

# Data analysis and evaluation frameworks
import json
import pandas as pd
import numpy as np
from retrieval_playground.src.pre_retrieval.chunking_evaluation import ChunkingEvaluator

# Additional Docling components for markdown processing
from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_docling.loader import ExportType
from langchain_docling import DoclingLoader

# Environment variable management
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# System configuration and API keys
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")  # Gemini API for language model operations
QDRANT_URL = os.getenv("QDRANT_URL")  # Qdrant cloud instance URL
QDRANT_KEY = os.getenv("QDRANT_KEY")  # Qdrant authentication key
EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-0.6B"  # Lightweight multilingual embedding model

print("Imports completed")


  from .autonotebook import tqdm as notebook_tqdm


Imports completed


In [None]:
# Verify API credentials are available
if not GOOGLE_API_KEY:
    raise ValueError("Please set GOOGLE_API_KEY environment variable")

# Initialize Google Gemini language model for text generation and evaluation
# Using Gemini 2.0 Flash for fast inference with low temperature for consistency
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.1,  # Low temperature for deterministic outputs
    google_api_key=GOOGLE_API_KEY
)

# Initialize HuggingFace embedding model for semantic similarity computation
# Qwen3-Embedding-0.6B provides efficient multilingual embeddings
embeddings = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL
)

print("Models initialized")

2025-09-20 22:41:16,008 - INFO - Use pytorch device_name: mps
2025-09-20 22:41:16,008 - INFO - Load pretrained SentenceTransformer: Qwen/Qwen3-Embedding-0.6B
2025-09-20 22:41:20,307 - INFO - 1 prompt is loaded, with the key: query


Models initialized


## Document Loading and Preprocessing

This section demonstrates loading a research paper to serve as our test document for comparing chunking strategies. We'll use a representative academic paper that contains typical document structures found in research literature.

### Data Source
The test document comes from the research papers collection available at:
https://huggingface.co/datasets/mahimaarora025/research_papers/tree/main/sample_research_papers

### Document Characteristics
Academic papers provide excellent test cases for chunking strategies because they contain:
- Abstract and introduction sections
- Multiple hierarchical headings
- Mathematical formulations and equations
- References and citations
- Tables and figures with captions
- Mixed text densities and complexity levels


In [3]:
pdf_path = "Generative_AI_2025_Frozen_in_Time__Parameter-Efficient_Time_Series_Transformers_via___Reservoir-Ind.pdf"

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# Initialize PDF loader for the selected research paper
loader = PyPDFLoader(str(pdf_path))
pdf_docs = loader.load()  # Load all pages as separate document objects

# Extract text content from all pages
sample_data = []
for i in range(len(pdf_docs)):
    sample_data.append(pdf_docs[i].page_content)

# Combine all pages into a single text document with page breaks preserved
# Using double newlines to maintain natural document flow for chunking
sample_data = '\n\n'.join(sample_data)

# Document Chunking Strategy Implementations

This section provides hands-on demonstrations of four distinct chunking approaches. Each strategy represents a different philosophy for dividing documents into manageable pieces while preserving semantic coherence and structural integrity.

## Strategy 1: Baseline Character-Based Chunking

### Overview
The baseline approach uses simple character counting to divide text into fixed-size chunks. This method prioritizes speed and simplicity over semantic preservation.

### Characteristics
- **Speed**: Fastest execution time
- **Simplicity**: Minimal configuration required
- **Limitations**: May split sentences, paragraphs, or concepts mid-way
- **Best Use Cases**: Large-scale processing where speed trumps precision


In [None]:
# Configure baseline character-based text splitter
# This approach splits text based on character count with minimal intelligence
baseline_splitter = CharacterTextSplitter(
    chunk_size=5000,        # Target chunk size in characters
    chunk_overlap=100,      # Character overlap between consecutive chunks
    separator="\n\n"        # Preferred split point (paragraph breaks)
)

# Apply baseline chunking to the sample document
baseline_chunks = baseline_splitter.split_text(sample_data)

# Display results summary
print("BASELINE CHUNKING RESULTS")
print(f"Number of chunks: {len(baseline_chunks)}")
print("-" * 50)

# Show first chunk as example
print(baseline_chunks[0])




BASELINE CHUNKING RESULTS
Number of chunks: 8
--------------------------------------------------
Frozen in Time: Parameter-Efficient Time Series
Transformers via Reservoir-Induced Feature Expansion
and Fixed Random Dynamics
Pradeep Singha,*, Mehak Sharmaa, Anupriya Deya and Balasubramanian Ramana
aMachine Intelligence Lab, Department of Computer Science and Engineering, IIT Roorkee, Roorkee-247667, India
ORCID (Pradeep Singh): https://orcid.org/0000-0002-5372-3355, ORCID (Mehak Sharma):
https://orcid.org/0009-0001-3102-1045, ORCID (Anupriya Dey): https://orcid.org/0009-0000-1630-1017, ORCID
(Balasubramanian Raman): https://orcid.org/0000-0001-6277-6267
Abstract. Transformers are the de-facto choice for sequence mod-
elling, yet their quadratic self-attention and weak temporal bias can
make long-range forecasting both expensive and brittle. We intro-
duce FreezeTST, a lightweight hybrid that interleavesfrozen random-
feature (reservoir) blocks with standard trainable Transformer lay-
er

## Strategy 2: Recursive Character Chunking

### Overview
Recursive character chunking employs a hierarchical approach that attempts to split text at natural boundaries while respecting size constraints. This method balances efficiency with semantic preservation.

### Splitting Hierarchy
The algorithm tries to split text in the following order of preference:
1. **Paragraph breaks** (`\n\n`) - Preserves conceptual boundaries
2. **Line breaks** (`\n`) - Maintains sentence structure when possible
3. **Spaces** (` `) - Avoids breaking words
4. **Character-level** - Last resort when size constraints are strict

### Characteristics
- **Intelligence**: Respects natural text boundaries
- **Flexibility**: Configurable separator hierarchy
- **Balance**: Good trade-off between speed and quality
- **Best Use Cases**: General-purpose text chunking for most applications


In [None]:
# Configure recursive character text splitter with hierarchical boundary detection
# This approach intelligently chooses split points based on natural text structure
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=5000,                    # Target chunk size in characters
    chunk_overlap=50,                   # Overlap for context continuity
    separators=["\n\n", "\n"]           # Hierarchical split preferences: paragraphs, then lines
)

# Apply recursive chunking strategy to the sample document
recursive_chunks = recursive_splitter.split_text(sample_data)

# Display results summary
print("RECURSIVE CHUNKING RESULTS")
print(f"Number of chunks: {len(recursive_chunks)}")
print("-" * 50)

# Show first chunk demonstrating boundary-aware splitting
print(recursive_chunks[0])

RECURSIVE CHUNKING RESULTS
Number of chunks: 15
--------------------------------------------------
Frozen in Time: Parameter-Efficient Time Series
Transformers via Reservoir-Induced Feature Expansion
and Fixed Random Dynamics
Pradeep Singha,*, Mehak Sharmaa, Anupriya Deya and Balasubramanian Ramana
aMachine Intelligence Lab, Department of Computer Science and Engineering, IIT Roorkee, Roorkee-247667, India
ORCID (Pradeep Singh): https://orcid.org/0000-0002-5372-3355, ORCID (Mehak Sharma):
https://orcid.org/0009-0001-3102-1045, ORCID (Anupriya Dey): https://orcid.org/0009-0000-1630-1017, ORCID
(Balasubramanian Raman): https://orcid.org/0000-0001-6277-6267
Abstract. Transformers are the de-facto choice for sequence mod-
elling, yet their quadratic self-attention and weak temporal bias can
make long-range forecasting both expensive and brittle. We intro-
duce FreezeTST, a lightweight hybrid that interleavesfrozen random-
feature (reservoir) blocks with standard trainable Transformer lay-


## Strategy 3: Unstructured Document-Aware Chunking

### Overview
The Unstructured library provides sophisticated document parsing that recognizes and preserves document structure. This approach understands document elements like titles, headers, paragraphs, lists, and tables before applying chunking logic.

### Document Understanding Capabilities
- **Element Detection**: Automatically identifies titles, headers, body text, captions
- **Structure Preservation**: Maintains hierarchical relationships between elements
- **Content Classification**: Distinguishes between different types of content
- **Intelligent Grouping**: Chunks content based on semantic coherence rather than arbitrary size

### Characteristics
- **Accuracy**: High-quality structure recognition
- **Semantic Preservation**: Maintains document logic and flow
- **Flexibility**: Handles diverse document formats and layouts
- **Best Use Cases**: Documents with clear structure, academic papers, reports, books

### Implementation Details
We'll process the PDF directly to extract structural elements before applying title-based chunking:


In [None]:
# Parse PDF with structure recognition using Unstructured library
# Fast strategy balances speed with reasonable accuracy for document element detection
elements = partition_pdf(
    pdf_path, 
    strategy="fast",                    
    infer_table_structure=True          # Attempt to preserve table structure
)

# Apply title-based intelligent chunking that respects document hierarchy
# This groups content under relevant headings and maintains semantic coherence
unstructured_chunks = chunk_by_title(
    elements, 
    max_characters=5000                 # Maximum chunk size while preserving structure
)

# Convert chunk objects to text strings for analysis (limiting for demo purposes)
unstructured_chunks = [str(chunk) for chunk in unstructured_chunks[:6]]

# Display results summary
print("UNSTRUCTURED CHUNKING RESULTS")
print(f"Number of chunks: {len(unstructured_chunks)}")
print("-" * 50)

# Show first chunk demonstrating structure-aware processing
print(unstructured_chunks[0])

2025-09-20 22:41:42,652 - INFO - pikepdf C++ to Python logger bridge initialized


UNSTRUCTURED CHUNKING RESULTS
Number of chunks: 6
--------------------------------------------------
5 2 0 2

g u A 5 2

]

G L . s c [

1 v 0 3 1 8 1 . 8 0 5 2 : v i X r a

Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics Pradeep Singha,*, Mehak Sharmaa, Anupriya Deya and Balasubramanian Ramana

aMachine Intelligence Lab, Department of Computer Science and Engineering, IIT Roorkee, Roorkee-247667, India ORCID (Pradeep Singh): https://orcid.org/0000-0002-5372-3355, ORCID (Mehak Sharma): https://orcid.org/0009-0001-3102-1045, ORCID (Anupriya Dey): https://orcid.org/0009-0000-1630-1017, ORCID (Balasubramanian Raman): https://orcid.org/0000-0001-6277-6267

Abstract. Transformers are the de-facto choice for sequence mod- elling, yet their quadratic self-attention and weak temporal bias can make long-range forecasting both expensive and brittle. We intro- duce FreezeTST, a lightweight hybrid that interleaves froze

## Strategy 4: Docling Advanced Hybrid Chunking

### Overview
Docling represents the current state-of-the-art in document processing, offering advanced PDF parsing with conversion to structured markdown format. The hybrid chunker combines semantic understanding with syntactic rules for optimal chunk boundaries.

### Advanced Processing Pipeline
1. **Document Conversion**: PDF to structured markdown with layout preservation
2. **Element Recognition**: Advanced AI-powered detection of document components
3. **Hybrid Chunking**: Combines token-level and semantic-level chunking strategies
4. **Header-Aware Splitting**: Respects markdown headers for natural boundaries

### Key Innovations
- **AI-Powered Parsing**: Uses machine learning models for superior accuracy
- **Layout Understanding**: Preserves spatial relationships and document flow
- **Multi-Modal Processing**: Handles text, images, tables, and complex layouts
- **Semantic Chunking**: Considers content meaning in addition to structure

### Characteristics
- **Accuracy**: Highest quality document understanding
- **Completeness**: Preserves maximum document information
- **Computational Cost**: Most resource-intensive approach
- **Best Use Cases**: High-value documents, complex layouts, maximum quality requirements


In [None]:
# Configure markdown header-based splitter for structured document processing
# This respects the hierarchical structure created by Docling's markdown conversion
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header_1"),              # Top-level headers (titles, major sections)
        ("##", "Header_2"),             # Second-level headers (subsections)
    ],
    strip_headers=False                 # Preserve headers in chunks for context
)

# Initialize Docling loader with advanced hybrid chunking capabilities
# Combines PDF parsing, markdown conversion, and intelligent chunking
loader = DoclingLoader(
    file_path=pdf_path,
    export_type=ExportType.MARKDOWN,   # Convert to structured markdown format
    chunker=HybridChunker(              # Advanced semantic-syntactic chunking
        tokenizer=EMBEDDING_MODEL,      # Use same tokenizer as embedding model
        max_tokens=100                  # Conservative token limit for fine-grained chunks
    )
)

# Process document through Docling pipeline
docs = loader.load()

# Apply header-aware splitting to the markdown-converted content
# Creates chunks that respect document structure and semantic boundaries
docling_chunks = [
    split.page_content 
    for doc in docs 
    for split in splitter.split_text(doc.page_content)
]

# Display results summary
print("DOCLING CHUNKING RESULTS")
print(f"Number of chunks: {len(docling_chunks)}")
print("-" * 50)

# Show first chunk demonstrating advanced structure preservation
print(docling_chunks[0])

2025-09-20 22:41:46,971 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-09-20 22:41:47,015 - INFO - Going to convert document batch...
2025-09-20 22:41:47,015 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e647edf348883bed75367b22fbe60347
2025-09-20 22:41:47,030 - INFO - Loading plugin 'docling_defaults'
2025-09-20 22:41:47,031 - INFO - Registered picture descriptions: ['vlm', 'api']
2025-09-20 22:41:47,037 - INFO - Loading plugin 'docling_defaults'
2025-09-20 22:41:47,039 - INFO - Registered ocr engines: ['easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2025-09-20 22:41:47,086 - INFO - Accelerator device: 'mps'
2025-09-20 22:41:48,889 - INFO - Accelerator device: 'mps'
2025-09-20 22:41:49,715 - INFO - Accelerator device: 'mps'
2025-09-20 22:41:50,137 - INFO - Processing document Generative_AI_2025_Frozen_in_Time__Parameter-Efficient_Time_Series_Transformers_via___Reservoir-Ind.pdf
2025-09-20 22:42:06,554 - INFO - Finished converting docum

DOCLING CHUNKING RESULTS
Number of chunks: 9
--------------------------------------------------
## Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics  
Pradeep Singh a, * , Mehak Sharma a , Anupriya Dey a and Balasubramanian Raman a a Machine Intelligence Lab, Department of Computer Science and Engineering, IIT Roorkee, Roorkee-247667, India ORCID (Pradeep Singh): https://orcid.org/0000-0002-5372-3355, ORCID (Mehak Sharma):  
https://orcid.org/0009-0001-3102-1045, ORCID (Anupriya Dey): https://orcid.org/0009-0000-1630-1017, ORCID (Balasubramanian Raman): https://orcid.org/0000-0001-6277-6267  
Abstract. Transformers are the de-facto choice for sequence modelling, yet their quadratic self-attention and weak temporal bias can make long-range forecasting both expensive and brittle. We introduce FreezeTST , a lightweight hybrid that interleaves frozen randomfeature (reservoir) blocks with standard trainable Transfo

# Chunking Strategy Comparative Analysis

This section provides a quantitative comparison of the four chunking strategies implemented above. We'll examine key metrics including chunk count, average chunk length, and qualitative characteristics to understand the trade-offs between different approaches.

## Comparative Metrics

The analysis focuses on several key dimensions:
- **Chunk Count**: Total number of chunks generated
- **Average Length**: Mean character count per chunk
- **Consistency**: Variance in chunk sizes
- **Boundary Quality**: How well chunks respect semantic boundaries


In [None]:
# Define strategy comparison dataset with descriptive metadata
strategies = [
    ("Baseline", baseline_chunks, "Character-based splitting"),
    ("Recursive", recursive_chunks, "Boundary-aware splitting"), 
    ("Unstructured", unstructured_chunks, "Structure-aware parsing"),
    ("Docling", docling_chunks, "Advanced hybrid parsing")
]

# Calculate comparative metrics for each chunking strategy
comparison_data = []
for name, chunks, description in strategies:
    # Compute average chunk length as primary size metric
    avg_length = sum(len(chunk) for chunk in chunks) / len(chunks)
    
    # Compile strategy performance summary
    comparison_data.append({
        "Strategy": name,
        "Chunks": len(chunks),              # Total number of chunks produced
        "Avg Length": f"{avg_length:.0f}",  # Mean characters per chunk
        "Description": description          # Strategy characterization
    })

# Create comparison DataFrame for structured analysis
comparison_df = pd.DataFrame(comparison_data)

# Display formatted comparison results
print("CHUNKING STRATEGY COMPARISON")
print("=" * 70)
print(comparison_df.to_string(index=False))
print("=" * 70)

CHUNKING STRATEGY COMPARISON
    Strategy  Chunks Avg Length               Description
    Baseline       8       6020 Character-based splitting
   Recursive      15       3216  Boundary-aware splitting
Unstructured       6       3716   Structure-aware parsing
     Docling       9       6551   Advanced hybrid parsing


# Vector Database Integration and Collection Management

This section demonstrates connecting to Qdrant, a high-performance vector database, to explore existing chunk collections. Vector databases are essential for RAG systems as they enable semantic search and similarity-based retrieval of document chunks.

## Qdrant Vector Database
Qdrant provides:
- **High Performance**: Optimized for similarity search at scale
- **Flexibility**: Support for various distance metrics and filtering
- **Scalability**: Handles large collections efficiently
- **Integration**: Seamless integration with embedding models

## Collection Exploration
We'll examine existing collections to understand how different chunking strategies have been previously processed and stored.


In [None]:
# Establish connection to Qdrant vector database
# Check for required credentials before attempting connection
if not QDRANT_URL or not QDRANT_KEY:
    print("Warning: Qdrant credentials not found. Please set QDRANT_URL and QDRANT_KEY")
    qdrant_client = None
else:
    # Initialize Qdrant client with cloud credentials
    qdrant_client = QdrantClient(
        url=QDRANT_URL,    # Cloud instance endpoint
        api_key=QDRANT_KEY # Authentication key
    )
    print("Qdrant connection established")
    
    # Retrieve and display available vector collections
    # Each collection typically represents a different chunking strategy or dataset
    collections = qdrant_client.get_collections()
    print("\nAVAILABLE QDRANT COLLECTIONS")
    print("-" * 40)
    
    if collections.collections:
        # List all collections with their basic information
        for collection in collections.collections:
            print(f"Collection: {collection.name}")
            # Additional collection metadata could be displayed here
            print()
    else:
        print("No collections found")

2025-09-20 22:42:53,445 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333 "HTTP/1.1 200 OK"


Qdrant connection established


2025-09-20 22:42:53,776 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections "HTTP/1.1 200 OK"



AVAILABLE QDRANT COLLECTIONS
----------------------------------------
Collection: docling

Collection: unstructured

Collection: baseline

Collection: recursive_character



In [19]:
scroll_iter = qdrant_client.scroll(
    collection_name="baseline",
    limit=2          
)

points, next_page = scroll_iter
for p in points:
    print(p)

2025-09-20 23:09:55,873 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/baseline/points/scroll "HTTP/1.1 200 OK"


id='0025580a-754e-4fed-9c65-c5c8eff536ea' payload={'page_content': 'edge classification), ensuring broad coverage for\nevaluating graph reasoning.\nGraph-to-text augmentation. Unlike prior work\nthat tokenizes structural features using GNN en-\ncoders, we revisit the pure graph-to-text paradigm.\nTaking node-level tasks as an example, for a target\nnode vi, we extract itsh-hop subgraph and describe\nall node features Ti = {x(vj) | j ∈ N(i) ∪ {i}},\nand edge relations Ei = {x(ejk) | vj, vk ∈\nN(i) ∪ {i}} within the subgraph using natural lan-\nguage, where N(i) is the neighborhood of vi. To\nmaintain input tractability for large graphs with\nverbose node texts (e.g., citation networks with ti-\ntles and abstracts), we apply DEEP SEEK -V3 for\nautomatic summarization. Prompt templates are\nprovided in Appendix B.\nReasoning-trace extraction. A distinctive fea-\nture of our dataset construction is the inclusion of\nexplicit reasoning traces for each answer. Specif-\nically, each subgraph 

# Comprehensive Evaluation Framework

This section implements a rigorous evaluation methodology to assess the real-world performance of different chunking strategies. We'll use a complete RAG pipeline with standardized test queries to measure retrieval quality and answer generation effectiveness.

## Evaluation Methodology

### Test Framework
- **RAGAS Metrics**: Industry-standard RAG evaluation framework
- **Standardized Queries**: Consistent test questions across all strategies
- **Controlled Variables**: Same embedding model, LLM, and retrieval parameters
- **Multiple Dimensions**: Retrieval quality, answer relevance, faithfulness, precision, recall

### Key Performance Indicators
1. **Answer Relevancy**: How well generated answers address the question
2. **Faithfulness**: Accuracy and consistency with source material
3. **Context Precision**: Quality of retrieved chunks for answering
4. **Context Recall**: Completeness of information retrieval


In [None]:
# Load standardized test queries for consistent evaluation across all chunking strategies
# These queries are designed to test different aspects of retrieval and comprehension
import json
with open("../retrieval_playground/tests/test_queries.json", 'r') as f:
    test_queries = json.load(f)

# Display sample query to demonstrate evaluation approach
print("\nSample Test Query:")
print(f"Question: {test_queries[0]['user_input']}")
print(f"Source Document: {test_queries[0]['source_file']}")


📝 Sample query:
Question: How does MC3G improve upon existing counterfactual explanation methods, particularly concerning cost computation and causal dependencies?
Source: Analytics_2025_MC3G__Model_Agnostic_Causally_Constrained_Counterfactual_Generation.pdf


In [None]:
# Initialize comprehensive chunking evaluation framework
# This evaluator will test all four chunking strategies against standardized queries
evaluator = ChunkingEvaluator(
    query_count=2,  # Number of test queries per strategy
    metrics=[       # RAGAS evaluation metrics for comprehensive assessment
        'answer_relevancy',    
        'faithfulness',        
        'context_precision',   
        'context_recall'       
    ]
)

print("Starting comprehensive chunking evaluation...")
print("This process evaluates all strategies against standardized queries using RAGAS metrics.")
print("Expected duration: 3-5 minutes depending on system performance.\n")

try:
    # Execute evaluation across all chunking strategies
    # This runs complete RAG pipelines for each strategy with identical test conditions
    results_df = evaluator.evaluate_all_strategies()
    print("\nEvaluation completed successfully!")
    
    # Provide summary of evaluation results
    print(f"Results matrix: {results_df.shape}")
    print("Analysis includes performance across all metrics and strategies")
    
except Exception as e:
    print(f"Evaluation failed with error: {e}")
    print("Check vector database connectivity and API credentials")

2025-09-20 23:12:21.086 INFO model_manager - _initialize_models: 🔄 ModelManager: Initializing shared AI models...


2025-09-20 23:12:21,086 - INFO - 🔄 ModelManager: Initializing shared AI models...
2025-09-20 23:12:21,204 - INFO - Use pytorch device_name: mps
2025-09-20 23:12:21,205 - INFO - Load pretrained SentenceTransformer: Qwen/Qwen3-Embedding-0.6B
2025-09-20 23:12:25,311 - INFO - 1 prompt is loaded, with the key: query


2025-09-20 23:12:25.312 INFO model_manager - _initialize_models: ✅ ModelManager: Shared AI models initialized successfully


2025-09-20 23:12:25,312 - INFO - ✅ ModelManager: Shared AI models initialized successfully


2025-09-20 23:12:25.312 INFO evaluation - __init__: RAGEvaluator initialized with metrics: ['answer_relevancy', 'faithfulness', 'context_precision', 'context_recall']


2025-09-20 23:12:25,312 - INFO - RAGEvaluator initialized with metrics: ['answer_relevancy', 'faithfulness', 'context_precision', 'context_recall']


2025-09-20 23:12:25.313 INFO chunking_evaluation - __init__: Loaded 2 test queries


2025-09-20 23:12:25,313 - INFO - Loaded 2 test queries


Starting comprehensive chunking evaluation...
This may take a few minutes...

2025-09-20 23:12:25.313 INFO chunking_evaluation - evaluate_all_strategies: Starting evaluation of all chunking strategies...


2025-09-20 23:12:25,313 - INFO - Starting evaluation of all chunking strategies...


2025-09-20 23:12:25.313 INFO chunking_evaluation - evaluate_strategy: Evaluating baseline strategy...


2025-09-20 23:12:25,313 - INFO - Evaluating baseline strategy...
2025-09-20 23:12:25,715 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333 "HTTP/1.1 200 OK"


2025-09-20 23:12:25.720 INFO baseline_rag - __init__: BaselineRAG pipeline initialized


2025-09-20 23:12:25,720 - INFO - BaselineRAG pipeline initialized


Generating answers for queries...


2025-09-20 23:12:26,059 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/baseline "HTTP/1.1 200 OK"
2025-09-20 23:12:26,746 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/baseline/points/query "HTTP/1.1 200 OK"
2025-09-20 23:12:28,467 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/baseline "HTTP/1.1 200 OK"
2025-09-20 23:12:28,767 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/baseline/points/query "HTTP/1.1 200 OK"


2025-09-20 23:12:30.290 INFO evaluation - evaluate_batch: 🔄 Evaluating 2 QA pairs...


2025-09-20 23:12:30,290 - INFO - 🔄 Evaluating 2 QA pairs...


2025-09-20 23:12:30.309 INFO evaluation - evaluate_batch: 🧮 Computing RAGAS metrics...


2025-09-20 23:12:30,309 - INFO - 🧮 Computing RAGAS metrics...
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:21<00:00,  2.68s/it]


2025-09-20 23:12:53.929 INFO evaluation - evaluate_batch: ✅ Evaluation completed


2025-09-20 23:12:53,929 - INFO - ✅ Evaluation completed


2025-09-20 23:12:53.932 INFO chunking_evaluation - evaluate_strategy: ✅ baseline evaluation completed


2025-09-20 23:12:53,932 - INFO - ✅ baseline evaluation completed


2025-09-20 23:12:53.939 INFO chunking_evaluation - evaluate_strategy: Evaluating recursive_character strategy...


2025-09-20 23:12:53,939 - INFO - Evaluating recursive_character strategy...
2025-09-20 23:12:54,313 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333 "HTTP/1.1 200 OK"


2025-09-20 23:12:54.317 INFO baseline_rag - __init__: BaselineRAG pipeline initialized


2025-09-20 23:12:54,317 - INFO - BaselineRAG pipeline initialized


Generating answers for queries...


2025-09-20 23:12:54,640 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/recursive_character "HTTP/1.1 200 OK"
2025-09-20 23:12:54,985 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/recursive_character/points/query "HTTP/1.1 200 OK"
2025-09-20 23:12:56,709 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/recursive_character "HTTP/1.1 200 OK"
2025-09-20 23:12:56,955 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/recursive_character/points/query "HTTP/1.1 200 OK"


2025-09-20 23:12:58.657 INFO evaluation - evaluate_batch: 🔄 Evaluating 2 QA pairs...


2025-09-20 23:12:58,657 - INFO - 🔄 Evaluating 2 QA pairs...


2025-09-20 23:12:58.667 INFO evaluation - evaluate_batch: 🧮 Computing RAGAS metrics...


2025-09-20 23:12:58,667 - INFO - 🧮 Computing RAGAS metrics...
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:19<00:00,  2.41s/it]


2025-09-20 23:13:19.602 INFO evaluation - evaluate_batch: ✅ Evaluation completed


2025-09-20 23:13:19,602 - INFO - ✅ Evaluation completed


2025-09-20 23:13:19.605 INFO chunking_evaluation - evaluate_strategy: ✅ recursive_character evaluation completed


2025-09-20 23:13:19,605 - INFO - ✅ recursive_character evaluation completed


2025-09-20 23:13:19.611 INFO chunking_evaluation - evaluate_strategy: Evaluating unstructured strategy...


2025-09-20 23:13:19,611 - INFO - Evaluating unstructured strategy...
2025-09-20 23:13:19,989 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333 "HTTP/1.1 200 OK"


2025-09-20 23:13:19.994 INFO baseline_rag - __init__: BaselineRAG pipeline initialized


2025-09-20 23:13:19,994 - INFO - BaselineRAG pipeline initialized


Generating answers for queries...


2025-09-20 23:13:20,371 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/unstructured "HTTP/1.1 200 OK"
2025-09-20 23:13:20,705 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/unstructured/points/query "HTTP/1.1 200 OK"
2025-09-20 23:13:22,604 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/unstructured "HTTP/1.1 200 OK"
2025-09-20 23:13:22,904 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/unstructured/points/query "HTTP/1.1 200 OK"


2025-09-20 23:13:23.644 INFO evaluation - evaluate_batch: 🔄 Evaluating 2 QA pairs...


2025-09-20 23:13:23,644 - INFO - 🔄 Evaluating 2 QA pairs...


2025-09-20 23:13:23.650 INFO evaluation - evaluate_batch: 🧮 Computing RAGAS metrics...


2025-09-20 23:13:23,650 - INFO - 🧮 Computing RAGAS metrics...
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:18<00:00,  2.26s/it]


2025-09-20 23:13:42.806 INFO evaluation - evaluate_batch: ✅ Evaluation completed


2025-09-20 23:13:42,806 - INFO - ✅ Evaluation completed


2025-09-20 23:13:42.808 INFO chunking_evaluation - evaluate_strategy: ✅ unstructured evaluation completed


2025-09-20 23:13:42,808 - INFO - ✅ unstructured evaluation completed


2025-09-20 23:13:42.814 INFO chunking_evaluation - evaluate_strategy: Evaluating docling strategy...


2025-09-20 23:13:42,814 - INFO - Evaluating docling strategy...
2025-09-20 23:13:43,254 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333 "HTTP/1.1 200 OK"


2025-09-20 23:13:43.258 INFO baseline_rag - __init__: BaselineRAG pipeline initialized


2025-09-20 23:13:43,258 - INFO - BaselineRAG pipeline initialized


Generating answers for queries...


2025-09-20 23:13:43,606 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/docling "HTTP/1.1 200 OK"
2025-09-20 23:13:43,928 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/docling/points/query "HTTP/1.1 200 OK"
2025-09-20 23:13:45,552 - INFO - HTTP Request: GET https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/docling "HTTP/1.1 200 OK"
2025-09-20 23:13:45,859 - INFO - HTTP Request: POST https://1d20b7dd-e936-4d2e-b034-c62a8dc85ef5.us-east4-0.gcp.cloud.qdrant.io:6333/collections/docling/points/query "HTTP/1.1 200 OK"


2025-09-20 23:13:46.268 INFO evaluation - evaluate_batch: 🔄 Evaluating 2 QA pairs...


2025-09-20 23:13:46,268 - INFO - 🔄 Evaluating 2 QA pairs...


2025-09-20 23:13:46.272 INFO evaluation - evaluate_batch: 🧮 Computing RAGAS metrics...


2025-09-20 23:13:46,272 - INFO - 🧮 Computing RAGAS metrics...
Evaluating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:14<00:00,  1.87s/it]


2025-09-20 23:14:03.503 INFO evaluation - evaluate_batch: ✅ Evaluation completed


2025-09-20 23:14:03,503 - INFO - ✅ Evaluation completed


2025-09-20 23:14:03.505 INFO chunking_evaluation - evaluate_strategy: ✅ docling evaluation completed


2025-09-20 23:14:03,505 - INFO - ✅ docling evaluation completed


2025-09-20 23:14:03.516 INFO chunking_evaluation - evaluate_all_strategies: ✅ All strategies evaluated successfully


2025-09-20 23:14:03,516 - INFO - ✅ All strategies evaluated successfully



Evaluation completed successfully!
Results shape: (8, 7)


In [4]:
# Display formatted results
evaluator.print_results(results_df)


📊 CHUNKING STRATEGY EVALUATION RESULTS

STRATEGY RANKINGS (by average score):
--------------------------------------------------
1. UNSTRUCTURED         | Avg: 0.976
2. BASELINE             | Avg: 0.759
3. RECURSIVE_CHARACTER  | Avg: 0.756
4. DOCLING              | Avg: 0.600

DETAILED METRICS:
--------------------------------------------------------------------------------
Strategy             Relevancy  Faithful   Precision  Recall     Average   
--------------------------------------------------------------------------------
unstructured         0.929      1.000      0.975      1.000      0.976     
baseline             0.947      1.000      0.465      0.625      0.759     
recursive_character  0.958      0.976      0.465      0.625      0.756     
docling              0.497      1.000      0.402      0.500      0.600     

KEY INSIGHTS:
------------------------------
• Best Strategy: UNSTRUCTURED (0.976)
• Worst Strategy: DOCLING (0.600)
• Performance Gap: 0.376 (62.7% improvement