# Data Processing: Chunking with HybridChunker

This notebook demonstrates how to use the **hybrid (token-based)** [Docling](https://docling-project.github.io/docling/) chunking strategy to split documents into smaller, semantically meaningful pieces. These chunks are essential for Retrieval-Augmented Generation (RAG) workflows.

The `HybridChunker` combines structural, rule-based chunking with token-aware splitting. This notebook will walk you through its main parameters to show how different settings affect the size and content of the resulting document chunks.

## 📦 Installation

Install the necessary packages into this notebook environment. We need `docling` for chunking and `transformers` for tokenization. Run this once per session. If you restart the kernel, re-run this cell before continuing.

In [None]:
%pip install -qq docling transformers

from pathlib import Path
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
from transformers import AutoTokenizer

## 🔧 Configuration

### Set file to be chunked

Set the source file to be chunked. We will download a sample `docling` JSON document from a public URL.

In [None]:
# URL to the raw docling JSON file on GitHub
file_to_chunk = "https://raw.githubusercontent.com/docling-project/docling/refs/tags/v2.55.1/tests/data/groundtruth/docling_v2/2206.01062.json"


### Set output directory

Choose where to save results. This notebook creates the folder if it doesn’t exist.

In [None]:
output_dir = Path("hybrid-chunking-demonstration/output")
output_dir.mkdir(parents=True, exist_ok=True)

### Configure chunking strategies

Next, we set up the different chunking strategies we want to demonstrate. The cell below contains three configurations:

1.  **`default_chunker`**:A `HybridChunker` initialized without an explicit tokenizer or token limit. It automatically uses a default tokenizer (e.g., a sentence-embedding model's tokenizer) and its derived default token limit (typically 512 tokens). By default, it aggressively merges adjacent small structural chunks (like short paragraphs) to create a larger, contextually richer chunk, stopping when it reaches the token limit as the `merge_peers` is True by default. It also splits any individual structural element that exceeds token limit.
2.  **`no_merge_chunker`**: A `HybridChunker` with `merge_peers=False`. This prevents the merging of adjacent chunks, resulting in smaller chunks that strictly follow the document's structure. Note, this still splits any individual structural element that exceeds token limit.
2.  **`custom_chunker`**: A `HybridChunker` configured with a tokenizer of choice and a `max_tokens` limit.
For additional customization, check the [official documentation](https://docling-project.github.io/docling/concepts/chunking/).

In [None]:
# 1. Default HybridChunker (no tokenizer)
default_chunker = HybridChunker()

# Configure the tokenizer for the next runs
EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
MAX_TOKENS = 128
hf_tokenizer = HuggingFaceTokenizer(
    tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
    max_tokens=MAX_TOKENS,
)
# 2. HybridChunker with tokenizer but merge_peers=False
no_merge_chunker = HybridChunker(
    merge_peers=False,
)
# 3. HybridChunker with a tokenizer and max_tokens
custom_chunker = HybridChunker(
    tokenizer=hf_tokenizer,
    # merge_peers is True by default
)

### Load the docling JSON into a Python object
The `file_to_chunk` is a JSON docling document generated from a pdf which is an arXiv paper about large language models, which provides a good mix of headings, paragraphs, and other structural elements that will help demonstrate the different chunking behaviors.

In [None]:
# Convert to a docling document
doc = DocumentConverter().convert(source=file_to_chunk)
print(f"Document loaded: {doc.document.name}")

## 🧪Experiment: Comparing Chunking Strategies

Let's compare how each chunking strategy affects the document:

1. **Default HybridChunker**: Default settings for tokenizer, max_tokens and merge_peers(True)
2. **No-Merge Chunker**: set merge_peers=False
3. **Custom Chunker**: Using tokenizer with max_tokens=128, merge_peers=True

For each strategy, we'll show:
- Total number of chunks created
- Average chunk size (in tokens)
- Sample chunks with their token counts

In [None]:
def analyze_chunks(chunker, doc, num_samples=3):
    """Analyze chunks produced by a chunker, including token counts and statistics."""
    chunks = list(chunker.chunk(dl_doc=doc.document))
    
    # Get token counts if we have a tokenizer
    token_counts = []
    if chunker.tokenizer:
        for chunk in chunks:
            text = chunker.contextualize(chunk=chunk)
            try:
                count = len(chunker.tokenizer.tokenizer.encode(text))
                token_counts.append(count)
            except Exception as e:
                print(f"Warning: Could not count tokens for chunk: {e}")
    
    # Print statistics
    print(f"\nTotal chunks: {len(chunks)}")
    if token_counts:
        avg_tokens = sum(token_counts) / len(token_counts)
        print(f"Average tokens per chunk: {avg_tokens:.1f}")
    
    # Show first num_samples samples from the chunks
    print("\nSample chunks:")
    samples = chunks[:min(len(chunks), num_samples)]
    for i, chunk in enumerate(samples, 1):
        text = chunker.contextualize(chunk=chunk)
        print(f"\nChunk {i}:")
        if chunker.tokenizer:
            try:
                count = len(chunker.tokenizer.tokenizer.encode(text))
                print(f"Token count: {count}")
            except Exception:
                print("Token count: N/A")
        print("-" * 40)
        print(text[:200] + "..." if len(text) > 200 else text)
        print("-" * 40)

print("Helper function created for analyzing chunks")

### 1. Default HybridChunker

First, let's see how the HybridChunker with default settings behaves.

> ⚠️ **Expected Warning**: The output may show tokenizer warnings about sequence length limits. These are informational only and do not affect the chunking results.

In [None]:
print("Running default HybridChunker (no tokenizer)...")
analyze_chunks(default_chunker, doc)

### 2. Disabling Peer Merging

Let's see what happens when we disable peer merging. This will prevent the chunker from combining small adjacent chunks, even if they would fit within our token limit.

In [None]:
print("Running HybridChunker (with merge_peers=False)...")
analyze_chunks(no_merge_chunker, doc)

### 3. Custom Chunking

Now let's set the tokenizer and max_tokens on the HybridChunker.Since we are setting max_tokens to a smaller value, you should see increased number of chunks. We are setting the merge_peers to True so this will merge chunks to stay under the token limit while trying to preserve document structure.

In [None]:
print("Running  HybridChunker with custom tokenizer and max_tokens...")
analyze_chunks(custom_chunker, doc)

## 🍩 Additional Resources

For more information about Docling and its features:

- [Docling Documentation](https://docling-project.github.io/docling/)
- [Open Data Hub Data Processing Examples](https://github.com/opendatahub-io/odh-data-processing)

### Any Feedback?

We'd love to hear if you have any feedback on this or any other notebook in this series! Please [open an issue](https://github.com/opendatahub-io/odh-data-processing/issues) and help us improve our demos.