# Chunking

The goal of chunking converted documents is to break them down into smaller and logical pieces.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

In [None]:
!pip install -qq docling

### Set directory for files to convert and output directory

In [None]:
from pathlib import Path

sample_data_dir = Path("data/sample-docling-json")
docling_json_files = list((sample_data_dir.glob("*.json")))

output_dir = Path("data/output")
output_dir.mkdir(parents=True, exist_ok=True)

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and `merge_peers` to merge undersized chunks that are next to eachother. Uncomment the commented out code to configure these.

In [None]:
#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
#from transformers import AutoTokenizer

from docling.chunking import HybridChunker

#EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#MAX_TOKENS = 1024
#
# tokenizer = HuggingFaceTokenizer(
#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
#     merge_peers=True # 
# )

chunker = HybridChunker(
    #tokenizer=tokenizer,
    #merge_peers=True,  # whether to merge undersized chunks - defaults to True
)

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

The resulting chunks are stored in a file called chunks.jsonl in the `chunks` directory in your contribution. This file is used as an input in a later step when creating the seed dataset for SDG.

In [None]:
import json
from docling.document_converter import DocumentConverter

all_chunks = []
    
for file in docling_json_files:
    # reconvert the docling JSON for chunking
    doc = DocumentConverter().convert(source=file)

    document_chunks = []
    chunk_iter = chunker.chunk(dl_doc=doc.document)
    chunk_objs = list(chunk_iter)

    print(f"Extracted {len(chunk_objs)} chunks from {doc.document.name}")
    
    for chunk in chunk_objs:
        c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())
        document_chunks.append(c)
        all_chunks.append(c)

    document_chunk_dir = output_dir / f"{doc.document.name}"
    document_chunk_dir.mkdir(parents=True, exist_ok=True)
    chunks_file_path = document_chunk_dir / "chunks.jsonl"
    with open(chunks_file_path, "w", encoding="utf-8") as file:
        for chunk in document_chunks:
            json.dump(chunk, file)
            file.write("\n")
        print(f"Path of chunks JSON is: {Path(chunks_file_path).resolve()}")

### View the Chunks

In [None]:
chunk_gen = iter(all_chunks)

The document is now broken into small sections with metadata about the chunk based on the document's format. To view the chunks one by one, rerun the following cell.

In [None]:
print(next(chunk_gen)['chunk'])

To view several randomly selected chunks, run the following cell as many times as you like:

In [None]:
NUM_CHUNKS_TO_VIEW = 5

import random
import json

sample = random.sample(all_chunks, min(len(all_chunks), NUM_CHUNKS_TO_VIEW))

i = 1
for chunk in sample:
    print(f"== Randomly selected chunk {i}: ==========\n\n{chunk['chunk']}\n\n")
    i += 1