## Chunking
The goal of chunking for InstructLab SDG is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with Docling[https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking].

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Dependencies

In [28]:
!pip install docling

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Import docling document converter and chunkers

In [29]:
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker, HierarchicalChunker
from pathlib import Path

## Set the source document path

Here we're going to want to set the converted.json that comes from the conversion notebook.

If the conversion notebook was not run then, setting the path to the source document in any form is fine.

In [30]:
doc_path = Path("output")

files = []

if doc_path.is_file():
    files = [doc_path]
else:
    files = list(doc_path.rglob("*.json"))
print(f"Docling JSON's to chunk: {files}")

Docling JSON's to chunk: [PosixPath('output/US-Youth-Soccer-Travel-Policy.json'), PosixPath('output/2-tables-one-page-cargo-theft-report.json'), PosixPath('output/cargo-theft-report-2018.json'), PosixPath('output/top-100-movies.json')]


## Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks.

In [31]:
#chunker = HierarchicalChunker()
chunker = HybridChunker()

## Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

In [32]:
all_chunks = []
for file in files:
    try:
        doc = DocumentConverter().convert(source=file).document
        chunk_iter = chunker.chunk(dl_doc=doc)
        chunks = [chunker.serialize(chunk=chunk) for chunk in chunk_iter]
        for chunk in chunks:
            c = dict(chunk=chunk, file=file.stem)
            all_chunks.append(c)
    except ConversionError as e:
        print(f"Skipping file {file}")

Token indices sequence length is longer than the specified maximum sequence length for this model (659 > 512). Running this sequence through the model will result in indexing errors
  chunks = [chunker.serialize(chunk=chunk) for chunk in chunk_iter]
  chunks = [chunker.serialize(chunk=chunk) for chunk in chunk_iter]
  chunks = [chunker.serialize(chunk=chunk) for chunk in chunk_iter]
  chunks = [chunker.serialize(chunk=chunk) for chunk in chunk_iter]


## View the Chunks

To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [1]:
# print(all_chunks)

## Save the chunks to a text file for each chunk

Each chunk is saved to an individual text file in the format: `{docling-json-file-name}-{chunk #}.txt`. Having chunking in this format is important as an input to create-sdg-seed-data notebook.

In [33]:
output_dir = Path("output/chunks")
for i, chunk in enumerate(all_chunks):
    chunk_path = output_dir / f"{chunk["file"]}-{i}.txt"
    with open(chunk_path, "w") as file:
        file.write(chunk["chunk"])