# 🐶 Data Pre-Processing: From source PDF to SDG-ready

This notebook outlines the data pre-processing stages for knowledge contributions. A knowledge contribution consists of one or more PDF files that serve as the dataset for fine-tuning a model.

At a high level the steps for the data pre-processing are:

1. [Contribution Overview](#Contribution-Overview)
1. [Getting Started](#Getting-Started)
1. [Data Gathering](#Data-Gathering)
1. [Document Conversion](#Document-Conversion)
1. [Chunking](#Chunking)
1. [Authoring](#Authoring)
1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)

Each step occurs in order and produces outputs used in subsequent steps. The final step creates an SDG dataset that allows users to run the [SDG-Hub knowledge-generation notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/knowledge_tuning/instructlab/knowledge_generation_and_mixing.ipynb) and generate samples.

**NOTE**: Starting the notebook using Python 3.12 is recommended.


***

## Contribution Overview

### What is a Contribution?

To add knowledge to a model, a user groups source documents of that contain the knowledge into knowledge contributions. A knowledge contribution is made up of:

1. One or more PDF documents that can be described by a contribution summary.
2. A contribution summary.
3. A contribution domain.
4. A unique name used to create a directory in the workspace for artifacts created by each step for the contribution.

Once contributions are set up a user can go through the data pre-processing workflow.

### What is a Contribution Summary?

In the synthetic data generation step, a model (known as the teacher model) generates synthetic data based on the source document.
The contribution summary and domain are used in the prompts that are sent to the teacher model to create data.

The document gets broken up into [chunks](#Chunking), and each chunk is in the prompt sent to the teacher model.
The contribution summary provides additional context to each chunk of a source document ensuring the teacher model has necessary background information.

Contribution summaries should be specific, avoid acronyms or other vague references, and the represent the documents focus areas.
When a contribution includes many versions of the same document, publication dates, volume numbers, or any other identifiers to distinguish between versions should be included in the contribution summary.

Here is an example of a contribution summary from a recent paper on [inference-time scaling](https://arxiv.org/pdf/2502.01618):

```
"A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)"
```

Since the title of the paper does a good job summaraizing the paper, the summary is based off the title but with the acronym LLM spelled out. 

Usually contributions only have one document. Contributions with multiple documents happen when the subject matter and format are similar among a group of documents. 

An example of a contribution having multiple documents would be the desire to teach a model an organization's bylaws over the years 2021, 2022, 2023, 2024, with a different PDF for each year.

A contribution summary in this case might look like:

`Bylaws of organization Foo from 2021 - 2024`

In the case that there was only one source document from the year 2023, the contribution summary would be:

`2023 Bylaws of organization Foo`

Another example of having multiple documents within the same contribution would be if the documents had the same format. An example here could be grouping together a furniture company's instruction manuals. The format and layout of the instruction manuals would be the same across different pieces of furniture, but each manual covers different furniture.

`Furniture company Foo's assembly instructions for tables, desks, and nightstands`

If the contribution only contained a PDF for the assembly instructions for an oak dining table the summary would be:

`Assembly instructions for furniture company Foo's oak dining table`

### What is a Contribution Domain?

A contribution's domain is the overarching subject or scope of the source document(s). The domain provides critical context to guide the teacher model in generating synthetic data that is relevant and grounded.

The domain is brief and should not exceed 3 words, but should ideally be 1-2 words.

To determine the domain, users should review document's primary subject and identify the main topic or purpose of the document.
Consider the intended use of the document and align it with the use case or audience. E.g. a tech manual for developers might fall under the “software development” domain.

For the contribution summary examples discussed in the previous sections, domains could be `Artificial Intelligence Research`, `Bylaws`, and `Furniture Assembly`.

**Note:** Different contributions can have the same domain

## Getting Started

The first step in this notebook is to establish a workspace. Workspaces allow multi-tenancy or multiple different runs of this notebook. Without workspaces the results of each of the steps would be overwritten each time this notebook is executed.

Users should change the `WORKSPACE_NAME` to suite their needs.

> **NOTE:**
> If this notebook is ever run from the middle the following two cells need to be rerun to initialize variables used in every section.

In [None]:
from pathlib import Path

WORKSPACE_NAME = "default"

WORKSPACE_ROOT = Path("workspaces")
WORKSPACE_ROOT.mkdir(exist_ok=True)

WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME
WORKSPACE_DIR.mkdir(exist_ok=True)

SOURCE_DOCUMENT_DIR = "source_documents"
CONVERSION_DIR = "conversion"
CHUNKING_DIR = "chunking"
AUTHORING_DIR = "authoring"

To create contributions, define the `name` for the contribution, and the `domain` and `summary`. The `name`, `domain` and `summary` go into a dictionary called `knowledge_contribution` which gets added to a list called `contributions`.

Once the list of `contributions` is set, a directory with each contribution name is created within the workspace and subdirectories for `source_documents`, `conversion`, `chunking`, `authoring` are created within the contribution name directory.

In [None]:
# Populated later on
contributions = []

# Inference Time Scaling Contribution
contribution_name = "inference-time-scaling"
contribution_domain = "Artificial Intelligence Research" 
contribution_summary = "A Probabilistic Inference Approach to Inference-Time Scaling of Large Language Models (LLMs)"

# Add contribution information to the knowledge_contribution dictionary for it
knowledge_contribution = {"name": contribution_name, "domain": contribution_domain, "summary": contribution_summary}
contributions.append(knowledge_contribution)

# NFL Rules Contribution
contribution2_name = "nfl"
contribution2_domain = "sports rules" 
contribution2_summary = "Official playing rules of the National Football League 2022, 2023"
knowledge_contribution2 = {"name": contribution2_name, "domain": contribution2_domain, "summary": contribution2_summary}
contributions.append(knowledge_contribution2)

for contribution in contributions:
    contribution_dir = WORKSPACE_DIR / contribution["name"]
    contribution["dir"] = contribution_dir

    for subdir in [SOURCE_DOCUMENT_DIR, CONVERSION_DIR, CHUNKING_DIR, AUTHORING_DIR]:
        (contribution_dir / subdir).mkdir(parents=True, exist_ok=True)

## Data Gathering

Copy each contribution file to the `WORKSPACE_DIR/<CONTRIBUTION NAME>/source_documents` directory for the following conversion step to detect them.

In [None]:
import shutil

# Inference Time Scaling Contribution
orig_path = Path("sample-pdfs/inference-time-scaling.pdf")
dst_path = WORKSPACE_DIR / contribution_name / SOURCE_DOCUMENT_DIR

shutil.copy(orig_path, dst_path)

# NFL Rules Contribution
rules_2022 = Path("sample-pdfs/2022-nfl-rulebook.pdf")
rules_2023 = Path("sample-pdfs/2023-nfl-rulebook.pdf")
rules_dst = WORKSPACE_DIR / contribution2_name / SOURCE_DOCUMENT_DIR

shutil.copy(rules_2022, rules_dst)
shutil.copy(rules_2023, rules_dst) 

Review this list of files to verify that all expected files are included in each of the contributions.

In [None]:
print("Files to pre-process\n--------------------")
for contribution in contributions:
    print(f"\nContribution: {contribution.get('name')}")
    print("Files:")
    files = list((contribution['dir'] / SOURCE_DOCUMENT_DIR).glob("*.pdf"))
    for file in files:
        print(file.resolve())

## Document Conversion

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document: a structured representation of the original document that can be exported as JSON. The resulting JSON output is used in the following step, which performs Docling's chunking methods.

In [None]:
!pip install -qq docling

### Configure Docling conversion pipeline

Next we set the configuration options for our conversion pipeline. 

The standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion techniques can lead to better outcomes. For instance, OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where other techniques didn't produce good results, using a vision-language model (VLM) may be a good option.

The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another that uses a VLM. You can comment or uncomment the corresponding code blocks to switch between them. For more information and additional conversion techniques, check our [Docling Conversion Tutorials](https://github.com/instructlab/examples/blob/main/docs/docling-conversion/README.md).

For a complete reference on Docling's conversion pipeline configuration, check the [Examples](https://docling-project.github.io/docling/examples/) section of the official documentation, as well as the [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS) reference pages.

In [None]:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    PdfPipelineOptions,
    VlmPipelineOptions,
    smoldocling_vlm_conversion_options,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

# Standard pipeline options
pipeline_options = PdfPipelineOptions()
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

# Force OCR on the entire page
# pipeline_options = PdfPipelineOptions()
# pipeline_options.do_ocr = True
# pipeline_options.ocr_options.force_full_page_ocr = True
# pipeline_options.ocr_options.lang = ["en"]
# pipeline_options.ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# pipeline_options.accelerator_options = AcceleratorOptions(
#     num_threads=4, device=AcceleratorDevice.AUTO
# )
# doc_converter = DocumentConverter(
#     format_options={
#         InputFormat.PDF: PdfFormatOption(
#             pipeline_options=pipeline_options,
#             backend=DoclingParseV4DocumentBackend,
#         )
#     }
# )

# Use the SmolDocling VLM
# pipeline_options = VlmPipelineOptions()
# pipeline_options.vlm_options = smoldocling_vlm_conversion_options
# doc_converter = DocumentConverter(
#     format_options={
#         InputFormat.PDF: PdfFormatOption(
#             pipeline_options=pipeline_options,
#             pipeline_cls=VlmPipeline,
#         )
#     }
# )



Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted

In [None]:
import json

confidence_reports = dict()

json_files=[]

for contribution in contributions:
    files = list((contribution["dir"] / SOURCE_DOCUMENT_DIR).glob("*.pdf"))
                 
    for file in files:
        print(f"Converting {file}...")
        
        conversion_result = doc_converter.convert(source=file)

        doc = conversion_result.document
        doc_dict = doc.export_to_dict()
   
        confidence_reports[file] = conversion_result.confidence
        
        conversion_output_dir = contribution["dir"] / CONVERSION_DIR
        conversion_output_dir.mkdir(parents=True, exist_ok=True)
        
        json_output_path = conversion_output_dir / f"{file.stem}.json"
        with open(json_output_path, "w") as f:
            json.dump(doc_dict, f)
            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
            json_files.append(json_output_path.resolve())

        print("Document sample:\n")
        print(f"{doc.export_to_text()[:500]}...")
        print()

### Conversion confidence

When converting a document, Docling can calculate how confident it is in the quality of the conversion. This *confidence* is expressed as both a *score* and a *grade*. The score is a numeric value between 0 and 1, and the grade is a label that can be **poor**, **fair**, **good**, or **excellent**. If Docling is unable to calculate a confidence grade, the value will be marked as *unspecified*.

If your document receives a low score (for example, below 0.8) and a grade of *poor* or *fair*, you'll probably benefit from using a different conversion technique. In that case, go back to the *Configure Docling Conversion Pipeline* section and try selecting a different approach (e.g. forcing OCR or using a VLM) and compare the results.

In [None]:
for file, confidence_report in confidence_reports.items():
    print(f"Conversion confidence for {file}:")
    
    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
    
    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages)}")
    
    print()

### Post-Conversion: Illuminator Analysis

The output of document conversion is not always perfect. Data may become distorted or corrupted, which can negatively affect a model's performance after training. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues.

In [None]:
from utils.illuminator.analysis import analyze_docling_tables
from utils.illuminator.utils import generate_summary
from docling.datamodel.document import DoclingDocument

import json
import sys
from pathlib import Path

for contribution in contributions:
    conversion_dir = contribution["dir"] / CONVERSION_DIR
    converted_json_paths = list(conversion_dir.glob("*.json"))
    results = {}
    
    for path in converted_json_paths:
        with open(path, "r") as f:
            doc_dict = json.load(f)
    
        doc = DoclingDocument(**doc_dict)
        results[path] = analyze_docling_tables(doc)
    
        summary_path = contribution["dir"] / CONVERSION_DIR / f"illuminator-readable-summary-{doc.name}.txt"
        
        with open(summary_path, "w") as f:
            generate_summary(results, file=f)
        
        print(f"✅ Post-conversion summary saved to: {summary_path.resolve()}")


The output of this post-conversion step should help determine whether to avoid using the content for chunking entirely or to manually edit it before proceeding with chunking.


## Chunking

The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these.

In [None]:
#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
#from transformers import AutoTokenizer

from docling.chunking import HybridChunker

#EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#MAX_TOKENS = 1024
#
# tokenizer = HuggingFaceTokenizer(
#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
#     merge_peers=True # 
# )

chunker = HybridChunker(
    #tokenizer=tokenizer,
    #merge_peers=True,  # whether to merge undersized chunks - defaults to True
)

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

The resulting chunks are stored in a file called chunks.jsonl in the `chunks` directory in your contribution. This file is used as an input in a later step when creating the seed dataset for SDG.

In [None]:
import json
from docling.document_converter import DocumentConverter

all_chunks = []

for contribution in contributions:
    conversion_dir = contribution["dir"] / CONVERSION_DIR
    json_files = list(conversion_dir.glob("*.json"))
    chunking_output_dir = contribution["dir"] / CHUNKING_DIR
    chunking_output_dir.mkdir(parents=True, exist_ok=True)
    contribution_chunks = []
    
    for file in json_files:
        # reconvert the docling JSON for chunking
        doc = DocumentConverter().convert(source=file)
        
        chunk_iter = chunker.chunk(dl_doc=doc.document)
        chunk_objs = list(chunk_iter)
    
        print(f"Extracted {len(chunk_objs)} chunks from {doc.document.name}")
        
        for chunk in chunk_objs:
            c = dict(chunk=chunker.contextualize(chunk=chunk), file=doc.document.name,metadata=chunk.meta.export_json_dict())
            contribution_chunks.append(c)
            all_chunks.append(c)


        chunks_file_path = chunking_output_dir / "chunks.jsonl"
        with open(chunks_file_path, "w", encoding="utf-8") as file:
            for chunk in contribution_chunks:
                json.dump(chunk, file)
                file.write("\n")
            print(f"Path of chunks JSON is: {Path(chunks_file_path).resolve()}")

### View the Chunks

In [None]:
chunk_gen = iter(all_chunks)

To view the chunks one by one, rerun the following cell. The document is now broken into small sections with metadata about the chunk based on the document's format.

In [None]:
print(next(chunk_gen)['chunk'])

To view several randomly selected chunks, run the following cell as many times as you like:

In [None]:
NUM_CHUNKS_TO_VIEW = 5

import random
import json

sample = random.sample(all_chunks, min(len(all_chunks), NUM_CHUNKS_TO_VIEW))

i = 1
for chunk in sample:
    print(f"== Randomly selected chunk {i}: ==========\n\n{chunk['chunk']}\n\n")
    i += 1

## Authoring

To start the synthetic data generation process, users need to prepare a diverse set of questions and answers based off chunks from each source document. A chunk and question-and-answer pairs are called a seed example.

#### Install docling-sdg

This notebook uses [Docling-sdg](https://github.com/docling-project/docling-sdg) to generate question and answer pairs for each chunk.

In [None]:
!pip install -qq docling-sdg

### Select the chunks for the seed examples

Chunks for seed examples should be diverse in style. These can be selected by hand or selecting diverse chunks from all of the chunks using the [subset selection notebook](https://github.com/instructlab/examples/blob/main/notebooks/instructlab-knowledge/subset-selection.ipynb).

If users are selecting chunks by hand, chunks should be taken directly from lines in `chunks.jsonl`. These lines have `chunk`, `file`, and `metadata` fields for each entry.

The below code randomly selects a preset number of chunks and saves them in a jsonl file for the next step.

In [None]:
from utils.qna_gen import save_random_chunk_selection

NUM_SEED_EXAMPLES = 7

for contribution in contributions:
    chunks_jsonl_path = contribution["dir"] / CHUNKING_DIR / "chunks.jsonl"
    authoring_path = contribution["dir"] / AUTHORING_DIR

    selected_chunks_jsonl = save_random_chunk_selection(chunks_jsonl_path,
                           authoring_path,
                           NUM_SEED_EXAMPLES)
    print(f"selected_chunks.jsonl saved to: {selected_chunks_jsonl}")

### Generate the questions and answers for each chunk

To generate questions and answers you need to set: 
1. The the Open AI compatible endpoint for the model generating question and answer pairs
2. The model's API key
3. The model's name

In [None]:
import os

API_KEY = os.getenv("MODEL_API_KEY") or "<INSERT API KEY HERE>"  # the API access key for your account (cannot be empty)
ENDPOINT_URL = os.getenv("MODEL_ENDPOINT_URL") or "<INSERT ENDPOINT URL HERE>" # the URL of your model's API. URL can end in "/v1"
MODEL_NAME = os.getenv("MODEL_NAME") or "mistralai/Mixtral-8x7B-Instruct-v0.1" # the name of your model

#### [OPTIONAL] Prompt customization for Q&A Generation

Optionally insert your own stylistic customization statement below. If `customization_str` is `None`, there will be no customization attempted and the default QA generation prompt will be used.

In [None]:
customization_str = None 

# Example: 
# customization_str = "Write at the fifth grade level."

#### Generate questions and answers and create qna.yaml file

In [None]:
from utils.qna_gen import generate_seed_examples

for contribution in contributions:
    authoring_path = contribution["dir"] / AUTHORING_DIR
    selected_chunks_path = authoring_path / "selected_chunks.jsonl"

    qna_output_path = generate_seed_examples(contribution["name"],
                           selected_chunks_path,
                           authoring_path,
                           API_KEY,
                           ENDPOINT_URL,
                           MODEL_NAME,
                           contribution["domain"],
                           contribution["summary"],                  
                           customization_str)
    print(f"qna.yaml saved to: {qna_output_path}")


### Review and Revise Seed Examples

A quality set of seed examples has diverse contexts and question-and-answer pairs across every seed example. You can asses the `qna.yaml` files in your preferred text editor to ensure the quality, diversity, and style of generated questions and answers, and modify them accordingly.

In [None]:
from utils.qna_gen import view_seed_example

index = 0 # index of seed example to view. Value must be lower than number of seed examples

# pass in path to qna.yaml file and seed example index to view single seed example
view_seed_example(qna_output_path, index)

After assessment, the `qna.yaml` files can be quickly reviewed to ensure they includes the required elements and correct number of each. It is recommended to have at least 5 seed examples. Each seed example must have 3 question and answer pairs.

In [None]:
from utils.qna_gen import review_seed_examples_file

for contribution in contributions:
        qna_path = contribution["dir"] / AUTHORING_DIR / "qna.yaml"
        review_seed_examples_file(qna_path, min_seed_examples=5, num_qa_pairs=3)

## Create Seed Dataset for SDG

This step creates the seed data for SDG. This data is a JSON filed that contains a combination of the `seed_examples` in the qna.yaml and the chunks from the source document. 

Intermediate seed data files are created for each contribution with the contribution's name included in the file name. For example in the `nfl` contribution, a file containing seed data called `seed_data-nfl.jsonl` would be created in `$WORKSPACE_DIR/nfl`. This file contains a combination of all of the chunks from the NFL source documents and the seed examples in the `qna.yaml` in `$WORKSPACE_DIR/nfl/authoring`.

After seed data files are created for each contribution, a final `seed_data.jsonl` is created in `$WORKSPACE_DIR`. This file is a concatenation of all of the intermediate `seed_data-{contribution name}.jsonl` files and should be used as an input to SDG.

In [None]:
!pip install -qq datasets transformers

In [None]:
from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets

contribution_datasets = []
for contribution in contributions:
    chunks_dir = contribution["dir"] / CHUNKING_DIR
    qna_dir = contribution["dir"] / AUTHORING_DIR
    seed_data = get_seed_dataset(chunks_dir, qna_dir)
    output_path = f'{contribution_dir}/seed_data-{contribution_name}.jsonl'
    seed_data.to_json(output_path, orient='records', lines=True)
    contribution_datasets.append(seed_data)
    print(f"Intermediate results saved to: {output_path}")

final_seed_data = safe_concatenate_datasets(contribution_datasets)
output_path = f'{WORKSPACE_DIR}/seed_data.jsonl'
final_seed_data.to_json(output_path, orient='records', lines=True)

print(f"Final seed data contains {final_seed_data.data.num_rows} rows")
print(f"Final seed data for SDG saved to: {output_path}")

### Inspect the seed data

In [None]:
print(seed_data.data.table.slice(length=1))

# Summary

To recap, given source documents in PDF format, this notebook:

1. Converts the documents using Docling and saves in the Docling Document format
2. Splits the extracted text into chunks of JSON
3. Generates Q&A pairs for a subset of those chunks
4. Creates a `qna.yaml` available for inspection and revision
5. Combines the chunks and `qna.yaml` to create a `seed_data.jsonl` to use for SDG

The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb).