This notebook uses [Docling](https://docling-project.github.io/docling/) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [18]:
from docling.document_converter import DocumentConverter, ConversionError, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
import json
from pathlib import Path

First we set the paths for the documents we want to convert and where the JSON output should live.

In [7]:
doc_path = Path("/path/to/documents")
output_dir = Path("/path/to/output/dir")

files = []

if doc_path.is_file():
    files = [doc_path]
else:
    files = list(doc_path.rglob("*.pdf"))
print(f"Files to convert: {files}")

Files to convert: [PosixPath('temp/2-tables-one-page-cargo-theft-report.pdf'), PosixPath('temp/US-Youth-Soccer-Travel-Policy.pdf'), PosixPath('temp/cargo-theft-report-2018.pdf'), PosixPath('temp/dir2/top-100-movies.pdf')]


Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found [Docling](https://docling-project.github.io/docling/).

In [19]:
pipeline_options = PdfPipelineOptions()

doc_converter = DocumentConverter(
     format_options={
         InputFormat.PDF: PdfFormatOption(
             pipeline_options=pipeline_options
         )
     }
)

Finally we convert every document into Docling JSON as long as it is a valid file type to be converted

In [21]:
for file in files:
    try:
        doc = doc_converter.convert(source=file).document
        doc_dict = doc.export_to_dict()
        json_output_path = output_dir / f"{file.stem}.json"
        with open(json_output_path, "w") as f:
            json.dump(doc_dict, f)
            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
    except ConversionError as e:
        print(f"Skipping file {file}")

Path of JSON output is: /Users/amaredia/dev/examples/notebooks/preprocessing-sdg/output/2-tables-one-page-cargo-theft-report.json
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/preprocessing-sdg/output/US-Youth-Soccer-Travel-Policy.json
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/preprocessing-sdg/output/cargo-theft-report-2018.json
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/preprocessing-sdg/output/top-100-movies.json
