# Document Conversion with Docling

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document: a structured representation of the original document that can be exported as JSON.

In [None]:
!pip install -qq docling

### Set directory for files to convert and output directory

In [None]:
from pathlib import Path

sample_data_dir = Path("data/sample-pdfs")
files = list((sample_data_dir.glob("*.pdf")))

output_dir = Path("data/output")
output_dir.mkdir(parents=True, exist_ok=True)

### Configure Docling conversion pipeline

Next we set the configuration options for our conversion pipeline. 

The standard pipeline options generally yield good and fast results for most documents. In some cases, however, alternative conversion conversion pipelines can lead to better outcomes. For instance, OCR is effective for scanned documents or images that contain text to be extracted and analyzed. In cases where other conversion pipelines didn't produce good results, using a vision-language model (VLM) may be a good option.

The next cell contains three combinations of pipeline options: the default (standard) options, a variant that forces OCR on the entire document, and another that uses a VLM. You can comment or uncomment the corresponding code blocks to switch between them or create a custom combination of settings. For more information and additional conversion conversion pipelines, check our [Docling Conversion Tutorials](https://github.com/instructlab/examples/blob/main/docs/docling-conversion/README.md).

For a complete reference on Docling's conversion pipeline configuration, check the [Examples](https://docling-project.github.io/docling/examples/) section of the official documentation, as well as the [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS) reference pages.

In [None]:
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    EasyOcrOptions,
    PdfPipelineOptions,
    VlmPipelineOptions,
    smoldocling_vlm_conversion_options,
)
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend

# Standard pipeline options
pipeline_options = PdfPipelineOptions()
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

# Force OCR on the entire page
# pipeline_options = PdfPipelineOptions()
# pipeline_options.do_ocr = True
# pipeline_options.ocr_options.force_full_page_ocr = True
# pipeline_options.ocr_options.lang = ["en"]
# pipeline_options.ocr_options = EasyOcrOptions(force_full_page_ocr=True)
# pipeline_options.accelerator_options = AcceleratorOptions(
#     num_threads=4, device=AcceleratorDevice.AUTO
# )
# doc_converter = DocumentConverter(
#     format_options={
#         InputFormat.PDF: PdfFormatOption(
#             pipeline_options=pipeline_options,
#             backend=DoclingParseV4DocumentBackend,
#         )
#     }
# )

# Use the SmolDocling VLM
# pipeline_options = VlmPipelineOptions()
# pipeline_options.vlm_options = smoldocling_vlm_conversion_options
# doc_converter = DocumentConverter(
#     format_options={
#         InputFormat.PDF: PdfFormatOption(
#             pipeline_options=pipeline_options,
#             pipeline_cls=VlmPipeline,
#         )
#     }
# )

Finally, we convert every document into Docling JSON as long as it is a [valid file type](https://docling-project.github.io/docling/usage/supported_formats/) to be converted

In [None]:
import json

confidence_reports = dict()

json_files=[]
             
for file in files:
    conversion_result = doc_converter.convert(source=file)

    doc = conversion_result.document
    doc_dict = doc.export_to_dict()
    confidence_reports[file] = conversion_result.confidence


    json_output_path = output_dir / f"{file.stem}.json"
    with open(json_output_path, "w") as f:
        json.dump(doc_dict, f)
        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
        json_files.append(json_output_path.resolve())

    print("Document sample:\n")
    print(f"{doc.export_to_text()[:500]}...")
    print()

### Conversion confidence

When converting a document, Docling can calculate how confident it is in the quality of the conversion. This *confidence* is expressed as both a *score* and a *grade*. The score is a numeric value between 0 and 1, and the grade is a label that can be **poor**, **fair**, **good**, or **excellent**. If Docling is unable to calculate a confidence grade, the value will be marked as *unspecified*.

If your document receives a low score (for example, below 0.8) and a grade of *poor* or *fair*, you'll probably benefit from using a different conversion technique. In that case, go back to the *Configure Docling Conversion Pipeline* section and try selecting a different approach (e.g. forcing OCR or using a VLM) and compare the results.

In [None]:
for file, confidence_report in confidence_reports.items():
    print(f"Conversion confidence for {file}:")
    
    print(f"Average confidence: \x1b[1m{confidence_report.mean_grade.name}\033[0m (score {confidence_report.mean_score:.3f})")
    
    low_score_pages = []
    for page in confidence_report.pages:
        page_confidence_report = confidence_report.pages[page]
        if page_confidence_report.mean_score < confidence_report.mean_score:
            low_score_pages.append(page)

    print(f"Pages that scored lower than average: {', '.join(str(x + 1) for x in low_score_pages)}")
    
    print()

### Post-Conversion: Illuminator Analysis

The output of document conversion is not always perfect. While optional, reviewing your converted data is strongly recommended. The following example explains how to use the Illuminator tool to identify common conversion issues.

In [None]:
from illuminator.analysis import analyze_docling_tables
from illuminator.utils import generate_summary
from docling.datamodel.document import DoclingDocument

import json
import sys
from pathlib import Path

converted_json_paths = list(output_dir.glob("*.json"))
results = {}
    
for path in converted_json_paths:
    with open(path, "r") as f:
        doc_dict = json.load(f)

    doc = DoclingDocument(**doc_dict)
    results[path] = analyze_docling_tables(doc)

    summary_path = output_dir / f"illuminator-readable-summary-{doc.name}.txt"
    
    with open(summary_path, "w") as f:
        generate_summary(results, file=f)
    
    print(f"✅ Post-conversion summary saved to: {summary_path.resolve()}")

The output of this post-conversion step should help determine whether to avoid using the content entirely or to manually edit it before proceeding to use the document.