# Docling Comprehensive Tutorial

This notebook provides a complete guide to using **Docling**, a powerful document conversion and processing library. We'll explore everything from basic conversions to advanced pipelines with VLMs, ASR, custom OCR configurations, and more.

## üìö Learning Path

The diagram below illustrates all concepts covered in this tutorial:

```mermaid
graph TB
    Start([üöÄ Start]):::startEnd --> Basics
    
    subgraph Basics[üî∞ Core Conversion]
        B1[Minimal Conversion]
        B2[Custom Configuration]
        B3[Batch Processing]
        B4[Multi-Format Support]
    end
    
    subgraph Backends[üíæ Backends]
        BE1[CSV Backend]
        BE2[XML & RAG]
    end
    
    subgraph Pipelines[ü§ñ Advanced Pipelines]
        P1[VLM Pipeline - Minimal]
        P2[VLM - Compare Models]
        P3[VLM - API Model]
        P4[ASR Pipeline]
    end
    
    subgraph Exports[üì§ Exporting Results]
        E1[Export Figures]
        E2[Export Tables]
        E3[Multimodal Export]
    end
    
    subgraph OCR[üëÅÔ∏è Advanced OCR]
        O1[Full Page OCR]
        O2[Tesseract Lang Detection]
        O3[RapidOCR Custom Models]
        O4[SuryaOCR Custom Models]
    end
    
    subgraph Advanced[‚ö° Enhancements]
        A1[Accelerator Options]
        A2[PII Obfuscation]
        A3[Translation]
    end
    
    Basics --> Backends
    Backends --> Pipelines
    Pipelines --> Exports
    Exports --> OCR
    OCR --> Advanced
    Advanced --> End([üéØ Complete]):::startEnd
    
    classDef startEnd fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px,color:#fff
    classDef basics fill:#4ecdc4,stroke:#2a9d8f,stroke-width:2px
    classDef pipelines fill:#95e1d3,stroke:#38ada9,stroke-width:2px
    classDef exports fill:#ffd93d,stroke:#f6b93b,stroke-width:2px
    classDef ocr fill:#a8e6cf,stroke:#6bcf9f,stroke-width:2px
    classDef advanced fill:#ff8b94,stroke:#ff6b7a,stroke-width:2px
    
    class B1,B2,B3,B4 basics
    class P1,P2,P3,P4 pipelines
    class E1,E2,E3 exports
    class O1,O2,O3,O4 ocr
    class A1,A2,A3 advanced
```

---
# üî∞ Part 1: Core Conversion

## 1. Minimal Conversion

The simplest way to use Docling - convert a document with default settings.

### üìñ Concept Overview

**What you'll learn:**
- How to perform a basic document conversion with default settings
- Understanding the `DocumentConverter` class
- Converting PDFs, images, and other formats to markdown
- Checking conversion status

**Key concepts:**
- `DocumentConverter()` - The main entry point for conversions
- `convert()` - Converts a single document
- `export_to_markdown()` - Exports the result as markdown text

This is the simplest way to get started with Docling!

## üìÅ Create Mock Data

Since we won't assume any data exists, let's create all the mock files we'll need for this tutorial.

In [1]:
import os
import csv
from pathlib import Path
from PIL import Image, ImageDraw, ImageFont
import io

# Create data directory
data_dir = Path("docling_tutorial_data")
data_dir.mkdir(exist_ok=True)

print(f"‚úÖ Created data directory: {data_dir.absolute()}")

# 1. Create a mock PDF using reportlab
try:
    from reportlab.lib.pagesizes import letter
    from reportlab.pdfgen import canvas
    from reportlab.lib.utils import ImageReader
    
    pdf_path = data_dir / "sample_document.pdf"
    c = canvas.Canvas(str(pdf_path), pagesize=letter)
    
    # Page 1
    c.setFont("Helvetica-Bold", 24)
    c.drawString(100, 750, "Docling Tutorial Sample Document")
    c.setFont("Helvetica", 12)
    c.drawString(100, 720, "This is a sample PDF created for demonstrating Docling's capabilities.")
    c.drawString(100, 700, "It contains multiple pages with text, tables, and images.")
    
    # Add a simple table
    c.setFont("Helvetica-Bold", 14)
    c.drawString(100, 650, "Sample Table:")
    c.setFont("Helvetica", 10)
    y = 630
    table_data = [
        ["Employee ID", "Name", "Department", "Salary"],
        ["001", "John Doe", "Engineering", "$75,000"],
        ["002", "Jane Smith", "Marketing", "$65,000"],
        ["003", "Bob Johnson", "Sales", "$70,000"]
    ]
    for row in table_data:
        c.drawString(100, y, " | ".join(row))
        y -= 15
    
    c.showPage()
    
    # Page 2
    c.setFont("Helvetica-Bold", 18)
    c.drawString(100, 750, "Page 2: Additional Content")
    c.setFont("Helvetica", 12)
    c.drawString(100, 720, "This page contains more text for testing multi-page conversion.")
    c.drawString(100, 700, "Docling can extract text from complex layouts efficiently.")
    
    c.save()
    print(f"‚úÖ Created PDF: {pdf_path}")
    
except ImportError:
    print("‚ö†Ô∏è reportlab not installed. Creating a simple text file instead.")
    pdf_path = data_dir / "sample_document.txt"
    with open(pdf_path, "w") as f:
        f.write("Docling Tutorial Sample Document\n")
        f.write("This is a sample text file for demonstration.\n")
    print(f"‚úÖ Created text file: {pdf_path}")

# 2. Create sample CSV
csv_path = data_dir / "employees.csv"
with open(csv_path, "w", newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(["EmployeeID", "Name", "Department", "Email", "Salary"])
    writer.writerow(["E001", "Alice Johnson", "Engineering", "alice@company.com", "95000"])
    writer.writerow(["E002", "Bob Martinez", "Product", "bob@company.com", "87000"])
    writer.writerow(["E003", "Carol White", "Marketing", "carol@company.com", "72000"])
    writer.writerow(["E004", "David Brown", "Sales", "david@company.com", "68000"])
    writer.writerow(["E005", "Eva Green", "HR", "eva@company.com", "65000"])

print(f"‚úÖ Created CSV: {csv_path}")

# 3. Create sample XML
xml_path = data_dir / "library.xml"
xml_content = """<?xml version="1.0" encoding="UTF-8"?>
<library>
    <metadata>
        <name>Tech Books Collection</name>
        <location>Main Branch</location>
        <established>2020</established>
    </metadata>
    <books>
        <book id="1">
            <title>Introduction to Machine Learning</title>
            <author>Dr. Sarah Anderson</author>
            <year>2023</year>
            <isbn>978-0-123456-78-9</isbn>
            <category>AI/ML</category>
            <available>true</available>
        </book>
        <book id="2">
            <title>Advanced Python Programming</title>
            <author>Michael Chen</author>
            <year>2024</year>
            <isbn>978-0-987654-32-1</isbn>
            <category>Programming</category>
            <available>false</available>
        </book>
        <book id="3">
            <title>Data Structures and Algorithms</title>
            <author>Lisa Rodriguez</author>
            <year>2022</year>
            <isbn>978-0-555666-77-8</isbn>
            <category>Computer Science</category>
            <available>true</available>
        </book>
    </books>
</library>
"""
with open(xml_path, "w", encoding='utf-8') as f:
    f.write(xml_content)

print(f"‚úÖ Created XML: {xml_path}")

# 4. Create sample images for OCR testing
img_path = data_dir / "sample_text_image.png"
img = Image.new('RGB', (800, 400), color=(255, 255, 255))
draw = ImageDraw.Draw(img)

# Use default font
try:
    font_large = ImageFont.truetype("arial.ttf", 36)
    font_medium = ImageFont.truetype("arial.ttf", 24)
    font_small = ImageFont.truetype("arial.ttf", 18)
except:
    font_large = ImageFont.load_default()
    font_medium = ImageFont.load_default()
    font_small = ImageFont.load_default()

draw.text((50, 30), "Docling OCR Test Image", fill=(0, 0, 0), font=font_large)
draw.text((50, 100), "This image contains text that will be extracted", fill=(50, 50, 50), font=font_medium)
draw.text((50, 150), "using Optical Character Recognition (OCR).", fill=(50, 50, 50), font=font_medium)

# Draw a simple table
draw.text((50, 220), "Sample Table:", fill=(0, 0, 0), font=font_medium)
draw.rectangle([50, 250, 750, 350], outline=(0, 0, 0), width=2)
draw.line([50, 280, 750, 280], fill=(0, 0, 0), width=2)
draw.line([300, 250, 300, 350], fill=(0, 0, 0), width=1)
draw.line([550, 250, 550, 350], fill=(0, 0, 0), width=1)

draw.text((100, 255), "Product", fill=(0, 0, 0), font=font_small)
draw.text((350, 255), "Quantity", fill=(0, 0, 0), font=font_small)
draw.text((600, 255), "Price", fill=(0, 0, 0), font=font_small)

draw.text((100, 290), "Laptop", fill=(0, 0, 0), font=font_small)
draw.text((380, 290), "15", fill=(0, 0, 0), font=font_small)
draw.text((600, 290), "$1200", fill=(0, 0, 0), font=font_small)

draw.text((100, 320), "Mouse", fill=(0, 0, 0), font=font_small)
draw.text((380, 320), "50", fill=(0, 0, 0), font=font_small)
draw.text((600, 320), "$25", fill=(0, 0, 0), font=font_small)

img.save(img_path)
print(f"‚úÖ Created image: {img_path}")

# 5. Create a second image with different content
img2_path = data_dir / "multilingual_sample.png"
img2 = Image.new('RGB', (600, 300), color=(240, 248, 255))
draw2 = ImageDraw.Draw(img2)

draw2.text((50, 30), "Multilingual Text Sample", fill=(0, 0, 139), font=font_medium)
draw2.text((50, 80), "English: Hello, World!", fill=(0, 0, 0), font=font_small)
draw2.text((50, 110), "Spanish: ¬°Hola, Mundo!", fill=(0, 0, 0), font=font_small)
draw2.text((50, 140), "French: Bonjour, le Monde!", fill=(0, 0, 0), font=font_small)
draw2.text((50, 170), "German: Hallo, Welt!", fill=(0, 0, 0), font=font_small)

img2.save(img2_path)
print(f"‚úÖ Created multilingual image: {img2_path}")

# 6. Create a mock audio file placeholder (we'll note it's for ASR demos)
audio_placeholder = data_dir / "sample_speech.mp3"
# Note: Creating actual audio requires additional libraries
# For now, we'll just note where it should be
print(f"üìù Audio file placeholder: {audio_placeholder}")
print("   (For ASR examples, you would need a real audio file)")

# 7. Create HTML sample
html_path = data_dir / "sample_page.html"
html_content = """<!DOCTYPE html>
<html>
<head>
    <title>Sample Web Page</title>
</head>
<body>
    <h1>Welcome to Docling Demo</h1>
    <p>This is a sample HTML page that can be converted using Docling.</p>
    <h2>Features</h2>
    <ul>
        <li>Convert HTML to structured documents</li>
        <li>Preserve formatting and structure</li>
        <li>Extract text and metadata</li>
    </ul>
    <table border="1">
        <tr>
            <th>Feature</th>
            <th>Status</th>
        </tr>
        <tr>
            <td>Text Extraction</td>
            <td>‚úì Supported</td>
        </tr>
        <tr>
            <td>Image Processing</td>
            <td>‚úì Supported</td>
        </tr>
    </table>
</body>
</html>
"""
with open(html_path, "w", encoding='utf-8') as f:
    f.write(html_content)

print(f"‚úÖ Created HTML: {html_path}")

print("\n" + "="*60)
print("üéâ All mock data files created successfully!")
print("="*60)

‚úÖ Created data directory: c:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\2.docling\docling_tutorial_data
‚úÖ Created PDF: docling_tutorial_data\sample_document.pdf
‚úÖ Created CSV: docling_tutorial_data\employees.csv
‚úÖ Created XML: docling_tutorial_data\library.xml
‚úÖ Created image: docling_tutorial_data\sample_text_image.png
‚úÖ Created multilingual image: docling_tutorial_data\multilingual_sample.png
üìù Audio file placeholder: docling_tutorial_data\sample_speech.mp3
   (For ASR examples, you would need a real audio file)
‚úÖ Created HTML: docling_tutorial_data\sample_page.html

üéâ All mock data files created successfully!


---
# üî∞ Part 1: Core Conversion

## 1. Minimal Conversion

The simplest way to use Docling - convert a document with default settings.

### üìñ Concept Overview

**What you'll learn:**
- How to perform a basic document conversion with default settings
- Understanding the `DocumentConverter` class
- Converting PDFs, images, and other formats to markdown
- Checking conversion status

**Key concepts:**
- `DocumentConverter()` - The main entry point for conversions
- `convert()` - Converts a single document
- `export_to_markdown()` - Exports the result as markdown text

This is the simplest way to get started with Docling!

In [21]:
from docling.document_converter import DocumentConverter

# Initialize the converter with default settings
converter = DocumentConverter()

# Convert the document
source = data_dir / "sample_document.pdf"
result = converter.convert(source)

# Display the result
print("Conversion Status:", result.status)
print("\n" + "="*60)
print("Markdown Output:")
print("="*60)
print(result.document.export_to_markdown())

2026-01-13 16:29:18,107 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:29:18,109 - INFO - Going to convert document batch...
2026-01-13 16:29:18,110 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-13 16:29:18,110 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:29:18,119 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:29:18,124 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:29:18,124 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:29:18,207 [Ra

Conversion Status: ConversionStatus.SUCCESS

Markdown Output:
## Docling Tutorial Sample Document

This is a sample PDF created for demonstrating Docling's capabilities.

It contains multiple pages with text, tables, and images.

## Sample Table:

Employee ID | Name | Department | Salary 001 | John Doe | Engineering | $75,000 002 | Jane Smith | Marketing | $65,000 003 | Bob Johnson | Sales | $70,000

## Page 2: Additional Content

This page contains more text for testing multi-page conversion.

Docling can extract text from complex layouts efficiently.


## 2. Custom Convert

Configure specific options for the conversion pipeline, such as enabling OCR, table structure recognition, and more.

### üìñ Concept Overview

**What you'll learn:**
- Customizing pipeline options for better control
- Enabling OCR (Optical Character Recognition)
- Configuring table structure recognition
- Fine-tuning conversion behavior

**Key concepts:**
- `PdfPipelineOptions` - Configuration for PDF processing
- `do_ocr` - Enable/disable OCR
- `do_table_structure` - Enable table detection
- `TableStructureOptions` - Configure table extraction accuracy

Use custom configuration when you need more control over how documents are processed.

In [20]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructureOptions

# Configure pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options = TableStructureOptions(
    do_cell_matching=True,
    mode="accurate"  # Options: "fast", "accurate"
)

# Create converter with custom configuration
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Convert with custom settings
result = doc_converter.convert(data_dir / "sample_document.pdf")

print(f"‚úÖ Conversion Status: {result.status}")
print(f"üìä OCR Enabled: {pipeline_options.do_ocr}")
print(f"üìã Table Structure Recognition: {pipeline_options.do_table_structure}")
print("\nDocument excerpt:")
print(result.document.export_to_markdown()[:500] + "...")

2026-01-13 16:27:55,536 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:27:55,541 - INFO - Going to convert document batch...
2026-01-13 16:27:55,541 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-13 16:27:55,541 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:27:55,554 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:27:55,558 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:27:55,559 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:27:55,622 [Ra

‚úÖ Conversion Status: ConversionStatus.SUCCESS
üìä OCR Enabled: True
üìã Table Structure Recognition: True

Document excerpt:
## Docling Tutorial Sample Document

This is a sample PDF created for demonstrating Docling's capabilities.

It contains multiple pages with text, tables, and images.

## Sample Table:

Employee ID | Name | Department | Salary 001 | John Doe | Engineering | $75,000 002 | Jane Smith | Marketing | $65,000 003 | Bob Johnson | Sales | $70,000

## Page 2: Additional Content

This page contains more text for testing multi-page conversion.

Docling can extract text from complex layouts efficiently....


## 3. Batch Convert

Process multiple documents efficiently in a single operation.

### üìñ Concept Overview

**What you'll learn:**
- Converting multiple documents efficiently
- Using `convert_all()` for batch processing
- Handling different file formats in one operation
- Processing conversion results

**Key concepts:**
- `convert_all()` - Batch conversion method
- Iterating over results
- Status checking for each document

Batch processing is essential for production workloads with many documents.

In [19]:
from docling.document_converter import DocumentConverter

# List of documents to convert
input_sources = [
    data_dir / "sample_document.pdf",
    data_dir / "sample_text_image.png",
    data_dir / "sample_page.html"
]

converter = DocumentConverter()

# Convert all documents
results = converter.convert_all(input_sources)

# Process results
print("Batch Conversion Results:")
print("="*60)

for result in results:
    print(f"\nüìÑ File: {result.input.file.name}")
    print(f"   Status: {result.status}")
    if result.status.name == 'SUCCESS':
        doc_preview = result.document.export_to_markdown()[:150]
        print(f"   Preview: {doc_preview}...")
    print("-"*60)

2026-01-13 16:27:33,270 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:27:33,272 - INFO - Going to convert document batch...
2026-01-13 16:27:33,272 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-13 16:27:33,274 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:27:33,285 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:27:33,288 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:27:33,289 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:27:33,351 [Ra

Batch Conversion Results:


2026-01-13 16:27:33,510 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2026-01-13 16:27:33,510 - INFO - Accelerator device: 'cuda:0'
2026-01-13 16:27:33,945 - INFO - Accelerator device: 'cuda:0'
2026-01-13 16:27:34,392 - INFO - Processing document sample_document.pdf
2026-01-13 16:27:34,791 - INFO - Finished converting document sample_document.pdf in 1.51 sec.
2026-01-13 16:27:34,793 - INFO - detected formats: [<InputFormat.IMAGE: 'image'>]
2026-01-13 16:27:34,796 - INFO - Going to convert document batch...
2026-01-13 16:27:34,796 - INFO - Processing document sample_text_image.png



üìÑ File: sample_document.pdf
   Status: ConversionStatus.SUCCESS
   Preview: ## Docling Tutorial Sample Document

This is a sample PDF created for demonstrating Docling's capabilities.

It contains multiple pages with text, tab...
------------------------------------------------------------


2026-01-13 16:27:37,751 - INFO - Finished converting document sample_text_image.png in 2.95 sec.
2026-01-13 16:27:37,753 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-13 16:27:37,755 - INFO - Going to convert document batch...
2026-01-13 16:27:37,756 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-13 16:27:37,756 - INFO - Processing document sample_page.html
2026-01-13 16:27:37,757 - INFO - Finished converting document sample_page.html in 0.01 sec.



üìÑ File: sample_text_image.png
   Status: ConversionStatus.SUCCESS
   Preview: ## Docling OCR Test Image

This image contains text that will be extracted using Optical Character Recognition (OCR).

## Sample Table:

| Product   |...
------------------------------------------------------------

üìÑ File: sample_page.html
   Status: ConversionStatus.SUCCESS
   Preview: # Welcome to Docling Demo

This is a sample HTML page that can be converted using Docling.

## Features

- Convert HTML to structured documents
- Pres...
------------------------------------------------------------


## 4. Multi-Format Support

Docling supports various input formats: PDF, DOCX, PPTX, images, HTML, and more. You can control which formats to allow.

### üìñ Concept Overview

**What you'll learn:**
- Understanding all supported input formats
- Restricting allowed formats for security/performance
- Format-specific handling
- Working with PDF, DOCX, images, HTML, and more

**Key concepts:**
- `InputFormat` enum - All available formats
- `allowed_formats` - Whitelist specific formats
- Format detection and handling

Docling supports 10+ document formats out of the box!

In [None]:
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

# Show all available formats
print("Available Input Formats:")
print("="*60)
for fmt in InputFormat:
    print(f"  ‚Ä¢ {fmt.name}: {fmt.value}")

print("\n" + "="*60)

# Create converter with specific allowed formats
converter = DocumentConverter(
    allowed_formats=[
        InputFormat.PDF,
        InputFormat.IMAGE,
        InputFormat.HTML,
        InputFormat.DOCX
    ]
)

# Try converting different format types
test_files = [
    data_dir / "sample_document.pdf",
    data_dir / "sample_text_image.png",
    data_dir / "sample_page.html"
]

print("\nConverting multiple formats:")
print("="*60)

for file in test_files:
    if file.exists():
        try:
            result = converter.convert(file)
            print(f"‚úÖ {file.suffix.upper()}: {file.name} - {result.status}")
        except Exception as e:
            print(f"‚ùå {file.name}: {str(e)[:50]}...")

2026-01-13 15:55:37,162 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 15:55:37,349 - INFO - Going to convert document batch...


Available Input Formats:
  ‚Ä¢ DOCX: docx
  ‚Ä¢ PPTX: pptx
  ‚Ä¢ HTML: html
  ‚Ä¢ IMAGE: image
  ‚Ä¢ PDF: pdf
  ‚Ä¢ ASCIIDOC: asciidoc
  ‚Ä¢ MD: md
  ‚Ä¢ CSV: csv
  ‚Ä¢ XLSX: xlsx
  ‚Ä¢ XML_USPTO: xml_uspto
  ‚Ä¢ XML_JATS: xml_jats
  ‚Ä¢ METS_GBS: mets_gbs
  ‚Ä¢ JSON_DOCLING: json_docling
  ‚Ä¢ AUDIO: audio
  ‚Ä¢ VTT: vtt


Converting multiple formats:


2026-01-13 15:55:37,351 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-13 15:55:37,370 - INFO - Loading plugin 'docling_defaults'
2026-01-13 15:55:37,377 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-01-13 15:55:37,395 - INFO - Loading plugin 'docling_defaults'
2026-01-13 15:55:37,406 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']
2026-01-13 15:55:38,127 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 15:55:38,144 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 15:55:38,161 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 15:55:38,161 [RapidOCR] main.py:53: Using C:\git-projects\personal\githu

‚úÖ .PDF: sample_document.pdf - ConversionStatus.SUCCESS


2026-01-13 15:55:48,502 - INFO - Finished converting document sample_text_image.png in 2.89 sec.
2026-01-13 15:55:48,502 - INFO - detected formats: [<InputFormat.HTML: 'html'>]
2026-01-13 15:55:48,508 - INFO - Going to convert document batch...
2026-01-13 15:55:48,508 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-13 15:55:48,508 - INFO - Processing document sample_page.html
2026-01-13 15:55:48,508 - INFO - Finished converting document sample_page.html in 0.01 sec.


‚úÖ .PNG: sample_text_image.png - ConversionStatus.SUCCESS
‚úÖ .HTML: sample_page.html - ConversionStatus.SUCCESS


### Allow external plugins
- e.g. surya_ocr

In [3]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# 1. Initialize pipeline options and enable external plugins
pipeline_options = PdfPipelineOptions()
pipeline_options.allow_external_plugins = True  # Required for 3rd-party modules

# 2. (Optional) Configure specific options from your plugin
# pipeline_options.ocr_options = YourCustomPluginOptions() 

# 3. Setup the converter with these options
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# 4. Use the converter as normal
source = data_dir / "sample_document.pdf"
result = doc_converter.convert(source)
print(result.document.export_to_markdown())


2026-01-13 16:01:37,936 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:01:37,940 - INFO - Going to convert document batch...
2026-01-13 16:01:37,940 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 19290a5a28cef23fbe50840b45d241ba
2026-01-13 16:01:37,957 - INFO - Loading plugin 'docling_defaults'
2026-01-13 16:01:37,957 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-01-13 16:01:37,970 - INFO - Loading plugin 'docling_defaults'
2026-01-13 16:01:37,970 - INFO - Loading plugin 'surya-ocr'
2026-01-13 16:01:37,971 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract', 'suryaocr']
2026-01-13 16:01:37,971 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:01:37,989 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:01:37,993 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TU

## Docling Tutorial Sample Document

This is a sample PDF created for demonstrating Docling's capabilities.

It contains multiple pages with text, tables, and images.

## Sample Table:

Employee ID | Name | Department | Salary 001 | John Doe | Engineering | $75,000 002 | Jane Smith | Marketing | $65,000 003 | Bob Johnson | Sales | $70,000

## Page 2: Additional Content

This page contains more text for testing multi-page conversion.

Docling can extract text from complex layouts efficiently.


---
# üíæ Part 2: Backends

## 5. CSV Backend

Docling can process CSV files and convert them to structured documents.

## 6. XML Backend (RAG Ready)

Convert XML documents and make them ready for RAG (Retrieval-Augmented Generation) applications.

### üìñ Concept Overview

**What you'll learn:**
- Parsing XML documents
- Preparing documents for RAG applications
- Chunking content for vector databases
- Exporting structured data

**Key concepts:**
- XML document processing
- `iterate_items()` - Walk through document structure
- `export_to_dict()` - Get structured representation
- RAG (Retrieval-Augmented Generation) preparation

XML backend helps prepare documents for AI/ML pipelines and search systems.

In [None]:
from docling.document_converter import DocumentConverter

# Convert XML file
converter = DocumentConverter()
xml_source = data_dir / "library.xml"

result = converter.convert(xml_source)

print("XML Conversion Result:")
print("="*60)
markdown_output = result.document.export_to_markdown()
print(markdown_output)

# For RAG applications, you can chunk the content
print("\n" + "="*60)
print("Document Structure for RAG:")
print("="*60)

# Iterate through document items
for item, level in result.document.iterate_items():
    if hasattr(item, 'text') and item.text:
        print(f"Level {level}: {item.text[:100]}...")
        
# Export to dict format (useful for vector databases)
print("\n" + "="*60)
print("Exportable Dict Format:")
print("="*60)
doc_dict = result.document.export_to_dict()
print(f"Keys: {list(doc_dict.keys())}")
print(f"Document name: {doc_dict.get('name', 'N/A')}")

---
# ü§ñ Part 3: Advanced Pipelines

## 7. Minimal VLM Pipeline

Use Vision Language Models (VLMs) for advanced document understanding, especially useful for complex layouts.

## 8. Compare VLM Models

Compare different VLM models to find the best one for your use case.

In [18]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import VlmPipelineOptions

try:
    from docling.pipeline.vlm_pipeline import VlmPipeline
    from docling import vlm_model_specs
    
    # Available VLM models
    print("Available VLM Model Specs:")
    print("="*60)
    
    vlm_models = [
        ("SMOLDOCLING_TRANSFORMERS", vlm_model_specs.SMOLDOCLING_TRANSFORMERS),
        ("GRANITEDOCLING_TRANSFORMERS", vlm_model_specs.GRANITEDOCLING_TRANSFORMERS),
    ]
    
    for model_name, model_spec in vlm_models:
        print(f"\nüì¶ {model_name}")
        print(f"   Type: {type(model_spec).__name__}")
    
    # Example: Use a specific model
    print("\n" + "="*60)
    print("Converting with SmolDocling Model:")
    print("="*60)
    
    pipeline_options = VlmPipelineOptions(
        vlm_options=vlm_model_specs.SMOLDOCLING_TRANSFORMERS
    )
    
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=VlmPipeline,
                pipeline_options=pipeline_options
            )
        }
    )
    
    result = converter.convert(data_dir / "sample_document.pdf")
    print(f"‚úÖ Conversion Status: {result.status}")
    print(f"\nFirst 300 chars:\n{result.document.export_to_markdown()[:300]}...")
    
except ImportError:
    print("‚ö†Ô∏è VLM features require: pip install docling[vlm]")
except Exception as e:
    print(f"‚ö†Ô∏è Model comparison requires model downloads: {e}")

‚ö†Ô∏è VLM features require: pip install docling[vlm]


## 9. VLM Pipeline with API Model

Use remote API models (like GPT-4V) instead of local models to reduce compute requirements.

In [17]:
from docling.datamodel.pipeline_options import VlmPipelineOptions

# Configuration example for API-based VLM
print("VLM API Configuration Example:")
print("="*60)

config_example = """
# To use an API-based VLM model:

from docling.datamodel.pipeline_options import VlmPipelineOptions, ApiVlmOptions

api_options = ApiVlmOptions(
    api_url="https://api.openai.com/v1/chat/completions",
    api_key="your-api-key-here",
    model_name="gpt-4-vision-preview"
)

pipeline_options = VlmPipelineOptions(
    vlm_options=api_options
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options
        )
    }
)

# This offloads processing to the API instead of local compute
result = converter.convert("document.pdf")
"""

print(config_example)

print("\nüí° Benefits of API-based VLM:")
print("   ‚Ä¢ No local GPU required")
print("   ‚Ä¢ Access to latest models")
print("   ‚Ä¢ Scalable processing")
print("   ‚Ä¢ Pay per use")

print("\n‚ö†Ô∏è Note: Requires API credentials and internet connection")

VLM API Configuration Example:

# To use an API-based VLM model:

from docling.datamodel.pipeline_options import VlmPipelineOptions, ApiVlmOptions

api_options = ApiVlmOptions(
    api_url="https://api.openai.com/v1/chat/completions",
    api_key="your-api-key-here",
    model_name="gpt-4-vision-preview"
)

pipeline_options = VlmPipelineOptions(
    vlm_options=api_options
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options
        )
    }
)

# This offloads processing to the API instead of local compute
result = converter.convert("document.pdf")


üí° Benefits of API-based VLM:
   ‚Ä¢ No local GPU required
   ‚Ä¢ Access to latest models
   ‚Ä¢ Scalable processing
   ‚Ä¢ Pay per use

‚ö†Ô∏è Note: Requires API credentials and internet connection


## 10. Minimal ASR Pipeline

Use Automatic Speech Recognition (ASR) to transcribe audio files.

In [4]:
from docling.document_converter import DocumentConverter, AudioFormatOption
from docling.datamodel.base_models import InputFormat

try:
    from docling.pipeline.asr_pipeline import AsrPipeline
    
    # Check if we have an audio file
    audio_file = data_dir / "sample_speech.mp3"
    
    if audio_file.exists() and audio_file.stat().st_size > 0:
        # Configure ASR pipeline
        converter = DocumentConverter(
            format_options={
                InputFormat.AUDIO: AudioFormatOption(pipeline_cls=AsrPipeline)
            }
        )
        
        # Convert audio to text
        result = converter.convert(audio_file)
        
        print("ASR Pipeline Result:")
        print("="*60)
        print(f"Status: {result.status}")
        print("\nTranscription:")
        print(result.document.export_to_markdown())
    else:
        print("üìù ASR Pipeline Configuration Example:")
        print("="*60)
        print("""
# To use ASR pipeline with a real audio file:

from docling.document_converter import DocumentConverter, AudioFormatOption
from docling.pipeline.asr_pipeline import AsrPipeline
from docling.datamodel.base_models import InputFormat

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(pipeline_cls=AsrPipeline)
    }
)

# Supported audio formats: MP3, WAV, M4A, FLAC, etc.
result = converter.convert("speech.mp3")
transcription = result.document.export_to_markdown()
        """)
        
        print("\nüí° ASR Features:")
        print("   ‚Ä¢ Automatic language detection")
        print("   ‚Ä¢ Speaker diarization (when available)")
        print("   ‚Ä¢ Timestamp support")
        print("   ‚Ä¢ Multiple audio format support")
        
except ImportError:
    print("‚ö†Ô∏è ASR pipeline requires: pip install docling[asr]")
    print("   Also requires ffmpeg to be installed on your system")

2026-01-12 16:04:22,860 - INFO - detected formats: [<InputFormat.AUDIO: 'audio'>]
2026-01-12 16:04:22,860 - INFO - Going to convert document batch...
2026-01-12 16:04:22,860 - INFO - Initializing pipeline for AsrPipeline with options hash f13ce7fb9c9ff942ac42023aa9fd5569
2026-01-12 16:04:22,860 - INFO - artifacts-path: None
2026-01-12 16:04:22,860 - INFO - accelerator_options: num_threads=4 device='auto' cuda_use_flash_attention2=False


‚ö†Ô∏è ASR pipeline requires: pip install docling[asr]
   Also requires ffmpeg to be installed on your system


---
# üì§ Part 4: Exporting Results

## 11. Export Figures

Extract and save figures/images from documents.

## 12. Export Tables

Extract tables and export them in various formats (DataFrame, HTML, CSV, Markdown).

### üìñ Concept Overview

**What you'll learn:**
- Extracting tables from documents
- Exporting tables as DataFrames, CSV, HTML, Markdown
- Table structure recognition
- Programmatic table manipulation

**Key concepts:**
- `document.tables` - Access detected tables
- `export_to_dataframe()` - Convert to pandas
- `export_to_html()` / `export_to_csv()` - Various formats

Table extraction is crucial for data analysis and structured information retrieval.

In [15]:
from docling.document_converter import DocumentConverter
import pandas as pd

# Convert document
converter = DocumentConverter()
result = converter.convert(data_dir / "sample_document.pdf")

print("Table Extraction Results:")
print("="*60)

# Create output directory for tables
tables_dir = data_dir / "extracted_tables"
tables_dir.mkdir(exist_ok=True)

# Extract tables
table_count = len(result.document.tables) if hasattr(result.document, 'tables') else 0
print(f"üìä Total tables found: {table_count}\n")

if table_count > 0:
    for i, table in enumerate(result.document.tables):
        print(f"\n{'='*60}")
        print(f"Table {i+1}:")
        print('='*60)
        
        # Export as DataFrame
        try:
            df = table.export_to_dataframe()
            print("\nüìã DataFrame Preview:")
            print(df)
            
            # Save as CSV
            csv_path = tables_dir / f"table_{i+1}.csv"
            df.to_csv(csv_path, index=False)
            print(f"\n‚úÖ Saved as CSV: {csv_path.name}")
            
        except Exception as e:
            print(f"‚ö†Ô∏è Could not export DataFrame: {e}")
        
        # Export as HTML
        try:
            html = table.export_to_html()
            html_path = tables_dir / f"table_{i+1}.html"
            with open(html_path, 'w', encoding='utf-8') as f:
                f.write(html)
            print(f"‚úÖ Saved as HTML: {html_path.name}")
            print(f"\nHTML Preview:\n{html[:200]}...")
        except Exception as e:
            print(f"‚ö†Ô∏è Could not export HTML: {e}")
        
        # Export as Markdown
        try:
            markdown = table.export_to_markdown()
            md_path = tables_dir / f"table_{i+1}.md"
            with open(md_path, 'w', encoding='utf-8') as f:
                f.write(markdown)
            print(f"‚úÖ Saved as Markdown: {md_path.name}")
        except Exception as e:
            print(f"‚ö†Ô∏è Could not export Markdown: {e}")

    print(f"\nüìÅ Tables saved to: {tables_dir.absolute()}")
else:
    print("No tables found in the document.")
    print("Tables can be extracted from PDFs, images with OCR, and structured documents.")

2026-01-13 16:22:02,196 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:22:02,199 - INFO - Going to convert document batch...
2026-01-13 16:22:02,200 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-13 16:22:02,201 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:22:02,217 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:22:02,222 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:22:02,223 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:22:02,285 [Ra

Table Extraction Results:
üìä Total tables found: 0

No tables found in the document.
Tables can be extracted from PDFs, images with OCR, and structured documents.


## 13. Export Multimodal

Export documents with combined text, layout, and visual information for multimodal AI applications.

In [13]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat
import json

# Configure for multimodal export
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert(data_dir / "sample_document.pdf")

print("Multimodal Export:")
print("="*60)

# Export to dict (contains full structure)
doc_dict = result.document.export_to_dict()

print(f"üì¶ Document Structure Keys: {list(doc_dict.keys())[:10]}")
print(f"üìÑ Document Name: {doc_dict.get('name', 'N/A')}")

# Export metadata
if 'metadata' in doc_dict:
    print(f"\nüìã Metadata:")
    for key, value in list(doc_dict['metadata'].items())[:5]:
        print(f"   {key}: {value}")

# Page information
if 'pages' in doc_dict:
    print(f"\nüìñ Pages: {len(doc_dict['pages'])}")
    for page_num, page_data in list(doc_dict['pages'].items())[:2]:
        print(f"   Page {page_num}: {len(str(page_data))} chars of data")

# Save multimodal data as JSON
multimodal_path = data_dir / "multimodal_export.json"
with open(multimodal_path, 'w', encoding='utf-8') as f:
    json.dump(doc_dict, f, indent=2, default=str)

print(f"\n‚úÖ Multimodal data saved to: {multimodal_path.name}")

# Document for Parquet export (useful for ML pipelines)
print("\n" + "="*60)
print("Parquet Export for ML Pipelines:")
print("="*60)

try:
    import pandas as pd
    from docling_core.transforms.chunker import HybridChunker
    
    # Prepare data for Parquet
    chunks = []
    for item, level in result.document.iterate_items():
        if hasattr(item, 'text') and item.text:
            chunks.append({
                'text': item.text,
                'level': level,
                'type': type(item).__name__
            })
    
    if chunks:
        df = pd.DataFrame(chunks)
        parquet_path = data_dir / "document_chunks.parquet"
        df.to_parquet(parquet_path)
        print(f"‚úÖ Saved {len(chunks)} chunks to Parquet")
        print(f"üìÅ File: {parquet_path.name}")
        print(f"\nDataFrame preview:")
        print(df.head())
    else:
        print("No chunks to export")
        
except ImportError:
    print("‚ö†Ô∏è Parquet export requires: pip install pandas pyarrow")
except Exception as e:
    print(f"‚ö†Ô∏è Parquet export error: {e}")

2026-01-13 16:20:16,126 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:20:16,129 - INFO - Going to convert document batch...
2026-01-13 16:20:16,130 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 43d100e5a88a3c7f4833eb75adbe811f
2026-01-13 16:20:16,130 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:20:16,141 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:20:16,146 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:20:16,147 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:20:16,213 [Ra

Multimodal Export:
üì¶ Document Structure Keys: ['schema_name', 'version', 'name', 'origin', 'furniture', 'body', 'groups', 'texts', 'pictures', 'tables']
üìÑ Document Name: sample_document

üìñ Pages: 2
   Page 1: 52888 chars of data
   Page 2: 26784 chars of data

‚úÖ Multimodal data saved to: multimodal_export.json

Parquet Export for ML Pipelines:
‚úÖ Saved 8 chunks to Parquet
üìÅ File: document_chunks.parquet

DataFrame preview:
                                                text  level               type
0                   Docling Tutorial Sample Document      1  SectionHeaderItem
1  This is a sample PDF created for demonstrating...      1           TextItem
2  It contains multiple pages with text, tables, ...      1           TextItem
3                                      Sample Table:      1  SectionHeaderItem
4  Employee ID | Name | Department | Salary 001 |...      1           TextItem


---
# üëÅÔ∏è Part 5: Advanced OCR

## 14. Full Page OCR

Force full-page OCR instead of using native text extraction (useful for scanned documents).

## 15. Tesseract Language Detection

Use Tesseract OCR with automatic language detection or specify languages.

### üìñ Concept Overview

**What you'll learn:**
- Using Tesseract OCR engine
- Automatic language detection
- Specifying multiple languages
- Page segmentation modes

**Key concepts:**
- `TesseractCliOcrOptions` - Tesseract configuration
- `lang` parameter - Language specification
- `psm` - Page segmentation mode
- Language packs and installation

Tesseract is a mature, open-source OCR engine with 100+ language support.

In [None]:
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.base_models import InputFormat

print("Tesseract Language Detection Configuration:")
print("="*60)

# Configuration examples
config_examples = """
# 1. Auto-detect language
from docling.datamodel.pipeline_options import TesseractCliOcrOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractCliOcrOptions(
    lang=["auto"]  # Automatic language detection
)

# 2. Specify multiple languages
pipeline_options.ocr_options = TesseractCliOcrOptions(
    lang=["eng", "fra", "deu", "spa"]  # English, French, German, Spanish
)

# 3. Single language for better accuracy
pipeline_options.ocr_options = TesseractCliOcrOptions(
    lang=["eng"],  # English only
    psm=6  # Page segmentation mode (6 = uniform block of text)
)

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)
"""

print(config_examples)

print("\nüí° Common Tesseract Language Codes:")
print("="*60)
languages = {
    'eng': 'English',
    'fra': 'French',
    'deu': 'German',
    'spa': 'Spanish',
    'ita': 'Italian',
    'por': 'Portuguese',
    'rus': 'Russian',
    'jpn': 'Japanese',
    'chi_sim': 'Chinese (Simplified)',
    'chi_tra': 'Chinese (Traditional)',
    'ara': 'Arabic',
    'hin': 'Hindi'
}

for code, name in languages.items():
    print(f"  {code:12} - {name}")

print("\n‚ö†Ô∏è Note: Tesseract requires:")
print("   1. System installation: sudo apt-get install tesseract-ocr")
print("   2. Language packs: sudo apt-get install tesseract-ocr-[lang]")
print("   3. Python package: pip install pytesseract")

# Test with multilingual sample
try:
    from docling.datamodel.pipeline_options import TesseractCliOcrOptions
    
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.ocr_options = TesseractCliOcrOptions(lang=["eng"])
    pipeline_options.ocr_options.tesseract_cmd='C://Program Files//Tesseract-OCR//tesseract.exe'  # Example for Windows; adjust as needed
    
    converter = DocumentConverter(
        format_options={
            InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
        }
    )
    
    result = converter.convert(data_dir / "multilingual_sample.png") #data_dir / "sample_text_image.png" #data_dir / "multilingual_sample.png"
    print("\n‚úÖ Tesseract OCR Result:")
    print("="*60)
    print(result.document.export_to_markdown())
    
except ImportError:
    print("\n‚ö†Ô∏è Tesseract not available in this environment")
except Exception as e:
    print(f"\n‚ö†Ô∏è Tesseract error: {e}")

2026-01-13 16:18:04,604 - INFO - detected formats: [<InputFormat.IMAGE: 'image'>]
2026-01-13 16:18:04,606 - INFO - Going to convert document batch...
2026-01-13 16:18:04,606 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 533c34cfcb200dfad02d3e0d118afa57
2026-01-13 16:18:04,648 - INFO - command: C://Program Files//Tesseract-OCR//tesseract.exe --list-langs
2026-01-13 16:18:04,689 - INFO - Accelerator device: 'cuda:0'


Tesseract Language Detection Configuration:

# 1. Auto-detect language
from docling.datamodel.pipeline_options import TesseractCliOcrOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractCliOcrOptions(
    lang=["auto"]  # Automatic language detection
)

# 2. Specify multiple languages
pipeline_options.ocr_options = TesseractCliOcrOptions(
    lang=["eng", "fra", "deu", "spa"]  # English, French, German, Spanish
)

# 3. Single language for better accuracy
pipeline_options.ocr_options = TesseractCliOcrOptions(
    lang=["eng"],  # English only
    psm=6  # Page segmentation mode (6 = uniform block of text)
)

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)


üí° Common Tesseract Language Codes:
  eng          - English
  fra          - French
  deu          - German
  spa          - Spanish
  ita          - Italian
  por          

2026-01-13 16:18:05,184 - INFO - Accelerator device: 'cuda:0'
2026-01-13 16:18:05,629 - INFO - Processing document sample_text_image.png
2026-01-13 16:18:05,707 - INFO - command: C://Program Files//Tesseract-OCR//tesseract.exe --psm 0 -l osd C:\Users\PMACHA~1\AppData\Local\Temp\tmp4e6sloq9.png stdout
2026-01-13 16:18:06,054 - INFO - command: C://Program Files//Tesseract-OCR//tesseract.exe -l eng C:\Users\PMACHA~1\AppData\Local\Temp\tmp4e6sloq9.png stdout tsv
2026-01-13 16:18:06,602 - INFO - Finished converting document sample_text_image.png in 1.98 sec.



‚úÖ Tesseract OCR Result:
## Docling OCR Test Image

This image contains text that will be extracted using Optical Character Recognition (OCR).

## Sample Table:

| Laptop   |   15 | $1200   |
|----------|------|---------|
| Mouse    |   50 | $25     |


## 16. RapidOCR with Custom Models

Use RapidOCR with custom ONNX model paths for detection and recognition.

In [4]:
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.base_models import InputFormat

print("RapidOCR Custom Models Configuration:")
print("="*60)

config_example = """
# RapidOCR allows using custom ONNX models
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions

# Configure with custom model paths
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

# Specify custom ONNX model paths
pipeline_options.ocr_options = RapidOcrOptions(
    det_model_path="/path/to/detection_model.onnx",
    rec_model_path="/path/to/recognition_model.onnx",
    cls_model_path="/path/to/classification_model.onnx"  # Optional
)

# Or use default RapidOCR models
pipeline_options.ocr_options = RapidOcrOptions()

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("image.png")
"""

print(config_example)

print("\nüí° RapidOCR Features:")
print("="*60)
print("  ‚úì Fast inference with ONNX Runtime")
print("  ‚úì Lightweight models")
print("  ‚úì CPU-friendly")
print("  ‚úì Support for custom trained models")
print("  ‚úì Good for production deployments")

print("\nüì¶ Model Types:")
print("  ‚Ä¢ Detection Model (det): Locates text regions")
print("  ‚Ä¢ Recognition Model (rec): Converts regions to text")
print("  ‚Ä¢ Classification Model (cls): Determines text orientation")

# Try with default RapidOCR
try:
    from docling.datamodel.pipeline_options import RapidOcrOptions
    
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.ocr_options = RapidOcrOptions()
    
    converter = DocumentConverter(
        format_options={
            InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
        }
    )
    
    result = converter.convert(data_dir / "sample_text_image.png")
    print("\n‚úÖ RapidOCR Result:")
    print("="*60)
    print(result.document.export_to_markdown()[:400])
    
except ImportError:
    print("\n‚ö†Ô∏è RapidOCR requires: pip install rapidocr-onnxruntime")
except Exception as e:
    print(f"\n‚ö†Ô∏è RapidOCR error: {e}")

2026-01-12 16:12:45,410 - INFO - detected formats: [<InputFormat.IMAGE: 'image'>]
2026-01-12 16:12:45,410 - INFO - Going to convert document batch...
2026-01-12 16:12:45,415 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 332bedb9a442cccff51646a28356cc8a
2026-01-12 16:12:45,432 - INFO - Loading plugin 'docling_defaults'
2026-01-12 16:12:45,436 - INFO - Registered picture descriptions: ['vlm', 'api']
2026-01-12 16:12:45,446 - INFO - Loading plugin 'docling_defaults'
2026-01-12 16:12:45,454 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']


RapidOCR Custom Models Configuration:

# RapidOCR allows using custom ONNX models
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions

# Configure with custom model paths
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

# Specify custom ONNX model paths
pipeline_options.ocr_options = RapidOcrOptions(
    det_model_path="/path/to/detection_model.onnx",
    rec_model_path="/path/to/recognition_model.onnx",
    cls_model_path="/path/to/classification_model.onnx"  # Optional
)

# Or use default RapidOCR models
pipeline_options.ocr_options = RapidOcrOptions()

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("image.png")


üí° RapidOCR Features:
  ‚úì Fast inference with ONNX Runtime
  ‚úì Lightweight models
  ‚úì CPU-friendly
  ‚úì Support for custom trained models
  ‚úì Good for production deployments

üì¶ Model Type

2026-01-12 16:12:47,479 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-12 16:12:47,488 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-12 16:12:47,518 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\1. chunking\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-12 16:12:47,518 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\1. chunking\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-12 16:12:47,599 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-12 16:12:47,600 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\1. chunking\.venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2


‚úÖ RapidOCR Result:
## Docling OCR Test Image

This image contains text that will be extracted using Optical Character Recognition (OCR).

## Sample Table:

| Product   |   Quantity | Price   |
|-----------|------------|---------|
| Laptop    |         15 | $1200   |
| Mouse     |         50 | $25     |


## 17. SuryaOCR with Custom Models

Use SuryaOCR, a modern OCR engine with support for custom models.

In [4]:
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.base_models import InputFormat

print("SuryaOCR Custom Models Configuration:")
print("="*60)

config_example = """
# SuryaOCR - Advanced OCR with custom model support
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling_surya.options import SuryaOcrOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

# Configure SuryaOCR
pipeline_options.ocr_options = SuryaOcrOptions(
    lang=["en"],  # Supported languages
    # Custom model paths (optional)
    det_model_path="/path/to/detection_model",
    rec_model_path="/path/to/recognition_model"
)

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.png")
"""

print(config_example)

print("\nüí° SuryaOCR Features:")
print("="*60)
print("  ‚úì Modern transformer-based architecture")
print("  ‚úì High accuracy on complex layouts")
print("  ‚úì Multilingual support")
print("  ‚úì Custom model fine-tuning")
print("  ‚úì Good for handwriting and difficult text")

print("\nüìã Supported Languages:")
print("  ‚Ä¢ English (en)")
print("  ‚Ä¢ Spanish (es)")
print("  ‚Ä¢ French (fr)")
print("  ‚Ä¢ German (de)")
print("  ‚Ä¢ Chinese (zh)")
print("  ‚Ä¢ And many more...")

# Try with SuryaOCR
try:
    from docling_surya import SuryaOcrOptions
    from docling.datamodel.pipeline_options import PdfPipelineOptions

    pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_model="suryaocr",           # Plugin engine name
    allow_external_plugins=True,     # Required for third-party plugins
    ocr_options=SuryaOcrOptions(
        lang=["en"],                 # OCR language(s)
        use_gpu=True,                # Optional: force GPU
    ),
)
    
    converter = DocumentConverter(
        format_options={
            InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
        }
    )
    
    result = converter.convert(data_dir / "sample_text_image.png")
    print("\n‚úÖ SuryaOCR Result:")
    print("="*60)
    print(result.document.export_to_markdown()[:400])
    
except ImportError:
    print("\n‚ö†Ô∏è SuryaOCR requires: pip install docling-surya")
except Exception as e:
    print(f"\n‚ö†Ô∏è SuryaOCR error: {e}")

2026-01-13 16:02:28,467 - INFO - detected formats: [<InputFormat.IMAGE: 'image'>]


2026-01-13 16:02:28,471 - INFO - Going to convert document batch...
2026-01-13 16:02:28,472 - INFO - Initializing pipeline for StandardPdfPipeline with options hash c3e46052a277609cda7a92f3807bd9c4


SuryaOCR Custom Models Configuration:

# SuryaOCR - Advanced OCR with custom model support
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling_surya.options import SuryaOcrOptions

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True

# Configure SuryaOCR
pipeline_options.ocr_options = SuryaOcrOptions(
    lang=["en"],  # Supported languages
    # Custom model paths (optional)
    det_model_path="/path/to/detection_model",
    rec_model_path="/path/to/recognition_model"
)

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(pipeline_options=pipeline_options)
    }
)

result = converter.convert("document.png")


üí° SuryaOCR Features:
  ‚úì Modern transformer-based architecture
  ‚úì High accuracy on complex layouts
  ‚úì Multilingual support
  ‚úì Custom model fine-tuning
  ‚úì Good for handwriting and difficult text

üìã Supported Languages:
  ‚Ä¢ English (en)
  ‚Ä¢ Spanish (es)
  ‚Ä¢ French (f

Downloading manifest.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 262/262 [00:00<?, ?B/s] 
Downloading text_recognition model to C:\Users\pmacharla\.cache\docling\models\SuryaOcr\text_recognition/2025_09_23:   0%|          | 0/12 [00:00<?, ?it/s]

[A[A
[A


Downloading vocab_math.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20.1k/20.1k [00:00<00:00, 3.07MB/s]
Downloading .gitattributes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.48k/1.48k [00:00<00:00, 227kB/s]
Downloading training_args.bin: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7.45k/7.45k [00:00<00:00, 616kB/s]
Downloading special_tokens_map.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 278/278 [00:00<00:00, 36.1kB/s]cr\text_recognition/2025_09_23:   8%|‚ñä         | 1/12 [00:00<00:01,  7.87it/s]

[A

Downloading tokenizer_config.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 694/694 [00:00<00:00, 82.3kB/s]
Downloading specials_dict.json: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 43.5k/43.5k [00:00<00:00, 5.99MB/s]

[A


[A[A[A



Downloading spe


‚úÖ SuryaOCR Result:
## Docling OCR Test Image

This image contains text that will be extracted using Optical Character Recognition (OCR).

## Sample Table:

| Product   |   Quantity | Price   |
|-----------|------------|---------|
| Laptop    |         15 | $1200   |
| Mouse     |         50 | $25     |


---
# ‚ö° Part 6: Performance & Enhancement

## 18. Accelerator Options

Configure hardware acceleration for optimal performance (CPU, CUDA, MPS).

## 19. PII Obfuscation

Detect and obfuscate Personally Identifiable Information (PII) in documents.

### üìñ Concept Overview

**What you'll learn:**
- Detecting personally identifiable information (PII)
- Obfuscating sensitive data
- Regular expressions for pattern matching
- Using NER models for advanced detection

**Key concepts:**
- PII types (names, emails, SSN, credit cards)
- Pattern-based detection
- NER (Named Entity Recognition)
- Compliance requirements (GDPR, HIPAA, CCPA)

PII obfuscation is critical for protecting privacy and meeting compliance requirements.

In [6]:
from docling.document_converter import DocumentConverter
import re

print("PII Obfuscation:")
print("="*60)

# Convert document
converter = DocumentConverter()
result = converter.convert(data_dir / "employees.csv")

print("Original Document:")
print("-"*60)
original_text = result.document.export_to_markdown()
print(original_text)

# PII Detection and Obfuscation Functions
def obfuscate_emails(text):
    """Replace email addresses with [EMAIL]"""
    return re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)

def obfuscate_phone_numbers(text):
    """Replace phone numbers with [PHONE]"""
    patterns = [
        r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b',  # US format
        r'\b\(\d{3}\)\s*\d{3}[-.]?\d{4}\b',  # (123) 456-7890
    ]
    for pattern in patterns:
        text = re.sub(pattern, '[PHONE]', text)
    return text

def obfuscate_ssn(text):
    """Replace SSN with [SSN]"""
    return re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)

def obfuscate_credit_cards(text):
    """Replace credit card numbers with [CREDIT_CARD]"""
    return re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CREDIT_CARD]', text)

def obfuscate_names(text, names_list):
    """Replace specific names with [NAME]"""
    for name in names_list:
        text = text.replace(name, '[NAME]')
    return text

# Apply obfuscation
names_to_obfuscate = ["Alice Johnson", "Bob Martinez", "Carol White", "David Brown", "Eva Green"]
salaries_pattern = r'\$\d{1,3}(,\d{3})*(\.\d{2})?'

obfuscated_text = original_text
obfuscated_text = obfuscate_emails(obfuscated_text)
obfuscated_text = obfuscate_phone_numbers(obfuscated_text)
obfuscated_text = obfuscate_ssn(obfuscated_text)
obfuscated_text = obfuscate_credit_cards(obfuscated_text)
obfuscated_text = obfuscate_names(obfuscated_text, names_to_obfuscate)
obfuscated_text = re.sub(salaries_pattern, '[SALARY]', obfuscated_text)

print("\n" + "="*60)
print("Obfuscated Document:")
print("-"*60)
print(obfuscated_text)

# Advanced PII detection with NER models
print("\n" + "="*60)
print("Advanced PII Detection with NER:")
print("="*60)

pii_detection_example = """
# Using GLiNER for PII detection
try:
    from gliner import GLiNER
    
    # Load GLiNER model for NER
    model = GLiNER.from_pretrained("urchade/gliner_base")
    
    # Define PII entities to detect
    labels = ["person", "email", "phone number", "social security number", 
              "credit card", "address", "organization"]
    
    # Detect entities
    entities = model.predict_entities(original_text, labels)
    
    # Obfuscate detected entities
    obfuscated = original_text
    for entity in sorted(entities, key=lambda x: x['start'], reverse=True):
        start, end = entity['start'], entity['end']
        entity_type = entity['label'].upper().replace(' ', '_')
        obfuscated = obfuscated[:start] + f'[{entity_type}]' + obfuscated[end:]
    
    print(obfuscated)
    
except ImportError:
    print("Install GLiNER: pip install gliner")
"""

print(pii_detection_example)

print("\nüí° PII Types Commonly Obfuscated:")
print("  ‚Ä¢ Names (PERSON)")
print("  ‚Ä¢ Email addresses (EMAIL)")
print("  ‚Ä¢ Phone numbers (PHONE)")
print("  ‚Ä¢ Social Security Numbers (SSN)")
print("  ‚Ä¢ Credit Card Numbers (CREDIT_CARD)")
print("  ‚Ä¢ Addresses (ADDRESS)")
print("  ‚Ä¢ Bank Account Numbers (ACCOUNT)")
print("  ‚Ä¢ Passport Numbers (PASSPORT)")
print("  ‚Ä¢ Medical Record Numbers (MRN)")

print("\n‚ö†Ô∏è Compliance:")
print("  ‚Ä¢ GDPR (Europe)")
print("  ‚Ä¢ HIPAA (Healthcare - USA)")
print("  ‚Ä¢ CCPA (California - USA)")
print("  ‚Ä¢ Use PII obfuscation before sharing documents")

2026-01-13 16:07:50,136 - INFO - detected formats: [<InputFormat.CSV: 'csv'>]
2026-01-13 16:07:50,136 - INFO - Going to convert document batch...
2026-01-13 16:07:50,139 - INFO - Initializing pipeline for SimplePipeline with options hash 995a146ad601044538e6a923bea22f4e
2026-01-13 16:07:50,140 - INFO - Processing document employees.csv
2026-01-13 16:07:50,141 - INFO - Parsing CSV with delimiter: ","
2026-01-13 16:07:50,141 - INFO - Detected 6 lines
2026-01-13 16:07:50,142 - INFO - Finished converting document employees.csv in 0.02 sec.


PII Obfuscation:
Original Document:
------------------------------------------------------------
| EmployeeID   | Name          | Department   | Email             |   Salary |
|--------------|---------------|--------------|-------------------|----------|
| E001         | Alice Johnson | Engineering  | alice@company.com |    95000 |
| E002         | Bob Martinez  | Product      | bob@company.com   |    87000 |
| E003         | Carol White   | Marketing    | carol@company.com |    72000 |
| E004         | David Brown   | Sales        | david@company.com |    68000 |
| E005         | Eva Green     | HR           | eva@company.com   |    65000 |

Obfuscated Document:
------------------------------------------------------------
| EmployeeID   | Name          | Department   | Email             |   Salary |
|--------------|---------------|--------------|-------------------|----------|
| E001         | [NAME] | Engineering  | [EMAIL] |    95000 |
| E002         | [NAME]  | Product      | [EMAI

## 20. Translation

Translate document content while preserving structure and formatting.

### üìñ Concept Overview

**What you'll learn:**
- Translating document content programmatically
- Preserving document structure during translation
- Using translation APIs (Google, DeepL, AWS, Azure)
- Maintaining formatting and layout

**Key concepts:**
- `iterate_items()` - Walk through document elements
- Translation service integration
- Structure preservation
- Quality checking with back-translation

Translation enables multilingual document processing while maintaining the original structure.

In [5]:
from docling.document_converter import DocumentConverter

print("Document Translation:")
print("="*60)

# Convert document
converter = DocumentConverter()
result = converter.convert(data_dir / "sample_document.pdf")

print("Original Document (English):")
print("-"*60)
original_text = result.document.export_to_markdown()
print(original_text[:500] + "...")

# Mock translation function (in production, use a real translation API)
def mock_translate(text, target_lang="es"):
    """
    Mock translation function
    In production, use services like:
    - Google Translate API
    - AWS Translate
    - Azure Translator
    - DeepL API
    """
    translations = {
        "en": {
            "Sample": {"es": "Muestra", "fr": "√âchantillon", "de": "Beispiel"},
            "Document": {"es": "Documento", "fr": "Document", "de": "Dokument"},
            "Table": {"es": "Tabla", "fr": "Tableau", "de": "Tabelle"},
            "Employee": {"es": "Empleado", "fr": "Employ√©", "de": "Mitarbeiter"},
            "Department": {"es": "Departamento", "fr": "D√©partement", "de": "Abteilung"},
            "This is a": {"es": "Este es un", "fr": "C'est un", "de": "Dies ist ein"},
        }
    }
    
    translated = text
    for eng_word, trans_dict in translations["en"].items():
        if target_lang in trans_dict:
            translated = translated.replace(eng_word, trans_dict[target_lang])
    
    return translated

# Translate to Spanish
print("\n" + "="*60)
print("Translated Document (Spanish):")
print("-"*60)
translated_text = mock_translate(original_text, target_lang="es")
print(translated_text[:500] + "...")

# Advanced translation with structure preservation
print("\n" + "="*60)
print("Translation with Structure Preservation:")
print("="*60)

translation_example = """
# Professional Translation Pipeline
from docling.document_converter import DocumentConverter

def translate_document(doc_result, target_lang="es", translation_service="google"):
    '''
    Translate document while preserving structure
    
    Args:
        doc_result: Docling conversion result
        target_lang: Target language code (ISO 639-1)
        translation_service: 'google', 'azure', 'deepl', 'aws'
    
    Returns:
        Translated document with preserved structure
    '''
    
    # Initialize translation client
    if translation_service == "google":
        from google.cloud import translate_v2
        translator = translate_v2.Client()
    elif translation_service == "deepl":
        import deepl
        translator = deepl.Translator("YOUR_API_KEY")
    
    # Iterate through document items
    for item, level in doc_result.document.iterate_items():
        if hasattr(item, 'text') and item.text:
            # Translate text while preserving formatting
            original = item.text
            
            # Call translation API
            if translation_service == "google":
                result = translator.translate(original, target_language=target_lang)
                item.text = result['translatedText']
            elif translation_service == "deepl":
                result = translator.translate_text(original, target_lang=target_lang.upper())
                item.text = result.text
    
    return doc_result

# Usage
converter = DocumentConverter()
result = converter.convert("document.pdf")
translated_result = translate_document(result, target_lang="es", translation_service="google")

# Export translated document
translated_md = translated_result.document.export_to_markdown()
"""

print(translation_example)

print("\nüìö Popular Translation Services:")
print("="*60)
print("""
1. Google Cloud Translation API
   - 100+ languages
   - High quality
   - pip install google-cloud-translate

2. DeepL API
   - 30+ languages
   - Very natural translations
   - pip install deepl

3. AWS Translate
   - 75+ languages
   - Scalable
   - boto3 library

4. Azure Translator
   - 100+ languages
   - Custom models available
   - pip install azure-ai-translation-text
""")

print("üí° Best Practices:")
print("  ‚úì Preserve document structure (tables, lists, headings)")
print("  ‚úì Handle special characters and formatting")
print("  ‚úì Batch translate for efficiency")
print("  ‚úì Cache translations to reduce API calls")
print("  ‚úì Maintain metadata (page numbers, sections)")
print("  ‚úì Quality check with back-translation")

2026-01-13 16:06:14,254 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2026-01-13 16:06:14,257 - INFO - Going to convert document batch...
2026-01-13 16:06:14,257 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2026-01-13 16:06:14,258 - INFO - Accelerator device: 'cuda:0'
[32m[INFO] 2026-01-13 16:06:14,275 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-01-13 16:06:14,278 [RapidOCR] download_file.py:60: File exists and is valid: C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:06:14,278 [RapidOCR] main.py:53: Using C:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\.venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-01-13 16:06:14,354 [Ra

Document Translation:


2026-01-13 16:06:14,500 - INFO - Auto OCR model selected rapidocr with onnxruntime.
2026-01-13 16:06:14,507 - INFO - Accelerator device: 'cuda:0'
2026-01-13 16:06:15,062 - INFO - Accelerator device: 'cuda:0'
2026-01-13 16:06:15,505 - INFO - Processing document sample_document.pdf
2026-01-13 16:06:15,924 - INFO - Finished converting document sample_document.pdf in 1.67 sec.


Original Document (English):
------------------------------------------------------------
## Docling Tutorial Sample Document

This is a sample PDF created for demonstrating Docling's capabilities.

It contains multiple pages with text, tables, and images.

## Sample Table:

Employee ID | Name | Department | Salary 001 | John Doe | Engineering | $75,000 002 | Jane Smith | Marketing | $65,000 003 | Bob Johnson | Sales | $70,000

## Page 2: Additional Content

This page contains more text for testing multi-page conversion.

Docling can extract text from complex layouts efficiently....

Translated Document (Spanish):
------------------------------------------------------------
## Docling Tutorial Muestra Documento

Este es un sample PDF created for demonstrating Docling's capabilities.

It contains multiple pages with text, tables, and images.

## Muestra Tabla:

Empleado ID | Name | Departamento | Salary 001 | John Doe | Engineering | $75,000 002 | Jane Smith | Marketing | $65,000 003 | 

---
# üéì Conclusion

## Summary

Congratulations! You've completed the comprehensive Docling tutorial covering:

### ‚úÖ What We Covered

1. **Core Conversion** (4 topics)
   - Minimal conversion
   - Custom configuration
   - Batch processing
   - Multi-format support

2. **Backends** (2 topics)
   - CSV processing
   - XML for RAG applications

3. **Advanced Pipelines** (4 topics)
   - VLM (Vision Language Models)
   - VLM model comparison
   - API-based VLM
   - ASR (Automatic Speech Recognition)

4. **Exporting Results** (3 topics)
   - Figure extraction
   - Table export (CSV, HTML, Markdown)
   - Multimodal export

5. **Advanced OCR** (4 topics)
   - Full page OCR
   - Tesseract with language detection
   - RapidOCR with custom models
   - SuryaOCR with custom models

6. **Performance & Enhancement** (3 topics)
   - Hardware acceleration (CPU/GPU)
   - PII obfuscation
   - Document translation

### üöÄ Next Steps

1. **Explore Advanced Features**
   - Fine-tune VLM models for specific domains
   - Integrate with vector databases for RAG
   - Build production pipelines

2. **Performance Optimization**
   - Benchmark different configurations
   - Implement caching strategies
   - Scale with distributed processing

3. **Integration Projects**
   - Document search systems
   - Automated compliance checking
   - Multilingual document processing

### üìñ Resources

- **Documentation**: https://docling-project.github.io/docling/
- **GitHub**: https://github.com/docling-project/docling
- **Community**: Join discussions and contribute!

### üí¨ Feedback

This tutorial was designed to be comprehensive and hands-on. Each section included:
- ‚úì Executable code examples
- ‚úì Mock data generation
- ‚úì Best practices
- ‚úì Configuration examples

Happy document processing with Docling! üéâ