# Docling Information Extraction Tutorial

This notebook demonstrates **Docling's structured information extraction** from unstructured documents using templates and Pydantic models.

## Learning Path

```mermaid
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#e3f2fd','primaryTextColor':'#0d47a1','primaryBorderColor':'#1976d2','lineColor':'#42a5f5','secondaryColor':'#fff3e0','tertiaryColor':'#f3e5f5','edgeLabelBackground':'#ffffff'}}}%%
graph TB
    subgraph Setup["üîß Setup (Blue)"]
        A[Document Extractor]
        B[Input Formats]
    end
    
    subgraph Templates["üìã Templates (Orange)"]
        C[String Template]
        D[Dict Template]
        E[Basic Pydantic Model]
    end
    
    subgraph Advanced["üöÄ Advanced Pydantic (Purple)"]
        F[Nested Models]
        G[Default Values]
        H[Field Examples]
        I[Optional Fields]
    end
    
    subgraph Validation["‚úÖ Validation (Green)"]
        J[Model Validation]
        K[Type Conversion]
        L[Structured Output]
    end
    
    subgraph UseCases["üíº Use Cases (Pink)"]
        M[Invoice Extraction]
        N[Contact Information]
        O[Multi-Page Documents]
        P[Real-world Workflows]
    end
    
    A --> B --> C
    C --> D --> E
    E --> F --> G
    G --> H --> I
    I --> J --> K
    K --> L --> M
    M --> N --> O --> P
    
    style Setup fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
    style Templates fill:#fff3e0,stroke:#f57c00,stroke-width:3px
    style Advanced fill:#f3e5f5,stroke:#8e24aa,stroke-width:3px
    style Validation fill:#e8f5e9,stroke:#388e3c,stroke-width:3px
    style UseCases fill:#fce4ec,stroke:#c2185b,stroke-width:3px
```

**Topics Covered:**
1. DocumentExtractor Setup & Configuration
2. String & Dictionary Templates
3. Basic Pydantic Models with Fields
4. Advanced Nested Pydantic Models
5. Default Values, Examples & Optional Fields
6. Model Validation & Type Conversion
7. Real-world Invoice & Contact Extraction
8. Multi-Page Document Processing

> **‚ö†Ô∏è Note:** The extraction API is currently experimental and may change without prior notice. Only PDF and image formats are supported.

# üì¶ Setup & Installation

In [None]:
# # Install required packages
# import sys
# import subprocess

# packages = [
#     "docling[vlm]",  # Docling with VLM support for extraction
#     "pydantic",
#     "rich",
#     "reportlab",
#     "pillow"
# ]

# for package in packages:
#     try:
#         subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
#         print(f"‚úì {package} installed")
#     except subprocess.CalledProcessError:
#         print(f"‚úó Failed to install {package}")

# üóÇÔ∏è Mock Data Generation

Generate sample invoices and documents for testing information extraction

In [1]:
# Generate mock invoice images and PDFs
import os
from pathlib import Path
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.units import inch
from PIL import Image, ImageDraw, ImageFont

# Create mock_data directory
mock_dir = Path("./mock_data_extraction")
mock_dir.mkdir(exist_ok=True)

# 1. Generate Invoice PDF
invoice_pdf_path = mock_dir / "sample_invoice.pdf"
c = canvas.Canvas(str(invoice_pdf_path), pagesize=letter)
width, height = letter

# Invoice Header
c.setFont("Helvetica-Bold", 24)
c.drawString(50, height - 50, "INVOICE")

# Invoice Details
c.setFont("Helvetica", 12)
invoice_details = [
    f"Invoice No: INV-2024-001",
    f"Date: January 13, 2026",
    f"",
    f"From:",
    f"TechCorp Solutions",
    f"123 Innovation Drive",
    f"San Francisco, CA 94105",
    f"Tax ID: 12-3456789",
    f"",
    f"To:",
    f"Global Enterprises Inc.",
    f"456 Business Plaza",
    f"New York, NY 10001",
    f"",
    f"Description:",
    f"Software Development Services - Q1 2026",
    f"Total Hours: 160",
    f"Rate: $150/hour",
    f"",
    f"Subtotal: $24,000.00",
    f"Tax (8.5%): $2,040.00",
    f"",
    f"TOTAL: $26,040.00",
]

y_pos = height - 100
for line in invoice_details:
    if line.startswith("TOTAL:"):
        c.setFont("Helvetica-Bold", 14)
    else:
        c.setFont("Helvetica", 12)
    c.drawString(50, y_pos, line)
    y_pos -= 20

c.save()
print(f"‚úì Created Invoice PDF: {invoice_pdf_path}")

# 2. Generate Receipt Image
receipt_img_path = mock_dir / "sample_receipt.png"
img = Image.new('RGB', (800, 1000), color='white')
draw = ImageDraw.Draw(img)

# Try to use default font
try:
    font = ImageFont.truetype("arial.ttf", 18)
    title_font = ImageFont.truetype("arial.ttf", 28)
    bold_font = ImageFont.truetype("arialbd.ttf", 22)
except:
    font = ImageFont.load_default()
    title_font = ImageFont.load_default()
    bold_font = ImageFont.load_default()

# Receipt content
draw.text((250, 30), "RETAIL STORE", fill='black', font=title_font)
draw.text((200, 70), "123 Shopping Ave, Suite 100", fill='black', font=font)
draw.text((250, 100), "City, State 12345", fill='black', font=font)
draw.line([(50, 130), (750, 130)], fill='black', width=2)

y = 160
draw.text((50, y), "Receipt #: RCP-5678", fill='black', font=font)
y += 30
draw.text((50, y), "Date: 2026-01-13", fill='black', font=font)
y += 30
draw.text((50, y), "Cashier: Jane Smith", fill='black', font=font)
y += 50

draw.line([(50, y), (750, y)], fill='black', width=2)
y += 30

# Items
items = [
    ("Product A", "2", "$15.99", "$31.98"),
    ("Product B", "1", "$25.50", "$25.50"),
    ("Product C", "3", "$8.99", "$26.97"),
]

draw.text((50, y), "Item", fill='black', font=bold_font)
draw.text((400, y), "Qty", fill='black', font=bold_font)
draw.text((500, y), "Price", fill='black', font=bold_font)
draw.text((650, y), "Total", fill='black', font=bold_font)
y += 40

for item, qty, price, total in items:
    draw.text((50, y), item, fill='black', font=font)
    draw.text((400, y), qty, fill='black', font=font)
    draw.text((500, y), price, fill='black', font=font)
    draw.text((650, y), total, fill='black', font=font)
    y += 35

draw.line([(50, y), (750, y)], fill='black', width=2)
y += 30

# Totals
draw.text((400, y), "Subtotal:", fill='black', font=bold_font)
draw.text((650, y), "$84.45", fill='black', font=bold_font)
y += 35
draw.text((400, y), "Tax (7%):", fill='black', font=font)
draw.text((650, y), "$5.91", fill='black', font=font)
y += 35
draw.text((400, y), "TOTAL:", fill='black', font=bold_font)
draw.text((650, y), "$90.36", fill='black', font=bold_font)

y += 60
draw.line([(50, y), (750, y)], fill='black', width=2)
y += 30
draw.text((200, y), "Thank you for your business!", fill='black', font=font)

img.save(receipt_img_path)
print(f"‚úì Created Receipt Image: {receipt_img_path}")

# 3. Generate Business Card Image
card_img_path = mock_dir / "business_card.png"
card = Image.new('RGB', (600, 350), color='lightblue')
draw_card = ImageDraw.Draw(card)

try:
    card_font = ImageFont.truetype("arial.ttf", 24)
    card_name_font = ImageFont.truetype("arialbd.ttf", 32)
except:
    card_font = ImageFont.load_default()
    card_name_font = ImageFont.load_default()

# Business card content
draw_card.text((50, 50), "John Anderson", fill='black', font=card_name_font)
draw_card.text((50, 100), "Senior Solutions Architect", fill='darkblue', font=card_font)
draw_card.text((50, 150), "TechVision Corp", fill='black', font=card_font)
draw_card.text((50, 190), "john.anderson@techvision.com", fill='black', font=card_font)
draw_card.text((50, 230), "+1 (555) 123-4567", fill='black', font=card_font)
draw_card.text((50, 270), "789 Innovation Blvd, Seattle, WA 98101", fill='black', font=card_font)

card.save(card_img_path)
print(f"‚úì Created Business Card: {card_img_path}")

print(f"\n‚úÖ Mock extraction data created in: {mock_dir.absolute()}")
print(f"Files: {list(mock_dir.glob('*'))}")

‚úì Created Invoice PDF: mock_data_extraction\sample_invoice.pdf
‚úì Created Receipt Image: mock_data_extraction\sample_receipt.png
‚úì Created Business Card: mock_data_extraction\business_card.png

‚úÖ Mock extraction data created in: c:\git-projects\personal\github.com\OPENSEARCH_INTERMEDIATE_TUTORIAL\7. BONUS_PROJECTS\chonkie_docling_langxtract\2.docling\mock_data_extraction
Files: [WindowsPath('mock_data_extraction/business_card.png'), WindowsPath('mock_data_extraction/sample_invoice.pdf'), WindowsPath('mock_data_extraction/sample_receipt.png')]


---
# 1. Defining the Document Extractor

## Concept Overview
The `DocumentExtractor` is the main entry point for structured information extraction. It supports PDF and image formats and uses VLM (Vision Language Models) to extract structured data based on user-defined templates.

In [1]:
from docling.datamodel.base_models import InputFormat
from docling.document_extractor import DocumentExtractor
from pydantic import BaseModel, Field
from typing import Optional
from rich import print as rprint

# Initialize the document extractor
extractor = DocumentExtractor(
    allowed_formats=[InputFormat.IMAGE, InputFormat.PDF]
)

print("‚úì DocumentExtractor initialized")
print(f"‚úì Supported formats: {[InputFormat.IMAGE, InputFormat.PDF]}")
print("\n‚ö†Ô∏è Note: The extraction API is experimental and may change")

‚úì DocumentExtractor initialized
‚úì Supported formats: [<InputFormat.IMAGE: 'image'>, <InputFormat.PDF: 'pdf'>]

‚ö†Ô∏è Note: The extraction API is experimental and may change


---
# 2. String Template Extraction

## Concept Overview
String templates define the extraction schema as a JSON string. This is the simplest approach for quick extraction tasks. The format is `{"field_name": "type"}` where type can be `string`, `float`, `int`, etc.

In [3]:
# Extract using string template
file_path = "./mock_data_extraction/sample_invoice.pdf"

result = extractor.extract(
    source=file_path,
    template='{"invoice_no": "string", "total": "float", "tax_id": "string"}',
)

print("üìã String Template Extraction Results:\n")
rprint(result.pages)

# Access extracted data
if result.pages:
    extracted = result.pages[0].extracted_data
    print(f"\n‚úì Extracted Invoice No: {extracted.get('invoice_no', 'N/A')}")
    print(f"‚úì Extracted Total: ${extracted.get('total', 0):.2f}")
    print(f"‚úì Extracted Tax ID: {extracted.get('tax_id', 'N/A')}")

Only PDF and image formats are supported.
  return next(all_res)
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


üìã String Template Extraction Results:




‚úì Extracted Invoice No: INV-2024-001
‚úì Extracted Total: $26040.00
‚úì Extracted Tax ID: 12-3456789


---
# 3. Dictionary Template Extraction

## Concept Overview
Dictionary templates provide the same functionality as string templates but with Python dictionary syntax. This is more Pythonic and easier to read/maintain than JSON strings.

In [None]:
# Extract using dict template
result = extractor.extract(
    source=file_path,
    template={
        "invoice_no": "string",
        "total": "float",
        "tax_id": "string",
        "date": "string",
    },
)

print("üìã Dictionary Template Extraction Results:\n")
rprint(result.pages)

# Access extracted data
if result.pages:
    extracted = result.pages[0].extracted_data
    print(f"\n‚úì Invoice No: {extracted.get('invoice_no', 'N/A')}")
    print(f"‚úì Total: ${extracted.get('total', 0):.2f}")
    print(f"‚úì Tax ID: {extracted.get('tax_id', 'N/A')}")
    print(f"‚úì Date: {extracted.get('date', 'N/A')}")

---
# 4. Basic Pydantic Model Template

## Concept Overview
Pydantic models provide type safety, validation, and IDE support. They define structured schemas with default values, examples, and optional fields. This is the recommended approach for production use.

In [None]:
# Define a basic Pydantic model
class Invoice(BaseModel):
    invoice_no: str = Field(
        examples=["INV-123", "INV-456"]  # Provide examples, no default
    )
    total: float = Field(
        default=0.0,  # Provide default value
        examples=[100.0, 500.0]
    )
    tax_id: Optional[str] = Field(
        default=None,  # Optional field
        examples=["12-3456789"]
    )
    date: Optional[str] = Field(
        default=None,
        examples=["2024-01-01"]
    )

# Extract using Pydantic model
result = extractor.extract(
    source=file_path,
    template=Invoice,
)

print("üìã Pydantic Model Extraction Results:\n")
rprint(result.pages)

# Validate and load with Pydantic
if result.pages:
    invoice = Invoice.model_validate(result.pages[0].extracted_data)
    print(f"\n‚úÖ Validated Invoice Object:")
    rprint(invoice)
    print(f"\n‚úì Type-safe access: {invoice.invoice_no}")
    print(f"‚úì Total: ${invoice.total:.2f}")

---
# 5. Pydantic Model with Instance Defaults

## Concept Overview
You can pass a Pydantic model **instance** as a template, overriding default values. This is useful when you have contextual information (e.g., known invoice number) that should be used as fallback if extraction fails.

In [None]:
# Extract with custom instance defaults
result = extractor.extract(
    source=file_path,
    template=Invoice(
        invoice_no="UNKNOWN-001",  # Fallback if not extracted
        total=0.0,
        tax_id="00-0000000",  # Default tax ID
        date="2026-01-01",
    ),
)

print("üìã Extraction with Instance Defaults:\n")
rprint(result.pages)

if result.pages:
    invoice = Invoice.model_validate(result.pages[0].extracted_data)
    print(f"\n‚úì Invoice No: {invoice.invoice_no} (extracted or default)")
    print(f"‚úì Tax ID: {invoice.tax_id} (extracted or default)")
    print(f"‚úì Total: ${invoice.total:.2f}")

---
# 6. Advanced Nested Pydantic Models

## Concept Overview
Nested Pydantic models capture hierarchical data structures. For example, an invoice with sender and receiver contact information. This enables extraction of complex, structured data with relationships.

In [None]:
# Define nested models
class Contact(BaseModel):
    name: Optional[str] = Field(
        default=None,
        examples=["John Smith", "Jane Doe"]
    )
    address: str = Field(
        default="123 Main St",
        examples=["456 Elm St", "789 Oak Ave"]
    )
    postal_code: str = Field(
        default="12345",
        examples=["67890", "11111"]
    )
    city: str = Field(
        default="Anytown",
        examples=["Springfield", "Portland"]
    )
    country: Optional[str] = Field(
        default=None,
        examples=["USA", "Canada"]
    )

class ExtendedInvoice(BaseModel):
    invoice_no: str = Field(examples=["INV-123", "INV-456"])
    total: float = Field(default=0.0, examples=[100.0, 500.0])
    tax_amount: float = Field(default=0.0, examples=[10.0, 50.0])
    description: Optional[str] = Field(
        default=None,
        examples=["Software Development", "Consulting Services"]
    )
    sender: Contact = Field(
        default=Contact(),
        examples=[Contact()]
    )
    receiver: Contact = Field(
        default=Contact(),
        examples=[Contact()]
    )

# Extract with nested model
result = extractor.extract(
    source=file_path,
    template=ExtendedInvoice,
)

print("üìã Nested Pydantic Model Extraction Results:\n")
rprint(result.pages)

if result.pages:
    extended_invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)
    print(f"\n‚úÖ Validated Extended Invoice:")
    rprint(extended_invoice)

---
# 7. Validating and Using Extracted Data

## Concept Overview
Once extracted, Pydantic validates the data and provides a type-safe object. You can access fields with IDE autocomplete and use the data in workflows without manual parsing or type checking.

In [None]:
# Extract and validate
result = extractor.extract(
    source=file_path,
    template=ExtendedInvoice,
)

if result.pages:
    # Validate with Pydantic
    invoice = ExtendedInvoice.model_validate(result.pages[0].extracted_data)
    
    print("‚úÖ Data Validation & Usage:\n")
    
    # Type-safe access
    print(f"Invoice #{invoice.invoice_no}")
    print(f"Description: {invoice.description or 'N/A'}")
    print(f"Total: ${invoice.total:.2f}")
    print(f"Tax: ${invoice.tax_amount:.2f}")
    print()
    
    # Access nested data
    print(f"From: {invoice.sender.name or 'Unknown'}")
    print(f"      {invoice.sender.address}")
    print(f"      {invoice.sender.city}, {invoice.sender.postal_code}")
    print()
    
    print(f"To: {invoice.receiver.name or 'Unknown'}")
    print(f"    {invoice.receiver.address}")
    print(f"    {invoice.receiver.city}, {invoice.receiver.postal_code}")
    print()
    
    # Use in business logic
    formatted_message = (
        f"Invoice #{invoice.invoice_no} was sent by {invoice.sender.name or 'Unknown'} "
        f"to {invoice.receiver.name or 'Unknown'} at {invoice.receiver.address}. "
        f"Total amount: ${invoice.total:.2f}"
    )
    print(f"üìß Business Logic Output:\n{formatted_message}")

---
# 8. Extracting from Receipt Images

## Concept Overview
Document extraction works equally well with images. Here we extract receipt data including items, quantities, and totals from a receipt image using a custom Pydantic model.

In [None]:
# Define receipt model
class Receipt(BaseModel):
    receipt_no: str = Field(examples=["RCP-123", "RCP-456"])
    date: str = Field(examples=["2024-01-01"])
    cashier: Optional[str] = Field(default=None, examples=["John Doe"])
    subtotal: float = Field(default=0.0, examples=[50.0, 100.0])
    tax: float = Field(default=0.0, examples=[5.0, 10.0])
    total: float = Field(default=0.0, examples=[55.0, 110.0])

# Extract from receipt image
receipt_path = "./mock_data_extraction/sample_receipt.png"
result = extractor.extract(
    source=receipt_path,
    template=Receipt,
)

print("üìã Receipt Image Extraction Results:\n")
rprint(result.pages)

if result.pages:
    receipt = Receipt.model_validate(result.pages[0].extracted_data)
    print(f"\n‚úÖ Validated Receipt:")
    print(f"Receipt #{receipt.receipt_no}")
    print(f"Date: {receipt.date}")
    print(f"Cashier: {receipt.cashier or 'N/A'}")
    print(f"Subtotal: ${receipt.subtotal:.2f}")
    print(f"Tax: ${receipt.tax:.2f}")
    print(f"Total: ${receipt.total:.2f}")

---
# 9. Extracting Contact Information from Business Cards

## Concept Overview
Information extraction can parse business cards, extracting names, titles, companies, emails, phones, and addresses. This demonstrates extraction from visually-oriented documents.

In [None]:
# Define business card model
class BusinessCard(BaseModel):
    name: str = Field(examples=["John Smith", "Jane Anderson"])
    title: Optional[str] = Field(
        default=None,
        examples=["Software Engineer", "Senior Manager"]
    )
    company: Optional[str] = Field(
        default=None,
        examples=["TechCorp", "InnovateLabs"]
    )
    email: Optional[str] = Field(
        default=None,
        examples=["john@example.com", "jane@company.com"]
    )
    phone: Optional[str] = Field(
        default=None,
        examples=["+1-555-123-4567", "(555) 987-6543"]
    )
    address: Optional[str] = Field(
        default=None,
        examples=["123 Main St, City, State 12345"]
    )

# Extract from business card
card_path = "./mock_data_extraction/business_card.png"
result = extractor.extract(
    source=card_path,
    template=BusinessCard,
)

print("üìã Business Card Extraction Results:\n")
rprint(result.pages)

if result.pages:
    card = BusinessCard.model_validate(result.pages[0].extracted_data)
    print(f"\n‚úÖ Validated Business Card:")
    print(f"Name: {card.name}")
    print(f"Title: {card.title or 'N/A'}")
    print(f"Company: {card.company or 'N/A'}")
    print(f"Email: {card.email or 'N/A'}")
    print(f"Phone: {card.phone or 'N/A'}")
    print(f"Address: {card.address or 'N/A'}")
    
    # Format for CRM import
    crm_entry = {
        "full_name": card.name,
        "job_title": card.title,
        "organization": card.company,
        "email_primary": card.email,
        "phone_mobile": card.phone,
        "address_full": card.address,
    }
    print(f"\nüìä CRM Format:")
    rprint(crm_entry)

---
# 10. Multi-Page Document Extraction

## Concept Overview
The extractor returns results organized by page. Each page can have different extracted data, allowing processing of multi-page documents like contracts, reports, or multi-page invoices.

In [None]:
# Demonstrate multi-page processing
print("üìÑ Multi-Page Document Handling:\n")

# The result.pages list contains ExtractedPageData for each page
if result.pages:
    print(f"Total pages processed: {len(result.pages)}")
    
    for page_data in result.pages:
        print(f"\n--- Page {page_data.page_no} ---")
        print(f"Extracted data: {page_data.extracted_data}")
        print(f"Raw text: {page_data.raw_text[:100]}..." if page_data.raw_text else "No raw text")
        print(f"Errors: {page_data.errors if page_data.errors else 'None'}")

# Example: Processing multi-page invoice
print("\n‚úì For multi-page documents:")
print("  ‚Ä¢ Loop through result.pages")
print("  ‚Ä¢ Access page_data.page_no for page number")
print("  ‚Ä¢ Access page_data.extracted_data for structured data")
print("  ‚Ä¢ Check page_data.errors for extraction issues")

---
# 11. Error Handling and Edge Cases

## Concept Overview
Extraction may fail or return partial data. The `errors` field in `ExtractedPageData` contains any issues encountered. Always check for errors and handle missing data gracefully.

In [None]:
# Demonstrate error handling
print("‚ö†Ô∏è Error Handling Best Practices:\n")

# Example with potential missing fields
result = extractor.extract(
    source=file_path,
    template=ExtendedInvoice,
)

if result.pages:
    for page_data in result.pages:
        # Check for errors
        if page_data.errors:
            print(f"‚ùå Errors on page {page_data.page_no}:")
            for error in page_data.errors:
                print(f"   - {error}")
        else:
            print(f"‚úÖ Page {page_data.page_no}: No errors")
        
        # Safe access with get()
        data = page_data.extracted_data
        invoice_no = data.get('invoice_no', 'UNKNOWN')
        total = data.get('total', 0.0)
        
        print(f"   Invoice: {invoice_no}, Total: ${total:.2f}")
        
        # Validate with Pydantic (handles missing fields)
        try:
            invoice = ExtendedInvoice.model_validate(data)
            print(f"   ‚úì Validation successful")
        except Exception as e:
            print(f"   ‚úó Validation failed: {e}")

print("\nüí° Best Practices:")
print("  1. Always check result.pages is not empty")
print("  2. Check page_data.errors for extraction issues")
print("  3. Use Optional fields for data that may not exist")
print("  4. Use Field(default=...) for graceful fallbacks")
print("  5. Wrap model_validate() in try-except")

---
# 12. Complete Extraction Workflow

## Concept Overview
End-to-end workflow: define schema ‚Üí extract ‚Üí validate ‚Üí use in business logic. This demonstrates a production-ready pipeline for document processing automation.

In [None]:
# Complete workflow demonstration
print("üîÑ Complete Information Extraction Workflow\n")
print("=" * 80)

# Step 1: Define schema
print("\n1Ô∏è‚É£ Define Pydantic Schema")
class InvoiceWorkflow(BaseModel):
    invoice_no: str = Field(examples=["INV-123"])
    total: float = Field(default=0.0)
    tax_id: Optional[str] = Field(default=None)
    sender: Contact = Field(default=Contact())
    receiver: Contact = Field(default=Contact())
print("   ‚úì Schema defined with nested models")

# Step 2: Extract
print("\n2Ô∏è‚É£ Extract from Document")
file_path = "./mock_data_extraction/sample_invoice.pdf"
result = extractor.extract(source=file_path, template=InvoiceWorkflow)
print(f"   ‚úì Extracted {len(result.pages)} page(s)")

# Step 3: Validate
print("\n3Ô∏è‚É£ Validate Extracted Data")
if result.pages:
    page_data = result.pages[0]
    if page_data.errors:
        print(f"   ‚ö†Ô∏è Found {len(page_data.errors)} error(s)")
    else:
        print(f"   ‚úì No errors")
    
    try:
        invoice = InvoiceWorkflow.model_validate(page_data.extracted_data)
        print(f"   ‚úì Pydantic validation passed")
    except Exception as e:
        print(f"   ‚úó Validation failed: {e}")
        invoice = None

# Step 4: Business Logic
print("\n4Ô∏è‚É£ Apply Business Logic")
if invoice:
    # Example: Send email notification
    email_body = f"""
    New Invoice Received
    
    Invoice Number: {invoice.invoice_no}
    Total Amount: ${invoice.total:.2f}
    
    From: {invoice.sender.name or 'Unknown'}
          {invoice.sender.city}, {invoice.sender.country or 'N/A'}
    
    To: {invoice.receiver.name or 'Unknown'}
        {invoice.receiver.city}, {invoice.receiver.country or 'N/A'}
    
    Action Required: Review and approve
    """
    print(email_body)
    
    # Example: Database record
    db_record = {
        "invoice_id": invoice.invoice_no,
        "amount": invoice.total,
        "sender_name": invoice.sender.name,
        "receiver_name": invoice.receiver.name,
        "status": "pending_approval",
    }
    print(f"üíæ Database Record:")
    rprint(db_record)

print("\n" + "=" * 80)
print("‚úÖ Complete workflow executed successfully!")

---
# 13. Batch Processing Multiple Documents

## Concept Overview
Process multiple documents in a loop, collecting results for batch processing. This is useful for automation pipelines handling folders of invoices, receipts, or forms.

In [None]:
# Batch processing example
print("üìö Batch Document Processing\n")

# Define simple extraction model
class SimpleInvoice(BaseModel):
    invoice_no: str = Field(examples=["INV-123"])
    total: float = Field(default=0.0)
    date: Optional[str] = Field(default=None)

# List of documents to process
documents = [
    "./mock_data_extraction/sample_invoice.pdf",
    "./mock_data_extraction/sample_receipt.png",
]

# Process batch
results = []
for i, doc_path in enumerate(documents, 1):
    print(f"Processing document {i}/{len(documents)}: {Path(doc_path).name}")
    
    try:
        result = extractor.extract(source=doc_path, template=SimpleInvoice)
        
        if result.pages:
            data = result.pages[0].extracted_data
            invoice = SimpleInvoice.model_validate(data)
            results.append({
                "file": Path(doc_path).name,
                "invoice_no": invoice.invoice_no,
                "total": invoice.total,
                "status": "success"
            })
            print(f"  ‚úì Extracted: {invoice.invoice_no}, Total: ${invoice.total:.2f}")
        else:
            results.append({
                "file": Path(doc_path).name,
                "status": "no_pages"
            })
            print(f"  ‚ö†Ô∏è No pages extracted")
    
    except Exception as e:
        results.append({
            "file": Path(doc_path).name,
            "status": "error",
            "error": str(e)
        })
        print(f"  ‚úó Error: {e}")

# Summary
print(f"\nüìä Batch Processing Summary:")
print(f"Total documents: {len(documents)}")
print(f"Successful: {sum(1 for r in results if r.get('status') == 'success')}")
print(f"Failed: {sum(1 for r in results if r.get('status') != 'success')}")

print(f"\nüìã Results:")
rprint(results)

---
# 14. Advanced Field Configuration

## Concept Overview
Pydantic Field configurations control extraction behavior. Use `examples` to guide the model, `default` for missing data, `description` for documentation, and validation constraints.

In [None]:
from pydantic import field_validator, Field
from typing import List

# Advanced field configuration
class AdvancedInvoice(BaseModel):
    invoice_no: str = Field(
        description="Unique invoice identifier",
        examples=["INV-2024-001", "BILL-123"],
        pattern=r"^[A-Z]{3}-\d{4}-\d{3}$"  # Validation pattern
    )
    
    total: float = Field(
        description="Total amount in USD",
        default=0.0,
        ge=0.0,  # Greater than or equal to 0
        examples=[100.0, 1500.50]
    )
    
    date: str = Field(
        description="Invoice date in ISO format",
        examples=["2024-01-13", "2026-12-31"],
        pattern=r"^\d{4}-\d{2}-\d{2}$"
    )
    
    items: Optional[List[str]] = Field(
        default=None,
        description="List of line items",
        examples=[["Item 1", "Item 2"]]
    )
    
    # Custom validator
    @field_validator('total')
    @classmethod
    def validate_total(cls, v):
        if v < 0:
            raise ValueError('Total cannot be negative')
        if v > 1000000:
            raise ValueError('Total exceeds maximum allowed amount')
        return v

print("üîß Advanced Field Configuration:\n")
print("‚úì Pattern validation for invoice_no")
print("‚úì Range validation for total (>= 0)")
print("‚úì Date format validation")
print("‚úì List fields for items")
print("‚úì Custom validators for business rules")

# Example usage
print("\nüìã Schema:")
rprint(AdvancedInvoice.model_json_schema())

---
# 15. Real-world Use Cases

## Concept Overview
Information extraction enables automation of document-heavy workflows: invoice processing, receipt scanning, form data entry, contract parsing, and identity verification.

In [None]:
print("üíº Real-world Use Cases for Information Extraction\n")
print("=" * 80)

use_cases = [
    {
        "title": "üìÑ Invoice Processing",
        "description": "Extract invoice data, validate against POs, route for approval",
        "fields": ["invoice_no", "vendor", "amount", "line_items", "due_date"],
        "automation": "Auto-match PO, send to approver, update accounting system"
    },
    {
        "title": "üßæ Receipt Scanning",
        "description": "Extract expense data from receipts for reimbursement",
        "fields": ["merchant", "date", "total", "tax", "category"],
        "automation": "Create expense report, attach to employee record, submit for approval"
    },
    {
        "title": "üìã Form Data Entry",
        "description": "Extract data from paper/PDF forms into database",
        "fields": ["applicant_name", "address", "ssn", "income", "employer"],
        "automation": "Populate CRM/ERP, trigger workflows, send confirmation"
    },
    {
        "title": "üìú Contract Parsing",
        "description": "Extract key terms from legal contracts",
        "fields": ["parties", "effective_date", "termination_date", "value", "clauses"],
        "automation": "Alert on renewals, track obligations, compliance monitoring"
    },
    {
        "title": "ü™™ Identity Verification",
        "description": "Extract data from IDs, passports, driver's licenses",
        "fields": ["name", "dob", "id_number", "expiry_date", "address"],
        "automation": "KYC verification, fraud detection, account creation"
    },
    {
        "title": "üìä Financial Statements",
        "description": "Extract financial data from reports and statements",
        "fields": ["revenue", "expenses", "net_income", "assets", "liabilities"],
        "automation": "Financial analysis, reporting, compliance checks"
    },
]

for i, uc in enumerate(use_cases, 1):
    print(f"\n{i}. {uc['title']}")
    print(f"   {uc['description']}")
    print(f"   Fields: {', '.join(uc['fields'])}")
    print(f"   Automation: {uc['automation']}")

print("\n" + "=" * 80)
print("\n‚úÖ Benefits:")
print("  ‚Ä¢ ‚ö° 10-100x faster than manual data entry")
print("  ‚Ä¢ üéØ 95%+ accuracy with proper templates")
print("  ‚Ä¢ üí∞ Significant cost savings on data entry labor")
print("  ‚Ä¢ üìà Scalable to thousands of documents per day")
print("  ‚Ä¢ üîÑ Integrates with existing business systems")
print("  ‚Ä¢ ü§ñ Enables end-to-end process automation")

---
# üéØ Summary & Key Takeaways

## Core Concepts Covered
1. **DocumentExtractor** - Main API for information extraction from PDFs and images
2. **Template Types** - String, dict, and Pydantic model templates
3. **Pydantic Models** - Type-safe schemas with validation and defaults
4. **Nested Models** - Complex hierarchical data structures
5. **Field Configuration** - Examples, defaults, validation, and constraints

## Pydantic Best Practices
6. **Optional Fields** - Use `Optional[Type]` for fields that may not exist
7. **Default Values** - Provide `Field(default=...)` for graceful fallbacks
8. **Examples** - Guide extraction with `Field(examples=[...])`
9. **Validation** - Use `@field_validator` for custom business rules
10. **Instance Templates** - Override defaults with model instances

## Production Patterns
11. **Error Handling** - Check `errors` field and wrap validation in try-except
12. **Multi-Page Processing** - Loop through `result.pages` for each page
13. **Batch Processing** - Process folders of documents in loops
14. **Business Logic Integration** - Use validated models in workflows
15. **Real-world Applications** - Invoice processing, receipt scanning, form extraction

## üöÄ Next Steps
- Define Pydantic models for your document types
- Test extraction accuracy with sample documents
- Build batch processing pipelines for automation
- Integrate with your business systems (ERP, CRM, etc.)
- Monitor extraction errors and refine templates

## üí° Pro Tips
- Use descriptive field names matching document terminology
- Provide multiple examples for better extraction accuracy
- Start with simple flat models, add nesting as needed
- Always validate extracted data with Pydantic
- Log errors for continuous improvement

---

**Resources:**
- [Docling Extraction Docs](https://docling-project.github.io/docling/examples/extraction/)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [VLM Models for Extraction](https://docling-project.github.io/docling/usage/vision_models/)

---

‚ö†Ô∏è **Note:** The extraction API is experimental and may change. Only PDF and image formats are currently supported.