# Document Parsing with Docling for RAG Systems

## A Comprehensive Guide to Document Conversion and Processing

This notebook demonstrates the powerful document parsing capabilities of **Docling** (v2.55.1), a Python library developed by IBM for converting various document formats into structured representations suitable for AI/ML workflows, particularly Retrieval-Augmented Generation (RAG) systems.

### What You'll Learn

1. **Basic Document Conversion** - Convert PDFs and other formats to Markdown, JSON, HTML
2. **Multiple File Formats** - PDF, DOCX, XLSX, PPTX, HTML, Markdown, Images, Audio
3. **Pipeline Configuration** - OCR engines, table extraction, layout analysis, VLM
4. **LangChain Integration** - DoclingLoader and RAG pipeline with Chroma
5. **Advanced Topics** - Enrichment, error handling

### Prerequisites

- Python 3.12 (recommended for full compatibility)
- OpenAI API key (for RAG examples)
- Sufficient disk space for model downloads (~2-4GB)

---

## 1. Installation & Setup

### 1.1 Create Python 3.12 Virtual Environment

```bash
# Create virtual environment with Python 3.12
python3.12 -m venv .venv

# Activate the environment
source .venv/bin/activate  # On macOS/Linux
# .venv\Scripts\activate  # On Windows
```

### 1.2 Install Dependencies

Run the following commands in your terminal:

In [None]:
print('all ok')

In [None]:
# Install Docling and its optional dependencies
# Uncomment and run these lines if you haven't installed the packages yet

!uv pip install docling==2.55.1 langchain-docling langchain-openai python-dotenv
!uv pip install docling[easyocr,vlm,asr]
!uv pip install docling-core[chunking]
!uv pip install chromadb transformers sentence-transformers
!uv pip install pandas openpyxl  

In [None]:
# Verify installation
import docling
from importlib.metadata import version

print(version("docling"))


### 1.3 Environment Configuration

In [None]:
from dotenv import load_dotenv, dotenv_values

# Load environment variables
load_dotenv()

config = dotenv_values(".env")

print("Core imports loaded successfully!")

In [None]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify OpenAI API key is set (for RAG examples later)
if os.getenv("OPENAI_API_KEY"):
    print("OpenAI API key is configured")
else:
    print("Warning: OpenAI API key not found. Some RAG examples will not work.")
    print("Create a .env file with: OPENAI_API_KEY=your-key-here")

In [None]:
# Import core modules that we'll use throughout the notebook
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings for cleaner output

# Docling imports
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat, ConversionStatus

# Set up paths
SAMPLE_DIR = Path("sample_documents")
OUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

print("Core imports loaded successfully!")

---

## 2. Basic Document Conversion

The `DocumentConverter` class is the main entry point for document conversion in Docling. It handles format detection, backend selection, and pipeline execution automatically.

### Key Concepts:
- **ConversionResult**: Contains the converted document, status, and any errors
- **DoclingDocument**: The unified internal representation of any document
- **Export Formats**: Markdown, JSON, HTML, Text, DocTags

### 2.1 Simple PDF Conversion

In [None]:
# Basic PDF conversion example
# Using the Docling paper from arXiv as an example

from docling.document_converter import DocumentConverter

# Initialize the converter with default settings
converter = DocumentConverter()

# Convert a PDF from URL
# The Docling paper: "Docling Technical Report"
pdf_url = "https://arxiv.org/pdf/2408.09869"

print(f"Converting PDF from: {pdf_url}")
print("This may take a minute for the first run as models are downloaded...")

# Perform conversion
result = converter.convert(pdf_url)

# Check conversion status
print(f"\nConversion Status: {result.status}")
print(f"Document Name: {result.input.file.name}")
print(f"Number of Pages: {len(result.pages) if result.pages else 'N/A'}")

In [None]:
# Access the converted document
doc = result.document

# Display document structure information
print(f"Document Type: {type(doc).__name__}")
print(f"Number of Tables: {len(doc.tables) if hasattr(doc, 'tables') else 0}")
print(f"Number of Pictures: {len(doc.pictures) if hasattr(doc, 'pictures') else 0}")

In [None]:
# # Display Tables
# print("=" * 50)
# print("TABLES")
# print("=" * 50)

if hasattr(doc, 'tables') and doc.tables:
    for i, table in enumerate(doc.tables):
        print(f"\n--- Table {i+1} ---")
        # Export table to markdown format
        print(table.export_to_markdown())
else:
    print("No tables found")

# Display Pictures
print("\n" + "=" * 50)
print("PICTURES")
print("=" * 50)

if hasattr(doc, 'pictures') and doc.pictures:
    for i, picture in enumerate(doc.pictures):
        print(f"\n--- Picture {i+1} ---")
        # Get caption or text associated with the picture
        if hasattr(picture, 'caption') and picture.caption:
            print(f"Caption: {picture.caption}")
        if hasattr(picture, 'text') and picture.text:
            print(f"Text: {picture.text}")
        # Show any available metadata
        if hasattr(picture, 'prov'):
            print(f"Provenance: {picture.prov}")
else:
    print("No pictures found")

### 2.2 Export Formats

Docling supports multiple export formats:

| Method | Output | Use Case |
|--------|--------|----------|
| `export_to_markdown()` | Markdown text | LLM input, readable output |
| `export_to_dict()` | Python dict | Programmatic access |
| `save_as_json()` | JSON file | Persistence, API responses |
| `save_as_html()` | HTML file | Web display |
| `export_to_text()` | Plain text | Simple text extraction |

In [None]:
# Export to Markdown
markdown_content = doc.export_to_markdown()

# Display first 2000 characters
print("=" * 80)
print("MARKDOWN OUTPUT (first 2000 chars)")
print("=" * 80)
print(markdown_content[:2000])
print("\n... [truncated] ...")

In [None]:
# Export to JSON (save to file)
json_output_path = OUTPUT_DIR / "docling_paper.json"
doc.save_as_json(json_output_path)
print(f"JSON saved to: {json_output_path}")

# Export to HTML
html_output_path = OUTPUT_DIR / "docling_paper.html"
doc.save_as_html(html_output_path)
print(f"HTML saved to: {html_output_path}")

# Export to Markdown file
md_output_path = OUTPUT_DIR / "docling_paper.md"
with open(md_output_path, "w") as f:
    f.write(markdown_content)
print(f"Markdown saved to: {md_output_path}")

In [None]:
# Export to dictionary for programmatic access
doc_dict = doc.export_to_dict()

# Explore the structure
print("Document Dictionary Keys:")
for key in doc_dict.keys():
    print(f"  - {key}")

### 2.3 ConversionResult Structure

The `ConversionResult` object contains valuable metadata about the conversion process.

In [None]:
# Examine the ConversionResult structure
print("ConversionResult Attributes:")
print(f"  status: {result.status}")
print(f"  input.file: {result.input.file}")
print(f"  input.format: {result.input.format}")
print(f"  input.document_hash: {result.input.document_hash[:16]}...")

# Check for errors
if result.errors:
    print(f"\nErrors ({len(result.errors)}):")
    for error in result.errors:
        print(f"  - {error.component_type}: {error.error_message}")
else:
    print("\nNo errors during conversion!")

---

## 3. Supported File Formats

Docling supports a wide variety of input formats, each handled by specialized backends:

| Format | Extensions | Backend | Pipeline |
|--------|-----------|---------|----------|
| PDF | `.pdf` | DoclingParseV4Backend | StandardPdfPipeline |
| Word | `.docx` | MsWordDocumentBackend | SimplePipeline |
| Excel | `.xlsx` | MsExcelDocumentBackend | SimplePipeline |
| PowerPoint | `.pptx` | MsPowerpointDocumentBackend | SimplePipeline |
| HTML | `.html`, `.htm` | HTMLDocumentBackend | SimplePipeline |
| Markdown | `.md` | MarkdownDocumentBackend | SimplePipeline |
| Images | `.png`, `.jpg`, `.tiff` | ImageDocumentBackend | StandardPdfPipeline |
| Audio | `.wav`, `.mp3` | AudioBackend | AsrPipeline |

### 3.1 PDF Documents

PDF is the most feature-rich format with support for:
- Layout analysis (headers, paragraphs, lists)
- Table structure extraction
- OCR for scanned pages
- Image/figure extraction
- Reading order determination

In [None]:
# PDF with detailed options
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure PDF pipeline with specific options
pdf_options = PdfPipelineOptions(
    do_ocr=False,              # Disable OCR for native PDFs (faster)
    do_table_structure=True,   # Enable table structure extraction
    generate_page_images=True, # Generate page images for HTML export
)

# Create converter with custom options
converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_options)
    }
)

# Convert the PDF
result = converter.convert(pdf_url)
print(f"Conversion status: {result.status}")

In [None]:
# Access tables from the converted document
doc = result.document

if hasattr(doc, 'tables') and doc.tables:
    print(f"Found {len(doc.tables)} tables in the document\n")
    
    # Display first table
    for i, table in enumerate(doc.tables[:2]):  # Show first 2 tables
        print(f"Table {i+1}:")
        print("-" * 40)
        
        # Try to export to DataFrame if pandas is available
        try:
            df = table.export_to_dataframe()
            print(df.head())
        except Exception as e:
            print(f"Table markdown: {table.export_to_markdown()[:500]}")
        print()
else:
    print("No tables found in the document")

### 3.2 Microsoft Office Documents

Docling supports Office Open XML formats (DOCX, XLSX, PPTX) with rich formatting preservation.

In [None]:
# Convert HTML document (from our sample files)
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Convert the sample HTML file
html_path = SAMPLE_DIR / "sample.html"

if html_path.exists():
    result = converter.convert(str(html_path))
    print(f"HTML Conversion Status: {result.status}")
    
    # Display converted content
    html_markdown = result.document.export_to_markdown()
    print("\nConverted HTML to Markdown:")
    print("=" * 60)
    print(html_markdown[:1500])
else:
    print(f"Sample HTML file not found at {html_path}")

In [None]:
# Convert Markdown document
md_path = SAMPLE_DIR / "sample.md"

if md_path.exists():
    result = converter.convert(str(md_path))
    print(f"Markdown Conversion Status: {result.status}")
    
    # Markdown to Markdown (demonstrates parsing and re-export)
    output_md = result.document.export_to_markdown()
    print("\nParsed and re-exported Markdown:")
    print("=" * 60)
    print(output_md[:1500])
else:
    print(f"Sample Markdown file not found at {md_path}")

In [None]:
# Example: Converting a DOCX file (if you have one)
# This demonstrates the pattern for Word documents

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, WordFormatOption

# Configure for Word documents
converter = DocumentConverter(
    allowed_formats=[InputFormat.DOCX, InputFormat.PPTX, InputFormat.XLSX],  # Only allow DOCX
)

# Excel conversion pattern
print("Excel (DOCX) Conversion:")
print("-" * 40)
result = converter.convert("sample_documents/sample.docx")
docx = result.document
docx_markdown = docx.export_to_markdown()

print("Word document conversion pattern demonstrated.")
print("To convert a Word document, use: converter.convert('your_document.docx')")
print(docx_markdown)

In [None]:
from docling.document_converter import DocumentConverter  
from docling.datamodel.base_models import InputFormat  
  
# Initialize converter with office document support  
converter = DocumentConverter(  
    allowed_formats=[InputFormat.DOCX, InputFormat.XLSX, InputFormat.PPTX]  
)  
  
# Convert any office document  
result = converter.convert("sample_documents/sample.xlsx")  
print(result.document.export_to_markdown())

In [None]:
from docling.document_converter import DocumentConverter  
from docling.datamodel.base_models import InputFormat  
  
# Initialize converter with office document support  
converter = DocumentConverter(  
    allowed_formats=[InputFormat.DOCX, InputFormat.XLSX, InputFormat.PPTX]  
)  
print("\nPowerPoint (PPTX) Conversion:")
print("-" * 40)
# Convert any office document 
# Each slide becomes a section in the document
result = converter.convert("sample_documents/dl.pptx")  
# result = converter.convert("sample_documents/sample1.pptx")  
print(result.document.export_to_markdown())

### 3.3 Image Files with OCR

Images are processed through the same pipeline as PDFs, with OCR enabled to extract text.

In [None]:
# Image conversion with OCR
from docling.document_converter import DocumentConverter, ImageFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure OCR for images
image_pipeline_options = PdfPipelineOptions(
    do_ocr=True,  # Enable OCR for text extraction from images
)

converter = DocumentConverter(
    format_options={
        InputFormat.IMAGE: ImageFormatOption(
            pipeline_options=image_pipeline_options
        )
    }
)

# Conversion pattern:
result = converter.convert("sample_documents/scan.pdf")
text = result.document.export_to_markdown()

print("Image OCR conversion pattern:")
print("-" * 40)
print(text)
print("Supported formats: PNG, JPEG, TIFF, BMP, WEBP")
print("Multi-page TIFF files are automatically handled.")

### 3.4 Audio Files (ASR Pipeline)

Docling can transcribe audio files using Automatic Speech Recognition (ASR).

### Run in a GPU

https://colab.research.google.com/drive/1EemOQ8V5BeGz1v7W2xjD6YUC3eZdJLOU?usp=sharing

In [None]:
from docling.document_converter import DocumentConverter  
from docling.datamodel.base_models import InputFormat  
from docling.datamodel import asr_model_specs  
  
# Initialize converter with ASR support  
converter = DocumentConverter(  
    allowed_formats=[InputFormat.AUDIO],  
    format_options={  
        InputFormat.AUDIO: AudioFormatOption(  
            pipeline_cls=AsrPipeline,  
            pipeline_options=AsrPipelineOptions(  
                asr_options=asr_model_specs.WHISPER_TINY  
            )  
        )  
    }  
)  
  
# Convert audio file  
result = converter.convert("sample_documents/sample.mp3")  
print(result.document.export_to_markdown())

In [None]:
# Audio transcription example (requires 'asr' extra)
from docling.document_converter import DocumentConverter, AudioFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.asr_pipeline import AsrPipeline
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.datamodel import asr_model_specs

print("Audio Transcription (ASR) Pattern:")
print("-" * 40)


# Configure ASR pipeline
asr_options = AsrPipelineOptions(
    asr_options=asr_model_specs.WHISPER_TINY,  # or WHISPER_BASE, WHISPER_SMALL
)

converter = DocumentConverter(
    format_options={
        InputFormat.AUDIO: AudioFormatOption(
            pipeline_cls=AsrPipeline,
            pipeline_options=asr_options,
        )
    }
)

result = converter.convert("sample_documents/sample.mp3")  # or .wav
transcript = result.document.export_to_markdown()
print(transcript)
print("\nSupported formats: WAV, MP3")
print("Requires: pip install 'docling[asr]'")

In [None]:
# VLM Pipeline Configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

print("\n2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):")
print("-" * 40)
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

vlm_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)
# Convert with VLM
pdf_url="https://arxiv.org/pdf/2408.09869"
result = vlm_converter.convert(pdf_url)
vlm_markdown = result.document.export_to_markdown()
print(vlm_markdown[:2000])

---

## 4. Pipeline Options & Configuration

Docling provides extensive configuration options for customizing the document processing pipeline.

### 4.1 OCR Configuration

Multiple OCR engines are available, each with different strengths:

| Engine | Best For | Installation |
|--------|----------|-------------|
| RapidOCR | General use (default) | Included |
| EasyOCR | Multi-language | `pip install 'docling[easyocr]'` |
| Tesseract | Production | System install + `pip install 'docling[tesserocr]'` |
| OcrMac | macOS native | `pip install 'docling[ocrmac]'` |

In [None]:
# OCR Configuration Examples
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    RapidOcrOptions,
    TesseractOcrOptions,
)

# Option 1: RapidOCR (default, fast)
rapid_ocr_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=RapidOcrOptions(),
)

# Option 2: EasyOCR (multi-language support)
easy_ocr_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=EasyOcrOptions(
        lang=["en", "fr", "de"],  # English, French, German
        use_gpu=True,  # Use GPU if available
    ),
)

# Option 3: Tesseract (production-ready)
tesseract_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=TesseractOcrOptions(
        lang=["eng", "fra"],  # Tesseract language codes
    ),
)

print("OCR configurations created successfully!")
print("\nAvailable OCR options:")
print("  - RapidOcrOptions: Fast, general-purpose")
print("  - EasyOcrOptions: Multi-language, GPU support")
print("  - TesseractOcrOptions: Production, requires system Tesseract")
print("  - OcrMacOptions: macOS Vision framework (macOS only)")

In [None]:
# Using EasyOCR with custom language support
# This example shows how to set up OCR for scanned documents

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, EasyOcrOptions
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice

# Configure EasyOCR with accelerator options
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    ocr_options=EasyOcrOptions(
        lang=["en"],
    ),
    accelerator_options=AcceleratorOptions(
        device=AcceleratorDevice.AUTO,  # AUTO, CPU, CUDA, or MPS
        num_threads=4,
    ),
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Conversion pattern:
result = converter.convert("sample_documents/scan.pdf")
text = result.document.export_to_markdown()
print("Converter configured with EasyOCR and accelerator options.")
print(f"Accelerator device: {AcceleratorDevice.AUTO}")

In [None]:
text

### 4.2 Table Structure Options

Configure table extraction with TableFormer model settings.

In [None]:
# Table structure configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    TableFormerMode,
)

# Configure table extraction
table_options = TableStructureOptions(
    do_cell_matching=True,  # Match cells with text content
    mode=TableFormerMode.ACCURATE,  # ACCURATE or FAST
)

pipeline_options = PdfPipelineOptions(
    do_table_structure=True,
    table_structure_options=table_options,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)


# Conversion pattern:
pdf_url ="https://arxiv.org/pdf/2408.09869v1"
result = converter.convert(pdf_url)
text = result.document.export_to_markdown()
print("Table extraction configured:")
print(f"  - Cell matching: {table_options.do_cell_matching}")
print(f"  - Mode: {table_options.mode}")

In [None]:
text

### 4.4 VLM Pipeline (Vision-Language Models)

For complex documents, Vision-Language Models provide end-to-end understanding.

**Available VLM Models:**
- `GRANITEDOCLING_TRANSFORMERS` - IBM GraniteDocling with Transformers
- `GRANITEDOCLING_MLX` - GraniteDocling optimized for Apple Silicon
- `SMOLDOCLING_TRANSFORMERS` - Smaller, faster model

In [None]:
# VLM Pipeline Configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

print("VLM Pipeline Configuration:")
print("=" * 60)

# Option 1: GraniteDocling with Transformers (cross-platform)
print("\n1. GraniteDocling with Transformers (GPU/CPU):")
print("-" * 40)
print("""pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)""")

# Option 2: GraniteDocling MLX (Apple Silicon optimized)
print("\n2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):")
print("-" * 40)
print("""pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)""")

In [None]:
#Option 2: GraniteDocling MLX (Apple Silicon optimized)

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

print("\n2. GraniteDocling MLX (Apple Silicon M1/M2/M3/M4):")
print("-" * 40)
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)
# Convert with VLM
pdf_url="https://arxiv.org/pdf/2408.09869"
result = converter.convert(pdf_url)
vlm_markdown = result.document.export_to_markdown()
print(vlm_markdown[:2000])

In [None]:
# VLM Pipeline - Live Example (requires significant GPU/memory)
# Uncomment to run if you have sufficient resources

"""from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.pipeline.vlm_pipeline import VlmPipeline
from docling.datamodel import vlm_model_specs
from docling.datamodel.pipeline_options import VlmPipelineOptions

# Configure VLM pipeline
pipeline_options = VlmPipelineOptions(
    vlm_options=vlm_model_specs.GRANITEDOCLING_TRANSFORMERS,
)

vlm_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_cls=VlmPipeline,
            pipeline_options=pipeline_options,
        ),
    }
)

# Convert with VLM
result = vlm_converter.convert(pdf_url)
vlm_markdown = result.document.export_to_markdown()
print(vlm_markdown[:2000])
"""

print("VLM example is commented out to avoid resource issues.")
print("Uncomment and run if you have GPU/sufficient memory.")

---

## 6. LangChain Integration

Docling integrates seamlessly with LangChain through the `langchain-docling` package.

### 6.1 DoclingLoader

The `DoclingLoader` provides a LangChain-compatible document loader.

In [None]:
# DoclingLoader Basic Usage
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

pdf_url = "https://arxiv.org/pdf/2408.09869"

# Create loader with DOC_CHUNKS export (recommended for RAG)
loader = DoclingLoader(
    file_path=pdf_url,
    export_type=ExportType.DOC_CHUNKS,  # Returns chunked documents
)

print("Loading documents with DoclingLoader...")
docs = loader.load()

print(f"\nLoaded {len(docs)} document chunks")
print("\nFirst document chunk:")
print("=" * 60)
print(f"Content: {docs[0].page_content[:500]}...")
print(f"\nMetadata: {docs[0].metadata}")

In [None]:
# DoclingLoader with MARKDOWN export
loader_md = DoclingLoader(
    file_path=pdf_url,
    export_type=ExportType.MARKDOWN,  # Returns full document as Markdown
)

docs_md = loader_md.load()

print(f"Loaded {len(docs_md)} document(s) as Markdown")
print(f"\nDocument length: {len(docs_md[0].page_content)} characters")
print("\nFirst 500 characters:")
print(docs_md[0].page_content[:500])

In [None]:
# DoclingLoader with custom converter
from langchain_docling import DoclingLoader
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Create custom converter with specific options
custom_pipeline = PdfPipelineOptions(
    do_ocr=False,
    do_table_structure=True,
)

custom_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=custom_pipeline)
    }
)

# Use custom converter with DoclingLoader
loader_custom = DoclingLoader(
    file_path=pdf_url,
    converter=custom_converter,  # Pass custom converter
    export_type=ExportType.DOC_CHUNKS,
)

docs_custom = loader_custom.load()
print(f"Loaded {len(docs_custom)} chunks with custom converter")

### 6.2 RAG Pipeline with LangChain

Build a complete RAG pipeline using Docling, LangChain, and Chroma.

In [None]:
# Complete RAG Pipeline Setup
import os
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_classic.chains import create_retrieval_chain
from langchain_community.vectorstores.utils import filter_complex_metadata

# Check for OpenAI API key
if not os.getenv("OPENAI_API_KEY"):
    print("Warning: OPENAI_API_KEY not set. RAG example will not work.")
    print("Set your API key: os.environ['OPENAI_API_KEY'] = 'your-key'")
else:
    print("OpenAI API key found. Proceeding with RAG setup...")

In [None]:
# Step 1: Load and chunk documents

pdf_url = "https://arxiv.org/pdf/2408.09869"

if os.getenv("OPENAI_API_KEY"):
    print("Step 1: Loading documents...")
    
    loader = DoclingLoader(
        file_path=pdf_url,
        export_type=ExportType.DOC_CHUNKS,
    )
    
    documents = loader.load()
    print(f"Loaded {len(documents)} document chunks")

In [None]:
# Step 2: Create embeddings and vector store
if os.getenv("OPENAI_API_KEY"):
    print("Step 2: Creating embeddings and vector store...")
    
    # Initialize embeddings
    embeddings = OpenAIEmbeddings(
        model="text-embedding-3-small"
    )
    # Filter complex metadata from documents
    filtered_documents = filter_complex_metadata(documents)
    
    # Create Chroma vector store
    vectorstore = Chroma.from_documents(
        documents=filtered_documents,
        embedding=embeddings,
        persist_directory="./chroma_db",  # Persist to disk
        collection_name="docling_demo",
    )
    
    print(f"Vector store created with {len(documents)} documents")
    print(f"Persisted to: ./chroma_db")

In [None]:
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

# Step 3: Create RAG chain
if os.getenv("OPENAI_API_KEY"):
    print("Step 3: Creating RAG chain...")
    
    # Initialize LLM
    llm = ChatOpenAI(
        model="gpt-4o-mini",
        temperature=0,
    )

          # Create prompt template
    prompt = ChatPromptTemplate.from_messages([
          ("system", "Answer the question based only on the following context:\n\n{context}"),
          ("human", "{input}")
      ])
    
    # Create retriever
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5},  # Return top 5 relevant chunks
    )
    
    # Create QA chain
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    qa_chain = create_retrieval_chain(retriever, question_answer_chain)
    
    print("RAG chain created successfully!")

In [None]:
response = qa_chain.invoke({"input": "What is this document about?"})
response

In [None]:
# Step 4: Query the RAG system
if os.getenv("OPENAI_API_KEY"):
    print("Step 4: Querying the RAG system...")
    print("=" * 60)
    
    # Example questions about Docling
    questions = [
        "What is Docling and what are its main features?",
        "What file formats does Docling support?",
        "How does Docling handle table extraction?",
    ]
    
    for question in questions:
        print(f"\nQ: {question}")
        print("-" * 40)
        
        response = qa_chain.invoke({"input": question})
        
        #print(f"A: {response['input']}")
        #print(f"\n(Based on {len(response['source_documents'])} source documents)")
        print("=" * 60)
        print(response['answer'])

---

## 7. Export & Serialization

### 7.1 Export Methods

Docling provides multiple export methods for different use cases.

In [None]:
print(OUTPUT_DIR)

In [None]:
# Comprehensive export examples
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert(pdf_url)
doc = result.document

# 1. Export to Markdown
markdown = doc.export_to_markdown()
print(f"Markdown export: {len(markdown)} characters")

# 2. Export to Text (plain text, no formatting)
text = doc.export_to_markdown(strict_text=True)
print(f"Text export: {len(text)} characters")

# 3. Export to Dictionary
doc_dict = doc.export_to_dict()
print(f"Dict export: {len(doc_dict.keys())} top-level keys")

# 4. Save as JSON
json_path = OUTPUT_DIR / "export_demo.json"
doc.save_as_json(json_path)
print(f"JSON saved: {json_path}")

# 5. Save as HTML
html_path = OUTPUT_DIR / "export_demo.html"
doc.save_as_html(html_path)
print(f"HTML saved: {html_path}")

### 7.2 Table Export

Export tables to pandas DataFrames or CSV.

In [None]:
# Table export to DataFrame
import pandas as pd

# Access tables from the document
if hasattr(doc, 'tables') and doc.tables:
    print(f"Found {len(doc.tables)} tables\n")
    
    for i, table in enumerate(doc.tables[:3]):  # First 3 tables
        print(f"Table {i+1}:")
        print("-" * 40)
        
        try:
            # Export to DataFrame
            df = table.export_to_dataframe()
            print(df.head())
            
            # Save to CSV
            csv_path = OUTPUT_DIR / f"table_{i+1}.csv"
            df.to_csv(csv_path, index=False)
            print(f"Saved to: {csv_path}")
        except Exception as e:
            print(f"Error exporting table: {e}")
        
        print()
else:
    print("No tables found in the document")

---

## 8. Advanced Topics

### 8.1 Batch Processing

Process multiple documents efficiently with `convert_all()`.

In [None]:
# Batch processing example
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import ConversionStatus
from pathlib import Path

# Define sources (can be paths, URLs, or streams)
sources = [
    str(SAMPLE_DIR / "sample.html"),
    str(SAMPLE_DIR / "sample.md"),
]

# Filter to existing files only
existing_sources = [s for s in sources if Path(s).exists()]

if existing_sources:
    converter = DocumentConverter()
    
    # Batch convert with error handling
    results = {
        "success": [],
        "partial": [],
        "failed": [],
    }
    
    print(f"Processing {len(existing_sources)} documents...")
    
    for result in converter.convert_all(existing_sources, raises_on_error=False):
        if result.status == ConversionStatus.SUCCESS:
            results["success"].append(result)
            print(f"  SUCCESS: {result.input.file.name}")
        elif result.status == ConversionStatus.PARTIAL_SUCCESS:
            results["partial"].append(result)
            print(f"  PARTIAL: {result.input.file.name}")
        else:
            results["failed"].append(result)
            print(f"  FAILED: {result.input.file.name}")
    
    print(f"\nSummary: {len(results['success'])} success, "
          f"{len(results['partial'])} partial, "
          f"{len(results['failed'])} failed")
else:
    print("No sample files found for batch processing demo.")

### 8.2 Document Enrichment

Enable enrichment features like picture classification and description.

In [None]:
# Document enrichment configuration
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Enable enrichment features
enrichment_options = PdfPipelineOptions(
    do_table_structure=True,
    do_picture_classification=True,   # Classify pictures (chart, diagram, etc.)
    do_picture_description=False,     # Disable VLM description (resource intensive)
    generate_picture_images=True,     # Save picture images
)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=enrichment_options)
    }
)

print("Enrichment features configured:")
print(f"  - Picture classification: {enrichment_options.do_picture_classification}")
print(f"  - Picture description: {enrichment_options.do_picture_description}")
print(f"  - Generate picture images: {enrichment_options.generate_picture_images}")

### 8.3 Error Handling

Handle conversion errors gracefully with status checking.

In [None]:
# Error handling patterns
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import ConversionStatus

converter = DocumentConverter()

def safe_convert(source):
    """Safely convert a document with proper error handling."""
    try:
        result = converter.convert(source, raises_on_error=False)
        
        if result.status == ConversionStatus.SUCCESS:
            print(f"Conversion successful: {result.input.file.name}")
            return result.document
        
        elif result.status == ConversionStatus.PARTIAL_SUCCESS:
            print(f"Partial success: {result.input.file.name}")
            print(f"  Errors: {len(result.errors)}")
            for error in result.errors:
                print(f"    - {error.component_type}: {error.error_message}")
            return result.document  # Still usable
        
        else:
            print(f"Conversion failed: {result.input.file.name}")
            for error in result.errors:
                print(f"  - {error.component_type}: {error.error_message}")
            return None
            
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

# Example usage
doc = safe_convert(pdf_url)
if doc:
    print(f"\nDocument ready with {len(doc.export_to_markdown())} characters")

---

## Summary

In this notebook, we covered:

1. **Installation & Setup** - Installing Docling 2.55.1 with all dependencies
2. **Basic Conversion** - Converting documents to Markdown, JSON, HTML
3. **File Formats** - PDF, Office (DOCX, XLSX, PPTX), HTML, Markdown, Images, Audio
4. **Pipeline Options** - OCR engines, table extraction, layout analysis, VLM
5. **Chunking** - HybridChunker and HierarchicalChunker for RAG
6. **LangChain Integration** - DoclingLoader and RAG pipeline
7. **Export Methods** - Multiple output formats and table export

### Key Takeaways

- **Docling** provides unified document parsing across multiple formats
- **DocumentConverter** is the main entry point for all conversions
- **Pipeline options** allow fine-tuned control over processing
- **Native chunking** is optimized for RAG applications
- **LangChain integration** enables seamless RAG pipeline creation

### Resources

- [Docling Documentation](https://docling-project.github.io/docling/)
- [Docling GitHub](https://github.com/docling-project/docling)
- [LangChain Docling Integration](https://docs.langchain.com/oss/python/integrations/document_loaders/docling)
- [Docling Examples](https://docling-project.github.io/docling/examples/)

In [None]:
# Cleanup (optional)
import shutil

# Uncomment to clean up generated files
# if OUTPUT_DIR.exists():
#     shutil.rmtree(OUTPUT_DIR)
# if Path("./chroma_db").exists():
#     shutil.rmtree("./chroma_db")
# if Path("./chroma_rag_demo").exists():
#     shutil.rmtree("./chroma_rag_demo")

print("Notebook completed successfully!")
print(f"Output files saved to: {OUTPUT_DIR.absolute()}")

In [None]:
# Cleanup (optional)
import shutil
from pathlib import Path

# Uncomment to clean up generated files
# if OUTPUT_DIR.exists():
#     shutil.rmtree(OUTPUT_DIR)
if Path("./chroma_db").exists():
    shutil.rmtree("./chroma_db")
# if Path("./chroma_rag_demo").exists():
#     shutil.rmtree("./chroma_rag_demo")

print("Notebook completed successfully!")
?print(f"Output files saved to: {OUTPUT_DIR.absolute()}")