# OpenDataLoaderPDFLoader

Parse PDFs into LangChain `Document` objects for RAG pipelines.

**Features:**
- Accurate multi-column reading order (XY-Cut++ algorithm)
- Table extraction with structure preservation
- Multiple output formats: text, markdown, JSON, HTML
- 100% local processing — no cloud APIs
- Fast rule-based extraction, no GPU required

## Installation

Requires Java 11+ on your system PATH.

In [None]:
%pip install -qU langchain-opendataloader-pdf requests

## Quickstart

### Download Sample PDF

In [None]:
import requests
import tempfile
import os

url = "https://arxiv.org/pdf/2408.02509v1.pdf"

output_dir = tempfile.mkdtemp(dir=tempfile.gettempdir())
filename = os.path.basename(url)
save_path = os.path.join(output_dir, filename)

response = requests.get(url, stream=True)
with open(save_path, "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

print(f"Downloaded PDF to: {save_path}")

### Load documents with OpenDataLoaderPDFLoader

In [None]:
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

# Basic usage - load as markdown
loader = OpenDataLoaderPDFLoader(
    file_path=save_path,
    format="markdown",
    quiet=True
)
documents = loader.load()

print(f"Loaded {len(documents)} document(s)")
print(f"Metadata: {documents[0].metadata}")
print(f"Content preview:\n{documents[0].page_content[:500]}...")

## Output Formats

Choose the format that best fits your use case:

In [None]:
# Plain text - best for simple RAG
loader_text = OpenDataLoaderPDFLoader(file_path=save_path, format="text", quiet=True)

# JSON - structured data with bounding boxes (for citations)
loader_json = OpenDataLoaderPDFLoader(file_path=save_path, format="json", quiet=True)

# HTML - styled output
loader_html = OpenDataLoaderPDFLoader(file_path=save_path, format="html", quiet=True)

print("Available formats: text, markdown, json, html")

## Advanced Options

In [None]:
# Tagged PDF support (for accessible documents)
loader = OpenDataLoaderPDFLoader(
    file_path=save_path,
    format="markdown",
    use_struct_tree=True,  # Use native PDF structure tags
    quiet=True
)

# Table detection with clustering (better for borderless tables)
loader = OpenDataLoaderPDFLoader(
    file_path=save_path,
    format="markdown",
    table_method="cluster",
    quiet=True
)

# Page separators for chunking
loader = OpenDataLoaderPDFLoader(
    file_path=save_path,
    format="text",
    text_page_separator="\n\n--- Page %page-number% ---\n\n",
    quiet=True
)

# Image handling
loader = OpenDataLoaderPDFLoader(
    file_path=save_path,
    format="markdown",
    image_output="embedded",  # Base64 embedded images
    image_format="jpeg",
    quiet=True
)

print("Advanced options configured")

## Parameters Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `file_path` | `str \| List[str]` | — | **(Required)** PDF file path(s) or directories |
| `format` | `str` | `"text"` | Output format: `"text"`, `"markdown"`, `"json"`, `"html"` |
| `quiet` | `bool` | `False` | Suppress console logging |
| `password` | `str` | `None` | Password for encrypted PDFs |
| `use_struct_tree` | `bool` | `False` | Use PDF structure tree (tagged PDFs) |
| `table_method` | `str` | `None` | `"default"` or `"cluster"` (better for borderless tables) |
| `reading_order` | `str` | `"xycut"` | Reading order: `"xycut"` or `"off"` |
| `keep_line_breaks` | `bool` | `False` | Preserve original line breaks |
| `text_page_separator` | `str` | `None` | Separator between pages (use `%page-number%` for page numbers) |
| `image_output` | `str` | `None` | `"off"`, `"embedded"` (Base64), or `"external"` |
| `image_format` | `str` | `None` | `"png"` or `"jpeg"` |