Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/oss/python/integrations/document_loaders/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ The below document loaders allow you to load PDF documents.
| [Upstage Document Parse Loader](/oss/integrations/document_loaders/upstage) | Load PDF files using UpstageDocumentParseLoader | Package |
| [Docling](/oss/integrations/document_loaders/docling) | Load PDF files using Docling | Package |
| [UnDatasIO](/oss/integrations/document_loaders/undatasio) | Load PDF files using UnDatasIO | Package |
| [OpenDataLoader PDF](/oss/integrations/document_loaders/opendataloader_pdf) | Load PDF files using OpenDataLoader PDF | Package |


### Cloud Providers
Expand Down Expand Up @@ -258,6 +259,7 @@ The below document loaders allow you to load data from common data formats.
<Card title="Notion DB" icon="link" href="/oss/integrations/document_loaders/notion" arrow="true" cta="View guide" />
<Card title="Nuclia" icon="link" href="/oss/integrations/document_loaders/nuclia" arrow="true" cta="View guide" />
<Card title="Obsidian" icon="link" href="/oss/integrations/document_loaders/obsidian" arrow="true" cta="View guide" />
<Card title="OpenDataLoader PDF" icon="link" href="/oss/integrations/document_loaders/opendataloader_pdf" arrow="true" cta="View guide" />
<Card title="Open Document Format (ODT)" icon="link" href="/oss/integrations/document_loaders/odt" arrow="true" cta="View guide" />
<Card title="Open City Data" icon="link" href="/oss/integrations/document_loaders/open_city_data" arrow="true" cta="View guide" />
<Card title="Oracle Autonomous Database" icon="link" href="/oss/integrations/document_loaders/oracleadb_loader" arrow="true" cta="View guide" />
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
title: OpenDataLoader PDF
---

**Safe, Open, High-Performance — PDF for AI**

[OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).

It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.

## Overview

### Integration details

| Class | Package | Local | Serializable | JS support |
| :--- | :--- | :---: | :---: | :---: |
| [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) | [langchain-opendataloader-pdf](https://pypi.org/project/langchain-opendataloader-pdf/) | ✅ | ❌ | ❌ |

### Loader features

| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: |
| OpenDataLoaderPDFLoader | ✅ | ❌ |

The `OpenDataLoaderPDFLoader` component enables you to parse PDFs into structured `Document` objects.

## Requirements
- Python >= 3.9
- Java 11 or newer available on the system `PATH`
- opendataloader-pdf >= 1.1.1

## Installation
```bash
pip install -U langchain-opendataloader-pdf
```

## Quick start
```python
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
file_path=["path/to/document.pdf", "path/to/folder"],
format="text"
)
documents = loader.load()

for doc in documents:
print(doc.metadata, doc.page_content[:80])
```

## Parameters

| Parameter | Type | Required | Default | Description |
|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------|
| `file_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. |
| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). |
| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). |

## Additional Resources

- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/)
- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf)
- [OpenDataLoader PDF Homepage](https://opendataloader.org/)
8 changes: 8 additions & 0 deletions src/oss/python/integrations/providers/all_providers.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1990,6 +1990,14 @@ Browse the complete collection of integrations available for Python. LangChain P
>
GPT models and comprehensive AI platform.
</Card>

<Card
title="OpenDataLoader PDF"
href="/oss/integrations/providers/opendataloader_pdf"
icon="link"
>
Safe, Open, High-Performance — PDF for AI
</Card>

<Card
title="OpenGradient"
Expand Down
51 changes: 51 additions & 0 deletions src/oss/python/integrations/providers/opendataloader_pdf.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
---
title: OpenDataLoader PDF
---

> **Safe, Open, High-Performance — PDF for AI**

> [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG).
>
> It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query.
> Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets.
> AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk.

## Requirements
- Python >= 3.9
- Java 11 or newer available on the system `PATH`
- opendataloader-pdf >= 1.1.1

## Installation
```bash
pip install -U langchain-opendataloader-pdf
```

## Quick start
```python
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
file_path=["path/to/document.pdf", "path/to/folder"],
format="text"
)
documents = loader.load()

for doc in documents:
print(doc.metadata, doc.page_content[:80])
```

## Parameters

| Parameter | Type | Required | Default | Description |
|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------|
| `file_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. |
| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). |
| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. |
| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). |

## Additional Resources

- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/)
- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf)
- [OpenDataLoader PDF Homepage](https://opendataloader.org/)