diff --git a/src/oss/python/integrations/document_loaders/index.mdx b/src/oss/python/integrations/document_loaders/index.mdx index ba144f7ab..074f8e3b3 100644 --- a/src/oss/python/integrations/document_loaders/index.mdx +++ b/src/oss/python/integrations/document_loaders/index.mdx @@ -67,6 +67,7 @@ The below document loaders allow you to load PDF documents. | [Upstage Document Parse Loader](/oss/integrations/document_loaders/upstage) | Load PDF files using UpstageDocumentParseLoader | Package | | [Docling](/oss/integrations/document_loaders/docling) | Load PDF files using Docling | Package | | [UnDatasIO](/oss/integrations/document_loaders/undatasio) | Load PDF files using UnDatasIO | Package | +| [OpenDataLoader PDF](/oss/integrations/document_loaders/opendataloader_pdf) | Load PDF files using OpenDataLoader PDF | Package | ### Cloud Providers @@ -258,6 +259,7 @@ The below document loaders allow you to load data from common data formats. + diff --git a/src/oss/python/integrations/document_loaders/opendataloader_pdf.mdx b/src/oss/python/integrations/document_loaders/opendataloader_pdf.mdx new file mode 100644 index 000000000..68929a331 --- /dev/null +++ b/src/oss/python/integrations/document_loaders/opendataloader_pdf.mdx @@ -0,0 +1,67 @@ +--- +title: OpenDataLoader PDF +--- + +**Safe, Open, High-Performance — PDF for AI** + +[OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG). + +It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query. +Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets. +AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk. + +## Overview + +### Integration details + +| Class | Package | Local | Serializable | JS support | +| :--- | :--- | :---: | :---: | :---: | +| [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) | [langchain-opendataloader-pdf](https://pypi.org/project/langchain-opendataloader-pdf/) | ✅ | ❌ | ❌ | + +### Loader features + +| Source | Document Lazy Loading | Native Async Support +| :---: | :---: | :---: | +| OpenDataLoaderPDFLoader | ✅ | ❌ | + +The `OpenDataLoaderPDFLoader` component enables you to parse PDFs into structured `Document` objects. + +## Requirements +- Python >= 3.9 +- Java 11 or newer available on the system `PATH` +- opendataloader-pdf >= 1.1.1 + +## Installation +```bash +pip install -U langchain-opendataloader-pdf +``` + +## Quick start +```python +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader + +loader = OpenDataLoaderPDFLoader( + file_path=["path/to/document.pdf", "path/to/folder"], + format="text" +) +documents = loader.load() + +for doc in documents: + print(doc.metadata, doc.page_content[:80]) +``` + +## Parameters + +| Parameter | Type | Required | Default | Description | +|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------| +| `file_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. | +| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). | +| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. | +| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). | + +## Additional Resources + +- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf) +- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/) +- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf) +- [OpenDataLoader PDF Homepage](https://opendataloader.org/) diff --git a/src/oss/python/integrations/providers/all_providers.mdx b/src/oss/python/integrations/providers/all_providers.mdx index 706070481..3521c4120 100644 --- a/src/oss/python/integrations/providers/all_providers.mdx +++ b/src/oss/python/integrations/providers/all_providers.mdx @@ -1990,6 +1990,14 @@ Browse the complete collection of integrations available for Python. LangChain P > GPT models and comprehensive AI platform. + + + Safe, Open, High-Performance — PDF for AI + **Safe, Open, High-Performance — PDF for AI** + +> [OpenDataLoader PDF](https://github.com/opendataloader-project/opendataloader-pdf) converts PDFs into JSON, Markdown or Html — ready to feed into modern AI stacks (LLMs, vector search, and RAG). +> +> It reconstructs document layout (headings, lists, tables, and reading order) so the content is easier to chunk, index, and query. +> Powered by fast, heuristic, rule-based inference, it runs entirely on your local machine and delivers high-throughput processing for large document sets. +> AI-safety is enabled by default and automatically filters likely prompt-injection content embedded in PDFs to reduce downstream risk. + +## Requirements +- Python >= 3.9 +- Java 11 or newer available on the system `PATH` +- opendataloader-pdf >= 1.1.1 + +## Installation +```bash +pip install -U langchain-opendataloader-pdf +``` + +## Quick start +```python +from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader + +loader = OpenDataLoaderPDFLoader( + file_path=["path/to/document.pdf", "path/to/folder"], + format="text" +) +documents = loader.load() + +for doc in documents: + print(doc.metadata, doc.page_content[:80]) +``` + +## Parameters + +| Parameter | Type | Required | Default | Description | +|--------------------------|-----------------------| ---------- |--------------|--------------------------------------------------------------------------------------------------------------------| +| `file_path` | `List[str]` | ✅ Yes | — | One or more PDF file paths or directories to process. | +| `format` | `str` | No | `None` | Output formats (e.g. `"json"`, `"html"`, `"markdown"`, `"text"`). | +| `quiet` | `bool` | No | `False` | Suppresses CLI logging output when `True`. | +| `content_safety_off` | `Optional[List[str]]` | No | `None` | List of content safety filters to disable (e.g. `"all"`, `"hidden-text"`, `"off-page"`, `"tiny"`, `"hidden-ocg"`). | + +## Additional Resources + +- [LangChain OpenDataLoader PDF integration GitHub](https://github.com/opendataloader-project/langchain-opendataloader-pdf) +- [LangChain OpenDataLoader PDF integration PyPI package](https://pypi.org/project/langchain-opendataloader-pdf/) +- [OpenDataLoader PDF GitHub](https://github.com/opendataloader-project/opendataloader-pdf) +- [OpenDataLoader PDF Homepage](https://opendataloader.org/)