In [None]:
# UnstructuredPDFLoader

[Unstructured](https://unstructured-io.github.io/unstructured/) supports a common interface for working with unstructured or semi-structured file formats, such as Markdown or PDF. LangChain's [UnstructuredPDFLoader](https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.UnstructuredPDFLoader.html) integrates with Unstructured to parse PDF documents into LangChain [Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) objects.

Please see [this page](https://python.langchain.com/docs/integrations/providers/unstructured/) for more information on installing system requirements.

This notebook provides a quick overview for getting started with `Unstructured` [document loader](https://python.langchain.com/docs/concepts/document_loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredPDFLoader.html).

  

## Overview
### Integration details

| Class                                                                                                                                                      | Package                                                                                   | Local | Serializable | JS support|
|:-----------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------| :---: | :---: |  :---: |
| [UnstructuredPDFLoader](https://python.langchain.com/api_reference/unstructured/document_loaders/langchain_unstructured.document_loaders.UnstructuredPDFLoader.html) | [langchain_unstructured](https://python.langchain.com/api_reference/unstructured/index.html) | ✅ | ❌ | ❌ |

---------   

### Loader features

|        Source         | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
|:---------------------:| :---: | :---: | :---: |:---: |
| UnstructuredPDFLoader | ✅ | ❌ | ✅ | ✅ |

  

## Setup

### Credentials

No credentials are required to use UnstructuredPDFLoader

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [None]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_unstructured**.

In [None]:
%pip install -qU langchain_unstructured

## Initialization

Now we can instantiate our model object and load documents:

In [None]:
from langchain_unstructured.document_loaders import UnstructuredPDFLoader

STRATEGY = "fast"
file_path = "./example_data/layout-parser-paper.pdf"
loader = UnstructuredPDFLoader(file_path, strategy=STRATEGY)

## Load

In [None]:
docs = loader.load()
docs[0]

In [None]:
import pprint

pprint.pp(docs[0].metadata)

## Lazy Load


In [None]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
len(pages)

In [None]:
print(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)

The metadata attribute contains at least the following keys:
- source
- page (if in mode *page*)
- total_page
- creationdate
- creator
- producer

Additional metadata are specific to each parser.
These pieces of information can be helpful (to categorize your PDFs for example).

### Retain Elements

Under the hood, Unstructured creates different "elements" for different chunks of text. By default we combine those together, but you can easily keep that separation by specifying `mode="elements"`.

In [None]:
file_path = "./example_data/layout-parser-paper.pdf"
loader = UnstructuredPDFLoader(file_path,
                               mode="elements",
                               strategy=STRATEGY,
                               )

data = loader.load()
data[0]

See the full set of element types for this particular document:

In [None]:
set(doc.metadata["category"] for doc in data)

## Splitting mode & custom pages delimiter

When loading the PDF file you can split it in two different ways:
- By page
- As a single text flow

By default PDFPlumberLoader will split the PDF by page.

### Extract the PDF by page. Each page is extracted as a langchain Document object:

In [None]:
loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    strategy=STRATEGY,
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

In this mode the pdf is split by pages and the resulting Documents metadata contains the page number. But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :

### Extract the whole PDF as a single langchain Document object:

In [None]:
loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
    strategy=STRATEGY,
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

Logically, in this mode, the ‘page_number’ metadata disappears. Here's how to clearly identify where pages end in the text flow :

### Add a custom *pages_delimitor* to identify where are ends of pages in *single* mode:

In [None]:
loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
    strategy=STRATEGY,
    pages_delimitor="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[:5780])

This could simply be \n, or \f to clearly indicate a page change, or \<!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect.

# Extract images from the PDF

You can extract images from your PDFs with a choice of three different solutions:
- rapidOCR (lightweight Optical Character Recognition tool)
- Tesseract (OCR tool with high precision)
- Multimodal language model

You can tune these functions to choose the output format of the extracted images among *html*, *markdown* or *text*

The result is inserted between the last and the second-to-last paragraphs of text of the page.

### Extract images from the PDF with rapidOCR:

In [None]:
%pip install -qU rapidocr-onnxruntime

In [None]:
from langchain_community.document_loaders.parsers.pdf import (
    convert_images_to_text_with_rapidocr,
)

loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    strategy=STRATEGY,
    extract_images=True,
    images_to_text=convert_images_to_text_with_rapidocr(format="html"),
)
docs = loader.load()

print(docs[5].page_content)

Be careful, RapidOCR is designed to work with Chinese and English, not other languages.

### Extract images from the PDF with Tesseract:

In [None]:
%pip install -qU pytesseract

In [None]:
from langchain_community.document_loaders.parsers.pdf import (
    convert_images_to_text_with_tesseract,
)

loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    strategy=STRATEGY,
    extract_images=True,
    images_to_text=convert_images_to_text_with_tesseract(format="text"),
)
docs = loader.load()
print(docs[5].page_content)

### Extract images from the PDF with multimodal model:

In [None]:
%pip install -qU langchain_openai

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

In [None]:
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")

In [None]:
from langchain_community.document_loaders.parsers.pdf import (
    convert_images_to_description,
)
from langchain_openai import ChatOpenAI

loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    strategy=STRATEGY,
    extract_images=True,
    images_to_text=convert_images_to_description(
        model=ChatOpenAI(model="gpt-4o", max_tokens=1024), format="markdown"
    ),
)
docs = loader.load()
print(docs[5].page_content)

# Extract tables from the PDF

With PyMUPDF you can extract tables from your PDFs in *html*, *markdown* or *csv* format :

In [None]:
loader = UnstructuredPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    strategy=STRATEGY,
    extract_tables="markdown",
)
docs = loader.load()
print(docs[4].page_content)

## Working with Files

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.

In [None]:
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_unstructured.document_loaders import UnstructuredPDFParser

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(
        path="./example_data/",
        glob="*.pdf",
    ),
    blob_parser=UnstructuredPDFParser(
        strategy=STRATEGY,
    ),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

It is possible to work with files from cloud storage.

In [None]:
from langchain_community.document_loaders import CloudBlobLoader
from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader(
    blob_loader=CloudBlobLoader(
        url="s3:/mybucket",  # Supports s3://, az://, gs://, file:// schemes.
        glob="*.pdf",
    ),
    blob_parser=UnstructuredPDFParser(
        strategy=STRATEGY,
    ),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

### Using Unstructured via API

UNSTRUCTURED_API_KEY

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()
from getpass import getpass

if not os.environ.get("UNSTRUCTURED_API_KEY"):
    os.environ["UNSTRUCTURED_API_KEY"] = getpass("Unstructured API key =")

In [None]:
    loader = UnstructuredPDFLoader(
        file_path=file_path,
        partition_via_api=True,
        strategy="fast",
        include_page_breaks=True,
        coordinates=True,
    )

    docs = loader.load()


## API reference

For detailed documentation of all `UnstructuredPDFLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/unstructured/