# PDFMinerLoader

This notebook provides a quick overview for getting started with `PDFMiner` [document loader](https://python.langchain.com/docs/concepts/document_loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html).

  

## Overview
### Integration details

| Class                                                                                                                                                | Package | Local | Serializable | JS support|
|:-----------------------------------------------------------------------------------------------------------------------------------------------------| :--- | :---: | :---: |  :---: |
| [PDFMinerLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ |

---------   

### Loader features

|     Source     | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
|:--------------:| :---: | :---: | :---: |:---: |
| PDFMinerLoader | ✅ | ❌ | ✅ | ✅ |

  

## Setup

### Credentials

No credentials are required to use PyMuPDFLoader

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [1]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_community** and **pymupdf**.

In [2]:
%pip install -qU langchain_community pdfminer.six

Note: you may need to restart the kernel to use updated packages.


## Initialization

Now we can instantiate our model object and load documents:

In [3]:
from langchain_community.document_loaders import PDFMinerLoader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PDFMinerLoader(file_path)

## Load

In [4]:
docs = loader.load()
docs[0]

Document(metadata={'author': '', 'creationdate': '2021-06-22T01:27:10+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'producer': 'pdfTeX-1.40.21', 'subject': '', 'title': '', 'trapped': 'False', 'total_pages': 16, 'source': './example_data/layout-parser-paper.pdf'}, page_content='1\n2\n0\n2\n\nn\nu\nJ\n\n1\n2\n\n]\n\nV\nC\n.\ns\nc\n[\n\n2\nv\n8\n4\n3\n5\n1\n.\n3\n0\n1\n2\n:\nv\ni\nX\nr\na\n\nLayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\n\nZejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw4

In [5]:
import pprint

pprint.pp(docs[0].metadata)

{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'total_pages': 16,
 'source': './example_data/layout-parser-paper.pdf'}


## Lazy Load


In [6]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
len(pages)

1

In [7]:
print(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)

1
2
0
2

n
u
J

1
2

]

V
C
.
s
c
[

2
v
8
4
3
5
1
.
3
0
1
2
:
v
i
X
r
a

LayoutParser: A Uniﬁed Too
{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'total_pages': 16,
 'source': './example_data/layout-parser-paper.pdf'}


The metadata attribute contains at least the following keys:
- source
- page (if in mode *page*)
- total_page
- creationdate
- creator
- producer

Additional metadata are specific to each parser.
These pieces of information can be helpful (to categorize your PDFs for example).

## Splitting mode & custom pages delimiter

When loading the PDF file you can split it in two different ways:
- By page
- As a single text flow

By default PDFPlumberLoader will split the PDF by page.

### Extract the PDF by page. Each page is extracted as a langchain Document object:

In [8]:
loader = PDFMinerLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

16
{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'total_pages': 16,
 'source': './example_data/layout-parser-paper.pdf',
 'page': 0}


In this mode the pdf is split by pages and the resulting Documents metadata contains the page number. But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :

### Extract the whole PDF as a single langchain Document object:

In [9]:
loader = PDFMinerLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

1
{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'total_pages': 16,
 'source': './example_data/layout-parser-paper.pdf'}


Logically, in this mode, the ‘page_number’ metadata disappears. Here's how to clearly identify where pages end in the text flow :

### Add a custom *pages_delimiter* to identify where are ends of pages in *single* mode:

In [11]:
loader = PDFMinerLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
    pages_delimiter="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[:5780])

1
2
0
2

n
u
J

1
2

]

V
C
.
s
c
[

2
v
8
4
3
5
1
.
3
0
1
2
:
v
i
X
r
a

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis

Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5

1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify d

This could simply be \n, or \f to clearly indicate a page change, or \<!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect.

# Extract images from the PDF

You can extract images from your PDFs with a choice of three different solutions:
- rapidOCR (lightweight Optical Character Recognition tool)
- Tesseract (OCR tool with high precision)
- Multimodal language model

You can tune these functions to choose the output format of the extracted images among *html*, *markdown* or *text*

The result is inserted between the last and the second-to-last paragraphs of text of the page.

### Extract images from the PDF with rapidOCR:

In [12]:
%pip install -qU rapidocr-onnxruntime

Note: you may need to restart the kernel to use updated packages.


In [14]:
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PDFMinerLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="markdown-img",
    images_parser=RapidOCRBlobParser(),
)
docs = loader.load()

print(docs[5].page_content)

6

Z. Shen et al.

Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the co-
ordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum ﬂexibility.

Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 diﬀerent datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).

3.2 Layout Data Structures

A critical feature of LayoutParser is the implementation of a series of data
structures an

Be careful, RapidOCR is designed to work with Chinese and English, not other languages.

### Extract images from the PDF with Tesseract:

In [15]:
%pip install -qU pytesseract

Note: you may need to restart the kernel to use updated packages.


In [16]:
from langchain_community.document_loaders.parsers import TesseractBlobParser

loader = PDFMinerLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="html-img",
    images_parser=TesseractBlobParser(),
)
docs = loader.load()
print(docs[5].page_content)

6

Z. Shen et al.

Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the co-
ordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum ﬂexibility.

Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 diﬀerent datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).

3.2 Layout Data Structures

A critical feature of LayoutParser is the implementation of a series of data
structures an

### Extract images from the PDF with multimodal model:

In [17]:
%pip install -qU langchain_openai

Note: you may need to restart the kernel to use updated packages.


In [18]:
import os

from dotenv import load_dotenv

load_dotenv()

True

In [19]:
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")

In [21]:
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PDFMinerLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="markdown-img",
    images_parser=LLMImageBlobParser(model=ChatOpenAI(model="gpt-4o", max_tokens=1024)),
)
docs = loader.load()
print(docs[5].page_content)

6

Z. Shen et al.

Fig. 2: The relationship between the three types of layout data structures.
Coordinate supports three kinds of variation; TextBlock consists of the co-
ordinate information and extra features like block text, types, and reading orders;
a Layout object is a list of all possible layout elements, including other Layout
objects. They all support the same set of transformation and operation APIs for
maximum ﬂexibility.

Shown in Table 1, LayoutParser currently hosts 9 pre-trained models trained
on 5 diﬀerent datasets. Description of the training dataset is provided alongside
with the trained models such that users can quickly identify the most suitable
models for their tasks. Additionally, when such a model is not readily available,
LayoutParser also supports training customized layout models and community
sharing of the models (detailed in Section 3.5).

3.2 Layout Data Structures

A critical feature of LayoutParser is the implementation of a series of data
structures an

## Working with Files

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.

It is possible to work with files from cloud storage.

In [None]:
from langchain_community.document_loaders import CloudBlobLoader
from langchain_community.document_loaders.generic import GenericLoader

loader = GenericLoader(
    blob_loader=CloudBlobLoader(
        url="s3://mybucket",  # Supports s3://, az://, gs://, file:// schemes.
        glob="*.pdf",
    ),
    blob_parser=PDFMinerParser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

## Using PDFMiner to generate HTML text

This can be helpful for chunking texts semantically into sections as the output html content can be parsed via `BeautifulSoup` to get more structured and rich information about font size, page numbers, PDF headers/footers, etc.

In [None]:
from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PDFMinerPDFasHTMLLoader(file_path)
docs = loader.load()
docs[0]

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(docs[0].page_content, "html.parser")
content = soup.find_all("div")

In [None]:
import re

cur_fs = None
cur_text = ""
snippets = []  # first collect all snippets that have the same font size
for c in content:
    sp = c.find("span")
    if not sp:
        continue
    st = sp.get("style")
    if not st:
        continue
    fs = re.findall(r"font-size:(\d+)px", st)
    if not fs:
        continue
    fs = int(fs[0])
    if not cur_fs:
        cur_fs = fs
    if fs == cur_fs:
        cur_text += c.text
    else:
        snippets.append((cur_text, cur_fs))
        cur_fs = fs
        cur_text = c.text
snippets.append((cur_text, cur_fs))
# Note: The above logic is very straightforward. One can also add more strategies such as removing duplicate snippets (as
# headers/footers in a PDF appear on multiple pages so if we find duplicates it's safe to assume that it is redundant info)

In [None]:
from langchain_core.documents import Document

cur_idx = -1
semantic_snippets = []
# Assumption: headings have higher font size than their respective content
for s in snippets:
    # if current snippet's font size > previous section's heading => it is a new heading
    if (
        not semantic_snippets
        or s[1] > semantic_snippets[cur_idx].metadata["heading_font"]
    ):
        metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
        metadata.update(docs[0].metadata)
        semantic_snippets.append(Document(page_content="", metadata=metadata))
        cur_idx += 1
        continue

    # if current snippet's font size <= previous section's content => content belongs to the same section (one can also create
    # a tree like structure for sub sections if needed but that may require some more thinking and may be data specific)
    if (
        not semantic_snippets[cur_idx].metadata["content_font"]
        or s[1] <= semantic_snippets[cur_idx].metadata["content_font"]
    ):
        semantic_snippets[cur_idx].page_content += s[0]
        semantic_snippets[cur_idx].metadata["content_font"] = max(
            s[1], semantic_snippets[cur_idx].metadata["content_font"]
        )
        continue

    # if current snippet's font size > previous section's content but less than previous section's heading than also make a new
    # section (e.g. title of a PDF will have the highest font size but we don't want it to subsume all sections)
    metadata = {"heading": s[0], "content_font": 0, "heading_font": s[1]}
    metadata.update(docs[0].metadata)
    semantic_snippets.append(Document(page_content="", metadata=metadata))
    cur_idx += 1

print(semantic_snippets[4])

## API reference

For detailed documentation of all `PDFMinerLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html

In [22]:
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import PDFMinerParser

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(
        path="./example_data/",
        glob="*.pdf",
    ),
    blob_parser=PDFMinerParser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

1
2
0
2

n
u
J

1
2

]

V
C
.
s
c
[

2
v
8
4
3
5
1
.
3
0
1
2
:
v
i
X
r
a

LayoutParser: A Uniﬁed Toolkit for Deep
Learning Based Document Image Analysis

Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5

1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been
primarily driven by the application of neural networks. Ideally, research
outcomes could be easily deployed in production and extended for further
investigation. However, various factors like loosely organized codebases
and sophisticated model conﬁgurations complicate the easy reuse of im-
portant innovations by a wide audience. Though there have been on-going
eﬀorts to improve reusability and simplify d