# PyMuPDF4LLMLoader

This notebook provides a quick overview for getting started with `PyMuPDF4LLM` [document loader](https://python.langchain.com/docs/concepts/document_loaders). For detailed documentation of all __ModuleName__Loader features and configurations head to the [API reference](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDF4LLMLoader.html).

  

## Overview
### Integration details

| Class | Package | Local | Serializable | JS support|
| :--- | :--- | :---: | :---: |  :---: |
| [PyMuPDF4LLMLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDF4LLMLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ | 

---------   

### Loader features

| Source | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
| :---: | :---: | :---: | :---: |:---: |
| PyMuPDF4LLMLoader | ✅ | ❌ | ✅ | ✅ |

  

## Setup

### Credentials

No credentials are required to use PyMuPDF4LLMLoader

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [4]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_community** and **pymupdf4llm**.

In [None]:
%pip install -qU langchain_community pymupdf4llm

## Initialization

Now we can instantiate our model object and load documents:

In [41]:
from langchain_community.document_loaders import PyMuPDF4LLMLoader

file_path = "./example_data/layout-parser-paper.pdf"
loader = PyMuPDF4LLMLoader(file_path)

## Load

In [6]:
docs = loader.load()
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-06-22T01:27:10+00:00', 'source': './example_data/layout-parser-paper.pdf', 'file_path': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'trapped': '', 'modDate': 'D:20210622012710Z', 'creationDate': 'D:20210622012710Z', 'page': 0}, page_content='```\nLayoutParser: A Unified Toolkit for Deep\n\n## Learning Based Document Image Analysis\n\n```\n\nZejiang Shen[1] (�), Ruochen Zhang[2], Melissa Dell[3], Benjamin Charles Germain\nLee[4], Jacob Carlson[3], and Weining Li[5]\n\n1 Allen Institute for AI\n```\n              shannons@allenai.org\n\n```\n2 Brown University\n```\n             ruochen zhang@brown.edu\n\n```\n3 Harvard University\n_{melissadell,jacob carlson}@fas.harvard.edu_\n4 University of Washington\n```\n              bcgl@cs.washington.edu\n\n`

In [55]:
print(docs[0].page_content[:250])

```
LayoutParser: A Unified Toolkit for Deep

## Learning Based Document Image Analysis

```

Zejiang Shen[1] (�), Ruochen Zhang[2], Melissa Dell[3], Benjamin Charles Germain
Lee[4], Jacob Carlson[3], and Weining Li[5]

1 Allen Institute for AI
```
 


In [7]:
import pprint

pprint.pp(docs[0].metadata)

{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'file_path': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'trapped': '',
 'modDate': 'D:20210622012710Z',
 'creationDate': 'D:20210622012710Z',
 'page': 0}


## Lazy Load


In [56]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pass
len(pages)

16

In [60]:
print(pages[2].page_content[500:1000])
pprint.pp(pages[0].metadata)

ser library, DL Model Zoo, and customized
model training are described in Section 3, and the DL model hub and community platform are detailed in Section 4. Section 5 shows two examples of how
```
LayoutParser can be used in practical DIA projects, and Section 6 concludes.

### 2 Related Work

```
Recently, various DL models and datasets have been developed for layout analysis
tasks. The dhSegment [22] utilizes fully convolutional networks [20] for segmentation tasks on historical documents. Obje
{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'source': 'example_data/layout-parser-paper.pdf',
 'file_path': 'example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'trapped': '',
 'modDate': 'D:20210622012710Z',
 'creationDate': 'D:20210622012710Z',
 'page': 0}


The metadata attribute contains at least the following keys:
- source
- page (if in mode *page*)
- total_page
- creationdate
- creator
- producer

Additional metadata are specific to each parser.
These pieces of information can be helpful (to categorize your PDFs for example).

## Splitting mode & custom pages delimiter

When loading the PDF file you can split it in two different ways:
- By page
- As a single text flow

By default PyMuPDF4LLMLoader will split the PDF by page.

### Extract the PDF by page. Each page is extracted as a langchain Document object:

In [10]:
loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

16
{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'file_path': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'trapped': '',
 'modDate': 'D:20210622012710Z',
 'creationDate': 'D:20210622012710Z',
 'page': 0}


In this mode the pdf is split by pages and the resulting Documents metadata contains the page number. But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :

### Extract the whole PDF as a single langchain Document object:

In [11]:
loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

1
{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'source': './example_data/layout-parser-paper.pdf',
 'file_path': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'trapped': '',
 'modDate': 'D:20210622012710Z',
 'creationDate': 'D:20210622012710Z'}


Logically, in this mode, the `page` (page number) metadata disappears. Here's how to clearly identify where pages end in the text flow :

### Add a custom *pages_delimiter* to identify where are ends of pages in *single* mode:

In [20]:
loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="single",
    pages_delimiter="\n-------THIS IS A CUSTOM END OF PAGE-------\n",
)
docs = loader.load()
print(docs[0].page_content[2000:3000])

 library is publicly available at https://layout-parser.github.io.](https://layout-parser.github.io)

**Keywords: Document Image Analysis · Deep Learning · Layout Analysis**

    - Character Recognition · Open Source library · Toolkit.

### 1 Introduction


Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
document image analysis (DIA) tasks including document image classification [11,


-----
-------THIS IS A CUSTOM END OF PAGE-------
2 Z. Shen et al.

37], layout detection [38, 22], table detection [26], and scene text detection [4].
A generalized learning-based framework dramatically reduces the need for the
manual specification of complicated rules, which is the status quo with traditional
methods. DL has the potential to transform DIA pipelines and benefit a broad
spectrum of large-scale document digitization projects.
However, there are several practical difficulties for taking advantages of recent advances in DL-based methods: 1) DL models are notor

This could simply be \n, or \f to clearly indicate a page change, or \<!-- PAGE BREAK --> for seamless injection in a Markdown viewer without a visual effect.

# Extract images from the PDF

You can extract images from your PDFs with a choice of three different solutions:
- rapidOCR (lightweight Optical Character Recognition tool)
- Tesseract (OCR tool with high precision)
- Multimodal language model

The extracted images are included in `page_content` of resulting Documents formatted in *markdown*

### Extract images from the PDF with rapidOCR:

In [None]:
%pip install -qU rapidocr-onnxruntime

In [None]:
from langchain_community.document_loaders.parsers import RapidOCRBlobParser

loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    extract_images=True,
    images_parser=RapidOCRBlobParser(),
)
docs = loader.load()

print(docs[5].page_content)

Be careful, RapidOCR is designed to work with Chinese and English, not other languages.

### Extract images from the PDF with Tesseract:

In [None]:
%pip install -qU pytesseract

In [None]:
from langchain_community.document_loaders.parsers import TesseractBlobParser

loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    extract_images=True,
    images_parser=TesseractBlobParser(),
)
docs = loader.load()
print(docs[5].page_content)

### Extract images from the PDF with multimodal model:

In [None]:
%pip install -qU langchain_openai

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

In [17]:
from getpass import getpass

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API key =")

In [None]:
from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    extract_images=True,
    images_parser=LLMImageBlobParser(model=ChatOpenAI(model="gpt-4o", max_tokens=1024)),
)
docs = loader.load()
print(docs[5].page_content)

# Extract tables from the PDF

With PyMUPDF4LLM you can extract tables from your PDFs in *markdown* format by specifying one of following table extraction strategies.
- `"lines_strict"` is the default strategy and is the most accurate for tables with column and row lines, but may not work well with all documents.
- `"lines"` is a less strict strategy that may work better with some documents.
- `"text"` is the least strict strategy and may work better with documents that do not have tables with lines.

In [40]:
loader = PyMuPDF4LLMLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    table_strategy="lines",
)
docs = loader.load()
print(docs[4].page_content[:1000])

`LayoutParser: A Unified Toolkit for DL-Based DIA` 5

Table 1: Current layout detection models in the LayoutParser model zoo

**Dataset** **Base Model[1]** **Large Model** **Notes**

PubLayNet [38] F / M M Layouts of modern scientific documents
PRImA [3] M    - Layouts of scanned modern magazines and scientific reports
Newspaper [17] F   - Layouts of scanned US newspapers from the 20th century
TableBank [18] F F Table region on modern scientific and business document
HJDataset [31] F / M   - Layouts of history Japanese documents

1 For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy
vs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101
backbones [13], respectively. One can train models of different architectures, like Faster R-CNN [28] (F) and Mask
R-CNN [12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained
using the ResNet 101 

## Working with Files

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.

In [54]:
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import PyMuPDF4LLMParser

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(
        path="./example_data/",
        glob="*.pdf",
    ),
    blob_parser=PyMuPDF4LLMParser(),
)
docs = loader.load()
print(docs[0].page_content[:250])
pprint.pp(docs[0].metadata)

```
LayoutParser: A Unified Toolkit for Deep

## Learning Based Document Image Analysis

```

Zejiang Shen[1] (�), Ruochen Zhang[2], Melissa Dell[3], Benjamin Charles Germain
Lee[4], Jacob Carlson[3], and Weining Li[5]

1 Allen Institute for AI
```
 
{'producer': 'pdfTeX-1.40.21',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'source': 'example_data/layout-parser-paper.pdf',
 'file_path': 'example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'format': 'PDF 1.5',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'trapped': '',
 'modDate': 'D:20210622012710Z',
 'creationDate': 'D:20210622012710Z',
 'page': 0}


It is possible to work with files from cloud storage.

In [None]:
from langchain_community.document_loaders import CloudBlobLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import PyMuPDF4LLMParser

loader = GenericLoader(
    blob_loader=CloudBlobLoader(
        url="s3:/mybucket",  # Supports s3://, az://, gs://, file:// schemes.
        glob="*.pdf",
    ),
    blob_parser=PyMuPDF4LLMParser(),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

## API reference

For detailed documentation of all `PyMuPDF4LLMLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PyMuPDF4LLMLoader.html