# RouterParser

This notebook provides a quick overview for getting started with `PDFMiner` [document loader](https://python.langchain.com/docs/concepts/document_loaders). For detailed documentation of all DocumentLoader features and configurations head to the [API reference](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFRouterLoader.html).


## Overview
### Integration details

| Class                                                                                                                                                  | Package | Local | Serializable | JS support|
|:-------------------------------------------------------------------------------------------------------------------------------------------------------| :--- |:-----:| :---: |  :---: |
| [PDFRouterLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFRouterLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) |   ?   | ❌ | ❌ |   
   
---------   

### Loader features

|     Source      | Document Lazy Loading | Native Async Support | Extract Images | Extract Tables |
|:---------------:| :---: | :---: |:--------------:|:--------------:|
| PDFRouterLoader | ✅ | ❌ |       ?        |       ?        |


## Setup

### Credentials

No credentials are required to use `PDFRouterLoader`.

If you want to get automated best in-class tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

In [1]:
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass("Enter your LangSmith API key: ")
# os.environ["LANGSMITH_TRACING"] = "true"

### Installation

Install **langchain_community** and **pdfminer-six**.

In [1]:
%pip install -qU langchain_community pdfminer-six
%pip install -q ../../../../dist/patch_langchain_pdf_loader*.whl

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.4 requires pdfminer.six==20231228, but you have pdfminer-six 20240706 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Initialization

Now we can instantiate our model object and load documents:

In [5]:
from patch_langchain_community.document_loaders.parsers import PyMuPDFParser, \
    PDFPlumberParser, PyPDFium2Parser
from patch_langchain_community.document_loaders import PDFRouterLoader

mode="single"
routes = [
    ("name",{"producer":"Microsoft", "creator":"Microsoft"}, PyMuPDFParser(mode=mode)),
    ("Latex",{"creator":"LaTeX","page1":"Recent advances in document image analysis"}, PDFPlumberParser(mode=mode)),
    (None, {}, PyPDFium2Parser(mode=mode))
]

file_path = "./example_data/layout-parser-paper.pdf"
loader = PDFRouterLoader(file_path,
                         routes=routes)

## Load

In [6]:
docs = loader.load()
docs[0]

Document(metadata={'author': '', 'creationdate': '2021-06-22T01:27:10+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2021-06-22T01:27:10+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'producer': 'pdfTeX-1.40.21', 'subject': '', 'title': '', 'trapped': 'False', 'source': './example_data/layout-parser-paper.pdf', 'file_path': './example_data/layout-parser-paper.pdf', 'total_pages': 16, 'router': 'Latex'}, page_content='LayoutParser: A Unified Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recentadvances

In [7]:
import pprint

pprint.pp(docs[0].metadata["router"])

'Latex'


## Lazy Load


In [8]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)
    if len(pages) >= 10:
        # do some paged operation, e.g.
        # index.upsert(page)

        pages = []
len(pages)

1

In [9]:
print(pages[0].page_content[:100])
pprint.pp(pages[0].metadata)

LayoutParser: A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ((cid:
{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'source': './example_data/layout-parser-paper.pdf',
 'file_path': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'router': 'Latex'}


The metadata attribute contains at least the following keys:
- source
- page (if in mode *page*)
- total_page
- creationdate
- creator
- producer

Additional metadata are specific to each parser.
These pieces of information can be helpful (to categorize your PDFs for example).

## Splitting mode & custom pages delimiter

When loading the PDF file you can split it in two different ways:
- By page
- As a single text flow

By default PDFMinerLoader will split the PDF as a single text flow.

### Extract the PDF by page. Each page is extracted as a langchain Document object:

In [10]:
mode="page"
routes = [
    ("name",{"producer":"Microsoft", "creator":"Microsoft"}, PyMuPDFParser(mode=mode)),
    ("Latex",{"creator":"LaTeX","page1":"Recent advances in document image analysis"}, PDFPlumberParser(mode=mode)),
    (None, {}, PyPDFium2Parser(mode=mode))
]

loader = PDFRouterLoader(
    "./example_data/layout-parser-paper.pdf",
    routes=routes,
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

16
{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'source': './example_data/layout-parser-paper.pdf',
 'file_path': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'page': 0,
 'router': 'Latex'}


In this mode the pdf is split by pages and the resulting Documents metadata contains the page number. But in some cases we could want to process the pdf as a single text flow (so we don't cut some paragraphs in half). In this case you can use the *single* mode :

### Extract the whole PDF as a single langchain Document object:

In [11]:
mode="single"
routes = [
    ("name",{"producer":"Microsoft", "creator":"Microsoft"}, PyMuPDFParser(mode=mode)),
    ("Latex",{"creator":"LaTeX","page1":"Recent advances in document image analysis"}, PDFPlumberParser(mode=mode)),
    (None, {}, PyPDFium2Parser(mode=mode))
]

loader = PDFRouterLoader(
    "./example_data/layout-parser-paper.pdf",
    routes=routes,
)
docs = loader.load()
print(len(docs))
pprint.pp(docs[0].metadata)

1
{'author': '',
 'creationdate': '2021-06-22T01:27:10+00:00',
 'creator': 'LaTeX with hyperref',
 'keywords': '',
 'moddate': '2021-06-22T01:27:10+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live '
                    '2020) kpathsea version 6.3.2',
 'producer': 'pdfTeX-1.40.21',
 'subject': '',
 'title': '',
 'trapped': 'False',
 'source': './example_data/layout-parser-paper.pdf',
 'file_path': './example_data/layout-parser-paper.pdf',
 'total_pages': 16,
 'router': 'Latex'}


Logically, in this mode, the ‘page_number’ metadata disappears. Here's how to clearly identify where pages end in the text flow :

## Working with Files

Many document loaders involve parsing files. The difference between such loaders usually stems from how the file is parsed, rather than how the file is loaded. For example, you can use `open` to read the binary content of either a PDF or a markdown file, but you need different parsing logic to convert that binary data into text.

As a result, it can be helpful to decouple the parsing logic from the loading logic, which makes it easier to re-use a given parser regardless of how the data was loaded.
You can use this strategy to analyze different files, with the same parsing parameters.

In [12]:
from patch_langchain_community.document_loaders.parsers import PDFRouterParser
from langchain_community.document_loaders import FileSystemBlobLoader
from langchain_community.document_loaders.generic import GenericLoader

mode="single"
routes = [
    ("name",{"producer":"Microsoft", "creator":"Microsoft"}, PyMuPDFParser(mode=mode)),
    ("Latex",{"creator":"LaTeX","page1":"Recent advances in document image analysis"}, PDFPlumberParser(mode=mode)),
    (None, {}, PyPDFium2Parser(mode=mode))
]

loader = GenericLoader(
    blob_loader=FileSystemBlobLoader(
        path="./example_data/",
        glob="*.pdf",
    ),
    blob_parser=PDFRouterParser(
        routes=routes
    ),
)
docs = loader.load()
print(docs[0].page_content)
pprint.pp(docs[0].metadata)

LayoutParser: A Unified Toolkit for Deep
Learning Based Document Image Analysis
Zejiang Shen1 ((cid:0)), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain
Lee4, Jacob Carlson3, and Weining Li5
1 Allen Institute for AI
shannons@allenai.org
2 Brown University
ruochen zhang@brown.edu
3 Harvard University
{melissadell,jacob carlson}@fas.harvard.edu
4 University of Washington
bcgl@cs.washington.edu
5 University of Waterloo
w422li@uwaterloo.ca
Abstract. Recentadvancesindocumentimageanalysis(DIA)havebeen
primarily driven by the application of neural networks. Ideally, research
outcomescouldbeeasilydeployedinproductionandextendedforfurther
investigation. However, various factors like loosely organized codebases
and sophisticated model configurations complicate the easy reuse of im-
portantinnovationsbyawideaudience.Thoughtherehavebeenon-going
efforts to improve reusability and simplify deep learning (DL) model
developmentindisciplineslikenaturallanguageprocessingandcomputer
vision, none 

Find out more [here](/docs/how_to/document_loader_custom/)

## API reference

For detailed documentation of all `PDFRouterLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.pdf.PDFRouterLoader.html