# Extracting Business Insurance Details from BOP PDF

This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.


In [None]:
# Install natural-pdf
!pip install natural-pdf

In [None]:
# Download the PDF file
import urllib.request
import os

pdf_url = "https://pub-4e99d31d19cb404d8d4f5f7efa51ef6e.r2.dev/pdfs/sample-bop-policy-restaurant/sample-bop-policy-restaurant.pdf"
pdf_name = "sample-bop-policy-restaurant.pdf"

if not os.path.exists(pdf_name):
    print(f"Downloading {pdf_name}...")
    urllib.request.urlretrieve(pdf_url, pdf_name)
    print(f"Downloaded {pdf_name}")
else:
    print(f"{pdf_name} already exists")

# Extracting Business Insurance Details from BOP PDF

This PDF is a complex insurance policy document generated for small businesses requiring BOP coverage. It contains an overwhelming amount of information across 111 pages. Challenges include varied forms that may differ slightly between carriers, making extraction inconsistent. It has to deal with different templated layouts, meaning even standard parts can shift when generated by different software.

In [None]:
from natural_pdf import PDF
from natural_pdf.analyzers.guides import Guides

pdf = PDF("sample-bop-policy-restaurant.pdf")
page = pdf.pages[0]
page.show()

Look at that watermark!

In [None]:
page.find_all('text[color~=red]').show()

Let's exclude it by finding all reddish text and removing it on each page. We can do this pdf-wide.

In [None]:
# pdf.add_exclusion('text[color~=red]')
pdf.find_all('text[color~=red]').exclude()

We can get the policy number by going to the right of the label.

In [None]:
(
    page
    .find(text="POLICY NUMBER")
    .right(until='text')
    .show()
)

In [None]:
(
    page
    .find(text="POLICY NUMBER")
    .right(until='text')
    .extract_text()
)

The address is a little different since it spans two (or more? or fewer?) lines. We'll start by grabbing it, and expanding downwards until we hit the next text label.

In [None]:
(
    page
    .find(text="Mailing Address")
    .expand(bottom='text')
    .show()
)

Then we just swing to the right and grab the text across the rest of the page.

In [None]:
(
    page
    .find(text="Mailing Address")
    .expand(bottom='text')
    .right()
    .extract_text()
)

Hmm what else do we have?

In [None]:
pdf.pages[:10].show(cols=2)

Hmmm let's go to the **Service of Suit** page. I don't want to think abotu guessing what page it is, so I'll just find the text on it.

In [None]:
page = pdf.find(text="SERVICE OF SUIT").page
page.show()

We probably want to get rid of those headers and footers.

In [None]:
header = page.region(bottom=100)
footer = page.region(bottom=page.height-70)
(header + footer).show()

Might as well get rid of them on every single page while we're at it.

In [None]:
pdf.add_exclusion(lambda page: page.region(bottom=100))
pdf.add_exclusion(lambda page: page.region(top=page.height-70))

And now we can grab the text!

In [None]:
text = page.extract_text()
print(text)

The rest of the PDF is a low of finding and `.below()` and `.right()` and all of that.