# Bad OCR in a board of education annual financial report

This PDF is all sorts of information about the Board of Education in Liberty County, Georgia


In [None]:
# Install natural-pdf
!pip install natural-pdf

In [None]:
# Download the PDF file
import urllib.request
import os

pdf_url = "https://pub-4e99d31d19cb404d8d4f5f7efa51ef6e.r2.dev/pdfs/liberty-county-boe/liberty-county-boe.pdf"
pdf_name = "liberty-county-boe.pdf"

if not os.path.exists(pdf_name):
    print(f"Downloading {pdf_name}...")
    urllib.request.urlretrieve(pdf_url, pdf_name)
    print(f"Downloaded {pdf_name}")
else:
    print(f"{pdf_name} already exists")

# Bad OCR in a board of education annual financial report

So we have a reasonably long PDF (72 pages) that we want to grab a single page of information from. On top of everything else the text recognition (OCR) is bad. We'll need to redo that, so we'll start by reading the PDF in with `text_layer=False` to have it discard the incorrect text.

In [None]:
from natural_pdf import PDF

pdf = PDF("liberty-county-boe.pdf", text_layer=False)
pdf.pages.show(cols=6)

Now we need to apply *new* OCR to it.

We're impatient and only care about one specific page, and we know the page is somewhere near the front. To speed things up, we'll apply OCR to a subset of the pages.

In [None]:
pdf.pages[5:20].apply_ocr()

Now we can look for the content we're interested in.

In [None]:
pdf.find(text="FINANCIAL HIGHLIGHTS").show()

Granted if our OCR was off we might not be able to just grab what we're looking for, but luckily it's printed very nicely and we can almost guarantee the text comes through well.

We can preview to make sure the page looks right...

In [None]:
page = pdf.find(text="FINANCIAL HIGHLIGHTS").page
page.show()

...and then pull out the text, save it to a file, whatever we want.

In [None]:
text = page.extract_text()
print(text)

with open("content.txt", 'w') as fp:
    fp.write(text)

If we wanted to pass this over to someone else to double-check, we could even save the page itself as an image. We use `.render()` instead of `.show()` because it by default won't include highlights and annotations and that kind of stuff.

In [None]:
page.render().save("output.png")