# Extracting Text from Georgia Legislative Bills

This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction. 


In [None]:
# Install natural-pdf
!pip install natural-pdf

In [None]:
# Download the PDF file
import urllib.request
import os

pdf_url = "https://pub-4e99d31d19cb404d8d4f5f7efa51ef6e.r2.dev/pdfs/20252026-236232/20252026-236232.pdf"
pdf_name = "20252026-236232.pdf"

if not os.path.exists(pdf_name):
    print(f"Downloading {pdf_name}...")
    urllib.request.urlretrieve(pdf_url, pdf_name)
    print(f"Downloaded {pdf_name}")
else:
    print(f"{pdf_name} already exists")

# Extracting Text from Georgia Legislative Bills

This PDF contains legal bills from the Georgia legislature, published yearly. Challenges include extracting marked-up text like underlines and strikethroughs. It has line numbers that complicate text extraction... *or do they make it easier?*

In [None]:
from natural_pdf import PDF

pdf = PDF("20252026-236232.pdf")
page = pdf.pages[-1]
page.show()

## Text with strikethroughs

See those strikeouts? Usually they're awful, terrible, impossible. When you use `.extract_text()` it pulls both the "normal" text and the struck-out text, ruining your ability to analyze the results.

In [None]:
text = page.extract_text()
print(text)

Luckily we have a strikeout selector!

In [None]:
page.find_all('text:strikeout').show(crop='wide')

We can do the same thing with underlined text.

In [None]:
underlined = page.find_all('text:underline')
print("Underlined text is", underlined.extract_text())
underlined.show(crop='wide')

This works across pages, too. All of the added text across the pages can be found like this:

/// tab | As one string

In [None]:
text = pdf.find_all('text:underline').extract_text()
print(text)

///
/// tab | As separate strings

In [None]:
text = pdf.find_all('text:underline').extract_each_text()
print(text)

///

### Ignoring struck-out text

If we want `.extract_text()` to fully ignore struck-out text, we can add an exclusion.

In [None]:
pdf.add_exclusion('text:strikeout')

Easy!

### Selecting the good areas

We have three approaches to avoiding the numbers on the left-hand column: make use of the numbers, select the region we do want, or ignore the stuff we don't want.

/// tab | Use the numbers

In [None]:
page = pdf.pages[0]
page.show()

One way to describe the sections we want is text to the right of the numbers. So first we find the general area of the numbers...

In [None]:
page.region(right=70).show()

...find the numbers...

In [None]:
(
  page
  .region(right=70)
  .find_all('text')
  .show(crop='wide')
)

...get the stuff to the right of them...

In [None]:
(
  page
  .region(right=70)
  .find_all('text')
  .right()
  .show(crop='wide')
)

...and merge it all together.

In [None]:
(
  page
  .region(right=70)
  .find_all('text')
  .right()
  .merge()
  .show(crop='wide')
)

We can do it for all pages.

In [None]:
sections = pdf.pages.apply(lambda page: (
    page
        .region(right=70)
        .find_all('text')
        .right()
        .merge()
    )
)
sections.show()

In [None]:
text = sections.extract_text()
print(text)

///

/// tab | Pixel-based regions

Most documents have headers and footers, this one just also has a left-hand area. What if we just selected the region based on pixels?

In [None]:
area = page.region(left=70, top=50, bottom=page.height - 100)
area.show()

We can go through each page and do the same thing, ending up with a collection of sections.

In [None]:
sections = pdf.pages.apply(lambda page: page.region(
    left=70,
    top=50,
    bottom=page.height - 100
  )
)
sections.show()

And now we simply grab the text!

In [None]:
sections.extract_text()
sections[0].extract_text()

In [None]:
sections.extract_each_text()

///

/// tab | Ignore what we don't want

Another route is through **more exclusions**. We start by finding the area on the page where the Bad Stuff is.

In [None]:
left = page.region(right=70)
top = page.region(bottom=50)
bottom = page.region(top=page.height-100)
(left + top + bottom).show()

Then we tell the PDF to ignore those regions on every single page.

In [None]:
pdf.add_exclusion(lambda page: page.region(right=70))
pdf.add_exclusion(lambda page: page.region(bottom=50))
pdf.add_exclusion(lambda page: page.region(top=page.height-100))

Done and done! We get a little extra copy on the first page compared to first approach, but this is 100x easier.

In [None]:
pdf.extract_text()

///