# Extracting Use-of-Force Records from Vancouver Police PDF

This PDF contains detailed records of Vancouver Police's use-of-force incidents, provided after a public records request by journalists. Challenges include its very very very small font size and lots of empty whitespace.


In [None]:
# Install natural-pdf
!pip install natural-pdf

In [None]:
# Download the PDF file
import urllib.request
import os

pdf_url = "https://pub-4e99d31d19cb404d8d4f5f7efa51ef6e.r2.dev/pdfs/use-of-force-raw/use-of-force-raw.pdf"
pdf_name = "use-of-force-raw.pdf"

if not os.path.exists(pdf_name):
    print(f"Downloading {pdf_name}...")
    urllib.request.urlretrieve(pdf_url, pdf_name)
    print(f"Downloaded {pdf_name}")
else:
    print(f"{pdf_name} already exists")

# Extracting Use-of-Force Records from Vancouver Police PDF

This PDF contains detailed records of Vancouver Police's use-of-force incidents, provided after a public records request by journalists. Challenges include its very small font size and lots of empty whitespace.

In [None]:
from natural_pdf import PDF

pdf = PDF("use-of-force-raw.pdf")
page = pdf.pages[0]
page.show()

Let's find all the headers, they're the text at the top of the pag, which means they have the smallest `y0`.

In [None]:
headers = page.find_all('text[y0=min()]')
headers.extract_each_text()

We can now use those headers to create guides that fit between each column.

In [None]:
from natural_pdf.analyzers.guides import Guides

guides = Guides(page)
guides.vertical.from_headers(headers)
guides.show()

Once we've established the columns, we're free to extract the table. [pdfplumber](https://github.com/jsvine/pdfplumber) is smart enough behind the scenes to know what each row is.

In [None]:
guides.extract_table().to_df()

## Combining the results for every page

If you provide a list of pages, guides can extract the tables from each of them. It will also do nice things like automatically remove duplicate column headers without you even asking!

In [None]:
df = guides.extract_table(pdf.pages).to_df()
print("You found", len(df), "rows")

df.tail()