# Demonstration of `pdfplumber.utils`

This notebook uses our [example PDF](../pdfs/background-checks.pdf) from the FBI's National Instant Criminal Background Check System to demonstrate `pdfplumber.utils`.

### Import `pdfplumber` and the relevant `utils`

In [1]:
import pdfplumber
from pdfplumber.utils import within_bbox, extract_columns, collate_chars

### Load the PDF

In [2]:
pdf = pdfplumber.from_path("../pdfs/background-checks.pdf")

### For use in constructing the bounding boxes later, store the the page width

In [3]:
PDF_WIDTH = pdf.pages[0].width

### Here are the first three characters stored in the PDF:

In [4]:
print(pdf.chars[:3])

[{'size': 11.414, 'width': 4.642, 'pageid': 1, 'x1': 51.682, 'object_type': 'char', 'top': 67.486, 'text': 'S', 'x0': 47.04, 'upright': True, 'y1': 544.514, 'y0': 533.099, 'height': 11.414, 'doctop': 67.486, 'fontname': 'DCLTEC+Helvetica-Bold', 'adv': 0.667}, {'size': 11.414, 'width': 2.318, 'pageid': 1, 'x1': 53.916, 'object_type': 'char', 'top': 67.486, 'text': 't', 'x0': 51.599, 'upright': True, 'y1': 544.514, 'y0': 533.099, 'height': 11.414, 'doctop': 67.486, 'fontname': 'DCLTEC+Helvetica-Bold', 'adv': 0.333}, {'size': 11.414, 'width': 3.87, 'pageid': 1, 'x1': 57.863, 'object_type': 'char', 'top': 67.486, 'text': 'a', 'x0': 53.993, 'upright': True, 'y1': 544.514, 'y0': 533.099, 'height': 11.414, 'doctop': 67.486, 'fontname': 'DCLTEC+Helvetica-Bold', 'adv': 0.556}]


### Use `within_bbox` to focus on the main data table

It starts around 77px from the top, and is about 410px tall. To select only these characters, we use `within_bbox`, and pass a bounding box of `(0, 77, PDF_WIDTH, 485)` as the `(x0, top0, x1, top1)` values.

In [5]:
table_chars = within_bbox(pdf.chars, (0, 77, PDF_WIDTH, 485))
print(table_chars[:3])

[{'size': 7.667, 'width': 3.842, 'pageid': 1, 'x1': 47.042, 'object_type': 'char', 'y0': 525.799, 'text': 'A', 'x0': 43.2, 'upright': True, 'y1': 533.465, 'top': 78.535, 'height': 7.667, 'doctop': 78.535, 'fontname': 'WEVZII+ArialMT', 'adv': 0.667}, {'size': 7.667, 'width': 1.279, 'pageid': 1, 'x1': 48.079, 'object_type': 'char', 'y0': 525.799, 'text': 'l', 'x0': 46.8, 'upright': True, 'y1': 533.465, 'top': 78.535, 'height': 7.667, 'doctop': 78.535, 'fontname': 'WEVZII+ArialMT', 'adv': 0.222}, {'size': 7.667, 'width': 3.203, 'pageid': 1, 'x1': 51.437, 'object_type': 'char', 'y0': 525.799, 'text': 'a', 'x0': 48.234, 'upright': True, 'y1': 533.465, 'top': 78.535, 'height': 7.667, 'doctop': 78.535, 'fontname': 'WEVZII+ArialMT', 'adv': 0.556}]


### Use `extract_columns` to divide the characters into rows and columns

Because side-by-side characters don't abut one another exactly, we pass `x_tolerance=2`.

In [6]:
table = extract_columns(table_chars, x_tolerance=2)
print(table[:2])

[{0: 'Alabama', 1: '18,870', 2: '23,022', 3: '22,650', 4: '859', 5: '1,178', 6: '0', 7: '14', 8: '15', 9: '0', 10: '2,179', 11: '2,307', 12: '11', 13: '0', 14: '0', 15: '0', 16: '', 17: '', 18: '13', 19: '14', 20: '0', 21: '3', 22: '2', 23: '0', 24: '71,137'}, {0: 'Alaska', 1: '209', 2: '3,062', 3: '3,209', 4: '191', 5: '184', 6: '0', 7: '9', 8: '3', 9: '0', 10: '100', 11: '100', 12: '0', 13: '18', 14: '9', 15: '1', 16: '', 17: '', 18: '0', 19: '0', 20: '0', 21: '0', 22: '0', 23: '0', 24: '7,095'}]


### Convert keys and values to something more useful

The dictionary keys returned by `extract_columns` are simply numbered in order, `0, 1, 2, ...`. Let's add the actual column names, and also convert strings-representing-numbers to the numbers themselves, e.g., `"18,870" -> 18870`:

In [7]:
COLUMNS = [
    "state",
    "permit",
    "handgun",
    "long_gun",
    "other",
    "multiple",
    "admin",
    "prepawn_handgun",
    "prepawn_long_gun",
    "prepawn_other",
    "redemption_handgun",
    "redemption_long_gun",
    "redemption_other",
    "returned_handgun",
    "returned_long_gun",
    "returned_other",
    "rentals_handgun",
    "rentals_long_gun",
    "private_sale_handgun",
    "private_sale_long_gun",
    "private_sale_other",
    "return_to_seller_handgun",
    "return_to_seller_long_gun",
    "return_to_seller_other",
    "totals"
]

In [8]:
def parse_value(k, x):
    if k == 0: return x
    if x == "": return None
    return int(x.replace(",", ""))

In [9]:
def parse_row(row):
    return dict((COLUMNS[k], parse_value(k, v)) for k, v in row.items())

In [10]:
parsed_table = [ parse_row(row) for row in table ]

Here's a sample row:

In [11]:
parsed_table[-2]

{'admin': 1,
 'handgun': 1745,
 'long_gun': 2372,
 'multiple': 104,
 'other': 87,
 'permit': 383,
 'prepawn_handgun': 0,
 'prepawn_long_gun': 4,
 'prepawn_other': 0,
 'private_sale_handgun': 1,
 'private_sale_long_gun': 2,
 'private_sale_other': 0,
 'redemption_handgun': 132,
 'redemption_long_gun': 184,
 'redemption_other': 0,
 'rentals_handgun': None,
 'rentals_long_gun': None,
 'return_to_seller_handgun': 0,
 'return_to_seller_long_gun': 2,
 'return_to_seller_other': 0,
 'returned_handgun': 0,
 'returned_long_gun': 0,
 'returned_other': 0,
 'state': 'Wyoming',
 'totals': 5017}

### Sort the data

For demonstration purposes, let's list the rows with the highest number of handgun-only background checks:

In [12]:
for row in list(reversed(sorted(parsed_table, key=lambda x: x["handgun"])))[:6]:
    print("{state}: {handgun:,d} handgun-only checks".format(**row))

Totals: 671,330 handgun-only checks
Pennsylvania: 62,752 handgun-only checks
Texas: 56,941 handgun-only checks
Florida: 50,796 handgun-only checks
California: 41,181 handgun-only checks
Ohio: 34,878 handgun-only checks


### Use `within_bbox` and `collate_chars` to extract the report month

The month of the report is listed in an area 35px to 60px from the top of the page. The code below isolates characters in that space, and then collates their text.

In [13]:
month_chars = within_bbox(pdf.chars, (0, 35, PDF_WIDTH, 60))

In [14]:
collate_chars(month_chars, x_tolerance=2)

'November - 2015'

---

---

---