# Extract data from a PDF

The PDF we'll be working with today is [a list of licensed debt collectors in Colorado](https://coag.gov/sites/default/files/contentuploads/cp/ConsumerCreditUnit/InternetReports/carreport_0.pdf).

The file lives at `../pdfs/collections.pdf`. The data start on page 2, and the table on each page has headers.

We're going to use a really cool tool called [`pdfplumber`](https://github.com/jsvine/pdfplumber) to extract the data.

Our steps:
1. Import dependencies
2. Open the PDF and noodle around
3. Create an empty data frame and specify the columns
4. Create a function to extract data from a single PDF page and return a data frame
5. Loop over the pages and call the function on each page
6. Clean up the data a bit
7. Do one quick bit of basic analysis in pandas

# 1. Import dependencies

In [1]:
import pdfplumber
import pandas as pd

## 2. Open the PDF and noodle around

Using `pdfplumber`'s syntax to open a file, let's see what's on the first page, see if we can extract a table, etc.

In [6]:
with pdfplumber.open('../pdfs/collections.pdf') as pdf:
    # print(pdf)
    # print(pdf.pages)
    test = pdf.pages[1]
    table = test.extract_table()
    print(table)

[['Original  Cancel/\nLicensed Instate Mailing License  License  Revoke \nBusiness/Trade Name Location Location Location Number Date Status Date Action', None, None, None, None, None, None, None, None], ['1ST CREDIT OF AMERICA LLC', '300 N ELIZABETH ST STE 220-B\nCHICAGO, IL 60607', '3025 S PARKER RD STE 711\nAURORA, CO 80014', '300 N ELIZABETH ST STE 220-B\nCHICAGO, IL 60607', '988412', '2/20/2004', 'C', '5/15/2007', 'Yes'], ['1ST NATIONAL RECOVERY \nSOLUTIONS LLC', '5497 BROADWAY ST\nLANCASTER, NY 14086', '600 17TH ST STE 800 NORTH\nDENVER, CO 80202', '5497 BROADWAY ST\nLANCASTER, NY 14086', '989708', '8/15/2007', 'E', '3/8/2010', ''], ['1ST NATIONWIDE \nCOLLECTION AGENCY INC', '3760 CALLE TECATE STE B\nCAMARILLO, CA 93012', '3025 S PARKER RD STE 711\nAURORA, CO 80014', 'PO BOX 1418\nCAMARILLO, CA 93011-1418', '989591', '3/6/2007', 'C', '11/12/2008', ''], ['21ST MORTGAGE \nCORPORATION', '620 MARKET ST\nKNOXVILLE, TN 37902', '3455 W SERVICE RD\nEVANS, CO 80620', 'PO BOX 477\nKNOXVILLE

## 3. Create an empty data frame and define the columns

We're going to create an empty data frame. By looking at the source PDF, we can also define its column headers.

In [7]:
cols = ['bizname', 'license_loc', 'instate_loc', 'mailing_loc',
        'license_no', 'lic_date', 'status', 'cr_date', 'action']

df = pd.DataFrame(columns=cols)

## 4. Create a function to extract data from a single PDF page

This function will be called on every PDF page we hand it. Its job is simple: Take a `pdfplumber.Page` object, extract the table and return the data in a data frame with the same headers as the empty one we just created.

👉 For more details on writing your own functions, [see this notebook](../reference/Functions.ipynb).

In [11]:
def page_to_df(page):
    
    # find the table on the page and extract the data
    table = page.extract_table()
    
    # grab all rows in the table except for the first one,
    # which is the header row
    lines = table[1:]
    
    # return the data in a data frame
    return pd.DataFrame(lines, columns=cols)

## 5. Loop over the pages and call the function on each page

As we extract the data from each page, we'll append the data frame returned by our function to the empty data frame (`df`) that we created earlier. This code block takes a little while to run.

In [12]:
# open the PDF
with pdfplumber.open('../pdfs/collections.pdf') as pdf:
    
    # skip the first page, which doesn't have a data table
    pages_with_data = pdf.pages[1:]
    
    # loop over the pages with data
    for page in pages_with_data:
        
        # call the extraction function to grab the data from this page
        df_to_append = page_to_df(page)
        
        # append it to our main dataframe, chopping off the index column
        df = df.append(df_to_append, ignore_index=True)

Before we continue, let's take a look at what we've got using the pandas `head()` method.

In [13]:
df.head()

Unnamed: 0,bizname,license_loc,instate_loc,mailing_loc,license_no,lic_date,status,cr_date,action
0,1ST CREDIT OF AMERICA LLC,"300 N ELIZABETH ST STE 220-B\nCHICAGO, IL 60607","3025 S PARKER RD STE 711\nAURORA, CO 80014","300 N ELIZABETH ST STE 220-B\nCHICAGO, IL 60607",988412,2/20/2004,C,5/15/2007,Yes
1,1ST NATIONAL RECOVERY \nSOLUTIONS LLC,"5497 BROADWAY ST\nLANCASTER, NY 14086","600 17TH ST STE 800 NORTH\nDENVER, CO 80202","5497 BROADWAY ST\nLANCASTER, NY 14086",989708,8/15/2007,E,3/8/2010,
2,1ST NATIONWIDE \nCOLLECTION AGENCY INC,"3760 CALLE TECATE STE B\nCAMARILLO, CA 93012","3025 S PARKER RD STE 711\nAURORA, CO 80014","PO BOX 1418\nCAMARILLO, CA 93011-1418",989591,3/6/2007,C,11/12/2008,
3,21ST MORTGAGE \nCORPORATION,"620 MARKET ST\nKNOXVILLE, TN 37902","3455 W SERVICE RD\nEVANS, CO 80620","PO BOX 477\nKNOXVILLE, TN 37901-0477",991831,4/16/2013,A,Active,
4,24 ASSET MANAGEMENT \nCORP,"2020 CAMINO DEL RIO N STE 900\nSAN DIEGO, CA 9...","80 GARDEN CTR STE 3\nBROOMFIELD, CO 80020","2020 CAMINO DEL RIO N STE \n900\nSAN DIEGO, CA...",990402,11/13/2009,C,1/6/2016,


I notice two things:
- `\n` newline breaks are being interpreted literally as text -- let's globally replace those
- The license date is coming in as a string, not a date, and we might be interested in doing some date filtering later -- let's coerce those values to date objects

## 6. Clean up the data a bit

In [14]:
# kill line breaks
df.replace('\n', ' ', inplace=True, regex=True)

# coerce license date col to datetime and sort descending
df.lic_date = pd.to_datetime(df.lic_date, errors='coerce')
df = df.sort_values('lic_date', ascending=False)

In [15]:
df.head()

Unnamed: 0,bizname,license_loc,instate_loc,mailing_loc,license_no,lic_date,status,cr_date,action
2192,TITANIUM FINANCIAL SOLUTIONS LLC,"3000 S JAMAICA CT STE 355 AURORA, CO 80014","3000 S JAMAICA CT STE 355 AURORA, CO 80014","PO BOX 372130 DENVER, CO 80237",993201,2018-03-06,A,Active,
1847,RCS CAPITAL PARTNERS INC,"270 NORTHPOINTE PKWY STE 40 AMHERST, NY 14228","7200 S ALTON WAY STE B180 DENVER, CO 80112","270 NORTHPOINTE PKWY STE 40 AMHERST, NY 14228",993200,2018-03-06,A,Active,
1278,KINUM INC,"800 SEAHAWK CIR STE 124 VIRGINIA BEACH, VA 23452","27 N WILLERUP STE B MONTROSE, CO 81401","2133 UPTON DR STE 126-129 VIRGINIA BEACH, VA 2...",993199,2018-03-06,A,Active,Yes
1716,PERFECTION COLLECTION LLC,"313 E 1200 S STE 102 OREM, UT 84058","27 N WILLERUP STE B MONTROSE, CO 81401","313 E 1200 S STE 102 OREM, UT 84058",993194,2018-02-26,A,Active,Yes
1974,RONEN LLC,"2003 WESTERN AVE STE 340 SEATTLE, WA 98121","80 GARDEN CTR BLDG B, STE 3 BROOMFIELD, CO 80020","2003 WESTERN AVE STE 340 SEATTLE, WA 98121",993195,2018-02-26,A,Active,


## 7. Do some basic analysis

Let's get a feel for how many records there are and figure out how many of debt collectors have been subject to some kind of "action."

According to the Colorado Attorney General (see page 1 of the PDF), the presence of "Yes" in the "action" column means that the company has been

> subject to legal or administrative action by this office or the licensee entered into a voluntary settlement with this office. If the entry is "yes," the licensee may have been subject to one or more letters of admonition, suspension of the license, a judgment or order against the licensee, or other action, including payments (fines, penalties, consumer refunds, or other monetary payments.) Additionally, "yes" may mean that the licensee's records include a voluntary settlement or stipulation with this office. If a licensee has been disciplined, it might still retain its license. Actions and settlements are matters of public record although research, copying, and mailing costs may apply. Contact this office for more information.

Let's write _an entire journalism sentence_ using math and some [string formatting](../reference/String%20formatting.ipynb). We're going to report the number of debt collectors who've faced some form of legal or administrative action, and the percentage of the total that represents.

Let's do the math up front:

In [16]:
# how many records are there, total?
record_count = len(df)

# let's filter to get just the collectors who've had some action taken against them
action = df[df.action == 'Yes']

# how many of those are there?
action_count = len(action)

# calculate the percentage of the whole
pct_whole = (action_count / record_count) * 100

... and now we can formulate a sentence.

In [19]:
# write out our formatted sentence using an f-string
story_sentence = f'Of {record_count:,} licensed debt collectors in Colorado, {action_count:,} ({pct_whole:0.2f}%) have been subject to some form of legal or administrative action, according to an analysis of Colorado Secretary of State data.'

print(story_sentence)

Of 2,402 licensed debt collectors in Colorado, 687 (28.60%) have been subject to some form of legal or administrative action, according to an analysis of Colorado Secretary of State data.


In [36]:
# what else?