# Extract data from a PDF

The PDF we'll be working with today is [a list of licensed debt collectors in Colorado](https://coag.gov/sites/default/files/contentuploads/cp/ConsumerCreditUnit/InternetReports/carreport_0.pdf).

The file lives at `../pdfs/collections.pdf`. The data start on page 2, and the table on each page has headers.

We're going to use a really cool tool called [`pdfplumber`](https://github.com/jsvine/pdfplumber) to extract the data.

Our steps:
1. Import dependencies
2. Open the PDF and noodle around
3. Create an empty data frame and specify the columns
4. Create a function to extract data from a single PDF page and return a data frame
5. Loop over the pages and call the function on each page
6. Clean up the data a bit
7. Do one quick bit of basic analysis in pandas

# 1. Import dependencies

In [None]:
# pdfplumber and pandas


## 2. Open the PDF and noodle around

Using `pdfplumber`'s syntax to open a file, let's see what's on the first page, see if we can extract a table, etc.

In [None]:
# open the pdf with pdfplumber

    # print(pdf)

    # print(pdf.pages)
    
    # get test page
    
    # try to extract the table on the test page

    # and print it


## 3. Create an empty data frame and define the columns

We're going to create an empty data frame. By looking at the source PDF, we can also define its column headers.

In [None]:
# columns for our data frame
cols = ['bizname', 'license_loc', 'instate_loc', 'mailing_loc',
        'license_no', 'lic_date', 'status', 'cr_date', 'action']

# create a data frame from those columns


## 4. Create a function to extract data from a single PDF page

This function will be called on every PDF page we hand it. Its job is simple: Take a `pdfplumber.Page` object, extract the table and return the data in a data frame with the same headers as the empty one we just created.

👉 For more details on writing your own functions, [see this notebook](../reference/Functions.ipynb).

In [None]:
# define a new function, page_to_df
# takes one argument: page

    
    # find the table on the page and extract the data

    
    # grab all rows in the table except for the first one,
    # which is the header row

    
    # return the data as a new data frame


## 5. Loop over the pages and call the function on each page

As we extract the data from each page, we'll append the data frame returned by our function to the empty data frame (`df`) that we created earlier. This code block takes a little while to run.

In [None]:
# open the PDF

    
    # skip the first page, which doesn't have a data table

    
    # loop over the pages with data

        
        # call the extraction function to grab the data from this page

        
        # append it to our main dataframe, chopping off the index column


Before we continue, let's take a look at what we've got using the pandas `head()` method.

In [None]:
# check it out with head()


I notice two things:
- `\n` newline breaks are being interpreted literally as text -- let's globally replace those
- The license date is coming in as a string, not a date, and we might be interested in doing some date filtering later -- let's coerce those values to date objects

## 6. Clean up the data a bit

In [None]:
# kill line breaks
# do it inplace
# specify that you need regular expression support


# convert license date column to datetime
# coerce errors


# "save" sorted version (by lic_date column)


In [None]:
# check the output with head()


## 7. Do some basic analysis

Let's get a feel for how many records there are and figure out how many of debt collectors have been subject to some kind of "action."

According to the Colorado Attorney General (see page 1 of the PDF), the presence of "Yes" in the "action" column means that the company has been

> subject to legal or administrative action by this office or the licensee entered into a voluntary settlement with this office. If the entry is "yes," the licensee may have been subject to one or more letters of admonition, suspension of the license, a judgment or order against the licensee, or other action, including payments (fines, penalties, consumer refunds, or other monetary payments.) Additionally, "yes" may mean that the licensee's records include a voluntary settlement or stipulation with this office. If a licensee has been disciplined, it might still retain its license. Actions and settlements are matters of public record although research, copying, and mailing costs may apply. Contact this office for more information.

Let's write _an entire journalism sentence_ using math and some [string formatting](../reference/String%20formatting.ipynb). We're going to report the number of debt collectors who've faced some form of legal or administrative action, and the percentage of the total that represents.

Let's do the math up front:

In [None]:
# how many records are there, total? use len
# variable should be called record_count

# let's filter to get just the collectors who've had some action taken against them
# variable should be called action


# how many of those are there?
# variable should be called action_count


# calculate the percentage of the whole
# variable should be called pct_whole


... and now we can formulate a sentence.

In [None]:
# write out our formatted sentence using an f-string
story_sentence = f'Of {record_count:,} licensed debt collectors in Colorado, {action_count:,} ({pct_whole:0.2f}%) have been subject to some form of legal or administrative action, according to an analysis of Colorado Secretary of State data.'

print(story_sentence)

# 📚 GROUP HOMEWORK 📚

In groups, answer these questions:
- How many debt collectors had their licenses revoked?
- **Bonus**: How many debt collectors were licensed in the past three years? (Hint: Will require using the [`apply()`](../reference/Using%20the%20apply%20method%20in%20pandas.ipynb) method