# More advanced PDF practice

This example shows how to parse a slightly more difficult PDF using our good friend [`pdfplumber`](https://github.com/jsvine/pdfplumber).

For this exercise, we'll pull the data out of a PDF that has a fixed-width table of Colorado county-level voter registration data from April 2008. That file lives here: `../pdfs/apr08_party.pdf`.

We'll need to use a text-based strategy and [explicit vertical lines](https://github.com/jsvine/pdfplumber#table-extraction-settings) in the table extraction settings, and we'll make _liberal_ use of [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), my favorite thing in Python.

👉 For more details on using list comprehensions, [see this notebook](../reference/Python%20data%20types%20and%20basic%20syntax.ipynb#List-comprehensions).

First, let's do our imports:

In [49]:
# import pdfplumber and pandas


Now let's open the PDF and dive in.

In [58]:
# open the PDF with the pdfplumber `open` function

    
    # the table settings I came up with after fiddling for a bit
    table_settings = {
        'vertical_strategy': 'text', 
        'horizontal_strategy': 'text',
        'explicit_vertical_lines': [95, 205, 245, 290]
    }
    
    # extract the table from the page

    
    # ~ lots of fiddling at this point to see what the results looked like ~
    
    # use a list comprehension to grab the headers (which are in row 4)
    # and tack on a conditional to remove blank items
    # https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions

    
    # .remove() the useless 'PREC' item from the headers

    
    # create an empty dataframe

    
    # loop over the table, slicing so that we start with row 6 and leave off some cruft at the bottom

        
        # clean up the row a little -- remove blanks and kill out commas

        
        # use zip() and dict() to marry the data and headers and then create a dictionary
        # https://docs.python.org/3/library/functions.html#func-dict
        # https://docs.python.org/3/library/functions.html#zip

        
        # append the dict to the dataframe


Where'd we land?

In [59]:
# look at df sorted on COUNTY NAME


Unnamed: 0,COUNTY NAME,REP,DEM,UNAFF,LIB,GRN,NAT,REF,ACP,GOR,PLP,TOTAL
0,ADAMS,54870,69938,75221,391,130,0,1,36,0,0,200587
1,ALAMOSA,2677,3128,2426,9,12,0,0,0,0,0,8252
2,ARAPAHOE,109448,98364,105295,702,237,0,0,54,0,0,314100
3,ARCHULETA,4646,1899,2110,18,23,0,0,0,0,0,8696
4,BACA,1257,1055,518,0,1,0,0,0,0,0,2831
5,BENT,875,1070,682,2,1,0,0,1,0,0,2631
6,BOULDER,43487,76676,79252,737,862,0,0,22,0,0,201036
63,BROOMFIELD,11513,9329,12164,88,35,0,0,4,0,0,33133
7,CHAFFEE,4858,3219,3749,26,29,0,0,4,0,0,11885
8,CHEYENNE,915,230,269,2,0,0,0,1,0,0,1417
