# Playing with voter data

A few months ago, I read about a [voter data breach](http://www.nytimes.com/2015/12/31/us/politics/voting-records-released-privacy-concerns.html?_r=0) ([more here](http://www.databreaches.net/191-million-voters-personal-info-exposed-by-misconfigured-database/)) that resulted in 191 million voters' profiles being released on the web.  A breach like this doesn't tell people who you voted for, but it does share which elections you've voted in and how you've registered in those elections.  It also shares personal information like your phone number and mailing address, and likely any other data that a political campaign [might acquire about you from a third-party](http://www.propublica.org/article/how-companies-have-assembled-political-profiles-for-millions-of-internet-us).

Where does your voter registration, election participation history, and contact information come from?  You!  When you participate in elections, states collect this information and make it available to varying degrees.  A company named NationBuilder has a list of the [state-by-state restrictions](http://nationbuilder.com/voterdata) on what one can do with voter data.  The limitations range from "Unrestricted" to "Political purposes only," with a few instances of "No commercial use."

Many states release the data online, and to understand the data better, I downloaded data from [North Carolina](http://www.ncsbe.gov/other-election-related-data), which provides easy and unrestricted access to the data.

In the rest of this post, here's what we'll learn:
 * We'll explore voter data from North Carolina to understand what sort of voter and contact information is available to virtually anyone, presumably in all states.
 * We'll show how this data can be used for political purposes by creating a contact list of active young Democrats and active older Republicans.
 * Is there a third thing?

# Understanding the North Carolina Data

I picked the North Carolina dataset because it's so easily accessible and my usage of it is unrestricted, so that I don't break any laws. I haven't looked at other states' datasets, but every article I've read on the topic suggests you can get similar data from every state.  Here's how I got the data:

 * Go to the [North Carolina State Board of Elections election data website](http://www.ncsbe.gov/other-election-related-data).
 * Click on "Voter Registration Data by County."
 * Download Alamance county's [Registration](ftp://alt.ncsbe.gov/data/ncvoter1.zip) and [History](ftp://alt.ncsbe.gov/data/ncvhis1.zip) datasets.  They are zipfiles, and you can extract a .csv (comma-separated file) from each of them.  At the bottom of the list you can download Statewide versions of each of these, but they are a few hundred megabytes each and not necessary for illustrative purposes.
 
We'll now explore some basic information about each of them, starting with the voter data.  I'll be using [Python](https://www.python.org/) and [Pandas](http://pandas.pydata.org/) for my analysis and showing my work, but don't let that stop you from following along!

## The voters of Alamance county
On the next few lines, we'll open up the voter data csv and listed all of the columns in the file.  Many of them are self-explanatory, like `first_name`, `last_name`, `zip_code`, `mail_addr1`, `full_phone number`, `birth_age`, `ethnic_code`, `race_code`, or `gender_code`.  In short, you can figure out *where someone lives*, *how to contact them*, *how old they are*, and their basic *demographic information*.

In [19]:
# Open up the voter data csv using Pandas
import pandas
import numpy
voters = pandas.read_csv('/Users/marcua/Downloads/ncvoter1.csv',
                         sep='\t',
                         encoding = 'ISO-8859-1')
voters.columns

Index(['county_id', 'county_desc', 'voter_reg_num', 'status_cd',
       'voter_status_desc', 'reason_cd', 'voter_status_reason_desc',
       'absent_ind', 'name_prefx_cd', 'last_name', 'first_name', 'middle_name',
       'name_suffix_lbl', 'res_street_address', 'res_city_desc', 'state_cd',
       'zip_code', 'mail_addr1', 'mail_addr2', 'mail_addr3', 'mail_addr4',
       'mail_city', 'mail_state', 'mail_zipcode', 'full_phone_number',
       'race_code', 'ethnic_code', 'party_cd', 'gender_code', 'birth_age',
       'birth_state', 'drivers_lic', 'registr_dt', 'precinct_abbrv',
       'precinct_desc', 'municipality_abbrv', 'municipality_desc',
       'ward_abbrv', 'ward_desc', 'cong_dist_abbrv', 'super_court_abbrv',
       'judic_dist_abbrv', 'nc_senate_abbrv', 'nc_house_abbrv',
       'county_commiss_abbrv', 'county_commiss_desc', 'township_abbrv',
       'township_desc', 'school_dist_abbrv', 'school_dist_desc',
       'fire_dist_abbrv', 'fire_dist_desc', 'water_dist_abbrv',
       'water

Next, and for funsies, we'll look at some basic statistical information about the numeric attributes in this file using Pandas' `describe` function.  Most of the statistics are meaningless (e.g., the 50th percentile zip code), but there are a few goodies, like there being 105565 non-null voter records, and that the max `birth_age` of voters in Alamance county is 115.

Who are these 115-year-old voters?  Below, we'll look at the six of them in the county.  This view only reports last names, though it would be easy to get some other identifying information.  Sadly, a few voter statuses are deceased/not confirmed, but a few other 115-year-olds have recently been verified.

In [20]:
voters.describe()

Unnamed: 0,county_id,voter_reg_num,zip_code,mail_zipcode,full_phone_number,birth_age,ward_abbrv,ward_desc,cong_dist_abbrv,nc_senate_abbrv,...,fire_dist_abbrv,fire_dist_desc,water_dist_abbrv,water_dist_desc,sewer_dist_abbrv,sewer_dist_desc,sanit_dist_abbrv,sanit_dist_desc,rescue_dist_abbrv,rescue_dist_desc
count,105565,105565.0,105565.0,105565.0,87773.0,105565.0,0.0,0.0,93796.0,93796,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,1,8267825.525032,27245.258599,1296884.0,3615165000.0,51.761758,,,5.406627,24,...,,,,,,,,,,
std,0,2106055.323698,36.232703,18923870.0,2436548000.0,19.446062,,,1.045065,0,...,,,,,,,,,,
min,1,3800.0,27215.0,1864.0,0.0,17.0,,,2.0,24,...,,,,,,,,,,
25%,1,9028020.0,27215.0,27215.0,3362267000.0,36.0,,,4.0,24,...,,,,,,,,,,
50%,1,9087069.0,27244.0,27244.0,3363763000.0,51.0,,,6.0,24,...,,,,,,,,,,
75%,1,9124077.0,27253.0,27253.0,3365844000.0,66.0,,,6.0,24,...,,,,,,,,,,
max,1,9153569.0,27516.0,966781300.0,10000000000.0,115.0,,,6.0,24,...,,,,,,,,,,


In [21]:
voters[voters.birth_age == 115]

Unnamed: 0,county_id,county_desc,voter_reg_num,status_cd,voter_status_desc,reason_cd,voter_status_reason_desc,absent_ind,name_prefx_cd,last_name,...,munic_dist_desc,dist_1_abbrv,dist_1_desc,dist_2_abbrv,dist_2_desc,confidential_ind,age,ncid,vtd_abbrv,vtd_desc
23316,1,ALAMANCE,1702000,I,INACTIVE,IN,CONFIRMATION NOT RETURNED,,,DAYE,...,BURLINGTON,15A,15A PROSECUTORIAL,,,N,Age Over 66,AA12753,12N,12N
28527,1,ALAMANCE,2013800,R,REMOVED,RD,DECEASED,,,FARRELL,...,,,,,,N,Age Over 66,AA14823,,
29714,1,ALAMANCE,2104775,I,INACTIVE,IN,CONFIRMATION NOT RETURNED,,,FITZGERALD,...,BURLINGTON,15A,15A PROSECUTORIAL,,,N,Age Over 66,AA15391,12E,12E
44240,1,ALAMANCE,3206800,R,REMOVED,RD,DECEASED,,,HUFF,...,,,,,,N,Age Over 66,AA22292,,
47454,1,ALAMANCE,3501400,A,ACTIVE,AV,VERIFIED,,,JOHNSON,...,MEBANE,15A,15A PROSECUTORIAL,,,N,Age Over 66,AA24092,10S,10S
70762,1,ALAMANCE,9014216,A,ACTIVE,AV,VERIFIED,,,PARSONS,...,GRAHAM,15A,15A PROSECUTORIAL,,,N,Age Over 66,AA64367,064,064


Curious how the county leans?  There are more registered Democrats (44151, 42%) than Republicans (34244, 32%), but the number of unaffiliated voters (26809) is quite sizable.  Interestingly, `LIB`, which I think is the libertarian party, is the only other listed party, but isn't that sizable at 361 records.

In [22]:
voters[['party_cd']].groupby('party_cd', as_index=False)['party_cd'].agg({'num_voters': numpy.size})

Unnamed: 0,party_cd,num_voters
0,DEM,44151
1,LIB,361
2,REP,34244
3,UNA,26809


## How Alamance county voted

We'll now take a look at the voting history, or the various elections in which voters participated.  There are less columns in this file, but the key ones are the `voter_reg_num`, which is each voter's ID, `election_desc`, which tells us which election it was, and `voted_party_desc`, which told us the party for which the voter was registered in that election.  You don't get to learn how the voter voted, but knowing their party and some demographic information from their voter record can probably give you a decent sense.

In [23]:
vote_history = pandas.read_csv('/Users/marcua/Downloads/ncvhis1.txt',
                               sep='\t')
vote_history.columns

Index(['county_id', 'county_desc', 'voter_reg_num', 'election_lbl',
       'election_desc', 'voting_method', 'voted_party_cd', 'voted_party_desc',
       'pct_label', 'pct_description', 'ncid', 'voted_county_id',
       'voted_county_desc', 'vtd_label', 'vtd_description'],
      dtype='object')

How many elections are we talking?  Quite a few!  Below, we can see everything from municipal primaries to general elections to board of education elections.  You can also get a whiff of how dirty some of this data is: some voters are listed as having voted in `11/07/2006 GENERAL`, while others voted in `2006 GENERAL - NOVEMBER 7 2006`, which are presumably two ways to describe the same event.

In [24]:
vote_history.groupby('election_desc').groups.keys()

dict_keys(['05/04/2010 PRIMARY', '11/03/2009 MUNICIPAL ELECTIONS', '10/06/2015 BURLINGTON PRIMARY', '09/10/2013 PRIMARY', '11/06/2007 MUNI/GENERAL', '05/06/2014 PRIMARY', '11/06/2007 MUNICIPAL AND COUNTY REFERENDA', '2008 SPECIAL', '10/09/2007 MUNICIPAL ELECTION', '07/11/2006 SPECIAL MIXED BEV', '09/09/2008 ABC ELECTION', '11/08/2011 GENERAL', '08/29/2006 - LR-J-BW', '10/11/2011 CARY MUNICIPAL', '07/29/2008 ASHEBORO CITY ABC ELECTION', '07/17/2012 SECOND PRIMARY', '10/11/2011 RAMSEUR PRIMARY', '05/05/2009 MIXED BEVERAGE', '11/05/2013 MUNICIPAL ELECTIONS', '11/07/2006 GENERAL', '11/03/2009 MUNICIPAL', '2007 MUNICIPAL', 'TOWN OF BOONE MIXED DRINK', '10/08/2013 MUNICIPAL PRIMARY', '11/04/2008 GENERAL', '11/06/2007 MUNICIPAL /REFERENDUM', '09/15/2009 WINSTON-SALEM PRIMARY', '11/06/2007 MUNICIPAL  ABC GENERAL', '11/06/2007 MUNICIPAL ELECTIONS', '11/05/2013 GENERAL MUNICIPAL', '09/15/2009 ALB PRIMARY', '05/30/2006 SECOND PRIMARY', '11/03/2009 MUNICIPALS', '11/03/2009 MUNICIPAL AND SCHOOL', '

How voters listed their affiliation in each election roughly matched the trends in the voter files, although both Democrats (178351, 47%) and Republicans (146305, 39%) participated in elections more frequently than their proportion of the registered voter base we saw above.

In [25]:
vote_history[['voted_party_desc']].groupby('voted_party_desc', as_index=False)['voted_party_desc'].agg({'num_voters': numpy.size})

Unnamed: 0,voted_party_desc,num_voters
0,DEMOCRATIC,178351
1,LIBERTARIAN,615
2,REPUBLICAN,146305
3,UNAFFILIATED,50787


It's fun to look at a single individual's voting history.  To pick one of our 115-year-old voters from before, voter number 9014216 seems to pretty consistently vote as a Republican, showing up in person at the same precint since 2006.  How's that for civic participation?

In [26]:
vote_history[vote_history.voter_reg_num==9014216]

Unnamed: 0,county_id,county_desc,voter_reg_num,election_lbl,election_desc,voting_method,voted_party_cd,voted_party_desc,pct_label,pct_description,ncid,voted_county_id,voted_county_desc,vtd_label,vtd_description
270186,1,ALAMANCE,9014216,11/02/2010,11/02/2010 GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270187,1,ALAMANCE,9014216,11/05/2013,11/05/2013 MUNICIPAL GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270188,1,ALAMANCE,9014216,11/04/2008,11/04/2008 GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270189,1,ALAMANCE,9014216,11/06/2012,11/06/2012 GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270190,1,ALAMANCE,9014216,05/08/2012,05/08/2012 PRIMARY,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270191,1,ALAMANCE,9014216,11/07/2006,11/07/2006 GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270192,1,ALAMANCE,9014216,11/08/2011,11/08/2011 MUNICIPAL GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64
270193,1,ALAMANCE,9014216,11/04/2014,11/04/2014 GENERAL,IN-PERSON,REP,REPUBLICAN,64,GRAHAM 4,AA64367,1,ALAMANCE,64,64


# What can we do with this data?

So far, we've gotten an aggregate sense of what voter data looks like.  You can see what your neighbor's party affiliation is, or call up a random person to have a chat (please don't!).  But what makes this data powerful?  It depends who you are!  Putting our campaign hats on for a moment, imagine we wanted to find some active voters and reach out to them.  Perhaps the Democrats could find some young active voters and call them to ask for help with door-to-door campaigning in their neighborhood.  Or maybe the Republicans would want to find some active older voters to send letters asking for campaign contributions.  With a few lines of code, we can do both of these things!

Let's start by looking for some active young Democrats.  These might be folks who are younger than 25, who have registered as Democrats and voted more than three times.  In the lines below, we filter down our dataset to young (8677 voters) who are registered Democrats (2631) that have voted (1945 times) actively (59 of these young adults).  Peaking at the `head` of this table of 59 active young Democrats, we can get their first names.  I've left out details like phone numbers or addresses for their privacy, but you can be sure that campaign organizers wouldn't: they could pick up the phone and reach out for some help.

In [27]:
young_voters = voters[voters.birth_age < 25]
print('young voters', len(young_voters))
young_democrats = young_voters[young_voters.party_cd == 'DEM']
print('young democrats', len(young_democrats))
young_democrat_votes = pandas.merge(young_democrats, vote_history, left_on='voter_reg_num', right_on='voter_reg_num')
print('votes by young democrats', len(young_democrat_votes))
young_democrat_vote_counts = (young_democrat_votes[['voter_reg_num']]
                              .groupby(['voter_reg_num'], as_index=False)
                              ['voter_reg_num']
                              .agg({'num_votes': numpy.size}))
young_active_democrats = young_democrat_vote_counts[young_democrat_vote_counts.num_votes > 3]
print('active young democrats', len(young_active_democrats))
young_active_democrats_mailing_list = pandas.merge(young_active_democrats, voters, left_on='voter_reg_num', right_on='voter_reg_num')
young_active_democrats_mailing_list[['first_name']].head()


young voters 8677
young democrats 2631
votes by young democrats 1945
active young democrats 59


Unnamed: 0,first_name
0,JAYMEE
1,FRANKLIN
2,JAYE
3,ANGELA
4,TYLER


Let's not leave our older Republican potential campaign contributors out, either!  These might be people who are older than 65, registered as Republican, and voted more than three times.  There's more of these folks, likely because of the broader age range: 26933 older voters, of whom 9066 are Republicans that have voted a total of 54562 times.  If we look for active Republicans, we end up with 6027 people to contact, and can see the first names of the first five in our list.  A real campaign would have pulled out mailing or other contact information to ask for contributions, but that's not our purpose here!

In [28]:
older_voters = voters[voters.birth_age > 65]
print('older voters', len(older_voters))
older_republicans = older_voters[older_voters.party_cd == 'REP']
print('older republicans', len(older_republicans))
older_republican_votes = pandas.merge(older_republicans, vote_history, left_on='voter_reg_num', right_on='voter_reg_num')
print('votes by older republicans', len(older_republican_votes))
older_republican_vote_counts = (older_republican_votes[['voter_reg_num']]
                                  .groupby(['voter_reg_num'], as_index=False)
                                  ['voter_reg_num']
                                  .agg({'num_votes': numpy.size}))
older_active_republicans = older_republican_vote_counts[older_republican_vote_counts.num_votes > 3]
print('active older republicans', len(older_active_republicans))
older_active_republican_mailing_list = pandas.merge(older_active_republicans, voters, left_on='voter_reg_num', right_on='voter_reg_num')
older_active_republican_mailing_list[['first_name']].head()

older voters 26933
older republicans 9066
votes by older republicans 54562
active older republicans 6027


Unnamed: 0,first_name
0,ALICE
1,ANTHONY
2,JEFFERSON
3,PAUL
4,NELL


# So what?

What we've learned is that each state, at the very least for political purposes, makes available voter information and voting histories.  While we looked at a single county in North Carolina, it would be easy to perform the same analysis on the entire state.  There are companies dedicated to pulling down this data, cleaning it up, merging multiple states' records, and integrating other sources of information such as purchase histories, web browsing habits, and other socioeconomic markers so that campaigns can get a clear sense of who to reach out to and how to engage them.

The data we've downloaded isn't "big data" in any sense.  On my five-year-old laptop, I can process the queries above in less than a few seconds.  Processing the state of North Carolina would take at most a few tens of minutes without breaking too much of a sweat.

I'll pass no judgement on whether making this data available for political purposes is a good thing, but it's also important to highlight how else this data could be used.  Even if they aren't allowed to, non-political entities can use this data to enrich their own customer profiles.  Whereas it might annoying to be called by a campaign looking for contributions, it's scary to think about this data being readily accessibly to your bank, insurance company, or employer.

I hope that playing around with this data has given you a sense of how simple and fun it is to learn with real data.  I also hope society engages in deeper conversations about how and by whom such data should be used.  Enjoy!