_ _, |\ o |\ o |) _ ,_ / / | /|/|/| |/\_|/ | /|/| |/ | /|/| |/) |/ / | \__/\/|_/ | | |_/|_/ |_/|/ | |_/ |_/|/ | |_/| \/|_/ |/ (| |)
Campaign Finance Linker
Campaign finance disclosure laws help us understand how money influences our political system, but inconsistencies in the data make it hard to get a full picture of where the money comes from. This library uses machine learning -- specifically, a technique known as random forest -- to connect donations from the same contributor.
How it works
Campaign finance records generally include a contributor's name, address, occupation and employer, but not a unique identifier for the individual. Inconsistencies like misspelled names or changing job titles make it difficult to connect records by donor.
This library can link contributions within a single dataset or across multiple datasets. It could, for example, match individual contribution records from the 2012 presidential election, connect records across multiple years of federal election data, or find connections between contributions to candidates in a local election and contributions to candidates who ran for president.
To train the classifier, we use an already-linked dataset (
data/crp_slice.zip) from the Center for Responsive Politics.
This project was inspired by fec-standardizer from The New York Times' Chase Davis, who first applied the random forest method to campaign finance data and identified the correct feature set for grouping records by donor. See his excellent wiki for background.
This project requires Python and MySQL. To install the required Python packages, run:
pip install -r requirements.txt
Follow these steps to create the necessary MySQL schema and to download, import, and link individual contribution data for the 2014 election cycle from the Federal Election Commision.
1) Create a
database.yml and edit the connection properties to match your system:
cp config/database.sample.yml config/database.yml
2) Create three tables (
individual_contributions_2014) for your linkage:
3) Download and import the first 20,000 individual contributions from the 2014 cycle:
4) Generate a training set from the linked CRP data:
5) Train the classifier and link the 2014 individual contribution data:
The 20,000 contributions (
individual_contributions_2014) are now linked to about 18,000 canonical individuals (
individuals). The 2,000 record difference is the result of multiple contributions being linked to a single individual. Each contribution record is linked to a canonical individual by the
individual_partial_matches table contains roughly 30 records, which represent the pairs that didn't satisfy the threshold to be considered a match by the learning algorithm, but possibly are. You can resolve these potential matches with the
resolve.py script, or you could use another method to determine whether they're actually matches. They can also be ignored, which results in
a slightly less precise linkage.
Linking a second dataset
Linking a second dataset is easier than linking the first. (The training set only needs to be generated once, so you don't have to run
generate.py again.) The steps are:
1) Create a table with the new data (make sure it contains an empty
individual_id field to link to
2) Add your new table to
database.yml. (You can override field names for the new table if needed.)
3) Link the new dataset by specifying the new table name:
python link.py --table=new_table
Since this second linkage shares the
individuals table with the first linkage, some individuals from the 2014 cycle may now be linked to
the dataset you just imported.
Instead of creating a new table, you could also just append new records to the same
individual_contributions_2014 table you used for the example linkage and rerun the linkage, as long as you don't delete the existing data in the
For performance reasons, the linker only compares records that have the same values for last name and state.
Linking larger datasets can take a long time; the full set of 3.5 million 2012 contributions took about 5 hours to link on a 2 GHz MacBook Pro. You can kill and restart the
link.py script at any time. (It would be fairly easy to parallelize the process so the script can be run on multiple machines, each of which pulls out — and locks — some records to link until there are no records
individuals table grows, future linkages will take longer. If you don't need records linked across projects, you can use a different
individuals tables for each one by creating a new table and modifying database.yml to point to the
Records from the
individuals table are cached in memory to reduce MySQL queries. Depending on how much RAM you have available, you can tweak the size of the cache by changing
campfin/linker.py. (Default is about 1 GB).
test.py to evaluate the machine learning performance and tweak some parameters
- Jay Boice, email@example.com
- Aaron Bycoffe, firstname.lastname@example.org
- Gabriel Florit, email@example.com
Copyright © 2013 The Huffington Post. See LICENSE for details.