### Getting accurate monthly series for campaign finance

### What's the issue? Why is this so hard?
The problem in trying to show accurate monthly campaign finance total fundraising, of the sort used in visualizations like the NYT's [2012 Money Race](http://elections.nytimes.com/2012/campaign-finance) is the simple one that campaigns do not actually report data monthly. This was the [presidential principal campaign committee reporting schedule](http://www.fec.gov/info/report_dates_2012.shtml#monthly) for the 2012 election. You can see that campaigns reported quarterly in 2011, and only began monthly reports in 2012. In addition, because the election is in November, reporting deadlines mean that the final three periods are not quite monthly and all have highly skewed coverage periods which need to be adjusted for as well.

### But the FEC has all this itemized data available with transaction dates - why can't we use that?
The itemized data is only a small subset of the overal data. So in the 2012 election cycle, for instance, Obama for America raised 738.5 million, of which 315 million came from individal donors that gave more than 200 bucks, and 234 million came from individual donors that gave less than 200 bucks (the rest came from authorized committees and a few other places). But donations less than 200 aren't required to be itemized, so they don't show up in the itemized dataset. In addition, the 315 million number includes donations from those who gave over 200 in total, but in a series of smaller donations. These also don't show up in the FEC's itemized data, so the itemized data only include about 200 mm - a small fraction of the total. If we looked at these, we would think that Romney had massively outraised Obama - but in fact the opposite was true.

### So what do we do?

The FEC Form 3P data represents the accurate totals, but is a mixture of quarterly and monthly data. The itemized data is only a fraction of the total, but we have dates for every single itemized contribution. So we can use the itemized data as a proxy for the overal time distribution of fundraising, and use the amount that each month represents out of the quarterly totals to prorate the Form 3P data where that data is quarterly.

### What data sources do we use
Getting the Form 3P data is harder than it would seem - the FEC website lets you download spreadsheets with totals, but it's a highly manual process - no persistent links. Better is to use the new beta FEC API, which makes it very easy to get the Form 3P data.

For the itemized data, the API isn't suitable - while the itemized data is available, the endpoints that give aggregates don't give us enough detail or control, while the endpoints that give true itemized data don't at the moment let you filter or agregate to get to a point where the data size is limited - and so as a result there are huge rate limiting problems going down this route. Much better to use the csv files the FEC makes available.


In [None]:
import requests, json, zipfile, StringIO, urllib2, os, re
%pylab inline
pylab.rcParams['figure.figsize'] = (10, 6)

import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

import pandas as pd
import dateutil.parser

from ipywidgets import widgets
from IPython.display import display, clear_output

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

### Setting up to get Form 3P data through the FEC Beta API
While there is a [python wrapper](https://github.com/jeremyjbowers/pyopenfec) available, it isn't well maintained, and breaking changes to the API have already caused problems. Since what we want is super simple to get, much better to just use requests and hit the REST API endpoints directly. The functions below are set up to do that.

Anyone who wants to get an excel spreadsheet to look at for comparison can go to http://www.fec.gov/data/CommitteeSummary.do?format=html&election_yr=2012 and enter in a campaign code and click through and eventually get one.

In [None]:
BASE_URL = 'https://api.open.fec.gov/v1'
envVars = %env
API_KEY = envVars['OPENFEC_API_KEY']

In [None]:
def all_results(endpoint, params):
    params['api_key'] = API_KEY
    url = BASE_URL+endpoint
    r = requests.get(url, params=params)
    initial_data = r.json()
    num_pages = initial_data['pagination']['pages']
    num_records = initial_data['pagination']['count']
    current_page = initial_data['pagination']['page']
    for record in initial_data['results']:
        yield record

    while current_page < num_pages:
        current_page += 1
        params['page']=current_page
        data = requests.get(url, params=params).json()
        for record in data['results']:
            yield record

def count_results(endpoint, params):
    params['api_key'] = API_KEY
    url = BASE_URL+endpoint
    data = requests.get(url, params=params).json()
    return data['pagination']['count']

In [None]:
with open('../App/data/candidates.json', 'r') as f:
    candidate_data = json.load(f)
candidate_data = { year: candidate_data[year] for year in candidate_data.keys() if int(year) >= 2008 }

In [None]:

form3 = {}

def form3totals(committee_id, cycle):
    
    key = (committee_id, cycle)
    
    if key not in form3:
        q = {
            'cycle':cycle,
            'per_page':100,
            'is_amended':False,
        }
        data = []
        for r in all_results('/committee/'+committee_id+'/reports/', q):
            data.append({
                    'Report Year': r["report_year"],
                    'Report Type': r["report_type"],
                    'Report Period': str(r["report_year"])+'-'+r["report_type"],
                    'Coverage Start Date': dateutil.parser.parse(r["coverage_start_date"]).date(),
                    'Coverage End Date': dateutil.parser.parse(r["coverage_end_date"]).date(),
                    'Total Receipts': r["total_receipts_period"],
                    'Total Disbursements': r["total_disbursements_period"],
                })
        columns = [
            'Report Period',
            'Coverage Start Date',
            'Coverage End Date',
            'Total Receipts',
            'Total Disbursements',
        ]
        df = pd.DataFrame(data, columns=columns)


        df.set_index('Report Period', inplace=True)
        df.sort_values(by='Coverage Start Date')
        df.name = "%s-%d" % (committee_id, int(cycle))
        form3[key] = df
    
    cached = form3[key]
    return form3[key]

### Getting the itemized data in bulk in csv form from the FEC website
The below gets all of the available itemized data from the FEC - just uncomment the function call to run for a specified year.

In [None]:
# To download all schedule detail csv files for a particular year

schedule_dict = {'Committees': {'abbrev':'cm', 'filename':'cm'},
            'Candidates': {'abbrev':'cn', 'filename':'cn'},
            'Candidate Committee Linkage': {'abbrev':'ccl', 'filename':'ccl'},
            'Committee to Committee Trans': {'abbrev':'oth', 'filename':'itoth'},
            'Contributions by Committees': {'abbrev':'pas2', 'filename':'itpas2'},
            'Contributions by Individuals': {'abbrev':'indiv', 'filename':'itcont'},
            'Operating Expenditures': {'abbrev':'oppexp', 'filename':'oppexp'}
             }
header_url_base = 'http://www.fec.gov/finance/disclosure/metadata/%s_header_file.csv'
data_url_base = 'ftp://ftp.fec.gov/FEC/20%02d/%s%02d.zip'


def get_schedule_data(year, schedule_dict):
    yearlasttwo = int(str(year)[2:])
    for f in schedule_dict.values():
        pathname = '%s_%s.csv' % (year, f['abbrev'])
        if not (os.path.isfile(pathname) and os.path.getsize(pathname) > 0):
            data_url = data_url_base % (yearlasttwo, f['abbrev'], yearlasttwo)
            print " downloading %s into %s..." % (data_url, pathname),
            r = requests.get(header_url_base % f['abbrev'])
            f['headers'] = r.content.strip().split(',')
            d = urllib2.urlopen(data_url)
            z = zipfile.ZipFile(StringIO.StringIO(d.read()))
            data = pd.read_csv(z.open('%s.txt' % f['filename']), sep='|', header=None, names=f['headers'], index_col=False,
                        low_memory=False)
            data.to_csv(pathname, index=True)
            print "done."
        else:
            print " skipping %s." % pathname

for cycle in sorted(candidate_data.keys()):
    print "downloading data for", cycle
    get_schedule_data(cycle, schedule_dict)


### Processing the Individual Contributions data
This just dumps the small number of cases where we don't have any dates for contributions, and then adds month year and quarter indices to make later manipulation easy, before saving the csv file back down.

In [None]:
# Process the individual contributions and operating expenditure csv        
def process_date(year, in_format, out_format, date_format, date_preformat=None):
    pathname_in = in_format % year
    pathname_out = out_format % year
    if os.path.isfile(pathname_in) and os.path.getsize(pathname_in) > 0:
        if not (os.path.isfile(pathname_out) and os.path.getsize(pathname_out) > 0):
            print " processing %s into %s..." % (pathname_in, pathname_out),
            df = pd.read_csv(pathname_in, low_memory=False, index_col=0)
            df = df[df['TRANSACTION_DT'].notnull()].copy() # if there isn't a date, no use to us
            df['DATE'] = df['TRANSACTION_DT'] if date_preformat is None else df['TRANSACTION_DT'].apply(date_preformat)
            df['DATE'] = pd.to_datetime(df['DATE'], format=date_format, errors='coerce')
            df = df[df['DATE'].notnull()]
            df['MONTH'] = pd.DatetimeIndex(df['DATE']).month
            df['QUARTER'] = pd.DatetimeIndex(df['DATE']).quarter
            df['YEAR'] = pd.DatetimeIndex(df['DATE']).year
            df[['MONTH', 'QUARTER', 'YEAR']] = df[['MONTH', 'QUARTER', 'YEAR']].astype(int)
            df.to_csv(pathname_out)
            print "done."
        else:
            print "skipping %s." % pathname_out
    else:
        print 'ERROR: %s not found' % pathname_in

def process_indiv(year):
    return process_date(year, '%s_indiv.csv', '%s_indiv_p1.csv', '%m%d%Y', lambda dt: "{:0>8d}".format(int(dt)))
        
def process_oppex(year):
    return process_date(year,'%s_oppexp.csv', '%s_oppexp_p1.csv', '%m/%d/%Y')
        
for cycle in sorted(candidate_data):
    process_indiv(cycle)
    process_oppex(cycle)

### Load the individual contribution data into memory
That makes it much faster to work with from that point on

In [None]:
indivs = pd.DataFrame()
oppexps = pd.DataFrame()
for cycle in sorted(candidate_data.keys()):
    print "reading processed data for", cycle
    indivs = indivs.append(pd.read_csv(cycle +'_indiv_p1.csv', low_memory=False, index_col=0))
    oppexps = oppexps.append(pd.read_csv(cycle +'_oppexp_p1.csv', low_memory=False, index_col=0))

In [None]:
q2m = lambda q: range(3*q - 2, 3*q + 1)
sa2m = lambda h: range(1, 7) if h == 0 else range(7, 13)

# for a given line item, which months should it distribute to?
def f3_period_distribution(periods):
    
    # want to move forward in time
    periods = periods[::]
    periods.reverse()
    
    distribution = []
    last_suffix = None
    deferred = False
    for period in periods:
        split = period.split('-')
        year = int(split[0])
        suffix = split[1]
        if suffix == 'YE':
            if last_suffix is None:
                deferred = True
            elif last_suffix == 'MY':
                distribution.append((year, sa2m(1)))
            elif last_suffix[0] == 'Q':
                distribution.append((year, q2m(4)))
            elif last_suffix[0] == 'M':
                distribution.append((year, [12]))
            elif last_suffix == '30G':
                distribution.append((year, q2m(4)))
        elif suffix == '12G' or suffix == '30G':
            distribution.append((year, q2m(4)))
        elif suffix in ['12P', '12R', '12C', '12S', 'TER']:
            distribution.append((year, []))
        else:
            if suffix == 'MY':
                if deferred:
                    distribution.append((year - 1, sa2m(1)))
                distribution.append((year, sa2m(0)))
            else:
                value = int(suffix[1:])
                if suffix[0] == 'M':
                    if deferred:
                        distribution.append((year - 1, [12]))
                    distribution.append((year, [value - 1]))
                elif suffix[0] == 'Q':
                    if deferred:
                        distribution.append((year - 1, q2m(4)))
                    distribution.append((year, q2m(value)))
                else:
                    raise ValueError("unkown suffix in f3 timeline " + suffix)
            deferred = False
        last_suffix = suffix
        
    # reverse to match f3
    distribution.reverse()
    return distribution

In [None]:

# pull all the values covering a distribution and calculate the normalized distribution based on indivs or opexps
def f3_period_coeffs(committee_id, df, dist):
    
    year, months = dist
    
    df = df[df['CMTE_ID']==committee_id]
    df = df[df['YEAR']==year]
    df = df[df['MONTH'].isin(months)].copy()
    
    grouped = df.groupby(by=['YEAR', 'QUARTER', 'MONTH']).sum()['TRANSACTION_AMT']
    monthly = pd.DataFrame({
        'Monthly Contributions': grouped,
    })
    monthly['Pro Rating Coeffs'] = monthly['Monthly Contributions']/grouped.sum()
    
    value = monthly['Pro Rating Coeffs']
    
    return value

### Creating monthly pro-rating coefficients from the individual contribution data
The below function returns a series of coefficients, indexed by year, quarter and month, to be used in prorating quarterly data to monthly. At the moment it is set up to work with the itemized individual contribution data; we probably need to generalize it so it can also work on the spending side to work with the itemized spending data.

The f3Monthly function then takes a form F3 period series, and a monthly pro-rating coefficient series as inputs, applies the pro-rating methodology, and returns a monthly series as output.

In [None]:

coeff_cache = {}

def f3_monthly(f3, data, committee_id, cycle):
    
    monthly_data = pd.Series(
        index=pd.MultiIndex.from_tuples(
            [(cycle - (1 - i/12), i/3 % 4 + 1, i % 12 + 1) for i in range(0, 24)],
            names=['YEAR', 'QUARTER', 'MONTH'])).fillna(0.)
    monthly_data.name = 'Monthly Data'
        
    f3_periods = f3.index.values.tolist()
    f3_distributions = f3_period_distribution(f3_periods)
    
    for period, distribution in zip(f3_periods, f3_distributions):
        year, months = distribution
        if len(months) > 0: # months > 0?
            if len(months) == 1:
                distributed = pd.Series(
                    index=pd.MultiIndex.from_tuples([(year, (months[0] - 1)/3 % 4 + 1, months[0])],
                                                    names=['YEAR', 'QUARTER', 'MONTH']))
                distributed = distributed.fillna(f3[period])
            else:
                key = (committee_id, cycle, f3.name, distribution[0], tuple(distribution[1]))
                if key not in coeff_cache:
                    coeff_cache[key] = f3_period_coeffs(committee_id, data, distribution)
                coeffs = coeff_cache[key]
                distributed = coeffs * f3[period]
            monthly_data = monthly_data.add(distributed, fill_value=0.)
    
    return monthly_data

In [None]:

cmte_names = {}
in_monthly = {}
out_monthly = {}

for c in sorted(candidate_data.keys()):
    print c
    in_c = {}
    in_monthly[c] = in_c
    out_c = {}
    out_monthly[c] = out_c
    for p in sorted(candidate_data[c].keys()):
        print " ", p
        cp_dicts = candidate_data[c][p]
        for cp_dict in cp_dicts:
            print "  ", cp_dict['CAND_NAME']
            cmtes = [cp_dict['Principal']]
            cmtes.extend(cp_dict['Supporting'])
            for cmte in cmtes:
                cmte_id = cmte['id']
                print "   ",cmte_id
                if cmte_id not in cmte_names:
                    cmte_names[cmte_id] = cmte['name']
                if cmte_id not in in_c:
                    in_c[cmte_id] = indiv_monthly(cmte_id, c)
                if cmte_id not in out_c:
                    out_c[cmte_id] = oppexp_monthly(cmte_id, c)


In [None]:
from ordered_set import OrderedSet
from collections import OrderedDict
from datetime import date
from dateutil.relativedelta import relativedelta
import json

cycle_cmte_ids = OrderedSet()

for cycle in sorted(in_monthly):
    for cmte_id in sorted(in_monthly[cycle]):
        cycle_cmte_ids.add((cycle, cmte_id))
for cycle in sorted(out_monthly):
    for cmte_id in sorted(out_monthly[cycle]):
        cycle_cmte_ids.add((cycle, cmte_id))

collection = []
for cycle_cmte_id in cycle_cmte_ids:
    cycle, cmte_id = cycle_cmte_id
    in_df = in_monthly[cycle][cmte_id]
    out_df = out_monthly[cycle][cmte_id]
    document = OrderedDict()
    document['cycle'] = cycle
    document['cmte_id'] = cmte_id
    document['cmte_name'] = cmte_names[cmte_id]
    base = date(int(cycle)-1, 1, 1)
    document['date'] = [(base + relativedelta(months=x)).isoformat() for x in range(0, 24)]
    document['receipts'] = in_df.clip_lower(0).values.astype(int).tolist()
    document['expenditures'] = out_df.clip_lower(0).values.astype(int).tolist()
    collection.append(document)
    print document['cycle'], document['cmte_id'], document['cmte_name']
    
with open('cmte_finances.json', 'w') as outfile:
    json.dump(collection, outfile)

