# Independent Expenditures
## What is an Independent Expenditure?
According to the <a href='https://www.fec.gov/help-candidates-and-committees/making-disbursements-pac/independent-expenditures-nonconnected-pac/'> Federal Election Commission documentation</a>, an Independent Expenditure is an expenditure for a communication that 'expressly advocates the election or defeat of a clearly identified federal candidate; and is not coordinated with a candidate, candidate’s committee, party committee or their agents.'

Political actions committees make independent expenditures to support or opposed candidates.  Independent expenditures are not contributions to a candidate are are therefore not subject to contribution limits.

## What's the data going to tell us?

The FEC data will be able to tell us a variety of things about a given election year.  Below, we will quantify:
- Number of communications advocating for a given candidate
- Number of communications advocating against a given candidate
- Total cost of the communications for or against a given candidates
- Those candidates a committee has spent money for or against
- The sum of money a given committee has spent for or against a candidate


# The Data
Raw data is obtained from the <a href=https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/index.html> Federal Election Commission bulk data sources </a>.  This notebook primarily uses the 'independent_expenditures' csv files located in the annual subfolders; i.e. 2016/independent_expenditures_2016.csv

In [1]:
# Import dependencies
# Basic modules
import re
import datetime
from operator import add

# Data analysis modules
from pyspark.sql import SparkSession
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Instantiate variables for processing data moving forward
spark = SparkSession.builder.appName('ElectionAnalyzer').getOrCreate()
datapath = '/Users/Dan/Downloads/'

# Set the election year; modify this value to look at a different year
election_year = "2016"

# Set the file paths for the data file, based on the data path and election year
independent_expenditure_file = '{0}independent_expenditure_{1}.csv'.format(datapath, election_year)

In [3]:
# Read each data file from CSV as a spark dataframe
ind_exp = spark.read.csv(independent_expenditure_file, header=True)

In [4]:
# Convert the dataframe with headers to an RDD, for use in map/reduce actions
# Filter out the empty lines
total_expenditures_rdd = (
    ind_exp
    .rdd
    .filter(lambda x: len(x) != 0)
)

# Convert the RDD to a Pandas dataframe for future use
total_expenditures_dataframe = (
    total_expenditures_rdd
    .filter(lambda x: x.exp_amo)
    .filter(lambda x: x.exp_date)
    .map(lambda x: (float(x.exp_amo), datetime.datetime.strptime(x.exp_date, '%d-%b-%y')))
    .toDF()
    .toPandas()
)

total_expenditures_dataframe.columns = ['expenditure','date']

print (total_expenditures_dataframe.head(5))

   expenditure       date
0     188000.0 2016-11-07
1      50359.0 2016-10-26
2     100000.0 2016-10-28
3      22000.0 2016-11-03
4      68240.0 2016-11-03


In [5]:
# Print the first expenditures RDD -- this is for general debugging and viewing
total_expenditures_rdd.first()

Row(cand_id='P00003392', cand_name='CLINTON, HILLARY RODHAM', spe_id='C00344531', spe_nam='1199 32BJ/144 SERVICE EMPLOYEES INTERNATIONAL UNION HOME CARE POLITICAL ACTION FUND', ele_type='G', can_office_state=None, can_office_dis='00', can_office='P', cand_pty_aff='DEMOCRATIC PARTY', exp_amo='188000', exp_date='07-NOV-16', agg_amo='188000', sup_opp='S', pur='EVENT EXPENSES', pay='ART SCHOOL DROPOUT INC.', file_num='1123839', amndt_ind='N', tran_id='SE.5047', image_num='201611049037122225', receipt_dat='04-NOV-16', fec_election_yr='2016', prev_file_num=None)

### Supporting and Opposing Expenditure Summary
#### Expenditures are declare to either support or opposed a candidate.  
A set of functions and data aggregations relating to supporting or opposing expenditures.
We get the total supporting and total opposing expenditures, grouped by candidate.  

In [7]:
# Return an list of (total number of expenditures, Candidate Name)
# sorted by total number of expenditures related to that candidate
def aggregate_sup_or_opp_expenditures(sup_or_opp):
    """ Returns an RDD of the form (candidate, total supporting or opposing exenditure)
        Given an S returns supporting exependitures; given an O returns opposing expenditures
    """ 
    sup_or_opp_expenditures = (
        total_expenditures_rdd
        .filter(lambda x: x.sup_opp == '{}'.format(sup_or_opp))
        .map(lambda x: ((x.cand_name).upper(), 1))
        .reduceByKey(add)
        .map(lambda x: (x[1], x[0]))
        .sortByKey(False)
    )
    return sup_or_opp_expenditures

supportive_expenditures_by_candidate = aggregate_sup_or_opp_expenditures('S')
oppositional_expenditures_by_candidate = aggregate_sup_or_opp_expenditures('O')


In [None]:
# Print to the console a list containing tuples of (count of supportive expenditures, candidate names)
print(supportive_expenditures_by_candidate.collect())

In [None]:
# Print to the console a list containing tuples of (count of oppositional expenditures, candidate names)
print(oppositional_expenditures_by_candidate.collect())

In [None]:
# Return a list of ((supportive expenditures, oppositional expenditures), CandidateName)
# Sorted by greatest number of supportive expenditures

def reduce_sup_or_opp(val):
    ''' Function to convert the sup_opp values from the FEC data
        into a tuple, so that we can then reduce it by key
        and get a consolidated view of support and oppositions in 
        one function (as opposed to the above cell, which requires two 
        executions of the same function)
    ''' 
    if val == 'S':
        return (1,0)
    if val == 'O':
        return (0,1)
    return (0,0)

sup_or_opp_expenditures = (
    total_expenditures_rdd
    .map(lambda x: ((x.cand_name).upper(), reduce_sup_or_opp(x.sup_opp)))
    .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
    .map(lambda x: (x[1], x[0]))
    .sortByKey(False)
)

print(sup_or_opp_expenditures.collect())

In [None]:
# Calculate the ratio of support to opposition for a given candidate, expressed as a precentage
# Return a list of (support to opposition ratio, candidate name) sorted by support ratio
# this does not handle 0 values gracefully at the moment, whihc must be adjusted
support_ratio = (
    sup_or_opp_expenditures
    .map(
    lambda x: (
        round((x[0][0]/x[0][1])*100, 2) 
        if (x[0][0] != 0) and (x[0][1] != 0) 
        else x[0][0]-x[0][1]*100, 
        x[1])
    )
    .sortByKey(False)
    .collect()
)

print(support_ratio)

In [None]:
# Return an list of (total number of expenditures, spender name)
# sorted by total number of expenditures by that spender
expenditures_by_spender = (
    total_expenditures_rdd
    .map(lambda x: ((x.spe_nam), 1))
    .reduceByKey(add)
    .map(lambda x: (x[1], x[0]))
    .sortByKey(False)
    .collect()
)

In [None]:
print(expenditures_by_spender)

In [None]:
## https://content.pivotal.io/blog/how-data-science-assists-sports
## https://seaborn.pydata.org/generated/seaborn.lineplot.html
# Using seaborn and pyspark dataframes together to create a chart/plot! 
## Do this for contributions over time!

a4_dims = (11.7, 8.27)
fig, ax = plt.subplots(figsize=a4_dims)

total_expenditures_timeseries = sns.barplot(ax=ax, x='date', y='expenditure', data=total_expenditures_dataframe)
total_expenditures_timeseries.get_figure().savefig('/Users/Dan/Desktop/output.png')

## Calculating the Similarity Coefficient of Candidates based on their Contributors
### Overview
In the cells below we calculate the <a href=https://en.wikipedia.org/wiki/Jaccard_index> 'Jaccard Index', or Similarity Coefficient </a>, of candidates based on the identities of their contributors.  This means that we will be able to see those candidates that are more and less similar based on who gives them money.  Scores are between 0 and 1, with one being identical and 0 being no shared contributors.  This is useful in understanding patterns of giving and opposition, and how committees might be influencing an election across multiple candidates.  

### Method
1. Obtain an RDD of the form (CandidateID, [list of contributor IDs])
2. Obtain the cartesian product of that RDD with itself
3. Calculate the Jaccard Index of each set produced by the cartesian product
    - get the sum of the length of the sets of contributors for each candidate being compared
    - get the length of the union of the sets of contributors to each candidate
    - divide the union of the sets by the total length of the sets
    - multiple by 100
4. Return an RDD of the form (Jaccard Index, (Candidate1ID, Candidate2ID))


### To Do
Weight the jaccard index based on the total number of contributors in each list -- a 1-to-1 similarity between two candidates witha  single contributor each may be valuable and important, but I believe it would be interesting to see the candidates with the greatest similarity amongst the greatest number of contributors.  

Move the similarity_coefficient functions to another notebook, and import that notebook, rather than simply use them here.  We can use this to great effect in other circumstances.

In [17]:
def prep_data(input_rdd, val_key, val_value):
    ''' A function to map and reduce an RDD based on a given key and value.
        Provided an object attribute to use as a key, and one to use as a value,
        will map the rdd to a tuple of (key, value) and then reduce by key by 
        addition
    '''
    _prepped_data = (
        input_rdd
        .filter(lambda x: (getattr(x, val_key) != 'None'))
        .map(lambda x: (getattr(x, val_key), [getattr(x, val_value)]))
        .reduceByKey(lambda x, y: x + y)
        .map(lambda x: (x[0], set(x[1])))
    )
    return _prepped_data
    
def rdd_similarity_coefficient_map(input_rdd):
    ''' Given an RDD of the form [(Identifier, [List of Characteristics])]
        This function will produce an RDD of the form [(Jaccard Index), Identifier1, Identifier2]
        based on the cartesian product of the RDD
    '''
    _coeff_sim = (
        input_rdd
        # get the cartesian product of the input rdd with itself
        .cartesian(input_rdd)
        # filter out the self * self products
        .filter(lambda x: x[0] != x[1])
        # calculate the Jaccard Index for each result of the cross-product
        # returning a nested tuple of (Jaccard Index, (cand1 id, cand2 id))
        .map(lambda x: (calculate_similarity_coefficient(x[0][1], x[1][1], min_similar=5), (x[0][0], x[1][0])))
    )
    return _coeff_sim
    
def calculate_similarity_coefficient(set_one, set_two, min_similar=None):
    ''' Function to calculate the Jaccard Index, given two sets
        Returns the Jaccard Index as a numeric between 0 and 1
    '''
    _set_one = set(set_one)
    _set_two = set(set_two)
    _set_union_len = len(_set_one.union(_set_two))
    _set_intersect_len = len(_set_one.intersection(_set_two))
    if min_similar:
        if _set_intersect_len < min_similar:
            return 0
    _sim_coeff = _set_intersect_len/_set_union_len * 100
    return _sim_coeff

In [24]:
# tests for the above function
# 0.0
print(calculate_similarity_coefficient(['1'], ['2']))

# 33.33333333333333
print(calculate_similarity_coefficient(['1','2'],['1','3']))

# 100.00
print(calculate_similarity_coefficient(['1','2','3','4'],['1','2','3','4']))

0.0
33.33333333333333
100.0


In [25]:
# Calculate the Jaccard similarity for each candidate the committee who has spent on that candidate
rdd_similarity_coefficient_map(prep_data(total_expenditures_rdd, 'cand_id', 'spe_id')).sortByKey(False).collect()

[(75.0, ('H6FL02190', 'H6FL02208')),
 (75.0, ('H6FL02208', 'H6FL02190')),
 (61.111111111111114, ('H4FL26038', 'H0IL10302')),
 (61.111111111111114, ('H0IL10302', 'H4FL26038')),
 (55.55555555555556, ('H6MN02149', 'H4NV04017')),
 (55.55555555555556, ('H4NV04017', 'H6MN02149')),
 (54.54545454545454, ('P60008059', 'P60008521')),
 (54.54545454545454, ('P60008521', 'P60008059')),
 (50.0, ('H0TX23086', 'H2TX23124')),
 (50.0, ('H4MN08083', 'H2IA01055')),
 (50.0, ('H2TX23124', 'H0TX23086')),
 (50.0, ('P60008521', 'P60003670')),
 (50.0, ('P60003670', 'P60008521')),
 (50.0, ('H8MN03077', 'H0CA19173')),
 (50.0, ('H2NV04045', 'H2IA01055')),
 (50.0, ('H4ME02200', 'H4NV04017')),
 (50.0, ('H0CA19173', 'H8MN03077')),
 (50.0, ('H2IA01055', 'H4MN08083')),
 (50.0, ('H2IA01055', 'H2NV04045')),
 (50.0, ('H4NV04017', 'H4ME02200')),
 (46.15384615384615, ('H4FL26038', 'H0CA19173')),
 (46.15384615384615, ('H0TX23086', 'H4NV04017')),
 (46.15384615384615, ('H0CA19173', 'H4FL26038')),
 (46.15384615384615, ('H4NV040