# Context
## Independent Expenditures
### What is an Independent Expenditure?
According to the <a href='https://www.fec.gov/help-candidates-and-committees/making-disbursements-pac/independent-expenditures-nonconnected-pac/'> Federal Election Commission documentation</a>, an Independent Expenditure is an expenditure for a communication that 'expressly advocates the election or defeat of a clearly identified federal candidate; and is not coordinated with a candidate, candidate’s committee, party committee or their agents.'

Political action committees make independent expenditures to support or opposed candidates.  Independent expenditures are not contributions to a candidate and are therefore not subject to contribution limits.

In this analysis we are finding the similarity between candidates, based on those committees which spend money to support or oppose them.  

# Analysis
## Calculating the Similarity Coefficient between candidates based on their shared contributors
### Overview
In this notebook we calculate the <a href=https://en.wikipedia.org/wiki/Jaccard_index> 'Jaccard Index', or Similarity Coefficient </a>, of candidates based on the identities of the committees who have spent money (independent expenditures) to support or oppose them.

Scores are between 0 and 100, with 100 being identical set of committee spenders and 0 being no shared spenders.  For example: two candidates have no similarity in terms of expenditures on them, they will have a Jaccard Index of 0.  Two candidates who have exactly the same committees spending money on them will have a Jaccard Index of 100.

This is useful in understanding patterns of giving and opposition, and how political action committees might be exercising influence across multiple candidates and races.

### Method
1. Obtain an RDD of the form (CandidateID, [list of contributor IDs])
2. Cross join of the RDD generated in step 1 with itself
3. Calculate the Jaccard Index of each set produced by the cross join
    - get the sum of the length of the sets of contributors for each candidate being compared
    - get the length of the union of the sets of contributors to each candidate
    - divide the union of the sets by the total length of the sets
    - multiple by 100
4. Return an RDD of the form (Jaccard Index, (Candidate1ID, Candidate2ID))
5. Print the return to the console

#### Author
Dan Budris <d.c.budris@gmail.com>

In [8]:
# Import dependencies
# Basic modules
import re
import datetime
from operator import add
import os

# Data analysis modules
from pyspark.sql import SparkSession
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
# Instantiate variables for processing data moving forward
# Spark configuration
spark = SparkSession.builder.appName('ElectionAnalyzer').getOrCreate()

# The path to the data -- defaults to the demo data that is downloaded by installer script.
# Change this path to a different data set as desired
datapath = os.environ['FEC_data_path']
#datapath = '../../data/2016/'

# Set the election year; modify this value to look at a different year
election_year = "2016"

# Set the file paths for the data file, based on the data path and election year
independent_expenditure_file = '{0}/{1}/independent_expenditure_{1}.csv'.format(datapath, election_year)

In [10]:
# Read the CSV as a spark dataframe
ind_exp = spark.read.csv(independent_expenditure_file, header=True)

# Convert the dataframe to an RDD, and 
# filter out the empty lines
total_expenditures_rdd = (
    ind_exp
    .rdd
    .filter(lambda x: len(x) != 0)
)

In [11]:
# Define functions for preping and analyzing the data 

def prep_data(input_rdd, val_key, val_value):
    ''' A function to map and reduce an RDD based on a given key and value.
        Provided an object attribute to use as a key, and one to use as a value,
        will map the rdd to a tuple of (key, value) and then reduce by key by 
        addition
    '''
    _prepped_data = (
        input_rdd
        .filter(lambda x: (getattr(x, val_key) != 'None'))
        .map(lambda x: (getattr(x, val_key), [getattr(x, val_value)]))
        .reduceByKey(lambda x, y: x + y)
        .map(lambda x: (x[0], set(x[1])))
    )
    return _prepped_data
    
def calculate_similarity_coefficient(set_one, set_two, min_similar=None):
    ''' Function to calculate the Jaccard Index, given two sets
        Returns the Jaccard Index as a numeric percentage between 0 and 100
        Given a 'min_similar' value, will only return positive values for those 
        sets whose intersection is greater than or equal to the minimum similarity.
    '''
    _set_one = set(set_one)
    _set_two = set(set_two)
    _set_union_len = len(_set_one.union(_set_two))
    _set_intersect_len = len(_set_one.intersection(_set_two))
    if min_similar:
        if _set_intersect_len < min_similar:
            return 0
    _sim_coeff = round(_set_intersect_len/_set_union_len * 100, 2)
    return _sim_coeff
    
def rdd_similarity_coefficient_map(input_rdd):
    ''' Given an RDD of the form [(Identifier, [List of Characteristics])]
        This function will produce an RDD of the form [(Jaccard Index), Identifier1, Identifier2]
        based on the cartesian product of the RDD
    '''
    _coeff_sim = (
        input_rdd
        # get the cartesian product of the input rdd with itself
        .cartesian(input_rdd)
        # filter out the self * self products
        .filter(lambda x: x[0] != x[1])
        # calculate the Jaccard Index for each result of the cross-product
        # returning a nested tuple of (Jaccard Index, (cand1 id, cand2 id))
        .map(lambda x: (calculate_similarity_coefficient(x[0][1], x[1][1], min_similar=5), (x[0][0], x[1][0])))
    )
    return _coeff_sim

In [12]:
# Calculate the Jaccard similarity for each candidate the committee who has spent on that candidate
candidate_pairs_with_jaccard_index = (
    rdd_similarity_coefficient_map(
        prep_data(
            total_expenditures_rdd, 'cand_id', 'spe_id'
        )
    )
    #.map(lambda x: (x[0], frozenset([x[1][0], x[1][1]])))
    #.distinct()
)

In [13]:
# Collect the (similarity, candidate pair) RDD, printing to the console
cands_w_coeff = (
    candidate_pairs_with_jaccard_index
    .filter(lambda x: x[0] != 0)
    .filter(lambda x: None not in x[1])
    .sortByKey(False)
    .collect()
)

In [14]:
# This variable has the data in the form (jaccard index, (candidate id 1, candidate id 2))
# Print, save, or otherwise parse this information as you see fit.
cands_w_coeff

[(75.0, ('H6FL02190', 'H6FL02208')),
 (75.0, ('H6FL02208', 'H6FL02190')),
 (61.11, ('H4FL26038', 'H0IL10302')),
 (61.11, ('H0IL10302', 'H4FL26038')),
 (55.56, ('H6MN02149', 'H4NV04017')),
 (55.56, ('H4NV04017', 'H6MN02149')),
 (54.55, ('P60008059', 'P60008521')),
 (54.55, ('P60008521', 'P60008059')),
 (50.0, ('H0TX23086', 'H2TX23124')),
 (50.0, ('H4MN08083', 'H2IA01055')),
 (50.0, ('H2TX23124', 'H0TX23086')),
 (50.0, ('P60008521', 'P60003670')),
 (50.0, ('P60003670', 'P60008521')),
 (50.0, ('H8MN03077', 'H0CA19173')),
 (50.0, ('H2NV04045', 'H2IA01055')),
 (50.0, ('H4ME02200', 'H4NV04017')),
 (50.0, ('H0CA19173', 'H8MN03077')),
 (50.0, ('H2IA01055', 'H4MN08083')),
 (50.0, ('H2IA01055', 'H2NV04045')),
 (50.0, ('H4NV04017', 'H4ME02200')),
 (46.15, ('H4FL26038', 'H0CA19173')),
 (46.15, ('H0TX23086', 'H4NV04017')),
 (46.15, ('H0CA19173', 'H4FL26038')),
 (46.15, ('H4NV04017', 'H0TX23086')),
 (45.45, ('H0TX23086', 'H4ME02200')),
 (45.45, ('H4MN08083', 'H2NV04045')),
 (45.45, ('H4MN08083', 'H4