# Data Merging

Some observations
- We choose the 5 non-RCV cities with highest cosine similary score compared to the 7 RCV cities in CA
- There were 33 distinct cities among those 35 cities
- There are 66 non-registered voters among 21.7 million voters
- There are total of 3.9 million voters in the sampled cities
- City 'El Paso de Robles' didn't match in demographic data
- How can we identify election dates are for different cities?
    - We found 122 cases out of 312 with 0% voter turnout 
- Todo calculating mean population income, age and donation amount, and count of college education and donation

In [76]:
import pandas as pd
import janitor
import gc

In [11]:
RCV_cities = ['San Francisco',
 'Oakland',
 'Berkeley',
 'San Leandro',
 'Palm Desert',
 'Eureka',
 'Albany']

sampled_nonRCV_cities = ['Fresno',
 'San Diego',
 'Sacramento',
 'Riverside',
 'San Jose',
 'Santa Ana',
 'Anaheim',
 'Santa Rosa',
 'Merced',
 'Santa Clarita',
 'Alhambra',
 'Davis',
 'Montebello',
 'Burbank',
 'Huntington Park',
 'Bellflower',
 'Watsonville',
 'Gilroy',
 'Whittier',
 'Lynwood',
 'Lakewood',
 'Pico Rivera',
 'Lake Forest',
 'Livermore',
 'Chino Hills',
 'Paramount',
 'El Paso de Robles',
 'Pico Rivera',
 'Buena Park',
 'Whittier',
 'Calabasas',
 'Carpinteria',
 'Morro Bay',
 'San Carlos',
 'Solvang']

print("total number of cities:", len(sampled_nonRCV_cities))

print("number of distinct cities:", len(set(sampled_nonRCV_cities)))

print("name of cities that were duplicated:", set([x for x in sampled_nonRCV_cities if sampled_nonRCV_cities.count(x) > 1]))

combined_sampled_cityName = RCV_cities+list(set(sampled_nonRCV_cities))
print("number of distinct RCV and sampled nonRCV cities:", len(combined_sampled_cityName))

total number of cities: 35
number of distinct cities: 33
name of cities that were duplicated: {'Pico Rivera', 'Whittier'}
number of distinct RCV and sampled nonRCV cities: 40


## 1. Demographic Data

1. Select only the columns required: city name ('Residence_Addresses_City'), unique voter id ('LALVOTERID'), voter's ethnicity ('EthnicGroups_EthnicGroup1Desc') and date when voter was registered ('Voters_OfficialRegDate')
2. Keep only the cities that were identified as being similar to RCV cities in CA (See ca_similarity_search.ipynb for reference) 
3. Keep only rows EthnicGroups_EthnicGroup1Desc == “European”,  “Likely African-American”,“Hispanic and Portuguese” and “East and South Asian” 
4. Keep only registered voters identified in 'Voters_OfficialRegDate'


In [3]:
# change the filepath as required, we have selected the folder with the latest date
filepath = 'VM2--CA--2022-04-25/'

selected_variables = ['Residence_Addresses_City', 
                      'LALVOTERID',
                      'EthnicGroups_EthnicGroup1Desc',
                      'Voters_OfficialRegDate'
                     ]

state_demographic = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC.tab', 
                                sep='\t', dtype=str, encoding='unicode_escape',
                                usecols=selected_variables)

In [4]:
state_demographic.head(5)

Unnamed: 0,LALVOTERID,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate
0,LALCA453164106,Oakland,Other,06/18/2021
1,LALCA453008306,Oakland,Likely African-American,04/01/2021
2,LALCA22129469,Oakland,European,11/16/2021
3,LALCA549803906,Oakland,Other,02/07/2022
4,LALCA24729024,San Leandro,European,02/28/2016


In [5]:
print("total number of unique cities", state_demographic.Residence_Addresses_City.nunique())
print("total number of unique voters", state_demographic.LALVOTERID.nunique())
print("count of non-registered voters", len(state_demographic[state_demographic['Voters_OfficialRegDate'].isnull()]))

total number of unique cities 1533
total number of unique voters 21711617
count of non-registered voters 66


In [13]:
print("number of expected cities:", len(combined_sampled_cityName))
missing_cities = [city for city in combined_sampled_cityName if city not in state_demographic['Residence_Addresses_City'].unique()]
if len(missing_cities) > 0:
    print("number of cities not found in demographic data:", len(missing_cities))
    print(missing_cities)

number of expected cities: 40
number of unfound cities in demographic data: 1
['El Paso de Robles']


In [14]:
selected_ethnicities = ['European', 'Likely African-American','Hispanic and Portuguese', 'East and South Asian']

state_demographic_subset = state_demographic[state_demographic['Residence_Addresses_City'].isin(combined_sampled_cityName) &
                                             state_demographic['EthnicGroups_EthnicGroup1Desc'].isin(selected_ethnicities) &
                                             state_demographic['Voters_OfficialRegDate'].notnull()
                                            ]
print(state_demographic_subset.shape)
state_demographic_subset.head()

(3918925, 4)


Unnamed: 0,LALVOTERID,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate
1,LALCA453008306,Oakland,Likely African-American,04/01/2021
2,LALCA22129469,Oakland,European,11/16/2021
4,LALCA24729024,San Leandro,European,02/28/2016
6,LALCA22466723,Livermore,European,11/01/2021
7,LALCA22466636,Livermore,European,12/07/2021


In [15]:
print("number of unique cities:", state_demographic_subset.Residence_Addresses_City.nunique())

number of unique cities: 39


In [9]:
del state_demographic
gc.collect()

20

## 2. Vote History

1. Select only the columns that are 4 most recent General elections and 4 most recent Local_or_Municipal elections and EthnicGroups_EthnicGroup1Desc
2. Merge Vote History with the sampled Demographic Data 


In [16]:
# select only subset of rows to find the column names that are 4 most recent General and Local_or_Municipal elections
state_voterhistory = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                nrows=10)
                                
state_voterhistory.head(5)

Unnamed: 0,LALVOTERID,Special_2022_04_19,Special_2022_04_12,Special_2022_04_05,Special_2022_02_15,Special_2022_02_01,Special_2021_12_14,Special_2021_12_07,Special_2021_11_02,Consolidated_General_2021_11_02,...,BallotReturnDate_General_2018_11_06,BallotReturnDate_Primary_2018_06_05,BallotReturnDate_General_2016_11_08,BallotReturnDate_Primary_2016_06_07,BallotReturnDate_General_2014_11_04,BallotReturnDate_Primary_2014_06_03,BallotReturnDate_General_2012_11_06,BallotReturnDate_Primary_2012_06_05,BallotReturnDate_General_2010_11_02,BallotReturnDate_Primary_2010_06_08
0,LALCA453164106,,,,,,,,,,...,,,11/07/2016,,,,,,,
1,LALCA453008306,,,,,,,,,,...,,,,,,,,,,
2,LALCA22129469,,,,,,,,,,...,11/06/2018,,,,,,,,,
3,LALCA549803906,,,,,,,,,,...,,,,,,,,,,
4,LALCA24729024,,,,,,,,,,...,,,,,,,,,,


In [17]:
def get_4_recent_date(string, df):
    list_cols = [col for col in df.columns if col.startswith(string)]
    dates = [col.replace(string+'_', '') for col in list_cols]
    dates.sort(reverse=True)
    return [string+'_'+d for d in dates[:4]]

GE_cols = get_4_recent_date('General', state_voterhistory)
print(GE_cols)
LM_cols = get_4_recent_date('Local_or_Municipal', state_voterhistory)
print(LM_cols)

['General_2020_11_03', 'General_2018_11_06', 'General_2016_11_08', 'General_2014_11_04']
['Local_or_Municipal_2021_08_31', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2021_06_01']


In [18]:
del state_voterhistory
gc.collect()

20

In [19]:
needed_variables = ['LALVOTERID'] + LM_cols + GE_cols

state_voterhistory = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)
                                
state_voterhistory.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04
0,LALCA453164106,,,,,Y,Y,Y,
1,LALCA453008306,,,,,,Y,,
2,LALCA22129469,,,,,Y,Y,Y,Y
3,LALCA549803906,,,,,Y,,,
4,LALCA24729024,,,,,,,,


In [20]:
merged_file = pd.merge(state_voterhistory, state_demographic_subset,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')

merged_file.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate
0,LALCA453008306,,,,,,Y,,,Oakland,Likely African-American,04/01/2021
1,LALCA22129469,,,,,Y,Y,Y,Y,Oakland,European,11/16/2021
2,LALCA24729024,,,,,,,,,San Leandro,European,02/28/2016
3,LALCA22466723,,,,,,,,,Livermore,European,11/01/2021
4,LALCA22466636,,,,,Y,Y,Y,Y,Livermore,European,12/07/2021


In [21]:
print(merged_file.shape)
print("number of unique cities:", merged_file.Residence_Addresses_City.nunique())

(3918925, 12)
number of unique cities: 39


In [22]:
merged_file = merged_file.reset_index(drop = False)

In [23]:
merged_file.to_csv('VM2--CA--2022-04-25-MERGED.csv', index=False)

# Calculate voter turnout using merged data

In [24]:
import pandas as pd
merged_file = pd.read_csv('VM2--CA--2022-04-25-MERGED.csv')

  merged_file = pd.read_csv('VM2--CA--2022-04-25-MERGED.csv')


In [25]:
merged_file.head()

Unnamed: 0,index,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate
0,0,LALCA453008306,,,,,,Y,,,Oakland,Likely African-American,04/01/2021
1,1,LALCA22129469,,,,,Y,Y,Y,Y,Oakland,European,11/16/2021
2,2,LALCA24729024,,,,,,,,,San Leandro,European,02/28/2016
3,3,LALCA22466723,,,,,,,,,Livermore,European,11/01/2021
4,4,LALCA22466636,,,,,Y,Y,Y,Y,Livermore,European,12/07/2021


In [26]:
# get the four most recent dates
# this might not be the recent dates for each cities because we have seen cases where some of these dates had 0 voter turnout

def get_4_recent_date(string, df):
    list_cols = [col for col in df.columns if col.startswith(string)]
    dates = [col.replace(string+'_', '') for col in list_cols]
    dates.sort(reverse=True)
    return [string+'_'+d for d in dates[:4]]

GE_cols = get_4_recent_date('General', merged_file)
print(GE_cols)
LM_cols = get_4_recent_date('Local_or_Municipal', merged_file)
print(LM_cols)

['General_2020_11_03', 'General_2018_11_06', 'General_2016_11_08', 'General_2014_11_04']
['Local_or_Municipal_2021_08_31', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2021_06_01']


In [27]:
# fill NA values with "N" to make it easier to compare  with "Y"
merged_file[GE_cols+LM_cols] = merged_file[GE_cols+LM_cols].fillna('N')
merged_file.head()

Unnamed: 0,index,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate
0,0,LALCA453008306,N,N,N,N,N,Y,N,N,Oakland,Likely African-American,04/01/2021
1,1,LALCA22129469,N,N,N,N,Y,Y,Y,Y,Oakland,European,11/16/2021
2,2,LALCA24729024,N,N,N,N,N,N,N,N,San Leandro,European,02/28/2016
3,3,LALCA22466723,N,N,N,N,N,N,N,N,Livermore,European,11/01/2021
4,4,LALCA22466636,N,N,N,N,Y,Y,Y,Y,Livermore,European,12/07/2021


In [28]:
# We created the dataframe below in order to easily calculate perc_turnout when no one voted

list_ethnic_city = merged_file[['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']].drop_duplicates()
list_ethnic_city_No = list_ethnic_city.copy()
list_ethnic_city_No['voted'] = 'N'
list_ethnic_city_Yes = list_ethnic_city.copy()
list_ethnic_city_Yes['voted'] = 'Y'
list_ethnic_city = pd.concat([list_ethnic_city_No, list_ethnic_city_Yes])

In [29]:
list_ethnic_city

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,voted
0,Oakland,Likely African-American,N
1,Oakland,European,N
2,San Leandro,European,N
3,Livermore,European,N
7,Oakland,East and South Asian,N
...,...,...,...
3777199,Santa Rosa,European,Y
3777202,Santa Rosa,Hispanic and Portuguese,Y
3777205,Santa Rosa,East and South Asian,Y
3777260,Santa Rosa,Likely African-American,Y


In [32]:
# we also need the total voters information per city and ethnicity
total_city_ethnic = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']).size().reset_index()
total_city_ethnic.columns = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'total_voters']
total_city_ethnic  = total_city_ethnic.merge(list_ethnic_city, on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc'])
total_city_ethnic

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,total_voters,voted
0,Albany,East and South Asian,2405,N
1,Albany,East and South Asian,2405,Y
2,Albany,European,6169,N
3,Albany,European,6169,Y
4,Albany,Hispanic and Portuguese,1035,N
...,...,...,...,...
307,Whittier,European,26477,Y
308,Whittier,Hispanic and Portuguese,76334,N
309,Whittier,Hispanic and Portuguese,76334,Y
310,Whittier,Likely African-American,214,N


In [50]:
elec_date_cols = GE_cols+LM_cols
for i in range(len(elec_date_cols)):
    col = elec_date_cols[i]
    voter_turnout_stats = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', col]).size().agg(
      {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
      ).unstack(level=0).reset_index()
    
    # 'voted' is either 'Y' or 'N'
    voter_turnout_stats = voter_turnout_stats.rename(columns = {col: 'voted'})
    voter_turnout_stats = total_city_ethnic.merge(voter_turnout_stats, 
                                                 how = 'left',
                                                 on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'voted']) 
    voter_turnout_stats = voter_turnout_stats.replace('East and South Asian', 'asian')
    voter_turnout_stats = voter_turnout_stats.replace('European', 'white')
    voter_turnout_stats = voter_turnout_stats.replace('Hispanic and Portuguese', 'hispanic')
    voter_turnout_stats = voter_turnout_stats.replace('Likely African-American', 'black')
    
    voter_turnout_stats['elec_date'] = col[len(col)-10:]
    voter_turnout_stats['elec_year'] = col[len(col)-10:len(col)-6]
    voter_turnout_stats['elec_type'] = col[:len(col)-11]
    
    voter_turnout_stats[['voted_voters', 'perc_turnout']] = voter_turnout_stats[['voted_voters', 'perc_turnout']].fillna(0)
    voter_turnout_stats = voter_turnout_stats[voter_turnout_stats['voted'] == 'Y']    
    pivot_df = voter_turnout_stats.pivot(index = ['elec_type','elec_year', 'elec_date', 'Residence_Addresses_City'],
                                    columns='EthnicGroups_EthnicGroup1Desc', 
                                    values=['total_voters', 'voted_voters', 'perc_turnout']).reset_index()
    pivot_df.columns = pivot_df.columns.map('_'.join)
    
    # stack all types of election into one dataframe 
    if i == 0:
        voter_turnout_merge = pivot_df.copy() 
    else:
        voter_turnout_merge = pd.concat([voter_turnout_merge, pivot_df])

  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}


In [51]:
voter_turnout_merge

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2020,2020_11_03,Albany,2405.0,147.0,1035.0,6169.0,1982.0,120.0,896.0,5517.0,0.824116,0.816327,0.865700,0.894310
1,General,2020,2020_11_03,Alhambra,17451.0,191.0,16596.0,7359.0,12135.0,139.0,12815.0,5918.0,0.695376,0.727749,0.772174,0.804185
2,General,2020,2020_11_03,Anaheim,26340.0,1211.0,70052.0,54644.0,20542.0,930.0,50500.0,46119.0,0.779879,0.767960,0.720893,0.843990
3,General,2020,2020_11_03,Bellflower,2153.0,3614.0,19899.0,10792.0,1465.0,2705.0,13642.0,8012.0,0.680446,0.748478,0.685562,0.742402
4,General,2020,2020_11_03,Berkeley,8549.0,5942.0,6388.0,39425.0,6659.0,4668.0,5113.0,33180.0,0.778922,0.785594,0.800407,0.841598
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34,Local_or_Municipal,2021,2021_06_01,Santa Clarita,1197.0,239.0,4511.0,10256.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000
35,Local_or_Municipal,2021,2021_06_01,Santa Rosa,4677.0,832.0,23575.0,78673.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000
36,Local_or_Municipal,2021,2021_06_01,Solvang,102.0,9.0,845.0,4238.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000
37,Local_or_Municipal,2021,2021_06_01,Watsonville,853.0,42.0,18695.0,11481.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000


In [56]:
# 1. convert the data type of columns with information about the total voter and the number of voters who voted into integer 

cnt_cols = [col for col in voter_turnout_merge.columns if 'total_voters' in col or 'voted_voters' in col]
    
for col in cnt_cols:
    voter_turnout_merge[col] = voter_turnout_merge[col].astype(int)

voter_turnout_merge    

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,896,5517,0.824116,0.816327,0.865700,0.894310
1,General,2020,2020_11_03,Alhambra,17451,191,16596,7359,12135,139,12815,5918,0.695376,0.727749,0.772174,0.804185
2,General,2020,2020_11_03,Anaheim,26340,1211,70052,54644,20542,930,50500,46119,0.779879,0.767960,0.720893,0.843990
3,General,2020,2020_11_03,Bellflower,2153,3614,19899,10792,1465,2705,13642,8012,0.680446,0.748478,0.685562,0.742402
4,General,2020,2020_11_03,Berkeley,8549,5942,6388,39425,6659,4668,5113,33180,0.778922,0.785594,0.800407,0.841598
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34,Local_or_Municipal,2021,2021_06_01,Santa Clarita,1197,239,4511,10256,0,0,0,0,0.000000,0.000000,0.000000,0.000000
35,Local_or_Municipal,2021,2021_06_01,Santa Rosa,4677,832,23575,78673,0,0,0,0,0.000000,0.000000,0.000000,0.000000
36,Local_or_Municipal,2021,2021_06_01,Solvang,102,9,845,4238,0,0,0,0,0.000000,0.000000,0.000000,0.000000
37,Local_or_Municipal,2021,2021_06_01,Watsonville,853,42,18695,11481,0,0,0,0,0.000000,0.000000,0.000000,0.000000


In [62]:
# 2. for each of the "count" columns find the number of 0 values 
# because if 0 voter turnout then may be the election date that was selected was not the election date for that city
no_voter_turnout = voter_turnout_merge[(voter_turnout_merge['perc_turnout_asian'] == 0) &
                                       (voter_turnout_merge['perc_turnout_black'] == 0) &
                                       (voter_turnout_merge['perc_turnout_hispanic'] == 0) &
                                       (voter_turnout_merge['perc_turnout_white'] == 0)]

no_voter_turnout[['elec_type_', 'elec_date_', 'Residence_Addresses_City_']]

Unnamed: 0,elec_type_,elec_date_,Residence_Addresses_City_
0,Local_or_Municipal,2021_08_31,Albany
1,Local_or_Municipal,2021_08_31,Alhambra
2,Local_or_Municipal,2021_08_31,Anaheim
3,Local_or_Municipal,2021_08_31,Bellflower
4,Local_or_Municipal,2021_08_31,Berkeley
...,...,...,...
33,Local_or_Municipal,2021_06_01,Santa Ana
34,Local_or_Municipal,2021_06_01,Santa Clarita
35,Local_or_Municipal,2021_06_01,Santa Rosa
36,Local_or_Municipal,2021_06_01,Solvang


### Merge the voter turnout data with ca-cities to obtain avergae income, age and college count

In [None]:
# cannot see reported Income column 
# can use ['Voters_Age', 'CommercialData_Education'] to find mean_pop_age and count_college_edu
# donation is only of the form "integer representing total number of federal donations made over the last four election cycles" in "FECDonors_NumberOfDonations" column
# similarly 'FECDonors_AvgDonation' is also over last four election cycles


In [73]:
# no columns calculated in terms of mean  
ca_cities = pd.read_csv('ca-cities.csv', usecols=['city', 'income_individual_median', 'age_median', 'education_college_or_above'])
ca_cities = ca_cities.rename(columns ={'city':'Residence_Addresses_City_'})
ca_cities.head()

Unnamed: 0,Residence_Addresses_City_,age_median,income_individual_median,education_college_or_above
0,Los Angeles,35.2,25302,33.1
1,San Francisco,38.3,45229,55.8
2,San Diego,34.3,33037,44.4
3,Riverside,31.3,24962,22.5
4,Sacramento,34.3,28633,31.5


In [74]:
voter_turnout_stats = voter_turnout_stats.merge(ca_cities, on = "Residence_Addresses_City_")
voter_turnout_stats

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white,age_median,income_individual_median,education_college_or_above
0,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,896,5517,0.824116,0.816327,0.865700,0.894310,35.7,48446,71.9
1,General,2018,2018_11_06,Albany,2405,147,1035,6169,1354,90,676,4678,0.562994,0.612245,0.653140,0.758308,35.7,48446,71.9
2,General,2016,2016_11_08,Albany,2405,147,1035,6169,1396,88,668,4659,0.580457,0.598639,0.645411,0.755228,35.7,48446,71.9
3,General,2014,2014_11_04,Albany,2405,147,1035,6169,714,50,333,3162,0.296881,0.340136,0.321739,0.512563,35.7,48446,71.9
4,Local_or_Municipal,2021,2021_08_31,Albany,2405,147,1035,6169,0,0,0,0,0.000000,0.000000,0.000000,0.000000,35.7,48446,71.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,General,2014,2014_11_04,Whittier,3963,214,76334,26477,783,48,12825,8807,0.197578,0.224299,0.168012,0.332628,36.8,31251,24.8
308,Local_or_Municipal,2021,2021_08_31,Whittier,3963,214,76334,26477,0,0,0,0,0.000000,0.000000,0.000000,0.000000,36.8,31251,24.8
309,Local_or_Municipal,2021,2021_07_20,Whittier,3963,214,76334,26477,1,0,32,25,0.000252,0.000000,0.000419,0.000944,36.8,31251,24.8
310,Local_or_Municipal,2021,2021_06_08,Whittier,3963,214,76334,26477,0,0,0,0,0.000000,0.000000,0.000000,0.000000,36.8,31251,24.8


In [63]:
voter_turnout_merge.to_csv('voter_turnout_stats.csv', index=False)

In [66]:
del voter_turnout_merge
gc.collect()

1710