# Data Merging

Some observations
- We choose the 5 non-RCV cities with highest cosine similary score compared to the 7 RCV cities in CA
- There were 33 distinct cities among those 35 cities
- There are 66 non-registered voters among 21.7 million voters
- There are total of 3.9 million voters in the sampled cities
- City 'El Paso de Robles' didn't match in demographic data
- How can we identify election dates are for different cities?
    - We found 122 cases out of 312 with 0% voter turnout 

In [1]:
import pandas as pd
import janitor
import gc

In [2]:
RCV_cities = ['San Francisco',
 'Oakland',
 'Berkeley',
 'San Leandro',
 'Palm Desert',
 'Eureka',
 'Albany']

sampled_nonRCV_cities = ['Fresno',
 'San Diego',
 'Sacramento',
 'Riverside',
 'San Jose',
 'Santa Ana',
 'Anaheim',
 'Santa Rosa',
 'Merced',
 'Santa Clarita',
 'Alhambra',
 'Davis',
 'Montebello',
 'Burbank',
 'Huntington Park',
 'Bellflower',
 'Watsonville',
 'Gilroy',
 'Whittier',
 'Lynwood',
 'Lakewood',
 'Pico Rivera',
 'Lake Forest',
 'Livermore',
 'Chino Hills',
 'Paramount',
 'El Paso de Robles',
 'Pico Rivera',
 'Buena Park',
 'Whittier',
 'Calabasas',
 'Carpinteria',
 'Morro Bay',
 'San Carlos',
 'Solvang']

print("total number of cities:", len(sampled_nonRCV_cities))

print("number of distinct cities:", len(set(sampled_nonRCV_cities)))

print("name of cities that were duplicated:", set([x for x in sampled_nonRCV_cities if sampled_nonRCV_cities.count(x) > 1]))

combined_sampled_cityName = RCV_cities+list(set(sampled_nonRCV_cities))
print("number of distinct RCV and sampled nonRCV cities:", len(combined_sampled_cityName))

total number of cities: 35
number of distinct cities: 33
name of cities that were duplicated: {'Pico Rivera', 'Whittier'}
number of distinct RCV and sampled nonRCV cities: 40


## 1. Demographic Data

1. Select only the columns required: city name ('Residence_Addresses_City'), unique voter id ('LALVOTERID'), voter's ethnicity ('EthnicGroups_EthnicGroup1Desc') and date when voter was registered ('Voters_OfficialRegDate')
2. Keep only the cities that were identified as being similar to RCV cities in CA (See ca_similarity_search.ipynb for reference) 
3. Keep only rows EthnicGroups_EthnicGroup1Desc == “European”,  “Likely African-American”,“Hispanic and Portuguese” and “East and South Asian” 
4. Keep only registered voters identified in 'Voters_OfficialRegDate'


In [4]:
# change the filepath as required, we have selected the folder with the latest date
#filepath = 'VM2--CA--2022-04-25/'
filepath = '../data/'

'''
selected_variables = ['Residence_Addresses_City', 
                      'LALVOTERID',
                      'EthnicGroups_EthnicGroup1Desc',
                      'Voters_OfficialRegDate'
                     ]


state_demographic = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC.tab', 
                                sep='\t', dtype=str, encoding='unicode_escape',
                                usecols=selected_variables)
'''

state_demographic = pd.read_parquet(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC_selected_cols.parquet')

In [5]:
state_demographic.head(5)

Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453164106,Oakland,F,29,Democratic,Other,06/18/2021,ALAMEDA,,,,
1,LALCA453008306,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
2,LALCA22129469,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA549803906,Oakland,M,60,Democratic,Other,02/07/2022,ALAMEDA,,,,
4,LALCA24729024,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,


In [6]:
print("total number of unique cities", state_demographic.Residence_Addresses_City.nunique())
print("total number of unique voters", state_demographic.LALVOTERID.nunique())
print("count of non-registered voters", len(state_demographic[state_demographic['Voters_OfficialRegDate'].isnull()]))

total number of unique cities 1533
total number of unique voters 21711617
count of non-registered voters 66


In [7]:
print("number of expected cities:", len(combined_sampled_cityName))
missing_cities = [city for city in combined_sampled_cityName if city not in state_demographic['Residence_Addresses_City'].unique()]
if len(missing_cities) > 0:
    print("number of cities not found in demographic data:", len(missing_cities))
    print(missing_cities)

number of expected cities: 40
number of cities not found in demographic data: 1
['El Paso de Robles']


In [8]:
selected_ethnicities = ['European', 'Likely African-American','Hispanic and Portuguese', 'East and South Asian']

state_demographic_subset = state_demographic[state_demographic['Residence_Addresses_City'].isin(combined_sampled_cityName) &
                                             state_demographic['EthnicGroups_EthnicGroup1Desc'].isin(selected_ethnicities) &
                                             state_demographic['Voters_OfficialRegDate'].notnull()
                                            ]
print(state_demographic_subset.shape)
state_demographic_subset.head()

(3918925, 12)


Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
1,LALCA453008306,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
2,LALCA22129469,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
4,LALCA24729024,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
6,LALCA22466723,Livermore,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
7,LALCA22466636,Livermore,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


In [9]:
print("number of unique cities:", state_demographic_subset.Residence_Addresses_City.nunique())

number of unique cities: 39


In [10]:
del state_demographic
gc.collect()

531

## 2. Vote History

1. Select only the columns that are 4 most recent General elections and 4 most recent Local_or_Municipal elections and EthnicGroups_EthnicGroup1Desc
2. Merge Vote History with the sampled Demographic Data 


In [11]:
# select only subset of rows to find the column names that are 4 most recent General and Local_or_Municipal elections
filepath = '../data/VM2--CA--2022-04-25/'

state_voterhistory = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                nrows=10)
                                
state_voterhistory.head(5)

Unnamed: 0,LALVOTERID,Special_2022_04_19,Special_2022_04_12,Special_2022_04_05,Special_2022_02_15,Special_2022_02_01,Special_2021_12_14,Special_2021_12_07,Special_2021_11_02,Consolidated_General_2021_11_02,...,BallotReturnDate_General_2018_11_06,BallotReturnDate_Primary_2018_06_05,BallotReturnDate_General_2016_11_08,BallotReturnDate_Primary_2016_06_07,BallotReturnDate_General_2014_11_04,BallotReturnDate_Primary_2014_06_03,BallotReturnDate_General_2012_11_06,BallotReturnDate_Primary_2012_06_05,BallotReturnDate_General_2010_11_02,BallotReturnDate_Primary_2010_06_08
0,LALCA453164106,,,,,,,,,,...,,,11/07/2016,,,,,,,
1,LALCA453008306,,,,,,,,,,...,,,,,,,,,,
2,LALCA22129469,,,,,,,,,,...,11/06/2018,,,,,,,,,
3,LALCA549803906,,,,,,,,,,...,,,,,,,,,,
4,LALCA24729024,,,,,,,,,,...,,,,,,,,,,


In [12]:
def get_4_recent_date(string, df):
    list_cols = [col for col in df.columns if col.startswith(string)]
    dates = [col.replace(string+'_', '') for col in list_cols]
    dates.sort(reverse=True)
    return [string+'_'+d for d in dates[:4]]

GE_cols = get_4_recent_date('General', state_voterhistory)
print(GE_cols)
LM_cols = get_4_recent_date('Local_or_Municipal', state_voterhistory)
print(LM_cols)

['General_2020_11_03', 'General_2018_11_06', 'General_2016_11_08', 'General_2014_11_04']
['Local_or_Municipal_2021_08_31', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2021_06_01']


In [13]:
del state_voterhistory
gc.collect()

0

In [14]:
needed_variables = ['LALVOTERID'] + LM_cols + GE_cols

state_voterhistory = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)
                                
state_voterhistory.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04
0,LALCA453164106,,,,,Y,Y,Y,
1,LALCA453008306,,,,,,Y,,
2,LALCA22129469,,,,,Y,Y,Y,Y
3,LALCA549803906,,,,,Y,,,
4,LALCA24729024,,,,,,,,


In [15]:
merged_file = pd.merge(state_voterhistory, state_demographic_subset,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')

merged_file.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,,,,,,Y,,,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,LALCA22129469,,,,,Y,Y,Y,Y,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,LALCA24729024,,,,,,,,,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA22466723,,,,,,,,,Livermore,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
4,LALCA22466636,,,,,Y,Y,Y,Y,Livermore,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


In [16]:
print(merged_file.shape)
print("number of unique cities:", merged_file.Residence_Addresses_City.nunique())

(3918925, 20)
number of unique cities: 39


In [17]:
merged_file = merged_file.reset_index(drop = False)

In [18]:
merged_file.to_csv('../data/VM2--CA--2022-04-25-MERGED.csv', index=False)

# Calculate voter turnout using merged data

In [19]:
import pandas as pd
merged_file = pd.read_csv('../data/VM2--CA--2022-04-25-MERGED.csv')

  merged_file = pd.read_csv('../data/VM2--CA--2022-04-25-MERGED.csv')


In [20]:
merged_file.head()

Unnamed: 0,index,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,...,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,0,LALCA453008306,,,,,,Y,,,...,F,26.0,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,1,LALCA22129469,,,,,Y,Y,Y,Y,...,F,47.0,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,2,LALCA24729024,,,,,,,,,...,F,56.0,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,3,LALCA22466723,,,,,,,,,...,F,38.0,Republican,European,11/01/2021,ALAMEDA,,,,
4,4,LALCA22466636,,,,,Y,Y,Y,Y,...,M,63.0,Democratic,European,12/07/2021,ALAMEDA,,,,


In [72]:
# get the four most recent dates
# this might not be the recent dates for each cities because we have seen cases where some of these dates had 0 voter turnout

def get_4_recent_date(string, df):
    list_cols = [col for col in df.columns if col.startswith(string)]
    dates = [col.replace(string+'_', '') for col in list_cols]
    dates.sort(reverse=True)
    return [string+'_'+d for d in dates[:4]]

GE_cols = get_4_recent_date('General', merged_file)
print(GE_cols)
LM_cols = get_4_recent_date('Local_or_Municipal', merged_file)
print(LM_cols)

['General_2020_11_03', 'General_2018_11_06', 'General_2016_11_08', 'General_2014_11_04']
['Local_or_Municipal_2021_08_31', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2021_06_01']


In [73]:
# fill NA values with "N" to make it easier to compare  with "Y"
merged_file[GE_cols+LM_cols] = merged_file[GE_cols+LM_cols].fillna('N')
merged_file.head()

Unnamed: 0,index,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,...,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,0,LALCA453008306,N,N,N,N,N,Y,N,N,...,F,26.0,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,1,LALCA22129469,N,N,N,N,Y,Y,Y,Y,...,F,47.0,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,2,LALCA24729024,N,N,N,N,N,N,N,N,...,F,56.0,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,3,LALCA22466723,N,N,N,N,N,N,N,N,...,F,38.0,Republican,European,11/01/2021,ALAMEDA,,,,
4,4,LALCA22466636,N,N,N,N,Y,Y,Y,Y,...,M,63.0,Democratic,European,12/07/2021,ALAMEDA,,,,


In [74]:
# We created the dataframe below in order to easily calculate perc_turnout when no one voted

list_ethnic_city = merged_file[['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']].drop_duplicates()
list_ethnic_city_No = list_ethnic_city.copy()
list_ethnic_city_No['voted'] = 'N'
list_ethnic_city_Yes = list_ethnic_city.copy()
list_ethnic_city_Yes['voted'] = 'Y'
list_ethnic_city = pd.concat([list_ethnic_city_No, list_ethnic_city_Yes])

In [75]:
list_ethnic_city

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,voted
0,Oakland,Likely African-American,N
1,Oakland,European,N
2,San Leandro,European,N
3,Livermore,European,N
7,Oakland,East and South Asian,N
...,...,...,...
3777199,Santa Rosa,European,Y
3777202,Santa Rosa,Hispanic and Portuguese,Y
3777205,Santa Rosa,East and South Asian,Y
3777260,Santa Rosa,Likely African-American,Y


In [76]:
# we also need the total voters information per city and ethnicity
total_city_ethnic = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']).size().reset_index()
total_city_ethnic.columns = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'total_voters']
total_city_ethnic  = total_city_ethnic.merge(list_ethnic_city, on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc'])
total_city_ethnic

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,total_voters,voted
0,Albany,East and South Asian,2405,N
1,Albany,East and South Asian,2405,Y
2,Albany,European,6169,N
3,Albany,European,6169,Y
4,Albany,Hispanic and Portuguese,1035,N
...,...,...,...,...
307,Whittier,European,26477,Y
308,Whittier,Hispanic and Portuguese,76334,N
309,Whittier,Hispanic and Portuguese,76334,Y
310,Whittier,Likely African-American,214,N


In [91]:
elec_date_cols = GE_cols+LM_cols
for i in range(len(elec_date_cols)):
    col = elec_date_cols[i]
    voter_turnout_stats = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', col]).size().agg(
      {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
      ).unstack(level=0).reset_index()
    
    # 'voted' is either 'Y' or 'N'
    voter_turnout_stats = voter_turnout_stats.rename(columns = {col: 'voted'})
    voter_turnout_stats = total_city_ethnic.merge(voter_turnout_stats, 
                                                 how = 'left',
                                                 on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'voted']) 
    voter_turnout_stats = voter_turnout_stats.replace('East and South Asian', 'asian')
    voter_turnout_stats = voter_turnout_stats.replace('European', 'white')
    voter_turnout_stats = voter_turnout_stats.replace('Hispanic and Portuguese', 'hispanic')
    voter_turnout_stats = voter_turnout_stats.replace('Likely African-American', 'black')
    
    voter_turnout_stats['elec_date'] = col[len(col)-10:]
    voter_turnout_stats['elec_year'] = col[len(col)-10:len(col)-6]
    voter_turnout_stats['elec_type'] = col[:len(col)-11]
    
    voter_turnout_stats[['voted_voters', 'perc_turnout']] = voter_turnout_stats[['voted_voters', 'perc_turnout']].fillna(0)
    voter_turnout_stats = voter_turnout_stats[voter_turnout_stats['voted'] == 'Y']    
    pivot_df = voter_turnout_stats.pivot(index = ['elec_type','elec_year', 'elec_date', 'Residence_Addresses_City'],
                                    columns='EthnicGroups_EthnicGroup1Desc', 
                                    values=['total_voters', 'voted_voters', 'perc_turnout']).reset_index()
    pivot_df.columns = pivot_df.columns.map('_'.join)
    
    # add info about age and donations
    # for each election, get mean donation and mean income per race group
    col = elec_date_cols[i]
    
    # age
    age_df = merged_file[['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 
                                  'Voters_Age', col]]
    # update columns to match
    age_df = age_df.replace('East and South Asian', 'asian')
    age_df = age_df.replace('European', 'white')
    age_df = age_df.replace('Hispanic and Portuguese', 'hispanic')
    age_df = age_df.replace('Likely African-American', 'black')
    
    age = age_df.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', col]).mean().reset_index()
    age.rename(columns={'Voters_Age': 'mean_age'}, inplace=True)
    
    pivot_age = age.pivot(index = ['Residence_Addresses_City', col],
                                    columns='EthnicGroups_EthnicGroup1Desc', values=['mean_age']).reset_index()
    pivot_age.columns = pivot_age.columns.map('_'.join)
    
    # donations
    donations_df = merged_file[['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 
                                  'FECDonors_TotalDonationsAmount', col]]
    
    # update columns to match
    donations_df = donations_df.replace('East and South Asian', 'asian')
    donations_df = donations_df.replace('European', 'white')
    donations_df = donations_df.replace('Hispanic and Portuguese', 'hispanic')
    donations_df = donations_df.replace('Likely African-American', 'black')
    
    
    mean_donations = donations_df.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', col]).mean().reset_index()
    mean_donations.rename(columns={'FECDonors_TotalDonationsAmount': 'mean_donations'}, inplace=True)
    
    pivot_donations = mean_donations.pivot(index = ['Residence_Addresses_City', col],
                                    columns='EthnicGroups_EthnicGroup1Desc', values=['mean_donations']).reset_index()
    pivot_donations.columns = pivot_donations.columns.map('_'.join)
    
    # merge donations and age
    merged = pivot_age.merge(pivot_donations, how='outer', on=['Residence_Addresses_City_', col+'_'])
    
    merged['elec_date_'] = col[len(col)-10:]
    merged['elec_year_'] = col[len(col)-10:len(col)-6]
    merged['elec_type_'] = col[:len(col)-11]
    
    # drop col
    merged.drop(columns=[col+'_'], inplace=True)
    
    # merge merged with pivot
    output = pivot_df.merge(merged, how='outer', on=['Residence_Addresses_City_', 'elec_date_', 
                                                     'elec_year_', 'elec_type_'])
    
    # stack all types of election into one dataframe 
    if i == 0:
        voter_turnout_merge = output.copy() 
    else:
        voter_turnout_merge = pd.concat([voter_turnout_merge, output])

  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}


In [92]:
voter_turnout_merge

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,...,perc_turnout_hispanic,perc_turnout_white,mean_age_asian,mean_age_black,mean_age_hispanic,mean_age_white,mean_donations_asian,mean_donations_black,mean_donations_hispanic,mean_donations_white
0,General,2020,2020_11_03,Albany,2405.0,147.0,1035.0,6169.0,1982.0,120.0,...,0.865700,0.894310,45.276596,39.148148,39.640288,40.702619,1507.500000,,30.000000,2240.583333
1,General,2020,2020_11_03,Albany,2405.0,147.0,1035.0,6169.0,1982.0,120.0,...,0.865700,0.894310,49.288743,49.633333,47.526786,53.328558,1952.010870,1946.545455,1858.243243,3491.918478
2,General,2020,2020_11_03,Alhambra,17451.0,191.0,16596.0,7359.0,12135.0,139.0,...,0.772174,0.804185,51.034065,37.980769,43.260870,46.200697,1138.928571,,261.428571,831.625000
3,General,2020,2020_11_03,Alhambra,17451.0,191.0,16596.0,7359.0,12135.0,139.0,...,0.772174,0.804185,51.728225,45.000000,48.782531,52.719601,1804.295359,230.000000,1068.829167,2727.153257
4,General,2020,2020_11_03,Anaheim,26340.0,1211.0,70052.0,54644.0,20542.0,930.0,...,0.720893,0.843990,44.726174,39.234875,36.025065,42.757864,1712.428571,,772.619048,987.093750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46,Local_or_Municipal,2021,2021_06_01,Santa Rosa,4677.0,832.0,23575.0,78673.0,0.0,0.0,...,0.000000,0.000000,48.063743,46.239183,42.143778,54.985549,1771.513514,274.857143,1138.776978,2706.721762
47,Local_or_Municipal,2021,2021_06_01,Solvang,102.0,9.0,845.0,4238.0,0.0,0.0,...,0.000000,0.000000,55.372549,56.444444,45.239053,58.503308,11763.750000,,885.052632,3218.320132
48,Local_or_Municipal,2021,2021_06_01,Watsonville,853.0,42.0,18695.0,11481.0,0.0,0.0,...,0.000000,0.000000,55.003525,50.190476,42.829374,55.534553,855.678571,416.666667,834.768519,3666.881647
49,Local_or_Municipal,2021,2021_06_01,Whittier,3963.0,214.0,76334.0,26477.0,0.0,0.0,...,0.000013,0.000076,50.288709,44.990654,45.660623,53.041180,2588.484375,850.000000,899.590595,2088.732673


In [93]:
# 1. convert the data type of columns with information about the total voter and the number of voters who voted into integer 

cnt_cols = [col for col in voter_turnout_merge.columns if 'total_voters' in col or 'voted_voters' in col]
    
for col in cnt_cols:
    voter_turnout_merge[col] = voter_turnout_merge[col].astype(int)

voter_turnout_merge    

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,...,perc_turnout_hispanic,perc_turnout_white,mean_age_asian,mean_age_black,mean_age_hispanic,mean_age_white,mean_donations_asian,mean_donations_black,mean_donations_hispanic,mean_donations_white
0,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,...,0.865700,0.894310,45.276596,39.148148,39.640288,40.702619,1507.500000,,30.000000,2240.583333
1,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,...,0.865700,0.894310,49.288743,49.633333,47.526786,53.328558,1952.010870,1946.545455,1858.243243,3491.918478
2,General,2020,2020_11_03,Alhambra,17451,191,16596,7359,12135,139,...,0.772174,0.804185,51.034065,37.980769,43.260870,46.200697,1138.928571,,261.428571,831.625000
3,General,2020,2020_11_03,Alhambra,17451,191,16596,7359,12135,139,...,0.772174,0.804185,51.728225,45.000000,48.782531,52.719601,1804.295359,230.000000,1068.829167,2727.153257
4,General,2020,2020_11_03,Anaheim,26340,1211,70052,54644,20542,930,...,0.720893,0.843990,44.726174,39.234875,36.025065,42.757864,1712.428571,,772.619048,987.093750
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46,Local_or_Municipal,2021,2021_06_01,Santa Rosa,4677,832,23575,78673,0,0,...,0.000000,0.000000,48.063743,46.239183,42.143778,54.985549,1771.513514,274.857143,1138.776978,2706.721762
47,Local_or_Municipal,2021,2021_06_01,Solvang,102,9,845,4238,0,0,...,0.000000,0.000000,55.372549,56.444444,45.239053,58.503308,11763.750000,,885.052632,3218.320132
48,Local_or_Municipal,2021,2021_06_01,Watsonville,853,42,18695,11481,0,0,...,0.000000,0.000000,55.003525,50.190476,42.829374,55.534553,855.678571,416.666667,834.768519,3666.881647
49,Local_or_Municipal,2021,2021_06_01,Whittier,3963,214,76334,26477,0,0,...,0.000013,0.000076,50.288709,44.990654,45.660623,53.041180,2588.484375,850.000000,899.590595,2088.732673


In [96]:
# 2. for each of the "count" columns find the number of 0 values 
# because if 0 voter turnout then may be the election date that was selected was not the election date for that city
no_voter_turnout = voter_turnout_merge[(voter_turnout_merge['perc_turnout_asian'] == 0) &
                                       (voter_turnout_merge['perc_turnout_black'] == 0) &
                                       (voter_turnout_merge['perc_turnout_hispanic'] == 0) &
                                       (voter_turnout_merge['perc_turnout_white'] == 0)]

no_voter_turnout[['elec_type_', 'elec_date_', 'Residence_Addresses_City_']]

Unnamed: 0,elec_type_,elec_date_,Residence_Addresses_City_
0,Local_or_Municipal,2021_08_31,Albany
1,Local_or_Municipal,2021_08_31,Alhambra
2,Local_or_Municipal,2021_08_31,Anaheim
3,Local_or_Municipal,2021_08_31,Bellflower
4,Local_or_Municipal,2021_08_31,Berkeley
...,...,...,...
44,Local_or_Municipal,2021_06_01,Santa Ana
45,Local_or_Municipal,2021_06_01,Santa Clarita
46,Local_or_Municipal,2021_06_01,Santa Rosa
47,Local_or_Municipal,2021_06_01,Solvang


In [98]:
voter_turnout_merge.to_csv('../data/voter_turnout_age_donations_CA.csv', index=False)

In [99]:
del voter_turnout_merge
gc.collect()

6631

In case we want to replicated other columns found in the "Colorado Sample Output", below are possible steps. 

- mean_pop_income: cannot see reported Income column. ca-cities contains median value

- mean_pop_age: ca-cities contains median income and age. In order to calculate mean age we can use 'Voters_Age' in Demographic Data. 

- count_college_edu: ca-cities contains 'education_college_or_above', but not sure why there are decimal values. Can use 'CommercialData_Education' in Demographic Data

- count_donated_once: donation is only of the form "integer representing total number of federal donations made over the last four election cycles" in "FECDonors_NumberOfDonations" column

- mean_donation_amount: similarly 'FECDonors_AvgDonation' is also over last four election cycles


In [100]:
# no columns calculated in terms of mean  
ca_cities = pd.read_csv('ca-cities.csv', usecols=['city', 'income_individual_median', 'age_median', 'education_college_or_above'])
ca_cities.head()

FileNotFoundError: [Errno 2] No such file or directory: 'ca-cities.csv'