# 1. Data Loading

1. DEMOGRAPHIC_selected_cols parquet file contains all rows with selected columns (voter id, city, county, ethnicity, age, gender, education, income, donation and parties description). See Reduce_to_parquet.ipynb 
2. VOTEHISTORY_selected_cols parquet file contains all rows with selected columns (General and Local_or_Municipal). See Reduce_to_parquet.ipynb 
3. GE_LM_dates_per_city parquet file contains four most recent General election and Local_or_Municipal for each of the selected city. See Find_recent_election_dates.ipynb and ca_similarity_search.ipynb

Some observations
- We choose the 5 non-RCV cities with highest cosine similary score compared to the 7 RCV cities in CA
- There were 33 distinct cities among those 35 cities
- There are 66 non-registered voters among 21.7 million voters
- There are total of 3.9 million voters in the sampled cities
- City 'El Paso de Robles' didn't match in demographic data


In [1]:
import pandas as pd
import janitor
import gc
import time
start_time = time.time()

In [2]:
def combine_cities_list(RCV_list, NonRCV_list):

    print("total number of cities:", len(RCV_list))

    print("number of distinct cities:", len(set(NonRCV_list)))

    print("name of cities that were duplicated:", set([x for x in NonRCV_list if NonRCV_list.count(x) > 1]))

    combined_cityName = RCV_list+list(set(NonRCV_list))
    print("number of distinct RCV and sampled nonRCV cities:", len(combined_cityName))
    return combined_cityName


### California

In [4]:
# ------ California -------

## change the filepath as required, we have selected the folder with the latest date

filepath = '../data/VM2--CA--2022-04-25/'
DEMO_filename = 'VM2--CA--2022-04-25-DEMOGRAPHIC_selected_cols.parquet'
VOTE_filename = 'VM2--CA--2022-04-25-VOTEHISTORY_selected_cols.parquet'
elec_dates_filename = 'GE_LM_dates_per_city_CA.parquet'

# 1. List of RCV and non-RCV cities 

RCV_cities_CA = ['San Francisco',
 'Oakland',
 'Berkeley',
 'San Leandro',
 'Palm Desert',
 'Eureka',
 'Albany']

sampled_nonRCV_cities_CA = ['Fresno',
 'San Diego',
 'Sacramento',
 'Riverside',
 'San Jose',
 'Santa Ana',
 'Anaheim',
 'Santa Rosa',
 'Merced',
 'Santa Clarita',
 'Alhambra',
 'Davis',
 'Montebello',
 'Burbank',
 'Huntington Park',
 'Bellflower',
 'Watsonville',
 'Gilroy',
 'Whittier',
 'Lynwood',
 'Lakewood',
 'Pico Rivera',
 'Lake Forest',
 'Livermore',
 'Chino Hills',
 'Paramount',
 'El Paso de Robles',
 'Pico Rivera',
 'Buena Park',
 'Whittier',
 'Calabasas',
 'Carpinteria',
 'Morro Bay',
 'San Carlos',
 'Solvang']

combined_sampled_cityName = combine_cities_list(RCV_list= RCV_cities_CA, NonRCV_list = sampled_nonRCV_cities_CA)
# ---------------------

total number of cities: 7
number of distinct cities: 33
name of cities that were duplicated: {'Whittier', 'Pico Rivera'}
number of distinct RCV and sampled nonRCV cities: 40


### Utah

In [6]:
# # ------ Utah -------
# filepath = '../data/VM2--UT--2022-03-30/'
# DEMO_filename = 'VM2--UT--2022-03-30-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--UT--2022-03-30-VOTEHISTORY_selected_cols.parquet'
# elec_dates_filename = 'GE_LM_dates_per_city_UT.parquet'


# ##1. List of RCV and non-RCV cities 

# RCV_cities_UT = ['Salt Lake City', 'Sandy', 'Lehi', 'Millcreek', 
#                  'Draper', 'Riverton',  'Cottonwood Heights', 
#                  'Springville', 'Midvale', 'Magna', 'South Salt Lake', 
#                  'Payson', 'Bluffdale']

# sampled_nonRCV_cities_UT = ['Ogden', 'Provo', 'West Valley City', 
#                             'Logan', 'St. George', 'Taylorsville', 
#                             'Layton', 'Orem', 'South Jordan', 'Murray', 
#                             'South Jordan', 'Clearfield', 'Spanish Fork', 
#                             'Tooele', 'Kearns', 'Cedar City', 'Murray', 
#                             'Bountiful',  'South Jordan', 'Pleasant Grove', 
#                             'Vernal', 'Hurricane', 'Herriman', 'American Fork', 
#                             'Washington', 'Eagle Mountain', 'Brigham City', 
#                             'American Fork', 'Herriman', 'Spanish Fork', 
#                             'Washington', 'Heber', 'Hurricane', 'Vernal', 
#                             'Holladay', 'Pleasant Grove', 'American Fork', 
#                             'Herriman', 'Eagle Mountain', 'Vernal', 
#                             'Bountiful', 'Pleasant Grove', 'Washington', 
#                             'South Jordan', 'Vernal', 'Tooele', 
#                             'Spanish Fork', 'Clearfield', 'Kearns', 
#                             'Eagle Mountain', 'Washington', 'Bountiful', 
#                             'Pleasant Grove', 'Hurricane', 'Cedar City', 
#                             'Saratoga Springs', 'Kaysville', 'Brigham City', 
#                             'North Salt Lake', 'American Fork', 'Highland', 
#                             'Lindon', 'Alpine', 'West Haven', 'North Logan']

# combined_sampled_cityName = combine_cities_list(RCV_list= RCV_cities_UT, NonRCV_list = sampled_nonRCV_cities_UT)
# # ---------------------


# 1.1 Demographic Data

1. Select only the columns required: city name ('Residence_Addresses_City'), unique voter id ('LALVOTERID'), voter's ethnicity ('EthnicGroups_EthnicGroup1Desc'), date when voter was registered ('Voters_OfficialRegDate'), voter's gender, date of birth, plus additionsl columns
2. Keep only the cities that were identified as being similar to RCV cities in CA (See ca_similarity_search.ipynb for reference) 
3. Keep only rows EthnicGroups_EthnicGroup1Desc == “European”,  “Likely African-American”,“Hispanic and Portuguese” and “East and South Asian” 
4. Keep only registered voters identified in 'Voters_OfficialRegDate'


In [7]:
def read_DEMOGRAPHIC():
    df_demographic = pd.read_parquet(f'{filepath}{DEMO_filename}')
    print("Total number of unique cities:", df_demographic.Residence_Addresses_City.nunique())
    print("Total number of unique voters:", df_demographic.LALVOTERID.nunique())
    print("Count of non-registered voters:", len(df_demographic[df_demographic['Voters_OfficialRegDate'].isnull()]))
    
    print("Number of expected cities:", len(combined_sampled_cityName))
    missing_cities = [city for city in combined_sampled_cityName if city not in df_demographic['Residence_Addresses_City'].unique()]
    if len(missing_cities) > 0:
        print("number of cities not found in demographic data:", len(missing_cities))
        print(missing_cities)
        
    return df_demographic
        
state_demographic = read_DEMOGRAPHIC()

Total number of unique cities: 1533
Total number of unique voters: 21711617
Count of non-registered voters: 66
Number of expected cities: 40
number of cities not found in demographic data: 1
['El Paso de Robles']


In [8]:
state_demographic.head(5)

Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Voters_BirthDate,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,CommercialData_EstimatedHHIncomeAmount,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453164106,Oakland,F,29,04/29/1993,Democratic,Other,06/18/2021,ALAMEDA,,,,,
1,LALCA453008306,Oakland,F,26,02/02/1996,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,,
2,LALCA22129469,Oakland,F,47,02/02/1975,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,,
3,LALCA549803906,Oakland,M,60,02/09/1962,Democratic,Other,02/07/2022,ALAMEDA,,,,,
4,LALCA24729024,San Leandro,F,56,01/01/1966,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,,


### California

In [9]:
# ----- California ----- 
combined_sampled_cityName = list(map(lambda x: x.replace('El Paso de Robles', 'Paso Robles'), combined_sampled_cityName))
print("number of expected cities:", len(combined_sampled_cityName))
# ----------------------

number of expected cities: 40


### Utah

In [10]:
# # ----- Utah ----- 
# combined_sampled_cityName = list(map(lambda x: x.replace('St. George', 'Saint George'), combined_sampled_cityName))
# print("number of expected cities:", len(combined_sampled_cityName))
# # ----------------------


In [11]:
# 2. filter DEMOGRAPHIC data based on the list of cities, ethnicities and registered voters

selected_ethnicities = ['European', 'Likely African-American','Hispanic and Portuguese', 'East and South Asian']

def filter_demo(df, list_cityNames):
    filtered_df = df[df['Residence_Addresses_City'].isin(list_cityNames) &
            df['EthnicGroups_EthnicGroup1Desc'].isin(selected_ethnicities) &
            df['Voters_OfficialRegDate'].notnull()]
    #[['LALVOTERID', 'Residence_Addresses_City']]
    
    print(filtered_df.shape)
    print("number of unique cities:", filtered_df.Residence_Addresses_City.nunique())
    
    return filtered_df

state_demographic_subset = filter_demo(df = state_demographic, list_cityNames = combined_sampled_cityName)
state_demographic_subset.head()

(3944492, 14)
number of unique cities: 40


Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Voters_BirthDate,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,CommercialData_EstimatedHHIncomeAmount,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
1,LALCA453008306,Oakland,F,26,02/02/1996,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,,
2,LALCA22129469,Oakland,F,47,02/02/1975,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,,
4,LALCA24729024,San Leandro,F,56,01/01/1966,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,,
6,LALCA22466723,Livermore,F,38,05/29/1984,Republican,European,11/01/2021,ALAMEDA,,,,,
7,LALCA22466636,Livermore,M,63,06/28/1959,Democratic,European,12/07/2021,ALAMEDA,,,,,


In [12]:
del state_demographic
gc.collect()

20

# 1.2 Vote History

1. Select only the columns that are 4 most recent General elections and 4 most recent Local_or_Municipal elections and EthnicGroups_EthnicGroup1Desc
2. Load Vote History 
3. Merge Vote History with the sampled Demographic Data 


## 1. Get four most recent election dates

In [13]:
# load the list of election dates for each city
GE_LM_dates_dict = pd.read_parquet(f'{filepath}{elec_dates_filename}')
GE_LM_dates_dict

Unnamed: 0,city,GE_dates,LM_dates
0,Oakland,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
1,San Leandro,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Consolidated_G..."
2,Livermore,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
3,Berkeley,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
4,Albany,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_03_02, Local_or_Munic..."
5,San Francisco,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
6,San Diego,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
7,San Jose,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
8,Fresno,"[General_2020_11_03, General_2018_11_06, Gener...","[Consolidated_General_2021_11_02, Local_or_Mun..."
9,Eureka,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_08, Local_or_Munic..."


In [14]:
e_dates = set()
for v in GE_LM_dates_dict['GE_dates']:
    for vv in v:
        e_dates.add(vv)
for v in GE_LM_dates_dict['LM_dates'] :
    for vv in v:
        e_dates.add(vv)
        
print(list(e_dates))
## when all four dates are not found e_dates will contain None, we need to remove it
if None in list(e_dates):
    e_dates.remove(None)
    print(list(e_dates))

['Local_or_Municipal_2019_04_16', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2020_08_03', 'Local_or_Municipal_2020_04_14', 'Local_or_Municipal_2019_06_04', 'General_2014_11_04', 'General_2016_11_08', 'Consolidated_General_2021_11_02', 'Local_or_Municipal_2019_08_27', 'Local_or_Municipal_2021_06_08', 'General_2018_11_06', 'Local_or_Municipal_2019_03_05', 'General_2020_11_03', 'Local_or_Municipal_2021_03_02', 'Consolidated_General_2017_11_07', 'Consolidated_General_2019_11_05', 'Local_or_Municipal_2021_06_01', 'Local_or_Municipal_2021_05_11', 'Local_or_Municipal_2021_04_20', 'Local_or_Municipal_2019_08_13']


In [15]:
# need in order to filter out rows after aggregation
def get_correct_dates(list_like_df):
    print("Shape before reshaping:",list_like_df.shape)
    list_like_df = list_like_df.explode(['GE_dates', 'LM_dates']).melt(id_vars=["city"], 
                                                                       var_name="Date", 
                                                                       value_name="Value")
    list_like_df = list_like_df.drop(columns = 'Date')
    list_like_df.columns = ['Residence_Addresses_City', 'elec_type_date']                          
    list_like_df['elec_date'] = list_like_df['elec_type_date'].str[-10:]
    list_like_df['elec_year'] = list_like_df['elec_type_date'].str[-10:-6]
    list_like_df['elec_type'] = list_like_df['elec_type_date'].str[:-11]                    
    list_like_df = list_like_df.drop(columns = 'elec_type_date')
    print("Shape after reshaping:",list_like_df.shape)
    return list_like_df

GE_LM_dates_df = get_correct_dates(GE_LM_dates_dict)
GE_LM_dates_df.head()

Shape before reshaping: (40, 3)
Shape after reshaping: (320, 4)


Unnamed: 0,Residence_Addresses_City,elec_date,elec_year,elec_type
0,Oakland,2020_11_03,2020,General
1,Oakland,2018_11_06,2018,General
2,Oakland,2016_11_08,2016,General
3,Oakland,2014_11_04,2014,General
4,San Leandro,2020_11_03,2020,General


## 2. load the VOTE HISTORY data for selected election dates only

In [16]:
needed_variables = ['LALVOTERID'] + list(e_dates)

state_voterhistory_4_dates = pd.read_parquet(f'{filepath}{VOTE_filename}',
                                             columns=needed_variables)
                                
state_voterhistory_4_dates.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2019_04_16,Local_or_Municipal_2021_07_20,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,Local_or_Municipal_2019_06_04,General_2014_11_04,General_2016_11_08,Consolidated_General_2021_11_02,Local_or_Municipal_2019_08_27,...,General_2018_11_06,Local_or_Municipal_2019_03_05,General_2020_11_03,Local_or_Municipal_2021_03_02,Consolidated_General_2017_11_07,Consolidated_General_2019_11_05,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_04_20,Local_or_Municipal_2019_08_13
0,LALCA453164106,,,,,,,Y,,,...,Y,,Y,,,,,,,
1,LALCA453008306,,,,,,,,,,...,Y,,,,,,,,,
2,LALCA22129469,,,,,,Y,Y,,,...,Y,,Y,,,,,,,
3,LALCA549803906,,,,,,,,,,...,,,Y,,,,,,,
4,LALCA24729024,,,,,,,,,,...,,,,,,,,,,


## 3. Merge Vote History and Demographic Data

In [17]:
merged_file = pd.merge(state_voterhistory_4_dates, state_demographic_subset,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')

print(merged_file.shape)

print("number of unique cities:", merged_file.Residence_Addresses_City.nunique())

merged_file.head(5)

(3944492, 34)
number of unique cities: 40


Unnamed: 0,LALVOTERID,Local_or_Municipal_2019_04_16,Local_or_Municipal_2021_07_20,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,Local_or_Municipal_2019_06_04,General_2014_11_04,General_2016_11_08,Consolidated_General_2021_11_02,Local_or_Municipal_2019_08_27,...,Voters_BirthDate,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,CommercialData_EstimatedHHIncomeAmount,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,,,,,,,,,,...,02/02/1996,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,,
1,LALCA22129469,,,,,,Y,Y,,,...,02/02/1975,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,,
2,LALCA24729024,,,,,,,,,,...,01/01/1966,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,,
3,LALCA22466723,,,,,,,,,,...,05/29/1984,Republican,European,11/01/2021,ALAMEDA,,,,,
4,LALCA22466636,,,,,,Y,Y,,,...,06/28/1959,Democratic,European,12/07/2021,ALAMEDA,,,,,


In [18]:
merge_filename = DEMO_filename.replace('DEMOGRAPHIC_selected_cols.parquet', 'merged.parquet')
print(merge_filename)
merged_file.to_parquet(f'{filepath}{merge_filename}')

VM2--CA--2022-04-25-merged.parquet


# 3.1. Calculate voter turnout per ethnicity

In [19]:
import pandas as pd
merge_filename = DEMO_filename.replace('DEMOGRAPHIC_selected_cols.parquet', 'merged.parquet')
merged_file = pd.read_parquet(f'{filepath}{merge_filename}')

In [20]:
def replace_ethnicities(df):
    df = df.replace('East and South Asian', 'asian')
    df = df.replace('European', 'white')
    df = df.replace('Hispanic and Portuguese', 'hispanic')
    df = df.replace('Likely African-American', 'black')
    return df

In [21]:
merged_file = replace_ethnicities(merged_file)
merged_file.head()

Unnamed: 0,LALVOTERID,Local_or_Municipal_2019_04_16,Local_or_Municipal_2021_07_20,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,Local_or_Municipal_2019_06_04,General_2014_11_04,General_2016_11_08,Consolidated_General_2021_11_02,Local_or_Municipal_2019_08_27,...,Voters_BirthDate,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,CommercialData_EstimatedHHIncomeAmount,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,,,,,,,,,,...,02/02/1996,Non-Partisan,black,04/01/2021,ALAMEDA,,,,,
1,LALCA22129469,,,,,,Y,Y,,,...,02/02/1975,Democratic,white,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,,
2,LALCA24729024,,,,,,,,,,...,01/01/1966,Democratic,white,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,,
3,LALCA22466723,,,,,,,,,,...,05/29/1984,Republican,white,11/01/2021,ALAMEDA,,,,,
4,LALCA22466636,,,,,,Y,Y,,,...,06/28/1959,Democratic,white,12/07/2021,ALAMEDA,,,,,


In [22]:
GE_cols = [col for col in merged_file.columns if col.startswith('General')]
print(GE_cols)
LM_cols = [col for col in merged_file.columns if col.startswith('Local_or_Municipal') \
           or col.startswith('Consolidated_General')]
print(LM_cols)

['General_2014_11_04', 'General_2016_11_08', 'General_2018_11_06', 'General_2020_11_03']
['Local_or_Municipal_2019_04_16', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2020_08_03', 'Local_or_Municipal_2020_04_14', 'Local_or_Municipal_2019_06_04', 'Consolidated_General_2021_11_02', 'Local_or_Municipal_2019_08_27', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2019_03_05', 'Local_or_Municipal_2021_03_02', 'Consolidated_General_2017_11_07', 'Consolidated_General_2019_11_05', 'Local_or_Municipal_2021_06_01', 'Local_or_Municipal_2021_05_11', 'Local_or_Municipal_2021_04_20', 'Local_or_Municipal_2019_08_13']


In [23]:
# fill NA values with "N" to make it easier to compare  with "Y"
merged_file[GE_cols+LM_cols] = merged_file[GE_cols+LM_cols].fillna('N')
merged_file.head()

Unnamed: 0,LALVOTERID,Local_or_Municipal_2019_04_16,Local_or_Municipal_2021_07_20,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,Local_or_Municipal_2019_06_04,General_2014_11_04,General_2016_11_08,Consolidated_General_2021_11_02,Local_or_Municipal_2019_08_27,...,Voters_BirthDate,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,CommercialData_EstimatedHHIncomeAmount,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,N,N,N,N,N,N,N,N,N,...,02/02/1996,Non-Partisan,black,04/01/2021,ALAMEDA,,,,,
1,LALCA22129469,N,N,N,N,N,Y,Y,N,N,...,02/02/1975,Democratic,white,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,,
2,LALCA24729024,N,N,N,N,N,N,N,N,N,...,01/01/1966,Democratic,white,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,,
3,LALCA22466723,N,N,N,N,N,N,N,N,N,...,05/29/1984,Republican,white,11/01/2021,ALAMEDA,,,,,
4,LALCA22466636,N,N,N,N,N,Y,Y,N,N,...,06/28/1959,Democratic,white,12/07/2021,ALAMEDA,,,,,


In [24]:
# We created the dataframe below in order to easily calculate perc_turnout when no one voted
list_ethnic_city = merged_file[['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']].drop_duplicates()
list_ethnic_city_No = list_ethnic_city.copy()
list_ethnic_city_No['voted'] = 'N'
list_ethnic_city_Yes = list_ethnic_city.copy()
list_ethnic_city_Yes['voted'] = 'Y'
list_ethnic_city = pd.concat([list_ethnic_city_No, list_ethnic_city_Yes])

In [25]:
list_ethnic_city

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,voted
0,Oakland,black,N
1,Oakland,white,N
2,San Leandro,white,N
3,Livermore,white,N
7,Oakland,asian,N
...,...,...,...
3802766,Santa Rosa,white,Y
3802769,Santa Rosa,hispanic,Y
3802772,Santa Rosa,asian,Y
3802827,Santa Rosa,black,Y


In [26]:
# we also need the total voters information per city and ethnicity
total_city_ethnic = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']).size().reset_index()
total_city_ethnic.columns = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'total_voters']
total_city_ethnic  = total_city_ethnic.merge(list_ethnic_city, on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc'])

total_city_ethnic = replace_ethnicities(total_city_ethnic)
total_city_ethnic

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,total_voters,voted
0,Albany,asian,2405,N
1,Albany,asian,2405,Y
2,Albany,black,147,N
3,Albany,black,147,Y
4,Albany,hispanic,1035,N
...,...,...,...,...
315,Whittier,black,214,Y
316,Whittier,hispanic,76334,N
317,Whittier,hispanic,76334,Y
318,Whittier,white,26477,N


In [27]:
def calc_votes(df, col):
    voter_turnout_stats = df.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', col]).size().reset_index(name='voted_voters')

    # 'voted' is either 'Y' or 'N'
    voter_turnout_stats = voter_turnout_stats.rename(columns = {col: 'voted'})    

    voter_turnout_stats = total_city_ethnic.merge(voter_turnout_stats, 
                                                     how = 'left',
                                                     on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'voted']) 
    voter_turnout_stats['perc_turnout'] = voter_turnout_stats['voted_voters']/voter_turnout_stats['total_voters']

    voter_turnout_stats['elec_date'] = col[len(col)-10:]
    voter_turnout_stats['elec_year'] = col[len(col)-10:len(col)-6]
    voter_turnout_stats['elec_type'] = col[:len(col)-11]

    voter_turnout_stats[['voted_voters', 'perc_turnout']] = voter_turnout_stats[['voted_voters', 'perc_turnout']].fillna(0)
    voter_turnout_stats = voter_turnout_stats[voter_turnout_stats['voted'] == 'Y']    
    pivot_df = voter_turnout_stats.pivot(index = ['elec_type','elec_year', 'elec_date', 'Residence_Addresses_City'],
                                    columns='EthnicGroups_EthnicGroup1Desc', 
                                    values=['total_voters', 'voted_voters', 'perc_turnout']).reset_index()
    pivot_df.columns = pivot_df.columns.map('_'.join)
    pivot_df = pivot_df.rename(columns = {'elec_type_':'elec_type', 'elec_year_':'elec_year', 'elec_date_':'elec_date', 'Residence_Addresses_City_':'Residence_Addresses_City'})

    del voter_turnout_stats
    gc.collect()
    return pivot_df

elec_date_cols = GE_cols+LM_cols

for i in range(len(elec_date_cols)):
    col = elec_date_cols[i]
    pivot_df = calc_votes(merged_file, col)    
    # stack all types of election into one dataframe 
    if i == 0:
        voter_turnout_merge_ethnicity = pivot_df.copy() 
    else:
        voter_turnout_merge_ethnicity = pd.concat([voter_turnout_merge_ethnicity, pivot_df])


In [28]:
print(voter_turnout_merge_ethnicity.shape)
voter_turnout_merge_ethnicity.head()

(800, 16)


Unnamed: 0,elec_type,elec_year,elec_date,Residence_Addresses_City,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2014,2014_11_04,Albany,2405.0,147.0,1035.0,6169.0,714.0,50.0,333.0,3162.0,0.296881,0.340136,0.321739,0.512563
1,General,2014,2014_11_04,Alhambra,17451.0,191.0,16596.0,7359.0,2962.0,34.0,3376.0,2294.0,0.169732,0.17801,0.203423,0.311727
2,General,2014,2014_11_04,Anaheim,26340.0,1211.0,70052.0,54644.0,5986.0,192.0,11115.0,19174.0,0.227259,0.158547,0.158668,0.350889
3,General,2014,2014_11_04,Bellflower,2153.0,3614.0,19899.0,10792.0,325.0,801.0,2469.0,2795.0,0.150952,0.221638,0.124077,0.258988
4,General,2014,2014_11_04,Berkeley,8549.0,5942.0,6388.0,39425.0,2449.0,2313.0,1805.0,19131.0,0.286466,0.389263,0.282561,0.48525


In [29]:
print(voter_turnout_merge_ethnicity.shape)
# remove rows where election dates are not associated with city
# need to do this only once as we will be using inner join to ensure only necessary combinations of city and election dates are present
voter_turnout_merge_ethnicity = GE_LM_dates_df.merge(voter_turnout_merge_ethnicity, 
                                how = 'left',
                                on = ['elec_type', 'elec_year', 'elec_date', 'Residence_Addresses_City'])
print(voter_turnout_merge_ethnicity.shape)

(800, 16)
(320, 16)


In [30]:
#should be empty dataframe because of the way we have filitered the dataframe

no_voter_turnout = voter_turnout_merge_ethnicity[(voter_turnout_merge_ethnicity['perc_turnout_asian'] == 0) &
                                       (voter_turnout_merge_ethnicity['perc_turnout_black'] == 0) &
                                       (voter_turnout_merge_ethnicity['perc_turnout_hispanic'] == 0) &
                                       (voter_turnout_merge_ethnicity['perc_turnout_white'] == 0)]

no_voter_turnout.head()

Unnamed: 0,Residence_Addresses_City,elec_date,elec_year,elec_type,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white


#  3.2.  Calculate average donation 

In [31]:
def calc_donation(df):
    donations_df = df[['Residence_Addresses_City', 'FECDonors_TotalDonationsAmount', 'FECDonors_NumberOfDonations']
                  + elec_date_cols]
    melt_donations_df = donations_df.melt(id_vars=['Residence_Addresses_City', 'FECDonors_TotalDonationsAmount', 'FECDonors_NumberOfDonations'], 
              value_vars=elec_date_cols,
              var_name='elec_type_date',
              value_name='voted')
    melt_donations_df = melt_donations_df[melt_donations_df['voted'] == 'Y']

    melt_donations_df = melt_donations_df.astype({'FECDonors_TotalDonationsAmount': float, 'FECDonors_NumberOfDonations': float})                        
    melt_donations_df = melt_donations_df.groupby(['Residence_Addresses_City', 'elec_type_date']).agg({'FECDonors_TotalDonationsAmount':'sum','FECDonors_NumberOfDonations':'sum'}).reset_index()    
    melt_donations_df['mean_donation'] = melt_donations_df['FECDonors_TotalDonationsAmount']/melt_donations_df['FECDonors_NumberOfDonations']
    melt_donations_df['elec_date'] = melt_donations_df['elec_type_date'].str[-10:]
    melt_donations_df['elec_year'] = melt_donations_df['elec_type_date'].str[-10:-6]
    melt_donations_df['elec_type'] = melt_donations_df['elec_type_date'].str[:-11]
    melt_donations_df = melt_donations_df.drop(columns = 'elec_type_date').reset_index(drop=True)
    
    return melt_donations_df

avg_donations = calc_donation(merged_file)
print(avg_donations.shape)
avg_donations.head()

(489, 7)


Unnamed: 0,Residence_Addresses_City,FECDonors_TotalDonationsAmount,FECDonors_NumberOfDonations,mean_donation,elec_date,elec_year,elec_type
0,Albany,3239.0,37.0,87.540541,2017_11_07,2017,Consolidated_General
1,Albany,44682.0,132.0,338.5,2019_11_05,2019,Consolidated_General
2,Albany,2433223.0,24639.0,98.754941,2014_11_04,2014,General
3,Albany,2803209.0,27551.0,101.74618,2016_11_08,2016,General
4,Albany,2753784.0,26649.0,103.33536,2018_11_06,2018,General


In [32]:
# Merge 3.1 and 3.2
voter_turnout_merge = voter_turnout_merge_ethnicity.merge(avg_donations, 
                                                          how = 'inner',
                                                          on = ['elec_type', 'elec_year', 'elec_date', 'Residence_Addresses_City'])

voter_turnout_merge.head()

Unnamed: 0,Residence_Addresses_City,elec_date,elec_year,elec_type,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white,FECDonors_TotalDonationsAmount,FECDonors_NumberOfDonations,mean_donation
0,Oakland,2020_11_03,2020,General,30600.0,61476.0,37174.0,83122.0,23041.0,45891.0,26954.0,69989.0,0.752974,0.746486,0.725077,0.842003,42110405.0,386556.0,108.937398
1,Oakland,2018_11_06,2018,General,30600.0,61476.0,37174.0,83122.0,14972.0,35012.0,17857.0,57872.0,0.489281,0.569523,0.480363,0.69623,40806061.0,378375.0,107.845553
2,Oakland,2016_11_08,2016,General,30600.0,61476.0,37174.0,83122.0,16057.0,37256.0,19792.0,57968.0,0.524739,0.606025,0.532415,0.697385,40282807.0,374104.0,107.678098
3,Oakland,2014_11_04,2014,General,30600.0,61476.0,37174.0,83122.0,8145.0,21265.0,8235.0,35411.0,0.266176,0.345907,0.221526,0.426012,35190358.0,332135.0,105.951971
4,San Leandro,2020_11_03,2020,General,12705.0,5596.0,16028.0,17780.0,9229.0,4299.0,11984.0,14638.0,0.726407,0.768227,0.747692,0.823285,1835380.0,30062.0,61.053157


#  3.3.  Calculate voter turnout per income

In [33]:
# percent missing values for income
print('Percent of rows with missing value for income:',
      100 * merged_file['CommercialData_EstimatedHHIncome'].isnull().sum() / merged_file.shape[0], '%')

Percent of rows with missing value for income: 1.5685416525119078 %


As long as this percentage is low, we can continue with our turnout calculations for income.

In [34]:
# Similar to before, but with income
list_income_city = merged_file[['Residence_Addresses_City', 'CommercialData_EstimatedHHIncome']].drop_duplicates()
list_income_city_No = list_income_city.copy()
list_income_city_No['voted'] = 'N'
list_income_city_Yes = list_income_city.copy()
list_income_city_Yes['voted'] = 'Y'
list_income_city = pd.concat([list_income_city_No, list_income_city_Yes])

In [35]:
# we also need the total voters information per city and income
total_city_income = merged_file.groupby(['Residence_Addresses_City', 'CommercialData_EstimatedHHIncome']).size().reset_index()
total_city_income.columns = ['Residence_Addresses_City', 'CommercialData_EstimatedHHIncome', 'total_voters']
total_city_income  = total_city_income.merge(list_income_city, on = ['Residence_Addresses_City', 'CommercialData_EstimatedHHIncome'])

total_city_income

Unnamed: 0,Residence_Addresses_City,CommercialData_EstimatedHHIncome,total_voters,voted
0,Albany,$1000-14999,83,N
1,Albany,$1000-14999,83,Y
2,Albany,$100000-124999,1337,N
3,Albany,$100000-124999,1337,Y
4,Albany,$125000-149999,1802,N
...,...,...,...,...
955,Whittier,$35000-49999,6387,Y
956,Whittier,$50000-74999,24013,N
957,Whittier,$50000-74999,24013,Y
958,Whittier,$75000-99999,24819,N


In [36]:
# function to calculate percent turnout by income bracket
def calc_votes_income(df, col):
    voter_turnout_stats = df.groupby(['Residence_Addresses_City', 'CommercialData_EstimatedHHIncome', col]).size().reset_index(name='voted_voters')

    # 'voted' is either 'Y' or 'N'
    voter_turnout_stats = voter_turnout_stats.rename(columns = {col: 'voted'})    

    voter_turnout_stats = total_city_income.merge(voter_turnout_stats, 
                                                     how = 'left',
                                                     on = ['Residence_Addresses_City', 'CommercialData_EstimatedHHIncome', 'voted']) 
    voter_turnout_stats['perc_turnout_income'] = voter_turnout_stats['voted_voters']/voter_turnout_stats['total_voters']

    voter_turnout_stats['elec_date'] = col[len(col)-10:]
    voter_turnout_stats['elec_year'] = col[len(col)-10:len(col)-6]
    voter_turnout_stats['elec_type'] = col[:len(col)-11]

    voter_turnout_stats[['voted_voters', 'perc_turnout_income']] = voter_turnout_stats[['voted_voters', 'perc_turnout_income']].fillna(0)
    voter_turnout_stats = voter_turnout_stats[voter_turnout_stats['voted'] == 'Y']    
    pivot_df = voter_turnout_stats.pivot(index = ['elec_type','elec_year', 'elec_date', 'Residence_Addresses_City'],
                                    columns='CommercialData_EstimatedHHIncome', 
                                    values=['total_voters', 'voted_voters', 'perc_turnout_income']).reset_index()
    pivot_df.columns = pivot_df.columns.map('_'.join)
    pivot_df = pivot_df.rename(columns = {'elec_type_':'elec_type', 'elec_year_':'elec_year', 'elec_date_':'elec_date', 'Residence_Addresses_City_':'Residence_Addresses_City'})

    del voter_turnout_stats
    gc.collect()
    return pivot_df

elec_date_cols = GE_cols+LM_cols

for i in range(len(elec_date_cols)):
    col = elec_date_cols[i]
    pivot_df = calc_votes_income(merged_file, col)    
    # stack all types of election into one dataframe 
    if i == 0:
        voter_turnout_income = pivot_df.copy() 
    else:
        voter_turnout_income = pd.concat([voter_turnout_income, pivot_df])


In [37]:
voter_turnout_income.head()

Unnamed: 0,elec_type,elec_year,elec_date,Residence_Addresses_City,total_voters_$1000-14999,total_voters_$100000-124999,total_voters_$125000-149999,total_voters_$15000-24999,total_voters_$150000-174999,total_voters_$175000-199999,...,perc_turnout_income_$125000-149999,perc_turnout_income_$15000-24999,perc_turnout_income_$150000-174999,perc_turnout_income_$175000-199999,perc_turnout_income_$200000-249999,perc_turnout_income_$25000-34999,perc_turnout_income_$250000+,perc_turnout_income_$35000-49999,perc_turnout_income_$50000-74999,perc_turnout_income_$75000-99999
0,General,2014,2014_11_04,Albany,83.0,1337.0,1802.0,113.0,1867.0,829.0,...,0.417314,0.530973,0.447777,0.475271,0.481028,0.510638,0.516741,0.263473,0.52027,0.420063
1,General,2014,2014_11_04,Alhambra,916.0,4420.0,3547.0,937.0,1254.0,1262.0,...,0.216239,0.270011,0.220893,0.222662,0.255875,0.28045,0.277633,0.213055,0.16789,0.211806
2,General,2014,2014_11_04,Anaheim,2471.0,16186.0,16774.0,3075.0,6765.0,6630.0,...,0.260582,0.26374,0.280414,0.279336,0.303997,0.283207,0.293158,0.221431,0.203979,0.221256
3,General,2014,2014_11_04,Bellflower,841.0,3168.0,2570.0,897.0,947.0,838.0,...,0.204669,0.269788,0.187962,0.188544,0.223005,0.204276,0.211864,0.142528,0.157968,0.176051
4,General,2014,2014_11_04,Berkeley,980.0,6129.0,6001.0,735.0,3374.0,7107.0,...,0.37827,0.493878,0.431239,0.466019,0.480263,0.54148,0.532916,0.372627,0.367175,0.391292


In [38]:
# Merge 3.1, 3.2 and 3.3
# merge with previous calculations for race and donation
voter_turnout_merge = voter_turnout_merge.merge(voter_turnout_income, 
                                                          how = 'inner',
                                                          on = ['elec_type', 'elec_year', 'elec_date', 'Residence_Addresses_City'])

voter_turnout_merge.head()

Unnamed: 0,Residence_Addresses_City,elec_date,elec_year,elec_type,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,...,perc_turnout_income_$125000-149999,perc_turnout_income_$15000-24999,perc_turnout_income_$150000-174999,perc_turnout_income_$175000-199999,perc_turnout_income_$200000-249999,perc_turnout_income_$25000-34999,perc_turnout_income_$250000+,perc_turnout_income_$35000-49999,perc_turnout_income_$50000-74999,perc_turnout_income_$75000-99999
0,Oakland,2020_11_03,2020,General,30600.0,61476.0,37174.0,83122.0,23041.0,45891.0,...,0.845973,0.749049,0.837453,0.865903,0.885037,0.710177,0.9044,0.668651,0.694306,0.778418
1,Oakland,2018_11_06,2018,General,30600.0,61476.0,37174.0,83122.0,14972.0,35012.0,...,0.678846,0.540957,0.665526,0.710556,0.73842,0.511445,0.745676,0.445798,0.474869,0.580371
2,Oakland,2016_11_08,2016,General,30600.0,61476.0,37174.0,83122.0,16057.0,37256.0,...,0.687129,0.587422,0.667623,0.719837,0.741728,0.561451,0.74626,0.498191,0.51747,0.611121
3,Oakland,2014_11_04,2014,General,30600.0,61476.0,37174.0,83122.0,8145.0,21265.0,...,0.402397,0.325856,0.376959,0.452424,0.474074,0.314004,0.460337,0.243371,0.256071,0.325532
4,San Leandro,2020_11_03,2020,General,12705.0,5596.0,16028.0,17780.0,9229.0,4299.0,...,0.783378,0.789203,0.79914,0.82022,0.852136,0.794416,0.836503,0.763248,0.731269,0.742035


In [39]:
# add one column that is just overall average income 
merged_file['CommercialData_EstimatedHHIncomeAmount']= merged_file['CommercialData_EstimatedHHIncomeAmount'].str.replace('$','', regex=False)

merged_file = merged_file.astype({'CommercialData_EstimatedHHIncomeAmount': float})
        
avg_income = merged_file[['Residence_Addresses_City', 'CommercialData_EstimatedHHIncomeAmount']].\
            groupby(['Residence_Addresses_City']).\
            mean().reset_index()

avg_income.head(10)

Unnamed: 0,Residence_Addresses_City,CommercialData_EstimatedHHIncomeAmount
0,Albany,144760.415983
1,Alhambra,94500.265392
2,Anaheim,103325.049915
3,Bellflower,85002.549857
4,Berkeley,149839.767931
5,Buena Park,105422.081532
6,Burbank,120977.462636
7,Calabasas,184202.553293
8,Carpinteria,121539.312743
9,Chino Hills,141110.737722


In [40]:
# Merge 3.1, 3.2 and 3.3
# merge with previous calculations for race, donation, income bracket
voter_turnout_merge = voter_turnout_merge.merge(avg_income, 
                                                          how = 'inner',
                                                          on = ['Residence_Addresses_City'])


voter_turnout_merge.head(10)

Unnamed: 0,Residence_Addresses_City,elec_date,elec_year,elec_type,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,...,perc_turnout_income_$15000-24999,perc_turnout_income_$150000-174999,perc_turnout_income_$175000-199999,perc_turnout_income_$200000-249999,perc_turnout_income_$25000-34999,perc_turnout_income_$250000+,perc_turnout_income_$35000-49999,perc_turnout_income_$50000-74999,perc_turnout_income_$75000-99999,CommercialData_EstimatedHHIncomeAmount
0,Oakland,2020_11_03,2020,General,30600.0,61476.0,37174.0,83122.0,23041.0,45891.0,...,0.749049,0.837453,0.865903,0.885037,0.710177,0.9044,0.668651,0.694306,0.778418,115166.322571
1,Oakland,2018_11_06,2018,General,30600.0,61476.0,37174.0,83122.0,14972.0,35012.0,...,0.540957,0.665526,0.710556,0.73842,0.511445,0.745676,0.445798,0.474869,0.580371,115166.322571
2,Oakland,2016_11_08,2016,General,30600.0,61476.0,37174.0,83122.0,16057.0,37256.0,...,0.587422,0.667623,0.719837,0.741728,0.561451,0.74626,0.498191,0.51747,0.611121,115166.322571
3,Oakland,2014_11_04,2014,General,30600.0,61476.0,37174.0,83122.0,8145.0,21265.0,...,0.325856,0.376959,0.452424,0.474074,0.314004,0.460337,0.243371,0.256071,0.325532,115166.322571
4,Oakland,2021_11_02,2021,Consolidated_General,30600.0,61476.0,37174.0,83122.0,1.0,1.0,...,0.0,0.00011,0.0,4.9e-05,0.0,7.3e-05,0.0,4.9e-05,4.4e-05,115166.322571
5,Oakland,2020_08_03,2020,Local_or_Municipal,30600.0,61476.0,37174.0,83122.0,55.0,54.0,...,0.000601,0.003255,0.002949,0.006716,0.000352,0.004671,0.001008,0.000712,0.00176,115166.322571
6,Oakland,2020_04_14,2020,Local_or_Municipal,30600.0,61476.0,37174.0,83122.0,2.0,0.0,...,0.0,5.5e-05,0.0,0.0,0.0,0.0,0.0,0.0,8.8e-05,115166.322571
7,Oakland,2019_11_05,2019,Consolidated_General,30600.0,61476.0,37174.0,83122.0,320.0,325.0,...,0.004606,0.015339,0.012057,0.01837,0.006339,0.019193,0.003893,0.004763,0.010382,115166.322571
8,San Leandro,2020_11_03,2020,General,12705.0,5596.0,16028.0,17780.0,9229.0,4299.0,...,0.789203,0.79914,0.82022,0.852136,0.794416,0.836503,0.763248,0.731269,0.742035,112858.690163
9,San Leandro,2018_11_06,2018,General,12705.0,5596.0,16028.0,17780.0,5006.0,3045.0,...,0.579263,0.559563,0.530291,0.588574,0.580372,0.573857,0.541365,0.47028,0.470708,112858.690163


#  3.4.  Calculate Voter turnout or percentage of Education?

In [41]:
# with/without college education then % voter turnout

#  3.4.  Calculate voter turnout per age group

In [42]:
#just average age of voter for each election
#need to calculate age on election date using DOB
#no need to bucket!

# Save the merged aggregations 

In [43]:
# voter_turnout_merged.to_csv(f'{filepath}voter_turnout_merged.csv', index=False)

In [44]:
# del voter_turnout_merge
# gc.collect()

In [45]:
end_time = time.time()
print("Time take to run this notebook in seconds: ", end_time - start_time)

Time take to run this notebook in seconds:  205.35417008399963
