# Data Merging

Some observations
- We choose the 5 non-RCV cities with highest cosine similary score compared to the 7 RCV cities in CA
- There were 33 distinct cities among those 35 cities
- There are 66 non-registered voters among 21.7 million voters
- There are total of 3.9 million voters in the sampled cities
- City 'El Paso de Robles' didn't match in demographic data
- How can we identify election dates are for different cities?
    - We found 122 cases out of 312 with 0% voter turnout 

In [1]:
import pandas as pd
import janitor
import gc

In [2]:
RCV_cities = ['San Francisco',
 'Oakland',
 'Berkeley',
 'San Leandro',
 'Palm Desert',
 'Eureka',
 'Albany']

sampled_nonRCV_cities = ['Fresno',
 'San Diego',
 'Sacramento',
 'Riverside',
 'San Jose',
 'Santa Ana',
 'Anaheim',
 'Santa Rosa',
 'Merced',
 'Santa Clarita',
 'Alhambra',
 'Davis',
 'Montebello',
 'Burbank',
 'Huntington Park',
 'Bellflower',
 'Watsonville',
 'Gilroy',
 'Whittier',
 'Lynwood',
 'Lakewood',
 'Pico Rivera',
 'Lake Forest',
 'Livermore',
 'Chino Hills',
 'Paramount',
 'El Paso de Robles',
 'Pico Rivera',
 'Buena Park',
 'Whittier',
 'Calabasas',
 'Carpinteria',
 'Morro Bay',
 'San Carlos',
 'Solvang']

print("total number of cities:", len(sampled_nonRCV_cities))

print("number of distinct cities:", len(set(sampled_nonRCV_cities)))

print("name of cities that were duplicated:", set([x for x in sampled_nonRCV_cities if sampled_nonRCV_cities.count(x) > 1]))

combined_sampled_cityName = RCV_cities+list(set(sampled_nonRCV_cities))
print("number of distinct RCV and sampled nonRCV cities:", len(combined_sampled_cityName))

total number of cities: 35
number of distinct cities: 33
name of cities that were duplicated: {'Whittier', 'Pico Rivera'}
number of distinct RCV and sampled nonRCV cities: 40


In [3]:
# change the filepath as required, we have selected the folder with the latest date
filepath = 'VM2--CA--2022-04-25/'

# Demographic Data

1. Select only the columns required: city name ('Residence_Addresses_City'), unique voter id ('LALVOTERID'), voter's ethnicity ('EthnicGroups_EthnicGroup1Desc') and date when voter was registered ('Voters_OfficialRegDate')
2. Keep only the cities that were identified as being similar to RCV cities in CA (See ca_similarity_search.ipynb for reference) 
3. Keep only rows EthnicGroups_EthnicGroup1Desc == “European”,  “Likely African-American”,“Hispanic and Portuguese” and “East and South Asian” 
4. Keep only registered voters identified in 'Voters_OfficialRegDate'



# 1. Reduce Demographic to parquet

**Note: Run only once**
- select a subset of columns
- save in parquet format

In [18]:
selected_variables = ['LALVOTERID',
                      'Residence_Addresses_City', 
                      'County',
                      'EthnicGroups_EthnicGroup1Desc',
                      'Voters_OfficialRegDate', 
                      'Voters_Age',
                      'Voters_Gender',        
                      'CommercialData_Education',
                      'CommercialData_EstimatedHHIncome',
                      'FECDonors_NumberOfDonations',
                      'FECDonors_TotalDonationsAmount', 
                      'Parties_Description'
                     ]

state_demographic = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC.tab', 
                                sep='\t', dtype=str, encoding='unicode_escape',
                                usecols=selected_variables)

In [19]:
state_demographic.head()

Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453164106,Oakland,F,29,Democratic,Other,06/18/2021,ALAMEDA,,,,
1,LALCA453008306,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
2,LALCA22129469,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA549803906,Oakland,M,60,Democratic,Other,02/07/2022,ALAMEDA,,,,
4,LALCA24729024,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
95,LALCA22584983,Pleasanton,F,57,Non-Partisan,European,02/28/2020,ALAMEDA,,,,
96,LALCA22552111,Pleasanton,F,54,Republican,,09/16/2008,ALAMEDA,,,,
97,LALCA22551998,Pleasanton,M,46,Republican,,09/06/2020,ALAMEDA,HS Diploma - Extremely Likely,,,
98,LALCA22766523,Oakland,F,45,Democratic,Other,09/21/2020,ALAMEDA,,,,


In [21]:
print("Memory usuage:", state_demographic.memory_usage().sum() / 1024**2)

Memory usuage: 1987.7580261230469


In [None]:
# state_demographic.to_parquet(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC_selected_cols.parquet')
# state_demographic.to_csv(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC_selected_cols.csv')

## 2.  Load new Demographic Data

In [4]:
state_demographic = pd.read_parquet(f'{filepath}VM2--CA--2022-04-25-DEMOGRAPHIC_selected_cols.parquet')

In [5]:
state_demographic.head(5)

Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453164106,Oakland,F,29,Democratic,Other,06/18/2021,ALAMEDA,,,,
1,LALCA453008306,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
2,LALCA22129469,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA549803906,Oakland,M,60,Democratic,Other,02/07/2022,ALAMEDA,,,,
4,LALCA24729024,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,


In [6]:
print("total number of unique cities", state_demographic.Residence_Addresses_City.nunique())
print("total number of unique voters", state_demographic.LALVOTERID.nunique())
print("count of non-registered voters", len(state_demographic[state_demographic['Voters_OfficialRegDate'].isnull()]))

total number of unique cities 1533
total number of unique voters 21711617
count of non-registered voters 66


In [7]:
print("number of expected cities:", len(combined_sampled_cityName))
missing_cities = [city for city in combined_sampled_cityName if city not in state_demographic['Residence_Addresses_City'].unique()]
if len(missing_cities) > 0:
    print("number of cities not found in demographic data:", len(missing_cities))
    print(missing_cities)

number of expected cities: 40
number of cities not found in demographic data: 1
['El Paso de Robles']


In [8]:
selected_ethnicities = ['European', 'Likely African-American','Hispanic and Portuguese', 'East and South Asian']

state_demographic_subset = state_demographic[state_demographic['Residence_Addresses_City'].isin(combined_sampled_cityName) &
                                             state_demographic['EthnicGroups_EthnicGroup1Desc'].isin(selected_ethnicities) &
                                             state_demographic['Voters_OfficialRegDate'].notnull()
                                            ]
print(state_demographic_subset.shape)
state_demographic_subset.head()

(3918925, 12)


Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
1,LALCA453008306,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
2,LALCA22129469,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
4,LALCA24729024,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
6,LALCA22466723,Livermore,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
7,LALCA22466636,Livermore,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


In [9]:
print("number of unique cities:", state_demographic_subset.Residence_Addresses_City.nunique())

number of unique cities: 39


In [10]:
# del state_demographic
# gc.collect()

20

# Vote History

1. Select only the columns that are 4 most recent General elections and 4 most recent Local_or_Municipal elections and EthnicGroups_EthnicGroup1Desc
2. Merge Vote History with the sampled Demographic Data 


# 3. Find four most recent election history

**Note: Run only once**
- select a subset of columns
- save in parquet format

In [11]:
# select only subset of rows to find the column names that are 4 most recent General and Local_or_Municipal elections
state_voterhistory = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                nrows=100)
                                
state_voterhistory.head(5)

Unnamed: 0,LALVOTERID,Special_2022_04_19,Special_2022_04_12,Special_2022_04_05,Special_2022_02_15,Special_2022_02_01,Special_2021_12_14,Special_2021_12_07,Special_2021_11_02,Consolidated_General_2021_11_02,...,BallotReturnDate_General_2018_11_06,BallotReturnDate_Primary_2018_06_05,BallotReturnDate_General_2016_11_08,BallotReturnDate_Primary_2016_06_07,BallotReturnDate_General_2014_11_04,BallotReturnDate_Primary_2014_06_03,BallotReturnDate_General_2012_11_06,BallotReturnDate_Primary_2012_06_05,BallotReturnDate_General_2010_11_02,BallotReturnDate_Primary_2010_06_08
0,LALCA453164106,,,,,,,,,,...,,,11/07/2016,,,,,,,
1,LALCA453008306,,,,,,,,,,...,,,,,,,,,,
2,LALCA22129469,,,,,,,,,,...,11/06/2018,,,,,,,,,
3,LALCA549803906,,,,,,,,,,...,,,,,,,,,,
4,LALCA24729024,,,,,,,,,,...,,,,,,,,,,


In [177]:
print("total number of General election dates", len([col for col in state_voterhistory.columns if col.startswith('General')]))
print("total number of Local or Municipal election dates", len([col for col in state_voterhistory.columns if col.startswith('Local_or_Municipal')]))

total number of General election dates 18
total number of Local or Municipal election dates 131


In [164]:
def get_recent_date(string, df, start_n_dates, end_n_dates):
    # In descending order
    # start_n_dates to end_n_dates 
    # if top 10 the 0:10, if next top 10 then 10:20 and so on
    list_cols = [col for col in df.columns if col.startswith(string)]
    dates = [col.replace(string+'_', '') for col in list_cols]
    dates.sort(reverse=True)
    return [string+'_'+d for d in dates[start_n_dates:end_n_dates]]

GE_cols = get_recent_date('General', state_voterhistory, 0, 10)
print(GE_cols)
LM_cols = get_recent_date('Local_or_Municipal', state_voterhistory, 0, 10)
print(LM_cols)

['General_2020_11_03', 'General_2018_11_06', 'General_2016_11_08', 'General_2014_11_04', 'General_2012_11_06', 'General_2010_11_02', 'General_2008_11_04', 'General_2006_11_07', 'General_2004_11_02', 'General_2002_11_05']
['Local_or_Municipal_2021_08_31', 'Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2021_06_01', 'Local_or_Municipal_2021_05_11', 'Local_or_Municipal_2021_05_04', 'Local_or_Municipal_2021_04_20', 'Local_or_Municipal_2021_03_09', 'Local_or_Municipal_2021_03_02', 'Local_or_Municipal_2020_08_03']


In [178]:
# del state_voterhistory
# gc.collect()

In [13]:
needed_variables = ['LALVOTERID'] + LM_cols + GE_cols

state_voterhistory_selected_cols = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)
                                
state_voterhistory_selected_cols.head(5)

Unnamed: 0,LALVOTERID,Special_2022_04_19,Special_2022_04_12,Special_2022_04_05,Special_2022_02_15,Special_2022_02_01,Special_2021_12_14,Special_2021_12_07,Special_2021_11_02,Consolidated_General_2021_11_02,...,BallotReturnDate_General_2018_11_06,BallotReturnDate_Primary_2018_06_05,BallotReturnDate_General_2016_11_08,BallotReturnDate_Primary_2016_06_07,BallotReturnDate_General_2014_11_04,BallotReturnDate_Primary_2014_06_03,BallotReturnDate_General_2012_11_06,BallotReturnDate_Primary_2012_06_05,BallotReturnDate_General_2010_11_02,BallotReturnDate_Primary_2010_06_08
0,LALCA453164106,,,,,,,,,,...,,,11/07/2016,,,,,,,
1,LALCA453008306,,,,,,,,,,...,,,,,,,,,,
2,LALCA22129469,,,,,,,,,,...,11/06/2018,,,,,,,,,
3,LALCA549803906,,,,,,,,,,...,,,,,,,,,,
4,LALCA24729024,,,,,,,,,,...,,,,,,,,,,


In [18]:
merged_file = pd.merge(state_voterhistory_selected_cols, state_demographic_subset,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')

print(merged_file.shape)

print("number of unique cities:", merged_file.Residence_Addresses_City.nunique())

merged_file.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_05_04,Local_or_Municipal_2021_04_20,Local_or_Municipal_2021_03_09,Local_or_Municipal_2021_03_02,...,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,,,,,,,,,,...,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,LALCA22129469,,,,,,,,,,...,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,LALCA24729024,,,,,,,,,,...,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA22466723,,,,,,,,,,...,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
4,LALCA22466636,,,,,,,,,,...,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


In [20]:
# merged_file = merged_file.reset_index(drop = False)

In [22]:
# fill NA values with "N" to make it easier to compare  with "Y"
merged_file[GE_cols+LM_cols] = merged_file[GE_cols+LM_cols].fillna('N')
merged_file.head()

Unnamed: 0,index,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_05_04,Local_or_Municipal_2021_04_20,Local_or_Municipal_2021_03_09,...,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,0,LALCA453008306,N,N,N,N,N,N,N,N,...,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,1,LALCA22129469,N,N,N,N,N,N,N,N,...,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,2,LALCA24729024,N,N,N,N,N,N,N,N,...,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,3,LALCA22466723,N,N,N,N,N,N,N,N,...,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
4,4,LALCA22466636,N,N,N,N,N,N,N,N,...,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


- create two dictionaries
    1. `list_city_cnt_dates` will count the number of election dates for each city
        - for each city if the count of election dates reaches 4 then stop checking more dates for that city (to do this we will remove the city from the `list_city_cnt_dates` dictionary) 
    2. `list_city_4_dates` will keep track of the cities and their election dates
- for a given election date if at least one voter has "Y" then proceed to find which cities took part on that date
    - for each city in `list_city_cnt_dates` if the city is also present in the dataframe (i.e. the vote "Y" is counted) then increment the count by 1 in `list_city_cnt_dates` and also add the date to `list_city_4_dates`


In [151]:
list_city_cnt_dates_1_top10 = {key: 0 for key in merged_file['Residence_Addresses_City']}
list_city_4_dates_1_top10 = {key: [] for key in merged_file['Residence_Addresses_City']}

def get_list_elec_dates(df, date_cols, list_city_cnt_dates, list_city_4_dates):
    for date_col in date_cols:
        cnt_df = df[df[date_col] == 'Y'][[date_col, 'Residence_Addresses_City']].groupby('Residence_Addresses_City').count()
        
        # If no rows found then none of the city had election held on that date 
        # assuming that at least one voter will present on an election date
        
        if len(cnt_df) > 1 and len(list_city_cnt_dates) > 0:
            
            # for the selected date check which cities held the election on that date
            for city in list(list_city_cnt_dates.keys()): 
                # first check if the city is present in list_city_cnt_dates, 
                # not being present means we have already found the dates so no need to check 
                if city in cnt_df.index:
                    # second check if the city is present in the dataframe with "Y"
                    # not being present means the date is not the election date for this city
                    list_city_cnt_dates[city] += 1
                    list_city_4_dates[city].append(date_col)                
                    if list_city_cnt_dates[city] == 4:
                        # remove the city from dictionary list_city_cnt_dates so that we know when to stop checking for more dates
                        del list_city_cnt_dates[city]
                        
        elif len(cnt_df) == 0:
            print("No cities found for ", date_col)
            
        elif len(list_city_cnt_dates) == 0:
            # means all 4 dates for all cities found since we removed cities every time 4 dates were found
            break
            
    return list_city_cnt_dates, list_city_4_dates
    
temp_LM_df = merged_file[['LALVOTERID','Residence_Addresses_City']+LM_cols].copy()
temp_GE_df = merged_file[['LALVOTERID','Residence_Addresses_City']+GE_cols].copy()

In [152]:
list_city_cnt_GE_dates, list_city_4_GE_dates = get_list_elec_dates(temp_GE_df, GE_cols, list_city_cnt_dates_1_top10, list_city_4_dates_1_top10)
if len(list_city_cnt_GE_dates) == 0:
    print("\nAll general election dates found!")
else:
    print("\nNeed to find more general election dates!!!")
list_city_cnt_GE_dates, list_city_4_GE_dates


All general election dates found!


({},
 {'Oakland': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'San Leandro': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'Livermore': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'Berkeley': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'Albany': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'San Francisco': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'San Diego': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'San Jose': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',
   'General_2014_11_04'],
  'Fresno': ['General_2020_11_03',
   'General_2018_11_06',
   'General_2016_11_08',

In [154]:
list_city_cnt_dates_1_top10 = {key: 0 for key in merged_file['Residence_Addresses_City']}
list_city_4_dates_1_top10 = {key: [] for key in merged_file['Residence_Addresses_City']}

list_city_cnt_LM_dates, list_city_4_LM_dates = get_list_elec_dates(temp_LM_df, LM_cols, list_city_cnt_dates_1_top10, list_city_4_dates_1_top10)

if len(list_city_cnt_LM_dates) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")

list_city_cnt_LM_dates, list_city_4_LM_dates

No cities found for  Local_or_Municipal_2021_08_31
No cities found for  Local_or_Municipal_2021_05_04
No cities found for  Local_or_Municipal_2021_03_09

Need to find more local and municipal election dates!!!


({'Oakland': 1,
  'San Leandro': 1,
  'Livermore': 1,
  'Albany': 2,
  'Fresno': 3,
  'Eureka': 2,
  'Alhambra': 2,
  'Montebello': 2,
  'Burbank': 2,
  'Lynwood': 2,
  'Pico Rivera': 2,
  'Santa Clarita': 2,
  'Paramount': 2,
  'Huntington Park': 0,
  'Calabasas': 1,
  'Chino Hills': 3,
  'Buena Park': 3,
  'Morro Bay': 2,
  'Merced': 2,
  'Santa Ana': 3,
  'Anaheim': 2,
  'Lake Forest': 0,
  'Palm Desert': 3,
  'San Carlos': 1,
  'Carpinteria': 1,
  'Solvang': 1,
  'Gilroy': 2,
  'Watsonville': 1,
  'Davis': 3,
  'Santa Rosa': 1},
 {'Oakland': ['Local_or_Municipal_2020_08_03'],
  'San Leandro': ['Local_or_Municipal_2020_08_03'],
  'Livermore': ['Local_or_Municipal_2020_08_03'],
  'Berkeley': ['Local_or_Municipal_2021_06_01',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'Albany': ['Local_or_Municipal_2021_03_02', 'Local_or_Municipal_2020_08_03'],
  'San Francisco': ['Local_or_Municipal_2021_07_20',
   'Local_or_Municipa

In [156]:
LM_cols_next_top10 = get_recent_date('Local_or_Municipal', state_voterhistory, 10, 20)
print(LM_cols_next_top10)

needed_variables = ['LALVOTERID'] + LM_cols_2_top10

state_voterhistory_next_top10 = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)

# inorder to decrease computation time we will remove all cities for which we have found the dates
state_demographic_next_top10 = state_demographic_subset[state_demographic_subset['Residence_Addresses_City'].isin(list(list_city_cnt_LM_dates.keys()))]
print(len(state_demographic_subset_next_top10))

merged_file_next_top10 = pd.merge(state_voterhistory_next_top10, state_demographic_subset_next_top10,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')
print(merged_file_next_top10.shape)
print("number of unique cities:", merged_file_next_top10.Residence_Addresses_City.nunique())


['Local_or_Municipal_2020_05_19', 'Local_or_Municipal_2020_05_05', 'Local_or_Municipal_2020_04_14', 'Local_or_Municipal_2019_08_27', 'Local_or_Municipal_2019_08_13', 'Local_or_Municipal_2019_06_04', 'Local_or_Municipal_2019_05_07', 'Local_or_Municipal_2019_04_16', 'Local_or_Municipal_2019_03_05', 'Local_or_Municipal_2018_07_24']
1606890
(1606890, 22)
number of unique cities: 30


In [160]:
list_city_cnt_LM_dates_next_top10, list_city_4_LM_dates_next_top10 = get_list_elec_dates(merged_file_next_top10, LM_cols_next_top10, list_city_cnt_LM_dates, list_city_4_LM_dates)

if len(list_city_cnt_LM_dates) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")

list_city_cnt_LM_dates_next_top10, list_city_4_LM_dates_next_top10


Need to find more local and municipal election dates!!!


({'Livermore': 3,
  'Huntington Park': 2,
  'Calabasas': 3,
  'Lake Forest': 3,
  'San Carlos': 3,
  'Solvang': 3,
  'Gilroy': 3,
  'Watsonville': 3},
 {'Oakland': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_08_27',
   'Local_or_Municipal_2019_08_13'],
  'San Leandro': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04',
   'Local_or_Municipal_2019_03_05'],
  'Livermore': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_03_05'],
  'Berkeley': ['Local_or_Municipal_2021_06_01',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'Albany': ['Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04'],
  'San Francisco': ['Local_or_Municipal_2021_07_20',
   'Local_or_Municipal_2021_05_11',
   'Local_

In [162]:
del state_voterhistory_next_top10

del merged_file_next_top10
gc.collect()

0

In [172]:
LM_cols_next_top10 = get_recent_date('Local_or_Municipal', state_voterhistory, 20, 30)
print(LM_cols_next_top10)

needed_variables = ['LALVOTERID'] + LM_cols_next_top10

state_voterhistory_next_top10 = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)

# inorder to decrease computation time we will remove all cities for which we have found the dates
state_demographic_next_top10 = state_demographic_subset[state_demographic_subset['Residence_Addresses_City'].isin(list(list_city_cnt_LM_dates_next_top10.keys()))]
print(len(state_demographic_subset_next_top10))

merged_file_next_top10 = pd.merge(state_voterhistory_next_top10, state_demographic_subset_next_top10,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')
print(merged_file_next_top10.shape)
print("number of unique cities:", merged_file_next_top10.Residence_Addresses_City.nunique())


['Local_or_Municipal_2018_05_21', 'Local_or_Municipal_2018_04_10', 'Local_or_Municipal_2018_03_06', 'Local_or_Municipal_2018_01_30', 'Local_or_Municipal_2017_08_29', 'Local_or_Municipal_2017_07_11', 'Local_or_Municipal_2017_06_30', 'Local_or_Municipal_2017_06_06', 'Local_or_Municipal_2017_05_09', 'Local_or_Municipal_2017_05_02']
1606890
(1606890, 22)
number of unique cities: 30


In [173]:
list_city_cnt_LM_dates_next_top10_2, list_city_4_LM_dates_next_top10_2 = get_list_elec_dates(merged_file_next_top10, LM_cols_next_top10, 
                                                                                         list_city_cnt_LM_dates_next_top10, list_city_4_LM_dates_next_top10)

if len(list_city_cnt_LM_dates_next_top10_2) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")

list_city_cnt_LM_dates_next_top10_2, list_city_4_LM_dates_next_top10_2

No cities found for  Local_or_Municipal_2018_05_21
No cities found for  Local_or_Municipal_2017_06_30

Need to find more local and municipal election dates!!!


({'Huntington Park': 2, 'Watsonville': 3},
 {'Oakland': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_08_27',
   'Local_or_Municipal_2019_08_13'],
  'San Leandro': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04',
   'Local_or_Municipal_2019_03_05'],
  'Livermore': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_03_05',
   'Local_or_Municipal_2017_08_29'],
  'Berkeley': ['Local_or_Municipal_2021_06_01',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'Albany': ['Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04'],
  'San Francisco': ['Local_or_Municipal_2021_07_20',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'San Di

In [174]:
del state_voterhistory_next_top10

del merged_file_next_top10
gc.collect()

666

In [175]:
LM_cols_next_top10 = get_recent_date('Local_or_Municipal', state_voterhistory, 30, 40)
print(LM_cols_next_top10)

needed_variables = ['LALVOTERID'] + LM_cols_next_top10

state_voterhistory_next_top10 = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)

# inorder to decrease computation time we will remove all cities for which we have found the dates
state_demographic_next_top10 = state_demographic_subset[state_demographic_subset['Residence_Addresses_City'].\
                                                        isin(list(list_city_cnt_LM_dates_next_top10_2.keys()))]
print(len(state_demographic_subset_next_top10))

merged_file_next_top10 = pd.merge(state_voterhistory_next_top10, state_demographic_subset_next_top10,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')
print(merged_file_next_top10.shape)
print("number of unique cities:", merged_file_next_top10.Residence_Addresses_City.nunique())

list_city_cnt_LM_dates_next_top10_3, list_city_4_LM_dates_next_top10_3 = get_list_elec_dates(merged_file_next_top10, LM_cols_next_top10, 
                                                                                         list_city_cnt_LM_dates_next_top10_2, list_city_4_LM_dates_next_top10_2)


['Local_or_Municipal_2017_04_18', 'Local_or_Municipal_2017_04_11', 'Local_or_Municipal_2017_03_14', 'Local_or_Municipal_2017_03_07', 'Local_or_Municipal_2017_02_28', 'Local_or_Municipal_2017_01_10', 'Local_or_Municipal_2016_10_11', 'Local_or_Municipal_2016_06_28', 'Local_or_Municipal_2016_06_14', 'Local_or_Municipal_2016_06_07']
1606890
(1606890, 22)
number of unique cities: 30
No cities found for  Local_or_Municipal_2017_01_10
No cities found for  Local_or_Municipal_2016_10_11
No cities found for  Local_or_Municipal_2016_06_28
No cities found for  Local_or_Municipal_2016_06_14

Need to find more local and municipal election dates!!!


({'Huntington Park': 3},
 {'Oakland': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_08_27',
   'Local_or_Municipal_2019_08_13'],
  'San Leandro': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04',
   'Local_or_Municipal_2019_03_05'],
  'Livermore': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_03_05',
   'Local_or_Municipal_2017_08_29'],
  'Berkeley': ['Local_or_Municipal_2021_06_01',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'Albany': ['Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04'],
  'San Francisco': ['Local_or_Municipal_2021_07_20',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'San Diego': ['Local_or_M

In [179]:
if len(list_city_cnt_LM_dates_next_top10_3) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")

list_city_cnt_LM_dates_next_top10_3, list_city_4_LM_dates_next_top10_3


Need to find more local and municipal election dates!!!


({'Huntington Park': 3},
 {'Oakland': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_08_27',
   'Local_or_Municipal_2019_08_13'],
  'San Leandro': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04',
   'Local_or_Municipal_2019_03_05'],
  'Livermore': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_03_05',
   'Local_or_Municipal_2017_08_29'],
  'Berkeley': ['Local_or_Municipal_2021_06_01',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'Albany': ['Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04'],
  'San Francisco': ['Local_or_Municipal_2021_07_20',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'San Diego': ['Local_or_M

In [180]:
del state_voterhistory_next_top10

del merged_file_next_top10
gc.collect()

13

In [181]:
LM_cols_next_top10 = get_recent_date('Local_or_Municipal', state_voterhistory, 40, 50)
print(LM_cols_next_top10)

needed_variables = ['LALVOTERID'] + LM_cols_next_top10

state_voterhistory_next_top10 = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)

# inorder to decrease computation time we will remove all cities for which we have found the dates
state_demographic_next_top10 = state_demographic_subset[state_demographic_subset['Residence_Addresses_City'].\
                                                        isin(list(list_city_cnt_LM_dates_next_top10_3.keys()))]
print(len(state_demographic_subset_next_top10))

merged_file_next_top10 = pd.merge(state_voterhistory_next_top10, state_demographic_subset_next_top10,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')
print(merged_file_next_top10.shape)
print("number of unique cities:", merged_file_next_top10.Residence_Addresses_City.nunique())

list_city_cnt_LM_dates_next_top10_4, list_city_4_LM_dates_next_top10_4 = get_list_elec_dates(merged_file_next_top10, LM_cols_next_top10, 
                                                                                         list_city_cnt_LM_dates_next_top10_3, list_city_4_LM_dates_next_top10_3)

if len(list_city_cnt_LM_dates_next_top10_4) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")

list_city_cnt_LM_dates_next_top10_4, list_city_4_LM_dates_next_top10_4

['Local_or_Municipal_2016_05_03', 'Local_or_Municipal_2016_04_12', 'Local_or_Municipal_2016_03_17', 'Local_or_Municipal_2016_03_08', 'Local_or_Municipal_2016_02_02', 'Local_or_Municipal_2015_10_06', 'Local_or_Municipal_2015_08_25', 'Local_or_Municipal_2015_06_02', 'Local_or_Municipal_2015_05_19', 'Local_or_Municipal_2015_05_05']
1606890
(1606890, 22)
number of unique cities: 30
No cities found for  Local_or_Municipal_2016_05_03

All local and municipal election dates found!


({},
 {'Oakland': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_08_27',
   'Local_or_Municipal_2019_08_13'],
  'San Leandro': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04',
   'Local_or_Municipal_2019_03_05'],
  'Livermore': ['Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2020_04_14',
   'Local_or_Municipal_2019_03_05',
   'Local_or_Municipal_2017_08_29'],
  'Berkeley': ['Local_or_Municipal_2021_06_01',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'Albany': ['Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03',
   'Local_or_Municipal_2019_08_13',
   'Local_or_Municipal_2019_06_04'],
  'San Francisco': ['Local_or_Municipal_2021_07_20',
   'Local_or_Municipal_2021_05_11',
   'Local_or_Municipal_2021_03_02',
   'Local_or_Municipal_2020_08_03'],
  'San Diego': ['Local_or_Municipal_2021_06_08'

In [195]:
GE_dates_df = pd.DataFrame(list_city_4_GE_dates.items(), columns=['city', 'GE_dates'])
LM_dates_df = pd.DataFrame(list_city_4_LM_dates_next_top10_4.items(), columns=['city', 'LM_dates'])
GE_LM_dates_df = GE_dates_df.merge(LM_dates_df, on = "city")
GE_LM_dates_df.head()

Unnamed: 0,city,GE_dates,LM_dates
0,Oakland,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
1,San Leandro,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
2,Livermore,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
3,Berkeley,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_01, Local_or_Munic..."
4,Albany,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_03_02, Local_or_Munic..."


In [197]:
GE_LM_dates_df.to_parquet('VM2--CA--2022-04-25-GE_LM_dates_df.parquet')

# 4. Merge Vote History and Demographic Data

In [16]:
# load the list of election dates for each city
GE_LM_dates_df = pd.read_parquet('VM2--CA--2022-04-25-GE_LM_dates_df.parquet')
GE_LM_dates_df

Unnamed: 0,city,GE_dates,LM_dates
0,Oakland,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
1,San Leandro,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
2,Livermore,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
3,Berkeley,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_01, Local_or_Munic..."
4,Albany,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_03_02, Local_or_Munic..."
5,San Francisco,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_07_20, Local_or_Munic..."
6,San Diego,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_08, Local_or_Munic..."
7,San Jose,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_08, Local_or_Munic..."
8,Fresno,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_01, Local_or_Munic..."
9,Eureka,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_08, Local_or_Munic..."


In [13]:
e_dates = set()
for v in GE_LM_dates_df['GE_dates']:
    for vv in v:
        e_dates.add(vv)
for v in GE_LM_dates_df['LM_dates'] :
    for vv in v:
        e_dates.add(vv)

In [14]:
list(e_dates)

['Local_or_Municipal_2017_05_02',
 'Local_or_Municipal_2020_04_14',
 'Local_or_Municipal_2019_06_04',
 'General_2018_11_06',
 'Local_or_Municipal_2019_08_27',
 'Local_or_Municipal_2021_06_01',
 'Local_or_Municipal_2017_03_07',
 'Local_or_Municipal_2018_07_24',
 'Local_or_Municipal_2017_08_29',
 'Local_or_Municipal_2019_03_05',
 'Local_or_Municipal_2021_07_20',
 'Local_or_Municipal_2020_08_03',
 'Local_or_Municipal_2021_06_08',
 'Local_or_Municipal_2017_05_09',
 'General_2014_11_04',
 'General_2016_11_08',
 'Local_or_Municipal_2016_04_12',
 'Local_or_Municipal_2019_08_13',
 'Local_or_Municipal_2021_05_11',
 'Local_or_Municipal_2021_03_02',
 'Local_or_Municipal_2021_04_20',
 'Local_or_Municipal_2017_06_06',
 'Local_or_Municipal_2019_04_16',
 'Local_or_Municipal_2018_04_10',
 'General_2020_11_03']

In [15]:
# load the VOTE HISTORY data for selected election dates only

needed_variables = ['LALVOTERID'] + list(e_dates)

state_voterhistory_4_dates = pd.read_csv(f'{filepath}VM2--CA--2022-04-25-VOTEHISTORY.tab',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                 usecols=needed_variables)
                                
state_voterhistory_4_dates.head(5)

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_04_20,Local_or_Municipal_2021_03_02,General_2020_11_03,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,...,Local_or_Municipal_2018_07_24,Local_or_Municipal_2018_04_10,Local_or_Municipal_2017_08_29,Local_or_Municipal_2017_06_06,Local_or_Municipal_2017_05_09,Local_or_Municipal_2017_05_02,Local_or_Municipal_2017_03_07,General_2016_11_08,Local_or_Municipal_2016_04_12,General_2014_11_04
0,LALCA453164106,,,,,,,Y,,,...,,,,,,,,Y,,
1,LALCA453008306,,,,,,,,,,...,,,,,,,,,,
2,LALCA22129469,,,,,,,Y,,,...,,,,,,,,Y,,Y
3,LALCA549803906,,,,,,,Y,,,...,,,,,,,,,,
4,LALCA24729024,,,,,,,,,,...,,,,,,,,,,


In [17]:
merged_file = pd.merge(state_voterhistory_4_dates, state_demographic_subset,
                       how='inner', left_on='LALVOTERID', right_on='LALVOTERID')

print(merged_file.shape)

print("number of unique cities:", merged_file.Residence_Addresses_City.nunique())

merged_file.head(5)

(3918925, 37)
number of unique cities: 39


Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_04_20,Local_or_Municipal_2021_03_02,General_2020_11_03,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,...,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,,,,,,,,,,...,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,LALCA22129469,,,,,,,Y,,,...,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,LALCA24729024,,,,,,,,,,...,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA22466723,,,,,,,,,,...,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
4,LALCA22466636,,,,,,,Y,,,...,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


In [18]:
merged_file.to_parquet('VM2--CA--2022-04-25-merged_VOTE_DEMO.parquet')

# 5. Calculate voter turnout using merged data

In [4]:
# import pandas as pd
# merged_file = pd.read_parquet('VM2--CA--2022-04-25-merged_VOTE_DEMO.parquet')

  merged_file = pd.read_csv('VM2--CA--2022-04-25-MERGED.csv')


In [5]:
# merged_file.head()

Unnamed: 0,index,LALVOTERID,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate
0,0,LALCA453008306,,,,,,Y,,,Oakland,Likely African-American,04/01/2021
1,1,LALCA22129469,,,,,Y,Y,Y,Y,Oakland,European,11/16/2021
2,2,LALCA24729024,,,,,,,,,San Leandro,European,02/28/2016
3,3,LALCA22466723,,,,,,,,,Livermore,European,11/01/2021
4,4,LALCA22466636,,,,,Y,Y,Y,Y,Livermore,European,12/07/2021


In [20]:
GE_cols = [col for col in merged_file.columns if col.startswith('General')]
print(GE_cols)
LM_cols = [col for col in merged_file.columns if col.startswith('Local_or_Municipal')]
print(LM_cols)

['General_2020_11_03', 'General_2018_11_06', 'General_2016_11_08', 'General_2014_11_04']
['Local_or_Municipal_2021_07_20', 'Local_or_Municipal_2021_06_08', 'Local_or_Municipal_2021_06_01', 'Local_or_Municipal_2021_05_11', 'Local_or_Municipal_2021_04_20', 'Local_or_Municipal_2021_03_02', 'Local_or_Municipal_2020_08_03', 'Local_or_Municipal_2020_04_14', 'Local_or_Municipal_2019_08_27', 'Local_or_Municipal_2019_08_13', 'Local_or_Municipal_2019_06_04', 'Local_or_Municipal_2019_04_16', 'Local_or_Municipal_2019_03_05', 'Local_or_Municipal_2018_07_24', 'Local_or_Municipal_2018_04_10', 'Local_or_Municipal_2017_08_29', 'Local_or_Municipal_2017_06_06', 'Local_or_Municipal_2017_05_09', 'Local_or_Municipal_2017_05_02', 'Local_or_Municipal_2017_03_07', 'Local_or_Municipal_2016_04_12']


In [21]:
# fill NA values with "N" to make it easier to compare  with "Y"
merged_file[GE_cols+LM_cols] = merged_file[GE_cols+LM_cols].fillna('N')
merged_file.head()

Unnamed: 0,LALVOTERID,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_04_20,Local_or_Municipal_2021_03_02,General_2020_11_03,Local_or_Municipal_2020_08_03,Local_or_Municipal_2020_04_14,...,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453008306,N,N,N,N,N,N,N,N,N,...,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
1,LALCA22129469,N,N,N,N,N,N,Y,N,N,...,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
2,LALCA24729024,N,N,N,N,N,N,N,N,N,...,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA22466723,N,N,N,N,N,N,N,N,N,...,F,38,Republican,European,11/01/2021,ALAMEDA,,,,
4,LALCA22466636,N,N,N,N,N,N,Y,N,N,...,M,63,Democratic,European,12/07/2021,ALAMEDA,,,,


In [22]:
# We created the dataframe below in order to easily calculate perc_turnout when no one voted
list_ethnic_city = merged_file[['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']].drop_duplicates()
list_ethnic_city_No = list_ethnic_city.copy()
list_ethnic_city_No['voted'] = 'N'
list_ethnic_city_Yes = list_ethnic_city.copy()
list_ethnic_city_Yes['voted'] = 'Y'
list_ethnic_city = pd.concat([list_ethnic_city_No, list_ethnic_city_Yes])

In [23]:
list_ethnic_city

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,voted
0,Oakland,Likely African-American,N
1,Oakland,European,N
2,San Leandro,European,N
3,Livermore,European,N
7,Oakland,East and South Asian,N
...,...,...,...
3777199,Santa Rosa,European,Y
3777202,Santa Rosa,Hispanic and Portuguese,Y
3777205,Santa Rosa,East and South Asian,Y
3777260,Santa Rosa,Likely African-American,Y


In [24]:
# we also need the total voters information per city and ethnicity
total_city_ethnic = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc']).size().reset_index()
total_city_ethnic.columns = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'total_voters']
total_city_ethnic  = total_city_ethnic.merge(list_ethnic_city, on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc'])
total_city_ethnic

Unnamed: 0,Residence_Addresses_City,EthnicGroups_EthnicGroup1Desc,total_voters,voted
0,Albany,East and South Asian,2405,N
1,Albany,East and South Asian,2405,Y
2,Albany,European,6169,N
3,Albany,European,6169,Y
4,Albany,Hispanic and Portuguese,1035,N
...,...,...,...,...
307,Whittier,European,26477,Y
308,Whittier,Hispanic and Portuguese,76334,N
309,Whittier,Hispanic and Portuguese,76334,Y
310,Whittier,Likely African-American,214,N


In [25]:
elec_date_cols = GE_cols+LM_cols
for i in range(len(elec_date_cols)):
    col = elec_date_cols[i]
    voter_turnout_stats = merged_file.groupby(['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', col]).size().agg(
      {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
      ).unstack(level=0).reset_index()
    
    # 'voted' is either 'Y' or 'N'
    voter_turnout_stats = voter_turnout_stats.rename(columns = {col: 'voted'})
    voter_turnout_stats = total_city_ethnic.merge(voter_turnout_stats, 
                                                 how = 'left',
                                                 on = ['Residence_Addresses_City', 'EthnicGroups_EthnicGroup1Desc', 'voted']) 
    voter_turnout_stats = voter_turnout_stats.replace('East and South Asian', 'asian')
    voter_turnout_stats = voter_turnout_stats.replace('European', 'white')
    voter_turnout_stats = voter_turnout_stats.replace('Hispanic and Portuguese', 'hispanic')
    voter_turnout_stats = voter_turnout_stats.replace('Likely African-American', 'black')
    
    voter_turnout_stats['elec_date'] = col[len(col)-10:]
    voter_turnout_stats['elec_year'] = col[len(col)-10:len(col)-6]
    voter_turnout_stats['elec_type'] = col[:len(col)-11]
    
    voter_turnout_stats[['voted_voters', 'perc_turnout']] = voter_turnout_stats[['voted_voters', 'perc_turnout']].fillna(0)
    voter_turnout_stats = voter_turnout_stats[voter_turnout_stats['voted'] == 'Y']    
    pivot_df = voter_turnout_stats.pivot(index = ['elec_type','elec_year', 'elec_date', 'Residence_Addresses_City'],
                                    columns='EthnicGroups_EthnicGroup1Desc', 
                                    values=['total_voters', 'voted_voters', 'perc_turnout']).reset_index()
    pivot_df.columns = pivot_df.columns.map('_'.join)
    
    # stack all types of election into one dataframe 
    if i == 0:
        voter_turnout_merge = pivot_df.copy() 
    else:
        voter_turnout_merge = pd.concat([voter_turnout_merge, pivot_df])

  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x:

  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}
  {'voted_voters': lambda x: x, 'perc_turnout':lambda x: x / x.sum(level=[0,1])}


In [26]:
voter_turnout_merge

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2020,2020_11_03,Albany,2405.0,147.0,1035.0,6169.0,1982.0,120.0,896.0,5517.0,0.824116,0.816327,0.865700,0.894310
1,General,2020,2020_11_03,Alhambra,17451.0,191.0,16596.0,7359.0,12135.0,139.0,12815.0,5918.0,0.695376,0.727749,0.772174,0.804185
2,General,2020,2020_11_03,Anaheim,26340.0,1211.0,70052.0,54644.0,20542.0,930.0,50500.0,46119.0,0.779879,0.767960,0.720893,0.843990
3,General,2020,2020_11_03,Bellflower,2153.0,3614.0,19899.0,10792.0,1465.0,2705.0,13642.0,8012.0,0.680446,0.748478,0.685562,0.742402
4,General,2020,2020_11_03,Berkeley,8549.0,5942.0,6388.0,39425.0,6659.0,4668.0,5113.0,33180.0,0.778922,0.785594,0.800407,0.841598
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34,Local_or_Municipal,2016,2016_04_12,Santa Clarita,1197.0,239.0,4511.0,10256.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000
35,Local_or_Municipal,2016,2016_04_12,Santa Rosa,4677.0,832.0,23575.0,78673.0,0.0,0.0,2.0,0.0,0.000000,0.000000,0.000085,0.000000
36,Local_or_Municipal,2016,2016_04_12,Solvang,102.0,9.0,845.0,4238.0,0.0,0.0,0.0,1.0,0.000000,0.000000,0.000000,0.000236
37,Local_or_Municipal,2016,2016_04_12,Watsonville,853.0,42.0,18695.0,11481.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000


In [27]:
# 1. convert the data type of columns with information about the total voter and the number of voters who voted into integer 

cnt_cols = [col for col in voter_turnout_merge.columns if 'total_voters' in col or 'voted_voters' in col]
    
for col in cnt_cols:
    voter_turnout_merge[col] = voter_turnout_merge[col].astype(int)

voter_turnout_merge    

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,896,5517,0.824116,0.816327,0.865700,0.894310
1,General,2020,2020_11_03,Alhambra,17451,191,16596,7359,12135,139,12815,5918,0.695376,0.727749,0.772174,0.804185
2,General,2020,2020_11_03,Anaheim,26340,1211,70052,54644,20542,930,50500,46119,0.779879,0.767960,0.720893,0.843990
3,General,2020,2020_11_03,Bellflower,2153,3614,19899,10792,1465,2705,13642,8012,0.680446,0.748478,0.685562,0.742402
4,General,2020,2020_11_03,Berkeley,8549,5942,6388,39425,6659,4668,5113,33180,0.778922,0.785594,0.800407,0.841598
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34,Local_or_Municipal,2016,2016_04_12,Santa Clarita,1197,239,4511,10256,0,0,0,0,0.000000,0.000000,0.000000,0.000000
35,Local_or_Municipal,2016,2016_04_12,Santa Rosa,4677,832,23575,78673,0,0,2,0,0.000000,0.000000,0.000085,0.000000
36,Local_or_Municipal,2016,2016_04_12,Solvang,102,9,845,4238,0,0,0,1,0.000000,0.000000,0.000000,0.000236
37,Local_or_Municipal,2016,2016_04_12,Watsonville,853,42,18695,11481,0,0,0,0,0.000000,0.000000,0.000000,0.000000


In [None]:
# remove irrelevent rows  (cases where the election date was not the among four most recent election dates)
voter_turnout_merge

In [32]:
GE_LM_dates_df.head()

Unnamed: 0,city,GE_dates,LM_dates
0,Oakland,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
1,San Leandro,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
2,Livermore,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
3,Berkeley,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_01, Local_or_Munic..."
4,Albany,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_03_02, Local_or_Munic..."


In [80]:
most_recent_elec_df_GE = GE_LM_dates_df[['city', 'GE_dates']]
most_recent_elec_df_GE = most_recent_elec_df_GE.explode('GE_dates')
most_recent_elec_df_GE = most_recent_elec_df_GE.reset_index(drop =True)
most_recent_elec_df_GE.columns = ['Residence_Addresses_City_', 'elec_date_']
most_recent_elec_df_GE['elec_date_'] = most_recent_elec_df_GE['elec_date_'].str[-10:]
most_recent_elec_df_GE['elec_type_'] = 'General'
most_recent_elec_df_GE

Unnamed: 0,Residence_Addresses_City_,elec_date_,elec_type_
0,Oakland,2020_11_03,General
1,Oakland,2018_11_06,General
2,Oakland,2016_11_08,General
3,Oakland,2014_11_04,General
4,San Leandro,2020_11_03,General
...,...,...,...
151,Davis,2014_11_04,General
152,Santa Rosa,2020_11_03,General
153,Santa Rosa,2018_11_06,General
154,Santa Rosa,2016_11_08,General


In [81]:
most_recent_elec_df_LM = GE_LM_dates_df[['city', 'LM_dates']]
most_recent_elec_df_LM = most_recent_elec_df_LM.explode('LM_dates')
most_recent_elec_df_LM = most_recent_elec_df_LM.reset_index(drop =True)
most_recent_elec_df_LM.columns = ['Residence_Addresses_City_', 'elec_date_']
most_recent_elec_df_LM['elec_date_'] = most_recent_elec_df_LM['elec_date_'].str[-10:]
most_recent_elec_df_LM['elec_type_'] = 'Local_or_Municipal'
most_recent_elec_df_LM

Unnamed: 0,Residence_Addresses_City_,elec_date_,elec_type_
0,Oakland,2020_08_03,Local_or_Municipal
1,Oakland,2020_04_14,Local_or_Municipal
2,Oakland,2019_08_27,Local_or_Municipal
3,Oakland,2019_08_13,Local_or_Municipal
4,San Leandro,2020_08_03,Local_or_Municipal
...,...,...,...
151,Davis,2019_08_13,Local_or_Municipal
152,Santa Rosa,2020_08_03,Local_or_Municipal
153,Santa Rosa,2020_04_14,Local_or_Municipal
154,Santa Rosa,2019_08_13,Local_or_Municipal


In [82]:
most_recent_elec_df = pd.concat([most_recent_elec_df_GE,most_recent_elec_df_LM] )
print(len(most_recent_elec_df))
most_recent_elec_df

312


Unnamed: 0,Residence_Addresses_City_,elec_date_,elec_type_
0,Oakland,2020_11_03,General
1,Oakland,2018_11_06,General
2,Oakland,2016_11_08,General
3,Oakland,2014_11_04,General
4,San Leandro,2020_11_03,General
...,...,...,...
151,Davis,2019_08_13,Local_or_Municipal
152,Santa Rosa,2020_08_03,Local_or_Municipal
153,Santa Rosa,2020_04_14,Local_or_Municipal
154,Santa Rosa,2019_08_13,Local_or_Municipal


In [83]:
voter_turnout_merge.head()

Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,896,5517,0.824116,0.816327,0.8657,0.89431
1,General,2020,2020_11_03,Alhambra,17451,191,16596,7359,12135,139,12815,5918,0.695376,0.727749,0.772174,0.804185
2,General,2020,2020_11_03,Anaheim,26340,1211,70052,54644,20542,930,50500,46119,0.779879,0.76796,0.720893,0.84399
3,General,2020,2020_11_03,Bellflower,2153,3614,19899,10792,1465,2705,13642,8012,0.680446,0.748478,0.685562,0.742402
4,General,2020,2020_11_03,Berkeley,8549,5942,6388,39425,6659,4668,5113,33180,0.778922,0.785594,0.800407,0.841598


In [84]:
voter_turnout_most_recent = pd.merge(most_recent_elec_df, voter_turnout_merge, 
                                     how = "left",
                                     on = ['elec_type_', 'elec_date_', 'Residence_Addresses_City_'])
voter_turnout_most_recent

Unnamed: 0,Residence_Addresses_City_,elec_date_,elec_type_,elec_year_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,Oakland,2020_11_03,General,2020,30600,61476,37174,83122,23041,45891,26954,69989,0.752974,0.746486,0.725077,0.842003
1,Oakland,2018_11_06,General,2018,30600,61476,37174,83122,14972,35012,17857,57872,0.489281,0.569523,0.480363,0.696230
2,Oakland,2016_11_08,General,2016,30600,61476,37174,83122,16057,37256,19792,57968,0.524739,0.606025,0.532415,0.697385
3,Oakland,2014_11_04,General,2014,30600,61476,37174,83122,8145,21265,8235,35411,0.266176,0.345907,0.221526,0.426012
4,San Leandro,2020_11_03,General,2020,12705,5596,16028,17780,9229,4299,11984,14638,0.726407,0.768227,0.747692,0.823285
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
307,Davis,2019_08_13,Local_or_Municipal,2019,4757,251,4971,23906,1,0,0,2,0.000210,0.000000,0.000000,0.000084
308,Santa Rosa,2020_08_03,Local_or_Municipal,2020,4677,832,23575,78673,1,0,2,21,0.000214,0.000000,0.000085,0.000267
309,Santa Rosa,2020_04_14,Local_or_Municipal,2020,4677,832,23575,78673,0,0,1,3,0.000000,0.000000,0.000042,0.000038
310,Santa Rosa,2019_08_13,Local_or_Municipal,2019,4677,832,23575,78673,0,0,2,4,0.000000,0.000000,0.000085,0.000051


In [86]:
voter_turnout_most_recent.to_csv('voter_turnout_most_recent.csv', index=False)

In [28]:
# 2. for each of the "count" columns find the number of 0 values 
# because if 0 voter turnout then may be the election date that was selected was not the election date for that city
no_voter_turnout = voter_turnout_merge[(voter_turnout_merge['perc_turnout_asian'] == 0) &
                                       (voter_turnout_merge['perc_turnout_black'] == 0) &
                                       (voter_turnout_merge['perc_turnout_hispanic'] == 0) &
                                       (voter_turnout_merge['perc_turnout_white'] == 0)]

no_voter_turnout[['elec_type_', 'elec_date_', 'Residence_Addresses_City_']]

Unnamed: 0,elec_type_,elec_date_,Residence_Addresses_City_
0,Local_or_Municipal,2021_07_20,Albany
1,Local_or_Municipal,2021_07_20,Alhambra
4,Local_or_Municipal,2021_07_20,Berkeley
6,Local_or_Municipal,2021_07_20,Burbank
7,Local_or_Municipal,2021_07_20,Calabasas
...,...,...,...
19,Local_or_Municipal,2016_04_12,Merced
20,Local_or_Municipal,2016_04_12,Montebello
28,Local_or_Municipal,2016_04_12,San Carlos
34,Local_or_Municipal,2016_04_12,Santa Clarita


In [85]:
#should be empty dataframe because of the way we have filitered the dataframe
no_voter_turnout = voter_turnout_merge[(voter_turnout_most_recent['perc_turnout_asian'] == 0) &
                                       (voter_turnout_most_recent['perc_turnout_black'] == 0) &
                                       (voter_turnout_most_recent['perc_turnout_hispanic'] == 0) &
                                       (voter_turnout_most_recent['perc_turnout_white'] == 0)]

no_voter_turnout[['elec_type_', 'elec_date_', 'Residence_Addresses_City_']]

  no_voter_turnout = voter_turnout_merge[(voter_turnout_most_recent['perc_turnout_asian'] == 0) &


Unnamed: 0,elec_type_,elec_date_,Residence_Addresses_City_


Unnamed: 0,elec_type_,elec_year_,elec_date_,Residence_Addresses_City_,total_voters_asian,total_voters_black,total_voters_hispanic,total_voters_white,voted_voters_asian,voted_voters_black,voted_voters_hispanic,voted_voters_white,perc_turnout_asian,perc_turnout_black,perc_turnout_hispanic,perc_turnout_white
0,General,2020,2020_11_03,Albany,2405,147,1035,6169,1982,120,896,5517,0.824116,0.816327,0.8657,0.89431
1,General,2020,2020_11_03,Alhambra,17451,191,16596,7359,12135,139,12815,5918,0.695376,0.727749,0.772174,0.804185
2,General,2020,2020_11_03,Anaheim,26340,1211,70052,54644,20542,930,50500,46119,0.779879,0.76796,0.720893,0.84399
3,General,2020,2020_11_03,Bellflower,2153,3614,19899,10792,1465,2705,13642,8012,0.680446,0.748478,0.685562,0.742402
4,General,2020,2020_11_03,Berkeley,8549,5942,6388,39425,6659,4668,5113,33180,0.778922,0.785594,0.800407,0.841598


In [16]:
del voter_turnout_merge
gc.collect()

20

In case we want to replicated other columns found in the "Colorado Sample Output", below are possible steps. 

- mean_pop_income: cannot see reported Income column. ca-cities contains median value

- mean_pop_age: ca-cities contains median income and age. In order to calculate mean age we can use 'Voters_Age' in Demographic Data. 

- count_college_edu: ca-cities contains 'education_college_or_above', but not sure why there are decimal values. Can use 'CommercialData_Education' in Demographic Data

- count_donated_once: donation is only of the form "integer representing total number of federal donations made over the last four election cycles" in "FECDonors_NumberOfDonations" column

- mean_donation_amount: similarly 'FECDonors_AvgDonation' is also over last four election cycles


In [17]:
# no columns calculated in terms of mean  
ca_cities = pd.read_csv('ca-cities.csv', usecols=['city', 'income_individual_median', 'age_median', 'education_college_or_above'])
ca_cities.head()

Unnamed: 0,city,age_median,income_individual_median,education_college_or_above
0,Los Angeles,35.2,25302,33.1
1,San Francisco,38.3,45229,55.8
2,San Diego,34.3,33037,44.4
3,Riverside,31.3,24962,22.5
4,Sacramento,34.3,28633,31.5
