# Data Merging

Some observations
- We choose the 5 non-RCV cities with highest cosine similary score compared to the 7 RCV cities in CA
- There were 33 distinct cities among those 35 cities
- There are 66 non-registered voters among 21.7 million voters
- There are total of 3.9 million voters in the sampled cities
- City 'El Paso de Robles' didn't match in demographic data, so we manually searched for possible names for that city and found 'Paso Robles'
- We found 122 cases out of 312 with 0% voter turnout. This notebook is an attempt to identify correct election dates for each cities that were selected.
    
# Find four most recent election dates
    
Vote History file doesn't contain city, so we need to merge it with the DEMOGRAPHIC file in order to find out the four most recent election dates for the selected cities. 

1. Load the DEMOGRAPHIC parquet file with only registered voters from selected cities and of selected ethnicities.
     - Get the list of RCV and non-RCV cities computed based on cosine similarity in ca_similarity_search.ipynb
2. Merge the DEMOGRAPHIC with VOTE HISTORY data
3. Find 4 most recent General elections and 4 most recent Local_or_Municipal elections


In [1]:
import pandas as pd
import janitor
import gc
import numpy as np
import time

In [2]:
start_time = time.time()

In [3]:
# ------ California -------

# # change the filepath as required, we have selected the folder with the latest date
# filepath = '../data/VM2--CA--2022-04-25/'

# DEMO_filename = 'VM2--CA--2022-04-25-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--CA--2022-04-25-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--CA--2022-04-25-VOTEHISTORY.tab'

## ------ Colorado -------

filepath = '../data/VM2--CO--2022-04-26/'
DEMO_filename = 'VM2--CO--2022-04-26-DEMOGRAPHIC_selected_cols.parquet'
VOTE_filename = 'VM2--CO--2022-04-26-VOTEHISTORY_selected_cols.parquet'
VOTE_filename_orig = 'VM2--CO--2022-04-26-VOTEHISTORY.tab'


# # ------ Maryland -------

# filepath = '../data/VM2--MD--2022-04-08/'
# DEMO_filename = 'VM2--MD--2022-04-08-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--MD--2022-04-08-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--MD--2022-04-08-VOTEHISTORY.tab'


# # ------ Maine -------

# filepath = '../data/VM2--ME--2022-03-02/'
# DEMO_filename = 'VM2--ME--2022-03-02-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--ME--2022-03-02-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--ME--2022-03-02-VOTEHISTORY.tab'


# # ------ Minnesota -------

# filepath = '../data/VM2--MN--2022-03-25/'
# DEMO_filename = 'VM2--MN--2022-03-25-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--MN--2022-03-25-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--MN--2022-03-25-VOTEHISTORY.tab'


# # ------ New Mexico -------

# filepath = '../data/VM2--NM--2022-03-30/'
# DEMO_filename = 'VM2--NM--2022-03-30-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--NM--2022-03-30-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--NM--2022-03-30-VOTEHISTORY.tab'


# # ------ Vermont -------

# filepath = '../data/VM2--VT--2022-04-20/'
# DEMO_filename = 'VM2--VT--2022-04-20-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--VT--2022-04-20-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--VT--2022-04-20-VOTEHISTORY.tab'


# # ------ Utah -------

# filepath = '../data/VM2--UT--2022-03-30/'
# DEMO_filename = 'VM2--UT--2022-03-30-DEMOGRAPHIC_selected_cols.parquet'
# VOTE_filename = 'VM2--UT--2022-03-30-VOTEHISTORY_selected_cols.parquet'
# VOTE_filename_orig = 'VM2--UT--2022-03-30-VOTEHISTORY.tab'


## 1.  Load new Demographic Data

1. use parquet file that was created in by Reduce_to_parquet.ipynb
2. filter the data based on the list of cities found in ca_similarity_search.ipynb


In [4]:
def combine_cities_list(RCV_list, NonRCV_list):

    print("total number of cities:", len(RCV_list))

    print("number of distinct cities:", len(set(NonRCV_list)))

    print("name of cities that were duplicated:", set([x for x in NonRCV_list if NonRCV_list.count(x) > 1]))

    combined_cityName = RCV_list+list(set(NonRCV_list))
    print("number of distinct RCV and sampled nonRCV cities:", len(combined_cityName))
    return combined_cityName

In [5]:
# 1. List of RCV and non-RCV cities 

# ------ California -------

RCV_cities_CA = ['San Francisco',
 'Oakland',
 'Berkeley',
 'San Leandro',
 'Palm Desert',
 'Eureka',
 'Albany']

sampled_nonRCV_cities_CA = ['Fresno',
 'San Diego',
 'Sacramento',
 'Riverside',
 'San Jose',
 'Santa Ana',
 'Anaheim',
 'Santa Rosa',
 'Merced',
 'Santa Clarita',
 'Alhambra',
 'Davis',
 'Montebello',
 'Burbank',
 'Huntington Park',
 'Bellflower',
 'Watsonville',
 'Gilroy',
 'Whittier',
 'Lynwood',
 'Lakewood',
 'Pico Rivera',
 'Lake Forest',
 'Livermore',
 'Chino Hills',
 'Paramount',
 'El Paso de Robles',
 'Pico Rivera',
 'Buena Park',
 'Whittier',
 'Calabasas',
 'Carpinteria',
 'Morro Bay',
 'San Carlos',
 'Solvang']

combined_sampled_cityName = combine_cities_list(RCV_list= RCV_cities_CA, NonRCV_list = sampled_nonRCV_cities_CA)
# ---------------------


total number of cities: 7
number of distinct cities: 33
name of cities that were duplicated: {'Whittier', 'Pico Rivera'}
number of distinct RCV and sampled nonRCV cities: 40


In [6]:
def read_DEMOGRAPHIC():
    df_demographic = pd.read_parquet(f'{filepath}{DEMO_filename}')
    print("Total number of unique cities:", df_demographic.Residence_Addresses_City.nunique())
    print("Total number of unique voters:", df_demographic.LALVOTERID.nunique())
    print("Count of non-registered voters:", len(df_demographic[df_demographic['Voters_OfficialRegDate'].isnull()]))
    
    print("Number of expected cities:", len(combined_sampled_cityName))
    missing_cities = [city for city in combined_sampled_cityName if city not in df_demographic['Residence_Addresses_City'].unique()]
    if len(missing_cities) > 0:
        print("number of cities not found in demographic data:", len(missing_cities))
        print(missing_cities)
        
    return df_demographic
        
state_demographic = read_DEMOGRAPHIC()

Total number of unique cities: 466
Total number of unique voters: 3745148
Count of non-registered voters: 8
Number of expected cities: 40
number of cities not found in demographic data: 38
['San Francisco', 'Oakland', 'Berkeley', 'San Leandro', 'Palm Desert', 'Eureka', 'Albany', 'Morro Bay', 'Montebello', 'Pico Rivera', 'Santa Rosa', 'Chino Hills', 'Gilroy', 'Riverside', 'San Diego', 'Santa Ana', 'Watsonville', 'Merced', 'Bellflower', 'Fresno', 'Burbank', 'Buena Park', 'Whittier', 'Huntington Park', 'Sacramento', 'Calabasas', 'Paramount', 'Lynwood', 'Alhambra', 'San Carlos', 'Anaheim', 'Davis', 'El Paso de Robles', 'Carpinteria', 'San Jose', 'Lake Forest', 'Santa Clarita', 'Solvang']


In [7]:
state_demographic.head(5)

Unnamed: 0,LALVOTERID,Residence_Addresses_City,Voters_Gender,Voters_Age,Parties_Description,EthnicGroups_EthnicGroup1Desc,Voters_OfficialRegDate,County,CommercialData_Education,CommercialData_EstimatedHHIncome,FECDonors_NumberOfDonations,FECDonors_TotalDonationsAmount
0,LALCA453164106,Oakland,F,29,Democratic,Other,06/18/2021,ALAMEDA,,,,
1,LALCA453008306,Oakland,F,26,Non-Partisan,Likely African-American,04/01/2021,ALAMEDA,,,,
2,LALCA22129469,Oakland,F,47,Democratic,European,11/16/2021,ALAMEDA,HS Diploma - Extremely Likely,,,
3,LALCA549803906,Oakland,M,60,Democratic,Other,02/07/2022,ALAMEDA,,,,
4,LALCA24729024,San Leandro,F,56,Democratic,European,02/28/2016,ALAMEDA,HS Diploma - Extremely Likely,,,


In [8]:
# ----- California ----- 
combined_sampled_cityName = list(map(lambda x: x.replace('El Paso de Robles', 'Paso Robles'), combined_sampled_cityName))
print("number of expected cities:", len(combined_sampled_cityName))
# ----------------------


number of expected cities: 40


In [9]:
# 2. filter DEMOGRAPHIC data based on the list of cities, ethnicities and registered voters

selected_ethnicities = ['European', 'Likely African-American','Hispanic and Portuguese', 'East and South Asian']

def filter_demo(df, list_cityNames):
    filtered_df = df[df['Residence_Addresses_City'].isin(list_cityNames) &
            df['EthnicGroups_EthnicGroup1Desc'].isin(selected_ethnicities) &
            df['Voters_OfficialRegDate'].notnull()][['LALVOTERID', 'Residence_Addresses_City']]
    
    print(filtered_df.shape)
    print("number of unique cities:", filtered_df.Residence_Addresses_City.nunique())
    
    return filtered_df

state_demographic_subset = filter_demo(df = state_demographic, list_cityNames = combined_sampled_cityName)
state_demographic_subset.head()

(3944492, 2)
number of unique cities: 40


Unnamed: 0,LALVOTERID,Residence_Addresses_City
1,LALCA453008306,Oakland
2,LALCA22129469,Oakland
4,LALCA24729024,San Leandro
6,LALCA22466723,Livermore
7,LALCA22466636,Livermore


In [10]:
gc.collect()

20

In [11]:
del state_demographic
gc.collect()

0

## 2. Merge VoteHistory with DEMOGRAPHIC Data 
1. Kernel died when trying to load all General and Local, so we load the two types of elections separately
    1. Load the original data in order to get the complete list of all possible columns containing "General" and "Local_or_Municipal" (need only one row)
    2. create two lists with the column names one for each type of election
2. merge only city (`Residence_Addresses_City`) from DEMOGRAPHIC file to VOTE HISTORY reduce computation time

In [12]:
# 1.A select only one rows to find the column names that are General and Local_or_Municipal elections
# need to use original tab file because pandas' read_parquet doesn't support nrows

state_voterhistory_cols = pd.read_csv(f'{filepath}{VOTE_filename_orig}',
                                 sep='\t', dtype=str, encoding='unicode_escape',
                                nrows=1)

In [13]:
# 1.B select only voter ID and columns with General or Local_or_Municipal election dates
def get_elec_cols(df, string):
    return [col for col in df.columns if col.startswith(string)]
    
GE_cols = get_elec_cols(state_voterhistory_cols, 'General')
print("total number of General election dates", len(GE_cols))

LM_cols = get_elec_cols(state_voterhistory_cols, 'Local_or_Municipal')
print("total number of Local or Municipal election dates", len(LM_cols))


total number of General election dates 18
total number of Local or Municipal election dates 131


In [14]:
del state_voterhistory_cols
gc.collect()

0

In [15]:
# 2. read the VOTEHISTORY parquet file and merge the city from DEMOGRAPHIC file 
df_voterhistory_LM = pd.merge(state_demographic_subset, 
                              pd.read_parquet(f'{filepath}{VOTE_filename}', columns =['LALVOTERID'] +LM_cols), 
                               how='inner', on = 'LALVOTERID') 
df_voterhistory_LM

Unnamed: 0,LALVOTERID,Residence_Addresses_City,Local_or_Municipal_2021_08_31,Local_or_Municipal_2021_07_20,Local_or_Municipal_2021_06_08,Local_or_Municipal_2021_06_01,Local_or_Municipal_2021_05_11,Local_or_Municipal_2021_05_04,Local_or_Municipal_2021_04_20,Local_or_Municipal_2021_03_09,...,Local_or_Municipal_2007_03_06,Local_or_Municipal_2007_02_06,Local_or_Municipal_2006_10_04,Local_or_Municipal_2006_08_29,Local_or_Municipal_2006_07_25,Local_or_Municipal_2006_07_18,Local_or_Municipal_2006_04_11,Local_or_Municipal_2006_04_04,Local_or_Municipal_2006_03_07,Local_or_Municipal_2006_02_07
0,LALCA453008306,Oakland,,,,,,,,,...,,,,,,,,,,
1,LALCA22129469,Oakland,,,,,,,,,...,,,,,,,,,,
2,LALCA24729024,San Leandro,,,,,,,,,...,,,,,,,,,,
3,LALCA22466723,Livermore,,,,,,,,,...,,,,,,,,,,
4,LALCA22466636,Livermore,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3944487,LALCA580571224,Davis,,,,,,,,,...,,,,,,,,,,
3944488,LALCA483029773,Davis,,,,,,,,,...,,,,,,,,,,
3944489,LALCA453079452,Davis,,,,,,,,,...,,,,,,,,,,
3944490,LALCA22064283,Davis,,,,,,,,,...,,,,,,,,,,


In [16]:
# 2.1. reduce number of columns by removing columns if all rows are None

def remove_all_None(df, selected_cols):
    print("-"*20, "\nBefore filtering Vote and Demographic\n", "-"*20)
    print("Total number of General election dates", len(get_elec_cols(df, 'General')))
    print("Total number of Local or Municipal election dates", len(get_elec_cols(df, 'Local_or_Municipal')))
    print("\n")

    # reduce the search space with this step
    cols_all_None = [col for col in selected_cols if len(df[col].value_counts()) == 0]
    print("number of columns with all None:", len(cols_all_None))

    df = df.drop(columns = cols_all_None)
    print("-"*20, "\nAfter removing dates with all None\n", "-"*20)
    print("Total number of General election dates", len(get_elec_cols(df, 'General')))
    print("Total number of Local or Municipal election dates", len(get_elec_cols(df, 'Local_or_Municipal')))

    gc.collect()
    return df

In [17]:
df_voterhistory_LM = remove_all_None(df = df_voterhistory_LM, selected_cols = LM_cols)

-------------------- 
Before filtering Vote and Demographic
 --------------------
Total number of General election dates 0
Total number of Local or Municipal election dates 131


number of columns with all None: 16
-------------------- 
After removing dates with all None
 --------------------
Total number of General election dates 0
Total number of Local or Municipal election dates 115



## 3. Find 4 most recent General elections and 4 most recent Local_or_Municipal elections
1. reduce number of columns by removing columns if all rows are None
2. run a loop to only keep election date columns that are associated with the chosen subset of cities. 
3. create two dictionaries
    1. `init_city_cnt_dates_{LM|GE}` will count the number of election dates for each city
        - for each city if the count of election dates reaches 4 then stop checking more dates for that city (to do this we will remove the city from the `init_city_cnt_dates_{LM|GE}` dictionary) 
    2. `init_city_4_dates_{LM|GE}` will keep track of the cities and their election dates
        - for a given election date if at least one voter has "Y" then proceed to find which cities took part on that date
        - for each city in `init_city_cnt_dates_{LM|GE}` if the city is also present in the dataframe (i.e. the vote "Y" is counted) then increment the count by 1 in `init_city_cnt_dates_{LM|GE}` and also add the date to `init_city_4_dates_{LM|GE}`

    

In [18]:
def get_list_elec_dates(df, date_cols, list_city_cnt_dates, list_city_4_dates):
    for date_col in date_cols:
        cnt_df = df[df[date_col] == 'Y'][[date_col, 'Residence_Addresses_City']].groupby('Residence_Addresses_City').count()
        
        # If no rows found then none of the city had election held on that date 
        # assuming that at least one voter will present on an election date
        
        if len(cnt_df) > 1 and len(list_city_cnt_dates) > 0:
            
            # for the selected date check which cities held the election on that date
            for city in list(list_city_cnt_dates.keys()): 
                # first check if the city is present in list_city_cnt_dates, 
                # not being present means we have already found the dates so no need to check 
                if city in cnt_df.index:
                    # second check if the city is present in the dataframe with "Y"
                    # not being present means the date is not the election date for this city
                    list_city_cnt_dates[city] += 1
                    list_city_4_dates[city].append(date_col)                
                    if list_city_cnt_dates[city] == 4:
                        # remove the city from dictionary list_city_cnt_dates so that we know when to stop checking for more dates
                        del list_city_cnt_dates[city]
                        
        elif len(cnt_df) == 0:
            print("No cities found for ", date_col)
            
        elif len(list_city_cnt_dates) == 0:
            # means all 4 dates for all cities found since we removed cities every time 4 dates were found
            break
            
    return list_city_cnt_dates, list_city_4_dates

In [19]:
init_city_cnt_dates_LM = {key: 0 for key in df_voterhistory_LM['Residence_Addresses_City']}
init_city_4_dates_LM = {key: [] for key in df_voterhistory_LM['Residence_Addresses_City']}

#need to recompute the list of election dates because some columns were removed in the previous step
LM_cols = get_elec_cols(df_voterhistory_LM, 'Local_or_Municipal')

list_city_cnt_dates_LM, list_city_4_dates_LM = get_list_elec_dates(df_voterhistory_LM, 
                                                                   LM_cols, 
                                                                   init_city_cnt_dates_LM, 
                                                                   init_city_4_dates_LM)

if len(list_city_cnt_dates_LM) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")
print(list_city_cnt_dates_LM)
list_city_4_dates_LM


All local and municipal election dates found!
{}


{'Oakland': ['Local_or_Municipal_2020_08_03',
  'Local_or_Municipal_2020_04_14',
  'Local_or_Municipal_2019_08_27',
  'Local_or_Municipal_2019_08_13'],
 'San Leandro': ['Local_or_Municipal_2020_08_03',
  'Local_or_Municipal_2019_08_13',
  'Local_or_Municipal_2019_06_04',
  'Local_or_Municipal_2019_03_05'],
 'Livermore': ['Local_or_Municipal_2020_08_03',
  'Local_or_Municipal_2020_04_14',
  'Local_or_Municipal_2019_03_05',
  'Local_or_Municipal_2017_08_29'],
 'Berkeley': ['Local_or_Municipal_2021_06_01',
  'Local_or_Municipal_2021_05_11',
  'Local_or_Municipal_2021_03_02',
  'Local_or_Municipal_2020_08_03'],
 'Albany': ['Local_or_Municipal_2021_03_02',
  'Local_or_Municipal_2020_08_03',
  'Local_or_Municipal_2019_08_13',
  'Local_or_Municipal_2019_06_04'],
 'San Francisco': ['Local_or_Municipal_2021_07_20',
  'Local_or_Municipal_2021_05_11',
  'Local_or_Municipal_2021_03_02',
  'Local_or_Municipal_2020_08_03'],
 'San Diego': ['Local_or_Municipal_2021_06_08',
  'Local_or_Municipal_2021_0

In [20]:
del df_voterhistory_LM
gc.collect()

0

## Redo 2.2 and all steps of 3 on General election

In [21]:
# 2. read the VOTEHISTORY parquet file and merge the city from DEMOGRAPHIC file 
df_voterhistory_GE = pd.merge(state_demographic_subset, 
                              pd.read_parquet(f'{filepath}{VOTE_filename}', columns =['LALVOTERID'] + GE_cols), 
                               how='inner', on = 'LALVOTERID') 
df_voterhistory_GE

Unnamed: 0,LALVOTERID,Residence_Addresses_City,General_2020_11_03,General_2018_11_06,General_2016_11_08,General_2014_11_04,General_2012_11_06,General_2010_11_02,General_2008_11_04,General_2006_11_07,General_2004_11_02,General_2002_11_05,General_2000_11_07,General_1998_11_03,General_1996_11_05,General_1994_11_08,General_1992_11_03,General_1990_11_06,General_1988_11_08,General_1986_11_04
0,LALCA453008306,Oakland,,Y,,,,,,,,,,,,,,,,
1,LALCA22129469,Oakland,Y,Y,Y,Y,Y,Y,,Y,Y,,,,,,,,,
2,LALCA24729024,San Leandro,,,,,,,,,,,,,,,,,,
3,LALCA22466723,Livermore,,,,,,,Y,,,,,,,,,,,
4,LALCA22466636,Livermore,Y,Y,Y,Y,Y,,Y,Y,Y,Y,,,Y,,Y,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3944487,LALCA580571224,Davis,Y,,,,,,,,,,,,,,,,,
3944488,LALCA483029773,Davis,Y,,Y,,,,,,,,,,,,,,,
3944489,LALCA453079452,Davis,Y,Y,Y,,,,,,,,,,,,,,,
3944490,LALCA22064283,Davis,,,Y,,,,Y,,,,,,,,,,,


In [22]:
df_voterhistory_GE = remove_all_None(df_voterhistory_GE, GE_cols)

init_city_cnt_dates_GE = {key: 0 for key in df_voterhistory_GE['Residence_Addresses_City']}
init_city_4_dates_GE = {key: [] for key in df_voterhistory_GE['Residence_Addresses_City']}

GE_cols = get_elec_cols(df_voterhistory_GE, 'General')

list_city_cnt_dates_GE, list_city_4_dates_GE = get_list_elec_dates(df_voterhistory_GE, 
                                                                   GE_cols, 
                                                                   init_city_cnt_dates_GE, 
                                                                   init_city_4_dates_GE)

if len(list_city_cnt_dates_GE) == 0:
    print("\nAll local and municipal election dates found!")
else:
    print("\nNeed to find more local and municipal election dates!!!")
print(list_city_cnt_dates_GE)
list_city_4_dates_GE

-------------------- 
Before filtering Vote and Demographic
 --------------------
Total number of General election dates 18
Total number of Local or Municipal election dates 0


number of columns with all None: 0
-------------------- 
After removing dates with all None
 --------------------
Total number of General election dates 18
Total number of Local or Municipal election dates 0

All local and municipal election dates found!
{}


{'Oakland': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'San Leandro': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'Livermore': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'Berkeley': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'Albany': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'San Francisco': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'San Diego': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'San Jose': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'Fresno': ['General_2020_11_03',
  'General_2018_11_06',
  'General_2016_11_08',
  'General_2014_11_04'],
 'Eureka': ['G

In [23]:
GE_dates_df = pd.DataFrame(list_city_4_dates_GE.items(), columns=['city', 'GE_dates'])
LM_dates_df = pd.DataFrame(list_city_4_dates_LM.items(), columns=['city', 'LM_dates'])
GE_LM_dates_df = GE_dates_df.merge(LM_dates_df, on = "city")
GE_LM_dates_df.head()

Unnamed: 0,city,GE_dates,LM_dates
0,Oakland,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
1,San Leandro,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
2,Livermore,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2020_08_03, Local_or_Munic..."
3,Berkeley,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_06_01, Local_or_Munic..."
4,Albany,"[General_2020_11_03, General_2018_11_06, Gener...","[Local_or_Municipal_2021_03_02, Local_or_Munic..."


In [24]:
# save in parquet format
GE_LM_dates_df.to_parquet(f'{filepath}GE_LM_dates_per_city.parquet')

In [25]:
end_time = time.time()
print("Time take to run this notebook in seconds: ", end_time - start_time)

Time take to run this notebook in seconds:  346.62226700782776


In [26]:
del df_voterhistory_GE
gc.collect()

0