# Plan

First get the 2019 election data  in and restructure it according to the following schema:

![schema](diagrams/schema.jpg)

Add key metrics from the ABS regarding electorates to the electorate data. Also join in the SA2 data for the polling places and include a similar set of variables. Consider the inclusion of crime data. Use geonames to geocode crime data. Also restructure it. It's disgusting.




## How it is made

Things you will need:

1. Australian Electoral Commission Data for 2019. Files available for download from [here](https://results.aec.gov.au/24310/Website/HouseDownloadsMenu-24310-Csv.htm)
   1. Each of the "Distribution of Preferences by Polling Place - {state} files were downloaded nad expanded into a series of files targetted in file_list
   2. Party Data from the Political Parties file was used for party_file_path
   3. Polling places file was used for polling_place_file_path
   4. Nominations by Division was used for division data.
   5. The National List of Candidates was used as input for candidate_file_path
      1. This file was manually gender coded based on known genders of common names (this country, for all its multiculturalism, has many Davids, Nicoles and Dougs)
      2. Where the gender of was ambiguous to me (either because it was not from among common English names or because it was ambiguous (consider Chris, Shane etc)) I used google searching to find images of candidates where possible and assign their apparent gender.
2. Australian Bureau of Statistics' Australian Statistical Geographic Standard 2016 Edition SA2 shape file
   1. Specifically [this file (WARNING: Will commence download as link targets .zip)](https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&1270055001_sa2_2016_aust_shape.zip&1270.0.55.001&Data%20Cubes&A09309ACB3FA50B8CA257FED0013D420&0&July%202016&12.07.2016&Latest)
   2. For general information see [this](https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1270.0.55.001July%202016?OpenDocument) link
3. Australian Bureau of Statistics "Discover Your Electorate" data (see [here](https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2082.02019))
   1. Used as the basis for electorate_statistics_file_path but essentially removed from excel format and dumped into spreadsheet
   2. Download original spreadsheet [here](https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&commonwealth%20electorate%20data.xls&2082.0&Data%20Cubes&BDBB9991BD7E99E8CA2583AF0071A786&0&2019&01.03.2019&Latest)
   3. Data is straight copy and pasted but column names have been rationalised.
      1. variables begin with either pc_ (percent) or aud_ (Australian Dollars)
      2. pc_ variables are then grouped by type - age, dwelling, occupation, background, family, employment_educ
      3. Abrieviations: 
         1. employment_educ - employment, education or training engagement rate
         2. atsi - Aboriginal or Torres Strait Islander (Australian First Nations peoples)
         3. family_nuclear - a family with a couple and children
      4. withheld is used where ABS data indicated that this information was not stated or in some cases too vague. "vague" is used where ABS specifically indicates that the information was provided but with insufficient detail to categorise. 
      
4. Australian Bureau of Statistics General Community Profile SA2 dataset for Australia
   1. General page used to access was [here](https://www.abs.gov.au/census/find-census-data/datapacks?release=2016&product=GCP&geography=SA2&header=S)
   2. [This file](https://www.abs.gov.au/census/find-census-data/datapacks/download/2016_GCP_SA2_for_AUS_short-header.zip) was used to download the data for the 2016 release.
   3. Files used are contained in the script below. To identify where an item comes from read the file_path variable for each of the sa2 files as this will give you a good idea what file number to look up. The replacement_dict for each file used for cleaning will give you a good idea what variables I am targetting for inclusion. 
   

In [1]:
import pandas as pd
import numpy as np
import glob
import geopandas as gpd



In [4]:
output_file_path = 'outputs/'

file_list_house_of_reps_data = glob.glob('house_of_reps/*/*.csv')

party_file_path = 'misc/hor/parties.csv'

polling_place_file_path = 'misc/hor/polling_places.csv'

sa2_data_file_path = '../../geospatial/sa2_2016/SA2_2016_AUST.shp'

nominations_by_division_file_path = 'misc/hor/division_nomination.csv'

electorate_statistics_file_path = '../../geospatial/ABS_2802_0_electorate_data_2019.csv'

candidate_file_path = 'misc/hor/members_nominated.csv'

sa2_age_total_population_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G01_AUS_SA2.csv'

sa2_migration_timeframe_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G10C_AUS_SA2.csv'

sa2_occupation_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G57B_AUS_SA2.csv'

sa2_university_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G15_AUS_SA2.csv'


sa2_labourforce_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G40_AUS_SA2.csv'


sa2_religion_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G14_AUS_SA2.csv'

sa2_industry_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G53A_AUS_SA2.csv'


sa2_family_composition_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G25_AUS_SA2.csv'


sa2_dwelling_type_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G33_AUS_SA2.csv'

sa2_medians_file_path = '../../geospatial/2016_SA2_Community_profile/2016 Census GCP Statistical Area 2 for AUST/2016Census_G02_AUS_SA2.csv'


preference_distribution_column_counts_dict = {'StateAb': 'state',
                                              'DivisionId': 'division_id',
                                              'DivisionNm': 'division_name',
                                              'PPId': 'polling_place_id',
                                              'PPNm': 'polling_place_name',
                                              'CountNum': 'round',
                                              'CandidateId': 'candidate_id',
                                              'Surname': 'candidate_surname',
                                              'PartyAb': 'party_code',
                                              'CalculationValue': 'preference_count'}


preference_distribution_column_count_transfer_dict = {'PPId': 'polling_place_id',
                                                      'PPNm': 'polling_place_name',
                                                      'DivisionId':'division_id',
                                                      'CountNum': 'round',
                                                      'CandidateId': 'candidate_id',
                                                      'Surname': 'candidate_surname',
                                                      'PartyAb': 'party_code',
                                                      'CalculationValue': 'transfer_count'}


parties_df_replace_dict = {'StateAb': 'state',
                           'PartyAb': 'party_code',
                           'PartyNm': 'party_name'}


polling_place_rename_dict = {'State': 'state',
                             'DivisionID': 'division_id', 
                             'PollingPlaceID':'polling_place_id',
                             'PollingPlaceTypeID':'polling_place_type_id',
                             'PollingPlaceNm': 'polling_place_name',
                             'PremisesNm': 'premises',
                             'Latitude':'latitude',
                             'Longitude': 'longitude',
                             'geometry': 'geometry'}

division_rename_dict = {'DivisionId': 'division_id',
                        'StateAb': 'state',
                        'DivisionNm': 'division_name',
                        'Enrolment': 'voter_count',
                        'Demographic': 'demographic',
                        'Nominations': 'candidate_count'}

candidate_df_replace_dict = {'PartyAb':'party_code',
                             'CandidateID': 'candidate_id',
                             'Surname':'candidate_surname',
                             'GivenNm':'candidate_given_names',
                             'Gender': 'gender'}

sa2_2016_rename_dict = {'SA2_MAIN16': 'SA2_MAIN16',
                        'SA2_NAME16': 'sa2_name',
                        'geometry': 'geometry'}

sa2_medians_replacement_dict = {'SA2_MAINCODE_2016':'SA2_MAIN16',
                                'Median_age_persons': 'median_age',
                                'Median_mortgage_repay_monthly': 'aud_median_monthly_mortgage_payment',
                                'Median_rent_weekly': 'aud_median_weekly_rent',
                                'Median_tot_hhd_inc_weekly': 'aud_median_weekly_household_income'}

sa2_age_total_population_rename_dict = {'SA2_MAINCODE_2016':'SA2_MAIN16',
                                        'Tot_P_P':'count_persons',
                                        'Tot_P_M':'count_male_persons',
                                        'Tot_P_F':'count_female_persons',  
                                        'Age_0_4_yr_P': 'count_age_0_4',
                                        'Age_5_14_yr_P': 'count_age_5_14',
                                        'Age_15_19_yr_P': 'count_age_15_19',
                                        'Age_20_24_yr_P': 'count_age_20_24',
                                        'Age_25_34_yr_P': 'count_age_25_34',
                                        'Age_35_44_yr_P': 'count_age_35_44',
                                        'Age_45_54_yr_P': 'count_age_45_54',
                                        'Age_55_64_yr_P': 'count_age_55_64',
                                        'Age_65_74_yr_P': 'count_age_65_74',
                                        'Age_75_84_yr_P': 'count_age_75_84',
                                        'Age_85ov_P': 'count_age_over_85',
                                        'Indigenous_P_Tot_P': 'count_atsi',
                                        'Birthplace_Australia_P': 'count_birthplace_au',
                                        'Birthplace_Elsewhere_P': 'count_born_overseas',
                                        'Lang_spoken_home_Eng_only_P': 'count_english_at_home',
                                        'Lang_spoken_home_Oth_Lang_P': 'count_linguistically_diverse',
                                        'Australian_citizen_P': 'count_au_citizens',
                                        'High_yr_schl_comp_Yr_12_eq_P': 'count_grade_12_completed'}

sa2_migration_timeframe_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16', 
                                            'Tot_2006_2010':'count_migration_2006_2010', 
                                            'Tot_2011':'count_migration_2011', 
                                            'Tot_2012':'count_migration_2012', 
                                            'Tot_2013':'count_migration_2013', 
                                            'Tot_2014': 'count_migration_2014', 
                                            'Tot_2015': 'count_migration_2015', 
                                            'Tot_2016':'count_migration_2016'}

sa2_university_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16',
                                   'Uni_other_Tert_Instit_Tot_P':'count_university_students'}

sa2_labourforce_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16',
                                    'Percent_Unem_loyment_M': 'pc_unemployment_male',
                                    'Percent_Unem_loyment_F': 'pc_unemployment_female',
                                    'Percent_Unem_loyment_P': 'pc_unemployment_all',
                                    'Percnt_LabForc_prticipation_M':'pc_labourforce_participation_male',
                                    'Percnt_LabForc_prticipation_F': 'pc_labourforce_participation_female',
                                    'Percnt_LabForc_prticipation_P': 'pc_labourforce_participation_all'}

sa2_religion_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16',
                                 'Buddhism_P': 'count_religion_buddhism',
                                 'Christianity_Tot_P': 'count_religion_christianity',
                                 'Hinduism_P': 'count_religion_hinduism',
                                 'Islam_P': 'count_religion_islam',
                                 'Judaism_P': 'count_religion_judaism',
                                 'Other_Religions_Tot_P': 'count_religion_other',
                                 'SB_OSB_NRA_Tot_P': 'count_religion_secular_spiritual_non_religious',
                                 'Religious_affiliation_ns_P': 'count_religion_withheld',
                                 'Tot_P': 'total_for_religion'}

sa2_industry_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16',
                                 'Agri_for_fish_Tot':'count_industry_agriculture_forestry_fisheries',
                                 'Mining_Tot': 'count_industry_mining',
                                 'Manufacturing_Tot': 'count_industry_manufacturing',
                                 'El_Gas_W_W_Tot': 'count_industry_elect_gas_water_waste',
                                 'Construction_Tot': 'count_industry_construction',
                                 'WhlesaleTde_Tot': 'count_industry_wholesale_trade',
                                 'RetTde_Tot': 'count_industry_retail_trade',
                                 'Acom_food_scs_Tot':'count_industry_accomodation_food_service',
                                 'Trans_po_wh_Tot': 'count_industry_transport_postal_warehouse',
                                 'Infon_med_tel_Tot': 'count_industry_information_telecomm_media',
                                 'Fin_and_ins_s_Tot': 'count_industry_finance_insurance',
                                 'Rent_hi_re_es_Tot': 'count_industry_hiring_real_estate',
                                 'Prof_sci_tec_Tot': 'count_industry_professional_scientific_technical',
                                 'Admin_sup_s_Tot': 'count_industry_admin_support',
                                 'Pub_adm_sfty_Tot': 'count_industry_public_admin_safety',
                                 'Educ_training_Tot': 'count_industry_education_training',
                                 'Hlth_care_soc_Tot': 'count_industry_health_care_social',
                                 'ArtRecreatTot': 'count_industry_arts_recreation',
                                 'Other_scs_Tot': 'count_industry_other_services',
                                 'ID_NS_Tot': 'count_industry_witheld'}


sa2_occupation_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16',
                                   'P_Tot_Managers': 'count_occupation_manager',
                                   'P_Tot_Professionals': 'count_occupation_professional', 
                                   'P_Tot_TechnicTrades_W': 'count_occupation_technician', 
                                   'P_Tot_CommunPersnlSvc_W': 'count_occupation_community_personal_service',
                                   'P_Tot_ClericalAdminis_W': 'count_occupation_clerical',
                                   'P_Tot_Sales_W': 'count_occupation_sales',
                                   'P_Tot_Mach_oper_drivers': 'count_occupation_machinery_operation',
                                   'P_Tot_Labourers': 'count_occupation_labourer',
                                   'P_Tot_Occu_ID_NS' : 'count_occupation_witheld',
                                   'P_Tot_Tot' : 'total_for_occupations'}

sa2_family_composition_replacement_dict = {'SA2_MAINCODE_2016': 'SA2_MAIN16',
                                        'CF_no_children_F': 'count_family_couple_no_children',
                                        'CF_Total_F': 'count_family_nuclear', 
                                        'OPF_Total_F': 'count_family_single_parent', 
                                        'Other_family_F': 'count_family_other', 
                                        'Total_F': 'total_families'}


sa2_dwelling_type_replacement_dict = {'SA2_MAINCODE_2016':'SA2_MAIN16',
                                      'O_OR_Total': 'count_dwelling_owned',
                                      'O_MTG_Total': 'count_dwelling_mortgaged',
                                      'R_Tot_Total': 'count_dwelling_rented',
                                      'Oth_ten_type_Total':'count_dwelling_other',
                                      'Ten_type_NS_Total': 'count_dwelling_witheld',
                                      'Total_Total':'total_dwellings'}


polling_place_crs = 'EPSG:4283'


def read_preference_distribution_files(file_list: list = file_list_house_of_reps_data) -> pd.DataFrame:
    
    df_list = []
    
    for file in file_list:
        df_list.append(pd.read_csv(file, skiprows=1))
    
    preference_distribution_df = pd.concat(df_list)
    
    return preference_distribution_df


def clean_preference_distribution_counts_df(df: pd.DataFrame, column_dict: dict = preference_distribution_column_counts_dict) -> pd.DataFrame:
    
    df = (df
          .query("CalculationType == 'Preference Count'")
          .pipe(drop_and_rename_columns, column_dict)
          .assign(max_round = lambda df: (df[['division_id','round']]
                                          .groupby('division_id')
                                          .transform('max')
                                          .rename(columns = {'round':'max_round'})),
                  is_final_round = lambda df: df['max_round']==df['round'])
          .drop(columns = ['max_round'])
          .query('preference_count!=0'))
    
    return df
    
    
def clean_preference_distribution_count_transfers_df(df: pd.DataFrame, 
                                                     column_dict: dict = preference_distribution_column_count_transfer_dict) -> pd.DataFrame:
    
    eliminated_rename_dict = {'candidate_id': 'eliminated_candidate_id',
                              'candidate_surname': 'eliminated_candidate_surname',
                              'party_code': 'eliminated_party_code'}
    
    gaining_rename_dict =  {'candidate_id': 'gaining_candidate_id',
                            'candidate_surname': 'gaining_candidate_surname',
                            'party_code': 'gaining_party_code'}
    
    df = (df
          .query("CalculationType == 'Transfer Count'")
          .pipe(drop_and_rename_columns, column_dict))
    
    
    eliminated_df = (df
                     .query('transfer_count<0')
                     .rename(columns = eliminated_rename_dict)
                     .assign(eliminated_candidate_total_value = lambda df: df['transfer_count']*-1)
                     .drop(columns = ['transfer_count']))
    
    
    gaining_df = (df
                  .query('transfer_count>0')
                  .rename(columns = gaining_rename_dict))
    
    
    rejoined_df = (gaining_df
                   .merge(eliminated_df, how = 'inner', on = ['polling_place_id', 'polling_place_name','division_id', 'round']))
    
    return rejoined_df


def read_and_clean_electorate_data(nominations_by_division_file_path: str = nominations_by_division_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(nominations_by_division_file_path, skiprows=1)
          .pipe(drop_and_rename_columns, division_rename_dict))
    
    return df


def prepare_enriched_electorate_data(electorate_statistics_file_path: str = electorate_statistics_file_path) -> pd.DataFrame:
    
    electorate_statistics = pd.read_csv(electorate_statistics_file_path)
    
    electorate_data = read_and_clean_electorate_data()
    
    enriched_electorate_data = electorate_data.merge(electorate_statistics, how = 'inner', on ='division_name')
    
    return enriched_electorate_data


def read_and_clean_party_df(party_file_path: str = party_file_path, column_rename_dict: dict = parties_df_replace_dict):
    
    df = (pd.read_csv('misc/hor/parties.csv', skiprows=1)
          .pipe(drop_and_rename_columns, column_rename_dict))
    
    return df


def read_and_clean_candidate_df(candidate_file_path: str = candidate_file_path, column_rename_dict: dict = candidate_df_replace_dict) -> pd.DataFrame:
    
    df = (pd.read_csv(candidate_file_path, skiprows=1)
          .pipe(drop_and_rename_columns, column_rename_dict))
    
    return df


def prepare_enriched_polling_place_data()-> pd.DataFrame:
    
    polling_place = read_and_clean_polling_place_data()
    
    sa2_2016_data = read_and_clean_2016_sa2_data()
    
    polling_place_enriched = (polling_place
                              .sjoin(sa2_2016_data, how ='left')
                              .drop(columns = ['geometry', 'index_right']))
    
    return polling_place_enriched


def read_and_clean_polling_place_data(polling_place_file_path: str = polling_place_file_path,
                                      polling_place_crs: str = polling_place_crs,
                                      polling_place_rename_dict: dict = polling_place_rename_dict) -> gpd.GeoDataFrame:
    
    gdf = (gpd.GeoDataFrame(pd.read_csv(polling_place_file_path, skiprows=1))
           .assign(geometry = lambda df: gpd.points_from_xy(df['Longitude'], df['Latitude'], crs = polling_place_crs))
           .pipe(drop_and_rename_columns, polling_place_rename_dict))
    
    return gdf


def read_and_clean_2016_sa2_data(sa2_data_file_path: str = sa2_data_file_path,
                                 sa2_2016_rename_dict = sa2_2016_rename_dict) -> gpd.GeoDataFrame:
    
    df = (gpd.read_file(sa2_data_file_path)
          .pipe(drop_and_rename_columns, sa2_2016_rename_dict))
    
    return df

"""
## Low level SA2 statistical data prep functions

"""
    

def read_and_clean_sa2_medians_data(sa2_medians_file_path: str = sa2_medians_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_medians_file_path)
          .pipe(drop_and_rename_columns, sa2_medians_replacement_dict))
    
    return df



def read_and_clean_sa2_age_total_population_data(sa2_age_total_population_file_path: str = sa2_age_total_population_file_path) -> pd.DataFrame:
    
    list_over_20_columns = ['count_age_20_24',
                            'count_age_25_34',
                            'count_age_35_44',
                            'count_age_45_54',
                            'count_age_55_64',
                            'count_age_65_74',
                            'count_age_75_84',
                            'count_age_over_85']
    
    sum_string = '+'.join(list_over_20_columns) 
    
    df = (pd.read_csv(sa2_age_total_population_file_path)
          .pipe(drop_and_rename_columns, sa2_age_total_population_rename_dict)
          .assign(count_age_over_20 = lambda df: df.eval(sum_string),
                  high_school_completion_rate = lambda df: df['count_grade_12_completed']/df['count_age_over_20'])
          .drop(columns = ['count_age_over_20', 'count_grade_12_completed'])
          .pipe(transform_all_count_variables_to_pc_variables, ignore_variable_list = ['SA2_MAIN16','high_school_completion_rate']))
    
    return df


def read_and_clean_sa2_dwelling_data(sa2_dwelling_type_file_path: str = sa2_dwelling_type_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_dwelling_type_file_path)
          .pipe(drop_and_rename_columns, sa2_dwelling_type_replacement_dict)
          .pipe(transform_all_count_variables_to_pc_variables, pc_basis_variable = 'total_dwellings'))
    
    return df


def read_and_clean_sa2_family_composition_data(sa2_family_composition_file_path: str = sa2_family_composition_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_family_composition_file_path)
          .pipe(drop_and_rename_columns, sa2_family_composition_replacement_dict)
          .pipe(transform_all_count_variables_to_pc_variables, pc_basis_variable = 'total_families'))
    
    return df


def read_and_clean_sa2_migration_timeframe_data(sa2_migration_timeframe_file_path: str = sa2_migration_timeframe_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_migration_timeframe_file_path)
          .pipe(drop_and_rename_columns, sa2_migration_timeframe_replacement_dict)
          .set_index(['SA2_MAIN16']).sum(axis=1)
          .reset_index()
          .rename(columns = {0:'count_migrated_since_2006'}))
    
    return df


def read_and_clean_sa2_occupation_data(sa2_occupation_file_path: str = sa2_occupation_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_occupation_file_path)
          .pipe(drop_and_rename_columns, sa2_occupation_replacement_dict)
          .pipe(transform_all_count_variables_to_pc_variables, pc_basis_variable = 'total_for_occupations'))
    
    return df


def read_and_clean_sa2_industry_data(sa2_industry_file_path: str = sa2_industry_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_industry_file_path)
          .pipe(drop_and_rename_columns, sa2_industry_replacement_dict)
          .assign(total_for_industry = lambda df: (df
                                                   .fillna(0)
                                                   .set_index('SA2_MAIN16')
                                                   .sum(axis=1).values))
          .pipe(transform_all_count_variables_to_pc_variables, pc_basis_variable = 'total_for_industry'))
    
    return df

def read_and_clean_sa2_university_data(sa2_university_file_path: str = sa2_university_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_university_file_path)
          .pipe(drop_and_rename_columns, sa2_university_replacement_dict))
    
    return df


def read_and_clean_sa2_labourforce_data(sa2_labourforce_file_path: str = sa2_labourforce_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_labourforce_file_path)
          .pipe(drop_and_rename_columns, sa2_labourforce_replacement_dict))
    
    columns_for_division_list = [column for column in df.columns if column!='SA2_MAIN16']
    
    for column in columns_for_division_list:
        
        df[column] = df[column]/100
    
    
    return df


def read_and_clean_sa2_religion_data(sa2_religion_file_path: str = sa2_religion_file_path) -> pd.DataFrame:
    
    df = (pd.read_csv(sa2_religion_file_path)
          .pipe(drop_and_rename_columns, sa2_religion_replacement_dict)
          .pipe(transform_all_count_variables_to_pc_variables, pc_basis_variable = 'total_for_religion'))
    
    
    return df


"""
# Prepare SA2 statistical data.

Prepares the SA2 statistical data records.

"""

sa2_function_list = [read_and_clean_sa2_dwelling_data, 
                     read_and_clean_sa2_family_composition_data,
                     read_and_clean_sa2_migration_timeframe_data,
                     read_and_clean_sa2_occupation_data,
                     read_and_clean_sa2_industry_data,
                     read_and_clean_sa2_university_data,
                     read_and_clean_sa2_labourforce_data,
                     read_and_clean_sa2_religion_data]

def prepare_sa2_statistics_data(function_list: list = sa2_function_list) -> pd.DataFrame:
    
    df = pd.DataFrame()
    
    for function in function_list:
        if len(df)==0:
            df = function()
        else:
            additional_df = function()
            df = df.merge(additional_df, how ='outer', on = 'SA2_MAIN16')
    
    return df


"""
## Utility Functions
"""

def transform_all_count_variables_to_pc_variables(df: pd.DataFrame, 
                                                  pc_basis_variable:str = 'count_persons', 
                                                  ignore_variable_list: list = ['SA2_MAIN16']) -> pd.DataFrame:
    
    non_target_columns = ignore_variable_list + [pc_basis_variable]
    
    target_columns = [column for column in df.columns if column not in non_target_columns]
    
    for target_column_name in target_columns:
        
        new_column_name = target_column_name.replace('count','pc')
        
        df = (df
              .assign(dummy_column = lambda df: df[target_column_name]/df[pc_basis_variable])
              .rename(columns = {'dummy_column':new_column_name})
              .drop(columns = [target_column_name]))
    
    return df


def drop_and_rename_columns(df: pd.DataFrame, drop_and_rename_dict: dict) -> pd.DataFrame:
    
    """
    This function essentially says if you want to rename only the columns that you want to keep, just wrap the rename dict with this function and save yourself a line...
    but does so by writing an additional function but damn it was getting repetitive.
    
    """
        
    keep_only_keys_list = list(drop_and_rename_dict.keys())
    
    df = (df[keep_only_keys_list]
          .rename(columns = drop_and_rename_dict))
    
    
    return df


"""
# Main Function

Calls Multiple sub functions to execute routine
"""

def main(output_file_path: str = output_file_path) -> None:
    
    write_options = {'encoding': 'utf-8',
                     'index': False}
    
    preference_data = (read_preference_distribution_files()
                       .pipe(clean_preference_distribution_counts_df)
                       .to_csv(f'{output_file_path}election_count.csv', **write_options))
    
    transfer_data = (read_preference_distribution_files()
                     .pipe(clean_preference_distribution_count_transfers_df)
                     .to_csv(f'{output_file_path}count_transfers.csv', **write_options))
    
    parties_df = (read_and_clean_party_df()
                  .to_csv(f'{output_file_path}political_parties.csv', **write_options))
    
    polling_place = (pd.DataFrame(prepare_enriched_polling_place_data())
                     .to_csv(f'{output_file_path}polling_places.csv', **write_options))
    
    electorates = (prepare_enriched_electorate_data()
                 .to_csv(f'{output_file_path}electorates.csv', **write_options))
    
    candidates = (read_and_clean_candidate_df()
                  .to_csv(f'{output_file_path}candidates.csv', **write_options))
    
    sa2_statistics = (prepare_sa2_statistics_data()
                      .to_csv(f'{output_file_path}sa2_statistics.csv', **write_options))
    
    return None



"""
# If called on to run as __main__
"""


if __name__ =='__main__':
    
    main()