# Census Bureau American Community Survey ETL

This notebook extracts Census Bureau data at the 1-year supplemental estimate (ACSSE) level and at the 5-year level of analysis for targeted geographies based on targeted counties (pass in via their 2-digit state + 3-digit county FIPS codes). 

ACS Supplemental Estimates are updated yearly with 12 months of collected data, but the smallest geographies supported are Public Use Micro Areas (PUMAs) and Census Designated Places (CDPs), and only those with populations greater than 20,000. ACS 5-Year surveys are updated yearly with 60 months of collected data, but support geographies down to the block group level with data for all geographies regardless of population.

This ETL process uses `COUNTY` as the reference geography from which all other geographies are based. For example: if the Texas county of Bexar is the `COUNTY` of reference, data associated with any `CD<current_congress>`, `PLACE`, `PUMA20`, and `ZIP` geographies that intersect with Bexar `COUNTY` (have any overlapping areas) will also be collected for analysis. 

*Note: running this notebook requires Shapefiles for `CD<current_congress>`, `COUNTY`, `PLACE` (Texas), and `PUMA20`.*

## References
Using 1-year or 5-year American Community Survey Data
- https://www.census.gov/programs-surveys/acs/guidance/estimates.html

ACS 1-year Supplemental Estimates Data Homepage
- https://www.census.gov/data/developers/data-sets/ACS-supplemental-data.html

ACS 1-year Supplemental Estimates Tables
- https://api.census.gov/data/2022/acs/acsse/variables.html

ACS 1-year Supplemental Estimates Available Geographies
- https://api.census.gov/data/2022/acs/acs5/geography.html

ACS 5-year Data Homepage
- https://www.census.gov/data/developers/data-sets/acs-5year.html

ACS 5-year Tables
- https://api.census.gov/data/2022/acs/acs5/variables.html

ACS 5-year Available Geographies
- https://api.census.gov/data/2022/acs/acs5/geography.html

## User Input

This section can be edited by the user of this notebook to change certain settings:
- debug mode
- county or counties of reference (by combination of state and county FIPS codes)
- year of survey
- API key

In [1]:
# if debug is true, data extracts are limited and database writes are disabled
debug = False

# reference county or counties, as FIPS state + county code
state_fips = ['48']
county_fips = ['029']
# specify the data source by year
year = '2022'
# API key string, or import from constants.py file
import constants
api_key = constants.census_api_key

## Pre-ETL

Set import and checks to see if required files are available locally or if they need to be extracted from source. 

In [2]:
import zipfile
import pandas as pd
import geopandas as gpd
import warnings
import requests
import os
import numpy as np
import sqlalchemy

In [3]:
# combines state FIPS and county FIPS codes into one string inside list object
county_or_counties = []
for index, state_code in enumerate(state_fips):
    county_or_counties.append(state_code + county_fips[index])
county_or_counties

['48029']

In [4]:
# lists surveys to iterate through for survey-dependent calls
surveys = ['acsse', 'acs5']

In [5]:
# creates dictionary to contain data for ACS 1yr and ACS 5yr data
acs_data_dict = {}

In [6]:
# checks to see if Shapefile directories contain data, download if not
# todo: add Shapefile extract code for other geographies
shapefiles_base_dirpath = '../data/geospatial_files/shapefiles/census_bureau'
shapefile_types = ['congressional_districts', 'counties', 'places', 'pumas', 'zip_code_tabulation_areas', 'block_groups']

shapefiles_base_url = f'https://www2.census.gov/geo/tiger/TIGER{year}'

for shapefile_type in shapefile_types:
    if not os.path.exists(f'{shapefiles_base_dirpath}/{shapefile_type}/{year}'):
        os.makedirs(f'{shapefiles_base_dirpath}/{shapefile_type}/{year}')
        
        if shapefile_type == 'block_groups':
            # Shapefile FTP URL
            block_group_url = f'{shapefiles_base_url}/BG/tl_{year}_{state_fips[0]}_bg.zip'
            # local filepath
            block_group_filepath = f'{shapefiles_base_dirpath}/{shapefile_type}/{year}/'
            shapefile_response = requests.get(block_group_url)
           
           # write URL response to file
            with open(block_group_filepath  + f'{year}_{state_code[0]}_bg.zip', 'wb') as f:
                f.write(shapefile_response.content)
            
            # extract .zip contents
            with zipfile.ZipFile(block_group_filepath + f'{year}_{state_code[0]}_bg.zip', mode='r') as archive:
                archive.extractall(path=block_group_filepath)
            archive.close()
            
            # delete unnecessary files
            for file in os.listdir(f'{shapefiles_base_dirpath}/{shapefile_type}/{year}'):
                filename = os.fsdecode(block_group_filepath + file)
                if filename.endswith('.shp') or filename.endswith('.shx') or filename.endswith('.dbf'):
                    continue
                else:
                    os.remove(block_group_filepath + file)

## Extract

### Preparation

In [7]:
# grabs "crosswalk" table for name-label-concept list available through Census Bureau website that contains names for each individual field in each table, which will be used to programmatically give human-readable names to DataFrame/database columns
for survey in surveys:
    crosswalk_df = pd.DataFrame()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
    crosswalk_url = f'https://api.census.gov/data/{year}/acs/{survey}/variables/'
    
    # local CSV versions of variable data
    crosswalk_dir = f'../data/datasets/census_bureau/'
    crosswalk_csv = f'{survey}_crosswalk.csv'
    
    try:
        crosswalk_response = requests.get(crosswalk_url, headers=headers)
        if crosswalk_response.status_code == 200:
            crosswalk_df = pd.DataFrame(crosswalk_response.json())
        # saves local copy of DataFrame as .csv file in case page is unavailable
        if not os.path.exists(crosswalk_dir):
            os.makedirs(crosswalk_dir)
            crosswalk_df.to_csv(crosswalk_dir + crosswalk_csv)
    except:
        crosswalk_df = pd.DataFrame().from_csv(crosswalk_dir + crosswalk_csv)
        
    acs_data_dict[f'{survey}_crosswalk_df'] = crosswalk_df

In [8]:
# convert first row into column headers, then deletes the row
for survey in surveys:
    crosswalk_df = acs_data_dict[f'{survey}_crosswalk_df']
    crosswalk_df.columns = crosswalk_df.iloc[0]
    crosswalk_df = crosswalk_df[1:]
    
    # removes rows not used to for naming columns locally
    crosswalk_df = crosswalk_df[crosswalk_df['name'].str.startswith('K') | crosswalk_df['name'].str.startswith('B')]
    
    if survey == 'acs5':
        idx = crosswalk_df.index[crosswalk_df['name'] == 'BLKGRP']
        crosswalk_df.drop(idx, inplace=True)
        
    acs_data_dict[f'{survey}_crosswalk_df'] = crosswalk_df

In [9]:
acs_data_dict['acs5_crosswalk_df'].head()

Unnamed: 0,name,label,concept
4,B24022_060E,Estimate!!Total:!!Female:!!Service occupations...,Sex by Occupation and Median Earnings in the P...
5,B19001B_014E,"Estimate!!Total:!!$100,000 to $124,999",Household Income in the Past 12 Months (in 202...
6,B07007PR_019E,Estimate!!Total:!!Moved from different municip...,Geographical Mobility in the Past Year by Citi...
7,B19101A_004E,"Estimate!!Total:!!$15,000 to $19,999",Family Income in the Past 12 Months (in 2022 I...
8,B24022_061E,Estimate!!Total:!!Female:!!Service occupations...,Sex by Occupation and Median Earnings in the P...


In [10]:
# transforms crosswalk_df by truncating `name` column to its table 'group' name (and deleting anything that's not a table name) and normalizing text in `concept` field to lowercase/no spaces format
for survey in surveys:
    tables_df = acs_data_dict[f'{survey}_crosswalk_df'].copy()
    # tables_df['name'] = crosswalk_df['name'].str.split('_').str[0]
    tables_df['name'] = acs_data_dict[f'{survey}_crosswalk_df']['name'].str.split('_').str[0]
    tables_df = tables_df.drop_duplicates(subset='name')
    tables_df = tables_df.drop(columns='label')
    tables_df['concept'] = tables_df['concept'].str.replace(' ', '_').str.lower()
    
    acs_data_dict[f'{survey}_tables_df'] = tables_df        

In [11]:
acs_data_dict['acs5_tables_df'].head()

Unnamed: 0,name,concept
4,B24022,sex_by_occupation_and_median_earnings_in_the_p...
5,B19001B,household_income_in_the_past_12_months_(in_202...
6,B07007PR,geographical_mobility_in_the_past_year_by_citi...
7,B19101A,family_income_in_the_past_12_months_(in_2022_i...
14,B01001B,sex_by_age_(black_or_african_american_alone)


In [12]:
# loads GeoDataFrame from Shapefiles for reference geographies and turns county UCGIDs into an iterable list
county_ucgids_list = []
county_ucgids_list_of_lists = []
target_counties_gdf = gpd.GeoDataFrame()

counties_gdf = gpd.read_file('../data/geospatial_files/shapefiles/census_bureau/counties/tl_2023_us_county.shp')
counties_gdf.set_crs(epsg='3395', inplace=True)
for county in county_or_counties:
    county_gdf = counties_gdf[counties_gdf['GEOID'] == county]
    county_ucgids_list_of_lists.append(list(counties_gdf['GEOIDFQ'][counties_gdf['GEOID'] == county])) 
    target_counties_gdf = pd.concat([target_counties_gdf, county_gdf])

for ucgid in county_ucgids_list_of_lists:
    county_ucgids_list.append(ucgid[0])

target_counties_gdf.head()

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,GEOID,GEOIDFQ,NAME,NAMELSAD,LSAD,CLASSFP,MTFCC,CSAFP,CBSAFP,METDIVFP,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
615,48,29,1383800,48029,0500000US48029,Bexar,Bexar County,6,H1,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.327 29.210, -98.327 29.210, -98...."


In [13]:
# loads GeoDataFrame from Shapefiles for `CD<current_congress>` geographies based on reference geographies
congressional_districts_gdf = gpd.read_file(
    '../data/geospatial_files/shapefiles/census_bureau/congressional_districts/118th_congress/tl_2023_48_cd118.shp')
congressional_districts_gdf.set_crs(epsg='3395', inplace=True)
# creates overlay, keeping only polygons that exist in both GeoDataFrames
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    congressional_districts_by_county_gdf = congressional_districts_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
congressional_districts_by_county_ucgid_list = list(congressional_districts_by_county_gdf['GEOIDFQ_1'])

congressional_districts_by_county_gdf.head()

Unnamed: 0,STATEFP_1,CD118FP,GEOID_1,GEOIDFQ_1,NAMELSAD_1,LSAD_1,CDSESSN,MTFCC_1,FUNCSTAT_1,ALAND_1,...,MTFCC_2,CSAFP,CBSAFP,METDIVFP,FUNCSTAT_2,ALAND_2,AWATER_2,INTPTLAT_2,INTPTLON_2,geometry
0,48,23,4823,5001800US4823,Congressional District 23,C2,118,G5200,N,152261432812,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.805 29.692, -98.803 29.695, -98...."
1,48,28,4828,5001800US4828,Congressional District 28,C2,118,G5200,N,29415114978,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.484 29.141, -98.484 29.142, -98...."
2,48,35,4835,5001800US4835,Congressional District 35,C2,118,G5200,N,1348685093,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.543 29.428, -98.543 29.428, -98...."
3,48,20,4820,5001800US4820,Congressional District 20,C2,118,G5200,N,464891989,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.788 29.501, -98.788 29.501, -98...."
4,48,21,4821,5001800US4821,Congressional District 21,C2,118,G5200,N,16309930932,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.314 29.602, -98.314 29.602, -98...."


In [14]:
# loads GeoDataFrame from Shapefiles for `PLACE` geographies based on reference geographies
places_gdf = gpd.read_file('../data/geospatial_files/shapefiles/census_bureau/places/tl_2023_48_place.shp')
places_gdf.set_crs(epsg='3395', inplace=True)
# creates overlay, keeping only polygons that exist in both GeoDataFrames
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    places_by_county_gdf = places_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
places_by_county_ucgid_list = list(places_by_county_gdf['GEOIDFQ_1'])

places_by_county_gdf.head()

Unnamed: 0,STATEFP_1,PLACEFP,PLACENS,GEOID_1,GEOIDFQ_1,NAME_1,NAMELSAD_1,LSAD_1,CLASSFP_1,PCICBSA,...,MTFCC_2,CSAFP,CBSAFP,METDIVFP,FUNCSTAT_2,ALAND_2,AWATER_2,INTPTLAT_2,INTPTLON_2,geometry
0,48,67268,2411878,4867268,1600000US4867268,Shavano Park,Shavano Park city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.576 29.592, -98.576 29.592, -98...."
1,48,64172,2412593,4864172,1600000US4864172,St. Hedwig,St. Hedwig town,43,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.272 29.421, -98.272 29.421, -98...."
2,48,74408,2412134,4874408,1600000US4874408,Universal City,Universal City city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.330 29.539, -98.327 29.541, -98...."
3,48,33146,2410736,4833146,1600000US4833146,Helotes,Helotes city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.728 29.531, -98.727 29.532, -98...."
4,48,68708,2411926,4868708,1600000US4868708,Somerset,Somerset city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.674 29.228, -98.674 29.228, -98...."


In [15]:
# loads GeoDataFrame from Shapefiles for `PUMA20` geographies based on reference geographies
pumas_gdf = gpd.read_file('../data/geospatial_files/shapefiles/census_bureau/pumas/tl_2023_48_puma20.shp')
pumas_gdf.set_crs(epsg='3395', inplace=True)
# creates overlay, keeping only polygons that exist in both GeoDataFrames
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    pumas_by_county_gdf = pumas_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
pumas_by_county_ucgid_list = list(pumas_by_county_gdf['GEOIDFQ20'])

pumas_by_county_gdf.head()

Unnamed: 0,STATEFP20,PUMACE20,GEOID20,GEOIDFQ20,NAMELSAD20,MTFCC20,FUNCSTAT20,ALAND20,AWATER20,INTPTLAT20,...,MTFCC,CSAFP,CBSAFP,METDIVFP,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
0,48,5907,4805907,795P200US4805907,Bexar County (South)--San Antonio City (Far So...,G6120,S,1271464920,33914665,29.3069306,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.305 29.237, -98.307 29.235, -98...."
1,48,5908,4805908,795P200US4805908,San Antonio City (West)--Between Loop TX-1604 ...,G6120,S,68356557,216036,29.4400578,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.671 29.389, -98.672 29.389, -98...."
2,48,5914,4805914,795P200US4805914,Bexar County (Northwest)--San Antonio (Far Nor...,G6120,S,473825286,874684,29.595835,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.716 29.435, -98.716 29.434, -98...."
3,48,5903,4805903,795P200US4805903,San Antonio City (Southeast)--Inside Loop I-41...,G6120,S,89668983,423359,29.3672916,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.437 29.413, -98.437 29.413, -98...."
4,48,5906,4805906,795P200US4805906,San Antonio City (Southwest)--Inside Loop I-41...,G6120,S,81573819,255474,29.3425496,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.504 29.320, -98.504 29.320, -98...."


In [16]:
# loads GeoDataFrame from Shapefiles for `PUMA20` geographies based on reference geographies
block_group_gdf = gpd.read_file(f'../data/geospatial_files/shapefiles/census_bureau/block_groups/{year}/tl_{year}_{state_fips[0]}_bg.shp')
block_group_gdf.set_crs(epsg='3395', inplace=True)
# creates an overlay that keeps only polygons that exist in both our target geography and the block group geographies
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    block_group_by_county_gdf = block_group_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
block_group_by_county_gdf['ucgid'] = '1500000US' + block_group_by_county_gdf['GEOID_1']
block_group_by_county_ucgid_list = list('1500000US' + block_group_by_county_gdf['GEOID_1'])
    
block_group_by_county_gdf.head()

Unnamed: 0,STATEFP_1,COUNTYFP_1,TRACTCE,BLKGRPCE,GEOID_1,NAMELSAD_1,MTFCC_1,FUNCSTAT_1,ALAND_1,AWATER_1,...,CSAFP,CBSAFP,METDIVFP,FUNCSTAT_2,ALAND_2,AWATER_2,INTPTLAT_2,INTPTLON_2,geometry,ucgid
0,48,29,182003,1,480291820031,Block Group 1,G5030,S,5939086,3937,...,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.653 29.594, -98.652 29.595, -98....",1500000US480291820031
1,48,29,150700,2,480291507002,Block Group 2,G5030,S,354965,0,...,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.492 29.377, -98.492 29.377, -98....",1500000US480291507002
2,48,29,121606,3,480291216063,Block Group 3,G5030,S,586667,0,...,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.340 29.554, -98.340 29.555, -98....",1500000US480291216063
3,48,29,121604,2,480291216042,Block Group 2,G5030,S,1509863,7297,...,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.329 29.537, -98.329 29.537, -98....",1500000US480291216042
4,48,29,121604,3,480291216043,Block Group 3,G5030,S,401071,0,...,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.320 29.524, -98.319 29.525, -98....",1500000US480291216043


In [17]:
def api_call_list_creator() -> list:
    # iterates through DataFrame of tables being collected AND through the list of geoIDs collected from all targeted geographies to create an API URL call for each
    api_call_url_list = []
    
    for survey in surveys:
        base_url = f'https://api.census.gov/data/{year}/acs/{survey}'
        
        # combines all UCGIDs into one list
        if survey == 'acsse':
            ucgid_list = county_ucgids_list + congressional_districts_by_county_ucgid_list + places_by_county_ucgid_list + pumas_by_county_ucgid_list
        elif survey == 'acs5':
            ucgid_list = block_group_by_county_ucgid_list
            
        # if URL contains too many UCGIDs making it too long, split into chunks to avoid API errors
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            ucgid_chunks = np.array([])
            if len(ucgid_list) > 500:
                ucgid_chunks = np.array_split(ucgid_list, 4)
                ucgid_chunks = list(ucgid_chunks)
        
        # iterates through tables DataFrame
        for index, row in acs_data_dict[f'{survey}_tables_df'].iterrows():
            # iterates through each UCGID in combined UCGID list using `county` as base geography
            for county in county_ucgids_list:
                with warnings.catch_warnings():
                    warnings.simplefilter("ignore")
                    if ucgid_chunks:
                        for chunk in ucgid_chunks:
                            chunk = list(chunk)
                            data_url = f'{base_url}?get=group({row["name"]})&ucgid={",".join(chunk)}&key={api_key}' 
                    else:
                        data_url = f'{base_url}?get=group({row["name"]})&ucgid={",".join(ucgid_list)}&key={api_key}'
    
                api_call_url_list.append(data_url)
    return api_call_url_list

### Extraction

In [18]:
# calls the API with a single URL containing one group of tables (max allowed) and COUNTY, PLACE, and CD<current_congress> geographies (only returns any of these geographies containing more than 20,000 total population)
def api_caller(url):
    r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
    if r.status_code == 200:
        return r
    else:
        print(r.status_code)
        print(r.text)
        return r

In [19]:
def data_extractor(url: str) -> pd.DataFrame:
    # iterates through the list of URLs, calling the API caller once for each URL, and joins all the results into one DataFrame - joining process requires removing any columns that will be duplicated, else DataFrame merge will fail
    df = pd.DataFrame()
    
    # calls the API caller
    response = api_caller(url)
    # converts API response JSON object into a local-scope DataFrame
    temp_df = pd.DataFrame(response.json())
    # converts first row into column headers, then deletes row
    temp_df.columns = temp_df.iloc[0]
    temp_df = temp_df[1:]
    for series_name, series in temp_df.items():
        # drops the 'NAME' column for all but the first DataFrame
        if series_name == 'NAME':
            if index > 0:
                temp_df.drop(columns=[series_name], inplace=True)
        if series_name == 'GEO_ID':
            temp_df.drop(columns=[series_name], inplace=True)
    # if this is the first run, set non-local scope DataFrame, otherwise merge local and non-local DataFrames
    if index == 0:
        df = temp_df
    else:
        try: 
            df = df.merge(temp_df, on='ucgid')
        except (KeyError, IndexError):
            print('error on merge')
    return df

## Transform

Once we've loaded the API data into memory, we can modify the data to exclude unnecesary fields before saving to the database. 

In [20]:
def dataframe_column_cleaner(df: pd.DataFrame) -> pd.DataFrame:    
    # remove columns representing annotations of estimates (*EA), margins of error (*M), and annotations of margins of error (*MA)
    df.drop(columns=df.columns[df.columns.str.endswith(('EA', 'M', 'MA'))], inplace=True)
    
    # for survey in surveys:
    for series_name in df.columns:
            # with warnings.catch_warnings():
            #     warnings.simplefilter("ignore")
            #     if acs_data_dict[f'{survey}_crosswalk_df']['name'].str.contains(series_name).any():
                    # if the crosswalk contains the name of the DataFrame column (i.e., for any data column as opposed to names, descriptors, etc. ) replace table name based on key to one based on descriptor, then strip spaces, punctuation, etc. and replace with underscores for easier data manipulation and normalization, then convert from Series object to int dtype
                    #                                                                              puma_df = df[df['ucgid'].str.startswith('795')]
                    # todo: remove column renaming code
                    # new_label = str(acs_data_dict[f'{survey}_crosswalk_df'][acs_data_dict[f'{survey}_crosswalk_df']['name'].str.startswith(series_name)][['concept', 'label']].values)
                    # new_label = new_label.replace('[', '').replace(']', '').replace('\' \'', '__').replace(' ', '_').replace('\'', '').replace('!!', '_').replace(':', '').lower()
        try:
            if series_name != 'ucgid':
                try:
                    if pd.api.types.is_string_dtype(df[series_name]):
                        df = df.astype({series_name: 'int'})
                except ValueError:
                    pass
                # df.rename(columns={series_name: new_label + '__' + series_name}, inplace=True, errors='raise')
        except (TypeError, ValueError):
            pass
    return df

The following cells separate out each geographic level of analysis into their own DataFrame - one each for `COUNTY`, `PLACE`, `CD<congressional_term>`, `PUMA`, and `BLOCK_GROUP`. 

Once they are separated out, they are merged with their associated GeoDataFrame in order to save the GeoDataFrame's `geography` column, which contains the Shapefile polygons that can be used for geospatial analysis. 

In [21]:
def dataframe_geography_parser(df: pd.DataFrame) -> tuple:    
    # the following cells separate out each geography level of analysis into their own DataFrame 
    dataframe_type = ''
    try:
        county_df = df[df['ucgid'].str.startswith('050')]
        final_df = pd.merge(county_df, county_gdf[['GEOIDFQ', 'geometry']], left_on='ucgid', right_on='GEOIDFQ')
        dataframe_type = 'county'
    except KeyError:
        pass
    
    try:    
        place_df = df[df['ucgid'].str.startswith('160')]
        final_df = pd.merge(place_df, places_gdf[['GEOIDFQ', 'geometry']], left_on='ucgid', right_on='GEOIDFQ')
        dataframe_type = 'place'
    except KeyError:
        pass
    
    try:    
        congressional_district_df = df[df['ucgid'].str.startswith('500')]
        final_df = pd.merge(congressional_district_df, congressional_districts_gdf[['GEOIDFQ', 'geometry']], left_on='ucgid', right_on='GEOIDFQ')
        dataframe_type = 'congressional_district'
    except KeyError:
        pass
    
    try:
        puma_df = df[df['ucgid'].str.startswith('795')]
        final_df = pd.merge(puma_df, pumas_gdf[['GEOIDFQ20', 'geometry']], left_on='ucgid', right_on='GEOIDFQ20')
        dataframe_type = 'puma'
    except KeyError:
        pass

    try:
        block_group_df = df[df['ucgid'].str.startswith('150')]
        final_df = pd.merge(block_group_df, block_group_by_county_gdf[['ucgid', 'geometry']], left_on='ucgid', right_on='ucgid')
        dataframe_type = 'block_group'
    except KeyError:
        pass
    
    return final_df, dataframe_type

## Load

The following code loads the DataFrame/GeoDataFrames into the database for future analysis.

The following cell modifies the DataFrame to ensure column dtype compatibility with SQLAlchemy ORM (Polygon object must be changed to object), then write each DataFrame to the database into their own table (one each for `COUNTY`, `CD<current_congress>`, `PLACE`, `PUMA`, and `BLOCK_GROUP`).


In [22]:
def database_writer(index: int, final_df: pd.DataFrame, dataframe_type: str, survey: str) -> None:
    if debug is False:
        databases_dirpath = os.path.join('../data/databases')
        
        # completes filepath to database
        demographics_db_filepath = os.path.join(databases_dirpath, f'census_{survey}_{year}.db')
        # creates connection to SQLite database
        sql_engine = sqlalchemy.create_engine('sqlite:///' + demographics_db_filepath)
            
        # converts `geometry` column from `polygon` to `string` for database write
        final_df['geometry'] = final_df['geometry'].astype(str)
        try:
            final_df.to_sql(f'{dataframe_type}', sql_engine, if_exists='append')
            if index == 0:
                acs_data_dict[f'{survey}_crosswalk_df'].to_sql('crosswalk', sql_engine, if_exists='replace')
                acs_data_dict[f'{survey}_tables_df'].to_sql('tables', sql_engine, if_exists='replace')
        except:
            pass

## Runner
The following code runs the transform and load functions defined above.

In [24]:
def code_runner() -> None:
    # maps geography type to survey
    survey_map = {'county': 'acsse', 'place': 'acsse', 'congressional_district': 'acsse', 'puma': 'acsse', 'block_group': 'acs5'}
    
    # creates list of URLs to call
    api_urls = api_call_list_creator()
    
    for index, url in enumerate(api_urls):
        df = data_extractor(url)
        df = dataframe_column_cleaner(df)
        df, df_type = dataframe_geography_parser(df)
        database_writer(index, df, df_type, survey_map[df_type])
        if debug is True:
            if index > 1:
                break