# Census Bureau American Community Survey ETL

This notebook extracts Census Bureau data at the 1-year supplemental estimate (ACSSE) level and at the 5-year level of analysis for targeted geographies based on targeted counties (pass in via their 2-digit state + 3-digit county FIPS codes). 

ACS Supplemental Estimates are updated yearly with 12 months of collected data, but the smallest geographies supported are Public Use Micro Areas (PUMAs) and Census Designated Places (CDPs), and only those with populations greater than 20,000. ACS 5-Year surveys are updated yearly with 60 months of collected data, but support geographies down to the block group level with data for all geographies regardless of population.

This ETL process uses `COUNTY` as the reference geography from which all other geographies are based. For example: if the Texas county of Bexar is the `COUNTY` of reference, data associated with any `CD<current_congress>`, `PLACE`, `PUMA20`, and `ZIP` geographies that intersect with Bexar `COUNTY` (have any overlapping areas) will also be collected for analysis. 

*Note: running this notebook requires Shapefiles for `CD<current_congress>`, `COUNTY`, `PLACE` (Texas), and `PUMA20`.*

## References
Using 1-year or 5-year American Community Survey Data
- https://www.census.gov/programs-surveys/acs/guidance/estimates.html

ACS 1-year Supplemental Estimates Data Homepage
- https://www.census.gov/data/developers/data-sets/ACS-supplemental-data.html

ACS 1-year Supplemental Estimates Tables
- https://api.census.gov/data/2022/acs/acsse/variables.html

ACS 1-year Supplemental Estimates Available Geographies
- https://api.census.gov/data/2022/acs/acs5/geography.html

ACS 5-year Data Homepage
- https://www.census.gov/data/developers/data-sets/acs-5year.html

ACS 5-year Tables
- https://api.census.gov/data/2022/acs/acs5/variables.html

ACS 5-year Available Geographies
- https://api.census.gov/data/2022/acs/acs5/geography.html

## User Input

This section can be edited by the user of this notebook to change certain settings:
- initializing run or not
- county of reference
- year of survey
- type of survey
- debug mode

In [178]:
from requests import ReadTimeout

# if initial run (if True, initializes databases, etc.)
initializing = True
# reference county or counties, as FIPS state + county code
state_fips = ['48']
county_fips = ['029']
# specify the data source by year
year = '2022'
# if debug is true, data extract limited to one table and database writes are disabled
debug = True

## Pre-ETL

Set import and checks to see if required files are available locally or if they need to be extracted from source. 

In [179]:
import zipfile
import pandas as pd
import geopandas as gpd
import warnings
import requests
import os
import sqlalchemy

In [180]:
# combines state FIPS and county FIPS codes into one string inside list object
county_or_counties = []
for index, state_code in enumerate(state_fips):
    county_or_counties.append(state_code + county_fips[index])
county_or_counties

['48029']

In [181]:
# lists surveys to iterate through for survey-dependent calls
surveys = ['acsse', 'acs5']

In [187]:
# creates dictionaries to contain data for ACS 1yr and ACS 5yr data
acs_1yr_data_dict = {}
acs_5yr_data_dict = {}
acs_data_dict = {}

In [183]:
# checks to see if Shapefile directories contain data, download if not
# todo: add Shapefile extract code for other geographies
shapefiles_base_dirpath = '../data/geospatial_files/shapefiles/census_bureau'
shapefile_types = ['congressional_districts', 'counties', 'places', 'pumas', 'zip_code_tabulation_areas', 'block_groups']

shapefiles_base_url = f'https://www2.census.gov/geo/tiger/TIGER{year}'

for shapefile_type in shapefile_types:
    if not os.path.exists(f'{shapefiles_base_dirpath}/{shapefile_type}/{year}'):
        os.makedirs(f'{shapefiles_base_dirpath}/{shapefile_type}/{year}')
        
        if shapefile_type == 'block_groups':
            # Shapefile FTP URL
            block_group_url = f'{shapefiles_base_url}/BG/tl_{year}_{state_fips[0]}_bg.zip'
            # local filepath
            block_group_filepath = f'{shapefiles_base_dirpath}/{shapefile_type}/{year}/'
            shapefile_response = requests.get(block_group_url)
           
           # write URL response to file
            with open(block_group_filepath  + f'{year}_{state_code[0]}_bg.zip', 'wb') as f:
                f.write(shapefile_response.content)
            
            # extract .zip contents
            with zipfile.ZipFile(block_group_filepath + f'{year}_{state_code[0]}_bg.zip', mode='r') as archive:
                archive.extractall(path=block_group_filepath)
            archive.close()
            
            # delete unnecessary files
            for file in os.listdir(f'{shapefiles_base_dirpath}/{shapefile_type}/{year}'):
                filename = os.fsdecode(block_group_filepath + file)
                if filename.endswith('.shp') or filename.endswith('.shx') or filename.endswith('.dbf'):
                    continue
                else:
                    os.remove(block_group_filepath + file)

## Extract

### Preparation

In [195]:
# grabs "crosswalk" table for name-label-concept list available through Census Bureau website that contains names for each individual field in each table, which will be used to programmatically give human-readable names to DataFrame/database columns
for survey in surveys:
    crosswalk_df = pd.DataFrame()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
    crosswalk_url = f'https://api.census.gov/data/{year}/acs/{survey}/variables/'
    
    # local CSV versions of variable data
    crosswalk_dir = f'../data/datasets/census_bureau/'
    crosswalk_csv = f'{survey}_crosswalk.csv'
    
    try:
        crosswalk_response = requests.get(crosswalk_url, headers=headers)
        if crosswalk_response.status_code == 200:
            crosswalk_df = pd.DataFrame(crosswalk_response.json())
        # saves local copy of DataFrame as .csv file in case page is unavailable
        if not os.path.exists(crosswalk_dir):
            os.makedirs(crosswalk_dir)
            crosswalk_df.to_csv(crosswalk_dir + crosswalk_csv)
    except:
        crosswalk_df = pd.DataFrame().from_csv(crosswalk_dir + crosswalk_csv)
        
    acs_data_dict[f'{survey}_crosswalk'] = crosswalk_df

acs_data_dict['acsse_crosswalk']

Unnamed: 0,0,1,2
0,name,label,concept
1,for,Census API FIPS 'for' clause,Census API Geography Specification
2,in,Census API FIPS 'in' clause,Census API Geography Specification
3,ucgid,Uniform Census Geography Identifier clause,Census API Geography Specification
4,K202101_002E,Estimate!!Total:!!Veteran:,Veteran Status for the Civilian Population 18 ...
...,...,...,...
365,GEOCOMP,GEO_ID Component,
366,K201701_007E,Estimate!!Total:!!Income in the past 12 months...,Poverty Status in the Past 12 Months by Age
367,K202301_005E,Estimate!!Total:!!In labor force:!!Civilian la...,Employment Status for the Population 16 Years ...
368,K202504_001E,Estimate!!Total:,Units in Structure


In [196]:
# convert first row into column headers, then deletes the row
for survey in surveys:
    crosswalk_df = acs_data_dict[f'{survey}_crosswalk']
    crosswalk_df.columns = crosswalk_df.iloc[0]
    crosswalk_df = crosswalk_df[1:]
    
    # removes rows not used to for naming columns locally
    crosswalk_df = crosswalk_df[crosswalk_df['name'].str.startswith('K') | crosswalk_df['name'].str.startswith('B')]
    
    acs_data_dict[f'{survey}_crosswalk_df'] = crosswalk_df
acs_data_dict['acsse_crosswalk_df']

Unnamed: 0,name,label,concept
4,K202101_002E,Estimate!!Total:!!Veteran:,Veteran Status for the Civilian Population 18 ...
5,K200201_006E,Estimate!!Total:!!Native Hawaiian and Other Pa...,Race
6,K202505_006E,Estimate!!Total:!!Built 1940 to 1959,Year Structure Built
8,K202101_003E,Estimate!!Total:!!Veteran:!!18 to 34 years,Veteran Status for the Civilian Population 18 ...
9,K200201_005E,Estimate!!Total:!!Asian alone,Race
...,...,...,...
364,K202504_002E,"Estimate!!Total:!!1, detached and attached",Units in Structure
366,K201701_007E,Estimate!!Total:!!Income in the past 12 months...,Poverty Status in the Past 12 Months by Age
367,K202301_005E,Estimate!!Total:!!In labor force:!!Civilian la...,Employment Status for the Population 16 Years ...
368,K202504_001E,Estimate!!Total:,Units in Structure


In [223]:
# transforms crosswalk_df by truncating `name` column to its table 'group' name (and deleting anything that's not a table name) and normalizing text in `concept` field to lowercase/no spaces format
for survey in surveys:
    tables_df = acs_data_dict[f'{survey}_crosswalk_df'].copy()
    # tables_df['name'] = crosswalk_df['name'].str.split('_').str[0]
    tables_df['name'] = acs_data_dict[f'{survey}_crosswalk_df']['name'].str.split('_').str[0]
    tables_df = tables_df.drop_duplicates(subset='name')
    tables_df = tables_df.drop(columns='label')
    tables_df['concept'] = tables_df['concept'].str.replace(' ', '_').str.lower()
    acs_data_dict[f'{survey}_tables_df'] = tables_df
    
acs_data_dict['acs5_tables_df']

Unnamed: 0,name,concept
4,B24022,sex_by_occupation_and_median_earnings_in_the_p...
5,B19001B,household_income_in_the_past_12_months_(in_202...
6,B07007PR,geographical_mobility_in_the_past_year_by_citi...
7,B19101A,family_income_in_the_past_12_months_(in_2022_i...
14,B01001B,sex_by_age_(black_or_african_american_alone)
...,...,...
27854,B19013H,median_household_income_in_the_past_12_months_...
27894,B25052,kitchen_facilities_for_occupied_housing_units
27901,B25020,tenure_by_rooms
27976,B99187,age_by_allocation_of_independent_living_diffic...


In [199]:
# loads GeoDataFrame from Shapefiles for reference geographies and turns county UCGIDs into an iterable list
county_ucgids_list = []
county_ucgids_list_of_lists = []
target_counties_gdf = gpd.GeoDataFrame()

counties_gdf = gpd.read_file('../data/geospatial_files/shapefiles/census_bureau/counties/tl_2023_us_county.shp')
counties_gdf.set_crs(epsg='3395', inplace=True)
for county in county_or_counties:
    county_gdf = counties_gdf[counties_gdf['GEOID'] == county]
    county_ucgids_list_of_lists.append(list(counties_gdf['GEOIDFQ'][counties_gdf['GEOID'] == county])) 
    target_counties_gdf = pd.concat([target_counties_gdf, county_gdf])

for ucgid in county_ucgids_list_of_lists:
    county_ucgids_list.append(ucgid[0])

target_counties_gdf.head()

Unnamed: 0,STATEFP,COUNTYFP,COUNTYNS,GEOID,GEOIDFQ,NAME,NAMELSAD,LSAD,CLASSFP,MTFCC,CSAFP,CBSAFP,METDIVFP,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
615,48,29,1383800,48029,0500000US48029,Bexar,Bexar County,6,H1,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.327 29.210, -98.327 29.210, -98...."


In [200]:
# loads GeoDataFrame from Shapefiles for `CD<current_congress>` geographies based on reference geographies
congressional_districts_gdf = gpd.read_file(
    '../data/geospatial_files/shapefiles/census_bureau/congressional_districts/118th_congress/tl_2023_48_cd118.shp')
congressional_districts_gdf.set_crs(epsg='3395', inplace=True)
# creates overlay, keeping only polygons that exist in both GeoDataFrames
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    congressional_districts_by_county_gdf = congressional_districts_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
congressional_districts_by_county_ucgid_list = list(congressional_districts_by_county_gdf['GEOIDFQ_1'])

congressional_districts_by_county_gdf.head()

Unnamed: 0,STATEFP_1,CD118FP,GEOID_1,GEOIDFQ_1,NAMELSAD_1,LSAD_1,CDSESSN,MTFCC_1,FUNCSTAT_1,ALAND_1,...,MTFCC_2,CSAFP,CBSAFP,METDIVFP,FUNCSTAT_2,ALAND_2,AWATER_2,INTPTLAT_2,INTPTLON_2,geometry
0,48,23,4823,5001800US4823,Congressional District 23,C2,118,G5200,N,152261432812,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.805 29.692, -98.803 29.695, -98...."
1,48,28,4828,5001800US4828,Congressional District 28,C2,118,G5200,N,29415114978,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.484 29.141, -98.484 29.142, -98...."
2,48,35,4835,5001800US4835,Congressional District 35,C2,118,G5200,N,1348685093,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.543 29.428, -98.543 29.428, -98...."
3,48,20,4820,5001800US4820,Congressional District 20,C2,118,G5200,N,464891989,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.788 29.501, -98.788 29.501, -98...."
4,48,21,4821,5001800US4821,Congressional District 21,C2,118,G5200,N,16309930932,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.314 29.602, -98.314 29.602, -98...."


In [201]:
# loads GeoDataFrame from Shapefiles for `PLACE` geographies based on reference geographies
places_gdf = gpd.read_file('../data/geospatial_files/shapefiles/census_bureau/places/tl_2023_48_place.shp')
places_gdf.set_crs(epsg='3395', inplace=True)
# creates overlay, keeping only polygons that exist in both GeoDataFrames
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    places_by_county_gdf = places_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
places_by_county_ucgid_list = list(places_by_county_gdf['GEOIDFQ_1'])

places_by_county_gdf.head()

Unnamed: 0,STATEFP_1,PLACEFP,PLACENS,GEOID_1,GEOIDFQ_1,NAME_1,NAMELSAD_1,LSAD_1,CLASSFP_1,PCICBSA,...,MTFCC_2,CSAFP,CBSAFP,METDIVFP,FUNCSTAT_2,ALAND_2,AWATER_2,INTPTLAT_2,INTPTLON_2,geometry
0,48,67268,2411878,4867268,1600000US4867268,Shavano Park,Shavano Park city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.576 29.592, -98.576 29.592, -98...."
1,48,64172,2412593,4864172,1600000US4864172,St. Hedwig,St. Hedwig town,43,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.272 29.421, -98.272 29.421, -98...."
2,48,74408,2412134,4874408,1600000US4874408,Universal City,Universal City city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.330 29.539, -98.327 29.541, -98...."
3,48,33146,2410736,4833146,1600000US4833146,Helotes,Helotes city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.728 29.531, -98.727 29.532, -98...."
4,48,68708,2411926,4868708,1600000US4868708,Somerset,Somerset city,25,C1,N,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.674 29.228, -98.674 29.228, -98...."


In [202]:
# loads GeoDataFrame from Shapefiles for `PUMA20` geographies based on reference geographies
pumas_gdf = gpd.read_file('../data/geospatial_files/shapefiles/census_bureau/pumas/tl_2023_48_puma20.shp')
pumas_gdf.set_crs(epsg='3395', inplace=True)
# creates overlay, keeping only polygons that exist in both GeoDataFrames
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    pumas_by_county_gdf = pumas_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
pumas_by_county_ucgid_list = list(pumas_by_county_gdf['GEOIDFQ20'])

pumas_by_county_gdf.head()

Unnamed: 0,STATEFP20,PUMACE20,GEOID20,GEOIDFQ20,NAMELSAD20,MTFCC20,FUNCSTAT20,ALAND20,AWATER20,INTPTLAT20,...,MTFCC,CSAFP,CBSAFP,METDIVFP,FUNCSTAT,ALAND,AWATER,INTPTLAT,INTPTLON,geometry
0,48,5907,4805907,795P200US4805907,Bexar County (South)--San Antonio City (Far So...,G6120,S,1271464920,33914665,29.3069306,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.305 29.237, -98.307 29.235, -98...."
1,48,5908,4805908,795P200US4805908,San Antonio City (West)--Between Loop TX-1604 ...,G6120,S,68356557,216036,29.4400578,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.671 29.389, -98.672 29.389, -98...."
2,48,5914,4805914,795P200US4805914,Bexar County (Northwest)--San Antonio (Far Nor...,G6120,S,473825286,874684,29.595835,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.716 29.435, -98.716 29.434, -98...."
3,48,5903,4805903,795P200US4805903,San Antonio City (Southeast)--Inside Loop I-41...,G6120,S,89668983,423359,29.3672916,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.437 29.413, -98.437 29.413, -98...."
4,48,5906,4805906,795P200US4805906,San Antonio City (Southwest)--Inside Loop I-41...,G6120,S,81573819,255474,29.3425496,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.504 29.320, -98.504 29.320, -98...."


In [220]:
# loads GeoDataFrame from Shapefiles for `PUMA20` geographies based on reference geographies
block_group_gdf = gpd.read_file(f'../data/geospatial_files/shapefiles/census_bureau/block_groups/{year}/tl_{year}_{state_fips[0]}_bg.shp')
block_group_gdf.set_crs(epsg='3395', inplace=True)
# creates an overlay that keeps only polygons that exist in both our target geography and the block group geographies
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    block_group_by_county_gdf = block_group_gdf.overlay(target_counties_gdf, how='intersection')
# creates list of UCGIDs to use as inputs for API caller
block_group_by_county_ucgid_list = list('1500000US' + block_group_by_county_gdf['GEOID_1'])
    
block_group_by_county_gdf.head()

Unnamed: 0,STATEFP_1,COUNTYFP_1,TRACTCE,BLKGRPCE,GEOID_1,NAMELSAD_1,MTFCC_1,FUNCSTAT_1,ALAND_1,AWATER_1,...,MTFCC_2,CSAFP,CBSAFP,METDIVFP,FUNCSTAT_2,ALAND_2,AWATER_2,INTPTLAT_2,INTPTLON_2,geometry
0,48,29,182003,1,480291820031,Block Group 1,G5030,S,5939086,3937,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.653 29.594, -98.652 29.595, -98...."
1,48,29,150700,2,480291507002,Block Group 2,G5030,S,354965,0,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.492 29.377, -98.492 29.377, -98...."
2,48,29,121606,3,480291216063,Block Group 3,G5030,S,586667,0,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.340 29.554, -98.340 29.555, -98...."
3,48,29,121604,2,480291216042,Block Group 2,G5030,S,1509863,7297,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.329 29.537, -98.329 29.537, -98...."
4,48,29,121604,3,480291216043,Block Group 3,G5030,S,401071,0,...,G4020,484,41700,,A,3212426728,40788666,29.4486708,-98.5201465,"POLYGON ((-98.320 29.524, -98.319 29.525, -98...."


In [222]:
# iterates through DataFrame of tables being collected AND through the list of geoIDs collected from all targeted geographies to create an API URL call for each
api_call_url_list = []

for survey in surveys:
    base_url = f'https://api.census.gov/data/{year}/acs/{survey}'
    
    # combines all UCGIDs into one list
    if survey == 'accse':
        ucgid_list = county_ucgids_list + congressional_districts_by_county_ucgid_list + places_by_county_ucgid_list + pumas_by_county_ucgid_list
    elif survey == 'acs5':
        ucgid_list = block_group_by_county_ucgid_list
    
    # iterates through tables DataFrame
    for index, row in acs_data_dict[f'{survey}_tables_df'].iterrows():
    # for index, row in tables_df.iterrows():
        # iterates through each UCGID in combined UCGID list using `county` as base geography
        for county in county_ucgids_list:
        # for ucgid in ucgid_list:
            data_url = f'{base_url}?get=group({row["name"]})&ucgid={",".join(ucgid_list)}'
            api_call_url_list.append(data_url)
            
print(api_call_url_list[0])
print(len(api_call_url_list))

https://api.census.gov/data/2022/acs/acsse?get=group(K202101)&ucgid=0500000US48029,5001800US4823,5001800US4828,5001800US4835,5001800US4820,5001800US4821,1600000US4867268,1600000US4864172,1600000US4874408,1600000US4833146,1600000US4868708,1600000US4865344,1600000US4866704,1600000US4865000,1600000US4875764,1600000US4872296,1600000US4879672,1600000US4825168,1600000US4801600,1600000US4805384,1600000US4813276,1600000US4814716,1600000US4816468,1600000US4823272,1600000US4845288,1600000US4831100,1600000US4833968,1600000US4834628,1600000US4839448,1600000US4842388,1600000US4843096,1600000US4866128,1600000US4814920,1600000US4853988,1600000US4817811,1600000US4845576,1600000US4840036,1600000US4866089,1600000US4860608,1600000US4873057,795P200US4805907,795P200US4805908,795P200US4805914,795P200US4805903,795P200US4805906,795P200US4805916,795P200US4805913,795P200US4805902,795P200US4805905,795P200US4805915,795P200US4805901,795P200US4805909,795P200US4805910,795P200US4805911,795P200US4805912,795P200US48059

### Extraction

In [125]:
# calls the API with a single URL containing one group of tables (max allowed) and COUNTY, PLACE, and CD<current_congress> geographies (only returns any of these geographies containing more than 20,000 total population)
def api_caller(url):
    r = requests.get(url)
    if r.status_code == 200:
        return r
    else:
        print(r.status_code)
        print(r.text)

In [126]:
# iterates through the list of URLs, calling the API caller once for each URL, and joins all the results into one DataFrame - joining process requires removing any columns that will be duplicated, else DataFrame merge will fail
df = pd.DataFrame()

for index, url in enumerate(api_call_url_list):
    # calls the API caller
    response = api_caller(url)
    # converts API response JSON object into a local-scope DataFrame
    temp_df = pd.DataFrame(response.json())
    # converts first row into column headers, then deletes row
    temp_df.columns = temp_df.iloc[0]
    temp_df = temp_df[1:]
    for series_name, series in temp_df.items():
        # drops the 'NAME' column for all but the first DataFrame
        if series_name == 'NAME':
            if index > 0:
                temp_df.drop(columns=[series_name], inplace=True)
        if series_name == 'GEO_ID':
            temp_df.drop(columns=[series_name], inplace=True)
    # if this is the first run, set non-local scope DataFrame, otherwise merge local and non-local DataFrames
    if index == 0:
        df = temp_df
    else:
        try: 
            df = df.merge(temp_df, on='ucgid')
        except (KeyError, IndexError):
            print('error on merge')
    if debug is True:
        if index > 1:
            break

In [127]:
df.head()

Unnamed: 0,NAME,K202101_001E,K202101_001EA,K202101_001M,K202101_001MA,K202101_002E,K202101_002EA,K202101_002M,K202101_002MA,K202101_003E,...,K202505_005M,K202505_005MA,K202505_006E,K202505_006EA,K202505_006M,K202505_006MA,K202505_007E,K202505_007EA,K202505_007M,K202505_007MA
0,"Bexar County, Texas",1526825,,3094,,151480,,7536,,17356,...,6497,,89110,,5179,,36491,,3357,
1,"Cibolo city, Texas",24995,,1574,,5685,,1408,,846,...,411,,303,,465,,0,,237,
2,"Converse city, Texas",19664,,2617,,3734,,1155,,75,...,396,,117,,199,,0,,237,
3,"San Antonio city, Texas",1105182,,8059,,89194,,5366,,11190,...,5455,,83518,,5008,,33018,,3177,
4,"Schertz city, Texas",34153,,2569,,7062,,1178,,107,...,947,,234,,252,,0,,237,


## Transform

Once we've loaded the API data into memory, we can modify the data to exclude unnecesary fields before saving to the database. 

In [128]:
# remove columns representing annotations of estimates (*EA), margins of error (*M), and annotations of margins of error (*MA)
df.drop(columns=df.columns[df.columns.str.endswith(('EA', 'M', 'MA'))], inplace=True)

for series_name in df.columns:
    with warnings.catch_warnings():
        if crosswalk_df['name'].str.contains(series_name).any():
            # if the crosswalk contains the name of the DataFrame column (i.e., for any data column as opposed to names, descriptors, etc. ) replace table name based on key to one based on descriptor, then strip spaces, punctuation, etc. and replace with underscores for easier data manipulation and normalization, then convert from Series object to int dtype
            new_label = str(crosswalk_df[crosswalk_df['name'].str.contains(series_name)][['concept', 'label']].values)
            new_label = new_label.replace('[', '').replace(']', '').replace('\' \'', '__').replace(' ', '_').replace('\'', '').replace('!!', '_').replace(':', '').lower()
            try:
                df = df.astype({series_name: 'int'})
            except TypeError:
                pass
            df.rename(columns={series_name: new_label + '__' + series_name}, inplace=True, errors='raise')

In [129]:
df.head()

Unnamed: 0,NAME,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total__K202101_001E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran__K202101_002E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_18_to_34_years__K202101_003E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_35_to_64_years__K202101_004E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_65_years_and_over__K202101_005E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_nonveteran__K202101_006E,ucgid,race__estimate_total__K200201_001E,race__estimate_total_white_alone__K200201_002E,...,race\n__estimate_total_native_hawaiian_and_other_pacific_islander_alone__K200201_006E,race__estimate_total_some_other_race_alone__K200201_007E,race__estimate_total_two_or_more_races__K200201_008E,year_structure_built__estimate_total__K202505_001E,year_structure_built__estimate_total_built_2020_or_later__K202505_002E,year_structure_built__estimate_total_built_2000_to_2019__K202505_003E,year_structure_built__estimate_total_built_1980_to_1999__K202505_004E,year_structure_built__estimate_total_built_1960_to_1979__K202505_005E,year_structure_built__estimate_total_built_1940_to_1959__K202505_006E,year_structure_built__estimate_total_built_1939_or_earlier__K202505_007E
0,"Bexar County, Texas",1526825,151480,17356,83108,51016,1375345,0500000US48029,2059530,858997,...,3075,231716,711299,820113,34343,290676,206285,163208,89110,36491
1,"Cibolo city, Texas",24995,5685,846,3358,1481,19310,1600000US4814920,34807,16029,...,0,3211,6482,12227,551,9348,1599,426,303,0
2,"Converse city, Texas",19664,3734,75,2840,819,15930,1600000US4816468,29597,9714,...,0,2139,10004,10147,1169,4691,3441,729,117,0
3,"San Antonio city, Texas",1105182,89194,11190,42790,35214,1015988,1600000US4865000,1472904,585287,...,1789,182468,540652,612031,18216,175641,165590,136048,83518,33018
4,"Schertz city, Texas",34153,7062,107,5328,1627,27091,1600000US4866128,45567,19325,...,0,5190,14744,17503,981,8262,5582,2444,234,0


The following cells separate out each geographic level of analysis into their own DataFrame - one each for `COUNTY`, `PLACE`, `CD<congressional_term>`, and `PUMA`. 

Once they are separated out, they are merged with their associated GeoDataFrame in order to save the GeoDataFrame's `geography` column, which contains the Shapefile polygons that can be used for geospatial analysis. 

In [130]:
# the following cells separate out each geography level of analysis into its own DataFrame - one each for COUNTY, PLACE, CD<congressional_term>, and PUMA
county_df = df[df['ucgid'].str.startswith('050')]
final_county_df = pd.merge(county_df, county_gdf[['GEOIDFQ', 'geometry']], left_on='ucgid', right_on='GEOIDFQ')

final_county_df.head()

Unnamed: 0,NAME,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total__K202101_001E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran__K202101_002E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_18_to_34_years__K202101_003E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_35_to_64_years__K202101_004E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_65_years_and_over__K202101_005E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_nonveteran__K202101_006E,ucgid,race__estimate_total__K200201_001E,race__estimate_total_white_alone__K200201_002E,...,race__estimate_total_two_or_more_races__K200201_008E,year_structure_built__estimate_total__K202505_001E,year_structure_built__estimate_total_built_2020_or_later__K202505_002E,year_structure_built__estimate_total_built_2000_to_2019__K202505_003E,year_structure_built__estimate_total_built_1980_to_1999__K202505_004E,year_structure_built__estimate_total_built_1960_to_1979__K202505_005E,year_structure_built__estimate_total_built_1940_to_1959__K202505_006E,year_structure_built__estimate_total_built_1939_or_earlier__K202505_007E,GEOIDFQ,geometry
0,"Bexar County, Texas",1526825,151480,17356,83108,51016,1375345,0500000US48029,2059530,858997,...,711299,820113,34343,290676,206285,163208,89110,36491,0500000US48029,"POLYGON ((-98.327 29.210, -98.327 29.210, -98...."


In [131]:
place_df = df[df['ucgid'].str.startswith('160')]
final_place_df = pd.merge(place_df, places_gdf[['GEOIDFQ', 'geometry']], left_on='ucgid', right_on='GEOIDFQ')

final_place_df.head()

Unnamed: 0,NAME,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total__K202101_001E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran__K202101_002E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_18_to_34_years__K202101_003E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_35_to_64_years__K202101_004E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_65_years_and_over__K202101_005E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_nonveteran__K202101_006E,ucgid,race__estimate_total__K200201_001E,race__estimate_total_white_alone__K200201_002E,...,race__estimate_total_two_or_more_races__K200201_008E,year_structure_built__estimate_total__K202505_001E,year_structure_built__estimate_total_built_2020_or_later__K202505_002E,year_structure_built__estimate_total_built_2000_to_2019__K202505_003E,year_structure_built__estimate_total_built_1980_to_1999__K202505_004E,year_structure_built__estimate_total_built_1960_to_1979__K202505_005E,year_structure_built__estimate_total_built_1940_to_1959__K202505_006E,year_structure_built__estimate_total_built_1939_or_earlier__K202505_007E,GEOIDFQ,geometry
0,"Cibolo city, Texas",24995,5685,846,3358,1481,19310,1600000US4814920,34807,16029,...,6482,12227.0,551.0,9348.0,1599.0,426.0,303.0,0.0,1600000US4814920,"MULTIPOLYGON (((-98.146 29.532, -98.145 29.532..."
1,"Converse city, Texas",19664,3734,75,2840,819,15930,1600000US4816468,29597,9714,...,10004,10147.0,1169.0,4691.0,3441.0,729.0,117.0,0.0,1600000US4816468,"POLYGON ((-98.342 29.536, -98.342 29.536, -98...."
2,"San Antonio city, Texas",1105182,89194,11190,42790,35214,1015988,1600000US4865000,1472904,585287,...,540652,612031.0,18216.0,175641.0,165590.0,136048.0,83518.0,33018.0,1600000US4865000,"MULTIPOLYGON (((-98.305 29.455, -98.304 29.456..."
3,"Schertz city, Texas",34153,7062,107,5328,1627,27091,1600000US4866128,45567,19325,...,14744,17503.0,981.0,8262.0,5582.0,2444.0,234.0,0.0,1600000US4866128,"MULTIPOLYGON (((-98.201 29.509, -98.201 29.509..."
4,"Timberwood Park CDP, Texas",28258,5192,512,3733,947,23066,1600000US4873057,40601,21913,...,14523,,,,,,,,1600000US4873057,"POLYGON ((-98.523 29.678, -98.523 29.678, -98...."


In [132]:
congressional_district_df = df[df['ucgid'].str.startswith('500')]
final_congressional_district_df = pd.merge(congressional_district_df, congressional_districts_gdf[['GEOIDFQ', 'geometry']], left_on='ucgid', right_on='GEOIDFQ')

final_congressional_district_df

Unnamed: 0,NAME,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total__K202101_001E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran__K202101_002E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_18_to_34_years__K202101_003E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_35_to_64_years__K202101_004E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_65_years_and_over__K202101_005E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_nonveteran__K202101_006E,ucgid,race__estimate_total__K200201_001E,race__estimate_total_white_alone__K200201_002E,...,race__estimate_total_two_or_more_races__K200201_008E,year_structure_built__estimate_total__K202505_001E,year_structure_built__estimate_total_built_2020_or_later__K202505_002E,year_structure_built__estimate_total_built_2000_to_2019__K202505_003E,year_structure_built__estimate_total_built_1980_to_1999__K202505_004E,year_structure_built__estimate_total_built_1960_to_1979__K202505_005E,year_structure_built__estimate_total_built_1940_to_1959__K202505_006E,year_structure_built__estimate_total_built_1939_or_earlier__K202505_007E,GEOIDFQ,geometry
0,"Congressional District 20 (118th Congress), Texas",586111,48276,4846,25001,18429,537835,5001800US4820,781188,296043,...,295309,312295,8536,98940,86456,69859,41095,7409,5001800US4820,"POLYGON ((-98.788 29.501, -98.788 29.501, -98...."
1,"Congressional District 21 (118th Congress), Texas",623073,56503,3864,26236,26403,566570,5001800US4821,807859,523543,...,181928,353442,16631,142027,107193,59382,19103,9106,5001800US4821,"POLYGON ((-100.064 29.711, -100.064 29.711, -1..."
2,"Congressional District 23 (118th Congress), Texas",562182,57355,8090,33218,16047,504827,5001800US4823,778355,339088,...,289082,294073,15653,128898,76358,46109,18711,8344,5001800US4823,"POLYGON ((-106.514 32.001, -106.510 32.001, -1..."
3,"Congressional District 28 (118th Congress), Texas",556198,39823,3481,22405,13937,516375,5001800US4828,777758,248104,...,383351,285562,9799,108510,72279,51825,30299,12850,5001800US4828,"POLYGON ((-100.212 28.197, -100.212 28.197, -1..."
4,"Congressional District 35 (118th Congress), Texas",625445,44308,7607,24452,12249,581137,5001800US4835,802077,322027,...,243116,336731,24748,155708,65438,45813,26095,18929,5001800US4835,"POLYGON ((-98.543 29.427, -98.543 29.428, -98...."


In [133]:
puma_df = df[df['ucgid'].str.startswith('795')]
final_puma_df = pd.merge(puma_df, pumas_gdf[['GEOIDFQ20', 'geometry']], left_on='ucgid', right_on='GEOIDFQ20')

final_puma_df.head()

Unnamed: 0,NAME,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total__K202101_001E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran__K202101_002E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_18_to_34_years__K202101_003E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_35_to_64_years__K202101_004E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_veteran_65_years_and_over__K202101_005E,veteran_status_for_the_civilian_population_18_years_and_over_by_age\n__estimate_total_nonveteran__K202101_006E,ucgid,race__estimate_total__K200201_001E,race__estimate_total_white_alone__K200201_002E,...,race__estimate_total_two_or_more_races__K200201_008E,year_structure_built__estimate_total__K202505_001E,year_structure_built__estimate_total_built_2020_or_later__K202505_002E,year_structure_built__estimate_total_built_2000_to_2019__K202505_003E,year_structure_built__estimate_total_built_1980_to_1999__K202505_004E,year_structure_built__estimate_total_built_1960_to_1979__K202505_005E,year_structure_built__estimate_total_built_1940_to_1959__K202505_006E,year_structure_built__estimate_total_built_1939_or_earlier__K202505_007E,GEOIDFQ20,geometry
0,San Antonio City (Central) PUMA; Texas,78806,4466,1251,2142,1073,74340,795P200US4805901,112815,28702,...,56166,46184,1328,8110,4260,8325,11669,12492,795P200US4805901,"POLYGON ((-98.461 29.431, -98.461 29.431, -98...."
1,San Antonio City (Northeast)--Inside Loop I-41...,78854,6145,564,2371,3210,72709,795P200US4805902,112214,52660,...,32010,48315,721,7902,7737,11311,14693,5951,795P200US4805902,"POLYGON ((-98.499 29.461, -98.499 29.462, -98...."
2,San Antonio City (Southeast)--Inside Loop I-41...,90288,4457,0,1723,2734,85831,795P200US4805903,120621,38926,...,47223,49112,917,9152,4965,11177,17119,5782,795P200US4805903,"POLYGON ((-98.439 29.413, -98.437 29.413, -98...."
3,San Antonio City (Northwest)--Inside Loop I-41...,77228,5945,436,2401,3108,71283,795P200US4805904,101698,30071,...,60528,44273,132,1041,4625,13216,19426,5833,795P200US4805904,"POLYGON ((-98.526 29.517, -98.526 29.518, -98...."
4,San Antonio City (West)--Inside Loop I-410 PUM...,80463,4363,1868,942,1553,76100,795P200US4805905,114308,32629,...,60119,38617,643,4946,7997,16896,7323,812,795P200US4805905,"POLYGON ((-98.573 29.471, -98.571 29.470, -98...."


In [134]:
if survey == 'acs5': 
    zcta_df = df[df['ucgid'].str.startswith('860')]
    final_zcta_df = pd.merge(zcta_df, zcta_gdf[['GEOIDFQ20', 'geometry']], left_on='ucgid', right_on='GEOIDFQ20')
    
    final_zcta_df.head()

In [135]:
if survey == 'acs5': 
    block_group_df = df[df['ucgid'].str.startswith('150')]
    final_block_group_df = pd.merge(block_group_df, block_group_gdf[['GEOIDFQ20', 'geometry']], left_on='ucgid', right_on='GEOIDFQ20')  # todo: check GEOIDFQ20 is correct
    
    final_block_group_df.head()

## Load

The following code loads the DataFrame/GeoDataFrames into the database for future analysis.

In [136]:
# creates connection to local SQLite database
databases_dirpath = os.path.join('../data/databases')
if survey == 'acsse':
    demographics_db_filepath = os.path.join(databases_dirpath, 'census_acs_1yr_2022.db')
elif survey == 'acs5':
    demographics_db_filepath = os.path.join(databases_dirpath, 'census_acs_5yr_2022.db')
else:
    # todo: add code to check if fields exist before appending to writers
    replace_or_append = 'append'

# creates connection to SQLite database
sql_engine = sqlalchemy.create_engine('sqlite:///' + demographics_db_filepath)

The following cells modify the DataFrame to ensure column dtype compatibility with SQLAlchemy ORM (Polygon object must be changed to object), then write each DataFrame to the database into their own table (one each for `COUNTY`, `CD<current_congress>`, `PLACE`, and `PUMA`), and  returns the number of rows successfully written.

In [137]:
if debug is False:
    final_county_df['geometry'] = final_county_df['geometry'].astype(str)
    final_county_df.to_sql('county', sql_engine, if_exists='replace')

In [138]:
if debug is False:    
    final_congressional_district_df['geometry'] = final_congressional_district_df['geometry'].astype(str)
    final_congressional_district_df.to_sql('congressional_district', sql_engine, if_exists='replace')

In [139]:
if debug is False:    
    final_place_df['geometry'] = final_place_df['geometry'].astype(str)
    final_place_df.to_sql('place', sql_engine, if_exists='replace')

In [140]:
if debug is False:    
    final_puma_df['geometry'] = final_puma_df['geometry'].astype(str)
    final_puma_df.to_sql('puma', sql_engine, if_exists='replace')

In [141]:
if survey == 'acs5':
    if debug is False: 
        final_zcta_df['geometry'] = final_zcta_df['geometry'].astype(str)
        final_zcta_df.to_sql('zcta', sql_engine, if_exists='replace')