# Exploratory Data Analysis

This notebook combines various data sets (demographic data, voting turnout numbers, voting registration figures) from various sources (Census Bureau, Texas Legislative Council, the Texas Secretary of State) of various types (datasets, geospatial files) in order to gain insights into patterns of behavior in targeted geographies.


## User Input and Pre-ETL Prep


In [1]:
# reference county or counties, as FIPS state + county code
state_fips = ['48']
county_fips = ['029']
# specify the data source by year
year = '2022' 

In [2]:
import geopandas as gpd
import pandas as pd
import os
import sqlalchemy
import requests

## Local ETL

### Census Data ETL
#### ACS Crosswalk
This crosswalk is used to map names to individual columns within ACS tables.

In [3]:
# imports crosswalk
surveys = ['acsse', 'acs5']
# creates dictionary to contain data for ACS 1yr and ACS 5yr data
acs_data_dict = {}

for survey in surveys:
    crosswalk_df = pd.DataFrame()
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
    crosswalk_url = f'https://api.census.gov/data/{year}/acs/{survey}/variables/'
    
    crosswalk_response = requests.get(crosswalk_url, headers=headers)
    if crosswalk_response.status_code == 200:
        crosswalk_df = pd.DataFrame(crosswalk_response.json())
    
    acs_data_dict[f'{survey}_crosswalk_df'] = crosswalk_df
    
    crosswalk_df = acs_data_dict[f'{survey}_crosswalk_df']
    crosswalk_df.columns = crosswalk_df.iloc[0]
    crosswalk_df = crosswalk_df[1:]
    
    # removes rows not used to for naming columns locally
    crosswalk_df = crosswalk_df[crosswalk_df['name'].str.startswith('K') | crosswalk_df['name'].str.startswith('B')]
    
    if survey == 'acs5':
        idx = crosswalk_df.index[crosswalk_df['name'] == 'BLKGRP']
        crosswalk_df.drop(idx, inplace=True)
        
    acs_data_dict[f'{survey}_crosswalk_df'] = crosswalk_df

In [15]:
acs_data_dict['acs5_crosswalk_df'].head()

Unnamed: 0,name,label,concept
4,B24022_060E,Estimate!!Total:!!Female:!!Service occupations...,Sex by Occupation and Median Earnings in the P...
5,B19001B_014E,"Estimate!!Total:!!$100,000 to $124,999",Household Income in the Past 12 Months (in 202...
6,B07007PR_019E,Estimate!!Total:!!Moved from different municip...,Geographical Mobility in the Past Year by Citi...
7,B19101A_004E,"Estimate!!Total:!!$15,000 to $19,999",Family Income in the Past 12 Months (in 2022 I...
8,B24022_061E,Estimate!!Total:!!Female:!!Service occupations...,Sex by Occupation and Median Earnings in the P...


#### ACS Tables
This table is used to map names to the various ACS tables.

In [5]:
# transforms crosswalk_df by truncating `name` column to its table 'group' name (and deleting anything that's not a table name) and normalizing text in `concept` field to lowercase/no spaces format
for survey in surveys:
    tables_df = acs_data_dict[f'{survey}_crosswalk_df'].copy()
    tables_df['name'] = acs_data_dict[f'{survey}_crosswalk_df']['name'].str.split('_').str[0]
    tables_df = tables_df.drop_duplicates(subset='name')
    tables_df = tables_df.drop(columns='label')
    tables_df['concept'] = tables_df['concept'].str.replace(' ', '_').str.lower()
    
    acs_data_dict[f'{survey}_tables_df'] = tables_df    

In [16]:
acs_data_dict['acs5_tables_df'].head()

Unnamed: 0,name,concept
4,B24022,sex_by_occupation_and_median_earnings_in_the_p...
5,B19001B,household_income_in_the_past_12_months_(in_202...
6,B07007PR,geographical_mobility_in_the_past_year_by_citi...
7,B19101A,family_income_in_the_past_12_months_(in_2022_i...
14,B01001B,sex_by_age_(black_or_african_american_alone)


#### ACS Datasets
This locally imports the tables selected below from the options presented in the table of tables above.

In [10]:
# input required ACS tables into list below
requested_survey = ['acs5']
requested_tables = ['B02001', # race
                    ]
requested_geography_type = 'block_group'
requested_geographies = ['1500000US480291820031']

In [8]:
# imports Census survey data
demographics_db_filepath = os.path.join('data/databases/census_acs5_2022_block_group.db')

# creates connection to SQLite database
sql_engine = sqlalchemy.create_engine('sqlite:///' + demographics_db_filepath)

In [11]:
if len(requested_tables) == 1:
    df = pd.read_sql_table(requested_tables[0], con=sql_engine)    

In [13]:
df.head()

Unnamed: 0,index,B02001_001E,B02001_002E,B02001_003E,B02001_004E,B02001_005E,B02001_006E,B02001_007E,B02001_008E,B02001_009E,B02001_010E,NAME,ucgid
0,0,0,0,0,0,0,0,0,0,0,0,Block Group 1; Census Tract 9800.01; Bexar Cou...,1500000US480299800011
1,1,0,0,0,0,0,0,0,0,0,0,Block Group 1; Census Tract 9800.02; Bexar Cou...,1500000US480299800021
2,2,0,0,0,0,0,0,0,0,0,0,Block Group 1; Census Tract 9800.04; Bexar Cou...,1500000US480299800041
3,3,1004,912,0,0,22,0,7,63,63,0,Block Group 1; Census Tract 1208; Bexar County...,1500000US480291208001
4,4,1004,726,0,0,0,0,32,246,246,0,Block Group 1; Census Tract 1211.11; Bexar Cou...,1500000US480291211111
