# Chicago's Pandemic Rail Ridership - Part 2 Neighborhood Demographics
> Utilizing the Census API to estimate neighborhood demographics.

- toc: true 
- badges: true
- comments: true
- categories: [transit]
- image: images/2022-02-09-census.jpeg

## In this Post

In this post we will explore the Census API, allowing us to read in 5-year American Community Survey (ACS) microdata.  We use this data given its improved stability and capabilities to drill down to the block-group level.  In Part 3, this will enable us to estimate granular demographic characteristics within a specified distance from each transit stop.

## Setup

In [2]:
import pandas as pd
import numpy as np

## Define Variables

While the Census API is extremely handy to avoid manually downloading data extracts, it is a bit manual to select relevant columns.  The Census Bureau provides an exhaustive data dictionary for each API it publishes.  In this case, we are able to use the 2019 5-year microdata data dictionary, found at [this link](https://api.census.gov/data/2019/acs/acs5/groups.html).

The level of grain of this data is at the census block-group level.  Thus, most variables are simply the sum of individuals that represent that particular characteristic or fall within a given range.  For instance, there will be  a variable that represents the number of married individuals, a variable that represents the number of single individuals, and additional variables for other marital statuses.  Furthermore, there are variables that represent the number of individuals that fall within a given income range.

Below, we create strings that include the name of variables we would like to include in our query.  These variables represent different income ranges and different levels of education.  As you can see, each topic has a general structure with different values to indicate different ranges of income or levels of education.

In [3]:
# iterate through income variables to create string for query
# B19001_011E is the first range that exceeds $50K
# we are looking specifically at the % of individuals making less than $50K
# so only need to count the number making over relative to total
income_vars = ""
for i in range(11,18):
    i_padded = str(i).rjust(2, '0')
    income_value = f"B19001_0{i_padded}E"
    income_vars = income_vars + income_value + ","

# iterate through education variables to create string for query
educ_vars = ""
for i in range(1,17):
    i_padded = str(i).rjust(2, '0')
    educ_value = f"B15003_0{i_padded}E"
    educ_vars = educ_vars + educ_value + ","

## Generate Query

Queries follow a straightforward pattern.  They stem from a base URL that must specify the appropriate year.  The base URL is followed by a `get` argument, where we are able to specify the variables we would like returned.  Finally, we must specify the geography of interest in the `for` argument.  We must specify the **block-group code**, **state**, and **county code** to return.  In this case I am returning every county and block-group in the state of Illinois.

In [4]:
# percent of residents over 50k in income +
# percent in poverty
# percent people of color 
# percent less than hs diploma


variable_dict = {"B01001_001E": "total_population",
                 "B17020_001E": "poverty_status",
                 "B19001_001E": "total_pop_income",
                 "B01001H_001E": "white_non_hispanic_population"}


# define year
acs_year = "2019"

# dynamic base URL based on year
base_url = f"https://api.census.gov/data/{acs_year}/acs/acs5"

# create string of variables to return
variable_names = f"NAME,B19001_001E,{income_vars}{educ_vars}B01001_002E,B02001_002E,B02001_001E"

# create geography query, specifiying the state, county, and block group
state_code = "17"
county_code = "*"
block_group_code = "*"
geo_query = f"block%20group:{block_group_code}&in=state:{state_code}%20county:{county_code}"

# combine elements of our query
query = f"{base_url}?get={variable_names}&for={geo_query}"

## Read Data

In [5]:
# read in query using pandas read_json function
census_df = pd.read_json(query)

# convert first row to column names
census_df.columns = census_df.iloc[0]
census_df = census_df[1:]


# convert value columns to numeric
value_cols = census_df.filter(regex='B').columns

census_df[value_cols] = census_df[value_cols].apply(pd.to_numeric, errors='coerce', axis = 1)

## Derive Metrics

Now, we aim to derive new columns to measure the following:

- The number of households without a highschool degree
- The number of households that identify as BIPOC
- The number of households making less than $50K a year

Given our variables are just total counts of individuals that satisfy different characteristics, our derived metrics are the result of simply taking the sum of different columns.  The results are summable across geographies.

In [6]:
census_derived_df = census_df \
    .assign(n_over_50k = lambda x: x['B19001_011E'] + x['B19001_012E'] + x['B19001_013E'] + x['B19001_014E'] +
                                   x['B19001_015E'] + x['B19001_016E'] + x['B19001_017E'],
            total_income = lambda x: x['B19001_001E'],
            pct_over_50k = lambda x: x['n_over_50k'] / x['total_income'],
            n_poc = lambda x: x['B02001_001E'] - x['B02001_002E'],
            total_race = lambda x: x['B02001_001E'],
            pct_poc = lambda x: x['n_poc'] / x['total_race'],
            n_nohs = lambda x: x['B15003_002E'] + x['B15003_003E'] + x['B15003_004E'] + x['B15003_005E'] +
                               x['B15003_006E'] + x['B15003_007E'] + x['B15003_008E'] + x['B15003_009E'] +
                               x['B15003_010E'] + x['B15003_011E'] + x['B15003_012E'] + x['B15003_013E'] +
                               x['B15003_014E'] + x['B15003_015E'] + x['B15003_016E'],
            total_educ = lambda x: x['B15003_001E'],
            pct_nohs = lambda x: x['n_nohs'] / x['total_educ'])

census_derived_df.to_csv("../data/census_derived_df.csv")

## Summary

Using the Census API, we were able to programatically pull in relevant variables to estimate different demographic characteristics for a given geography.  In Part 3, we will look at how we tie these demographic characteristics to given transit stops based on proximity.