# America's 500 Largest Cities through the Numbers
### An exploration of urban health compiled by Kennon Stewart
We live in strange times. The COVID-19 pandemic exposed the holes in many states' health systems and, for the first time, states are reckoning with their inadequate public health measures. This is an exploration of the pre-COVID health issues facing individual census tracts in 2016-2017, brought together to paint a larger picture of their state.

In [1]:
%autosave 15
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd

Autosaving every 15 seconds


The dataset I'm using is the 500 Cities database provided by the CDC in 2019. Unfortunately the set only has data from years 2016 and 2017 so, if anyone knows of more comprehensive data, let me know! As for this dataset, I want to save as much stress on my laptop as possible so I'll only read the relevant columns.

In [2]:
cols = ['Year','StateAbbr','High_Confidence_Limit','UniqueID','Short_Question_Text']

# Variables:
# StateAbbr = State Abbreviation
# High_Confidence_Limit = the highest confidence limit estimate for a particular census tract for a particular ailment
# Unique_ID = the unique ID for each census tract in the country which will be matched with census data from the year 2017
# Short_Question_Text = 

cities = pd.read_csv('/Users/student/Downloads/500 Cities/500_Cities__Local_Data_for_Better_Health__2019_release.csv', usecols=cols)

I'm also uploading racial data from the Census that gives demographic information for the tracts we're analyzing in the cities dataset. This gives us more information about the backgroud of each tract and allows us to take a larger look at the relationships between health and social factors.

In [4]:
cols2 = ['GEO_ID','DP05_0038E','DP05_0038M','DP05_0038PE','DP05_0038PM']

# Variables:
# GEO_ID = ID of the Census tract 
# DP05_0038E = estimated real count of Black Americans
# DP05_0038M = margin of error of real count of Black Americans
# DP05_0038PE = estimated percentage of Black Americans
# DP05_0038PM = margin of error of estimated percentage of Black Americans

racial = pd.read_csv('/Users/student/Downloads/500 Cities/ACSDP5Y2017.DP05_2020-06-27T105923/ACSDP5Y2017.DP05_data_with_overlays_2020-06-27T105738.csv',usecols=cols2,low_memory=False)

FileNotFoundError: [Errno 2] File /Users/student/Downloads/500 Cities/ACSDP5Y2017.DP05_2020-06-27T105923/ACSDP5Y2017.DP05_data_with_overlays_2020-06-27T105738.csv does not exist: '/Users/student/Downloads/500 Cities/ACSDP5Y2017.DP05_2020-06-27T105923/ACSDP5Y2017.DP05_data_with_overlays_2020-06-27T105738.csv'

I'll go ahead and clean the Unique ID's for the 

In [None]:
cities['ID'] = cities['UniqueID'].map(lambda x: x.replace('-',''))
cities['ID'] = cities['UniqueID'].map(lambda x: x[8:])
cities = cities.drop('UniqueID',axis=1)

In [None]:
cities.head(10)

In [None]:
racial = racial.drop([0])
racial['ID'] = racial['GEO_ID'].map(lambda x: x.replace('US',''))
mess = ['*','**','-']
for i in mess:
    racial['DP05_0038PM'] = racial['DP05_0038PM'].map(lambda x: x.replace(i,'NaN'))
    racial['DP05_0038PE'] = racial['DP05_0038PE'].map(lambda y: y.replace(i,'NaN'))
racial['ID'] = racial['GEO_ID'].map(lambda x: x[9:])
racial = racial.drop('GEO_ID',axis=1)

In [None]:
racial.head()

We were successfully able to merge the racial and health data of most census tracts within the United States, consisting of the 500 largest cities within the country. Because of the outer merge, we can also see that some of the census data we loaded doesn't have any corresponding CDC data. Since the analysis is only of the 500 largest cities, we'll exclude these from analysis. In addition, some of the survey tracts within the 500 Cities survey were too small to be surveyed effectively, so researchers didn't report those counties.

In [None]:
combined = pd.merge(racial, cities, on='ID', how='inner')
combined.dropna(inplace=True)

In [None]:
combined.shape

In [None]:
michigan = combined[combined['StateAbbr']=='MI']

For a sample analysis, I'll look at my home state of Michigan. Now that I have the percentage of Black residents in each census tract as well as the prevalence of health outcomes in that tract, I can run a one-way ANOVA. This is a standard test to determine whether an independent variable impacts some numeric dependent variable, in this instance we'll do Coronary Heart Disease.

In [5]:
chdMI = michigan[michigan['Short_Question_Text']=='Coronary Heart Disease']

NameError: name 'michigan' is not defined

In [6]:
chdMI.head(15)

NameError: name 'chdMI' is not defined

In [7]:
chdMI[['DP05_0038E','DP05_0038M']] = chdMI[['DP05_0038E','DP05_0038M']].astype(int)
chdMI[['DP05_0038PE','DP05_0038PM']] = chdMI[['DP05_0038PE','DP05_0038PM']].astype('float64')
chdMI['DP05_0038PE'].describe()

NameError: name 'chdMI' is not defined

In [8]:
chdMI.loc[chdMI['DP05_0038PE'] <= 100, 'DL']= 4
chdMI.loc[chdMI['DP05_0038PE'] <= 89.875, 'DL']= 3
chdMI.loc[chdMI['DP05_0038PE'] <= 27.8, 'DL']= 2
chdMI.loc[chdMI['DP05_0038PE'] <= 7.9, 'DL']= 1

NameError: name 'chdMI' is not defined

In [9]:
chdMI['DL'] = chdMI['DL'].astype('int')

NameError: name 'chdMI' is not defined

In [10]:
chdMI.head()

NameError: name 'chdMI' is not defined

### A Note on Categorical Variables
I created categorical variable for every census tract based on their percentage Black population in relation to the rest of their state. Categories are 1, 2, 3, and 4 with 1 being 1 being a tract in the lower 25% of Michigan tracts for % of Black residents and 4 being the highest 25% of Michigan tracts for Black residents. This allows us to rank categorical variables and continue with statistical tests that require them.

In [11]:
import scipy.stats as stats

In [12]:
stats.f_oneway(chdMI['High_Confidence_Limit'][chdMI['DL'] == 1], chdMI['High_Confidence_Limit'][chdMI['DL'] == 2], chdMI['High_Confidence_Limit'][chdMI['DL'] == 3], chdMI['High_Confidence_Limit'][chdMI['DL'] == 4])

NameError: name 'chdMI' is not defined

After conducting a one-way ANOVA, we can see the result is a miniscule p-value. This means that the percentage of Black people in a given census tract has an impact on the rates of Coronary Heart Disease in Michigan. This is something we can explore more with histograms of the distributions.

In [1]:
chdMI['High_Confidence_Limit'].hist()

NameError: name 'chdMI' is not defined