## Preprocessing DM and obesity prevalence

Diabetes and obesity prevalance per state are provided by the CDC through the [Diabetes Atlas](https://gis.cdc.gov/grasp/diabetes/DiabetesAtlas.html#), and I downloaded the prevalence of each of these two indicators as csv files. This was to have predictors to work with before tackling extracting data from the BRFSS, which required a lot of work.

In [1]:
import pandas as pd
import numpy as np
import pickle

import matplotlib.pyplot as plt

# create autocorr plot
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf # better and more arguments

%matplotlib inline

The target variable is myocardial infarction by state.

In [2]:
%%bash
ls ../data/pickles

MI_mortality_medicaid_expansion.pkl
any_exercise_list_of_dfs.pkl
cardiac_mortality_obesity_dm_df_by_state.pkl
codebook_dfs_dict.pkl
consensus_var_desc_dict.pkl
dict_of_interpol_covariate_state_dfs.pkl
dict_of_relevant_dfs_raw.pkl
interpol_truncated_MI_mortality_per_state_dict.pkl
list_of_relevant_dfs_raw.pkl
master_codebook_all_years.pkl
master_dict_of_state_dfs_with_covariates.pkl
myocardial_infarction_df_state_mortality_dict.pkl
state_mortality_dict.pkl
state_population_by_year_dict.pkl


In [3]:
with open("../data/pickles/myocardial_infarction_df_state_mortality_dict.pkl", "rb") as picklefile:
    df, state_mortality_dict = pickle.load(picklefile)

We'll use the diabetes prevalence and obesity prevalence per state per year as two predictor variables for the target time series.

In [5]:
%%bash
ls ../data/cdc_diabetes/state_diabetes_prevalence/

diabetes_prevalence_1999.csv
diabetes_prevalence_2000.csv
diabetes_prevalence_2001.csv
diabetes_prevalence_2002.csv
diabetes_prevalence_2003.csv
diabetes_prevalence_2004.csv
diabetes_prevalence_2005.csv
diabetes_prevalence_2006.csv
diabetes_prevalence_2007.csv
diabetes_prevalence_2008.csv
diabetes_prevalence_2009.csv
diabetes_prevalence_2010.csv
diabetes_prevalence_2011.csv
diabetes_prevalence_2012.csv
diabetes_prevalence_2013.csv
diabetes_prevalence_2014.csv
diabetes_prevalence_2015.csv


In [13]:
states = list(state_mortality_dict.keys())

years = list(range(1999, 2016))

In [15]:
dm_prevalence_dict = {}

for year in years:
    if year == 2014 or year == 2015:
        dm_prevalence_state_df = pd.read_csv(f"../data/cdc_diabetes/state_diabetes_prevalence/diabetes_prevalence_{year}.csv", sep="\t")
    else:
        dm_prevalence_state_df = pd.read_csv(f"../data/cdc_diabetes/state_diabetes_prevalence/diabetes_prevalence_{year}.csv")
    
    dm_prevalence_state_df['Percentage'] = pd.to_numeric(dm_prevalence_state_df['Percentage'], errors='coerce')
    
    for state in states:
        if state in dm_prevalence_dict:
            dm_prevalence_dict[state].append(float(dm_prevalence_state_df[dm_prevalence_state_df.State == state]['Percentage']))
        else:
            dm_prevalence_dict[state] = [float(dm_prevalence_state_df[dm_prevalence_state_df.State == state]['Percentage'])]

In [16]:
for state, series in dm_prevalence_dict.items():
    temp = dict(zip(pd.to_datetime(years, format='%Y'), series))
    dm_prevalence_dict[state] = pd.Series(temp)

We'll parse the obesity prevalence files in a similar fashion.

In [18]:
obesity_prevalence_dict = {}

for year in years:
    obesity_prevalence_state_df = pd.read_csv(f"../data/cdc_diabetes/state_obesity_prevalence/obesity_prevalence_{year}.csv")
    
    obesity_prevalence_state_df['Percentage'] = pd.to_numeric(obesity_prevalence_state_df['Percentage'], errors='coerce')
    
    for state in states:
        if state in obesity_prevalence_dict:
            obesity_prevalence_dict[state].append(float(obesity_prevalence_state_df[obesity_prevalence_state_df.State == state]['Percentage']))
        else:
            obesity_prevalence_dict[state] = [float(obesity_prevalence_state_df[obesity_prevalence_state_df.State == state]['Percentage'])]

In [19]:
for state, series in obesity_prevalence_dict.items():
    temp = dict(zip(pd.to_datetime(years, format='%Y'), series))
    obesity_prevalence_dict[state] = pd.Series(temp)

We need to resample the diabetes and obesity prevalence time series so that there is a data point for every month, in order to match the frequency of our target variable (total deaths due to myocardial infarction) which is sampled monthly.

In [24]:
dm_prevalence_resampled_dict = {}

for state, time_series in dm_prevalence_dict.items():
    resampled_ts = time_series.resample("M").ffill(limit=1).interpolate('linear')
    resampled_ts.index = resampled_ts.index - pd.offsets.MonthBegin(0) - pd.DateOffset(months=1)
    dm_prevalence_resampled_dict[state] = resampled_ts

In [25]:
obesity_prevalence_resampled_dict = {}

for state, time_series in obesity_prevalence_dict.items():
    resampled_ts = time_series.resample("M").ffill(limit=1).interpolate('linear')
    resampled_ts.index = resampled_ts.index - pd.offsets.MonthBegin(0) - pd.DateOffset(months=1)
    obesity_prevalence_resampled_dict[state] = resampled_ts

In [26]:
dict_of_dfs = {}

for state, cardiac_mortality_df in state_mortality_dict.items():
    temp_df = cardiac_mortality_df.copy()
    temp_df['obesity_prevalence'] = obesity_prevalence_resampled_dict[state]
    temp_df['diabetes_prevalence'] = dm_prevalence_resampled_dict[state]
    dict_of_dfs[state] = temp_df

Let's also import the state populations by year and then normalize the mortality rate by state population by year.

In [27]:
%%bash
ls ../data/cdc_diabetes/

DM_PREV_ALL_STATES.xlsx
DM_PREV_by_sex_ALL_STATES.xlsx
INCIDENCE_ALL_STATES.xlsx
LTPIA_PREV_ALL_STATES.xlsx
LTPIA_PREV_by_sex_ALL_STATES.xlsx
OB_PREV_ALL_STATES.xlsx
OB_PREV_by_sex_ALL_STATES.xlsx
state_diabetes_prevalence
state_obesity_prevalence
state_populations_by_year.txt


In [28]:
with open("../data/cdc_diabetes/state_populations_by_year.txt", "r") as f:
    lines = f.readlines()

In [29]:
state_population_by_year = pd.read_csv("../data/cdc_diabetes/state_populations_by_year.txt", delimiter="\t")
state_population_by_year = state_population_by_year[['State', 'State Code', 'Year', 'Population']]
state_population_by_year.dropna(inplace=True)

state_population_by_year['Year'] = state_population_by_year['Year'].apply(int)
state_population_by_year['Year'] = pd.to_datetime(state_population_by_year['Year'], format='%Y')
state_population_by_year['Population'] = pd.to_numeric(state_population_by_year['Population'])

In [30]:
grouped_by_state = state_population_by_year.groupby('State')[['Year', 'Population']]

In [31]:
state_population_by_year_dict = {}

for state in states:
    temp = grouped_by_state.get_group(state).set_index('Year')
    temp = temp.resample("M").ffill(limit=1).interpolate('linear')
    temp.index = temp.index - pd.offsets.MonthBegin(0) - pd.DateOffset(months=1)
    state_population_by_year_dict[state] = temp

In [32]:
for state in states:
    temp_df = dict_of_dfs[state].copy()
    temp_df['Population'] = state_population_by_year_dict[state]
    temp_df['mortality_per_100k'] = 100000*(temp_df['Deaths']/temp_df['Population'])
    dict_of_dfs[state] = temp_df

In [61]:
with open("../data/pickles/state_population_by_year_dict.pkl", "wb") as picklefile:
    pickle.dump(state_population_by_year_dict, picklefile)

In [None]:
with open("../data/pickles/cardiac_mortality_obesity_dm_df_by_state.pkl", "wb") as picklefile:
    pickle.dump(dict_of_dfs, picklefile)

We now have a dataframe with the target (mortality due to MI per 100K per state) and two predictor variables within a dataframe. We also have a dictionary containing each state's population between 1999 to 2015 on a monthly basis, using linear interpolation.