## Extracting predictors from BRFSS

In this notebook, we're going to figure out which predictors we want to extract from the BRFSS CSVs. This is going to take several steps:

1. Load consensus_var_desc_dict which contains all the variable names and descriptions from all the codebooks from 1999 to 2017. Also load the master_codebook_all_years dataframe, which will help you figure out for which years which variables were used.
2. Decide which predictors I want to extract from each year's BRFSS CSV.
3. Load each BRFSS CSV into a pandas dataframe, and extract the relevant columns. Store this is a dictionary or list of dataframe.
4. Then, once you have this list or dictionary of dataframes by year, take each dataframe and compute the state-level summary statistics you want for each year; turn this into a time series for each predictor that has an annual sampling frequency.

In [1]:
%%bash
ls ../data/pickles

MI_mortality_medicaid_expansion.pkl
any_exercise_list_of_dfs.pkl
cardiac_mortality_obesity_dm_df_by_state.pkl
codebook_dfs_dict.pkl
consensus_var_desc_dict.pkl
interpol_truncated_MI_mortality_per_state_dict.pkl
master_codebook_all_years.pkl
myocardial_infarction_df_state_mortality_dict.pkl
state_population_by_year_dict.pkl


In [64]:
import pickle
import pandas as pd

import warnings

from progress_bar import log_progress

warnings.filterwarnings('ignore')

In [4]:
with open("../data/pickles/master_codebook_all_years.pkl", "rb") as f:
    master_codebook_all_years_df = pickle.load(f)

In [6]:
with open("../data/pickles/consensus_var_desc_dict.pkl", "rb") as f:
    consensus_var_desc_dict = pickle.load(f)

What we need to do is find all the variables that mention a particular condition, so that we can figure out which variable names are synonymous with each other.

In [51]:
master_codebook_all_years_df[master_codebook_all_years_df.var_name == 'HLTHPLN1']

Unnamed: 0,var_name,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1079,HLTHPLN1,,,,,,,,,,,,,"Do you have any kind of health care coverage, ...","Do you have any kind of health care coverage, ...","Do you have any kind of health care coverage, ...","Do you have any kind of health care coverage, ...","Do you have any kind of health care coverage, ...","Do you have any kind of health care coverage, ...","Do you have any kind of health care coverage, ..."


In [143]:
relevant_variables = {}
relevant_variables['high_cholesterol'] = ['TOLDHI','TOLDHI2']
relevant_variables['hypertension'] = ['BPHIGH', 'BPHIGH2', 'BPHIGH3', 'BPHIGH4']
relevant_variables['aspirin'] = ['CVDASPRN']
relevant_variables['exercise'] = ['EXERANY', 'EXERANY2']
relevant_variables['general_health'] = ['GENHLTH']
relevant_variables['mental_health'] = ['MENTHLTH']
relevant_variables['coverage'] = ['HLTHPLAN', 'HLTHPLN1']
relevant_variables['income'] = ['INCOME2']
relevant_variables['smoker'] = ['SMOKER2', 'SMOKER3']
relevant_variables['weight_to_multiply'] = ['FINALWT', 'LLCPWT']

In [None]:
# This cell goes through all the key and values, and if the condition is mentioned in the description/question,
# it prints the variable name and the description/question.

condition = 'race'

for key, val in consensus_var_desc_dict.items():
    if val:
        if condition in val:
            print(key)
            print(val)
            print("\n")
        if condition.capitalize() in val:
            print(key)
            print(val)
            print("\n")

Now, let's write a function that will take a list of columns that we want to extract, check it against the columns of a dataframe, and then it returns a dataframe with the columns that this dataframe has that we want to extract. We have to write this function because the dataframes for all of the different years have different variable names.

In [144]:
columns_to_extract = ['STATE', 'FINALWT', 'LLCPWT']
for key, val in relevant_variables.items():
    columns_to_extract.extend(val)

In [145]:
def find_cols_this_df_has(list_of_columns, df):
    good_cols = []
    for col in df.columns:
        temp_col = col.replace("x.", "")
        for col_to_extract in list_of_columns:
            if col_to_extract.lower() == temp_col.lower():
                good_cols.append(col)
    return good_cols

Now, let's iterate through each of the BRFSS CSVs and load them into a dataframe, and for each dataframe we'll get all the columns that we want to extract that this dataframe contains.

In [146]:
years = list(range(1999, 2018))

In [147]:
dict_of_relevant_dfs = {}

for year in log_progress(years):
    brfss_df = pd.read_csv(f"../data/brfss/csv/brfss{year}.csv", encoding="cp1252")
    cols = find_cols_this_df_has(columns_to_extract, brfss_df)
    print(year)
    print(cols)
    temp_df = brfss_df[cols].copy()
    
    temp_df['year'] = year
    dict_of_relevant_dfs[year] = temp_df

VBox(children=(HTML(value=''), IntProgress(value=0, max=19)))

1999
['x.state', 'genhlth', 'menthlth', 'hlthplan', 'bphigh', 'toldhi', 'income2', 'cvdasprn', 'exerany', 'x.smoker2', 'x.finalwt', 'x.finalwt']
2000
['x.state', 'genhlth', 'menthlth', 'hlthplan', 'exerany', 'income2', 'bphigh', 'toldhi', 'cvdasprn', 'x.smoker2', 'x.finalwt', 'x.finalwt']
2001
['x.state', 'genhlth', 'menthlth', 'hlthplan', 'exerany2', 'bphigh2', 'toldhi2', 'income2', 'cvdasprn', 'x.finalwt', 'x.finalwt', 'x.smoker2']
2002
['x.state', 'genhlth', 'hlthplan', 'exerany2', 'income2', 'bphigh3', 'toldhi2', 'menthlth', 'cvdasprn', 'x.finalwt', 'x.finalwt', 'x.smoker2']
2003
['x.state', 'genhlth', 'menthlth', 'hlthplan', 'exerany2', 'bphigh3', 'toldhi2', 'income2', 'cvdasprn', 'x.finalwt', 'x.finalwt', 'x.smoker2']
2004
['x.state', 'genhlth', 'menthlth', 'hlthplan', 'exerany2', 'income2', 'bphigh3', 'toldhi2', 'cvdasprn', 'x.finalwt', 'x.finalwt', 'x.smoker2']
2005
['x.state', 'genhlth', 'menthlth', 'hlthplan', 'exerany2', 'bphigh4', 'toldhi2', 'income2', 'cvdasprn', 'x.finalw

In [148]:
with open("../data/pickles/dict_of_relevant_dfs_raw.pkl", "wb") as f:
    pickle.dump(dict_of_relevant_dfs, f)

Now, let's clean up each of the dataframes. Unfortunately, the easiest way to do this on a relatively small set is to do it manually.

In [149]:
dict_of_relevant_dfs[1999].columns

Index(['x.state', 'genhlth', 'menthlth', 'hlthplan', 'bphigh', 'toldhi',
       'income2', 'cvdasprn', 'exerany', 'x.smoker2', 'x.finalwt', 'x.finalwt',
       'year'],
      dtype='object')

In [150]:
relevant_variables

{'high_cholesterol': ['TOLDHI', 'TOLDHI2'],
 'hypertension': ['BPHIGH', 'BPHIGH2', 'BPHIGH3', 'BPHIGH4'],
 'aspirin': ['CVDASPRN'],
 'exercise': ['EXERANY', 'EXERANY2'],
 'general_health': ['GENHLTH'],
 'mental_health': ['MENTHLTH'],
 'coverage': ['HLTHPLAN', 'HLTHPLN1'],
 'income': ['INCOME2'],
 'smoker': ['SMOKER2', 'SMOKER3'],
 'weight_to_multiply': ['FINALWT', 'LLCPWT']}

In [151]:
for year, df in dict_of_relevant_dfs.items():
    
    rename_dict = {}

    for col in df.columns:
        for key, value in relevant_variables.items():
            if col.replace("x.", "").upper() in value:
                rename_dict[col] = key
                
    dict_of_relevant_dfs[year] = df.rename(columns=rename_dict)

Now that we have a dictionary of the dataframes with the predictors that we're interested in, we now have to translate the numbers into the actual responses. We can than calculate the mean per state per year for each of our predictors.

In [152]:
dict_of_relevant_dfs[2017].head()

Unnamed: 0,x.state,general_health,mental_health,coverage,hypertension,high_cholesterol,income,exercise,aspirin,weight_to_multiply,weight_to_multiply.1,smoker,year
0,1,2.0,88.0,1.0,1.0,1.0,6.0,1.0,,79.425947,79.425947,4,2017
1,1,2.0,88.0,1.0,1.0,2.0,8.0,1.0,,89.69458,89.69458,4,2017
2,1,3.0,88.0,1.0,3.0,1.0,99.0,2.0,,440.121376,440.121376,4,2017
3,1,4.0,88.0,1.0,1.0,1.0,1.0,,,194.867164,194.867164,4,2017
4,1,4.0,88.0,1.0,3.0,2.0,2.0,2.0,,169.087888,169.087888,3,2017


In [171]:
var_name_response_num_to_str_dict = {}
var_name_response_num_to_str_dict['high_cholesterol'] = {1: 'yes', 2: 'no', 7: "don't know", 9: "refused"}
var_name_response_num_to_str_dict['coverage'] = {1: 'yes', 2: 'no', 7: "don't know", 9: "refused"}
var_name_response_num_to_str_dict['general_health'] = {1: 'excellent', 2: 'very good', 3: 'good',
                                                       4: 'fair', 5: 'poor', 7: "don't know", 9: 'refused'}
var_name_response_num_to_str_dict['mental_health'] = {77: "don't know", 88: 0, 99: "refused"}
var_name_response_num_to_str_dict['hypertension'] = {1: 'yes', 2: 'yes, only during pregnancy', 3:'no', 4:'borderline', 7: "don't know", 9: "refused"}
var_name_response_num_to_str_dict['income'] = {1: '<10K', 2: '10K-15K', 3: '15K-20K', 4: '20K-25K', 
                                               5: '25K-35K', 6: '35K-50K', 7: '50K-75K', 8: '>75', 77: "don't know", 99: 'refused'}
var_name_response_num_to_str_dict['aspirin'] = {1: 'yes', 2: 'no', 7: "don't know", 9: "refused"}
var_name_response_num_to_str_dict['smoker'] = {1: 'current, smoke every day', 2: 'current, smoke some days', 3: 'former smoker', 4: 'never smoked', 9: 'refused'}

In [172]:
dict_of_dfs_per_year_with_str_responses = {}

for year in years:
    temp_df = dict_of_relevant_dfs[year]
    for col in temp_df.columns:
        if col in var_name_response_num_to_str_dict:
            temp_df[col] = temp_df[col].replace(to_replace=var_name_response_num_to_str_dict[col])
    dict_of_dfs_per_year_with_str_responses[year] = temp_df

Now, we need to figure out which response we're interested in for each predictor. E.g., 

In [165]:
dict_of_dfs_per_year_with_str_responses[1999].head()

Unnamed: 0,x.state,general_health,mental_health,coverage,hypertension,high_cholesterol,income,aspirin,exercise,smoker,weight_to_multiply,weight_to_multiply.1,year
0,1,excellent,0,yes,no,no,50K-75K,no,,former smoker,1419.911405,1419.911405,1999
1,1,very good,10,no,no,no,<10K,no,,never smoked,1839.091587,1839.091587,1999
2,1,very good,0,yes,no,no,25K-35K,,,never smoked,1024.247333,1024.247333,1999
3,1,very good,0,yes,yes,no,>75,no,,never smoked,2381.874267,2381.874267,1999
4,1,very good,2,no,no,no,50K-75K,no,,never smoked,2381.874267,2381.874267,1999


In [166]:
g = dict_of_dfs_per_year_with_str_responses[2017].groupby('x.state')

In [170]:
g.get_group(1).hypertension.unique()

array(['yes', 3.0, 4.0, 'no', "don't know", 'refused'], dtype=object)