# **Heart Disease Health Indicators Data Processing**
The data is obtained from [CDC 2021 BRFSS Survey Data and Documentation](https://www.cdc.gov/brfss/annual_data/annual_2021.html). The Behavioral Risk Factor Surveillance System (BRFSS) is a survey conducted in the U.S. to understand health-related risk behaviors and chronic health conditions of residents in the US.
* [2021 Survey Data Information](https://www.cdc.gov/brfss/annual_data/2021/pdf/codebook21_llcp-v2-508.pdf): There are 438,693 records and 303 features for 2021.

The following data processing steps are applied:
* Removed invalid data
    * Dropped missing values
    * Removed duplicate data
* Mapped variables to standardized values
    * Converted "Yes" to 1 and "No" to 0
    * Converted integers to strings for better visualization
* One-hot-encoded non-binary categorical variables
    * BMI, Diabetes, General Health, Mental Health Status, Age, Education, Income
    
After the data processing steps, there are **220,411** records and **24** features left that are related to **heart disease and mental health** which will be used for further analysis. The following is the complete list of the variables used together with their descriptions and data types:

Variable Name | Description | Data Type
| --- | --- | --- |
`mental_health` | Number of days mental health not good | Integer
`physical_health` | Number of days physical health not good | Integer
`high_bp` | Adults who have been told they have high blood pressure | String
`high_chol` | Adults who have been told they have high cholesterol | String
`chol_check` | Adults who have been told they have high blood pressure | String
`bmi_category` | Body mass index categories | String
`smoker` | Adults who are current smokers | String
`stroke` | Ever diagnosed with a stroke | String
`diabetes_category` | Adults who have been told they have diabetes | String
`physical_activity` | Adults who reported doing physical activity | String
`fruits` | Consume fruit 1 or more times per day  | String
`veggies` | Consume vegetables 1 or more times per day | String
`alcohol_consumption` | Heavy drinkers of alcohol | String
`health_insurance` | : Adults who had some form of health insurance | String
`no_doc_cost` | Adults who could not afford to see a doctor  | String
`general_health` | General health | String
`difficult_walk` | Adults who have difficulty in walking or climbing stairs | String
`depressive_disorder` | Adults who have been told they have depressive disorder | String
`mh_status` | 3 level not good mental health status: 0 days, 1-13 days, 14-30 days | String
`sex` | Biological sex | String
`age_grp` | Reported age in five-year age categories | String
`educ_grp` | Highest grade or year of school you completed? | String
`income_grp` | Income level | String

In [1]:
# General libraries
import warnings
import pandas as pd
from tqdm import tqdm
warnings.filterwarnings("ignore")

## Data Handling

### Convert .XPT to .CSV

In [2]:
# df = pd.read_sas('LLCP2021.XPT', encoding='latin-1')
# df.head()

In [3]:
# df.shape

In [4]:
# df.to_csv("2021_brfss.csv", index=False)

### Load and Read the BRFSS 2021 CSV file

In [5]:
# Load and read the data
data = pd.read_csv('../data/2021_brfss.csv')
data.head(2)

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
0,1.0,1.0,1192021,1,19,2021,1100.0,2021000001,2021000000.0,1.0,...,1.0,1.0,100.0,214.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79
1,1.0,1.0,1212021,1,21,2021,1100.0,2021000002,2021000000.0,1.0,...,1.0,1.0,100.0,128.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79


In [6]:
# Check the shape
data.shape

(438693, 303)

In [7]:
# Check that the data loaded in is in the correct format
pd.set_option('display.max_columns', 500)
data.head(2)

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,PVTRESD1,COLGHOUS,STATERE1,CELPHON1,LADULT1,COLGSEX,NUMADULT,LANDSEX,NUMMEN,NUMWOMEN,RESPSLCT,SAFETIME,CTELNUM1,CELLFON5,CADULT1,CELLSEX,PVTRESD3,CCLGHOUS,CSTATE1,LANDLINE,HHADULT,SEXVAR,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,PRIMINSR,PERSDOC3,MEDCOST1,CHECKUP1,EXERANY2,BPHIGH6,BPMEDS,CHOLCHK3,TOLDHI3,CHOLMED3,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNCR,CHCOCNCR,CHCCOPD3,ADDEPEV3,CHCKDNY2,DIABETE4,DIABAGE3,HAVARTH5,ARTHEXER,ARTHEDU,LMTJOIN3,ARTHDIS2,JOINPAI2,MARITAL,EDUCA,RENTHOM1,NUMHHOL3,NUMPHON3,CPDEMO1B,VETERAN3,EMPLOY1,CHILDREN,INCOME3,PREGNANT,WEIGHT2,HEIGHT3,DEAF,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,SMOKE100,SMOKDAY2,USENOW3,ECIGNOW1,ALCDAY5,AVEDRNK3,DRNK3GE5,MAXDRNKS,FLUSHOT7,FLSHTMY3,IMFVPLA2,PNEUVAC4,HIVTST7,HIVTSTD3,FRUIT2,FRUITJU2,FVGREEN1,FRENCHF1,POTATOE1,VEGETAB2,PDIABTST,PREDIAB1,INSULIN1,BLDSUGAR,FEETCHK3,DOCTDIAB,CHKHEMO3,FEETCHK,EYEEXAM1,DIABEYE,DIABEDU,TOLDCFS,HAVECFS,WORKCFS,TOLDHEPC,TRETHEPC,PRIRHEPC,HAVEHEPC,HAVEHEPB,MEDSHEPB,HPVADVC4,HPVADSHT,TETANUS1,SHINGLE2,LCSFIRST,LCSLAST,LCSNUMCG,LCSCTSCN,HADMAM,HOWLONG,CERVSCRN,CRVCLCNC,CRVCLPAP,CRVCLHPV,HADHYST2,PSATEST1,PSATIME1,PCPSARS2,PCSTALK,HADSIGM4,COLNSIGM,COLNTES1,SIGMTES1,LASTSIG4,COLNCNCR,VIRCOLO1,VCLNTES1,SMALSTOL,STOLTEST,STOOLDN1,BLDSTFIT,SDNATES1,CNCRDIFF,CNCRAGE,CNCRTYP1,CSRVTRT3,CSRVDOC1,CSRVSUM,CSRVRTRN,CSRVINST,CSRVINSR,CSRVDEIN,CSRVCLIN,CSRVPAIN,CSRVCTL2,HOMBPCHK,HOMRGCHK,WHEREBP,SHAREBP,WTCHSALT,DRADVISE,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,CAREGIV1,CRGVREL4,CRGVLNG1,CRGVHRS1,CRGVPRB3,CRGVALZD,CRGVPER1,CRGVHOU1,CRGVEXPT,ACEDEPRS,ACEDRINK,ACEDRUGS,ACEPRISN,ACEDIVRC,ACEPUNCH,ACEHURT1,ACESWEAR,ACETOUCH,ACETTHEM,ACEHVSEX,ACEADSAF,ACEADNED,MARIJAN1,USEMRJN3,RSNMRJN2,LASTSMK2,STOPSMK2,FIREARM5,GUNLOAD,LOADULK2,RCSGENDR,RCSRLTN2,CASTHDX2,CASTHNO2,BIRTHSEX,SOMALE,SOFEMALE,TRNSGNDR,QSTVER,QSTLANG,_METSTAT,_URBSTAT,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_IMPRACE,_CHISPNC,_CRACE1,_CPRACE1,CAGEG,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT2,_LLCPWT,_RFHLTH,_PHYS14D,_MENT14D,_HLTHPLN,_HCVU652,_TOTINDA,_RFHYPE6,_CHOLCH3,_RFCHOL3,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR3,_LMTACT3,_LMTWRK3,_PRACE1,_MRACE1,_HISPANC,_RACE,_RACEG21,_RACEGR3,_RACEPRV,_SEX,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG1,_SMOKER3,_RFSMOK3,_CURECI1,DRNKANY5,DROCDY3_,_RFBING5,_DRNKWK1,_RFDRHV7,_FLSHOT7,_PNEUMO3,_AIDTST4,FTJUDA2_,FRUTDA2_,GRENDA1_,FRNCHDA_,POTADA1_,VEGEDA2_,_MISFRT1,_MISVEG1,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
0,1.0,1.0,1192021,1,19,2021,1100.0,2021000001,2021000000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,5.0,20.0,10.0,88.0,3.0,1.0,2.0,2.0,2.0,3.0,,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,3.0,,1.0,2.0,2.0,2.0,1.0,8.0,1.0,4.0,1.0,1.0,1.0,1.0,2.0,7.0,88.0,5.0,,72.0,411.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,3.0,3.0,3.0,888.0,,,,1.0,92020.0,1.0,1.0,2.0,,101.0,555.0,204.0,203.0,201.0,101.0,2.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,3.0,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,1.0,11011.0,39.766158,2.0,79.532315,1.0,9.0,,,,,1.0,0.519019,874.242902,744.745531,2.0,3.0,2.0,1.0,9.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,11.0,2.0,70.0,6.0,59.0,150.0,3266.0,1454.0,1.0,1.0,1.0,2.0,3.0,3.0,1.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,1.0,1.0,2.0,5.397605e-79,100.0,57.0,43.0,14.0,100.0,5.397605e-79,5.397605e-79,1.0,1.0,100.0,214.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79
1,1.0,1.0,1212021,1,21,2021,1100.0,2021000002,2021000000.0,1.0,1.0,,1.0,2.0,1.0,,2.0,,1.0,1.0,2.0,,,,,,,,,,,2.0,3.0,88.0,88.0,,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,1.0,2.0,2.0,,2.0,2.0,2.0,2.0,1.0,1.0,98.0,1.0,1.0,2.0,1.0,1.0,10.0,9.0,6.0,1.0,2.0,,1.0,2.0,8.0,88.0,77.0,,7777.0,506.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,,3.0,3.0,888.0,,,,2.0,,,2.0,2.0,,101.0,555.0,201.0,555.0,201.0,207.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,5.0,,,,,,,,,,,,,,,,,10.0,1.0,1.0,1.0,2.0,11011.0,39.766158,2.0,79.532315,2.0,9.0,,,,,1.0,0.519019,874.242902,299.137394,1.0,1.0,1.0,1.0,9.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,10.0,2.0,67.0,6.0,66.0,168.0,,,,9.0,1.0,4.0,9.0,4.0,1.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,2.0,2.0,2.0,5.397605e-79,100.0,14.0,5.397605e-79,14.0,100.0,5.397605e-79,5.397605e-79,1.0,1.0,100.0,128.0,1.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79


### Select Relevant Columns

In [8]:
# Select specific columns
brfss_df_selected = data[['_MICHD', 
                       '_RFHYPE6',  
                     'TOLDHI3', '_CHOLCH3', 
                     '_BMI5', 
                     '_RFSMOK3', 
                     'CVDSTRK3', 'DIABETE4', 
                     '_TOTINDA', 
                     '_FRTLT1A', '_VEGLT1A', 
                     '_RFDRHV7', 
                     '_HLTHPLN', 'MEDCOST1', 
                     'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 
                     'ADDEPEV3', '_MENT14D',
                     '_SEX', '_AGEG5YR', 'EDUCA', 'INCOME3' ]]

In [9]:
# Check the shape
brfss_df_selected.shape

(438693, 24)

In [10]:
# Check the data
brfss_df_selected.head(2)

Unnamed: 0,_MICHD,_RFHYPE6,TOLDHI3,_CHOLCH3,_BMI5,_RFSMOK3,CVDSTRK3,DIABETE4,_TOTINDA,_FRTLT1A,_VEGLT1A,_RFDRHV7,_HLTHPLN,MEDCOST1,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,ADDEPEV3,_MENT14D,_SEX,_AGEG5YR,EDUCA,INCOME3
0,2.0,1.0,1.0,1.0,1454.0,1.0,2.0,3.0,2.0,1.0,1.0,1.0,1.0,2.0,5.0,10.0,20.0,2.0,2.0,2.0,2.0,11.0,4.0,5.0
1,1.0,2.0,1.0,1.0,,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,88.0,88.0,1.0,2.0,1.0,2.0,10.0,6.0,77.0


In [11]:
# Check the dataset info
brfss_df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438693 entries, 0 to 438692
Data columns (total 24 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   _MICHD    434058 non-null  float64
 1   _RFHYPE6  438693 non-null  float64
 2   TOLDHI3   377857 non-null  float64
 3   _CHOLCH3  438693 non-null  float64
 4   _BMI5     391841 non-null  float64
 5   _RFSMOK3  438693 non-null  float64
 6   CVDSTRK3  438691 non-null  float64
 7   DIABETE4  438690 non-null  float64
 8   _TOTINDA  438693 non-null  float64
 9   _FRTLT1A  438693 non-null  float64
 10  _VEGLT1A  438693 non-null  float64
 11  _RFDRHV7  438693 non-null  float64
 12  _HLTHPLN  438693 non-null  float64
 13  MEDCOST1  438688 non-null  float64
 14  GENHLTH   438689 non-null  float64
 15  MENTHLTH  438691 non-null  float64
 16  PHYSHLTH  438690 non-null  float64
 17  DIFFWALK  420684 non-null  float64
 18  ADDEPEV3  438690 non-null  float64
 19  _MENT14D  438693 non-null  float64
 20  _SEX

In [12]:
# Check summary statistics
brfss_df_selected.describe()

Unnamed: 0,_MICHD,_RFHYPE6,TOLDHI3,_CHOLCH3,_BMI5,_RFSMOK3,CVDSTRK3,DIABETE4,_TOTINDA,_FRTLT1A,_VEGLT1A,_RFDRHV7,_HLTHPLN,MEDCOST1,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,ADDEPEV3,_MENT14D,_SEX,_AGEG5YR,EDUCA,INCOME3
count,434058.0,438693.0,377857.0,438693.0,391841.0,438693.0,438691.0,438690.0,438693.0,438693.0,438693.0,438693.0,438693.0,438688.0,438689.0,438691.0,438690.0,420684.0,438690.0,438693.0,438693.0,438693.0,438688.0,429846.0
mean,1.918621,1.427244,1.647766,1.71698,2855.226495,1.578063,1.978381,2.761946,1.260891,2.270561,2.257184,1.692548,1.37017,1.942994,2.524761,59.923347,63.190139,1.864055,1.837179,1.634248,1.535529,7.726016,5.035267,23.222501
std,0.273416,0.699127,0.713507,2.036583,655.194977,1.852399,0.369242,0.743411,0.557932,2.485479,2.71146,2.163298,1.566496,0.41258,1.082066,37.47268,36.222075,0.535092,0.591371,1.222564,0.498737,3.645926,1.047852,33.568376
min,1.0,1.0,1.0,1.0,1200.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,2.0,1.0,1.0,1.0,2414.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,15.0,25.0,2.0,2.0,1.0,1.0,5.0,4.0,6.0
50%,2.0,1.0,2.0,1.0,2744.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,88.0,88.0,2.0,2.0,1.0,2.0,8.0,5.0,8.0
75%,2.0,2.0,2.0,1.0,3174.0,1.0,2.0,3.0,1.0,2.0,2.0,1.0,1.0,2.0,3.0,88.0,88.0,2.0,2.0,2.0,2.0,11.0,6.0,10.0
max,2.0,9.0,9.0,9.0,9933.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,99.0,99.0,9.0,9.0,9.0,2.0,14.0,9.0,99.0


### Drop Missing Values

In [13]:
# Check for null values
brfss_df_selected.isna().sum()

_MICHD       4635
_RFHYPE6        0
TOLDHI3     60836
_CHOLCH3        0
_BMI5       46852
_RFSMOK3        0
CVDSTRK3        2
DIABETE4        3
_TOTINDA        0
_FRTLT1A        0
_VEGLT1A        0
_RFDRHV7        0
_HLTHPLN        0
MEDCOST1        5
GENHLTH         4
MENTHLTH        2
PHYSHLTH        3
DIFFWALK    18009
ADDEPEV3        3
_MENT14D        0
_SEX            0
_AGEG5YR        0
EDUCA           5
INCOME3      8847
dtype: int64

In [14]:
# Drop missing values
brfss_df_selected = brfss_df_selected.dropna()
brfss_df_selected.shape

(332527, 24)

### Mapping

In [15]:
# _MICHD
# Change 2 to 0 because this means did not have MI or CHD
brfss_df_selected['_MICHD'] = brfss_df_selected['_MICHD'].replace({2: 0})
brfss_df_selected._MICHD.unique()

array([0., 1.])

In [16]:
#1 _RFHYPE6
# Change 1 to 0 so it represents No high blood pressure and 2 to 1 so it represents high blood pressure
brfss_df_selected['_RFHYPE6'] = brfss_df_selected['_RFHYPE6'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFHYPE6 != 9]
brfss_df_selected._RFHYPE6.unique()

array([0., 1.])

In [17]:
#2 TOLDHI3
# Change 1 to 0 because it is No, 2 to 1 because it is Yes
# Remove all 9 (don't know/refused/missing)
brfss_df_selected['TOLDHI3'] = brfss_df_selected['TOLDHI3'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI3 != 9]
brfss_df_selected.TOLDHI3.unique()

array([0., 1.])

In [18]:
#3 _CHOLCH3
# Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years
# Remove 9
brfss_df_selected['_CHOLCH3'] = brfss_df_selected['_CHOLCH3'].replace({3:0, 2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._CHOLCH3 != 9]
brfss_df_selected._CHOLCH3.unique()

array([1., 0.])

In [19]:
#4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
brfss_df_selected['_BMI5'] = brfss_df_selected['_BMI5'].div(100).round(0)
brfss_df_selected._BMI5.unique()

array([15., 28., 33., 29., 24., 46., 23., 40., 27., 35., 18., 30., 25.,
       36., 22., 31., 45., 26., 14., 38., 21., 32., 20., 19., 34., 41.,
       43., 44., 39., 37., 16., 42., 50., 51., 17., 52., 47., 49., 56.,
       57., 48., 58., 61., 53., 63., 64., 54., 68., 55., 62., 13., 59.,
       89., 66., 77., 60., 87., 69., 72., 75., 67., 71., 65., 82., 86.,
       70., 78., 12., 74., 98., 73., 84., 76., 80., 83., 79., 99., 88.,
       81., 90., 92., 91., 95., 85., 94.])

In [20]:
#5 _RFSMOK3
# Change 2 to 0 because it is No
# Remove all 9 (refused)
brfss_df_selected['_RFSMOK3'] = brfss_df_selected['_RFSMOK3'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFSMOK3 != 9]
brfss_df_selected._RFSMOK3.unique()

array([0., 1.])

In [21]:
#6 CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['CVDSTRK3'] = brfss_df_selected['CVDSTRK3'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 9]
brfss_df_selected.CVDSTRK3.unique()

array([0., 1.])

In [22]:
#7 DIABETE3
# going to make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['DIABETE4'] = brfss_df_selected['DIABETE4'].replace({2:0, 3:0, 4:1, 1:2})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE4 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE4 != 9]
brfss_df_selected.DIABETE4.unique()

array([0., 2., 1.])

In [23]:
#8 _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)
brfss_df_selected['_TOTINDA'] = brfss_df_selected['_TOTINDA'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._TOTINDA != 9]
brfss_df_selected._TOTINDA.unique()

array([0., 1.])

In [24]:
#9 _FRTLT1
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day 
# remove all dont knows and missing 9
brfss_df_selected['_FRTLT1A'] = brfss_df_selected['_FRTLT1A'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._FRTLT1A != 9]
brfss_df_selected._FRTLT1A.unique()

array([1., 0.])

In [25]:
#10 _VEGLT1
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day 
# remove all dont knows and missing 9
brfss_df_selected['_VEGLT1A'] = brfss_df_selected['_VEGLT1A'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._VEGLT1A != 9]
brfss_df_selected._VEGLT1A.unique()

array([1., 0.])

In [26]:
#11 _RFDRHV7
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking)
# remove all dont knows and missing 9
brfss_df_selected['_RFDRHV7'] = brfss_df_selected['_RFDRHV7'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFDRHV7 != 9]
brfss_df_selected._RFDRHV7.unique()

array([0., 1.])

In [27]:
#12 _HLTHPLN
# 1 is yes, change 2 to 0 because it is No health insurance
# remove 7 and 9 for don't know or refused
brfss_df_selected['_HLTHPLN'] = brfss_df_selected['_HLTHPLN'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._HLTHPLN != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected._HLTHPLN != 9]
brfss_df_selected._HLTHPLN.unique()

array([1., 0.])

In [28]:
#13 MEDCOST1
# Change 2 to 0 for no, 1 is already yes
# remove 7 for don/t know and 9 for refused
brfss_df_selected['MEDCOST1'] = brfss_df_selected['MEDCOST1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST1 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST1 != 9]
brfss_df_selected.MEDCOST1.unique()

array([0., 1.])

In [29]:
#14 GENHLTH
# This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 9]
brfss_df_selected.GENHLTH.unique()

array([5., 2., 3., 4., 1.])

In [30]:
#15 MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['MENTHLTH'] = brfss_df_selected['MENTHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 99]
brfss_df_selected.MENTHLTH.unique()

array([10.,  0.,  5., 25.,  2.,  7., 30.,  3., 14., 20.,  8.,  1., 15.,
        4., 28., 24., 21., 12.,  6., 22., 27., 18., 13., 17., 16.,  9.,
       19., 29., 23., 11., 26.])

In [31]:
#16 PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['PHYSHLTH'] = brfss_df_selected['PHYSHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 99]
brfss_df_selected.PHYSHLTH.unique()

array([20.,  0., 30., 25.,  1.,  4., 10.,  2.,  3., 15.,  8., 13., 14.,
        5.,  7.,  6., 24., 29., 18.,  9., 16., 17., 26., 28., 12., 21.,
       27., 11., 19., 22., 23.])

In [32]:
#17 DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused
brfss_df_selected['DIFFWALK'] = brfss_df_selected['DIFFWALK'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 9]
brfss_df_selected.DIFFWALK.unique()

array([0., 1.])

In [33]:
#18 SEX
# in other words - is respondent male (somewhat arbitrarily chose this change because men are at higher risk for heart disease)
# change 2 to 0 (female as 0). Male is 1
brfss_df_selected['_SEX'] = brfss_df_selected['_SEX'].replace({2:0})
brfss_df_selected._SEX.unique()

array([0., 1.])

In [34]:
#19 _AGEG5YR
# already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing
brfss_df_selected = brfss_df_selected[brfss_df_selected._AGEG5YR != 14]
brfss_df_selected._AGEG5YR.unique()

array([11.,  9., 12., 13., 10.,  7.,  6.,  8.,  4.,  3.,  5.,  2.,  1.])

In [35]:
#20 EDUCA
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6
# Remove 9 for refused:
brfss_df_selected = brfss_df_selected[brfss_df_selected.EDUCA != 9]
brfss_df_selected.EDUCA.unique()

array([4., 3., 5., 6., 2., 1.])

In [36]:
#21 INCOME3
# Variable is already ordinal with 1 being less than $10,000 all the way up to 11 being $200,000 or more
# Remove 77 and 99 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME3 != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME3 != 99]
brfss_df_selected.INCOME3.unique()

array([ 5.,  3.,  7.,  4.,  6.,  8.,  2.,  9., 10.,  1., 11.])

In [37]:
#22 ADDEPEV3
# Change 2 to 0 (2 was no for depressive disorder)
# remove all dont knows and refused 9
brfss_df_selected['ADDEPEV3'] = brfss_df_selected['ADDEPEV3'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.ADDEPEV3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.ADDEPEV3 != 9]
brfss_df_selected.ADDEPEV3.unique()

array([0., 1.])

In [38]:
#23 _MENT14D 
# This is already an ordinal variable
# Scale here is 1-3
# Remove 9 for refused:
brfss_df_selected = brfss_df_selected[brfss_df_selected._MENT14D != 9]
brfss_df_selected._MENT14D.unique()

array([2., 1., 3.])

In [39]:
# Print the column names
brfss_df_selected.columns

Index(['_MICHD', '_RFHYPE6', 'TOLDHI3', '_CHOLCH3', '_BMI5', '_RFSMOK3',
       'CVDSTRK3', 'DIABETE4', '_TOTINDA', '_FRTLT1A', '_VEGLT1A', '_RFDRHV7',
       '_HLTHPLN', 'MEDCOST1', 'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK',
       'ADDEPEV3', '_MENT14D', '_SEX', '_AGEG5YR', 'EDUCA', 'INCOME3'],
      dtype='object')

In [40]:
# Rename the columns to make them more readable
brfss = brfss_df_selected.rename(columns = {'_MICHD':'HeartDiseaseorAttack', 
                                         '_RFHYPE6':'HighBP',  
                                         'TOLDHI3':'HighChol', '_CHOLCH3':'CholCheck', 
                                         '_BMI5':'BMI', 
                                         '_RFSMOK3':'Smoker', 
                                         'CVDSTRK3':'Stroke', 'DIABETE4':'Diabetes', 
                                         '_TOTINDA':'PhysActivity', 
                                         '_FRTLT1A':'Fruits', '_VEGLT1A':"Veggies", 
                                         '_RFDRHV7':'HvyAlcoholConsump', 
                                         '_HLTHPLN':'AnyHealthInsurance', 'MEDCOST1':'NoDocbcCost', 
                                         'GENHLTH':'GenHlth', 'MENTHLTH':'MentHlth', 'PHYSHLTH':'PhysHlth', 'DIFFWALK':'DiffWalk',
                                         'ADDEPEV3': 'DepressiveDisorder', '_MENT14D': 'MentHlthStatus',
                                         '_SEX':'Sex', '_AGEG5YR':'Age', 'EDUCA':'Education', 'INCOME3':'Income' })

In [41]:
# Check for null values
brfss.isna().sum()

HeartDiseaseorAttack    0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
Diabetes                0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthInsurance      0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
DepressiveDisorder      0
MentHlthStatus          0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [42]:
# Check the dataset info
brfss.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 235718 entries, 0 to 438692
Data columns (total 24 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   HeartDiseaseorAttack  235718 non-null  float64
 1   HighBP                235718 non-null  float64
 2   HighChol              235718 non-null  float64
 3   CholCheck             235718 non-null  float64
 4   BMI                   235718 non-null  float64
 5   Smoker                235718 non-null  float64
 6   Stroke                235718 non-null  float64
 7   Diabetes              235718 non-null  float64
 8   PhysActivity          235718 non-null  float64
 9   Fruits                235718 non-null  float64
 10  Veggies               235718 non-null  float64
 11  HvyAlcoholConsump     235718 non-null  float64
 12  AnyHealthInsurance    235718 non-null  float64
 13  NoDocbcCost           235718 non-null  float64
 14  GenHlth               235718 non-null  float64
 15  

In [43]:
# Check summary statistics
brfss.describe()

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income
count,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0,235718.0
mean,0.086514,0.418449,0.59796,0.963418,28.952787,0.122795,0.038877,0.30766,0.779431,0.621399,0.827888,0.062053,0.962629,0.063504,2.479556,3.924083,3.742926,0.153645,0.20452,1.496055,0.477796,7.865572,5.139421,6.929424
std,0.281123,0.493306,0.490311,0.187733,6.550623,0.328202,0.193302,0.70497,0.414631,0.485039,0.377479,0.241252,0.189669,0.243868,1.028736,7.873229,8.236664,0.360609,0.403351,0.699923,0.499508,3.236053,0.946034,2.37484
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,5.0,4.0,5.0
50%,0.0,0.0,1.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,32.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,4.0,2.0,0.0,0.0,2.0,1.0,10.0,6.0,9.0
max,1.0,1.0,1.0,1.0,99.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,30.0,30.0,1.0,1.0,3.0,1.0,13.0,6.0,11.0


In [44]:
# Check the shape
brfss.shape

(235718, 24)

In [45]:
# Check the data
brfss.head(2)

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0
2,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0


In [46]:
# Check how many respondents have had heart disease or a heart attack
brfss.groupby(['HeartDiseaseorAttack']).size()

HeartDiseaseorAttack
0.0    215325
1.0     20393
dtype: int64

In [47]:
# Save the mapping to CSV
brfss.to_csv('../data/2021_brfss_heart_disease_health_indicators.csv', sep=",", index=False)

## Data Transformation

### Load and Read the Heart Disease Health Indicators data

In [48]:
# Load and read the data
heart_data = pd.read_csv('2021_brfss_heart_disease_health_indicators.csv')
heart_data.head(2)

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0
1,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0


In [49]:
# Rename the HeartDiseaseorAttack column to target
heart_data = heart_data.rename(columns = {"HeartDiseaseorAttack": "target"})
heart_data.head(2)

Unnamed: 0,target,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0
1,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0


In [50]:
# Get the shape
heart_data.shape

(235718, 24)

### Remove the Duplicate Data (Unmapped)

In [51]:
# Make a copy of the dataframe
heart_data_nodup = heart_data.copy()
heart_data_nodup.head(2)

Unnamed: 0,target,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0
1,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0


In [52]:
# Drop duplicated rows from the copied dataset
heart_data_nodup.drop_duplicates(inplace=True)

In [53]:
# Print the shape of the new dataset to check for the number of rows
heart_data_nodup.shape

(220411, 24)

In [54]:
# Check how many respondents have had heart disease or a heart attack
heart_data_nodup.groupby(['target']).size()

target
0.0    200163
1.0     20248
dtype: int64

In [55]:
# Get the percentage of with and without disease for cleaned data
without_disease_nd = (heart_data_nodup.groupby(['target']).size()[0] / heart_data_nodup.shape[0])*100
with_disease_nd = (heart_data_nodup.groupby(['target']).size()[1] / heart_data_nodup.shape[0])*100

print(f"With Disease: {with_disease_nd}")
print(f"Without Disease: {without_disease_nd}")

With Disease: 9.18647435926519
Without Disease: 90.81352564073481


In [56]:
def bmi_cat(bmi):
    if bmi<18.5:
        return "Underweight"
    elif bmi<25:
        return "Normal"
    elif bmi<30:
        return "Overweight"
    elif bmi<35:
        return "Obese 1"
    elif bmi<40:
        return "Obese 2"
    else:
        return "Obese 3"
    
heart_data_nodup['bmi_category']=heart_data_nodup['BMI'].apply(bmi_cat)

In [57]:
heart_data_nodup.head(3)

Unnamed: 0,target,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income,bmi_category
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0,Underweight
1,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0,Overweight
2,0.0,1.0,0.0,1.0,33.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,10.0,0.0,0.0,0.0,2.0,0.0,9.0,4.0,7.0,Obese 1


In [58]:
# Diabetes category function
def diabetes_cat(diabetes):
    if diabetes==0:
        return "No or During Pregnancy"
    elif diabetes==1:
        return "Pre-Diabetes/Borderline"
    else:
        return "Existing Diabetes"
    
heart_data_nodup['diabetes_category']=heart_data_nodup['Diabetes'].apply(diabetes_cat)

In [59]:
# General health
heart_data_nodup['general_health']=heart_data_nodup['GenHlth'].map({
    1:'Excellent',
    2:'Very good',
    3: 'Good',
    4: 'Fair',
    5: 'Poor'
})

In [60]:
# Mental health status
heart_data_nodup['mh_status']=heart_data_nodup['MentHlthStatus'].map({
    1:'0 days not good ',
    2:'1-13 days not good ',
    3:'14+ days not good'
})

In [61]:
# Age group
heart_data_nodup['age_grp']=heart_data_nodup['Age'].map({
    1:'18-24',
    2:'25-29',
    3:'30-34',
    4:'35-39',
    5:'40-44',
    6:'45-49',
    7:'50-54',
    8:'55-59',
    9:'60-64',
    10:'65-69',
    11:'70-74',
    12:'75-79',
    13:'80 and up'
})

In [62]:
# Education group
heart_data_nodup['educ_grp']=heart_data_nodup['Education'].map({
    1:'Never attended school / Only Kindergarten',
    2:'Grades 1-8 (Elementary)',
    3:'Grades 9-11 (Some High School)',
    4:'Grade 12 or GED (High School Graduate)',
    5:'College 1-3 (Some College/Technical School)',
    6:'College Graduate'
})

In [63]:
# Income group
heart_data_nodup['income_grp']=heart_data_nodup['Income'].map({
    1:'Less than 10k USD',
    2:'10k-15k USD',
    3:'15k-20k USD',
    4:'20k-25k USD',
    5:'25k-35k USD',
    6:'35k-50k USD',
    7:'50k-75k USD',
    8:'75k-100k USD',
    9:'100k-150k USD',
    10:'150k-200k USD',
    11: '200k USD or more'
})

In [64]:
heart_data_nodup.head(3)

Unnamed: 0,target,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income,bmi_category,diabetes_category,general_health,mh_status,age_grp,educ_grp,income_grp
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0,Underweight,No or During Pregnancy,Poor,1-13 days not good,70-74,Grade 12 or GED (High School Graduate),25k-35k USD
1,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0,Overweight,Existing Diabetes,Very good,0 days not good,70-74,Grade 12 or GED (High School Graduate),15k-20k USD
2,0.0,1.0,0.0,1.0,33.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,10.0,0.0,0.0,0.0,2.0,0.0,9.0,4.0,7.0,Obese 1,Existing Diabetes,Very good,1-13 days not good,60-64,Grade 12 or GED (High School Graduate),50k-75k USD


### One-Hot Encoding

In [65]:
to_drop_ohe = ['BMI', 'Diabetes', 'GenHlth', 'MentHlthStatus', 'Age', 'Education', 'Income']

# Get list of categorical columns
cat_cols = heart_data_nodup.select_dtypes(include=['object']).columns.tolist()

# Perform one-hot encoding on categorical variables
one_hot_encoded_dfs = []
for col in tqdm(cat_cols):
    dummies = pd.get_dummies(heart_data_nodup[col], prefix=col, drop_first=False)
    one_hot_encoded_dfs.append(dummies)
    to_drop_ohe.append(col)

# Concatenate all the dataframes with one-hot encoded columns
heart_data_clean_ohe = pd.concat([heart_data_nodup] + one_hot_encoded_dfs, axis=1)

# Drop the original categorical columns
heart_data_clean_ohe = heart_data_clean_ohe.drop(to_drop_ohe, axis=1)

100%|██████████| 7/7 [00:00<00:00, 53.39it/s]


In [66]:
# View dataset info
heart_data_clean_ohe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 220411 entries, 0 to 235717
Data columns (total 64 columns):
 #   Column                                                Non-Null Count   Dtype  
---  ------                                                --------------   -----  
 0   target                                                220411 non-null  float64
 1   HighBP                                                220411 non-null  float64
 2   HighChol                                              220411 non-null  float64
 3   CholCheck                                             220411 non-null  float64
 4   Smoker                                                220411 non-null  float64
 5   Stroke                                                220411 non-null  float64
 6   PhysActivity                                          220411 non-null  float64
 7   Fruits                                                220411 non-null  float64
 8   Veggies                                     

In [67]:
heart_data_clean_ohe.head(2)

Unnamed: 0,target,HighBP,HighChol,CholCheck,Smoker,Stroke,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,Sex,bmi_category_Normal,bmi_category_Obese 1,bmi_category_Obese 2,bmi_category_Obese 3,bmi_category_Overweight,bmi_category_Underweight,diabetes_category_Existing Diabetes,diabetes_category_No or During Pregnancy,diabetes_category_Pre-Diabetes/Borderline,general_health_Excellent,general_health_Fair,general_health_Good,general_health_Poor,general_health_Very good,mh_status_0 days not good,mh_status_1-13 days not good,mh_status_14+ days not good,age_grp_18-24,age_grp_25-29,age_grp_30-34,age_grp_35-39,age_grp_40-44,age_grp_45-49,age_grp_50-54,age_grp_55-59,age_grp_60-64,age_grp_65-69,age_grp_70-74,age_grp_75-79,age_grp_80 and up,educ_grp_College 1-3 (Some College/Technical School),educ_grp_College Graduate,educ_grp_Grade 12 or GED (High School Graduate),educ_grp_Grades 1-8 (Elementary),educ_grp_Grades 9-11 (Some High School),educ_grp_Never attended school / Only Kindergarten,income_grp_100k-150k USD,income_grp_10k-15k USD,income_grp_150k-200k USD,income_grp_15k-20k USD,income_grp_200k USD or more,income_grp_20k-25k USD,income_grp_25k-35k USD,income_grp_35k-50k USD,income_grp_50k-75k USD,income_grp_75k-100k USD,income_grp_Less than 10k USD
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,10.0,20.0,0.0,0.0,0.0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [68]:
# Save the unmapped one-hot encoded dataset with no duplicates
heart_data_clean_ohe.to_csv('../data/2021_brfss_ohe_heart_disease_health_indicators.csv', index=False)

### Mapping for Visualization Purposes

In [69]:
# Make a copy of the dataframe
heart_data_trans = heart_data.copy()
heart_data_trans.head()

Unnamed: 0,target,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthInsurance,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,DepressiveDisorder,MentHlthStatus,Sex,Age,Education,Income
0,0.0,0.0,0.0,1.0,15.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,5.0,10.0,20.0,0.0,0.0,2.0,0.0,11.0,4.0,5.0
1,1.0,1.0,1.0,1.0,28.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,11.0,4.0,3.0
2,0.0,1.0,0.0,1.0,33.0,0.0,0.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,10.0,0.0,0.0,0.0,2.0,0.0,9.0,4.0,7.0
3,1.0,0.0,0.0,1.0,29.0,0.0,1.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,5.0,0.0,30.0,1.0,0.0,1.0,1.0,12.0,3.0,4.0
4,0.0,0.0,1.0,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,1.0,0.0,1.0,1.0,13.0,5.0,6.0


In [70]:
# High BP
heart_data_trans['high_bp']=heart_data_trans['HighBP'].map({
    0:'No',
    1:'Yes'
})

In [71]:
# High cholesterol
heart_data_trans['high_chol']=heart_data_trans['HighChol'].map({
    0:'No',
    1:'Yes'
})

In [72]:
# Cholesterol check
heart_data_trans['chol_check']=heart_data_trans['CholCheck'].map({
    0:'No',
    1:'Yes'
})

In [73]:
# BMI category function
def bmi_cat(bmi):
    if bmi<18.5:
        return "Underweight"
    elif bmi<25:
        return "Normal"
    elif bmi<30:
        return "Overweight"
    elif bmi<35:
        return "Obese 1"
    elif bmi<40:
        return "Obese 2"
    else:
        return "Obese 3"
    
heart_data_trans['bmi_category']=heart_data_trans['BMI'].apply(bmi_cat)

In [74]:
# Smoker
heart_data_trans['smoker']=heart_data_trans['Smoker'].map({
    0:'No',
    1:'Yes'
})

In [75]:
# Stroke
heart_data_trans['stroke']=heart_data_trans['Stroke'].map({
    0:'No',
    1:'Yes'
})

In [76]:
# Diabetes category function
def diabetes_cat(diabetes):
    if diabetes==0:
        return "No or During Pregnancy"
    elif diabetes==1:
        return "Pre-Diabetes/Borderline"
    else:
        return "Existing Diabetes"
    
heart_data_trans['diabetes_category']=heart_data_trans['Diabetes'].apply(diabetes_cat)

In [77]:
# Physical activity
heart_data_trans['physical_activity']=heart_data_trans['PhysActivity'].map({
    0:'No',
    1:'Yes'
})

In [78]:
# Fruit consumption
heart_data_trans['fruits']=heart_data_trans['Fruits'].map({
    0:'No',
    1:'Yes'
})

In [79]:
# Vegetable consumption
heart_data_trans['veggies']=heart_data_trans['Veggies'].map({
    0:'No',
    1:'Yes'
})

In [80]:
# Alcohol consumption
heart_data_trans['alcohol_consumption']=heart_data_trans['HvyAlcoholConsump'].map({
    0:'No',
    1:'Yes'
})

In [81]:
# Health insurance
heart_data_trans['health_insurance']=heart_data_trans['AnyHealthInsurance'].map({
    0:'No',
    1:'Yes'
})

In [82]:
# Could not afford to see a doctor
heart_data_trans['no_doc_cost']=heart_data_trans['NoDocbcCost'].map({
    0:'No',
    1:'Yes'
})

In [83]:
# General health
heart_data_trans['general_health']=heart_data_trans['GenHlth'].map({
    1:'Excellent',
    2:'Very good',
    3: 'Good',
    4: 'Fair',
    5: 'Poor'
})

In [84]:
# Difficulty walking
heart_data_trans['difficult_walk']=heart_data_trans['DiffWalk'].map({
    0:'No',
    1:'Yes'
})

In [85]:
# Depressive disorder
heart_data_trans['depressive_disorder']=heart_data_trans['DepressiveDisorder'].map({
    0:'No',
    1:'Yes'
})

In [86]:
# Mental health status
heart_data_trans['mh_status']=heart_data_trans['MentHlthStatus'].map({
    1:'0 days not good ',
    2:'1-13 days not good ',
    3:'14+ days not good'
})

In [87]:
# Sex
heart_data_trans['sex']=heart_data_trans['Sex'].map({
    0:'Female',
    1:'Male'
})

In [88]:
# Age group
heart_data_trans['age_grp']=heart_data_trans['Age'].map({
    1:'18-24',
    2:'25-29',
    3:'30-34',
    4:'35-39',
    5:'40-44',
    6:'45-49',
    7:'50-54',
    8:'55-59',
    9:'60-64',
    10:'65-69',
    11:'70-74',
    12:'75-79',
    13:'80 and up'
})

In [89]:
# Education group
heart_data_trans['educ_grp']=heart_data_trans['Education'].map({
    1:'Never attended school / Only Kindergarten',
    2:'Grades 1-8 (Elementary)',
    3:'Grades 9-11 (Some High School)',
    4:'Grade 12 or GED (High School Graduate)',
    5:'College 1-3 (Some College/Technical School)',
    6:'College Graduate'
})

In [90]:
# Income group
heart_data_trans['income_grp']=heart_data_trans['Income'].map({
    1:'Less than 10k USD',
    2:'10k-15k USD',
    3:'15k-20k USD',
    4:'20k-25k USD',
    5:'25k-35k USD',
    6:'35k-50k USD',
    7:'50k-75k USD',
    8:'75k-100k USD',
    9:'100k-150k USD',
    10:'150k-200k USD',
    11: '200k USD or more'
})

In [91]:
# Rename the columns
heart_data_trans = heart_data_trans.rename(columns={'MentHlth': 'mental_health',
                                                   'PhysHlth': 'physical_health'})

In [92]:
# Drop the original columns
to_drop = ['HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker', 'Stroke',
       'Diabetes', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump',
       'AnyHealthInsurance', 'NoDocbcCost', 'GenHlth',
       'DiffWalk', 'DepressiveDisorder', 'MentHlthStatus', 'Sex', 'Age',
       'Education', 'Income']

heart_data_trans.drop(to_drop, axis=1, inplace=True)

In [93]:
# Check the data
heart_data_trans.head()

Unnamed: 0,target,mental_health,physical_health,high_bp,high_chol,chol_check,bmi_category,smoker,stroke,diabetes_category,physical_activity,fruits,veggies,alcohol_consumption,health_insurance,no_doc_cost,general_health,difficult_walk,depressive_disorder,mh_status,sex,age_grp,educ_grp,income_grp
0,0.0,10.0,20.0,No,No,Yes,Underweight,No,No,No or During Pregnancy,No,Yes,Yes,No,Yes,No,Poor,No,No,1-13 days not good,Female,70-74,Grade 12 or GED (High School Graduate),25k-35k USD
1,1.0,0.0,0.0,Yes,Yes,Yes,Overweight,No,No,Existing Diabetes,No,Yes,No,No,Yes,No,Very good,No,No,0 days not good,Female,70-74,Grade 12 or GED (High School Graduate),15k-20k USD
2,0.0,10.0,0.0,Yes,No,Yes,Obese 1,No,No,Existing Diabetes,Yes,Yes,Yes,No,Yes,No,Very good,No,No,1-13 days not good,Female,60-64,Grade 12 or GED (High School Graduate),50k-75k USD
3,1.0,0.0,30.0,No,No,Yes,Overweight,No,Yes,Existing Diabetes,Yes,Yes,Yes,No,Yes,No,Poor,Yes,No,0 days not good,Male,75-79,Grades 9-11 (Some High School),20k-25k USD
4,0.0,0.0,0.0,No,Yes,Yes,Normal,No,No,No or During Pregnancy,No,No,No,No,Yes,No,Good,Yes,No,0 days not good,Male,80 and up,College 1-3 (Some College/Technical School),35k-50k USD


In [94]:
# Check the data info
heart_data_trans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235718 entries, 0 to 235717
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   target               235718 non-null  float64
 1   mental_health        235718 non-null  float64
 2   physical_health      235718 non-null  float64
 3   high_bp              235718 non-null  object 
 4   high_chol            235718 non-null  object 
 5   chol_check           235718 non-null  object 
 6   bmi_category         235718 non-null  object 
 7   smoker               235718 non-null  object 
 8   stroke               235718 non-null  object 
 9   diabetes_category    235718 non-null  object 
 10  physical_activity    235718 non-null  object 
 11  fruits               235718 non-null  object 
 12  veggies              235718 non-null  object 
 13  alcohol_consumption  235718 non-null  object 
 14  health_insurance     235718 non-null  object 
 15  no_doc_cost      

In [95]:
# Save the transfomed dataset
# heart_data_trans.to_csv('../data/2021_brfss_trans_heart_disease_health_indicators.csv', index=False)

### Check for Duplicate Data

In [96]:
# Get the shape
heart_data_trans.shape

(235718, 24)

In [97]:
# Check how many respondents have had heart disease or a heart attack
heart_data_trans.groupby(['target']).size()

target
0.0    215325
1.0     20393
dtype: int64

In [98]:
# Get the percentage of with and without disease for duplicated data
without_disease_dup = (heart_data_trans.groupby(['target']).size()[0] / heart_data_trans.shape[0])*100
with_disease_dup = (heart_data_trans.groupby(['target']).size()[1] / heart_data_trans.shape[0])*100

print(f"With Disease: {with_disease_dup}")
print(f"Without Disease: {without_disease_dup}")

With Disease: 8.651439431863498
Without Disease: 91.3485605681365


In [99]:
# Check for duplicates
duplicates = heart_data_trans.duplicated()

# Print out the number of duplicate rows
print("Number of duplicate rows: {}".format(duplicates.sum()))

Number of duplicate rows: 33364


In [100]:
# Show the duplicate rows themselves
heart_data_trans[duplicates].head()

Unnamed: 0,target,mental_health,physical_health,high_bp,high_chol,chol_check,bmi_category,smoker,stroke,diabetes_category,physical_activity,fruits,veggies,alcohol_consumption,health_insurance,no_doc_cost,general_health,difficult_walk,depressive_disorder,mh_status,sex,age_grp,educ_grp,income_grp
1042,0.0,0.0,0.0,No,Yes,Yes,Overweight,No,No,No or During Pregnancy,Yes,No,Yes,No,Yes,No,Very good,No,No,0 days not good,Female,45-49,College Graduate,100k-150k USD
1116,0.0,0.0,0.0,No,Yes,Yes,Normal,No,No,No or During Pregnancy,Yes,No,Yes,No,Yes,No,Excellent,No,No,0 days not good,Female,40-44,College Graduate,100k-150k USD
1243,0.0,0.0,0.0,No,Yes,Yes,Overweight,Yes,No,No or During Pregnancy,Yes,Yes,Yes,No,Yes,No,Very good,No,No,0 days not good,Female,40-44,College 1-3 (Some College/Technical School),100k-150k USD
1537,0.0,0.0,0.0,No,Yes,Yes,Overweight,No,No,No or During Pregnancy,Yes,Yes,Yes,No,Yes,No,Excellent,No,No,0 days not good,Male,50-54,College Graduate,200k USD or more
1803,0.0,0.0,0.0,No,Yes,Yes,Obese 1,No,No,No or During Pregnancy,Yes,Yes,Yes,No,Yes,No,Very good,No,No,0 days not good,Male,50-54,College 1-3 (Some College/Technical School),75k-100k USD


In [101]:
# Check how many duplicate respondents have had heart disease or a heart attack
heart_data_trans[duplicates].groupby(['target']).size()

target
0.0    32894
1.0      470
dtype: int64

### Remove the Duplicate Data

In [102]:
# Make a copy of the original dataset
heart_data_clean = heart_data_trans.copy()

# Drop duplicated rows from the copied dataset
heart_data_clean.drop_duplicates(inplace=True)

In [103]:
# Print the shape of the new dataset to check for the number of rows
heart_data_clean.shape

(202354, 24)

In [104]:
# Check how many respondents have had heart disease or a heart attack
heart_data_clean.groupby(['target']).size()

target
0.0    182431
1.0     19923
dtype: int64

In [105]:
# Get the percentage of with and without disease for cleaned data
without_disease_clean = (heart_data_clean.groupby(['target']).size()[0] / heart_data_clean.shape[0])*100
with_disease_clean = (heart_data_clean.groupby(['target']).size()[1] / heart_data_clean.shape[0])*100

print(f"With Disease: {with_disease_clean}")
print(f"Without Disease: {without_disease_clean}")

With Disease: 9.845617086887335
Without Disease: 90.15438291311267


### Change data types

In [106]:
# Convert columns from float to int
heart_data_clean['target'] = heart_data_clean['target'].astype('uint8')
heart_data_clean['mental_health'] = heart_data_clean['mental_health'].astype('uint8')
heart_data_clean['physical_health'] = heart_data_clean['physical_health'].astype('uint8')

In [107]:
# View the dataframe info
heart_data_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 202354 entries, 0 to 235717
Data columns (total 24 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   target               202354 non-null  uint8 
 1   mental_health        202354 non-null  uint8 
 2   physical_health      202354 non-null  uint8 
 3   high_bp              202354 non-null  object
 4   high_chol            202354 non-null  object
 5   chol_check           202354 non-null  object
 6   bmi_category         202354 non-null  object
 7   smoker               202354 non-null  object
 8   stroke               202354 non-null  object
 9   diabetes_category    202354 non-null  object
 10  physical_activity    202354 non-null  object
 11  fruits               202354 non-null  object
 12  veggies              202354 non-null  object
 13  alcohol_consumption  202354 non-null  object
 14  health_insurance     202354 non-null  object
 15  no_doc_cost          202354 non-nu

In [108]:
# Display the data
heart_data_clean.head(2)

Unnamed: 0,target,mental_health,physical_health,high_bp,high_chol,chol_check,bmi_category,smoker,stroke,diabetes_category,physical_activity,fruits,veggies,alcohol_consumption,health_insurance,no_doc_cost,general_health,difficult_walk,depressive_disorder,mh_status,sex,age_grp,educ_grp,income_grp
0,0,10,20,No,No,Yes,Underweight,No,No,No or During Pregnancy,No,Yes,Yes,No,Yes,No,Poor,No,No,1-13 days not good,Female,70-74,Grade 12 or GED (High School Graduate),25k-35k USD
1,1,0,0,Yes,Yes,Yes,Overweight,No,No,Existing Diabetes,No,Yes,No,No,Yes,No,Very good,No,No,0 days not good,Female,70-74,Grade 12 or GED (High School Graduate),15k-20k USD


In [109]:
# Save the cleaned dataset
heart_data_clean.to_csv('../data/2021_brfss_clean_heart_disease_health_indicators.csv', index=False)