# Diabetes Health Indicators Dataset Notebook

## Purpose
The purpose of this code notebook is to clean BRFSS data into a useable format for machine learning alogrithms. 
The dataset originally has 330 features (columns), but based on diabetes disease research regarding factors influencing diabetes disease and other chronic health conditions, only select features are included in this analysis.

#### Important Risk Factors
Research in the field has identified the following as **important risk factors** for diabetes and other chronic illnesses like heart disease (not in strict order of importance):

*   blood pressure (high)
*   cholesterol (high)
*   smoking
*   diabetes
*   obesity
*   age
*   sex
*   race
*   diet
*   exercise
*   alcohol consumption
*   BMI
*   Household Income
*   Marital Status
*   Sleep
*   Time since last checkup
*   Education
*   Health care coverage
*   Mental Health

### Selected Subset of Features from BRFSS 2015
Given these risk factors, I tried to select features (columns/questions) in the BRFSS related to these risk factors. To help understand what the columns mean, I consult the BRFSS 2015 Codebook to see the questions and information about the questions. I try to match the variable names in the codebook to the variable names in the dataset I downloaded from Kaggle. I also reference some of the same features chosen for a research paper by Zidian Xie et al for *Building Risk Prediction Models for Type 2 Diabetes Using Machine Learning Techniques* using the 2014 BRFSS.

**BRFSS 2015 Codebook:** https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

**Relevant Research Paper using BRFSS for Diabetes ML:** https://www.cdc.gov/pcd/issues/2019/19_0109.htm


The **selected features** from the BRFSS 2015 dataset are:

**Response Variable / Dependent Variable:**
*   (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.) --> DIABETE3

**Independent Variables:**

**High Blood Pressure**
*   Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional --> _RFHYPE5

**High Cholesterol**
*   Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? --> TOLDHI2
*   Cholesterol check within past five years --> _CHOLCHK

**BMI**
*   Body Mass Index (BMI) --> _BMI5

**Smoking**
*   Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] --> SMOKE100

**Other Chronic Health Conditions**
*   (Ever told) you had a stroke. --> CVDSTRK3
*   Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) --> _MICHD

**Physical Activity**
*   Adults who reported doing physical activity or exercise during the past 30 days other than their regular job --> _TOTINDA

**Diet**
*   Consume Fruit 1 or more times per day --> _FRTLT1
*   Consume Vegetables 1 or more times per day --> _VEGLT1

**Alcohol Consumption**
*   Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) --> _RFDRHV5

**Health Care**
*   Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?  --> HLTHPLN1
*   Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? --> MEDCOST

**Health General and Mental Health**
*   Would you say that in general your health is: --> GENHLTH
*   Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? --> MENTHLTH
*   Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? --> PHYSHLTH
*   Do you have serious difficulty walking or climbing stairs? --> DIFFWALK

**Demographics**
*   Indicate sex of respondent. --> SEX
*   Fourteen-level age category --> _AGEG5YR
*   What is the highest grade or year of school you completed? --> EDUCA
*   Is your annual household income from all sources: (If respondent refuses at any income level, code "Refused.") --> INCOME2

In [1]:
columns_were_intered_in = ['PERSDOC2','CHECKUP1','HAVARTH3','ADDEPEV2','_RACE']


## 1. Get the data

In [2]:
#imports
import os
import pandas as pd
import numpy as np
import random
random.seed(1)

In [3]:
#read in the dataset (select 2015)
brfss_2015_dataset = pd.read_csv('./2015.csv')

In [4]:
brfss_2015_dataset.head()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENUM,...,_PAREC1,_PASTAE1,_LMTACT1,_LMTWRK1,_LMTSCL1,_RFSEAT2,_RFSEAT3,_FLSHOT6,_PNEUMO2,_AIDTST3
0,1.0,1.0,b'01292015',b'01',b'29',b'2015',1200.0,2015000000.0,2015000000.0,1.0,...,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0
1,1.0,1.0,b'01202015',b'01',b'20',b'2015',1100.0,2015000000.0,2015000000.0,1.0,...,2.0,2.0,3.0,3.0,4.0,2.0,2.0,,,2.0
2,1.0,1.0,b'02012015',b'02',b'01',b'2015',1200.0,2015000000.0,2015000000.0,1.0,...,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,
3,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,...,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,9.0
4,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,...,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0


In [5]:
#How many rows and columns
brfss_2015_dataset.shape

(441456, 330)

In [6]:
#check that the data loaded in is in the correct format
pd.set_option('display.max_columns', 500)
brfss_2015_dataset.head()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENUM,PVTRESD1,COLGHOUS,STATERES,CELLFON3,LADULT,NUMADULT,NUMMEN,NUMWOMEN,CTELNUM1,CELLFON2,CADULT,PVTRESD2,CCLGHOUS,CSTATE,LANDLINE,HHADULT,GENHLTH,PHYSHLTH,MENTHLTH,POORHLTH,HLTHPLN1,PERSDOC2,MEDCOST,CHECKUP1,BPHIGH4,BPMEDS,BLOODCHO,CHOLCHK,TOLDHI2,CVDINFR4,CVDCRHD4,CVDSTRK3,ASTHMA3,ASTHNOW,CHCSCNCR,CHCOCNCR,CHCCOPD1,HAVARTH3,ADDEPEV2,CHCKIDNY,DIABETE3,DIABAGE2,SEX,MARITAL,EDUCA,RENTHOM1,NUMHHOL2,NUMPHON2,CPDEMO1,VETERAN3,EMPLOY1,CHILDREN,INCOME2,INTERNET,WEIGHT2,HEIGHT3,PREGNANT,QLACTLM2,USEEQUIP,BLIND,DECIDE,DIFFWALK,DIFFDRES,DIFFALON,SMOKE100,SMOKDAY2,STOPSMK2,LASTSMK2,USENOW3,ALCDAY5,AVEDRNK2,DRNK3GE5,MAXDRNKS,FRUITJU1,FRUIT1,FVBEANS,FVGREEN,FVORANG,VEGETAB1,EXERANY2,EXRACT11,EXEROFT1,EXERHMM1,EXRACT21,EXEROFT2,EXERHMM2,STRENGTH,LMTJOIN3,ARTHDIS2,ARTHSOCL,JOINPAIN,SEATBELT,FLUSHOT6,FLSHTMY2,IMFVPLAC,PNEUVAC3,HIVTST6,HIVTSTD3,WHRTST10,PDIABTST,PREDIAB1,INSULIN,BLDSUGAR,FEETCHK2,DOCTDIAB,CHKHEMO3,FEETCHK,EYEEXAM,DIABEYE,DIABEDU,PAINACT2,QLMENTL2,QLSTRES2,QLHLTH2,CAREGIV1,CRGVREL1,CRGVLNG1,CRGVHRS1,CRGVPRB1,CRGVPERS,CRGVHOUS,CRGVMST2,CRGVEXPT,VIDFCLT2,VIREDIF3,VIPRFVS2,VINOCRE2,VIEYEXM2,VIINSUR2,VICTRCT4,VIGLUMA2,VIMACDG2,CIMEMLOS,CDHOUSE,CDASSIST,CDHELP,CDSOCIAL,CDDISCUS,WTCHSALT,LONGWTCH,DRADVISE,ASTHMAGE,ASATTACK,ASERVIST,ASDRVIST,ASRCHKUP,ASACTLIM,ASYMPTOM,ASNOSLEP,ASTHMED3,ASINHALR,HAREHAB1,STREHAB1,CVDASPRN,ASPUNSAF,RLIVPAIN,RDUCHART,RDUCSTRK,ARTTODAY,ARTHWGT,ARTHEXER,ARTHEDU,TETANUS,HPVADVC2,HPVADSHT,SHINGLE2,HADMAM,HOWLONG,HADPAP2,LASTPAP2,HPVTEST,HPLSTTST,HADHYST2,PROFEXAM,LENGEXAM,BLDSTOOL,LSTBLDS3,HADSIGM3,HADSGCO1,LASTSIG3,PCPSAAD2,PCPSADI1,PCPSARE1,PSATEST1,PSATIME,PCPSARS1,PCPSADE1,PCDMDECN,SCNTMNY1,SCNTMEL1,SCNTPAID,SCNTWRK1,SCNTLPAD,SCNTLWK1,SXORIENT,TRNSGNDR,RCSGENDR,RCSRLTN2,CASTHDX2,CASTHNO2,EMTSUPRT,LSATISFY,ADPLEASR,ADDOWN,ADSLEEP,ADENERGY,ADEAT1,ADFAIL,ADTHINK,ADMOVE,MISTMNT,ADANXEV,QSTVER,QSTLANG,EXACTOT1,EXACTOT2,MSCODE,_STSTR,_STRWT,_RAWRAKE,_WT2RAKE,_CHISPNC,_CRACE1,_CPRACE,_CLLCPWT,_DUALUSE,_DUALCOR,_LLCPWT,_RFHLTH,_HCVU651,_RFHYPE5,_CHOLCHK,_RFCHOL,_MICHD,_LTASTH1,_CASTHM1,_ASTHMS1,_DRDXAR1,_PRACE1,_MRACE1,_HISPANC,_RACE,_RACEG21,_RACEGR3,_RACE_G1,_AGEG5YR,_AGE65YR,_AGE80,_AGE_G,HTIN4,HTM4,WTKG3,_BMI5,_BMI5CAT,_RFBMI5,_CHLDCNT,_EDUCAG,_INCOMG,_SMOKER3,_RFSMOK3,DRNKANY5,DROCDY3_,_RFBING5,_DRNKWEK,_RFDRHV5,FTJUDA1_,FRUTDA1_,BEANDAY_,GRENDAY_,ORNGDAY_,VEGEDA1_,_MISFRTN,_MISVEGN,_FRTRESP,_VEGRESP,_FRUTSUM,_VEGESUM,_FRTLT1,_VEGLT1,_FRT16,_VEG23,_FRUITEX,_VEGETEX,_TOTINDA,METVL11_,METVL21_,MAXVO2_,FC60_,ACTIN11_,ACTIN21_,PADUR1_,PADUR2_,PAFREQ1_,PAFREQ2_,_MINAC11,_MINAC21,STRFREQ_,PAMISS1_,PAMIN11_,PAMIN21_,PA1MIN_,PAVIG11_,PAVIG21_,PA1VIGM_,_PACAT1,_PAINDX1,_PA150R2,_PA300R2,_PA30021,_PASTRNG,_PAREC1,_PASTAE1,_LMTACT1,_LMTWRK1,_LMTSCL1,_RFSEAT2,_RFSEAT3,_FLSHOT6,_PNEUMO2,_AIDTST3
0,1.0,1.0,b'01292015',b'01',b'29',b'2015',1200.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,3.0,1.0,2.0,,,,,,,,,5.0,15.0,18.0,10.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,2.0,3.0,,2.0,1.0,4.0,1.0,2.0,,1.0,2.0,8.0,88.0,3.0,2.0,280.0,510.0,,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,3.0,,2.0,3.0,888.0,,,,305.0,310.0,320.0,310.0,305.0,101.0,2.0,,,,,,,888.0,1.0,1.0,1.0,6.0,1.0,1.0,112014.0,1.0,1.0,1.0,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.78156,3.0,86.344681,,,,,1.0,0.614125,341.384853,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,63.0,5.0,70.0,178.0,12701.0,4018.0,4.0,2.0,1.0,2.0,2.0,3.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,17.0,33.0,67.0,33.0,17.0,100.0,5.397605e-79,5.397605e-79,1.0,1.0,50.0,217.0,2.0,1.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2469.0,423.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0
1,1.0,1.0,b'01202015',b'01',b'20',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,1.0,5.397605e-79,1.0,,,,,,,,,3.0,88.0,88.0,,2.0,1.0,1.0,4.0,3.0,,1.0,4.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,3.0,,2.0,2.0,6.0,1.0,2.0,,2.0,2.0,3.0,88.0,1.0,1.0,165.0,508.0,,1.0,2.0,1.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,,3.0,888.0,,,,302.0,305.0,302.0,202.0,202.0,304.0,1.0,64.0,212.0,100.0,69.0,212.0,100.0,888.0,,,,,3.0,2.0,,,2.0,2.0,,,2.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,1.0,,,,,,,,,,1.0,5.0,5.0,,5.0,2.0,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,,,,,,,,b'',1.0,2.0,,,2.0,60.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.78156,1.0,28.78156,,,,,9.0,,108.060903,1.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,3.0,2.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,7.0,1.0,52.0,4.0,68.0,173.0,7484.0,2509.0,3.0,2.0,1.0,4.0,1.0,1.0,2.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,7.0,17.0,7.0,29.0,29.0,13.0,5.397605e-79,5.397605e-79,1.0,1.0,24.0,78.0,2.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,1.0,35.0,5.397605e-79,2876.0,493.0,1.0,5.397605e-79,60.0,60.0,2800.0,2800.0,168.0,5.397605e-79,5.397605e-79,5.397605e-79,168.0,5.397605e-79,168.0,5.397605e-79,5.397605e-79,5.397605e-79,2.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,4.0,2.0,2.0,,,2.0
2,1.0,1.0,b'02012015',b'02',b'01',b'2015',1200.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,2.0,1.0,1.0,,,,,,,,,4.0,15.0,88.0,88.0,1.0,2.0,2.0,1.0,3.0,,1.0,1.0,1.0,7.0,2.0,1.0,2.0,,2.0,1.0,2.0,1.0,2.0,2.0,3.0,,2.0,2.0,4.0,1.0,2.0,,1.0,2.0,7.0,88.0,99.0,2.0,158.0,511.0,,2.0,2.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,b'',,,,,,,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',5.0,11011.0,28.78156,2.0,57.56312,,,,,1.0,0.614125,255.264797,2.0,9.0,1.0,1.0,2.0,,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,11.0,2.0,71.0,6.0,71.0,180.0,7167.0,2204.0,2.0,1.0,1.0,2.0,9.0,9.0,9.0,9.0,900.0,9.0,99900.0,9.0,,,,,,,2.0,4.0,5.397605e-79,5.397605e-79,,,9.0,9.0,1.0,1.0,1.0,1.0,9.0,,,2173.0,373.0,,,,,,,,,,9.0,,,,,,,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,
3,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,3.0,1.0,2.0,,,,,,,,,5.0,30.0,30.0,30.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,,2.0,1.0,2.0,1.0,1.0,2.0,3.0,,2.0,1.0,4.0,1.0,2.0,,1.0,2.0,8.0,1.0,8.0,2.0,180.0,507.0,,1.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,,,,3.0,888.0,,,,555.0,101.0,555.0,301.0,301.0,201.0,2.0,,,,,,,888.0,1.0,1.0,1.0,8.0,1.0,1.0,777777.0,5.0,1.0,9.0,,,2.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,1.0,1.0,1.0,2.0,1.0,1.0,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,1.0,,,,,,,,b'',4.0,7.0,,,,97.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.78156,3.0,86.344681,,,,,1.0,0.614125,341.384853,2.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,63.0,5.0,67.0,170.0,8165.0,2819.0,3.0,2.0,2.0,2.0,5.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,5.397605e-79,100.0,5.397605e-79,3.0,3.0,14.0,5.397605e-79,5.397605e-79,1.0,1.0,100.0,20.0,1.0,2.0,1.0,1.0,5.397605e-79,5.397605e-79,2.0,,,2469.0,423.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,9.0
4,1.0,1.0,b'01142015',b'01',b'14',b'2015',1100.0,2015000000.0,2015000000.0,1.0,1.0,,1.0,2.0,,2.0,1.0,1.0,,,,,,,,,5.0,20.0,88.0,30.0,1.0,1.0,2.0,1.0,3.0,,1.0,1.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,2.0,2.0,3.0,,2.0,1.0,5.0,1.0,2.0,,2.0,2.0,8.0,88.0,77.0,1.0,142.0,504.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,,3.0,888.0,,,,777.0,102.0,203.0,204.0,310.0,320.0,2.0,,,,,,,888.0,1.0,1.0,1.0,7.0,1.0,2.0,,,1.0,1.0,777777.0,1.0,1.0,3.0,,,,,,,,,,,,,,2.0,,,,,,,,7.0,,,,,,,,,,2.0,,,,,,1.0,777.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,1.0,2.0,5.0,,,,,,,,b'',5.0,5.0,,,,45.0,,,,,,,,,,,,,,,,,,,10.0,1.0,b'',b'',3.0,11011.0,28.78156,2.0,57.56312,,,,,9.0,,258.682223,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,3.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,9.0,1.0,61.0,5.0,64.0,163.0,6441.0,2437.0,2.0,1.0,1.0,3.0,9.0,4.0,1.0,2.0,5.397605e-79,1.0,5.397605e-79,1.0,,200.0,43.0,57.0,33.0,67.0,1.0,5.397605e-79,5.397605e-79,1.0,,200.0,9.0,1.0,1.0,1.0,1.0,5.397605e-79,2.0,,,2543.0,436.0,,,,,,,,,5.397605e-79,5.397605e-79,,,,,,,4.0,2.0,3.0,3.0,2.0,2.0,4.0,2.0,1.0,1.0,1.0,1.0,1.0,,,1.0


**At this point we have 441,456 records and 330 columns. Each record contains an individual's BRFSS survey responses.**

In [7]:
# select specific columns
brfss_df_selected = brfss_2015_dataset[['DIABETE3',
                                         '_RFHYPE5',  
                                         'TOLDHI2', '_CHOLCHK', 
                                         '_BMI5', 
                                         'SMOKE100', 
                                         'CVDSTRK3', '_MICHD', 
                                         '_TOTINDA', 
                                         '_FRTLT1', '_VEGLT1', 
                                         '_RFDRHV5', 
                                         'HLTHPLN1', 'MEDCOST', 
                                         'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', 
                                         'SEX', '_AGEG5YR', 'EDUCA', 'INCOME2' ]+columns_were_intered_in]

In [8]:
brfss_df_selected.shape

(441456, 27)

In [9]:
brfss_df_selected.head()

Unnamed: 0,DIABETE3,_RFHYPE5,TOLDHI2,_CHOLCHK,_BMI5,SMOKE100,CVDSTRK3,_MICHD,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,HLTHPLN1,MEDCOST,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR,EDUCA,INCOME2,PERSDOC2,CHECKUP1,HAVARTH3,ADDEPEV2,_RACE
0,3.0,2.0,1.0,1.0,4018.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,5.0,18.0,15.0,1.0,2.0,9.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0
1,3.0,1.0,2.0,2.0,2509.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,1.0,3.0,88.0,88.0,2.0,2.0,7.0,6.0,1.0,1.0,4.0,2.0,2.0,1.0
2,3.0,1.0,1.0,1.0,2204.0,,1.0,,9.0,9.0,9.0,9.0,1.0,2.0,4.0,88.0,15.0,,2.0,11.0,4.0,99.0,2.0,1.0,1.0,2.0,1.0
3,3.0,2.0,1.0,1.0,2819.0,2.0,2.0,2.0,2.0,1.0,2.0,1.0,1.0,1.0,5.0,30.0,30.0,1.0,2.0,9.0,4.0,8.0,2.0,1.0,1.0,1.0,1.0
4,3.0,1.0,2.0,1.0,2437.0,2.0,2.0,2.0,2.0,9.0,1.0,1.0,1.0,2.0,5.0,88.0,20.0,2.0,2.0,9.0,5.0,77.0,1.0,1.0,1.0,2.0,1.0


## 2. Clean the data

### 2.1 Drop missing values

In [10]:
brfss_df_selected.describe()

Unnamed: 0,DIABETE3,_RFHYPE5,TOLDHI2,_CHOLCHK,_BMI5,SMOKE100,CVDSTRK3,_MICHD,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,HLTHPLN1,MEDCOST,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR,EDUCA,INCOME2,PERSDOC2,CHECKUP1,HAVARTH3,ADDEPEV2,_RACE
count,441449.0,441456.0,382302.0,441456.0,405058.0,427201.0,441456.0,437514.0,441456.0,441456.0,441456.0,441456.0,441456.0,441455.0,441454.0,441456.0,441455.0,429122.0,441456.0,441456.0,441456.0,438155.0,441456.0,441455.0,441455.0,441456.0,441456.0
mean,2.757888,1.42841,1.630876,1.533609,2804.2424,1.613987,1.97388,1.911699,1.931871,2.131746,2.109316,1.516312,1.101201,1.916066,2.57879,64.679178,60.655113,1.8566,1.576542,7.803623,4.920094,20.253013,1.395213,1.574185,1.697419,1.837522,2.021758
std,0.723319,0.646749,0.740235,1.555462,665.463433,0.74653,0.348689,0.283733,2.209728,2.322882,2.522517,1.87458,0.512261,0.415414,1.117585,35.843085,37.055684,0.579838,0.494107,3.495609,1.076198,31.853507,0.833429,1.249199,0.644214,0.563087,2.273676
min,1.0,1.0,1.0,1.0,1202.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,3.0,1.0,1.0,1.0,2373.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,28.0,15.0,2.0,1.0,5.0,4.0,5.0,1.0,1.0,1.0,2.0,1.0
50%,3.0,1.0,2.0,1.0,2695.0,2.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,88.0,88.0,2.0,2.0,8.0,5.0,7.0,1.0,1.0,2.0,2.0,1.0
75%,3.0,2.0,2.0,1.0,3090.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,88.0,88.0,2.0,2.0,10.0,6.0,8.0,1.0,2.0,2.0,2.0,1.0
max,9.0,9.0,9.0,9.0,9995.0,9.0,9.0,2.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,99.0,99.0,9.0,2.0,14.0,9.0,99.0,9.0,9.0,9.0,9.0,9.0


In [11]:
#Drop Missing Values - knocks 100,000 rows out right away
brfss_df_selected = brfss_df_selected.dropna()
brfss_df_selected.shape

(343605, 27)

### 2.2 Modify and clean the values to be more suitable to ML algorithms
In order to do this part, I referenced the codebook which says what each column/feature/question is: https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

In [12]:
# DIABETE3
# going to make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes, 2 is for yes diabetes
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['DIABETE3'] = brfss_df_selected['DIABETE3'].replace({2:0, 3:0, 1:2, 4:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIABETE3 != 9]
brfss_df_selected.DIABETE3.unique()

array([0., 2., 1.])

In [13]:
#1 _RFHYPE5
#Change 1 to 0 so it represetnts No high blood pressure and 2 to 1 so it represents high blood pressure
brfss_df_selected['_RFHYPE5'] = brfss_df_selected['_RFHYPE5'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFHYPE5 != 9]
brfss_df_selected._RFHYPE5.unique()

array([1., 0.])

In [14]:
#2 TOLDHI2
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['TOLDHI2'] = brfss_df_selected['TOLDHI2'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.TOLDHI2 != 9]
brfss_df_selected.TOLDHI2.unique()

array([1., 0.])

In [15]:
#3 _CHOLCHK
# Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years
# Remove 9
brfss_df_selected['_CHOLCHK'] = brfss_df_selected['_CHOLCHK'].replace({3:0,2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._CHOLCHK != 9]
brfss_df_selected._CHOLCHK.unique()

array([1., 0.])

In [16]:
#4 _BMI5 (no changes, just note that these are BMI * 100. So for example a BMI of 4018 is really 40.18)
brfss_df_selected['_BMI5'] = brfss_df_selected['_BMI5'].div(100).round(0)
brfss_df_selected._BMI5.unique()

array([40., 25., 28., 24., 27., 30., 26., 23., 34., 33., 21., 22., 31.,
       38., 20., 19., 32., 46., 41., 37., 36., 29., 35., 18., 54., 45.,
       39., 47., 43., 55., 49., 42., 17., 16., 48., 44., 50., 59., 15.,
       52., 53., 57., 51., 14., 58., 63., 61., 56., 60., 74., 62., 64.,
       13., 66., 73., 65., 68., 85., 71., 84., 67., 70., 82., 79., 92.,
       72., 88., 96., 81., 12., 77., 95., 75., 91., 69., 76., 87., 89.,
       83., 98., 86., 80., 90., 78., 97.])

In [17]:
#5 SMOKE100
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['SMOKE100'] = brfss_df_selected['SMOKE100'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.SMOKE100 != 9]
brfss_df_selected.SMOKE100.unique()

array([1., 0.])

In [18]:
#6 CVDSTRK3
# Change 2 to 0 because it is No
# Remove all 7 (dont knows)
# Remove all 9 (refused)
brfss_df_selected['CVDSTRK3'] = brfss_df_selected['CVDSTRK3'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CVDSTRK3 != 9]
brfss_df_selected.CVDSTRK3.unique()

array([0., 1.])

In [19]:
#7 _MICHD
#Change 2 to 0 because this means did not have MI or CHD
brfss_df_selected['_MICHD'] = brfss_df_selected['_MICHD'].replace({2: 0})
brfss_df_selected._MICHD.unique()

array([0., 1.])

In [20]:
#8 _TOTINDA
# 1 for physical activity
# change 2 to 0 for no physical activity
# Remove all 9 (don't know/refused)
brfss_df_selected['_TOTINDA'] = brfss_df_selected['_TOTINDA'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._TOTINDA != 9]
brfss_df_selected._TOTINDA.unique()

array([0., 1.])

In [21]:
#9 _FRTLT1
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day 
# remove all dont knows and missing 9
brfss_df_selected['_FRTLT1'] = brfss_df_selected['_FRTLT1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._FRTLT1 != 9]
brfss_df_selected._FRTLT1.unique()

array([0., 1.])

In [22]:
#10 _VEGLT1
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day 
# remove all dont knows and missing 9
brfss_df_selected['_VEGLT1'] = brfss_df_selected['_VEGLT1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected._VEGLT1 != 9]
brfss_df_selected._VEGLT1.unique()

array([1., 0.])

In [23]:
#11 _RFDRHV5
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking)
# remove all dont knows and missing 9
brfss_df_selected['_RFDRHV5'] = brfss_df_selected['_RFDRHV5'].replace({1:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected._RFDRHV5 != 9]
brfss_df_selected._RFDRHV5.unique()

array([0., 1.])

In [24]:
#12 HLTHPLN1
# 1 is yes, change 2 to 0 because it is No health care access
# remove 7 and 9 for don't know or refused
brfss_df_selected['HLTHPLN1'] = brfss_df_selected['HLTHPLN1'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.HLTHPLN1 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.HLTHPLN1 != 9]
brfss_df_selected.HLTHPLN1.unique()

array([1., 0.])

In [25]:
#13 MEDCOST
# Change 2 to 0 for no, 1 is already yes
# remove 7 for don/t know and 9 for refused
brfss_df_selected['MEDCOST'] = brfss_df_selected['MEDCOST'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MEDCOST != 9]
brfss_df_selected.MEDCOST.unique()

array([0., 1.])

In [26]:
#14 GENHLTH
# This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor)
# Remove 7 and 9 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.GENHLTH != 9]
brfss_df_selected.GENHLTH.unique()

array([5., 3., 2., 4., 1.])

In [27]:
#15 MENTHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['MENTHLTH'] = brfss_df_selected['MENTHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.MENTHLTH != 99]
brfss_df_selected.MENTHLTH.unique()

array([18.,  0., 30.,  3.,  5., 15., 10.,  6., 20.,  2., 25.,  1., 29.,
        4.,  7.,  8., 21., 14., 26.,  9., 16., 28., 11., 12., 24., 17.,
       13., 23., 27., 19., 22.])

In [28]:
#16 PHYSHLTH
# already in days so keep that, scale will be 0-30
# change 88 to 0 because it means none (no bad mental health days)
# remove 77 and 99 for don't know not sure and refused
brfss_df_selected['PHYSHLTH'] = brfss_df_selected['PHYSHLTH'].replace({88:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.PHYSHLTH != 99]
brfss_df_selected.PHYSHLTH.unique()

array([15.,  0., 30.,  2., 14., 28.,  7., 20.,  3., 10.,  1.,  5., 17.,
        4., 19.,  6., 21., 12.,  8., 25., 27., 22., 29., 24.,  9., 16.,
       18., 23., 13., 26., 11.])

In [29]:
#17 DIFFWALK
# change 2 to 0 for no. 1 is already yes
# remove 7 and 9 for don't know not sure and refused
brfss_df_selected['DIFFWALK'] = brfss_df_selected['DIFFWALK'].replace({2:0})
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.DIFFWALK != 9]
brfss_df_selected.DIFFWALK.unique()

array([1., 0.])

In [30]:
#18 SEX
# in other words - is respondent male (somewhat arbitrarily chose this change because men are at higher risk for heart disease)
# change 2 to 0 (female as 0). Male is 1
brfss_df_selected['SEX'] = brfss_df_selected['SEX'].replace({2:0})
brfss_df_selected.SEX.unique()

array([0., 1.])

In [31]:
#19 _AGEG5YR
# already ordinal. 1 is 18-24 all the way up to 13 wis 80 and older. 5 year increments.
# remove 14 because it is don't know or missing
brfss_df_selected = brfss_df_selected[brfss_df_selected._AGEG5YR != 14]
brfss_df_selected._AGEG5YR.unique()

array([ 9.,  7., 11., 10., 13.,  8.,  4.,  6.,  2., 12.,  5.,  1.,  3.])

In [32]:
#20 EDUCA
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6
# Remove 9 for refused:
brfss_df_selected = brfss_df_selected[brfss_df_selected.EDUCA != 9]
brfss_df_selected.EDUCA.unique()

array([4., 6., 3., 5., 2., 1.])

In [33]:
#21 INCOME2
# Variable is already ordinal with 1 being less than $10,000 all the way up to 8 being $75,000 or more
# Remove 77 and 99 for don't know and refused
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME2 != 77]
brfss_df_selected = brfss_df_selected[brfss_df_selected.INCOME2 != 99]
brfss_df_selected.INCOME2.unique()

array([3., 1., 8., 6., 4., 7., 2., 5.])

In [34]:
# for column in columns_were_intered_in:
#     print(brfss_2015_dataset[column].value_counts())

___
___

PERSDOC2
Do you have one person you think of as your personal doctor or health care provider? (If "No" ask "Is there more than one or is there no person who you think of as your personal doctor or health care provider?".)

1
Yes, only one

2
More than one

3
No

In [35]:
#encode persdoc2
brfss_df_selected['PERSDOC2'] = brfss_df_selected['PERSDOC2'].replace({3:0, 2:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.PERSDOC2 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.PERSDOC2 != 9]

CHECKUP1
About how long has it been since you last visited a doctor for a routine checkup? [A routine checkup is a general physical exam, not an exam for a specific injury, illness, or condition.]

1
Within past year (anytime less than 12 months ago)

2
Within past 2 years (1 year but less than 2 years ago)

3
Within past 5 years (2 years but less than 5 years ago)

4
5 or more years ago

7
Don’t know/Not sure

8
Never

In [36]:
#encode checkup1
brfss_df_selected['CHECKUP1'] = brfss_df_selected['CHECKUP1'].replace({1:0, 2:1, 3:2, 4:2})
brfss_df_selected = brfss_df_selected[brfss_df_selected.CHECKUP1 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CHECKUP1 != 8]
brfss_df_selected = brfss_df_selected[brfss_df_selected.CHECKUP1 != 9]

HAVARTH3
(Ever told) you have some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, etc.)

1
Yes

2
No


In [37]:
#encode havarth3
brfss_df_selected['HAVARTH3'] = brfss_df_selected['HAVARTH3'].replace({2:0, 1:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.HAVARTH3 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.HAVARTH3 != 9]

ADDEPEV2
(Ever told) you that you have a depressive disorder, including depression, major depression, dysthymia, or minor depression?

1
Yes

2
No


In [38]:
#encode addepev2
brfss_df_selected['ADDEPEV2'] = brfss_df_selected['ADDEPEV2'].replace({2:0, 1:1})
brfss_df_selected = brfss_df_selected[brfss_df_selected.ADDEPEV2 != 7]
brfss_df_selected = brfss_df_selected[brfss_df_selected.ADDEPEV2 != 9]

_RACE

1
White only, non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 10

2
Black only, non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 20

3
American Indian or Alaskan Native only, Non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 33

4
Asian only, non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 40,41,42,43,44,45,46,47

5
Native Hawaiian or other Pacific Islander only, Non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 50,51,52,53,54

6
Other race only, non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 60

7
Multiracial, non-Hispanic Notes: _HISPANC = 2 and _MRACE1 = 77

8
Hispanic Notes: _HISPANC = 1


In [39]:
brfss_df_selected = brfss_df_selected[brfss_df_selected._RACE != 9]
brfss_df_selected = pd.get_dummies(brfss_df_selected,columns=['_RACE'], dtype=int)

___
___

In [40]:
#Check the shape of the dataset now: We have 253,680 cleaned rows and 22 columns (1 of which is our dependent variable)
brfss_df_selected.shape

(246972, 34)

In [41]:
#Let's see what the data looks like after Modifying Values
brfss_df_selected.head()

Unnamed: 0,DIABETE3,_RFHYPE5,TOLDHI2,_CHOLCHK,_BMI5,SMOKE100,CVDSTRK3,_MICHD,_TOTINDA,_FRTLT1,_VEGLT1,_RFDRHV5,HLTHPLN1,MEDCOST,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,SEX,_AGEG5YR,EDUCA,INCOME2,PERSDOC2,CHECKUP1,HAVARTH3,ADDEPEV2,_RACE_1.0,_RACE_2.0,_RACE_3.0,_RACE_4.0,_RACE_5.0,_RACE_6.0,_RACE_7.0,_RACE_8.0
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,1.0,0.0,1.0,1.0,1,0,0,0,0,0,0,0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,1.0,2.0,0.0,0.0,1,0,0,0,0,0,0,0
3,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,1.0,0.0,1.0,1.0,1,0,0,0,0,0,0,0
5,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,1.0,0.0,1.0,0.0,1,0,0,0,0,0,0,0
6,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,1.0,0.0,0.0,0.0,0,0,0,0,0,0,1,0


In [42]:
#Check Class Sizes of the heart disease column
brfss_df_selected.groupby(['DIABETE3']).size()

DIABETE3
0.0    208156
1.0      4476
2.0     34340
dtype: int64

## 3. Make feature names more readable

In [43]:
#Rename the columns to make them more readable
brfss = brfss_df_selected.rename(columns = {'DIABETE3':'Diabetes_012', 
                                        '_RFHYPE5':'HighBP',  
                                        'TOLDHI2':'HighChol', '_CHOLCHK':'CholCheck', 
                                        '_BMI5':'BMI', 
                                        'SMOKE100':'Smoker', 
                                        'CVDSTRK3':'Stroke', '_MICHD':'HeartDiseaseorAttack', 
                                        '_TOTINDA':'PhysActivity', 
                                        '_FRTLT1':'Fruits', '_VEGLT1':"Veggies", 
                                        '_RFDRHV5':'HvyAlcoholConsump', 
                                        'HLTHPLN1':'AnyHealthcare', 'MEDCOST':'NoDocbcCost', 
                                        'GENHLTH':'GenHlth', 'MENTHLTH':'MentHlth', 'PHYSHLTH':'PhysHlth', 'DIFFWALK':'DiffWalk', 
                                        'SEX':'Sex', '_AGEG5YR':'Age', 'EDUCA':'Education', 'INCOME2':'Income',
                                        'PERSDOC2':'PrimaryDoc', 'CHECKUP1':'Checkup1', 'HAVARTH3':'Arthritis', 'ADDEPEV2':'Depression', 
                                        '_RACE_1.0':'Race_white', '_RACE_2.0':'Race_black','_RACE_3.0':'Race_AMI_AKN',
                                        '_RACE_4.0':'Race_asian', '_RACE_5.0':'Race_HI_PI', '_RACE_6.0':'Race_other',
                                        '_RACE_7.0':'Race_multi','_RACE_8.0':'Race_hispanic'})

In [44]:
brfss.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,PrimaryDoc,Checkup1,Arthritis,Depression,Race_white,Race_black,Race_AMI_AKN,Race_asian,Race_HI_PI,Race_other,Race_multi,Race_hispanic
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,1.0,0.0,1.0,1.0,1,0,0,0,0,0,0,0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,1.0,2.0,0.0,0.0,1,0,0,0,0,0,0,0
3,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,1.0,0.0,1.0,1.0,1,0,0,0,0,0,0,0
5,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,1.0,0.0,1.0,0.0,1,0,0,0,0,0,0,0
6,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,1.0,0.0,0.0,0.0,0,0,0,0,0,0,1,0


In [45]:
brfss.shape

(246972, 34)

In [46]:
#Check how many respondents have no diabetes, prediabetes or diabetes. Note the class imbalance!
brfss.groupby(['Diabetes_012']).size()

Diabetes_012
0.0    208156
1.0      4476
2.0     34340
dtype: int64

## 4. Save to csv
First save version where diabetes is the target variable and in the first column. This is the full cleaned dataset with prediabetes still there.

In [47]:
brfss.to_csv('CUSTOM_diabetes_012_health_indicators_BRFSS2015.csv', sep=",", index=False)


## 5. Create Binary Dataset for diabetes vs. no diabetes

In [50]:
#Copy old table to new one.
brfss_binary = brfss.copy()
#Change the diabetics 2 to a 1 and pre-diabetics 1 to a 0, so that we have 0 meaning non-diabetic and pre-diabetic and 1 meaning diabetic.
# brfss_binary['Diabetes_012'] = brfss_binary['Diabetes_012'].replace({1:0})
brfss_binary['Diabetes_012'] = brfss_binary['Diabetes_012'].replace({2:1})

#Change the column name to Diabetes_binary
brfss_binary.Diabetes_012.value_counts()

Diabetes_012
0.0    208156
1.0     38816
Name: count, dtype: int64

In [521]:
#Show the change
brfss_binary.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,PrimaryDoc,Checkup1,Arthritis,Depression,Race_white,Race_black,Race_AMI_AKN,Race_asian,Race_HI_PI,Race_other,Race_multi,Race_hispanic
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0,1.0,0.0,1.0,1.0,1,0,0,0,0,0,0,0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0,1.0,2.0,0.0,0.0,1,0,0,0,0,0,0,0
3,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0,1.0,0.0,1.0,1.0,1,0,0,0,0,0,0,0
5,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0,1.0,0.0,1.0,0.0,1,0,0,0,0,0,0,0
6,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0,1.0,0.0,0.0,0.0,0,0,0,0,0,0,1,0


In [55]:
brfss_binary.to_csv('CUSTOM_BINARY_diabetes_012_health_indicators_BRFSS2015.csv', sep=",", index=False)