## Data Preparation and Cleaning

<p>Data preparation and cleaning are crucial in the data analysis process. It transform raw data into a form that is suitable for analysis. </p>

The following are some key steps involved in our data preparation and cleaning process:

>1. **Data Collection**: our raw dataset is from the 2021 BRFSS survey data done by the CDC, based on more than 400,000 survey participants in the US. The original data file: https://www.cdc.gov/brfss/annual_data/annual_2021.html


>2. **Data Extraction**: the survey dataset has 303 columns, from responses to the different questions asked in the survey. To identify factors releted to Diabetes, ample research was conducted. Relevant variable columns are then identified and extracted from the survey dataset.


>3. **Data Cleaning**: steps taken to clean the dataset include tackling missing values from survey respondents and dropping irrelevant responses.


>4. **Data Transformation**: after cleaning, the data is transformed into a format that is suitable for analysis.


>5. **Data Documentation & Export**: lastly, a detailed documentation of the data. Our codebook can be found in the data description file.

### Step 0: Load in the dataset

In [128]:
#Import libraries
import os
import pandas as pd
import numpy as np
import random
random.seed(1)

In [129]:
#Read the dataset
diabetes = pd.read_csv('LLCP2021 3.csv')
diabetes.head()

Unnamed: 0,_STATE,FMONTH,IDATE,IMONTH,IDAY,IYEAR,DISPCODE,SEQNO,_PSU,CTELENM1,...,_FRTRES1,_VEGRES1,_FRUTSU1,_VEGESU1,_FRTLT1A,_VEGLT1A,_FRT16A,_VEG23A,_FRUITE1,_VEGETE1
0,1,1,1192021,1,19,2021,1100,2021000001,2021000001,1.0,...,1,1,100.0,214.0,1,1,1,1,0,0
1,1,1,1212021,1,21,2021,1100,2021000002,2021000002,1.0,...,1,1,100.0,128.0,1,1,1,1,0,0
2,1,1,1212021,1,21,2021,1100,2021000003,2021000003,1.0,...,1,1,100.0,71.0,1,2,1,1,0,0
3,1,1,1172021,1,17,2021,1100,2021000004,2021000004,1.0,...,1,1,114.0,165.0,1,1,1,1,0,0
4,1,1,1152021,1,15,2021,1100,2021000005,2021000005,1.0,...,1,1,100.0,258.0,1,1,1,1,0,0


### Step 1: Extract columns that are relevant to our problem, i.e., diabetes analysis

After ample secondary research, the following factors are identified to have close relationship with diabetes:
- **BMI**: In obese individuals, the amount of nonesterified fatty acids, glycerol, hormones, cytokines, proinflammatory markers, and other substances that are involved in the development of insulin resistance, is increased.

- **Physical Exercise**: Inactivity increases visceral fat accumulation, stimulating chronic low-grade systemic inflammation and dependent comorbidities such as insulin resistance and Diabetes Mellitus.

- **Fruit Intake**: Fruits are an important part of a healthy diet,rich in nutrients such as vitamins, minerals, and antioxidants that are important for maintaining overall health. Some studies have suggested that increased fruit intake may be related to a lower risk of diabetes.

- **Vegetables Intake**: Vegetables are an important part of a healthy diet, and they are rich in nutrients such as vitamins, minerals, and fiber that are essential for maintaining overall health. Several studies have suggested that increased vegetable intake may be related to a lower risk of diabetes.

- **Alcohol Consumption Habit**: Regular heavy drinking can reduce the body's sensitivity to insulin, which can trigger type 2 diabetes. Diabetes is a common side effect of chronic pancreatitis, which may be caused by heavy drinking.

- **Smoker**: Tobacco use can raise your blood glucose (sugar) and reduce your body's ability to use insulin. In fact, people who smoke cigarettes are 30%–40% more likely to develop type 2 diabetes than people who don't smoke.

- **Mental Health Problem**: Psychiatric disorders can disrupt sleep and impair a person's metabolism, leading to an increased risk of diabetes. 

- **Physical Health Problem**: People with diabetes have a higher risk of physical health problems including heart attack, stroke and kidney failure.

- **Age**: According to a survey, the prevalence of diabetes and prediabetes among people aged 40–49 is 11.1% and 40.3%, respectively, while the prevalence of diabetes and prediabetes among people aged 60–69 has increased to 23.9% and 47.6%, respectively. Advanced age is a major risk factor for diabetes and prediabetes.

- **Difficulty walking or climbing**: Individuals with diabetes walk slower and with shorter step lengths, a longer stance phase, a wider base of support, greater step time variability on irregular surfaces, and improper pressure distribution at the foot compared with individuals without diabetes.

- **Chronic Health Conditions**: Over time, diabetes can damage blood vessels in the heart, eyes, kidneys and nerves. People with diabetes have a higher risk of health problems including heart attack, stroke and kidney failure.

- **High Blood Pressure**: Diabetes causes damage by scarring the kidneys, which in turn leads to salt and water retention, which in turn raises blood pressure.

- **Cholesterol**: Clinical studies have shown that increased cholesterol levels lead to deterioration of glucose tolerance, and that a high total cholesterol (TC) to high-density lipoprotein cholesterol (HDL-C) ratio can predict type 2 diabetes.

- **Education**: Holding other factors constant, an increase of a year of schooling decreases the hazard of being diagnosed with Diabetes Mellitus by 0.04.

- **Gender**: Worldwide, an estimated 17.7 million more men than women have diabetes mellitus.

- **Have any health insurance**: People without health insurance would be less inclined to check if they have diabetes. 

- **Afford to see doctor**: People with lesser ability to visit the doctor would have limited access to healthcare, thus exacerbating diabetes prevalence.

- **General Health**: The most common long-term diabetes-related health problems are: damage to the large blood vessels of the heart, brain and legs (macrovascular complications) damage to the small blood vessels, causing problems in the eyes, kidneys, feet and nerves (microvascular complications).

- **Income**: Lower income often correlates with limited healthcare access and preventive measures, exacerbating diabetes prevalence.


Based on the factors identified through secondary research, relevant variables and data are extracted from the raw survey data file. Variables extracted include:
1. _BMI5
2. _TOTINDA
3. PHYSHLTH
4. MENTHLTH
5. _RFDRHV7
6. SMOKE100
7. _AGEG5YR
8. DIFFWALK
9. _SEX
10. CVDSTRK3
11. _MICHD
12. _RFHYPE6
13. TOLDHI3
14. _CHOLCH3
15. _FRTLT1A
16. _VEGLT1A
17. EDUCA
18. _HLTHPLN
19. GENHLTH
20. MEDCOST1
21. INCOME3

Refer to our data description file for detailed description on the variables.

In [130]:
#Choose specific columns
diabetes_extracted = diabetes[['DIABETE4', '_RFHYPE6', 'TOLDHI3', '_CHOLCH3', '_BMI5', 'SMOKE100', 'CVDSTRK3', '_MICHD', '_TOTINDA', '_FRTLT1A', '_VEGLT1A', '_RFDRHV7', '_HLTHPLN', 'MEDCOST1', 'GENHLTH', 'MENTHLTH', 'PHYSHLTH', 'DIFFWALK', '_SEX', '_AGEG5YR', 'EDUCA', 'INCOME3' ]]

In [131]:
#Check number of rows and columns left
diabetes_extracted.shape

(438693, 22)

In [132]:
diabetes_extracted.head()

Unnamed: 0,DIABETE4,_RFHYPE6,TOLDHI3,_CHOLCH3,_BMI5,SMOKE100,CVDSTRK3,_MICHD,_TOTINDA,_FRTLT1A,...,_HLTHPLN,MEDCOST1,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,_SEX,_AGEG5YR,EDUCA,INCOME3
0,3.0,1,1.0,1,1454.0,1.0,2.0,2.0,2,1,...,1,2.0,5.0,10.0,20.0,2.0,2,11,4.0,5.0
1,1.0,2,1.0,1,,2.0,2.0,1.0,1,1,...,1,2.0,3.0,88.0,88.0,1.0,2,10,6.0,77.0
2,1.0,2,2.0,1,2829.0,2.0,2.0,1.0,2,1,...,1,2.0,2.0,88.0,88.0,2.0,2,11,4.0,3.0
3,1.0,2,1.0,1,3347.0,2.0,2.0,2.0,1,1,...,1,2.0,2.0,10.0,88.0,2.0,2,9,4.0,7.0
4,1.0,1,1.0,1,2873.0,2.0,1.0,1.0,1,1,...,1,2.0,5.0,88.0,30.0,1.0,1,12,3.0,4.0


### Step 2: Tackle the Missing Values

There are missing values in many variable columns because the survey respondents failed to provide answers for these questions.

These missing values have to be tackled before we proceed with further analysis and model building. By calling the `.dropna()` method on the DataFrame, rows containing any missing values are eliminated.

In [133]:
diabetes_extracted = diabetes_extracted.dropna()
diabetes_extracted.shape

(330355, 22)

### Step 3: Modify and clean the values to be more suitable to ML algorithms



In [134]:
# Making this ordinal. 0 is for no diabetes, only during pregnancy, borderline diabetes or pre-diabetes, 1 is for yes diabetes, remove all 7 (dont knows) and 9 (refused)
diabetes_extracted['DIABETE4'] = diabetes_extracted['DIABETE4'].replace({2:0, 3:0, 1:1, 4:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.DIABETE4 != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.DIABETE4 != 9]
diabetes_extracted.DIABETE4.unique()

array([0., 1.])

In [135]:
#Change 1 to 0 which represents No high blood pressure and 2 to 1 to represent high blood pressure
diabetes_extracted['_RFHYPE6'] = diabetes_extracted['_RFHYPE6'].replace({1:0, 2:1})
diabetes_extracted = diabetes_extracted[diabetes_extracted._RFHYPE6 != 9]
diabetes_extracted._RFHYPE6.unique()

array([0, 1])

In [136]:
# Change 2 to 0 because it is No, remove all 7 (dont knows) and all 9 (refused)
diabetes_extracted['TOLDHI3'] = diabetes_extracted['TOLDHI3'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.TOLDHI3 != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.TOLDHI3 != 9]
diabetes_extracted.TOLDHI3.unique()

array([1., 0.])

In [137]:
# Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years, remove 9 (don't know/refused)
diabetes_extracted['_CHOLCH3'] = diabetes_extracted['_CHOLCH3'].replace({3:0,2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted._CHOLCH3 != 9]
diabetes_extracted._CHOLCH3.unique()

array([1, 0])

In [138]:
# BMI are * 100
diabetes_extracted['_BMI5'] = diabetes_extracted['_BMI5'].div(100).round(0)
diabetes_extracted._BMI5.unique()

array([15., 28., 33., 29., 24., 46., 23., 40., 27., 35., 18., 30., 25.,
       36., 22., 31., 45., 26., 14., 38., 21., 32., 20., 19., 34., 41.,
       43., 44., 39., 37., 16., 42., 50., 51., 17., 52., 47., 49., 56.,
       57., 48., 58., 61., 53., 63., 64., 54., 68., 55., 62., 13., 59.,
       89., 66., 77., 60., 87., 69., 72., 75., 67., 71., 65., 82., 86.,
       70., 78., 12., 74., 98., 73., 84., 76., 80., 83., 79., 99., 88.,
       81., 90., 92., 91., 95., 85., 94.])

In [139]:
# Change 2 to 0 because it is No, remove all 7 (dont knows) and all 9 (refused)
diabetes_extracted['SMOKE100'] = diabetes_extracted['SMOKE100'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.SMOKE100 != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.SMOKE100 != 9]
diabetes_extracted.SMOKE100.unique()

array([1., 0.])

In [140]:
# Change 2 to 0 because it is No, remove all 7 (dont knows) and all 9 (refused)
diabetes_extracted['CVDSTRK3'] = diabetes_extracted['CVDSTRK3'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.CVDSTRK3 != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.CVDSTRK3 != 9]
diabetes_extracted.CVDSTRK3.unique()

array([0., 1.])

In [141]:
# Change 2 to 0 because this means they do not have MI or CHD
diabetes_extracted['_MICHD'] = diabetes_extracted['_MICHD'].replace({2: 0})
diabetes_extracted._MICHD.unique()

array([0., 1.])

In [142]:
# Change 2 to 0 for no physical activites and remove all 9 (don't know/refused)
diabetes_extracted['_TOTINDA'] = diabetes_extracted['_TOTINDA'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted._TOTINDA != 9]
diabetes_extracted._TOTINDA.unique()

array([0, 1])

In [143]:
# Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day and remove all 9 (don't know/refused)
diabetes_extracted['_FRTLT1A'] = diabetes_extracted['_FRTLT1A'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted._FRTLT1A != 9]
diabetes_extracted._FRTLT1A.unique()


array([1, 0])

In [144]:
# Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day and remove all 9 (don't know/refused)
diabetes_extracted['_VEGLT1A'] = diabetes_extracted['_VEGLT1A'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted._VEGLT1A != 9]
diabetes_extracted._VEGLT1A.unique()

array([1, 0])

In [145]:
# Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking) and remove all 9 (don't know/refused)
diabetes_extracted['_RFDRHV7'] = diabetes_extracted['_RFDRHV7'].replace({1:0, 2:1})
diabetes_extracted = diabetes_extracted[diabetes_extracted._RFDRHV7 != 9]
diabetes_extracted._RFDRHV7.unique()

array([0, 1])

In [146]:
# In days, scale will be between 0-30
# Change 88 to 0 because it means none (no bad mental health days), remove all 77 (dont knows) and all 99 (refused)
diabetes_extracted['MENTHLTH'] = diabetes_extracted['MENTHLTH'].replace({88:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.MENTHLTH != 77]
diabetes_extracted = diabetes_extracted[diabetes_extracted.MENTHLTH != 99]
diabetes_extracted.MENTHLTH.unique()


array([10.,  0.,  5., 25.,  2.,  7., 30.,  3., 14., 20.,  8.,  1., 15.,
        4., 28., 24., 21., 12.,  6., 22., 27., 18., 13., 17., 16.,  9.,
       19., 29., 23., 11., 26.])

In [147]:
# In days, scale will be between 0-30
# Change 88 to 0 because it means none (no bad physical health days), remove all 77 (dont knows) and all 99 (refused)
diabetes_extracted['PHYSHLTH'] = diabetes_extracted['PHYSHLTH'].replace({88:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.PHYSHLTH != 77]
diabetes_extracted = diabetes_extracted[diabetes_extracted.PHYSHLTH != 99]
diabetes_extracted.PHYSHLTH.unique()

array([20.,  0., 30., 25.,  1.,  4., 10.,  2.,  3., 15.,  8., 13., 14.,
        5.,  7.,  6., 24., 29., 18.,  9., 16., 17., 26., 28., 12., 21.,
       27., 11., 19., 22., 23.])

In [148]:
# Change 2 to 0 for no, remove all 7 (dont knows) and all 9 (refused)
diabetes_extracted['DIFFWALK'] = diabetes_extracted['DIFFWALK'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.DIFFWALK != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.DIFFWALK != 9]
diabetes_extracted.DIFFWALK.unique()

array([0., 1.])

In [149]:
# Men are at higher risk for heart disease 
#Change 2 to 0 (female as 0)
diabetes_extracted['_SEX'] = diabetes_extracted['_SEX'].replace({2:0})
diabetes_extracted._SEX.unique()

array([0, 1])

In [150]:
# 5 year increments. It is already ordinal. 1 is 18-24 all the way up to 13 which is 80 and older and remove all 14 (don't know or missing)
diabetes_extracted = diabetes_extracted[diabetes_extracted._AGEG5YR != 14]
diabetes_extracted._AGEG5YR.unique()

array([11,  9, 12, 13, 10,  7,  6,  8,  1,  4,  3,  5,  2])

In [151]:
# This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more
# Scale here is 1-6, remove all 9 (refused)
diabetes_extracted = diabetes_extracted[diabetes_extracted.EDUCA != 9]
diabetes_extracted.EDUCA.unique()

array([4., 3., 5., 6., 2., 1.])

In [152]:
# Change 2 to 0 for no health insurance and remove all 9 (don't know/refused)
diabetes_extracted['_HLTHPLN'] = diabetes_extracted['_HLTHPLN'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted._HLTHPLN != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted._HLTHPLN != 9]
diabetes_extracted._HLTHPLN.unique()

array([1, 0])

In [153]:
# Change 2 to 0 for no, remove all 7 (dont knows) and remove all 9 (refused)
diabetes_extracted['MEDCOST1'] = diabetes_extracted['MEDCOST1'].replace({2:0})
diabetes_extracted = diabetes_extracted[diabetes_extracted.MEDCOST1 != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.MEDCOST1 != 9]
diabetes_extracted.MEDCOST1.unique()

array([0., 1.])

In [154]:
# 1 is Excellent -> 5 is Poor, remove all 7 (dont knows) and all 9 (refused)
diabetes_extracted = diabetes_extracted[diabetes_extracted.GENHLTH != 7]
diabetes_extracted = diabetes_extracted[diabetes_extracted.GENHLTH != 9]
diabetes_extracted.GENHLTH.unique()

array([5., 2., 3., 4., 1.])

In [155]:
# INCOME3 = Income Level
# This is already an ordinal variable with 1 being less than $10,000 all the way up to 11 being $200,000 or more
# Remove all 77 (dont knows)
# Remove all 99 (refused)
diabetes_extracted = diabetes_extracted[diabetes_extracted.INCOME3 != 77]
diabetes_extracted = diabetes_extracted[diabetes_extracted.INCOME3 != 99]
diabetes_extracted.INCOME3.unique()

array([ 5.,  3.,  7.,  4.,  6.,  8.,  2.,  9., 10.,  1., 11.])

In [156]:
diabetes_extracted.shape

(236378, 22)

In [157]:
diabetes_extracted.head()

Unnamed: 0,DIABETE4,_RFHYPE6,TOLDHI3,_CHOLCH3,_BMI5,SMOKE100,CVDSTRK3,_MICHD,_TOTINDA,_FRTLT1A,...,_HLTHPLN,MEDCOST1,GENHLTH,MENTHLTH,PHYSHLTH,DIFFWALK,_SEX,_AGEG5YR,EDUCA,INCOME3
0,0.0,0,1.0,1,15.0,1.0,0.0,0.0,0,1,...,1,0.0,5.0,10.0,20.0,0.0,0,11,4.0,5.0
2,1.0,1,0.0,1,28.0,0.0,0.0,1.0,0,1,...,1,0.0,2.0,0.0,0.0,0.0,0,11,4.0,3.0
3,1.0,1,1.0,1,33.0,0.0,0.0,0.0,1,1,...,1,0.0,2.0,10.0,0.0,0.0,0,9,4.0,7.0
4,1.0,0,1.0,1,29.0,0.0,1.0,1.0,1,1,...,1,0.0,5.0,0.0,30.0,1.0,1,12,3.0,4.0
5,0.0,0,0.0,1,24.0,1.0,0.0,0.0,0,0,...,1,0.0,3.0,0.0,0.0,1.0,1,13,5.0,6.0


In [158]:
#Check Class Sizes of the heart disease column
diabetes_extracted.groupby(['DIABETE4']).size()

DIABETE4
0.0    202810
1.0     33568
dtype: int64

### Step 4: Make feature names more readable¶

In [159]:
#Rename the columns to make them more readable
diabetes = diabetes_extracted.rename(columns = {'DIABETE4':'Diabetes_binary', '_RFHYPE6':'HighBP', 'TOLDHI3':'HighChol', '_CHOLCH3':'CholCheck', '_BMI5':'BMI', 'SMOKE100':'Smoker', 'CVDSTRK3':'Stroke', '_MICHD':'HeartDiseaseorAttack', '_TOTINDA':'PhysActivity', '_FRTLT1A':'Fruits', '_VEGLT1A':"Veg", '_RFDRHV7':'HvyAlcoholConsump', 'MENTHLTH':'MentHlth', 'PHYSHLTH':'PhysHlth', 'DIFFWALK':'DiffWalk', '_SEX':'Sex', '_AGEG5YR':'Age', 'EDUCA':'Education'})

In [160]:
diabetes.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,_HLTHPLN,MEDCOST1,GENHLTH,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,INCOME3
0,0.0,0,1.0,1,15.0,1.0,0.0,0.0,0,1,...,1,0.0,5.0,10.0,20.0,0.0,0,11,4.0,5.0
2,1.0,1,0.0,1,28.0,0.0,0.0,1.0,0,1,...,1,0.0,2.0,0.0,0.0,0.0,0,11,4.0,3.0
3,1.0,1,1.0,1,33.0,0.0,0.0,0.0,1,1,...,1,0.0,2.0,10.0,0.0,0.0,0,9,4.0,7.0
4,1.0,0,1.0,1,29.0,0.0,1.0,1.0,1,1,...,1,0.0,5.0,0.0,30.0,1.0,1,12,3.0,4.0
5,0.0,0,0.0,1,24.0,1.0,0.0,0.0,0,0,...,1,0.0,3.0,0.0,0.0,1.0,1,13,5.0,6.0


In [161]:
diabetes.shape

(236378, 22)

In [162]:
diabetes.groupby(['Diabetes_binary']).size()

Diabetes_binary
0.0    202810
1.0     33568
dtype: int64

### Step 5: Creating a 50-50 binary balanced dataset

In [163]:
#Separate the 0(No Diabetes) and 1&2(Pre-diabetes and Diabetes)
#Get the 1s
is1 = diabetes['Diabetes_binary'] == 1
diabetes_5050_1 = diabetes[is1]

#Get the 0s
is0 = diabetes['Diabetes_binary'] == 0
diabetes_5050_0 = diabetes[is0] 

#Select the 33568 random cases from the 0 (non-diabetes group). we already have 33568 cases from the diabetes risk group
diabetes_5050_0_rand1 = diabetes_5050_0.take(np.random.permutation(len(diabetes_5050_0))[:33568])

#Append the 33568 1s to the 33568 randomly selected 0s
diabetes_5050 = diabetes_5050_0_rand1._append(diabetes_5050_1, ignore_index = True)

In [164]:
#Now we have a dataset of 67136 rows that is equally balanced with 50% 1 and 50% 0 for the target variable Diabetes_binary
diabetes_5050.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,_HLTHPLN,MEDCOST1,GENHLTH,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,INCOME3
0,0.0,1,1.0,1,33.0,1.0,0.0,0.0,1,1,...,1,0.0,3.0,0.0,6.0,0.0,1,9,4.0,7.0
1,0.0,1,1.0,1,27.0,1.0,0.0,0.0,1,0,...,1,0.0,1.0,5.0,0.0,0.0,0,5,5.0,9.0
2,0.0,0,1.0,1,21.0,1.0,0.0,1.0,1,0,...,1,0.0,3.0,6.0,5.0,1.0,0,10,6.0,6.0
3,0.0,0,0.0,1,25.0,0.0,0.0,0.0,1,1,...,1,1.0,3.0,10.0,2.0,0.0,0,2,6.0,5.0
4,0.0,0,0.0,1,31.0,0.0,0.0,0.0,1,0,...,1,0.0,3.0,30.0,0.0,0.0,1,3,4.0,4.0


In [165]:
diabetes_5050.tail()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,_HLTHPLN,MEDCOST1,GENHLTH,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,INCOME3
67131,1.0,1,0.0,1,27.0,0.0,0.0,0.0,1,1,...,1,0.0,3.0,0.0,0.0,0.0,1,11,5.0,6.0
67132,1.0,1,1.0,1,26.0,0.0,0.0,0.0,0,1,...,1,0.0,4.0,0.0,0.0,0.0,0,11,4.0,2.0
67133,1.0,1,1.0,1,32.0,0.0,0.0,1.0,1,0,...,1,1.0,2.0,10.0,0.0,0.0,1,8,6.0,6.0
67134,1.0,1,1.0,1,33.0,0.0,0.0,0.0,0,0,...,1,0.0,2.0,0.0,0.0,1.0,1,10,4.0,5.0
67135,1.0,1,1.0,1,21.0,0.0,0.0,0.0,1,1,...,1,0.0,4.0,0.0,0.0,0.0,1,10,2.0,3.0


In [166]:
#See the classes are perfectly balanced now
diabetes_5050.groupby(['Diabetes_binary']).size()

Diabetes_binary
0.0    33568
1.0    33568
dtype: int64

### Step 6: Export cleaned dataset to csv

In [167]:
diabetes_5050.to_csv('diabetes.csv', sep=",", index=False)