# Diabetes prediction machine learning project

This notebook is part of the final exam for DTSA 5509 Supervised learning. 

## Topic 
Diabetes is a chronic diseases impacting millions of people around the world. This disease is correlated with other health-related risk and/or characteristics. Working on diabetes data to predict and understand the disease's correlation with other health related aspects is important to increase our knowledge about the disease and enhance prevention. The goal of this project is to use a dataset containing health-related features variables and predict diabetes (the target variable). This work is a classification task.

The machine learning tasks include building one models, train the model(s) and evaluate them. The evaluation part is crutial to asses the quality of the model. The models selected for this work are: 

- Logistic regression 
- SVM
- Random forest 

Before working on a model, data exploration and cleaning is necessary in order to understand the data and use it efficiently. 

# Data 

The dataset used for this work is from the Behavioral Risk Factor Surveillance System (BRFSS). The original dataset contains response to survey from 441'445 people and has 330 features. The survey has been conducted since 1984 on Americans citizens. You can find more informations about the dataset on [Dataset on Kaggle](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv). 

The dataset used in this work is the first of the three datasets available: `diabetes_012_health_indicators_BRFSS2015.csv`. The target variable (`Diabetes_012`) has 3 classes and is imbalance:

- 0: no diabetes or only during pregnancy;
- 1: prediabetes;
- 2: diabetes;

The dataset contains 21 features variables.

In [106]:
import polars as pl  # like pands, written in rust
import json 
import plotly.express as px 
import plotly.graph_objects as go

from plotly.subplots import make_subplots
from pathlib import Path 


In [107]:
data_path = Path(".").resolve().parent.joinpath('data')
file_stem = '2015'

parquet_file = data_path.joinpath('interim', f'{file_stem}.parquet')
csv_file = data_path.joinpath('raw', f'{file_stem}.csv')

#  If parquet file don't exists, create it. Parquet format is way more efficient. 
if parquet_file.exists():
    df = pl.read_parquet(parquet_file)
else:
    df = pl.read_csv(csv_file)
    df.write_parquet(parquet_file)
    
print(f'The dataframe has {df.shape[0]} rows and {df.shape[1]} columns.')

The dataframe has 441456 rows and 330 columns.


This dataset is very large, all the columns are described in the file `references/cookbook15_llcp.pdf`. There is also a json file that describe the columns, that we will load in the variable `features_dict`

In [108]:
json_filename = '2015_formats.json'
json_path = data_path.joinpath('raw', json_filename)

with json_path.open('r') as fp: 
    features_dict = json.load(fp)

From the cookbook I selectionned a subset of features related with diabetes. I used [this](https://www.cdc.gov/pcd/issues/2019/19_0109.htm) publication to help me with the features selection. This features are:

In [109]:
selected_features = [
    'DIABETE3',
    'CVDSTRK3', 
    'DIFFWALK',
    'EDUCA',
    'GENHLTH',
    'HLTHPLN1',
    'INCOME2',
    'MEDCOST',
    'MENTHLTH',
    'PHYSHLTH',
    'SEX',
    'SMOKE100',
    'TOLDHI2',
    '_AGEG5YR',
    '_BMI5',
    '_CHOLCHK',
    '_FRTLT1',
    '_MICHD',
    '_RFDRHV5',
    '_RFHYPE5',
    '_TOTINDA',
    '_VEGLT1'
]

Categories description (from the json file and the code book): 

- **DIABETE3**: (Ever told) you have diabetes (If "Yes" and respondent is female, ask "Was this only when you were pregnant?". If Respondent says pre-diabetes or borderline diabetes, use response code 4.)
	 - 1: "Yes"
	 - 2: "Yes, but female told only during pregnancy"
	 - 3: "No"
	 - 4: "No, pre-diabetes or borderline diabetes"
	 - 7: "Dont know/Not Sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **CVDSTRK3**: (Ever told) you had a stroke.
	 - 1: "Yes"
	 - 2: "No"
	 - 7: "Dont know/Not sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **DIFFWALK**: Do you have serious difficulty walking or climbing stairs?
	 - 1: "Yes"
	 - 2: "No"
	 - 7: "Dont know/Not Sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **EDUCA**: What is the highest grade or year of school you completed?
	 - 1: "Never attended school or only kindergarten"
	 - 2: "Grades 1 through 8 (Elementary)"
	 - 3: "Grades 9 through 11 (Some high school)"
	 - 4: "Grade 12 or GED (High school graduate)"
	 - 5: "College 1 year to 3 years (Some college or technical school)"
	 - 6: "College 4 years or more (College graduate)"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **GENHLTH**: Would you say that in general your health is ?
	 - 1: "Excellent"
	 - 2: "Very good"
	 - 3: "Good"
	 - 4: "Fair"
	 - 5: "Poor"
	 - 7: "Dont know/Not Sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **HLTHPLN1**: Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
	 - 1: "Yes"
	 - 2: "No"
	 - 7: "Dont know/Not Sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **INCOME2**: Is your annual household income from all sources;
	 - 1: "Less than $10,000"
	 - 2: "Less than $15,000 ($10,000 to less than $15,000)"
	 - 3: "Less than $20,000 ($15,000 to less than $20,000)"
	 - 4: "Less than $25,000 ($20,000 to less than $25,000)"
	 - 5: "Less than $35,000 ($25,000 to less than $35,000)"
	 - 6: "Less than $50,000 ($35,000 to less than $50,000)"
	 - 7: "Less than $75,000 ($50,000 to less than $75,000)"
	 - 8: "$75,000 or more"
	 - 77: "Dont know/Not sure"
	 - 99: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **MEDCOST**: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
	 - 1: "Yes"
	 - 2: "No"
	 - 7: "Dont know/Not sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **MENTHLTH**: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
	 - 77: "Dont know/Not sure"
	 - 88: "None"
	 - 99: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
	 - 1       - 30: "Number of days"
- **PHYSHLTH**: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
	 - 77: "Dont know/Not sure"
	 - 88: "None"
	 - 99: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
	 - 1       - 30: "Number of days"
- **SEX**: Indicate sex of respondent;
	 - 1: "Male"
	 - 2: "Female"
	 - 9: "Refused"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **SMOKE100**: Have you smoked at least 100 cigarettes in your entire life? 
	 - 1: "Yes"
	 - 2: "No"
	 - 7: "Dont know/Not Sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **TOLDHI2**: Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high?; 
	 - 1: "Yes"
	 - 2: "No"
	 - 7: "Dont know/Not Sure"
	 - 9: "Refused"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_AGEG5YR**: Fourteen-level age category;
	 - 1: "Age 18 to 24"
	 - 2: "Age 25 to 29"
	 - 3: "Age 30 to 34"
	 - 4: "Age 35 to 39"
	 - 5: "Age 40 to 44"
	 - 6: "Age 45 to 49"
	 - 7: "Age 50 to 54"
	 - 8: "Age 55 to 59"
	 - 9: "Age 60 to 64"
	 - 10: "Age 65 to 69"
	 - 11: "Age 70 to 74"
	 - 12: "Age 75 to 79"
	 - 13: "Age 80 or older"
	 - 14: "Dont know/Refused/Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_BMI5**: Body Mass Index (BMI);
	 - .: "Dont know/Refused/Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
	 - 1       - 9999: "1 or greater"
- **_CHOLCHK**: Cholesterol check within past five years;
	 - 1: "Had cholesterol checked in past 5 years"
	 - 2: "Did not have cholesterol checked in past 5 years"
	 - 3: "Have never had cholesterol checked"
	 - 9: "Dont know/Not Sure Or Refused/Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_FRTLT1**: Consume Fruit 1 or more times per day;
	 - 1: "Consumed fruit one or more times per day"
	 - 2: "Consumed fruit less than one time per day"
	 - 9: "Don´t know, refused or missing values"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_MICHD**: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI);
	 - 1: "Reported having MI or CHD"
	 - 2: "Did not report having MI or CHD"
	 - .: "Not asked or Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_RFDRHV5**: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week);
	 - 1: "No"
	 - 2: "Yes"
	 - 9: "Dont know/Refused/Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_RFHYPE5**: Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional;
	 - 1: "No"
	 - 2: "Yes"
	 - 9: "Dont know/Not Sure/Refused/Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_TOTINDA**: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job;
	 - 1: "Had physical activity or exercise"
	 - 2: "No physical activity or exercise in last 30 days"
	 - 9: "Dont know/Refused/Missing"
	 - .D: "DK/NS"
	 - .R: "REFUSED"
- **_VEGLT1**: Consume Vegetables 1 or more times per day;
	 - 1: "Consumed vegetables one or more times per day"
	 - 2: "Consumed vegetables less than one time per day"
	 - 9: "Don´t know, refused or missing values"
	 - .D: "DK/NS"
	 - .R: "REFUSED"


In [110]:
df = df.select(selected_features)
df.head()

DIABETE3,CVDSTRK3,DIFFWALK,EDUCA,GENHLTH,HLTHPLN1,INCOME2,MEDCOST,MENTHLTH,PHYSHLTH,SEX,SMOKE100,TOLDHI2,_AGEG5YR,_BMI5,_CHOLCHK,_FRTLT1,_MICHD,_RFDRHV5,_RFHYPE5,_TOTINDA,_VEGLT1
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
3.0,2.0,1.0,4.0,5.0,1.0,3.0,2.0,18.0,15.0,2.0,1.0,1.0,9.0,4018.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0
3.0,2.0,2.0,6.0,3.0,2.0,1.0,1.0,88.0,88.0,2.0,1.0,2.0,7.0,2509.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0
3.0,1.0,,4.0,4.0,1.0,99.0,2.0,88.0,15.0,2.0,,1.0,11.0,2204.0,1.0,9.0,,9.0,1.0,9.0,9.0
3.0,2.0,1.0,4.0,5.0,1.0,8.0,1.0,30.0,30.0,2.0,2.0,1.0,9.0,2819.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0
3.0,2.0,2.0,5.0,5.0,1.0,77.0,2.0,88.0,20.0,2.0,2.0,2.0,9.0,2437.0,1.0,9.0,2.0,1.0,1.0,2.0,1.0


In [111]:
df.describe()

describe,DIABETE3,CVDSTRK3,DIFFWALK,EDUCA,GENHLTH,HLTHPLN1,INCOME2,MEDCOST,MENTHLTH,PHYSHLTH,SEX,SMOKE100,TOLDHI2,_AGEG5YR,_BMI5,_CHOLCHK,_FRTLT1,_MICHD,_RFDRHV5,_RFHYPE5,_TOTINDA,_VEGLT1
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""count""",441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0,441456.0
"""null_count""",7.0,0.0,12334.0,0.0,2.0,0.0,3301.0,1.0,0.0,1.0,0.0,14255.0,59154.0,0.0,36398.0,0.0,0.0,3942.0,0.0,0.0,0.0,0.0
"""mean""",2.757888,1.97388,1.8566,4.920094,2.57879,1.101201,20.253013,1.916066,64.679178,60.655113,1.576542,1.613987,1.630876,7.803623,2804.2424,1.533609,2.131746,1.911699,1.516312,1.42841,1.931871,2.109316
"""std""",0.723319,0.348689,0.579838,1.076198,1.117585,0.512261,31.853507,0.415414,35.843085,37.055684,0.494107,0.74653,0.740235,3.495609,665.463433,1.555462,2.322882,0.283733,1.87458,0.646749,2.209728,2.522517
"""min""",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1202.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
"""25%""",3.0,2.0,2.0,4.0,2.0,1.0,5.0,2.0,28.0,15.0,1.0,1.0,1.0,5.0,2373.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0
"""50%""",3.0,2.0,2.0,5.0,2.0,1.0,7.0,2.0,88.0,88.0,2.0,2.0,2.0,8.0,2695.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0
"""75%""",3.0,2.0,2.0,6.0,3.0,1.0,8.0,2.0,88.0,88.0,2.0,2.0,2.0,10.0,3090.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0
"""max""",9.0,9.0,9.0,9.0,9.0,9.0,99.0,9.0,99.0,99.0,2.0,9.0,9.0,14.0,9995.0,9.0,9.0,2.0,9.0,9.0,9.0,9.0


Let's define a dictionnary for the most common features types and theire valid and unvalid values that will be usefull later

In [127]:
features_dtypes = {
    'binary':
        {
            'features': [
                'CVDSTRK3',
                'DIFFWALK',
                'HLTHPLN1',
                'MEDCOST',
                'SEX',
                'SMOKE100',
                'TOLDHI2',
                '_CHOLCHK',
                '_FRTLT1',
                '_MICHD',
                '_RFDRHV5',
                '_RFHYPE5',
                '_TOTINDA',
                '_VEGLT1',
            ],
            'acc_raw_vals':
                [1, 2],
            'rej_raw_vals':
                [7, 9],
        },
    'monthly':
        {
            'features': [
                'MENTHLTH',
                'PHYSHLTH',
            ],
            'acc_raw_vals':
                list(range(1, 31)),
            'rej_raw_vals':
                [77, 88, 99],
        },
    'continuous':
        {
            'features': [
                '_BMI5'
            ], 
            'acc_raw_vals':
                None,
            'rej_raw_vals':
                None,
        }
}

## Cleaning

In [113]:
null_cnts = (
    df.
    null_count()
    .transpose(include_header=True)
    .rename({'column': 'x', 'column_0': 'y'})
)
px.bar(null_cnts, 
       x='x', 
       y='y', 
       log_y=True, 
       text_auto=True,
       title="Logarithmic plot of the null counts"
       )

In [114]:
print(f"Shape before nulls dropping: {df.shape}")
df = df.drop_nulls()
print(f"Shape after nulls dropping: {df.shape}")

Shape before nulls dropping: (441456, 22)
Shape after nulls dropping: (343606, 22)


In [25]:
duplicated_cnt = (
    df
    .is_duplicated()
    .sum()
)

print(f'There are {duplicated_cnt} duplicates')

df = (
    df 
    .filter(
        (~df.is_duplicated())
    )
)
print(f"Dataframe shape after duplicates removal: {df.shape}")

There are 7885 duplicates
Dataframe shape after duplicates removal: (335721, 22)


In [115]:
fig = make_subplots(rows=6, 
                    cols=4,
                    subplot_titles=selected_features)

for idx, f in enumerate(selected_features):    
    data = df.group_by(f).agg(pl.count()).sort(f).to_dict()
    r, c = divmod(idx, 4)
    fig.add_trace(
        go.Bar(
            x=data[f],
            y=data['count'],
            ), 
        row=r+1, 
        col=c+1
        )
    
fig.update_layout(height=1200, 
                  showlegend=False,
                  title="Values count by features"
                  )
fig.update_yaxes(type='log')
fig.show()




First, let's take a look to the binary features: `CVDSTRK3`, `DIFFWALK`, `HLTHPLN1`, `MEDCOST`, `SEX`, `SMOKE100`, `TOLDHI2`, `_CHOLCHK`, `_FRTLT1`, `_MICHD`, `_RFDRHV5`, `_RFHYPE5`, `_TOTINDA` and `_VEGLT1`. These features are supposed to be binaries, so every value higher than 2 stands for a missing value or other non binary result(see beggining of the chapter for the exact definition of the integers higher than 2 for each of these features). We'll offset the values by -1 to get true binary data as well. For `MENTHLTH`, `PHYSHLTH` and `INCOME2` values of 77, 88 or 99 are missing values, they will be filtered out as well. Finally,  `_AGEG5YR` equal to 14, `EDUCA` equal to 9 and `GENHLTH` equal to 7 or 9 are missing or null values as well. We'll focus on the target variable later, let's filter these features and shift values to 0-index. 

In [134]:
df_f = (
    df 
    .filter(
        [
            pl.col(features_dtypes['binary']['features']).is_in(features_dtypes['binary']['acc_raw_vals']),
            (~pl.col(features_dtypes['monthly']['features']).is_in(features_dtypes['monthly']['rej_raw_vals'])),
            (~pl.col('INCOME2').is_in([77, 88, 99])),
            (pl.col('_AGEG5YR') != 14),
            (pl.col('EDUCA') != 9),
            (~pl.col('GENHLTH').is_in([7, 9])),
               
        ]
    )
    .with_columns(
        [
            (pl.col(features_dtypes['binary']['features']) - 1).cast(pl.Int8),
            (pl.col(features_dtypes['monthly']['features']) - 1).cast(pl.Int8),
            (pl.col('INCOME2') - 1 ).cast(pl.Int8),
            (pl.col('_AGEG5YR') - 1 ).cast(pl.Int8),
            (pl.col('EDUCA') - 1 ).cast(pl.Int8),
            (pl.col('GENHLTH') - 1 ).cast(pl.Int8),
        ]
    )
)

In [135]:
fig = make_subplots(rows=6, 
                    cols=4,
                    subplot_titles=selected_features)

for idx, f in enumerate(selected_features):    
    data = df_f.group_by(f).agg(pl.count()).sort(f).to_dict()
    r, c = divmod(idx, 4)
    fig.add_trace(
        go.Bar(
            x=data[f],
            y=data['count'],
            ), 
        row=r+1, 
        col=c+1
        )
    
fig.update_layout(height=1200, 
                  showlegend=False,
                  title="Values count by features after filtering"
                  )
fig.update_yaxes(type='log')
fig.show()

Now let's take a look on the continuous feature `_BMI5`

In [158]:
fig = go.Figure(
    data=go.Violin(
        y=df_f.select('_BMI5').to_dict()['_BMI5'], 
        box_visible=True, 
        line_color='black', 
        meanline_visible=True,
        opacity=0.7, 
        x0='BMI',
        fillcolor='lightseagreen'
    )
)
fig.update_layout(
    width=700, 
    title='Violin plot for "_BMI5"', 
)
fig.show()

There are no evident outliers, we'll keep all the data for the next steps. 

Last but not least, we have to clean the target variable, `DIABETE3`. As a remember, here are the values that `DIABETE3` can take: 

- 1: "Yes"

- 2: "Yes, but female told only during pregnancy"

- 3: "No"

- 4: "No, pre-diabetes or borderline diabetes"

- 7: "Dont know/Not Sure"

- 9: "Refused"

- .: "Not asked or Missing"

- .D: "DK/NS"

- .R: "REFUSED"

In [168]:
fig = px.bar(
    data_frame=df_f.group_by('DIABETE3').agg(pl.count()).sort('DIABETE3').to_dicts(), 
    x='DIABETE3', 
    y='count',
    title='Diabetes values counts',
    width=600, 
    text_auto=True
)
fig.show()