# Lab Five: Wide and Deep Networks

Team: Miro Ronac, Kirk Watson, Brandon Vincitore

## 1. Preparation

This data can be useful in identifying prediabetes or diabetes in patients and assisting doctors with making accurate observations from a variety of health indicators.

Every year, the CDC collects data from a health-related telephone survey called the Behavioral Risk Factor Surveillance System. The data gathered from these surveys include information on “health-related risk behaviors, chronic health conditions, and use of preventive services.” This dataset focuses on responses from 2015 and diabetes, a “prevalent chronic disease in the United States.”

Ultimately, the ability to identify a patient with prediabetes or diabetes with increased efficiency and accuracy is the intention of analyzing this dataset. With this capability, a diabetes diagnosis can be reached at a faster rate compared to when a human makes the diagnosis. Diabetes is extremely common in the US as about 1 in 10 Americans have diabetes, and [about 1 in 5 people with diabetes don’t know they have it](https://www.cdc.gov/diabetes/library/spotlights/diabetes-facts-stats.html#:~:text=37.3%20million%20Americans%E2%80%94about%201,t%20know%20they%20have%20it.). In addition, 1 in 3 Americans have prediabetes, and [more than 8 in 10 adults with prediabetes don’t know they have it](https://www.cdc.gov/diabetes/library/spotlights/diabetes-facts-stats.html#:~:text=37.3%20million%20Americans%E2%80%94about%201,t%20know%20they%20have%20it.). Using this classifier, patients that might be at risk of diabetes can be brought to a doctor’s attention at a higher rate allowing for earlier medical care and attention.

Dataset Source: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

### 1.1 Define and prepare class variables

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler

df = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')

# Making target feature binary
target_array = np.zeros((len(df),4))
for i in range(len(df)):
    target_array[i,0] = df['Diabetes_012'].values[i]
for i in range(len(target_array)):
    # no diabetes
    if target_array[i,0] == 0:
        target_array[i,1] = 1
    # prediabetes
    if target_array[i,0] == 1:
        target_array[i,2] = 1
    # diabetes
    if target_array[i,0] == 2:
        target_array[i,3] = 1

# Adding new target columns to original dataframe
target_headers = ['NoDiabetes', 'PreDiabetes', 'Diabetes']
for i in range(target_array.shape[1]-1):
    df.insert(i, target_headers[i], target_array[:,1:][:,i], True)
# removing previous categorical target column
df_target = df.drop('Diabetes_012', axis=1)
df_target.drop(df_target.tail(252000).index,inplace = True)
    
print(df_target.info())

#show df
pd.set_option('display.max_columns', None)
df_target

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1680 entries, 0 to 1679
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   NoDiabetes            1680 non-null   float64
 1   PreDiabetes           1680 non-null   float64
 2   Diabetes              1680 non-null   float64
 3   HighBP                1680 non-null   float64
 4   HighChol              1680 non-null   float64
 5   CholCheck             1680 non-null   float64
 6   BMI                   1680 non-null   float64
 7   Smoker                1680 non-null   float64
 8   Stroke                1680 non-null   float64
 9   HeartDiseaseorAttack  1680 non-null   float64
 10  PhysActivity          1680 non-null   float64
 11  Fruits                1680 non-null   float64
 12  Veggies               1680 non-null   float64
 13  HvyAlcoholConsump     1680 non-null   float64
 14  AnyHealthcare         1680 non-null   float64
 15  NoDocbcCost          

Unnamed: 0,NoDiabetes,PreDiabetes,Diabetes,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1.0,0.0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,1.0,0.0,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,1.0,0.0,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,1.0,0.0,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,1.0,0.0,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1675,1.0,0.0,0.0,1.0,0.0,1.0,20.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,13.0,5.0,4.0
1676,0.0,0.0,1.0,0.0,1.0,1.0,30.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,4.0,0.0,25.0,1.0,0.0,9.0,5.0,3.0
1677,1.0,0.0,0.0,0.0,0.0,1.0,42.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,2.0,15.0,0.0,0.0,0.0,8.0,6.0,8.0
1678,0.0,0.0,1.0,1.0,1.0,1.0,28.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,15.0,22.0,1.0,1.0,11.0,6.0,5.0


##### We aim to utilize our dataset to predict level of diabetes in a patient, (e.g. no diabetes, prediabetes, diabetes). Each patient was asked 21 questions regarding his/her health, habits, education, and income history. Although we have the same data type (float64) for each of our features, based on the documentation on Kaggle, we do have 4 distinct categorical features: GenHlth, Age, Education, and Income.

##### Each category is described as follows:

###### GenHlth - "General Health"

    1 = Excellent 
    2 = Very Good                            
    3 = Good                            
    4 = Fair                            
    5 = Poor

###### Age - "13-level age category"

    1 = 18 - 24 yrs
    2 = 25 - 29 yrs
    3 = 30 - 34 yrs
    4 = 35 - 39 yrs
    5 = 40 - 44 yrs
    6 = 45 - 49 yrs
    7 = 50 - 54 yrs
    8 = 55 - 59 yrs
    9 = 60 - 64 yrs
    10 = 65 - 69 yrs
    11 = 70 - 74 yrs
    12 = 75 - 79 yrs
    13 = 80 yrs or older

###### Education - "Education Level"

    1 = Never attended school or only kindergarten
    2 = Grades 1 through 8 (Elementary)
    3 = Grades 9 through 11 (Some high school)
    4 = Grade 12 or GED (High school graduate)
    5 = College 1 year to 3 years (Some college or technical school)       
    6 = College 4 years or more (College graduate)
                            
###### Income - "Income Scale"  

    1 = less than \$10,000
    2 = less than $15,000, more than $10,000
    3 = less than $20,000, more than $15,000
    4 = less than $25,000, more than $20,000
    5 = less than $35,000, more than $25,000
    6 = less than $50,000, more than $35,000
    7 = less than $75,000, more than $75,000
    8 = more than $75,000

In [2]:
# Check for missing values
df_target.isnull().sum().any()

False

In [3]:
# Encoding categorical features and normalizing numerical variables
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import StandardScaler
# ========================================================
from copy import deepcopy
df_encode = deepcopy(df_target)
# ========================================================
# for every categorical variable, encode as integer
# Keras will use the integer variable to figure out how to one-hot encode    
encoders = dict() # save each encoder in dictionary
categorical_headers = ['GenHlth','Age','Education','Income']

for col in categorical_headers:
        encoders[col] = LabelEncoder() # save the encoder
        df_encode[col] = encoders[col].fit_transform(df_encode[col])
# ========================================================
# scale the numeric, continuous variables
numerical_headers = ["BMI", "MentHlth", "PhysHlth"]

for col in numerical_headers:
    ss = StandardScaler()
    df_encode[col] = ss.fit_transform(df_encode[col].values.reshape(-1,1))
# ========================================================
# making binary variables integers
binary_headers = ['NoDiabetes','PreDiabetes','Diabetes','HighBP','HighChol','CholCheck','Smoker','Stroke','HeartDiseaseorAttack',
                  'PhysActivity','Fruits','Veggies','HvyAlcoholConsump','AnyHealthcare','NoDocbcCost','DiffWalk','Sex',]
for col in binary_headers:
    df_encode[col] = df_encode[col].astype('int64')
df_encode.head()

Unnamed: 0,NoDiabetes,PreDiabetes,Diabetes,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1,0,0,1,1,1,1.703564,1,0,0,0,0,1,0,1,0,4,1.727369,0.963016,1,0,8,3,2
1,1,0,0,0,0,0,-0.672844,1,0,0,1,0,0,0,0,1,2,-0.477318,-0.564619,0,0,6,5,0
2,1,0,0,1,1,1,-0.197562,0,0,0,0,1,0,0,1,1,4,3.19716,2.490651,1,0,8,3,7
3,1,0,0,1,0,1,-0.35599,0,0,0,1,1,1,0,1,0,1,-0.477318,-0.564619,0,0,10,2,5
4,1,0,0,1,1,1,-0.831271,0,0,0,1,1,1,0,1,0,1,-0.10987,-0.564619,0,0,10,4,3


In [4]:
# ========================================================
# Define features and target
X = df_encode.drop(target_headers, axis=1)
y = df_encode[target_headers]
# ========================================================
X.info()
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1680 entries, 0 to 1679
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   HighBP                1680 non-null   int64  
 1   HighChol              1680 non-null   int64  
 2   CholCheck             1680 non-null   int64  
 3   BMI                   1680 non-null   float64
 4   Smoker                1680 non-null   int64  
 5   Stroke                1680 non-null   int64  
 6   HeartDiseaseorAttack  1680 non-null   int64  
 7   PhysActivity          1680 non-null   int64  
 8   Fruits                1680 non-null   int64  
 9   Veggies               1680 non-null   int64  
 10  HvyAlcoholConsump     1680 non-null   int64  
 11  AnyHealthcare         1680 non-null   int64  
 12  NoDocbcCost           1680 non-null   int64  
 13  GenHlth               1680 non-null   int64  
 14  MentHlth              1680 non-null   float64
 15  PhysHlth             

#### Final dataset description

When preparing this dataset, we normalized all continuous data (BMI, MentHlth, PhysHlth) and hot one encoded all categorical data (GenHlth, Age, Education, Income). In addition, we separated our "Diabetes_012" prediction task feature into 3 different features (NoDiabetes, Prediabetes, Diabetes). We also checked for null values (none present) and converted all binary feature values to int64.

We are performing a classification task with this dataset. Our objective is to be able to determine whether a patient might have diabetes or prediabetes.

### 1.2 Cross-product features

In [5]:
cross_columns = [['Education', 'Income'],
                 ['Education','Income','Age'],
                 ['GenHlth','Education','Age'],
                 ['GenHlth', 'Income']
                ]

cross_col_df_names= []
for cols_list in cross_columns:
    enc = LabelEncoder()
    
    # 1. create crossed labels by join operation
    X_crossed = X[cols_list].astype(str).agg('_'.join, axis=1)
    
    # get a nice name for this new crossed column
    cross_col_name = '_'.join(cols_list)
    # 2. encode as integers, stacking all possibilities
    enc.fit(X_crossed)
    
    # 3. Save into dataframe with new name
    X[cross_col_name] = enc.transform(X_crossed)
    
    # keep track of the new names of the crossed columns
    cross_col_df_names.append(cross_col_name)
    
X.head()

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,Education_Income,Education_Income_Age,GenHlth_Education_Age,GenHlth_Income
0,1,1,1,1.703564,1,0,0,0,0,1,0,1,0,4,1.727369,0.963016,1,0,8,3,2,19,102,210,34
1,0,0,0,-0.672844,1,0,0,1,0,0,0,0,1,2,-0.477318,-0.564619,0,0,6,5,0,33,254,137,16
2,1,1,1,-0.197562,0,0,0,0,1,0,0,1,1,4,3.19716,2.490651,1,0,8,3,7,24,160,210,39
3,1,0,1,-0.35599,0,0,0,1,1,1,0,1,0,1,-0.477318,-0.564619,0,0,10,2,5,14,64,47,13
4,1,1,1,-0.831271,0,0,0,1,1,1,0,1,0,1,-0.10987,-0.564619,0,0,10,4,3,28,194,66,11


## 2. Modeling

## 3. Exceptional Work