# NHANES Diabetes Prediction Model
- 1 in 5 people don't know they have diabetes. 
- Diabetes is the #8 leading cause of death in the United States. 
- People with diabetes pay 2x higher medical bills  

Creating accurate predictive models to help individuals quickly identify diabetes could be pivotal in their lives, and help get them medical attention. 

## Read in Libraries

In [92]:
# load in necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from tensorflow import keras
import tensorflow as tf

## Read in the Data
Note: Data pre-processed in R and saved to csv to use for further analysis here.

In [93]:
data = pd.read_csv('nhanes_cleaned.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,blood_cholesterol_mg_per_dc,blood_glycosylated_hemoglobin_volume_percentage,family_income,household_adults_60_years_plus,household_children_6_17_years,household_income,household_ref_person_age_year,household_ref_person_biological_sex,household_ref_person_education,...,subject_education,subject_high_blood_pressure,subject_2_year_weight_screening_time,subject_2_year_mec_weight_mec_time,subject_insurance,subject_masked_variance_pseudo_psu,subject_masked_variance_pseudo_stratum,subject_prescription_coverage,subject_race_ethnicity,diagnosis
0,4331,168,5.5,"$65,000 to $74,999",1,0,"$65,000 to $74,999",61,2,College graduate or above,...,College graduate or above,Yes,60325.09525,61758.65488,Covered by state-sponsored health plan,1,114,Yes,White Hispanic,False
1,4332,168,5.2,"$100,000 and Over",0,0,"$100,000 and Over",26,2,College graduate or above,...,College graduate or above,No,89514.43322,91523.51605,Covered by private insurance,2,113,Yes,White Hispanic,False
2,4333,131,5.0,"$45,000 to $54,999",0,1,"$45,000 to $54,999",33,2,College graduate or above,...,College graduate or above,No,14155.313,15397.21985,Covered by private insurance,2,114,Yes,Non-White,False
3,4334,154,5.1,"$45,000 to $54,999",0,2,"$45,000 to $54,999",35,2,Some college or AA degree,...,9th grade,No,12433.74874,12665.77009,Covered by private insurance,1,104,Yes,White Hispanic,False
4,4336,225,5.8,"$20,000 and Over",0,3,"$20,000 and Over",38,2,High school graduate/GED or equivalent,...,High school graduate/GED or equivalent,No,27388.92069,27196.63839,Covered by private insurance,2,116,Yes,White Hispanic,False


## Further Data Pre-processing

In [94]:
# Define numerical and categorical features
numerical_cols = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

# Preprocessing for numerical data
numerical_transformer = StandardScaler()

# Preprocessing for categorical data
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

## Create Model

In [95]:
# Create a new model using TensorFlow/Keras, more similar to your class example
def build_model(input_shape):
    model = keras.Sequential([
        keras.layers.Dense(128, activation='relu', input_shape=[input_shape]),
        keras.layers.Dropout(0.1),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dropout(0.1),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dropout(0.1),
        keras.layers.Dense(128, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

## Split Data and Fit

In [96]:
# Split data into X and y
X = data.drop('diagnosis', axis=1)
y = data['diagnosis'].astype(int)  # for binary classification

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fit the preprocessor to the training data and transform it
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

# Further split the training data into training and validation sets
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train_transformed, y_train, test_size=0.2, random_state=42)

# Get the number of features from the transformed data
input_shape = X_train_transformed.shape[1]

## Build and Fit Model

In [97]:
# Build the model
model = build_model(input_shape)

# Fit the model
model.fit(X_train_final, y_train_final, epochs=10, validation_data=(X_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7fc9c0525690>

## Evaluation of Model

In [98]:
# Evaluate the model on the test data
loss, accuracy = model.evaluate(X_test_transformed, y_test)
print("Accuracy:", accuracy)

Accuracy: 0.9412628412246704
