# Final Project, Part 1: Neural-Network-Based Diabetes Defeater

### Overview

You'll be developing the diabetes defeater! This is a neural network that can predict individuals that will diagnosed as diabetic within 3 years.

<div style="text-align: center;"> <img src = "res/final_project/dd_icon.jpg" width="25%"/> </div>

However, the initial neural network used to detect diabetes is performing terribly. Your job is to fix it. Its diagram appears below.

<div style="text-align: center;"> <img src = "res/final_project/dd_initial_nn.jpg" width="30%"/> </div>

## Codebook

Here are the features and label in the CDC diabetes dataset. Note that the preprocessing will be done for you, so these are less relevant. Also note that this is not the same dataset as the machine learning course, as it includes far more data.


<strong> Label (Binary) </strong>
<ul>
  <li> <strong>Diabetic_3Years.</strong> Whether the individual will be diagnosed with diabetes in 3 years time. 1 = diagnosed with diabetes. 0 = no diagnosis </li>
</ul>


<strong> Features: Binary</strong>
<ul>
  <li> <strong>HighBP.</strong> Whether the individual currently has high blood pressure (1) or not (0).  </li>
  <li> <strong>HighChol.</strong> Whether the individual currently has high cholesterol (1) or not (0). </li>
  <li> <strong>CholChecked_Last5Years.</strong> Whether the had a cholesterol check within the last 5 years (1) or not (0). </li>
  <li> <strong>EverBeenSmoker.</strong> Whether the individual has smoked at least 100 cigarettes (approximately 20 packs) in his/her lifetime. </li>
    <li> <strong>EverHadStroke.</strong> Whether the individual has ever had a stroke (1) or not (0). </li>
    <li> <strong>EverHadHeartDiseaseOrAttack.</strong> Whether the individual has ever had heart disease or a heart attack (1) or not (0). </li>
  <li> <strong>PhysActivity_Last30Days.</strong> Whether the individual has engaged in physical activity in the last 30 days, exluding work (1) or not (0). </li>
    <li> <strong>Fruits_OncePerDay.</strong> Whether the individual eats fruits once per day (1) or not (0). </li>
    <li> <strong>Veggies_OncePerDay.</strong> Whether the individual eats veggies once per day (1) or not (0). </li>
    <li> <strong>HvyAlcoholConsump.</strong> Whether the individual drinks >= 14 drinks per week for men or >= 7 drinks per week for women (1) or not (0). </li>
    <li> <strong>AnyHealthcare.</strong> Whether the individual has health insurance or other forms of healthcare (1) or not (0). </li>
    <li> <strong>TooPoorForDoctor_Last1Year.</strong> Whether the individual wanted to see a doctor in the past 12 months but didn't because he/she couldn't afford it (1) or not (0). </li>
    <li> <strong>DiffWalk.</strong> Whether the individual has difficulty walking or climbing stairs (1) or not (0) </li>
    <li> <strong>IsMale.</strong> Whether the individual is a man (1) or woman (0) </li>
</ul>

<strong> Features: Categorical </strong>
<ul>
  <li> <strong>HealthSelfAssessment_General.</strong> Individual's self assessment of his/her general health. 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor.</li>
    <li> <strong>AgeGroup.</strong> Age group, broken into age band. 1 = 18-24, 2 = 24 = 29, [...], 9 = 60-64, [...] 13 = 80 or older</li>
    <li> <strong>Education.</strong> Education Level. 1 = Never attended school or only kindergarten, 2 = Grades 1 through 8 (Elementary), 3 = Grades 9 through 11 (Some high school), 4 = Grade 12 or GED (High school graduate), 5 = College 1 year to 3 years (Some college or technical school), 6 = College 4 years or more (College graduate). </li>
    <li> <strong>IncomeGroup.</strong> Income group, on scale 1-8. 1 = less than $10,000, [...] 8 = $75,000 or more</li>
</ul>


<strong> Features: Numeric </strong>
<ul>
  <li> <strong>BMI.</strong> Body mass index. < 18.5 indicates underweight. 18.5 - 24.9 indicates normal. 25 - 29.9 indicates overweight. 30 - 34.9 indicates Class I Obesity. 35 - 39.9 indicates Class 2 Obesity, whereas >= 40 indicates Class 3 obesity.</li>
</ul>


# 0 | Google Colab Setup

In [None]:
import os
import shutil
import stat

In [None]:
def copy_safe(src, dst, max_len=200):
    """Copy files, skip long paths"""
    skipped = 0
    for root, dirs, files in os.walk(src):
        rel_path = os.path.relpath(root, src)
        dst_root = os.path.join(dst, rel_path) if rel_path != '.' else dst
        if len(dst_root) < max_len:
            os.makedirs(dst_root, exist_ok=True)
            for file in files:
                dst_file = os.path.join(dst_root, file)
                if len(dst_file) < max_len:
                    try: shutil.copy2(os.path.join(root, file), dst_file)
                    except: skipped += 1
                else: skipped += 1
        else: skipped += len(files)
    return skipped

In [None]:
# Setup resources if needed
setup_ran = False
if not os.path.exists('res'):
    print("Setting up resources...")
    setup_ran = True
    
    # Cleanup, clone, copy
    repo = 'deep_learning_resources'
    if os.path.exists(repo):
        shutil.rmtree(repo, onerror=lambda f,p,e: os.chmod(p, stat.S_IWRITE) or f(p))
    
    !git clone --depth=1 https://github.com/jjv31/deep_learning_resources
    
    if os.path.exists(f'{repo}/res'):
        skipped = copy_safe(f'{repo}/res', 'res')
        print(f"Setup complete! {'(' + str(skipped) + ' long filenames skipped)' if skipped else ''}")
    
    shutil.rmtree(repo, onerror=lambda f,p,e: os.chmod(p, stat.S_IWRITE) or f(p))

In [None]:
# Only refresh if we just downloaded resources
if setup_ran:
    from IPython.display import Javascript, display
    import time
    
    print("Refreshing images...")
    
    # Try browser refresh + aggressive image reload
    display(Javascript(f'''
    try {{ setTimeout(() => window.location.reload(true), 2000); }} catch(e) {{}}
    
    const t = {int(time.time())};
    document.querySelectorAll('img').forEach((img, i) => {{
        if (img.src.includes('res/')) {{
            const src = img.src.split('?')[0];
            setTimeout(() => img.src = src + '?v=' + t + '_' + i, i * 50);
        }}
    }});
    '''))
    
    print("If images don't appear, press Ctrl+Shift+R to hard refresh!")
else:
    print("Resources already exist, skipping setup.")

# 1 | Introduction: Load & Preprocess Dataset

### 1.0 | Imports & Auxilary Functions

Just run these. No need to modify them.

In [None]:
#TQDM
%pip install tqdm

In [None]:
#load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Scikit-learn libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix

#Set plot styles
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

# Get Pandas to display all rows/columns
pd.set_option('future.no_silent_downcasting', True)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None) 

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Neural Network
from keras.models import Sequential
from keras.layers import Dense, Input, Dropout
from keras.optimizers import Adam
from tensorflow.keras import regularizers
import keras as keras

In [None]:
def print_univariates_categorical(data):
    data = data.astype(str) # Prevents error when sorting between strings and bools
    numbers = data.value_counts().sort_index()
    percentage = numbers / numbers.sum() * 100
    percentage = percentage.round(2)
    
    for c, i in enumerate(numbers.index):
        print(f"{i}\t\t{numbers.iloc[c]}\t{percentage.iloc[c]}%")

In [None]:
def print_univariates_metric(data, nameToPrint=None):

    # Mode - Handling multimodal cases
    mode_result = data.mode()
    if len(mode_result) == 0:  # No mode found
        mode_result = None
    else:
        mode_result = mode_result[0]

    # Print output
    print(f"Descriptives for {nameToPrint}")
    print(f"Mean = {round(data.mean(),2)} | Median = {round(data.median(),2)} | Mode = {mode_result} | "
          f"Min = {data.min()} | Max = {data.max()} | SD = {round(data.std(),2)} | "
          f"IQR(25) = {data.quantile(0.25)} | IQR(75) = {data.quantile(0.75)}")

In [None]:
#Function to facilitate evaluating our models
def print_score(clf, X, y_true):

    # Gets predicted labels
    if isinstance(clf, keras.models.Sequential): # If the model is a Keras neural network
        y_pred = (clf.predict(X) >= 0.5).astype(int) 
    else: # Normal scikit-learn model
        y_pred = clf.predict(X)

    # Gets key performance indicators
    accuracy = round(accuracy_score(y_true, y_pred), 4)
    recall = round(recall_score(y_true, y_pred), 4)
    precision = round(precision_score(y_true, y_pred), 4)
    f1 = round(f1_score(y_true, y_pred), 4)

    # Displays them
    print(f"F1 = {f1:.4f} | Recall = {recall* 100:.2f}% | Precision = {precision*100:.2f}%")

In [None]:
# Plots the performance of the neural network
def plot_performance(training_values, validation_values, metric_name = "Recall"):

    epochs = range(1, len(training_values) + 1)
    
    sns.set() 
    plt.plot(epochs, training_values, '-', label=f'Training {metric_name}')
    plt.plot(epochs, validation_values, ':', label=f'Validation {metric_name}')

    plt.title(f'Training and Validation {metric_name}')
    plt.xlabel('Epoch')
    plt.ylabel(metric_name)
    plt.legend(loc='lower right')
    plt.plot()

### 1.1 | Import & Explore Data

In [None]:
df = pd.read_csv("res/final_project/cdc_diabetes_v2.csv")
df.head(3)

In [None]:
# Displays the label
print("Diabetes in 3 years?")
print_univariates_categorical(df["Diabetes_3Years"])

In [None]:
# Any missing values?
if df.isnull().sum().sum() > 0:
    print("There are missing values in the DataFrame.")
else:
    print("Congratulations. There are no missing values in the DataFrame.")

### 1.2 | Preprocess Dataset

In [None]:
df.head(1)

In [None]:
# Calculates z scores for BMI, thereby transforming it into a much smaller number more conducive for our neural network

# Prints BMI stats
print_univariates_metric(df["BMI"], "BMI")

# Calculates & converts to z scores
df['BMI'] = round( ( df['BMI'] - df['BMI'].mean() ) / df['BMI'].std(), 4)

# Prints BMI post z-scores
print_univariates_metric(df["BMI"], "BMI (z-scores)")

In [None]:
# We have a bunch of ordinal variables. 
# Let's convert them to a number between 0 and 1 rather than 0 and their max value (e.g., 13)

# We'll use scikit learn's min-max scaler to do this
scaler = MinMaxScaler()

# First, we'll show an example
print("Self assessment scores before scaling")
print_univariates_categorical(df["HealthSelfAssessment_General"])
df['HealthSelfAssessment_General'] = scaler.fit_transform(df[['HealthSelfAssessment_General']])
print("\nSelf assessment scores after scaling")
print_univariates_categorical(df["HealthSelfAssessment_General"])

# Now, we'll do the rest
df['AgeGroup'] = scaler.fit_transform(df[['AgeGroup']])
df['Education'] = scaler.fit_transform(df[['Education']])
df['IncomeGroup'] = scaler.fit_transform(df[['IncomeGroup']])

In [None]:
# Let's display our pre-processed dataframe
df.head(1)

### 1.3 | Splits Dataset 

In [None]:
X = df.drop(['Diabetes_3Years'], axis=1)
y = df['Diabetes_3Years']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#Let's examine the sizes of the training and testing sets
print(f"Training set size = {y_train.size}\nTesting set size = {y_test.size}")

# 2 | Construct Neural Nework

### 2.0 | Overview

You'll need to construce, evaluate, & iteratively improve a neural network that can detect diabetes. Keep these two steps in mind as you're constructing the neural network, and refer to practicum #1

<div style="text-align: center;"> <img src = "res/model_building/money_marker_steps_to_fixing_nn.jpg" width="30%"/> </div>

Underfitting is generally addressed by making your network more complex, whereas overfitting is generally addressed by (i) adding regularization, (ii) adding one/more dropout layer(s), and/or (iii) reducing the complexity of your neural network.

Because we have a lot of data, we'll use a seperate validation set to evaluate our neural network's performance. This makes the training set a more unbiased estimate of how the model will perform in the real world because it is now data it has never seen before.

<div style="text-align: center;"> <img src = "res/cnn_fashion/dataset_split.jpg" width="30%"/> </div>

In [None]:
# Let's initialize some constants. 
# DO NOT CHANGE THESE CONSTANTS.
NUMBER_OF_FEATURES = X_train.shape[1]
EPOCHES = 100

print(f"There are {NUMBER_OF_FEATURES} in the dataset. Thus, the neural network will have {NUMBER_OF_FEATURES} input nodes.")

### 2.1 | Create Feedforward Neural Network

In [None]:
# Create neural network
your_neural_network = Sequential() 
your_neural_network.add( Input( shape= (NUMBER_OF_FEATURES,) ) ) 
your_neural_network.add(Dense(1, activation='linear'))
your_neural_network.add(Dense(1, activation='sigmoid')) 

# Compiles the model
my_learning_rate = 0.000000001
your_neural_network.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=my_learning_rate), 
              metrics=[keras.metrics.Precision(name="precision"), keras.metrics.Recall(name="recall"),]) 
your_neural_network.summary()

In [None]:
#Trains the model
your_hist = your_neural_network.fit(X_train, y_train, validation_split=0.1, # Uses 10% of the training data as validation
                 epochs=EPOCHES, batch_size=1024)

### 2.2 | Evaluate Neural Network

In [None]:
print("Training Data Score:")
print_score(your_neural_network, X_train, y_train)

In [None]:
print("Testing Data Score")
print_score(your_neural_network, X_test, y_test)

### 2.3 | Evaluate Training

In [None]:
loss, val_loss = your_hist.history["loss"], your_hist.history["val_loss"]
plot_performance(loss, val_loss, "Loss (Error)")

In [None]:
prec, val_prec = your_hist.history["precision"], your_hist.history["val_precision"]
plot_performance(prec, val_prec, "Precision")

In [None]:
recall, val_recall = your_hist.history["recall"], your_hist.history["val_recall"]
plot_performance(recall, val_recall, "Recall")

# 3 | Diagram Your Final Neural Network

Finished making your neural network? Create a diagram to present to the class: https://alexlenail.me/NN-SVG/index.html