# CS 4993 Independent Study – Machine Learning with COVID Data

***Professor: [Haiyang Shen](https://engineering.virginia.edu/faculty/haiying-shen)***  
***Researcher: [Iain Muir](https://www.linkedin.com/in/iain-muir-b37718164/) | iam9ez***

*Github Project:* https://github.com/iainmuir6/machineLearning_covidData  
*Last Updated: June 14th, 2021*  

*References*
* [CS 4774 ML Material – Professor Rich Nguyen](https://www.cs.virginia.edu/~nn4pj/teaching)
* [Steps to Building Machine Learning Model](https://analyticsindiamag.com/the-7-key-steps-to-build-your-machine-learning-model/)
* [Steps to Data Preprocessing](https://hackernoon.com/what-steps-should-one-take-while-doing-data-preprocessing-502c993e1caa)
* [Handling Missing Values](https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e)
* [Feature Selection I](https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2)
* [Feature Selection II](https://machinelearningmastery.com/feature-selection-machine-learning-python/)
* [Keras Neural Network I](https://towardsdatascience.com/3-ways-to-create-a-machine-learning-model-with-keras-and-tensorflow-2-0-de09323af4d3)
* [GAN Github Repository – codyznash](https://github.com/codyznash/GANs_for_Credit_Card_Data/blob/7f7e2dfb6ab15eb0d520fa6611fe03d6f8646141/GAN_171103.py#L47)
* [GAN I](https://datasciencecampus.ons.gov.uk/projects/generative-adversarial-networks-gans-for-synthetic-dataset-generation-with-binary-classes/)
* [GAN II](https://nbviewer.jupyter.org/github/codyznash/GANs_for_Credit_Card_Data/blob/master/GAN_comparisons.ipynb#Generated%20Data%20Testing)

## Table of Contents <a class="anchor" id="toc"></a>
* **[0. Import Packages](#setup)**
    * [0.1 General Imports](#imp1)
    * [0.2 ML Imports](#imp2)
* **[1. Read Excel File](#data)**
    * [1.1 Data Overview](#overview)
    * [1.2 Descriptive Statistics](#stats)
    * [1.3 Inspect Null Data](#null)
* **[2. Data Preparation](#prep)**
    * [2.1 Drop Columns](#drop)
    * [2.2 Handle Categorical Variables](#handle1)
        * *[2.2.1 Manual Conversion](#manual)*
        * *[2.2.2 Encoding](#encoding)*
        * *[2.2.3 Categorical Codes](#codes)*
    * [2.3 Handle Missing Values](#handle2)
    * [2.4 Feature Scaling](#scaling)
    * [2.5 Train / Test Split](#split)
    * [2.6 Final Prepared Data](#final_data)
* **[3. Feature Selection I](#feature)**
    * [3.1 Pearson Correlation](#corr)
    * [3.2 Chi-Squared Test](#chi_sq)
    * [3.3 Recursive Feature Elimination](#rfe)
    * [3.4 SelectFromModel: Lasso](#lasso)
    * [3.5 SelectFromModel: Random Forest Classifier](#rfc)
    * [3.6 Cumulative Feature Selection](#cum)
* **[4. Model Selection](#model)**
    * [4.1 Train / Test Data](#tt)
    * [4.2 Model Evaluation Functions](#funcs)
    * [4.3 Model Construction](#models)
        * *[4.3.1 Decision Tree](#dt)*
        * *[4.3.2 Random Forest Classifier](#rfc2)*
        * *[4.3.3 Simple Deep Neural Network](#dnn)*
        * *[4.3.4 Convolutional Neural Network](#cnn)*
    * [4.4 Simultaneous Model Evaluation](#eval)
    * [4.5 RandomizedSearch](#search)
* **[5. Generative Adverserial Networks](#gan)**
    * [5.1 Network Setup](#setup2)
    * [5.2 Training GAN Models](#train)
        * [5.2.1 GAN](#gan2)
        * [5.2.2 CGAN](#cgan)
        * [5.2.3 WGAN + WCGAN](#wgan)
    * [5.3 Loss Information](#loss)
    * [5.4 Generate New Data](#new_data)
    * [5.5 Training Models on New Data](#train_gan)
    * [5.6 Plot Real vs Test Data](#plot)
    * [5.7 Feature Importance](#importance)
* **[6. Retrain Models with GAN Data](#retrain)**
    * [6.1 Re-Prepare Data](#prep2)
    * [6.2 Re-Train Models](#retrain2)
* **[7. Final Model Training with Feature Selection](#final)**
    * [7.1 Define Models and Variables](#define)
    * [7.2 Model Performance w/o GAN](#perf1)
    * [7.3 Model Performance with GAN](#perf2)

## 0. Import Packages <a class="anchor" id="setup"></a>

[Table of Contents](#toc)

#### 0.1 General Imports <a class="anchor" id="imp1"></a>

In [337]:
from IPython.display import Markdown, Image, display
from scipy.stats import reciprocal
import matplotlib.pyplot as plt
from matplotlib import cm
import missingno as msno
import seaborn as sns
import pandas as pd
import numpy as np
import pickle
import random
import os

#### 0.2 ML Imports <a class="anchor" id="imp2"></a>

In [338]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, RepeatedStratifiedKFold
from sklearn.preprocessing import OrdinalEncoder, StandardScaler, MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2, RFE, SelectFromModel
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.externals.six import StringIO
import pydotplus
import sklearn

from keras.layers import Dense, InputLayer, Dropout, Conv2D, MaxPooling2D, Flatten, Embedding, LSTM
from keras.layers import LeakyReLU, PReLU, BatchNormalization, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from keras.optimizers import Adam, SGD, RMSprop
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import tensorflow.compat.v1 as tf
from tensorflow import keras
tf.disable_v2_behavior()

import xgboost as xgb
from GAN import GAN

In [339]:
tf.__version__

'2.3.0'

In [340]:
keras.__version__

'2.4.0'

In [341]:
sklearn.__version__

'0.21.2'

In [342]:
SEED = 0

## 1. Read Excel File <a class="anchor" id="data"></a>

[Table of Contents](#toc)

#### 1.1 Data Overview <a class="anchor" id="overview"></a>

In [343]:
df = pd.read_excel('ed_pred.xlsx')
df = df.reset_index()
df.head(5)

KeyboardInterrupt: 

In [None]:
df.shape

In [None]:
target = 'COVIDResult'
df.columns

###### Column Description
Note (src - [Walk-In-Lab](https://www.walkinlab.com/products/view/complete-blood-count-cbc-comprehensive-metabolic-panel-cmp-14-blood-test-panel#:~:text=A%20CBC%20also%20helps%20your,anemia%2C%20and%20several%20other%20disorders.&text=Comprehensive%20Metabolic%20Panel%20)): 

CBC == [Complete Blood Count](https://www.mayoclinic.org/tests-procedures/complete-blood-count/about/pac-20384919)
* Complete Blood Count (CBC) gives important information about the numbers and kinds of cells in the blood, especially red blood cells, white blood cells, and platelets. A CBC helps your health professional check any symptoms, such as fatigue, weakness, or bruising, that you may have. A CBC also helps your health professional diagnose conditions, such as infection, anemia, and several other disorders.

CMP == [Comprehensive Metabolic Panel](https://www.mayocliniclabs.com/test-catalog/Clinical+and+Interpretive/113631)
* Comprehensive Metabolic Panel (CMP-14) with eGFR is a group of 14 laboratory tests ordered to give information about the current status of your liver, kidneys, and electrolyte and acid/base balance.  The test gives the current status of your blood sugar and blood proteins also.

#### 1.2 Descriptive Statistics <a class="anchor" id="stats"></a>

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.corr()

In [None]:
print(df[target].value_counts())

In [None]:
none_detected_dups = sum(df.loc[df[target]=='None Detected'].duplicated())
detected_dups = sum(df.loc[df[target]=='Detected'].duplicated())
total_dups = none_detected_dups + detected_dups

print('None Detected Duplicates:', none_detected_dups)
print('Detected Duplicates:', detected_dups)
print('Total Duplicates:', total_dups)
print('Fraction Duplicated:', total_dups / len(df))

#### 1.3 Inspect Null Data <a class="anchor" id="null"></a>

In [None]:
print('Total Number of NULL Data Points:', df.isnull().sum().sum())
df.isnull().sum()

In [None]:
d = {
    col: round(df[col].isnull().sum() * 100 / len(df[col]), 4)
    for col in df
}
d = dict(sorted(d.items(), key=lambda item: item[1], reverse=True))
majority_null = [k for k, v in d.items() if v > 50.0]

print("Null Data Points by variable")
d

In [None]:
msno.matrix(df)
# msno.heatmap(df)

## 2. Data Preparation <a class="anchor" id="prep"></a>

[Table of Contents](#toc)

#### 2.1 Drop Columns <a class="anchor" id="drop"></a>

In [None]:
drop = False

In [None]:
if drop:
    trim_df = df.drop(columns=majority_null)
    trim_df = trim_df.drop(columns=['index', 'patno'])
    trim_df.head(5)
else:
    trim_df = df

In [None]:
trim_df.isnull().sum().sum()

#### 2.2 Handle Categorical Variables <a class="anchor" id="handle1"></a>

In [None]:
trim_df.select_dtypes(include=['object']).columns

###### 2.2.1 Manual Conversion <a class="anchor" id="manual"></a>

In [None]:
num_cat_cols = [
    'cmp_bicarbonate', 'cmp_bun', 'cmp_creatinine', 'cmp_alt', 'cmp_bilirubin'
]
less_than_list = [
    '<5', '<2', '<0.2', '<6', '<0.1'
]


def replace_cat(val, less, num):
    if val == less:
        return random.uniform(0, num) if "." in less else random.randint(0, num)
    else:
        return float(val)

    
trim_df2 = trim_df.copy()
for col, less_than in zip(num_cat_cols, less_than_list):
    upper_range = float(less_than[1:])
    trim_df2[col] = trim_df2[col].apply(lambda x: replace_cat(x, less_than, upper_range))

trim_df2.head(5)

In [None]:
trim_df2.info()

###### 2.2.2 Encoding <a class="anchor" id="encoding"></a>

In [None]:
cat_cols = list(trim_df2.select_dtypes(include=['object']).columns)

trim_df3 = trim_df2.copy()
trim_df3['FirstRace'] = trim_df3['FirstRace'].fillna("Unspecified")
if not drop:
    trim_df3['AdmittingDepartment'] = trim_df3['AdmittingDepartment'].fillna('N/A')
for col in cat_cols:
    enc = OrdinalEncoder()
    y = enc.fit_transform(trim_df3[[col]])
    if col == 'COVIDResult':
        y = 1 - y
    trim_df3[col + "_Encoded"] = y
trim_df3.head(5)

###### 2.2.3 Categorical Codes <a class="anchor" id="codes"></a>

In [None]:
for col in cat_cols:
    if col == 'AdmittingDepartment':
        continue
    display(Markdown("**{}**".format(col)))
    for each in trim_df3.groupby([col, col + '_Encoded']).indices:
        print(each)
    print()

In [None]:
trim_df3 = trim_df3.drop(columns=cat_cols)
trim_df3.head(5)

In [None]:
trim_df3.shape

In [None]:
trim_df3.info()

#### 2.3 Handle Missing Values <a class="anchor" id="handle2"></a>

In [None]:
fill_option = 'A'

In [None]:
num_cols = list(trim_df2.select_dtypes(include=['float64']).columns)

trim_df4 = trim_df3.copy()
for col in num_cols:
    if fill_option == 'A':
        trim_df4[col] = trim_df4[col].fillna(0)
    else:
        trim_df4[col] = trim_df4[col].replace(np.NaN, trim_df4[col].mean())
    
trim_df4.head(5)

In [None]:
trim_df3.isnull().sum().sum()

In [None]:
trim_df4.isnull().sum().sum()

#### 2.4 Feature Scaling <a class="anchor" id="scaling"></a>

In [None]:
code_cols = [
    'Admitted', 'FirstRace_Encoded', 'Ethnicity_Encoded', 'Sex_Encoded',
    'AdmittingDepartment_Encoded', 'COVIDResult_Encoded'
]

if drop:
    code_cols.remove('AdmittingDepartment_Encoded')

codes_df = trim_df4[code_cols]
trim_df5 = trim_df4.drop(columns=code_cols)

scaler = StandardScaler()
scaled = scaler.fit_transform(trim_df5)
scaled_df = pd.DataFrame(data=scaled, columns=trim_df5.columns)

In [None]:
merged_df = pd.concat([scaled_df, codes_df], axis=1)
merged_df.head(5)

#### 2.5 Train / Test Split <a class="anchor" id="split"></a>

In [None]:
train, test = train_test_split(merged_df, test_size=0.2, random_state=SEED)

#### 2.6 Final Prepared Data <a class="anchor" id="final_data"></a>

In [None]:
final_data = merged_df.copy()
final_train = train.copy()
final_test = test.copy()
target = 'COVIDResult_Encoded'

In [None]:
final_data.shape

In [None]:
final_train.shape

In [None]:
final_test.shape

In [None]:
print(final_train[target].value_counts())
print(final_test[target].value_counts())

## 3. Feature Selection <a class="anchor" id="feature"></a>

[Table of Contents](#toc)

In [None]:
X = final_data.loc[:, final_data.columns != target]
pos_X = trim_df4.loc[:, trim_df4.columns != target]
X_norm = MinMaxScaler().fit_transform(pos_X)
Y = final_data[target]

###### Number of Features

In [None]:
top_n_feats = len(X.columns)
# top_n_feats = 10

In [None]:
def plot_scores(scores, selector):
    plt.bar(range(len(scores)), scores, color='b')
    plt.show()

#### 3.1 Pearson Correlation <a class="anchor" id="corr"></a>

In [None]:
def correlation_selector(x, y):
    correl_dict = {
        col: np.corrcoef(x[col], y)[0, 1] for col in x.columns.tolist()
    }
    correl_dict = {
        col: 0 if np.isnan(cor) else np.abs(cor) for col, cor in correl_dict.items()
    }
    plot_scores(list(correl_dict.values()), 'correlation')
    
    correl_dict = dict(sorted(correl_dict.items(), key=lambda item: item[1], reverse=True)[:top_n_feats])
    top_n = np.array([
        True if col in list(correl_dict.keys()) else False for col in x.columns.tolist()
    ])
    return top_n


corr_top_n = correlation_selector(X, Y)
corr_top_n

#### 3.2 Chi-Squared <a class="anchor" id="chi_sq"></a>

In [None]:
def chi_selector(y):
    chi_sq = SelectKBest(chi2, k=top_n_feats)
    chi_sq.fit(X_norm, y)
    top_n = chi_sq.get_support()
    plot_scores(chi_sq.scores_, 'chi-squared')
    return top_n


chi_top_n = chi_selector(Y)
chi_top_n

#### 3.3 Recursive Feature Elimination <a class="anchor" id="rfe"></a>

In [None]:
def rfe_selector(y):
    rfe = RFE(estimator=LogisticRegression(), n_features_to_select=top_n_feats, step=10, verbose=0)
    rfe.fit(X_norm, y)
    top_n = rfe.get_support()
    return top_n


rfe_top_n = rfe_selector(Y)
rfe_top_n

#### 3.4 Lasso: SelectFromModel <a class="anchor" id="lasso"></a>

In [None]:
def lasso_selector(y):
    lasso = SelectFromModel(LogisticRegression(penalty="l2"), max_features=top_n_feats)
    lasso.fit(X_norm, y)
    top_n = lasso.get_support()
    return top_n


lasso_top_n = lasso_selector(Y)
lasso_top_n

#### 3.5 RandomForestClassifier: SelectFromModel <a class="anchor" id="rfc"></a>

In [None]:
def rfc_selector(x, y):
    rfc = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=top_n_feats)
    rfc.fit(x, y)
    top_n = rfc.get_support()
    return top_n


rfc_top_n = rfc_selector(X, Y)
rfc_top_n

#### 3.6 Cumulative Feature Selection <a class="anchor" id="cum"></a>

In [None]:
cumm_df = pd.DataFrame({
    'feature': X.columns.tolist(),
    'correlation': corr_top_n,
    'chi-sq': chi_top_n,
    'rfe': rfe_top_n,
    'lasso': lasso_top_n,
    'rfc': rfc_top_n
})
cumm_df['total'] = np.sum(cumm_df, axis=1)
cumm_df = cumm_df.sort_values(['total', 'feature'], ascending=False)
cumm_df.index = range(1, len(cumm_df) + 1)
cumm_df

In [None]:
top_n_features = list(cumm_df.iloc[:top_n_feats]['feature'])
top_n_features

## 4. Model Selection <a class="anchor" id="model"></a>

[Table of Contents](#toc)

#### 4.1 Train / Test Data <a class="anchor" id="tt"></a>

In [None]:
X_train = final_train[top_n_features]
y_train = final_train[target]
X_test = final_test[top_n_features]
y_test = final_test[target]

In [None]:
print('X_train', X_train.shape)
print('y_train', y_train.shape)
print()
print('X_test', X_test.shape)
print('y_test', y_test.shape)

#### 4.2 Model Evaluation Functions <a class="anchor" id="funcs"></a>

In [None]:
def plot_metric(hist, metric):
    train_metrics = hist.history[metric]
    val_metrics = hist.history['val_'+metric]
    epochs = range(1, len(train_metrics) + 1)
    plt.plot(epochs, train_metrics, 'bo--')
    plt.plot(epochs, val_metrics, 'ro-')
    plt.title('Training and Validation '+ metric)
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend(["train_"+metric, 'val_'+metric])
    plt.show()

In [None]:
def plot_history(hist):
    display(Markdown('**Training/Validation Loss and Accuracy**'))
    pd.DataFrame(hist.history).plot(figsize=(8,5))
    plt.grid(True)
    plt.gca().set_ylim(0, 1)
    plt.show()

In [None]:
def metric_evaluation(y_test, y_pred, labels=True):
    if labels:
        display(Markdown('**Metric Scores**'))
    print("Accuracy: {:.2f}%".format(accuracy_score(y_test, y_pred) * 100))
    print("Precision: {:.2f}%".format(precision_score(y_test, y_pred) * 100))
    print("Recall: {:.2f}%".format(recall_score(y_test, y_pred) * 100))
    print("F1: {:.2f}%".format(f1_score(y_test, y_pred) * 100))
    if labels:
        display(Markdown('**Confusion Matrix**'))
    print(confusion_matrix(y_test, y_pred))

In [None]:
def plot_confusion_matrix(y_test, y_pred, labels=['None Detected', 'Detected']):
    cm_df = pd.DataFrame(
        confusion_matrix(y_test, y_pred), columns=labels, index=labels
    )
    ax = sns.heatmap(
        data=cm_df, cmap=cm.Blues, annot=True, fmt='d'
    )
    ax.set(xlabel='Predicted', ylabel='Actual')
    plt.show()

In [None]:
EPOCHS = 30
BATCH_SIZE = 200
VAL_SPLIT = 0.2
X_train, y_train, X_test, y_test = xy

In [None]:
def test_models(models, xy, isolate=None):
    X_train, y_train, X_test, y_test = xy
    X_train = np.array(X_train.values.tolist())
    y_train = np.array(y_train)
    for i, m in enumerate(models):
        name, model, loss, optimizer, binary = m.values()
        if isolate is not None and i + 1 != isolate:
            print('Skipping Model {}...'.format(i + 1))
            continue
        
        display(Markdown('### Model {} – {}'.format(i + 1, name)))
        
        # 1. Compile
        model.compile(
            loss=loss, optimizer=optimizer, metrics=['accuracy']
        )
        
        # 2. Fit
        history = model.fit(
            X_train, y_train,
            batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=VAL_SPLIT, shuffle=True, verbose=0
        )
        
        # 3. Visualize Model
        display(Markdown("**Summary**"))
        model.summary()
        plot_history(history)
        
        # 4. Evaluate
        display(Markdown("**Evaluation and Prediction**"))
        loss, accuracy = model.evaluate(x=X_test, y=y_test)
        print("\nLoss: {:.2f}%".format(loss * 100))
        print("Accuracy: {:.2f}%".format(accuracy * 100))
        
        # 5. Predict
        y_prediction_array = model.predict(X_test)
        
        if not binary:
            y_prediction = np.argmax(y_prediction_array, axis=1)
        else:
            y_prediction = np.round(y_prediction_array)
        
        # 6. Visualize Predictions
        metric_evaluation(y_test, y_prediction)
        plot_confusion_matrix(y_test, y_prediction)
        
        print()

#### 4.3 Model Construction <a class="anchor" id="models"></a>

In [None]:
tf.keras.backend.set_floatx('float64')
tf.set_random_seed(SEED)
np.random.seed(SEED)

###### 4.3.1 Decision Trees <a class="anchor" id="dt"></a>

In [None]:
dt = DecisionTreeClassifier(
    max_depth=4, criterion="entropy", random_state=SEED
)
dt = dt.fit(X_train, y_train)
y_prediction = dt.predict(X_test)
metric_evaluation(y_test, y_prediction)
plot_confusion_matrix(y_test, y_prediction)

In [None]:
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names=top_n_features, class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# graph.write_png('rfc.png')
Image(graph.create_png())

###### 4.3.2 Random Forest Classifier <a class="anchor" id="rfc2"></a>

In [None]:
rfc = RandomForestClassifier(
    n_estimators=100, bootstrap=True, max_features='sqrt', random_state=SEED
)
rfc.fit(X_train, y_train)
y_prediction = rfc.predict(X_test)
metric_evaluation(y_test, y_prediction)
plot_confusion_matrix(y_test, y_prediction)

In [None]:
feature_importance = pd.Series(rfc.feature_importances_,index=top_n_features).sort_values(ascending=False).iloc[:10]
sns.barplot(x=feature_importance, y=feature_importance.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

###### 4.3.3 Simple Deep Neural Network <a class="anchor" id="dnn"></a>

In [None]:
dnn = Sequential([
    Dense(top_n_feats//2, activation='relu', input_shape=(top_n_feats,)),
    Dense(8, activation='relu'),
    Dense(8, activation='relu'),
    Dense(8, activation='relu'),
    Dense(4, activation='relu'),
    Dense(2, activation='softmax')
])

In [None]:
test_models([
    {'name': 'Simple Deep Neural Network', 'model': dnn,
     'loss': 'sparse_categorical_crossentropy', 'optimizer': RMSprop(learning_rate=1e-2), 'binary': False}
], (X_train, y_train, X_test, y_test))

###### 4.3.4 Convolutional Neural Network <a class="anchor" id="cnn"></a>

In [None]:
# cnn = Sequential([
#     Conv2D(32, kernel_size=(3, 3), activation='linear', input_shape=(top_n_feats,0,0), padding='same'),
#     LeakyReLU(alpha=0.1),
#     MaxPooling2D((2, 2), padding='same'),
#     Conv2D(64, (3, 3), activation='linear',padding='same'),
#     LeakyReLU(alpha=0.1),
#     MaxPooling2D(pool_size=(2, 2), padding='same'),
#     Conv2D(128, (3, 3), activation='linear',padding='same'),
#     LeakyReLU(alpha=0.1),
#     MaxPooling2D(pool_size=(2, 2), padding='same'),
#     Flatten(),
#     Dense(128, activation='linear'),
#     LeakyReLU(alpha=0.1),
#     Dense(2, activation='softmax')
# ])

###### All Models

In [None]:
model_lst = []

###### Retired Models

In [None]:
# MLPs are more linear...
mlp1 = Sequential([
    Dense(12, input_dim=top_n_feats, kernel_initializer='uniform', activation='relu'),
    Dense(8, kernel_initializer='uniform', activation='relu'),
    Dense(1, kernel_initializer='uniform', activation='sigmoid')
])
mlp2 = Sequential([
    Dense(top_n_feats, activation='relu'),
    Dropout(0.2),
    Dense(10, activation='relu'),
    Dropout(0.2),
    Dense(2, activation='softmax')
])
mlp3 = Sequential([
    Dense(64,activation='relu',input_dim=X_train.shape[1]),
    Dense(1)
])
seq = Sequential([
    Dense(top_n_feats, activation='relu'),
    Dense(6, activation='relu'),
    Dense(2, activation='softmax')
])

# RNNs are more for Time-Series Data
rnn = Sequential([
    Embedding(len(X_train) + 1, 64),
    LSTM(64, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')
])
#model_lst = [
#     {'name': 'MLP: Binary Classification', 'model': mlp1,
#      'loss': 'binary_crossentropy', 'optimizer': Adam(learning_rate=1e-2), 'binary': True},
#     {'name': 'MLP: Multi-Class Classification', 'model': mlp2,
#      'loss': 'sparse_categorical_crossentropy', 'optimizer': RMSprop(learning_rate=1e-2), 'binary': False},
#     {'name': 'MLP: Regression', 'model': mlp3,
#      'loss': 'mse', 'optimizer': RMSprop(learning_rate=1e-2), 'binary': False},
#     {'name': 'Sequential: Dense Layers, ReLU Activation', 'model': seq,
#      'loss': 'sparse_categorical_crossentropy', 'optimizer': Adam(learning_rate=1e-2), 'binary': False},
#     {'name': 'Convolutional Neural Network', 'model': cnn,
#      'loss': 'sparse_categorical_crossentropy', 'optimizer': RMSprop(learning_rate=1e-4, decay=1e-6), 'binary': False},
#     {'name': 'Recurrent Neural Network', 'model': rnn,
#      'loss': 'binary_crossentropy', 'optimizer': Adam(learning_rate=1e-4), 'binary': True},
# ]

#### 4.4 Simultaneous Model Evaluation <a class="anchor" id="eval"></a>

In [None]:
test_models(model_lst, (X_train, y_train, X_test, y_test))

#### 4.5 Randomized Search <a class="anchor" id="search"></a>

In [None]:
def build_model(n_hidden=1, n_neurons=30, learning_rate=3e-3, input_shape=[top_n_feats]):
    model = Sequential([
        InputLayer(input_shape=input_shape)
    ])
    for layer in range(n_hidden):
        model.add(Dense(n_neurons, activation="relu"))
    model.add(Dense(1))
    optimizer = SGD(lr=learning_rate)
    model.compile(loss="mse", optimizer=optimizer)
    return model

In [None]:
keras_reg = KerasRegressor(build_model)

In [None]:
hyperparameters = {
    'n_hidden': [0, 1, 2, 3],
    'n_neurons': np.arange(1, 100),
    'learning_rate': reciprocal(3e-4, 3e-2)
}

In [None]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=SEED)

In [None]:
rnd_search_cv = RandomizedSearchCV(keras_reg, hyperparameters, n_iter=10, cv=cv)

In [None]:
run = False

In [None]:
%%time
if run:
    rnd_search_cv.fit(
        X_train, y_train, epochs=EPOCHS, validation_split=VAL_SPLIT, callbacks=[EarlyStopping(patience=10)], verbose=0
    )

In [None]:
def rnd_search_results(rscv):
    global final_top_x_models
    
    # 1. Best Model Params
    display(Markdown("**Best Parameters**"))
    best_params = rscv.best_params_
    for k, v in best_params:
        print(k + ":", v)
    print("Best Score: {}".format(rscv.best_score_))
    
    # 2. CV Results
    cv_results = pd.DataFrame(rnd_search_cv.cv_results_)
    print(cv_results)
    
    # 3. Top X
    top_x = 3
    top_x_models = cv_results.loc[cv_results['rank_test_score'].isin(range(1, top_x+1))].sort_values(
        by=['rank_test_score']
    )
    final_top_x_models = top_x_models.reset_index()
    print(final_top_x_models)
    
    # 4. Best!
    print(cv_results.iloc[rnd_search_cv.best_index_])

    
if run:
    rnd_search_results(rnd_search_cv)

In [None]:
def build_model2(n_hidden, n_neurons):
    model = Sequential([
        InputLayer(input_shape=[top_n_feats])
    ])
    for layer in range(n_hidden):
        model.add(Dense(n_neurons, activation="relu"))
    model.add(Dense(1))
    return model

In [None]:
# best_models_list = [{
#     'name': 'Randomized Search: #{} Model'.format(idx + 1),
#     'model': build_model2(n_hidden=model['param_n_hidden'], n_neurons=model['param_n_neurons']),
#     'loss': 'sparse_categorical_crossentropy',
#     'optimizer': SGD(learning_rate=model['param_learning_rate'])
# } for idx, model in final_top_x_models.iterrows()]

In [None]:
# best_models_list

In [None]:
# test_models(best_models_list, (X_train, y_train, X_test, y_test))

## 5. Generative Adverserial Networks (GAN) <a class="anchor" id="gan"></a>

[Table of Contents](#toc)

#### 5.1 Network Setup <a class="anchor" id="setup2"></a>

In [None]:
RAND_DIM = 32
NB_STEPS = 500 + 1
BASE_N_COUNT = 128
BATCH_SIZE = 128
NUM_UPDATES_D = 1         # number of critic network updates per adversarial training step
NUM_UPDATES_G = 1         # number of generator network updates per adversarial training step
NUM_PRE_TRAIN_STEPS = 100 # number of steps to pre-train the critic before starting adversarial training
LOG_INTERVAL = 100        # interval (in steps) at which to log loss summaries and save plots of image samples to disc
LEARNING_RATE = 5e-4 
DIRECTORY = 'GAN/outputs/'
SHOW = True
generator_model_path, discriminator_model_path, loss_pickle_path = None, None, None

arguments = [
    RAND_DIM, NB_STEPS, BATCH_SIZE, NUM_UPDATES_D, NUM_UPDATES_G, NUM_PRE_TRAIN_STEPS, LOG_INTERVAL,
    LEARNING_RATE, BASE_N_COUNT, DIRECTORY, generator_model_path, discriminator_model_path, loss_pickle_path, SHOW
]

In [None]:
all_data = False

In [None]:
target = 'COVIDResult_Encoded'
if all_data:
    train = final_data.copy()
else:
    all_detected = final_data.loc[final_data[target] == 1]
    train = all_detected.copy().reset_index(drop=True)

In [None]:
all_columns = list(train.columns.tolist())
data_cols = all_columns[:-1]
label_cols = [target]
train_no_label = train[data_cols] / 10

In [None]:
train.head(5)

#### 5.2 Train GAN Models <a class="anchor" id="train"></a>

In [None]:
gan_model = 'cgan'

###### 5.2.1 GAN <a class="anchor" id="gan2"></a>

In [None]:
%%time
# GAN
if gan_model == 'gan':
    GAN.adversarial_training_GAN(
        arguments=arguments, train=train_no_label, data_cols=data_cols, seed=SEED
    )

###### 5.2.2 CGAN <a class="anchor" id="cgan"></a>

In [None]:
%%time
# CGAN
if gan_model == 'cgan':
    GAN.adversarial_training_GAN(
        arguments=arguments, train=train, data_cols=data_cols, label_cols=label_cols, seed=SEED
    )

###### 5.2.3 WGAN and WCGAN <a class="anchor" id="wgan"></a>

In [None]:
# %%time
# # WGAN
# GAN.adversarial_training_WGAN(
#     arguments=arguments, train=train_no_label, data_cols=data_cols, seed=SEED
# )
# # WCGAN
# GAN.adversarial_training_WGAN(
#     arguments=arguments, train=train, data_cols=data_cols, label_cols=label_cols, seed=SEED
# )

#### 5.3 Loss Information <a class="anchor" id="loss"></a>

In [None]:
%%time
TYPE_ = 'CGAN'

fig = plt.figure()
fig.subplots_adjust(hspace=0.6, wspace=0.6)
for i, step in zip(range(1, 7), range(0, 600, 100)):
    [combined_loss, disc_loss_generated, disc_loss_real, xgb_losses] = pickle.load(
        open('{}{}_losses_step_{}.pkl'.format(DIRECTORY, TYPE_, step), 'rb')
    )
    
    ax = fig.add_subplot(2, 3, i)
    ax.plot(pd.DataFrame(xgb_losses).rolling(10).mean())
    ax.title.set_text('Step {}'.format(step))

#### 5.4 Generate New Data <a class="anchor" id="new_data"></a>

In [None]:
np.random.seed(SEED)
NEW = 6000
DATA_DIM = len(data_cols)
LABEL_DIM = len(label_cols)
WITH_CLASS = True if LABEL_DIM > 0 else False

In [None]:
def generate_helper(size, generator):
    """
    
    """
    x = GAN.get_data_batch(train, size, seed=SEED)
    labels = x[:, -LABEL_DIM:]
    
    if all_data: 
        z = np.random.normal(size=(size, RAND_DIM))
        g_z = generator.predict([z, labels]) if WITH_CLASS else generator.predict(z)
    else:
        for _ in range(NEW//size):
            new_z = np.random.normal(size=(size, RAND_DIM))
            new_g_z = generator.predict([new_z, labels]) if WITH_CLASS else generator.predict(new_z)
            try:
                g_z = np.append(g_z, new_g_z, axis=0)
            except:
                g_z = new_g_z
    
    return np.array(x), np.array(g_z)

In [None]:
def generate():  
    """
    
    """
    # 1. Define Models
    generator, discriminator, combined = GAN.define_models_CGAN(
        RAND_DIM, DATA_DIM, LABEL_DIM, BASE_N_COUNT
    )
    generator.load_weights('GAN/outputs/CGAN_generator_model_weights_step_500.h5')
    
    # 2. Generate Batches of Data
    test_size = len(train)
    x, g_z = generate_helper(test_size, generator)
        
    # 3. Visualize Accuracy + New Data
    print("Accuracy:", GAN.CheckAccuracy(
        x, g_z, data_cols, label_cols, seed=SEED, with_class=WITH_CLASS, data_dim=DATA_DIM
    ))
    GAN.PlotData(
        x, g_z, data_cols, label_cols, seed=SEED, with_class=WITH_CLASS, data_dim=DATA_DIM
    )
    return x, g_z
    
batch, generated = generate()

#### 5.5 Test New Data on Models <a class="anchor" id="train_gan"></a>

In [None]:
real = pd.DataFrame(batch, columns=data_cols+label_cols)
real['syn_label'] = 0
real.head(5)

In [None]:
test = pd.DataFrame(generated, columns=data_cols+label_cols)
test['syn_label'] = 1
test.head(5)

In [None]:
real.shape

In [None]:
test.shape

In [None]:
SPLIT = 0.5

In [None]:
n_real, n_test = int(len(real)*SPLIT), int(len(test)*SPLIT)

In [None]:
train_gan = pd.concat([real[:n_real], test[:n_test]], axis=0)
train_gan = train_gan.sample(frac=1).reset_index(drop=True) # shuffle
test_gan = pd.concat([real[n_real:], test[n_test:]], axis=0)
test_gan = test_gan.sample(frac=1).reset_index(drop=True) # shuffle

In [None]:
X = train_gan.columns[:-2]
y = train_gan.columns[-1]
y_true = test_gan[y]
d_train = xgb.DMatrix(train_gan[X], train_gan[y], feature_names=X)
d_test = xgb.DMatrix(test_gan[X], feature_names=X)

In [None]:
parameters = {
    'max_depth': 4,
    'objective': 'binary:logistic',
    'random_state': SEED,
    'eval_metric': 'auc'
}
xgb_clf = xgb.train(parameters, d_train, num_boost_round=10)

In [None]:
y_pred = xgb_clf.predict(d_test)
metric_evaluation(np.round(y_pred), y_true)
plot_confusion_matrix(np.round(y_pred), y_true, labels=['Real', 'Fake'])

#### 5.6 Plot Real vs Test Data <a class="anchor" id="plot"></a>

In [None]:
for i in range(0, len(X)-1, 2):
    fig, ax = plt.subplots(1, 2, figsize=(6,2))

    ax[0].scatter(test_gan[:n_real][X[i]], test_gan[:n_real][X[i + 1]], c=y_pred[:n_real], cmap='plasma')
    ax[0].set_title('real')
    ax[0].set_ylabel(X[i + 1])

    ax[1].scatter(test_gan[n_real:][X[i]], test_gan[n_real:][X[i + 1]], c=y_pred[n_real:], cmap='plasma')
    ax[1].set_title('test')
    ax[1].set_xlim(ax[0].get_xlim()), ax[1].set_ylim(ax[0].get_ylim())

    for a in ax:
        a.set_xlabel(X[i])

    plt.show()

In [None]:
colors = ['red','blue']
markers = ['o','^']
labels = ['normal','detected']

target = 'COVIDResult_Encoded'

for i in range(0, len(X), 2):
    col1, col2 = i, i + 1
    if col2 >= len(X):
        continue
    
    fig, ax = plt.subplots(1, 2, figsize=(6,2))
    for group, color, marker, label in zip( test_gan[:n_real].groupby(target), colors, markers, labels):
        ax[0].scatter( group[1][X[col1]], group[1][X[col2]], label=label, c=color, marker=marker, alpha=0.2) 
    ax[0].legend()
    ax[0].set_title('real')
    ax[0].set_ylabel(X[col2])

    for group, color, marker, label in zip( test_gan[n_real:].groupby(target), colors, markers, labels):
        ax[1].scatter(group[1][X[col1]], group[1][X[col2]], label=label, c=color, marker=marker, alpha=0.2) 
    ax[1].set_xlim(ax[0].get_xlim()), ax[1].set_ylim(ax[0].get_ylim())
    ax[1].legend()
    ax[1].set_title('generated')

    for a in ax:
        a.set_xlabel(X[col1])

    plt.show()

#### 5.7 Feature Importance <a class="anchor" id="importance"></a>

In [None]:
MAX_FEATURES = 20

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
xgb.plot_importance(xgb_clf, max_num_features=MAX_FEATURES, height=0.5, ax=ax)
plt.show()

## 6. Retrain Models with GAN Data <a class="anchor" id="retrain"></a>

[Table of Contents](#toc)

#### 6.1 Re-Prepare Data <a class="anchor" id="prep2"></a>

In [None]:
real_data = final_data.copy()
fake_data = test.copy().drop(columns=['syn_label'])

In [None]:
combined = pd.concat([real_data, fake_data], axis=0)
combined = combined.sample(frac=1).reset_index(drop=True) # shuffle

In [None]:
combined.shape

In [None]:
retrain, retest = train_test_split(combined, test_size=0.5, random_state=SEED)

In [None]:
target = 'COVIDResult_Encoded'
X_cols = combined.columns.tolist()
X_cols.remove(target)

In [None]:
X_retrain = retrain[X_cols]
y_retrain = retrain[target]
X_retest = retest[X_cols]
y_retest = retest[target]

#### 6.2 Re-Train Models <a class="anchor" id="retrain2"></a>

In [None]:
dt = DecisionTreeClassifier(
    max_depth=4, criterion="entropy", random_state=SEED
)
dt = dt.fit(X_retrain, y_retrain)
y_prediction = dt.predict(X_retest)
metric_evaluation(y_retest, y_prediction)
plot_confusion_matrix(y_retest, y_prediction)

In [None]:
dot_data = StringIO()
export_graphviz(dt, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names=top_n_features, class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# graph.write_png('rfc.png')
Image(graph.create_png())

In [None]:
rfc = RandomForestClassifier(
    n_estimators=100, bootstrap=True, max_features='sqrt', random_state=SEED
)
rfc.fit(X_retrain, y_retrain)
y_prediction = rfc.predict(X_retest)
metric_evaluation(y_retest, y_prediction)
plot_confusion_matrix(y_retest, y_prediction)

In [None]:
feature_importance = pd.Series(rfc.feature_importances_,index=top_n_features).sort_values(ascending=False).iloc[:10]
sns.barplot(x=feature_importance, y=feature_importance.index)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

In [None]:
dnn = Sequential([
    Dense(top_n_feats//2, activation='relu', input_shape=(top_n_feats,)),
    Dense(8, activation='relu'),
    Dense(8, activation='relu'),
    Dense(8, activation='relu'),
    Dense(4, activation='relu'),
    Dense(2, activation='softmax')
])

In [None]:
test_models([
    {'name': 'Simple Deep Neural Network', 'model': dnn,
     'loss': 'sparse_categorical_crossentropy', 'optimizer': RMSprop(learning_rate=1e-2), 'binary': False}
], (X_retrain, y_retrain, X_retest, y_retest))

## 7. Final Model Training with Feature Selection <a class="anchor" id="final"></a>

[Table of Contents](#toc)

#### 7.1 Define Models and Variables <a class="anchor" id="define"></a>

In [358]:
models = [
    DecisionTreeClassifier(max_depth=4, criterion="entropy", random_state=SEED),
    RandomForestClassifier(n_estimators=100, bootstrap=True, max_features='sqrt', random_state=SEED),
    'Sequential'
]
names = [
    'Decision Tree Classifier',
    'Random Forest Classifier',
    'Deep Neural Network'
]

In [359]:
for model in models:
    print(model)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=0, splitter='best')
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
Sequential


In [360]:
drops = [
    ('sbp', 'dbp'),
    'pulse_ox',
    'cmp_glucose',
    'resp_rate',
    None
]

In [361]:
metrics = [
    'Accuracy',
    'Precision',
    'Recall',
    'F1'
]

In [362]:
EPOCHS = 30
BATCH_SIZE = 200
VAL_SPLIT = 0.2

In [363]:
def metric_evaluation2(y_test, y_pred):
    acc = round(accuracy_score(y_test, y_pred) * 100, 2)
    prec = round(precision_score(y_test, y_pred) * 100, 2)
    rec = round(recall_score(y_test, y_pred) * 100, 2)
    f1 = round(f1_score(y_test, y_pred) * 100, 2)
    return [acc, prec, rec, f1]

In [369]:
def test_final_models(models_lst, xy):
    X_train, y_train, X_test, y_test = xy
    X_cols = [
        'Age',
        'FirstRace_Encoded',
        'Ethnicity_Encoded',
        'Sex_Encoded',
        'height',
        'wght',
        'heart_rate',
        'sbp',
        'dbp',
        'pulse_ox',
        'resp_rate',
        'cmp_glucose'
    ]
    
    metric_scores_lst = []
    index_tuples = []
    
    for num, drop in enumerate(drops):
        display(Markdown('#### {} Features'.format(len(X_cols))))
        metric_scores = []
        
        for model, name in zip(models_lst, names):
            display(Markdown('<u>{}</u>'.format(name)))
            
            if model == 'Sequential':
                nn = True
                model = Sequential([
                            Dense(len(X_cols)//2, activation='relu', input_shape=(len(X_cols),)),
                            Dense(8, activation='relu'),
                            Dense(8, activation='relu'),
                            Dense(8, activation='relu'),
                            Dense(4, activation='relu'),
                            Dense(2, activation='softmax')
                        ])
                model.compile(
                    loss='sparse_categorical_crossentropy',
                    optimizer=RMSprop(learning_rate=1e-2),
                    metrics=['accuracy']
                )
                model.fit(
                    X_train[X_cols], y_train,
                    batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=VAL_SPLIT, shuffle=True, verbose=0
                )

            else:
                nn = False
                model.fit(X_train[X_cols], y_train)

            y_prediction = model.predict(X_test[X_cols])
            y_prediction = np.argmax(y_prediction, axis=1) if nn else np.round(y_prediction)
            metric_evaluation(y_test, y_prediction, labels=False)
            
            metric_scores.extend(metric_evaluation2(y_test, y_prediction))
            if num == 0:
                for metric in metrics:
                    index_tuples.append((name, metric))
                
        
        
        metric_scores_lst.append(metric_scores)
        
        if drop is not None:
            if not isinstance(drop, str):
                X_cols.remove(drop[0])
                display(Markdown('*Dropping {}...*'.format(drop[0])))
                X_cols.remove(drop[1])
                display(Markdown('*Dropping {}...*'.format(drop[1])))
                
            else:
                X_cols.remove(drop)
                display(Markdown('*Dropping {}...*'.format(drop)))
           
    index = pd.MultiIndex.from_tuples(index_tuples)
    transposed = map(list, zip(*metric_scores_lst))
    return pd.DataFrame(transposed, columns=['12', '10', '9', '8', '7'], index=index)

#### 7.2 Model Performance without GAN (Real Data) <a class="anchor" id="perf1"></a>

In [365]:
%%time
metric_df1 = test_final_models(models, (X_train, y_train, X_test, y_test))

#### 12 Features

<u>Decision Tree Classifier</u>

Accuracy: 93.70%
Precision: 46.67%
Recall: 7.61%
F1: 13.08%
[[1376    8]
 [  85    7]]


<u>Random Forest Classifier</u>

Accuracy: 94.11%
Precision: 77.78%
Recall: 7.61%
F1: 13.86%
[[1382    2]
 [  85    7]]


<u>Deep Neural Network</u>

Accuracy: 93.77%
Precision: 0.00%
Recall: 0.00%
F1: 0.00%
[[1384    0]
 [  92    0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


*Dropping sbp...*

*Dropping dbp...*

#### 10 Features

<u>Decision Tree Classifier</u>

Accuracy: 93.70%
Precision: 46.67%
Recall: 7.61%
F1: 13.08%
[[1376    8]
 [  85    7]]


<u>Random Forest Classifier</u>

Accuracy: 93.97%
Precision: 61.54%
Recall: 8.70%
F1: 15.24%
[[1379    5]
 [  84    8]]


<u>Deep Neural Network</u>

Accuracy: 93.77%
Precision: 0.00%
Recall: 0.00%
F1: 0.00%
[[1384    0]
 [  92    0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


*Dropping pulse_ox...*

#### 9 Features

<u>Decision Tree Classifier</u>

Accuracy: 93.77%
Precision: 0.00%
Recall: 0.00%
F1: 0.00%
[[1384    0]
 [  92    0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


<u>Random Forest Classifier</u>

Accuracy: 93.56%
Precision: 42.11%
Recall: 8.70%
F1: 14.41%
[[1373   11]
 [  84    8]]


<u>Deep Neural Network</u>

Accuracy: 93.77%
Precision: 0.00%
Recall: 0.00%
F1: 0.00%
[[1384    0]
 [  92    0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


*Dropping cmp_glucose...*

#### 8 Features

<u>Decision Tree Classifier</u>

Accuracy: 93.77%
Precision: 50.00%
Recall: 3.26%
F1: 6.12%
[[1381    3]
 [  89    3]]


<u>Random Forest Classifier</u>

Accuracy: 93.56%
Precision: 40.00%
Recall: 6.52%
F1: 11.21%
[[1375    9]
 [  86    6]]


<u>Deep Neural Network</u>

Accuracy: 93.77%
Precision: 0.00%
Recall: 0.00%
F1: 0.00%
[[1384    0]
 [  92    0]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


*Dropping resp_rate...*

#### 7 Features

<u>Decision Tree Classifier</u>

Accuracy: 93.77%
Precision: 50.00%
Recall: 3.26%
F1: 6.12%
[[1381    3]
 [  89    3]]


<u>Random Forest Classifier</u>

Accuracy: 93.16%
Precision: 28.57%
Recall: 6.52%
F1: 10.62%
[[1369   15]
 [  86    6]]


<u>Deep Neural Network</u>

Accuracy: 93.77%
Precision: 0.00%
Recall: 0.00%
F1: 0.00%
[[1384    0]
 [  92    0]]
[('Decision Tree Classifier', 'Accuracy'), ('Decision Tree Classifier', 'Precision'), ('Decision Tree Classifier', 'Recall'), ('Decision Tree Classifier', 'F1'), ('Random Forest Classifier', 'Accuracy'), ('Random Forest Classifier', 'Precision'), ('Random Forest Classifier', 'Recall'), ('Random Forest Classifier', 'F1'), ('Deep Neural Network', 'Accuracy'), ('Deep Neural Network', 'Precision'), ('Deep Neural Network', 'Recall'), ('Deep Neural Network', 'F1')]
[[93.7, 46.67, 7.61, 13.08, 94.11, 77.78, 7.61, 13.86, 93.77, 0.0, 0.0, 0.0], [93.7, 46.67, 7.61, 13.08, 93.97, 61.54, 8.7, 15.24, 93.77, 0.0, 0.0, 0.0], [93.77, 0.0, 0.0, 0.0, 93.56, 42.11, 8.7, 14.41, 93.77, 0.0, 0.0, 0.0], [93.77, 50.0, 3.26, 6.12, 93.56, 40.0, 6.52, 11.21, 93.77, 0.0, 0.0, 0.0], [93.77, 50.0, 3.26, 6.12, 93.16, 28.57, 6.52, 10.62, 93.77, 0.0, 0.0, 0.0]]
<map object at 0x17f4a0198>
CPU times: user 57.5 s, sys: 1.55 s, total: 59

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [366]:
metric_df1

Unnamed: 0,Unnamed: 1,12,10,9,8,7
Decision Tree Classifier,Accuracy,93.7,93.7,93.77,93.77,93.77
Decision Tree Classifier,Precision,46.67,46.67,0.0,50.0,50.0
Decision Tree Classifier,Recall,7.61,7.61,0.0,3.26,3.26
Decision Tree Classifier,F1,13.08,13.08,0.0,6.12,6.12
Random Forest Classifier,Accuracy,94.11,93.97,93.56,93.56,93.16
Random Forest Classifier,Precision,77.78,61.54,42.11,40.0,28.57
Random Forest Classifier,Recall,7.61,8.7,8.7,6.52,6.52
Random Forest Classifier,F1,13.86,15.24,14.41,11.21,10.62
Deep Neural Network,Accuracy,93.77,93.77,93.77,93.77,93.77
Deep Neural Network,Precision,0.0,0.0,0.0,0.0,0.0


#### 7.3 Model Performance with GAN <a class="anchor" id="perf2"></a>

In [367]:
%%time
metric_df2 = test_final_models(models, (X_retrain, y_retrain, X_retest, y_retest))

#### 12 Features

<u>Decision Tree Classifier</u>

Accuracy: 96.07%
Precision: 99.93%
Recall: 91.75%
F1: 95.67%
[[3429    2]
 [ 254 2825]]


<u>Random Forest Classifier</u>

Accuracy: 96.16%
Precision: 99.75%
Recall: 92.11%
F1: 95.78%
[[3424    7]
 [ 243 2836]]


<u>Deep Neural Network</u>

Accuracy: 95.53%
Precision: 98.88%
Recall: 91.59%
F1: 95.09%
[[3399   32]
 [ 259 2820]]


*Dropping sbp...*

*Dropping dbp...*

#### 10 Features

<u>Decision Tree Classifier</u>

Accuracy: 96.14%
Precision: 100.00%
Recall: 91.85%
F1: 95.75%
[[3431    0]
 [ 251 2828]]


<u>Random Forest Classifier</u>

Accuracy: 96.11%
Precision: 99.44%
Recall: 92.30%
F1: 95.74%
[[3415   16]
 [ 237 2842]]


<u>Deep Neural Network</u>

Accuracy: 95.55%
Precision: 99.61%
Recall: 90.94%
F1: 95.08%
[[3420   11]
 [ 279 2800]]


*Dropping pulse_ox...*

#### 9 Features

<u>Decision Tree Classifier</u>

Accuracy: 96.13%
Precision: 100.00%
Recall: 91.82%
F1: 95.73%
[[3431    0]
 [ 252 2827]]


<u>Random Forest Classifier</u>

Accuracy: 96.07%
Precision: 99.44%
Recall: 92.21%
F1: 95.69%
[[3415   16]
 [ 240 2839]]


<u>Deep Neural Network</u>

Accuracy: 92.15%
Precision: 94.06%
Recall: 89.02%
F1: 91.47%
[[3258  173]
 [ 338 2741]]


*Dropping cmp_glucose...*

#### 8 Features

<u>Decision Tree Classifier</u>

Accuracy: 96.04%
Precision: 99.79%
Recall: 91.82%
F1: 95.64%
[[3425    6]
 [ 252 2827]]


<u>Random Forest Classifier</u>

Accuracy: 95.99%
Precision: 99.13%
Recall: 92.34%
F1: 95.61%
[[3406   25]
 [ 236 2843]]


<u>Deep Neural Network</u>

Accuracy: 85.27%
Precision: 81.45%
Recall: 89.15%
F1: 85.13%
[[2806  625]
 [ 334 2745]]


*Dropping resp_rate...*

#### 7 Features

<u>Decision Tree Classifier</u>

Accuracy: 96.04%
Precision: 99.79%
Recall: 91.82%
F1: 95.64%
[[3425    6]
 [ 252 2827]]


<u>Random Forest Classifier</u>

Accuracy: 95.75%
Precision: 98.44%
Recall: 92.47%
F1: 95.36%
[[3386   45]
 [ 232 2847]]


<u>Deep Neural Network</u>

Accuracy: 83.73%
Precision: 83.40%
Recall: 81.91%
F1: 82.65%
[[2929  502]
 [ 557 2522]]
[('Decision Tree Classifier', 'Accuracy'), ('Decision Tree Classifier', 'Precision'), ('Decision Tree Classifier', 'Recall'), ('Decision Tree Classifier', 'F1'), ('Random Forest Classifier', 'Accuracy'), ('Random Forest Classifier', 'Precision'), ('Random Forest Classifier', 'Recall'), ('Random Forest Classifier', 'F1'), ('Deep Neural Network', 'Accuracy'), ('Deep Neural Network', 'Precision'), ('Deep Neural Network', 'Recall'), ('Deep Neural Network', 'F1')]
[[96.07, 99.93, 91.75, 95.67, 96.16, 99.75, 92.11, 95.78, 95.53, 98.88, 91.59, 95.09], [96.14, 100.0, 91.85, 95.75, 96.11, 99.44, 92.3, 95.74, 95.55, 99.61, 90.94, 95.08], [96.13, 100.0, 91.82, 95.73, 96.07, 99.44, 92.21, 95.69, 92.15, 94.06, 89.02, 91.47], [96.04, 99.79, 91.82, 95.64, 95.99, 99.13, 92.34, 95.61, 85.27, 81.45, 89.15, 85.13], [96.04, 99.79, 91.82, 95.64, 95.75, 98.44, 92.47, 95.36, 83.73, 83.4, 81.91, 82.65]]
<map object at 0x18

In [368]:
metric_df2

Unnamed: 0,Unnamed: 1,12,10,9,8,7
Decision Tree Classifier,Accuracy,96.07,96.14,96.13,96.04,96.04
Decision Tree Classifier,Precision,99.93,100.0,100.0,99.79,99.79
Decision Tree Classifier,Recall,91.75,91.85,91.82,91.82,91.82
Decision Tree Classifier,F1,95.67,95.75,95.73,95.64,95.64
Random Forest Classifier,Accuracy,96.16,96.11,96.07,95.99,95.75
Random Forest Classifier,Precision,99.75,99.44,99.44,99.13,98.44
Random Forest Classifier,Recall,92.11,92.3,92.21,92.34,92.47
Random Forest Classifier,F1,95.78,95.74,95.69,95.61,95.36
Deep Neural Network,Accuracy,95.53,95.55,92.15,85.27,83.73
Deep Neural Network,Precision,98.88,99.61,94.06,81.45,83.4
