# Introduction

In order to successful knowledge discovery in databases (KDD), well-defined and formal methods should be applied for managing data.  Cross-industry standard process for data mining (CRISP-DM) model is a standard methodology, which includes six phases:
    1. Problem domain understanding
    2. Data Understanding
    3. Data Preparation
    4. Modeling
    5. Evaluation
    6. Deployment

----------------------------------------------------------------------------------------------------------------------

# Part 1 - Problem domain understanding

Cancer still remains a challenge for our world in preventing and treating. However, most of cancers are highly curable if they are detected early, so the stage at diagnosis heavily influences survival. Due to no early warning signs, it’s important to have routine screening tests. For many types of cancers such as colorectal cancer, lung cancer, stomach cancer, screening rate remains low due to unpleasant procedure and expensive cost. Therefore, a risk prediction model for cancer could bring benefits for both customer and health institute. For customer, it encourages people to take screening tests to detect the risk of cancer early and increase survival rate. For health institute, it provides more services and hence increase sale.

Nowadays, electronic medical records have become increasingly available through regular health checkup. In recent research, there has been an increasing interest in finding biomarkers of cancer from routine blood tests. In general, blood indices are related to cancer to some extent, but none of them solely exhibits a clear connection and can be used for diagnostic purposes. However, taking these basic blood indices together, information to be gleaned may reveal about converging signs or pattern of an individual for many forms of cancer. By monitoring selected biomarkers routinely measured in primary care, we can learn a lot about physiological patterns that promote carcinogenesis, proliferation, progression before tumor makers emerge.

This research aims to utilize temporal, longitudinal data accumulated in regular health checkup to explore pattern of change of many biomarkers in common blood test to predict cancer.

----------------------------------------------------------------------------------------------------------------------

# Part 2 - Data Exploration & Understanding

http://localhost:8888/notebooks/0-MyCollection/00-Sample/01-DataExploration.ipynb

## 1. Import Library and Define Common function

### 1.1. Import Library

In [295]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pylab
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

# Modelling Helpers:
# from sklearn.preprocessing import Imputer, Normalizer, scale
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, ShuffleSplit, cross_validate
from sklearn import model_selection
from sklearn.model_selection import train_test_split

# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
# Evaluation metrics for Classification
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.metrics import mutual_info_score

# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,RidgeCV,ElasticNet,LogisticRegression
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
# Evaluation metrics for Regression 
from sklearn.metrics import mean_squared_log_error, mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import (confusion_matrix, classification_report, accuracy_score, roc_auc_score, auc,
                             precision_score, recall_score, roc_curve, precision_recall_curve,
                             precision_recall_fscore_support, f1_score,
                             precision_recall_fscore_support)

# Configuration
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)


# Supress warnings
import warnings
warnings.filterwarnings("ignore")

print("Setup complete...")

Setup complete...


### 1.2. Common Function

In [296]:
# Distribution plot

def analyse_continuous(df,var,target):
    df = df.copy()
    # df[var] = df[var].fillna(df[var].median())
    plt.figure(figsize=(20,5))
       
    # histogram
    plt.subplot(131)
    sns.distplot(df[var], bins=30)
    #sns.distplot(df[var],hist=True, kde=True,kde_kws={'shade': True, 'linewidth': 3})
    plt.title('Histogram')    
    
    # Q-Q plot
    plt.subplot(132)
    stats.probplot(df[var], dist="norm", plot=pylab)
    plt.ylabel('Quantiles')    
    
    # boxplot
    plt.subplot(133)
    sns.boxplot(x=df[var])
    plt.title('Boxplot')
          
    # skewness and kurtosis
    print('Skewness: %f' % df[var].skew())
    print('Kurtosis: %f' % df[var].kurt())
    plt.show()

In [297]:
def Training_Preparation(df, cont_vars):
    num_df = df[cont_vars].copy()

    # scaling features
    from sklearn.preprocessing import MinMaxScaler
    numdf_norm = pd.DataFrame(MinMaxScaler().fit_transform(df[cont_vars]))
    numdf_norm.columns = num_df.columns
    
    # Define X & y
    X = numdf_norm
    y = df['Class']

    # Split to train and test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=90, stratify = y)
    
    # initialize models
    models = []
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('SVC', SVC(kernel="linear")))
    models.append(('LSVC', SVC(kernel="rbf")))
    models.append(('LR', LogisticRegression()))
    models.append(('DT', DecisionTreeClassifier()))
    models.append(('GNB', GaussianNB()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('GB', GradientBoostingClassifier()))
    models.append(('LGB',LGBMClassifier()))
    models.append(('ADA',AdaBoostClassifier()))
    models.append(('LDA',LinearDiscriminantAnalysis()))
    models.append(('QDA',QuadraticDiscriminantAnalysis()))
    models.append(('NN',MLPClassifier()))
    models.append(('XGB',XGBClassifier()))
    
    # Test options and evaluation metric
    seed = 9
    scoring = 'recall_macro'

    # evaluate each model in turn
    results = {}
    names = []

    for name, model in models:
        kfold = model_selection.KFold(n_splits=10, random_state = seed)
        cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results[name] = cv_results
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
    results_df = pd.DataFrame(results)
    plt.figure(figsize=(16,8))
    sns.boxplot(data=results_df)
    plt.show()

In [298]:
from matplotlib.backends.backend_pdf import PdfPages


def DistributionComparison(all_df, selected_vars,name):
    colors = ['#3791D7','#D72626']

    # pdf = matplotlib.backends.backend_pdf.PdfPages(name + '.pdf')
    with PdfPages(name + '.pdf') as pdf_pages:
        for column in selected_vars:    
            fig = plt.figure(figsize=[8,4])
            plt.subplot(121)
            sns.boxplot(x='Class', y=column,data=all_df,palette=colors)
            plt.title(column, fontsize=12)
            plt.subplot(122)
            sns.kdeplot(all_df[all_df.Class==1][column], bw = 0.4, label = "Cancer", shade=True, color="#D72626", linestyle="--")
            sns.kdeplot(all_df[all_df.Class==0][column], bw = 0.4, label = "NoCancer", shade=True, color= "#3791D7", linestyle=":")
            plt.title(column, fontsize=12)   
            pdf_pages.savefig(fig)                                          
            plt.show()    

    # Write the PDF document to the disk
    #pdf_pages.close()

In [299]:
def ModelEvaluation(df, cont_vars):
    
    num_df = df[cont_vars].copy()

    # scaling features
    from sklearn.preprocessing import MinMaxScaler
    numdf_norm = pd.DataFrame(MinMaxScaler().fit_transform(df[cont_vars]))
    numdf_norm.columns = num_df.columns
    
    # Define X & y
    X = numdf_norm
    y = df['Class']

    # Split to train and test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=90, stratify = y)
    
    # initialize models
    models = []
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('SVC', SVC(kernel="linear")))
    models.append(('LSVC', SVC(kernel="rbf")))
    models.append(('LR', LogisticRegression()))
    models.append(('DT', DecisionTreeClassifier()))
    models.append(('GNB', GaussianNB()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('GB', GradientBoostingClassifier()))
    models.append(('LGB',LGBMClassifier()))
    models.append(('ADA',AdaBoostClassifier()))
    models.append(('LDA',LinearDiscriminantAnalysis()))
    models.append(('QDA',QuadraticDiscriminantAnalysis()))
    models.append(('NN',MLPClassifier()))
    models.append(('XGB',XGBClassifier()))
    
    for name,model in models:
        print(name)
        model.fit(X_train, y_train)
        
        print('==========================================================')
        print('Train set')
        y_train_pred = model.predict(X_train)
        print('Accuracy: ', accuracy_score(y_train, list(y_train_pred)))
        print('ROC AUC Score: ', roc_auc_score(y_train, list(y_train_pred)))
        cm_df = pd.DataFrame(confusion_matrix(y_train,list(y_train_pred)), index=model.classes_,columns=model.classes_)
        cm_df.index.name = 'True'
        cm_df.columns.name = 'Predicted'
        print('Confusion matrix')
        print(cm_df)
        print(classification_report(y_train, list(y_train_pred)))
  
        print('----------------------------------------------------------')
        print('Test set')
        y_test_pred = model.predict(X_test)
        print('Accuracy: ', accuracy_score(y_test, list(y_test_pred)))
        print('ROC AUC Score: ', roc_auc_score(y_test, list(y_test_pred)))
        cm_df = pd.DataFrame(confusion_matrix(y_test,list(y_test_pred)), index=model.classes_,columns=model.classes_)
        cm_df.index.name = 'True'
        cm_df.columns.name = 'Predicted'
        print('Confusion matrix')
        print(cm_df)
        print(classification_report(y_test, list(y_test_pred)))
        print('==========================================================')
        

In [300]:
from sklearn.manifold import TSNE

def tsne_plot(X, y):
       
        
    # scaling features
    from sklearn.preprocessing import MinMaxScaler
    numdf_norm = pd.DataFrame(MinMaxScaler().fit_transform(X))
    numdf_norm.columns = X.columns
    
    tsne = TSNE(n_components=2, random_state=0)
    X_t = tsne.fit_transform(numdf_norm)

    plt.figure(figsize=(12, 8))
    plt.scatter(X_t[np.where(y == 0), 0], X_t[np.where(y == 0), 1], marker='o', color='g', linewidth='1', alpha=0.8, label='No cancer')
    plt.scatter(X_t[np.where(y == 1), 0], X_t[np.where(y == 1), 1], marker='o', color='r', linewidth='1', alpha=0.8, label='Colon cancer')

    plt.legend(loc='best');
    plt.show();

In [301]:
# function to find upper and lower boundaries
# for normally distributed variables

def find_normal_boundaries(df, variable):

    # calculate the boundaries outside which sit the outliers
    # for a Gaussian distribution

    upper_boundary = df[variable].mean() + 3 * df[variable].std()
    lower_boundary = df[variable].mean() - 3 * df[variable].std()

    return upper_boundary, lower_boundary

In [302]:
# function to find upper and lower boundaries
# for skewed distributed variables

def find_skewed_boundaries(df, variable, distance):

    # Let's calculate the boundaries outside which sit the outliers
    # for skewed distributions

    # distance passed as an argument, gives us the option to
    # estimate 1.5 times or 3 times the IQR to calculate
    # the boundaries.

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

In [303]:
def find_uncorrelated_vars(cancer_df, selected_vars, threshold):

    corrmat = cancer_df[selected_vars].corr()
    corrmat = corrmat.abs().unstack() # absolute value of corr coef
    corrmat = corrmat.sort_values(ascending=False)

    corrmat = pd.DataFrame(corrmat).reset_index()
    corrmat.columns = ['feature1', 'feature2', 'corr']
    corrmat['MissingF1'] = corrmat.feature1.apply(lambda x:MissingPercentage(x))
    corrmat['MissingF2'] = corrmat.feature2.apply(lambda x:MissingPercentage(x))
    
    correlated_groups = corrmat[corrmat['corr'] > threshold]
    
    selected_vars = []
    remaining_vars = correlated_groups.feature1.unique()

    while(len(remaining_vars) > 0):
        feature = remaining_vars[0]
        correlated_block = correlated_groups[correlated_groups.feature1 == feature]
        min_ind = correlated_block[['MissingF2']].idxmin() 
        sel_var = correlated_block.feature2[min_ind].values[0]
        removed_vars = [var for var in list(correlated_block.feature2.values)]
        remaining_vars = [var for var in remaining_vars if var not in removed_vars]
        if sel_var not in selected_vars:
            selected_vars = selected_vars + [sel_var]   
    
    return selected_vars

In [304]:
def analyze_na_values(df, var, target):
    tmp_df = df.copy()
    print(target)
    
    # Make a variable that indicates 1 if the observation was missing or 0 otherwise
    tmp_df['Missing'] = np.where(df[var].isnull(),1,0)
    
    # Calculate the mean Price where the information is missing or present
    tmp_df =  pd.DataFrame(tmp_df.groupby([target,'Missing'])[target].count())
    tmp_df.columns = ['Count']
    tmp_df = tmp_df.reset_index()
    
    
    if(len(tmp_df[tmp_df[target] == 0] == 1)):
        tmp_df= tmp_df.append({target:0,'Missing':1,'Count':0}, ignore_index=True)

    if(len(tmp_df[tmp_df[target] == 1] == 1)):
        tmp_df= tmp_df.append({target:1,'Missing':1,'Count':0}, ignore_index=True)

    tmp_df.loc[0,'Per']= tmp_df.loc[0,'Count']/(tmp_df.loc[0:1,'Count'].sum())
    tmp_df.loc[1,'Per']= tmp_df.loc[1,'Count']/(tmp_df.loc[0:1,'Count'].sum())
    tmp_df.loc[2,'Per']= tmp_df.loc[2,'Count']/(tmp_df.loc[2:3,'Count'].sum())
    tmp_df.loc[3,'Per']= tmp_df.loc[3,'Count']/(tmp_df.loc[2:3,'Count'].sum())
    sns.barplot(x=target, y = 'Per', data=tmp_df, hue='Missing')
    plt.title(var)
    plt.show()

In [305]:
def CategoricalDistribution(df, var, target):
    df = df.copy()
    
    # Calculate the mean Price where the information is missing or present
    sns.countplot(x=var, data=df, hue=target)
    plt.title(var)
    plt.show()

In [306]:
def CreateDummyVar(df, categorical_list):
    objdf_new = df.copy()
    objdf_dummy =pd.DataFrame()
    i = 0
    for e in categorical_list:
        i = i + 1
        objdf_new[e] = e + '_' + objdf_new[e].astype(str)
        varname= e 
        df_temp = pd.get_dummies(objdf_new[varname], drop_first=True)
        objdf_dummy = pd.concat([objdf_dummy, df_temp], axis=1)
        
    return objdf_dummy

In [307]:
def MissingPercentage(x):
    return df[x].isnull().sum()/len(df)

## 2. Load raw data

In [308]:
df = pd.read_csv('Nhanes-BreastCancer_merged.csv')

df.shape

(1173, 461)

In [310]:
df.head()

Unnamed: 0,RIDAGEYR,RIDAGEMN,INDFMPIR,LBXSAL,LBDSGLSI,URDTIME2,LBDSBUSI,LBDHDDSI,LBDLYMNO,LBXSCA,URXMBP,LBDSCASI,URXMZP,URDFLOW2,LBXGLU,LBXBGE,LBXBPB,LBDBCDSI,URXUMS,LBXSGL,LBXSOSSI,LBXIHG,URXMHH,URXVOL3,URXMHP,URDTIME1,LBXSZN,LBXSKSI,LBDTHGSI,URXMIB,URDFLOW3,URXMNP,LBXBCD,LBXSATSI,LBXSBU,LBDSTRSI,LBDSCRSI,LBXSUA,LBDSTPSI,LBXSTR,URXVOL1,LBXLYPCT,URXVOL2,LBDSTBSI,URXECP,LBDTCSI,LBXNEPCT,LBXBGM,LBXHGB,LBXSTP,URXCNP,LBXPLTSI,LBDIHGSI,LBDSIRSI,LBXSGTSI,LBXBAPCT,LBXSTB,LBXMCHSI,LBDNENO,LBXSCR,LBDSZNSI,LBXMOPCT,LBXMPSI,LBXSIR,PHAANTMN,LBDSCUSI,PHAALCMN,LBXRBCSI,URXUIO,LBDSUASI,LBDSPHSI,URXCRS,URXMEP,LBXMCVSI,LBXSCH,URXMC1,LBXGLT,WTSVS2YR,LBXSLDSI,LBXRDW,WTFSM,PHAGUMMN,LBXSGB,LBDBPBSI,LBDGLTSI,WTSVOC2Y,URXCOP,LBDGLUSI,LBXSCU,LBXSSE,LBXWBCSI,LBDHDD,LBDSSESI,URDFLOW1,LBXSPH,PHACOFMN,PHASUPMN,WTSOG2YR,URXUMA,LBXSASSI,LBXEOPCT,LBDSGBSI,LBXSAPSI,LBXTC,LBXMC,URDTIME3,LBXTHG,LBDSCHSI,URXMOH,LBDMONO,LBDEONO,LBDBANO,LBDSALSI,PHASUPHR,LBXSCLSI,LBXSNASI,PHAGUMHR,PHAANTHR,PHACOFHR,PHAALCHR,LBXSC3SI,DXDLAPF,DXDTRBMD,DXDSTBMC,DXXTRFAT,DXXHEFAT,DXDSTLE,DXXRALI,DXDTOFAT,DXDTOBMD,DXXLSBMC,DXXTRLI,DXDRATOT,BPXDI3,DXXHEA,DXDTOPF,DXDRLTOT,DXDTRPF,DXXLLBMD,DXXRLLI,BMXARMC,DXDRALE,DXXLLBMC,DXXLALI,DXXLRBMC,DXXPEA,DXDRAPF,BPXDI1,DXXPEBMD,DXDTRTOT,DXXRAA,DXDLLPF,BMXARML,DXXHELI,DXXRRA,DXDSTBMD,BPXDI4,DXXLLLI,DXDRLPF,BPXDI2,DXXLSBMD,BPXPLS,DXDTOBMC,DXDTOLE,DXXLRBMD,DXDTRLE,DXXHEBMC,DXDSTTOT,DXXRLA,BMXLEG,BPXSY1,DXXLAA,DXXLABMC,DXDLALE,DXDTRA,BMXWT,DXXTSBMC,DXXLLA,DXXLLFAT,DXDTOLI,BMXHT,DXXLABMD,DXDSTLI,DXXRRBMD,DXXLSA,DXDSTA,DXDTRBMC,DXXRABMC,DXDHELE,DXDHETOT,DXXHEBMD,DXDTOA,DXXRABMD,DXDRLLE,DXXLRA,DXXRRBMC,BPXSY2,DXDLLLE,DXXTSBMD,BMXBMI,DXDSTFAT,DXXRLFAT,DXXRLBMD,DXXRAFAT,DXDLATOT,BMXWAIST,DXDLLTOT,DXDSTPF,DXDHEPF,DXXLAFAT,DXXRLBMC,BPXSY3,BPXSY4,DXXTSA,DXXPEBMC,DXDTOTOT,DMDYRSUS,DMDEDUC2,DMDHHSIZ,DMDFMSIZ,BPXML1,BPACSZ,FCX10DI,FCX11DI,FCX06DI,FCX08DI,FCX07DI,FCX09DI,RIDRETH1,RIDEXMON,DMDCITZN,DMDMARTL,RIDEXPRG,URDMNPLC,PHQ060,ORXH51,ORXH64,ORXHPC,LBDHEG,LBDWFL,ORXGL,ORXH69,PHQ050,LBDHD,LBXHCR,ORXH62,URDCNPLC,ORXHPV,ORXH11,ORXH73,URXUTRI,ORXH26,ORXH31,LBDHBG,ORXH82,LBXHE1,ORXH53,ORXH58,LBDHEM,URDMZPLC,PHQ020,ORXH35,ORXH84,LBDIHGLC,URXPREG,PHQ040,LBDBGELC,ORXH68,URDMEPLC,ORXH81,ORXGH,ORXH52,URDMBPLC,ORXH56,LBXHE2,ORXH61,LBDBGMLC,URDECPLC,URDMOHLC,ORXH33,ORXH06,ORXHPI,LBDTHGLC,URDMHPLC,URDCOPLC,URDMC1LC,ORXH45,ORXH55,ORXH71,PHQ030,ORXH40,ORXH39,ORXH83,ORXH66,LBXHBC,ORXH72,ORXH54,ORXH70,URXUCL,LBDWFLLC,LBXHA,ORXH59,ORXH67,ORXH42,ORXH18,URDMHHLC,PHDSESN,URDMIBLC,ORXH16,LBXHBS,LBXHCG,OHARNF,OHAPOS,BMIARML,OHDDESTS,OHAROCGP,OHX23TC,BPAEN2,OHAROCOH,OHAROCDE,OHX02TC,OHX30TC,OHXIMP,OHX14TC,OHAROTH,BMDSTATS,OHX16TC,OHX05TC,BMIARMC,BPAEN3,OHX26TC,OHAREC,BPAARM,OHAROCCI,BMILEG,BPAEN4,OHX01TC,BMXRECUM,OHX09TC,BMIHT,BPXCHR,OHX32TC,OHX18TC,OHDRCSTS,BPXPTY,BMXHEAD,OHX19TC,OHX15TC,OHAROCDT,BMIHEAD,BMIRECUM,OHX03TC,BPXPULS,OHX17TC,BMIWT,BPAEN1,OHX31TC,OHX08TC,BMIWAIST,DXARLBV,DXALLBV,DXARABV,OHX06TC,DXALABV,DXARLTV,OHX04TC,OHX12TC,OHX27TC,OHX13TC,DXAHEBV,OHX22TC,OHX10TC,OHX29TC,OHX21TC,OHX28TC,OHX07TC,DXARATV,OHX24TC,PEASCCT1,DXALLTV,OHX25TC,OHX11TC,DXAHETV,DXAEXSTS,LBD2DFLC,LBDV1DLC,LBDV2ALC,LBDV3BLC,LBDV4CLC,LBDVBZLC,LBDVCBLC,LBDVCTLC,LBDVDBLC,LBDVEBLC,LBDVMCLC,LBDVNBLC,LBDVOXLC,LBDVTCLC,LBDVTELC,LBDVXYLC,LBX2DF,LBXV1D,LBXV2A,LBXV3B,LBXV4C,LBXVBZ,LBXVCB,LBXVCT,LBXVDB,LBXVEB,LBXVMC,LBXVNB,LBXVOX,LBXVTC,LBXVTE,LBXVXY,OHDEXSTS,PHAFSTHR,PHAFSTMN,URDUA3LC,URDUA5LC,URDUABLC,URDUACLC,URDUBALC,URDUCDLC,URDUCOLC,URDUCSLC,URDUDALC,URDUMMAL,URDUMNLC,URDUMOLC,URDUPBLC,URDUSBLC,URDUSNLC,URDUSRLC,URDUTLLC,URDUTULC,URDUURLC,URXUAB,URXUAC,URXUAS,URXUAS3,URXUAS5,URXUBA,URXUCD,URXUCO,URXUCS,URXUDMA,URXUMMA,URXUMN,URXUMO,URXUPB,URXUSB,URXUSN,URXUSR,URXUTL,URXUTU,URXUUR,WTSA2YR,WTSAF2YR,Class
0,72,,5.0,3.7,4.94,,4.64,1.81,1.9,8.9,,2.225,,,,0.11,1.83,6.58,1.5,89.0,275.0,0.46,,,,57.0,,3.9,25.9,,,,0.74,18.0,13.0,1.332,67.18,5.2,69.0,118.0,76.0,33.4,,8.55,,5.17,53.4,5.09,12.4,6.9,,237.0,2.3,11.1,20.0,0.6,0.5,31.2,3.0,0.76,,8.1,9.0,62.0,,,,3.96,,309.3,1.324,3182.4,,90.2,196.0,,,,148.0,13.7,,,3.2,0.088,,,,,,,5.6,70.0,,1.333,4.1,,,,1.5,26.0,4.6,32.0,69.0,200.0,34.6,,5.19,5.069,,0.5,0.3,0.0,37.0,,102.0,138.0,,,,,28.0,,,,,,,,,,,,,68.0,,,,,,,32.0,,,,,,,72.0,,,,,37.0,,,,,,,74.0,,66.0,,,,,,,,33.5,120.0,,,,,71.9,,,,,157.5,,,,,,,,,,,,,,,,116.0,,,29.0,,,,,,103.3,,,,,,118.0,,,,,,4.0,1,1,140.0,4.0,,,,,,,3,2.0,1.0,2.0,,,2.0,,,,2.0,,,,2.0,2.0,,,,,,,,,,2.0,,,,,2.0,,2.0,,,0.0,,2.0,1.0,,,,,,,,,,0.0,,,,,,0.0,,,,,,,2.0,,,,,2.0,,,,,,1.0,,,,,,2.0,,,2.0,,1.0,1.0,,1.0,,2.0,2.0,,,2.0,2.0,2.0,2.0,,1.0,4.0,2.0,,2.0,2.0,4.0,2.0,,,,4.0,,2.0,,,2.0,2.0,1.0,1.0,,2.0,2.0,,,,2.0,1.0,2.0,,2.0,2.0,2.0,,,,,2.0,,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,,2.0,,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,41.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
1,80,,2.77,4.1,6.49,,5.36,1.97,0.9,9.6,,2.4,,,,,2.4,4.72,7.0,117.0,274.0,,,,,101.0,,4.0,1.6,,,,0.53,15.0,15.0,1.66,79.56,4.0,66.0,147.0,261.0,17.1,,10.26,,6.0,69.6,,13.9,6.6,,209.0,,14.1,20.0,0.2,0.6,31.7,3.8,0.9,,11.8,7.5,79.0,,,,4.36,,237.9,1.162,4420.0,,91.9,235.0,,,,140.0,12.3,,,2.5,0.116,,,,,,,5.5,76.0,,2.584,3.6,,,,7.0,22.0,1.3,25.0,84.0,232.0,34.4,,0.33,6.077,,0.6,0.1,0.0,41.0,,98.0,136.0,,,,,27.0,,,,,,,,,,,,,74.0,,,,,,,32.0,,,,,,,74.0,,,,,36.2,,,,,,,80.0,,72.0,,,,,,,,37.5,152.0,,,,,74.9,,,,,162.5,,,,,,,,,,,,,,,,152.0,,,28.36,,,,,,95.3,,,,,,140.0,,,,,,3.0,1,1,170.0,4.0,,,,,,,3,2.0,1.0,1.0,,,2.0,,,,2.0,,,,2.0,2.0,,,,,,,,,,2.0,,,,,2.0,,2.0,,,,,2.0,,,,,,,,,,,,,,,,,0.0,,,,,,,2.0,,,,,2.0,,,,,,2.0,,,,,,1.0,,,2.0,,,1.0,,1.0,,4.0,2.0,,1.0,4.0,4.0,2.0,4.0,,1.0,4.0,4.0,,2.0,4.0,4.0,1.0,,,,4.0,,4.0,,,4.0,4.0,1.0,1.0,,4.0,4.0,,,,4.0,1.0,4.0,,2.0,4.0,4.0,,,,,4.0,,,4.0,4.0,4.0,4.0,,4.0,4.0,4.0,4.0,4.0,4.0,,4.0,,,4.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,2.0,11.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1
2,74,,4.53,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,0.0,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,76.0,,,,,,,,,,,,,,82.0,,,,,,,,,,,,70.0,,62.0,,,,,,,,,170.0,,,,,95.1,,,,,171.1,,,,,,,,,,,,,,,,164.0,,,32.5,,,,,,,,,,,,164.0,,,,,,5.0,1,1,190.0,5.0,,,,,,,4,2.0,1.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,1.0,1.0,1.0,1.0,2.0,2.0,,,4.0,2.0,1.0,4.0,,2.0,4.0,2.0,1.0,2.0,2.0,3.0,2.0,,1.0,,4.0,,4.0,3.0,,4.0,4.0,1.0,1.0,,4.0,4.0,,,,4.0,1.0,4.0,3.0,2.0,4.0,3.0,1.0,,,,2.0,,,3.0,3.0,2.0,3.0,,2.0,4.0,4.0,2.0,2.0,4.0,,2.0,,,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,1
3,55,,,4.6,3.22,,4.28,1.58,2.1,9.7,,2.425,,,,0.11,0.74,4.27,0.6,58.0,279.0,0.19,,,,65.0,94.3,3.9,7.3,,,,0.48,18.0,12.0,1.163,75.14,4.2,70.0,103.0,216.0,26.6,,13.68,,4.71,64.3,1.27,13.0,7.0,,282.0,0.95,16.8,8.0,0.4,0.8,32.5,5.1,0.85,14.43,7.8,9.0,94.0,,18.68,,3.99,85.3,249.8,1.421,972.4,,96.7,190.0,,,236518.754751,110.0,13.8,349457.042144,,2.4,0.036,,245238.889006,,,119.0,126.0,8.0,61.0,1.6,3.323,4.4,,,,0.6,20.0,0.9,24.0,62.0,182.0,33.7,,1.47,4.913,,0.6,0.1,0.0,46.0,,104.0,141.0,,,,,28.0,36.4,0.808,1412.69,7752.1,1030.6,33919.9,2292.7,19727.7,1.063,62.17,18116.1,3539.5,70.0,224.79,33.8,10916.7,30.0,1.049,6628.6,29.3,2159.2,346.92,2017.6,56.78,178.46,35.2,72.0,1.076,25868.2,198.2,40.4,36.0,3369.2,95.07,0.872,,6277.6,39.3,74.0,1.102,84.0,1960.44,36741.4,0.512,17658.2,547.75,54029.8,341.62,38.0,118.0,182.49,118.46,1899.1,566.81,58.0,94.0,330.74,4253.6,38701.8,163.5,0.649,35332.6,0.557,56.4,1619.87,457.84,133.53,2821.4,4399.8,2.437,1844.66,0.674,6272.7,110.78,52.91,118.0,5930.7,0.746,21.7,18697.2,4288.1,1.042,1246.8,3174.2,79.6,10531.2,34.6,23.4,1156.6,355.92,116.0,,126.09,191.99,58429.6,,5.0,2,2,140.0,3.0,,,,,,,3,2.0,1.0,1.0,,,2.0,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,3.0,2.0,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,,2.0,1.0,2.0,,2.0,1.0,2.0,,2.0,,2.0,0.0,,,2.0,2.0,2.0,0.0,,,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,2.0,2.0,2.0,2.0,2.0,,1.0,,2.0,2.0,,1.0,1.0,,1.0,,2.0,2.0,,,2.0,2.0,2.0,2.0,,1.0,4.0,4.0,,2.0,2.0,4.0,1.0,,,,4.0,,2.0,,,4.0,2.0,1.0,1.0,,2.0,2.0,,,,2.0,1.0,4.0,,2.0,2.0,2.0,,0.0,0.0,0.0,2.0,0.0,0.0,2.0,4.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,0.0,2.0,,0.0,2.0,2.0,0.0,1.0,1.0,1.0,1.0,1.0,,,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0078,0.0177,0.0071,0.0177,,,0.0078,0.0035,0.0283,0.017,0.1768,0.2263,0.028,0.0085,0.0071,0.085,1.0,1.0,11.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.82,0.08,0.55,0.08,0.56,0.2,0.025,0.066,0.791,1.35,0.14,0.092,3.9,0.06,0.016,0.064,12.28,0.047,0.013,0.0014,403759.430206,,0
4,51,,2.1,4.0,5.44,,3.57,1.34,2.4,9.9,,2.475,,,,0.11,0.9,0.98,6.8,98.0,278.0,0.19,,,,26.0,,4.5,2.7,,,,0.11,15.0,10.0,1.66,44.2,5.3,69.0,147.0,54.0,27.8,,6.84,,4.37,64.2,0.55,12.7,6.9,,290.0,0.95,10.7,10.0,0.4,0.4,30.9,5.6,0.5,,5.6,7.9,60.0,,,,4.13,,315.2,1.227,7956.0,,94.9,172.0,,,,99.0,13.5,,,2.9,0.043,,,,,,,8.7,52.0,,2.077,3.8,,,,6.8,11.0,2.2,29.0,72.0,169.0,32.5,,0.55,4.448,,0.5,0.2,0.0,40.0,,106.0,140.0,,,,,24.0,,0.786,1562.13,22665.0,1262.6,49636.8,,48128.5,1.111,46.32,27665.8,,80.0,225.63,46.6,,45.0,,,43.6,,,,66.58,174.52,,78.0,1.11,50330.8,,,36.0,3917.6,154.55,0.896,,,,80.0,0.899,60.0,2187.03,52929.4,0.561,27158.8,624.9,98064.9,,37.2,134.0,,,,644.95,102.8,111.89,,,55116.5,161.8,,51198.9,0.573,51.52,1743.16,507.06,,3292.6,5180.1,2.77,1968.79,,,118.61,88.52,132.0,,0.768,39.3,46866.0,,,,,111.4,,47.8,24.4,,,130.0,,145.76,193.75,103245.0,7.0,2.0,3,3,170.0,5.0,,,,,,,2,1.0,1.0,1.0,,,2.0,2.0,2.0,2.0,2.0,,1.0,2.0,2.0,2.0,,2.0,,2.0,2.0,2.0,,2.0,2.0,2.0,2.0,,2.0,2.0,2.0,,2.0,2.0,2.0,1.0,,2.0,1.0,2.0,,2.0,1.0,2.0,,2.0,,2.0,0.0,,,2.0,2.0,2.0,0.0,,,,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,,,1.0,2.0,2.0,2.0,2.0,,1.0,,2.0,2.0,,,1.0,,1.0,,2.0,2.0,,,5.0,4.0,2.0,2.0,,1.0,4.0,2.0,,2.0,2.0,3.0,1.0,,,,2.0,,2.0,,,2.0,2.0,1.0,1.0,,2.0,4.0,1.0,,,2.0,1.0,2.0,,2.0,2.0,2.0,,4.0,4.0,4.0,2.0,4.0,4.0,2.0,2.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,4.0,2.0,,4.0,2.0,2.0,0.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,0.0,29.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0


## 4. Target variable analysis

In [311]:
df[df.Class == 1].shape[0]/df.shape[0],df[df.Class == 1].shape[0],df[df.Class == 0].shape[0]

(0.4919011082693947, 577, 596)

## 5. Categorize vars

In [312]:
target = ['Class']

cont_vars = ['RIDAGEYR', 'RIDAGEMN', 'INDFMPIR', 'LBXSAL', 'LBDSGLSI', 'URDTIME2', 'LBDSBUSI', 'LBDHDDSI', 'LBDLYMNO', 'LBXSCA', 'URXMBP', 'LBDSCASI', 'URXMZP', 'URDFLOW2', 'LBXGLU', 'LBXBPB', 'LBDBCDSI', 'URXUMS', 'LBXSGL', 'LBXSOSSI', 'URXUAC', 'LBXIHG', 'URXMHH', 'URXMHP', 'URXUBA', 'WTFSM', 'URXUUR', 'URDTIME1', 'URXUSR', 'LBXSZN', 'URXUTU', 'LBXSKSI', 'URXUSN', 'URXUMMA', 'LBDTHGSI', 'URXMIB', 'URXUMO', 'URXMNP', 'LBXBCD', 'URXUDMA', 'LBXSATSI', 'LBXSBU', 'LBXV4C', 'LBDSTRSI', 'LBXVBZ', 'LBDSCRSI', 'LBXSUA', 'LBXVOX', 'URXUAS', 'URXUSB', 'LBDSTPSI', 'LBXVXY', 'LBXSTR', 'WTSA2YR', 'URXVOL1', 'LBXLYPCT', 'URXUTL', 'URXUPB', 'URXVOL2', 'PHAFSTMN', 'LBXVEB', 'URXECP', 'LBDTCSI', 'LBXNEPCT', 'LBXBGM', 'LBXHGB', 'LBXSTP', 'URXCNP', 'LBXPLTSI', 'LBDIHGSI', 'LBDSIRSI', 'LBXSGTSI', 'URXUCO', 'LBXBAPCT', 'LBXMCHSI', 'LBDNENO', 'LBXSCR', 'LBDSZNSI', 'LBXMOPCT', 'LBXMPSI', 'LBXVDB', 'LBXSIR', 'URXUCS', 'LBX2DF', 'URXUAB', 'LBDSCUSI', 'LBXRBCSI', 'URXUIO', 'LBDSUASI', 'LBDSPHSI', 'URXCRS', 'URXMEP', 'URXUCD', 'LBXMCVSI', 'URXUAS3', 'LBXSCH', 'URXMC1', 'LBXGLT', 'WTSVS2YR', 'LBXSLDSI', 'LBXRDW', 'PHAGUMMN', 'LBXSGB', 'LBDBPBSI', 'LBDGLTSI', 'WTSVOC2Y', 'URXCOP', 'LBDGLUSI', 'LBXSCU', 'LBXSSE', 'LBXWBCSI', 'LBDHDD', 'LBDSSESI', 'URDFLOW1', 'WTSAF2YR', 'LBXSPH', 'WTSOG2YR', 'URXUMA', 'LBXSASSI', 'LBXEOPCT', 'LBDSGBSI', 'LBXSAPSI', 'LBXTC', 'LBXMC', 'LBXTHG', 'LBDSCHSI', 'URXMOH', 'LBDSALSI', 'PHAFSTHR', 'DXDLAPF', 'DXDTRBMD', 'DXDSTBMC', 'DXXTRFAT', 'DXXHEFAT', 'DXDSTLE', 'DXXRALI', 'DXDTOFAT', 'DXDTOBMD', 'DXXLSBMC', 'DXXTRLI', 'DXDRATOT', 'BPXDI3', 'DXXHEA', 'DXDTOPF', 'DXDRLTOT', 'DXDTRPF', 'DXXLLBMD', 'DXXRLLI', 'BMXARMC', 'DXDRALE', 'DXXLLBMC', 'DXXLALI', 'DXXLRBMC', 'DXXPEA', 'DXDRAPF', 'BPXDI1', 'DXXPEBMD', 'DXDTRTOT', 'DXXRAA', 'DXDLLPF', 'BMXARML', 'DXXHELI', 'DXXRRA', 'DXDSTBMD', 'BPXDI4', 'DXXLLLI', 'DXDRLPF', 'BPXDI2', 'DXXLSBMD', 'BPXPLS', 'DXDTOBMC', 'DXDTOLE', 'DXXLRBMD', 'DXDTRLE', 'DXXHEBMC', 'DXDSTTOT', 'DXXRLA', 'BMXLEG', 'BPXSY1', 'DXXLAA', 'DXXLABMC', 'DXDLALE', 'DXDTRA', 'BMXWT', 'DXXTSBMC', 'DXXLLA', 'DXXLLFAT', 'DXDTOLI', 'BMXHT', 'DXXLABMD', 'DXDSTLI', 'DXXRRBMD', 'DXXLSA', 'DXDSTA', 'DXDTRBMC', 'DXXRABMC', 'DXDHELE', 'DXDHETOT', 'DXXHEBMD', 'DXDTOA', 'DXXRABMD', 'DXDRLLE', 'DXXLRA', 'DXXRRBMC', 'BPXSY2', 'DXDLLLE', 'DXXTSBMD', 'BMXBMI', 'DXDSTFAT', 'DXXRLFAT', 'DXXRLBMD', 'DXXRAFAT', 'DXDLATOT', 'BMXWAIST', 'DXDLLTOT', 'DXDSTPF', 'DXDHEPF', 'DXXLAFAT', 'DXXRLBMC', 'BPXSY3', 'BPXSY4', 'DXXTSA', 'DXXPEBMC', 'DXDTOTOT']
print(len(cont_vars))

dis_vars = ['DMDYRSUS', 'DMDEDUC2', 'DMDHHSIZ', 'DMDFMSIZ', 'BPXML1', 'BPACSZ', 'FCX10DI', 'FCX11DI', 'FCX06DI', 'FCX08DI', 'FCX07DI', 'FCX09DI', 'LBXBGE', 'URXVOL3', 'URXUAS5', 'URDFLOW3', 'LBXVTE', 'LBDSTBSI', 'URXUMN', 'LBXVTC', 'LBXSTB', 'PHAANTMN', 'PHAALCMN', 'PHACOFMN', 'PHASUPMN', 'URDTIME3', 'LBXV2A', 'LBXV1D', 'LBXVMC', 'LBXVNB', 'LBXV3B', 'LBXVCB', 'LBXVCT', 'LBDMONO', 'LBDEONO', 'LBDBANO', 'PHASUPHR', 'LBXSCLSI', 'LBXSNASI', 'PHAGUMHR', 'PHAANTHR', 'PHACOFHR', 'PHAALCHR', 'LBXSC3SI']
print(len(dis_vars))

cat_vars = ['RIDRETH1', 'RIDEXMON', 'DMDCITZN', 'DMDMARTL', 'RIDEXPRG', 'URDMNPLC', 'PHQ060', 'ORXH51', 'ORXH64', 'ORXHPC', 'LBDHEG', 'ORXGL', 'ORXH69', 'PHQ050', 'LBDVCTLC', 'LBD2DFLC', 'LBXHCR', 'ORXH62', 'URDCNPLC', 'URDUSNLC', 'ORXHPV', 'LBDVTCLC', 'URDUA3LC', 'URDUTLLC', 'URDUTULC', 'URDUDALC', 'ORXH11', 'URDUCOLC', 'ORXH73', 'URXUTRI', 'ORXH26', 'ORXH31', 'LBDHBG', 'ORXH82', 'LBXHE1', 'ORXH53', 'ORXH58', 'URDUURLC', 'LBDHEM', 'URDMZPLC', 'URDUBALC', 'LBDVTELC', 'LBDV1DLC', 'PHQ020', 'LBDV3BLC', 'URDUACLC', 'URDUSBLC', 'ORXH35', 'ORXH84', 'LBDIHGLC', 'URXPREG', 'LBDV4CLC', 'PHQ040', 'URDUPBLC', 'LBDBGELC', 'URDUMMAL', 'URDUMNLC', 'ORXH68', 'URDUA5LC', 'URDMEPLC', 'ORXH81', 'LBDVOXLC', 'ORXGH', 'LBDVCBLC', 'LBDVBZLC', 'ORXH52', 'URDMBPLC', 'ORXH56', 'LBXHE2', 'ORXH61', 'LBDBGMLC', 'URDECPLC', 'URDUCDLC', 'LBDVNBLC', 'URDMOHLC', 'ORXH33', 'ORXH06', 'ORXHPI', 'LBDTHGLC', 'URDUMOLC', 'LBDVXYLC', 'URDMHPLC', 'URDCOPLC', 'URDMC1LC', 'ORXH45', 'ORXH55', 'ORXH71', 'PHQ030', 'LBDVMCLC', 'URDUSRLC', 'ORXH40', 'ORXH39', 'ORXH83', 'ORXH66', 'LBXHBC', 'LBDVDBLC', 'LBDVEBLC', 'ORXH72', 'ORXH54', 'ORXH70', 'URDUCSLC', 'URXUCL', 'LBDV2ALC', 'LBDWFLLC', 'URDUABLC', 'LBXHA', 'ORXH59', 'ORXH67', 'ORXH42', 'ORXH18', 'URDMHHLC', 'PHDSESN', 'URDMIBLC', 'ORXH16', 'LBXHBS', 'LBXHCG', 'OHARNF', 'OHAPOS', 'BMIARML', 'OHDDESTS', 'OHAROCGP', 'OHX23TC', 'BPAEN2', 'OHAROCOH', 'OHAROCDE', 'OHX02TC', 'OHX30TC', 'OHXIMP', 'OHX14TC', 'OHAROTH', 'BMDSTATS', 'OHX16TC', 'OHX05TC', 'BMIARMC', 'BPAEN3', 'OHX26TC', 'OHAREC', 'BPAARM', 'OHAROCCI', 'BMILEG', 'BPAEN4', 'OHX01TC', 'BMXRECUM', 'OHX09TC', 'BMIHT', 'BPXCHR', 'OHX32TC', 'OHX18TC', 'OHDRCSTS', 'BPXPTY', 'BMXHEAD', 'OHX19TC', 'OHX15TC', 'OHDEXSTS', 'OHAROCDT', 'BMIHEAD', 'BMIRECUM', 'OHX03TC', 'BPXPULS', 'OHX17TC', 'BMIWT', 'BPAEN1', 'OHX31TC', 'OHX08TC', 'BMIWAIST', 'DXARLBV', 'DXALLBV', 'DXARABV', 'FCX10DI', 'OHX06TC', 'DXALABV', 'DXARLTV', 'OHX04TC', 'OHX12TC', 'FCX08DI', 'OHX27TC', 'OHX13TC', 'DXAHEBV', 'OHX22TC', 'OHX10TC', 'OHX29TC', 'OHX21TC', 'FCX07DI', 'OHX28TC', 'OHX07TC', 'FCX09DI', 'DXARATV', 'OHX24TC', 'PEASCCT1', 'DXALLTV', 'OHX25TC', 'OHX11TC', 'DXAHETV', 'DXAEXSTS']      
print(len(cat_vars))

224
44
194


## 6. Final dataset

In [313]:
all_df = df.copy()
all_df.shape

(1173, 461)

## 7. Check missing data of all features

In [314]:
all_vars = cont_vars+dis_vars+cat_vars
miss_df = pd.DataFrame(all_df[all_vars].isnull().sum(),columns=['Count'])
miss_df['Percentage'] = 100 * all_df[all_vars].isnull().sum()/len(all_df)
miss_df = miss_df.sort_values('Percentage', ascending=True)
miss_df = miss_df.reset_index()
miss_df.columns = ['Feature','Count','Percentage']
miss_df.head()

Unnamed: 0,Feature,Count,Percentage
0,RIDAGEYR,0,0.0
1,DMDMARTL,0,0.0
2,DMDEDUC2,0,0.0
3,DMDHHSIZ,0,0.0
4,DMDFMSIZ,0,0.0


https://pypi.org/project/mixed-naive-bayes/

## 8. Feature scaling

In [315]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range = (0, 1))

X_scaled = pd.DataFrame(scaler.fit_transform(all_df[cont_vars + dis_vars]))
X_scaled.columns = cont_vars + dis_vars
print(X_scaled.shape)
X_scaled.head()

(1173, 268)


Unnamed: 0,RIDAGEYR,RIDAGEMN,INDFMPIR,LBXSAL,LBDSGLSI,URDTIME2,LBDSBUSI,LBDHDDSI,LBDLYMNO,LBXSCA,URXMBP,LBDSCASI,URXMZP,URDFLOW2,LBXGLU,LBXBPB,LBDBCDSI,URXUMS,LBXSGL,LBXSOSSI,URXUAC,LBXIHG,URXMHH,URXMHP,URXUBA,WTFSM,URXUUR,URDTIME1,URXUSR,LBXSZN,URXUTU,LBXSKSI,URXUSN,URXUMMA,LBDTHGSI,URXMIB,URXUMO,URXMNP,LBXBCD,URXUDMA,LBXSATSI,LBXSBU,LBXV4C,LBDSTRSI,LBXVBZ,LBDSCRSI,LBXSUA,LBXVOX,URXUAS,URXUSB,LBDSTPSI,LBXVXY,LBXSTR,WTSA2YR,URXVOL1,LBXLYPCT,URXUTL,URXUPB,URXVOL2,PHAFSTMN,LBXVEB,URXECP,LBDTCSI,LBXNEPCT,LBXBGM,LBXHGB,LBXSTP,URXCNP,LBXPLTSI,LBDIHGSI,LBDSIRSI,LBXSGTSI,URXUCO,LBXBAPCT,LBXMCHSI,LBDNENO,LBXSCR,LBDSZNSI,LBXMOPCT,LBXMPSI,LBXVDB,LBXSIR,URXUCS,LBX2DF,URXUAB,LBDSCUSI,LBXRBCSI,URXUIO,LBDSUASI,LBDSPHSI,URXCRS,URXMEP,URXUCD,LBXMCVSI,URXUAS3,LBXSCH,URXMC1,LBXGLT,WTSVS2YR,LBXSLDSI,LBXRDW,PHAGUMMN,LBXSGB,LBDBPBSI,LBDGLTSI,WTSVOC2Y,URXCOP,LBDGLUSI,LBXSCU,LBXSSE,LBXWBCSI,LBDHDD,LBDSSESI,URDFLOW1,WTSAF2YR,LBXSPH,WTSOG2YR,URXUMA,LBXSASSI,LBXEOPCT,LBDSGBSI,LBXSAPSI,LBXTC,LBXMC,LBXTHG,LBDSCHSI,URXMOH,LBDSALSI,PHAFSTHR,DXDLAPF,DXDTRBMD,DXDSTBMC,DXXTRFAT,DXXHEFAT,DXDSTLE,DXXRALI,DXDTOFAT,DXDTOBMD,DXXLSBMC,DXXTRLI,DXDRATOT,BPXDI3,DXXHEA,DXDTOPF,DXDRLTOT,DXDTRPF,DXXLLBMD,DXXRLLI,BMXARMC,DXDRALE,DXXLLBMC,DXXLALI,DXXLRBMC,DXXPEA,DXDRAPF,BPXDI1,DXXPEBMD,DXDTRTOT,DXXRAA,DXDLLPF,BMXARML,DXXHELI,DXXRRA,DXDSTBMD,BPXDI4,DXXLLLI,DXDRLPF,BPXDI2,DXXLSBMD,BPXPLS,DXDTOBMC,DXDTOLE,DXXLRBMD,DXDTRLE,DXXHEBMC,DXDSTTOT,DXXRLA,BMXLEG,BPXSY1,DXXLAA,DXXLABMC,DXDLALE,DXDTRA,BMXWT,DXXTSBMC,DXXLLA,DXXLLFAT,DXDTOLI,BMXHT,DXXLABMD,DXDSTLI,DXXRRBMD,DXXLSA,DXDSTA,DXDTRBMC,DXXRABMC,DXDHELE,DXDHETOT,DXXHEBMD,DXDTOA,DXXRABMD,DXDRLLE,DXXLRA,DXXRRBMC,BPXSY2,DXDLLLE,DXXTSBMD,BMXBMI,DXDSTFAT,DXXRLFAT,DXXRLBMD,DXXRAFAT,DXDLATOT,BMXWAIST,DXDLLTOT,DXDSTPF,DXDHEPF,DXXLAFAT,DXXRLBMC,BPXSY3,BPXSY4,DXXTSA,DXXPEBMC,DXDTOTOT,DMDYRSUS,DMDEDUC2,DMDHHSIZ,DMDFMSIZ,BPXML1,BPACSZ,FCX10DI,FCX11DI,FCX06DI,FCX08DI,FCX07DI,FCX09DI,LBXBGE,URXVOL3,URXUAS5,URDFLOW3,LBXVTE,LBDSTBSI,URXUMN,LBXVTC,LBXSTB,PHAANTMN,PHAALCMN,PHACOFMN,PHASUPMN,URDTIME3,LBXV2A,LBXV1D,LBXVMC,LBXVNB,LBXV3B,LBXVCB,LBXVCT,LBDMONO,LBDEONO,LBDBANO,PHASUPHR,LBXSCLSI,LBXSNASI,PHAGUMHR,PHAANTHR,PHACOFHR,PHAALCHR,LBXSC3SI
0,0.666667,,1.0,0.458333,0.10287,,0.138362,0.439344,0.233333,0.288889,,0.288889,,,,0.252427,0.090771,0.000336,0.102941,0.35,,0.134199,,,,,,0.041081,,,,0.4,,,0.230699,,,,0.090786,,0.064327,0.138462,,0.115359,,0.056924,0.411765,,,,0.433333,,0.115433,,0.182254,0.4528,,,,0.694915,,,0.293515,0.460709,0.216602,0.55102,0.433333,,0.367776,0.134432,0.209026,0.027523,,0.157895,0.651042,0.129252,0.056934,,0.160763,0.441558,,0.208511,,,,,0.423077,,0.411788,0.444597,0.052838,,,0.571134,,0.282051,,,,0.147297,0.18,,0.363636,0.258278,,,,,,,0.207547,0.440678,,0.180005,,0.444444,,0.000336,0.147059,0.306667,0.363636,0.182131,0.294118,0.676471,0.230385,0.282142,,0.458333,0.038462,,,,,,,,,,,,,0.653846,,,,,,,0.413889,,,,,,,0.580645,,,,,0.548387,,,,,,,0.587302,,0.340426,,,,,,,,0.4,0.318182,,,,,0.307393,,,,,0.488938,,,,,,,,,,,,,,,,0.267606,,,0.34186,,,,,,0.424865,,,,,,0.25,,,,,,0.375,0.0,0.0,0.157658,0.75,,,,,,,0.230769,,,,,0.277778,,,0.277778,,,,,,,,,,,,,0.121212,0.25,0.0,,0.526316,0.470588,,,,,0.5625
1,0.871795,,0.554,0.625,0.171302,,0.169397,0.491803,0.066667,0.444444,,0.444444,,,,0.34466,0.062443,0.001877,0.171569,0.325,,,,,,,,0.088649,,,,0.428571,,,0.009991,,,,0.062331,,0.046784,0.169231,,0.151812,,0.077369,0.270588,,,,0.333333,,0.151819,,0.625899,0.192,,,,0.186441,,,0.387941,0.710324,,0.704082,0.333333,,0.318739,,0.280285,0.027523,,0.052632,0.677083,0.183673,0.077372,,0.26158,0.246753,,0.280851,,,,,0.54142,,0.27057,0.333104,0.080235,,,0.606186,,0.393162,,,,0.136486,0.086667,,0.204545,0.350993,,,,,,,0.201258,0.491525,,0.349702,,0.333333,,0.001877,0.107843,0.086667,0.204545,0.233677,0.388235,0.647059,0.009977,0.393192,,0.625,0.038462,,,,,,,,,,,,,0.711538,,,,,,,0.413889,,,,,,,0.596774,,,,,0.496774,,,,,,,0.634921,,0.404255,,,,,,,,0.586047,0.560606,,,,,0.330739,,,,,0.599558,,,,,,,,,,,,,,,,0.521127,,,0.326977,,,,,,0.338378,,,,,,0.421875,,,,,,0.25,0.0,0.0,0.191441,0.75,,,,,,,,,,,,0.333333,,,0.333333,,,,,,,,,,,,,0.151515,0.083333,0.0,,0.315789,0.352941,,,,,0.5
2,0.717949,,0.906,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,0.0,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,0.730769,,,,,,,,,,,,,,0.66129,,,,,,,,,,,,0.555556,,0.297872,,,,,,,,,0.69697,,,,,0.487938,,,,,0.789823,,,,,,,,,,,,,,,,0.605634,,,0.423256,,,,,,,,,,,,0.609375,,,,,,0.5,0.0,0.0,0.213964,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,0.230769,,,0.833333,0.026932,,0.122845,0.363934,0.266667,0.466667,,0.466667,,,,0.076052,0.055589,8.4e-05,0.026961,0.45,0.0,0.017316,,,0.024096,0.495048,0.0,0.04973,0.006095,0.541126,0.0,0.4,0.0,0.0,0.061762,,0.001881,,0.055556,0.001318,0.064327,0.123077,,0.096577,,0.07007,0.294118,0.001681,0.0,0.0,0.466667,0.002002,0.096612,0.585874,0.517986,0.344,0.059545,0.002481,,0.186441,0.0,,0.241183,0.628659,0.051448,0.612245,0.466667,,0.446585,0.017346,0.344418,0.005505,0.000508,0.105263,0.71875,0.272109,0.070073,0.541018,0.152589,0.441558,6e-06,0.344681,0.000613,0.0,0.000968,0.197691,0.431953,0.001554,0.294106,0.511356,0.003914,,0.0,0.705155,0.0,0.264957,,,0.481444,0.095946,0.186667,,0.181818,0.086093,,0.529408,,,0.197802,0.233164,0.358491,0.364407,0.232143,0.449946,,0.511111,,8.4e-05,0.088235,0.06,0.181818,0.158076,0.241176,0.544118,0.061678,0.264955,,0.833333,0.019231,0.420225,0.343811,0.363737,0.101648,0.250675,0.183942,0.361441,0.129037,0.460317,0.45941,0.166588,0.17168,0.673077,0.524084,0.406728,0.191466,0.357702,0.48073,0.25245,0.338889,0.35441,0.371894,0.274337,0.201455,0.403774,0.390995,0.580645,0.280066,0.079658,0.435094,0.470414,0.483871,0.387566,0.201785,0.368293,,0.213043,0.417417,0.587302,0.438447,0.531915,0.389319,0.188593,0.147929,0.164431,0.542143,0.103727,0.488364,0.609302,0.30303,0.538009,0.410843,0.274115,0.401779,0.199222,0.303915,0.405509,0.187749,0.196206,0.621681,0.330789,0.188633,0.282918,0.649249,0.470588,0.318083,0.3915,0.304188,0.350762,0.4971,0.471587,0.334152,0.245622,0.306929,0.168637,0.28169,0.209975,0.460289,0.172093,0.129731,0.191416,0.275556,0.14073,0.150368,0.168649,0.175845,0.422287,0.371429,0.146249,0.404916,0.234375,,0.363678,0.343634,0.108313,,0.5,0.166667,0.166667,0.157658,0.5,,,,,,,0.230769,,0.0,,9.2e-05,0.444444,0.0,0.011905,0.444444,,,,,,0.001299,0.0,0.0,1.0,0.0,0.0,0.0,0.151515,0.083333,0.0,,0.631579,0.647059,,,,,0.5625
4,0.128205,,0.42,0.583333,0.124945,,0.092241,0.285246,0.316667,0.511111,,0.511111,,,,0.101942,0.005483,0.001821,0.125,0.425,,0.017316,,,,,,0.007568,,,,0.571429,,,0.019982,,,,0.00542,,0.046784,0.092308,,0.151812,,0.018975,0.423529,,,,0.433333,,0.151819,,0.129496,0.3632,,,,0.491525,,,0.202503,0.627119,0.02032,0.581633,0.433333,,0.460595,0.017346,0.199525,0.009174,,0.105263,0.635417,0.306122,0.018978,,0.092643,0.298701,,0.2,,,,,0.473373,,0.423457,0.377839,0.158513,,,0.668041,,0.213675,,,,0.081081,0.166667,,0.295455,0.109272,,,,,,,0.402516,0.288136,,0.280928,,0.377778,,0.001821,0.0,0.146667,0.295455,0.19244,0.202941,0.367647,0.019955,0.213727,,0.583333,0.0,,0.300589,0.463743,0.474565,0.56385,0.576347,,0.521562,0.544974,0.253539,0.604312,,0.769231,0.534845,0.798165,,0.749347,,,0.736111,,,,0.303326,0.38138,,0.629032,0.336079,0.495756,,,0.483871,0.659671,0.732619,0.426829,,,,0.634921,0.246212,0.276596,0.51248,0.581006,0.244576,0.607091,0.676356,0.515255,,0.572093,0.424242,,,,0.634179,0.54786,0.468724,,,0.582394,0.584071,,0.576085,0.311388,0.502703,0.609445,0.413217,,0.588918,0.63389,0.635046,0.604704,,,0.378221,0.480431,0.380282,,0.5,0.581395,0.520843,,,,,0.512432,,0.809384,0.657143,,,0.34375,,0.543641,0.350518,0.519511,0.061224,0.125,0.333333,0.333333,0.191441,1.0,,,,,,,0.230769,,,,,0.222222,,,0.222222,,,,,,,,,,,,,0.121212,0.166667,0.0,,0.736842,0.588235,,,,,0.3125


In [316]:
all_df[cont_vars + dis_vars] = X_scaled[cont_vars + dis_vars]

# Part 3: Data Preprocessing

## 1. Cont vars

### 1.1. Handling missing values

In [317]:
sig_cont_vars = cont_vars

miss_cont_df = pd.DataFrame(all_df[sig_cont_vars].isnull().sum(),columns=['Count'])
miss_cont_df['Percentage'] = 100 * all_df[sig_cont_vars].isnull().sum()/len(all_df)
miss_cont_df = miss_cont_df.sort_values('Percentage', ascending=False)
miss_cont_df.tail()

Unnamed: 0,Count,Percentage
BPXPLS,55,4.688832
BMXBMI,45,3.836317
BMXHT,44,3.751066
BMXWT,44,3.751066
RIDAGEYR,0,0.0


#### Select cont vars with missing value <= 30%

In [318]:
sig_cont_vars = list(miss_cont_df[miss_cont_df.Percentage <= 30].index)
print(len(sig_cont_vars))
print(sig_cont_vars)

79
['URXVOL1', 'LBXBCD', 'LBXTHG', 'LBXBPB', 'LBDBPBSI', 'LBDTHGSI', 'LBDBCDSI', 'BPXSY3', 'BPXDI3', 'LBXSLDSI', 'LBXSASSI', 'LBXSATSI', 'LBXSOSSI', 'LBXSGL', 'LBXSAL', 'LBDSBUSI', 'LBDSPHSI', 'BPXSY2', 'LBXSIR', 'LBDSCASI', 'LBXSPH', 'LBDSUASI', 'LBXSCA', 'BPXDI2', 'LBXSBU', 'LBXSTP', 'LBDSGBSI', 'LBXSAPSI', 'LBDSCHSI', 'LBDSALSI', 'LBXSUA', 'LBDSCRSI', 'LBXSCH', 'LBDSTRSI', 'LBDSIRSI', 'LBDSTPSI', 'LBXSGTSI', 'LBXSKSI', 'LBDSGLSI', 'LBXSTR', 'LBXSCR', 'LBXSGB', 'BPXSY1', 'INDFMPIR', 'BPXDI1', 'LBDTCSI', 'LBDHDD', 'LBXTC', 'LBDHDDSI', 'LBDLYMNO', 'LBXBAPCT', 'LBDNENO', 'LBXNEPCT', 'LBXMOPCT', 'LBXEOPCT', 'LBXLYPCT', 'BMXLEG', 'LBXRDW', 'LBXHGB', 'LBXMC', 'LBXPLTSI', 'LBXWBCSI', 'LBXMCVSI', 'LBXRBCSI', 'LBXMCHSI', 'LBXMPSI', 'BMXWAIST', 'BMXARML', 'BMXARMC', 'PHAFSTHR', 'PHAFSTMN', 'URXCRS', 'URXUMS', 'URXUMA', 'BPXPLS', 'BMXBMI', 'BMXHT', 'BMXWT', 'RIDAGEYR']


#### Remove observations with 40% of missing data

In [319]:
all_df['FeatureCount'] = all_df[sig_cont_vars].count(axis=1)
all_df['FeatureMissing'] = len(sig_cont_vars) - all_df['FeatureCount']
all_df['MissingPercentage'] = all_df.FeatureMissing/len(sig_cont_vars)
all_df[['FeatureMissing','MissingPercentage']].describe()

Unnamed: 0,FeatureMissing,MissingPercentage
count,1173.0,1173.0
mean,7.884058,0.099798
std,17.917955,0.22681
min,0.0,0.0
25%,0.0,0.0
50%,1.0,0.012658
75%,6.0,0.075949
max,78.0,0.987342


In [320]:
print(all_df[(all_df['MissingPercentage'] > 0.4) & (all_df['Class'] == 1)].shape)
all_df = all_df[all_df['MissingPercentage'] <= 0.4]
all_df = all_df.reset_index(drop=True)

all_df.shape, len(all_df[(all_df['Class'] == 1)])/len(all_df)

(92, 464)


((1054, 464), 0.4601518026565465)

#### Filling missing data 

In [321]:
for var in sig_cont_vars:
    all_df[var] = all_df[var].fillna(all_df[var].median())

In [322]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(all_df[sig_cont_vars], all_df['Class'], test_size=0.25, random_state=1)

def NaiveBayesPrediction(X_train, y_train, X_test, y_test):
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    
    y_pred_gnb = clf.predict(X_test)
    y_prob_pred_gnb = clf.predict_proba(X_test)
    # how did our model perform?
    count_misclassified = (y_test != y_pred_gnb).sum()
    
    print("GaussianNB")
    print("=" * 30)
    print('Misclassified samples: {}'.format(count_misclassified))
    accuracy = accuracy_score(y_test, y_pred_gnb)
    print('Accuracy: {:.2f}'.format(accuracy))
    
    print("Recall score : ", recall_score(y_test, y_pred_gnb , average='micro'))
    print("Precision score : ",precision_score(y_test, y_pred_gnb , average='micro'))
    print("F1 score : ",f1_score(y_test, y_pred_gnb , average='micro'))
    
    print(classification_report(y_test, y_pred_gnb))
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 93
Accuracy: 0.65
Recall score :  0.6477272727272727
Precision score :  0.6477272727272727
F1 score :  0.6477272727272727
              precision    recall  f1-score   support

           0       0.62      0.89      0.73       142
           1       0.75      0.36      0.49       122

    accuracy                           0.65       264
   macro avg       0.68      0.63      0.61       264
weighted avg       0.68      0.65      0.62       264



In [323]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.64247312 0.58039216 0.66111111 0.59198718 0.62799202 0.6262987
 0.56915584 0.65423387 0.62612613 0.69664032]
Accuracy: 62.76% (+/- 3.71%)


### 1.2. Remove outliers

In [324]:
for var in sig_cont_vars:
    upper_boundary, lower_boundary = find_skewed_boundaries(all_df, var, 3.5)
    all_df.loc[all_df[var] <= lower_boundary,var] = lower_boundary
    all_df.loc[all_df[var] >= upper_boundary,var] = upper_boundary

In [325]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(all_df[sig_cont_vars], all_df['Class'], test_size=0.25, random_state=1)
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 87
Accuracy: 0.67
Recall score :  0.6704545454545454
Precision score :  0.6704545454545454
F1 score :  0.6704545454545454
              precision    recall  f1-score   support

           0       0.67      0.77      0.72       142
           1       0.68      0.55      0.61       122

    accuracy                           0.67       264
   macro avg       0.67      0.66      0.66       264
weighted avg       0.67      0.67      0.67       264



In [326]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.69758065 0.70522876 0.77124183 0.63108974 0.67420213 0.64642857
 0.63506494 0.75268817 0.67052767 0.71805007]
Accuracy: 69.02% (+/- 4.54%)


### 1.3. Gaussian Transformation - Type 2

In [327]:
df1 = all_df.copy()

Gauss_transformed_vars = ['URDTIME1','URXVOL1','LBXTHG','LBDTHGSI','LBXSGTSI','URXCRS','URXUMS','URXUMA']

In [328]:
from sklearn.preprocessing import quantile_transform

for var in Gauss_transformed_vars:
    df1[var] = quantile_transform(np.array(df1[var]).reshape(-1,1), n_quantiles=20, random_state=0, copy=True)

In [329]:
X_train, X_test, y_train, y_test = train_test_split(df1[sig_cont_vars], df1['Class'], test_size=0.25, random_state=1)

NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 92
Accuracy: 0.65
Recall score :  0.6515151515151515
Precision score :  0.6515151515151515
F1 score :  0.6515151515151515
              precision    recall  f1-score   support

           0       0.65      0.75      0.70       142
           1       0.65      0.53      0.59       122

    accuracy                           0.65       264
   macro avg       0.65      0.64      0.64       264
weighted avg       0.65      0.65      0.65       264



In [330]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.69186828 0.63529412 0.78594771 0.68237179 0.66356383 0.68636364
 0.64935065 0.71572581 0.65862291 0.75494071]
Accuracy: 69.24% (+/- 4.52%)


### 1.4. Only use sig cont vars 

In [331]:
import scipy

ttest_df = pd.DataFrame(columns = ['Feature','FeatureName','t-stats','p-value','Skew','Kurtosis'])
ttest_df['Feature'] = sig_cont_vars
ttest_df['FeatureName'] = sig_cont_vars

df0 = all_df[all_df['Class'] == 0]
df1 = all_df[all_df['Class'] == 1]

for var in cont_vars:
    result = scipy.stats.ttest_ind(df0[df0[var].isna()==False][var], df1[df1[var].isna()==False][var])
    ttest_df.loc[ttest_df['Feature'] == var,'t-stats'] = result[0]
    ttest_df.loc[ttest_df['Feature'] == var,'p-value'] = result[1]
    ttest_df.loc[ttest_df['Feature'] == var,'Skew'] = all_df[var].skew()
    ttest_df.loc[ttest_df['Feature'] == var,'Kurtosis'] = all_df[var].kurt()
  
ttest_df['abs_tstats'] = np.abs(ttest_df['t-stats'])
ttest_df = ttest_df.sort_values(['abs_tstats'], ascending = False)
ttest_df = ttest_df.merge(miss_df, left_on = 'Feature',right_on='Feature',how='inner')

ttest_df

Unnamed: 0,Feature,FeatureName,t-stats,p-value,Skew,Kurtosis,abs_tstats,Count,Percentage
0,RIDAGEYR,RIDAGEYR,-10.6959,2.03864e-25,0.000859062,-1.18415,10.6959,0,0.0
1,LBDSALSI,LBDSALSI,7.86981,8.78556e-15,-0.347124,1.00179,7.86981,119,10.144928
2,LBXSAL,LBXSAL,7.86981,8.78556e-15,-0.347124,1.00179,7.86981,119,10.144928
3,LBDSGLSI,LBDSGLSI,-7.57897,7.62166e-14,1.68589,2.68658,7.57897,119,10.144928
4,LBXSGL,LBXSGL,-7.57309,7.95584e-14,1.69302,2.7168,7.57309,119,10.144928
5,BMXWAIST,BMXWAIST,-7.55143,9.31728e-14,0.624863,0.646716,7.55143,88,7.502131
6,LBXSUA,LBXSUA,-7.52802,1.10476e-13,0.762941,0.732965,7.52802,119,10.144928
7,LBDSUASI,LBDSUASI,-7.5279,1.10568e-13,0.762869,0.733033,7.5279,119,10.144928
8,LBXLYPCT,LBXLYPCT,6.69582,3.48404e-11,0.303575,0.293664,6.69582,97,8.269395
9,BMXBMI,BMXBMI,-6.58331,7.24781e-11,1.13585,1.8663,6.58331,45,3.836317


In [332]:
sig_cont_vars = list(ttest_df[ttest_df['p-value'] <= 0.05].sort_values(['abs_tstats'],ascending=False).Feature)
print(sig_cont_vars)
print(len(sig_cont_vars))

['RIDAGEYR', 'LBDSALSI', 'LBXSAL', 'LBDSGLSI', 'LBXSGL', 'BMXWAIST', 'LBXSUA', 'LBDSUASI', 'LBXLYPCT', 'BMXBMI', 'URXUMA', 'URXUMS', 'BMXARMC', 'LBXRDW', 'LBDHDDSI', 'LBDHDD', 'INDFMPIR', 'BMXWT', 'LBXNEPCT', 'BMXLEG', 'LBDSTRSI', 'LBXSTR', 'BMXARML', 'LBDSCRSI', 'LBXSCR', 'LBXSAPSI', 'BPXSY3', 'LBXSOSSI', 'LBXTHG', 'LBDTHGSI', 'LBDNENO', 'LBXSBU', 'LBDSBUSI', 'LBXSIR', 'BPXDI2', 'LBDSIRSI', 'LBXSCH', 'LBDSCHSI', 'LBXSLDSI', 'BPXSY2', 'LBDTCSI', 'LBXTC', 'BPXDI1', 'BPXDI3', 'BMXHT', 'BPXSY1', 'LBXSGTSI', 'URXVOL1', 'LBDLYMNO', 'LBXHGB', 'BPXPLS', 'LBXWBCSI', 'LBXSTP', 'LBDSTPSI', 'LBXSGB', 'LBDSGBSI', 'LBXMOPCT', 'LBXMC', 'LBXRBCSI', 'LBXEOPCT']
60


In [333]:
len(sig_cont_vars)

60

In [334]:
X_train, X_test, y_train, y_test = train_test_split(all_df[sig_cont_vars], all_df['Class'], test_size=0.25, random_state=1)
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 86
Accuracy: 0.67
Recall score :  0.6742424242424242
Precision score :  0.6742424242424242
F1 score :  0.6742424242424242
              precision    recall  f1-score   support

           0       0.67      0.77      0.72       142
           1       0.68      0.56      0.61       122

    accuracy                           0.67       264
   macro avg       0.68      0.67      0.67       264
weighted avg       0.68      0.67      0.67       264



In [335]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.71370968 0.65359477 0.7748366  0.61891026 0.65857713 0.675
 0.66071429 0.73185484 0.66023166 0.76581028]
Accuracy: 69.13% (+/- 4.96%)


### 1.5. Remove features with high correlation

In [336]:
# build a dataframe with the correlation between features
# remember that the absolute value of the correlation
# coefficient is important and not the sign

corrmat = all_df[sig_cont_vars].corr()
corrmat = corrmat.abs().unstack() # absolute value of corr coef
corrmat = corrmat.sort_values(ascending=False)

corrmat = pd.DataFrame(corrmat).reset_index()
corrmat.columns = ['feature1', 'feature2', 'corr']
corrmat['MissingF1'] = corrmat.feature1.apply(lambda x:MissingPercentage(x))
corrmat['MissingF2'] = corrmat.feature2.apply(lambda x:MissingPercentage(x))
corrmat.head()

Unnamed: 0,feature1,feature2,corr,MissingF1,MissingF2
0,LBXSTP,LBDSTPSI,1.0,0.101449,0.101449
1,LBDSTPSI,LBXSTP,1.0,0.101449,0.101449
2,LBDSALSI,LBXSAL,1.0,0.101449,0.101449
3,LBXSAL,LBDSALSI,1.0,0.101449,0.101449
4,LBDSGBSI,LBXSGB,1.0,0.101449,0.101449


In [337]:
correlated_groups = corrmat[corrmat['corr'] > 0.90]
correlated_groups

Unnamed: 0,feature1,feature2,corr,MissingF1,MissingF2
0,LBXSTP,LBDSTPSI,1.0,0.101449,0.101449
1,LBDSTPSI,LBXSTP,1.0,0.101449,0.101449
2,LBDSALSI,LBXSAL,1.0,0.101449,0.101449
3,LBXSAL,LBDSALSI,1.0,0.101449,0.101449
4,LBDSGBSI,LBXSGB,1.0,0.101449,0.101449
5,LBXSGB,LBDSGBSI,1.0,0.101449,0.101449
6,LBXSCH,LBXSCH,1.0,0.101449,0.101449
7,LBXTC,LBXTC,1.0,0.098039,0.098039
8,LBDTCSI,LBDTCSI,1.0,0.098039,0.098039
9,BPXSY2,BPXSY2,1.0,0.101449,0.101449


In [338]:
sig_cont_vars = []
remaining_vars = correlated_groups.feature1.unique()

while(len(remaining_vars) > 0):
    feature = remaining_vars[0]
    correlated_block = correlated_groups[correlated_groups.feature1 == feature]
    min_ind = correlated_block[['MissingF2']].idxmin() 
    sel_var = correlated_block.feature2[min_ind].values[0]
    removed_vars = [var for var in list(correlated_block.feature2.values)]
    remaining_vars = [var for var in remaining_vars if var not in removed_vars]
    if sel_var not in sig_cont_vars:
        sig_cont_vars = sig_cont_vars + [sel_var]    
    
print(sig_cont_vars)
len(sig_cont_vars)

['LBDSTPSI', 'LBXSAL', 'LBXSGB', 'LBDTCSI', 'BPXSY2', 'LBXSLDSI', 'LBXEOPCT', 'LBDSIRSI', 'BPXDI2', 'BPXDI3', 'LBDSBUSI', 'BPXDI1', 'LBXSGTSI', 'BMXHT', 'BPXSY1', 'LBDTHGSI', 'URXVOL1', 'LBDLYMNO', 'LBXHGB', 'BPXPLS', 'LBXWBCSI', 'LBXMC', 'BMXARMC', 'LBDSGLSI', 'BMXWAIST', 'LBXSUA', 'LBXLYPCT', 'BMXWT', 'URXUMA', 'LBXRDW', 'LBXSOSSI', 'LBDHDDSI', 'INDFMPIR', 'BMXLEG', 'LBDSTRSI', 'BMXARML', 'LBDSCRSI', 'LBXSAPSI', 'LBXMOPCT', 'RIDAGEYR', 'LBXRBCSI']


41

In [339]:
X_train, X_test, y_train, y_test = train_test_split(all_df[sig_cont_vars], all_df['Class'], test_size=0.25, random_state=1)
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 85
Accuracy: 0.68
Recall score :  0.678030303030303
Precision score :  0.678030303030303
F1 score :  0.678030303030303
              precision    recall  f1-score   support

           0       0.67      0.80      0.73       142
           1       0.69      0.54      0.61       122

    accuracy                           0.68       264
   macro avg       0.68      0.67      0.67       264
weighted avg       0.68      0.68      0.67       264



In [340]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.70799731 0.64248366 0.74901961 0.64391026 0.64793883 0.72922078
 0.66363636 0.69489247 0.64350064 0.7397892 ]
Accuracy: 68.62% (+/- 4.09%)


In [341]:
from mixed_naive_bayes import MixedNB

X_train, X_test, y_train, y_test = train_test_split(all_df[sig_cont_vars], all_df['Class'], test_size=0.25, random_state=1)

def MixedBayesPrediction(X_train, y_train, X_test, y_test):
    clf = MixedNB()
    clf.fit(X_train, y_train)
    
    y_pred_gnb = clf.predict(X_test)
    y_prob_pred_gnb = clf.predict_proba(X_test)
    # how did our model perform?
    count_misclassified = (y_test != y_pred_gnb).sum()
    
    print("GaussianNB")
    print("=" * 30)
    print('Misclassified samples: {}'.format(count_misclassified))
    accuracy = accuracy_score(y_test, y_pred_gnb)
    print('Accuracy: {:.2f}'.format(accuracy))
    
    print("Recall score : ", recall_score(y_test, y_pred_gnb , average='micro'))
    print("Precision score : ",precision_score(y_test, y_pred_gnb , average='micro'))
    print("F1 score : ",f1_score(y_test, y_pred_gnb , average='micro'))
    
    print(classification_report(y_test, y_pred_gnb))
    
MixedBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 85
Accuracy: 0.68
Recall score :  0.678030303030303
Precision score :  0.678030303030303
F1 score :  0.678030303030303
              precision    recall  f1-score   support

           0       0.67      0.80      0.73       142
           1       0.69      0.54      0.61       122

    accuracy                           0.68       264
   macro avg       0.68      0.67      0.67       264
weighted avg       0.68      0.68      0.67       264

