# Introduction

In order to successful knowledge discovery in databases (KDD), well-defined and formal methods should be applied for managing data.  Cross-industry standard process for data mining (CRISP-DM) model is a standard methodology, which includes six phases:
    1. Problem domain understanding
    2. Data Understanding
    3. Data Preparation
    4. Modeling
    5. Evaluation
    6. Deployment

----------------------------------------------------------------------------------------------------------------------

# Part 1 - Problem domain understanding

Cancer still remains a challenge for our world in preventing and treating. However, most of cancers are highly curable if they are detected early, so the stage at diagnosis heavily influences survival. Due to no early warning signs, it’s important to have routine screening tests. For many types of cancers such as colorectal cancer, lung cancer, stomach cancer, screening rate remains low due to unpleasant procedure and expensive cost. Therefore, a risk prediction model for cancer could bring benefits for both customer and health institute. For customer, it encourages people to take screening tests to detect the risk of cancer early and increase survival rate. For health institute, it provides more services and hence increase sale.

Nowadays, electronic medical records have become increasingly available through regular health checkup. In recent research, there has been an increasing interest in finding biomarkers of cancer from routine blood tests. In general, blood indices are related to cancer to some extent, but none of them solely exhibits a clear connection and can be used for diagnostic purposes. However, taking these basic blood indices together, information to be gleaned may reveal about converging signs or pattern of an individual for many forms of cancer. By monitoring selected biomarkers routinely measured in primary care, we can learn a lot about physiological patterns that promote carcinogenesis, proliferation, progression before tumor makers emerge.

This research aims to utilize temporal, longitudinal data accumulated in regular health checkup to explore pattern of change of many biomarkers in common blood test to predict cancer.

----------------------------------------------------------------------------------------------------------------------

# Part 2 - Data Exploration & Understanding

http://localhost:8888/notebooks/0-MyCollection/00-Sample/01-DataExploration.ipynb

## 1. Import Library and Define Common function

### 1.1. Import Library

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pylab
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

# Modelling Helpers:
# from sklearn.preprocessing import Imputer, Normalizer, scale
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score, ShuffleSplit, cross_validate
from sklearn import model_selection
from sklearn.model_selection import train_test_split

# Classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
# Evaluation metrics for Classification
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report, roc_auc_score, roc_curve, precision_recall_curve, average_precision_score
from sklearn.metrics import mutual_info_score

# Regression
from sklearn.linear_model import LinearRegression,Ridge,Lasso,RidgeCV,ElasticNet,LogisticRegression
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,GradientBoostingRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
# Evaluation metrics for Regression 
from sklearn.metrics import mean_squared_log_error, mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import (confusion_matrix, classification_report, accuracy_score, roc_auc_score, auc,
                             precision_score, recall_score, roc_curve, precision_recall_curve,
                             precision_recall_fscore_support, f1_score,
                             precision_recall_fscore_support)

# Configuration
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)


# Supress warnings
import warnings
warnings.filterwarnings("ignore")

print("Setup complete...")

Setup complete...


### 1.2. Common Function

In [2]:
# Distribution plot

def analyse_continuous(df,var,target):
    df = df.copy()
    # df[var] = df[var].fillna(df[var].median())
    plt.figure(figsize=(20,5))
       
    # histogram
    plt.subplot(131)
    sns.distplot(df[var], bins=30)
    #sns.distplot(df[var],hist=True, kde=True,kde_kws={'shade': True, 'linewidth': 3})
    plt.title('Histogram')    
    
    # Q-Q plot
    plt.subplot(132)
    stats.probplot(df[var], dist="norm", plot=pylab)
    plt.ylabel('Quantiles')    
    
    # boxplot
    plt.subplot(133)
    sns.boxplot(x=df[var])
    plt.title('Boxplot')
          
    # skewness and kurtosis
    print('Skewness: %f' % df[var].skew())
    print('Kurtosis: %f' % df[var].kurt())
    plt.show()

In [3]:
def Training_Preparation(df, cont_vars):
    num_df = df[cont_vars].copy()

    # scaling features
    from sklearn.preprocessing import MinMaxScaler
    numdf_norm = pd.DataFrame(MinMaxScaler().fit_transform(df[cont_vars]))
    numdf_norm.columns = num_df.columns
    
    # Define X & y
    X = numdf_norm
    y = df['Class']

    # Split to train and test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=90, stratify = y)
    
    # initialize models
    models = []
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('SVC', SVC(kernel="linear")))
    models.append(('LSVC', SVC(kernel="rbf")))
    models.append(('LR', LogisticRegression()))
    models.append(('DT', DecisionTreeClassifier()))
    models.append(('GNB', GaussianNB()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('GB', GradientBoostingClassifier()))
    models.append(('LGB',LGBMClassifier()))
    models.append(('ADA',AdaBoostClassifier()))
    models.append(('LDA',LinearDiscriminantAnalysis()))
    models.append(('QDA',QuadraticDiscriminantAnalysis()))
    models.append(('NN',MLPClassifier()))
    models.append(('XGB',XGBClassifier()))
    
    # Test options and evaluation metric
    seed = 9
    scoring = 'recall_macro'

    # evaluate each model in turn
    results = {}
    names = []

    for name, model in models:
        kfold = model_selection.KFold(n_splits=10, random_state = seed)
        cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results[name] = cv_results
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
        
    results_df = pd.DataFrame(results)
    plt.figure(figsize=(16,8))
    sns.boxplot(data=results_df)
    plt.show()

In [4]:
from matplotlib.backends.backend_pdf import PdfPages


def DistributionComparison(all_df, selected_vars,name):
    colors = ['#3791D7','#D72626']

    # pdf = matplotlib.backends.backend_pdf.PdfPages(name + '.pdf')
    with PdfPages(name + '.pdf') as pdf_pages:
        for column in selected_vars:    
            fig = plt.figure(figsize=[8,4])
            plt.subplot(121)
            sns.boxplot(x='Class', y=column,data=all_df,palette=colors)
            plt.title(column, fontsize=12)
            plt.subplot(122)
            sns.kdeplot(all_df[all_df.Class==1][column], bw = 0.4, label = "Cancer", shade=True, color="#D72626", linestyle="--")
            sns.kdeplot(all_df[all_df.Class==0][column], bw = 0.4, label = "NoCancer", shade=True, color= "#3791D7", linestyle=":")
            plt.title(column, fontsize=12)   
            pdf_pages.savefig(fig)                                          
            plt.show()    

    # Write the PDF document to the disk
    #pdf_pages.close()

In [5]:
def ModelEvaluation(df, cont_vars):
    
    num_df = df[cont_vars].copy()

    # scaling features
    from sklearn.preprocessing import MinMaxScaler
    numdf_norm = pd.DataFrame(MinMaxScaler().fit_transform(df[cont_vars]))
    numdf_norm.columns = num_df.columns
    
    # Define X & y
    X = numdf_norm
    y = df['Class']

    # Split to train and test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=90, stratify = y)
    
    # initialize models
    models = []
    models.append(('KNN', KNeighborsClassifier()))
    models.append(('SVC', SVC(kernel="linear")))
    models.append(('LSVC', SVC(kernel="rbf")))
    models.append(('LR', LogisticRegression()))
    models.append(('DT', DecisionTreeClassifier()))
    models.append(('GNB', GaussianNB()))
    models.append(('RF', RandomForestClassifier()))
    models.append(('GB', GradientBoostingClassifier()))
    models.append(('LGB',LGBMClassifier()))
    models.append(('ADA',AdaBoostClassifier()))
    models.append(('LDA',LinearDiscriminantAnalysis()))
    models.append(('QDA',QuadraticDiscriminantAnalysis()))
    models.append(('NN',MLPClassifier()))
    models.append(('XGB',XGBClassifier()))
    
    for name,model in models:
        print(name)
        model.fit(X_train, y_train)
        
        print('==========================================================')
        print('Train set')
        y_train_pred = model.predict(X_train)
        print('Accuracy: ', accuracy_score(y_train, list(y_train_pred)))
        print('ROC AUC Score: ', roc_auc_score(y_train, list(y_train_pred)))
        cm_df = pd.DataFrame(confusion_matrix(y_train,list(y_train_pred)), index=model.classes_,columns=model.classes_)
        cm_df.index.name = 'True'
        cm_df.columns.name = 'Predicted'
        print('Confusion matrix')
        print(cm_df)
        print(classification_report(y_train, list(y_train_pred)))
  
        print('----------------------------------------------------------')
        print('Test set')
        y_test_pred = model.predict(X_test)
        print('Accuracy: ', accuracy_score(y_test, list(y_test_pred)))
        print('ROC AUC Score: ', roc_auc_score(y_test, list(y_test_pred)))
        cm_df = pd.DataFrame(confusion_matrix(y_test,list(y_test_pred)), index=model.classes_,columns=model.classes_)
        cm_df.index.name = 'True'
        cm_df.columns.name = 'Predicted'
        print('Confusion matrix')
        print(cm_df)
        print(classification_report(y_test, list(y_test_pred)))
        print('==========================================================')
        

In [6]:
from sklearn.manifold import TSNE

def tsne_plot(X, y):
       
        
    # scaling features
    from sklearn.preprocessing import MinMaxScaler
    numdf_norm = pd.DataFrame(MinMaxScaler().fit_transform(X))
    numdf_norm.columns = X.columns
    
    tsne = TSNE(n_components=2, random_state=0)
    X_t = tsne.fit_transform(numdf_norm)

    plt.figure(figsize=(12, 8))
    plt.scatter(X_t[np.where(y == 0), 0], X_t[np.where(y == 0), 1], marker='o', color='g', linewidth='1', alpha=0.8, label='No cancer')
    plt.scatter(X_t[np.where(y == 1), 0], X_t[np.where(y == 1), 1], marker='o', color='r', linewidth='1', alpha=0.8, label='Colon cancer')

    plt.legend(loc='best');
    plt.show();

In [7]:
# function to find upper and lower boundaries
# for normally distributed variables

def find_normal_boundaries(df, variable):

    # calculate the boundaries outside which sit the outliers
    # for a Gaussian distribution

    upper_boundary = df[variable].mean() + 3 * df[variable].std()
    lower_boundary = df[variable].mean() - 3 * df[variable].std()

    return upper_boundary, lower_boundary

In [8]:
# function to find upper and lower boundaries
# for skewed distributed variables

def find_skewed_boundaries(df, variable, distance):

    # Let's calculate the boundaries outside which sit the outliers
    # for skewed distributions

    # distance passed as an argument, gives us the option to
    # estimate 1.5 times or 3 times the IQR to calculate
    # the boundaries.

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

In [9]:
def find_uncorrelated_vars(cancer_df, selected_vars, threshold):

    corrmat = cancer_df[selected_vars].corr()
    corrmat = corrmat.abs().unstack() # absolute value of corr coef
    corrmat = corrmat.sort_values(ascending=False)

    corrmat = pd.DataFrame(corrmat).reset_index()
    corrmat.columns = ['feature1', 'feature2', 'corr']
    corrmat['MissingF1'] = corrmat.feature1.apply(lambda x:MissingPercentage(x))
    corrmat['MissingF2'] = corrmat.feature2.apply(lambda x:MissingPercentage(x))
    
    correlated_groups = corrmat[corrmat['corr'] > threshold]
    
    selected_vars = []
    remaining_vars = correlated_groups.feature1.unique()

    while(len(remaining_vars) > 0):
        feature = remaining_vars[0]
        correlated_block = correlated_groups[correlated_groups.feature1 == feature]
        min_ind = correlated_block[['MissingF2']].idxmin() 
        sel_var = correlated_block.feature2[min_ind].values[0]
        removed_vars = [var for var in list(correlated_block.feature2.values)]
        remaining_vars = [var for var in remaining_vars if var not in removed_vars]
        if sel_var not in selected_vars:
            selected_vars = selected_vars + [sel_var]   
    
    return selected_vars

In [10]:
def analyze_na_values(df, var, target):
    tmp_df = df.copy()
    print(target)
    
    # Make a variable that indicates 1 if the observation was missing or 0 otherwise
    tmp_df['Missing'] = np.where(df[var].isnull(),1,0)
    
    # Calculate the mean Price where the information is missing or present
    tmp_df =  pd.DataFrame(tmp_df.groupby([target,'Missing'])[target].count())
    tmp_df.columns = ['Count']
    tmp_df = tmp_df.reset_index()
    
    
    if(len(tmp_df[tmp_df[target] == 0] == 1)):
        tmp_df= tmp_df.append({target:0,'Missing':1,'Count':0}, ignore_index=True)

    if(len(tmp_df[tmp_df[target] == 1] == 1)):
        tmp_df= tmp_df.append({target:1,'Missing':1,'Count':0}, ignore_index=True)

    tmp_df.loc[0,'Per']= tmp_df.loc[0,'Count']/(tmp_df.loc[0:1,'Count'].sum())
    tmp_df.loc[1,'Per']= tmp_df.loc[1,'Count']/(tmp_df.loc[0:1,'Count'].sum())
    tmp_df.loc[2,'Per']= tmp_df.loc[2,'Count']/(tmp_df.loc[2:3,'Count'].sum())
    tmp_df.loc[3,'Per']= tmp_df.loc[3,'Count']/(tmp_df.loc[2:3,'Count'].sum())
    sns.barplot(x=target, y = 'Per', data=tmp_df, hue='Missing')
    plt.title(var)
    plt.show()

In [11]:
def CategoricalDistribution(df, var, target):
    df = df.copy()
    
    # Calculate the mean Price where the information is missing or present
    sns.countplot(x=var, data=df, hue=target)
    plt.title(var)
    plt.show()

In [12]:
def CreateDummyVar(df, categorical_list):
    objdf_new = df.copy()
    objdf_dummy =pd.DataFrame()
    i = 0
    for e in categorical_list:
        i = i + 1
        objdf_new[e] = e + '_' + objdf_new[e].astype(str)
        varname= e 
        df_temp = pd.get_dummies(objdf_new[varname], drop_first=True)
        objdf_dummy = pd.concat([objdf_dummy, df_temp], axis=1)
        
    return objdf_dummy

In [13]:
def MissingPercentage(x):
    return df[x].isnull().sum()/len(df)

## 2. Load raw data

In [240]:
df = pd.read_csv('NHANES-MultipleCycles_merged1.csv')
org_df = df.copy()


df = df[df.RIAGENDR==2]
print(df.shape)

df.head()

(35481, 799)


Unnamed: 0,SIALANG,WTINT2YR,DMDCITZN,WTMEC2YR,SIAINTRP,SIAPROXY,RIDAGEYR,DMDMARTL,RIDSTATR,MIALANG,FIAPROXY,SDMVPSU,MIAPROXY,RIAGENDR,DMDEDUC2,FIALANG,SEQN,SDMVSTRA,RIDAGEMN,RIDEXPRG,MIAINTRP,DMDYRSUS,DMDEDUC3,DMDHHSIZ,DMDFMSIZ,DMDHRGND,RIDEXMON,RIDRETH1,SDDSRVYR,INDFMPIR,FIAINTRP,MCQ160D,BPQ070,ACD040,MCQ025,IMQ011,PFQ061T,ALQ130,FSDAD,CDQ006,DPQ060,SMQ050U,DEQ034A,HUQ020,SMD100NI,DLQ010,DID250,PFQ061F,HSQ580,DIQ230,DPQ040,DPQ020,SXQ610,INQ012,SMQ710,FSD071,SXQ600,SMQ876,CDQ005,PFQ030,DEQ034C,DEQ038G,INQ020,RHD280,SMQ800,CDQ001,SMAQUEX2,HIQ260,PFQ061I,MCQ050,DLQ050,RHQ602U,FSD151,HSQ500,FSD111,HUQ071,PFQ051,SMQ690F,DIQ350U,SMQ870,HIQ031B,INQ150,SMQ670,HIQ210,CDQ009A,MCQ160L,SMD100TR,SXQ590,PFQ041,PFQ061Q,SMDUPCA,BPQ100D,HIQ031A,SMQ856,MCQ160A,SMQ690D,HSQ571,DLQ040,DIQ180,SXD171,SXQ130,OCQ260,SMQ868,FSD032A,FSD032E,PFQ063D,FSDHH,SXQ251,FSQ162,DPQ050,BPQ080,SXQ270,OCQ210,CDQ009H,INQ090,SMD057,INDFMMPI,RHQ060,PFQ061S,SMD100FL,BPQ030,RHD143,RHQ420,HIQ011,SMQ690B,DPQ070,SMQ020,FSDCH,HSD010,SMQ050Q,SXD031,RHQ171,HIQ031I,RHQ291,DEQ034D,PFQ020,PFQ061D,MCQ053,SMQ880,DPQ100,HIQ031C,SMD100MN,MCQ230B,ALQ110,RHQ586U,MCQ160F,OCD270,SMQ690C,ALQ101,MCQ170L,FSD102,FSD032F,OCQ380,INQ140,RHQ166,OCD390G,HEQ040,SMQ770,CDQ009D,PUQ110,RHQ020,RHQ169,BPQ090D,MCQ080,SXQ265,OCD395,DLQ020,SLQ050,MCQ160M,DIQ080,PFQ061A,RHQ070,RHQ010,HIQ031AA,SMQ720,INQ060,RHQ131,BPQ020,HUQ030,BPQ040A,SMD100LN,DID060,PFQ049,DED031,ECD010,SMQ817,HSAQUEX,HIQ031F,MCQ300B,CDQ008,MCQ300A,INDFMMPC,DIQ360,DIQ050,PFQ061N,PUQ100,RHQ031,DLQ060,SMD641,FSD061,SMQ040,ECQ150,SMQ690A,FSD032C,PFQ090,DIQ010,IND235,RHQ576U,FSD122,PFQ061M,DLQ080,RHQ602Q,MCQ160G,DIQ260U,SMQ725,FSD032B,FSD052,PFQ061P,FSD146,DPQ080,HSQ590,RHQ160,PFQ061K,PFQ061C,PFQ063A,PFQ063E,MCQ160E,WHQ030E,ALQ120U,OCD150,MCQ230D,HEQ030,FSD132,ECQ090,DIQ170,PFQ061J,PFQ061O,CDQ009G,MCQ220,MCQ040,MCQ230A,SMQ840,SMQ872,DID350,MCQ080E,DID260,FSD141,DPQ030,HIQ031J,DIQ060U,RHQ596,SMQ874,MCQ149,MCQ300C,MCQ160C,SXQ272,RHQ554,FSD081,SMQ740,SXQ410,SXQ280,CDQ009C,SMD100CO,RHQ560U,SMQ860,SMD030,PFQ061R,RHQ576Q,MCQ230C,HIQ031E,RHQ540,HUQ090,CDQ003,MCQ092,PFQ063C,PFQ057,IMQ020,HSQ510,SXQ550,RHQ560Q,MCQ160K,SMD650,INQ030,INQ132,DID040,ECQ020,RHQ570,CDQ010,HEQ010,DPQ090,MCQ170M,PFQ061H,SMQ830,RHQ200,HUD080,ALQ120Q,BPQ050A,FSD032D,MCD093,PFQ063B,SMQ878,HEQ020,MCQ170K,CDQ002,MCQ010,SMQ858,PFQ061G,MCQ035,SMQ690E,ECQ080,HIQ105,HUQ010,RHQ580,FSD092,CDQ009F,SMD100BR,SMD630,PFQ059,PFQ061B,CDQ009B,CDQ004,CDQ009E,DPQ010,ECD070A,ECD070B,DIQ240,SXQ490,HSQ520,FSD041,BPQ060,HIQ270,INQ080,PFQ061L,PFQ054,SMD093,PFQ061E,HIQ031H,SXQ260,HIQ031D,SMQ866,OCQ180,MCQ160B,DIQ160,SMQ862,RHQ586Q,LBDSGLSI,URDTIME2,LBDSBUSI,LBDHDDSI,URDMNPLC,LBXHCG,LBDLYMNO,LBXSCA,PHQ060,URXMBP,LBDSCASI,URXMZP,URDFLOW2,ORXH51,ORXH64,LBXGLU,ORXHPC,LBXBGE,LBDHEG,LBXBPB,LBDWFL,LBDBCDSI,ORXGL,ORXH69,URXUMS,LBXSGL,LBXSOSSI,LBXIHG,URXMHH,LBDSALSI,URXVOL3,PHQ050,URXMHP,LBDHD,LBXHCR,ORXH62,URDTIME1,URDCNPLC,LBXSZN,LBXSKSI,LBDTHGSI,URXMIB,URDFLOW3,ORXHPV,URXMNP,LBXBCD,LBXSATSI,LBXSBU,ORXH11,LBDSTRSI,LBDSCRSI,ORXH73,LBXSUA,URXUTRI,ORXH26,ORXH31,LBDHBG,LBDSTPSI,PHASUPHR,ORXH82,LBXHE1,LBXSTR,ORXH53,ORXH58,URXVOL1,LBXLYPCT,URXVOL2,LBDSTBSI,LBDHEM,URXECP,LBDTCSI,LBXSCLSI,LBXSNASI,URDMZPLC,LBXSAL,PHAGUMHR,LBXNEPCT,PHQ020,LBXBGM,LBXHGB,ORXH35,ORXH84,LBXSTP,LBDIHGLC,URXCNP,LBXPLTSI,LBDIHGSI,URXPREG,LBDSIRSI,PHQ040,LBXSGTSI,LBDBGELC,PHAANTHR,ORXH68,LBXBAPCT,URDMEPLC,LBXSTB,ORXH81,LBXMCHSI,ORXGH,LBDNENO,LBXSCR,ORXH52,URDMBPLC,LBDSZNSI,ORXH56,LBXHE2,LBXMOPCT,LBXMPSI,ORXH61,LBDBGMLC,LBXSIR,PHAANTMN,PHACOFHR,URDECPLC,URDMOHLC,ORXH33,ORXH06,ORXHPI,LBDTHGLC,LBDSCUSI,LBDMONO,PHAALCMN,LBXRBCSI,URDMHPLC,URXUIO,LBDSUASI,LBDSPHSI,URDCOPLC,URXCRS,URDMC1LC,URXMEP,ORXH45,ORXH55,LBXMCVSI,ORXH71,LBXSCH,URXMC1,LBXGLT,LBDEONO,PHQ030,ORXH40,ORXH39,ORXH83,ORXH66,LBXHBC,ORXH72,WTSVS2YR,LBXSLDSI,LBXRDW,WTFSM,ORXH54,PHAGUMMN,ORXH70,LBXSGB,LBDBPBSI,URXUCL,LBDGLTSI,LBDWFLLC,WTSVOC2Y,URXCOP,LBDGLUSI,LBXSCU,LBXSSE,LBXWBCSI,LBDHDD,LBDSSESI,PHAALCHR,URDFLOW1,LBXHA,ORXH59,LBXSC3SI,LBXSPH,PHACOFMN,PHASUPMN,ORXH67,WTSOG2YR,ORXH42,URXUMA,LBXSASSI,LBXEOPCT,LBDSGBSI,LBXSAPSI,ORXH18,LBXTC,URDMHHLC,LBXMC,PHDSESN,URDMIBLC,URDTIME3,LBXTHG,ORXH16,LBDSCHSI,LBXHBS,URXMOH,LBDBANO,OHARNF,DXDLAPF,DXDTRBMD,DXDSTBMC,DXARLBV,OHAPOS,DXXTRFAT,DXALLBV,BMIARML,DXXHEFAT,DXARABV,DXDSTLE,FCX10DI,DXXRALI,OHDDESTS,FCX11DI,OHX06TC,DXDTOFAT,DXDTOBMD,OHAROCGP,DXXLSBMC,DXXTRLI,DXDRATOT,BPXDI3,DXXHEA,DXDTOPF,DXDRLTOT,DXDTRPF,DXXLLBMD,DXXRLLI,BMXARMC,DXDRALE,OHX23TC,BPAEN2,OHAROCOH,OHAROCDE,DXXLLBMC,OHX02TC,DXXLALI,DXXLRBMC,DXXPEA,OHX30TC,OHXIMP,DXALABV,DXDRAPF,BPXDI1,DXXPEBMD,OHX14TC,DXARLTV,OHAROTH,BMDSTATS,FCX06DI,OHX04TC,OHX16TC,OHX05TC,BMIARMC,DXDTRTOT,OHX12TC,DXXRAA,DXDLLPF,BMXARML,BPAEN3,DXXHELI,OHX26TC,DXXRRA,FCX08DI,OHAREC,OHX27TC,DXDSTBMD,BPXDI4,OHX13TC,DXXLLLI,DXAHEBV,OHX22TC,DXDRLPF,BPXDI2,BPAARM,OHX10TC,OHAROCCI,DXXLSBMD,BMILEG,BPXPLS,BPAEN4,OHX01TC,DXDTOBMC,DXDTOLE,OHX29TC,OHX21TC,BMXRECUM,DXXLRBMD,FCX07DI,OHX09TC,BPXML1,BMIHT,DXDTRLE,DXXHEBMC,OHX28TC,DXDSTTOT,OHX07TC,BPXCHR,DXXRLA,BMXLEG,BPXSY1,OHX32TC,FCX09DI,DXXLAA,DXXLABMC,DXDLALE,OHX18TC,OHDRCSTS,DXDTRA,BMXWT,BPXPTY,DXXTSBMC,DXXLLA,DXXLLFAT,DXDTOLI,DXARATV,BMXHT,DXXLABMD,DXDSTLI,BMXHEAD,DXXRRBMD,OHX24TC,DXXLSA,OHX19TC,OHX15TC,DXDSTA,PEASCCT1,DXDTRBMC,DXALLTV,OHAROCDT,DXXRABMC,DXDHELE,DXDHETOT,BMIHEAD,BMIRECUM,DXXHEBMD,DXALATV,DXDTOA,DXXRABMD,DXDRLLE,DXXLRA,DXXRRBMC,DXATRBV,OHX20TC,DXATRTV,BPXSY2,OHX03TC,BPXPULS,OHX17TC,DXDLLLE,DXXTSBMD,BMXBMI,OHX25TC,BMIWT,DXDSTFAT,DXXRLFAT,DXXRLBMD,DXXRAFAT,BPAEN1,BPACSZ,OHX11TC,DXAHETV,DXDLATOT,BMXWAIST,DXDLLTOT,OHX31TC,DXDSTPF,OHX08TC,DXDHEPF,DXXLAFAT,DXXRLBMC,BPXSY3,BPXSY4,DXAEXSTS,DXXTSA,BMIWAIST,DXXPEBMC,DXDTOTOT,LBD2DFLC,LBDV1DLC,LBDV2ALC,LBDV3BLC,LBDV4CLC,LBDVBZLC,LBDVCBLC,LBDVCTLC,LBDVDBLC,LBDVEBLC,LBDVMCLC,LBDVNBLC,LBDVOXLC,LBDVTCLC,LBDVTELC,LBDVXYLC,LBX2DF,LBXV1D,LBXV2A,LBXV3B,LBXV4C,LBXVBZ,LBXVCB,LBXVCT,LBXVDB,LBXVEB,LBXVMC,LBXVNB,LBXVOX,LBXVTC,LBXVTE,LBXVXY,OHDEXSTS,PHAFSTHR,PHAFSTMN,SMAQUEX,URDUA3LC,URDUA5LC,URDUABLC,URDUACLC,URDUBALC,URDUCDLC,URDUCOLC,URDUCSLC,URDUDALC,URDUMMAL,URDUMNLC,URDUMOLC,URDUPBLC,URDUSBLC,URDUSNLC,URDUSRLC,URDUTLLC,URDUTULC,URDUURLC,URXUAB,URXUAC,URXUAS,URXUAS3,URXUAS5,URXUBA,URXUCD,URXUCO,URXUCS,URXUDMA,URXUMMA,URXUMN,URXUMO,URXUPB,URXUSB,URXUSN,URXUSR,URXUTL,URXUTU,URXUUR,WTSA2YR,WTSAF2YR,Class
1,1.0,9081.700761,1.0,8987.04181,2.0,1.0,11,,2,1.0,2.0,1,2.0,2,,1.0,31128.0,52,132.0,2.0,2.0,,4.0,7,6,1,1.0,4,4,0.77,2.0,,,,,3.0,,,1.0,,,,,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,2.0,,2,,,,,,,,2.0,,,,,1.0,,,,,,,,,,,,,,,3.0,3.0,,1.0,,2.0,,,,,,,,,,,,,,,1.0,,,,1.0,,,,,,,,2.0,,2.0,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,1,,,,,,33.0,,1.0,,,,,,,2.0,,2.0,,,,,,,,3.0,,2.0,,,,,,,,,,3.0,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,2.0,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,1,2.0,,,,,,,,1.0,,,,,,,,,,,,3.0,,,,,,,2.0,,,,,,,4,,,,,,,,,,,,4.0,0.0,,,2.0,,,1.0,,,,,,,,17.0,,,,,,,,,,1.42,,,2.3,,2.0,,,,,,,,,,,2.24,,1.25,,,82.1,,,,,,,2.0,,2.0,,,,,,,1.15,,,,,0.14,,,,,,,,,,,2.0,,,,,,,,,45.3,,,,,3.34,,,,,,44.3,2.0,,13.7,,,,,,286.0,,,,2.0,,,,,0.1,,,,26.1,,2.2,,,,,,,8.6,8.1,,,,,,,,,,,1.0,,0.4,,5.25,,,,,,25724.0,,,,,78.8,,,,,0.1,2.0,,,,,2.0,,,,12.6,,,,,,0.108,,,,,,,,,5.0,55.0,,,,2.0,,,,,,,,,82.1,,1.8,,,,129.0,,33.1,0.0,,,0.23,,,2.0,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,21.7,,,2.0,,,,,,,,,,,,62.0,,,,,1.0,,,,,,,,,,34.3,2.0,,,,,,,,,,,,,,,1.0,,,,,84.0,2.0,,,,,,,,,,130.0,,,,,,,,,37.6,100.0,,,,,,,,,40.1,1.0,,,,,,151.6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,17.45,,,,,,,2.0,3.0,,,,62.8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,14.0,2.0,,1.0,1.0,0.0,1.0,,,,,0.0,0.0,,,,,,,,,,3.64,0.42,,0.85,0.71,,,,,5.99,1.15,,,,,,,,,,25546.389422,,0
3,1.0,29960.839509,1.0,34030.994786,2.0,2.0,85,2.0,2,,2.0,2,,2,4.0,1.0,31130.0,46,,,,,,1,1,2,2.0,3,4,1.99,2.0,2.0,9.0,,,3.0,5.0,,1.0,,,,,3.0,,,,1.0,,,,,,,,,,,,,,,,,,2.0,1.0,,1.0,,,,2.0,,,2,2.0,,,,15.0,,,2.0,,2.0,,,,1.0,,,,,2.0,,,,1.0,,,,,3.0,,,1.0,,,,2.0,,,,,,,,1.0,,,,,1.0,,,2.0,,,,,,,,,,1.0,2.0,,,,,,,,2.0,,,,,,,3.0,,,1.0,,,,,,,,2.0,,4.0,,2.0,2.0,,1.0,,,,,,,2.0,2,,,,2.0,,,,2.0,,1.0,,2.0,,,2.0,1.0,,,,,,,,,3.0,2.0,2.0,,,,4.0,,,2.0,,,3.0,,1.0,,,,,1.0,5.0,28.0,,2.0,,,4.0,,,,,2.0,1.0,1.0,,2.0,,,,,,,,,,,,,,,2.0,2.0,,,,,,,,,,,,1.0,,,,,2.0,,9.0,,2.0,3,,,,2.0,,,,,,,2.0,,,,1.0,,,,,,,,,,,,,2.0,,1.0,,,,,3,,,,,,2.0,1.0,,,,,,,,,,,1.0,1.0,,1.0,2.0,,5.0,,,,,,2.0,2.0,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,70.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,1.0,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,10.0,3.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0
4,1.0,26457.70818,1.0,26770.584605,2.0,2.0,44,1.0,2,1.0,2.0,1,2.0,2,4.0,1.0,31131.0,48,535.0,2.0,2.0,,,4,4,1,2.0,4,4,4.65,2.0,2.0,1.0,,,3.0,,,1.0,,0.0,,3.0,1.0,,,,,,,0.0,0.0,2.0,,,,1.0,,,,4.0,2.0,,,,2.0,1.0,,,,,1.0,2.0,2.0,,1,2.0,,,,,,,2.0,,2.0,,0.0,,,,,14.0,,2.0,,2.0,,1.0,,0.0,1.0,,3.0,3.0,,1.0,1.0,2.0,0.0,2.0,2.0,,,,,,,,,1.0,2.0,1.0,1.0,,0.0,2.0,1.0,3.0,,17.0,2.0,,,4.0,,,2.0,,,,,,1.0,1.0,2.0,324.0,,2.0,,,,,,0.0,2.0,,,,2.0,,2.0,,2.0,2.0,,,2.0,2.0,,,,11.0,,,,1.0,1.0,1,1.0,,,2.0,5.0,,,2.0,,2.0,,1.0,,,2.0,,2.0,1.0,,,,,,,3.0,2.0,2.0,,,,,,4.0,2.0,,,3.0,,,,0.0,2.0,2.0,,,,,2.0,,,1.0,,,,,1.0,,,,2.0,,,,,,,,,0.0,,,1.0,,,1.0,2.0,2.0,,,,,,,,,,,,,,,1.0,2.0,,1.0,,2.0,3,2.0,,,2.0,,,,,,,2.0,,0.0,,,,,2.0,0.0,1.0,3.0,3.0,,,,,,2.0,,,,,,,3,1.0,,,,,2.0,,,,,0.0,,,,,2.0,,1.0,1.0,,,2.0,,,,2.0,,,45.0,1.0,2.0,,10.0,4.83,,2.14,1.01,,,1.9,8.9,2.0,,2.225,,,,,90.0,,,,1.01,,1.25,,,18.0,87.0,271.0,,,35.0,,2.0,,2.0,,,,,,4.1,9.83,,,,,0.14,14.0,6.0,,0.881,70.72,,4.9,,,,2.0,69.0,,,2.0,78.0,,,,35.8,,6.84,,,2.72,106.0,137.0,,3.5,,55.1,2.0,,12.5,,,6.9,,,298.0,,2.0,9.1,2.0,17.0,,,,0.5,,0.4,,27.1,,2.9,0.8,,,,,1.0,7.8,7.8,,,51.0,,,,,,,,0.0,,0.4,,4.63,,137.0,291.5,1.098,,17857.0,,,,,80.1,,103.0,,126.0,0.0,2.0,,,,,2.0,,,105.0,13.7,,,,,3.4,0.049,,6.994,,56634.315081,,4.996,,,5.3,39.0,,,,2.0,,23.0,3.4,,,,71634.174,,18.0,16.0,0.9,34.0,74.0,,105.0,,33.8,0.0,,,1.97,,2.664,2.0,,0.0,,,,,,,,,,,,,,,,,,,,,,,,74.0,,,,,,,35.8,,,2.0,,,,,,,,,,,,74.0,,,,,3.0,,,,,,,,,,35.0,2.0,,,,,,,,,,,,,,72.0,1.0,,,,,58.0,,,,,,,,,,,160.0,,,,,,,,,38.0,144.0,,,,,,,,,75.2,1.0,,,,,,156.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,140.0,,1.0,,,,30.9,,,,,,,2.0,4.0,,,,96.0,,,,,,,,134.0,,,,,,,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,,1.0,0.0,1.0,1.0,0.0,0.0078,0.0707,0.0071,0.0354,0.0339,0.059,0.0078,0.0035,0.0849,0.053,,0.2121,0.046,0.0085,0.0071,0.17,,14.0,9.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,67556.81,0
6,1.0,5635.221296,1.0,5920.617679,2.0,2.0,16,5.0,2,1.0,2.0,1,2.0,2,,1.0,31133.0,51,193.0,2.0,2.0,,9.0,3,3,1,2.0,4,4,5.0,2.0,,,,,3.0,,,1.0,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,2.0,2.0,,1,,,,,,,,2.0,,,,,2.0,,,,14.0,,,,2.0,,2.0,,,,,3.0,3.0,,1.0,,2.0,,,,,,,,,,,,,,1.0,1.0,,,,1.0,4.0,,,,,,,2.0,,2.0,,,,,,,,,,,,,,,2.0,,,1.0,,,,2.0,,,,2.0,,0.0,,2.0,,,,,12.0,,,,2.0,2.0,1,,,,,,,,2.0,,,,,,,2.0,,2.0,1.0,,,,,,,3.0,,2.0,,,,,,,,,,3.0,,,,,2.0,,,,,,,,,4.0,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,1,1.0,,,,,,,,,,,,,,,,,2.0,,,3.0,,,,,,,2.0,,,,,,,5,,,,,55.0,,,,,,,,,,,1.0,,,1.0,,,,,,,,,,,,2.0,,,4.39,,2.5,1.4,0.0,,2.0,9.8,2.0,69.2,2.45,26.064,,,,84.0,,,,0.35,,2.31,,,25.2,79.0,271.0,,190.5,42.0,,2.0,38.9,2.0,,,,0.0,,4.0,5.69,11.4,,,2.002,0.26,15.0,7.0,,0.553,79.56,,4.7,,,,2.0,78.0,,,2.0,49.0,,,,30.0,,30.78,,258.4,3.8,102.0,137.0,0.0,4.2,,63.1,2.0,,12.9,,,7.8,,146.1,226.0,,2.0,9.9,2.0,27.0,,,,0.1,0.0,1.8,,33.3,,4.2,0.9,,0.0,,,,5.7,8.9,,,55.0,,,0.0,0.0,,,,0.0,,0.4,,3.86,0.0,,279.6,1.292,0.0,24310.0,0.0,122.958,,,99.0,,149.0,10.3,122.0,0.1,2.0,,,,,2.0,,,101.0,14.5,,,,,3.6,0.017,,6.772,,,13.1,4.663,,,6.6,54.0,,,,2.0,,23.0,4.0,,,,17234.377,,25.2,21.0,1.2,36.0,41.0,,147.0,0.0,33.7,0.0,0.0,,1.14,,3.853,1.0,118.6,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,22.3,,,2.0,,,,,,,,,,,,58.0,,,,,1.0,,,,,,,,,,35.8,2.0,,,,,,,,52.0,,,,,,42.0,1.0,,,,,88.0,2.0,,,,,,,,,,140.0,,,,,,,,,39.2,120.0,,,,,,,,,45.0,1.0,,,,,,163.7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,116.0,,1.0,,,,16.79,,,,,,,2.0,3.0,,,,62.0,,,,,,,,,120.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,12.0,32.0,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,15668.017,0
9,1.0,26487.611553,1.0,0.0,2.0,2.0,41,5.0,1,,2.0,2,,2,4.0,1.0,31136.0,51,493.0,3.0,,,,1,1,2,,4,4,3.57,2.0,2.0,1.0,,,3.0,,,1.0,,,,1.0,3.0,,,,,,,,,,,,,,,,,4.0,2.0,,,,1.0,1.0,,,,,,2.0,,,2,2.0,,,,,,,2.0,,2.0,,,,,,,14.0,,2.0,,,,1.0,,,1.0,,3.0,,,1.0,,2.0,,2.0,,,,,,,,,,,,,1.0,,,2.0,,,,,,,,5.0,,,2.0,,,,,,,,2.0,108.0,,,,,,,,,2.0,,,,,,,,1.0,,,,1.0,2.0,,,,,,,,,2.0,1,,,,2.0,3.0,,,,,2.0,2.0,2.0,,,2.0,,,,,,,,,,3.0,2.0,2.0,,,,,,,2.0,,,3.0,,,,,,,,,,,2.0,,,1.0,,,,,1.0,,,,2.0,,,,,,,,,,,,,,,1.0,2.0,,,,,,,,,,,,,,,,,2.0,,2.0,,2.0,3,,,,2.0,,,,,,,2.0,,,,,,,,,,,,,,,,2.0,2.0,,,,,,,2,,,,,,2.0,,,,,,,,,,,,1.0,1.0,,,2.0,,,,,,,40.0,2.0,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0


## 4. Target variable analysis

In [241]:
df[(df.MCQ230C.isna()== False) | (df.MCQ230D.isna()== False)].shape

(26, 799)

In [242]:
df = df[(df.MCQ230C.isna()) & (df.MCQ230D.isna())]
df.shape

(35455, 799)

In [243]:
df[(df['MCQ230A'] != 14) & (df['MCQ230B'] != 14) & (df.MCQ220 == 1)].shape

(1376, 799)

In [244]:
df = df[(df['MCQ230A'] == 14) | (df['MCQ230B'] == 14) | (df.MCQ220 == 2)]
df.shape

(19079, 799)

In [245]:
# no cancer group, excellent health
df['Class'] == 0

# breast cancer
df.loc[df.MCQ220 == 1,'Class'] = 1

df[df.Class == 1].shape[0]/df.shape[0],df[df.Class == 1].shape[0],df[df.Class == 0].shape[0]

(0.03165784370250013, 604, 18475)

In [246]:
# number of cancer <= 45 and num of cancer > 45
(df[(df.Class == 1) & (df.RIDAGEYR <= 45)].shape, df[(df.Class == 1) & (df.RIDAGEYR > 45)].shape)

((27, 799), (577, 799))

In [247]:
# number of no cancer <= 45 and num of no cancer > 45
(df[(df.Class == 0) & (df.RIDAGEYR <= 45)].shape, df[(df.Class == 0) & (df.RIDAGEYR > 45)].shape)

((8891, 799), (9584, 799))

In [248]:
df = df[df.RIDAGEYR > 45]
df.shape

(10161, 799)

In [249]:
pd.DataFrame([df[df.Class == 0].RIDAGEYR.describe(),df[df.Class == 1].RIDAGEYR.describe()],index=['NoCancer','Cancer'])

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
NoCancer,9584.0,62.505426,10.700078,46.0,53.0,62.0,71.0,85.0
Cancer,577.0,68.532062,10.125914,46.0,61.0,70.0,79.0,85.0


In [250]:
df[df.Class == 1].shape[0]/df.shape[0],df[df.Class == 1].shape[0],df[df.Class == 0].shape[0]

(0.05678574943411081, 577, 9584)

In [251]:
df = df.reset_index(drop = True)
df.shape

(10161, 799)

## 5. Categorize vars

In [252]:
target = ['Class']

cont_vars = ['RIDAGEYR', 'RIDAGEMN', 'INDFMPIR', 'LBXSAL', 'LBDSGLSI', 'URDTIME2', 'LBDSBUSI', 'LBDHDDSI', 'LBDLYMNO', 'LBXSCA', 'URXMBP', 'LBDSCASI', 'URXMZP', 'URDFLOW2', 'LBXGLU', 'LBXBPB', 'LBDBCDSI', 'URXUMS', 'LBXSGL', 'LBXSOSSI', 'URXUAC', 'LBXIHG', 'URXMHH', 'URXMHP', 'URXUBA', 'WTFSM', 'URXUUR', 'URDTIME1', 'URXUSR', 'LBXSZN', 'URXUTU', 'LBXSKSI', 'URXUSN', 'URXUMMA', 'LBDTHGSI', 'URXMIB', 'URXUMO', 'URXMNP', 'LBXBCD', 'URXUDMA', 'LBXSATSI', 'LBXSBU', 'LBXV4C', 'LBDSTRSI', 'LBXVBZ', 'LBDSCRSI', 'LBXSUA', 'LBXVOX', 'URXUAS', 'URXUSB', 'LBDSTPSI', 'LBXVXY', 'LBXSTR', 'WTSA2YR', 'URXVOL1', 'LBXLYPCT', 'URXUTL', 'URXUPB', 'URXVOL2', 'PHAFSTMN', 'LBXVEB', 'URXECP', 'LBDTCSI', 'LBXNEPCT', 'LBXBGM', 'LBXHGB', 'LBXSTP', 'URXCNP', 'LBXPLTSI', 'LBDIHGSI', 'LBDSIRSI', 'LBXSGTSI', 'URXUCO', 'LBXBAPCT', 'LBXMCHSI', 'LBDNENO', 'LBXSCR', 'LBDSZNSI', 'LBXMOPCT', 'LBXMPSI', 'LBXVDB', 'LBXSIR', 'URXUCS', 'LBX2DF', 'URXUAB', 'LBDSCUSI', 'LBXRBCSI', 'URXUIO', 'LBDSUASI', 'LBDSPHSI', 'URXCRS', 'URXMEP', 'URXUCD', 'LBXMCVSI', 'URXUAS3', 'LBXSCH', 'URXMC1', 'LBXGLT', 'WTSVS2YR', 'LBXSLDSI', 'LBXRDW', 'PHAGUMMN', 'LBXSGB', 'LBDBPBSI', 'LBDGLTSI', 'WTSVOC2Y', 'URXCOP', 'LBDGLUSI', 'LBXSCU', 'LBXSSE', 'LBXWBCSI', 'LBDHDD', 'LBDSSESI', 'URDFLOW1', 'WTSAF2YR', 'LBXSPH', 'WTSOG2YR', 'URXUMA', 'LBXSASSI', 'LBXEOPCT', 'LBDSGBSI', 'LBXSAPSI', 'LBXTC', 'LBXMC', 'LBXTHG', 'LBDSCHSI', 'URXMOH', 'LBDSALSI', 'PHAFSTHR', 'DXDLAPF', 'DXDTRBMD', 'DXDSTBMC', 'DXXTRFAT', 'DXXHEFAT', 'DXDSTLE', 'DXXRALI', 'DXDTOFAT', 'DXDTOBMD', 'DXXLSBMC', 'DXXTRLI', 'DXDRATOT', 'BPXDI3', 'DXXHEA', 'DXDTOPF', 'DXDRLTOT', 'DXDTRPF', 'DXXLLBMD', 'DXXRLLI', 'BMXARMC', 'DXDRALE', 'DXXLLBMC', 'DXXLALI', 'DXXLRBMC', 'DXXPEA', 'DXDRAPF', 'BPXDI1', 'DXXPEBMD', 'DXDTRTOT', 'DXXRAA', 'DXDLLPF', 'BMXARML', 'DXXHELI', 'DXXRRA', 'DXDSTBMD', 'BPXDI4', 'DXXLLLI', 'DXDRLPF', 'BPXDI2', 'DXXLSBMD', 'BPXPLS', 'DXDTOBMC', 'DXDTOLE', 'DXXLRBMD', 'DXDTRLE', 'DXXHEBMC', 'DXDSTTOT', 'DXXRLA', 'BMXLEG', 'BPXSY1', 'DXXLAA', 'DXXLABMC', 'DXDLALE', 'DXDTRA', 'BMXWT', 'DXXTSBMC', 'DXXLLA', 'DXXLLFAT', 'DXDTOLI', 'BMXHT', 'DXXLABMD', 'DXDSTLI', 'DXXRRBMD', 'DXXLSA', 'DXDSTA', 'DXDTRBMC', 'DXXRABMC', 'DXDHELE', 'DXDHETOT', 'DXXHEBMD', 'DXDTOA', 'DXXRABMD', 'DXDRLLE', 'DXXLRA', 'DXXRRBMC', 'BPXSY2', 'DXDLLLE', 'DXXTSBMD', 'BMXBMI', 'DXDSTFAT', 'DXXRLFAT', 'DXXRLBMD', 'DXXRAFAT', 'DXDLATOT', 'BMXWAIST', 'DXDLLTOT', 'DXDSTPF', 'DXDHEPF', 'DXXLAFAT', 'DXXRLBMC', 'BPXSY3', 'BPXSY4', 'DXXTSA', 'DXXPEBMC', 'DXDTOTOT']
print(len(cont_vars))

dis_vars = ['DMDYRSUS', 'DMDEDUC2', 'DMDHHSIZ', 'DMDFMSIZ', 'BPXML1', 'BPACSZ', 'FCX10DI', 'FCX11DI', 'FCX06DI', 'FCX08DI', 'FCX07DI', 'FCX09DI', 'LBXBGE', 'URXVOL3', 'URXUAS5', 'URDFLOW3', 'LBXVTE', 'LBDSTBSI', 'URXUMN', 'LBXVTC', 'LBXSTB', 'PHAANTMN', 'PHAALCMN', 'PHACOFMN', 'PHASUPMN', 'URDTIME3', 'LBXV2A', 'LBXV1D', 'LBXVMC', 'LBXVNB', 'LBXV3B', 'LBXVCB', 'LBXVCT', 'LBDMONO', 'LBDEONO', 'LBDBANO', 'PHASUPHR', 'LBXSCLSI', 'LBXSNASI', 'PHAGUMHR', 'PHAANTHR', 'PHACOFHR', 'PHAALCHR', 'LBXSC3SI']
print(len(dis_vars))

cat_vars = ['RIDRETH1', 'RIDEXMON', 'DMDCITZN', 'DMDMARTL', 'RIDEXPRG', 'URDMNPLC', 'PHQ060', 'ORXH51', 'ORXH64', 'ORXHPC', 'LBDHEG', 'ORXGL', 'ORXH69', 'PHQ050', 'LBDVCTLC', 'LBD2DFLC', 'LBXHCR', 'ORXH62', 'URDCNPLC', 'URDUSNLC', 'ORXHPV', 'LBDVTCLC', 'URDUA3LC', 'URDUTLLC', 'URDUTULC', 'URDUDALC', 'ORXH11', 'URDUCOLC', 'ORXH73', 'URXUTRI', 'ORXH26', 'ORXH31', 'LBDHBG', 'ORXH82', 'LBXHE1', 'ORXH53', 'ORXH58', 'URDUURLC', 'LBDHEM', 'URDMZPLC', 'URDUBALC', 'LBDVTELC', 'LBDV1DLC', 'PHQ020', 'LBDV3BLC', 'URDUACLC', 'URDUSBLC', 'ORXH35', 'ORXH84', 'LBDIHGLC', 'URXPREG', 'LBDV4CLC', 'PHQ040', 'URDUPBLC', 'LBDBGELC', 'URDUMMAL', 'URDUMNLC', 'ORXH68', 'URDUA5LC', 'URDMEPLC', 'ORXH81', 'LBDVOXLC', 'ORXGH', 'LBDVCBLC', 'LBDVBZLC', 'ORXH52', 'URDMBPLC', 'ORXH56', 'LBXHE2', 'ORXH61', 'LBDBGMLC', 'URDECPLC', 'URDUCDLC', 'LBDVNBLC', 'URDMOHLC', 'ORXH33', 'ORXH06', 'ORXHPI', 'LBDTHGLC', 'URDUMOLC', 'LBDVXYLC', 'URDMHPLC', 'URDCOPLC', 'URDMC1LC', 'ORXH45', 'ORXH55', 'ORXH71', 'PHQ030', 'LBDVMCLC', 'URDUSRLC', 'ORXH40', 'ORXH39', 'ORXH83', 'ORXH66', 'LBXHBC', 'LBDVDBLC', 'LBDVEBLC', 'ORXH72', 'ORXH54', 'ORXH70', 'URDUCSLC', 'URXUCL', 'LBDV2ALC', 'LBDWFLLC', 'URDUABLC', 'LBXHA', 'ORXH59', 'ORXH67', 'ORXH42', 'ORXH18', 'URDMHHLC', 'PHDSESN', 'URDMIBLC', 'ORXH16', 'LBXHBS', 'LBXHCG', 'OHARNF', 'OHAPOS', 'BMIARML', 'OHDDESTS', 'OHAROCGP', 'OHX23TC', 'BPAEN2', 'OHAROCOH', 'OHAROCDE', 'OHX02TC', 'OHX30TC', 'OHXIMP', 'OHX14TC', 'OHAROTH', 'BMDSTATS', 'OHX16TC', 'OHX05TC', 'BMIARMC', 'BPAEN3', 'OHX26TC', 'OHAREC', 'BPAARM', 'OHAROCCI', 'BMILEG', 'BPAEN4', 'OHX01TC', 'BMXRECUM', 'OHX09TC', 'BMIHT', 'BPXCHR', 'OHX32TC', 'OHX18TC', 'OHDRCSTS', 'BPXPTY', 'BMXHEAD', 'OHX19TC', 'OHX15TC', 'OHDEXSTS', 'OHAROCDT', 'BMIHEAD', 'BMIRECUM', 'OHX03TC', 'BPXPULS', 'OHX17TC', 'BMIWT', 'BPAEN1', 'OHX31TC', 'OHX08TC', 'BMIWAIST', 'DXARLBV', 'DXALLBV', 'DXARABV', 'FCX10DI', 'OHX06TC', 'DXALABV', 'DXARLTV', 'OHX04TC', 'OHX12TC', 'FCX08DI', 'OHX27TC', 'OHX13TC', 'DXAHEBV', 'OHX22TC', 'OHX10TC', 'OHX29TC', 'OHX21TC', 'FCX07DI', 'OHX28TC', 'OHX07TC', 'FCX09DI', 'DXARATV', 'OHX24TC', 'PEASCCT1', 'DXALLTV', 'OHX25TC', 'OHX11TC', 'DXAHETV', 'DXAEXSTS']      
print(len(cat_vars))

224
44
194


## 6. Final dataset

In [253]:
HC_df = df.groupby(['HSD010','Class'])['Class'].count().unstack()
HC_df['Percentage_0'] = df[df.Class == 0].groupby(['HSD010'])['HSD010'].count()/df[df.Class==0].shape[0]
HC_df['Percentage_1'] = df[df.Class == 1].groupby(['HSD010'])['HSD010'].count()/df[df.Class==1].shape[0]

HC_df

Class,0,1,Percentage_0,Percentage_1
HSD010,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,596.0,22.0,0.062187,0.038128
2.0,1941.0,147.0,0.202525,0.254766
3.0,3310.0,197.0,0.345367,0.341421
4.0,2098.0,111.0,0.218907,0.192374
5.0,436.0,42.0,0.045492,0.07279
9.0,2.0,,0.000209,


In [254]:
df = df[df.HSD010 < 4]
df = df[((df.Class == 0) & (df.HSD010 < 3)) | (df.Class== 1)]

df.shape, df[df.Class == 1].shape, df[df.Class == 0].shape

((2903, 799), (366, 799), (2537, 799))

## 7. Check missing data of all features

In [255]:
all_vars = cont_vars+dis_vars
miss_df = pd.DataFrame(df[all_vars].isnull().sum(),columns=['Count'])
miss_df['Percentage'] = 100 * df[all_vars].isnull().sum()/len(df)
miss_df = miss_df.sort_values('Percentage', ascending=True)
miss_df = miss_df.reset_index()
miss_df.columns = ['Feature','Count','Percentage']
miss_df.head(20)

Unnamed: 0,Feature,Count,Percentage
0,RIDAGEYR,0,0.0
1,DMDEDUC2,0,0.0
2,DMDHHSIZ,0,0.0
3,DMDFMSIZ,0,0.0
4,BMXWT,22,0.757837
5,BMXHT,23,0.792284
6,BMXBMI,26,0.895625
7,PHAFSTHR,29,0.998967
8,PHAFSTMN,29,0.998967
9,URXCRS,47,1.619015


In [256]:
import scipy

ttest_df = pd.DataFrame(columns = ['Feature','FeatureName','t-stats','p-value','Skew','Kurtosis'])
ttest_df['Feature'] = cont_vars + dis_vars
ttest_df['FeatureName'] = cont_vars + dis_vars

df0 = df[df['Class'] == 0]
df1 = df[df['Class'] == 1]

for var in cont_vars + dis_vars:
    result = scipy.stats.ranksums(df0[df0[var].isna()==False][var], df1[df1[var].isna()==False][var])
    ttest_df.loc[ttest_df['Feature'] == var,'t-stats'] = result[0]
    ttest_df.loc[ttest_df['Feature'] == var,'p-value'] = result[1]
    ttest_df.loc[ttest_df['Feature'] == var,'Skew'] = df[var].skew()
    ttest_df.loc[ttest_df['Feature'] == var,'Kurtosis'] = df[var].kurt()
  
ttest_df['abs_tstats'] = np.abs(ttest_df['t-stats'])
ttest_df = ttest_df.sort_values(['abs_tstats'], ascending = False)
ttest_df = ttest_df.merge(miss_df, left_on = 'Feature',right_on='Feature',how='inner')

ttest_df

Unnamed: 0,Feature,FeatureName,t-stats,p-value,Skew,Kurtosis,abs_tstats,Count,Percentage
0,RIDAGEYR,RIDAGEYR,-10.4024,2.41838e-25,0.169227,-1.15567,10.4024,0,0.0
1,RIDAGEMN,RIDAGEMN,-6.67115,2.5381e-11,0.214931,-1.02841,6.67115,1775,61.143645
2,LBXLYPCT,LBXLYPCT,5.77229,7.82009e-09,0.46378,1.17866,5.77229,96,3.306924
3,LBXSOSSI,LBXSOSSI,-5.35762,8.43275e-08,-0.381261,4.52146,5.35762,129,4.443679
4,LBDSUASI,LBDSUASI,-5.22877,1.70641e-07,0.728258,0.861264,5.22877,129,4.443679
5,LBXSUA,LBXSUA,-5.22877,1.70641e-07,0.728288,0.86095,5.22877,129,4.443679
6,LBXSAL,LBXSAL,5.13968,2.75211e-07,-0.260504,0.647443,5.13968,129,4.443679
7,LBDSALSI,LBDSALSI,5.13968,2.75211e-07,-0.260504,0.647443,5.13968,129,4.443679
8,LBDSGLSI,LBDSGLSI,-4.86966,1.1179e-06,4.78565,33.5117,4.86966,129,4.443679
9,LBXSGL,LBXSGL,-4.86966,1.1179e-06,4.78576,33.5142,4.86966,129,4.443679


https://pypi.org/project/mixed-naive-bayes/

# Part 3: Data Preprocessing

#### Select cont vars with missing value <= 30%

In [257]:
num_vars = list(miss_df[miss_df.Percentage <= 20].Feature)
print(len(num_vars))
print(num_vars)

91
['RIDAGEYR', 'DMDEDUC2', 'DMDHHSIZ', 'DMDFMSIZ', 'BMXWT', 'BMXHT', 'BMXBMI', 'PHAFSTHR', 'PHAFSTMN', 'URXCRS', 'URXUMS', 'URXUMA', 'BPXPLS', 'BPACSZ', 'BPXML1', 'BMXARML', 'BMXARMC', 'LBXMC', 'LBXPLTSI', 'LBXHGB', 'LBXWBCSI', 'LBXRDW', 'LBXMCHSI', 'LBXMCVSI', 'LBXRBCSI', 'LBXMPSI', 'LBXLYPCT', 'LBXNEPCT', 'LBXMOPCT', 'LBXEOPCT', 'LBDNENO', 'LBXBAPCT', 'LBDBANO', 'LBDLYMNO', 'LBDEONO', 'LBDMONO', 'BMXLEG', 'BMXWAIST', 'LBDTCSI', 'LBDHDDSI', 'LBDHDD', 'LBXTC', 'LBXSPH', 'LBXSGTSI', 'LBXSNASI', 'LBXSCLSI', 'LBXSCR', 'LBXSTP', 'LBXSAL', 'LBDSGLSI', 'LBDSALSI', 'LBDSCHSI', 'LBDSBUSI', 'LBXSOSSI', 'LBXSAPSI', 'LBXSBU', 'LBDSUASI', 'LBXSGB', 'LBDSGBSI', 'LBDSCRSI', 'LBXSUA', 'LBXSCH', 'LBDSTPSI', 'LBXSGL', 'LBDSPHSI', 'LBDSTRSI', 'LBXSTR', 'LBXSKSI', 'LBXSIR', 'LBDSTBSI', 'LBDSIRSI', 'LBDSCASI', 'LBXSTB', 'LBXSCA', 'LBXSC3SI', 'LBXSATSI', 'LBXSASSI', 'LBXSLDSI', 'BPXDI2', 'BPXSY2', 'BPXDI3', 'BPXSY3', 'BPXDI1', 'BPXSY1', 'INDFMPIR', 'LBDBCDSI', 'LBXBPB', 'LBDBPBSI', 'LBDTHGSI', 'LBXTHG', '

#### Remove observations with 40% of missing data

In [258]:
df['FeatureCount'] = df[num_vars].count(axis=1)
df['FeatureMissing'] = len(num_vars) - df['FeatureCount']
df['MissingPercentage'] = df.FeatureMissing/len(num_vars)
df[['FeatureMissing','MissingPercentage']].describe()

Unnamed: 0,FeatureMissing,MissingPercentage
count,2903.0,2903.0
mean,4.183948,0.045977
std,12.401837,0.136284
min,0.0,0.0
25%,0.0,0.0
50%,0.0,0.0
75%,3.0,0.032967
max,83.0,0.912088


In [259]:
print(df[(df['MissingPercentage'] > 0.6) & (df['Class'] == 1)].shape)
df = df[df['MissingPercentage'] <= 0.6]
df = df.reset_index(drop=True)

df.shape, df[df.Class == 1].shape, df[df.Class == 0].shape

(22, 802)


((2821, 802), (344, 802), (2477, 802))

In [260]:
import scipy

ttest_df = pd.DataFrame(columns = ['Feature','FeatureName','t-stats','p-value','Skew','Kurtosis'])
ttest_df['Feature'] = num_vars
ttest_df['FeatureName'] = num_vars

df0 = df[df['Class'] == 0]
df1 = df[df['Class'] == 1]

for var in num_vars:
    result = scipy.stats.ranksums(df0[df0[var].isna()==False][var], df1[df1[var].isna()==False][var])
    ttest_df.loc[ttest_df['Feature'] == var,'t-stats'] = result[0]
    ttest_df.loc[ttest_df['Feature'] == var,'p-value'] = result[1]
    ttest_df.loc[ttest_df['Feature'] == var,'Skew'] = df[var].skew()
    ttest_df.loc[ttest_df['Feature'] == var,'Kurtosis'] = df[var].kurt()
  
ttest_df['abs_tstats'] = np.abs(ttest_df['t-stats'])
ttest_df = ttest_df.sort_values(['abs_tstats'], ascending = False)
ttest_df = ttest_df.merge(miss_df, left_on = 'Feature',right_on='Feature',how='inner')

ttest_df

Unnamed: 0,Feature,FeatureName,t-stats,p-value,Skew,Kurtosis,abs_tstats,Count,Percentage
0,RIDAGEYR,RIDAGEYR,-10.0894,6.154940000000001e-24,0.172149,-1.15074,10.0894,0,0.0
1,LBXLYPCT,LBXLYPCT,5.77229,7.82009e-09,0.46378,1.17866,5.77229,96,3.306924
2,LBXSOSSI,LBXSOSSI,-5.35762,8.43275e-08,-0.381261,4.52146,5.35762,129,4.443679
3,LBDSUASI,LBDSUASI,-5.22877,1.70641e-07,0.728258,0.861264,5.22877,129,4.443679
4,LBXSUA,LBXSUA,-5.22877,1.70641e-07,0.728288,0.86095,5.22877,129,4.443679
5,LBXSAL,LBXSAL,5.13968,2.75211e-07,-0.260504,0.647443,5.13968,129,4.443679
6,LBDSALSI,LBDSALSI,5.13968,2.75211e-07,-0.260504,0.647443,5.13968,129,4.443679
7,LBDSGLSI,LBDSGLSI,-4.86966,1.1179e-06,4.78565,33.5117,4.86966,129,4.443679
8,LBXSGL,LBXSGL,-4.86966,1.1179e-06,4.78576,33.5142,4.86966,129,4.443679
9,LBDLYMNO,LBDLYMNO,4.77344,1.81109e-06,26.0391,1074.53,4.77344,96,3.306924


In [325]:
num_vars = list(ttest_df[ttest_df['p-value'] <= 0.2].sort_values(['abs_tstats'],ascending=False).Feature)
print(num_vars)
print(len(num_vars))

['RIDAGEYR', 'LBXLYPCT', 'LBXSOSSI', 'LBDSUASI', 'LBXSUA', 'LBXSAL', 'LBDSALSI', 'LBDSGLSI', 'LBXSGL', 'LBDLYMNO', 'LBDSBUSI', 'LBXSBU', 'LBDEONO', 'DMDHHSIZ', 'LBDSTPSI', 'LBXSTP', 'LBXNEPCT', 'LBDSTRSI', 'LBXSTR', 'BPXDI1', 'DMDFMSIZ', 'LBXEOPCT', 'BPXDI2', 'URXUMS', 'URXUMA', 'BMXARML', 'BPXML1', 'BPXSY2', 'LBXRDW', 'LBDHDDSI', 'LBDHDD', 'BPXSY1', 'LBXSGTSI', 'LBDSCRSI', 'LBXSCR', 'BMXWAIST', 'LBDMONO', 'LBXPLTSI', 'BPXSY3', 'LBDTCSI', 'LBXTC', 'LBXSCH', 'LBDSCHSI', 'LBXMOPCT', 'BPXDI3', 'LBXSLDSI', 'BMXLEG', 'LBXSKSI', 'INDFMPIR', 'BMXHT', 'LBXSNASI', 'BPACSZ', 'LBXSC3SI', 'LBXSTB', 'LBDSTBSI', 'LBXMC', 'LBXBAPCT', 'LBXHGB', 'BMXARMC', 'LBDNENO', 'BMXBMI', 'LBXRBCSI', 'LBDBANO', 'LBXSAPSI', 'LBXSIR', 'LBDSIRSI']
66


#### Filling missing data 

In [326]:
for var in num_vars:
    df[var] = df[var].fillna(df[var].median())

In [327]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(df[num_vars], df['Class'], test_size=0.25, random_state=1)

def NaiveBayesPrediction(X_train, y_train, X_test, y_test):
    clf = GaussianNB()
    clf.fit(X_train, y_train)
    
    y_pred_gnb = clf.predict(X_test)
    y_prob_pred_gnb = clf.predict_proba(X_test)
    # how did our model perform?
    count_misclassified = (y_test != y_pred_gnb).sum()
    
    print("GaussianNB")
    print("=" * 30)
    print('Misclassified samples: {}'.format(count_misclassified))
    accuracy = accuracy_score(y_test, y_pred_gnb)
    print('Accuracy: {:.2f}'.format(accuracy))
    
    print("Recall score : ", recall_score(y_test, y_pred_gnb , average='micro'))
    print("Precision score : ",precision_score(y_test, y_pred_gnb , average='micro'))
    print("F1 score : ",f1_score(y_test, y_pred_gnb , average='micro'))
    
    print(classification_report(y_test, y_pred_gnb))
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 165
Accuracy: 0.77
Recall score :  0.7662889518413598
Precision score :  0.7662889518413598
F1 score :  0.7662889518413598
              precision    recall  f1-score   support

           0       0.90      0.83      0.86       617
           1       0.22      0.34      0.27        89

    accuracy                           0.77       706
   macro avg       0.56      0.58      0.56       706
weighted avg       0.81      0.77      0.79       706



In [328]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.61174242 0.58027923 0.53978331 0.60777548 0.58396595 0.59908377
 0.52814136 0.64208103 0.62198465 0.53941068]
Accuracy: 58.54% (+/- 3.67%)


### 1.2. Remove outliers

In [329]:
for var in num_vars:
    upper_boundary, lower_boundary = find_skewed_boundaries(df, var, 3.5)
    df.loc[df[var] <= lower_boundary,var] = lower_boundary
    df.loc[df[var] >= upper_boundary,var] = upper_boundary

## 8. Feature scaling

In [330]:
from sklearn.naive_bayes import GaussianNB

X_train, X_test, y_train, y_test = train_test_split(df[num_vars], df['Class'], test_size=0.25, random_state=1)
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 165
Accuracy: 0.77
Recall score :  0.7662889518413598
Precision score :  0.7662889518413598
F1 score :  0.7662889518413598
              precision    recall  f1-score   support

           0       0.90      0.83      0.86       617
           1       0.22      0.34      0.27        89

    accuracy                           0.77       706
   macro avg       0.56      0.58      0.56       706
weighted avg       0.81      0.77      0.79       706



In [331]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.61174242 0.58027923 0.53978331 0.60777548 0.58396595 0.59908377
 0.52814136 0.64208103 0.62198465 0.53941068]
Accuracy: 58.54% (+/- 3.67%)


### 1.3. Gaussian Transformation - Type 2

In [332]:
Gauss_transformed_vars = ['URDTIME1','URXVOL1','LBXTHG','LBDTHGSI','LBXSGTSI','URXCRS','URXUMS','URXUMA']
from sklearn.preprocessing import quantile_transform

for var in Gauss_transformed_vars:
    df[var] = quantile_transform(np.array(df[var]).reshape(-1,1), n_quantiles=20, random_state=0, copy=True)

In [333]:
X_train, X_test, y_train, y_test = train_test_split(df[num_vars], df['Class'], test_size=0.25, random_state=1)

NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 165
Accuracy: 0.77
Recall score :  0.7662889518413598
Precision score :  0.7662889518413598
F1 score :  0.7662889518413598
              precision    recall  f1-score   support

           0       0.90      0.83      0.86       617
           1       0.22      0.34      0.27        89

    accuracy                           0.77       706
   macro avg       0.56      0.58      0.56       706
weighted avg       0.81      0.77      0.79       706



In [334]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.61174242 0.58027923 0.53978331 0.60777548 0.58396595 0.59908377
 0.52814136 0.64208103 0.62198465 0.53941068]
Accuracy: 58.54% (+/- 3.67%)


### 1.5. Remove features with high correlation

In [335]:
# build a dataframe with the correlation between features
# remember that the absolute value of the correlation
# coefficient is important and not the sign

corrmat = df[num_vars].corr()
corrmat = corrmat.abs().unstack() # absolute value of corr coef
corrmat = corrmat.sort_values(ascending=False)

corrmat = pd.DataFrame(corrmat).reset_index()
corrmat.columns = ['feature1', 'feature2', 'corr']
corrmat['MissingF1'] = corrmat.feature1.apply(lambda x:MissingPercentage(x))
corrmat['MissingF2'] = corrmat.feature2.apply(lambda x:MissingPercentage(x))
corrmat.head()

Unnamed: 0,feature1,feature2,corr,MissingF1,MissingF2
0,LBXSAL,LBDSALSI,1.0,0.0,0.0
1,LBDSALSI,LBXSAL,1.0,0.0,0.0
2,LBDSIRSI,LBDSIRSI,1.0,0.0,0.0
3,BPXSY3,BPXSY3,1.0,0.0,0.0
4,BPXDI3,BPXDI3,1.0,0.0,0.0


In [336]:
correlated_groups = corrmat[corrmat['corr'] > 0.90]
correlated_groups

Unnamed: 0,feature1,feature2,corr,MissingF1,MissingF2
0,LBXSAL,LBDSALSI,1.0,0.0,0.0
1,LBDSALSI,LBXSAL,1.0,0.0,0.0
2,LBDSIRSI,LBDSIRSI,1.0,0.0,0.0
3,BPXSY3,BPXSY3,1.0,0.0,0.0
4,BPXDI3,BPXDI3,1.0,0.0,0.0
5,LBXMOPCT,LBXMOPCT,1.0,0.0,0.0
6,LBDSCHSI,LBDSCHSI,1.0,0.0,0.0
7,LBXSCH,LBXSCH,1.0,0.0,0.0
8,LBXTC,LBXTC,1.0,0.0,0.0
9,LBDTCSI,LBDTCSI,1.0,0.0,0.0


In [337]:
sig_cont_vars = []
remaining_vars = correlated_groups.feature1.unique()

while(len(remaining_vars) > 0):
    feature = remaining_vars[0]
    correlated_block = correlated_groups[correlated_groups.feature1 == feature]
    min_ind = correlated_block[['MissingF2']].idxmin() 
    sel_var = correlated_block.feature2[min_ind].values[0]
    removed_vars = [var for var in list(correlated_block.feature2.values)]
    remaining_vars = [var for var in remaining_vars if var not in removed_vars]
    if sel_var not in sig_cont_vars:
        sig_cont_vars = sig_cont_vars + [sel_var]    
    
print(sig_cont_vars)
len(sig_cont_vars)

['LBDSALSI', 'LBDSIRSI', 'BPXSY3', 'BPXDI3', 'LBXMOPCT', 'LBDSCHSI', 'LBXPLTSI', 'BMXLEG', 'LBDMONO', 'BMXWAIST', 'LBXSCR', 'LBXSGTSI', 'LBXSLDSI', 'INDFMPIR', 'LBXSKSI', 'LBDHDD', 'BMXHT', 'LBXSNASI', 'BPACSZ', 'LBXSC3SI', 'LBXSTB', 'LBXMC', 'LBXBAPCT', 'LBXHGB', 'BMXARMC', 'LBDNENO', 'BMXBMI', 'LBXRBCSI', 'BPXSY1', 'LBXSAPSI', 'LBXRDW', 'LBXLYPCT', 'LBXSOSSI', 'LBDSUASI', 'LBDSGLSI', 'LBDLYMNO', 'LBDSBUSI', 'LBDEONO', 'DMDHHSIZ', 'LBDSTPSI', 'LBDSTRSI', 'BPXDI1', 'LBXEOPCT', 'BPXDI2', 'URXUMS', 'BMXARML', 'BPXML1', 'LBDBANO', 'RIDAGEYR']


49

In [338]:
X_train, X_test, y_train, y_test = train_test_split(df[num_vars], df['Class'], test_size=0.25, random_state=1)
    
NaiveBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 165
Accuracy: 0.77
Recall score :  0.7662889518413598
Precision score :  0.7662889518413598
F1 score :  0.7662889518413598
              precision    recall  f1-score   support

           0       0.90      0.83      0.86       617
           1       0.22      0.34      0.27        89

    accuracy                           0.77       706
   macro avg       0.56      0.58      0.56       706
weighted avg       0.81      0.77      0.79       706



In [339]:
from sklearn.naive_bayes import GaussianNB 
model = GaussianNB()

kfold = KFold(n_splits = 10, random_state = 10)
scores = cross_val_score(model,X_train,y_train,cv=kfold,scoring='recall_macro')
print(scores)

print("Accuracy: %0.2f%% (+/- %0.2f%%)" % (100*scores.mean(), 100*scores.std()))

[0.61174242 0.58027923 0.53978331 0.60777548 0.58396595 0.59908377
 0.52814136 0.64208103 0.62198465 0.53941068]
Accuracy: 58.54% (+/- 3.67%)


In [340]:
from mixed_naive_bayes import MixedNB

X_train, X_test, y_train, y_test = train_test_split(df[num_vars], df['Class'], test_size=0.25, random_state=1)

def MixedBayesPrediction(X_train, y_train, X_test, y_test):
    clf = MixedNB()
    clf.fit(X_train, y_train)
    
    y_pred_gnb = clf.predict(X_test)
    y_prob_pred_gnb = clf.predict_proba(X_test)
    # how did our model perform?
    count_misclassified = (y_test != y_pred_gnb).sum()
    
    print("GaussianNB")
    print("=" * 30)
    print('Misclassified samples: {}'.format(count_misclassified))
    accuracy = accuracy_score(y_test, y_pred_gnb)
    print('Accuracy: {:.2f}'.format(accuracy))
    
    print("Recall score : ", recall_score(y_test, y_pred_gnb , average='micro'))
    print("Precision score : ",precision_score(y_test, y_pred_gnb , average='micro'))
    print("F1 score : ",f1_score(y_test, y_pred_gnb , average='micro'))
    
    print(classification_report(y_test, y_pred_gnb))
    
MixedBayesPrediction(X_train, y_train, X_test, y_test)

GaussianNB
Misclassified samples: 165
Accuracy: 0.77
Recall score :  0.7662889518413598
Precision score :  0.7662889518413598
F1 score :  0.7662889518413598
              precision    recall  f1-score   support

           0       0.90      0.83      0.86       617
           1       0.22      0.34      0.27        89

    accuracy                           0.77       706
   macro avg       0.56      0.58      0.56       706
weighted avg       0.81      0.77      0.79       706

