# Risk Factor prediction of Chronic Kidney Disease Data Set

Dataset link - https://archive.ics.uci.edu/ml/datasets/Risk+Factor+prediction+of+Chronic+Kidney+Disease#

I selected this dataset from UCI ML Repository. This dataset contains the patients data who were affected by chronic kidney disease. 
Following are the columns in the dataset-
1. bp(Diastolic)
2. bp limit
3. sg
4. al
5. class
6. rbc
7. su
8. pc
9. pcc
10. ba
11. bgr
12. bu
13. sod
14. sc
15. pot
16. hemo
17. pcv
18. rbcc
19. wbcc
20. htn
21. dm
22. cad
23. appet
24. pe
25. ane
26. grf
27. stage
28. affected
29. age

For this dataset, I preprocessed the data first as it has noise,then did the split of data and addressed the data imbalance. Later on fitted the ML models on the resulting dataset. 

Since this is a Classification problem, I chose Logistic Regression, Decision Tree, Random Forest, AdaBoost and Gradient Boost models.

**Importing Packages**

In [8]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score

In [9]:
np.random.seed(1)

**Loading data**

In [10]:
ckd=pd.read_csv("C:/Users/prath/Downloads/ckd-dataset-v2.csv")

**Exploring data**

In [11]:
ckd.head(3)

Unnamed: 0,bp (Diastolic),bp limit,sg,al,class,rbc,su,pc,pcc,ba,...,htn,dm,cad,appet,pe,ane,grf,stage,affected,age
0,0,0,1.019 - 1.021,1-Jan,ckd,0,< 0,0,0,0,...,0,0,0,0,0,0,≥ 227.944,s1,1,< 12
1,0,0,1.009 - 1.011,< 0,ckd,0,< 0,0,0,0,...,0,0,0,0,0,0,≥ 227.944,s1,1,< 12
2,0,0,1.009 - 1.011,≥ 4,ckd,1,< 0,1,0,1,...,0,0,0,1,0,0,127.281 - 152.446,s1,1,< 12


In [12]:
ckd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 29 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   bp (Diastolic)  200 non-null    int64 
 1   bp limit        200 non-null    int64 
 2   sg              200 non-null    object
 3   al              200 non-null    object
 4   class           200 non-null    object
 5   rbc             200 non-null    int64 
 6   su              200 non-null    object
 7   pc              200 non-null    int64 
 8   pcc             200 non-null    int64 
 9   ba              200 non-null    int64 
 10  bgr             200 non-null    object
 11  bu              200 non-null    object
 12  sod             200 non-null    object
 13  sc              200 non-null    object
 14  pot             200 non-null    object
 15  hemo            200 non-null    object
 16  pcv             200 non-null    object
 17  rbcc            200 non-null    object
 18  wbcc      

In [13]:
ckd.describe()

Unnamed: 0,bp (Diastolic),bp limit,rbc,pc,pcc,ba,htn,dm,cad,appet,pe,ane,affected
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,0.54,0.755,0.125,0.225,0.135,0.055,0.39,0.35,0.11,0.2,0.175,0.16,0.64
std,0.499648,0.805119,0.331549,0.41863,0.342581,0.228552,0.488974,0.478167,0.313675,0.401004,0.380921,0.367526,0.481205
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
max,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Checking for Null Value columns

In [14]:
ckd.isna().sum()

bp (Diastolic)    0
bp limit          0
sg                0
al                0
class             0
rbc               0
su                0
pc                0
pcc               0
ba                0
bgr               0
bu                0
sod               0
sc                0
pot               0
hemo              0
pcv               0
rbcc              0
wbcc              0
htn               0
dm                0
cad               0
appet             0
pe                0
ane               0
grf               0
stage             0
affected          0
age               0
dtype: int64

List of Categorical Variables

In [15]:
category_var_list = list(ckd.select_dtypes(include='object').columns)
category_var_list

['sg',
 'al',
 'class',
 'su',
 'bgr',
 'bu',
 'sod',
 'sc',
 'pot',
 'hemo',
 'pcv',
 'rbcc',
 'wbcc',
 'grf',
 'stage',
 'age']

In [16]:
#to get the unique values in each column
for cat in category_var_list: 
    print(f"Category: {cat} Values: {ckd[cat].unique()}")

Category: sg Values: ['1.019 - 1.021' '1.009 - 1.011' '1.015 - 1.017' '≥ 1.023' '< 1.007']
Category: al Values: ['1-Jan' '< 0' '≥ 4' '3-Mar' '2-Feb']
Category: class Values: ['ckd' 'notckd']
Category: su Values: ['< 0' '4-Apr' '2-Feb' '4-Mar' '2-Jan' '≥ 4']
Category: bgr Values: ['< 112' '112 - 154' '154 - 196' '406 - 448' '238 - 280' '196 - 238'
 '≥ 448' '280 - 322' '364 - 406' '322 - 364']
Category: bu Values: ['< 48.1' '48.1 - 86.2' '200.5 - 238.6' '124.3 - 162.4' '86.2 - 124.3'
 '162.4 - 200.5' '≥ 352.9' '238.6 - 276.7']
Category: sod Values: ['138 - 143' '133 - 138' '123 - 128' '143 - 148' '148 - 153' '< 118'
 '128 - 133' '118 - 123' '≥ 158']
Category: sc Values: ['< 3.65' '3.65 - 6.8' '16.25 - 19.4' '6.8 - 9.95' '13.1 - 16.25'
 '9.95 - 13.1' '≥ 28.85']
Category: pot Values: ['< 7.31' '≥ 42.59' '7.31 - 11.72' '38.18 - 42.59']
Category: hemo Values: ['11.3 - 12.6' '8.7 - 10' '13.9 - 15.2' '≥ 16.5' '10 - 11.3' '7.4 - 8.7'
 '12.6 - 13.9' '15.2 - 16.5' '< 6.1' '6.1 - 7.4']
Category: p

It is observed that there are some irrelevant values in some columns, so it's better replace them with Nan and remove the class column since it's directly correlated to our target variable 'affected'.

In [17]:
ckd.drop(['class'], axis=1, inplace = True)

In [18]:
ckd['al']=ckd.al.replace(['1-Jan','3-Mar','2-Feb'],np.nan)

In [19]:
ckd['su']=ckd.su.replace(['4-Apr','2-Feb','4-Mar','2-Jan'],np.nan)

In [20]:
ckd['grf']=ckd.grf.replace([' p '],np.nan)

In [21]:
ckd['age']=ckd.age.replace(['20-Dec'],np.nan)

In [22]:
#checking for nan values
ckd.isna().sum()

bp (Diastolic)     0
bp limit           0
sg                 0
al                71
rbc                0
su                29
pc                 0
pcc                0
ba                 0
bgr                0
bu                 0
sod                0
sc                 0
pot                0
hemo               0
pcv                0
rbcc               0
wbcc               0
htn                0
dm                 0
cad                0
appet              0
pe                 0
ane                0
grf                1
stage              0
affected           0
age                4
dtype: int64

In [23]:
#replacing the Nan values with mode
ckd['al']=ckd.al.fillna(ckd.al.mode())
ckd['su']=ckd.su.fillna(ckd.su.mode())
ckd['grf']=ckd.grf.fillna(ckd.grf.mode())
ckd['age']=ckd.age.fillna(ckd.age.mode())

In [18]:
#Label encoding the categorical columns since the models can only work with numerical values

In [24]:
labelencoder = LabelEncoder()
ckd['sg'] = labelencoder.fit_transform(ckd['sg'])
ckd['al'] = labelencoder.fit_transform(ckd['al'])
ckd['su'] = labelencoder.fit_transform(ckd['su'])
ckd['bgr'] = labelencoder.fit_transform(ckd['bgr'])
ckd['bu'] = labelencoder.fit_transform(ckd['bu'])
ckd['sod'] = labelencoder.fit_transform(ckd['sod'])
ckd['sc'] = labelencoder.fit_transform(ckd['sc'])
ckd['pot'] = labelencoder.fit_transform(ckd['pot'])
ckd['hemo'] = labelencoder.fit_transform(ckd['hemo'])
ckd['pcv'] = labelencoder.fit_transform(ckd['pcv'])
ckd['rbcc'] = labelencoder.fit_transform(ckd['rbcc'])
ckd['wbcc'] = labelencoder.fit_transform(ckd['wbcc'])
ckd['grf'] = labelencoder.fit_transform(ckd['grf'])
ckd['stage'] = labelencoder.fit_transform(ckd['stage'])
ckd['age'] = labelencoder.fit_transform(ckd['age'])

In [25]:
ckd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   bp (Diastolic)  200 non-null    int64
 1   bp limit        200 non-null    int64
 2   sg              200 non-null    int32
 3   al              200 non-null    int32
 4   rbc             200 non-null    int64
 5   su              200 non-null    int32
 6   pc              200 non-null    int64
 7   pcc             200 non-null    int64
 8   ba              200 non-null    int64
 9   bgr             200 non-null    int32
 10  bu              200 non-null    int32
 11  sod             200 non-null    int32
 12  sc              200 non-null    int32
 13  pot             200 non-null    int32
 14  hemo            200 non-null    int32
 15  pcv             200 non-null    int32
 16  rbcc            200 non-null    int32
 17  wbcc            200 non-null    int32
 18  htn             200 non-null  

**Split Data**

In [26]:
# split the data into validation and training set
train_df, test_df = train_test_split(ckd, test_size=0.3)

In [27]:
train_df.affected.value_counts()

affected
1    91
0    49
Name: count, dtype: int64

Addressing Data imbalance

In [28]:
#there is more than 10% difference between the count of classes, so it is a better approach to resample the training data
cls0 = train_df[train_df['affected']==0]
cls1 = train_df[train_df['affected']==1]
from sklearn.utils import resample
train_df_minority_resampled = resample(cls0, 
                                 replace=True,     
                                 n_samples=91,    
                                 random_state=123)

In [29]:
print(cls1.shape,train_df_minority_resampled.shape)

(91, 28) (91, 28)


In [30]:
train_df=pd.concat([cls1,train_df_minority_resampled])

In [31]:

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'affected'
predictors = list(ckd.columns)
predictors.remove(target)

In [32]:
train_X=train_df[predictors]
train_y = train_df[target] 
test_X = test_df[predictors]
test_y = test_df[target]

In [28]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

**Fitting Logistic Regression**

In [132]:
log_reg_model = LogisticRegression(penalty='none', max_iter=900)
_ = log_reg_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [133]:
#Liblinear Solver
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(train_X, np.ravel(train_y))
model_preds = log_reg_liblin_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [134]:
#L2 Regularization
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_L2_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [135]:
#L1 Regularization
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_L1_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [136]:
#ElasticNet Regularization
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1500)
_ = log_reg_elastic_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_elastic_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elastic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
import warnings
warnings.filterwarnings("ignore")

**Fitting Decision Tree Classifier**

In [137]:
Dt=DecisionTreeClassifier(max_depth=10)
Dt=Dt.fit(train_X,np.ravel(train_y))
model_preds=Dt.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree Classifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Random Forest Classifier**

In [138]:
rforest = RandomForestClassifier()
_ = rforest.fit(train_X, train_y)
y_pred = rforest.predict(test_X)
c_matrix = confusion_matrix(test_y, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Random Forest", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Ada Boost Classifier**

In [139]:
aboost = AdaBoostClassifier()
_ = aboost.fit(train_X, train_y)
y_pred = aboost.predict(test_X)
c_matrix = confusion_matrix(test_y, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"AdaBoost Classifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Gradient Boost Classifier**

In [140]:
gboost = GradientBoostingClassifier()
_ = gboost.fit(train_X, train_y)
y_pred = gboost.predict(test_X)
c_matrix = confusion_matrix(test_y, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Gradient Boost Classifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Neural Networks**

In [141]:
ann = MLPClassifier(hidden_layer_sizes=(60,50,40), solver='adam', max_iter=200)
_ = ann.fit(train_X, train_y)
y_pred=ann.predict(test_X)
c_matrix = confusion_matrix(test_y, y_pred)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Networks", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Neural Networks using RandomizedSearchCV**

In [142]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (50,), (70,),(60,40), (40,20), (60,50,40), (80,60,40)],
    'activation': ['logistic', 'tanh', 'relu'],
    'solver': ['adam', 'sgd'],
    'alpha': [0, .2, .5, .7, 1],
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'learning_rate_init': [0.001, 0.01, 0.1, 0.2, 0.5],
    'max_iter': [5000]
}

ann = MLPClassifier()
rand_search = RandomizedSearchCV(estimator = ann, param_distributions=param_grid, cv=kfolds, n_iter=100,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(train_X, train_y)

bestRecallTree = rand_search.best_estimator_

print(rand_search.best_params_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
{'solver': 'adam', 'max_iter': 5000, 'learning_rate_init': 0.01, 'learning_rate': 'adaptive', 'hidden_layer_sizes': (40, 20), 'alpha': 0, 'activation': 'relu'}


In [143]:
c_matrix = confusion_matrix(test_y, rand_search.predict(test_X))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Networks Random Search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Neural Networks using GridSearchCV**

In [144]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'hidden_layer_sizes': [ (20,), (40,), (70,), (90,)],
    'activation': ['logistic', 'relu'],
    'solver': ['adam'],
    'alpha': [.1, .2, 1],
    'learning_rate': ['adaptive', 'constant'],
    'learning_rate_init': [ 0.01,0.2, 0.15],
    'max_iter': [5000]
}

ann = MLPClassifier()
grid_search = GridSearchCV(estimator = ann, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(train_X, train_y)

bestRecallTree = grid_search.best_estimator_

print(grid_search.best_params_)

Fitting 5 folds for each of 144 candidates, totalling 720 fits
{'activation': 'relu', 'alpha': 0.1, 'hidden_layer_sizes': (20,), 'learning_rate': 'constant', 'learning_rate_init': 0.01, 'max_iter': 5000, 'solver': 'adam'}


In [145]:
c_matrix = confusion_matrix(test_y, grid_search.predict(test_X))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Neural Networks Grid Search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [146]:
performance.sort_values(by='Recall',ascending=False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,L1 logistic,1.0,1.0,1.0,1.0
0,default logistic,0.983333,1.0,0.972973,0.986301
0,liblinear logistic,0.983333,1.0,0.972973,0.986301
0,Elastic logistic,0.983333,1.0,0.972973,0.986301
0,Neural Networks,0.983333,1.0,0.972973,0.986301
0,Neural Networks Grid Search,0.983333,1.0,0.972973,0.986301
0,L2 logistic,0.966667,1.0,0.945946,0.972222
0,Neural Networks Random Search,0.966667,1.0,0.945946,0.972222
0,Decision Tree Classifier,0.9,0.942857,0.891892,0.916667
0,Random Forest,0.9,0.942857,0.891892,0.916667


To identify which model is better for this dataset, Recall should be considered as the measure. The reason being it is a health related dataset and it's risky to have False Negatives i.e., predicting affected as 0 when it's 1 as it will result in death of patients if not treated.

Above are the results of models fit on the dataset sorted by Recall, L1 logistic has fit the data better than other models. Followed by default,liblinear,Elastic Regression and Neural Networks. Out of the 3 neural networks that attempted to fit the data, Neural Networks and Neural Networks using Grid Search performed better than Neural Networks Random Search.


In [29]:
ckd.keys()

Index(['bp (Diastolic)', 'bp limit', 'sg', 'al', 'rbc', 'su', 'pc', 'pcc',
       'ba', 'bgr', 'bu', 'sod', 'sc', 'pot', 'hemo', 'pcv', 'rbcc', 'wbcc',
       'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'grf', 'stage', 'affected',
       'age'],
      dtype='object')

In [32]:
ckd.shape

(200, 28)

**Wide Network using Keras Network**

In [44]:
import os
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

In [45]:
from tensorflow import keras

# fix random seed for reproducibility
tf.random.set_seed(1)


In [46]:
%%time

# create model stucture
model = keras.models.Sequential()
model.add(keras.layers.Input(27))
model.add(keras.layers.Dense(256, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid')) # final layer, 1 category since it's a binary classification




CPU times: total: 0 ns
Wall time: 56.3 ms


In [47]:
from keras import backend as K

def recall_m(y_true, y_pred):
    y_true = K.ones_like(y_true)
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

In [48]:
# compile
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=[recall_m])

#using the loss function as binary_crossentropy as this is binary classification
#though metrics doesn't matter in fitting, but it does calculate the given metric at that loss point, 
# as I need recall to decide the best model,I took metric as recall.

In [49]:
%%time

# fit the model
history = model.fit(train_X, train_y, validation_data=(test_X, test_y), 
                    epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
CPU times: total: 484 ms
Wall time: 3.18 s


In [50]:
scores = model.evaluate(test_X, test_y, verbose=0)
scores

[0.1270373910665512, 0.5848214626312256]

In [51]:
print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f" % (model.metrics_names[1], scores[1]))

loss: 0.13
recall_m: 0.58


**Deep Network using Keras**

In [52]:
model = keras.models.Sequential()

model.add(keras.layers.Input(shape=27))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(100, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

In [53]:
# Compile model

#Optimizer:
adam = keras.optimizers.Adam(learning_rate=0.01)
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=[recall_m])

In [54]:
# Fit the model

history = model.fit(train_X, train_y, 
                    validation_data=(test_X, test_y), 
                    epochs=60)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


In [55]:

scores = model.evaluate(test_X, test_y, verbose=0)
print("%s: %.2f" % (model.metrics_names[0], scores[0]))
print("%s: %.2f" % (model.metrics_names[1], scores[1]))

loss: 0.08
recall_m: 0.62


**RandomGridSearch**
*(Keras with Sklearn tuning)*

In [56]:
%%time
from scikeras.wrappers import KerasClassifier
from keras.initializers import GlorotNormal

score_measure = recall_m
kfolds = 5

def build_clf(hidden_layer_sizes, dropout):
    model = tf.keras.models.Sequential()
    model.add(keras.layers.Input(shape=27)),
    for hidden_layer_size in hidden_layer_sizes:
        model.add(keras.layers.Dense(hidden_layer_size, kernel_initializer= tf.keras.initializers.GlorotUniform(), 
                                     bias_initializer=keras.initializers.RandomNormal(mean=0.0, stddev=0.05, seed=None), activation="relu"))
        model.add(keras.layers.Dropout(dropout))
    model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
    model.compile(loss = 'binary_crossentropy', metrics = [recall_m])
    return model

CPU times: total: 0 ns
Wall time: 0 ns


In [57]:
from scikeras.wrappers import KerasClassifier

keras_clf = KerasClassifier(
    model=build_clf,
    hidden_layer_sizes=150,
    dropout = 0.0
)


In [58]:
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import RandomizedSearchCV

params = {
    'optimizer__learning_rate': [0.0005, 0.005, 0.01],
    'model__hidden_layer_sizes': [(70,),(90, ), (100,), (100, 90)],
    'model__dropout': [0, 0.1],
    'batch_size':[20, 50, 100],
    'epochs':[10, 50, 100],
    'optimizer':["adam","sgd"]
}
keras_clf.get_params().keys()

dict_keys(['model', 'build_fn', 'warm_start', 'random_state', 'optimizer', 'loss', 'metrics', 'batch_size', 'validation_batch_size', 'verbose', 'callbacks', 'validation_split', 'shuffle', 'run_eagerly', 'epochs', 'hidden_layer_sizes', 'dropout', 'class_weight'])

In [59]:
rnd_search_cv = RandomizedSearchCV(estimator=keras_clf, param_distributions=params, scoring='recall', n_iter=50, cv=5)

import sys
sys.setrecursionlimit(10000)

earlystop = EarlyStopping(monitor='val_recall', patience=5, verbose=0, mode='auto')
callback = [earlystop]

_ = rnd_search_cv.fit(train_X, train_y, callbacks=callback, verbose=0)

import warnings
warnings.filterwarnings('ignore')



In [60]:
rnd_search_cv.best_params_

{'optimizer__learning_rate': 0.0005,
 'optimizer': 'sgd',
 'model__hidden_layer_sizes': (90,),
 'model__dropout': 0.1,
 'epochs': 100,
 'batch_size': 100}

In [61]:
best_net = rnd_search_cv.best_estimator_
print(rnd_search_cv.best_params_)

{'optimizer__learning_rate': 0.0005, 'optimizer': 'sgd', 'model__hidden_layer_sizes': (90,), 'model__dropout': 0.1, 'epochs': 100, 'batch_size': 100}


In [62]:
%%time
y_pred = best_net.predict(test_X)
print(classification_report(test_y, y_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98        23
           1       1.00      0.97      0.99        37

    accuracy                           0.98        60
   macro avg       0.98      0.99      0.98        60
weighted avg       0.98      0.98      0.98        60

CPU times: total: 0 ns
Wall time: 96.5 ms


In [65]:
recall_score(test_y, best_net.predict(test_X))                                     



0.972972972972973

NN Model                     Recall

Wide NN                      0.58
Deep NN                      0.62
NN using RandomSearch CV     0.97

As part of Assignment 1, I have fit various classification models on the Chronic Kidney Disease Dataset. As this is a health related dataset, I have considered Recall as the evaluation metric to find the best fitting model. L1 Logistic has a Recall value of 1 and has fit the data better than the remaining models. Now out of MLPClassifier, Wide Network using Keras, Deep Network using Keras and Network using Keras RandomSearch with Sklearn Tuning- the RandomSearch with Sklearn Tuning and MLP Classifier has fit the data at the same level and is next to L1 logistic in the list.