# Risk Factor prediction of Chronic Kidney Disease Data Set

Dataset link - https://archive.ics.uci.edu/ml/datasets/Risk+Factor+prediction+of+Chronic+Kidney+Disease#

I selected this dataset from UCI ML Repository. This dataset contains the patients data who were affected by chronic kidney disease. 
Following are the columns in the dataset-
1. bp(Diastolic)
2. bp limit
3. sg
4. al
5. class
6. rbc
7. su
8. pc
9. pcc
10. ba
11. bgr
12. bu
13. sod
14. sc
15. pot
16. hemo
17. pcv
18. rbcc
19. wbcc
20. htn
21. dm
22. cad
23. appet
24. pe
25. ane
26. grf
27. stage
28. affected
29. age

For this dataset, I preprocessed the data first as it has noise,then did the split of data and addressed the data imbalance. Later on fitted the ML models on the resulting dataset. 

Since this is a Classification problem, I chose Logistic Regression, Decision Tree, Random Forest, AdaBoost and Gradient Boost models.

**Importing Packages**

In [67]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

In [68]:
np.random.seed(1)

**Loading data**

In [69]:
ckd=pd.read_csv("C:/Users/prath/Downloads/ckd-dataset-v2.csv")

**Exploring data**

In [70]:
ckd.head(3)

Unnamed: 0,bp (Diastolic),bp limit,sg,al,class,rbc,su,pc,pcc,ba,...,htn,dm,cad,appet,pe,ane,grf,stage,affected,age
0,0,0,1.019 - 1.021,1-Jan,ckd,0,< 0,0,0,0,...,0,0,0,0,0,0,≥ 227.944,s1,1,< 12
1,0,0,1.009 - 1.011,< 0,ckd,0,< 0,0,0,0,...,0,0,0,0,0,0,≥ 227.944,s1,1,< 12
2,0,0,1.009 - 1.011,≥ 4,ckd,1,< 0,1,0,1,...,0,0,0,1,0,0,127.281 - 152.446,s1,1,< 12


In [71]:
ckd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 29 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   bp (Diastolic)  200 non-null    int64 
 1   bp limit        200 non-null    int64 
 2   sg              200 non-null    object
 3   al              200 non-null    object
 4   class           200 non-null    object
 5   rbc             200 non-null    int64 
 6   su              200 non-null    object
 7   pc              200 non-null    int64 
 8   pcc             200 non-null    int64 
 9   ba              200 non-null    int64 
 10  bgr             200 non-null    object
 11  bu              200 non-null    object
 12  sod             200 non-null    object
 13  sc              200 non-null    object
 14  pot             200 non-null    object
 15  hemo            200 non-null    object
 16  pcv             200 non-null    object
 17  rbcc            200 non-null    object
 18  wbcc      

In [72]:
ckd.describe()

Unnamed: 0,bp (Diastolic),bp limit,rbc,pc,pcc,ba,htn,dm,cad,appet,pe,ane,affected
count,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0,200.0
mean,0.54,0.755,0.125,0.225,0.135,0.055,0.39,0.35,0.11,0.2,0.175,0.16,0.64
std,0.499648,0.805119,0.331549,0.41863,0.342581,0.228552,0.488974,0.478167,0.313675,0.401004,0.380921,0.367526,0.481205
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
max,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Checking for Null Value columns

In [73]:
ckd.isna().sum()

bp (Diastolic)    0
bp limit          0
sg                0
al                0
class             0
rbc               0
su                0
pc                0
pcc               0
ba                0
bgr               0
bu                0
sod               0
sc                0
pot               0
hemo              0
pcv               0
rbcc              0
wbcc              0
htn               0
dm                0
cad               0
appet             0
pe                0
ane               0
grf               0
stage             0
affected          0
age               0
dtype: int64

List of Categorical Variables

In [74]:
category_var_list = list(ckd.select_dtypes(include='object').columns)
category_var_list

['sg',
 'al',
 'class',
 'su',
 'bgr',
 'bu',
 'sod',
 'sc',
 'pot',
 'hemo',
 'pcv',
 'rbcc',
 'wbcc',
 'grf',
 'stage',
 'age']

In [75]:
#to get the unique values in each column
for cat in category_var_list: 
    print(f"Category: {cat} Values: {ckd[cat].unique()}")

Category: sg Values: ['1.019 - 1.021' '1.009 - 1.011' '1.015 - 1.017' '≥ 1.023' '< 1.007']
Category: al Values: ['1-Jan' '< 0' '≥ 4' '3-Mar' '2-Feb']
Category: class Values: ['ckd' 'notckd']
Category: su Values: ['< 0' '4-Apr' '2-Feb' '4-Mar' '2-Jan' '≥ 4']
Category: bgr Values: ['< 112' '112 - 154' '154 - 196' '406 - 448' '238 - 280' '196 - 238'
 '≥ 448' '280 - 322' '364 - 406' '322 - 364']
Category: bu Values: ['< 48.1' '48.1 - 86.2' '200.5 - 238.6' '124.3 - 162.4' '86.2 - 124.3'
 '162.4 - 200.5' '≥ 352.9' '238.6 - 276.7']
Category: sod Values: ['138 - 143' '133 - 138' '123 - 128' '143 - 148' '148 - 153' '< 118'
 '128 - 133' '118 - 123' '≥ 158']
Category: sc Values: ['< 3.65' '3.65 - 6.8' '16.25 - 19.4' '6.8 - 9.95' '13.1 - 16.25'
 '9.95 - 13.1' '≥ 28.85']
Category: pot Values: ['< 7.31' '≥ 42.59' '7.31 - 11.72' '38.18 - 42.59']
Category: hemo Values: ['11.3 - 12.6' '8.7 - 10' '13.9 - 15.2' '≥ 16.5' '10 - 11.3' '7.4 - 8.7'
 '12.6 - 13.9' '15.2 - 16.5' '< 6.1' '6.1 - 7.4']
Category: p

It is observed that there are some irrelevant values in some columns, so it's better replace them with Nan and remove the class column since it's directly correlated to our target variable 'affected'.

In [76]:
ckd.drop(['class'], axis=1, inplace = True)

In [77]:
ckd['al']=ckd.al.replace(['1-Jan','3-Mar','2-Feb'],np.nan)

In [78]:
ckd['su']=ckd.su.replace(['4-Apr','2-Feb','4-Mar','2-Jan'],np.nan)

In [79]:
ckd['grf']=ckd.grf.replace([' p '],np.nan)

In [80]:
ckd['age']=ckd.age.replace(['20-Dec'],np.nan)

In [81]:
#checking for nan values
ckd.isna().sum()

bp (Diastolic)     0
bp limit           0
sg                 0
al                71
rbc                0
su                29
pc                 0
pcc                0
ba                 0
bgr                0
bu                 0
sod                0
sc                 0
pot                0
hemo               0
pcv                0
rbcc               0
wbcc               0
htn                0
dm                 0
cad                0
appet              0
pe                 0
ane                0
grf                1
stage              0
affected           0
age                4
dtype: int64

In [82]:
#replacing the Nan values with mode
ckd['al']=ckd.al.fillna(ckd.al.mode())
ckd['su']=ckd.su.fillna(ckd.su.mode())
ckd['grf']=ckd.grf.fillna(ckd.grf.mode())
ckd['age']=ckd.age.fillna(ckd.age.mode())

In [83]:
#Label encoding the categorical columns since the models can only work with numerical values

In [84]:
labelencoder = LabelEncoder()
ckd['sg'] = labelencoder.fit_transform(ckd['sg'])
ckd['al'] = labelencoder.fit_transform(ckd['al'])
ckd['su'] = labelencoder.fit_transform(ckd['su'])
ckd['bgr'] = labelencoder.fit_transform(ckd['bgr'])
ckd['bu'] = labelencoder.fit_transform(ckd['bu'])
ckd['sod'] = labelencoder.fit_transform(ckd['sod'])
ckd['sc'] = labelencoder.fit_transform(ckd['sc'])
ckd['pot'] = labelencoder.fit_transform(ckd['pot'])
ckd['hemo'] = labelencoder.fit_transform(ckd['hemo'])
ckd['pcv'] = labelencoder.fit_transform(ckd['pcv'])
ckd['rbcc'] = labelencoder.fit_transform(ckd['rbcc'])
ckd['wbcc'] = labelencoder.fit_transform(ckd['wbcc'])
ckd['grf'] = labelencoder.fit_transform(ckd['grf'])
ckd['stage'] = labelencoder.fit_transform(ckd['stage'])
ckd['age'] = labelencoder.fit_transform(ckd['age'])

In [85]:
ckd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   bp (Diastolic)  200 non-null    int64
 1   bp limit        200 non-null    int64
 2   sg              200 non-null    int32
 3   al              200 non-null    int32
 4   rbc             200 non-null    int64
 5   su              200 non-null    int32
 6   pc              200 non-null    int64
 7   pcc             200 non-null    int64
 8   ba              200 non-null    int64
 9   bgr             200 non-null    int32
 10  bu              200 non-null    int32
 11  sod             200 non-null    int32
 12  sc              200 non-null    int32
 13  pot             200 non-null    int32
 14  hemo            200 non-null    int32
 15  pcv             200 non-null    int32
 16  rbcc            200 non-null    int32
 17  wbcc            200 non-null    int32
 18  htn             200 non-null  

**Split Data**

In [86]:
# split the data into validation and training set
train_df, test_df = train_test_split(ckd, test_size=0.3)

In [87]:
train_df.affected.value_counts()

1    91
0    49
Name: affected, dtype: int64

Addressing Data imbalance

In [88]:
#there is more than 10% difference between the count of classes, so it is a better approach to resample the training data
cls0 = train_df[train_df['affected']==0]
cls1 = train_df[train_df['affected']==1]
from sklearn.utils import resample
train_df_minority_resampled = resample(cls0, 
                                 replace=True,     
                                 n_samples=91,    
                                 random_state=123)

In [89]:
print(cls1.shape,train_df_minority_resampled.shape)

(91, 28) (91, 28)


In [90]:
train_df=pd.concat([cls1,train_df_minority_resampled])

In [91]:

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'affected'
predictors = list(ckd.columns)
predictors.remove(target)

In [92]:
train_X=train_df[predictors]
train_y = train_df[target] 
test_X = test_df[predictors]
test_y = test_df[target]

In [93]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

**Fitting Logistic Regression**

In [94]:
log_reg_model = LogisticRegression(penalty='none', max_iter=900)
_ = log_reg_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [95]:
#Liblinear Solver
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(train_X, np.ravel(train_y))
model_preds = log_reg_liblin_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [96]:
#L2 Regularization
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_L2_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [97]:
#L1 Regularization
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_L1_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [98]:
#ElasticNet Regularization
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1500)
_ = log_reg_elastic_model.fit(train_X, np.ravel(train_y))
model_preds = log_reg_elastic_model.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elastic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
import warnings
warnings.filterwarnings("ignore")

**Fitting Decision Tree Classifier**

In [99]:
Dt=DecisionTreeClassifier(max_depth=10)
Dt=Dt.fit(train_X,np.ravel(train_y))
model_preds=Dt.predict(test_X)
c_matrix = confusion_matrix(test_y, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree Classifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Random Forest Classifier**

In [100]:
rforest = RandomForestClassifier()
_ = rforest.fit(train_X, train_y)
y_pred = rforest.predict(test_X)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Random Forest", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Ada Boost Classifier**

In [101]:
aboost = AdaBoostClassifier()
_ = aboost.fit(train_X, train_y)
y_pred = aboost.predict(test_X)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"AdaBoost Classifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

**Fitting Gradient Boost Classifier**

In [102]:
gboost = GradientBoostingClassifier()
_ = gboost.fit(train_X, train_y)
y_pred = gboost.predict(test_X)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Gradient Boost Classifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [103]:
performance.sort_values(by='Recall',ascending=False)

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,L1 logistic,1.0,1.0,1.0,1.0
0,default logistic,0.983333,1.0,0.972973,0.986301
0,liblinear logistic,0.983333,1.0,0.972973,0.986301
0,Elastic logistic,0.983333,1.0,0.972973,0.986301
0,L2 logistic,0.966667,1.0,0.945946,0.972222
0,Decision Tree Classifier,0.933333,0.971429,0.918919,0.944444
0,Random Forest,0.933333,0.971429,0.918919,0.944444
0,AdaBoost Classifier,0.933333,0.971429,0.918919,0.944444
0,Gradient Boost Classifier,0.933333,0.971429,0.918919,0.944444


To identify which model is better for this dataset, Recall should be considered as the measure. The reason being it is a health related dataset and it's risky to have False Negatives i.e., predicting affected as 0 when it's 1 as it will result in death of patients if not treated.

Above are the results of models fit on the dataset sorted by Recall, L1 logistic has fit the data better than other models.Followed by default logistic,liblinear logistic, Elastic logistic having the same Recall value.
