## Chronic Kidney disease (CKD): Random forest
Members: Meng-Ni Ho, Liying Chen, Mufei Chen, Catharine Zhang
#### Number of Instances:  400 (250 CKD, 150 non-ckd)
#### Number of Attributes: 24 + class = 25 (11 numeric,14 nominal) 
#### Attribute Information :
    1. Age(numerical): age in years
    2. Blood Pressure(numerical): bp in mm/Hg
    3. Specific Gravity(nominal): sg - (1.005,1.010,1.015,1.020,1.025)
    4. Albumin(nominal): al - (0,1,2,3,4,5)
    5. Sugar(nominal): su - (0,1,2,3,4,5)
    6. Red Blood Cells(nominal): rbc - (normal = 0,abnormal = 1)
    7.Pus Cell (nominal): pc - (normal = 0,abnormal = 1)
    8.Pus Cell clumps(nominal): pcc - (present = 1,notpresent = 0)
    9.Bacteria(nominal): ba  - (present = 1,notpresent = 0)
    10.Blood Glucose Random(numerical): bgr in mgs/dl
    11.Blood Urea(numerical): bu in mgs/dl
    12.Serum Creatinine(numerical): sc in mgs/dl
    13.Sodium(numerical): sod in mEq/L
    14.Potassium(numerical): pot in mEq/L
    15.Hemoglobin(numerical): hemo in gms
    16.Packed  Cell Volume(numerical)
    17.White Blood Cell Count(numerical): wc in cells/cumm
    18.Red Blood Cell Count(numerical): rc in millions/cmm
    19.Hypertension(nominal): htn - (yes = 1, no = 0)
    20.Diabetes Mellitus(nominal): dm - (yes = 1,no = 0)
    21.Coronary Artery Disease(nominal): cad - (yes = 1,no = 0)
    22.Appetite(nominal): appet	 - (good = 1, poor = 0)
    23.Pedal Edema(nominal): pe - (yes = 1, no = 0)	
    24.Anemia(nominal): ane	- (yes = 1,no = 0)
    25.Class (nominal): class - (ckd = 1, notckd = 0)
#### Objective: Prediction of CKD using random forest.    
#### Resource: UCI machine learning repository

### Step 1: Data cleaning, missing value imputation: replace all the NaN with median

In [14]:
import pandas as pd
features = pd.read_csv('kidney.csv')
# fill all the NA with median
#features = features.apply(lambda x: x.fillna(x.median()),axis=0)
features.head(5)

Unnamed: 0,age,blood pressure,specific gravity,albumin,sugar,rbc,pus cell,pus cell clumps,bacteric,blood glucose random,...,packed cell volume,wbc count,rbc count,hypertension,diabetes mellitus,coronary artery disease,appetite,pedal edema,anemia,class
0,48.0,80.0,1.02,1.0,0.0,,0.0,0.0,0.0,121.0,...,44.0,7800.0,5.2,1.0,1.0,0.0,1.0,0.0,0.0,1
1,7.0,50.0,1.02,4.0,0.0,,0.0,0.0,0.0,,...,38.0,6000.0,,0.0,0.0,0.0,1.0,0.0,0.0,1
2,62.0,80.0,1.01,2.0,3.0,0.0,0.0,0.0,0.0,423.0,...,31.0,7500.0,,0.0,1.0,0.0,0.0,0.0,1.0,1
3,48.0,70.0,1.005,4.0,0.0,0.0,1.0,1.0,0.0,117.0,...,32.0,6700.0,3.9,1.0,0.0,0.0,0.0,1.0,1.0,1
4,51.0,80.0,1.01,2.0,0.0,0.0,0.0,0.0,0.0,106.0,...,35.0,7300.0,4.6,0.0,0.0,0.0,1.0,0.0,0.0,1


In [None]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='median', axis=0)
imputer = imputer.fit(features)
features = imputer.transform(features)

### Step 2: Split data to training set and testing set

In [2]:
# Labels are the values we want to predict
labels = features[:,24]
# Remove the labels from the kidney data
features_drop = features[:,:24]

In [3]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split
# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size = 0.4, random_state = 42)

In [4]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (240, 25)
Training Labels Shape: (240,)
Testing Features Shape: (160, 25)
Testing Labels Shape: (160,)


### Step 3: Build the model with `RandomForestClassifier` using training set
#### set `bootstrap = True` to expand the samples in order to converge predictions to 0 and 1

In [7]:
from sklearn.ensemble import RandomForestClassifier

# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=500, 
                               bootstrap = True,
                               max_features = 'sqrt')
# Fit on training data
model.fit(train_features, train_labels)

# Import the model we are using
# from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 500 decision trees
# rf = RandomForestRegressor(n_estimators = 500, random_state = 42)
# Train the model on training data
# rf.fit(train_features, train_labels)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Step 4: Fit the model on testing set

##### The predicted labels from model, the results will converge to 0 and 1. 

In [18]:
# Use the forest's predict method on the test data
predictions = model.predict(test_features)
print(predictions)

[1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 0. 0.
 1. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1.
 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1.
 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0.
 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0.
 1. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1.
 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0.]


##### Original predicted results, before convergence. Results close to 0 (ex. 0.002) will converge to 0, close to 1 (ex. 0.952) will converge to 1.

In [19]:
rf_probs = model.predict_proba(test_features)[:,1]
print(rf_probs)

[0.952 0.    1.    1.    1.    0.996 0.    0.996 0.014 1.    0.998 0.004
 1.    1.    0.998 0.978 0.    0.002 1.    0.    1.    0.996 0.    0.
 0.838 0.002 0.994 1.    0.002 1.    0.004 0.986 0.01  0.996 1.    0.952
 1.    0.002 0.998 1.    0.026 0.996 0.998 0.982 0.978 0.    0.998 0.968
 0.976 0.968 0.978 0.01  0.006 0.97  0.032 0.998 0.996 0.982 0.99  0.978
 0.998 0.054 0.002 1.    0.996 0.994 0.    0.002 0.026 1.    0.    0.95
 0.974 1.    0.    0.996 1.    0.08  0.992 0.    0.998 0.96  1.    0.012
 0.012 1.    1.    0.    0.938 0.988 0.936 0.002 0.99  0.988 1.    0.
 0.    0.006 0.992 0.952 0.92  0.988 0.992 0.998 1.    0.968 0.994 0.998
 0.006 0.002 0.968 0.016 0.002 0.    0.    0.998 0.99  0.012 0.004 0.002
 0.98  1.    0.992 0.    0.994 0.996 0.058 0.    0.724 0.996 0.    1.
 0.002 0.994 0.982 0.994 0.994 0.    0.962 0.008 1.    0.996 0.002 0.992
 0.998 0.004 1.    1.    0.974 0.006 1.    0.002 0.03  0.904 0.998 0.936
 1.    0.998 0.    0.016]


### Step 5: Calculate the accuracy with confusion matrix

* Confusion matrix is a way to calculate accuracy for classification.
* A confusion matrix is a summary of prediction results on a classification problem.
* The number of correct and incorrect predictions are summarized with count values and broken down by each class. 

![](Confusion_Matrix1_1.png)
* Class 1 : Positive  
* Class 2 : Negative  

#### Definition of the Terms:  
* Positive (P) : Observation is positive (for example: is an apple).  
* Negative (N) : Observation is not positive (for example: is not an apple).  
* True Positive (TP) : Observation is positive, and is predicted to be positive.  
* False Negative (FN) : Observation is positive, but is predicted negative.  
* True Negative (TN) : Observation is negative, and is predicted to be negative.  
* False Positive (FP) : Observation is negative, but is predicted positive.  

#### Classification Rate/Accuracy:  
Classification Rate or Accuracy is given by the relation:
![](Confusion_Matrix2_2.png)
Resource: https://www.geeksforgeeks.org/confusion-matrix-machine-learning/

In [20]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

results = confusion_matrix(test_labels, predictions) 
print('Confusion Matrix :') 
print(results) 
print('Accuracy Score :',accuracy_score(test_labels, predictions))
print('Report : ') 
print(classification_report(test_labels, predictions)) 

Confusion Matrix :
[[ 58   0]
 [  0 102]]
Accuracy Score : 1.0
Report : 
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        58
         1.0       1.00      1.00      1.00       102

   micro avg       1.00      1.00      1.00       160
   macro avg       1.00      1.00      1.00       160
weighted avg       1.00      1.00      1.00       160

