# Part 2 - Chronic Kidney Diseases

Thoughout this question, 

After importing, cleaning & sorting the data,
I have come up with two approaches to return the predicted Gender of each patients.
- **1st approach** is done by using the Regression Model. 
- **2nd approach** is done by using keras library to see the probability for each patients to fall into the 5 different CKD Stages.

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import math

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor


### Importing the dataset

##### Dataset Column Features Explain

- 1.**Age**(numerical) - age in years 
- 2.**Blood** Pressure(numerical) - bp in mm/Hg 
- 3.**Specific** Gravity(nominal) - sg - (1.005,1.010,1.015,1.020,1.025) 
- 4.**Albumin**(nominal) - al - (0,1,2,3,4,5) 
- 5.**Sugar**(nominal) - su - (0,1,2,3,4,5) 
- 6.**Red Blood** Cells(nominal) - rbc - (normal,abnormal) 
- 7.**Pus Cell** (nominal) - pc - (normal,abnormal) 
- 8.**Pus Cell** clumps(nominal) - pcc - (present,notpresent) 
- 9.**Bacteria**(nominal) - ba - (present,notpresent) 
- 10.**Blood Glucose Random**(numerical)	- bgr in mgs/dl 
- 11.**Blood Urea**(numerical)	- bu in mgs/dl 
- 12.Serum **Creatinine**(numerical)	- sc in mgs/dl 
- 13.**Sodium**(numerical) - sod in mEq/L 
- 14.**Potassium**(numerical)	- pot in mEq/L 
- 15.**Hemoglobin**(numerical) - hemo in gms 
- 16.**Packed Cell Volume**(numerical) 
- 17.**White Blood Cell Count**(numerical) - wc in cells/cumm 
- 18.**Red Blood Cell Coun**t(numerical)	- rc in millions/cmm 
- 19.**Hypertension****(nominal)	- htn - (yes,no) 
- 20.**Diabetes Mellitus**(nominal)	- dm - (yes,no) 
- 21.**Coronary Artery Disease**(nominal) - cad - (yes,no) 
- 22.**Appetite**(nominal)	- appet - (good,poor) 
- 23.**Pedal Edema**(nominal) - pe - (yes,no)	
- 24.**Anemia**(nominal) - ane - (yes,no) 
- 25.**Class** (nominal)	- class - (ckd,notckd)



In [2]:
# read the dataset
header = ['Age','BloodPressure','SpecificGravity','Albumin','Sugar','RedBloodCell',
          'PusCell','PusCellCLumps','Bacteria','BloodGlucoseRandom','Blood Urea',
          'SerumCreatinine','Sodium','Potassium','Hemoglobin','PackedCellVolume',
          'WhiteBloodCell','RedBloodCellCount','Hypertension','DiabetesMellitus',
          'CoronaryArteryDisease','Appetite','PedalEdema','Anemia','Classification']

df = pd.read_csv('/Users/jesskim/Downloads/Chronic_Kidney_Disease/chronic_kidney_disease.arff', 
                 header=None, names=header)

df.head()

Unnamed: 0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,RedBloodCell,PusCell,PusCellCLumps,Bacteria,BloodGlucoseRandom,...,PackedCellVolume,WhiteBloodCell,RedBloodCellCount,Hypertension,DiabetesMellitus,CoronaryArteryDisease,Appetite,PedalEdema,Anemia,Classification
0,@relation Chronic_Kidney_Disease,,,,,,,,,,...,,,,,,,,,,
1,@attribute 'age' numeric,,,,,,,,,,...,,,,,,,,,,
2,@attribute 'bp' numeric,,,,,,,,,,...,,,,,,,,,,
3,@attribute 'sg' {1.005,1.01,1.015,1.02,1.025},,,,,,...,,,,,,,,,,
4,@attribute 'al' {0,1.0,2.0,3.0,4,5},,,,,...,,,,,,,,,,


In [3]:
# dataset has '?' in it, convert these into NaN
df = df.replace('?', np.nan)
# drop the NaN
df = df.dropna(axis=0, how="any")

df['index'] = range(df.shape[0]) 
#because we droped anan, there are some missing values so reset the indx.
df = df.set_index("index")

#change numerical column into INTEGER TYPE column
int_col = ['Age','BloodPressure','SpecificGravity','Albumin','Sugar','BloodGlucoseRandom','Blood Urea',
          'SerumCreatinine','Sodium','Potassium','Hemoglobin','PackedCellVolume',
          'WhiteBloodCell','RedBloodCellCount']
for col in int_col:
    df[col] = df[col].astype('float')

df.loc[df['SerumCreatinine'] < 0.8, 'SerumCreatinine'] = 0.8

df.head()

Unnamed: 0_level_0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,RedBloodCell,PusCell,PusCellCLumps,Bacteria,BloodGlucoseRandom,...,PackedCellVolume,WhiteBloodCell,RedBloodCellCount,Hypertension,DiabetesMellitus,CoronaryArteryDisease,Appetite,PedalEdema,Anemia,Classification
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
1,53.0,90.0,1.02,2.0,0.0,abnormal,abnormal,present,notpresent,70.0,...,29.0,12100.0,3.7,yes,yes,no,poor,no,yes,ckd
2,63.0,70.0,1.01,3.0,0.0,abnormal,abnormal,present,notpresent,380.0,...,32.0,4500.0,3.8,yes,yes,no,poor,yes,no,ckd
3,68.0,80.0,1.01,3.0,2.0,normal,abnormal,present,present,157.0,...,16.0,11000.0,2.6,yes,yes,yes,poor,yes,no,ckd
4,61.0,80.0,1.015,2.0,0.0,abnormal,abnormal,notpresent,notpresent,173.0,...,24.0,9200.0,3.2,yes,yes,yes,poor,yes,yes,ckd


### Categorical Columns to Dummy Variables

In [4]:
cat_cols = list(df.select_dtypes('object').columns)
for col in cat_cols:
    df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col])], axis=1)
    
df.head()

Unnamed: 0_level_0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,BloodGlucoseRandom,Blood Urea,SerumCreatinine,Sodium,Potassium,...,no,yes,good,poor,no,yes,no,yes,ckd,notckd
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,1,0,0,1,0,1,0,1,1,0
1,53.0,90.0,1.02,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,1,0,0,1,1,0,0,1,1,0
2,63.0,70.0,1.01,3.0,0.0,380.0,60.0,2.7,131.0,4.2,...,1,0,0,1,0,1,1,0,1,0
3,68.0,80.0,1.01,3.0,2.0,157.0,90.0,4.1,130.0,6.4,...,0,1,0,1,0,1,1,0,1,0
4,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,...,0,1,0,1,0,1,0,1,1,0


We are given with the **Mayo Quadratic Equation**, I've Calculated the eGFR scores for both Males and Females.

Given section 4.4 of https://en.wikipedia.org/wiki/Renal_function

In [5]:
def get_eGFR_Male(row):
    eGFR_Male = math.exp(1.911 + (5.249/float(row['SerumCreatinine']))-(2.114/(float(row['SerumCreatinine']))**2) -(0.00686*float(row['Age']))) 
    return eGFR_Male

df['eGFR_Male'] =  df.apply(get_eGFR_Male, axis=1)

def get_eGFR_Female(row):
    eGFR_Female = math.exp(1.911 + (5.249/float(row['SerumCreatinine']))-(2.114/(float(row['SerumCreatinine']))**2) -(0.00686*float(row['Age']))-(0.205)) 
    return eGFR_Female

df['eGFR_Female'] =  df.apply(get_eGFR_Female, axis=1)

## **1st Approach**

Using Average of eGFRScore of the both male and Female can be one of the method.

According the calculated eGFR score of males and the females, the difference betwwen two score are very small and they are very hard to indentify and specify weather they are male and female. 

I decided to make a column of avg score of the two eGFR score of males and females. 

In [6]:
df['avg_eGFR'] = (df['eGFR_Male'] + df['eGFR_Female']) / 2
#regression model target = average of eGFR_SCORE


eGFR_SCORE_df = df.iloc[:, -3:]

eGFR_SCORE_df.head()

Unnamed: 0_level_0,eGFR_Male,eGFR_Female,avg_eGFR
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,16.720523,13.621329,15.170926
1,9.352732,7.619178,8.485955
2,22.940589,18.688489,20.814539
3,13.450354,10.957295,12.203824
4,14.871549,12.115068,13.493308


In [7]:
df.head()

Unnamed: 0_level_0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,BloodGlucoseRandom,Blood Urea,SerumCreatinine,Sodium,Potassium,...,poor,no,yes,no,yes,ckd,notckd,eGFR_Male,eGFR_Female,avg_eGFR
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,1,0,1,0,1,1,0,16.720523,13.621329,15.170926
1,53.0,90.0,1.02,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,1,1,0,0,1,1,0,9.352732,7.619178,8.485955
2,63.0,70.0,1.01,3.0,0.0,380.0,60.0,2.7,131.0,4.2,...,1,0,1,1,0,1,0,22.940589,18.688489,20.814539
3,68.0,80.0,1.01,3.0,2.0,157.0,90.0,4.1,130.0,6.4,...,1,0,1,1,0,1,0,13.450354,10.957295,12.203824
4,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,...,1,0,1,0,1,1,0,14.871549,12.115068,13.493308


Now, I have finished Data Processing, its time to test!

Split the data into test set and training set

In [8]:
#Our Target is the very last column of the dataframe
X = df.iloc[:, :-3].values # Entire data excluding Female & Male eGFR score and the average score.
y = df.iloc[:, -1:].values #average eGFR Score of Female and male

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)


The dataset have a White Blood Cell column which is very dominant comparing to other columns.

In [9]:
#feature Scaling
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

In [10]:
#Regression model
REGRESSORS = {
        'rf': RandomForestRegressor(n_estimators=100, random_state=0),
        'dt': DecisionTreeRegressor()}

regressor = REGRESSORS['rf']
regressor.fit(X_train, y_train)
y_test_predict = regressor.predict(X_test)

  import sys


In [11]:
#comparing two dataset by making them into DataFrames
y_test_predict = pd.DataFrame(y_test_predict)
y_test = pd.DataFrame(y_test)

result = pd.concat([y_test, y_test_predict])

I'm going to run this regression model on entire dataset so I can compare between the predicted Average eGFR Scores and the Calculated average eGFR SCores

In [12]:
#feature scaling on entire dataset X accordance to trainig set.
X_featured = sc_X.transform(X) 
Y_predict = regressor.predict(X_featured)

In [13]:
df['y_pred'] = pd.DataFrame(Y_predict)
df.head() # attached very last column of the dataset.

Unnamed: 0_level_0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,BloodGlucoseRandom,Blood Urea,SerumCreatinine,Sodium,Potassium,...,no,yes,no,yes,ckd,notckd,eGFR_Male,eGFR_Female,avg_eGFR,y_pred
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,0,1,0,1,1,0,16.720523,13.621329,15.170926,14.854403
1,53.0,90.0,1.02,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,1,0,0,1,1,0,9.352732,7.619178,8.485955,8.496045
2,63.0,70.0,1.01,3.0,0.0,380.0,60.0,2.7,131.0,4.2,...,0,1,1,0,1,0,22.940589,18.688489,20.814539,19.777596
3,68.0,80.0,1.01,3.0,2.0,157.0,90.0,4.1,130.0,6.4,...,0,1,1,0,1,0,13.450354,10.957295,12.203824,13.235092
4,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,...,0,1,0,1,1,0,14.871549,12.115068,13.493308,13.851491


In [14]:
def FindGender(row):
    if row['y_pred'] >  row['avg_eGFR']:
        return 'Male'
    else:
        return 'Female'

df['Gender'] =  df.apply(FindGender, axis=1)
df['Gender']

index
0      Female
1        Male
2      Female
3        Male
4        Male
5        Male
6      Female
7      Female
8        Male
9        Male
10       Male
11       Male
12       Male
13       Male
14       Male
15       Male
16       Male
17       Male
18     Female
19       Male
20       Male
21       Male
22       Male
23       Male
24     Female
25       Male
26       Male
27     Female
28     Female
29       Male
        ...  
127    Female
128    Female
129      Male
130    Female
131      Male
132      Male
133      Male
134    Female
135    Female
136    Female
137    Female
138    Female
139    Female
140      Male
141    Female
142    Female
143    Female
144    Female
145    Female
146      Male
147    Female
148    Female
149      Male
150    Female
151    Female
152    Female
153      Male
154    Female
155      Male
156    Female
Name: Gender, Length: 157, dtype: object

In [15]:
df['Gender'].value_counts()

Female    82
Male      75
Name: Gender, dtype: int64

I have returned predicted Gender column by using eGFR Scores.

## 2nd Approaches (Classification Model)
##### Stages Grouping 

By using the Stages of eGFR given from wikipedia

0. Normal kidney function – GFR above 90 mL/min/1.73 m2 and no proteinuria

1. CKD1 – GFR above 90 mL/min/1.73 m2 with evidence of kidney damage

2. CKD2 (mild) – GFR of 60 to 89 mL/min/1.73 m2 with evidence of kidney damage

3. CKD3 (moderate) – GFR of 30 to 59 mL/min/1.73 m2

4. CKD4 (severe) – GFR of 15 to 29 mL/min/1.73 m2

5. CKD5 kidney failure – GFR less than 15 mL/min/1.73 m2 Some people add CKD5D for those stage 5 patients requiring dialysis; many patients in CKD5 are not yet on dialysis.

CKD stage	GFR level (mL/min/1.73 m2)

- Stage 1	≥ 90 

- Stage 2	60–89 

- Stage 3	30–59

- Stage 4	15–29

- Stage 5	< 15

Group the each eGFR scores into ckd Stages groups.

In [16]:
ckd_column = ['eGFR_Male', 'eGFR_Female']
for col in ckd_column:
    df['temp']=np.nan
    
    condition = (df[col].round(0) >= 90)
    df.loc[condition,'temp'] = 'Stage 1'
    
    condition = (df[col].round(0) >= 60) & (df[col].round(0) < 90)
    df.loc[condition,'temp'] = 'Stage 2'
    
    condition = (df[col].round(0) >= 30) & (df[col].round(0) < 60)
    df.loc[condition,'temp'] = 'Stage 3' 

    condition = (df[col].round(0) >= 15) & (df[col].round(0) < 30)
    df.loc[condition,'temp'] = 'Stage 4'
    
    condition = (df[col].round(0) < 15)
    df.loc[condition,'temp'] = 'Stage 5'

    df[col] = df['temp']  
    df.drop('temp',axis=1,inplace=True)


#renames
df.rename(columns={'eGFR_Male':'Male_Stages',
                          'eGFR_Female':'Female_Stages'}, inplace=True)

I'm going to assume that all patient are male and see the probability where each patients fall into. 


In [17]:
df.head()

Unnamed: 0_level_0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,BloodGlucoseRandom,Blood Urea,SerumCreatinine,Sodium,Potassium,...,yes,no,yes,ckd,notckd,Male_Stages,Female_Stages,avg_eGFR,y_pred,Gender
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,1,0,1,1,0,Stage 4,Stage 5,15.170926,14.854403,Female
1,53.0,90.0,1.02,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,0,0,1,1,0,Stage 5,Stage 5,8.485955,8.496045,Male
2,63.0,70.0,1.01,3.0,0.0,380.0,60.0,2.7,131.0,4.2,...,1,1,0,1,0,Stage 4,Stage 4,20.814539,19.777596,Female
3,68.0,80.0,1.01,3.0,2.0,157.0,90.0,4.1,130.0,6.4,...,1,1,0,1,0,Stage 5,Stage 5,12.203824,13.235092,Male
4,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,...,1,0,1,1,0,Stage 4,Stage 5,13.493308,13.851491,Male


To make the data little more clear, I going to drop last three columns

In [18]:
df = df.drop(['avg_eGFR', 'y_pred', 'Gender'],axis=1)
df.head()

Unnamed: 0_level_0,Age,BloodPressure,SpecificGravity,Albumin,Sugar,BloodGlucoseRandom,Blood Urea,SerumCreatinine,Sodium,Potassium,...,good,poor,no,yes,no,yes,ckd,notckd,Male_Stages,Female_Stages
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,2.5,...,0,1,0,1,0,1,1,0,Stage 4,Stage 5
1,53.0,90.0,1.02,2.0,0.0,70.0,107.0,7.2,114.0,3.7,...,0,1,1,0,0,1,1,0,Stage 5,Stage 5
2,63.0,70.0,1.01,3.0,0.0,380.0,60.0,2.7,131.0,4.2,...,0,1,0,1,1,0,1,0,Stage 4,Stage 4
3,68.0,80.0,1.01,3.0,2.0,157.0,90.0,4.1,130.0,6.4,...,0,1,0,1,1,0,1,0,Stage 5,Stage 5
4,61.0,80.0,1.015,2.0,0.0,173.0,148.0,3.9,135.0,5.2,...,0,1,0,1,0,1,1,0,Stage 4,Stage 5


Now, I have everything to have my data train.

I'm going to make 2 classification models assigning target as Males_Stages and the other as Female_Stages

In [19]:
X = df.iloc[:, :-3].values #X Set
ym = df[df.columns[-2]] # Target Male_Stages
yf = df[df.columns[-1]] #Target Female_Stages

y = df.iloc[:, -2:] # the set of Males_Stages and Female_Stages (we are goin gto compare later)

#making target into dummy variables because we have 5 possible outcome
ym = pd.get_dummies(ym) 
yf = pd.get_dummies(yf)

#spliting data into trainset and testset
X_train, X_test, ym_train, ym_test = train_test_split(X, ym, test_size=0.2, random_state = 0) #for Male
X_train, X_test, yf_train, yf_test = train_test_split(X, yf, test_size=0.2, random_state = 0) #for Female

#scalar to fit the testing set.
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test)

#X dataset has been slited into random_State = 0, X_train and X_test sets are the same for Male and Female, 
#so we only need to fit and transform once.

## Using Keras LIbraries to make predictions.

In [20]:
import keras
from keras.models import Sequential
from keras.layers import Dense

def baseline_model():
	# create model
    model = Sequential()
    n_cols = X_train.shape[1]
    model.add(Dense(output_dim = 200, init ='uniform', activation = 'relu', input_dim = n_cols))
    #adding layers.
    model.add(Dense(output_dim = 200, init ='uniform', activation = 'relu'))
    model.add(Dense(output_dim = 100, init ='uniform', activation = 'relu'))
    model.add(Dense(output_dim = 50, init ='uniform', activation = 'relu'))
    model.add(Dense(output_dim = 5, init ='uniform', activation = 'softmax'))
	# Compile model
    model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
    return model

#I would try to add more layers


model_m = baseline_model() #model for males
model_f = baseline_model() #model for females


#Males
model_m.fit(X_train, ym_train, batch_size = 10, nb_epoch = 100)
ym_pred = model_m.predict(X_test)

#Females
model_f.fit(X_train, yf_train, batch_size = 10, nb_epoch = 100)
yf_pred = model_f.predict(X_test)

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.


  if __name__ == '__main__':
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
  del sys.path[0]
  


Instructions for updating:
Use tf.cast instead.




Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100




Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [21]:
#checking the accuracy of the models.
scores_m = model_m.evaluate(X_test, ym_test)
scores_f = model_f.evaluate(X_test, yf_test)

print("\n%s: %.2f%%" % (model_m.metrics_names[1], scores_m[1]*100))
print("\n%s: %.2f%%" % (model_f.metrics_names[1], scores_f[1]*100))


acc: 81.25%

acc: 65.62%


In [22]:
#predicted test values into DataFrames
ym_pred = pd.DataFrame(ym_pred)
yf_pred = pd.DataFrame(yf_pred)

yf_pred.round(5)*100

Unnamed: 0,0,1,2,3,4
0,0.0,0.0,0.82,99.145004,0.035
1,100.0,0.0,0.0,0.0,0.0
2,47.328003,52.671997,0.0,0.0,0.0
3,1.64,98.360001,0.0,0.0,0.0
4,99.950996,0.049,0.0,0.0,0.0
5,99.993004,0.007,0.0,0.0,0.0
6,0.434,99.566002,0.0,0.0,0.0
7,91.978004,8.022,0.0,0.0,0.0
8,99.980995,0.019,0.0,0.0,0.0
9,0.036,99.963997,0.0,0.0,0.0


In [23]:
ym_pred.round(5)*100

Unnamed: 0,0,1,2,3,4
0,0.0,0.048,6.493,93.459999,0.0
1,100.0,0.0,0.0,0.0,0.0
2,100.0,0.0,0.0,0.0,0.0
3,100.0,0.0,0.0,0.0,0.0
4,100.0,0.0,0.0,0.0,0.0
5,100.0,0.0,0.0,0.0,0.0
6,100.0,0.0,0.0,0.0,0.0
7,100.0,0.0,0.0,0.0,0.0
8,100.0,0.0,0.0,0.0,0.0
9,0.003,99.946999,0.014,0.019,0.018


The prediction of two outputs give a straight forward probability for each categories

Now I'm going to apply this model into the entire dataset for both target Males and Females.

In [24]:
#feature scaling on entire dataset X accordance to trainig set.
X_featured = sc_X.transform(X)
Y_predict_Male = model_m.predict(X_featured)
Y_predict_Female = model_f.predict(X_featured)

Check!!!!

In [25]:
Male_Probability = pd.DataFrame(Y_predict_Male)
Female_Probability = pd.DataFrame(Y_predict_Female)

In [26]:
Male_Probability.columns = ['stage_1M', 'stage_2M', 'stage_3M', 'stage_4M', 'stage_5M']
Male_Probability.round(5).head()

Unnamed: 0,stage_1M,stage_2M,stage_3M,stage_4M,stage_5M
0,0.0,1e-05,1e-05,0.99997,0.0
1,0.0,5e-05,0.0,0.0,0.99995
2,0.0,0.0,1e-05,0.99998,0.0
3,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.99999,1e-05


In [29]:
Female_Probability.columns = ['stage_1F', 'stage_2F', 'stage_3F', 'stage_4F', 'stage_5F']
Female_Probability.round(5).head()

Unnamed: 0,stage_1F,stage_2F,stage_3F,stage_4F,stage_5F
0,0.0,0.0,0.0,0.00468,0.99532
1,0.0,0.0,0.0,1e-05,0.99999
2,0.0,0.0,0.00056,0.9955,0.00394
3,0.0,0.0,0.0,1e-05,0.99999
4,0.0,0.0,0.0,0.0,1.0


Male_Probability and Female_Probability Dataframes show the probability of each patient will fall into which CKD Stages.


In [30]:
y.head() 

Unnamed: 0_level_0,Male_Stages,Female_Stages
index,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Stage 4,Stage 5
1,Stage 5,Stage 5
2,Stage 4,Stage 4
3,Stage 5,Stage 5
4,Stage 4,Stage 5


So, by looking at these three graphs(y, Male_probabaility, Female_probabaliy). I want to return the value with the higest probability.

For a example at index 0, Male indicates stage 4 and female indicates stage 5.
Male_Probability and Female_Probabilty set shows that male with stage 54 gives higher probability than female with stage 5. Therefore we can predict that the 'index 0' patient is likely to be a male.


Another sample at index 1, Since both male and female indicates stage 5, we can look at the probability table to see which gender table give us the higher probability at stage 5 and return the following Gender. which would be Female. 

In [31]:
final = pd.concat([y, Male_Probability.round(5), Female_Probability.round(5)], axis = 1)
final.head()

Unnamed: 0_level_0,Male_Stages,Female_Stages,stage_1M,stage_2M,stage_3M,stage_4M,stage_5M,stage_1F,stage_2F,stage_3F,stage_4F,stage_5F
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,Stage 4,Stage 5,0.0,1e-05,1e-05,0.99997,0.0,0.0,0.0,0.0,0.00468,0.99532
1,Stage 5,Stage 5,0.0,5e-05,0.0,0.0,0.99995,0.0,0.0,0.0,1e-05,0.99999
2,Stage 4,Stage 4,0.0,0.0,1e-05,0.99998,0.0,0.0,0.0,0.00056,0.9955,0.00394
3,Stage 5,Stage 5,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1e-05,0.99999
4,Stage 4,Stage 5,0.0,0.0,0.0,0.99999,1e-05,0.0,0.0,0.0,0.0,1.0


In [32]:
#assigning probability with given stages in for Male_Stages
colname = ['Male_Stages']
for col in colname:
    final['temp'] = np.nan
        
    condition = (final['Male_Stages'] == 'Stage 1')
    final.loc[condition,'temp'] = final.stage_1M
    
    condition = (final['Male_Stages'] == 'Stage 2')
    final.loc[condition,'temp'] = final.stage_2M
    
    condition = (final['Male_Stages'] == 'Stage 3')
    final.loc[condition,'temp'] = final.stage_3M
    
    condition = (final['Male_Stages'] == 'Stage 4')
    final.loc[condition,'temp'] = final.stage_4M
    
    condition = (final['Male_Stages'] == 'Stage 5')
    final.loc[condition,'temp'] = final.stage_5M
    
    
    final[col] = final['temp']  
    final.drop('temp',axis=1,inplace=True)
    
#assigning probability with given stages in for female_Stages
colname = ['Female_Stages']
for col in colname:
    final['temp'] = np.nan
        
    condition = (final['Female_Stages'] == 'Stage 1')
    final.loc[condition,'temp'] = final.stage_1F
    
    condition = (final['Female_Stages'] == 'Stage 2')
    final.loc[condition,'temp'] = final.stage_2F
    
    condition = (final['Female_Stages'] == 'Stage 3')
    final.loc[condition,'temp'] = final.stage_3F
    
    condition = (final['Female_Stages'] == 'Stage 4')
    final.loc[condition,'temp'] = final.stage_4F
    
    condition = (final['Female_Stages'] == 'Stage 5')
    final.loc[condition,'temp'] = final.stage_5F
    
    
    final[col] = final['temp']  
    final.drop('temp',axis=1,inplace=True)
    
final.head()

Unnamed: 0_level_0,Male_Stages,Female_Stages,stage_1M,stage_2M,stage_3M,stage_4M,stage_5M,stage_1F,stage_2F,stage_3F,stage_4F,stage_5F
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,0.99997,0.99532,0.0,1e-05,1e-05,0.99997,0.0,0.0,0.0,0.0,0.00468,0.99532
1,0.99995,0.99999,0.0,5e-05,0.0,0.0,0.99995,0.0,0.0,0.0,1e-05,0.99999
2,0.99998,0.9955,0.0,0.0,1e-05,0.99998,0.0,0.0,0.0,0.00056,0.9955,0.00394
3,1.0,0.99999,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1e-05,0.99999
4,0.99999,1.0,0.0,0.0,0.0,0.99999,1e-05,0.0,0.0,0.0,0.0,1.0


NOw that we are returned with probability, we need to return the Gender that has the highest probability.

In [33]:
def FindGenderS(final):
    if final['Male_Stages'] >  final['Female_Stages']:
        return 'Male'
    else:
        return 'Female'

final['Gender'] =  final.apply(FindGenderS, axis=1)
final['Gender'].value_counts()

Male      121
Female     36
Name: Gender, dtype: int64

In [34]:
final['Gender']

index
0        Male
1      Female
2        Male
3        Male
4      Female
5      Female
6        Male
7      Female
8      Female
9        Male
10       Male
11     Female
12     Female
13       Male
14       Male
15       Male
16       Male
17     Female
18     Female
19     Female
20     Female
21       Male
22       Male
23     Female
24     Female
25     Female
26     Female
27       Male
28       Male
29       Male
        ...  
127      Male
128      Male
129    Female
130      Male
131    Female
132      Male
133      Male
134      Male
135      Male
136      Male
137    Female
138      Male
139      Male
140      Male
141      Male
142      Male
143    Female
144      Male
145      Male
146      Male
147      Male
148      Male
149      Male
150      Male
151      Male
152    Female
153      Male
154      Male
155      Male
156    Female
Name: Gender, Length: 157, dtype: object