## Prediction of patient chance of survival

I am going to train a model that predicts the chance of survival of patients after 1 year of treatment in Greenland

The predictive model is built following the steps below:

Exploratory Data Analysis (EDA) - understanding the dataset and the underlying interactions between the different variables

Data Pre-processing - preparing the data for modelling

Building the model

Evaluating the performance of the model, and possibly fine-tune and tweak it if necessary


The goal of the model is to predict the chances of survival of a patient after 1 year of treatment (Survived_1_year)

In [166]:
# Loading libraries

import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt

# To ignore warnings
# import warnings
# warnings.filterwarnings("ignore")

In [167]:
# Loading the Training dataset

train_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')

train_data.head(10)

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,22374,8,3333,DX6,56,18.479385,YES,URBAN,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,18164,5,5740,DX2,36,22.945566,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,6283,23,10446,DX6,48,27.510027,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,5339,51,12011,DX1,5,19.130976,NO,URBAN,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
4,33012,0,12513,,128,1.3484,Cannot say,RURAL,Stable,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1
5,10808,45,7977,DX6,47,26.15512,YES,URBAN,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
6,5138,52,5296,DX3,53,19.103244,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,1
7,17265,9,5947,DX5,3,18.126976,NO,URBAN,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
8,24349,47,6585,DX4,62,25.074482,NO,URBAN,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
9,1647,17,10190,DX6,46,17.663877,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,1


In [168]:
# Loading the Testing dataset

test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Testing_set_begs.csv')

test_data.head(10)

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond
0,19150,40,3709,DX3,16,29.443894,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
1,23216,52,986,DX6,24,26.836321,NO,URBAN,Stable,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0
2,11890,50,11821,DX4 DX5,63,25.52328,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
3,7149,32,3292,DX6,42,27.171155,NO,URBAN,Stable,1.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0
4,22845,20,9959,DX3,50,25.556192,NO,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,28169,9,2508,DX1 DX2,40,27.085641,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
6,5672,4,5467,DX6,3,21.248985,NO,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,2325,9,7725,DX3,35,18.42861,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,31840,24,122,DX6,23,19.061391,NO,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
9,12699,30,11066,DX6,3,23.265954,NO,URBAN,Stable,1.0,0.0,0.0,1.0,1.0,0.0,0.0,3.0


The variables are:

ID_Patient_Care_Situation: Care situation of a patient during treatment

Diagnosed_Condition: The diagnosed condition of the patient

ID_Patient: Patient identifier number

Treatment_with_drugs: Class of drugs used during treatment

Survived_1_year: If the patient survived after one year (0 means did not survive; 1 means survived)

Patient_Age: Age of the patient

Patient_Body_Mass_Index: A calculated value based on the patient’s weight, height, etc.

Patient_Smoker: If the patient was a smoker or not

Patient_Rural_Urban: If the patient stayed in Rural or Urban part of the country

Previous_Condition: Condition of the patient before the start of the treatment ( This variable is splitted into 8 columns - A, B, C, D, E, F, Z and Number_of_prev_cond. A, B, C, D, E, F and Z are the previous conditions of the patient. Suppose for one patient, if the entry in column A is 1, it means that the previous condition of the patient was A. If the patient didn't have that condition, it is 0 and same for other conditions. If a patient has previous condition as A and C , columns A and C will have entries as 1 and 1 respectively while the other column B, D, E, F, Z will have entries 0, 0, 0, 0, 0 respectively. The column Number_of_prev_cond will have entry as 2 i.e. 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 = 2 in this case. )





## 1. Exploratory Data Analysis (EDA) - understanding the dataset and the underlying interactions between the different variables

In [169]:
train_data.shape

(23097, 18)

In [170]:
test_data.shape

(9303, 17)

In [171]:
train_data.columns

Index(['ID_Patient_Care_Situation', 'Diagnosed_Condition', 'Patient_ID',
       'Treated_with_drugs', 'Patient_Age', 'Patient_Body_Mass_Index',
       'Patient_Smoker', 'Patient_Rural_Urban', 'Patient_mental_condition',
       'A', 'B', 'C', 'D', 'E', 'F', 'Z', 'Number_of_prev_cond',
       'Survived_1_year'],
      dtype='object')

In [172]:
test_data.columns

Index(['ID_Patient_Care_Situation', 'Diagnosed_Condition', 'Patient_ID',
       'Treated_with_drugs', 'Patient_Age', 'Patient_Body_Mass_Index',
       'Patient_Smoker', 'Patient_Rural_Urban', 'Patient_mental_condition',
       'A', 'B', 'C', 'D', 'E', 'F', 'Z', 'Number_of_prev_cond'],
      dtype='object')

Training data dataframe has 18 columns as it includes Target Variable of "Survived_1_year" whereas Test data dataframe has 17 columns, without the Target Variable

In [173]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID_Patient_Care_Situation  23097 non-null  int64  
 1   Diagnosed_Condition        23097 non-null  int64  
 2   Patient_ID                 23097 non-null  int64  
 3   Treated_with_drugs         23084 non-null  object 
 4   Patient_Age                23097 non-null  int64  
 5   Patient_Body_Mass_Index    23097 non-null  float64
 6   Patient_Smoker             23097 non-null  object 
 7   Patient_Rural_Urban        23097 non-null  object 
 8   Patient_mental_condition   23097 non-null  object 
 9   A                          21862 non-null  float64
 10  B                          21862 non-null  float64
 11  C                          21862 non-null  float64
 12  D                          21862 non-null  float64
 13  E                          21862 non-null  flo

In [174]:
train_data['Survived_1_year'].value_counts(dropna=False)

1    14603
0     8494
Name: Survived_1_year, dtype: int64

No missing values in the Target Variable


In [175]:
np.mean(train_data['Survived_1_year'])*100 

63.22466121141274

Total number of patient records in the training dataset is 23097

Number of patient records in the training dataset survived after 1 year is 14603 and 63.22%

# 2. Data Pre-processing - preparing the data for modelling


Omit irrelevant columns

removing ID_Patient_Care_Situation, Patient_ID and Patient_mental_condition (all values are STABLE therefore removing this column will not affect the model at all) columns from both train_data and test_data datasets

In [176]:
columns_to_drop = ['ID_Patient_Care_Situation', 'Patient_ID', 'Patient_mental_condition']

train_data = train_data.drop(columns_to_drop, axis =1)

test_data = test_data.drop(columns_to_drop, axis =1)

In [177]:
train_data.head()

Unnamed: 0,Diagnosed_Condition,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,8,DX6,56,18.479385,YES,URBAN,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,5,DX2,36,22.945566,YES,RURAL,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,23,DX6,48,27.510027,YES,RURAL,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,51,DX1,5,19.130976,NO,URBAN,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
4,0,,128,1.3484,Cannot say,RURAL,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1


Non numeric variable to numeric and Missing Values

In [178]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Diagnosed_Condition      23097 non-null  int64  
 1   Treated_with_drugs       23084 non-null  object 
 2   Patient_Age              23097 non-null  int64  
 3   Patient_Body_Mass_Index  23097 non-null  float64
 4   Patient_Smoker           23097 non-null  object 
 5   Patient_Rural_Urban      23097 non-null  object 
 6   A                        21862 non-null  float64
 7   B                        21862 non-null  float64
 8   C                        21862 non-null  float64
 9   D                        21862 non-null  float64
 10  E                        21862 non-null  float64
 11  F                        21862 non-null  float64
 12  Z                        21862 non-null  float64
 13  Number_of_prev_cond      21862 non-null  float64
 14  Survived_1_year       

In [179]:
train_data['Patient_Smoker'].value_counts(dropna=False)

NO            13246
YES            9838
Cannot say       13
Name: Patient_Smoker, dtype: int64

In [180]:
test_data['Patient_Smoker'].value_counts(dropna=False)

NO     5333
YES    3970
Name: Patient_Smoker, dtype: int64

In [181]:
# impuration - replacing 'Cannot say' with mode of the variable which is NO

train_data['Patient_Smoker'] = train_data['Patient_Smoker'].replace('Cannot say', 'NO')

In [182]:
train_data['Patient_Smoker'].value_counts()

NO     13259
YES     9838
Name: Patient_Smoker, dtype: int64

In [183]:
train_data['Patient_Smoker'] = train_data['Patient_Smoker'].apply(lambda x: 1 if x == 'YES' else 0)
test_data['Patient_Smoker'] = test_data['Patient_Smoker'].apply(lambda x: 1 if x == 'YES' else 0)

In [184]:
train_data['Patient_Smoker'].value_counts()

0    13259
1     9838
Name: Patient_Smoker, dtype: int64

In [185]:
test_data['Patient_Smoker'].value_counts()

0    5333
1    3970
Name: Patient_Smoker, dtype: int64

In [186]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Diagnosed_Condition      23097 non-null  int64  
 1   Treated_with_drugs       23084 non-null  object 
 2   Patient_Age              23097 non-null  int64  
 3   Patient_Body_Mass_Index  23097 non-null  float64
 4   Patient_Smoker           23097 non-null  int64  
 5   Patient_Rural_Urban      23097 non-null  object 
 6   A                        21862 non-null  float64
 7   B                        21862 non-null  float64
 8   C                        21862 non-null  float64
 9   D                        21862 non-null  float64
 10  E                        21862 non-null  float64
 11  F                        21862 non-null  float64
 12  Z                        21862 non-null  float64
 13  Number_of_prev_cond      21862 non-null  float64
 14  Survived_1_year       

In [187]:
train_data['Patient_Rural_Urban'].value_counts(dropna=False)

RURAL    16134
URBAN     6963
Name: Patient_Rural_Urban, dtype: int64

In [188]:
test_data['Patient_Rural_Urban'].value_counts(dropna=False)

RURAL    6502
URBAN    2801
Name: Patient_Rural_Urban, dtype: int64

In [189]:
train_data['Patient_Rural_Urban'] = train_data['Patient_Rural_Urban'].apply(lambda x: 1 if x == 'RURAL' else 0)
test_data['Patient_Rural_Urban'] = test_data['Patient_Rural_Urban'].apply(lambda x: 1 if x == 'RURAL' else 0)

In [190]:
train_data['Patient_Rural_Urban'].value_counts(dropna=False)

1    16134
0     6963
Name: Patient_Rural_Urban, dtype: int64

In [191]:
test_data['Patient_Rural_Urban'].value_counts(dropna=False)

1    6502
0    2801
Name: Patient_Rural_Urban, dtype: int64

In [192]:
train_data.head()

Unnamed: 0,Diagnosed_Condition,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,8,DX6,56,18.479385,1,0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,5,DX2,36,22.945566,1,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,23,DX6,48,27.510027,1,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,51,DX1,5,19.130976,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
4,0,,128,1.3484,0,1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1


In [193]:
train_data.describe()

Unnamed: 0,Diagnosed_Condition,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
count,23097.0,23097.0,23097.0,23097.0,23097.0,21862.0,21862.0,21862.0,21862.0,21862.0,21862.0,21862.0,21862.0,23097.0
mean,26.413127,33.209768,23.45482,0.425943,0.698532,0.897905,0.136355,0.18507,0.083615,0.393239,0.0537,0.000595,1.75048,0.632247
std,15.030865,19.549882,3.807661,0.494496,0.458905,0.30278,0.343173,0.388363,0.276817,0.48848,0.225431,0.024379,0.770311,0.482204
min,0.0,0.0,1.0893,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,13.0,16.0,20.20555,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,26.0,33.0,23.386199,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
75%,39.0,50.0,26.788154,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,1.0
max,52.0,149.0,29.999579,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0


Looking at the overview of the train_data dataframe, I can see that Patient_Age ranges from 0 (min) to 149 (max), as I am sure extremely young age and old age will affect one's health status and also predicting the chance of survival, I will remove the rows with age under 3 and over 100 from the dataframe

Also for the Patient_Body_Mass_Index, it ranges from 1.0893 (min) to 29.999 (max) and max value is within the range of average adult BMI however 1.0893 is out of average range of adult BMI so I will also remove rows with extremely low BMI under 10

In [194]:
# Patient_Age

train_data = train_data[(train_data['Patient_Age'] >= 3) & (train_data['Patient_Age'] <= 100)]

In [195]:
# Patient_Body_Mass_Index	

train_data = train_data[train_data['Patient_Body_Mass_Index'] >= 10]

In [196]:
train_data.describe()

Unnamed: 0,Diagnosed_Condition,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
count,21961.0,21961.0,21961.0,21961.0,21961.0,20799.0,20799.0,20799.0,20799.0,20799.0,20799.0,20799.0,20799.0,21961.0
mean,26.389554,34.802149,23.460647,0.447976,0.699513,0.897591,0.13693,0.184528,0.084235,0.394971,0.054089,0.0,1.752344,0.635217
std,15.002317,18.465489,3.776395,0.497297,0.45848,0.303192,0.343781,0.387924,0.277746,0.488856,0.226199,0.0,0.772064,0.48138
min,1.0,3.0,17.000336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,13.0,19.0,20.205511,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,26.0,35.0,23.367231,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0
75%,39.0,51.0,26.792097,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,1.0
max,52.0,66.0,29.999579,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,5.0,1.0


Applying the same rule for Patient Age and BMI for test_data

In [197]:
# Before applying the rules

test_data.describe()

Unnamed: 0,Diagnosed_Condition,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond
count,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0
mean,26.680426,33.249059,23.429321,0.426744,0.698914,0.89326,0.14232,0.183167,0.087284,0.399441,0.052886,0.0,1.758358
std,15.097842,19.47792,3.769305,0.494631,0.458755,0.308799,0.349396,0.386824,0.282265,0.48981,0.223818,0.0,0.77123
min,1.0,0.0,17.000695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,14.0,17.0,20.166849,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,27.0,33.0,23.392495,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,40.0,50.0,26.726929,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
max,52.0,66.0,29.999579,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,5.0


In [198]:
# Patient_Age

# test_data = test_data[(test_data['Patient_Age'] >= 3) & (test_data['Patient_Age'] <= 100)]

In [199]:
# Patient_Body_Mass_Index	

# test_data = test_data[test_data['Patient_Body_Mass_Index'] >= 10]

In [200]:
# rules applied to test_data

test_data.describe()

Unnamed: 0,Diagnosed_Condition,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond
count,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0,9303.0
mean,26.680426,33.249059,23.429321,0.426744,0.698914,0.89326,0.14232,0.183167,0.087284,0.399441,0.052886,0.0,1.758358
std,15.097842,19.47792,3.769305,0.494631,0.458755,0.308799,0.349396,0.386824,0.282265,0.48981,0.223818,0.0,0.77123
min,1.0,0.0,17.000695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,14.0,17.0,20.166849,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,27.0,33.0,23.392495,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,40.0,50.0,26.726929,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
max,52.0,66.0,29.999579,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,5.0


In [201]:
train_data['Treated_with_drugs'].value_counts(dropna=False)

DX6                     8160
DX2                     1827
DX5                     1818
DX1                     1746
DX3                     1740
DX4                     1710
DX1 DX2                  427
DX3 DX4                  427
DX1 DX3                  406
DX2 DX4                  405
DX4 DX5                  404
DX1 DX5                  386
DX3 DX5                  384
DX1 DX4                  382
DX2 DX3                  380
DX2 DX5                  380
DX1 DX2 DX5              101
DX1 DX3 DX5               97
DX1 DX2 DX4               94
DX1 DX2 DX3               90
DX3 DX4 DX5               89
DX2 DX3 DX5               86
DX1 DX3 DX4               85
DX2 DX3 DX4               82
DX2 DX4 DX5               78
DX1 DX4 DX5               77
DX1 DX2 DX3 DX4           23
DX1 DX3 DX4 DX5           22
DX2 DX3 DX4 DX5           22
DX1 DX2 DX4 DX5           17
DX1 DX2 DX3 DX5           13
DX1 DX2 DX3 DX4 DX5        3
Name: Treated_with_drugs, dtype: int64

In [202]:
# imputation - replacing missing value with mode of the variable which is 'DX6'

treated_drug_mode = train_data['Treated_with_drugs'].mode()[0]

train_data['Treated_with_drugs'] = train_data['Treated_with_drugs'].fillna(treated_drug_mode)
train_data['Treated_with_drugs'].value_counts(dropna=False)

DX6                     8160
DX2                     1827
DX5                     1818
DX1                     1746
DX3                     1740
DX4                     1710
DX1 DX2                  427
DX3 DX4                  427
DX1 DX3                  406
DX2 DX4                  405
DX4 DX5                  404
DX1 DX5                  386
DX3 DX5                  384
DX1 DX4                  382
DX2 DX3                  380
DX2 DX5                  380
DX1 DX2 DX5              101
DX1 DX3 DX5               97
DX1 DX2 DX4               94
DX1 DX2 DX3               90
DX3 DX4 DX5               89
DX2 DX3 DX5               86
DX1 DX3 DX4               85
DX2 DX3 DX4               82
DX2 DX4 DX5               78
DX1 DX4 DX5               77
DX1 DX2 DX3 DX4           23
DX1 DX3 DX4 DX5           22
DX2 DX3 DX4 DX5           22
DX1 DX2 DX4 DX5           17
DX1 DX2 DX3 DX5           13
DX1 DX2 DX3 DX4 DX5        3
Name: Treated_with_drugs, dtype: int64

In [203]:
test_data['Treated_with_drugs'].value_counts(dropna=False)

# no missing data for the variable in test_data

DX6                     3462
DX4                      785
DX5                      782
DX1                      753
DX3                      747
DX2                      745
DX2 DX4                  181
DX2 DX3                  179
DX1 DX5                  166
DX2 DX5                  165
DX3 DX5                  161
DX1 DX2                  160
DX4 DX5                  157
DX1 DX4                  153
DX1 DX3                  152
DX3 DX4                  148
DX1 DX3 DX4               41
DX1 DX2 DX5               41
DX2 DX3 DX4               40
DX1 DX2 DX3               40
DX3 DX4 DX5               40
DX1 DX2 DX4               38
DX2 DX3 DX5               37
DX1 DX4 DX5               34
DX2 DX4 DX5               33
DX1 DX3 DX5               23
DX1 DX3 DX4 DX5           11
DX2 DX3 DX4 DX5            8
DX1 DX2 DX4 DX5            8
DX1 DX2 DX3 DX5            6
DX1 DX2 DX3 DX4            5
DX1 DX2 DX3 DX4 DX5        2
Name: Treated_with_drugs, dtype: int64

In [204]:
treated_drugs_dummy = train_data['Treated_with_drugs'].str.get_dummies(" ")
treated_drugs_dummy.head()

Unnamed: 0,DX1,DX2,DX3,DX4,DX5,DX6
0,0,0,0,0,0,1
1,0,1,0,0,0,0
2,0,0,0,0,0,1
3,1,0,0,0,0,0
5,0,0,0,0,0,1


In [205]:
# drop the Treated_with_drugs column from train_data and concat the treated_drugs_dummy dataframe to train_data

train_data = train_data.drop('Treated_with_drugs', axis = 1)

In [206]:
train_data = pd.concat([train_data, treated_drugs_dummy], axis =1)


In [207]:
train_data.head()

Unnamed: 0,Diagnosed_Condition,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year,DX1,DX2,DX3,DX4,DX5,DX6
0,8,56,18.479385,1,0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0,0,0,0,0,0,1
1,5,36,22.945566,1,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,0,1,0,0,0,0
2,23,48,27.510027,1,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0,0,0,0,1
3,51,5,19.130976,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,1,0,0,0,0,0
5,45,47,26.15512,1,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,0,0,0,0,1


In [208]:
# dummy treated_drugs_dummy_test for test_data

treated_drugs_dummy_testdata = test_data['Treated_with_drugs'].str.get_dummies(" ")
treated_drugs_dummy_testdata.head()

Unnamed: 0,DX1,DX2,DX3,DX4,DX5,DX6
0,0,0,1,0,0,0
1,0,0,0,0,0,1
2,0,0,0,1,1,0
3,0,0,0,0,0,1
4,0,0,1,0,0,0


In [209]:
# drop the Treated_with_drugs column from test_data and concat the treated_drugs_dummy dataframe to test_data

test_data = test_data.drop('Treated_with_drugs', axis = 1)

In [210]:
test_data = pd.concat([test_data, treated_drugs_dummy_testdata], axis =1)


In [211]:
test_data.head()

Unnamed: 0,Diagnosed_Condition,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,A,B,C,D,E,F,Z,Number_of_prev_cond,DX1,DX2,DX3,DX4,DX5,DX6
0,40,16,29.443894,0,1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0,0,1,0,0,0
1,52,24,26.836321,0,0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0,0,0,0,0,1
2,50,63,25.52328,0,1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0,0,0,1,1,0
3,32,42,27.171155,0,0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0,0,0,0,0,0,1
4,20,50,25.556192,0,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,0,1,0,0,0


In [212]:
# Dealing with NaN values in Previous condition columns

train_data['A'].value_counts(dropna=False)

1.0    18669
0.0     2130
NaN     1162
Name: A, dtype: int64

In [213]:
train_data.isnull().sum()

Diagnosed_Condition           0
Patient_Age                   0
Patient_Body_Mass_Index       0
Patient_Smoker                0
Patient_Rural_Urban           0
A                          1162
B                          1162
C                          1162
D                          1162
E                          1162
F                          1162
Z                          1162
Number_of_prev_cond        1162
Survived_1_year               0
DX1                           0
DX2                           0
DX3                           0
DX4                           0
DX5                           0
DX6                           0
dtype: int64

For NaN values in previous condition columns, I have decided to drop them out of dataframe, as single imputation may create even more class imbalance in the variable

In [214]:
train_data = train_data.dropna()

In [215]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20799 entries, 0 to 23096
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Diagnosed_Condition      20799 non-null  int64  
 1   Patient_Age              20799 non-null  int64  
 2   Patient_Body_Mass_Index  20799 non-null  float64
 3   Patient_Smoker           20799 non-null  int64  
 4   Patient_Rural_Urban      20799 non-null  int64  
 5   A                        20799 non-null  float64
 6   B                        20799 non-null  float64
 7   C                        20799 non-null  float64
 8   D                        20799 non-null  float64
 9   E                        20799 non-null  float64
 10  F                        20799 non-null  float64
 11  Z                        20799 non-null  float64
 12  Number_of_prev_cond      20799 non-null  float64
 13  Survived_1_year          20799 non-null  int64  
 14  DX1                   

In [216]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Diagnosed_Condition      9303 non-null   int64  
 1   Patient_Age              9303 non-null   int64  
 2   Patient_Body_Mass_Index  9303 non-null   float64
 3   Patient_Smoker           9303 non-null   int64  
 4   Patient_Rural_Urban      9303 non-null   int64  
 5   A                        9303 non-null   float64
 6   B                        9303 non-null   float64
 7   C                        9303 non-null   float64
 8   D                        9303 non-null   float64
 9   E                        9303 non-null   float64
 10  F                        9303 non-null   float64
 11  Z                        9303 non-null   float64
 12  Number_of_prev_cond      9303 non-null   float64
 13  DX1                      9303 non-null   int64  
 14  DX2                     

No NaN (missing values) in test_data dataframe

## 3. Building the model

Separating Input variables and output variable

In [217]:
X = train_data.drop('Survived_1_year', axis = 1)
y = train_data['Survived_1_year']

In [218]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

Developing the model

In [219]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=1000)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)

In [220]:
y_pred

array([1, 1, 0, ..., 1, 1, 0])

In [221]:
np.mean(y_pred)*100

69.13461538461539

## 4. Evaluating the performance of the model, and possibly fine-tune and tweak it if necessary

Checking how accurate the trained model is using Confusion Matrix and the accuracy_score class from sklearn.metrics

Confusion Matrix

In [222]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1000,  579],
       [ 284, 2297]])

In [223]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print("True Positive", tp)
print("True Negative", tn)
print("False Positive", fp)
print("False Positive", fn)

True Positive 2297
True Negative 1000
False Positive 579
False Positive 284


Accuracy

In [224]:
from sklearn.metrics import accuracy_score

In [225]:
acc = accuracy_score(y_test, y_pred)
acc

0.7925480769230769

F1 Score

In [226]:
from sklearn.metrics import f1_score
print("F1 score: ", f1_score(y_test, y_pred))

F1 score:  0.8418544988088693


Prediction using the model with test_data

In [227]:
prediction = mlp.predict(test_data)

In [228]:
prediction

array([1, 1, 1, ..., 1, 0, 0])

In [229]:
prediction_series = pd.DataFrame(prediction)
prediction_series = prediction_series.rename(columns = {0:"prediction"})
prediction_series.head()

Unnamed: 0,prediction
0,1
1,1
2,1
3,0
4,1


In [230]:
prediction_series.to_csv('prediction.csv', header=True)


In [231]:
len(prediction)

9303

In [232]:
np.mean(prediction)*100

68.63377405138128