# Heart Patient Prognosis

Data Scientist:   __Gail Wittich__\
Email:      gwittich@optusnet.com.au \
Website:    www.linkedin.com/in/gail-wittich \
Copyright:  Copyright 2020, Gail Wittich 

# Build Models

In [None]:
!pip install Boruta

Collecting Boruta
[?25l  Downloading https://files.pythonhosted.org/packages/b2/11/583f4eac99d802c79af9217e1eff56027742a69e6c866b295cce6a5a8fc2/Boruta-0.3-py3-none-any.whl (56kB)
[K     |████████████████████████████████| 61kB 2.8MB/s 
Installing collected packages: Boruta
Successfully installed Boruta-0.3


In [None]:
from boruta import BorutaPy                            # for feature selection
from google.colab import drive                         # for accessing files
import matplotlib.pyplot as plt                        # for data visualisation
import numpy as np                                     # for numeric computations
import pickle                                          # for file reading and saving
import pandas as pd                                    # for data analysis
import seaborn as sns                                  # for data visualisation
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression    # Logistic regression model
from sklearn.model_selection import train_test_split, GridSearchCV   # to split the dataset into train and test, hyperparameter tuning
from sklearn.metrics import accuracy_score, f1_score   # for model evaluation
from sklearn.metrics import confusion_matrix           # for decsion tree
from sklearn.metrics import classification_report      # for LinearSVC
from sklearn import preprocessing                      # for data preprocessing
from sklearn.preprocessing import LabelEncoder         # for converting categorical to numerical data
from sklearn.svm import LinearSVC                      # for LinearSVC
from sklearn.tree import DecisionTreeClassifier        # for decsion tree

from sklearn import tree        # for decsion tree
import pydotplus        # for decsion tree
import matplotlib.pyplot as plt        # for decsion tree
import matplotlib.image as pltimg        # for decsion tree

import warnings                                        # to ignore warnings
warnings.filterwarnings('ignore')

### **Load Data**

In [None]:
# mount the google drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Unpickle Training data
train_df = pd.read_pickle('/content/drive/My Drive/Colab Notebooks/ML Bootcamp/Heart_Patient/Data/train_data_4_model.pkl')

# Unpickle Testing data
test_df = pd.read_pickle('/content/drive/My Drive/Colab Notebooks/ML Bootcamp/Heart_Patient/Data/new_test_data_4_model.pkl')

### **Prepare Train/Test Data**
1Separate the input and output variables:
- Features (Input variables) are represented by 'X' (capital)
- Target (Output variables) are represented by 'y' (lowercase)

In [None]:
X = train_df.drop('Survived_1_year',axis = 1)
y = train_df['Survived_1_year']

2. Train/test split
split the training data into train and test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### **Model Building**
The target, 'Survived_1_year' has two possible values:
- 0 - meaning the patient DID NOT survive for one year of treatment,
- 1 - meaning the patient DID survive for one year or longer after treatment. This is therfore a Classification Problem in Supervised Machine Learning.

Classification models include:
- Logistic Regression, 
- Random Forest Classifier, 
- Decision Tree Classifier, 
- etc.

This script tests several models to find the best model for the data.

### 1. Logistic Regression Model

In [None]:
model = LogisticRegression(max_iter = 1000)     # The maximum number of iterations will be 1000. This will help you prevent from convergence warning.
model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
pred = model.predict(X_test)

*Logistic Regression Evaluation:*

In [None]:
print(f1_score(y_test,pred))

0.7839179881467243


The f1 score of the Logistic Regression model is 80%. 

Try Random Forest Classifier and see if the result is better...

### 2. Random Forest

In [None]:
forest = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
 
forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

##### Evaluating Random Forest on X_test

In [None]:
y_pred = forest.predict(X_test)

fscore = f1_score(y_test ,y_pred)
fscore

0.838659392049883

The f1 score by Random Forest classifier is 84% which is better than logistic regression.

There are a lot of feaures to train the model on.
Feature selection may improve the accuracy of the Random Forest model.
With a decrease in the complexity of the data the model will improve but will it be meaningful?

Boruta Feature Selector will be used.

### 3. Random Forest and Boruta

Boruta is an all-relevant feature selection method. Unlike other techniques that select small set of features to minimize the error, Boruta tries to capture all the important and interesting features in the dataset with respect to the target variable.

Boruta by default uses random forest but can be used with other algorithms such as LightGBM, XGBoost etc.

In [None]:
# initialize the boruta selector
boruta_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=1)

# fit the boruta selector (including converting data to numpy array for Boruta to read it.)
boruta_selector.fit(np.array(X_train), np.array(y_train))

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	25
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	9 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	10 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	11 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	12 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	13 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	14 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	15 / 100
Confirmed: 	17
Tentative: 	3
Rejected: 	5
Iteration: 	16 / 100
Confirmed: 	17
Tentative: 	2
Rejected: 	6
I

BorutaPy(alpha=0.05,
         estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                          class_weight=None, criterion='gini',
                                          max_depth=5, max_features='auto',
                                          max_leaf_nodes=None, max_samples=None,
                                          min_impurity_decrease=0.0,
                                          min_impurity_split=None,
                                          min_samples_leaf=1,
                                          min_samples_split=2,
                                          min_weight_fraction_leaf=0.0,
                                          n_estimators=123, n_jobs=None,
                                          oob_score=False,
                                          random_state=RandomState(MT19937) at 0x7FF04AF7ADB0,
                                          verbose=0, warm_start=False),
         max_iter=100, n_estimators='a

In [None]:
# check selected features
print("Selected Features: ", boruta_selector.support_)
 
# check ranking of features
print("Ranking: ",boruta_selector.ranking_)               

print("No. of significant features: ", boruta_selector.n_features_)

Selected Features:  [False  True False  True  True  True  True False  True False False False
  True False  True  True  True  True  True  True  True False  True  True
  True]
Ranking:  [3 1 2 1 1 1 1 3 1 5 6 7 1 8 1 1 1 1 1 1 1 9 1 1 1]
No. of significant features:  17


Boruta has selected 17 significant features (Ranking = 1) from the 23 provided.Visualise it better in the form of a table

**Display features in Rank order**

In [None]:
selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
                                      'Ranking':boruta_selector.ranking_})
selected_rfe_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
12,Number_of_prev_cond,1
22,Patient_Smoker_YES,1
20,Patient_Smoker_NO,1
19,DX6,1
18,DX5,1
17,DX4,1
16,DX3,1
15,DX2,1
14,DX1,1
23,Patient_Rural_Urban_RURAL,1


Interesting to see that the missing data for Patient Smoker which was encoded as UNKNOWN ranks as a significant feature. Filling with the Mode (which is NO) may have had diluted the data.
Similarly for '0' (ie. Zero previous condions). Not filling with the Mode appears to have been a good decision.

#### Create a new subset of the data with only the selected features

In [None]:
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))

#### Build model with selected features

In [None]:
# Create a new random forest classifier for the most important features
rf_important = RandomForestClassifier(random_state=1, n_estimators=1000, n_jobs = -1)

# Train the new classifier on the new dataset containing the most important features
rf_important.fit(X_important_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

#### Evaluation

In [None]:
y_important_pred = rf_important.predict(X_important_test)
rf_imp_fscore = f1_score(y_test, y_important_pred)

In [None]:
print(rf_imp_fscore)

0.8614144129010584


If you remember from above that the Random Forest Classifier with all the features had given f1 score as 84% while after selecting some relavent features the Random Forest Classifier has given f1 score as 88.6% which is a good improvement in terms of bothe performance of the model (i.e. the result) and the complexity is also reduced.

Well we have chosen some of the parameters randomly like max_depht, n_estimators. There are many other parameters related to Random Forest model.If you remember we had discussed in our session 'Performance Evaluation' about Hyper parameter tunning. Hyper parameter tunnning helps you to choose a set of  optimal parameters for a model. So let's try if this helps us to further improve the performance of the model. 

Grid Search helps you to find the optimal parameter for a model.

### Hyper Parameter Tunning

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True, False],
    'max_depth': [5, 10, 15],
    'n_estimators': [500, 1000]}

In [None]:
rf = RandomForestClassifier(random_state = 1)

# Grid search cv
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 2, n_jobs = -1, verbose = 2)

In [None]:
grid_search.fit(X_important_train, y_train)

Fitting 2 folds for each of 12 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  2.2min finished


GridSearchCV(cv=2, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=1,
                                   

In [None]:
grid_search.best_params_

{'bootstrap': True, 'max_depth': 15, 'n_estimators': 1000}

In [None]:
pred = grid_search.predict(X_important_test)

In [None]:
f1_score(y_test, pred)

0.864684263280989

As you can see the accuracy has been improved from 84% to 88.8% by selecting some good parameters with the help of hyper parameter tunning - GridSearchCV

### **4. We could also try standardizing and normalizing the data or some other algorithms and so on....**

### 5. Decision Tree.

In [None]:
dtree = DecisionTreeClassifier()
dtree = dtree.fit(X_train, y_train)

# ERROR: dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.191177 to fit
# data = tree.export_graphviz(dtree, out_file=None, feature_names=X_train.columns)
# graph = pydotplus.graph_from_dot_data(data)
# graph.write_png('mydecisiontree.png')

# img=pltimg.imread('mydecisiontree.png')
# imgplot = plt.imshow(img)
# plt.show()

In [None]:
y_test = dtree.predict(X_test)

*Decision Tree Evaluation:*

In [None]:
'''
species = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
confusion_matrix(species, predictions)
'''

'\nspecies = np.array(y_test).argmax(axis=1)\npredictions = np.array(y_pred).argmax(axis=1)\nconfusion_matrix(species, predictions)\n'

### 6. Linear SVC

In [None]:
'''
LSVC.fit(x1,y1)
y2_LSVC_model = LSVC.predict(x2)
print("LSVC Accuracy :", accuracy_score(y2, y2_LSVC_model))
'''

'\nLSVC.fit(x1,y1)\ny2_LSVC_model = LSVC.predict(x2)\nprint("LSVC Accuracy :", accuracy_score(y2, y2_LSVC_model))\n'

In [None]:
LSVC = LinearSVC()
LSVC.fit(X_train, y_train)
y2_LSVC_model = LSVC.predict(X_test)
print("LSVC Accuracy :", accuracy_score(y_test, y_test_LSVC_model))

NameError: ignored

*Linear SVC Evaluation:*

# Model Evaluation and Selection
It is clearly observable that the f1 scores increased:
- logistic regression-----------------------------------80%
- random forest with full features----------------------84%
- random forest on the selected features using Boruta---88.6%
- add Hyper parameter tunning---------------------------88.7%

- standardizing and normalizing the data 
- Decision Tree.

### Save Best Model