# **Classification Model and Evaluation**

## Objectives

* Answer business requirement 2:
    * The client is interested in using patient data to predict whether or not a patient is at risk of heart disease.
* Fit and evaluate a classification model to predict if a patient has heart disease or not.

## Inputs

* outputs/datasets/collection/HeartDiseasePrediction.csv
* Instructions on data cleaning and feature engineering from the relevant notebooks

## Outputs

* Data cleaning, feature engineering and modelling pipelines
* Feature importance plot


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/home/jfpaliga/CVD-predictor/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/home/jfpaliga/CVD-predictor'

# Load Data

Load the raw dataset and replace values of 0 in RestingBP and Cholesterol with NaN ready for the ML pipeline

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("outputs/datasets/collection/HeartDiseasePrediction.csv")

for col in ["RestingBP", "Cholesterol"]:
    df[col] = df[col].replace(0, np.nan)

df.isna().sum()

Age                 0
Sex                 0
ChestPainType       0
RestingBP           1
Cholesterol       172
FastingBS           0
RestingECG          0
MaxHR               0
ExerciseAngina      0
Oldpeak             0
ST_Slope            0
HeartDisease        0
dtype: int64

---

# Classification ML Pipeline

## Pipeline for Data Cleaning, Feature Engineering and Modelling

In [5]:
from sklearn.pipeline import Pipeline

#Data Cleaning
from feature_engine.imputation import MeanMedianImputer, RandomSampleImputer

# Feature Engineering
from feature_engine.discretisation import ArbitraryDiscretiser
from feature_engine.encoding import OrdinalEncoder

# Feature Scaling
from sklearn.preprocessing import StandardScaler

# Feature Selection
from sklearn.feature_selection import SelectFromModel

# ML Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from xgboost import XGBClassifier


def ClassificationPipeline(model):

    pipeline = Pipeline([
        ("median_imputation", MeanMedianImputer(imputation_method="median",
                                                variables=["RestingBP"])),
        ("random_sample_imputation", RandomSampleImputer(random_state=1,
                                                         seed='general',
                                                         variables=["Cholesterol"])),
        ("arbitrary_discretisation", ArbitraryDiscretiser(binning_dict={"Oldpeak":[-np.inf, 0, 1.5, np.inf]})),
        ("ordinal_encoding", OrdinalEncoder(encoding_method="arbitrary",
                                            variables=["Sex",
                                                       "ChestPainType",
                                                       "FastingBS",
                                                       "RestingECG",
                                                       "ExerciseAngina",
                                                       "ST_Slope"])),
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
        ])

    return pipeline

## Split Data into Train and Test Sets

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.drop(["HeartDisease"], axis=1),
    df["HeartDisease"],
    test_size=0.2,
    random_state=0,
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(734, 11) (734,) (184, 11) (184,)


## Hyperparameter Optimisation

* Load custom hyperparameter optimisation class from CodeInstitute

In [7]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = ClassificationPipeline(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

### Finding the most suitable algorithm with HyperparameterOptimizationSearch

In [8]:
models_quick_search = {
    "LogisticRegression": LogisticRegression(random_state=0),
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "LogisticRegression": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
}

Using **default** hyperparameters to find best algorithm, scored by recall (as per business requirement 2)

In [9]:
from sklearn.metrics import make_scorer, recall_score

search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
search.fit(X_train, y_train,
           scoring =  make_scorer(recall_score, pos_label=1),
           n_jobs=-1, cv=5)


Running GridSearchCV for LogisticRegression 

Fitting 5 folds for each of 1 candidates, totalling 5 fits



Running GridSearchCV for XGBClassifier 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for DecisionTreeClassifier 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for RandomForestClassifier 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for GradientBoostingClassifier 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for ExtraTreesClassifier 

Fitting 5 folds for each of 1 candidates, totalling 5 fits

Running GridSearchCV for AdaBoostClassifier 

Fitting 5 folds for each of 1 candidates, totalling 5 fits




Results of GridSearch

In [10]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
1,XGBClassifier,0.814815,0.855463,0.8875,0.028487
0,LogisticRegression,0.7625,0.847778,0.888889,0.04661
3,RandomForestClassifier,0.8,0.83537,0.851852,0.018686
4,GradientBoostingClassifier,0.7875,0.817963,0.8625,0.024418
5,ExtraTreesClassifier,0.775,0.810525,0.8625,0.030659
6,AdaBoostClassifier,0.592593,0.766019,0.925,0.112828
2,DecisionTreeClassifier,0.691358,0.743272,0.7875,0.030903


The top three algorithms rated by mean score for recall were XGBClassifier, LogisticRegression and RandomForestClassifier.

Using these three algorithms, extensive hyperparameter optimisation was carried out.

In [17]:
models_search = {
    "XGBClassifier":XGBClassifier(random_state=0),
    "LogisticRegression": LogisticRegression(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
}

# documentation to help on hyperparameter list: 
# https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

params_search = {
    "XGBClassifier":{
        "model__learning_rate": [1e-1,1e-2,1e-3],
        "model__n_estimators": [10, 100, 1000],
        "model__max_depth": [3,5,7,9,12,15,17,25],
        "model__min_child_weight": [1,3,5,7],
        "model__subsample": [0.6,0.7,0.8,0.9,1.0],
        "model__colsample_bytree": [0.6,0.7,0.8,0.9,1.0],
        "model__reg_lambda": [0.01,0.1,1.0],
        "model__reg_alpha": [0,0.1,0.5,1.0],
    },
    "LogisticRegression":{
        "model__solver": ["newton-cg", "lbfgs", "liblinear", "sag", "saga", "newton-cholesky"],
        "model__penalty": ["l1", "l2", "elasticnet", None],
        "model__C": [100,10,1.0,0.1,0.01,0.001],
    },
    "RandomForestClassifier":{
        "model__max_features": ["sqrt","log2",None],
        "model__n_estimators": [120,300,500,800,1200],
        "model__max_depth": [5,8,15,25,30,None],
        "model__min_samples_split": [1.0,2,5,10,15,100],
        "model__min_samples_leaf": [1,2,5,10],
    }
}

Using more extensive hyperparameter options

In [18]:
from sklearn.metrics import recall_score, make_scorer

search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring =  make_scorer(recall_score, pos_label=1),
           n_jobs=-1, cv=5)


Running GridSearchCV for XGBClassifier 

Fitting 5 folds for each of 86400 candidates, totalling 432000 fits

Running GridSearchCV for LogisticRegression 

Fitting 5 folds for each of 144 candidates, totalling 720 fits


330 fits failed out of a total of 720.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/home/jfpaliga/CVD-predictor/.venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jfpaliga/CVD-predictor/.venv/lib/python3.10/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/home/jfpaliga/CVD-predictor/.venv/lib/python3.10/site-packages/sklearn/pipeline.py", line 476, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/home/jfpaliga/CVD-predictor/.venv/lib/python3.1


Running GridSearchCV for RandomForestClassifier 

Fitting 5 folds for each of 2160 candidates, totalling 10800 fits


Results of GridSearch

In [19]:
extensive_grid_search_summary, extensive_grid_search_pipelines = search.score_summary(sort_by='mean_score')
extensive_grid_search_summary 

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,model__colsample_bytree,model__learning_rate,model__max_depth,model__min_child_weight,model__n_estimators,model__reg_alpha,model__reg_lambda,model__subsample,model__C,model__penalty,model__solver,model__max_features,model__min_samples_leaf,model__min_samples_split
44319,XGBClassifier,1.0,1.0,1.0,0.0,0.8,0.01,15,5,10,0.5,0.1,1.0,,,,,,
49574,XGBClassifier,1.0,1.0,1.0,0.0,0.8,0.001,12,7,100,0,1.0,1.0,,,,,,
49563,XGBClassifier,1.0,1.0,1.0,0.0,0.8,0.001,12,7,100,0,0.01,0.9,,,,,,
49564,XGBClassifier,1.0,1.0,1.0,0.0,0.8,0.001,12,7,100,0,0.01,1.0,,,,,,
49565,XGBClassifier,1.0,1.0,1.0,0.0,0.8,0.001,12,7,100,0,0.1,0.6,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86534,LogisticRegression,,,,,,,,,,,,,0.001,elasticnet,liblinear,,,
86535,LogisticRegression,,,,,,,,,,,,,0.001,elasticnet,sag,,,
86536,LogisticRegression,,,,,,,,,,,,,0.001,elasticnet,saga,,,
86537,LogisticRegression,,,,,,,,,,,,,0.001,elasticnet,newton-cholesky,,,


Save the best model and parameters

In [20]:
best_model = extensive_grid_search_summary.iloc[0,0]
best_model

'XGBClassifier'

In [21]:
best_parameters = extensive_grid_search_pipelines[best_model].best_params_
best_parameters

{'model__colsample_bytree': 0.6,
 'model__learning_rate': 0.01,
 'model__max_depth': 3,
 'model__min_child_weight': 1,
 'model__n_estimators': 10,
 'model__reg_alpha': 0,
 'model__reg_lambda': 0.01,
 'model__subsample': 0.6}

Define the pipeline using the findings from hyperparameter optimisation

In [22]:
classification_pipeline = extensive_grid_search_pipelines[best_model].best_estimator_
classification_pipeline

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
