<a href="https://colab.research.google.com/github/placeholder2/Heart-Disease-Prediction/blob/main/hd_pred_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Heart Disease Prediction**
**Using XGBoost and over-sampling methods** 

 
The main metric I'll be using to evaluate performance of the models is recall, although I'll also keep an eye on precision and f1 scores for better comparison. Accuracy is not very reliable due to lack of balance in the dataset, but it'll help evaluating in comparing over-sampling techniques.

## Importing Libraries and Data

In [None]:
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import joblib

from sklearn.metrics import precision_score, recall_score,f1_score,accuracy_score
from sklearn.model_selection import GridSearchCV,PredefinedSplit
from xgboost import XGBClassifier

from imblearn.over_sampling import ADASYN,RandomOverSampler, SMOTE
from imblearn.pipeline import make_pipeline

In [None]:
from google.colab import drive
drive.mount('/content/drive');

Mounted at /content/drive


In [None]:
file = open('/content/drive/MyDrive/Colab Notebooks/HD/dataset_dict.pkl', "rb")
data = joblib.load(file)
X_train,y_train,X_test,y_test,X_val,y_val = data.values()

In [None]:
def predict(model):
  '''Function returning y_pred'''
    
  model.fit(X_train, y_train)
  return model.predict(X_test)

In [None]:
def evaluate(y_pred):
  '''Function to evaluate clf performance '''


  precision = precision_score(y_test, y_pred)
  recall = recall_score(y_test, y_pred)
  accuracy = accuracy_score(y_test, y_pred)
  f1 = f1_score(y_test, y_pred)

  report = f'Precision: {precision}\nAccuracy: {accuracy}\nRecall: {recall}\nF1: {f1}'
  print(report)

##**Baseline**
**Classification with XGBoost**

In [None]:
xgb_normal = XGBClassifier()

In [None]:
evaluate(predict(xgb_normal))

Precision: 0.5566502463054187
Accuracy: 0.9255780593599509
Recall: 0.05131698455949137
F1: 0.09397089397089396


Really low recall, also the overall performance is rather weak.

##**Applying over-sampling**
This time I will use following methods, but there exists a few more, also under-sampling or combination of both types of balancing dataset could be a great idea. \

**RandomOverSampler** - Over-sampling the minority class by picking samples at random with replacement. \
\
**SMOTE (Synthetic Minority Over-sampling Technique)** - Selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.\
\
**ADASYN (Adaptive Synthetic Sampling Method)** - Modification of SMOTE. Generates different number of samples depending on an estimate of the local distribution of the class to be oversampled.\


In [None]:
models = [
  make_pipeline(RandomOverSampler(random_state=42), xgb_normal),
  make_pipeline(SMOTE(random_state=42), xgb_normal),
  make_pipeline(ADASYN(random_state=42), xgb_normal),
]

In [None]:
for model in models:
  name = list(model.named_steps)[0]
  print('\n' + name + '\n')
  evaluate(predict(model))


randomoversampler

Precision: 0.1920017434891577
Accuracy: 0.7317189794733426
Recall: 0.8001816530426885
F1: 0.30969329466561213

smote

Precision: 0.2588357588357588
Accuracy: 0.861402370299532
Recall: 0.45231607629427795
F1: 0.3292561983471074

adasyn

Precision: 0.25353283458021614
Accuracy: 0.8640322415383039
Recall: 0.41553133514986373
F1: 0.31491997934950955


The best one is randomoversampler.

##**Hyperparameter tuning**


In [None]:
#Best model
model = models[0]

#Hyperparameters
params = {
    'xgbclassifier__eval_metric' : ['logloss','error'],
    'xgbclassifier__n_estimators': [50,100,150],
    'xgbclassifier__learning_rate': [0.01,0.1],
    'xgbclassifier__max_depth' : [6,8]
}

#PredefinedSplit
split_index = [-1]*len(X_train) + [0]*len(X_val)
X = np.concatenate((X_train, X_val), axis=0)
y = np.concatenate((y_train, y_val), axis=0)
pds = PredefinedSplit(test_fold = split_index)

In [None]:
grid = GridSearchCV(estimator = model,
                   cv=pds,
                   param_grid=params,
                   scoring='recall',
                   n_jobs= -1,
                   verbose = 3)

grid.fit(X, y)

Fitting 1 folds for each of 24 candidates, totalling 24 fits


GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ...,  0,  0])),
             estimator=Pipeline(steps=[('randomoversampler',
                                        RandomOverSampler(random_state=42)),
                                       ('xgbclassifier', XGBClassifier())]),
             n_jobs=-1,
             param_grid={'xgbclassifier__eval_metric': ['logloss', 'error'],
                         'xgbclassifier__learning_rate': [0.01, 0.1],
                         'xgbclassifier__max_depth': [6, 8],
                         'xgbclassifier__n_estimators': [50, 100, 150]},
             scoring='recall', verbose=3)

In [None]:
grid.best_params_

{'xgbclassifier__eval_metric': 'logloss',
 'xgbclassifier__learning_rate': 0.01,
 'xgbclassifier__max_depth': 6,
 'xgbclassifier__n_estimators': 100}

In [None]:
best_model = grid.best_estimator_

evaluate(predict(best_model))

Precision: 0.1769727047146402
Accuracy: 0.7024830082994638
Recall: 0.8097184377838329
F1: 0.2904618392115338
