# Diabetes Predictor Model - Modeling

### In this notebook we will:
### 1. Apply 2-3 different modeling methods
### 2. Apply model hyperparameter tuning methods
### 3. Define the metrics I use to choose my final model
### 4. Evaluate the performance of the different models 
### 5. Identify one of the models as the best model 

##### The main objective is to develop a binary classification model to predict if an individual has Diabetes using their personal and health indicators

### Imports

In [16]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-3.0.0-py3-none-macosx_12_0_arm64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.0-py3-none-macosx_12_0_arm64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-3.0.0


In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.graphics.api import abline_plot
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, roc_auc_score, confusion_matrix, classification_report, precision_score, recall_score, f1_score, precision_recall_curve, make_scorer, recall_score
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV

import xgboost as xgb
from xgboost import XGBClassifier, DMatrix

#show plots inline
%matplotlib inline

##### Since I have a binary classification problem, I am going to use the following modeling methods...
##### 1. XGBoost (tree-based model)
##### 2. K-Nearest Neighbors (KNN) (instance-based model)
##### 3. Support Vector Machine (SVM) (margin-based model)

### Loading the Data

In [9]:
# Reading the saved CSV's
# Using .squeeze() to turn my single-column DataFrame into a Series

X_train = pd.read_csv('X_train.csv')
X_test = pd.read_csv('X_test.csv')
y_train = pd.read_csv('y_train.csv').squeeze()
y_test = pd.read_csv('y_test.csv').squeeze()

In [11]:
# Ensuring the shapes match

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(202944, 21) (202944,)
(50736, 21) (50736,)


In [28]:
print(X_train.dtypes)

HighBP                  int64
HighChol                int64
CholCheck               int64
BMI                     int64
Smoker                  int64
Stroke                  int64
HeartDiseaseorAttack    int64
PhysActivity            int64
Fruits                  int64
Veggies                 int64
HvyAlcoholConsump       int64
AnyHealthcare           int64
NoDocbcCost             int64
GenHlth                 int64
MentHlth                int64
PhysHlth                int64
DiffWalk                int64
Sex                     int64
Age                     int64
Education               int64
Income                  int64
dtype: object


In [30]:
print(y_train.dtypes)

int64


### XGBoost Model

In [95]:
# Since my data is unbalanced as 'Diabetes' is in the minority, using imbalance ratio
n_neg = (y_train == 0).sum()
n_pos = (y_train == 1).sum()
imbalance_ratio = n_neg / n_pos

# Instantiating my XGBClassifier model
xgb_clf = XGBClassifier(
    objective='binary:logistic',
    booster='gbtree',
    eval_metric='logloss',
    scale_pos_weight=scale,
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

In [87]:
# Fitting my model
xgb_clf.fit(X_train, y_train)

# Predicting
y_pred = xgb_clf.predict(X_test)
y_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]   # Doing this keeps only the probability that each sample belongs to the positive class

In [89]:
# Getting my first baseline
print(f'Accuracy : {accuracy_score(y_test, y_pred)}')
print(f'Precision : {precision_score(y_test, y_pred)}')
print(f'Recall : {recall_score(y_test, y_pred)}')
print(f'F1 : {f1_score(y_test, y_pred)}')
print(f'ROC-AUC : {roc_auc_score(y_test, y_pred_proba)}')

Accuracy : 0.7223470514033428
Precision : 0.33690297708288713
Recall : 0.7869918699186992
F1 : 0.4718233287090848
ROC-AUC : 0.8256563513095054


In [97]:
# Applying hyperparameter tuning
param_grid = {
    'n_estimators': [200, 400, 800],
    'learning_rate': [0.03, 0.1],
    'max_depth': [4, 6, 8],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'scale_pos_weight': [1, imbalance_ratio/2, imbalance_ratio, imbalance_ratio*2]
}

In [101]:
# Using RandomizedSearchCV to determine the best hyperparameters for the XGB model
recall_scorer = make_scorer(recall_score)

rand = RandomizedSearchCV(
    estimator=XGBClassifier(eval_metric='logloss'),
    param_distributions=param_grid,
    n_iter=20,
    scoring=recall_scorer,
    cv=5,
    random_state=42,
    n_jobs=1
)

rand.fit(X_train, y_train)

best_xgb = rand.best_estimator_
y_pred = best_xgb.predict(X_test)

print(f'Best scale_pos_weight: {best_xgb.get_params()["scale_pos_weight"]}')
print(f'Recall on test: {recall_score(y_test, y_pred)}')

Best scale_pos_weight: 10.691138765555625
Recall on test: 0.9034396497811132


##### Missing a true diabetic (a false negative) is more harmful than a false alarm, therefore I am using Recall as my metric
##### Telling my search to pick hyperparameters that maximize recall on my cross-validation folds (catch as many Diabetes as possible)

In [None]:
# Checking my confusion matrix as high recall can come with more false positives
