## XGBoost Implementation

In this Notebook we are going to implement Xtreme Gradient Boosting and predict whether a person is dibetic or not. The data we are using is from Pima Indians Diabetes Database. You can access the data from Kaggle [here](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

### Load Libraries 

In [28]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [7]:
diabetes = pd.read_csv("diabetes.csv",sep=",")

In [8]:
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [9]:
X = diabetes.drop('Outcome',axis=1)
y = diabetes['Outcome']

### Train and Test Split 

In [23]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

### Train the Model with default parameters

In [24]:
xgb = XGBClassifier()

In [25]:
xgb.fit(X_train,y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

### Test the Model 

In [26]:
pred = xgb.predict(X_test)

### Check Performance of the Model 

In [27]:
accuracy_score(y_test,pred)

0.6883116883116883

In [31]:
print(classification_report(y_test,pred))
pd.crosstab(y_test,pred)

              precision    recall  f1-score   support

           0       0.79      0.71      0.74        99
           1       0.55      0.65      0.60        55

    accuracy                           0.69       154
   macro avg       0.67      0.68      0.67       154
weighted avg       0.70      0.69      0.69       154



col_0,0,1
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1
0,70,29
1,19,36


The Model has performed really poorly. If you see the recall is very low. There are 19 patients who have Diabetes but they have been predicted as non-diabetic...very bad.

### Hyperparamter Tuning 

We can try with different values of hyperparameters to take the benefit of power of XGBoost.

In [32]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [36]:

param = {
        'learning_rate'      : [0.05,0.10,0.15,0.20,0.25,0.30],
        'n_estimators'       : [100,150,200,250],
        'max_depth'          : [3,4,5,6,8,10,12,15],
        'gamma'              : [0.0,0.1,0.2,0.3,0.4],
        'min_child_weight'   : [1,3,5,7],
        'colsample_bytree'   : [0.3,0.4,0.5,0.7]
      }

Model_Tuning = RandomizedSearchCV(estimator=XGBClassifier(),param_distributions=param,scoring="roc_auc",n_jobs=-1,cv=5,verbose=3)


In [38]:
Model_Tuning.fit(X_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    6.6s finished




RandomizedSearchCV(cv=5,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=100,...
                                           scale_pos_weight=None,
                                           subsample=None, tree_method=None,
                                      

In [39]:
Model_Tuning.best_params_,Model_Tuning.best_score_

({'n_estimators': 250,
  'min_child_weight': 3,
  'max_depth': 5,
  'learning_rate': 0.05,
  'gamma': 0.3,
  'colsample_bytree': 0.3},
 0.8215706219323791)

### Lets train final model based on these parameters 

In [40]:
xgb = XGBClassifier(n_estimators=250,min_child_weight=3,max_depth=5,learning_rate=0.05,gamma=0.3,colsample_bytree=0.3)
xgb.fit(X_train,y_train)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.3, gamma=0.3, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=nan, monotone_constraints='()',
              n_estimators=250, n_jobs=4, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [41]:
y_pred = xgb.predict(X_test)

In [42]:
accuracy_score(y_test,y_pred)

0.7402597402597403

In [43]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.82      0.77      0.79        99
           1       0.62      0.69      0.66        55

    accuracy                           0.74       154
   macro avg       0.72      0.73      0.72       154
weighted avg       0.75      0.74      0.74       154



In [44]:
pd.crosstab(y_test,y_pred)

col_0,0,1
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1
0,76,23
1,17,38


Although the accuracy and recall has increased now but we would still need a better recall for this model. There are still 17 patients who are actually dibetics but predicted as non-dibetics.<br/><br/>
Similarly, we can do parameter tuning for other parameters as well and maybe look at feature engineering too and it should improve our model's performance.