# XGBoost (eXtreme Gradient Boosting)

In this section, __XGBoost__ learning algorithim is leveraged supported by the _Gradient Boosting_ framework to obtimize the parameter tuning in solving this classification problem. Some advantages why we choose this technique is to take advantage of the parallel processing, compare with other models on how boosting regularization parameters reduce overfitting than other experiments, and take advantage of the builtin cross validation parameters. 

![](https://www.kdnuggets.com/wp-content/uploads/xgb1.png)

__Resources for XGBoost__
* [XGBoost: A Scalable Tree Boosting System by Tianqi Chen](https://arxiv.org/pdf/1603.02754.pdf)
* [BoostedTree lecture from U of Washingtion](https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
* [Complete Guide ot Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)


In [1]:
# Data Processing
import pandas as pd
import time

from sklearn.metrics import classification_report, accuracy_score
import xgboost as xgb

# Ignore the warnings if any
import warnings  
warnings.filterwarnings('ignore')

----
## Predicting Income

In [19]:
# Load the dataset
df = pd.read_csv("../data/lab2_df.csv")
# define variables for classificaiton training

variables_ = ['age', 'fnlwgt', 'educationNum', 'hoursPerWeek', 'netCapital', 'isWhite', 'isMarried', 'isHusband', 'USA', 'sex_Male', 'jobtype_government', 'jobtype_other', 'jobtype_private']
y=df['income']
X=df[variables_]


from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


# test train split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# test train scaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)


# prepare cross validation
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

# for just k-fold final model testing
X_ = scaler.fit_transform(X)

In [21]:
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from xgboost.sklearn import XGBClassifier

#learning_rate = [1, 0.20,0.35,5]
params = {"learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
          "max_depth":range(1, 11, 2),
         "objective":["binary:logistic"]}

# Parallel Thread XGBoost and CV
xgb = GridSearchCV(estimator=XGBClassifier(),param_grid=params,scoring='roc_auc',cv=kfold, n_jobs=-1)
xgb.fit(X_train_scaled,y_train, eval_metric="auc")
print("----XGBoost Classifier Hyperparameter tuning Results ----")
print("Best XGBoost Score: ",xgb.best_score_)
print("Best XGBoost Parameters: ",xgb.best_params_)

start = time.time()
xgb_results = cross_val_score(xgb, X_test_scaled, y_test,scoring='roc_auc',cv=kfold,  n_jobs=-1)
end = time.time() - start
print("")
print("")
print("Elapsed Time: XGBoost + Parallel Threat and CV: %f" % (end))
print("---- XGBoost model selection ----")
print("")
print("XGBoost Classifier test CV results: ",xgb_results )
print("XGBoost Classifier test MEAN CV results: ",xgb_results.mean())

grid_search_prediction = xgb.predict(X_test_scaled)
print("")
print("")
print("---- XGBoost Classification Report ----")
print(classification_report(y_test, grid_search_prediction))

----XGBoost Classifier Hyperparameter tuning Results ----
Best XGBoost Score:  0.9081045831272091
Best XGBoost Parameters:  {'learning_rate': 0.1, 'max_depth': 7, 'objective': 'binary:logistic'}


Elapsed Time: XGBoost + Parallel Threat and CV: 698.934785
---- XGBoost model selection ----

XGBoost Classifier test CV results:  [0.88187188 0.92467709 0.8664633  0.88998929 0.88330633 0.91045624
 0.88343522 0.89082986 0.88093132 0.88398153]
XGBoost Classifier test MEAN CV results:  0.8895942068617769


---- XGBoost Classification Report ----
              precision    recall  f1-score   support

           0       0.85      0.94      0.89      4966
           1       0.71      0.48      0.57      1547

    accuracy                           0.83      6513
   macro avg       0.78      0.71      0.73      6513
weighted avg       0.82      0.83      0.82      6513



In [30]:
xgb_best = XGBClassifier(learning_rate=0.1, max_depth=7, objective='binary:logistic')
xgb_best.fit(X_train_scaled, y_train, eval_metric="auc")
xgb_best_results = cross_val_score(xgb_best, X_test_scaled, y_test,scoring='roc_auc',cv=kfold,  n_jobs=-1)
xgb_best_results

array([0.87417408, 0.92235997, 0.85966119, 0.88923149, 0.87990279,
       0.90931302, 0.88283689, 0.883923  , 0.87712669, 0.88040453])

In [24]:
xgb_fit_results = xgb.fit(X_,y, eval_metric="auc")

In [25]:
# summarize results
print("Best: %f using %s" % (xgb_fit_results.best_score_, xgb_fit_results.best_params_))
means, stdevs = [], []
for params, mean_score, scores in xgb_fit_results.grid_scores_:
    stdev = scores.std()
    means.append(mean_score)
    stdevs.append(stdev)
    print("%f (%f) with: %r" % (mean_score, stdev, params))
    
# plot
max_depth = range(1, 11, 2)
pyplot.errorbar(max_depth, means, yerr=stdevs)
pyplot.title("XGBoost max_depth vs Log Loss")
pyplot.xlabel('max_depth')
pyplot.ylabel('Log Loss')
pyplot.savefig('max_depth.png')

Best: 0.908007 using {'learning_rate': 0.1, 'max_depth': 7, 'objective': 'binary:logistic'}


AttributeError: 'GridSearchCV' object has no attribute 'grid_scores_'

In [1]:
from xgboost.sklearn import XGBClassifier
XGBClassifier

In [None]:
xgb1 = XGBClassifier(random_state=0, )
gb.fit(X_train, y_train)

print("Accuracy on training set: {:.3f}".format(gb.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gb.score(X_test, y_test)))

In [4]:
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from xgboost.sklearn import XGBClassifier

#learning_rate = [1, 0.20,0.35,5]
params = {"learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
         "objective":["binary:logistic"]}

# Parallel Thread XGBoost and CV
xgb = GridSearchCV(estimator=XGBClassifier(),param_grid=params,cv=kfold, n_jobs=-1)
#xgb.fit(X_train_scaled,y_train, eval_metric="auc")
xgb.fit(X_,y, eval_metric="auc")
print("----XGBoost Classifier Hyperparameter tuning Results ----")
print("Best XGBoost Score: ",xgb.best_score_)
print("Best XGBoost Parameters: ",xgb.best_params_)

# Applying k-Fold Cross Validation
#xgb_results = cross_val_score(xgb, X_test_scaled, y_test, cv=10)

#from sklearn.model_selection import KFold
#kfold = KFold(n_splits=10, random_state=7)
start = time.time()
xgb_results = cross_val_score(xgb, X_test_scaled, y_test,scoring='roc_auc',cv=kfold,  n_jobs=-1)
end = time.time() - start
print("")
print("")
print("Elapsed Time: XGBoost + Parallel Threat and CV: %f" % (end))
print("---- XGBoost model selection ----")
print("")
print("XGBoost Classifier test CV results: ",xgb_results )
print("XGBoost Classifier test MEAN CV results: ",xgb_results.mean())

grid_search_prediction = xgb.predict(X_test_scaled)
print("")
print("")
print("---- XGBoost Classification Report ----")
print(classification_report(y_test, grid_search_prediction))

----XGBoost Classifier Hyperparameter tuning Results ----
Best XGBoost Score:  0.860527005551772
Best XGBoost Parameters:  {'learning_rate': 0.2, 'objective': 'binary:logistic'}


Elapsed Time: XGBoost + Parallel Threat and CV: 51.043947
---- XGBoost model selection ----

XGBoost Classifier test CV results:  [0.88297527 0.92467709 0.85927176 0.88638977 0.88613499 0.90795422
 0.88108091 0.88057362 0.88093132 0.88676509]
XGBoost Classifier test MEAN CV results:  0.887675403968468


---- XGBoost Classification Report ----
              precision    recall  f1-score   support

           0       0.85      0.95      0.90      4966
           1       0.74      0.48      0.58      1547

    accuracy                           0.84      6513
   macro avg       0.80      0.71      0.74      6513
weighted avg       0.83      0.84      0.82      6513



In [6]:
# summarize results
print("Best: %f using %s" % (xgb.best_score_, xgb.best_params_))
means = xgb.cv_results_['mean_test_score']
stds = xgb.cv_results_['std_test_score']
params = xgb.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
	print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.860527 using {'learning_rate': 0.2, 'objective': 'binary:logistic'}
0.856535 (0.006213) with: {'learning_rate': 1, 'objective': 'binary:logistic'}
0.860527 (0.004334) with: {'learning_rate': 0.2, 'objective': 'binary:logistic'}
0.860373 (0.004043) with: {'learning_rate': 0.35, 'objective': 'binary:logistic'}
0.624759 (0.197373) with: {'learning_rate': 5, 'objective': 'binary:logistic'}


In [26]:
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
learning_rate = [1, 0.20,0.35,5]
# plot
pyplot.errorbar(learning_rate, means, yerr=stds)
pyplot.title("XGBoost learning_rate vs Log Loss")
pyplot.xlabel('learning_rate')
pyplot.ylabel('Log Loss')
#pyplot.savefig('learning_rate.png')
pyplot.show()


ValueError: shape mismatch: objects cannot be broadcast to a single shape

In [16]:
# Plot performance for learning_rate=0.1
from matplotlib import pyplot
n_estimators = [100, 200, 300, 400, 500]
loss = [-0.001239, -0.001153, -0.001152, -0.001153, -0.001153]
pyplot.plot(n_estimators, loss)
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost learning_rate=0.1 n_estimators vs Log Loss')
pyplot.show()

----
# PREDICTING SEX

In [54]:
# Load the dataset
df2 = pd.read_csv("../data/lab2_df.csv")
# define variables for classificaiton training

variables_2 = ['age', 'fnlwgt', 'educationNum', 'hoursPerWeek', 'netCapital', 'isWhite', 'isMarried', 'isHusband', 'USA', 'income', 'jobtype_government', 'jobtype_other', 'jobtype_private']
y2=df2['sex_Male']
X2=df2[variables_2]


# test train split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=0)

# test train scaler
scaler = StandardScaler()
X_train_scaled2 = scaler.fit_transform(X_train2)
X_test_scaled2 = scaler.fit_transform(X_test2)

In [58]:
params = {"learning_rate": [1, 0.20,0.35,5],
         "objective":["binary:logistic"]}
grid_search2 = GridSearchCV(estimator=XGBClassifier(),param_grid=params,n_jobs=-1)
grid_search2.fit(X_train_scaled2,y_train2, eval_metric="auc")

GridSearchCV(cv=None, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=-1,
             param_grid={'learning_rate': [1, 0.2, 0.35, 5],
                         'objective': ['binary:logistic']},
             pre_dispatch='2*n_jobs', refit=True, retur

In [59]:
print(grid_search2.best_score_)
print(grid_search2.best_params_)

0.7820944408559083
{'learning_rate': 0.35, 'objective': 'binary:logistic'}


In [60]:
grid_search_prediction2 = grid_search2.predict(X_test_scaled2)
print(classification_report(y_test2, grid_search_prediction2))

              precision    recall  f1-score   support

           0       0.66      0.75      0.70      2227
           1       0.86      0.80      0.83      4286

    accuracy                           0.78      6513
   macro avg       0.76      0.77      0.76      6513
weighted avg       0.79      0.78      0.78      6513



In [61]:
kfold2 = KFold(n_splits=10, random_state=7)
results2 = cross_val_score(grid_search2, X_test_scaled2, y_test2, cv=kfold)
results2

array([0.7791411 , 0.76993865, 0.78220859, 0.76036866, 0.74961598,
       0.77572965, 0.76651306, 0.76804916, 0.78801843, 0.78801843])

---
### Archive Test ... blahhhh

In [4]:
from sklearn.model_selection import cross_val_score,GridSearchCV
from xgboost.sklearn import XGBClassifier

params = {"learning_rate": [1, 0.20,0.35,5],
         "objective":["binary:logistic"]}
xgb = GridSearchCV(estimator=XGBClassifier(),param_grid=params,n_jobs=-1)
#grid_search.fit(X_train_scaled,y_train, eval_metric="auc")
xgb.fit(X_,y, eval_metric="auc")

GridSearchCV(cv=None, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=-1,
             param_grid={'learning_rate': [1, 0.2, 0.35, 5],
                         'objective': ['binary:logistic']},
             pre_dispatch='2*n_jobs', refit=True, retur

In [5]:
print(xgb.best_score_)
print(xgb.best_params_)

0.7963832116526728
{'learning_rate': 0.35, 'objective': 'binary:logistic'}


In [10]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=7)
xgb_results = cross_val_score(xgb, X_, y, cv=kfold)


print("XGBoost Classifier test CV results: ",xgb_results )
print("XGBoost Classifier test MEAN CV results: ",xgb_results.mean())


XGBoost Classifier test CV results:  [0.40374578 0.84213759 0.84244472 0.85657248 0.84336609 0.84674447
 0.8470516  0.85626536 0.84367322 0.84275184]
XGBoost Classifier test MEAN CV results:  0.8024753149330982


In [12]:
from sklearn.model_selection import cross_val_score,GridSearchCV
from xgboost.sklearn import XGBClassifier

params = {"learning_rate": [1, 0.20,0.35,5],
         "objective":["binary:logistic"]}
xgb1 = GridSearchCV(estimator=XGBClassifier(),param_grid=params,cv=10)
xgb1.fit(X_train_scaled,y_train, eval_metric="auc")

GridSearchCV(cv=10, error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=None,
             param_grid={'learning_rate': [1, 0.2, 0.35, 5],
                         'objective': ['binary:logistic']},
             pre_dispatch='2*n_jobs', refit=True, retur

In [13]:
print(xgb1.best_score_)
print(xgb1.best_params_)

0.860527005551772
{'learning_rate': 0.2, 'objective': 'binary:logistic'}


In [None]:
pred = grid_search.predict_proba(X_test_scaled)[:,1]

In [15]:
cross_val_score(XGBClassifier(),X,y)

array([0.7673883 , 0.76842752, 0.80958231, 0.81633907, 0.81188575])

In [14]:
from sklearn.metrics import f1_score

In [15]:
f1_score(y_train, xgb1.predict(X_train_scaled), average='macro')

0.793849544311618

In [16]:
f1_score(y_test, xgb1.predict(X_test_scaled), average='macro')

0.7417322025009255

In [17]:
grid_search_prediction = xgb1.predict(X_test_scaled)
print(classification_report(y_test, grid_search_prediction))

              precision    recall  f1-score   support

           0       0.85      0.95      0.90      4966
           1       0.74      0.48      0.58      1547

    accuracy                           0.84      6513
   macro avg       0.80      0.71      0.74      6513
weighted avg       0.83      0.84      0.82      6513



In [18]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, random_state=7)
xgb_results = cross_val_score(xgb1, X_test_scaled, y_test, cv=kfold)
print("XGBoost Classifier test CV results: ",xgb_results )
print("XGBoost Classifier test MEAN CV results: ",xgb_results.mean())

XGBoost Classifier test CV results:  [0.86042945 0.85122699 0.84969325 0.84639017 0.83102919 0.84485407
 0.85867896 0.85560676 0.88018433 0.83563748]
XGBoost Classifier test MEAN CV results:  0.8513730645632487


In [37]:
results

array([0.86042945, 0.85122699, 0.84969325, 0.85099846, 0.83102919,
       0.84485407, 0.85714286, 0.85560676, 0.88018433, 0.83563748])

In [41]:
from sklearn.model_selection import cross_val_score

cv=KFold(n_splits=10)
acc1 = cross_val_score(grid_search, X_test_scaled, y_test, cv=cv)
acc2 = cross_val_score(grid_search, X_test_scaled, y_test, cv=cv)

In [43]:
import numpy as np

In [44]:
t = 2.26 / np.sqrt(10)

e = (1-acc1)-(1-acc2)
# std1 = np.std(acc1)
# std2 = np.std(acc2)
stdtot = np.std(e)

dbar = np.mean(e)
print ('Range of:', dbar-t*stdtot,dbar+t*stdtot )
print (np.mean(acc1), np.mean(acc2))

Range of: 0.0 0.0
0.851680284225307 0.851680284225307


In [45]:
acc1

array([0.86042945, 0.85122699, 0.84969325, 0.85099846, 0.83102919,
       0.84485407, 0.85714286, 0.85560676, 0.88018433, 0.83563748])