## Hypertuning - xgboost

In [5]:
# data manipulation
import pandas as pd
import os
import numpy as np

# modeling
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline

# parameter searching
from skopt.space import Real, Integer, Categorical

# custom helper functions
from src.models import cross_validate as cv

In [6]:
DATA_PATH = '../data/processed/'
OBS_PATH = os.path.join(DATA_PATH, 'observations_features.csv')
RESULTS_PATH = os.path.join(DATA_PATH, 'results.csv')

### Load Data

In [7]:
obs = pd.read_csv(OBS_PATH)
obs.head()

Unnamed: 0,session_id,seq,buy_event,visitor_id,view_count,session_length,item_views,add_to_cart_count,transaction_count,avg_avail
0,1000001_251341,2.0,0,1000001,1.0,0.0,1.0,0.0,0.0,0.0
1,1000007_251343,2.0,0,1000007,1.0,0.0,1.0,0.0,0.0,0.0
2,1000042_251344,2.0,0,1000042,1.0,0.0,1.0,0.0,0.0,1.0
3,1000057_251346,2.0,0,1000057,1.0,0.0,1.0,0.0,0.0,1.0
4,1000067_251351,2.0,0,1000067,1.0,0.0,1.0,0.0,0.0,0.0


### Parameter Search

We will be using `BayesSeachCV` to hypertune the parameters. `BayesSearchCV` implements Bayesian Optimization methods to estimate the distribution of the paramters, and ultimately find the set of paramters to minimize AUC, in our case.

In [8]:
X_train, X_test, y_train, y_test = cv.create_Xy(obs)

pipe = imbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('xgb', XGBClassifier(random_state=42, n_estimators=500))
])

search_params = {
#     'smote__k_neighbors': Integer(5, 20)#,
    'xgb__max_depth': Integer(3, 10),
    'xgb__learning_rate': Real(0, 1),
    'xgb__gamma': Real(0, 1),
    'xgb__reg_lambda': Real(0, 1)
}

search_results = cv.bayes_search(X_train, y_train, pipe, search_params)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   46.7s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   45.8s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.3min finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   38.5s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   39.1s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   55.3s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   38.1s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   31.6s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.5s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   36.8s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.4s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.4min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.4min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.1s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.9s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.4min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.4min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   28.9s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.1s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.9s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.9s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.7min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.3s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.6s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.5min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.8min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed: 31.7min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.0s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.4s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   28.9s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.4min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.1s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.6s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.6s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:  1.4min finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   31.2s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.8s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.5s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.8s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   31.4s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.2s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   52.1s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.1s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.3s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   29.9s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   31.7s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   30.4s finished


Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   52.3s finished


In [10]:
search_results.best_score_

0.6246704948506208

In [11]:
pd.DataFrame(search_results.cv_results_).sort_values('mean_test_score', ascending=False)

Unnamed: 0,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_xgb__gamma,param_xgb__learning_rate,param_xgb__max_depth,param_xgb__reg_lambda,params
48,0.635041,0.597432,0.641538,0.62467,0.019442,1,29.739937,0.202219,0.427171,0.037883,1.0,0.100464,3,0.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.10..."
47,0.635505,0.595208,0.640787,0.623833,0.020356,1,30.838834,0.328485,0.416149,0.046073,1.0,0.105429,3,1.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.10..."
46,0.633298,0.59589,0.640164,0.623117,0.019456,1,29.259307,0.101379,0.487873,0.024711,1.0,0.106285,3,1.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.10..."
45,0.63427,0.59151,0.642188,0.622656,0.022259,1,29.589816,0.210543,0.422242,0.036392,1.0,0.106889,3,1.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.10..."
44,0.629927,0.59692,0.641035,0.622627,0.018735,1,29.491282,0.273699,0.40394,0.059988,1.0,0.109871,3,1.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.10..."
42,0.632163,0.589279,0.635759,0.619067,0.021114,1,29.528396,0.13607,0.470329,0.024252,0.807917,0.11643,3,0.961514,"{'xgb__gamma': 0.8079170954164776, 'xgb__learn..."
39,0.626942,0.588898,0.635929,0.617256,0.020385,1,29.711836,0.244713,0.445399,0.052004,0.045936,0.119733,3,0.98295,"{'xgb__gamma': 0.045936045180184355, 'xgb__lea..."
33,0.622874,0.576241,0.632731,0.610615,0.024637,1,29.511802,0.156127,0.366997,0.017965,1.0,0.181832,3,1.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.18..."
40,0.62056,0.575294,0.627152,0.607669,0.02305,1,30.181234,0.29286,0.38751,0.049567,0.0,0.145081,3,1.0,"{'xgb__gamma': 0.0, 'xgb__learning_rate': 0.14..."
24,0.61265,0.580977,0.627074,0.6069,0.019253,1,29.30978,0.326295,0.304638,0.035482,1.0,0.186432,3,1.0,"{'xgb__gamma': 1.0, 'xgb__learning_rate': 0.18..."


In [12]:
search_results.best_params_

{'xgb__gamma': 1.0,
 'xgb__learning_rate': 0.10046431004883158,
 'xgb__max_depth': 3,
 'xgb__reg_lambda': 0.0}

In [13]:
model = 'xgb_tuned'

cv_results = cv.cv_model(X_train, y_train, search_results.best_estimator_)
cv.log_scores(cv_results, model)

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc
xgb_tuned,0.93599,0.004699,0.213814,0.027299,0.06102,0.005549,0.094789,0.008549,0.62467,0.019442


In [14]:
results = pd.read_csv(RESULTS_PATH, index_col=0)

results = results.drop(index=model, errors='ignore')
results = results.append(cv.log_scores(cv_results, model), sort=False)
results.to_csv(RESULTS_PATH)
results

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc
log_regression,0.478189,0.003034,0.84024,0.015868,0.024764,0.000464,0.048111,0.0009,0.752486,0.009845
random_forest,0.930277,0.001674,0.148949,0.016731,0.039709,0.003086,0.062687,0.005337,0.531354,0.013361
xgb,0.936546,0.003746,0.211411,0.023783,0.061001,0.005206,0.094584,0.008094,0.619189,0.016401
xgb_SMOTE_Tomek,0.933256,0.007034,0.216817,0.025067,0.059006,0.003155,0.092508,0.004382,0.621756,0.015873
log_reg_SMOTE_Tomek,0.479056,0.003446,0.837838,0.015359,0.024738,0.00048,0.048057,0.000929,0.752859,0.010022
log_reg_tuned,0.479226,0.003956,0.839039,0.013669,0.024781,0.000494,0.04814,0.000953,0.752452,0.010132
xgb_tuned,0.93599,0.004699,0.213814,0.027299,0.06102,0.005549,0.094789,0.008549,0.62467,0.019442


### Save the model

The tuned model will now be pickled.

In [15]:
import pickle

MODEL_PATH = '../models/'
XGB_PATH = os.path.join(MODEL_PATH, 'xgb_tuned.pkl')

pickle.dump(search_results.best_estimator_, open(XGB_PATH, 'wb'))