### Load Libraries

In [7]:

import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import xgboost as xgb; print('XGBoost Version:', xgb.__version__)

import time 

numpy Version: 1.16.4
pandas Version: 0.24.2
Scikit-Learn Version: 0.21.2
XGBoost Version: 1.0.0rapidsdev0-SNAPSHOT


In the previous notebook we already load our data and split it in training .
We intentionally named the objects we created in a way you can easy distingwish whether you are using pandas(`pd_`) vs cudf gpu (`cdf_`) data frame

In [2]:
start = time.time()
%run files/load_data.py
time_needeed = start - time.time

CPU times: user 3min 51s, sys: 14.7 s, total: 4min 6s
Wall time: 4min 8s


Our next step is to convert this to a DMatrix object that XGBoost can work with. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label= keyword argument`. 

In [3]:
dtrain = xgb.DMatrix(pd_X_train, label = np.squeeze(pd_y_train))
dvalidation = xgb.DMatrix(pd_X_test, label = np.squeeze(pd_y_test))

In [4]:

# check dimensions
print('Training data dimension : ', dtrain.num_row(), dtrain.num_col())
print('Validation data dimension:', dvalidation.num_row(), dvalidation.num_col())


Training data dimension :  8000000 100
Validation data dimension: 2000000 100


### Set up the parameters

In [5]:
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = -1  # this means "use all the GPUs available. Change it to 1 to use single GPU or 0 to use the CPU
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'auc'
learning_task_params['objective'] = 'binary:logistic'

params.update(learning_task_params)
print(params)

{'silent': 1, 'tree_method': 'gpu_hist', 'n_gpus': -1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


### Train XGBoost model 

**Monitor gpu utilization** by typing `watch 0.1 nvidia-smi` in your console 

We need to pass a `num_boost_round` which corresponds to the maximum number of boosting rounds that we allow. We want to set it to a large value hoping to find the optimal number of rounds before reaching it, if we haven't improved performance on our test dataset in `early_stopping_round` rounds.

You can try setting a larger number.

In [9]:
%time 
startTime = time.time()

evallist = [(dvalidation, 'validation'), (dtrain, 'train')]

gpu_train = xgb.train(
    params = params, 
    dtrain = dtrain,
    evals = evallist,
    num_boost_round = 100, #maximum number of boosting rounds we allow
    verbose_eval = 5,
    early_stopping_rounds = 10 # The number of rounds without improvements after which we should stop,
) 

gpuTrainingTime = time.time() - startTime 

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 17.4 µs
[0]	validation-auc:0.885117	train-auc:0.885372
Multiple eval metrics have been passed: 'train-auc' will be used for early stopping.

Will train until train-auc hasn't improved in 10 rounds.
[5]	validation-auc:0.956869	train-auc:0.956969
[10]	validation-auc:0.974715	train-auc:0.974869
[15]	validation-auc:0.981039	train-auc:0.981213
[20]	validation-auc:0.984458	train-auc:0.984653
[25]	validation-auc:0.986609	train-auc:0.986795
[30]	validation-auc:0.989126	train-auc:0.989285
[35]	validation-auc:0.990134	train-auc:0.990286
[40]	validation-auc:0.990911	train-auc:0.991067
[45]	validation-auc:0.9918	train-auc:0.991945
[50]	validation-auc:0.992184	train-auc:0.992335
[55]	validation-auc:0.992568	train-auc:0.992718
[60]	validation-auc:0.99295	train-auc:0.993105
[65]	validation-auc:0.993174	train-auc:0.993328
[70]	validation-auc:0.993267	train-auc:0.993428
[75]	validation-auc:0.993331	train-auc:0.993504
[80]	validation-auc:0.993469	t

## Exercise: 

Change your parameter setings to train this model **without usign GPU** and check the speedup time 

In [10]:
## SOLUTION

# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = 0  # this means "use all the GPUs available. Change it to 1 to use single GPU or 0 to use the CPU
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'auc'
learning_task_params['objective'] = 'binary:logistic'

params.update(learning_task_params)
print(params)

{'silent': 1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


In [14]:
# %time 
# startTime = time.time()
# evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
# cpu_train = xgb.train(
#     params = params, 
#     dtrain = dtrain,
#     evals = evallist,
#     num_boost_round = 10,
#     verbose_eval = 5,
#     early_stopping_rounds = 10 )


# cpuTrainTime = time.time() - startTime 

In [None]:
# cpuTrainTime/gpuTrainTime 

https://www.datacamp.com/community/tutorials/xgboost-in-python#what

https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f