### Load Libraries

In [2]:

import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import xgboost as xgb; print('XGBoost Version:', xgb.__version__)

import time 

numpy Version: 1.16.4
pandas Version: 0.24.2
Scikit-Learn Version: 0.21.2
XGBoost Version: 1.0.0rapidsdev0-SNAPSHOT


In the previous notebook we already load our data and split it in training .
We intentionally named the objects we created in a way you can easy distingwish whether you are using pandas(`pd_`) vs cudf gpu (`cdf_`) data frame

In [3]:
start = time.time()
%run ../files/load_data.py
time_needeed = start - time.time()

Our next step is to convert this to a DMatrix object that XGBoost can work with. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label= keyword argument`. 

In [4]:
dtrain = xgb.DMatrix(pd_X_train, label = np.squeeze(pd_y_train))
dvalidation = xgb.DMatrix(pd_X_test, label = np.squeeze(pd_y_test))

In [5]:

# check dimensions
print('Training data dimension : ', dtrain.num_row(), dtrain.num_col())
print('Validation data dimension:', dvalidation.num_row(), dvalidation.num_col())


Training data dimension :  800 100
Validation data dimension: 200 100


### Set up the parameters

In [6]:
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = -1  # this means "use all the GPUs available. Change it to 1 to use single GPU or 0 to use the CPU
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'auc'
learning_task_params['objective'] = 'binary:logistic'

params.update(learning_task_params)
print(params)

{'silent': 1, 'tree_method': 'gpu_hist', 'n_gpus': -1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


### Train XGBoost model 

**Monitor gpu utilization** by typing `watch 0.1 nvidia-smi` in your console 

We need to pass a `num_boost_round` which corresponds to the maximum number of boosting rounds that we allow. We want to set it to a large value hoping to find the optimal number of rounds before reaching it, if we haven't improved performance on our test dataset in `early_stopping_round` rounds.

You can try setting a larger number.

In [7]:
%time 
startTime = time.time()

evallist = [(dvalidation, 'validation'), (dtrain, 'train')]

gpu_train = xgb.train(
    params = params, 
    dtrain = dtrain,
    evals = evallist,
    num_boost_round = 100, #maximum number of boosting rounds we allow
    verbose_eval = 5,
    early_stopping_rounds = 10 # The number of rounds without improvements after which we should stop,
) 

gpuTrainTime = time.time() - startTime 

CPU times: user 1 µs, sys: 2 µs, total: 3 µs
Wall time: 6.91 µs
[0]	validation-auc:0.766216	train-auc:0.956076
Multiple eval metrics have been passed: 'train-auc' will be used for early stopping.

Will train until train-auc hasn't improved in 10 rounds.
[5]	validation-auc:0.896942	train-auc:0.999537
[10]	validation-auc:0.918997	train-auc:1
[15]	validation-auc:0.924511	train-auc:1
Stopping. Best iteration:
[9]	validation-auc:0.910576	train-auc:1



## Exercise: 

Change your parameter setings to train this model **without usign GPU** and check the speedup time 

In [8]:
## SOLUTION

# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = 0  # this means "use all the GPUs available. Change it to 1 to use single GPU or 0 to use the CPU
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
params.update(booster_params)

# learning task params
learning_task_params = {}
learning_task_params['eval_metric'] = 'auc'
learning_task_params['objective'] = 'binary:logistic'

params.update(learning_task_params)
print(params)

{'silent': 1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


In [9]:
%time 
startTime = time.time()
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
cpu_train = xgb.train(
    params = params, 
    dtrain = dtrain,
    evals = evallist,
    num_boost_round = 100,
    verbose_eval = 5,
    early_stopping_rounds = 10 )


cpuTrainTime = time.time() - startTime 

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.68 µs
[0]	validation-auc:0.795739	train-auc:0.953388
Multiple eval metrics have been passed: 'train-auc' will be used for early stopping.

Will train until train-auc hasn't improved in 10 rounds.
[5]	validation-auc:0.922406	train-auc:0.999318
[10]	validation-auc:0.938145	train-auc:1
[15]	validation-auc:0.936942	train-auc:1
[20]	validation-auc:0.94015	train-auc:1
Stopping. Best iteration:
[10]	validation-auc:0.938145	train-auc:1



In [10]:
cpuTrainTime/gpuTrainTime 

0.02197543131091485

https://www.datacamp.com/community/tutorials/xgboost-in-python#what

https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f