# Accelerating XGboost with GPU

This kernel uses the Xgboost models, running on CPU and GPU. With the GPU acceleration, we gain a ~8.5x performance improvement on an NVIDIA K80 card compared to the 2-core virtual CPU available in the Kaggle VM (1h 8min 46s vs. 8min 20s).

The gain on a NVIDIA 1080ti card compared to an Intel i7 6900K 16-core CPU is ~6.6x.

To turn GPU support on in Kaggle, in notebook settings, set the **GPU beta** option to "GPU on".

## Notebook  Content
1. [Loading the data](#0) <br>    
1. [Training the model on CPU](#1)
1. [Training the model on GPU](#2)
1. [Submission](#3)


<a id="0"></a>
## 1. Loading the data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
import gc
import xgboost as xgb

pd.set_option('display.max_columns', 200)

In [2]:
train_df = pd.read_csv('../input/train.csv', engine='python')
test_df = pd.read_csv('../input/test.csv', engine='python')

<a id="1"></a> 
## 2. Training the model on CPU

In [3]:
import subprocess
print((subprocess.check_output("lscpu", shell=True).strip()).decode())

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    2
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:              0
CPU MHz:               2300.000
BogoMIPS:              4600.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor l

In [4]:
MAX_TREE_DEPTH = 8
TREE_METHOD = 'hist'
ITERATIONS = 1000
SUBSAMPLE = 0.6
REGULARIZATION = 0.1
GAMMA = 0.3
POS_WEIGHT = 1
EARLY_STOP = 10

params = {'tree_method': TREE_METHOD, 'max_depth': MAX_TREE_DEPTH, 'alpha': REGULARIZATION,
          'gamma': GAMMA, 'subsample': SUBSAMPLE, 'scale_pos_weight': POS_WEIGHT, 'learning_rate': 0.05, 
          'silent': 1, 'objective':'binary:logistic', 'eval_metric': 'auc', 'silent':True, 
          'verbose_eval': False}

In [5]:
%%time
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)

oof = np.zeros(len(train_df))
predictions = np.zeros(len(test_df))

target = 'target'
predictors = train_df.columns.values.tolist()[2:]

i = 1
for train_index, valid_index in skf.split(train_df, train_df.target.values):
    print("\nFold {}".format(i))
    xg_train = xgb.DMatrix(train_df.iloc[train_index][predictors].values,
                           train_df.iloc[train_index][target].values,                           
                           )
    xg_valid = xgb.DMatrix(train_df.iloc[valid_index][predictors].values,
                           train_df.iloc[valid_index][target].values,                           
                           )   

    
    clf = xgb.train(params, xg_train, ITERATIONS, evals=[(xg_train, "train"), (xg_valid, "eval")],
                early_stopping_rounds=EARLY_STOP, verbose_eval=False)
    oof[valid_index] = clf.predict(xgb.DMatrix(train_df.iloc[valid_index][predictors].values)) 
    
    predictions += clf.predict(xgb.DMatrix(test_df[predictors].values)) / nfold
    i = i + 1

print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))


Fold 1
[22:49:15] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.

Fold 2
[22:56:07] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.

Fold 3
[23:02:23] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.

Fold 4
[23:10:20] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.

Fold 5
[23:16:04] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.


CV AUC: 0.88
CPU times: user 1h 5min 2s, sys: 23.8 s, total: 1h 5min 26s
Wall time: 33min 27s


<a id="2"></a>
## 3. Training the model on GPU

In [6]:
!nvidia-smi

Thu Mar  7 23:22:42 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.44                 Driver Version: 396.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0    33W / 250W |    303MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage    

We now train the model with a K80 GPU available in Kaggle. Xgboost provides out of the box support for single GPU training. On a local workstation, a GPU-ready xgboost docker image can be obtained from https://hub.docker.com/r/rapidsai/rapidsai/.

All we need to change is to set: `TREE_METHOD = 'gpu_hist'`

In [7]:
MAX_TREE_DEPTH = 8
TREE_METHOD = 'gpu_hist'
ITERATIONS = 1000
SUBSAMPLE = 0.6
REGULARIZATION = 0.1
GAMMA = 0.3
POS_WEIGHT = 1
EARLY_STOP = 10

params = {'tree_method': TREE_METHOD, 'max_depth': MAX_TREE_DEPTH, 'alpha': REGULARIZATION,
          'gamma': GAMMA, 'subsample': SUBSAMPLE, 'scale_pos_weight': POS_WEIGHT, 'learning_rate': 0.05, 
          'silent': 1, 'objective':'binary:logistic', 'eval_metric': 'auc',
          'n_gpus': 1}

In [8]:
%%time
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)

oof = np.zeros(len(train_df))
predictions = np.zeros(len(test_df))

target = 'target'
predictors = train_df.columns.values.tolist()[2:]

i = 1
for train_index, valid_index in skf.split(train_df, train_df.target.values):
    print("\nFold {}".format(i))
    xg_train = xgb.DMatrix(train_df.iloc[train_index][predictors].values,
                           train_df.iloc[train_index][target].values,                           
                           )
    xg_valid = xgb.DMatrix(train_df.iloc[valid_index][predictors].values,
                           train_df.iloc[valid_index][target].values,                           
                           )   

    
    clf = xgb.train(params, xg_train, ITERATIONS, evals=[(xg_train, "train"), (xg_valid, "eval")],
                early_stopping_rounds=EARLY_STOP, verbose_eval=False)
    oof[valid_index] = clf.predict(xgb.DMatrix(train_df.iloc[valid_index][predictors].values)) 
    
    predictions += clf.predict(xgb.DMatrix(test_df[predictors].values)) / nfold
    i = i + 1

print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))


Fold 1

Fold 2

Fold 3

Fold 4

Fold 5


CV AUC: 0.88
CPU times: user 3min 2s, sys: 40.7 s, total: 3min 43s
Wall time: 2min 55s


<a id="3"></a>
## 4. Submission

In [9]:
sub_df = pd.DataFrame({"ID_code": test_df.ID_code.values})
sub_df["target"] = predictions
sub_df[:10]

Unnamed: 0,ID_code,target
0,test_0,0.093176
1,test_1,0.207328
2,test_2,0.111683
3,test_3,0.130428
4,test_4,0.044165
5,test_5,0.004539
6,test_6,0.011087
7,test_7,0.087311
8,test_8,0.006254
9,test_9,0.009013


In [10]:
sub_df.to_csv("xgboost_gpu.csv", index=False)