In [None]:
# Upgrade Oracle ADS to pick up latest features and maintain compatibility with Oracle Cloud Infrastructure.

!pip install -U oracle-ads

Oracle Data Science service sample notebook.

Copyright (c) 2020, 2022 Oracle, Inc. All rights reserved. Licensed under the [Universal Permissive License v 1.0](https://oss.oracle.com/licenses/upl).

---

# <font color="red">XGBoost with RAPIDS</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color="teal">Oracle Cloud Infrastructure Data Science Service.</font></p>

---

# Overview:

The purpose of this notebook is to compare the speedup of the training time between an XGBoost model trained on GPUs versus the same model trained on CPUs.

Compatible with: [NVIDIA RAPIDS 21.10](https://docs.oracle.com/en-us/iaas/data-science/using/conda-rapids-fam.htm) for GPU on Python 3.7 (version 1.0)

---

## Contents:

- <a href='#synthetic-data'>Creating a Synthetic Dataset</a>    
    - <a href='#split'>Splitting the Dataset into a Training and a Validation Sample</a>
    - <a href='#convert'>Converting NumPy Arrays to XGBoost DMatrix Data Format</a>
- <a href='#assign'>Assigning Values to XGBoost Hyperparameteers</a>
- <a href='#cpu'>Training an XGBoost Model using CPUs</a>
- <a href='#gpu'>Training an XGBoost Model using GPUs</a>
- <a href='#conclusion'>Conclusion</a>

---

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

Datasets are provided as a convenience.  Datasets are considered third-party content and are not considered materials 
under your agreement with Oracle.

---


In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb

<a id='synthetic-data'></a>
## Creating a Synthetic Dataset

The first step is to create synthetic training and validation datasets. The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32` if the data is numerical or `np.uint8` if the data is categorical. Both numerical and categorical data can also be combined; for this experiment, we have ignored this combination.

In [None]:
# Synthetic dataset size (rows and column counts):
n_rows = int(1e5)
n_columns = int(100)
n_categories = 2

In [None]:
# Function to create the synthetic dataset:
def simulate_data(m, n, k=2, numerical=False):
    if numerical:
        features = np.random.rand(m, n)
    else:
        features = np.random.randint(2, size=(m, n))
    labels = np.random.randint(k, size=m)
    return np.c_[labels, features].astype(np.float32)

In [None]:
%%time
dataset = simulate_data(n_rows, n_columns, n_categories)
print(dataset.shape)

<a id='split'></a>
### Splitting the Dataset into a Training and a Validation Sample

We'll split our dataset into a 80% training dataset and a 20% validation dataset.

In [None]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

We can check the dimensions and proportions of our training and validation dataets.

In [None]:
# check dimensions
print(
    "X_train: ", X_train.shape, X_train.dtype, "y_train: ", y_train.shape, y_train.dtype
)
print(
    "X_validation",
    X_validation.shape,
    X_validation.dtype,
    "y_validation: ",
    y_validation.shape,
    y_validation.dtype,
)

# check the proportions
total = X_train.shape[0] + X_validation.shape[0]
print("X_train proportion:", X_train.shape[0] / total)
print("X_validation proportion:", X_validation.shape[0] / total)

<a id='convert'></a>
### Converting NumPy Arrays to XGBoost DMatrix Data Format 

The next step is to convert the `X_train`, `y_train`, `X_validation`, `y_validation` numpy arrays to the data matrix (`DMatrix`) format that XGBoost supports. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. 

To learn more about XGBoost's support for data structures see the XGBoost documentation: 
https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface

In [None]:
%%time

# converting both training and validation datasets:
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

<a id='assign'></a>
## Assigning Values to XGBoost Hyperparameteers 

There are a number of parameters that can be set before XGBoost can be run. 

* General parameters relate to which booster we are using to do boosting, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation here:


https://xgboost.readthedocs.io/en/latest/parameter.html

In [None]:
# instantiate params
params = {}

# general params
general_params = {"silent": 1}
params.update(general_params)

# learning task params
learning_task_params = {"eval_metric": "auc", "objective": "binary:logistic"}
params.update(learning_task_params)
print(params)

<a id='cpu'></a>
## Training an XGBoost Model using CPUs

Now it's time to train the model! You can use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. For more information on the parameters that can be passed into `xgb.train`, check out the documentation:

https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train

In the cell below, replace the value of `n_gpus` with 0. This will trigger moddel training on CPUs. Note XGBoost supports multithreading out-of-the-box. You will notice that all the CPUs of your VM are utilized. 

In [None]:
# booster params.
n_gpus = 0
booster_params = {}

if n_gpus != 0:
    booster_params["tree_method"] = "gpu_hist"
    booster_params["n_gpus"] = n_gpus
params.update(booster_params)
print(params)

In [None]:
# model training settings
evallist = [(dvalidation, "validation"), (dtrain, "train")]

num_round = 100

In [None]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

<a id='gpu'></a>
## Training an XGBoost Model using GPUs

Let's now repeat the same training step but this time we will use GPUs to train the model. 

In [None]:
# GPUs available? (1/0)
n_gpus = 1
booster_params = {}

if n_gpus != 0:
    booster_params["tree_method"] = "gpu_hist"
    booster_params["n_gpus"] = n_gpus
params.update(booster_params)
print(params)

In [None]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

That was fast! You can typically get speedups of 50x or more depending on the VM shape and the generation of GPUs (e.g. P100 vs V100) you are using. 

<a id='conclusion'></a>
## Conclusion

We compared the performance of an xgboost model trained on CPUs vs GPUs. For additional resources, 

<a id='ref'></a>
# References

- [ADS Library Documentation](https://accelerated-data-science.readthedocs.io/en/latest/index.html)
- [Data Science YouTube Videos](https://www.youtube.com/playlist?list=PLKCk3OyNwIzv6CWMhvqSB_8MLJIZdO80L)
- [OCI Data Science Documentation](https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm)
- [Oracle Data & AI Blog](https://blogs.oracle.com/datascience/)
- [NVIDIA RAPIDS](http://rapids.ai)