# Intel AI Kit and XGBoost Using daal4py

### Learning objectives

* Utilize XGBoost with Intel's AI KIt
* Take advantage of Intel extensions to SciKit Learn by enabling them with XGBoost
* Utilize oneDaal to enhance prediction performance
 


In this example, we will use a dataset with particle features and functions of those features **to distinguish between a signal process which produces Higgs bosons (1) and a background process which does not (0)**. The Higgs boson is a basic particle in the standard model produced by the quantum excitation of the Higgs field, named after physicist Peter Higgs.

![image](3D_view_energy_of_8_TeV.png)
[Images Source](https://commons.wikimedia.org/wiki/File:3D_view_of_an_event_recorded_with_the_CMS_detector_in_2012_at_a_proton-proton_centre_of_mass_energy_of_8_TeV.png)

## Import Necessary Libraries

In [None]:
import sklearn
from sklearnex import patch_sklearn
patch_sklearn()
#unpatch_sklearn()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from pandas import MultiIndex, Int16Dtype # if you don't import in this order you will get a pandas.Int64Index fix for FutureWarning error.
import xgboost as xgb
import numpy as np
from time import perf_counter
print("XGB Version          : ", xgb.__version__)
print("Scikit-Learn Version : ", sklearn.__version__)
print("Pandas Version       : ", pd.__version__)

## Import the Data: 

### (If you ran module 03 you should already have it and do not need to execute the following cells to get the HIGGs data; however, you do need to create a symbolic link to the HIGGS.csv) 

```ln -s /home/u12345/AI_Kit_XGBoost_Predictive_Modeling/03_XGBoost/HIGGS.csv HIGGS.csv```

* The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features):

* The dataset has 1.1 million rows, adjust the __nrows__ value to something manageable by the sytem you happen to be using.  100K is easy for a modern laptop; however, once you start optimizing much more than that can take some time. 

[Data Source](https://archive.ics.uci.edu/ml/datasets/HIGGS)

### To get the data using the Intel DevCloud execute the following cells:

In [None]:
# ! cp /data/oneapi_workshop/big_datasets/xgboost/HIGGS.tar.gz .

In [None]:
# ! tar -xzf HIGGS.tar.gz

### __Do not__ run this if on the Intel DevCloud.  To fetch the data for your local install execute the below two cells.

In [None]:
# import os
# import requests
# if not os.path.isfile("./HIGGS.csv.gz"):
#         print("Fetching data set from Internet...~2.8GB")
#         url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
#         myfile = requests.get(url)
#         with open('./HIGGS.csv.gz', 'wb') as f:
#             f.write(myfile.content)

In [None]:
# ! gunzip HIGGS.csv.gz

### Set the number of rows to use via nrows= variable.  100K is manageable on a laptop.

In [None]:
filename = 'HIGGS.csv'
names =  ['class_label', 'lepton pT', 'lepton eta', 'lepton phi', 'missing energy magnitude', 'missing energy phi', 'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 'jet 1 b-tag', 'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag', 'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag', 'jet 4 pt', 'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
#data = pd.read_csv(filename, names=names, delimiter=",", nrows=100000)
data = pd.read_csv(filename, names=names, delimiter=",", nrows=1100000)
print(data.shape)

### Examine the data:

In [None]:
data.head()

### Create your train/test split. 

* Remember the first non index column is 0 = no signal 1 = signal, so we want to leave out the labels and predict column 0.  

In [None]:
X, y = data.iloc[:, 1:],data.iloc[:,0]

### We are using the scikit-learn methodology to create the train test/split.  Feel free to play with the split and random state, just make sure you use the same random state throughout the notebook.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Get a baseline using the XGBoost defaults.  

Now that we have our data split into train and test datasets let's use the default XGBoost parameters to see default results.  If you are familiar with these parameters feel free to add them to the parameters cell below and feel free to modify these.  We will explore how to find better results later in the notebook.

* __learning_rate:__ step size shrinkage used to prevent overfitting. Range is 0 to 1 but a lower rate is usually better.
* __max_depth:__ determines how deeply each tree is allowed to grow during any boosting round.
* __subsample:__ percentage of samples used per tree. Low value can lead to underfitting.
* __colsample_bytree:__ percentage of features used per tree. High value can lead to overfitting.
* __n_estimators:__ number of trees built
* __objective:__ determines the loss function type: 
    * reg:linear for # regression problems.
    * reg:logistic for classification problems with only decision.
    * binary:logistic for classification problems with probability.
    
    [There are many more parameters, here is the reference.](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters)
    
* For a default we are selecting three parameters:  binary:logistic, using the cpu_predictor and due to a recent change in XGBoosts behaviour setting the error metric to error rather than logistic error for now. 

## Set XGBoost Training Parameters

In [None]:
# Set XGBoost training parameters
xgb_params = {
    'objective':                    'binary:logistic',
    'predictor':                    'cpu_predictor',
    'disable_default_eval_metric':  'true',
}

# Train the model
warnings.simplefilter(action='ignore', category=UserWarning)
t1_start = perf_counter()  # Time fit function
model_xgb= xgb.XGBClassifier(**xgb_params)
model_xgb.fit(X_train,y_train)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start,"seconds to train the model.")

## Predict Using XGBoost

In [None]:
# Check model accuracy
t1_start = perf_counter()
result_predict_xgb_test = model_xgb.predict(X_test)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start,"seconds for prediction with XGBoost.")

## Accuracy of XGBoost

In [None]:
acc = np.mean(y_test == result_predict_xgb_test)
print("Model accuracy =",acc)

## Convert an XGBoost model to oneDAL

In [None]:
import daal4py as d4p
clf = xgb.XGBClassifier(**xgb_params)
xgb_model = clf.fit(X_train,y_train)

In [None]:
daal_model = d4p.get_gbt_model_from_xgboost(xgb_model.get_booster())

## Predict using oneDAL

In [None]:
t1_start = perf_counter()  # Time function
daal_prediction = d4p.gbt_classification_prediction(nClasses=2).compute(X_test, daal_model).prediction
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start,"seconds for prediction with oneDAL.")

## Accuracy of oneDAL

In [None]:
print("\nXGBoost prediction results (first 10 rows):\n", result_predict_xgb_test[0:10])
print("\ndaal4py prediction results (first 10 rows):\n", daal_prediction[0:10])
print("\nGround truth (first 10 rows):\n", y_test[0:10])

# Summary:

* We covered how to set parameters for XGBoost.
* How to convert to an XGBoost Model to a oneDAL model.
* We compared oneDal prediction time vs XGBoost.