# Intel AI Kit and XGBoost

### Learning objectives

* Utilize XGBoost with Intel's AI KIt
* Take advantage of Intel extensions to SciKit Learn by enabling them with XGBoost
* Use Cross Validation technique to find better XGBoost Hyperparameters
* Use a learning curve to estimate the ideal number of trees
* Improve performance by implementing early stopping


In this example, we will use a dataset with particle features and functions of those features **to distinguish between a signal process which produces Higgs bosons (1) and a background process which does not (0)**. The Higgs boson is a basic particle in the standard model produced by the quantum excitation of the Higgs field, named after physicist Peter Higgs.

![image](3D_view_energy_of_8_TeV.png)
[Images Source](https://commons.wikimedia.org/wiki/File:3D_view_of_an_event_recorded_with_the_CMS_detector_in_2012_at_a_proton-proton_centre_of_mass_energy_of_8_TeV.png)

## Import Necessary Libraries

In [None]:
import sklearn
from sklearnex import patch_sklearn
patch_sklearn()
#unpatch_sklearn()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
from pandas import MultiIndex, Int16Dtype # if you don't import in this order you will get a pandas.Int64Index fix for FutureWarning error.
import xgboost as xgb
import numpy as np
from time import perf_counter
print("XGB Version          : ", xgb.__version__)
print("Scikit-Learn Version : ", sklearn.__version__)
print("Pandas Version       : ", pd.__version__)

## Import the Data:

* The first column is the class label (1 for signal, 0 for background), followed by the 28 features (21 low-level features then 7 high-level features):

* The dataset has 1.1 million rows, adjust the __nrows__ value to something manageable by the sytem you happen to be using.  100K is easy for a modern laptop; however, once you start optimizing much more than that can take some time. 

[Data Source](https://archive.ics.uci.edu/ml/datasets/HIGGS)

### To get the data using the Intel DevCloud execute the following cells:

In [None]:
# ! cp /data/oneapi_workshop/big_datasets/xgboost/HIGGS.tar.gz .

In [None]:
# ! tar -xzf HIGGS.tar.gz

### __Do not__ run this if on the Intel DevCloud.  To fetch the data for your local install execute the below two cells.

In [None]:
# import os
# import requests
# if not os.path.isfile("./HIGGS.csv.gz"):
#         print("Fetching data set from Internet...~2.8GB")
#         url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz"
#         myfile = requests.get(url)
#         with open('./HIGGS.csv.gz', 'wb') as f:
#             f.write(myfile.content)

In [None]:
# ! gunzip HIGGS.csv.gz

### Set the number of rows to use via nrows= variable.  100K is manageable on a laptop.

In [None]:
filename = 'HIGGS.csv'
names =  ['class_label', 'lepton pT', 'lepton eta', 'lepton phi', 'missing energy magnitude', 'missing energy phi', 'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 'jet 1 b-tag', 'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag', 'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag', 'jet 4 pt', 'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']
#data = pd.read_csv(filename, names=names, delimiter=",", nrows=100000)
data = pd.read_csv(filename, names=names, delimiter=",", nrows=1100000)
print(data.shape)

In [None]:
%time p_df = pd.read_csv("HIGGS.csv")

### Examine the data:

In [None]:
data.head()

* What kind of data is this?

In [None]:
data.dtypes

* Examine the distribution of the Higgs Boson class_label.  Depending on how many rows you load this could change how you choose to split the data.  

In [None]:
data.class_label.value_counts()

* In this scenario loading 100000 rows the balance isn't too skewed, the next cell is optional.

In [None]:
data.class_label.value_counts(normalize=True)

### Create your train/test split. 

* Remember the first column is 0 = no signal 1 = signal, so we want to leave out the labels and predict column 0.  

In [None]:
X, y = data.iloc[:, 1:],data.iloc[:,0]

* These next two cell are optional, just a sanity check of the split data actually representing our intentions.

In [None]:
#check split of data.  This is the x variable.
print(data.iloc[:,1:])

In [None]:
# This is the y target vector -- the ones we want to predict.
print(data.iloc[:,0])

### We are using the scikit-learn methodology to create the train test/split.  Feel free to play with the split and random state, just make sure you use the same random state throughout the notebook.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

* Another sanity check, make sure nothing odd happened after splitting the data.

In [None]:
# y_train.value_counts(normalize=True)

In [None]:
# y_test.value_counts(normalize=True)

### Get a baseline using the XGBoost defaults.  

Now that we have our data split into train and test datasets let's use the default XGBoost parameters to see default results.  If you are familiar with these parameters feel free to add them to the parameters cell below and feel free to modify these.  We will explore how to find better results later in the notebook.

* __learning_rate:__ step size shrinkage used to prevent overfitting. Range is 0 to 1 but a lower rate is usually better.
* __max_depth:__ determines how deeply each tree is allowed to grow during any boosting round.
* __subsample:__ percentage of samples used per tree. Low value can lead to underfitting.
* __colsample_bytree:__ percentage of features used per tree. High value can lead to overfitting.
* __n_estimators:__ number of trees built
* __objective:__ determines the loss function type: 
    * reg:linear for # regression problems.
    * reg:logistic for classification problems with only decision.
    * binary:logistic for classification problems with probability.
    
    [There are many more parameters, here is the reference.](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters)
    
* For a default we are selecting three parameters:  binary:logistic, using the cpu_predictor and due to a recent change in XGBoosts behaviour setting the error metric to error rather than logistic error for now. 

In [None]:
# Set XGBoost parameters
xgb_params = {
    'objective':                    'binary:logistic',
    'predictor':                    'cpu_predictor',
    'disable_default_eval_metric':  'true',
}

# Train the model
warnings.simplefilter(action='ignore', category=UserWarning)
t1_start = perf_counter()  # Time fit function
model_xgb= xgb.XGBClassifier(**xgb_params)
model_xgb.fit(X_train,y_train)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start," to fit.")

In [None]:
result_predict_xgb_test = model_xgb.predict(X_test)

In [None]:
# Check model accuracy
acc = np.mean(y_test == result_predict_xgb_test)
print("Model accuracy =",acc)

#### Accuracy:

* 100000 rows using defaults achieved ~72% accuracy.  Not bad, but let us see if we can do better.

In [None]:
# View the settings of the default XGBoost implementation.
model_xgb

### Tune Parameters with GridSearchCV

* As you can see above there are many parameters that can be modified and tuned and that would take a lot of time to profile each parameter.  In this exercise we will focus on some of the most frequently chosen parameters to tune. GridSearchCV is an exhaustive search over a set of parameters fitting seperate models to each combination.  It is important to consider how many cores you have and how much memory you have. 

#### Parameters for Tree Booster

__eta__ [default=0.3, alias: learning_rate]  range: [0,1]

* Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.



__gamma__ [default=0, alias: min_split_loss]  range: [0,∞]

* Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.


__max_depth__ [default=6]  range: [0,∞]

Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. exact tree method requires non-zero value.


__subsample__ [default=1] range: [0,1]

* Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.

__colsample_bytree__ [default=1] range: [0,1]

* colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

__Lambda__ [default=1, alias: reg_lambda]

* L2 regularization term on weights. Increasing this value will make model more conservative.

__scale_pos_weight__ [default=1]

* Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances)

[These descriptions are straight from the docs, which you can view all parameters explanation here.](https://xgboost.readthedocs.io/en/stable/parameter.html)

Feel free to change these values, these are a good starting point for round one.  Word of caution this takes ~ 1 to 3 hours on an Intel® Xeon® 6128 running in the Intel DevCloud.  

In [None]:
param_grid = {
    "learning_rate": [0.1, 0.3, 0.5],
    "gamma": [0, 0.25, 1],
    "max_depth": [4, 6, 8],
    "subsample": [0.5, 1],
    "colsample_bytree": [0.7, 1],
    "colsample_bynode": [0.7, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1],
}

In [None]:
from sklearn.model_selection import GridSearchCV
xgb_params2 = {
    'objective':                    'binary:logistic',
    'predictor':                    'cpu_predictor',
    'disable_default_eval_metric':  'true',
    'tree_method':                  'hist', 
}
# Necessary for now to supress multi-threaded Future errors with respect to pandas and XGBoost
import os
os.environ['PYTHONWARNINGS']='ignore::FutureWarning'

# Train the model
model_xgb= xgb.XGBClassifier(**xgb_params2, use_label_encoder=False)

# Setup grid search n_jobs=-1 uses all cores, reducing cv from 5 to 3 for speed, scoring is done using area under curve.
grid_cv = GridSearchCV(model_xgb, param_grid, n_jobs=-1, cv=3, scoring="roc_auc")

# This fit function takes a while--hours, make sure you are ready.
_ = grid_cv.fit(X_train, y_train)

In [None]:
grid_cv.best_score_

In [None]:
grid_cv.best_params_

### Results

    grid_cv.best_score_ = 0.80 grid cv.best_params

    {'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 8, 'reg_lambda': 10, 'scale_pos_weight': 1, 'subsample': 1}

As you can see the results came back at 80% which is a great improvement, over the default settings.  

    "max_depth": [2, 4, 6, 8],
    "learning_rate": [0.1, 0.3, 0.5],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5],
    "subsample": [1],
    "colsample_bytree": [1],
  
The best results from the experiment were as follows formatted for easy pasting into the above xgb_params kwargs:

    'max_depth': 8,
    'learning_rate': 0.1,
    'gamma': 0,
    'reg_lambda': 10,
    'scale_pos_weight': 1,
    
From this result it would be worth exploring a higher tree depth as it was at the edge of our parameters as well as the reg_lambda parameter.  This is an exercise you can run setting the best values and adding some additional ranges.  For example max_depth: [8, 10, 12]

## Implement an XGBoost classifier using the above results:  

* hint you could use the **grid_cv.best_params like xgb_params above.

        xgb_model.XGBClassifier(
            **grid_cv.best_params_,
            objective="binary:logistic",
            colsample_bytree=1,
            subsample=1
        )

## Further Tuning

Another frequent parameter that is tuned for is:

     n_estimators:, default=100

n_estimaters represents the number of trees in the forest.  A good way to see how many trees might be useful is to plot the learning curve.  Since this is a classification problem we will use log loss as our measurement where lower values are better.  

Our orignal fit function needs to be modified to include eval_metric with the type set to logloss.  In addition we need to define the evaluation data set so that the results are evaluated after each round in order to plot them.


In [None]:
# Datasets used for evaluation after each round
evalset = [(X_train, y_train), (X_test,y_test)]

In [None]:
# Fit the model
model_xgb.fit(X_train, y_train, eval_metric='logloss', eval_set=evalset)

In [None]:
# Check model accuracy
result_predict_xgb_test = model_xgb.predict(X_test)
acc = np.mean(y_test == result_predict_xgb_test)
print("Model accuracy =",acc)

In [None]:
# retrieve performance metrics
results = model_xgb.evals_result()

In [None]:
# Plot learning curves
import matplotlib.pyplot as plt
plt.plot(results['validation_0']['logloss'], label='train')
plt.plot(results['validation_1']['logloss'], label='test')
# display legend
plt.legend()
# render
plt.show()

## Put it all together:

* Use the results from the gridsearch and add an additional parameter n_estimators
        
        n_estimators are the number of trees in the forest. default=100    

*   From the curves above you can see that they are still at a slope when n_estimators is at 100.  In this case it would seem that increasing the number of trees might yield a better result. We know that we achieve 72% when using the default 100 trees and our best results discovered via the grid search is approximately 80%.  Can we do better?

* Start by setting up a new parameters section and use the values discovered earlier.  In addition set n_estimators to 1000 and see if a better result is achieved.

In [None]:
# Set XGBoost parameters
xgb_params = {
    'objective':                    'binary:logistic',
    'predictor':                    'cpu_predictor',
    'disable_default_eval_metric':  'true',
    'max_depth':                     8,
    'learning_rate':                 0.1,
    'subsample':                     1,
    'gamma':                         0,
    'reg_lambda':                    10,
    'scale_pos_weight':              1,
    'tree_method':                  'hist', 
    'n_estimators':                  1000,
}

# Train the model
t1_start = perf_counter()  # Time fit function
model_xgb= xgb.XGBClassifier(**xgb_params)
model_xgb.fit(X_train,y_train, eval_metric='logloss', eval_set=evalset, verbose=True)
t1_stop = perf_counter()
print ("It took", t1_stop-t1_start,"seconds to fit.")

In [None]:
# Check model accuracy
result_predict_xgb_test = model_xgb.predict(X_test)
acc = np.mean(y_test == result_predict_xgb_test)
print("Model accuracy =",acc)

In [None]:
# retrieve performance metrics
results = model_xgb.evals_result()

In [None]:
# Plot learning curves
import matplotlib.pyplot as plt
plt.plot(results['validation_0']['logloss'], label='train')
plt.plot(results['validation_1']['logloss'], label='test')
# display legend
plt.legend()
# render
plt.show()

## So how many trees do we need really?

* It takes awhile to watch 1000 trees get evaluated, a great performance improvement is to use the XGBoost early stopping capability.

* Modify the fit function to stop the training after 10 to 15 rounds of no improvement.  
        
        model_xgb.fit(X_train,y_train, early_stopping_rounds=10, eval_metric='logloss', eval_set=evalset, verbose=True)

* Depending on how large a dataset you used this will vary.  There are numerous other optimizations that one can undertake, hopefully this gets you started.



# Summary:

* We covered how to set parameters for XGBoost.
* How to enable Intel's SciKit-Learn features
* How to use CV to identify better hyperparameter options
* How to use a learning curve to estimate the number of trees
* How to use early stopping to optimize training time