## XGBoost estimation
## Tutorial

The objective of this project is to understand and implement **XGBoost library for developing boosted models**, which may be either composed of an ensemble of linear models or an ensemble of trees. All main pieces of code present here were extracted from [XGBoost official documentation](https://xgboost.readthedocs.io/en/latest/index.html). Once the library has been [installed](https://xgboost.readthedocs.io/en/latest/install.html), the documentation provides a [simple tutorial](https://xgboost.readthedocs.io/en/latest/get_started.html) of how to train a GBM model using XGBoost API. In this notebook, section [train-test estimation](#train_test)<a href='#train_test'></a> applies those basic codes for preparing and training an XGBoost model.
<br>
<br>
The use of XGBoost through [Python](https://xgboost.readthedocs.io/en/latest/python/python_intro.html) is somewhat similar to that for [LightGBM](https://lightgbm.readthedocs.io/en/latest/). So, for instance, in the first place, one should create a proper [data object](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.core) for feeding the model object. This data object is created by applying the callable class *DMatrix* over a conventional data object, such as dataframe or nd-array. This *DMatrix* object contains both input and output variables.
<br>
<br>
Another similarity with LightGBM is the definition of a dictionary containing all relevant **parameters** for GBM estimation. This gives more control over the structure of the model under construction, since all parameters are gathered in the same object. Once the data object and the parameters dictionary have been created, the model is trained by declaring them in the initialization of the [training object](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train), *train*. Another crucial parameter declared during this initialization, this time outside the dictionary of parameters, is *num_boosted_rounds*, which works as the number of trees in the GBM ensemble. A customized cost function can be used by declaring it in the argument *obj* when initializing the object of model training - however, one can choose among standard cost functions through the *objective* key in the parameters dictionary.
<br>
<br>
**Early stopping** is an estimation strategy that helps defining the number of boosting iterations. Its use is straightforward: performance metrics should be defined in the *eval_metric* key of parameters dictionary, while arguments *evals* and *early_stopping_rounds* should be declared in the training object. While *eval_metric* is given by the string or a list of strings with names of performance metrics, *evals* receives a list with tuples, where the first element is a *DMatrix* object and the second a string with its name, and *early_stopping_rounds* is the number of iterations allowed for not showing any performance improvement.
<br>
<br>
The class *cv* consists of an extension of *train* that implements [K-folds cross-validation](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.cv) together with XGBoost model. Its initialization arguments are basically the same as those for *train*, having some extra arguments for setting up the K-folds estimation, such as the number of folds, whether data should be shuffled or not previous to split, and whether stratified (representative) samples should be produced when data is shuffled.
<br>
<br>
GBMs are very flexible estimation methods that usually outperform most alternative learning algorithms. This comes at the cost of requiring the proper definition of a huge collection of parameters. XGBoost documents [here](https://xgboost.readthedocs.io/en/latest/parameter.html#xgboost-parameters) all available parameters for optimizing model performance. In addition to [general parameters](https://xgboost.readthedocs.io/en/latest/parameter.html#global-configuration), the lists of parameters for [tree booster](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster), for [linear booster](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-linear-booster-booster-gblinear) and for [DART booster](https://xgboost.readthedocs.io/en/latest/parameter.html#additional-parameters-for-dart-booster-booster-dart) correspond to the most relevant documentation of XGBoost. Linear booster and DART booster are alternative implementations of the boosting principle, since GBMs are generally taken to be ensembles of trees.
<br>
<br>
This tutorial restricts itself to present and discuss the construction of boosted trees, leaving for further research the implementation of boosted linear models and DART models. Those **hyper-parameters** of GBM that are particularly helpful for improving model performance are the following:
* Subsample paramter ($subsample \leq 1$): when defined to be smaller than one, stochastic GBM is implemented, meaning that only a random subset of training data is used for fitting the tree at each boosting iteration.
* Learning rate ($eta \in [0,1]$): when defined to be smaller than one, each member of the ensemble will have its contribution to the prediction shrinked by a factor of $eta$.
* Maximum depth ($max\_depth \in \mathbb{I}$): controls the quantity of vertical growths of each tree. The larger $max\_depth$, the more complex each member of the ensemble will be.
* Number of boosting iterations ($num\_boost\_round \in \mathbb{I}_+$): number of members of the ensemble of trees. *Note that this parameter should be declared when initializing the training object.*

**Additional parameters** can also contribute for fine tuning the model performance. They mostly relate to the construction of each tree that compose the boosted model. For instance, $gamma$ ($min\_split\_loss$) controls the minimum reduction in cost function for a split to be considered. $grow\_policy$ directly dictates how the tree should grow as splits go on. Parameters such as $colsample\_bytree$, $colsample\_bylevel$, $colsample\_bynode$ allow the random selection of a subset of features for producing splits, similar to the development of random forests. Parameters $lambda$ and $alpha$ make possible to introduce further regularization to the model. Parameter $tree\_method$ gives an even larger degree of control over the construction of the boosted model, since it makes possible the choice regarding which optimization algorithm should be used for generating trees.
<br>
<br>
XGBoost documentation presents some discussion about how to [fine tune](https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html) hyper-parameters in order to reduce overfitting, improve model performance and make estimations faster.
<br>
<br>
Documentation regarding parameters also cover options for the [learning task](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters), such as the cost function to be used during training and the metric for evaluating the model at each boosting iteration.

Documentation also shows how to select [sub-models](https://xgboost.readthedocs.io/en/latest/python/model.html) from the entire ensemble of trees (for tree booster or DART booster) and how to develop [callback functions](https://xgboost.readthedocs.io/en/latest/python/callbacks.html) (and [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#callback-api)) for monitoring the model construction. Documentation for [prediction](https://xgboost.readthedocs.io/en/latest/prediction.html), [plotting](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.plotting) model outcomes, and for using XGBoost under [sklearn interface](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) ends the main documentation, together with several official tutorials ([here](https://github.com/dmlc/xgboost/tree/master/demo) and [here](https://xgboost.readthedocs.io/en/latest/tutorials/index.html)).

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing data](#imports)<a href='#imports'></a>.
    * [Features and label](#feats_label)<a href='#feats_label'></a>.
<br>
<br>
5. [Model estimation](#model_estimation)<a href='#model_estimation'></a>.
    * [Train-test estimation](#train_test)<a href='#train_test'></a>.
    * [Early stopping](#es)<a href='#es'></a>.
    * [CV estimation](#cv)<a href='#cv'></a>.
    * [Grid and random searches](#grid_random_searches)<a href='#grid_random_searches'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np

import os
import json

import re
# pip install unidecode
from unidecode import unidecode

from datetime import datetime
import time

from scipy.stats import uniform, norm, randint

# pip install xgboost
import xgboost as xgb

from sklearn.metrics import roc_auc_score, average_precision_score, auc, precision_recall_curve, brier_score_loss

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import utils
from utils import loading_data, running_time

In [3]:
import kfolds
from kfolds import Kfolds_fit

<a id='settings'></a>

## Settings

In [4]:
# Define the dataset_id:
dataset_id = 2706

<a id='imports'></a>

## Importing data

<a id='feats_label'></a>

### Features and label

#### Training data

In [5]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_train = loading_data(path=f'../Datasets/dataset_{dataset_id}_train.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

# Accessory variables:
drop_vars = ['y', 'order_id', 'epoch', 'date']

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2020-12-31 to 2021-02-17.
----------------------------------------




#### Test data

In [6]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_test = loading_data(path=f'../Datasets/dataset_{dataset_id}_test.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2021-02-17 to 2021-03-31.
----------------------------------------




<a id='model_estimation'></a>

## Model estimation

<a id='train_test'></a>

### Train-test estimation

#### Creating data objects

In [7]:
help(xgb.DMatrix)

Help on class DMatrix in module xgboost.core:

class DMatrix(builtins.object)
 |  DMatrix(data, label=None, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, nthread=None)
 |  
 |  Data Matrix used in XGBoost.
 |  
 |  DMatrix is a internal data structure that used by XGBoost
 |  which is optimized for both memory efficiency and training speed.
 |  You can construct DMatrix from numpy.arrays
 |  
 |  Methods defined here:
 |  
 |  __del__(self)
 |  
 |  __init__(self, data, label=None, weight=None, base_margin=None, missing=None, silent=False, feature_names=None, feature_types=None, nthread=None)
 |      Parameters
 |      ----------
 |      data : os.PathLike/string/numpy.array/scipy.sparse/pd.DataFrame/
 |             dt.Frame/cudf.DataFrame/cupy.array/dlpack
 |          Data source of DMatrix.
 |          When data is string or os.PathLike type, it represents the path
 |          libsvm format txt file, csv file (by specifying uri par

In [8]:
# Creating the training data object:
dtrain = xgb.DMatrix(data=df_train.drop(drop_vars, axis=1),
                     label=df_train['y'])

# Creating the test data object:
dtest = xgb.DMatrix(data=df_test.drop(drop_vars, axis=1))

#### Training the model

In [9]:
help(xgb.train)

Help on function train in module xgboost.training:

train(params, dtrain, num_boost_round=10, evals=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, evals_result=None, verbose_eval=True, xgb_model=None, callbacks=None)
    Train a booster with given parameters.
    
    Parameters
    ----------
    params : dict
        Booster params.
    dtrain : DMatrix
        Data to be trained.
    num_boost_round: int
        Number of boosting iterations.
    evals: list of pairs (DMatrix, string)
        List of validation sets for which metrics will evaluated during training.
        Validation metrics will help us track the performance of the model.
    obj : function
        Customized objective function.
    feval : function
        Customized evaluation function.
    maximize : bool
        Whether to maximize feval.
    early_stopping_rounds: int
        Activates early stopping. Validation metric needs to improve at least once in
        every **early_stopping_roun

In [10]:
# Declaring dictionary with hyper-parameters:
param = {'subsample': 0.75, 'eta': 0.1, 'max_depth': 3, 'objective': 'binary:logistic'}

In [11]:
# Training the model:
xgb_model = xgb.train(params=param, dtrain=dtrain, num_boost_round=100)

#### Performance metrics

In [12]:
# Predicting scores for test data:
score_pred_test = xgb_model.predict(dtest)

In [13]:
# Performance metrics:
test_roc_auc = roc_auc_score(df_test['y'], score_pred_test)
test_prec_avg = average_precision_score(df_test['y'], score_pred_test)
test_brier = brier_score_loss(df_test['y'], score_pred_test)

print('\033[1mPerformance metrics (test data):\033[0m')
print(f'Test ROC-AUC: {test_roc_auc}.')
print(f'Test average precision score: {test_prec_avg}.')
print(f'Test Brier score: {test_brier}.')

[1mPerformance metrics (test data):[0m
Test ROC-AUC: 0.9901764234161989.
Test average precision score: 0.9625007990634594.
Test Brier score: 0.004469397445861866.


#### Relevant attributes or methods of the trained model

Visualizing the ensemble of trees using a dataframe

In [15]:
xgb_model.trees_to_dataframe().head(25)

Unnamed: 0,Tree,Node,ID,Feature,Split,Yes,No,Missing,Gain,Cover
0,0,0,0-0,feat_879,2.838081,0-1,0-2,0-1,571.748047,1360.75
1,0,1,0-1,feat_886,5.626141,0-3,0-4,0-3,123.749512,1313.0
2,0,2,0-2,feat_1204,1.0,0-5,0-6,0-5,30.966598,47.75
3,0,3,0-3,feat_175,2.996223,0-7,0-8,0-7,13.816406,1303.5
4,0,4,0-4,feat_120,-0.287084,0-9,0-10,0-9,2.566416,9.5
5,0,5,0-5,feat_137,-0.076741,0-11,0-12,0-11,7.285355,44.75
6,0,6,0-6,Leaf,,,,,-0.125,3.0
7,0,7,0-7,Leaf,,,,,-0.197852,1302.5
8,0,8,0-8,Leaf,,,,,0.1,1.0
9,0,9,0-9,Leaf,,,,,-0.0,1.0


<a id='es'></a>

### Early stopping

#### Creating data objects

In [45]:
# Creating the training data object:
dtrain = xgb.DMatrix(data=df_train.drop(drop_vars, axis=1),
                     label=df_train['y'])

# Creating the validation data object:
dval = xgb.DMatrix(data=df_test.drop(drop_vars, axis=1),
                   label=df_test['y'])

#### Training the model

In [54]:
# Declaring dictionary with hyper-parameters:
param = {'subsample': 0.75, 'eta': 0.1, 'max_depth': 3, 'objective': 'binary:logistic',
         'eval_metric': ['logloss', 'auc']}

In [58]:
# Training the model:
xgb_model = xgb.train(params=param, dtrain=dtrain, num_boost_round=100,
                      evals=[(dval, 'val_data')], early_stopping_rounds=20)

[0]	val_data-logloss:0.60059	val_data-auc:0.96612
Multiple eval metrics have been passed: 'val_data-auc' will be used for early stopping.

Will train until val_data-auc hasn't improved in 20 rounds.
[1]	val_data-logloss:0.52463	val_data-auc:0.96614
[2]	val_data-logloss:0.46108	val_data-auc:0.96616
[3]	val_data-logloss:0.40777	val_data-auc:0.96611
[4]	val_data-logloss:0.36199	val_data-auc:0.96611
[5]	val_data-logloss:0.32237	val_data-auc:0.96611
[6]	val_data-logloss:0.28813	val_data-auc:0.96611
[7]	val_data-logloss:0.25859	val_data-auc:0.96744
[8]	val_data-logloss:0.23256	val_data-auc:0.96746
[9]	val_data-logloss:0.20987	val_data-auc:0.96744
[10]	val_data-logloss:0.18959	val_data-auc:0.96745
[11]	val_data-logloss:0.17184	val_data-auc:0.96742
[12]	val_data-logloss:0.15580	val_data-auc:0.96742
[13]	val_data-logloss:0.14160	val_data-auc:0.96735
[14]	val_data-logloss:0.12907	val_data-auc:0.96714
[15]	val_data-logloss:0.11793	val_data-auc:0.96716
[16]	val_data-logloss:0.10800	val_data-auc:0.

In [59]:
print(f'Best validation ROC-AUC: {xgb_model.best_score:.4f}\nBest iteration: {xgb_model.best_iteration}')

Best validation ROC-AUC: 0.9941
Best iteration: 41


#### Performance metrics

In [76]:
help(xgb_model.predict)

Help on method predict in module xgboost.core:

predict(data, output_margin=False, ntree_limit=0, pred_leaf=False, pred_contribs=False, approx_contribs=False, pred_interactions=False, validate_features=True, training=False) method of xgboost.core.Booster instance
    Predict with data.
    
    .. note:: This function is not thread safe except for ``gbtree``
              booster.
    
      For ``gbtree`` booster, the thread safety is guaranteed by locks.
      For lock free prediction use ``inplace_predict`` instead.  Also, the
      safety does not hold when used in conjunction with other methods.
    
      When using booster other than ``gbtree``, predict can only be called
      from one thread.  If you want to run prediction using multiple
      thread, call ``bst.copy()`` to make copies of model object and then
      call ``predict()``.
    
    Parameters
    ----------
    data : DMatrix
        The dmatrix storing the input.
    
    output_margin : bool
        Whether to o

In [74]:
# Predicting scores for validation data using model as it was in the best iteration:
score_pred_val = xgb_model.predict(dval, ntree_limit=xgb_model.best_iteration+1)

In [75]:
# Performance metrics:
val_roc_auc = roc_auc_score(df_test['y'], score_pred_val)
val_prec_avg = average_precision_score(df_test['y'], score_pred_val)
val_brier = brier_score_loss(df_test['y'], score_pred_val)

print('\033[1mPerformance metrics (validation data):\033[0m')
print(f'Test ROC-AUC: {val_roc_auc}.')
print(f'Test average precision score: {val_prec_avg}.')
print(f'Test Brier score: {val_brier}.')

[1mPerformance metrics (validation data):[0m
Test ROC-AUC: 0.9941393098711168.
Test average precision score: 0.9626073602847985.
Test Brier score: 0.004270969364380872.


<a id='cv'></a>

### CV estimation

#### Creating data objects

In [77]:
# Creating the training data object:
dtrain = xgb.DMatrix(data=df_train.drop(drop_vars, axis=1),
                     label=df_train['y'])

# Creating the test data object:
dtest = xgb.DMatrix(data=df_test.drop(drop_vars, axis=1),
                    label=df_test['y'])

#### Training the model

In [79]:
help(xgb.cv)

Help on function cv in module xgboost.training:

cv(params, dtrain, num_boost_round=10, nfold=3, stratified=False, folds=None, metrics=(), obj=None, feval=None, maximize=False, early_stopping_rounds=None, fpreproc=None, as_pandas=True, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, shuffle=True)
    Cross-validation with given parameters.
    
    Parameters
    ----------
    params : dict
        Booster params.
    dtrain : DMatrix
        Data to be trained.
    num_boost_round : int
        Number of boosting iterations.
    nfold : int
        Number of folds in CV.
    stratified : bool
        Perform stratified sampling.
    folds : a KFold or StratifiedKFold instance or list of fold indices
        Sklearn KFolds or StratifiedKFolds object.
        Alternatively may explicitly pass sample indices for each fold.
        For ``n`` folds, **folds** should be a length ``n`` list of tuples.
        Each tuple is ``(in,out)`` where ``in`` is a list of indices to be used

In [80]:
# Declaring dictionary with hyper-parameters:
param = {'subsample': 0.75, 'eta': 0.1, 'max_depth': 3, 'objective': 'binary:logistic'}

In [81]:
# Training the model:
xgb_model = xgb.cv(params=param, dtrain=dtrain, num_boost_round=100,
                   nfold=3, shuffle=False)

In [83]:
xgb_model.head(10)

Unnamed: 0,train-error-mean,train-error-std,test-error-mean,test-error-std
0,0.00769,0.000294,0.009977,0.001357
1,0.007413,0.000197,0.009284,0.001708
2,0.007621,0.000197,0.008868,0.00153
3,0.007136,0.00026,0.008314,0.002036
4,0.007482,0.000171,0.008591,0.001708
5,0.006997,0.000393,0.008314,0.002036
6,0.006997,0.00026,0.008314,0.002036
7,0.006997,0.00026,0.008314,0.002036
8,0.00672,0.000596,0.008452,0.002208
9,0.006651,0.00034,0.008452,0.002208


<a id='grid_random_searches'></a>

### Grid and random searches

In [8]:
# Grid of hyper-parameters:
grid_param = {'subsample': uniform(0.5, 0.5),
              'eta': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_boost_round': [100, 250, 500]}

# Creating K-folds CV object:
kfolds = Kfolds_fit(task = 'binary:logistic', method = 'xgboost', num_folds = 3, metric = 'roc_auc',
                    random_search = True, n_samples = 10,
                    grid_param = grid_param,
                    default_param = {'subsample': 0.75,
                                     'eta': 0.01,
                                     'max_depth': 10,
                                     'num_boost_round': 100})

# Running K-folds CV:
kfolds.fit(train_inputs = df_train.drop(drop_vars, axis=1), train_output = df_train['y'],
           test_inputs = df_test.drop(drop_vars, axis=1), test_output = df_test['y'])

[1mGrid estimation progress:[0m [----------------------------------------------] 100%

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Number of samples for random search: 10.
   Estimation method: xgboost.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'subsample': 0.6958157064067576, 'eta': 0.03823028413810232, 'max_depth': 4, 'num_boost_round': 500}.
   CV performance metric associated with best hyper-parameters: 0.9799.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9889
   test_prec_avg = 0.9601
   test_brier = 0.0046
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 6.3 minutes.
Start time: 2021-06-13, 18:38:03
End time: 2021-06-13, 18:44:21
------------------------------------


#### Performance metrics

In [11]:
# Performance metrics:
test_roc_auc = kfolds.performance_metrics['test_roc_auc']
test_prec_avg = kfolds.performance_metrics['test_prec_avg']
test_brier = kfolds.performance_metrics['test_brier']

print('\033[1mPerformance metrics (test data):\033[0m')
print(f'Test ROC-AUC: {test_roc_auc}.')
print(f'Test average precision score: {test_prec_avg}.')
print(f'Test Brier score: {test_brier}.')
print('\n')

print('\033[1mBest hyper-parameters:\033[0m')
print(kfolds.best_param)

[1mPerformance metrics (test data):[0m
Test ROC-AUC: 0.9889312408852977.
Test average precision score: 0.9601162681393539.
Test Brier score: 0.0046171217817736225.


[1mBest hyper-parameters:[0m
{'subsample': 0.6958157064067576, 'eta': 0.03823028413810232, 'max_depth': 4, 'num_boost_round': 500}
