# Introduction

<div class="alert alert-block alert-warning">
<font color=black><br>

**What?** Hyperparameter tuning for regression with NATIVE XGBoost API

**Reference:** https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f<br>

<br></font>
</div>

# Why would you use the native XGBoost API over the scikit-learn API?

<div class="alert alert-block alert-info">
<font color=black><br>

Advantages include:
1. Automatically find the best number of boosting rounds
- Built-in cross validation
- Custom objective functions

<br></font>
</div>

# Import modules

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Import dataset

<div class="alert alert-block alert-info">
<font color=black><br>

- Facebook comment volume dataset
- Dataset can be donwload here: https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
- 53 features describing a Facebook post: the number of likes on the page it was posted, the category of the page, the time and day it was posted, etc. 
- The last column is the target: the number of comments the post received. 
- **GOAL**: predict the number of comments a new post will receive based on all the given features.

<br></font>
</div>

In [2]:
!ls ../DATASETS/Facebook_comment_volume_dataset/Training/

Features_Variant_1.arff Features_Variant_3.arff Features_Variant_5.arff
Features_Variant_1.csv  Features_Variant_3.csv  Features_Variant_5.csv
Features_Variant_2.arff Features_Variant_4.arff
Features_Variant_2.csv  Features_Variant_4.csv


In [3]:
file = "../DATASETS/Facebook_comment_volume_dataset/Training/Features_Variant_1.csv"
df = pd.read_csv(file, header=None)
df.sample(n=5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
22151,497,0,0,16,0.0,5.0,0.381818,0.0,0.884018,0.0,...,0,0,1,0,0,0,0,0,0,0
27582,5365996,40729,102442,9,0.0,740.0,60.61215,33.0,90.050007,0.0,...,0,0,0,0,0,1,0,0,0,3
29452,204478,0,3661,92,0.0,368.0,51.688525,13.0,79.076543,0.0,...,0,0,1,0,0,0,0,0,0,7
7224,441897,0,16175,18,8.0,1164.0,99.691275,71.0,118.840986,0.0,...,0,0,0,0,0,0,0,1,0,0
15889,7564986,0,123241,12,0.0,35.0,3.365714,1.0,5.335913,0.0,...,0,0,0,0,0,1,0,0,0,0


In [4]:
print("Dataset has {} entries and {} features".format(*df.shape))

Dataset has 40949 entries and 54 features


# Split the dataset

<div class="alert alert-block alert-info">
<font color=black><br>

- In order to use the native API for XGBoost, we will first need to build DMatrices.

<br></font>
</div>

In [5]:
X, y = df.loc[:,:52].values, df.loc[:,53].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state = 42)

In [6]:
dtrain = xgb.DMatrix(X_train, label = y_train)
dtest = xgb.DMatrix(X_test, label = y_test)

# Building a baseline model

<div class="alert alert-block alert-info">
<font color=black><br>

- The **baseline model** serves to get a score which can be achieved with no efforts.
- The hope is to beat it with our tuned/fancy algorithm.
- The **MAE** Mean Absolute Error is here chosen becasuse it has the same unit of the target and that ease the results interpretation.

<br></font>
</div>

In [7]:
# "Learn" the mean from the training data
mean_train = np.mean(y_train)
mean_test = np.mean(y_test)
print("Mean form the training data {:.2f}".format(mean_train))
print("Mean form the test data {:.2f}".format(mean_test))

# Get predictions on the test set
baseline_predictions = np.ones(y_test.shape) * mean_train
# Compute MAE
mae_baseline = mean_absolute_error(y_test, baseline_predictions)
print("Baseline MAE is {:.2f}".format(mae_baseline))

Mean form the training data 7.28
Mean form the test data 7.72
Baseline MAE is 11.31


<div class="alert alert-block alert-info">
<font color=black><br>

- **Is our baseline model good**? 
- Our prediction is 11.31 comments away from the truth.
- That is not good if we compare it against the average number for a post in both training and test set.
- Of course is the error the other is mean, so they are not exactly the same thing, but we can still use it to get the order of magnitude.

<br></font>
</div>

# How num_boost_round & early_stopping_rounds are used in tuning

<div class="alert alert-block alert-info">
<font color=black><br>

- There are 2 other parameters that are passed as a standalone argument of XGBoost that are not the params dictionary.
- The num_boost_round and corresponds to the No of boosting rounds or trees to build. 
- You could tune it together with all parameters in a grid-search, but it’ll be **expensive**.
- There is a **more efficient** way. Since trees are built sequentially, instead of fixing the number of rounds at the beginning, we can test our model at each step and see if adding a new tree/round improves performance.
- To do so, we define a test dataset and a metric that is used to assess performance at each round. If performance haven’t improved for N rounds (N is defined by the variable early_stopping_round), we stop the training and keep the best number of boosting rounds.

<br></font>
</div>

In [8]:
params = {
    # Parameters that we are going to tune.
    'max_depth':6,
    'min_child_weight': 1,
    'eta':.3,
    'subsample': 1,
    'colsample_bytree': 1,
    # Other parameters
    'objective':'reg:linear',
}

In [9]:
params['eval_metric'] = "mae"
num_boost_round = 999

In [10]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round = num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds = 10
)

[0]	Test-mae:5.97481
Will train until Test-mae hasn't improved in 10 rounds.
[1]	Test-mae:5.03353
[2]	Test-mae:4.64575
[3]	Test-mae:4.42335
[4]	Test-mae:4.39328
[5]	Test-mae:4.35536
[6]	Test-mae:4.31313
[7]	Test-mae:4.33087
[8]	Test-mae:4.37167
[9]	Test-mae:4.38777
[10]	Test-mae:4.39438
[11]	Test-mae:4.40656
[12]	Test-mae:4.39122
[13]	Test-mae:4.39086
[14]	Test-mae:4.39829
[15]	Test-mae:4.39103
[16]	Test-mae:4.40305
Stopping. Best iteration:
[6]	Test-mae:4.31313



<div class="alert alert-block alert-info">
<font color=black><br>

- As you can see we stopped before reaching the maximum number of boosting rounds, that’s because after the 7th tree, adding more rounds did not lead to improvements of MAE on the test dataset.

<br></font>
</div>

In [11]:
print("Best MAE: {:.2f} with {} rounds".format(
                 model.best_score,
                 model.best_iteration+1))

Best MAE: 4.31 with 7 rounds


# How to use XGBoost native CV

<div class="alert alert-block alert-info">
<font color=black><br>

- We **don’t need** to pass a test dataset here. 
- It’s because the cross-validation function is splitting the train dataset into nfolds and iteratively keeps one of the folds for test purposes. 

<br></font>
</div>

In [12]:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    seed=42,
    nfold=5,
    metrics={'mae'},
    early_stopping_rounds=10
)



<div class="alert alert-block alert-info">
<font color=black><br>

- cv returns a table where the rows correspond to the **No of boosting** trees used.
- What is important to note is that we stopped before the 999 rounds (fortunately!).

<br></font>
</div>

In [13]:
cv_results

Unnamed: 0,train-mae-mean,train-mae-std,test-mae-mean,test-mae-std
0,5.604948,0.06466,5.689212,0.270147
1,4.622352,0.065104,4.849511,0.271889
2,4.059494,0.065932,4.468344,0.239464
3,3.723084,0.060754,4.268582,0.224425
4,3.510358,0.061148,4.192462,0.18976
5,3.367076,0.060926,4.172847,0.189624
6,3.245542,0.060118,4.15783,0.192568
7,3.151558,0.062634,4.143255,0.194406
8,3.082316,0.058967,4.147838,0.196198
9,3.01686,0.057426,4.144695,0.189789


In [14]:
cv_results['test-mae-mean'].min()

4.0827944

# Final Tuning - bringing everything together

## Tree-related hyperparameter

<div class="alert alert-block alert-info">
<font color=black><br>

- These 2 parameters can be used to control the complexity of the trees.
- **max_depth** is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit.
- **min_child_weight** is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit.
- It is important to tune them together in order to find a good **trade-off** between model bias and variance

<br></font>
</div>

In [15]:
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(9,12)
    for min_child_weight in range(5,8)
]

In [16]:
# List of tuples
gridsearch_params

[(9, 5), (9, 6), (9, 7), (10, 5), (10, 6), (10, 7), (11, 5), (11, 6), (11, 7)]

In [17]:
# Define initial best params and MAE
min_mae = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )    # Update best MAE
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth,min_child_weight)


CV with max_depth=9, min_child_weight=5
	MAE 4.045249799999999 for 6 rounds
CV with max_depth=9, min_child_weight=6
	MAE 4.07648 for 5 rounds
CV with max_depth=9, min_child_weight=7
	MAE 4.0753965999999995 for 5 rounds
CV with max_depth=10, min_child_weight=5
	MAE 4.0805982 for 5 rounds
CV with max_depth=10, min_child_weight=6
	MAE 4.0351186 for 5 rounds
CV with max_depth=10, min_child_weight=7
	MAE 4.0872286 for 5 rounds
CV with max_depth=11, min_child_weight=5
	MAE 4.0626337999999995 for 5 rounds
CV with max_depth=11, min_child_weight=6
	MAE 4.054813 for 5 rounds
CV with max_depth=11, min_child_weight=7
	MAE 4.0580998 for 5 rounds


In [18]:
print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

Best params: 10, 6, MAE: 4.0351186


In [19]:
# Let's update the parameters dictionary
params['max_depth'] = 10
params['min_child_weight'] = 6

In [20]:
params

{'max_depth': 10,
 'min_child_weight': 6,
 'eta': 0.3,
 'subsample': 1,
 'colsample_bytree': 1,
 'objective': 'reg:linear',
 'eval_metric': 'mae'}

## Parameters subsample and colsample_bytree

<div class="alert alert-block alert-info">
<font color=black><br>

- Those parameters control the sampling of the dataset that is done at each boosting round.
- Instead of using the whole training set every time, we can build a tree on slightly different data at each step, which makes it less likely to overfit to a single sample or feature.
- **subsample** corresponds to the fraction of observations (the rows) to subsample at each step. By default it is set to 1 meaning that we use all rows.
- **colsample_bytree** corresponds to the fraction of features (the columns) to use. By default it is set to 1 meaning that we will use all features

<br></font>
</div>

In [21]:
gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(7,11)]
    for colsample in [i/10. for i in range(7,11)]
]

In [22]:
gridsearch_params

[(0.7, 0.7),
 (0.7, 0.8),
 (0.7, 0.9),
 (0.7, 1.0),
 (0.8, 0.7),
 (0.8, 0.8),
 (0.8, 0.9),
 (0.8, 1.0),
 (0.9, 0.7),
 (0.9, 0.8),
 (0.9, 0.9),
 (0.9, 1.0),
 (1.0, 0.7),
 (1.0, 0.8),
 (1.0, 0.9),
 (1.0, 1.0)]

In [23]:
min_mae = float("Inf")
best_params = None# We start by the largest values and go down to the smallest
for subsample, colsample in reversed(gridsearch_params):
    print("CV with subsample={}, colsample={}".format(
                             subsample,
                             colsample))    # We update our parameters
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (subsample,colsample)

CV with subsample=1.0, colsample=1.0
	MAE 4.0351186 for 5 rounds
CV with subsample=1.0, colsample=0.9
	MAE 4.086857 for 5 rounds
CV with subsample=1.0, colsample=0.8
	MAE 4.1143116 for 5 rounds
CV with subsample=1.0, colsample=0.7
	MAE 4.2003104 for 5 rounds
CV with subsample=0.9, colsample=1.0
	MAE 4.0303374000000005 for 5 rounds
CV with subsample=0.9, colsample=0.9
	MAE 4.0712008 for 5 rounds
CV with subsample=0.9, colsample=0.8
	MAE 4.1030088 for 4 rounds
CV with subsample=0.9, colsample=0.7
	MAE 4.1687326 for 4 rounds
CV with subsample=0.8, colsample=1.0
	MAE 4.149029999999999 for 5 rounds
CV with subsample=0.8, colsample=0.9
	MAE 4.157543800000001 for 6 rounds
CV with subsample=0.8, colsample=0.8
	MAE 4.1861581999999995 for 7 rounds
CV with subsample=0.8, colsample=0.7
	MAE 4.1982572 for 4 rounds
CV with subsample=0.7, colsample=1.0


	MAE 4.0902796 for 5 rounds
CV with subsample=0.7, colsample=0.9
	MAE 4.112025999999999 for 6 rounds
CV with subsample=0.7, colsample=0.8
	MAE 4.1948692 for 4 rounds
CV with subsample=0.7, colsample=0.7
	MAE 4.19326 for 4 rounds


In [24]:
print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

Best params: 0.9, 1.0, MAE: 4.0303374000000005


In [25]:
# Let's update the params dict
params['subsample'] = .8
params['colsample_bytree'] = 1.

In [26]:
params

{'max_depth': 10,
 'min_child_weight': 6,
 'eta': 0.3,
 'subsample': 0.8,
 'colsample_bytree': 1.0,
 'objective': 'reg:linear',
 'eval_metric': 'mae'}

## Hyperparameter ETA

<div class="alert alert-block alert-info">
<font color=black><br>

- The ETA parameter controls the learning rate. It corresponds to the shrinkage of the weights associated to features after each round, in other words it defines the amount of "correction" we make at each step.
- In practice, having a lower eta makes our model more robust to overfitting thus, usually, the lower the learning rate, the best. 
- But with a lower eta, we need more boosting rounds, which takes more time to train, sometimes for only marginal improvements.

<br></font>
</div>

In [27]:
# This can take some time…
min_mae = float("Inf")
best_params = None

for eta in [.3, .2, .1, .05, .01, .005]:
    print("CV with eta={}".format(eta))    
    
    # We update our parameters
    params['eta'] = eta    
    
    # Run and time CV
    cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    seed=42,
    nfold=5,
    metrics=['mae'],
    early_stopping_rounds=10)
    
    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds\n".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = eta

CV with eta=0.3
	MAE 4.149029999999999 for 5 rounds

CV with eta=0.2
	MAE 4.0048536 for 10 rounds

CV with eta=0.1
	MAE 3.9243398 for 19 rounds

CV with eta=0.05
	MAE 3.8693966000000004 for 46 rounds

CV with eta=0.01
	MAE 3.8336379999999997 for 235 rounds

CV with eta=0.005
	MAE 3.8281038 for 479 rounds



In [28]:
print("Best params: {}, MAE: {}".format(best_params, min_mae))

Best params: 0.005, MAE: 3.8281038


In [29]:
# Let's update the params dict
params['ETA'] = 0.005

In [30]:
params

{'max_depth': 10,
 'min_child_weight': 6,
 'eta': 0.005,
 'subsample': 0.8,
 'colsample_bytree': 1.0,
 'objective': 'reg:linear',
 'eval_metric': 'mae',
 'ETA': 0.005}

# Train the model and get test set results

<div class="alert alert-block alert-info">
<font color=black><br>

- Please note that the best No of boosting round does not follow a monotonic trend.
- This means the best value is not the last one!
- This important when we want to save the model. 
- In fact if we know the best number of boosting we do not need to use the **ealy_stopping_round** anymore.

<br></font>
</div>

In [31]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

[0]	Test-mae:7.72516
Will train until Test-mae hasn't improved in 10 rounds.
[1]	Test-mae:7.68929
[2]	Test-mae:7.65557
[3]	Test-mae:7.61957
[4]	Test-mae:7.5839
[5]	Test-mae:7.54856
[6]	Test-mae:7.51315
[7]	Test-mae:7.48022
[8]	Test-mae:7.44596
[9]	Test-mae:7.41321
[10]	Test-mae:7.37869
[11]	Test-mae:7.34604
[12]	Test-mae:7.31248
[13]	Test-mae:7.2791
[14]	Test-mae:7.24591
[15]	Test-mae:7.21429
[16]	Test-mae:7.18185
[17]	Test-mae:7.14933
[18]	Test-mae:7.11726
[19]	Test-mae:7.08604
[20]	Test-mae:7.05399
[21]	Test-mae:7.02381
[22]	Test-mae:6.99194
[23]	Test-mae:6.96059
[24]	Test-mae:6.92885
[25]	Test-mae:6.89883
[26]	Test-mae:6.86869
[27]	Test-mae:6.84059
[28]	Test-mae:6.81215
[29]	Test-mae:6.78097
[30]	Test-mae:6.75338
[31]	Test-mae:6.72323
[32]	Test-mae:6.69574
[33]	Test-mae:6.66586
[34]	Test-mae:6.63746
[35]	Test-mae:6.60968
[36]	Test-mae:6.58174
[37]	Test-mae:6.55553
[38]	Test-mae:6.52827
[39]	Test-mae:6.50107
[40]	Test-mae:6.47334
[41]	Test-mae:6.44917
[42]	Test-mae:6.42616
[43]	Test-

[355]	Test-mae:3.92952
[356]	Test-mae:3.92947
[357]	Test-mae:3.92995
[358]	Test-mae:3.92974
[359]	Test-mae:3.92965
[360]	Test-mae:3.92935
[361]	Test-mae:3.92811
[362]	Test-mae:3.92744
[363]	Test-mae:3.92652
[364]	Test-mae:3.92786
[365]	Test-mae:3.9285
[366]	Test-mae:3.92846
[367]	Test-mae:3.92828
[368]	Test-mae:3.9279
[369]	Test-mae:3.92785
[370]	Test-mae:3.92766
[371]	Test-mae:3.92686
[372]	Test-mae:3.92665
[373]	Test-mae:3.92671
Stopping. Best iteration:
[363]	Test-mae:3.92652



In [32]:
model.best_iteration

363

# Saving out model

In [33]:
num_boost_round = model.best_iteration + 1

best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")]
)

[0]	Test-mae:7.72516
[1]	Test-mae:7.68929
[2]	Test-mae:7.65557
[3]	Test-mae:7.61957
[4]	Test-mae:7.5839
[5]	Test-mae:7.54856
[6]	Test-mae:7.51315
[7]	Test-mae:7.48022
[8]	Test-mae:7.44596
[9]	Test-mae:7.41321
[10]	Test-mae:7.37869
[11]	Test-mae:7.34604
[12]	Test-mae:7.31248
[13]	Test-mae:7.2791
[14]	Test-mae:7.24591
[15]	Test-mae:7.21429
[16]	Test-mae:7.18185
[17]	Test-mae:7.14933
[18]	Test-mae:7.11726
[19]	Test-mae:7.08604
[20]	Test-mae:7.05399
[21]	Test-mae:7.02381
[22]	Test-mae:6.99194
[23]	Test-mae:6.96059
[24]	Test-mae:6.92885
[25]	Test-mae:6.89883
[26]	Test-mae:6.86869
[27]	Test-mae:6.84059
[28]	Test-mae:6.81215
[29]	Test-mae:6.78097
[30]	Test-mae:6.75338
[31]	Test-mae:6.72323
[32]	Test-mae:6.69574
[33]	Test-mae:6.66586
[34]	Test-mae:6.63746
[35]	Test-mae:6.60968
[36]	Test-mae:6.58174
[37]	Test-mae:6.55553
[38]	Test-mae:6.52827
[39]	Test-mae:6.50107
[40]	Test-mae:6.47334
[41]	Test-mae:6.44917
[42]	Test-mae:6.42616
[43]	Test-mae:6.39965
[44]	Test-mae:6.37425
[45]	Test-mae:6.34918


[358]	Test-mae:3.92974
[359]	Test-mae:3.92965
[360]	Test-mae:3.92935
[361]	Test-mae:3.92811
[362]	Test-mae:3.92744
[363]	Test-mae:3.92652


In [34]:
"""
best_model.save_model("my_model.model")

loaded_model = xgb.Booster()
loaded_model.load_model("my_model.model")
# And use it for predictions.
loaded_model.predict(dtest)
"""

'\nbest_model.save_model("my_model.model")\n\nloaded_model = xgb.Booster()\nloaded_model.load_model("my_model.model")\n# And use it for predictions.\nloaded_model.predict(dtest)\n'

# Make predictions

<div class="alert alert-block alert-info">
<font color=black><br>

- We should obtain the **same** score as promised in the last round of training, let’s check!

<br></font>
</div>

In [35]:
mean_absolute_error(best_model.predict(dtest), y_test)

3.9265177537378957