# 3. Fine-Tuning XGBoost Models

These are my personal notes of the Datacamp course [Extreme Gradient Boosting with XGBoost](https://app.datacamp.com/learn/courses/extreme-gradient-boosting-with-xgboost).

The course has 4 main sections:

1. Classification
2. Regression
3. **Fine-tuning XGBoost**: the current notebook.
4. Using XGBoost in Pipelines

XGBoost is an implementation of the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm in C++ which has bindings to other languages, such as Python. It has the following properties:

- Fast.
- Best performance.
- Parallelizable, on a computer and across the network. So it can work with huge datasets distributed on several nodes/GPUs.
- We can use it for classification and regression.
- The [Python API](https://xgboost.readthedocs.io/en/stable/python/python_api.html) is easy to use and has two major flavors or sub-APIs:
  - The **Scikit-Learn API**: We instantiate `XGBRegressor()` or `XGBClassifier` and then we can `fit()` and `predict()`, using the typical Scikit-Learn parameters; we can even use those objects with other Scikit-Learn modules, such as `GridSearchCV`.
  - The **Learning API**: The native XGBoost Python API requires to convert the dataframes into `DMatrix` objects first; then, we have powerful methods which allow for tuning many parameters: `xgb.cv()`, `xgb.train()`. The native/learning API is very easy to use. **Note: the parameter names are different compared to the Scikit-Learn API!**

Classification is the original supervised learning problem addressed by XGBoost, although it can also handle regression problems.

### Installation

```python
# PIP
pip install xgboost

# Conda: General
conda install -c conda-forge py-xgboost

# Conda: CPU only
conda install -c conda-forge py-xgboost-cpu

# Conda: Use NVIDIA GPU: Linux x86_64
conda install -c conda-forge py-xgboost-gpu

# For tree visualization
pip install graphviz
```

### Table of Contents

- [3.1 Manual Hyperparameter Selection](#3.1-Manual-Hyperparameter-Selection)
    - Example of Manual Parameter Selection
    - Effect of Varying a Hyperparameter: Number of Boosting Rounds
    - Automated Selection with Early Stopping
- [3.2 Most Common Tunable Hyperparameters](#3.2-Most-Common-Tunable-Hyperparameters)
    - Example: Variation of the Learning Rate, Max Depth, Number of Features
- [3.3 Grid Search and Random Search](#3.3-Grid-Search-and-Random-Search)
    - Grid Search
    - Random Search

## 3.1 Manual Hyperparameter Selection

Manual or systematic parameter tuning can significantly improve the results, but it can be also time consuming. So we need to choose according to the application.

In general, XGBoost hyperparameters are modifiable via the `params` dictionary. Also, note that the parameters in the `cv()` or `train()` APIs are also hyperparameters! Especially, the `num_boost_round` value, which specifies the number of weak learners or boosting rounds is essential. We can also use the Scikit-Learn API; in that case, we can take advantage of `GridSearchCV` or `RandomSearchCV`, but note that the name of the parameters might change, e.g., `num_boost_round` becomes `n_estimators`. For more, check the documentation: [Python API Reference](https://xgboost.readthedocs.io/en/stable/python/python_api.html).

### Example of Manual Parameter Selection

In [1]:
import pandas as pd
import xgboost as xgb
import numpy as np

In [2]:
housing_data = pd.read_csv("../data/ames_housing_trimmed_processed.csv")
X = housing_data[housing_data.columns.tolist()[:-1]]
y = housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X,label=y)

In [11]:
# Manually set paramater values (not default ones)
params = {"objective":"reg:squarederror",
          'colsample_bytree': 0.3,
          'learning_rate': 0.1,
          'max_depth': 5}
cv_results_rmse = xgb.cv(dtrain=housing_dmatrix,
                         params=params,
                         nfold=4,
                         num_boost_round=200, # THIS IS ALSO A PARAM!
                         metrics="rmse",
                         as_pandas=True,
                         seed=123)

In [12]:
print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1))) # 30370

Tuned rmse: 30370.552735


### Effect of Varying a Hyperparameter: Number of Boosting Rounds

In this example, we try different number of boosting rounds, i.e., `num_boost_round`; these denote the number of weak learners under the hood. Note that we can use a loop with any kind of hyperparameter.

In [13]:
# Create list of number of boosting rounds
num_rounds = [150, 200, 250]

In [14]:
# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

In [15]:
# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix,
                        params=params,
                        nfold=3,
                        num_boost_round=curr_num_rounds, # Several values tried in loop
                        metrics="rmse",
                        as_pandas=True,
                        seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

In [16]:
# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

   num_boosting_rounds          rmse
0                  150  29763.123698
1                  200  29634.996745
2                  250  29639.554036


### Automated Selection with Early Stopping

We can activate activate early stopping with `early_stopping_rounds`: boosting rounds can be stopped before completing the total number of boosting rounds given with `num_boost_round`. The validation metric needs to improve at least once in every `early_stopping_rounds` round(s) to avoid stopping.

In [24]:
# Create the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":4}

In [41]:
# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix,
                         params=params,
                         nfold=3,
                         num_boost_round=50,
                         early_stopping_rounds=10,
                         metrics="rmse",
                         as_pandas=True,
                         seed=123)

In [42]:
# Print cv_results
# We see the results for each boosting round
print(cv_results)

    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0     141871.635417      403.636200   142640.651042     705.559164
1     103057.036458       73.769561   104907.664062     111.112417
2      75975.963542      253.726946    79262.054687     563.764349
3      57420.531250      521.656754    61620.135417    1087.693857
4      44552.955729      544.170190    50437.561198    1846.446330
5      35763.949219      681.795751    43035.661458    2034.469207
6      29861.464193      769.571238    38600.880208    2169.796232
7      25994.676432      756.520565    36071.817708    2109.795430
8      23306.835937      759.237670    34383.184896    1934.546688
9      21459.769531      745.624998    33509.141276    1887.375284
10     20148.721354      749.612769    32916.809245    1850.893589
11     19215.382161      641.388291    32197.832682    1734.456935
12     18627.389323      716.256596    31770.852865    1802.155484
13     17960.694661      557.043073    31482.782552    1779.12

## 3.2 Most Common Tunable Hyperparameters

**IMPORTANT**: Have a look at the [Python API](https://xgboost.readthedocs.io/en/stable/python/python_api.html) to see all the parameters for any API (i.e., Scikit-Learn or Learning). In the following, the most common parameters are listed.

Tree weak learner:

- `eta` or `learning_rate`: how quickly we fit the residual error. High values lead to quicker fits.
- `gamma`: min loss reduction to create new tree split. Higher value, less splits, less complexity, less overfitting.
- `lambda`: L2 reg on leaf weights. Higher value, less complexity.
- `alpha`: L1 reg on leaf weights. Higher value, less complexity.
- `max_depth`: max depth per tree; how deep each tree is allowed to grow in each round. Higher value, **more** complexity.
- `subsample`: fraction of total samples used per tree; in each boosting round, a tree takes one subset of all data points, this value refers to the size of this subset. Higher value, **more** complexity. 
- `colsample_bytree`: fraction of features used per each tree or boosting round. Not all features need to be used by each weak learner or boosting round. This value refers to how many from the total amount are used, selected randomly. A low value of this parameter is like more regularization.

Linear weak learner (much less hyperparameters):

- `lambda`: L2 reg on weights. Higher value, less complexity.
- `alpha`: L1 reg on weights. Higher value, less complexity.
- `lambda_bias`: L2 reg term on bias. Higher value, less complexity.

For any type base/weak learner, recall that we can tune the number of boostings or weak learners we want in the `cv()` or `train()` call:

- `num_boost_round`
- `early_stopping_rounds`


### Example: Variation of the Learning Rate, Max Depth, Number of Features

In [44]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
vals = [0.001, 0.01, 0.1] # eta
#vals = [2, 5, 10, 20] # max_depth
#vals = [0.1, 0.5, 0.8, 1] # colsample_bytree
best_rmse = []

# Systematically vary the eta 
for curr_val in vals:

    params["eta"] = curr_val
    #params["max_depth"] = curr_val
    #params["colsample_bytree"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix,
                         params=params,
                         nfold=3,
                         num_boost_round=10,
                         early_stopping_rounds=5,
                         metrics="rmse",
                         as_pandas=True,
                         seed=123)

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

     eta      best_rmse
0  0.001  195736.411458
1  0.010  179932.192708
2  0.100   79759.411458


## 3.3 Grid Search and Random Search

We can use `GridSearchCV` and `RandomSearchCV` from Scikit-Learn to systematically obtain the best parameters. To that end, we need to use the Scikit-Learn API, i.e., we instantiate `XGBRegressor` or `XGBClassifier` and user the parameters typical from Scikit-Learn. Note that the parameter seach space increases exponentially as we add parameters, so:

- With `GridSearchCV` we might require much more time to find the optimum parameter set.
- With `RandomSearchCV` we limit the number of sets, but these are random!

There are more advanced techniques for hyperparameter tuning, such as [Bayesian hyperparameter optimization](https://machinelearningmastery.com/scikit-optimize-for-hyperparameter-tuning-in-machine-learning/).

### Grid Search

In [56]:
import pandas as pd
import xgboost as xgb
import numpy as np
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [50]:
housing_data = pd.read_csv("../data/ames_housing_trimmed_processed.csv")
X = housing_data[housing_data.columns.tolist()[:-1]]
y = housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X,label=y)

In [51]:
# The parameter names with the Scikit-Learn API are different
# eta -> learning_rate
# num_boost_round -> n_estimators
gbm_param_grid = {'learning_rate': [0.01,0.1,0.5,0.9],
                  'n_estimators': [200],
                  'subsample': [0.3, 0.5, 0.9]}

In [52]:
gbm = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=gbm,
                        param_grid=gbm_param_grid,
                        scoring='neg_mean_squared_error', # negative MSE
                        cv=4,
                        verbose=1)

In [54]:
grid_mse.fit(X, y)

Fitting 4 folds for each of 12 candidates, totalling 48 fits


GridSearchCV(cv=4,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None,
                                    enable_categorical=False, gamma=None,
                                    gpu_id=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=None,
                                    num_parallel_tree=None, predictor=None,
                                    random_state=None, reg_alpha=None,
                                    reg_lambda=None, scale_pos_weight=None,
       

In [55]:
print("Best parameters found: ", grid_mse.best_params_)
# Since we have the negative MSE, we need to compute the RMSE from it
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Best parameters found:  {'learning_rate': 0.1, 'n_estimators': 200, 'subsample': 0.5}
Lowest RMSE found:  29105.179169382693


### Random Search

We define the possible hyperparameter values but unlike in the grid search, here we define a number of possible combinations to be tested. Then, for each trial, the hyperparameter values are chosen randomly.

In [58]:
# All possible combinations are: 20 * 1 * 20 = 400
# BUT: we limit to n_iter=25 the number of combinations
# And we will train each of them 4-fold with CV
gbm_param_grid = {'learning_rate': np.arange(0.05,1.05,.05), # arange: 20 values
                  'n_estimators': [200],
                  'subsample': np.arange(0.05,1.05,.05)} # arange: 20 values

gbm = xgb.XGBRegressor()
randomized_mse = RandomizedSearchCV(estimator=gbm,
                                    param_distributions=gbm_param_grid,
                                    n_iter=25, # number of combinations
                                    scoring='neg_mean_squared_error',
                                    cv=4,
                                    verbose=1)

randomized_mse.fit(X, y)
print("Best parameters found: ",randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

Fitting 4 folds for each of 25 candidates, totalling 100 fits
Best parameters found:  {'subsample': 0.4, 'n_estimators': 200, 'learning_rate': 0.2}
Lowest RMSE found:  29666.410368346937
