Ahh, the dark art of hyperparameter tuning.
It's a key step in the machine learning workflow,
and it's an activity that can easily be overlooked or be overkill.
Therefore, dear reader, it is an art that requires the application of both skill and wisdom to realize its full potential while avoiding its perils.
Today I'll show you my approach for hyperparameter tuning XGBoost, although the principles apply to any GBT framework.
I'll give you some intuition for how to think about the key parameters in XGBoost,
and I'll show you an efficient strategy for parameter tuning GBTs.
I'll be using the optuna python library to tune parameters with bayesian optimization, but you can implement my strategy with whatever hyperparameter tuning utility you like.

![Lunar halo on a frosty night in Johnson City, TN](optuna_main.jpg "")

## When should you do hyperparameter tuning?

Hyperparameter tuning can easily be overlooked in the move-fast-and-break-everything husstle of building an ML product, but it can also easily become overkill, depending on the application.
There are two key questions to ask:

1. How much value is created by an incremental gain in model prediction accuracy?
1. What is the cost of increasing model prediction accuracy?

The point is that sometimes a small gain in model prediction performance translates into millions of dollars of impact.
The dream scenario is that you swoop in on some key model in your organization, markedly improve its accuracy with an easy afternoon of hyperparameter tuning, realize massive improvements in your org's KPIs, and get mad respect, bonuses, and promoted.
But the reality is that ofen additional model accuracy doesn't really change business KPIs by very much.
Try to figure out the actual value of improved model accuracy and proceed accordingly.

Remember too that hyperparameter tuning has its costs, most obviously the developer time and compute resources for the search itself.
It can also lead us to deeper and deeper models which take longer to train, occupy larger memory footprints, and have higher prediction latency.
Blindly chasing prediction accuracy can even backfire and make a system worse,e.g. by [degrading causal reasoning in decision-making systems](https://matheusfacure.github.io/python-causality-handbook/11-Propensity-Score.html#common-issues-with-propensity-score).

Moral of the story: think before you tune.

## XGBoost Parameters

Gradient boosting algorithms like XGBoost have two main types of hyperparameters: *tree parameters* which control the decision tree trained at each boosting round and *boosting parameters* which control the boosting procedure itself.
Below I'll highlight my favorite parameters, but you can see the full list in the [documentation](https://xgboost.readthedocs.io/en/stable/parameter.html).

### Tree Parameters
In theory you can use any kind of model as a base learner in a gradient boosting algorithm, but for reasons we discussed before, [decision trees](/posts/consider-the-decision-tree/) 
are typically the best choice.
In XGBoost, we can choose the tree construction algorithm, and we get three types of parameters to control its behavior: tree complexity parameters, sampling parameters, and regularization parameters.

#### Tree construction algorithm
The tree construction algorithm boils down to split finding, and
different algorithms have different ways of generating candidate splits to consider.
In XGBoost we have the parameter:

* `tree_method` - select tree construction algorithm: `exact`, `hist`, or default `approx`. 
The exact method tends to be slow, so I usually consider approx and hist in parameter searches.

#### Tree complexity parameters
Tree complexity just means how many leaf nodes the trees have, and therefore how expressive they can be.
I use these two parameters:

* `max_depth` - maximum number of split levels allowed. Reasonable values are usually from 3-12.
* `min_child_weight` - minimum allowable sum of hessian values over data in a node. When using mean squared error as the objective, this is the minimum number of samples allowed in a leaf node. In that case, values in [1, 200] usually work well.

These two parameters oppose each other; increasing max depth allows for more expressive trees, while increasing  min child weight makes trees less expressive and therefore is a powerful way to counter overfitting.
Note that `gamma` (a.k.a. `min_split_loss`) also limits node splitting, but I usually don't use it because `min_child_weight` seems to work well enough on its own.

#### Sampling parameters
XGBoost can randomly sample rows and columns to be used for training each tree;
you might think of this as *bagging*.
We have a few parameters:

* `subsample` - proportion of rows to use in each tree. Setting this less than 1.0 results in stochastic gradient descent, because each tree is trained on only a subset of the entire training dataset. Any value in (0,1] is valid, but it seems like values in [0.7, 1] are usually the best.
* `colsample_bytree`, `colsample_bylevel`, `colsample_bynode` - control the fraction of columns available to each tree, at each split level, or at each split, respectively. I usually use either by level or by node because I like the idea that trees might be forced to learn interactions by having different features available at each subsequent split.  Again, values in (0,1] are valid, but values in [0.5,1] usually seem to work best.

#### Regularization parameters
In XGBoost, regularization penalizes the actual values predicted by the individual trees, pushing values toward zero.
I usually use:

* `reg_lambda` - L2 regularization  of tree predicted values. Increasing this parameter decreases tree expressiveness and therefore counters overfitting. 
Valid values are in [0,$\infty$), but good values typically fall in [1,10].

There is also an L1 regularization parameter called `reg_alpha`; feel free to use it instead.
It seems that using one or the other is usually sufficient.

### Boosting Parameters
Trained gradient boosting models take the form:

$$ F(\mathbf{x}) = b + \eta \sum_{k=1}^{K} f_k(\mathbf{x}) $$ 

where $b$ is the constant base predicted value, $f_k(\cdot)$ is the base learner for round $k$, parameter $K$ is the number of boosting rounds, and parameter $\eta$ is the learning rate.
In XGBoost these parameters are controlled by:

* `num_boost_round` - the number of boosting iterations. 
* `learning_rate` - the scaling or "shrinkage" factor applied to the predicted value of each base learner. Valid values are in (0,1]; the default is 0.3. 

These two parameters are very closely linked; the optimal value of one depends on the value of the other,
where smaller learning rates require more boosting rounds to reach optimality.
While training a model with a given learning rate, accuracy tends to increase with additional boosting rounds up to a certain point, but beyond that point it flattens out or even gets worse.
We can leverage this fact to make our tuning more efficient by using XGBoost's `early_stopping_rounds: int` argument, which terminates training after observing the specified number of boosting rounds without any improvement to the evaluation metric on the validation set.

## An Efficient Parameter Search Strategy for XGBoost
Efficiency is the key to effective parameter tuning, because wasting less time means searching more  parameter values and finding better models in a given amount of time.
Parameter search involves training models over and over, and what determines training time?
Well given a training dataset and a tree construction algorithm, by far the most important parameter is the number of boosting rounds.
So we want to avoid any unnecessary boosting rounds during parameter search.

Fortunately, the tree parameters tend to be independent of the boosting parameters, meaning that if we find a good combination of tree parameters, they will usually work well across various boosting parameter values.
This insight leads to a strategy where I first simultaneously tune tree parameters and the learning rate while holding the boosting rounds fixed at a moderately small value for fast training.
Then I can optionally train a model with optimal tree parameters and aggressive boosting parameters.
Specifically, my strategy goes like this:

1. With early stopping enabled, fix the number of boosting rounds at a reasonable value, and perform a parameter search over all other relevant parameters. Note the best iteration of the best model found during the search.
1. Optionally, with early stopping enabled,  train a new model using the optimal tree parameter values from stage 1, fix the learning rate at a very small value ($\le 0.01$), and boost until early stopping is invoked.

If using the one-stage procedure, train the final model using the optimal parameter values, and set the number of boosting rounds to the best iteration of the best model from the search.
If using the second aggressive boosting step, train the final model using the optimal tree parameters from stage 1, the small learning rate you chose for stage 2, and the best boosting round from stage 2.

## Tuning XGBoost Parameters with Optuna

[Optuna](https://optuna.readthedocs.io/en/stable/)
is a model-agnostic python library for hyperparameter tuning.
I like it because it has a flexible API that abstracts away the details of the search algorithm being used.
That means you can use this one library to tune all kinds of different models, and you can easily switch the parameter sampling approach among grid search, random search, the very sensible default bayesian optimization, and more.
Another massive benefit is that optuna provides a specific [XGBoost integration](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.XGBoostPruningCallback.html) 
which terminates training early on lousy parameter combinations.

You can install optuna with anaconda, e.g.

```.zsh
$ conda install -c conda-forge optuna
```


### Example: Tuning the Bluebook for Bulldozers Regression Model

To illustrate the procedure, we'll tune the parameters for the regression model we built back in the [XGBoost for regression](/posts/xgboost-for-regression-in-python/) post.
First we'll load up the bulldozer data and prepare the features and target just like we did before.

In [1]:
#| code-fold: true
#| output: false
import time 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import optuna 

df = pd.read_csv('../xgboost-for-regression-in-python/Train.csv', parse_dates=['saledate']);

def encode_string_features(df):
    out_df = df.copy()
    for feature, feature_type in df.dtypes.items():
        if feature_type == 'object':
            out_df[feature] = out_df[feature].astype('category')
    return out_df

df = encode_string_features(df)

df['saledate_days_since_epoch'] = (
    df['saledate'] - pd.Timestamp(year=1970, month=1, day=1)
    ).dt.days

df['logSalePrice'] = np.log1p(df['SalePrice'])


features = [
    'SalesID',
    'MachineID',
    'ModelID',
    'datasource',
    'auctioneerID',
    'YearMade',
    'MachineHoursCurrentMeter',
    'UsageBand',
    'fiModelDesc',
    'fiBaseModel',
    'fiSecondaryDesc',
    'fiModelSeries',
    'fiModelDescriptor',
    'ProductSize',
    'fiProductClassDesc',
    'state',
    'ProductGroup',
    'ProductGroupDesc',
    'Drive_System',
    'Enclosure',
    'Forks',
    'Pad_Type',
    'Ride_Control',
    'Stick',
    'Transmission',
    'Turbocharged',
    'Blade_Extension',
    'Blade_Width',
    'Enclosure_Type',
    'Engine_Horsepower',
    'Hydraulics',
    'Pushblock',
    'Ripper',
    'Scarifier',
    'Tip_Control',
    'Tire_Size',
    'Coupler',
    'Coupler_System',
    'Grouser_Tracks',
    'Hydraulics_Flow',
    'Track_Type',
    'Undercarriage_Pad_Width',
    'Stick_Length',
    'Thumb',
    'Pattern_Changer',
    'Grouser_Type',
    'Backhoe_Mounting',
    'Blade_Type',
    'Travel_Controls',
    'Differential_Type',
    'Steering_Controls',
    'saledate_days_since_epoch'
 ]

target = 'logSalePrice'

  df = pd.read_csv('../xgboost-for-regression-in-python/Train.csv', parse_dates=['saledate']);


But this time, since we're going to slam our validation set over and over during hyperparameter search, we want to reserve an actual test set to check how the final model generalizes.
We make four different `xgboost.DMatrix` datasets for this process: training, validation, training+validation, and test. 
Training and validation are for the parameter search, and training+validation and test are for the final model.

In [3]:
n_valid = 12000
n_test = 12000

sorted_df = df.sort_values(by='saledate')
train_df = sorted_df[:-(n_valid + n_test)] 
valid_df = sorted_df[-(n_valid + n_test):-n_test] 
test_df = sorted_df[-n_test:]

dtrain = xgb.DMatrix(data=train_df[features], label=train_df[target], 
                     enable_categorical=True)
dvalid = xgb.DMatrix(data=valid_df[features], label=valid_df[target], 
                     enable_categorical=True)
dtest = xgb.DMatrix(data=test_df[features], label=test_df[target], 
                    enable_categorical=True)
dtrainvalid = xgb.DMatrix(data=pd.concat([train_df, valid_df])[features], 
                          label=pd.concat([train_df, valid_df])[target], 
                          enable_categorical=True)

## Preliminaries: base parameters and scoring function

There are a couple of parameters that we usually want to keep fixed across all trials in a parameter search, including the XGBoost objective for training and the evaluation metric to be used for early stopping.
We'll also want to implement a model scoring function that takes a trained model and a dataset and returns the score, in our case, RMSE.

In [4]:
metric = 'rmse'
base_params = {
    'objective': 'reg:squarederror',
    'eval_metric': metric,
}

In [5]:
def score_model(model, dmat):
    y_true = dmat.get_label() 
    y_pred = model.predict(dmat) 
    return mean_squared_error(y_true, y_pred, squared=False)

## Stage 1: Tune Tree Parameters with Optuna

Next we implement our optuna objective, a function taking an optuna study `Trial` object and returning the score we want to optimize.
We use the `suggest_categorical`, `suggest_float`, and `suggest_int` methods of the `Trial` object to define the search space for each parameter.
Note the use of the pruning callback function which we pass into the `callback` argument of the XGBoost `train` function; this is a must, since it allows optuna to prune lousy models after a few boosting rounds.
Then we train XGBoost using 500 boosting rounds, which takes only a few seconds on my little laptop.
Finally we return the score as computed by our `model_score` function.

In [6]:
def objective(trial):
    params = {
        'tree_method': trial.suggest_categorical('tree_method', ['approx', 'hist']),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 250),
        'subsample': trial.suggest_float('subsample', 0.1, 01.0),
        'colsample_bynode': trial.suggest_float('colsample_bynode', 0.1, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 25, log=True),
        'learning_rate': trial.suggest_float('learning_rate', 0.05, 0.5, log=True),
    }
    num_boost_round = 500
    params.update(base_params)
    pruning_callback = optuna.integration.XGBoostPruningCallback(trial, f'valid-{metric}')
    model = xgb.train(params=params, dtrain=dtrain, num_boost_round=num_boost_round,
                      evals=[(dtrain, 'train'), (dvalid, 'valid')],
                      early_stopping_rounds=50,
                      verbose_eval=0,
                      callbacks=[pruning_callback])

    return model.best_score


To create a new optuna study and search through 50 parameter combinations, you could just run these two lines.

```python
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=50)
```

But, in practice, I prefer to run these potentially long running tasks for a pre-specified amount of clock time, rather than a specified number of trials&mdash;who knows how long 50 trials will take.
So, to run the optimization for around 600 seconds (long enough to go make a nice cup of tea, stretch,  and come back), I do something like this:

In [7]:
#| output: false
study = optuna.create_study(direction='minimize')
tic = time.time()
while time.time() - tic < 600:
    study.optimize(objective, n_trials=1)

[I 2023-12-18 12:26:24,038] A new study created in memory with name: no-name-ad01e2fe-3ccd-490e-b492-54447aa24eb7
[I 2023-12-18 12:26:38,732] Trial 0 finished with value: 0.22716975344231757 and parameters: {'tree_method': 'hist', 'max_depth': 7, 'min_child_weight': 47, 'subsample': 0.644239586943823, 'colsample_bynode': 0.6324287447379262, 'reg_lambda': 0.008813955729768966, 'learning_rate': 0.12041680713201537}. Best is trial 0 with value: 0.22716975344231757.
[I 2023-12-18 12:26:42,022] Trial 1 finished with value: 0.25134318137258616 and parameters: {'tree_method': 'hist', 'max_depth': 7, 'min_child_weight': 22, 'subsample': 0.3972059827740082, 'colsample_bynode': 0.44287882577128346, 'reg_lambda': 0.002971964781260193, 'learning_rate': 0.4677686278959682}. Best is trial 0 with value: 0.22716975344231757.
[I 2023-12-18 12:26:49,585] Trial 2 finished with value: 0.23955581193610404 and parameters: {'tree_method': 'hist', 'max_depth': 10, 'min_child_weight': 159, 'subsample': 0.53978

In [8]:
print('Stage 1 ==============================')
print(f'best score = {study.best_trial.value}')
print('best params --------------------------')
for k, v in study.best_trial.params.items():
    print(k, ':', v)

best score = 0.22716975344231757
best params --------------------------
tree_method : hist
max_depth : 7
min_child_weight : 47
subsample : 0.644239586943823
colsample_bynode : 0.6324287447379262
reg_lambda : 0.008813955729768966
learning_rate : 0.12041680713201537


If we are happy with this result, we can go ahead and train a final model on the training+validation set using these parameter values and our fixed number of boosting rounds.
If we decide we want to tune the tree parameters a little more, we can just call `study.optimize(...)` again, adding as many trials as we want.
Once we're happy with the tree parameters, if we want more accuracy and  are willing to accept longer training time and a bigger model, we can proceed to stage 2.

## Stage 2: Tune Boosting Parameters via Early Stopping

Now we take the optimal tree parameters that we found in stage 1, and we train a new model with a fixed low learning rate&mdash;here I use 0.01, but you could go lower&mdash;and a large number of boosting rounds.
The lower your learning rate, the better your performance (with diminishing returns) and the more boosting rounds you'll need to max out the evaluation metric on the validation data.

In [9]:
#| output: false
params = {}
params.update(base_params)
params.update(study.best_trial.params)
params['learning_rate'] = 0.01
model_stage2 = xgb.train(params=params, dtrain=dtrain, 
                         num_boost_round=10000,
                         evals=[(dtrain, 'train'), (dvalid, 'valid')],
                         early_stopping_rounds=50,
                         verbose_eval=0)

In [10]:
print('Stage 2 ==============================')
print(f'best score = {score_model(model_stage2, dvalid)}')
best_iteration = model_stage2.best_iteration
print(f'best iteration = {best_iteration}')

best score = 0.22271610796451569
best iteration = 6115


## Train the Final Model and Evaluate on Test Data

Now we can train our final model on the combined training and validation datasets using the optimal tree parameters from stage 1 and the fixed learning rate and optimal boosting rounds from stage 2.
Then we evaluate on the held out test data.

In [11]:
#| output: false
params['learning_rate'] = 0.01
model_final = xgb.train(params=params, dtrain=dtrainvalid, 
                        num_boost_round=best_iteration,
                        evals=[(dtrain, 'train')],
                        verbose_eval=0)

In [12]:
print('Final Model ==========================')
print(f'test score = {score_model(model_final, dtest)}')
print('parameters ---------------------------')
for k, v in params.items():
    print(k, ':', v)
print(f'num_boost_round: {best_iteration}')

test score = 0.21926622092723846
parameters ---------------------------
objective : reg:squarederror
eval_metric : rmse
tree_method : hist
max_depth : 7
min_child_weight : 47
subsample : 0.644239586943823
colsample_bynode : 0.6324287447379262
reg_lambda : 0.008813955729768966
learning_rate : 0.01
num_boost_round: 6115


Back in the [regression post](/posts/xgboost-for-regression-in-python/) 
we got an RMSE of about 0.231 just using default parameter values, which put us in about 5th place on the [leaderboard for the Kagle dozers competition](https://www.kaggle.com/competitions/bluebook-for-bulldozers/leaderboard).
Now with less than 15 minutes of hyperparameter tuning, our RMSE is down to 0.219 which puts us in 1st place by a huge margin.

## Wrapping Up

There it is, an efficient and ridiculously easy hyperparameter tuning strategy for XGBoost using optuna.
If you found this helpful, if you have questions, or if you have your own preferred method for parameter search, let me know about it down in the comments!