# CatBoost Hyperparameter Tuning

### Introduction

In this lesson, we'll work through different hyperparameters that are available to tune in catboost.  As we'll see, catboost is quite sensitive hyperparameter tuning.  By the end of this lesson, we'll see an improvement in our score from .35 to .52 -- roughly 1.5 times our original score.  Also, we'll see that many of the hyperparameters that we can tune are similar to those available in a random forest.  Let's get started.

### Loading Data

Let's load up our data, and get it to train a catboost regressor.

In [17]:
import pandas as pd
from sklearn.datasets import load_boston

df = pd.read_csv('./imdb_movies.csv')
df_sorted = df.sort_values(['year', 'month'])
X = df_sorted.iloc[:, 1:-1]
y = df_sorted['revenue']

genre = X['genre']
genre_filled = genre.fillna(-999).astype('category')
X_updated = X.assign(genre = genre_filled)

So after replacing our na values, and changing the type to category, we can split our data.

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_updated, y, shuffle = False, test_size = .2)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, shuffle = False, test_size = .5)

Then we identify our categorical columns and create our Pools for the training, validation, and test sets.

In [19]:
import numpy as np
cat_indices = np.where(X.dtypes == np.object)[0]
cat_indices

array([0])

In [20]:
from catboost import CatBoostRegressor, Pool

In [21]:
train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=[0])

validate_pool = Pool(X_validate, 
                  y_validate, 
                  cat_features=[0])

test_pool = Pool(X_test, 
                  y_test, 
                  cat_features=[0])

Finally, let's train an initial model to check if need any further modifications to our data.

In [22]:
cbr = CatBoostRegressor(logging_level = 'Silent')
cbr.fit(train_pool)

<catboost.core.CatBoostRegressor at 0x125000590>

### Tuning Hyperparameters

Now there are a number of hyperparameters we can set with catboost.  Many of them are equivalent to what we saw with random forests.  

* `iterations`: `n_estimators`
* `min_child_samples`: `min_samples_leaf`
* `colsample_bylevel`: `max_features`

Let's go through these parameters to get familiar with the process of tuning our hyperparameters.  We can view even more hyperparameters by visiting the [parameters in the documentation](https://catboost.ai/docs/concepts/python-reference_parameters-list.html).

We can specify an initial list of hyperparameters as a dictionary.  Notice that 

In [12]:
params = {
    'iterations': 500,
    'learning_rate': 0.1,
    'min_child_samples': 7,
    'eval_metric': 'RMSE',
    'random_seed': 42,
    'logging_level': 'Silent'
}

And we can use the splat operator to pass them through as arguments to our regressor.

In [13]:
cbr_2 = CatBoostRegressor(**params)
cbr_2.fit(train_pool)

<catboost.core.CatBoostRegressor at 0x12624fbd0>

In [14]:
cbr_2.score(validate_pool)

0.35573315143164774

### Hyperparameter Search

Now let's begin the process of trying our different hyperparameters.  Because training a catboost model can be time consuming there are couple of techniques that we can help speed up training time.  

The first is setting the `task_type = 'GPU'`.  Now, this will only work if an Nvidia driver is present on the computer (which for a Macbook Pro it is not).  But in Google Colab, we can go to `Runtime > Change Runtime Type`.  

Ok, now let's begin with hyperparameter tuning,  starting with `max_depth`.

1. Tuning max_depth

We perform hyperparameter tuning the same we did previously.  We loop through different values of our hyperparameter and evaluate the score.  Catboost doesn't allow max depth to exceed 16, so we'll try values between 5 and 15 going by 2.

In [23]:
max_depths = list(range(5, 16, 2))
max_depths

[5, 7, 9, 11, 13, 15]

> Then we train the model at each of the values.

In [24]:
model_depths = [CatBoostRegressor(iterations=200,
                                  max_depth=max_depth, 
                                  logging_level = 'Silent').fit(train_pool) 
                for max_depth in max_depths]

In [25]:
pd.Series(index = max_depths, data = [model.score(validate_pool) for model in model_depths])

5     0.312708
7     0.353376
9     0.421708
11    0.454850
13    0.470696
15    0.441350
dtype: float64

We see that the top score is 13, so let's also try 12, and 14.   

In [26]:
model_depths = [CatBoostRegressor(iterations=200,
                                  max_depth=max_depth, 
                                  logging_level = 'Silent').fit(train_pool) 
                for max_depth in [12, 14]]

In [28]:
[model.score(validate_pool) for model in model_depths]

[0.48008950374793147, 0.44431330627021737]

So we see that max_depth at 12 is slightly higher, so we'll go with that.

2. Tuning min_samples

Because catboost tries to maintain an even split, there is not as strong a benefit to working with `min_child_samples` as in random forests, which can have uneven splits.  So let's just try a few values to see if there is any benefit.  

In [29]:
min_samples = list(range(3, 13, 4))
min_samples

[3, 7, 11]

In [31]:
model_min_samples = [CatBoostRegressor(iterations=200,
                                  max_depth=12, 
                                  min_child_samples = min_sample,
                                  logging_level = 'Silent').fit(train_pool) 
                for min_sample in min_samples]


In [32]:
pd.Series(index = min_samples, data = [model.score(validate_pool) for model in model_min_samples])

3     0.48009
7     0.48009
11    0.48009
dtype: float64

> So we can see that the `min_child_samples` does not have an impact on this model. 

*  `col_sample_by_level`

Next, we can try to tune the `col_sample_by_level` hyperparameter.  Remember that this hyperparameter randomly selects a subsample of features before each split.   

In [33]:
import numpy as np
col_sample_pcts = np.linspace(0.1, 1, 10)
col_sample_pcts

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [34]:

model_sample_pcts = [CatBoostRegressor(iterations=200,
                                  max_depth=12, 
                                  colsample_bylevel = pct,
                                  logging_level = 'Silent').fit(train_pool) 
                for pct in col_sample_pcts]


In [35]:
pd.Series(index = col_sample_pcts, data = [model.score(validate_pool) for model in model_sample_pcts])

0.1    0.501423
0.2    0.470952
0.3    0.426782
0.4    0.457420
0.5    0.496440
0.6    0.483795
0.7    0.471304
0.8    0.486244
0.9    0.429902
1.0    0.470696
dtype: float64

> Here it could be either .5 or .1 as our ideal choice.  Let's try multiple models.

1. Sample at $.1$

> First, we'll try multiple attempts at setting a `colsample_bylevel` at .1.

In [43]:
random_models_one_tenth = [CatBoostRegressor(iterations=1000, max_depth=12,  
                                colsample_bylevel = .1, random_state = i,
                                logging_level = 'Silent').fit(train_pool) for i in range(5)]

In [44]:
one_tenth_scores = [model.score(validate_pool) for model in random_models_one_tenth]
one_tenth_scores

[0.5064435852154457,
 0.49885520382008475,
 0.4794802112328914,
 0.5152191341463006,
 0.45917077043233556]

In [45]:
one_tenth = pd.Series(one_tenth_scores)
one_tenth.mean(), one_tenth.std()

(0.49183378096941155, 0.02252279092798799)

> Then let's try to tune multiple models at `.5`.

In [46]:
random_models_one_half = [CatBoostRegressor(iterations=1000, max_depth=12,  
                                colsample_bylevel = .5, random_state = i,
                                logging_level = 'Silent').fit(train_pool) for i in range(5)]

In [47]:
one_half_scores = [model.score(validate_pool) for model in random_models]
one_half_scores[:4]

[0.4715609966945278,
 0.4456166963451539,
 0.4766053657585869,
 0.48250848881947983]

In [48]:
one_half = pd.Series(one_half_scores)
one_half.mean(), one_half.std()

(0.46965481592254266, 0.014145759347158896)

So it appears that using `colsample_bylevel = .1` provides the strongest increase in the score.

### Tuning l2 regularization leaf

Now l2 regularization is hyperparameter that we have not seen before with tree based models.  The basic idea is that there is increased variance as our trees become more complex -- as they have more splits.  So with l2 regularization, there the decision tree will only continue to split if there is a significant improvement from not splitting.  The threshold of improvement needed is determined by the hyperparameter.

In [52]:
l2_vals = list(range(1, 12, 2))
l2_vals

[1, 3, 5, 7, 9, 11]

In [53]:
l2_models = [CatBoostRegressor(iterations=1000, max_depth=12,  
                                colsample_bylevel = .1,
                                logging_level = 'Silent', l2_leaf_reg = val).fit(train_pool) for val in l2_vals]

In [55]:
[model.score(validate_pool) for model in l2_models]

[0.5210297767132677,
 0.5195998473448362,
 0.5285313071922895,
 0.5272492438928775,
 0.5297997737844614,
 0.5264380609616861]

Here, we begin to see a bump as we get upwards of 9 and 11.  Let's try higher numbers.

In [56]:
l2_models = [CatBoostRegressor(iterations=1000, max_depth=12,  
                                colsample_bylevel = .1,
                                logging_level = 'Silent', l2_leaf_reg = val).fit(train_pool) for val in [13, 15, 17]]

In [57]:
[model.score(validate_pool) for model in l2_models]

[0.5246821966369193, 0.5202944091370997, 0.5210612140544018]

It looks like we see a peak right around 9.  Let's now just try values 8 and 10.

In [58]:
l2_models_eight_ten = [CatBoostRegressor(iterations=1000, max_depth=12,  
                                colsample_bylevel = .1,
                                logging_level = 'Silent', l2_leaf_reg = val).fit(train_pool) for val in [8, 10]]

In [59]:
[model.score(validate_pool) for model in l2_models_eight_ten]

[0.5273739387034617, 0.5238129670744792]

So we can see that the top value is when we had `l2_leaf_reg = 9`.

### Setting the Learning Rate

Ok, now it's time to work with setting the learning rate.  The general rule is that the lower the learning rate, the more the number of iterations we'll need to converge.  To prevent our model from overfitting, we can use the overfitting detector, which will stop our model when there is no longer an improvement on the validation set to adding more trees.

> To use the overfitting detector we need to pass through the validation pool when we fit the model.  And we have to specify our overfitting detector -- here `od_type = 'Iter'`, and the `od_wait = 40` means we stop training if there is no improvement for 40 trees.

In [61]:
regressor_learn = CatBoostRegressor(iterations=5000, learning_rate = .01,
                                max_depth=12, l2_leaf_reg = 9,
                                colsample_bylevel = .1, od_type='Iter', od_wait = 40,
                                logging_level = 'Silent').fit(train_pool, eval_set = validate_pool)

Now we can look at the best score and best iteration.

In [62]:
regressor_learn.best_score_

{'learn': {'RMSE': 135963397.37969565},
 'validation': {'RMSE': 186490337.30068162}}

In [63]:
regressor_learn.best_iteration_

2851

In [65]:
regressor_learn.score(validate_pool)

0.5274538836079761

> Now that we found the best score for a learning rate of .01, let's cut the learning the rate in half and double the number of iterations.

In [66]:
regressor_smaller_learn = CatBoostRegressor(iterations=6000, learning_rate = .005,
                                max_depth=12, l2_leaf_reg = 9,
                                colsample_bylevel = .1, od_type='Iter', od_wait = 40,
                                logging_level = 'Silent').fit(train_pool, eval_set = validate_pool)

In [67]:
regressor_smaller_learn.best_score_

{'learn': {'RMSE': 141581067.4139003},
 'validation': {'RMSE': 188935636.04809418}}

In [69]:
regressor_smaller_learn.best_iteration_

3407

In [70]:
regressor_smaller_learn.tree_count_

3408

In [71]:
regressor_smaller_learn.score(validate_pool)

0.5149804002794596

Here we see a slightly smaller score, which is likely due to random variation in our trees.

### Summary

In this lesson, we learned about hyperparameter tuning for the catboost regressor.  We saw that some of the parameters are pretty similar to what we saw with a random forest regressor.

* `iterations`: `n_estimators`
* `min_child_samples`: `min_samples_leaf`
* `colsample_bylevel`: `max_features`

And we tune our hyperparameters in the same way: we loop through different values and assess the score on the validation set.  

Then we explored some hyperparameters that were a bit different.  The first is `l2_leaf_reg`, by which will only a tree will only split if there is sufficient gain ahead of some threshold.  This is another mechanism of balancing the bias-variance tradeoff, as increasing a split increases the variance.  And preventing a split limits a tree to find a potential pattern in the data.

The second was the hyperparameter we worked with was the number of trees/learning rate.  We were able to determine the number of trees to use for an initial learning rate through the overfitting detector, by setting hyperparameters `od_type='Iter', od_wait = 40`.  And then we used that number to then cut our learning rate in half and double the number of iterations.

### Resources

[Catboost parameter docs](https://catboost.ai/docs/concepts/python-reference_parameters-list.html)

[Catboost Description and Hyperparameters](https://towardsdatascience.com/https-medium-com-talperetz24-mastering-the-new-generation-of-gradient-boosting-db04062a7ea2)

[Catboost Tutorial](https://colab.research.google.com/github/catboost/tutorials/blob/master/python_tutorial.ipynb)