# Hyperparameter Tuning

### Getting setup

Before getting started, begin by going to `Runtime` in Google colab and changing the runtime to `GPU`.  Let's also install `catboost` into google colab.

In [8]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/b2/aa/e61819d04ef2bbee778bf4b3a748db1f3ad23512377e43ecfdc3211437a0/catboost-0.23.2-cp36-none-manylinux1_x86_64.whl (64.8MB)
[K     |████████████████████████████████| 64.8MB 46kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.23.2


Now, let's begin by loading up our bulldozers data.  We'll begin with our data pre-processed.

In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/gradient-boosting/master/bulldozers_sorted.csv"
df_sorted = pd.read_csv(url, index_col = 0)
df_sorted[:2]

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
7648,1165000,27000.0,733687,7057,121,3.0,1995,4368.0,Medium,312,...,5,0,5,False,False,False,False,False,False,1073260800
8228,1166933,10750.0,1035166,8861,121,3.0,2002,603.0,Low,803,...,9,4,9,False,False,False,False,False,False,1073606400


Looking at our data, our salesdata column is already coerced into multiple different numerical columns.  And the only features that are not numeric are our categorical features, which is correct.

Now let's separate our data into the features, $X$, and the target $y$.

In [3]:
X = df_sorted.drop('SalePrice', axis = 1)
y = df_sorted['SalePrice']

In [8]:
'SalePrice' in X.columns 

# False

False

According to the [catboost documentation](https://catboost.ai/docs/concepts/speed-up-training.html), one way to speed up training time is to change our object dtypes to be categorical.  So let's do that.

> First we can select our features of type object.

In [4]:
object_df = None
object_df[:2]

# 	UsageBand	fiModelDesc	fiBaseModel	fiSecondaryDesc	fiModelSeries	fiModelDescriptor	ProductSize	fiProductClassDesc	state	ProductGroup	...	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls
# 7648	Medium	312	312	-999	-999	-999	Small	Hydraulic Excavator, Track - 12.0 to 14.0 Metr...	New York	TEX	...	24 inch	9' 10"	Manual	None or Unspecified	Triple	-999	-999	-999	-999	-999
# 8228	Low	803	803	-999	-999	-999	Mini	Hydraulic Excavator, Track - 2.0 to 3.0 Metric...	Georgia	TEX	...	None or Unspecified	None or Unspecified	None or Unspecified	None or Unspecified	Double	-999	-999	-999	-999	-999
# 2 rows × 44 columns

Unnamed: 0,UsageBand,fiModelDesc,fiBaseModel,fiSecondaryDesc,fiModelSeries,fiModelDescriptor,ProductSize,fiProductClassDesc,state,ProductGroup,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
7648,Medium,312,312,-999,-999,-999,Small,"Hydraulic Excavator, Track - 12.0 to 14.0 Metr...",New York,TEX,...,24 inch,"9' 10""",Manual,None or Unspecified,Triple,-999,-999,-999,-999,-999
8228,Low,803,803,-999,-999,-999,Mini,"Hydraulic Excavator, Track - 2.0 to 3.0 Metric...",Georgia,TEX,...,None or Unspecified,None or Unspecified,None or Unspecified,None or Unspecified,Double,-999,-999,-999,-999,-999


And from there let's set the object columns to type category.

In [5]:
cat_df = object_df.astype('category')


> We confirm that all of our of our columns are now of type category.

In [7]:
(cat_df.dtypes == 'category').all()
# True

True

Now let's replace our object columns in X with the columns in our `cat_df`.

In [None]:
X_with_cat = X.copy() 
# replace object columns with cat columns

In [None]:
(cat_df.dtypes == 'object').any()
# False

Then let's split our data into training, validation and test sets.  We set `shuffle = False` as we want our datasets ordered by time.  We'll have a 80-10-10 split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_with_cat, y, test_size = .2, shuffle = False)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size = .5, shuffle = False)

> Let's again find the categorical columns.  

In [19]:
import numpy as np
cal_col_idcs = np.where(X.dtypes == np.object)[0]
cal_col_idcs

# array([ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
#        24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
#        41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

array([ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
       24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
       41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

And let's create our datapools.

In [None]:
from catboost import Pool, CatBoostRegressor

train_pool = Pool(X_train, y_train, cat_features = cal_col_idcs)
validate_pool = Pool(X_validate, y_validate, cat_features = cal_col_idcs)
test_pool = Pool(X_test, y_test, cat_features = cal_col_idcs)

Finally, let's train regressor model.  We can do so with the following parameters.  

> Notice that we *splat* our parameters by using the double star.

In [21]:
params = {'iterations': 200, 'depth':6,  
          'logging_level':'Silent', 
          'random_seed':42, "task_type":"GPU"}

cbr = CatBoostRegressor(**params).fit(train_pool)


cbr.score(validate_pool)

0.7770237053135296

### Tuning Hyperparameters

Now it's time to begin working with our hyperparameters.  When setting the hyperparameters below, be sure to always set the following values:

* `iterations`: 200
* `logging_level`: 'silent'
* `task_type`: GPU

1. Max depth 

Next let's move onto max depth.  The maximum value that catboost allows for `max_depth` is 16.  Let's try values between 4 and 12.  

In [8]:
max_depths = None
max_depths

# [4, 6, 8, 10, 12]

[4, 6, 8, 10, 12]

In [22]:

model_depths = None

1 loop, best of 3: 1min 4s per loop


Next, let's check the validation scores on each of the models we trained above.

In [24]:
depth_scores = None
depth_scores[:2]

# [0.7659049418450385, 0.7775122482929486]

[0.7659049418450385, 0.7775122482929486]

Initialize a series with the labels as the the `max_depth` values and the values as the corresponding validation scores.

In [26]:
max_depth_series = None
max_depth_series

# 4     0.765905
# 6     0.777512
# 8     0.783817
# 10    0.760517
# 12    0.751324
# dtype: float64

4     0.765905
6     0.777512
8     0.783817
10    0.760517
12    0.751324
dtype: float64

We can see that `max_depth` peaks at 8.  Let's set that as our hyperparameter value going forward.

2. Column Sample by Level

Now let's move onto `colsample_bylevel`.  Remember that allows us to specify the percentage of features that are considered on each split of the decision tree.  

For catboost, we can pass through any value x, $ 0 \gt x \leq 1 $.  So let's initialize a list of ten fractions from one tenth to one, and see how they perform.

In [28]:
import numpy as np
col_sample_pcts = None
col_sample_pcts

# array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

An alias for `colsample_bylevel` is random subspace, or rsm.  Unfortunately, rsm is only available on a CPU.  So set the `task_type` to CPU to train this hyperparameter.

In [None]:
model_sample_pcts = None

In [32]:
sample_pct_scores = None
sample_pct_scores

# 0.8132362857514441,
#  0.8120038687449829,
#  0.813863897895586,
#  0.8121112813610925,
#  0.8142525559257833,
#  0.813259623405968,
#  0.8112427106146669,
#  0.8184105335645979,
#  0.8147599987772485,
#  0.8163274725567524]

[0.8132362857514441,
 0.8120038687449829,
 0.813863897895586,
 0.8121112813610925,
 0.8142525559257833,
 0.813259623405968,
 0.8112427106146669,
 0.8184105335645979,
 0.8147599987772485,
 0.8163274725567524]

We can see that here, there does not appear to be much difference as we change this hyperaparameter.  So let's leave it out.

> But oddly, there does appear to be a significant increase in using the CPU.  Let's keep that in mind.

3. L2_regularization

Now let's try using L2 regularization, with values between 1 and 5.

In [89]:
l2_vals = None
l2_vals

# array([1, 2, 3, 4, 5])

array([1, 2, 3, 4, 5])

In [None]:
model_sample_pcts = None

In [91]:
sample_pct_scores = None

sample_pct_scores

[0.6613824960577861,
 0.6571175886055414,
 0.6557103150694057,
 0.6516318365424489,
 0.6541637682457586]

It looks like we cannot find any values here that improve our score, so let's not use l2_leaf_reg. 

In [None]:
model = None

In [96]:
model.score(validate_pool)

0.8163274725567524

Here, we see that our best score is when we have a value of 1.  So let's try leaving it out entirely, and see how we perform.

### Learning Rate

Finally let's move onto the learning rate and number of iterations.  The first step is to find the number of iterations that is maximized for a specific learning rate.   Here, we'll start with 5000 iterations.

We can how many iterations is best by using the overfitting detector.  We can set this up by using `od_type='Iter'` and od_wait=40.  This means it will stop if it sees no improvement for 40 iterations.  

We should pass through an `eval_set = validation_pool` to the fit method, so that it can evaluate on the validation set. 

> We can set the `eval_metric` as "R2", so we can see the comparison in score.  However, this is not available with GPU.  So let's stick with the default of RMSE.

In [None]:
model_learn = None

In [63]:
model_learn.best_score_

# {'learn': {'RMSE': 9145.615675283978},
#  'validation': {'RMSE': 12046.234303548972}}

{'learn': {'RMSE': 9145.615675283978},
 'validation': {'RMSE': 12046.234303548972}}

We can see the validation r2 score with the `score` method.

In [65]:
model_learn.score(validate_pool)

# 0.7643614826582625

0.7643614826582625

> Now that we found the ideal learning rate at that number of iterations.  Let's double the number of iterations and cut the learning rate in half.  Because we are using early stopping, we can set our number of iterations to 8000, even though doubling would take us to 7000.  Cutting the learning rate in half leaves it at .005.

In [None]:
regressor_learn = None

Look at the score on the validation pool.

In [72]:
# validation score

# 0.7658185187953346

0.7658185187953346

> So we see a small improvement in our score.  Look at the `best_score_`

In [73]:


# {'learn': {'RMSE': 9040.387276660222},
#  'validation': {'RMSE': 12008.937215923814}}

{'learn': {'RMSE': 9040.387276660222},
 'validation': {'RMSE': 12008.937215923814}}

> And the best iteration.

In [74]:
regressor_learn.best_iteration_

# 7996

7996

Finally, let's try training on our CPU, as that also appeared to improve our score.

In [None]:
regressor_cpu = None

In [101]:
regressor_cpu.score(validate_pool)

# 0.8240710727054391

0.8240710727054391

So again, we see a significant improvement in our score by using the cpu.

### Summary

In this lesson, we learned how to work with the catboost hyperparameters.  We worked with the `max_depth`, and `colsample_bylevel`.  Training time for catboost is significant, as it must train each decision tree sequentially.  To try to speed up training time, we used `task_type = 'GPU'` when possible.  

We finished up by tuning the learning rate.  We used early stopping with the overfitting detector by setting the hyperparameter of `od_type='Iter'` and `od_wait= 40`.

### Resources

[Speeding up Catboost](https://catboost.ai/docs/concepts/speed-up-training.html)