# Hyperparameter Tuning

### Loading our Data

Let's begin by loading up our bulldozers data.  We'll begin with our data pre-processed.

In [1]:
import pandas as pd

df_sorted = pd.read_csv('./bulldozers_sorted.csv', index_col = 0)
df_sorted[:2]

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,fiModelDesc,...,saleDay,saleDayofweek,saleDayofyear,saleIs_month_end,saleIs_month_start,saleIs_quarter_end,saleIs_quarter_start,saleIs_year_end,saleIs_year_start,saleElapsed
7648,1165000,27000.0,733687,7057,121,3.0,1995,4368.0,Medium,312,...,5,0,5,False,False,False,False,False,False,1073260800
8228,1166933,10750.0,1035166,8861,121,3.0,2002,603.0,Low,803,...,9,4,9,False,False,False,False,False,False,1073606400


So our salesdata column is already coerced into multiple different numerical columns.  And the only features that are not numeric are our categorical features, which is correct.

In [4]:
# df_sorted.info()

Now let's separate our data into the features, $X$, and the target $y$.

In [2]:
X = df_sorted.drop('SalePrice', axis = 1)
y = df_sorted['SalePrice']

Then let's split our data into training, validation and test sets.  We set `shuffle = False` as we want our datasets ordered by time.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, shuffle = False)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size = .5, shuffle = False)

> Let's again find the categorical columns.  

In [4]:
import numpy as np
cal_col_idcs = np.where(X.dtypes == np.object)[0]
cal_col_idcs

# array([ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
#        24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
#        41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

array([ 7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,
       24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40,
       41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

And let's create our datapools.

In [5]:
from catboost import Pool, CatBoostRegressor

train_pool = Pool(X_train, y_train, cat_features = cal_col_idcs)
validate_pool = Pool(X_validate, y_validate, cat_features = cal_col_idcs)
test_pool = Pool(X_test, y_test, cat_features = cal_col_idcs)

In [8]:
params = {'iterations': 200, 'depth':6,  
          'logging_level':'Silent', 
          'random_seed':42, "task_type":"GPU"}

cbr = CatBoostRegressor(**params).fit(train_pool)


cbr.score(validate_pool)

CatBoostError: catboost/libs/train_lib/train_model.cpp:916: Can't load GPU learning library. Module was not compiled or driver  is incompatible with package. Please install latest NVDIA driver and check again

### Tuning Hyperparameters

Now let's see if we can improve our score by tuning hyperparameters.  

Let's begin with the `min_child_samples` hyperparameter.  Let's set min samples between 4 and 12, going by 2.

In [None]:
# min_samples = list(range(3, 13, 2))
np.arrange(4, 12, 2)

1. Min Samples

In [None]:

model_min_samples = [CatBoostRegressor(iterations=200,
                                  max_depth=13, 
                                  min_child_samples = min_sample,
                                  logging_level = 'Silent').fit(train_pool) 
                for min_sample in min_samples]


Initialize a series with the labels as the the min_child_samples values and the values as the corresponding validation scores.

So it looks like `min_child_samples` of ?? is the value that optimizes our score.

1. Max depth 

Next let's move onto max depth.  The maximum value that catboost allows for `max_depth` is 16.  So let's try values between 4 and 16.  

In [None]:
max_depths = range(5, 16, 2)

In [None]:
model_depths = [CatBoostRegressor(iterations=200,
                                  max_depth=max_depth, 
                                  logging_level = 'Silent').fit(train_pool) 
                for max_depth in max_depths]

Initialize a series with the labels as the the `max_depth` values and the values as the corresponding validation scores.

2. Column Sample by Level

Now let's move onto `colsample_bylevel`.  Remember that allows us to specify the percentage of features that are considered on each split of the decision tree.  For catboost, it uses a 

In [None]:
import numpy as np
col_sample_pcts = np.linspace(0.1, 1, 10)
col_sample_pcts

In [None]:
model_sample_pcts = [CatBoostRegressor(iterations=200,
                                  max_depth=13, 
                                  colsample_bylevel = pct,
                                  logging_level = 'Silent').fit(train_pool) 
                for pct in col_sample_pcts]


### Learning Rate

In [None]:
regressor_learn = CatBoostRegressor(iterations=3000, max_depth=13, learning_rate = .01,
                                colsample_bylevel = .1,
                                logging_level = 'Silent').fit(train_pool)

In [None]:
regressor_learn.score(validate_pool)

> Try more iterations

In [None]:
regressor_learn = CatBoostRegressor(iterations=4000, max_depth=13, learning_rate = .01,
                                colsample_bylevel = .1,
                                logging_level = 'Silent').fit(train_pool)

In [None]:
regressor_learn.score(validate_pool)

> Cut learning rate in half and double score.

In [None]:
regressor_learn = CatBoostRegressor(iterations=6000, max_depth=13, learning_rate = .005,
                                colsample_bylevel = .1,
                                logging_level = 'Silent').fit(train_pool)

In [None]:
regressor_learn.score(validate_pool)