# Hyperparameter Tuning

In [80]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error,  mean_squared_error
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import GridSearchCV


In this toy example we will look at a GradientBoostingRegressor and LinearRegression (both implemented in `sklearn`). We will use RepeatedKFold cross validation. The metric we are using for our scoring is `neg_mean_absolute_error`. 

In [81]:
df = pd.read_csv("data/train.csv")
df.head()

Unnamed: 0,Weight,Length1,Length3,Species_Bream,Species_Parkki,Species_Perch,Species_Pike,Species_Roach,Species_Smelt,Species_Whitefish
0,290.0,24.0,29.2,0,0,0,0,1,0,0
1,430.0,26.5,34.0,1,0,0,0,0,0,0
2,700.0,34.0,38.3,0,0,1,0,0,0,0
3,500.0,42.0,48.0,0,0,0,1,0,0,0
4,110.0,19.0,22.5,0,0,1,0,0,0,0


In [82]:
data = df.values

In [83]:
X, y =  data[:, 1:], data[:,0]

In [84]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)


In [71]:
model_gb = GradientBoostingRegressor()

In [72]:
# search space
num_estimators = [100, 500, 1000]
learn_rates = [0.02, 0.05, 0.1, 0.2]
max_depths = [1, 2, 5]
min_samples_leaf = [5,10]
min_samples_split = [5,10]
space_gb = {'n_estimators': num_estimators,
              'learning_rate': learn_rates,
              'max_depth': max_depths,
              'min_samples_leaf': min_samples_leaf,
              'min_samples_split': min_samples_split}

In [73]:
search = GridSearchCV(model_gb, space_gb, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv)


In [74]:
result_gb = search.fit(X, y)

In [75]:
print('Best Score: %s' % result_gb.best_score_)
print('Best Hyperparameters: %s' % result_gb.best_params_)

Best Score: -67.80347126549573
Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 1, 'min_samples_leaf': 5, 'min_samples_split': 10, 'n_estimators': 1000}


In a real project we might have some rival models here, and we would compare them at this stage and choose the best one. Let's consider Linear Regression (this could be thought of as our baseline).
The only parameter we have is 'fit_intercept', almost certaintly we do want to fit an intercept, but just for this toy example we will search over this.

In [76]:
model_lr = LinearRegression()
space_lr = {'fit_intercept':[True, False]}
search = GridSearchCV(model_lr, space_lr, scoring='neg_mean_absolute_error', n_jobs=-1, cv=cv)
result_lr = search.fit(X,y)
print('Best Score: %s' % result_lr.best_score_)
print('Best Hyperparameters: %s' % result_lr.best_params_)

Best Score: -76.94888187660814
Best Hyperparameters: {'fit_intercept': True}


# Choose final model

For the final model we will use GradientBoostingRegressor with the best hyperparameters: `{'learning_rate': 0.2, 'max_depth': 1, 'min_samples_leaf': 5, 'min_samples_split': 5, 'n_estimators': 500}`.

At this stage we can see how this model (GradientBoostingRegressor with the above hyperparameters) performs on the holdout dataset. The metrics here are the metrics we would report to stakeholders. These are unbiased metrics because the holdout dataset has not been used in any of the process so far (where the best_score above is a biased estimate because it is (implicitly) using information from across the training set when the tuning is carried out).

It is **absolutely crucial** at this stage that we do not do anymore tuning based on the this otherwise we will bias this estimate.

In [79]:
test = pd.read_csv('data/test.csv')
test_data = test.values
X, y =  test_data[:, 1:], test_data[:,0]


In [87]:
best_model = result_gb.best_estimator_
y_pred = best_model.predict(X)
R2 = r2_score(y, y_pred)
MSE = mean_squared_error(y, y_pred)
MAE = mean_absolute_error(y, y_pred)
print("These are the statistics we would report to stakeholders: ")
print("Final GBRegressor: R-squared: {}, MAE: {}, MSE: {}".format(R2, MAE, MSE))


These are the statistics we would report to stakeholders: 
Final GBRegressor: 0.9463130239173201, MAE: 39.778168094240016, MSE: 7388.717629847771


We can now train the final model on ALL of the data (this is done in another notebook).