# Gradient Boosting Regressor hyperparameter tuning



Our actual model is a Gradient Boosting Regressor(GBR), so this notebook will try yo **find the best parameters for the task**. Before starting to read this notebook I recommend this [video](https://www.youtube.com/watch?v=3CC4N4z3GJc) about the main idea behind the algorithm. 

Hyperparameter tuning is very important in machine learning, mainly for optimize **the bias and variance tradeoff**, if you don't know anything about that I recommend you to read [this](https://en.wikipedia.org/wiki/Hyperparameter_optimization) before dive into the notebook analyzes. Besides that, we are going to use GridSearchCV for it, so it's very important to understand the principles behind [cross-validation](https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f).

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import plotly.express as px

In [4]:
df = pd.read_csv('../data/processed/train_selected.csv')
df = pd.get_dummies(df)    
y = df['SalePrice']
X = df.drop(columns=['SalePrice'])

In [24]:
def rmse_model(model, X, y):
    scores = np.sqrt(-1 * cross_val_score(model, X, y, cv=5, 
                                          scoring='neg_mean_squared_log_error'))
    return scores

def test_paramethers( X, y, paramether_name, paramether_list):
    paramethers = {'n_estimators':3500, 'learning_rate':0.01,
                   'max_depth':4, 'max_features':'sqrt',
                   'min_samples_leaf':15, 'min_samples_split':10,
                   'loss':'huber', 'random_state':42}
    
    score_result = np.array([])
    for p in paramether_list:
        paramethers[paramether_name] = p
        model = GradientBoostingRegressor(**paramethers)
        score = rmse_model(model, X, y).mean()
        score_result = np.append(score_result, score)
    
    return pd.DataFrame(data={paramether_name: paramether_list, 'scores': score_result})


def test_multiple_paramethers(X, y, paramether_names, paramether_lists):
    paramethers = {'n_estimators':3500, 'learning_rate':0.01,
                   'max_depth':4, 'max_features':'sqrt',
                   'min_samples_leaf':15, 'min_samples_split':10,
                   'loss':'huber', 'random_state':42}
    
    score_result = np.array([])
    for name in paramether_names:
        
        for p in paramether_list:
            paramethers[paramether_name] = p
            model = GradientBoostingRegressor(**paramethers)
            score = rmse_model(model, X, y).mean()
            score_result = np.append(score_result, score)
    
    return pd.DataFrame(data={paramether_name: paramether_list, 'scores': score_result})


def test_multiple_paramethers(X, y, param_test, param_fix):
    gsearch = GridSearchCV(estimator = GradientBoostingRegressor(**param_fix),
                           param_grid = param_test, scoring='neg_mean_squared_log_error', n_jobs=-2, cv=5)
    
    return gsearch.fit(X,y)

# Hyper parameter tuning

## Main parameters analyze

The first 2 main parameters that this chapter will focus on are:

- **n_estimators:** Number of trees that the model will create
- **learning_rate:** How much the algorithm will learn with each tree

These two parameters are correlated with each other and with bias vs variance. So, with these parameters, you can overfit or underfit the model. The parameters that we are using in the model are **n_estimator=3500** and **learning_rate=0.01**.

### High scale test

The first test will understand what are the best parameters at a high range, for that we're going to use this[ function](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). The result will guide the minor scale search.

In [41]:
p_fix = {'max_depth':4, 'max_features':'sqrt',
     'min_samples_leaf':15, 'min_samples_split':10,
     'loss':'huber', 'random_state':42}

grid = test_multiple_paramethers(X, y,
                                 {'n_estimators':range(500,6001,500),
                                  'learning_rate':np.arange(0.001, 0.11, 0.005).tolist()},
                                 p_fix)

print('Best score: {0}'.format(np.sqrt(-1 * grid.best_score_)))
print('Best parameters: {0}'.format(grid.best_params_))

**Test Insights**

- The chosen parameters performed better on a high scale test, know we are going to analyze in a minor scale to choose the better ones.
- This test is the best one to detect overfit (probably not)? How can we make sure that the model isn't overfitting?

In [None]:
df_learning = test_paramethers(X, y, 'learning_rate', np.arange(0.004, 0.02, 0.002).tolist())
fig = px.line(df_learning, x='learning_rate', y='scores')
fig.show()

In [31]:
df_tree_numbers = test_paramethers(X, y, 'n_estimators', np.arange(3000, 4001, 100).tolist())
fig = px.line(df_tree_numbers, x='n_estimators', y='scores')
fig.show()

**Conclusion**

The best paramether that we analyze were:
- **n_estimator**: 3400
- **learning_rate**: 0.008

## Tree parameters

We already define the two principals parameters of Gradient Boosting Regressor. So, know we will focus on the tree specif parameters. These **parameters** are:

- **min_samples_split**: The minimum number of samples required to split an internal node.
- **min_samples_leaf**: The minimum number of samples required to be at a leaf node
- **max_depth**: The maximum depth of the individual regression estimators

So, the **range** that we are going to test are:

- **min_samples_split**: This should be ~0.5-1% of total values.
- **min_samples_leaf**: Can be selected based on intuition.
- **max_depth**: Should be chosen (5-8) based on the number of observations and predictors.

In [22]:
p_fix = {
    'n_estimators':3400,
    'learning_rate':0.008,
    'max_features':'sqrt',
    'loss':'huber', 'random_state':42
}

grid = test_multiple_paramethers(X, y,
                                 {'min_samples_split':range(2,12),
                                  'max_depth':range(3, 9),
                                  'min_samples_leaf':range(10,31,5)},
                                 p_fix)


print('Best score: {0}'.format(np.sqrt(-1 * grid.best_score_)))
print('Best parameters: {0}'.format(grid.best_params_))

Best score: 0.12848252489350628
Best parameters: {'max_depth': 4, 'min_samples_leaf': 15, 'min_samples_split': 2}


**Test insights**

- The test showed that the Max_depth and min_samples_leaf parameters were optimized.
- The test showed that **maybe** we can change the min_samples_split, but we have to analyze more to have sure more about that.

In [28]:
df_tree_numbers = test_paramethers(X, y, 'min_samples_split', np.arange(2, 15, 1).tolist())
fig = px.line(df_tree_numbers, x='min_samples_split', y='scores')
fig.show()

**Conclusions**

- **min_samples_split**: The minimum number of samples required to split an internal node.
- **min_samples_leaf**: The minimum number of samples required to be at a leaf node
- **max_depth**: The maximum depth of the individual regression estimators