## Tune XGBoost model

- make `XGBoost` models as performant as possible. 
- learn about the variety of parameters that can be adjusted to alter the behavior of `XGBoost`
- and how to tune them efficiently so that supercharge the performance of models.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


In [3]:
df = pd.read_csv('ames_housing_trimmed_processed.csv')
X, y = df.iloc[:, :-1], df.iloc[:, -1]

In [7]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

   num_boosting_rounds          rmse
0                    5  50903.300782
1                   10  34774.192709
2                   15  32895.096354


> increasing the number of boosting rounds decreases the RMSE.

### Automated boosting round selection using early_stopping
- use XGBoost automatically select the number of boosting rounds within xgb.cv(). This is done using a technique called `early stopping`.
- Early stopping works by:
    - testing the XGBoost model after every boosting round against a hold-out dataset 
    - and stopping the creation of additional boosting rounds (thereby finishing training of the model early) 
        - if the hold-out metric (ex "rmse") does not improve for a given number of rounds. 
        - Ex: use the `early_stopping_rounds` parameter in xgb.cv() with a large possible number of boosting rounds (50).
        - if the holdout metric continuously improves up through when `num_boost_rounds` is reached, then early stopping does not occur.

In [11]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:squarederror", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(params= params, 
                    dtrain=housing_dmatrix, 
                    nfold=3, metrics ='rmse',
                    seed=123,
                    num_boost_round=50,
                    early_stopping_rounds=10, 
                    as_pandas=True)

# Print cv_results
print(cv_results)

    train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0     141871.630208      403.632409   142640.645833     705.565658
1     103057.031250       73.772931   104907.666667     111.114933
2      75975.963542      253.734987    79262.059896     563.766991
3      57420.528646      521.655155    61620.135417    1087.690754
4      44552.955729      544.169200    50437.562500    1846.448017
5      35763.950521      681.797429    43035.660156    2034.469858
6      29861.464844      769.570645    38600.881511    2169.800969
7      25994.673177      756.520694    36071.817708    2109.795430
8      23306.833333      759.237086    34383.184896    1934.546688
9      21459.768880      745.624404    33509.141927    1887.377720
10     20148.720703      749.612103    32916.806641    1850.893136
11     19215.382161      641.388291    32197.834635    1734.458508
12     18627.388021      716.257152    31770.853516    1802.155409
13     17960.695312      557.043498    31482.782552    1779.12

### Tuning eta
- tuning the "eta", also known as the learning rate.
- The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

In [12]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
                        num_boost_round=10, early_stopping_rounds=5,
                        metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

     eta      best_rmse
0  0.001  195736.401042
1  0.010  179932.182292
2  0.100   79759.414063


### Tuning max_depth
- tuning max_depth, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.

In [14]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:squarederror", 'max_depth':20}

# Create list of max_depth values
max_depths = [2,5,10,20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix,
                        params=params,
                        metrics='rmse',
                        seed=123, 
                        num_boost_round=10,
                        nfold=2,
                        early_stopping_rounds=5 ,
                        as_pandas=True)
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))

   max_depth     best_rmse
0          2  37957.468750
1          5  35596.599610
2         10  36065.546875
3         20  36739.578125


### Tuning colsample_bytree
- tuning `colsample_bytree`. 
- With `scikit-learn`'s `RandomForestClassifier` or `RandomForestRegressor`, where it just was called `max_features`. 
- In both `xgboost` and `sklearn`, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, `colsample_bytree` must be specified as a float between 0 and 1.

In [16]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:squarederror","max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1,0.5,0.8,1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params['colsample_bytree'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))

   colsample_bytree     best_rmse
0               0.1  40918.115235
1               0.5  35813.906250
2               0.8  35995.679688
3               1.0  35836.044922


### Review of grid search and random search

#### Grid search with XGBoost
- take parameter tuning to the next level by using scikit-learn's GridSearch and RandomizedSearch capabilities with internal cross-validation using the GridSearchCV and RandomizedSearchCV functions. 
- use these to find the best model exhaustively from a collection of possible parameter values across multiple parameters simultaneously. 

In [21]:
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm,
                        param_grid= gbm_param_grid,
                        scoring='neg_mean_squared_error',
                        cv=4,
                        verbose=1)


# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Fitting 4 folds for each of 4 candidates, totalling 16 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best parameters found:  {'colsample_bytree': 0.3, 'max_depth': 5, 'n_estimators': 50}
Lowest RMSE found:  28986.18703093561


[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:    2.7s finished


#### Random search with XGBoost
- `GridSearchCV` can be really time consuming, 
- so in practice, may use `RandomizedSearchCV` instead
- the key difference is to specify a `param_distributions` parameter instead of a `param_grid` parameter.

In [28]:

# Create the parameter grid: gbm_param_grid 
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid,                                           
                                    estimator=gbm,
                                    scoring= 'neg_mean_squared_error',
                                    n_iter=5,
                                    cv=4,
                                    verbose =1)


# Fit randomized_mse to the data
randomized_mse.fit(X,y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

Fitting 4 folds for each of 10 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Best parameters found:  {'n_estimators': 25, 'max_depth': 4}
Lowest RMSE found:  29998.4522530019


[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed:    5.8s finished
