### Tuning the number of boosting rounds

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:linear", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

num_boosting_rounds          rmse
0                    5  50903.298177
1                   10  34774.191406
2                   15  32895.098958

### Automated boosting round selection using early_stopping


Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv(). This is done using a technique called early stopping.

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)

In [None]:
train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0     141871.635417      403.636200   142640.656250     705.559400
1     103057.033854       73.772531   104907.664062     111.112417
2      75975.966146      253.726099    79262.057291     563.767892
3      57420.532552      521.658345    61620.134115    1087.692518
4      44552.955729      544.170190    50437.561198    1846.446330
5      35763.946615      681.797429    43035.661458    2034.469207
6      29861.463542      769.572072    38600.880208    2169.796232
7      25994.676432      756.520565    36071.817708    2109.795430
8      23306.836588      759.238254    34383.184896    1934.546688
9      21459.770182      745.625311    33509.141276    1887.375284
10     20148.721354      749.612769    32916.808594    1850.894249
11     19215.382162      641.387014    32197.832682    1734.456935
12     18627.388672      716.257510    31770.852865    1802.155484
13     17960.694661      557.043073    31482.781250    1779.124592
14     17559.736328      631.412555    31389.992188    1892.321520
15     17205.712565      590.171852    31302.882162    1955.165902
16     16876.571940      703.631755    31234.058594    1880.705796
17     16597.662110      703.677609    31318.348308    1828.860391
18     16330.460937      607.274494    31323.634766    1775.910706
19     16005.972331      520.470326    31204.134766    1739.077059
20     15814.301432      518.604477    31089.862630    1756.021674
21     15493.405924      505.615987    31047.996094    1624.673955
22     15270.733724      502.019237    31056.916015    1668.042812
23     15086.381836      503.912899    31024.984375    1548.985605
24     14917.608399      486.206187    30983.684896    1663.130201
25     14709.589518      449.668010    30989.477214    1686.666560
26     14457.286133      376.787666    30952.113932    1613.172643
27     14185.567383      383.102234    31066.902344    1648.534310
28     13934.067383      473.465256    31095.640625    1709.225072
29     13749.644857      473.670302    31103.886719    1778.879529
30     13549.836914      454.898141    30976.085287    1744.514533
31     13413.485351      399.603618    30938.469401    1746.052597
32     13275.916341      415.408908    30931.001953    1772.469510
33     13085.877930      493.792427    30929.057291    1765.540568
34     12947.181640      517.789318    30890.630208    1786.511479
35     12846.027018      547.732021    30884.492187    1769.728223
36     12702.379232      505.523221    30833.541667    1691.002563
37     12532.243490      508.298594    30856.688151    1771.446377
38     12384.054362      536.225108    30818.016276    1782.784214
39     12198.444010      545.165502    30839.392578    1847.327022
40     12054.583659      508.841772    30776.966146    1912.780507
41     11897.036458      477.177568    30794.702474    1919.674832
42     11756.221680      502.992782    30780.955078    1906.820029
43     11618.846029      519.836706    30783.755860    1951.260705
44     11484.080404      578.427828    30776.731120    1953.446309
45     11356.553060      565.368827    30758.543620    1947.454953
46     11193.558594      552.298906    30729.971354    1985.700237
47     11071.315429      604.089960    30732.663411    1966.997822
48     10950.778646      574.863135    30712.241536    1957.751573
49     10824.865885      576.665756    30720.854818    1950.511514

### Tuning eta

It iss time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! We'll begin by tuning the "eta", also known as the learning rate.

The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
                        num_boost_round=10, early_stopping_rounds=5,
                        metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

### Tuning max_depth

In [None]:
Tune max_depth, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to.
Smaller values will lead to shallower trees, and larger values to deeper trees.

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:linear"}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))

### Tuning colsample_bytree


In [None]:
Tune "colsample_bytree". 
We've already seen this if you've ever worked with scikit-learn's RandomForestClassifier or RandomForestRegressor, 
where it just was called max_features. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

In [None]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:linear","max_depth":3}

# Create list of hyperparameter values
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params["colsample_bytree"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))

In [None]:
 colsample_bytree     best_rmse
0               0.1  48193.451172
1               0.5  36013.544922
2               0.8  35932.962891
3               1.0  35836.044922

### Grid search with XGBoost

In [None]:
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid,
                        scoring='neg_mean_squared_error', cv=4, verbose=1)
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

In [None]:
Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 5, 'n_estimators': 50}
Lowest RMSE found:  29916.562522854438

### Random search with XGBoost

In [None]:
GridSearchCV can be really time consuming, so in practice, you may want to use RandomizedSearchCV instead, 
as you will do in this exercise. The good news is you only have to make a few modifications to your GridSearchCV
code to do RandomizedSearchCV. The key difference is you have to specify a param_distributions parameter instead
of a param_grid parameter.

In [None]:
# Create the parameter grid: gbm_param_grid 
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid,
                                    n_iter=5, scoring='neg_mean_squared_error', cv=4, verbose=1)
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ",randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

In [None]:
Best parameters found:  {'n_estimators': 25, 'max_depth': 6}
Lowest RMSE found:  36909.98213965752