## Gradient boosting

Random forests are an example of an 'ensemble method'. 

**Ensemble methods** combine the predictions of several models, e.g., several trees in the case of random forests. 

Gradient boosting is another type of ensemble method that iteravely adds models into an ensemble. 

The current ensemble is used to generate predictions for each observation in the dataset. Then, the predictions from all models are addd to the ensemble to make the final prediction. A loss function is then calculated. 

THen, the loss function is used to fit a new model to be added to the ensemble. Gradient descent is used on the loss functions to determine the parameters of the new model. 

The new model is then added to the ensemble and the process is repeated. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('../data/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

https://xgboost.readthedocs.io/en/latest/python/python_api.html

In [27]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

  if getattr(data, 'base', None) is not None and \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [28]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + '{:,}'.format(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 271,751.1572210972


## Parameter Tuning
`n_estimators` specifies how many times to go through the modeling cycle, and is equal to the number of models included in the ensemble. 

If `n_estimators` is too low, we risk underfitting, and if it is too high we risk overfitting.

In [38]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

  if getattr(data, 'base', None) is not None and \




XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=500,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [39]:
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + '{:,}'.format(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 244,125.02549015096


### How to find the ideal value for `n_estimators`?
With `early_stopping_rounds`, we can automatically find the ideal value for `n_estimators`, because early stopping causes the model to stop iterating when the validation score stops improving, even if `n_estimators` hasn't been reached. We can set a high value for `n_estimators` and use `early_stopping_rounds` to find the optimal time to stop iterating. 

By chance, there can be a single round where validation scores don't improve, so a number needs to be specified for how many rounds of straight deterioration to allow before stopping. By setting `early_stopping_rounds=5`, we stop after 5 straight rounds of deteriorating validation scores. 

When using `early_stopping_rounds`, we're takoing the unsual step of calculating the validation scores during the fitting, as just descriibed, so we need to set aside some data for calculating the validation scores.

In [45]:
my_model = XGBRegressor(n_estimators=500)

my_model.fit(X_train, y_train, early_stopping_rounds=5,
            eval_set=[(X_valid, y_valid)], verbose=True)

[0]	validation_0-rmse:1.1427e+06
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:1.04957e+06
[2]	validation_0-rmse:967366
[3]	validation_0-rmse:896919
[4]	validation_0-rmse:833499
[5]	validation_0-rmse:778415
[6]	validation_0-rmse:729782
[7]	validation_0-rmse:688190
[8]	validation_0-rmse:652339
[9]	validation_0-rmse:621969
[10]	validation_0-rmse:595039
[11]	validation_0-rmse:570100
[12]	validation_0-rmse:551016
[13]	validation_0-rmse:533835
[14]	validation_0-rmse:520226
[15]	validation_0-rmse:507822
[16]	validation_0-rmse:496726
[17]	validation_0-rmse:486550
[18]	validation_0-rmse:479013
[19]	validation_0-rmse:472543
[20]	validation_0-rmse:465436
[21]	validation_0-rmse:459176
[22]	validation_0-rmse:454678
[23]	validation_0-rmse:450637
[24]	validation_0-rmse:447132
[25]	validation_0-rmse:444344
[26]	validation_0-rmse:442032
[27]	validation_0-rmse:439186
[28]	validation_0-rmse:437198
[29]	validation_0-rmse:435299
[30]	validation_0-rmse:432899
[31]	va

[262]	validation_0-rmse:380344
[263]	validation_0-rmse:380283
[264]	validation_0-rmse:380168
[265]	validation_0-rmse:380214
[266]	validation_0-rmse:380072
[267]	validation_0-rmse:379867
[268]	validation_0-rmse:379691
[269]	validation_0-rmse:379650
[270]	validation_0-rmse:379513
[271]	validation_0-rmse:379546
[272]	validation_0-rmse:379394
[273]	validation_0-rmse:379307
[274]	validation_0-rmse:379342
[275]	validation_0-rmse:379162
[276]	validation_0-rmse:379176
[277]	validation_0-rmse:379069
[278]	validation_0-rmse:379053
[279]	validation_0-rmse:379027
[280]	validation_0-rmse:378953
[281]	validation_0-rmse:379008
[282]	validation_0-rmse:379000
[283]	validation_0-rmse:378750
[284]	validation_0-rmse:378651
[285]	validation_0-rmse:378473
[286]	validation_0-rmse:378327
[287]	validation_0-rmse:378241
[288]	validation_0-rmse:378239
[289]	validation_0-rmse:378272
[290]	validation_0-rmse:378154
[291]	validation_0-rmse:378107
[292]	validation_0-rmse:378041
[293]	validation_0-rmse:378037
[294]	va

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=500,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

In [46]:
predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + '{:,}'.format(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 247,374.91476752347


The best iteration that time was 359 for `n_estimators`.

### Learning rate

`learning_rate` is another parameter. 

The learning rate is used to multiply the predictions from each model, to prevent overfitting. StatQuest provides a good explanation. 

In general, a small learning rate and large number of estimators will yield more accurate XGBoost models, though it will also take the model longer to train since it does more iterations through the cycle. As default, XGBoost sets `learning_rate=0.1`.


In [51]:
%%time
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=True)

[0]	validation_0-rmse:1.195e+06
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:1.1452e+06
[2]	validation_0-rmse:1.09824e+06
[3]	validation_0-rmse:1.05399e+06
[4]	validation_0-rmse:1.01249e+06
[5]	validation_0-rmse:974113
[6]	validation_0-rmse:937231
[7]	validation_0-rmse:903537
[8]	validation_0-rmse:871526
[9]	validation_0-rmse:841147
[10]	validation_0-rmse:812972
[11]	validation_0-rmse:786414
[12]	validation_0-rmse:761652
[13]	validation_0-rmse:738611
[14]	validation_0-rmse:716961
[15]	validation_0-rmse:697022
[16]	validation_0-rmse:678241
[17]	validation_0-rmse:660187
[18]	validation_0-rmse:644124
[19]	validation_0-rmse:628778
[20]	validation_0-rmse:614148
[21]	validation_0-rmse:601000
[22]	validation_0-rmse:589297
[23]	validation_0-rmse:577831
[24]	validation_0-rmse:567657
[25]	validation_0-rmse:557744
[26]	validation_0-rmse:549083
[27]	validation_0-rmse:540133
[28]	validation_0-rmse:532617
[29]	validation_0-rmse:524669
[30]	validation_0-rmse:5

[262]	validation_0-rmse:397490
[263]	validation_0-rmse:397476
[264]	validation_0-rmse:397481
[265]	validation_0-rmse:397475
[266]	validation_0-rmse:397420
[267]	validation_0-rmse:397372
[268]	validation_0-rmse:397163
[269]	validation_0-rmse:397159
[270]	validation_0-rmse:396983
[271]	validation_0-rmse:396825
[272]	validation_0-rmse:396787
[273]	validation_0-rmse:396645
[274]	validation_0-rmse:396530
[275]	validation_0-rmse:396455
[276]	validation_0-rmse:396454
[277]	validation_0-rmse:396255
[278]	validation_0-rmse:396059
[279]	validation_0-rmse:395835
[280]	validation_0-rmse:395802
[281]	validation_0-rmse:395704
[282]	validation_0-rmse:395706
[283]	validation_0-rmse:395676
[284]	validation_0-rmse:395552
[285]	validation_0-rmse:395540
[286]	validation_0-rmse:395492
[287]	validation_0-rmse:395544
[288]	validation_0-rmse:395537
[289]	validation_0-rmse:395473
[290]	validation_0-rmse:395276
[291]	validation_0-rmse:395205
[292]	validation_0-rmse:395192
[293]	validation_0-rmse:395196
[294]	va

Stopping. Best iteration:
[521]	validation_0-rmse:383520

CPU times: user 3.79 s, sys: 240 ms, total: 4.03 s
Wall time: 6.08 s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.05, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)

Validations deteriorated after 521 estimators. 

### Number of jobs

For larger datasets, parallelism can be used to build the models faster. 

`n_jobs` can be set to the number of cores on the machine. (2 in my case.)

In [52]:
%%time
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=2)
my_model.fit(X_train, y_train,
            early_stopping_rounds=5,
            eval_set=[(X_valid, y_valid)], verbose=True)

[0]	validation_0-rmse:1.19499e+06
Will train until validation_0-rmse hasn't improved in 5 rounds.
[1]	validation_0-rmse:1.1452e+06
[2]	validation_0-rmse:1.09824e+06
[3]	validation_0-rmse:1.05399e+06
[4]	validation_0-rmse:1.01249e+06
[5]	validation_0-rmse:974112
[6]	validation_0-rmse:937231
[7]	validation_0-rmse:903537
[8]	validation_0-rmse:871527
[9]	validation_0-rmse:841147
[10]	validation_0-rmse:812973
[11]	validation_0-rmse:786414
[12]	validation_0-rmse:761652
[13]	validation_0-rmse:738612
[14]	validation_0-rmse:716962
[15]	validation_0-rmse:697022
[16]	validation_0-rmse:678241
[17]	validation_0-rmse:660188
[18]	validation_0-rmse:644124
[19]	validation_0-rmse:628778
[20]	validation_0-rmse:614147
[21]	validation_0-rmse:601000
[22]	validation_0-rmse:589297
[23]	validation_0-rmse:577831
[24]	validation_0-rmse:567657
[25]	validation_0-rmse:557743
[26]	validation_0-rmse:549083
[27]	validation_0-rmse:540132
[28]	validation_0-rmse:532617
[29]	validation_0-rmse:524669
[30]	validation_0-rmse

[262]	validation_0-rmse:397491
[263]	validation_0-rmse:397476
[264]	validation_0-rmse:397481
[265]	validation_0-rmse:397476
[266]	validation_0-rmse:397419
[267]	validation_0-rmse:397372
[268]	validation_0-rmse:397163
[269]	validation_0-rmse:397160
[270]	validation_0-rmse:396983
[271]	validation_0-rmse:396825
[272]	validation_0-rmse:396787
[273]	validation_0-rmse:396645
[274]	validation_0-rmse:396530
[275]	validation_0-rmse:396456
[276]	validation_0-rmse:396454
[277]	validation_0-rmse:396254
[278]	validation_0-rmse:396059
[279]	validation_0-rmse:395835
[280]	validation_0-rmse:395802
[281]	validation_0-rmse:395704
[282]	validation_0-rmse:395706
[283]	validation_0-rmse:395676
[284]	validation_0-rmse:395552
[285]	validation_0-rmse:395540
[286]	validation_0-rmse:395493
[287]	validation_0-rmse:395544
[288]	validation_0-rmse:395537
[289]	validation_0-rmse:395473
[290]	validation_0-rmse:395275
[291]	validation_0-rmse:395205
[292]	validation_0-rmse:395192
[293]	validation_0-rmse:395196
[294]	va

Stopping. Best iteration:
[521]	validation_0-rmse:383520

CPU times: user 4.92 s, sys: 266 ms, total: 5.19 s
Wall time: 5.8 s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.05, max_delta_step=0,
             max_depth=3, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=2, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=1, verbosity=1)