# Introduction


**What?** Building a gradient boost algorithm from scratch



# Import modules

In [35]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor

# Load dataset

In [3]:
df_bikes = pd.read_csv('../DATASETS/bike_rentals_cleaned.csv')
df_bikes.head(3)

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1.0,0.0,1,0.0,6.0,0.0,2,0.344167,0.363625,0.805833,0.160446,985
1,2,1.0,0.0,1,0.0,0.0,0.0,2,0.363478,0.353739,0.696087,0.248539,801
2,3,1.0,0.0,1,0.0,1.0,1.0,1,0.196364,0.189405,0.437273,0.248309,1349


# Data pre-processing

In [7]:
# Split data into X and y
X_bikes = df_bikes.iloc[:,:-1]
y_bikes = df_bikes.iloc[:,-1]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_bikes, y_bikes, random_state=2)

# Build a gradient boosting model from scratch in 7 steps!


- The initial decision tree, called a base learner, should not be fine-tuned for accuracy. 
- We want a model thatfocuses on learning from errors, not a model that relies heavily on the base learner.



In [9]:
# Step 1 - Initialize Decision Tree Regressor and fit to training data
tree_1 = DecisionTreeRegressor(max_depth=2, random_state=2)
tree_1.fit(X_train, y_train)

DecisionTreeRegressor(max_depth=2, random_state=2)


- Make predictions with the training set: Instead of making predictions with the test set, predictions in gradient boosting are initially made with the training set. 
- Why? To compute the residuals, we need to compare the predictions while still in the training phase. 
- The test phase of the model build comes at the end, after all thetrees have been constructed.



In [10]:
# Step 2 - Make predictions on training set
y_train_pred = tree_1.predict(X_train)


- Compute the residuals: The residuals are the differences between the predictions and the target column.
- The residuals are defined as y2_train because they are the new target column for the next tree. 



In [12]:
# Step 3 - Compute residuals
y2_train = y_train - y_train_pred


- Fit the new tree on the residuals: Fitting a new tree on the residuals is different than fitting a model on the training set. The primary difference is in the predictions. 
- In the bike rentals dataset, when fitting a new tree on the residuals, we should progressively get smaller numbers.



In [13]:
# Step 4 - Initialize Decision Tree Regressor and fit tree to training data
tree_2 = DecisionTreeRegressor(max_depth=2, random_state=2)
tree_2.fit(X_train, y2_train)

DecisionTreeRegressor(max_depth=2, random_state=2)


- Repeat steps 2-4: As the process continues, the residuals should gradually approach 0 from the positive and negative direction. 
- This process may continue for dozens, hundreds, or thousands of trees. 
- Under normal circumstances, you would certainly keep going. 
- It will take more than a few trees to transform a weak learner into a strong learner. 
- Since our goal is to understand how gradient boosting works behind the scenes, however, we will move on now that the general idea has been covered.



In [15]:
# Step 5 - Repeat steps from 2 to 4

# Make predictions on training set
y2_train_pred = tree_2.predict(X_train)
# Compute residuals
y3_train = y2_train - y2_train_pred
# Initialize Decision Tree Regressor
tree_3 = DecisionTreeRegressor(max_depth=2, random_state=2)
# Fit tree to training data
tree_3.fit(X_train, y3_train)

DecisionTreeRegressor(max_depth=2, random_state=2)


- Sum the results: Summing the results requires making predictions for each tree with the test set. 
- Since the predictions are positive and negative differences, summing the predictions should result in predictions that are closer to the target.



In [17]:
# Step 6 - Sum the results
y1_pred = tree_1.predict(X_test)
y2_pred = tree_2.predict(X_test)
y3_pred = tree_3.predict(X_test)
y_pred = y1_pred + y2_pred + y3_pred

In [18]:
# Step 7 - Compute root mean squared error (rmse)
MSE(y_test, y_pred)**0.5

911.0479538776444

# Build a gradient boosting via scikit-learn


- Let's see whether we can obtain the same result as in the previous section using scikit-learn's GradientBoostingRegressor. 
- To obtain the same results, it's essential to match max_depth=2 and random_state=2. 
- Furthermore, since there are only three trees, we must have n_estimators=3. 
- Finally, we must set the learning_rate=1.0 hyperparameter. 
- As you can the result pratically the same!



In [21]:
# Instantiate the mthod
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=3, random_state=2, learning_rate=1.0)
# Fit on the training data
gbr.fit(X_train, y_train)
# Predict test data
y_pred = gbr.predict(X_test)
# Compute root mean squared error (rmse)
MSE(y_test, y_pred)**0.5

911.0479538776439


- Recall that the point of gradient boosting is to build a model with enough trees to transform a weak learner into a strong learner. 
- This is easily done by changing n_estimators, the number of iterations, to a much larger number.
- Let's build and score a gradient boosting regressor with 30 estimators:



In [22]:
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=30, random_state=2, learning_rate=1.0)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
MSE(y_test, y_pred)**0.5

857.1072323426944

In [23]:
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=300, random_state=2, learning_rate=1.0)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
MSE(y_test, y_pred)**0.5

936.3617413678853


- Now, we changed learning_rate without saying much about it. 
- So, what happens if we remove learning_rate=1.0 and use the scikit-learn defaults?



In [25]:
gbr = GradientBoostingRegressor(max_depth=2, n_estimators=300, random_state=2)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
MSE(y_test, y_pred)**0.5

653.7456840231495

# Modifying gradient boosting hyperparameters


- learning_rate, also known as the shrinkage, shrinks the contribution of individual trees so that no tree  has too much influence when building the model. 
- If an entire ensemble is built from the errors of one base learner, without careful adjustment of hyperparameters, early trees in the model can have too much influence on subsequent development.



In [26]:
learning_rate_values = [0.001, 0.01, 0.05, 0.1, 0.15, 0.2, 0.3, 0.5, 1.0]
for value in learning_rate_values:
    gbr = GradientBoostingRegressor(max_depth=2, n_estimators=300, random_state=2, learning_rate=value)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    rmse = MSE(y_test, y_pred)**0.5
    print('Learning Rate:', value, ', Score:', rmse)

Learning Rate: 0.001 , Score: 1633.0261400367258
Learning Rate: 0.01 , Score: 831.5430182728547
Learning Rate: 0.05 , Score: 685.0192988749717
Learning Rate: 0.1 , Score: 653.7456840231495
Learning Rate: 0.15 , Score: 687.666134269379
Learning Rate: 0.2 , Score: 664.312804425697
Learning Rate: 0.3 , Score: 689.4190385930236
Learning Rate: 0.5 , Score: 693.8856905068778
Learning Rate: 1.0 , Score: 936.3617413678853


In [27]:
depths = [None, 1, 2, 3, 4]
for depth in depths:
    gbr = GradientBoostingRegressor(max_depth=depth, n_estimators=300, random_state=2)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    rmse = MSE(y_test, y_pred)**0.5
    print('Max Depth:', depth, ', Score:', rmse)

Max Depth: None , Score: 869.2783041945797
Max Depth: 1 , Score: 707.8261886858736
Max Depth: 2 , Score: 653.7456840231495
Max Depth: 3 , Score: 646.4045923317708
Max Depth: 4 , Score: 663.048387855927



- Subsample is a subset of samples. 
- Since samples are the rows, a subset of rows means that all rows may not be included when building each tree. 
- By changing subsample from 1.0 to a smaller decimal, trees only select that percentage of samples during the build phase. 
- For example, subsample=0.8 would select 80% of samples for each tree.



In [28]:
samples = [1, 0.9, 0.8, 0.7, 0.6, 0.5]
for sample in samples:
    gbr = GradientBoostingRegressor(max_depth=3, n_estimators=300, subsample=sample, random_state=2)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    rmse = MSE(y_test, y_pred)**0.5
    print('Subsample:', sample, ', Score:', rmse)

Subsample: 1 , Score: 646.4045923317708
Subsample: 0.9 , Score: 620.1819001443569
Subsample: 0.8 , Score: 617.2355650565677
Subsample: 0.7 , Score: 612.9879156983139
Subsample: 0.6 , Score: 622.6385116402317
Subsample: 0.5 , Score: 626.9974073227554


# RandomizedSearchCV


- We can use what we learned from above to fix some starting values for a more thourgh search. 
- With 27 possible combinations of hyperparameters, we use RandomizedSearchCV to try 10 of these combinations in the hopes of finding a good model. 
- While 27 combinations are feasible with GridSearchCV, at some point you will end up with too many possibilities and RandomizedSearchCV will become essential.



In [30]:
params={'subsample':[0.65, 0.7, 0.75],
                          'n_estimators':[300, 500, 1000],
                          'learning_rate':[0.05, 0.075, 0.1]
                         }

# Import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

gbr = GradientBoostingRegressor(max_depth=3, random_state=2)


# Instantiate RandomizedSearchCV as rand_reg
rand_reg = RandomizedSearchCV(gbr, params, n_iter=10, scoring='neg_mean_squared_error', 
                              cv=5, n_jobs=-1, random_state=2)

# Fit grid_reg on X_train and y_train
rand_reg.fit(X_train, y_train)
# Extract best estimator
best_model = rand_reg.best_estimator_
# Extract best params
best_params = rand_reg.best_params_
# Print best params
print("Best params:", best_params)
# Compute best score
best_score = np.sqrt(-rand_reg.best_score_)

# Print best score
print("Training score: {:.3f}".format(best_score))
# Predict test set labels
y_pred = best_model.predict(X_test)
# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**0.5
# Print rmse_test
print('Test set score: {:.3f}'.format(rmse_test))

Best params: {'subsample': 0.65, 'n_estimators': 300, 'learning_rate': 0.05}
Training score: 636.200
Test set score: 625.985


In [34]:
# After a few rounds of experimentation, we obtained the following model.
gbr = GradientBoostingRegressor(max_depth=3, n_estimators=1600, subsample=0.75, learning_rate=0.02, random_state=2)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
print("Test error: ", MSE(y_test, y_pred)**0.5)

y_pred1 = gbr.predict(X_train)
print("Train error: ", MSE(y_train, y_pred1)**0.5)

Test error:  596.9544588974487
Train error:  159.90255545058218


# Regressor in XGBoost


- XGBoost is preferred over gradient boosting in general because it consistently delivers better results. 
- It is essentially a better vesion. 



In [36]:
# Instantiate the XGBRegressor, xg_reg
xg_reg = XGBRegressor(max_depth=3, n_estimators=1600, eta=0.02, subsample=0.75, random_state=2)

# Fit xg_reg to training set
xg_reg.fit(X_train, y_train)

# Predict labels of test set, y_pred
y_pred = xg_reg.predict(X_test)

# Compute root mean squared error (rmse)
MSE(y_test, y_pred)**0.5

584.339544309016

# References


- Corey Wade. â€œHands-On Gradient Boosting with XGBoost and scikit-learn
- https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn
    
