# Gradient Boosting

### Gradient boosting is a boosting method that combines a bunch of weak learners into a strong single learner. This is accomplished by adding an estimator h to provide a better model. Essentially, Fn+1(X) = Fn(x) + h(x) = y where h(x) is our estimator and is detemined by h(x) = y - Fn(x)


In [21]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_moons

# Generating a dataset
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

# Splitting the dataset in to training and test samples
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Creating and ensembel of 3 trees
### Determining h1, h2, and h3 in Fn+1(X) = Fn(X)+h1(X)+h2(X)+h3(X)

In [22]:
# Creating an ensemble of 3 trees
tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X_train, y_train)

# Training a second DecisionTreeRegressor on the residual errors from the first predictor
y2 = y_train - tree_reg1.predict(X_train)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X_train, y2)

# Training a third DecisionTreeRegressor on the residuals made by the second predictor
y3 = y2 - tree_reg2.predict(X_train)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X_train, y3)

# Make predictions on new instances by adding up all of the predictions of all the trees
y_pred_manual = sum(tree.predict(X_test) for tree in (tree_reg1, tree_reg2, tree_reg3))


## Same as above, but using the GradientBoostingRegressor function in sklearn rather than code it manually 

In [23]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, learning_rate=1.0)
gbrt.fit(X_train, y_train)
y_pred = gbrt.predict(X_test)


## Optimizing the number of estimators to be used

In [24]:
# Staged_predict yields the score after each iteration of boosting (num of trees)
errors = [mean_squared_error(y_test, y_pred) for y_pred in gbrt.staged_predict(X_test)]
# list location 0 -> 1 tree
best_n_estimator = np.argmin(errors) + 1

# creating a new regressor, but with the optimum number of trees
gbrt_best = GradientBoostingRegressor(max_depth=2, n_estimators=best_n_estimator, learning_rate=1.0)
gbrt_best.fit(X_train, y_train)
y_pred_best = gbrt_best.predict(X_test)

## Results

In [25]:
print("MSE_manual: ", mean_squared_error(y_test, y_pred_manual))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("MSE_best: ", mean_squared_error(y_test, y_pred_best))
print("best_n_estimator: ", best_n_estimator)

MSE_manual:  0.108687010459
MSE:  0.114829985941
MSE_best:  0.114829985941
best_n_estimator:  2


In [26]:
### The manual implementation has the worst MSE because it does not benefit from 