### XGBoost
Testing XGBoost. XGBoost provides regularization hyper parameters to guard against overfitting.
See 
- https://towardsdatascience.com/xgboost-python-example-42777d01001e
- https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4


#### Partition
Given a tree, calculate the optimal gain (non-negative) given left-right split across a dataset.

*Gain = (G_left^2 / H_left + lambda) + (G_right^2 / H_right + lamba) - (G_right + G_left)^2/(H_left + H_right + lambda) - gamma*

- G => sum of residuals
- H => # of residuals
- lambda => regularization, redudce sensitivity to individual observations
- gamma => regularization, minimum loss reguired to advance partition


#### Calculate output at leaf-nodes
Calculate output for iterative learning.

*Output = (Sum of residuals) / (# of residuals + lambda)*

Then

*Predication = initial_predication + learning_rate * Output*


#### Learning
Use the residuals from above to create another decision and repeat the process.

The final prediction is:

*final_prediction = initial_prediction + learning_rate * prediction_1 + learning_rate * prediction_2 + learning_rate ...*


In [29]:
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [30]:
# Load example data into pandas

# Load california housing, apparently the boston-housing dataset has ethical issus
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target)

In [31]:
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [33]:
# configure xgboost
regressor = xgb.XGBRegressor(
    n_estimators=50,
    reg_lambda=1,
    gamma=0,
    max_depth=3
)

In [34]:
regressor.fit(X_train, y_train)
pd.DataFrame(regressor.feature_importances_.reshape(1, -1), columns=housing.feature_names)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,0.554588,0.069037,0.042508,0.019759,0.017297,0.152479,0.056506,0.087826


In [35]:
y_pred = regressor.predict(X_test)

In [36]:
## Check result, convert pandas series to list, zip with prediction and print out first few entries
y_test_list = y_test.to_list()

combined = zip(y_test, y_pred)
list(combined)[0: 5]

[(2.929, 2.3515584),
 (3.333, 3.5468132),
 (2.524, 2.7701283),
 (1.597, 1.6726933),
 (2.16, 2.7044654)]

#### vanilla gradient boost
Now do the same thing with vanilla gradient boost and check the results

In [40]:
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
regressor = GradientBoostingRegressor(
    max_depth=2,
    n_estimators=3,
    learning_rate=1.0
)
regressor.fit(X_train, y_train)

In [41]:
errors = [mean_squared_error(y_test, y_pred) for y_pred in regressor.staged_predict(X_test)]
best_n_estimators = np.argmin(errors)

# Remake using optimal numbers
best_regressor = GradientBoostingRegressor(
    max_depth=2,
    n_estimators=best_n_estimators,
    learning_rate=1.0
)
best_regressor.fit(X_train, y_train)

In [42]:
y_pred_vanilla = best_regressor.predict(X_test)

In [48]:
from sklearn.metrics import mean_absolute_error

print(f"Vanilla {mean_absolute_error(y_test, y_pred_vanilla)}")
print(f"XGBoost {mean_absolute_error(y_test, y_pred)}")

Vanilla 0.5823327559359661
XGBoost 0.3619540448257128


In [49]:
combined2 = zip(y_test, y_pred, y_pred_vanilla)
list(combined2)[0: 5]

[(2.929, 2.3515584, 1.998267964565343),
 (3.333, 3.5468132, 4.196437363415366),
 (2.524, 2.7701283, 1.998267964565343),
 (1.597, 1.6726933, 1.998267964565343),
 (2.16, 2.7044654, 1.998267964565343)]