# XGBoost

Notes from [DataCamp's XGBoost Course](https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost/classification-with-xgboost?ex=5)

* Ideal when you have 1k+ training samples and fewer than 100 features
* But in general if you have fewer features than training samples it'll be fine
* Does well with mix of numeric and categorical features, or just numeric features
* Not ideal for computer vision or NLP (deep learning is better)
* Not ideal when you have fewer than 100 training examples or # training examples a lot smaller than # of features
* XGBoost is an ensemble model that uses many individual models that combine to form a single prediction. The individual models are called base learners. Each base learner should be good at distinguishing or predicting different parts of the datset. Two kinds of base learners: tree and linear.


In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np

## Classification

In [None]:
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.20, random_state=123)

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train, y_train)

preds = xg_cl.predict(X_test)
accuracy = float(np.sum(preds == y_test)) / y_test.shape[0]

print("accuracy: %f" % (accuracy))

## Simple Decision Tree

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

dt_clf_4 = DecisionTreeClassifier(max_depth=4)
dt_clf_4.fit(X_train, y_train)

y_pred_4 = dt_clf_4.predict(X_test)

accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)

## Boosting

* Weak learners are ML algorithms that are only slightly better than chance
* With boosting, we can convert a collection of weak learners into a strong learner
* Strong learners can be tuned to achieve high performance
* To boost, we iteratively train weak models on subsets of the data. Then we weigh each weak prediction according to its performance. We then combine the predictions to obtain a single weighted prediction that is much better than any of the individual predictions.

## DMatrix

* XGBoost gets its performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.
* If we instantiate XGBClassifier like in the first example, the inputs are converted to DMatrix objects automatically
* But if we want to use XGBoost's build in cross-validation, we have to do it manually

In [None]:
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]

churn_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}

# Perform cross-validation: cv_results
# "nfold" is the number of cross-validation folds
# "num_boost_round" is the number of trees we want to build
# "metrics" is the metric you want to compute (this will be "error", which we will convert to an accuracy)
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                  nfold=3, num_boost_round=5, 
                  metrics="error", as_pandas=True, seed=123)

print(cv_results)

print(((1 - cv_results["test-error-mean"]).iloc[-1]))

# Could also use `metrics="auc"` and check cv_results["test-auc-mean"]
# Or `metrics="rmse" and check cv_results["test-rmse-mean"]
# Or `metrics="mae" and check cv_results["test-mae-mean"]

## Regression

* Objective (aka loss) functions quantifies how far off the predictions are from the actual result
* For any ML algorithm we want to minimize the loss function value
* In xgboost, there are special naming conventions
* For regression model: reg:linear
* For binary classification when you just need a decision (and not probability): reg:logistics
* For binary classification when you want the probability: binary:logistic

## Base Learners

### 1. Tree Base Learner

* Uses decision tress as base model
* Boosted model is weighted sum of decision tress (which are nonlinear)
* Almost exclusively used in XGBoost
* Hyperparameters:
*  learning rate/eta - how quickly the model fits the residual error using base learners
*  gamma - for tree based learners. minimum loss reduction for a split to occur. higher values lead to fewer splits.
*  alpha - l1 regularization on leaf weights, larger values mean more regularization. not a penalty on feature weights as is the case in linear or logistic regression. higher alpha values lead to more l1 regularization, will cause many leaf weights to go to zero. values in 1, 10, and 100.
*  lambda - l2 regularization on leaf weights. much smoother than l1. causes leaf weights to smoothly decrease.
*  max_depth - how deeply the tree can grow
*  subsample - dictates the fraction of the training data that is used during any given boosting round. if very low or high, can lead to underfitting problems.
*  colsample_bytree - the % of features you can select from from any boosting round. a large value means almost all features can be used in any boosting round. in general, smaller values can be thought of as providing additional regularization. using all columns will tend to overfit.

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)

# By default, XGBoost uses trees as base learners, so you don't have to specify 
# that you want to use trees here with booster="gbtree"
xg_reg = xgb.XGBRegressor(objective="reg:linear", n_estimators=10, seed=123)

# Fit the regressor to the training set
xg_reg.fit(X_train, y_train)

# Predict the labels of the test set: preds
preds = xg_reg.predict(X_test)

# Compute the rmse: rmse
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

### 2. Linear Base Learner

* Sum of linear terms
* Boosted model is weighted sum of linear models (which is itself linear)
* Rarely used, because you don't get any combinations of features. Get similar performance to regularized linear model.
* Hyperparameters:
*  lambda: l2 reg on weights
*  alpha: l1 reg on weights
*  lambda_bias: applied to model bias


In [None]:
# In order to do this you must create the parameter dictionary that describes the kind of booster you want to use

# Convert the training and testing sets into DMatrixes: DM_train, DM_test
DM_train = xgb.DMatrix(X_train, y_train)
DM_test =  xgb.DMatrix(X_test, y_test)

# Create the parameter dictionary: params
params = {"booster":"gblinear", "objective":"reg:linear"}

# Train the model: xg_reg
xg_reg = xgb.train(params=params, dtrain=DM_train, num_boost_round=5)

# Predict the labels of the test set: preds
preds = xg_reg.predict(DM_test)

# Compute and print the RMSE
rmse = np.sqrt(mean_squared_error(y_test,preds))
print("RMSE: %f" % (rmse))

## Regularization

* Regularization controls model complexity
* We want models that are as accurate and as simple as possible. We penalize models that are too complex.
* For XGBoost we use gamma, lambda, and alpha (see hyperparameter list above)

## Hyperparameter tuning

### Manually tuning lambda example

In [None]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

reg_params = [1, 10, 100]

# Create the initial parameter dictionary for varying l2 strength: params
params = {"objective":"reg:linear", "max_depth":3}

# Create an empty list for storing rmses as a function of l2 complexity
rmses_l2 = []

# Iterate over reg_params
for reg in reg_params:

    # Update l2 strength
    params["lambda"] = reg
    
    # Pass this updated param dictionary into cv
    cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, num_boost_round=5, metrics="rmse", as_pandas=True, seed=123)
    
    # Append best rmse (final round) to rmses_l2
    rmses_l2.append(cv_results_rmse["test-rmse-mean"].tail(1).values[0])

# Look at best rmse per l2 param
print("Best rmse as a function of l2:")
print(pd.DataFrame(list(zip(reg_params, rmses_l2)), columns=["l2", "rmse"]))

### Manually tuning eta example

In [None]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta 
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=10, early_stopping_rounds=5, metrics="rmse", seed=123, as_pandas=True)

    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

## Grid Search

* Number of models grows exponentially so can be slow

In [None]:
# Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(param_grid=gbm_param_grid, estimator=gbm, scoring="neg_mean_squared_error", cv=4, verbose=1)

# Fit grid_mse to the data
grid_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

## Random Search

* Left hoping one of the random searches is a good one

In [None]:
# Create the parameter grid: gbm_param_grid 
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(param_distributions=gbm_param_grid, estimator=gbm, scoring="neg_mean_squared_error", n_iter=5, cv=4, verbose=1)

# Fit randomized_mse to the data
randomized_mse.fit(X, y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

## Plotting trees

In [None]:
xg_reg = xgb.train(params=params, dtrain=housing_dmatrix, num_boost_round=10)

# "They provide insight into how the model arrived at its final decisions and 
# what splits it made to arrive at those decisions. This allows us to identify
# which features are the most important in determining house price. "

# num_trees is the index of the tree to plot
# 0 = first tree, 1 = second tree, etc
xgb.plot_tree(xg_reg, num_trees=0)

## Plotting feature importance

In [None]:
xgb.plot_importance(xg_reg)
plt.show()

## Pipeline

In [None]:
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)

# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]

# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)

# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict("records"), y, scoring="neg_mean_squared_error", cv=10)

# Print the 10-fold RMSE
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))