In [6]:
import sklearn
import numpy
import xgboost

In this section, we will use the XGBoost library, which makes use of gradient boosted trees for classification and regression. 

The problem with regular trees is often that they lack complexity and sophistication. They're often unable to capture the intricacies of large datasets. We could continuously increase the maximum depth of a decision tree to fit larger datasets, but decision trees with many nodes tend to overfit the data.

Instead, we make use of gradient boosting to combine many decision trees into a single model for classification or regression. Gradient boosting starts with a single decision tree, then adds more decision trees to the overal model that correct the existing model's errors on the training dataset. Each tree added is "built off the errors of the others", so to speak. 

(can't install xgboost on my system? Wonder why... some error keeps popping up and I'm not sure how to fix it)

### XGBoost Basics

The basic data structure in XGBoost is the DMatrix, which is just a data matrix. The DMatrix can be used to train a Booster() object, which represents the gradient boosted decision tree. 

In [7]:
"""
data = np.array([
  [1.2, 3.3, 1.4],
  [5.1, 2.2, 6.6]])

import xgboost as xgb
dmat1 = xgb.DMatrix(data)

labels = np.array([0, 1])
dmat2 = xgb.DMatrix(data, label=labels)
"""

'\ndata = np.array([\n  [1.2, 3.3, 1.4],\n  [5.1, 2.2, 6.6]])\n\nimport xgboost as xgb\ndmat1 = xgb.DMatrix(data)\n\nlabels = np.array([0, 1])\ndmat2 = xgb.DMatrix(data, label=labels)\n'

In [None]:
"""
# predefined data and labels
print('Data shape: {}'.format(data.shape))
print('Labels shape: {}'.format(labels.shape))
dtrain = xgb.DMatrix(data, label=labels)

# training parameters
params = {
  'max_depth': 0,
  'objective': 'binary:logistic'
}
print('Start training')
bst = xgb.train(params, dtrain)  # booster
print('Finish training')
"""

In the example above, we set the 'max_depth' parameter to 0 (which means no limit on the tree depths, equivalent to None in scikit-learn). We also set the 'objective' parameter (the objective function) to binary classification via logistic regression. For the remaining available parameters, we used their default settings (so we didn't include them in params).

After we train a booster, we can use it to make predictions. 

In [None]:
"""
# predefined evaluation data and labels
print('Data shape: {}'.format(eval_data.shape))
print('Labels shape: {}'.format(eval_labels.shape))
deval = xgb.DMatrix(eval_data, label=eval_labels)

# Trained bst from previous code
print(bst.eval(deval))  # evaluation

# new_data contains 2 new data observations
dpred = xgb.DMatrix(new_data)
# predictions represents probabilities
predictions = bst.predict(dpred)
print('{}\n'.format(predictions))
"""

"""
OUTPUT:

Data shape: (119, 30)
Labels shape: (119,)
[0]	eval-error:0.226891
[0.6236573 0.6236573]
"""

# for binary classification, the default metric is eval-error, which is the classification error. 
# Note that the model's predictions (from the predict function) are probabilities, rather than class labels.  
# The actual label classifications are just the rounded probabilities. In the example above, 
# the Booster predicts classes of 0 and 1, respectively.

We can cross-validate XGBoost using xgb.cv. The keyword 'num_boost_round' specifies the number of boosting iterations, where each boosting iteration will try to improve the model through boosting. 

You can specify n_fold (default = 3) and num_boost_round (default = 10)

In [8]:
"""
# predefined data and labels
dtrain = xgb.DMatrix(data, label=labels)
params = {
  'max_depth': 2,
  'lambda': 1.5,
  'objective':'binary:logistic'
}
cv_results = xgb.cv(params, dtrain, num_boost_round=5)
print('CV Results:\n{}'.format(cv_results))
"""

"\n# predefined data and labels\ndtrain = xgb.DMatrix(data, label=labels)\nparams = {\n  'max_depth': 2,\n  'lambda': 1.5,\n  'objective':'binary:logistic'\n}\ncv_results = xgb.cv(params, dtrain, num_boost_round=5)\nprint('CV Results:\n{}'.format(cv_results))\n"

### Saving and Loading Boosters

We can save our trained Booster objects by using the .save_model() function for the object. This saves the model's binary data into an input file, with a .bin extension. We can restore a trained Booster object using a booster object's .load_model() function (we call the function after creating a new Booster instance). 

In [9]:
"""
SAVING BOOSTERS:

# predefined data and labels
dtrain = xgb.DMatrix(data, label=labels)
params = {
  'max_depth': 3,
  'objective':'binary:logistic'
}
bst = xgb.train(params, dtrain)

# 2 new data observations
dpred = xgb.DMatrix(new_data)
print('Probabilities:\n{}'.format(
  repr(bst.predict(dpred))))

bst.save_model('model.bin')
"""

"""
LOADING BOOSTERS: 

# Load saved Booster
new_bst = xgb.Booster()
new_bst.load_model('model.bin')

# Same dpred from before
print('Probabilities:\n{}'.format(
  repr(new_bst.predict(dpred))))
"""

"\nLOADING BOOSTERS: \n\n# Load saved Booster\nnew_bst = xgb.Booster()\nnew_bst.load_model('model.bin')\n\n# Same dpred from before\nprint('Probabilities:\n{}'.format(\n  repr(new_bst.predict(dpred))))\n"

### Classification and Regression with XGBoost

XGBoost also provides a wrapper that functions like the sklearn models. It functions in the same way as boosters, but does so in a more familiar syntax. For classification, the XGBoost wrapper is XGBClassifier(). For regression, the XGBoost wrapper is XGBRegressor(). Like regular scikit-learn models, it can be trained with a simple call to fit with NumPy arrays as input arguments.


In [None]:
"""
model = xgb.XGBClassifier()
# predefined data and labels
model.fit(data, labels)

# new_data contains 2 new data observations
predictions = model.predict(new_data)
print('Predictions:\n{}'.format(repr(predictions)))
"""

All the parameters for the original Booster object are now keyword arguments for the XGBClassifier. For instance, we can specify the type of classification, i.e. the 'objective' parameter for Booster objects, with the objective keyword argument (the default is binary classification).

In [10]:
"""
model = xgb.XGBClassifier(objective='multi:softmax')
# predefined data and labels (multiclass dataset)
model.fit(data, labels)

# new_data contains 2 new data observations
predictions = model.predict(new_data)
print('Predictions:\n{}'.format(repr(predictions)))
"""

"\nmodel = xgb.XGBClassifier(objective='multi:softmax')\n# predefined data and labels (multiclass dataset)\nmodel.fit(data, labels)\n\n# new_data contains 2 new data observations\npredictions = model.predict(new_data)\nprint('Predictions:\n{}'.format(repr(predictions)))\n"

### Feature Importance

Not every feature is equally important in helping a boosted tree make decisions. Certain features are more important than others. 

We can view the relative/proportional importance of each dataset feature using the feature_importances_ property of the model. 

By default, the plot_importance() function looks at feature weight as the importance metric (i.e., how often does the feature appear in the boosted decision tree?). You can change to a different importance metric with the "importance_type" keyword arg. We can set, for example, importance_type equal to 'gain', which means that we use information gain as the importance metric. Information gain is a commonly used metric for determining how good a feature is at differentiating the dataset, which is important in making predictions with a decision tree.

In [12]:
"""
model = xgb.XGBClassifier()
# predefined data and labels
model.fit(data, labels)

# Array of feature importances
print('Feature importances:\n{}'.format(
  repr(model.feature_importances_)))
"""

"""
OUTPUT:

Feature importances:
array([0.17941953, 0.11345647, 0.41556728, 0.29155672], dtype=float32)
"""

"""
# plot of importances

model = xgb.XGBRegressor()
# predefined data and labels (for regression)
model.fit(data, labels)

xgb.plot_importance(model, importance_type='gain')
plt.show() # matplotlib plot
"""

# The resulting plot is a bar graph of the F-scores ( F1-scores) for each feature
# (the number next to each bar is the exact F-score). Note that the features are 
# labeled as "fN", where N is the index of the column in the dataset. The F-score 
# is a standardized measurement of a feature's importance, based on the specified importance metric.

"\n# plot of importances\n\nmodel = xgb.XGBRegressor()\n# predefined data and labels (for regression)\nmodel.fit(data, labels)\n\nxgb.plot_importance(model, importance_type='gain')\nplt.show() # matplotlib plot\n"

### Hyperparameter tuning with grid-search cross-validation

We can use sklearn's GridSearchCV in order to perform hyperparameter tuning. Below is an example:

In [13]:
"""
model = xgb.XGBClassifier()
params = {'max_depth': range(2, 5)}

from sklearn.model_selection import GridSearchCV
cv_model = GridSearchCV(model, params, cv=4, iid=False)

# predefined data and labels
cv_model.fit(data, labels)
print('Best max_depth: {}\n'.format(
  cv_model.best_params_['max_depth']))

# new_data contains 2 new data observations
print('Predictions:\n{}'.format(
  repr(cv_model.predict(new_data))))
"""

"\nmodel = xgb.XGBClassifier()\nparams = {'max_depth': range(2, 5)}\n\nfrom sklearn.model_selection import GridSearchCV\ncv_model = GridSearchCV(model, params, cv=4, iid=False)\n\n# predefined data and labels\ncv_model.fit(data, labels)\nprint('Best max_depth: {}\n'.format(\n  cv_model.best_params_['max_depth']))\n\n# new_data contains 2 new data observations\nprint('Predictions:\n{}'.format(\n  repr(cv_model.predict(new_data))))\n"

### Saving and loading XGBoost models

As is the case with other sklearn models, we can save and load them with the joblib API. 

Below is an example:

In [14]:
"""
from joblib import dump, load
dump(clf, 'filename.joblib') 
clf = load('filename.joblib') 
"""

"\nfrom joblib import dump, load\ndump(clf, 'filename.joblib') \nclf = load('filename.joblib') \n"