# XGBoost #

In <b>boosting</b>, models are sequentially created such that model $M_t$ created at time step $t$ learns from its predecessor model $M_{t-1}$. This is generally achieved by learning a model on the residual errors from the previous model.

For example, if we have a model $M_1$, and it creates predictions $\hat{y}$ on the training data. The residuals are then $\hat{y}-y$, with $y$ being the actual training data labels. We can then train a model h to predict the residuals of $M_1$. The the next model $M_2$ will be a combination of $M_1$ predictions and h predicted residuals. Specifically:

<center> $M_2(x) = M_1(x) + h(x)$ </center>

The models learned are generally considered "weak" models that generally don't have high predictive pwoer on their own, but when combined with multiple models, can perform well. The model is popularly used with decision trees.

Then <b>gradient boosting</b> uses calculations of the gradient of the loss function instead of residuals. In gradient boosting, the first model is created to minimize loss:

<center> $M_0(x) = \mathit{argmin}_y \sum_{i=1}^n L(y_i - y)$ </center>

Then we can calculate the gradient as:

<center> $\frac{\partial(L(y_i, M_0(x_i))}{\partial M_0(x_i)}$ </center>

Assuming we fit some model $h$ to the gradient, we can calculate the next model, $M_1$ as:

<center> $M_1(x) = M_0(x) + \alpha h_1(x)$ </center>

where we refer to $\alpha$ as the learning rate parameter. And more generally we would have:

<center> $M_{t+1}(x) = M_t(x) + \alpha h_{t+1}(x)$ </center>

<b>XGBoost</b> is gradient boosting with some additional features. Some notable ones include regularization to avoid overfitting, built-in handling of sparse data, and parallelization.


In [2]:
# import packages
from sklearn import datasets, model_selection, metrics, metrics

import xgboost as xgb

### Import Data ###

Importing the built-in breast cancer data set with 2 classes(212 malignant, 357 benign) and 30 features. 

In [4]:
data = datasets.load_breast_cancer()
X = data['data']
Y = data['target']

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y, train_size=0.7)


### Create Model ###

Construct the XGBoost classifier. Parameters that can be tuned: learning rate, n_estimators, and early_stopping_rounds. To do so, a validation set could be created from the training set and used to set those values appropriately.

In [8]:
model = xgb.XGBClassifier()
model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)



### Analyze Model ###

Analyze the model performance using accuracy and F1 score (which takes into account precision and recall).

In [14]:
acc = metrics.accuracy_score(Y_test, Y_pred)
print('Accuracy:', acc)

f1 = metrics.f1_score(Y_test, Y_pred)
print('F1 score:', f1)

matrix = metrics.confusion_matrix(Y_test, Y_pred)
print('Confusion matrix:')
print(matrix)

Accuracy: 0.9766081871345029
F1 score: 0.9824561403508771
Confusion matrix:
[[ 55   2]
 [  2 112]]


As mentioned earlier, some parameters of the model could be optimally set. Though the accuracy and F1 score show that the model built with generic settings still performs very well.

For context, this data set was tested with two other algorithms:

| algorithm | accuracy | f1 score |
| --------- | -------- | -------- |
| logistic regression | 0.68 | 0.64 |
| linear SVM | 0.95 | 0.96 |

Comparing them, we see XGBoost does a better job than the previously tested methods. In particular, it is slightly better/on par with linear SVM and much better than logistic regression.

The confusion matrix shows the true (rows) vs predicted (columns) labels. From these results, we can see that the incorrect predictions are evenly spread among the two classes (both have 2 incorrectly labeled). 