# 1. Classification with XGBoost

These are my personal notes of the Datacamp course [Extreme Gradient Boosting with XGBoost](https://app.datacamp.com/learn/courses/extreme-gradient-boosting-with-xgboost).

The course has 4 main sections:

1. **Classification**: the current notebook.
2. Regression
3. Fine-tuning XGBoost
4. Using XGBoost in Pipelines

XGBoost is an implementation of the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm in C++ which has bindings to other languages, such as Python. It has the following properties:

- Fast.
- Best performance.
- Parallelizable, on a computer and across the network. So it can work with huge datasets distributed on several nodes/GPUs.
- We can use it for classification and regression.
- The [Python API](https://xgboost.readthedocs.io/en/stable/python/python_api.html) is easy to use and has two major flavors or sub-APIs:
  - The **Scikit-Learn API**: We instantiate `XGBRegressor()` or `XGBClassifier` and then we can `fit()` and `predict()`, using the typical Scikit-Learn parameters; we can even use those objects with other Scikit-Learn modules, such as `GridSearchCV`.
  - The **Learning API**: The native XGBoost Python API requires to convert the dataframes into `DMatrix` objects first; then, we have powerful methods which allow for tuning many parameters: `xgb.cv()`, `xgb.train()`. The native/learning API is very easy to use. **Note: the parameter names are different compared to the Scikit-Learn API!**

Classification is the original supervised learning problem addressed by XGBoost, although it can also handle regression problems.

### Installation

```python
# PIP
pip install xgboost

# Conda: General
conda install -c conda-forge py-xgboost

# Conda: CPU only
conda install -c conda-forge py-xgboost-cpu

# Conda: Use NVIDIA GPU: Linux x86_64
conda install -c conda-forge py-xgboost-gpu

# For tree visualization
pip install graphviz
```

### Table of Contents

- [1.1 Introduction: Churn Classification Example](#1.1-Introduction:-Churn-Classification-Example)
- [1.2 How Does It Work?](#1.2-How-Does-It-Work?)
- [1.3 Cross Validation](#1.3-Cross-Validation)
- [1.4 When to Use XGBoost](#1.4-When-to-Use-XGBoost)

## 1.1 Introduction: Churn Classification Example

In [1]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
# Get data, split
class_data = pd.read_csv("../data/ChurnData.csv")

In [5]:
class_data.head()

Unnamed: 0,tenure,age,address,income,ed,employ,equip,callcard,wireless,longmon,...,pager,internet,callwait,confer,ebill,loglong,logtoll,lninc,custcat,churn
0,11.0,33.0,7.0,136.0,5.0,5.0,0.0,1.0,1.0,4.4,...,1.0,0.0,1.0,1.0,0.0,1.482,3.033,4.913,4.0,1.0
1,33.0,33.0,12.0,33.0,2.0,0.0,0.0,0.0,0.0,9.45,...,0.0,0.0,0.0,0.0,0.0,2.246,3.24,3.497,1.0,1.0
2,23.0,30.0,9.0,30.0,1.0,2.0,0.0,0.0,0.0,6.3,...,0.0,0.0,0.0,1.0,0.0,1.841,3.24,3.401,3.0,0.0
3,38.0,35.0,5.0,76.0,2.0,10.0,1.0,1.0,1.0,6.05,...,1.0,1.0,1.0,1.0,1.0,1.8,3.807,4.331,4.0,0.0
4,7.0,35.0,14.0,80.0,2.0,15.0,0.0,1.0,0.0,7.1,...,0.0,0.0,1.0,1.0,0.0,1.96,3.091,4.382,3.0,0.0


In [12]:
X, y = class_data.iloc[:,:-1], class_data.iloc[:,-1].astype(int)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=123)

In [13]:
# XGBoost Classifier instance
# Parameters:
# https://xgboost.readthedocs.io/en/stable/parameter.html
# Objective functions:
# reg:linear - regression (deprecated)
# reg:squarederror - regression
# reg:logistic - classification, class label output
# binary:logistic - classification, class probability output
xg_cl = xgb.XGBClassifier(objective='binary:logistic',
                          n_estimators=10,
                          seed=123)

In [14]:
# Train/Fit
xg_cl.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=10, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=123,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [15]:
# Predict
preds = xg_cl.predict(X_test)

In [17]:
# Evaluate
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("Accuracy: %f" % (accuracy))

Accuracy: 0.750000


## 1.2 How Does It Work?

XGBoost works with *weak* or individual base learners underneath; usually, these are **decision trees**, concretely **CARTs: Classification and Regression Trees**.

A decision tree is a binary tree where in each node a feature is used to split the dataset in two; that split is associated to a question. The leaves of the tree contain either a class or a value to be predicted. In particular, CARTs always contain a continuous value in the leaves, which can be used as a classifier value when a threshold is defined.

Therefore, XGBoost is an **ensemble learning** method: many models are used to yield a result. The underlying *weak* learners can be any algorithm, as mentioned, although CARTs are usually employed. The *weak* learner needs to be any model which is better than random chance, i.e., >50% accuracy in a binary classification. Then, the XGBoost converts those *weak* learners into **strong learners**: weak/bad effects cancel out and strong/good effects are highlighted.

*Weak* learners are trained with **boosting**:

- Iteratively learn models on subsets of data.
- Weight each weak prediction based on learner's performance.
- Combine weighted predictions to obtain a single prediction.

The XGBoost implementation allows two weak learners:

- The mentioned CART trees; these should be used in most cases, because they capture non-linearities.
- Linear learners.

Notes on the general boosting algorithm:

- Each weak learner is created in a boosting round and it uses a subset of the total dataset.
- If we use trees, we can select the number of features to be selected randomly to build the tree.
- We can apply regularization is the model is overfitting.

## 1.3 Cross Validation


We can use cross-validation with XGBoost, but the API usage is a bit different:

- We need to define `DMatrix` objects.
- We call `xgb.cv()`.

In [18]:
import xgboost as xgb
import pandas as pd

In [19]:
churn_data = pd.read_csv("../data/ChurnData.csv")

In [21]:
# DMatrix is a specific data structure which accelerates the computations
# In the regular API, i.e., without cross-validation, DMatrix is automatically generated
# but with cross-validation we need to do it explicitly
churn_dmatrix = xgb.DMatrix(data=churn_data.iloc[:,:-1], # X
                            label=churn_data.churn) # y

In [25]:
# Define params
# Objective functions:
# reg:linear - regression (deprecated)
# reg:squarederror - regression
# reg:logistic - classification, class label output
# binary:logistic - classification, class probability output
params = {"objective":"binary:logistic",
          "max_depth":4}

In [26]:
# Fit with CV and get results of CV
# Parameters:
# https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.cv
cv_results = xgb.cv(dtrain=churn_dmatrix, # DMatrix
                    params=params, # parameters dictionary
                    nfold=4, # number of non-overlapping folds
                    num_boost_round=10, # number of trees
                    metrics="error", # error converts to accuracy; "rmse" or "mae" for regression
                    as_pandas=True) # if we want results as a pandas object

In [24]:
print("Accuracy: %f" %((1-cv_results["test-error-mean"]).iloc[-1]))

Accuracy: 0.705000


In [27]:
# Perform cross-validation with another metric: AUC
cv_results = xgb.cv(dtrain=churn_dmatrix,
                    params=params, 
                    nfold=3,
                    num_boost_round=5, 
                    metrics="auc",
                    as_pandas=True,
                    seed=123)

# Print cv_results
print(cv_results)

# Print the AUC
print((cv_results["test-auc-mean"]).iloc[-1])

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.907307       0.025788       0.694683      0.057410
1        0.951466       0.017800       0.720245      0.032604
2        0.975673       0.009259       0.722732      0.018837
3        0.982302       0.006991       0.735959      0.038124
4        0.988113       0.005642       0.732957      0.040420
0.732957


## 1.4 When to Use XGBoost

We can use XGBoost:

- With large datasets:
  - 1000 data-point of less than 100 features each,
  - however, as long as the number of features < the number of data-points, everything should be fine.
- With numerical or categorical features, or a mixture of both.

XGBoost is suboptimal if:

- Computer vision, Image recognition (better use deep learning)
- NLP (better use deep learning)
- When the number of features is larger than the number of data-points.