## Extreme Gradient Boosting for Class Imbalance

Kaggle CreditCard Fraud Detection Data can be downloaded here:
https://github.com/nsethi31/Kaggle-Data-Credit-Card-Fraud-Detection/blob/master/creditcard.csv?raw=true

In [None]:
# In your command line issue this command:
# conda install -c conda-forge xgboost

In [1]:
%%time
import pandas as pd
data = pd.read_csv('creditcard.csv')

CPU times: user 2.63 s, sys: 542 ms, total: 3.17 s
Wall time: 5.83 s


In [None]:
data.head()

In [None]:
data.shape

In [None]:
data['Class'].value_counts()

In [2]:
%%time
# Split data into train and test splits

from sklearn.model_selection import train_test_split

# retrieve numpy array
data = data.values
# split into input and output elements
X, y = data[:, 1:-1], data[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

CPU times: user 873 ms, sys: 334 ms, total: 1.21 s
Wall time: 3.18 s


In [4]:
# Count how many unique values of each class
import numpy as np
unique, counts = np.unique(y, return_counts=True)
print (np.asarray((unique, counts)).T)

unique, counts = np.unique(y_test, return_counts=True)
print (np.asarray((unique, counts)).T)

[[0.00000e+00 2.84315e+05]
 [1.00000e+00 4.92000e+02]]
[[0.0000e+00 9.3838e+04]
 [1.0000e+00 1.4900e+02]]


In [5]:
counts

array([93838,   149])

In [3]:
# calculate heuristic class weighting
from sklearn.utils.class_weight import compute_class_weight

# calculate class weighting according to training data
weighting = compute_class_weight('balanced', [0,1], y_train)
print(weighting)

[  0.50090037 278.16326531]


### For XGBoost:

XGBoost is trained to minimize a loss function and the “gradient” in gradient boosting refers to the steepness of this loss function, e.g. the amount of error. A small gradient means a small error and, in turn, a small change to the model to correct the error. A large error gradient during training in turn results in a large correction.

* *Small Gradient:* Small error or correction to the model
* *Large Gradient:* Large error or correction to the model


The **scale_pos_weight** value is used to scale the gradient for the positive class.


By default, the scale_pos_weight hyperparameter is set to the value of 1.0 and has the effect of weighing the balance of positive examples, relative to negative examples when boosting decision trees. 


Scaling the gradient for the positive class has the effect of scaling errors made by the model during training on the positive class and encourages the model to over-correct them. In turn, this can help the model achieve better performance when making predictions on the positive class. 




#### Observe how the estimate value of scale_pos_weight is calculated

In [7]:
%%time
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

unique, counts = np.unique(y_train, return_counts=True)

# estimate scale_pos_weight value
estimate = counts[0] / counts[1]
print('Estimate: %.3f' % estimate)

# define model
# try with and without scale_pos_weight
#model = XGBClassifier(scale_pos_weight=estimate)
model = XGBClassifier()
# fit model
model.fit(X_train, y_train)

# evaluate model
y_probs = model.predict(X_test)

auc = roc_auc_score(y_test, y_probs)

# summarize performance
print(' ROC AUC = %.3f' % auc)

Estimate: 555.327
 ROC AUC = 0.913
CPU times: user 2min 1s, sys: 2.65 s, total: 2min 3s
Wall time: 2min 58s


In [None]:
0.954
0.913