<a href="https://colab.research.google.com/github/kiserran/academic-analytics/blob/main/xgboostbinaryclassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###  Xgboost for binary classification
### Data contains information about invoice payment, it indicates if the invoice was paid on time and if it was delayed — how long was the delay. Decision column is assigned with 0 if the invoice was paid on time or delay was small.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score
from matplotlib import pyplot

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/wangx346/MAS651/main/invoice2.csv')
df.head()

Unnamed: 0,invoice_risk_decision,customer,payment_due_date,payment_date,grand_total
0,0,id_24,2,2,64.25
1,0,id_11,3,3,50.34
2,0,id_29,4,4,40.03
3,0,id_28,4,2,94.86
4,1,id_13,2,8,65.15


In [None]:
print('Number of rows in dataset:')
print(df[df.columns[0]].value_counts())

##### XGBoost works with numerical (continuous) data. Categorical features must be translated to numeric representation. Pandas library provide get_dummies function which helps to encode categorical data into an array of (0,1). Here I will translate categorical feature customer_id

In [None]:
# create dummy variables
encoded_data = pd.get_dummies(df)
encoded_data.head()

Unnamed: 0,invoice_risk_decision,payment_due_date,payment_date,grand_total,customer_id_11,customer_id_12,customer_id_13,customer_id_14,customer_id_15,customer_id_18,...,customer_id_45,customer_id_46,customer_id_47,customer_id_48,customer_id_49,customer_id_50,customer_id_6,customer_id_7,customer_id_8,customer_id_9
0,0,2,2,64.25,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,3,3,50.34,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,4,4,40.03,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,4,2,94.86,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,2,8,65.15,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# split data into X and y
X = encoded_data.iloc[:, 1:44]
Y = encoded_data.iloc[:, 0:1]

In [None]:
## get a list of all the column names in a Pandas DataFrame:
headers = list(X)


## create training data sets and testing data sets
train_x, test_x, train_Y, test_Y = train_test_split(X, Y, test_size=0.1, stratify=Y, random_state=0)

print(train_x.shape, test_x.shape)
print()
print('Number of rows in Train dataset:')
print(train_Y['invoice_risk_decision'].value_counts())
print()
print('Number of rows in Test dataset:')
print(test_Y['invoice_risk_decision'].value_counts())

(275, 43) (31, 43)

Number of rows in Train dataset:
0    154
1    121
Name: invoice_risk_decision, dtype: int64

Number of rows in Test dataset:
0    17
1    14
Name: invoice_risk_decision, dtype: int64


#####  Using XGBoost feature of training self-evaluation and early stopping to avoid overfitting. Along with training data, passing test data too into ML model build function — model.fit. The function is assigned with 10 early stopping rounds. If there is no improvement in 10 rounds, training will stop and choose the most optimal model. Using logloss metric to evaluate training quality. Training is running with verbose=True flag to print detail output for each training iteration:

In [None]:
model = xgb.XGBClassifier(max_depth=4,
                        subsample=0.9,
                        objective='binary:logistic',
                        n_estimators=200,
                        learning_rate = 0.1)
# train_x, test_x, train_Y, test_Y = train_test_split(X, Y, test_size=0.1, stratify=Y, random_state=0)
eval_set = [(train_x, train_Y), (test_x, test_Y)]

model.fit(train_x, train_Y.values.ravel(), early_stopping_rounds=10, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)

[0]	validation_0-error:0.08727	validation_0-logloss:0.63186	validation_1-error:0.19355	validation_1-logloss:0.64377
[1]	validation_0-error:0.08727	validation_0-logloss:0.58163	validation_1-error:0.19355	validation_1-logloss:0.60726
[2]	validation_0-error:0.09091	validation_0-logloss:0.53791	validation_1-error:0.22581	validation_1-logloss:0.57581
[3]	validation_0-error:0.09091	validation_0-logloss:0.50050	validation_1-error:0.22581	validation_1-logloss:0.55041
[4]	validation_0-error:0.09091	validation_0-logloss:0.46685	validation_1-error:0.22581	validation_1-logloss:0.52272
[5]	validation_0-error:0.09091	validation_0-logloss:0.43845	validation_1-error:0.22581	validation_1-logloss:0.50777
[6]	validation_0-error:0.09091	validation_0-logloss:0.41247	validation_1-error:0.22581	validation_1-logloss:0.48711
[7]	validation_0-error:0.09091	validation_0-logloss:0.38838	validation_1-error:0.22581	validation_1-logloss:0.46815
[8]	validation_0-error:0.09091	validation_0-logloss:0.36835	validation_1

[71]	validation_0-error:0.01818	validation_0-logloss:0.09703	validation_1-error:0.09677	validation_1-logloss:0.22033
[72]	validation_0-error:0.01455	validation_0-logloss:0.09607	validation_1-error:0.09677	validation_1-logloss:0.22085
[73]	validation_0-error:0.01455	validation_0-logloss:0.09542	validation_1-error:0.09677	validation_1-logloss:0.21858
[74]	validation_0-error:0.01455	validation_0-logloss:0.09462	validation_1-error:0.09677	validation_1-logloss:0.21853
[75]	validation_0-error:0.01455	validation_0-logloss:0.09419	validation_1-error:0.09677	validation_1-logloss:0.21885
[76]	validation_0-error:0.01455	validation_0-logloss:0.09361	validation_1-error:0.09677	validation_1-logloss:0.22028
[77]	validation_0-error:0.01455	validation_0-logloss:0.09292	validation_1-error:0.09677	validation_1-logloss:0.21975
[78]	validation_0-error:0.01455	validation_0-logloss:0.09188	validation_1-error:0.09677	validation_1-logloss:0.22108
[79]	validation_0-error:0.01455	validation_0-logloss:0.09123	val

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.1, max_delta_step=0,
              max_depth=4, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=200, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.9,
              tree_method='exact', validate_parameters=1, verbosity=None)

###### To evaluate training accuracy, I execute model.predict function and passing X testing data frame. The function returns an array of predictions per each row for X set. Then I match each row from prediction array with actual decision feature value. This is how accuracy is calculated:

In [None]:
# make predictions for test data
y_pred = model.predict(test_x)
predictions = [round(value) for value in y_pred]

In [None]:
# evaluate predictions
accuracy = accuracy_score(test_Y, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 90.32%


In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {
    'max_depth': range (2, 10, 1),
    'n_estimators': range(60, 220, 40),
    'learning_rate': [0.1, 0.01, 0.05]
}

#Initialise XGBoost Model
xgb1 = xgb.XGBClassifier(subsample=0.9, objective='binary:logistic')

# Define grid search
grid_search= GridSearchCV(
    estimator=xgb1,
    param_grid=parameters,
    scoring = 'roc_auc',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

# Fit grid search
model1 = grid_search.fit(train_x,train_Y)
print("Best parameters:", model1.best_params_)

Fitting 10 folds for each of 96 candidates, totalling 960 fits
Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}


In [None]:
from sklearn.metrics import confusion_matrix
predict1 = model1.predict(test_x)
print('Best AUC Score: {}'.format(model1.best_score_))
print('Accuracy: {}'.format(accuracy_score(test_Y, predict1)))
print(confusion_matrix(test_Y,predict1))

Best AUC Score: 0.9767094017094017
Accuracy: 0.9032258064516129
[[15  2]
 [ 1 13]]


In [None]:
test_probs = model1.predict_proba(test_x)[:,1]
print(test_probs)

[1.5100325e-02 8.8246174e-02 9.2017984e-01 1.3721527e-02 2.2550551e-03
 1.5940575e-01 8.5871282e-04 9.9002153e-01 6.1288571e-01 1.8745381e-02
 3.6624399e-01 7.4888176e-01 6.1174178e-01 9.8159790e-01 9.5425004e-01
 2.2163162e-01 2.2163318e-02 1.1720129e-02 9.1634393e-01 9.0719599e-01
 9.9571818e-01 1.5629703e-02 9.7032243e-01 3.9174223e-01 9.5023459e-01
 9.5023459e-01 2.1503465e-02 6.1680382e-01 9.5023459e-01 9.1174664e-03
 1.1916337e-02]


#####  Executed model.predict with test data complete. But how to execute model.predict with new data? Here is the example below, which feeds model.predict with Pandas data frame constructed from static data. Payment is by one day late (payment after 4 days since invoice vs. 3 days of expected payment), but since the amount is less than 80 — such payment delay is not considered risky. XGBoost model.predict returns decision, but often it might be useful to call model.predict_proba instead, which returns probabilities for the decision:

In [None]:
# Predict on new observations
headers = list(X)
input_variables = pd.DataFrame([[3,4,50.75,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]],
                                columns=headers,
                                dtype=float,
                                index=['input'])

# Get the model's prediction
prediction = model1.predict(input_variables)
print("Prediction: ", prediction)
prediction_proba = model1.predict_proba(input_variables)
print("Probabilities: ", prediction_proba)

Prediction:  [0]
Probabilities:  [[0.81825465 0.18174534]]
