<a href="https://colab.research.google.com/github/krish-94/XGBoost/blob/main/XGBoost_with_BayesianOptimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **CREDIT CARD FRAUD DETECTION USING XGBOOST AND BAYESIAN OPTIMIZATION**

Kaggle dataset - https://www.kaggle.com/mlg-ulb/creditcardfraud/home

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

In [1]:
# Importing the libraries

import numpy as np
import pandas as pd

import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score
from sklearn.metrics import precision_score, recall_score, precision_recall_curve

In [2]:
# Preparing the train and test data

data = pd.read_csv('/content/creditcard.csv')
X = data.drop('Class', axis = 1)
Y = data['Class']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, stratify = Y, test_size = 0.2, random_state = 42)

# Converting the data frame into XGBoost Dmatrix object

dtrain = xgb.DMatrix(X_train, label= Y_train)
dtest = xgb.DMatrix(X_test)

In [3]:
!pip install bayesian-optimization
from bayes_opt import BayesianOptimization



In [4]:
# max_depth - max tree depth for base learners
# gamma - Tree compexity parameter. Based on diff b/w gamma and gain, the splitting of the leaf node will occur
# Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample 
#     half of the training data prior to growing trees. 
#     and this will prevent overfitting. Subsampling will occur once in every boosting iteration.

def xgb_tune(max_depth, gamma, learning_rate):
  params = {'eval_metric': 'logloss',
            'max_depth': int(max_depth),
            'learning_rate': learning_rate,
            'subsample': 0.8,
            'gamma': gamma}

  # cross validation of xgboost. here num_boost_rounds is equivalent to n_estimators
  # nfold - number of folds in cross validation
  cv_xgboost = xgb.cv(params, dtrain, num_boost_round=70, nfold=5)
  
  # cv_xgboost will have train,test logloss mean and std. We want test-logloss-mean
  # multiply by -1 since we are maximizing the function

  return -1.0 * cv_xgboost['test-logloss-mean'].iloc[-1]


In [5]:
# Instantiate the BayesianOptimization object with bounds for parameters to be tuned
# Here the objective function is logloss
# Surrogate model used Gaussian Process model

xgb_bo = BayesianOptimization(xgb_tune, {'max_depth': (3, 10),
                                         'gamma': (0, 1),
                                         'learning_rate':(0,1)
                                        })

In [6]:
# performing Bayesian Optimization with Expected Improvement (EI) as acquisition function

xgb_bo.maximize(acq='ei', n_iter=5, init_points=8) #  so total iteration = 8 + 5 = 13

|   iter    |  target   |   gamma   | learni... | max_depth |
-------------------------------------------------------------
| [0m 1       [0m | [0m-0.003678[0m | [0m 0.9896  [0m | [0m 0.8962  [0m | [0m 8.76    [0m |
| [0m 2       [0m | [0m-0.004345[0m | [0m 0.6556  [0m | [0m 0.7407  [0m | [0m 5.354   [0m |
| [0m 3       [0m | [0m-0.003999[0m | [0m 0.8957  [0m | [0m 0.491   [0m | [0m 3.041   [0m |
| [0m 4       [0m | [0m-0.00377 [0m | [0m 0.1263  [0m | [0m 0.4624  [0m | [0m 6.926   [0m |
| [95m 5       [0m | [95m-0.003642[0m | [95m 0.6449  [0m | [95m 0.5245  [0m | [95m 8.597   [0m |
| [95m 6       [0m | [95m-0.003119[0m | [95m 0.2351  [0m | [95m 0.1817  [0m | [95m 9.524   [0m |
| [0m 7       [0m | [0m-0.004475[0m | [0m 0.164   [0m | [0m 0.6953  [0m | [0m 9.874   [0m |
| [0m 8       [0m | [0m-0.003633[0m | [0m 0.5873  [0m | [0m 0.4334  [0m | [0m 8.972   [0m |
| [0m 9       [0m | [0m-0.003716[0m | [0m 0.2

In [7]:
parameters = xgb_bo.max['params']
parameters['max_depth'] = int(parameters['max_depth'])

In [8]:
# train a xgb classifier model with obtained hyper parameters from Bayesian Optimization
from xgboost import XGBClassifier
final_model = XGBClassifier(**parameters, n_estimators=250).fit(X_train,Y_train)
default_model = XGBClassifier().fit(X_train,Y_train)

In [9]:
from sklearn.metrics import classification_report
final_predict = final_model.predict(X_test)
print(classification_report(Y_test,final_predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.93      0.82      0.87        98

    accuracy                           1.00     56962
   macro avg       0.96      0.91      0.93     56962
weighted avg       1.00      1.00      1.00     56962



In [10]:
default_predict = default_model.predict(X_test)
print(classification_report(Y_test,default_predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.89      0.80      0.84        98

    accuracy                           1.00     56962
   macro avg       0.94      0.90      0.92     56962
weighted avg       1.00      1.00      1.00     56962



In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(final_predict, Y_test)
acc = cm.diagonal().sum()/cm.sum()
print(acc)
print(accuracy_score(Y_test,final_predict))
# predict probabilities
lr_probs = final_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]

precision_final, recall_final, _ = precision_recall_curve(Y_test,lr_probs)
from sklearn.metrics import auc
print(auc(recall_final,precision_final))

0.9995786664794073
0.9995786664794073
0.8743059593128896


In [20]:
cm2 = confusion_matrix(default_predict, Y_test)
acc2 = cm2.diagonal().sum()/cm2.sum()
print(acc2)
print(accuracy_score(Y_test,default_predict))
# predict probabilities
lr_probs = default_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]

precision_final, recall_final, _ = precision_recall_curve(Y_test,lr_probs)
from sklearn.metrics import auc
print(auc(recall_final,precision_final))

0.9994733330992591
0.9994733330992591
0.8587418472035832
