# Credit Card Fraud Classification with XGBoost and Sklearn

## Import Libraries

The libraries used are:
* Numpy: https://numpy.org/doc/stable/
* Pandas: https://pandas.pydata.org/docs/
* XGBoost: https://xgboost.readthedocs.io/en/stable/
* Sklearn: https://scikit-learn.org/stable/index.html 

The data set is open source and can be found at this link: https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023 

In [1]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

## Understanding the Data

In [2]:
# Use pandas to read in the dataset
df = pd.read_csv('creditcard_2023.csv')

In [3]:
# Print out the first 5 rows to visual some values
df.head()

Unnamed: 0,id,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-0.260648,-0.469648,2.496266,-0.083724,0.129681,0.732898,0.519014,-0.130006,0.727159,...,-0.110552,0.217606,-0.134794,0.165959,0.12628,-0.434824,-0.08123,-0.151045,17982.1,0
1,1,0.9851,-0.356045,0.558056,-0.429654,0.27714,0.428605,0.406466,-0.133118,0.347452,...,-0.194936,-0.605761,0.079469,-0.577395,0.19009,0.296503,-0.248052,-0.064512,6531.37,0
2,2,-0.260272,-0.949385,1.728538,-0.457986,0.074062,1.419481,0.743511,-0.095576,-0.261297,...,-0.00502,0.702906,0.945045,-1.154666,-0.605564,-0.312895,-0.300258,-0.244718,2513.54,0
3,3,-0.152152,-0.508959,1.74684,-1.090178,0.249486,1.143312,0.518269,-0.06513,-0.205698,...,-0.146927,-0.038212,-0.214048,-1.893131,1.003963,-0.51595,-0.165316,0.048424,5384.44,0
4,4,-0.20682,-0.16528,1.527053,-0.448293,0.106125,0.530549,0.658849,-0.21266,1.049921,...,-0.106984,0.729727,-0.161666,0.312561,-0.414116,1.071126,0.023712,0.419117,14278.97,0


In [4]:
# Understand the size to better understand what algorithms to apply
df.shape

(568630, 31)

In [5]:
# Check for any missing values
df.isnull().sum()

id        0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [6]:
# Create our features (X) and target (y)
# Drop 'Class' as it is the target and drop 'id' as that is the index
X = df.drop(columns=['Class', 'id'])

# Set y equal to the target 'Class'
y = df['Class']

In [7]:
# Find that y is a series
type(y)

pandas.core.series.Series

In [8]:
# Convert it to a datafram 
y = y.to_frame()

# Verify it worked 
type(y)

pandas.core.frame.DataFrame

## Split the data and understand the values included in y_train and y_test

In [9]:
# Use sklearn to split the data
# Pass X and Y and features and targets 
# Keep 20% of data for testing
# Apply a random state of 42 for reproducibility 
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# See all possible outcomes for 'Class' in the training set for y
y_train['Class'].unique()

array([1, 0], dtype=int64)

In [11]:
# See all possible outcomes for 'Class' in the testing set for y
y_test['Class'].unique()

array([1, 0], dtype=int64)

## Create, train and test the model

In [12]:
# Initiate XGBClassifier with metric set to logloss 
# Logloss is good to use because output is binary 
xgb_classifier = XGBClassifier(eval_metric='logloss')

In [13]:
# Fit model 
xgb_classifier.fit(X_train, y_train)

In [14]:
# Make predictions
y_pred = xgb_classifier.predict(X_test)

In [15]:
# Accuracy score printed below
accuracy = accuracy_score(y_test, y_pred) * 100
accuracy

99.97098288869739

## Hyperparameter tuning

In [16]:
# Define a parameter grid to search over
param_grid = {
    # Max dept is the max number of trees
    'max_depth': [3, 5, 7],
    # Various step sizes
    'learning_rate': [0.01, 0.1, 1],
    # Boosting rounds
    'n_estimators': [10, 25, 50, 100],
    # Randomly selecting features to prevent overfitting
    'colsample_bytree': [0.5, 0.8, 1]
}

In [17]:

# Initialize GridSearchCV with the XGBoost classifier and the parameter grid
grid_search = GridSearchCV(estimator=XGBClassifier(eval_metric='logloss'), 
                           # Parameters in the model are equal to the grid above
                           param_grid=param_grid,
                           # Low value of CV due to higher sample count 
                           cv=2, 
                           # Optimzing the process
                           n_jobs=-1, 
                           # Easier understanding
                           verbose=1)

In [18]:
# Fit the grid search
grid_search.fit(X_train, y_train)

Fitting 2 folds for each of 108 candidates, totalling 216 fits


In [19]:
# Get the best parameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Make predictions with the best model
y_pred_best = best_model.predict(X_test)

# Evaluate the best model
accuracy_after_tuning = accuracy_score(y_test, y_pred_best) * 100
print("Best Parameters:", best_params, "Best Accuracy: ", accuracy_after_tuning)


Best Parameters: {'colsample_bytree': 0.8, 'learning_rate': 1, 'max_depth': 7, 'n_estimators': 100} Best Accuracy:  99.97713803351917


In [20]:
accuracy_after_tuning-accuracy

0.006155144821775593

This shows that after tuning the parameters, the model gained just over 0.006 percent accuracy. 