# Credit Card Fraud Detection (ML)

Following up with the results seen in the previous notebook, we will now try to build a model that can predict whether a transaction is fraudulent or not.

## Import libraries and Load data

In [53]:
 # import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (5, 5)

In [54]:
# load data
df = pd.read_csv('../data/processed/creditcard_new.csv')

In [55]:
df.head()

Unnamed: 0,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,...,v23,v24,v25,v26,v27,v28,amount,fraud,hour,amount_logged
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0,0.0,5.008166
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0,0.0,0.993252
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0,0.0,5.936665
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0,0.0,4.816322
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0,0.0,4.248495


In [71]:
df.var(axis=0)

v1               3.836489
v2               2.726820
v3               2.299029
v4               2.004684
v5               1.905081
v6               1.774946
v7               1.530401
v8               1.426479
v9               1.206992
v10              1.185594
v11              1.041855
v12              0.998403
v13              0.990571
v14              0.918906
v15              0.837803
v16              0.767819
v17              0.721373
v18              0.702539
v19              0.662662
v20              0.594325
v21              0.539526
v22              0.526643
v23              0.389951
v24              0.366808
v25              0.271731
v26              0.232543
v27              0.162919
v28              0.108955
fraud            0.001725
amount_logged    3.805565
dtype: float64

We will start by looking at which features are the most important in our dataset. We will use methods such as variance thresholding, forward selection, backward elimination, and recursive feature elimination to select the most important features.  
First, we start by dropping the hour feature since it clearly has no predictive power (we tried using it in some models and the score didn't change at all) and the amount feature since we will be using the log of the amount feature instead.

In [57]:
df.drop(['amount', 'hour'], inplace=True)

## Model Selection

Given the size of the dataset in our hands, and the nature of the problem (binary classification), here are the models that we are going to try:

- Logistic Regression
- K-Nearest Neighbors
- Support Vector Classifier
- Decision Tree
- Random Forest
- AdaBoost Classifier
- Stacking Classifier
- XGBoost Classifier

In [89]:
# create X and y
X = df.drop(['fraud'], axis=1)
y = df['fraud']

# split data into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

### Logistic Regression

The first model that we will try is the logistic regression model. Which is a good starting point due to its speed and simplicity.

In [100]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# create pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression(max_iter=1000))
])

# fit model
pipe.fit(X_train, y_train)

# predict on train and test
y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

# evaluate model
from sklearn.metrics import f1_score, classification_report

print('Train f1 score: ', f1_score(y_train, y_train_pred))
print('Test f1 score: ', f1_score(y_test, y_test_pred))

Train f1 score:  0.7452135493372607
Test f1 score:  0.7314285714285713


Since the proportions of classes in our dataset for the fraudulent class is very low, we evaluated the model using the f1 score to see how well it performs on the minority class, in terms of both the precision and the recall.  
We see that the model doesn't overfit, with an f1 score of 0.74, which is a good starting point.  
Our model was a bit slow to train, and this is due to the large size of the dataset.

In [101]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.83      0.65      0.73        98

    accuracy                           1.00     56962
   macro avg       0.92      0.83      0.87     56962
weighted avg       1.00      1.00      1.00     56962



Notice how the predictions on the class 0 were almost perfect, but we don't really care about that, we care about the model handles the class 1, which is the minority class.  
Given that the f1 score is 0.74, we can say that the model is doing a good job at predicting the minority class, we can work on the probability threshold to give more weight to the recall.  
Before doing that, let's see if we can simplify the model by reducing the number of features, and see if we can improve the f1 score by tuning the hyperparameters.  
To avoid the model running for too long, we will only use variance thresholding for eliminating features (we still managed to get the good f1 score with a 0.5 threshold).

In [6]:
df.columns

Index(['time', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10',
       'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19', 'v20',
       'v21', 'v22', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'amount',
       'fraud', 'hour', 'amount_logged'],
      dtype='object')