## Bank Fraud Dataset Practice
I will be using dataset from https://www.kaggle.com/volodymyrgavrysh/fraud-detection-bank-dataset-20k-records-binary

This dataset contains 20k+ transactions with 112 features (numerical)

For this practice, I will use TPOT library to create automated pipeline and best model

In [8]:
# import all the required libraries
# libraries for data analysis and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#libraries for evaluation metrics, preprocessing, and pipeline
from sklearn.metrics import recall_score, precision_score, confusion_matrix, classification_report, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split

# tpot libraries
from tpot import TPOTClassifier

In [9]:
# read the data from csv file
df = pd.read_csv('fraud_detection_bank_dataset.csv', index_col=0)
df.head()

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,col_103,col_104,col_105,col_106,col_107,col_108,col_109,col_110,col_111,targets
0,9,1354,0,18,0,1,7,9,0,0,...,0,0,0,1,1,0,0,0,49,1
1,0,239,0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,55,1
2,0,260,0,4,0,3,6,0,0,0,...,0,0,0,1,1,0,0,0,56,1
3,17,682,0,1,0,0,8,17,0,0,...,0,1,0,1,1,0,0,0,65,1
4,1,540,0,2,0,1,7,1,0,0,...,0,0,0,1,1,0,0,0,175,1


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20468 entries, 0 to 20467
Columns: 113 entries, col_0 to targets
dtypes: float64(1), int64(112)
memory usage: 17.8 MB


The dataset has 113 columns including targets column and a total of 20468 rows

It will be hard to do analysis because of the absence of column names, thus no context can be given to the dataset

In [6]:
# let's check the missing value in the data set, if there is any
withna_col = []
for a in df.columns:
    if df[a].isna().sum() > 0:
        withna_col.append(a)
withna_col

[]

Empty list means there isn't any column with null values

In [15]:
# check the composition of the targets column
df['targets'].value_counts()/len(df['targets'])

0    0.734317
1    0.265683
Name: targets, dtype: float64

This isn't an imbalance dataset, so there is no need for oversampling method and I can use accuracy for general evaluation metrics

______

Since TPOT does it's own encoding and scaling, I don't need to preprocess the dataset (encoding categorical column and/or scaling numerical column). Hence, I'll go through dataset splitting for train and test

In [16]:
x = df.drop(columns='targets')
y = df['targets']

In [17]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, stratify=y, random_state=69)

In [19]:
# create the TPOT Classifier pipeline
opt = TPOTClassifier(max_time_mins=30, max_eval_time_mins=5, cv=5, random_state=69, verbosity=2, scoring='accuracy', n_jobs=-1)

Parameter explanation:
- max_time_mins: How many minutes TPOT has to optimize the pipeline
- max_eval_time_mins: How many minutes TPOT has to evaluate a single pipeline 
- cv: Cross-validation strategy used when evaluating pipelines
- random_state: The seed of the pseudo random number generator used in TPOT
- verbosity: How much information TPOT communicates while it's running
- scoring: Function used to evaluate the quality of a given pipeline for the classification problem
- n_jobs: Number of processes to use in parallel for evaluating pipelines during the TPOT optimization process.

Please refer to http://epistasislab.github.io/tpot/api/ for further information on the TPOTClassifier parameters and explanation

For this case, I use 30 mins max time for TPOT to optimize the pipeline with 5 mins each to evaluate a single pipeline. I use 5 fold cross validation with accuracy as the benchmark and use all of my computer cores with information on the progess is shown by progress bar.

In [20]:
# Execute the pipeline
opt.fit(xtrain, ytrain)


30.38 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.4, min_samples_leaf=3, min_samples_split=9, n_estimators=100)


TPOTClassifier(max_time_mins=30, n_jobs=-1, random_state=69, scoring='accuracy',
               verbosity=2)

In [21]:
# create ypred variable to contain prediction result
ypred = opt.predict(xtest)

In [22]:
# confusion matrix of ytest and ypred
cm = confusion_matrix(ytest, ypred, labels=[1,0])
result = pd.DataFrame(cm, index = ['Act1', 'Act0'], columns=['Pred1', 'Pred0'])
result

Unnamed: 0,Pred1,Pred0
Act1,1158,201
Act0,143,3615


In [29]:
# check the accuracy, precision, and recall
ev = []
ev.append(accuracy_score(ytest,ypred))
ev.append(recall_score(ytest,ypred))
ev.append(precision_score(ytest,ypred))
eval_metrics = pd.DataFrame(ev, index=['Accuracy', 'Recall', 'Precision'], columns=['TPOT Classifier'])
eval_metrics

Unnamed: 0,TPOT Classifier
Accuracy,0.932773
Recall,0.852097
Precision,0.890085


The result is quite good despite TPOT closed during evaluation in just one generation, if I provide more time maybe better results will come out. 

In [33]:
# I use classification report for further details about precision, recall, and f1 score
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95      3758
           1       0.89      0.85      0.87      1359

    accuracy                           0.93      5117
   macro avg       0.92      0.91      0.91      5117
weighted avg       0.93      0.93      0.93      5117



In [31]:
# create the output file for the best model
opt.export('fraud_detection.py')