# Problem Statement
<p> There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.

We present a synthetic dataset generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.</p>

# Content
<p> PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.

This synthetic dataset is scaled down 1/4 of the original dataset and it is created just for Kaggle.</p>

# Data Dictionary

<p> step - maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).

type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

amount -
amount of the transaction in local currency.

nameOrig - customer who started the transaction

oldbalanceOrg - initial balance before the transaction

newbalanceOrig - new balance after the transaction

nameDest - customer who is the recipient of the transaction

oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).

newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).

isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.

isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.</p>

## Data Loading

In [1]:
import pandas as pd


In [2]:
raw_data = pd.read_csv("../dataset/PS_20174392719_1491204439457_log.csv")
raw_data.head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0


In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [4]:
raw_data.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

## Data Exploration and Analysis

#### How many data samples are there ? 

In [5]:
raw_data.shape[0]

6362620

#### How many transactions is fraudulent ? 

In [6]:
raw_data.loc[raw_data.isFraud==1].shape[0]

8213

#### What kinds of transactions are there ?

In [7]:
raw_data.type.unique().tolist()

['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN']

#### What kinds of transactions has a fraudulent records ?

In [8]:
raw_data.loc[raw_data.isFraud==1].type.unique().tolist()

['TRANSFER', 'CASH_OUT']

#### How many unique Accounts in records ?

In [9]:
raw_data.nameOrig.unique().shape[0]

6353307

#### How many accounts (origin) involved in fraudulent transactions

In [10]:
raw_data.loc[raw_data.isFraud == 1].nameOrig.unique().shape[0]

8213

#### How many accounts (destination) involved in fraudulent transactions

In [11]:
raw_data.loc[raw_data.isFraud == 1].nameDest.unique().shape[0]

8169

# Preprocess Data

In [12]:
raw_data_copy = raw_data.copy() # copy raw_data

In [13]:
raw_data_copy.head(3)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0


In [14]:
# Drop colums not needed in analysis
raw_data_copy = raw_data_copy.drop(['nameOrig', 'nameDest'], axis=1)

In [15]:
raw_data_copy.head(3)

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0,0
1,1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0,0
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1,0


In [16]:
from sklearn import preprocessing
import pickle

enc = preprocessing.LabelEncoder()
enc.fit(raw_data_copy.type)
raw_data_copy.type = enc.transform(raw_data_copy.type) # change categorical data to numerical
raw_data_copy.head()

with open('../model/fraud_detection_model_v1.pkl', 'wb') as f:
    pickle.dump(enc, f)

In [33]:

label = pickle.load(open('../model/fraud_detection_model_v1.pkl', 'rb'))

In [26]:
enc.classes_

array(['CASH_IN', 'CASH_OUT', 'DEBIT', 'PAYMENT', 'TRANSFER'],
      dtype=object)

In [None]:
raw_data_copy.dtypes

In [None]:
## split data. 50:50 for non-fraud and fraund transaction

raw_data_fraud = raw_data_copy.loc[raw_data_copy.isFraud == 1]
raw_data_fraud.shape

In [None]:
raw_data_legit = raw_data_copy.loc[raw_data_copy.isFraud != 1]
raw_data_legit.head()

In [None]:
# randomly select non fraud transaction to add ot final train set


fraud_transaction_count = len(raw_data_fraud)
raw_data_legit_selected = raw_data_legit.sample(fraud_transaction_count)
raw_data_legit_selected.shape

In [None]:
final_raw_data_trimmed = pd.concat([raw_data_legit_selected, raw_data_fraud])
final_raw_data_trimmed.shape

In [None]:
final_raw_data_trimmed.head(1)

In [None]:
# separate input variables to target output variable

X = final_raw_data_trimmed.iloc[:, [1,2,3,4,5,6]]
X.head(1)

In [None]:
y = pd.DataFrame(final_raw_data_trimmed["isFraud"])
y.head(1)

In [None]:
# split data to training and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 99)
# y_train = y_train.values.reshape(y_train.shape[0], 1)
# X_train = X_train.values

In [None]:
X_train

In [None]:
y_train

# Neural Network Modeling

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from numpy import asarray
from tensorflow.keras.layers import Dropout
from keras_visualizer import visualizer
from tensorflow.keras.optimizers import Adam, SGD, RMSprop, Adadelta, Adagrad, Adamax, Nadam, Ftrl
from keras.wrappers.scikit_learn import KerasClassifier
from bayes_opt import BayesianOptimization
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score

In [None]:
def baseline_model(activation="relu", dropout=0.1):
    model = Sequential()
    model.add(Dense(80, input_dim=input_dim, activation=activation))
    model.add(Dropout(dropout))
    model.add(Dense(40, input_dim=input_dim, activation=activation))
    model.add(Dropout(dropout))
    model.add(Dense(60, input_dim=input_dim, activation=activation))
    model.add(Dropout(dropout))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
    return model

In [None]:
opt = Adagrad(learning_rate=0.001)
input_dim = X_train.shape[1]
model = baseline_model()

In [None]:
model.summary()

In [None]:
history = model.fit(X_train, y_train, epochs=100, verbose=1, batch_size=32, validation_data=(X_test, y_test))

In [None]:
import seaborn as sns
import numpy as np

histotyDF = pd.DataFrame.from_dict(history.history)
histotyDF["epochs"] = np.arange(len(histotyDF))
histotyDF.head()

In [None]:
import plotly.express as px 

# fig = px.line(histotyDF, x='epochs')
fig = px.line()
fig.add_scatter(x=histotyDF['epochs'], y=histotyDF['accuracy'], name="Training Accuracy") 
fig.add_scatter(x=histotyDF['epochs'], y=histotyDF['val_accuracy'], name="Validation Accuracy") 

fig.show()


In [None]:
# model.predict(x) > 0.5).astype("int32")
predicted_nn = (model.predict(X_test.values) >0.5).astype("int32")
predicted_df = pd.DataFrame(predicted_nn, columns=["isFraud"])

## Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test["isFraud"], predicted_df)
accuracy # overall accuracy

In [None]:
# filter fraudulent transactions in y_test
y_test_fraud = y_test[y_test["isFraud"]==1]

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predicted_df)

In [None]:
import numpy as np
from matplotlib.pyplot import figure

figure(figsize=(12, 14), dpi=200)

def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()

## Confusion Matrix


In [None]:
plot_confusion_matrix(cm           = cm, 
                      normalize    = True,
                      target_names = ['Non-Fraud', "Fraud"],
                      title        = "Confusion Matrix")

# Metrics

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

def metrics(true, predicted):
    accuracy = accuracy_score(true,predicted)
    f1 = f1_score(true, predicted, average="micro")
    mcc = matthews_corrcoef(true, predicted)
    precision = precision_score(true, predicted, average="micro")
    recall = recall_score(true, predicted, average='micro')
    
    print("Accuracy: ", accuracy)
    print("F1 Score: ", f1)
    print("MCC: ", mcc)
    print("Precision", precision)
    print("rercall", recall)

In [None]:
metrics(y_test, predicted_df)

# Preparation for Deployment


### Export Model to file

In [None]:
model.save("../model/fraud_detection_model_v1.h5")

### Testing the model file

In [38]:
from tensorflow.keras.models import load_model
fraud_detection_model = load_model("../model/fraud_detection_model_v1.h5")

2022-02-17 23:24:49.366046: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2022-02-17 23:25:30.190282: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2022-02-17 23:25:30.215580: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: UNKNOWN ERROR (100)
2022-02-17 23:25:30.215647: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (DESKTOP-I4FO2JM): /proc/driver/nvidia/version does not exist
2022-02-17 23:25:30.216482: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Preprocess Function

In [49]:
## read raw data
raw_data = pd.read_csv("../dataset/PS_20174392719_1491204439457_log.csv")

In [58]:
raw_data.columns.tolist()

['step',
 'type',
 'amount',
 'nameOrig',
 'oldbalanceOrg',
 'newbalanceOrig',
 'nameDest',
 'oldbalanceDest',
 'newbalanceDest',
 'isFraud',
 'isFlaggedFraud']

In [50]:
def preprocess_(data):
    X_ = data.iloc[:, [1,2,4,5,7,8]]
    from sklearn.preprocessing import LabelEncoder
    label = LabelEncoder()
    X_.loc[:, ["type"]] = label.fit_transform(X_["type"])
    y_ = pd.DataFrame(data["isFraud"])
    return X_, y_

In [51]:
X_, y_ = preprocess_(raw_data)

## Prediction


In [52]:
prediction = (fraud_detection_model.predict(X_) > 0.5).astype("int32")

2022-02-17 23:31:41.431069: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-02-17 23:31:41.481911: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3593465000 Hz


## Metrics

In [56]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

def metrics(true, predicted):
    accuracy = accuracy_score(true,predicted)
    f1 = f1_score(true, predicted, average="micro")
    mcc = matthews_corrcoef(true, predicted)
    precision = precision_score(true, predicted, average="micro")
    recall = recall_score(true, predicted, average='micro')
    
    print("Accuracy: ", accuracy)
    print("F1 Score: ", f1)
    print("MCC: ", mcc)
    print("Precision", precision)
    print("rercall", recall)

In [57]:
metrics(y_.values, prediction)

Accuracy:  0.8638351811046393
F1 Score:  0.8638351811046393
MCC:  0.08438865843386466
Precision 0.8638351811046393
rercall 0.8638351811046393
