<h1>Lab Five: Wide and Deep Networks</h1>
<b>By Michael Watts, Maya Muralidhar, Nora Potenti, and Adam Ashcraft </b>

<h2> 1.0 Preparation </h2>

<h3> 1.1 Business Understanding </h3>

Our data set for this lab is a collection of synthetic online transactions produced by The PaySim simulator and collected by the Norwegian University of Science and Technology. The simulation was created based on real financial data for a multinational company. It is categorized by the type of the transaction (i.e. a payment, transfer, etc.), the original balance before and after the transaction of the source account, the balance before and after the transaction of the destination account, and if the transaction was actually fraud. It is also marked by several other categories not useful for this use case. As more and more transactions shift from the physical space to the digital space, it becomes more important for financial institutions to be able to detect and deny fraudulent charges. As the number of digital transactions increases, the amount of data to parse to determine the legitimacy of a transaction increases to the point where these companies could not afford humans to do the fraud detection. This is where our model would come in, as an efficient learning tool able to detect and mark fraud for these institutions. Romexsoft, a company that helps develop fraud detection models, boast a 98% fraud detection rate. For our model to be a success, it must detect at or above this rate. 
<hr>
Kaggle link: https://www.kaggle.com/ntnu-testimon/paysim1/home  <br>
Romexsoft: https://www.romexsoft.com/blog/credit-card-fraud-detection-in-banking/

<h3> 1.2 Data Cleaning </h3>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as skl
import pickle
import warnings
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from scipy.special import expit
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
import missingno as mn
import sys

In [2]:
finData = pd.read_csv('data/PS_20174392719_1491204439457_log.csv') #load the data
finData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
step              int64
type              object
amount            float64
nameOrig          object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest          object
oldbalanceDest    float64
newbalanceDest    float64
isFraud           int64
isFlaggedFraud    int64
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


Let's begin by removing some of the columns we will not be using. After this, we will convert the oldbalanceOrg and newbalanceOrig into one column that reflects the change the original account balance, called Org_Account_Delta. If it is an increase in account balance, we will make it a 1. If it is a decrease we will make it a negative 1. No change will be a 0. We will do the same thing for oldbalanceDest and newbalanceDest in Dest_Account_Delta. Finally, we will encode the one-hot encode the 6 types: CASH-IN, CASH-OUT, NAN, DEBIT, PAYMENT and TRANSFER. If a transaction is fraudulent, in the isFraud column, it will be marked as 1.

In [None]:
finData.drop(columns=['step', 'nameOrig', 'nameDest', 'isFlaggedFraud'], inplace=True)
def calcChange1(row):
    change = row.newbalanceOrig - row.oldbalanceOrg
    if(change > 0):
        return int(1)
    if(change < 0):
        return int(-1)
    if(change == 0):
        return int(0)
def calcChange2(row):
    change = row.newbalanceDest - row.oldbalanceDest
    if(change > 0):
        return int(1)
    if(change < 0):
        return int(-1)
    if(change == 0):
        return int(0)
finData['Org_Account_Delta'] = finData.apply(calcChange1, axis=1)
finData['Dest_Account_Delta'] = finData.apply(calcChange2, axis=1)
finData.drop(columns=['newbalanceOrig', 'newbalanceDest'], inplace=True)
def setType(x):
    if x == 'CASH-IN':
        return int(0)
    if x == 'CASH-OUT':
        return int(1)
    if x == 'DEBIT':
        return int(2)
    if x == 'PAYMENT':
        return int(3)
    if x == 'TRANSFER':
        return int(4)
    else:
        return int(5)
finData['type'] = finData['type'].apply(setType)
finData.info()

Now let's pickle our data for faster reterival. 

In [None]:
pickle.dump(finData, open( 'pickledData/finData.p', 'wb' ))

In [None]:
finData = pickle.load(open( 'pickledData/finData.p', 'rb' ))

In [None]:
finData.head()

In [None]:
mn.matrix(finData)
print(mn)

We now have no missing data that needs to be imputed. We have our Y target, isFraud. We also have the type of transaction and the amount of each transaction. We have generated columns to show the change in each account with each transaction as well. Finally, we also know the account balance for the origin and destination accounts before each transaction. 

<h3> 1.3 Cross-Product Features </h3>

First we will cross Dest_Account_Delta and type. This could reveal a correlation between the type of transaction used money being sent into a destination account. If scammers found a method to execute a certain kind of transaction on customer accounts, they would use the same transaction type over and over to extract money from a customer account into their own.  Next we will cross Org_Account_Delta and type. This will help us establish the same kind of correlation from the previous column cross. It will however also reveal to us possible correlations where scammers are directly drawing from customer accounts without using a secondary account. Logically scammers would try this method as much as possible, as it does not involve adding a secondary account with a paper trail that could possibly lead back to them. If these kinds of fraudulent transactions are common, this cross will reveal it. Finally, we will cross Dest_Account_Delta and Org_Account_Delta. This will help further learn the correlation between when money leaves a customer account and ends up in a new account. Untimely this will help further cement either the correlation between a high number of direct fraudulent withdraws, or fraudulent money transfers.     

 


In [None]:
cross_product_sets= [['type', 'Dest_Account_Delta'], 
                     [ 'type', 'Org_Account_Delta'], 
                    ['Dest_Account_Delta', 'Org_Account_Delta']]

<h3> 1.4 Evaluation Criteria </h3>

For our evaluation criteria, we will be focused on recall. Recall focuses on the amount of false negatives achieved. In this instance a false negative means marking a fraudulent transaction as a genuine one. A false positive means we have marked a genuine transaction as a fraudulent one. In a false positive scenario, the client may be slightly inconvenienced. He would have to call the bank and ensure the proper funds are released. In a false negative situation, the client’s money has been illegally transferred from his account and is most likely lost to him forever. The bank will have to spend time both reimbursing his account and filing the proper paperwork about the fraudulent attempt. The client will be unhappy the bank has not properly secured his money and the criminal has just successful conned the bank. In order to prevent what would be the worse case scenario for a mislabeled transaction, we will focus on keeping our recall and subsequently our false negative rate low.


In [None]:
my_scorer = make_scorer(recall_score)

<h3> 1.5 Data Division <h3>

For our data, we will use a stratified shuffle split. The stratification of the data ensures that each k-fold will have the same percentage of fraudulent data as the data set as a whole. This prevents the model from ever receiving and being trained off of a data set with no fraudulent transactions present. If this were the case, our network may simply detect everything as genuine and still have a good evaluation score. shuffling will prevent any one account from being disproportionally present in a fold. As this data was originally linear time data, it is possible one account would be making several hundreds of transactions sequentially. For instance, a company may be restocking all its inventory at once. To avoid this, we will shuffle the data. While 10 splits would have been the best choice, for the sake of computation, we will bound it with 5 folds. 


In [None]:
#first we will divide out our X and Y data
y = finData.isFraud
finData.drop(columns=['isFraud'], inplace=True)
X = finData
cv = StratifiedShuffleSplit(n_splits=5, random_state=1) 
# X_train_set = []
# X_test_set = []
# y_train_set = []
# y_test_set = []
# numerical_headers = ['amount', 'oldbalanceOrg', 'oldbalanceDest']
# for train_index, test_index in cv.split(X, y): 
#     X_train, X_test = X[train_index], X[test_index]
#     y_train, y_test = y[train_index], y[test_index]
#     y_train_set.append(y_train)
#     y_test_set.append(y_test)
#     for col in numerical_headers: #scale our data
#         ss = StandardScaler()
#         X_train[col] = ss.fit_transform(df_train[col].values.reshape(-1, 1))
#         X_test[col] = ss.transform(df_test[col].values.reshape(-1, 1))
#     X_train_set.append(X_train)
#     X_test_set.append(X_test)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, Input
from keras.layers import Embedding, Flatten, Concatenate
from keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
import scipy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape)
for col in ['amount', 'oldbalanceOrg', 'oldbalanceDest']:
    X_train[col] = X_train[col].astype(np.float)
    X_test[col] = X_test[col].astype(np.float)
    
    ss = StandardScaler()
    X_train[col] = ss.fit_transform(X_train[col].values.reshape(-1, 1))
    X_test[col] = ss.transform(X_test[col].values.reshape(-1, 1))
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values
X_train.shape

In [None]:
# This creates a model that includes
# the Input layer and three Dense layers

In [None]:
import tensorflow as tf
from keras.wrappers.scikit_learn import KerasClassifier
# From Stack Overflow: 
# https://stackoverflow.com/questions/43076609/how-to-calculate-precision-and-recall-in-keras
def as_keras_metric(method):
    import functools
    from keras import backend as K
    import tensorflow as tf
    @functools.wraps(method)
    def wrapper(self, args, **kwargs):
        """ Wrapper for turning tensorflow metrics into keras metrics """
        value, update_op = method(self, args, **kwargs)
        K.get_session().run(tf.local_variables_initializer())
        with tf.control_dependencies([update_op]):
            value = tf.identity(value)
        return value
    return wrapper
def create_model():
    inputs = Input(shape=(X_train.shape[1],))
    # a layer instance is callable on a tensor, and returns a tensor
    x = Dense(units=6, activation='relu')(inputs)
    predictions = Dense(1,activation='sigmoid')(x)
    model = Model(inputs=inputs, outputs=predictions)
    model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=['accuracy'])
    return model

model = KerasClassifier(build_fn=create_model,
                        epochs=5, 
                        batch_size=500,
                        verbose=1)

In [None]:
results = cross_val_score(model, X, y, cv=cv)
print(results.mean())