**Hongqiang Zhou**

Silver Spring, MD

This notebook continues my work on the credit card fraud detection. In a previous notebook (https://www.kaggle.com/zhouhq/credit-fraud-detection-the-power-of-ensemble), we have practiced algorithms of logistic regression, classification tree, random forest, and AdaBoost. Instead of expanding that notebook, we start this new one. 

In this notebook, we build a neural network with the help of TensorFlow. The model initialize parameters through Adam optimizer, computes the loss function through gradient descent on mini-batches, and optimizes hyper-parameters thorugh L2-regularization. A two-layer network is then constructed and tuned on the training data. 

The data is highly skewed. In the previous notebook, we have boosted the minority class through SMOTE technique on training data. In this notebook, we try a different way to balance classes in training data: under-sampling. This approach removes most instances of the major class from training data set. The advantage of this approach is that it balances the classes without introducing hypothetical data. The disadvantage is that we lose a great amount of data. To compensate for the lack of data, cross-validation is employed in model training.

First, let us import modules to be employed in this project.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.python.framework import ops
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_curve, auc, roc_auc_score
from itertools import cycle

plt.rcParams['figure.figsize'] = (7, 4)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'
%matplotlib inline

**Construct a neural network through TensorFlow**

The following functions compose a general neural network model.

In [None]:
def create_placeholders(n_x, n_y):
    X = tf.placeholder(dtype = tf.float32, shape = (n_x, None), name = 'X')
    Y = tf.placeholder(dtype = tf.float32, shape = (n_y, None), name = 'Y')
    return X, Y

In [None]:
def initialize_parameters(layers_dims):
    num_layers = len(layers_dims) - 1
    parameters = {}
    for l in range(1, num_layers + 1):
        parameters['W' + str(l)] = tf.get_variable('W' + str(l), [layers_dims[l], layers_dims[l - 1]],\
                            initializer = tf.contrib.layers.xavier_initializer(seed = next(seeds)))
        parameters['b' + str(l)] = tf.get_variable('b' + str(l), [layers_dims[l], 1], \
                                                   initializer = tf.zeros_initializer())
    
    return parameters   

In [None]:
def forward_propagation(X, parameters):
    L = len(parameters) // 2
    A = X
    for l in range(1, L):
        Z = tf.add(tf.matmul(parameters['W' + str(l)], A), parameters['b' + str(l)])
        A = tf.nn.relu(Z)
    ZL = tf.add(tf.matmul(parameters['W' + str(L)], A), parameters['b' + str(L)])
    
    return ZL

In [None]:
def compute_l2_regularization_cost(parameters, l2):
    L = len(parameters) // 2
    cost = 0.0
    for l in range(1, L + 1):
        cost += tf.reduce_sum(tf.nn.l2_loss(parameters['W' + str(l)]))
    l2_regularization_cost = cost * l2
    
    return l2_regularization_cost

In [None]:
def compute_cross_entropy_cost(ZL, Y):
    cross_entropy_cost = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(logits = ZL, \
                                                                               labels = Y))
    
    return cross_entropy_cost                                   

In [None]:
def random_mini_batches(X, Y, minibatch_size = 64):
    m = X.shape[1]
    minibatches = []
    
    np.random.seed(next(seeds))
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((Y.shape[0], m))
    
    num_complete_minibatches = m // minibatch_size
    for k in range(0, num_complete_minibatches):
        minibatch_X = shuffled_X[:, k * minibatch_size : (k + 1) * minibatch_size]
        minibatch_Y = shuffled_Y[:, k * minibatch_size : (k + 1) * minibatch_size]
        minibatch = (minibatch_X, minibatch_Y)
        minibatches.append(minibatch)
    
    if m % minibatch_size != 0:
        minibatch_X = shuffled_X[:, num_complete_minibatches * minibatch_size :]
        minibatch_Y = shuffled_Y[:, num_complete_minibatches * minibatch_size :]
        minibatch = (minibatch_X, minibatch_Y)
        minibatches.append(minibatch)
        
    return minibatches

In [None]:
def model(X_train, Y_train, layers_dims, l2 = 1e-6, learning_rate = 0.0001, 
          num_epochs = 1500, minibatch_size = 64, print_cost = True):
    ops.reset_default_graph()
    #tf.set_random_seed(seed)
    (n_x, m) = X_train.shape
    n_y = Y_train.shape[0]
    costs = []
    
    X, Y = create_placeholders(n_x, n_y)
    parameters = initialize_parameters(layers_dims)
    ZL = forward_propagation(X, parameters)
    cross_entropy_cost = compute_cross_entropy_cost(ZL, Y)
    l2_regularization_cost = compute_l2_regularization_cost(parameters, l2)
    cost = cross_entropy_cost + l2_regularization_cost 
    optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)
    init = tf.global_variables_initializer()
    
    with tf.Session() as sess:
        sess.run(init)
        
        for epoch in range(num_epochs):
            epoch_cost = 0.0
            minibatches = random_mini_batches(X_train, Y_train, minibatch_size)
            num_minibatches = len(minibatches)
            
            for minibatch in minibatches:
                (minibatch_X, minibatch_Y) = minibatch
                _, minibatch_cost = sess.run([optimizer, cost], feed_dict = {X: minibatch_X, 
                                                                             Y: minibatch_Y})
                epoch_cost += minibatch_cost
                
            epoch_cost = epoch_cost / m    
            costs.append(epoch_cost)    
            
            if print_cost and epoch % 100 == 0:
                print('Cost after epoch {}: {}'.format(epoch, np.float(epoch_cost)))
        else:
            if print_cost:
                print('Cost after epoch {}: {}'.format(epoch, np.float(epoch_cost)))
                
        parameters = sess.run(parameters)
        return parameters, costs

In [None]:
def predict(parameters, X):
    nx = X.shape[0]
    params = {}
    L = len(parameters) // 2
    for l in range(1, L+1):
        params['W' + str(l)] = tf.convert_to_tensor(parameters['W' + str(l)])
        params['b' + str(l)] = tf.convert_to_tensor(parameters['b' + str(l)])
    
    x = tf.placeholder(dtype = tf.float32, shape = (nx, None))
    z = forward_propagation(x, params) 
    a = tf.sigmoid(z)
    
    with tf.Session() as sess:
        proba = sess.run(a, feed_dict = {x: X})
        
    return proba

In [None]:
def model_evaluation(parameters, feature_matrix, target):
    probs = predict(parameters, feature_matrix)
    (fpr, tpr, thresholds) = roc_curve(y_true = target.ravel(), y_score = probs.ravel())
    auc_score = auc(x = fpr, y = tpr)
    fig, ax = plt.subplots()
    ax.plot(fpr, tpr, 'r-', linewidth = 2)
    ax.plot([0, 1], [0, 1], 'k--', linewidth = 1)
    plt.title('ROC curve with AUC = {0:.3f}'.format(auc_score))
    plt.xlabel('fpr')
    plt.ylabel('tpr')
    plt.axis([-0.01, 1.01, -0.01, 1.01])
    plt.tight_layout()
    
    return {'fpr': fpr, 'tpr': tpr, 'thresholds': thresholds, 'auc': auc_score}

In [None]:
def k_fold_cross_validation(train_data, k, n_h, l2):
    layers_dims = [29, n_h, 1]
    fold_size = train_data.shape[0] // k
    np.random.seed(next(seeds))
    permutation = list(np.random.permutation(train_data.shape[0]))
    shuffled_data = train_data.values[permutation, :]
    shuffled_data = shuffled_data.T
 
    error = 0.0
    for i in range(k):
        val_X = shuffled_data[:-1, i * fold_size : (i + 1) * fold_size]
        val_y = shuffled_data[-1, i * fold_size : (i + 1) * fold_size].reshape(1, -1)
        
        train_X = np.concatenate([shuffled_data[:-1, 0 : i * fold_size],  \
                                  shuffled_data[:-1, (i + 1) * fold_size :]], axis = 1)
        train_y = np.concatenate([shuffled_data[-1, 0 : i * fold_size], \
                                 shuffled_data[-1, (i + 1) * fold_size :]])
        train_y = train_y.reshape(1, -1)
        
        parameters, _ = model(train_X, train_y, layers_dims, l2, learning_rate = 1e-4, \
                                   num_epochs = 1500, minibatch_size = 16, print_cost = False)
        
        probs = predict(parameters, val_X)
        preds = np.where(probs > 0.5, 1, 0)
        error += np.sum(preds != val_y)
        
    accuracy = 1.0 - error / train_data.shape[0]
    return accuracy 

The function in below cell will be employed to tune the hyper-parametrs: number of neurons in hidden layers and L2-penalty coefficient. 

In [None]:
def tune_hparams(train_data, hparams):
    n_h = hparams['n_h'] 
    l2 = hparams['l2'] 
    accuracies = []
    for n in range(len(n_h)):
        accuracy = k_fold_cross_validation(train_data, 5, n_h[n], l2[n])
        accuracies.append(accuracy)
        print('Trial = {}, n_h = {}, l2 = {}, accuracy = {}'.format(n, n_h[n], l2[n], accuracy))
    
    return accuracies 

**Prepare the data**

In [None]:
data = pd.read_csv('../input/creditcard.csv')

In [None]:
features = data.columns
features = [str(s) for s in features]
label = features[-1]
features = features[1 : -1]
data = data[features + [label]]

In [None]:
scaler = StandardScaler().fit(data[features])
scaler_mean = scaler.mean_
scaler_scale = scaler.scale_
data[features] = scaler.transform(data[features])

In [None]:
train_data, test_data = train_test_split(data, test_size = 0.2, random_state = 1)

In [None]:
np.random.seed(1)
train_positive = train_data[train_data[label] == 1]
train_negative = train_data[train_data[label] == 0]
indices = np.random.choice(a = train_negative.index, size = train_positive.shape[0], replace = False)
sample_negative = train_negative.loc[indices, :]
sample = pd.concat([train_positive, train_negative.loc[indices, :]], axis = 0)

**Train the model**

In [None]:
np.random.seed(1)
seeds = np.random.randint(0, 10000, 10000)
seeds = cycle(seeds)

In below two cells, I demonstrate that the present setup of minibath size, learning rate, and number of epochs is good enough to minimize the cost.

In [None]:
parameters, costs = model(sample[features].values.T, sample[label].values.reshape(1, -1), [29, 10, 1], 
                          l2 = 0.001, learning_rate = 0.0001, num_epochs = 1500, minibatch_size = 16, 
                          print_cost = True)

In [None]:
plt.plot(costs)
plt.xlabel('epoch')
plt.ylabel('cost')
plt.title('Cost function minimizes during a few epochs.')
plt.tight_layout()

In below cell, we tune a two-layer network. This process has taken a very long time on my PC. My experiment shows employing 16 neurons is likely to yield the best performance. In fact, model performance is not very sensative to this parameter. While keeping all other facts unchanged, varying this parameter can change the model accuracy within a range of less than 5%.

In [None]:
#n_h = np.arange(4, 22, 2)
#l2 = 10 ** np.linspace(-3, 3, 7)
#n_h, l2 = np.meshgrid(n_h, l2)
#hparams = {'n_h': n_h.ravel(), 'l2': l2.ravel()}
#accuracies = tune_hparams(sample, hparams)                

In below cell, we tune the L2-penalty coefficient for a two-layer network with 16 neurons in hidden layer. Once again, I comment off the code as it can take a very too long time to execute. My experiment indicates the best value of this coefficient is around 0.25.

In [None]:
#n_h = np.ones(21) * 16
#l2 = 10 ** np.linspace(-2, 2, 21)
#hparams = {'n_h': n_h, 'l2': l2}
#accuracies = tune_hparams(sample, hparams)     

Finally, we train the model with optimized hyper parameters, and evaluate model performance on test data.

In [None]:
parameters, costs = model(sample[features].values.T, sample[label].values.reshape(1, -1), [29, 16, 1], 
                          l2 = 0.25, learning_rate = 0.0001, num_epochs = 1500, minibatch_size = 16, 
                          print_cost = True)

In [None]:
metrics = model_evaluation(parameters, test_data[features].T, test_data[label])

We also investigate the variation of TPR and FPR as a function of threshold. 

In [None]:
plt.plot(metrics['thresholds'], metrics['tpr'], 'r-', linewidth = 2, label = 'tpr')
plt.plot(metrics['thresholds'], metrics['fpr'], 'b-', linewidth = 2, label = 'fpr')
plt.legend(loc = 'best')
plt.axis([0, 1, 0, 1])
plt.xlabel('threshold')
plt.tight_layout()

If we set our threshold to 0.9. The metrics of model performance is computed as follows.

In [None]:
probs = predict(parameters, test_data[features].T)
preds = np.where(probs > 0.9, 1, 0)
tn, fp, fn, tp = confusion_matrix(y_true = test_data[label].values.ravel(), \
                                  y_pred = preds.ravel()).ravel()
print ('(tn, fp, fn, tp) = ({}, {}, {}, {})'.format(tn, fp, fn, tp))
print ('precision = {}'.format(tp / (tp + fp)))
print ('recall = {}'.format(tp / (tp + fn)))
print ('accuracy = {}'.format((tp + tn) / float(len(test_data))))

Among the 87 fraudulent transactions, the model correctly identifies 72 and misses 15, at the cost of misclassifying 189 genuine transactions as positive. 

**A brief summary**

In this notebook, we construct a general neural network model by using the TensorFlow library. A two-layer model is then trained on the under-sampled training data, and evaluted on the test data set. In general, the present model shows some improvement over linear regression and ensembled tree classifiers. Neural network is a powerful tool for problems with a large number of variables. The present problem has only 29 parameters. It may not be a good application for neural network.

 *Copyright reserved to Hongqiang Zhou (hongqiang.zhou@hotmail.com)*
 
*Last updated 25 Oct. 2017* 