# Introduction

In this notebook we will look at the credit risk fraud data and try to use an Auto encoder to identify fraud transactions. The data analysis is allready been described in another notebook, so that part will be skipped.

I choose the following approach:
1. Split the fraud cases in a train and testset
2. Use the same ratio for the normal cases
3. Oversample the fraud training set to match the number of cases in the normal training set
4. Define an auto encoder with two hidden tensors, reducing the dimensionality to 2 dimensions
5. Train an auto encoder on the oversampled set
5. Evaluate the model on the test set with  the trained hidden layer.

By reducing the data to two dimensions, while the model has learned how frauds look like, we hope to see a clear separation in the 2 dimensional space.

The choice for oversampling is based on the wish to include as much normal transactions as posible. In this way the model has the best opportunity to learn how a normal transaction looks like. The fraud transactions are copies, a large group with little variantion.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import seaborn as sns
import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
%matplotlib inline

# Getting the data

First let's fetch the data and describe it.

In [None]:
df = pd.read_csv("../input/creditcard.csv")
df.describe()

In [None]:
# reshuffle the data
df=df.sample(frac=1).reset_index(drop=True)

## Sampling

We select a training set of frauds - leaving 30% as test set. We oversample the training fraud set to match the training set of normals. 

In [None]:
fraud_indices = np.array(df[df.Class == 1].index)
number_records_fraud = len(fraud_indices)

# Picking the indices of the normal classes
normal_indices = np.array(df[df.Class == 0].index)
number_records_normal = len(normal_indices)

trainingratio = 0.7
training_n_normal = round(number_records_normal*trainingratio)
training_n_fraude = round(number_records_fraud*trainingratio)

# Select the fraud cases trainingset
random_fraud_indices = np.random.choice(fraud_indices, training_n_fraude, replace = False)
random_fraud_indices = np.array(random_fraud_indices)

# Out of the fraud indices pick training_n_normal cases with replacement to oversample
duplicated_fraud_indices = np.random.choice(random_fraud_indices, training_n_normal, replace = True)
duplicated_fraud_indices = np.array(duplicated_fraud_indices)

# Select random the training normal cases without replacement
random_normal_indices = np.random.choice(normal_indices, training_n_normal, replace = False)
random_normal_indices = np.array(random_normal_indices)

# Appending the 2 indices
sample_indices = np.concatenate([random_normal_indices,duplicated_fraud_indices])

# Sample dataset
sample_data = df.iloc[sample_indices,:]
test_data = df.drop(sample_indices,axis=0)

# sort on Class for the scatter plots at the end, to make sure that Frauds are drawn last
test_data=test_data.sort_values(['Class'], ascending=[True])

#shuffle the data, because the frauds where added to the tail
sample_data=sample_data.sample(frac=1).reset_index(drop=True)

print("Normal transactions:                     ", number_records_normal)
print("Fraud  transactions:                     ", number_records_fraud)
print("Fraud  transactions for training:        ", len(random_fraud_indices))

print("Selected normal transactions:            ", len(random_normal_indices))
print("Selected oversampled fraud transactions: ",  len(duplicated_fraud_indices))

print("Fraud  transactions selected for test:   ", len(test_data[test_data.Class == 1]))

print("Normal transactions selected for test:   ", len(test_data[test_data.Class == 0]))

The fraud data is oversampled. 

Now the data can be scaled. The scaler is fitted on the whole set, applying the transformation to the train and test set.





In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df.drop(['Class','Time'],axis=1))

scaled_data = scaler.transform(sample_data.drop(['Class','Time'],axis=1))
scaled_test_data = scaler.transform(test_data.drop(['Class','Time'],axis=1))
print("Size training data: ", len(scaled_data))
print("Size test data:     ", len(scaled_test_data))

# Encoder
We use *Tensorflow* to define and generate an encoder. There are 3 layers with in the middle a hidden layer with 2 tensors. We train the network to represent the training data with this hidden layer of 2 nodes. Hoping that the fraud and non-frauds can be distinguish from each other.

In [None]:
import tensorflow as tf

num_inputs = len(scaled_data[1])
num_hidden = 2  
num_outputs = num_inputs 

learning_rate = 0.001
keep_prob = 0.5
tf.reset_default_graph() 

## Placeholders and layers
We define the placeholders and the layers. We use a simple model. I tried some other activations but the tanh gives the best results. Also a dropout layer is defined to force the network to generalize.

In [None]:
# placeholder X
X = tf.placeholder(tf.float32, shape=[None, num_inputs])

# weights
initializer = tf.variance_scaling_initializer()
w = tf.Variable(initializer([num_inputs, num_hidden]), dtype=tf.float32)
w_out = tf.Variable(initializer([num_hidden, num_outputs]), dtype=tf.float32)

# bias
b = tf.Variable(tf.zeros(num_hidden))
b_out = tf.Variable(tf.zeros(num_outputs))

#activation
act_func = tf.nn.tanh

# layers
hidden_layer = act_func(tf.matmul(X, w) + b)
dropout_layer= tf.nn.dropout(hidden_layer,keep_prob=keep_prob)
output_layer = tf.matmul(dropout_layer, w_out) + b_out

## Functions 
The loss and the optimizer have to be defined. For the loss we want the output (output_layer) to be as close to the input (X) as possible. The maximum value of X is 1. 

We specifiy a function to create a new batch with random data.

In [None]:
loss = tf.reduce_mean(tf.abs(output_layer - X))
optimizer = tf.train.AdamOptimizer(learning_rate)
train  = optimizer.minimize( loss)
init = tf.global_variables_initializer()

def next_batch(x_data,batch_size):
    
    rindx = np.random.choice(x_data.shape[0], batch_size, replace=False)
    x_batch = x_data[rindx,:]
    return x_batch

## Training
Let's define the training and evaluate the loss from each epoch.

In [None]:
num_steps = 10
batch_size = 150
num_batches = len(scaled_data) // batch_size

with tf.Session() as sess:
    sess.run(init)
    for step in range(num_steps):        
        for iteration in range(num_batches):
            X_batch = next_batch(scaled_data,batch_size)
            sess.run(train,feed_dict={X: X_batch})
        
        if step % 1 == 0:
            err = loss.eval(feed_dict={X: scaled_data})
            print(step, "\tLoss:", err)
            output_2d = hidden_layer.eval(feed_dict={X: scaled_data})
    
    output_2d_test = hidden_layer.eval(feed_dict={X: scaled_test_data})

### Training results

The hidden layer is trained, let's see where the frauds (yellow)  and non-frauds are located in the 2 dimensional space.



In [None]:
plt.figure(figsize=(20,8))
plt.scatter(output_2d[:,0],output_2d[:,1],c=sample_data['Class'],alpha=0.7)

That looks promising. The most frauds (yellow) are seperated from the purple (normal) cloud. However some of the normal transactions at the far end share more with the frauds than with the other normal transactions.

### Test results
At the end of the training we can evaluate the test data, let's see what the scatter looks like.

In [None]:
plt.figure(figsize=(20,8))
plt.scatter(output_2d_test[:,0],output_2d_test[:,1],c=test_data['Class'],alpha=1)

There is a seperation between the normal and fraud transactions. Based on the values in the hidden layer a fraud can be predicted, with a slight chance of false positives. 

## Things left to do
A few things would be interesting to look into:
1. Define a formula based on the hidden layer to predict whether a transaction is a fraud
2. Redesign the layers to establish beter seperaion in the hidden layer