# Predicting Credit Card Fraud with SMOTE and Deep Learning

The goal of this analysis is to predit credit card fraud in real data. I'll create a model using Deep Learning and pre-process available data with SMOTE sampling to compensate the ratio between non-fraudulent data and fraudulent data.

## Importing data

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import keras

np.random.seed(9)

data = pd.read_csv('../input/creditcard.csv')
data.head()

## Normalization
I normalize the data with a Standard Scaler and drop the time and amount columns as I consider them not necessary at this moment. I split the dataset into X without the classification and y, which only has the classification or label.

In [2]:
data['normAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)

X = data.iloc[:, data.columns != 'Class']
y = data.iloc[:, data.columns == 'Class']

In [3]:
# Count fraudulent and non-fraudulent transactions
all_records= len(data)
number_records_fraud = len(data[data.Class == 1])
print(all_records,number_records_fraud)

# Apply SMOTE
X_resample, y_resample = SMOTE().fit_sample(X, y.values.ravel())

In [6]:
# Transform into a panda dataframe
y_resample = pd.DataFrame(y_resample)
X_resample = pd.DataFrame(X_resample)
# Split into training and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X_resample, y_resample, test_size=0.3, random_state=0)

In [7]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout


model = Sequential([
    Dense(units=16, kernel_initializer='uniform', input_dim=29, activation='relu'),
    Dense(units=18, kernel_initializer='uniform', activation='relu'),
    Dropout(0.25),
    Dense(20, kernel_initializer='uniform', activation='relu'),
    Dense(24, kernel_initializer='uniform', activation='relu'),
    Dense(1, kernel_initializer='uniform', activation='sigmoid')
])

model.summary()

In [8]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(np.array(X_train), np.array(Y_train), batch_size=15, epochs=5)

## Evaluate model

In [10]:
score = model.evaluate(np.array(X_test), np.array(Y_test), batch_size=128)
print('\nScore is ', score[1] * 100, '%')

In [12]:
# Function for plotting the confusion matrix

import itertools
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [13]:
y_pred = model.predict(np.array(X_test))

In [15]:
Y_test = pd.DataFrame(Y_test)
Y_test.shape

In [16]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(Y_test,y_pred.round())
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

## Further testing
Now, we have performed SMOTE sampling in the dataset to compensate the balance but we want to know how does the trained model behave on real data. Therefore we are going to use the original dataset to test the model, we want to minimize the fraud transactions predicted as non-fraudulent transactions. That is the bottom left square in the confusion matrix. Let's see how does our model behave.

We do this because we have seen some kernels which arrive to a 99.9% accuracy but that can easily be achieved with only predicting non-fraudulent transactions.

In [None]:
y_pred = model.predict(np.array(X))

In [None]:
# Compute confusion matrix

cnf_matrix = confusion_matrix(y,y_pred.round())
np.set_printoptions(precision=2)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix
                      , classes=class_names
                      , title='Confusion matrix')
plt.show()

Great! We have only missclassified two fraudulent transactions. This is a great result.