This code trains decision tree, neural networks and random forest to distinguish between fraud and non-fraud credit card transations. The code is divided in three parts:

 1. Analyse the data directly and study the accuracy, confusion matrix and F1 score
 2. Analyse part of the data, generating a symmetric data (# fraud=#non-fraud)
 3. Drop features which are not adding information to the data

The fit quality of point 2. and 3. are also calculated using cross validation of each method

# Part 1:
Analyse the data directly and study the accuracy, confusion matrix and F1 score

In [None]:
import numpy as np
import sklearn
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt

Open the data file

In [None]:
data_df = pd.read_csv('../input/creditcard.csv')

In [None]:
data_df.head(2)

In [None]:
#check if the df contains any NaN values

data_df.isnull().values.any()

First, lets identify how many classes there are and how many correspond to each

In [None]:
classes = data_df.Class.unique()
print(classes)

In [None]:
print('class corresponding to non-fraud', classes[0],': ', len(data_df[data_df.Class==classes[0]]))
print('class corresponding to fraud', classes[1],': ', len(data_df[data_df.Class==classes[1]]))

There is such a large asymmetry between 'fraud' and 'non-fraud',that the algorithm will be completely biased for class 0. This will give us an amazing accuracy, as most of the data points will be confused by 'non-fraud' after running the algorith. This can be seen looking at the confusion matrix and F1 score

In [None]:
#transform the dataframe to an array
data = data_df.as_matrix()

X_data = data[:,:(data_df.shape[1]-1)]
y_data = data[:,(data_df.shape[1]-1)]

In [None]:
#split the data into training and test data
from sklearn.model_selection import train_test_split

X_data_train, X_data_test, y_data_train, y_data_test = train_test_split(X_data, y_data, test_size=0.25)

In [None]:
#scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_data_train = scaler.fit_transform(X_data_train)
X_data_test = scaler.fit_transform(X_data_test)

In [None]:
#train using neural networks
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score


#fit decision tree classifier
model_nn = MLPClassifier(hidden_layer_sizes=2, max_iter=2000)
model_nn.fit(X_data_train, y_data_train)

#predict 'y' for test data
y_data_pred_test = model_nn.predict(X_data_test)

#score
print('Accuracy: ', accuracy_score(y_data_test, y_data_pred_test))
print('confusion matrix:', confusion_matrix(y_data_test, y_data_pred_test))
print('F1:', f1_score(y_data_test, y_data_pred_test))

The accuracy is pretty good (as expected). Nevertheless the F1 score can be substantially improved. 
But what happens if we reduce the amount of 'non-fraud' data for symmetry?

# Part 2:
Analyse part of the data, generating a symmetric data (# fraud=#non-fraud)

In [None]:
#this is the data corresponding to fraud
data_fraud_df = data_df[data_df.Class==classes[1]]
data_fraud_df = data_fraud_df.reset_index(drop=True)
data_fraud = data_fraud_df.as_matrix()


#this is the data corresponding to non-fraud
data_nonfraud_df = data_df[data_df.Class==classes[0]]
data_nonfraud_df = data_nonfraud_df.reset_index(drop=True)
data_red_nonfraud_df = \
    data_nonfraud_df.ix[np.random.random_integers(1, max(data_nonfraud_df.index),max(data_fraud_df.index)+1)]
data_red_nonfraud_df = data_red_nonfraud_df.reset_index(drop=True)

In [None]:
#now lets join both fraud and non-fraud of the same length
data_red_df = pd.concat([data_red_nonfraud_df, data_fraud_df])
data_red = data_red_df.as_matrix()

X_data_red = data_red[:,:(data_red_df.shape[1]-1)]
y_data_red = data_red[:,(data_red_df.shape[1]-1)]

In [None]:
#define train and test of the symmetric data
X_data_red_train, X_data_red_test, y_data_red_train, y_data_red_test =\
            train_test_split(X_data_red, y_data_red, test_size=0.25)

In [None]:
#scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_data_red_train = scaler.fit_transform(X_data_red_train)
X_data_red_test = scaler.fit_transform(X_data_red_test)

In [None]:
#again, lets use neural networks

model_nn.fit(X_data_red_train, y_data_red_train)

#predict 'y' for test data
y_data_red_pred_test = model_nn.predict(X_data_red_test)

#score
print('Accuracy: ', accuracy_score(y_data_red_test, y_data_red_pred_test))
print('confusion matrix:', confusion_matrix(y_data_red_test, y_data_red_pred_test))
print('F1:', f1_score(y_data_red_test, y_data_red_pred_test))

In [None]:
#lets look at the cross validation
from sklearn.cross_validation import cross_val_score, KFold
from scipy.stats import sem

cv = KFold(len(y_data_red_pred_test), 5, shuffle=True, random_state=0)
# by default the score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(model_nn, X_data_red_test, y_data_red_pred_test, cv=cv)
print(scores)
print("Mean score: {0:.3f} (+/-{1:.3f})".format(np.mean(scores), sem(scores)))

The F1 and confusion matrix have already improved. Furthermore, the cv looks pretty good already! Nevertheless, lets look if this can still be further improved. 

It is interesting to visualise the data to see if there are particular features which can improve the classification. Hence

 - We will reduce the dimension of the features to visualize how good the classification currently is
 - We will visualise the different features and chose the ones which can improve the classification and drop all the other features

In [None]:
from sklearn.manifold import Isomap
iso = Isomap(n_neighbors=30, n_components=2)

#project the data to 2-dimension features
iso.fit(X_data_red_train[:50,:])
Xdata_red_projected = iso.transform(X_data_red_train)

#visualise the data
plt.scatter(Xdata_red_projected[:, 0], Xdata_red_projected[:, 1], c=y_data_red_train,
            edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('nipy_spectral'))

plt.clim(-0.5, 9.5);

# Part 3: Feature reduction
Will this improve our result?

In [None]:
%matplotlib inline
import seaborn as sns; sns.set()
sns.pairplot(data_red_df, hue='Class',vars=['Time', 'Amount']);

In [None]:
sns.pairplot(data_red_df, hue='Class',vars=['V1','V2','V3','V4', 'V5', 'V6','V7','V8','V9','V10']);

In [None]:
sns.pairplot(data_red_df, hue='Class',vars=[ 'V11', 'V12','V13','V14','V15','V16','V17','V18','V19','V20','V21']);

In [None]:
sns.pairplot(data_red_df, hue='Class',vars=[ 'V22','V23','V24','V25','V26','V27','V28']);

It seems that the best features are 'V10', 'V14', 'V16', 'V17'

In [None]:
sns.pairplot(data_red_df, hue='Class',vars=['V10','V14','V16','V17']);

Lets take just these columns to fit our algorithms

In [None]:
data_card_df = pd.concat([data_red_df['V10'],data_red_df['V16'],data_red_df['V14'],\
                            data_red_df['V17'],data_red_df['Class']],axis=1)


data_card = data_card_df.as_matrix()

X = data_card[:,:(data_card_df.shape[1]-1)]
y = data_card[:,(data_card_df.shape[1]-1)]


In [None]:
iso.fit(X[:50,:])
data_projected = iso.transform(X)
data_projected.shape

In [None]:
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=y,
            edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('nipy_spectral'))

plt.clim(-0.5, 9.5);

It does not look as this has strongly improved the result... actually looks worse!!
Lets now divide the data into training and test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Decision Tree classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

#fit decision tree classifier
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)

#predict 'y' for test data
y_pred_test_dt = model_dt.predict(X_test)

#score
print(confusion_matrix(y_test, y_pred_test_dt))
print(f1_score(y_test, y_pred_test_dt))

# Neural networks


In [None]:
#fit data
model_nn.fit(X_train, y_train)

#predict y
y_pred_test_nn = model_nn.predict(X_test)

#score
print(confusion_matrix(y_test, y_pred_test_nn))
print(f1_score(y_test, y_pred_test_nn))

# Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

#fit
model_rf = RandomForestClassifier(criterion='entropy')
model_rf.fit(X_train, y_train)

#predict y
y_pred_test_rf = model_rf.predict(X_test)

#score
print(confusion_matrix(y_test, y_pred_test_rf))
print(f1_score(y_test, y_pred_test_rf))


As expected, the results don't improve much.
Lets look at the cross valudation accuracy:

In [None]:
cv = KFold(len(y), 5, shuffle=True, random_state=0)

scores_dt = cross_val_score(model_dt, X, y, cv=cv)
print(scores_dt)
print("Mean score decision tree: {0:.3f} (+/-{1:.3f})".format(np.mean(scores_dt), sem(scores_dt)))


scores_nn = cross_val_score(model_nn, X, y, cv=cv)
print(scores_nn)
print("Mean score neural networks: {0:.3f} (+/-{1:.3f})".format(np.mean(scores_nn), sem(scores_nn)))
scores_rf = cross_val_score(model_rf, X, y, cv=cv)
print(scores_rf)
print("Mean score random forest: {0:.3f} (+/-{1:.3f})".format(np.mean(scores_rf), sem(scores_rf)))



Hence, just by generating a symmetric data, the result is already very good!