The goal of this notebook is to showcase the strength of random forest. We found that this algorithm is especially suitable for this data set, returning 100% accuracy. There are several reasons that we come up with as to why this is the case:
1) The data set strongly depends on a subset of features with clear separation between the positive and negative samples. This makes decision tree based methods an obvious approach because they readily divide the space without heuristic kernels (such as used in svm based methods)
2) The imbalance of the data set is handled very well by random forest. This is not the case with decision tree because it tends to overfit

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import seaborn as sns

In [None]:
df = pd.read_csv('../input/creditcard.csv')
df.head()

Looking at some features(esp. V1 - V28), we see that most features were already normalized, so there is no need to process data. 

In [None]:
df.describe()

The data set is very skewed

In [None]:
df.groupby(by='Class').apply(len)

This is the central idea of this notebook. We use scikit-learn's implementation of random forest to show relative importance of features

In [None]:
from sklearn import ensemble

deci = ensemble.RandomForestClassifier()
deci.fit(df.iloc[:, 0:30], df.Class)
importances = deci.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(30) + 0.5
pl.barh(padding, importances[sorted_idx], align='center')
pl.yticks(padding, df.columns[sorted_idx])
pl.xlabel("Relative Importance")
pl.title("Variable Importance")
pl.show()

Here we see that V17 decreases the most entropy, and hence the most useful when splitting the plane. This is actually not absolute though because features are selected at random when building random forest. In contrast, decision tree will indeed select splitting features in order of most useful.

Here are some plots to show relative usefulness of features in decreasing order

In [None]:
import seaborn as sns

for feature in ['V14', 'V10', 'V17', 'V26', 'V7', 'V20']:
    pl.title(feature)
    sns.distplot(df[feature][df.Class==1])
    sns.distplot(df[feature][df.Class==0])
    pl.show()

As evident, we will use only top 6 features 

In [None]:
all_features = ['V7', 'V10', 'V18', 'V4', 'V9', 'V16', 'V14', 'V11', 'V17', 'V12', 'Class']
features = ['V7', 'V10', 'V18', 'V4', 'V9', 'V16', 'V14', 'V11', 'V17', 'V12']
truncated = df[all_features]
truncated.tail()

Because of the imbalance, we give the test set and data set equal ratio of positive/negative samples

In [None]:
positive = truncated[truncated.Class==1]
negative = truncated[truncated.Class==0]
test_set = pd.concat((positive.iloc[::2, :], negative.iloc[::2, :]), axis=0)
train_set = pd.concat((positive.iloc[1::2, :], negative.iloc[1::2, :]), axis=0)
#shuffling
print(test_set.tail(2))
test_set = test_set.sample(frac=1).reset_index(drop=True)
print(test_set.tail(2))
train_set = train_set.sample(frac=1).reset_index(drop=True)

In [None]:
print(test_set.groupby(df.Class).apply(len))
print(train_set.groupby(df.Class).apply(len))

Train and test with various methods

In [None]:
def train_and_test_with(mlfunc_list):
    for func, name in mlfunc_list:
        func.fit(train_set[features], train_set.Class)
        print(name+'\n', pd.crosstab(test_set.Class, func.predict(test_set[features])), '\n')

from sklearn import svm
from sklearn import tree
train_and_test_with([(ensemble.RandomForestClassifier(), 'random forest'), (ensemble.AdaBoostClassifier(), 'adaptive boosting'),
                     (ensemble.BaggingClassifier(), 'bagging'), (ensemble.ExtraTreesClassifier(), 'extra tree'),
                     (ensemble.GradientBoostingClassifier(), 'gradient boosting')])

train_and_test_with([(tree.DecisionTreeClassifier(), 'decision tree'), (svm.SVC(), 'support vector machine')])