**Approach**
Feature Engineering: Shall be creating a feature based on 'Amount'. The feature is tested and ends up being slightly significant.

Cleaning data: None other than removing duplicates. (undersampling with duplicates hampers the training of classifiers)

Classified using NearestNeighbors(tried variations but showcased just kNeighbors) and RandomForestClassifier

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input/creditcard.csv"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

#pd options
pd.options.display.max_columns = None
pd.options.display.max_colwidth = -1

In [2]:
ccard = pd.read_csv("../input/creditcard.csv")

In [3]:
ccard.describe()

In [4]:
len(ccard[ccard['Class']==1])

#Heavily Unbalanced dataset

In [5]:
#Split the data into training, validation and test

mask = np.random.rand(len(ccard))<0.8
ccard_trainVal = ccard[mask]
ccard_test = ccard[~mask]
mask2 = np.random.rand(len(ccard_trainVal))<0.85
ccard_train = ccard_trainVal[mask2]
ccard_val = ccard_trainVal[~mask2]

print (len(ccard_train), len(ccard_val), len(ccard_test))

In [6]:
len(ccard_train[ccard_train['Class']==1]) #~300 - 340 of total 492 frauds

In [7]:
ccard_train.columns

In [8]:
print (len(ccard_train[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',\
                        'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 
                        'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']].drop_duplicates()))
print (len(ccard_train.drop_duplicates()))

#equal numbers indicate that there does not exist a pair of observation with same values but different classes.
#Before proceeding with undersampling, it would be preferred to remove duplicates.
#Removal of duplicates could be carried out prior to training-validation-test split as well.
ccard_train.drop_duplicates(inplace=True)

In [9]:
def Amount_fractional(r):
    return r['Amount'] - int(r['Amount'])

ccard_train['Amount_fractional'] = ccard_train.apply(Amount_fractional, axis=1)
ccard_ints = ccard_train[ccard_train['Amount_fractional']==0]
ccard_floats = ccard_train[ccard_train['Amount_fractional']>0]
print(len(ccard_ints[ccard_ints['Class']==1]),len(ccard_ints))
print(len(ccard_floats[ccard_floats['Class']==1]),len(ccard_floats))

The feature reveals a high information gain(almost double). Transactions with fractional value have lesser probability of being fraudulent. Expect this feature to be significant for random forest classifier.

In [10]:
len(ccard_train[ccard_train['Class']==1])

In [11]:
ccard_train.columns

In [12]:
#Undersampling
fraud = ccard_train[ccard_train['Class']==1]
nonFraud = ccard_train[ccard_train['Class']==0].sample(frac=1).head(len(ccard_train[ccard_train['Class']==1])) # perfectly balanced
frame = [fraud, nonFraud]
ccardUSample = pd.concat(frame)
print("Size of sample", len(ccardUSample))
ccard_features = ccardUSample[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',\
                               'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 
                               'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Amount_fractional']]
ccard_labels = ccardUSample['Class'].tolist()

Heavily unbalanced nature prompts to carryout undersampling:

In [13]:
from sklearn.neighbors import KNeighborsClassifier
NN = KNeighborsClassifier(n_neighbors = 5, weights = 'distance')
NN.fit(ccard_features, ccard_labels)

In [14]:
ccard_val['Amount_fractional'] = ccard_val.apply(Amount_fractional, axis=1)
valid_features = ccard_val[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',\
                               'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 
                               'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Amount_fractional']]
valid_labels = ccard_val['Class'].tolist()

In [15]:
pred = NN.predict_proba(valid_features).tolist()
prediction = NN.predict(valid_features).tolist()

In [16]:
prob_fraud = [x[1] for x in pred]

In [17]:
pred_prob = [1*(x>0.8) for x in prob_fraud]

In [18]:
from sklearn import metrics
precision = metrics.precision_score(valid_labels,prediction)
recall = metrics.recall_score(valid_labels,prediction)

In [19]:
print (precision, recall) # A random allotment inder an identical recall would have half the precision(23TP instead of 44 out of 63positives)
#precision = 44/12489 ; recall = 44/63 in one of the runs

In [20]:
def evaluatePerf(true_labels, predicted_labels):
    type2 = 0
    type1 = 0
    true_positive = 0
    true_negative = 0
    for x,y in zip(true_labels, predicted_labels):
        if x == y:
            if x == 1:
                true_positive+=1
            else:
                true_negative+=1
        elif x == 1:
            type2 += 1
        elif x == 0:
            type1 += 1

    print("TP:", true_positive, " TN:", true_negative, " T1Err:", type1, " T2Err:", type2)


In [21]:
evaluatePerf(valid_labels,prediction)

In [22]:
precision = metrics.precision_score(valid_labels,pred_prob)
recall = metrics.recall_score(valid_labels,pred_prob)

print (precision, recall)
#precision = 29/5446; recall = 29/63

In [23]:
evaluatePerf(valid_labels,pred_prob)

In [24]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators = 50)
RFC.fit(ccard_features, ccard_labels)

In [25]:
prediction = RFC.predict(valid_features)
from sklearn import metrics
precision = metrics.precision_score(valid_labels,prediction)
recall = metrics.recall_score(valid_labels,prediction)

print (precision, recall)
print( "reported: ", sum(prediction), ' and total:', len(prediction))

In [26]:
evaluatePerf(valid_labels,prediction)

precision, recall for n_estimator:
0.046439628483 0.952380952381 for n=20
0.0505297473513 0.984126984127 for n = 50
Random Forest out performs Knearest neighbors

Lets have a look at the features if they make some sense. Also check if the amount OR fractional amount feature shows up 

In [27]:
RFC.feature_importances_

In [28]:
ccard_features.columns

In [29]:
ccard_test['Amount_fractional'] = ccard_test.apply(Amount_fractional, axis=1)
test_feature = ccard_test[['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12',\
                           'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24',
                           'V25', 'V26', 'V27', 'V28', 'Amount', 'Amount_fractional']]
test_labels = ccard_test['Class'].tolist()

In [30]:
pred_test = RFC.predict(test_feature)
precision = metrics.precision_score(test_labels,pred_test)
recall = metrics.recall_score(test_labels,pred_test)

print (precision, recall)
print( "reported: ", sum(pred_test), ' and total:', len(pred_test))

In [31]:
evaluatePerf(test_labels,pred_test)

The 'Amount' feature does turn up significant. 'Amount_fractional' does not seem very significant though results without this feature were slightly poor. RFC outperforms KNN. Some runs of this notebook see a 
Precision/T1 error remains a problem as one wouldn't expect credit card to be blocked around 3% of the times.

Cross Validation could be attempted to tweak the parameters. Since, post undersampling, the training set becomes too small, problems pertaining to consistency occur. Suggestions to improve on those fronts are welcomed. The approach strictly balances out the training set by under sampling. 

Suggestions are welcomed. Thanks :)