**Credit Card Fraud Detection**

Outcome: Pending...

Edit: (28th Sep 2017) - I have received feedback that I was only testing on a small subset of the overall data. I will be rewriting this notebook to take this into account. I expect my precision to drop dramatically, but my overall SVC score to trend up (which demonstrates why you should never trust this score in isolation).

I have tangled with the Credit Card Fraud Detection dataset before with little success. In past attempts my accuracy was equivalent to a coin toss and my precision was a crapshoot. The challenges of the dataset are extensive - a vastly skewed dataset, inscrutible feature labels and haphazardly applied algorithms. 

If you have not seen this dataset before it is 280,000 records of credit card transactions. I suspect that the data has been heavily obsfuscated (for obvious reasons) before it was submitted to the site and so little intuitive knowledge can be gleaned from the features.  The feature labels are hidden, replaced with simple titles of V1 through to V28, with two named features (Amount and Time) and a Class label of 1 (meaning a fraudulent transaction) or 0 (meaning a legitimate transaction). The vast majority of the data is legitimate, with < 500 instances of fraud scattered amongst the numbers. 

In previous attempts I tried to grapple with the data by throwing algorithms at it -  a random forest here, a logistic regression there, with little thought given to the type of problem I was facing let alone the structure of the data presented.

This time, however, I decided to approach this with a little more structure. I reviewed the data structure for simple correlations and plotted these patterns against the postive and negative examples to determine what patterns, if any, indicated fraudulent data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed


# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))


# Load the transactions from the credit card fraud file
transactions = pd.read_csv('../input/creditcard.csv') 


I've loaded the dataset but I want to read it. Let's look at the actual data.

The data represents a lot of credit card transactions. 280,000+. Nearly 500 of these are fraudulent. Yeah, less than 20% of 1% of the transactions are fraudulent.

First, let's perform a correlation on the legitimate transactions to see if there are any commonalities to how this data appears which might not be represented in the illegitimate dataset.

In [None]:
# Produce a correlation heat map of the negative class (Legitimate transactions)
sample = transactions[transactions['Class']==0]
normcorr=  sample.corr()
sns.heatmap(normcorr, cbar = True,  square = True, annot=False, fmt= '.2f',annot_kws={'size': 15},
           cmap= 'coolwarm')
plt.show()

In my last iteration of this notebook I made fun of the blandness of this heat map.  This is unfair. I said it had less taste than frozen tofu. Again, I am better than this. This is not the nearly blank-square of the perfectly distributed dataset - this is the blank face of the psycopath. This is the smooth-faced criminal who stares at you impassively just before he shivs you in the stomach. There are hints, that little chequerboard of negative correlation in the V1-V17 square for instance, which suggest something might lurk beneath, but its pure evil is swept aside by a smooth facade.

Let's look at the positive dataset - the fraudulent transactions.

In [None]:
# Produce a correlation heat map of the fraudulent transactions.

fraud = transactions[transactions['Class']==1]

fraudcorr = fraud.corr()
sns.heatmap(fraudcorr, cbar = True,  square = True, annot=False, fmt= '.2f',annot_kws={'size': 15},
           cmap= 'coolwarm')
plt.show()

This! This is the cocky face of the little bugger who swipes your credit card and spends it on soda and pixie-sticks. You can see in this heat-map the pock-marked visage between V1 and V18 that the previous graph only hinted at. It sneers at you. 

I'm feeling a sudden swirl of something resembling... hope.

I can see a few areas of high heat or burning cold on this map. Of particular note is the V1 through V3 square, and the red and blue lines between V9 and V18. 

It should be important to note at this point that the number of positive features is so small compared to the negative features that even strong correlations such as this might be washed away by the sheer size of the negatively labled data. 

To assuage my fears I shall map some of these correlations out and see how the distrubution fairs. Let's start with my favourites - V9 and V10.

In [None]:
print('V9 - V10')
plt.scatter(fraud['V9'], fraud['V10'],s=1, color='r')
plt.scatter(sample['V9'], sample['V10'], s=1, color='g')
plt.show()
plt.clf()

A nice clear distribution. The positive results (red points) sweep up from the lower left upwards at a nice angle while the green dots (the negative results) clump in the centre. A few more of these clean delineations and we might be onto something.

Let's try V16 and V17.

In [None]:
print('V16-V17')
plt.scatter(sample['V16'], sample['V17'], s=1, color = 'g')
plt.scatter(fraud['V16'], fraud['V17'], s=1, color = 'r')
plt.show()
plt.clf()

Again, a nice correlation and a nice clump, but this time the positive results and the negative results share some graph-space. Let's move on to V17 and V18.


In [None]:
print('V17 - V18')
plt.scatter(sample['V18'], sample['V17'], s=1, color = 'g')
plt.scatter(fraud['V18'], fraud['V17'], s=1, color = 'r')
plt.show()
plt.clf()

And again. We might be in luck here... Let's try some of the lower features now. V1 and V3 had a nice, strong correlation.

In [None]:
print('V1 - V3')
plt.scatter(sample['V1'], sample['V3'], s=1, color = 'g')
plt.scatter(fraud['V1'], fraud['V3'], s=1, color = 'r')
plt.show()
plt.clf()

I... uh.... Ergh. While the correlation did not show strongly on the negative data, you can almost imagine the sound of the data being smeared across the surface. A slow, intentional smear. Not too much, just enough to overwhelm the near straight-line correlation of the positive results. 

Hopefully we will have a better result with V1 and V2.

In [None]:
print('V1 - V2')
plt.scatter(sample['V1'], sample['V2'], s=1, color = 'g')
plt.scatter(fraud['V1'], fraud['V2'], s=1, color = 'r')
plt.show()
plt.clf()

No. This seems much the same, only at a higher speed, as if the smearer realised we were hot on his tail and just hurled the data at the graph as he scarpered out the door. 

I don't think I'll include the smears. 

Anyway - I have a small number of apparently useful data features, enough at least for a rough SVC. You know, just to see if I'm on the right path.

In [None]:
transactions = transactions[['Class', 'V9', 'V10', 'V16', 'V17', 'V18','Amount']]


sample = transactions[transactions['Class']==0]
fraud = transactions[transactions['Class'] == 1]

# need a very small but random sample of the legitimate data since it is massively over represented.
ignore_me, sample = train_test_split(sample, test_size = 0.01)

I have used the train_test_split randomness to extract a 1% chunk of the negative data to overcome some of the skewage.

Now that I have both sample (negative data) and fraud (positive data) I need to concatenate them back together so I can break them apart into a training and test set.

In [None]:
import warnings
warnings.filterwarnings("ignore")

sample = pd.concat([sample, fraud])

# Break into train and test units.
train, test = train_test_split(sample, test_size = 0.3)

trainy = train['Class']
testy = test['Class']
train.drop('Class', 1, inplace = True)
test.drop('Class', 1, inplace = True)

You know how your programming lecturer tells you to never **NEVER** turn off all warnings when compiling? Yeah, don't do what I did. 

In [None]:
scaler = StandardScaler()
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

Scaling because that nice Mr Ng told me to. Also because of smooth, easy gradient descent. Not that it really matters here.

In [None]:
clf = SVC()
clf.fit(train, trainy)
outcome = list(clf.predict(test))
testy = list(testy)

I've trained my reduced set and thrown the test set at the algorithm. Now I just need to change these ones and zeros into some kind of score.

In [None]:
count = 0
falsepos = 0
truepos = 0
falseneg = 0
trueneg = 0


for i in range (1,len(testy)):
    if (outcome[i]==1):
        if (testy[i] == 1):
            truepos = truepos + 1
        else:
            falsepos = falsepos + 1
    else:
        if (testy[i] == 0):
            trueneg = trueneg + 1
        else:
            falseneg = falseneg  +1
    count = count + 1



The Precision score is high (meaning those items we predicted were fraudulent tended to be fraudulent 98% of the time) compared to my previous attempts. The Recall is not so high, meaning some attempts of fraud slipped under the radar. The F1 is a respectable 90%. Good, but not great.

Overall, I am happy with the success of this approach and will likely return to see if I can improve.

I hope you enjoyed reading this.

In [None]:

precision = truepos / (truepos + falsepos)
recall = truepos / (truepos + falseneg)
F1 = 2*((precision * recall ) / (precision + recall))

print("Precision = " + str(precision))
print("Recall = " + str(recall))
print("F1 = " + str(F1))
