# Detecting Credit Card Fraud With Supervised Learning

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

%matplotlib inline
sns.set_style('white')

In [2]:
raw_data = pd.read_csv('creditcard.csv')

In [3]:
raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
(raw_data['Class']==1).sum(), len(raw_data)

(492, 284807)

The biggest hurdle here is likely to be the class imbalance. Fewer than 500 of the 284,000 records are fraudulent. We can try a few things to make this work. Let's start by resampling the classes. 

In [5]:
X_full = raw_data.loc[:, ~raw_data.columns.isin(['Class'])]
Y_full = raw_data['Class']
X_full.shape, Y_full.shape

((284807, 30), (284807,))

In [6]:
from sklearn.utils import resample

In [7]:
data_pos = raw_data[raw_data['Class']==1]
data_neg = raw_data[raw_data['Class']==0]

data_pos.shape, data_neg.shape

((492, 31), (284315, 31))

In [8]:
data_pos_resamp = resample(data_pos, n_samples=20000)
data_neg_resamp = resample(data_neg, n_samples=20000)

data_resamp = pd.concat([data_pos_resamp, data_neg_resamp])
data_resamp = data_resamp.sample(frac=1)

data_resamp.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
214662,139767.0,0.467992,1.100118,-5.607145,2.204714,-0.578539,-0.1742,-3.454201,1.102823,-1.065016,...,0.983481,0.899876,-0.285103,-1.929717,0.319869,0.170636,0.851798,0.372098,120.54,1
11841,20332.0,-15.271362,8.326581,-22.338591,11.885313,-8.721334,-2.324307,-16.196419,0.512882,-6.333685,...,-2.356896,1.068019,1.085617,-1.039797,-0.182006,0.649921,2.149247,-1.406811,1.0,1
154371,101313.0,-25.825982,19.167239,-25.390229,11.125435,-16.682644,3.933699,-37.060311,-28.759799,-11.126624,...,-16.922016,5.703684,3.510019,0.05433,-0.671983,-0.209431,-4.950022,-0.448413,2.28,1
233258,147501.0,-1.611877,-0.40841,-3.829762,6.249462,-3.360922,1.147964,1.858425,0.474858,-3.838399,...,1.245582,0.616383,2.251439,-0.066096,0.53871,0.541325,-0.136243,-0.009852,996.27,1
172787,121238.0,-2.628922,2.275636,-3.745369,1.226948,-1.132966,-1.256353,-1.75242,0.281736,-1.792343,...,0.87073,1.269473,-0.265494,-0.480549,0.169665,0.096081,0.070036,0.063768,144.62,1


And let's try running some models on it. The features of this data set are the result of a PCA, so they should already be orthagonal to each other. We may find it useful to run some feature selection, but for now let's use them all.

In [9]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

In [10]:
X = data_resamp.loc[:, ~data_resamp.columns.isin(['Class'])]
Y = data_resamp['Class']

X.shape, Y.shape

((40000, 30), (40000,))

In [11]:
bnb = BernoulliNB()

cross_val_score(bnb, X, Y, cv=5)

array([ 0.902   ,  0.91025 ,  0.908625,  0.90825 ,  0.902875])

In [12]:
dtc = DecisionTreeClassifier(
        max_depth=10,
        max_features=4
)

cross_val_score(dtc, X, Y, cv=5)

array([ 0.97075 ,  0.982625,  0.983375,  0.979   ,  0.97575 ])

In [13]:
rfc = RandomForestClassifier(
        n_estimators=20,
        max_depth=6
)

cross_val_score(rfc, X, Y, cv=5)

array([ 0.960875,  0.96225 ,  0.964375,  0.96125 ,  0.96075 ])

Decision Tree runs a lot faster than Random Forest, and is doing just as well at the moment. Let's fit it and see what it does on the actual dataset.

In [14]:
dtc.fit(X, Y)

Y_pred = dtc.predict(X_full)

confusion_matrix(Y_full, Y_pred)

array([[279637,   4678],
       [    12,    480]])

Hey, this is actually pretty great! (Also, I realize confusion matrix isn't the best way to asses this result). But let's just take a really quick look at it. In fraud detection, we are WAY more concerned about missing fraud than we are about mis-classifying non-fraud. And even with this incredibly simple model, we only missed 4 of the 492 examples of fraud. We also sent 5,000 emails alerting people of potential fraud that didn't happen, but hey, they can just confirm that it wasn't fraud, right? One problem here is that we fit on the entire set of positive examples, so we're prone to overfitting. It's a shame we have so few examples, but we really do have to pull some of those out to act as a test set.

Let's try this again, but holding out some of the original data to act as a test set.

In [15]:
# Separate out n_test positive examples for the test set. 150 (30%) seems to give the most stable results.
n_test=150

data_pos = data_pos.sample(frac=1)
data_pos_train = data_pos[n_test:]
data_pos_test = data_pos[:n_test]

# Separate out 20000 negative examples for training set
data_neg = data_neg.sample(frac=1)
data_neg_train = data_neg[:20000]
data_neg_test = data_neg[20000:]

# Upsample the positive training examples so we're back to our 1:1 ratio
data_pos_train_resamp = resample(data_pos_train, n_samples=20000)

# Recombine to make training and test datasets. Test set now only has 100 positive examples.
data_train = pd.concat([data_pos_train_resamp, data_neg_train]).sample(frac=1)
data_test = pd.concat([data_pos_test, data_neg_test]).sample(frac=1)

X_train = data_train.loc[:, ~data_train.columns.isin(['Class'])]
Y_train = data_train['Class']
X_test = data_test.loc[:, ~data_test.columns.isin(['Class'])]
Y_test = data_test['Class']

X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

((40000, 30), (40000,), (264465, 30), (264465,))

In [16]:
dtc = DecisionTreeClassifier(
        max_depth=10,
        max_features=4
)

cross_val_score(dtc, X_train, Y_train, cv=5)

array([ 0.97725 ,  0.989125,  0.9925  ,  0.9865  ,  0.99175 ])

In [17]:
dtc.fit(X_train,Y_train)
Y_pred = dtc.predict(X_test)
confusion_matrix(Y_test, Y_pred)

array([[261986,   2329],
       [    31,    119]])

Okay, we didn't do quite as well here, but we're still doing something at least! Classified over 80% of the positive examples correctly, while getting better than 98% accuracy on negative examples. Let's take a look at the area under the precision-recall curve (AUPRC)

In [18]:
from sklearn.metrics import auc, precision_recall_curve

In [19]:
precision, recall, thresholds = precision_recall_curve(Y_test, Y_pred)
auc(precision, recall)

0.42046364830884991

Now let's see if we can do better. First, how does our original result compare to this?

In [20]:
def auprc(Y, Y_pred):
    precision, recall, thresholds = precision_recall_curve(Y, Y_pred)
    return auc(precision, recall)

In [21]:
auprc(Y_full, dtc.predict(X_full))

0.53894141354577974

This is with 99% accuracy on positive examples, so doing better will mean fewer false positives.

In [25]:
rfc = RandomForestClassifier(
        n_estimators=30,
        max_depth=20,
        
)

rfc.fit(X_train, Y_train)
Y_pred = rfc.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.712763972445
[[264223     92]
 [    23    127]]


Wow, this is amazing! Using Random Forest, we still aren't getting some of those positives in the test set, but we've reduced false positives to barely any. Let's see how an SVC does.

In [24]:
from sklearn.svm import SVC

In [27]:
svc = SVC(C=1)

svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.411925396555
[[264313      2]
 [   143      7]]


Ugh, that takes so much longer to run than random forest and didn't do anything. I don't even want to bother tuning it.