# Detecting Credit Card Fraud With Supervised Learning

The data set we'll be using is available on Kaggle. The features have already been run through PCA, so no decomposition is necessary. Let's take a look at what we have and see how we can model it

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

%matplotlib inline
sns.set_style('white')

In [3]:
raw_data = pd.read_csv('creditcard.csv')
raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
(raw_data['Class']==1).sum(), len(raw_data)

(492, 284807)

The biggest hurdle here is likely to be the class imbalance. Fewer than 500 of the 284,000 records are fraudulent. We can try a few things to make this work. Let's start by resampling the classes. 

In [5]:
X_full = raw_data.loc[:, ~raw_data.columns.isin(['Class'])]
Y_full = raw_data['Class']
X_full.shape, Y_full.shape

((284807, 30), (284807,))

In [6]:
from sklearn.utils import resample

In [7]:
data_pos = raw_data[raw_data['Class']==1]
data_neg = raw_data[raw_data['Class']==0]

data_pos.shape, data_neg.shape

((492, 31), (284315, 31))

In [8]:
data_pos_resamp = resample(data_pos, n_samples=20000)
data_neg_resamp = resample(data_neg, n_samples=20000)

data_resamp = pd.concat([data_pos_resamp, data_neg_resamp])
data_resamp = data_resamp.sample(frac=1)

data_resamp.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
29687,35585.0,-2.019001,1.49127,0.005222,0.817253,0.973252,-0.639268,-0.974073,-3.146929,-0.003159,...,2.839596,-1.185443,-0.142812,-0.086103,-0.329113,0.523601,0.626283,0.15244,0.76,1
64329,51112.0,-9.848776,7.365546,-12.898538,4.273323,-7.611991,-3.427045,-8.350808,6.863604,-2.387567,...,0.931958,-0.874467,-0.192639,-0.035426,0.538665,-0.263934,1.134095,0.225973,99.99,1
42887,41285.0,-12.83576,6.574615,-12.788462,8.786257,-10.723121,-2.813536,-14.248847,7.960521,-7.718751,...,2.67949,-0.047335,-0.836982,0.625349,0.125865,0.177624,-0.81768,-0.52103,37.32,1
157591,110099.0,2.100253,0.198889,-1.79435,0.561776,0.448126,-0.954524,-0.011194,-0.30015,1.840955,...,-0.020492,0.244313,0.062023,0.4531,0.055294,0.622047,-0.102058,-0.045644,1.5,0
249828,154599.0,0.667714,3.041502,-5.845112,5.967587,0.213863,-1.462923,-2.688761,0.677764,-3.447596,...,0.32976,-0.941383,-0.006075,-0.958925,0.239298,-0.067356,0.821048,0.426175,6.74,1


And let's try running some models on it. The features of this data set are the result of a PCA, so they should already be orthagonal to each other. We may find it useful to run some feature selection, but for now let's use them all.

In [9]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

In [10]:
X = data_resamp.loc[:, ~data_resamp.columns.isin(['Class'])]
Y = data_resamp['Class']

X.shape, Y.shape

((40000, 30), (40000,))

In [11]:
bnb = BernoulliNB()

cross_val_score(bnb, X, Y, cv=5)

array([ 0.909   ,  0.897375,  0.910125,  0.906125,  0.90575 ])

In [12]:
dtc = DecisionTreeClassifier(
        max_depth=10,
        max_features=4
)

cross_val_score(dtc, X, Y, cv=5)

array([ 0.981375,  0.979375,  0.987125,  0.97975 ,  0.972625])

In [13]:
rfc = RandomForestClassifier(
        n_estimators=20,
        max_depth=6
)

cross_val_score(rfc, X, Y, cv=5)

array([ 0.97025 ,  0.965   ,  0.971125,  0.97    ,  0.967   ])

Decision Tree runs a lot faster than Random Forest, and is doing just as well at the moment. Let's fit it and see what it does on the actual dataset.

In [14]:
dtc.fit(X, Y)

Y_pred = dtc.predict(X_full)

confusion_matrix(Y_full, Y_pred)

array([[280479,   3836],
       [     4,    488]])

Hey, this is actually pretty great! (Also, I realize confusion matrix isn't the best way to asses this result). But let's just take a really quick look at it. In fraud detection, we are WAY more concerned about missing fraud than we are about mis-classifying non-fraud. And even with this incredibly simple model, we only missed 4 of the 492 examples of fraud. We also sent 5,000 emails alerting people of potential fraud that didn't happen, but hey, they can just confirm that it wasn't fraud, right? One problem here is that we fit on the entire set of positive examples, so we're prone to overfitting. It's a shame we have so few examples, but we really do have to pull some of those out to act as a test set.

Let's try this again, but holding out some of the original data to act as a test set.

In [15]:
# Separate out n_test positive examples for the test set. 150 (30%) seems to give the most stable results.
n_test = 150
n_train = 50000

data_pos = data_pos.sample(frac=1)
data_pos_train = data_pos[n_test:]
data_pos_test = data_pos[:n_test]

# Separate out n_train negative examples for training set
data_neg = data_neg.sample(frac=1)
data_neg_train = data_neg[:n_train]
data_neg_test = data_neg[n_train:]

# Upsample the positive training examples so we're back to our 1:1 ratio
data_pos_train_resamp = resample(data_pos_train, n_samples=n_train)

# Recombine to make training and test datasets. Test set now only has 100 positive examples.
data_train = pd.concat([data_pos_train_resamp, data_neg_train]).sample(frac=1)
data_test = pd.concat([data_pos_test, data_neg_test]).sample(frac=1)

X_train = data_train.loc[:, ~data_train.columns.isin(['Class'])]
Y_train = data_train['Class']
X_test = data_test.loc[:, ~data_test.columns.isin(['Class'])]
Y_test = data_test['Class']

X_train.shape, Y_train.shape, X_test.shape, Y_test.shape

((100000, 30), (100000,), (234465, 30), (234465,))

In [16]:
dtc = DecisionTreeClassifier(
        max_depth=10,
        max_features=4
)

cross_val_score(dtc, X_train, Y_train, cv=5)

array([ 0.99235,  0.98305,  0.98655,  0.98455,  0.9757 ])

In [17]:
dtc.fit(X_train,Y_train)
Y_pred = dtc.predict(X_test)
confusion_matrix(Y_test, Y_pred)

array([[232795,   1520],
       [    33,    117]])

Okay, we didn't do quite as well here, but we're still doing something at least! Classified over 80% of the positive examples correctly, while getting better than 98% accuracy on negative examples. Let's take a look at the area under the precision-recall curve (AUPRC)

In [18]:
from sklearn.metrics import auc, precision_recall_curve

In [19]:
precision, recall, thresholds = precision_recall_curve(Y_test, Y_pred)
auc(precision, recall)

0.42516672126919758

Now let's see if we can do better. First, how does our original result compare to this?

In [20]:
def auprc(Y, Y_pred):
    precision, recall, thresholds = precision_recall_curve(Y, Y_pred)
    return auc(precision, recall)

In [21]:
auprc(Y_full, dtc.predict(X_full))

0.55804757064528576

This is with 99% accuracy on positive examples, so doing better will mean fewer false positives.

In [22]:
rfc = RandomForestClassifier(
        n_estimators=30,
        max_depth=20
)

rfc.fit(X_train, Y_train)
Y_pred = rfc.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.731748973522
[[234271     44]
 [    38    112]]


Wow, this is amazing! Using Random Forest, we still aren't getting some of those positives in the test set, but we've reduced false positives to barely any. Let's see how an SVC does. We actually get fewer false negatives with n_train of a couple thousand, but our lowest AUPRC is with n_train about 50,000. Let's see how other models compare to this result.

In [23]:
from sklearn.svm import SVC

In [None]:
#svc = SVC(C=1)
#
#svc.fit(X_train, Y_train)
#Y_pred = svc.predict(X_test)
#print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
#print(confusion_matrix(Y_test, Y_pred))

Ugh, that takes so much longer to run than random forest and didn't do anything good. 

In [29]:
bnb = BernoulliNB()

bnb.fit(X_train,Y_train)
Y_pred = bnb.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.417386416985
[[232927   1388]
 [    36    114]]


In [30]:
lgr = LogisticRegression()

lgr.fit(X_train,Y_train)
Y_pred = lgr.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.449141282402
[[229436   4879]
 [    19    131]]


In [31]:
dtc = DecisionTreeClassifier(max_depth=20, max_features=4)

dtc.fit(X_train,Y_train)
Y_pred = dtc.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.467966366829
[[233977    338]
 [    45    105]]


In [32]:
knn = KNeighborsClassifier()

knn.fit(X_train,Y_train)
Y_pred = knn.predict(X_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.116774317811
[[232045   2270]
 [   117     33]]


And finally, let's do some feature selection and see if we can keep the quality of our results with fewer features. Note that PCA has already been run on our starting feature set. We'll do RFE, SelectKBest, and Random Forest importances, and run them on our best model, RFC.

In [33]:
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest

In [42]:
rfe = RFE(dtc, n_features_to_select=8)

rfe.fit(X_train, Y_train)
X_rfe_train = rfe.transform(X_train)
X_rfe_test = rfe.transform(X_test)

rfc.fit(X_rfe_train, Y_train)
Y_pred = rfc.predict(X_rfe_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.712036006322
[[234264     51]
 [    39    111]]


Keeping the best 8 features with RFE seems to maintain our result -- pretty good reduction from 29!

In [50]:
skb = SelectKBest(k=8)

skb.fit(X_train, Y_train)
X_skb_train = skb.transform(X_train)
X_skb_test = skb.transform(X_test)

rfc.fit(X_skb_train, Y_train)
Y_pred = rfc.predict(X_skb_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.72913026772
[[234266     49]
 [    36    114]]


8 features from SelectKBest does the job as well. We have to re-run our model to get our original feature_importances.

In [62]:
rfc_feat = RandomForestClassifier(
        n_estimators=30,
        max_depth=20
)

rfc_feat.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=20, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=30, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [95]:
n_feat = 7
X_rfc_train = X_train.loc[:, X_train.columns[np.argsort(rfc_feat.feature_importances_)[-n_feat:]]]
X_rfc_test = X_test.loc[:, X_test.columns[np.argsort(rfc_feat.feature_importances_)[-n_feat:]]]

rfc = RandomForestClassifier(
        n_estimators=30,
        max_depth=20
)

rfc.fit(X_rfc_train, Y_train)
Y_pred = rfc.predict(X_rfc_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.716715122695
[[234260     55]
 [    36    114]]


Same results here -- need about 7-8 features to get the full accuracy. Finally, let's try gradient boosting and see how that does. We'll just use the feature_importance result for this one.

In [96]:
from sklearn.ensemble import GradientBoostingClassifier

In [103]:
gbc = GradientBoostingClassifier(
            n_estimators=500,
            learning_rate=.3,
            max_depth=4
)

gbc.fit(X_rfc_train, Y_train)
Y_pred = gbc.predict(X_rfc_test)
print('AUPRC: ' + str(auprc(Y_test, Y_pred)))
print(confusion_matrix(Y_test, Y_pred))

AUPRC: 0.663019105738
[[234228     87]
 [    36    114]]


Gradient boosting is starting to get up into the neighborhood of our Random Forest classifier, but it is already slower. We could tune it with a grid search, but I'm pretty happy with RFC for now. 

## Conclusion

Although we only have 500 examples of fraud in our data set, we're able to classify quite well by upsampling those positive examples. We can use feature selection to reduce our initial (already decomposed) feature set from 29 to 8 without losing much of the variance in our model. Because 29 still isn't that many features, I would recommend keeping all of those features for the minor improvement in our result.

Because of the extreme class imbalance, accuracy wasn't a suitable measure for our models, so we used the area under the precision-recall curve. Of our models, random forest performed best, with an AUPRC score of 0.73. Gradient boosting did fairly well (0.66), but most of the other models did not perform well ( < 0.50 ). They all continued to find a large number ( > 1000 ) of false positives. 

Our random forest would likely do even better if we had more positive examples of fraud. There are surely more than 500 ways a transaction can be fraudulent, and unfortunately random forest is weak at extrapolating to examples it hasn't seen before. 