# Credit Card Fraud Detection

***By Joe Corliss***

[Data source](https://www.kaggle.com/mlg-ulb/creditcardfraud)

I'm going to try to keep this kernel short and sweet, just to test out a few models.

This is my second machine learning project on Kaggle. I would really appreciate any feedback!

## Table of Contents

1. [Check the Data](#1)
2. [Pre-processing](#2)
3. [Predictive Modeling](#3)
    1. [Linear Classification with Stochastic Gradient Descent (SGD)](#3.1)
    2. [Random Forest Classifier](#3.2)

# Check the Data
<a id='1'></a>

In [1]:
import numpy as np
import pandas as pd

Read in the data to a pandas dataframe.

In [2]:
df = pd.read_csv('creditcard.csv', low_memory=False)

Check basic metadata.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26  

Check the first few rows of the data.

In [5]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Get descriptive statistics.

In [6]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,...,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Note that we will need to standardize the data (rescale each column to zero mean and unit variance) to aid learning algorithms.

How balanced are the classes, i.e. how common are fraudulent transactions?

In [7]:
df['Class'].value_counts() / df.shape[0]

0    0.998273
1    0.001727
Name: Class, dtype: float64

Only about 0.173% of transactions are fraudulent.

Is there any missing data?

In [8]:
df.isnull().sum().sum()

0

No. Great!

# Pre-processing
<a id='2'></a>

Split the data into features and target variable.

In [9]:
X = df.drop(columns=['Class']) # Features
y = df['Class'] # Target
df = None # Delete original dataframe to save memory

## Train/test split

In [10]:
from sklearn.model_selection import train_test_split

Do an 80/20 train/test split.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Standardization

In [12]:
from sklearn.preprocessing import StandardScaler

Fit the scaler only to the training data to avoid test set leakage.

In [13]:
scaler = StandardScaler().fit(X_train)

Scale the data.

In [14]:
X_train = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test  = pd.DataFrame(scaler.transform(X_test),  columns=X_test.columns)

# Predictive Modeling
<a id='3'></a>

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import matthews_corrcoef, make_scorer, average_precision_score, confusion_matrix, classification_report

## Linear Classification with Stochastic Gradient Descent (SGD)
<a id='3.1'></a>

In [16]:
from sklearn.linear_model import SGDClassifier

The estimator `SGDClassifier` ([documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)) implements multiple linear classifiers with SGD training, which makes it very fast on large datasets at the cost of accuracy.

### Train with grid search

The linear classifier is chosen with the `loss` hyperparameter. We'll try a linear SVC (`loss = 'hinge'`) and logistic regression (`loss = 'log'`).

In [22]:
param_grid = [{'loss': ['hinge'],
               'alpha': [10.0**k for k in range(-6,4)],
               'max_iter': [1000],
               'tol': [1e-3],
               'class_weight': [None, 'balanced'],
               'warm_start': [True]},
              {'loss': ['log'],
               'penalty': ['l2', 'l1'],
               'alpha': [10.0**k for k in range(-6,4)],
               'max_iter': [1000],
               'tol': [1e-3],
               'warm_start': [True]}]

We'll use the Matthews correlation coefficient (MCC) ([documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html)) as our scoring metric.

In [23]:
grid = GridSearchCV(estimator=SGDClassifier(), param_grid=param_grid, scoring=make_scorer(matthews_corrcoef), 
                    n_jobs=4, verbose=3, return_train_score=True)

Perform the grid search:

In [24]:
grid.fit(X_train, y_train)

Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:   10.8s
[Parallel(n_jobs=4)]: Done 120 out of 120 | elapsed:   46.9s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=4,
       param_grid=[{'loss': ['hinge'], 'alpha': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], 'max_iter': [1000], 'tol': [0.001], 'class_weight': [None, 'balanced'], 'warm_start': [True]}, {'loss': ['log'], 'penalty': ['l2', 'l1'], 'alpha': [1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], 'max_iter': [1000], 'tol': [0.001], 'warm_start': [True]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(matthews_corrcoef), verbose=3)

What are the optimal parameters from our chosen grid?

In [26]:
grid.best_params_

{'alpha': 100.0,
 'class_weight': 'balanced',
 'loss': 'hinge',
 'max_iter': 1000,
 'tol': 0.001,
 'warm_start': True}

Looks like a linear SVC (`loss = 'hinge'`) with high regularization (`alpha = 100.0`) works best.

### Test set evaluation

Make test set predictions.

In [28]:
y_pred = grid.predict(X_test)

**Evaluation metrics:**

In [29]:
print(confusion_matrix(y_test, y_pred))

[[56841    19]
 [   22    80]]


Not too shabby?

In [30]:
print(classification_report(y_test, y_pred, digits=3))

             precision    recall  f1-score   support

          0      1.000     1.000     1.000     56860
          1      0.808     0.784     0.796       102

avg / total      0.999     0.999     0.999     56962



F1-score of 0.796.

In [31]:
matthews_corrcoef(y_test, y_pred)

0.7957484074536006

Informally, the model is 0.79% of the way to a perfect predictor.

The `average_precision_score` ([documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)) is an AUPRC-like metric (Area Under Precision-Recall Curve).

In [32]:
average_precision_score(y_test, y_pred)

0.6341750914768634

Not great, but we did not optimize the hyperparameters with respect to this metric.

## Random Forest Classifier
<a id='3.2'></a>

In [33]:
from sklearn.ensemble import RandomForestClassifier

### Train

The random forest takes much longer to train on this fairly large dataset, so we won't tune the hyperparameters with a grid search. We'll just use default values, except for using 100 estimators (the default is 10) and building trees in parallel (`n_jobs = -1`).

In [48]:
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

In [49]:
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Test set evaluation

Make test set predictions.

In [50]:
y_pred = rf.predict(X_test)

**Evaluation metrics:**

In [51]:
print(confusion_matrix(y_test, y_pred))

[[56854     6]
 [   25    77]]


Looks like an improvement over the linear SVC!

In [52]:
print(classification_report(y_test, y_pred, digits=3))

             precision    recall  f1-score   support

          0      1.000     1.000     1.000     56860
          1      0.928     0.755     0.832       102

avg / total      0.999     0.999     0.999     56962



F1-score of 0.832.

In [53]:
matthews_corrcoef(y_test, y_pred)

0.8366009115262671

In [54]:
average_precision_score(y_test, y_pred, average='weighted')

0.7007696237874707

The random forest did a little better than the linear SVC.