# Credit Card Fraud Detection

***By Joe Corliss***

**July 12, 2018**

## Contents

* [1. Introduction](#1)
    * [1.1 References](#1.1)
* [2. Getting Started](#2)
* [3. Train/Test Split](#3)
* [4. Exploratory Data Analysis](#4)
    * [4.1 Time](#4.1)
    * [4.2 Amount](#4.2)
    * [4.3 Time vs. Amount](#4.3)
    * [4.4 V1-V28](#4.4)
* [5. Mutual Information between Fraud and the Predictors](#5)
* [6. Modeling](#6)
    * [6.1 Logistic Regression and Support Vector Classifier](#6.1)
    * [6.2 Random Forest](#6.2)
* [7. Test Set Evaluation of the Best Model](#7)
* [8. Conclusion](#8)

# 1. Introduction
<a id='1'></a>

In this project we analyze a dataset of credit card transactions made over a two-day period in September 2013 by European cardholders. The dataset contains 284,807 transactions, of which 492 (0.17%) are fraudulent.

Each transaction has 30 features, all of which are numerical. The features `V1, V2, ..., V28` are the result of a PCA transformation. To protect confidentiality, background information on these features is not available. The `Time` feature contains the time elapsed since the first transaction, and the `Amount` feature contains the transaction amount. The response variable, `Class`, is 1 in the case of fraud, and 0 otherwise.

Our goal in this project is to construct models to predict whether a credit card transaction is fraudulent. We'll attempt a supervised learning approach. We'll also create visualizations to help us understand the structure of the data and unearth any interesting patterns.

## 1.1 References
<a id='1.1'></a>

* [Kaggle Dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)
* [Kaggle Notebook](https://www.kaggle.com/pileatedperch/linear-svc-random-forest-mcc-0-77-0-88) - A version of this notebook hosted on Kaggle
* [GitHub Repository](https://github.com/jgcorliss/credit-card-fraud) - The GitHub repository for this project

# 2. Getting Started
<a id='2'></a>

Import basic libraries:

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Pandas options
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

# Plotting options
%matplotlib inline
mpl.style.use('ggplot')
sns.set(style='whitegrid')

Read in the data into a pandas dataframe.

In [None]:
transactions = pd.read_csv('../input/creditcard.csv')

Check basic metadata.

In [None]:
transactions.shape

In [None]:
transactions.info()

Are there any variables with missing data?

In [None]:
transactions.isnull().any().any()

No! Let's view five randomly chosen transactions.

In [None]:
transactions.sample(5)

How balanced are the classes, i.e. how common are fraudulent transactions?

In [None]:
transactions['Class'].value_counts()

In [None]:
transactions['Class'].value_counts(normalize=True)

Only 0.17% (492 out of 284,807) transactions are fraudulent.

# 3. Train/Test Split
<a id='3'></a>

Before we begin preprocessing, we split off a test data set. First split the data into features and response variable:

In [None]:
X = transactions.drop(labels='Class', axis=1) # Features
y = transactions.loc[:,'Class']               # Response
del transactions                              # Delete the original data

We'll use a test size of 20%. We also stratify the split on the response variable, which is very important to do because there are so few fraudulent transactions.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
del X, y

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Prevent view warnings
X_train.is_copy = False
X_test.is_copy = False

# 4. Exploratory Data Analysis
<a id='4'></a>

We conduct EDA only on the training set, and leave the test set unknown.

## 4.1 Time
<a id='4.1'></a>

Few descriptive statistics for the `Time` variable:

In [None]:
X_train['Time'].describe()

Let's convert the time from seconds to hours to ease the interpretation.

In [None]:
X_train.loc[:,'Time'] = X_train.Time / 3600
X_test.loc[:,'Time'] = X_test.Time / 3600

What is the time of the last transaction, in days?

In [None]:
X_train['Time'].max() / 24

So the transactions indeed occur over a two-day period. Next let's plot a histogram of transaction times, with one bin per hour:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.distplot(X_train['Time'], bins=48, kde=False)
plt.xlim([0,48])
plt.xticks(np.arange(0,54,6))
plt.xlabel('Time After First Transaction (hr)')
plt.ylabel('Count')
plt.title('Transaction Times')

It looks like there are two lulls in credit card transactions during nighttime on each day.

## 4.2 Amount
<a id='4.2'></a>

Summary statistics:

In [None]:
X_train['Amount'].describe()

Comparing the different quantiles, it looks like the amounts are very right-skewed. To verify this, plot a histogram of the transaction amounts:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.distplot(X_train['Amount'], bins=300, kde=False)
plt.ylabel('Count')
plt.title('Transaction Amounts')

The histogram is hard to read due to some outliers we can't see. A boxplot will show the outliers:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.boxplot(X_train['Amount'])
plt.title('Transaction Amounts')

We can see there are no outliers on the left and many outliers on the right. So the amounts certainly seem right-skewed. We can calculate the skewness to be sure:

In [None]:
X_train['Amount'].skew()

That's a strong right skew. Let's use a power transform to bring the transaction amounts closer to a normal distribution. We'll use the [Box-Cox transform in SciPy](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html), but some of the amounts are zero (min = 0 above), so we need to shift the amounts first to make them positive. We'll shift by a very small amount, just $10^{-9}$.

In [None]:
X_train.loc[:,'Amount'] = X_train['Amount'] + 1e-9 # Shift all amounts by 1e-9

Perform the Box-Cox transform:

In [None]:
X_train.loc[:,'Amount'], maxlog, (min_ci, max_ci) = sp.stats.boxcox(X_train['Amount'], alpha=0.01)

The maximum likelihood estimate of $\lambda$ in the Box-Cox transform:

In [None]:
maxlog

The 99% confidence interval for $\lambda$:

In [None]:
(min_ci, max_ci)

Plot the newly transformed amounts:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.distplot(X_train['Amount'], kde=False)
plt.xlabel('Transformed Amount')
plt.ylabel('Count')
plt.title('Transaction Amounts (Box-Cox Transformed)')

Much better. The distribution appears to be bimodal, suggesting a divide between "small" and "large" purchases. Now let's check the descriptive stats of the transformed amounts:

In [None]:
X_train['Amount'].describe()

What's the skewness of the transformed amounts? (Zero skewness would be ideal.)

In [None]:
X_train['Amount'].skew()

So our power transform removed most of the skewness in the `Amount` variable. Now we need to compute the Box-Cox transform on the test data amounts as well, using the $\lambda$ value estimated on the training data.

In [None]:
X_test.loc[:,'Amount'] = X_test['Amount'] + 1e-9 # Shift all amounts by 1e-9

In [None]:
X_test.loc[:,'Amount'] = sp.stats.boxcox(X_test['Amount'], lmbda=maxlog)

## 4.3 Time vs. Amount
<a id='4.3'></a>

Is there a relationship between the transaction amounts and the time of day? Let's explore this question with a joint histogram using hexagonal bins. For this plot, we convert each transaction time to the hour of the day on which it occurred.

In [None]:
sns.jointplot(X_train['Time'].apply(lambda x: x % 24), X_train['Amount'], kind='hex', stat_func=None, size=12, xlim=(0,24), ylim=(-7.5,14)).set_axis_labels('Time of Day (hr)','Transformed Amount')

The transaction amounts appear to be similarly distributed throughout the daytime hours. However, in the earliest hours of the day, around 5-7 AM, amounts around 2.5 are the most common (recall this is a Box-Cox transformed value). Perhaps everyone's buying their morning coffee?

## 4.4 V1-V28
<a id='4.4'></a>

Let's compare the descriptive stats of the PCA variables `V1-V28`.

In [None]:
pca_vars = ['V%i' % k for k in range(1,29)]

Full table of descriptive stats:

In [None]:
X_train[pca_vars].describe()

It's tedious to interpret this table, so let's make some visualizations. We'll start by plotting the means:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.barplot(x=pca_vars, y=X_train[pca_vars].mean(), color='darkblue')
plt.xlabel('Column')
plt.ylabel('Mean')
plt.title('V1-V28 Means')

All of `V1-V28` have approximately zero mean. Now plot the standard deviations:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.barplot(x=pca_vars, y=X_train[pca_vars].std(), color='darkred')
plt.xlabel('Column')
plt.ylabel('Standard Deviation')
plt.title('V1-V28 Standard Deviations')

The PCA variables have roughly unit variance, but as low as ~0.3 and as high as ~1.9. Plot the skewnesses next:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.barplot(x=pca_vars, y=X_train[pca_vars].skew(), color='darkgreen')
plt.xlabel('Column')
plt.ylabel('Skewness')
plt.title('V1-V28 Skewnesses')

A few of the PCA variables are significantly skewed. Let's plot a histogram of one of the particularly skewed variables, `V8`, to see the distribution in detail.

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.distplot(X_train['V8'], bins=300, kde=False)
plt.ylabel('Count')
plt.title('V8')

The histogram doesn't show us outliers. Let's try a boxplot:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.boxplot(X_train['V8'])
plt.title('V8')

The boxplot is also hard to read due to the large number of outliers, which indicates high kurtosis in `V8`. This motivates us to plot the kurtoses of the PCA variables. The kurtosis method employed in pandas is Fisher’s definition, for which the standard normal distribution has kurtosis 0.

Note the log scale on the y-axis in the plot below:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
plt.yscale('log')
sns.barplot(x=pca_vars, y=X_train[pca_vars].kurtosis(), color='darkorange')
plt.xlabel('Column')
plt.ylabel('Kurtosis')
plt.title('V1-V28 Kurtoses')

We've learned that many of the PCA variables are heavy-tailed. The large numbers of outliers in `V1-V28` motivates us to consider robust descriptive statistics. Let's plot the medians:

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.barplot(x=pca_vars, y=X_train[pca_vars].median(), color='darkblue')
plt.xlabel('Column')
plt.ylabel('Median')
plt.title('V1-V28 Medians')

The medians are also roughly zero. Next let's look at the interquartile ranges (IQR)*:

*Pandas does not have a built-in IQR method, but we can use the [`quantile`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html) method to calculate the IQR.

In [None]:
plt.figure(figsize=(12,4), dpi=80)
sns.barplot(x=pca_vars, y=X_train[pca_vars].quantile(0.75) - X_train[pca_vars].quantile(0.25), color='darkred')
plt.xlabel('Column')
plt.ylabel('IQR')
plt.title('V1-V28 IQRs')

The IQRs of `V1-V28` are on a similar scale as the standard deviations.

# 5. Mutual Information between Fraud and the Predictors
<a id='5'></a>

[Mutual information](https://en.wikipedia.org/wiki/Mutual_information) is a non-parametric method to estimate the mutual dependence between two variables. Mutual information of 0 indicates no dependence, and higher values indicate higher dependence. According to the [sklearn User Guide](http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection), "mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation." We have 227,845 training samples, so mutual information should work well. Because the target variable is discrete, we use [`mutual_info_classif`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#sklearn.feature_selection.mutual_info_classif) (as opposed to [`mutual_info_regression`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression) for a continuous target).

In [None]:
from sklearn.feature_selection import mutual_info_classif

In [None]:
mutual_infos = pd.Series(data=mutual_info_classif(X_train, y_train, discrete_features=False, random_state=1), index=X_train.columns)

The calculated mutual informations of each variable with `Class`, in descending order:

In [None]:
mutual_infos.sort_values(ascending=False)

The five most correlated variables with `Class` are, in decreasing order, V17, V14, V10, V12, and V11.

# 6. Modeling
<a id='6'></a>

Now we're ready to build machine learning models to predict whether a transaction is fraudulent. We'll train the following models:
* Logistic regression
* Support vector classifier
* Random forest

## 6.1 Logistic Regression and Support Vector Classifier
<a id='6.1'></a>

The class [`SGDClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) implements multiple linear classifiers with SGD training, which makes learning much faster on large datasets. We'll implement the model as a machine learning pipeline that includes [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) for data standardization (rescaling each variable to zero mean and unit variance).

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier

In [None]:
pipeline_sgd = Pipeline([
    ('scaler', StandardScaler(copy=False)),
    ('model', SGDClassifier(max_iter=1000, tol=1e-3, random_state=1, warm_start=True))
])

We'll conduct a grid search over several hyperparameter choices. The search uses 5-fold cross-validation with stratified folds. The type of linear classifier is chosen with the `loss` hyperparameter. For a linear SVC we set `loss = 'hinge'`, and for logistic regression we set `loss = 'log'`.

Set the hyperparameter grids to search over, one grid for the linear SVC and one for logistic regression:

In [None]:
param_grid_sgd = [{
    'model__loss': ['log'],
    'model__penalty': ['l1', 'l2'],
    'model__alpha': np.logspace(start=-3, stop=3, num=20)
}, {
    'model__loss': ['hinge'],
    'model__alpha': np.logspace(start=-3, stop=3, num=20),
    'model__class_weight': [None, 'balanced']
}]

The grid search, implemented by [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), uses [`StratifiedKFold`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) with 5 folds for the train/validation splits. We'll use [`matthews_corrcoef`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html) (the [Matthews correlation coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient), MCC) as our scoring metric.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef

In [None]:
MCC_scorer = make_scorer(matthews_corrcoef)
grid_sgd = GridSearchCV(estimator=pipeline_sgd, param_grid=param_grid_sgd, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)

Perform the grid search:

In [None]:
import warnings
with warnings.catch_warnings(): # Suppress warnings from the matthews_corrcoef function
    warnings.simplefilter("ignore")
    grid_sgd.fit(X_train, y_train)

Mean cross-validated MCC score of the best estimator found:

In [None]:
grid_sgd.best_score_

This is a pretty good MCC score---random guessing has a score of 0, and a perfect predictor has a score of 1. Now check the best hyperparameters found in the grid search:

In [None]:
grid_sgd.best_params_

So the linear SVC performed better than logistic regression, and with a high level of regularization ($\alpha\approx 483$).

## 6.2 Random Forest
<a id='6.2'></a>

Next we'll try a random forest model, implemented in [`RandomForestClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
from sklearn.ensemble import RandomForestClassifier

We do not need to rescale the data for tree-based models, so our pipeline will simply consist of the random forest model. We'll leave the pipeline implementation in place in case we want to add preprocessing steps in the future.

In [None]:
pipeline_rf = Pipeline([
    ('model', RandomForestClassifier(n_jobs=-1, random_state=1))
])

The random forest takes much longer to train on this fairly large dataset, so we don't actually do a hyperparameter grid search, only specifiying the number of estimators. We'll leave the grid search implemented in case we decide to try different hyperparameter values in the future.

In [None]:
param_grid_rf = {'model__n_estimators': [75]}

In [None]:
grid_rf = GridSearchCV(estimator=pipeline_rf, param_grid=param_grid_rf, scoring=MCC_scorer, n_jobs=-1, pre_dispatch='2*n_jobs', cv=5, verbose=1, return_train_score=False)

Perform the grid search:

In [None]:
grid_rf.fit(X_train, y_train)

In [None]:
grid_rf.best_score_

The random forest performed much better than the linear SVC---and without any hyperparameter tweaking!

In [None]:
grid_rf.best_params_

# 7. Test Set Evaluation of the Best Model
<a id='7'></a>

According to the cross-validated MCC scores, the random forest is the best-performing model, so now let's evaluate its performance on the test set.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, matthews_corrcoef, cohen_kappa_score, accuracy_score, average_precision_score, roc_auc_score

In [None]:
def classification_eval(estimator, X_test, y_test):
    """
    Print several metrics of classification performance of an estimator, given features X_test and true labels y_test.
    
    Input: estimator or GridSearchCV instance, X_test, y_test
    Returns: text printout of metrics
    """
    y_pred = estimator.predict(X_test)
    
    # Number of decimal places based on number of samples
    dec = np.int64(np.ceil(np.log10(len(y_test))))
    
    print('CONFUSION MATRIX')
    print(confusion_matrix(y_test, y_pred), '\n')
    
    print('CLASSIFICATION REPORT')
    print(classification_report(y_test, y_pred, digits=dec))
    
    print('SCALAR METRICS')
    format_str = '%%13s = %%.%if' % dec
    print(format_str % ('MCC', matthews_corrcoef(y_test, y_pred)))
    if y_test.nunique() <= 2: # Additional metrics for binary classification
        try:
            y_score = estimator.predict_proba(X_test)[:,1]
        except:
            y_score = estimator.decision_function(X_test)
        print(format_str % ('AUPRC', average_precision_score(y_test, y_score)))
        print(format_str % ('AUROC', roc_auc_score(y_test, y_score)))
    print(format_str % ("Cohen's kappa", cohen_kappa_score(y_test, y_pred)))
    print(format_str % ('Accuracy', accuracy_score(y_test, y_pred)))

Evaluate the random forest on the test set:

In [None]:
classification_eval(grid_rf, X_test, y_test)

According to the MCC, the random forest performed better on the test set than on the training set. This is probably due to the refit model being trained on the entire training data set, and not on the smaller CV folds.

# 8. Conclusion
<a id='8'></a>

We were able to accurately identify fraudulent credit card transactions using a random forest model. We found that the five variables most correlated with fraud are, in decreasing order, V17, V14, V10, V12, and V11. Only a few preprocessing steps were necessary before constructing predictive models:
* Split the data using a random, stratified train/test split with a test size of 20%
* Box-Cox power transform of the transaction amounts to remove skewness in the data
* Mean and variance standardization of all features as part of a machine learning pipeline

We used the [Matthews correlation coefficient (MCC)](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) to compare the performance of different models. In cross validation, the best linear model (logistic regression, linear SVC) achieved a cross-validated MCC score of 0.807, and a random forest achieved a cross-validated MCC score of 0.856. We therefore chose the random forest as the better model, which obtained an MCC of 0.869 on the test set.

To improve a chosen model, we searched over a grid of hyperparameters and compared performance with cross-validation. It may be possible to improve the random forest model by further tweaking the hyperparameters, given additional time and/or computational power.