# Credit card fraud detection
---

In this kernel, we will explore the credit card transaction [data](https://www.kaggle.com/dalpozz/creditcardfraud) and detect frauds using classification models. The kernel also demonstrates the basic pipeline of data analysis and modeling.

**TODO**
1. Evaluate different models, including XGBoost, LightGBM
2. Tune hyperparameters using Hyperopt
3. Try stacking, using StackNet

In [2]:
%matplotlib inline
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from imblearn.combine import SMOTEENN 

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score

np.random.seed(5)

## 1. Obtain the data
---

The [datasets](https://www.kaggle.com/dalpozz/creditcardfraud) contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions.

In [3]:
# Load csv data to data frame
file_path = '../input/creditcard.csv'
df = pd.read_csv(file_path, sep=",")

## 2. Scrub the data
---

We need to clean the data if there is some missing value or outlier.

### 2.1. Take a look at the data
---

In [4]:
df.head()

### 2.2. Move Class to the front of the data frame
---

In [5]:
class_ = df.Class # since class is preserved in Python, use class_ instead
df.drop('Class', axis=1, inplace=True)
df.insert(0, 'Class', class_)
df.head()

### 2.3. Check missing values
---

In [6]:
df.isnull().any()

The data is clean. Go ahead to explore the data.

## 3. Explore the data
---

In this section, we try to answer to these questions:
* How many data points are there?
* How many features are there?
* What are their respective types?
* How many classes are there and what are their counts?
* What are descriptive statistics of the data?
* What are the relations between the class and features, as well as between one feature and another?

### 3.1. Statistical overview
---

In [7]:
df.shape

There are 284807 rows and 31 features, including one column of dependent variable, namely, 'Class'.

In [8]:
df.dtypes

In [9]:
fraud_rate = df.Class.value_counts() / df.shape[0]
fraud_rate

There are two classes, 0 for normal and 1 for fraud. According to the count, two classes are extremely imbalanced, with only 0.1727% are frauds.

In [10]:
df.describe()

Features V1, V2, ..., V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Therefore, there is little need to further infer something from these transformed features. One observation is that V1 to V28 are roughly centered while 'Amount' does not follow the similar statistical pattern. It might be sensible to conduct preprocessing on features such as standarization to facilitate machine learning approaches.

In [11]:
# Overview of fraud and normal transactions
fraud_summary = df.groupby('Class')
fraud_summary.mean().T

For V1, V2, ..., V28, a difference in mean value with respect to two classes can be observed: these features have mean values of opposite sign for two classes. This might indicate a difference in data distribution for each class.  

### 3.2. Correlation matrix
---

We now explore the relationship between one variable and another using correlation matrix.

In [12]:
corr = df.corr()
# plot heat map
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(11.7, 8.27)
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            ax = ax,
            cmap='YlGnBu')
plt.title('Heatmap of Correlation Matrix')

In [13]:
corr

### 3.3. Statistical test for correlation
---

A one-sample t-test checks whether a sample mean differs from the population mean. Let us test to see whether the average amount of transaction classified as fraud differs from the entire population.

In [14]:
amount_population = df.Amount.mean()
amount_fraud = df[df.Class == 1].Amount.mean()
print('mean amount of population: {}, mean amount of fraud transaction: {}'.format(amount_population, amount_fraud))

In [15]:
import scipy.stats as stats
stats.ttest_1samp(a=df[df['Class']==1]['Amount'], 
                  popmean=amount_population)

If the t-statistic value we calculated above is outside the quantiles, then we can reject the null hypothesis

In [16]:
degree_freedom = len(df[df['Class']==1])
conf_level = 0.95

LQ = stats.t.ppf((1-conf_level)/2,degree_freedom)  # Left Quartile

RQ = stats.t.ppf((1+conf_level)/2,degree_freedom)  # Right Quartile

print ('The t-distribution left quartile range is: ' + str(LQ))
print ('The t-distribution right quartile range is: ' + str(RQ))

The result shows that we should reject the null hypothesis that the average amount of transaction classified as fraud is the same as that of population, since the t-statistic value is outside the 95% confidence interval.

### 3.4. Distribution plots
---

We explore data distribution through various approaches of visualization.

#### 3.4.0. Visualizing pairwise relationships in a dataset

In [17]:
# For computational efficiency, only visualize pairwise relationships among several features, 
# including two principal components
sns.pairplot(df.loc[:, ['Class', 'Amount', 'Time', 'V1', 'V2']], hue='Class')

#### 3.4.1. Class vs Amount

In [18]:
# Kernel Density Plot
fig = plt.figure(figsize=(16,9),)
ax=sns.kdeplot(df.loc[(df['Class'] == 0), 'Amount'] , color='b', shade=True,label='normal transaction')
ax=sns.kdeplot(df.loc[(df['Class'] == 1), 'Amount'] , color='r', shade=True, label='fraud transaction')
plt.title('Transaction amount distribution - normal V.S. fraud')

The distribution of the amount of fraud is long-tailed.

#### 3.4.2. Class vs Time

In [19]:
# Kernel Density Plot
fig = plt.figure(figsize=(16,9),)
ax=sns.kdeplot(df.loc[(df['Class'] == 0), 'Time'] , color='b', shade=True,label='normal transaction')
ax=sns.kdeplot(df.loc[(df['Class'] == 1), 'Time'] , color='r', shade=True, label='fraud transaction')
plt.title('Transaction time distribution - normal V.S. fraud')

There is a bi-modal distribution for time of normal transactions. The distribution pattern corresponds well to the problem description which states the transactions in dataset occur in two days: two ridges correspond to frequent transactions in the daytime while the trough corresponds to infrequent ones during night.

#### 3.4.3. Time vs Amount

In [20]:
sns.lmplot(x='Time', y='Amount', data=df,
           fit_reg=False, # No regression line
           hue='Class')   # Color by evolution stage

Frauds are evenly distributed along the time.

We can also visualize this bivariate distribution using a scatterplot with histogram.

In [21]:
sns.jointplot(x='Time', y='Amount', data=df[df['Class']==0], color='b')
sns.jointplot(x='Time', y='Amount', data=df[df['Class']==1], color='r')

## 4. Classification
---

### 4.1. Prepare train data and test data (with stratified sampling)
---

When dealing with imbalanced cases, it is often advantageous to sample each subpopulation (stratum) independently.

In [22]:
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=5)
X = df.drop(['Class', 'Time'], axis=1)
X = StandardScaler().fit_transform(X.values)
y = df['Class'].values
for train_index, test_index in sss.split(X, y):
    X_train_ = X[train_index, :]
    y_train_ = y[train_index]
    X_test = X[test_index, :]
    y_test = y[test_index]

Verify if the dataset is well stratified:

In [23]:
y_train_pos = y_train_[y_train_ == 1]
y_test_pos = y_test[y_test == 1]
print('# positive in train data: {}, {}%'.format(y_train_pos.shape[0], y_train_pos.shape[0]*100. / y_train_.shape[0]))
print('# positive in test data: {}, {}%'.format(y_test_pos.shape[0], y_test_pos.shape[0]*100. / y_test.shape[0]))

### 4.2. Build a helpful function for cross validation
---

In order to facilitate model selection and further processing, we can create a helpful function to conduct cross validation and parameter tuning. We select optimal parameters based on recall metric. In addition, we allow the option of using [SMOTE](http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html), an imbalanced learning tool, to boost performance in imbalanced datasets.

In [24]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.model_selection import StratifiedKFold

def kfold_cv(Model, X, y, n_splits=10, smote=False, verbose=False):
    """
    Args:
        model: object that has fit, predict_proba methods
        X: array
        y: array
        n_splits: number of splits
    """
    skf = StratifiedKFold(n_splits, random_state=5, shuffle=True)
    C = np.logspace(-3, 3, num=7, base=10)
    def sub_cv(model):
        kfold = skf.split(X, y)
        scores = 0
        recall = 0
        if smote:
            sme = SMOTEENN(random_state=5)
        i = 0
        for train_index, test_index in kfold:
            X_train_ = X[train_index, :]
            y_train_ = y[train_index]
            X_test = X[test_index, :]
            y_test = y[test_index]
            if smote:
                X_train, y_train = sme.fit_sample(X_train_, y_train_)
            else:
                X_train = X_train_
                y_train = y_train_
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)
            y_score = model.predict_proba(X_test)[:, 1]
            score = roc_auc_score(y_test, y_score, average='micro')
            if verbose:
                print('Trained {} th model, AUC score: {}'.format(i+1, score))
            scores += score
            recall += recall_score(y_test, y_pred)
            i += 1
        return scores / i, recall / i
    bestC = 0
    bestauc = 0
    bestrecall = 0
    for c in C:
        model = Model(class_weight='balanced', C=c)
        auc, recall = sub_cv(model)
        if recall > bestrecall:
            bestauc = auc
            bestC = c
            bestrecall = recall
        print('C: {}, AUC: {}, recall: {}, best C: {}'.format(c, auc, recall, bestC))
    return bestC, bestauc, bestrecall

### 4.3. Baseline: logistic regression with balanced class weight
---

In [25]:
Model = LogisticRegression
bestC, bestauc, bestrecall = kfold_cv(Model, X_train_, y_train_, n_splits=5, verbose=False)
print('Best C: {}'.format(bestC))

In [26]:
baseline = Model(class_weight='balanced', C=bestC)
baseline.fit(X_train_, y_train_)
y_pred = baseline.predict(X_test)
y_score = baseline.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_score, average='micro')
recall = recall_score(y_test, y_pred)
print('AUC: {}, recall: {}'.format(auc, recall))

### 4.4 Compare different strategies to balance data
---
1. Use class_weight in logistic regression (already done)
2. Naively undersample majority class
3. Use imbalanced learning: smote

#### 4.4.1. Naively undersample majority class
Undersample the data so that it has comparable number of data points for each class. First, create a helpful function for conducting undersampling.

In [27]:
def undersample(X_train_, y_train_, n_major):
    """
    In y_train_, positive class is far fewer than negative one.
    """
    X_train_pos = X_train_[y_train_ == 1]
    X_train_neg = X_train_[y_train_ == 0]
    y_train_pos = y_train_[y_train_ == 1]
    y_train_neg = y_train_[y_train_ == 0]
    undersample_y_train_neg_index = np.random.choice(y_train_neg.shape[0], n_major, replace=False)
    undersample_y_train_ = np.concatenate((y_train_pos, y_train_neg[undersample_y_train_neg_index]), axis=0)
    undersample_X_train_ = np.concatenate((X_train_pos, X_train_neg[undersample_y_train_neg_index]), axis=0)
    indices = np.arange(undersample_X_train_.shape[0])
    np.random.shuffle(indices)
    return undersample_X_train_[indices, :], undersample_y_train_[indices]

We evaluate the effect of undersampling by varying number of data points belonging to normal transaction.

In [28]:
n_majority = np.arange(1, 50, 10) * y_train_pos.shape[0]
res = pd.DataFrame(data=np.zeros((len(n_majority), 3)), columns=['best_c', 'auc', 'recall'])
for i, n_major in enumerate(n_majority):
    undersample_X_train_, undersample_y_train_ = undersample(X_train_, y_train_, n_major)
    bestC, auc, recall = kfold_cv(Model, undersample_X_train_, undersample_y_train_, n_splits=5)
    res.loc[i, 'best_c'] = bestC
    res.loc[i, 'auc'] = auc
    res.loc[i, 'recall'] = recall
    print('undersample number: {}, best C: {}'.format(n_major, bestC))

In [29]:
res['maj_class_num'] = n_majority
res

In [30]:
fig = plt.figure(figsize=(16, 9))
plt.plot(res['maj_class_num'], res['auc'], 'b', label='AUC')
plt.plot(res['maj_class_num'], res['recall'], 'r', label='Recall')
plt.xlabel('maj_class_num')
plt.legend()
plt.grid()
plt.title('AUC and recall of logistic regression with balanced class weight trained on undersampled majority class of different numbers')
plt.show()

Note that the more imbalanced the data, the lower recall metric tends to be.

#### 4.4.2. Use imbalanced learning: SMOTE

Here, we combine undersampling with SMOTE.

In [None]:
# Applying SMOTE on the entire training data set is really time-consuming. 
# Uncomment the following lines to evaluate effect of SMOTE on entire training data.
'''
model = LogisticRegression
auc, recall = kfold_cv(model, X_train_, y_train_, n_splits=10, smote=True, verbose=True)
print('AUC score: {}, recall: {}'.format(auc, recall))
'''

In [None]:
n_majority = np.arange(1, 50, 10) * y_train_pos.shape[0]
res = pd.DataFrame(data=np.zeros((len(n_majority), 3)), columns=['best_c', 'auc', 'recall'])
for i, n_major in enumerate(n_majority):
    undersample_X_train_, undersample_y_train_ = undersample(X_train_, y_train_, n_major)
    bestC, auc, recall = kfold_cv(Model, undersample_X_train_, undersample_y_train_, n_splits=5, smote=True, verbose=False)
    res.loc[i, 'best_c'] = bestC
    res.loc[i, 'auc'] = auc
    res.loc[i, 'recall'] = recall
    print('undersample number: {}, best C: {}'.format(n_major, bestC))
res['maj_class_num'] = n_majority
print(res)
fig = plt.figure(figsize=(16, 9))
plt.plot(res['maj_class_num'], res['auc'], 'b', label='AUC')
plt.plot(res['maj_class_num'], res['recall'], 'r', label='Recall')
plt.xlabel('maj_class_num')
plt.legend()
plt.grid()
plt.title('AUC and recall of logistic regression with balanced class weight trained on undersampled majority class of different numbers and smoted data')
plt.show()

There seems to be no obvious difference in performance when applying SMOTE.

#### 4.4.3. Final evaluation on test data
Train model on undersampled data with optimal parameters. Evaluate model on test data set.

##### 4.4.3.1. Logistic regression with balanced class weight, undersampled training data

In [None]:
bestmodel = LogisticRegression(class_weight='balanced', C=0.001)
undersample_X_train_, undersample_y_train_ = undersample(X_train_, y_train_, y_train_pos.shape[0])
bestmodel.fit(undersample_X_train_, undersample_y_train_)
y_pred = bestmodel.predict(X_test)
y_score = bestmodel.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_score, average='micro')
recall = recall_score(y_test, y_pred)
print('AUC: {}, recall: {}'.format(auc, recall))

##### 4.4.3.2. Logistic regression with balanced class weight, undersampled training data, and smote processing

In [None]:
bestmodel = LogisticRegression(class_weight='balanced', C=0.001)
undersample_X_train_, undersample_y_train_ = undersample(X_train_, y_train_, y_train_pos.shape[0])
sme = SMOTEENN(random_state=5)
X_res, y_res = sme.fit_sample(undersample_X_train_, undersample_y_train_)
bestmodel.fit(X_res, y_res)
y_pred = bestmodel.predict(X_test)
y_score = bestmodel.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_score, average='micro')
recall = recall_score(y_test, y_pred)
print('AUC: {}, recall: {}'.format(auc, recall))

## 5. Summary

1. We explore data through various statistical analysis and visualization.
2. Compared to naively using the entire imbalanced training data, undersampling can greatly boost performance. AUC increases from 0.9771 to 0.9798 while recall increases from 0.9082 to 0.9592. 
3. SMOTE is not significant in this case in contrast to undersampling.