Under Sampling in Credit Card Fraud Detection

The data in input set is highly skewed towards the non-fradulent transaction. This makes classification tricky.  So in this kernel we can explore how undersampling will help to learn a better classifier. Also, we will be using recall as our evaluation metric as it's much for useful compared to accuracy score.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
## Read the data
df = pd.read_csv("../input/creditcard.csv")
df.head()

In [None]:
## Plot the distribution of data
%matplotlib inline
sns.countplot(x='Class', data=df)

From the above graph you can observe that data is really skewed for class 0 which indicates the non fradulant transactions.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split

df['normal_amount'] = StandardScaler().fit_transform(df['Amount'].values.reshape(-1,1))
df = df.drop(['Amount','Time'], axis=1)
X = df.loc[:,df.columns != 'Class']
y = df.loc[:,df.columns == 'Class']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

Below code trains a logisitc regression model on original data. As you can observe from the output,
recall is pretty poor. But accuracy is pretty high.

In [None]:
# Calculate the recall score for logistic Regression on Skewed data
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score,accuracy_score
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred = lr.predict(X_test)
print(recall_score(y_test,y_pred,average=None))
print(accuracy_score(y_test,y_pred))


To improve the recall, let's implement undersampling. Here the code is trying to reduce the number
of non fraudulent transactions equivalent to fraudulent ones.

In [None]:
# Undersample the data
no_frauds = len(df[df['Class'] == 1])
non_fraud_indices = df[df.Class == 0].index
random_indices = np.random.choice(non_fraud_indices,no_frauds, replace=False)
fraud_indices = df[df.Class == 1].index
under_sample_indices = np.concatenate([fraud_indices,random_indices])
under_sample = df.loc[under_sample_indices]

In [None]:
## Plot the distribution of data for undersampling
%matplotlib inline
sns.countplot(x='Class', data=under_sample)

In [None]:
X_under = under_sample.loc[:,under_sample.columns != 'Class']
y_under = under_sample.loc[:,under_sample.columns == 'Class']
X_under_train, X_under_test, y_under_train, y_under_test = train_test_split(X_under,y_under,test_size = 0.3, random_state = 0)

Below code trains the logistic regression on undersampled data. From the result, you can observe that the recall is much better.

In [None]:
lr_under = LogisticRegression()
lr_under.fit(X_under_train,y_under_train)
y_under_pred = lr_under.predict(X_under_test)
print(recall_score(y_under_test,y_under_pred))
print(accuracy_score(y_under_test,y_under_pred))

It also generalises good enough for full data.

In [None]:
## Recall for the full data
y_pred_full = lr_under.predict(X_test)
print(recall_score(y_test,y_pred_full))
print(accuracy_score(y_test,y_pred_full))

Rather than doing sampling explicitely we can use class_weight property to achive the same effect.

In [None]:
lr_balanced = LogisticRegression(class_weight = 'balanced')
lr_balanced.fit(X_train,y_train)
y_balanced_pred = lr_balanced.predict(X_test)
print(recall_score(y_test,y_balanced_pred))
print(accuracy_score(y_test,y_balanced_pred))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix_value = confusion_matrix(y_test,y_balanced_pred)

In [None]:
sns.set(font_scale=1.4)
confusion_matrix_value
#sns.heatmap(confusion_matrix_value, annot=True)