# Credit Card Fraud Detection

This problem is a binary classification with huge imbalance in class sizes which is common in a lot of fraud detection cases.

We will handle this unbalanced data using various oversampling and undersampling techniques.

1. Undersampling the majority class
2. Oversampling the minority class 
3. Using K-means clustering to undersample
4. Using SMOTE(Synthetic Minority Over-Sampling Technique) to oversample

Credit to: https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html for the ideas.

We will shy away from tedius repetitions of hyperparameter tuning using cross validation and focus on th
e effects of these sampling techniques.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.






In [None]:
data = pd.read_csv("../input/creditcard.csv")

data.sample(5)

In [None]:
data.info()

We have 28 features + time stamp + amount and a class label.
Let's scale and split the data into training and test sets.

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np 
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [None]:
sns.countplot(data.Class)
d = data.Class.value_counts()
print(d)
print('Fraud cases are only {:f}% of all cases.'.format(d[1]*100/len(data)))

In [None]:
sns.kdeplot(data = data[data.Class == 1].Amount, label = 'Fraud', bw=50)
sns.kdeplot(data = data[data.Class == 0].Amount, label = 'Normal', bw=50)
plt.legend();


We see that a lot number of fraud transactions are small amounts.

For simplicity, we'll ignore timestamps.

In [None]:
data.drop("Time", axis = 1, inplace = True)
data.columns

In [None]:
X = data.drop("Class", axis = 1)
y = data.Class

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=333, stratify = y)



In [None]:
print(len(y_train[y_train == 1])/len(y_train))
print(len(y_test[y_test == 1])/len(y_test))

In [None]:
X_train.Amount.shape

In [None]:
scaler = StandardScaler()
scaler.fit(X_train.Amount.reshape(-1, 1))
X_train.Amount = scaler.transform(X_train.Amount.reshape(-1,1))
X_test.Amount = scaler.transform(X_test.Amount.reshape(-1,1))

In [None]:
X_train.Amount.describe(), X_test.Amount.describe()

In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# 0. No under or oversampling
First, we need a baseline metric with a simple logistic regression trained on the original unbalanced data.

Since we want to avoid false negatives (frauds that go undetected), recall is what we should to look at.

In [None]:

clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:


cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

In [None]:

clf = RandomForestClassifier(n_estimators = 10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:

cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

We see that without under or over sampling, we achieve a mere 65% recall on the test set with logistic regression.
As expected, random forest classifier works better than logistic regression even without hyperparameter tuning due to its ensemble nature.

# 1. Undersampling

Here we will undersample the majority class (normal) without replacement to match the number of the minority class (fraud).




In [None]:
from sklearn.utils import resample

X_train_normal = X_train[y_train == 0]
y_train_normal = y_train[y_train == 0]
X_train_fraud = X_train[y_train == 1]
y_train_fraud = y_train[y_train == 1]

X_train_normal, y_train_normal = resample(X_train_normal, y_train_normal, n_samples = len(y_train_fraud), replace = False, random_state = 333)

X_train_undersample = pd.concat([X_train_normal, X_train_fraud], ignore_index=True)
y_train_undersample = pd.concat([y_train_normal, y_train_fraud], ignore_index=True)


In [None]:
print(type(y_train_undersample))
print(y_train_undersample.value_counts())
print(len(X_train_undersample))

In [None]:

clf = LogisticRegression()
clf.fit(X_train_undersample, y_train_undersample)
y_pred = clf.predict(X_test)

In [None]:

cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

In [None]:

clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train_undersample, y_train_undersample)
y_pred = clf.predict(X_test)



In [None]:

cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

Even with simple undersampling, the recall accuracy greatly improved.
However, precision dropped significantly. This means a lot of normal cases are predicted as frauds.



   # 2. Oversampling
   
   Here we will oversample the minority class (fraud) with replacement to match the number of the majority class (normal).

In [None]:
X_train_normal = X_train[y_train == 0]
y_train_normal = y_train[y_train == 0]
X_train_fraud = X_train[y_train == 1]
y_train_fraud = y_train[y_train == 1]

X_train_fraud, y_train_fraud = resample(X_train_fraud, y_train_fraud, n_samples = len(y_train_normal), replace = True, random_state = 333)

X_train_oversample = pd.concat([X_train_normal, X_train_fraud], ignore_index=True)
y_train_oversample = pd.concat([y_train_normal, y_train_fraud], ignore_index=True)


In [None]:
print(y_train_oversample.value_counts())
print(len(X_train_oversample))

In [None]:
clf = LogisticRegression()
clf.fit(X_train_oversample, y_train_oversample)
y_pred = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

In [None]:

clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train_oversample, y_train_oversample)
y_pred = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

Oversampling results in recall similar to undersampling in this case.

# 3. Using K-means clustering to undersample

We will use K-means clustering to reduce the number of majority class instances to match the number of minority class instances.
The process can be seen as vector quantizing the normal cases.
The number of clusters should be equal to the number of minority class instances (frauds).


In [None]:
from sklearn.cluster import KMeans

X_train_normal = X_train[y_train == 0]
y_train_normal = y_train[y_train == 0]
X_train_fraud = X_train[y_train == 1]
y_train_fraud = y_train[y_train == 1]

len(y_train_normal), len(y_train_fraud)

In [None]:
n_clusters = len(y_train_fraud)
kmeans = KMeans(n_clusters = n_clusters, random_state = 333).fit(X_train_normal)
X_train_normal = kmeans.cluster_centers_

X_train_normal.shape

In [None]:
X_train_normal = kmeans.cluster_centers_
X_train_normal = pd.DataFrame(X_train_normal, columns = X_train.columns)
X_train_normal.sample(5)

In [None]:
y_train_normal = y_train_normal[:n_clusters]
y_train_normal.shape

In [None]:
X_train_kmeans = pd.concat([X_train_normal, X_train_fraud], ignore_index=True)
y_train_kmeans = pd.concat([y_train_normal, y_train_fraud], ignore_index=True)

print(y_train_kmeans.value_counts())
print(len(X_train_kmeans))

X_train_kmeans.isnull().values.any(), y_train_kmeans.isnull().values.any()

In [None]:
clf = LogisticRegression()
clf.fit(X_train_kmeans, y_train_kmeans)
y_pred = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

In [None]:
clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train_kmeans, y_train_kmeans)
y_pred = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

For logitic regression, recall stayed the same.
For random forest classifier, recall improved 92 percent. significantly at the cost of precision.

# 4. Using SMOTE(Synthetic Minority Over-Sampling Technique) to oversample

To oversample, SMOTE chooses the midpoint between a minority sample and one of its k-nearest neighbors, and adds random perturbation to synthesize a new sample It is one of the most popular ways to deal with imbalanced data.

Source: https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/chawla2002.html
http://contrib.scikit-learn.org/imbalanced-learn/stable/generated/imblearn.over_sampling.SMOTE.html

In [None]:
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)

In [None]:
X_train_smote, y_train_smote = os.fit_sample(X_train,y_train)
type(X_train_smote), type(y_train_smote)

X_train_smote = pd.DataFrame(X_train_smote, columns= X_train.columns)
y_train_smote= pd.Series(y_train_smote)

In [None]:
#check the number of each class 
y_train_smote.value_counts()

In [None]:
clf = LogisticRegression()
clf.fit(X_train_smote, y_train_smote)
y_pred = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

In [None]:

clf = RandomForestClassifier(n_estimators = 100)
clf.fit(X_train_smote, y_train_smote)
y_pred = clf.predict(X_test)

In [None]:
cm = confusion_matrix(y_test, y_pred)
df_cm = pd.DataFrame(cm, range(2), range(2))
plt.figure(figsize = (10,7))
sns.heatmap(df_cm, annot=True)

print(classification_report(y_test,y_pred))

Recall suffered but precision greatly improved.
Using SMOTE with random forest classifier yields the best F1 score(geometric mean of precision and recall).

# Summary

With some tweaking of threshold and hyperparameter tuning, SMOTE looks the most promising since it achieved both high recall and precision.

Undersampling by using K-means centroids yields the best recall at the cost of very low precision.