In this notebook i'm going to explore class imbalance, how that affects a classification algorithm. Then i show two strategies for dealing with the imbalance and how to tune them.

Mainly this was quite interesting for myself to learn how this works :-)

We start by loading and scaling data, and creating training and test set

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/creditcard.csv')

# Separata data into X/y
y = data['Class'].values
X = data.drop(['Class', 'Time'], axis=1).values

num_neg = (y==0).sum()
num_pos = (y==1).sum()

# Scaling..
scaler = RobustScaler()
X = scaler.fit_transform(X)

# Split into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

First, let's summarize the data a bit. Most important is the class distribution as we'll see further down.

There is very few fraud cases (class=1) vs very many non-fraud cases (class=0)

In [None]:
import seaborn as sns

print(data.groupby('Class').size())

sns.countplot(x="Class", data=data)

First attempt at prediction credit card fraud : Use a plain logistic regression.

This fails pretty bad, only part of the fraud is detected and there is many false fraud reports.

Clearly the simple logistic regression is not good enough...

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from mlxtend.plotting import plot_decision_regions, plot_confusion_matrix
from matplotlib import pyplot as plt

lr = LogisticRegression()

# Fit..
lr.fit(X_train, y_train)

# Predict..
y_pred = lr.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))

One possibility is to tell the logistic regression there is class-imbalance and to put weights on errors proportional to the class imbalance. Documentation suggesets that should help..

However this ends up tipping the scale in the other wrong direction: Almost all fraud is detected, but there is way to many false negatives...

In [None]:
lr = LogisticRegression(class_weight='balanced')

# Fit..
lr.fit(X_train, y_train)

# Predict..
y_pred = lr.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))

We can also tune the class weights manually to find a better trade-off between false positives, false negatives and detected fraud cases. The F1 score is a metric that attempts to take that tradeoff.

Below we explore the effect of weighting on F1 score to figure out the optimum.

In [None]:
from sklearn.model_selection import GridSearchCV

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={
        'class_weight': [{0: x, 1: 1.0-x} for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')

Now we create a logistic regression with the optimum parameters we discovered above and plot results again. This version reseults in a more balanced tradeoff between false positives, false negatives and finding fraud cases.

In [None]:
lr = LogisticRegression(**grid_result.best_params_)

# Fit..
lr.fit(X_train, y_train)

# Predict..
y_pred = lr.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))

Another approach is to re-sample the data to balance the positive/negatives classes. This should result in similar as the weighting. Let's see what this does:

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

pipe = make_pipeline(
    SMOTE(),
    LogisticRegression()
)

# Fit..
pipe.fit(X_train, y_train)

# Predict..
y_pred = pipe.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))

So the SMOTE approach suffers a similar issue as the auto-balancing of weights by sklearn LogisticRegression. It results in many false fraud reports.

We can tune SMOTE re-sampling and achieve a similar effect...

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pipe = make_pipeline(
    SMOTE(),
    LogisticRegression()
)

weights = np.linspace(0.005, 0.05, 10)

gsc = GridSearchCV(
    estimator=pipe,
    param_grid={
        #'smote__ratio': [{0: int(num_neg), 1: int(num_neg * w) } for w in weights]
        'smote__ratio': weights
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')

So now let's train a model with the best discovered params and see how it does!

In [None]:
pipe = make_pipeline(
    SMOTE(ratio=0.015),
    LogisticRegression()
)

# Fit..
pipe.fit(X_train, y_train)

# Predict..
y_pred = pipe.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))

So to conclude this investigation. You can detect a decent amount of fraud cases using logistic regression classifier. The thing is, you need to deal with the class imbalance (many non-fraud vs few fraud). I showed two successfull strategies for dealing with that and how to tune this.