<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#SMOTE:-Synthetic-Minority-Over-sampling-Technique" data-toc-modified-id="SMOTE:-Synthetic-Minority-Over-sampling-Technique-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>SMOTE: Synthetic Minority Over-sampling Technique</a></span></li><li><span><a href="#Under-sampling" data-toc-modified-id="Under-sampling-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Under-sampling</a></span></li></ul></div>

# Sampling Methods
Introducing sampling methods to deal with imbalanced data

In [1]:
import pandas as pd
import numpy as np

seed = 42 # Random seed for replicability

## Introduction
In the EDA we have already seen that the data is heavily imbalanced. Sampling methods can help to deal with that problem.

In [2]:
data = pd.read_csv("../data/creditcard.csv")
X = data.drop('Class', axis=1)
y = data['Class']
print(X.shape)

(284807, 30)


In [3]:
print("Distribution before sampling:\n")
print(f"Normal transactions: {round(y.value_counts(normalize = True)[0] * 100, 2)} %")
print(f"Fraudulent transactions: {round(y.value_counts(normalize = True)[1] * 100, 2)} %")

Distribution before sampling:

Normal transactions: 99.83 %
Fraudulent transactions: 0.17 %


## SMOTE: Synthetic Minority Over-sampling Technique

In [4]:
from imblearn.over_sampling import SMOTE # ref: https://imbalanced-learn.readthedocs.io/

Using TensorFlow backend.


In [5]:
X_smote, y_smote = SMOTE(sampling_strategy='minority').fit_resample(X, y)
print(X_smote.shape)

(568630, 30)


In [6]:
print("Distribution after SMOTE (Oversampling):\n")
print(f"No fraud: {round(np.bincount(y_smote)[0]/y_smote.shape[0] * 100, 2)} %")
print(f"Fraud: {round(np.bincount(y_smote)[1]/y_smote.shape[0] * 100, 2)} %")

Distribution after SMOTE (Oversampling):

No fraud: 50.0 %
Fraud: 50.0 %


Now we have a perfectly balanced dataset with over 500.000 samples.

## Under-sampling

In [7]:
# Separate fraudulent and normal transactions
data_fraud = data[data['Class'] == 1]
data_normal = data[data['Class'] == 0]

# Sample a sub set of the normal transactions
data_normal_sub = data_normal.sample(data_fraud.shape[0])

# Merge fraud cases and the normal cases subsample
df_sub = pd.concat([data_fraud, data_normal_sub]).sample(frac=1, random_state=seed)
print(df_sub.shape)

(984, 31)


In [9]:
print("Distribution after sub-sampling:\n")
print(f"Normal transactions: {round(df_sub['Class'].value_counts(normalize = True)[0] * 100, 2)} %")
print(f"Fraudulent transactions: {round(df_sub['Class'].value_counts(normalize = True)[1] * 100, 2)} %")

Distribution after sub-sampling:

Normal transactions: 50.0 %
Fraudulent transactions: 50.0 %
