# 1. Undersampling and Oversampling

In this notebook, I'll explore ways to help with our bad classifier predictions. While our accuracy was pretty high, our precision and recall weren't that good. As previously explained, this behaviour is expected because of our unbalanced dataset.

Some ways to help to prevent classifiers to generalise badly is to undersample and oversample our data and I'll explore what these concepts mean and how to use them properly.

Undersampling can be described as a way to reduce the imbalance in a dataset by removing data points from the classes that are in higher number in the dataset. Oversampling, meanwhile, is to produce more data points for the class that is in lower quantity in order to balance the dataset.

We can get a simplistic look at how it works here: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

I'll be using Sklearn's implementation of undersampling and oversampling techniques.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from collections import Counter
from imblearn import over_sampling
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.metrics import recall_score, precision_score, accuracy_score, make_scorer, confusion_matrix

%matplotlib inline

In [2]:
cc_df = pd.read_csv('../../../data/raw/kaggle/creditcard.csv')
X = cc_df.drop(['Time', 'Class'], axis=1)
y = cc_df['Class']

In [3]:
ROS = over_sampling.RandomOverSampler(random_state=0)
sampled_X, sampled_y = ROS.fit_sample(X, y)

In [4]:
Counter(sampled_y)

Counter({0: 284315, 1: 284315})