# 1. Undersampling and Oversampling

In this notebook, I'll explore ways to help with our bad classifier predictions. While our accuracy was pretty high, our precision and recall weren't that good. As previously explained, this behaviour is expected because of our unbalanced dataset.

Some ways to help to prevent classifiers to generalise badly is to undersample and oversample our data and I'll explore what these concepts mean and how to use them properly.

Undersampling can be described as a way to reduce the imbalance in a dataset by removing data points from the classes that are in higher number in the dataset. Oversampling, meanwhile, is to produce more data points for the class that is in lower quantity in order to balance the dataset.

We can get a simplistic look at how it works here: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

I'll be using Sklearn's implementation of undersampling and oversampling techniques.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from collections import Counter
from imblearn import over_sampling, under_sampling
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate, cross_val_predict, StratifiedShuffleSplit, train_test_split
from sklearn.metrics import recall_score, precision_score, accuracy_score, make_scorer, confusion_matrix

import fraudutils as futils
import warnings

warnings.filterwarnings(action='once')

%load_ext autoreload
%autoreload 2
%matplotlib inline

I'll be loading the same data as before, but now I'll split it into train and test while maintaining its distributuin and I'll apply oversampling and undersampling techniques and compare the results of simple ML algorithms, both in accuracy, precision and recall metrics.

After loading our dataset, I will be splitting it into train and test and then apply undersampling and oversampling techniques on the training set. This way we can use the new generated train set to train our algorithms and see how they perform on the test set.

In [2]:
cc_df = pd.read_csv('../../../data/raw/kaggle/creditcard.csv')
X_ = cc_df.drop(['Time', 'Class'], axis=1)
y_ = cc_df['Class'].values

X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.2, random_state=0, stratify=y_)

I'll be comparing classifiers before and after applying undersampling and oversampling as shown bellow

In [3]:
def classify(X_train, X_test, y_train, y_test, random_state=0, classifier=LogisticRegression):
    lrc = classifier(random_state=random_state)
    lrc.fit(X_train, y_train)
    y_pred = lrc.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    
    print("Mean accuracy: {}".format(accuracy))
    print("Mean precision: {}".format(precision))
    print("Mean recall: {}".format(recall))
    
    return {'accuracy': accuracy, 
            'precision': precision,
            'recall': recall
           }

##### No sampling applied

In [4]:
logistic_regression_scores = {}
decision_tree_scores = {}

In [5]:
print("Logistic regression results:")
logistic_regression_scores['normal'] = classify(X_train, X_test, y_train, y_test, classifier=LogisticRegression)

Logistic regression results:
Mean accuracy: 0.9991748885221726
Mean precision: 0.8493150684931506
Mean recall: 0.6326530612244898


In [6]:
print("Decision tree results:")
decision_tree_scores['normal'] = classify(X_train, X_test, y_train, y_test, classifier=DecisionTreeClassifier)

Decision tree results:
Mean accuracy: 0.9991222218320986
Mean precision: 0.75
Mean recall: 0.7346938775510204


##### Random oversampling applied

In [7]:
ros = over_sampling.RandomOverSampler(random_state=0)
X_oversampled, y_oversampled = ros.fit_sample(X_train, y_train)

In [8]:
print("Logistic regression results:")
logistic_regression_scores['oversampled'] = classify(X_oversampled, X_test, y_oversampled, y_test, classifier=LogisticRegression)

Logistic regression results:
Mean accuracy: 0.9780379902391068
Mean precision: 0.0649056603773585
Mean recall: 0.8775510204081632


In [9]:
print("Decision tree results:")
decision_tree_scores['oversampled'] = classify(X_oversampled, X_test, y_oversampled, y_test, classifier=DecisionTreeClassifier)

Decision tree results:
Mean accuracy: 0.9991222218320986
Mean precision: 0.75
Mean recall: 0.7346938775510204


##### Random undersampling applied

In [10]:
rus = under_sampling.RandomUnderSampler(random_state=0)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train)

In [11]:
print("Logistic regression results:")
logistic_regression_scores['undersampled'] = classify(X_undersampled, X_test, y_undersampled, y_test, classifier=LogisticRegression)

Logistic regression results:
Mean accuracy: 0.9702784312348584
Mean precision: 0.04918032786885246
Mean recall: 0.8877551020408163


In [12]:
print("Decision tree results:")
decision_tree_scores['undersampled'] = classify(X_undersampled, X_test, y_undersampled, y_test, classifier=DecisionTreeClassifier)

Decision tree results:
Mean accuracy: 0.9086408482848215
Mean precision: 0.01611068991660349
Mean recall: 0.8673469387755102


### Logistic regression scores

In [13]:
logistic_regression_df = pd.DataFrame(logistic_regression_scores)
logistic_regression_df

Unnamed: 0,normal,oversampled,undersampled
accuracy,0.999175,0.978038,0.970278
precision,0.849315,0.064906,0.04918
recall,0.632653,0.877551,0.887755


### Decision tree scores

In [14]:
decision_tree_df = pd.DataFrame(decision_tree_scores)
decision_tree_df

Unnamed: 0,normal,oversampled,undersampled
accuracy,0.999122,0.999122,0.908641
precision,0.75,0.75,0.016111
recall,0.734694,0.734694,0.867347
