# Dealing with Imbalanced Class Variable

In classification problems, sometimes we may encounter datasets that have class imbalance issue: a class value dominates the other class values. This poses significant challenges to the classification algorithms which work well with balanced class. This type of class imbalance problem can be found in areas like medical diagnosis, spam filtering, and fraud detection.

In this lab, we'll look at possible ways to handle an imbalanced class problem using a credit card data. Here each transaction is tagged as fradulent (1) or not fradulent (0). Our objective will be to correctly classify the minority class of fraudulent transactions.

Important Note: This guide will focus soley on addressing imbalanced classes and will not addressing other important machine learning steps including, but not limited to, feature selection or hyperparameter tuning.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score, classification_report
from sklearn.metrics import fbeta_score

%matplotlib inline

In [2]:
# read in data
datapath = "./credit_card.csv"
df = pd.read_csv(datapath)

print(df.shape)

# show random 5 rows from this dataset
df.sample(n=5, replace=False)

(28480, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
3395,159245.0,2.006879,-0.400549,-0.511934,0.344015,-0.387092,-0.016655,-0.640259,0.033004,1.430189,...,0.171556,0.730852,0.113104,0.505969,-0.026404,-0.231335,0.038433,-0.031646,9.99,0
4737,159850.0,-0.855281,1.007698,0.89551,-0.761136,0.247561,0.279321,0.066651,0.629354,0.020545,...,-0.200186,-0.611384,-0.052361,0.013455,-0.200052,0.029018,0.212686,0.126544,9.99,0
3725,159381.0,-0.176637,1.223412,-1.157242,-0.84751,1.152529,0.011672,0.539591,0.502167,-0.470087,...,-0.303905,-0.882418,0.027147,-0.444602,-0.279758,0.15658,0.092327,0.002596,10.02,0
25910,170753.0,1.414668,-0.952525,-2.831745,0.389489,0.428029,-1.578949,1.546694,-0.780668,-0.086361,...,0.533662,0.912288,-0.463213,1.285918,0.615379,0.359092,-0.172319,-0.012868,378.0,0
2,152165.0,-4.673231,4.195976,-8.392423,7.743215,-1.138803,-2.094899,-3.839487,0.543053,-1.528448,...,0.554185,0.656076,0.482417,-0.624399,-0.296289,0.374802,-2.678544,0.412368,1.0,1


In [3]:
# print the class 
df.Class.value_counts()

0    28431
1       49
Name: Class, dtype: int64

In [6]:
print(49/28431)
# Ha I guess I had the same thought

0.0017234708592733284


In [5]:
# print percentage of questions where target == 1
fraud_ratio = df[df.Class==1].shape[0] * 100 / df.shape[0]
fraud_ratio

0.1720505617977528

We can see we have a very imbalanced class - just 0.17% of our dataset belong to fradulent transactions!

This is a problem because many machine learning models are designed to maximize overall accuracy, which especially with imbalanced classes may not be the best metric to use. Classification accuracy is defined as the number of correct predictions divided by total predictions times 100. For example, if we simply predicted all transactions are not fraud, we would get a classification acuracy score of over 99%!

# Create Train and Test Sets
The training set is used to build and validate the model, while the test set is reserved for testing the model on unseen data.

In [7]:
# Prepare data for modeling
# Separate input features and target
y = df.Class
X = df.drop('Class', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)

In [8]:
y_train.value_counts()[1] * 100 / y_train.shape[0]

0.17322097378277154

In [9]:
y_test.value_counts()[1] * 100 / y_test.shape[0]

0.16853932584269662

# Baseline Models

In [10]:
# DummyClassifier to predict only target 0
dummy = DummyClassifier(strategy='most_frequent').fit(X_train, y_train)
dummy_pred = dummy.predict(X_test)

# checking unique labels
print('The predicted labels: ', np.unique(dummy_pred))

# checking accuracy
print('Accuracy score: ', accuracy_score(y_test, dummy_pred))

print(classification_report(y_test, dummy_pred))

The predicted labels:  [0]
Accuracy score:  0.998314606741573
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7108
           1       0.00      0.00      0.00        12

    accuracy                           1.00      7120
   macro avg       0.50      0.50      0.50      7120
weighted avg       1.00      1.00      1.00      7120



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The accuracy score for this Dummy Classifier is 99.8%, and this highy accuracy may mislead us. As the Dummy Classifier predicts only Class 0, it is clearly not a good option for our objective, which is correctly classifying fraudulent transactions.

Let's see how logistic regression performs on this dataset.

In [11]:
# Modeling the data as is
# Train model
lr = LogisticRegression(solver='liblinear').fit(X_train, y_train)
 
# Predict on training set
lr_pred = lr.predict(X_test)

In [12]:
# print confusion matrix
pd.DataFrame(confusion_matrix(y_test, lr_pred))

Unnamed: 0,0,1
0,7108,0
1,8,4


In [13]:
print(classification_report(y_test, lr_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7108
           1       1.00      0.33      0.50        12

    accuracy                           1.00      7120
   macro avg       1.00      0.67      0.75      7120
weighted avg       1.00      1.00      1.00      7120



In [14]:
# Checking accuracy to see the first few digits after decimal point
accuracy_score(y_test, lr_pred)

0.998876404494382

In [15]:
# Checking unique values
predictions = pd.DataFrame(lr_pred)
predictions[0].value_counts()

0    7116
1       4
Name: 0, dtype: int64

Logistic Regression outperformed the Dummy Classifier. We can see that it predicted 4 instances of class 1, so this is definitely an improvement. But can we do better?

Let's see if we can apply some techniques for dealing with class imbalance to improve these results. 

### 1. Change the evaluation measure

We should not use accuracy as a default measure for evaluating an imbalanced class. Metrics that can provide better insight include:
 
* **Confusion Matrix:** a table showing the number of correct and incorrect predictions
* **Precision:** the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier's exactness. Low precision indicates a high number of false positives.
* **Recall:** the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier's completeness. Low recall indicates a high number of false negatives.
* **F1-Score:** the weighted average of precision and recall
* **F-beta Score:** It's a generailzation of F1-score. F1-score give equal weights to precision and recall. In some case precision could be more important than recall and vice versa. Check this short [article](https://onlinehelp.explorance.com/blueml/Content/articles/getstarted/mlcalculations.htm?TocPath=Get%20started%7C_____3) on F-beta measure. 

If our main objective is to classifying the fraud cases the recall score (or F-2 score) can be considered a key metric to use for evaluating predictions.

In [16]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, lr_pred))

Unnamed: 0,0,1
0,7108,0
1,8,4


In [17]:
# recall score
recall_score(y_test, lr_pred)

0.3333333333333333

In [18]:
# f1 score
f1_score(y_test, lr_pred)

0.5

In [19]:
# F2 Score
fbeta_score(y_test, lr_pred, beta=2)

0.3846153846153846

We have a very high accuracy score of 0.999 but a F1 score of only 0.752. And from the confusion matrix, we can see we are misclassifying several observations leading to a recall score of only 0.64.

### 2. Change the learner

For every machine learning problem, its a good rule of thumb to try a variety of algorithms, it can be beneficial with imbalanced datasets. Decision trees frequently perform well on imbalanced data. They work by learning a hierachy of if/else questions. This can force both classes to be addressed.

In [20]:
from sklearn.ensemble import RandomForestClassifier

In [21]:
# train model
rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set
rfc_pred = rfc.predict(X_test)

print(classification_report(y_test, rfc_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7108
           1       1.00      0.92      0.96        12

    accuracy                           1.00      7120
   macro avg       1.00      0.96      0.98      7120
weighted avg       1.00      1.00      1.00      7120



In [22]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, rfc_pred))

Unnamed: 0,0,1
0,7108,0
1,1,11


It seems RandomForest was able to overcome the class imbalance problem for this dataset. The results on the test set show higher recall and F1-score compared to logistic regression. 

### 3. Resampling Techniques

#### Oversampling Minority Class
Oversampling can be defined as adding more copies of the minority class. Oversampling can be a good choice when you don't have a ton of data to work with. A con to consider when undersampling is that it can cause overfitting and poor generalization to your test set.

We will use the resampling module from Scikit-Learn to randomly replicate samples from the minority class.

Important Note
Always split into test and train sets BEFORE trying any resampling techniques! Oversampling before splitting the data can allow the exact same observations to be present in both the test and train sets! This can allow our model to simply memorize specific data points and cause overfitting.

In [23]:
from sklearn.utils import resample

In [24]:
# concatenate our training data back together before doing the resampling

train_set = pd.concat([X_train, y_train], axis=1)
train_set.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
13707,163892.0,1.794257,-0.277896,-2.381295,0.227035,1.181389,0.610771,0.134762,0.097202,0.646861,...,0.27143,0.845002,-0.143285,-2.082546,0.116228,0.046821,0.031267,-0.032991,104.71,0
20703,167418.0,-2.425979,2.399782,-0.779091,-1.365301,0.260095,-0.643864,1.098915,-0.142483,2.054626,...,-0.747394,-0.596796,0.001252,-0.452732,-0.161442,-0.003178,-0.337996,-1.084199,9.99,0
10572,162529.0,0.072139,1.001944,-0.401543,-0.541423,1.116271,-0.754787,0.995224,-0.139046,-0.205811,...,-0.315193,-0.734425,0.115187,0.528007,-0.388949,0.11561,0.227464,0.087912,8.99,0
4755,159856.0,2.006937,0.630784,-1.091179,3.554664,0.775172,-0.093498,0.311137,-0.177965,-1.054687,...,0.175626,0.52031,-0.026899,-0.731376,0.221499,0.177702,-0.047755,-0.060361,2.28,0
8068,161407.0,-0.004653,1.075463,-0.341932,-0.621717,0.996624,-0.781242,1.112333,-0.269529,0.355222,...,-0.422035,-0.835089,0.09616,0.553738,-0.427988,0.093356,0.167179,-0.080997,1.78,0


In [25]:
# separate minority and majority classes
train_set_not_fraud = train_set[train_set.Class==0]
train_set_fraud = train_set[train_set.Class==1]

# upsample minority
fraud_upsampled = resample(train_set_fraud,
                          replace=True, # sample with replacement
                          n_samples=train_set_not_fraud.shape[0], # match number in majority class
                          random_state=27) # reproducible results

# combine majority and upsampled minority
train_set_upsampled = pd.concat([train_set_not_fraud, fraud_upsampled])

# check new class counts
train_set_upsampled.Class.value_counts()

0    21323
1    21323
Name: Class, dtype: int64

In [26]:
# trying logistic regression again with the balanced dataset
X_train_upsampled = train_set_upsampled.drop('Class', axis=1)
y_train_upsampled = train_set_upsampled.Class


upsampled = LogisticRegression(solver='liblinear').fit(X_train_upsampled, y_train_upsampled)
upsampled_pred = upsampled.predict(X_test)

print(classification_report(y_test, upsampled_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98      7108
           1       0.03      1.00      0.07        12

    accuracy                           0.95      7120
   macro avg       0.52      0.98      0.52      7120
weighted avg       1.00      0.95      0.97      7120



In [27]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, upsampled_pred))

Unnamed: 0,0,1
0,6776,332
1,0,12


The accuracy score (0.97) dropped after upsampling, but the model is now predicting all the fraud instances  correctly, an improvement over our plain logistic regression above.



#### Undersampling Majority Class

Undersampling can be defined as removing some observations of the majority class. Undersampling can be a good choice when you have a lot of data - think millions of rows. But a drawback to undersampling is that we are removing information that may be valuable.

We will again use the resampling module from Scikit-Learn to randomly remove samples from the majority class.

In [28]:
# still using our separated classes fraud and not_fraud from above

# downsample majority
not_fraud_downsampled = resample(train_set_not_fraud,
                                replace = False, # sample without replacement
                                n_samples = train_set_fraud.shape[0], # match minority n
                                random_state = 27) # reproducible results

# combine minority and downsampled majority
train_set_downsampled = pd.concat([not_fraud_downsampled, train_set_fraud])

# checking counts
train_set_downsampled.Class.value_counts()

0    37
1    37
Name: Class, dtype: int64

In [29]:
# trying logistic regression again with the undersampled dataset

y_train_downsampled = train_set_downsampled.Class
X_train_downsampled = train_set_downsampled.drop('Class', axis=1)

undersampled = LogisticRegression(solver='liblinear').fit(X_train_downsampled, y_train_downsampled)
undersampled_pred = undersampled.predict(X_test)

print(classification_report(y_test, undersampled_pred))

              precision    recall  f1-score   support

           0       1.00      0.85      0.92      7108
           1       0.01      1.00      0.02        12

    accuracy                           0.85      7120
   macro avg       0.51      0.93      0.47      7120
weighted avg       1.00      0.85      0.92      7120



In [30]:
# confusion matrix
pd.DataFrame(confusion_matrix(y_test, undersampled_pred))

Unnamed: 0,0,1
0,6060,1048
1,0,12


The accuracy score (0.85) dropped after downsampling, but the model is predicting the minority class correctly although it performs worse compared to the upsampling technique.



### Generate Synthetic Samples

SMOTE or Synthetic Minority Oversampling Technique is a popular algorithm to create sythetic observations of the minority class.

In [31]:
from imblearn.over_sampling import SMOTE

In [32]:
sm = SMOTE()
X_train, y_train = sm.fit_resample(X_train, y_train)

In [33]:
smote = LogisticRegression(solver='liblinear').fit(X_train, y_train)
smote_pred = smote.predict(X_test)

In [34]:
# confustion matrix
pd.DataFrame(confusion_matrix(y_test, smote_pred))

Unnamed: 0,0,1
0,6972,136
1,0,12


In [35]:
print(classification_report(y_test, smote_pred))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99      7108
           1       0.08      1.00      0.15        12

    accuracy                           0.98      7120
   macro avg       0.54      0.99      0.57      7120
weighted avg       1.00      0.98      0.99      7120



SMOTE performs better than upsampling and downsampling techniques.  

# Save your notebook, then `File > Close and Halt`