# Dealing with Imbalanced Data in Python

---
## Introduction

### What is imbalanced data and why is it a problem?
Imbalanced data is a frequent and significant challenge for classification problems.

Imbalanced datasets are those in which the classes are not equally represented. One such example would be a dataset where 95% of observations originate from one class and only 5% belong to the other. Common examples of situations in which imbalanced data is present are fraud dectection, medical diagnoses, and anomaly detection. Unless properly accounted for, this imbalance can cause the model to be biased towards the majority class and perform much more poorly on the minority class.  

Thus, in this tutorial, we will cover various methods of dealing with imbalanced data in python.

### Methods of dealing with imbalanced data
There are several ways to deal with imbalanced data in Python. Some of the most common are the following:

1. **Undersampling**: This involves reducing the occurences of the majority class to match the number of those in the minority class. This is typically done by randomly removing occurences of the majority class until the classes are equally represented in the data. A significant drawback to this approach is that, by removing a large proportion of the data, we are losing valuable information about the major class.

2. **Oversampling**: This is, in many ways, the opposite of undersampling. Instead of removing occurences of the majority class, we are increasing the occurences of the minority class. The simplest approach to this method is to duplicate existing samples. However, other techniques have been developed to generate synthetic examples of the minority class. One popular example of this kind of oversampling is **SMOTE** (see bullet point below). If the duplicated observations or synthetic examples are not representative of the minority class, this leads to overfitting and poor generalization of the trained model.

3. **Synthetic Minority Over-Sampling Technique (SMOTE)**: As explained above, this is a popular oversampling technique that involves generating synthetic examples for the minority class by interpolating between existing observations in the dataset. This reduces the risk of overfitting as it the examples it generates are more representative of the minority class than duplicates of existing observations would be. However, this method can take a long time to run due to its lengthy and complex task of generating large amounts of new data.  

4. **A combination of undersampling and oversampling**: This technique uses a combination of undersampling and oversampling to acheive a balanced dataset. Typically, this is done by first using an undersampling technique to reduce the occurences of the majority class, followed by an oversampling technique to increase the occurences of the minority class. One common method in the imbalanced learn library is SMOTEENN.  

5. **Class Weighting**: By giving the observations of the minority class more weight (often done by setting the `class_weight` parameter of the classifier to `'balanced'`), the classifier will pay more attention to the minority class during training, thus improving its ability to predict the minority class. However, this often results in higher false positive rates for the minority class (or worse ability to predict the majority class).

In this tutorial, I will take you through some basic examples of the first three using a logistic regression classifier. 

---

## Tutorial
For illustration purposes, I will use the flights dataset used during last project. In this dataset, the feature 'DEP_DEL15' is the response variable where 1 indicates that a flight was delayed more than 15 minutes and 0 otherwise. As shown below, there are only about 200,000 cases of fraud compared to 1 million cases of no fraud. 

### Load the data
Feel free to go back to the repository where you can find the flights dataset and load into your python environment.

In [1]:
import pandas as pd
df = pd.read_csv("train_data.csv")

### Explore the data
Before fitting any models, let us take a look at the data. According to the description in the repository, there are several columns that describe the flight logistics and plane details. 

In [2]:
df["DEP_DEL15"].value_counts()

0    1028946
1     237295
Name: DEP_DEL15, dtype: int64

In [3]:
df.head()

Unnamed: 0,MONTH,DAY_OF_WEEK,DEP_DEL15,DEP_TIME_BLK,DISTANCE_GROUP,SEGMENT_NUMBER,CONCURRENT_FLIGHTS,NUMBER_OF_SEATS,CARRIER_NAME,AIRPORT_FLIGHTS_MONTH,...,PLANE_AGE,DEPARTING_AIRPORT,LATITUDE,LONGITUDE,PREVIOUS_AIRPORT,PRCP,SNOW,SNWD,TMAX,AWND
0,3,4,0,2200-2259,2,8,21,90,Comair Inc.,11965,...,5,Ronald Reagan Washington National,38.852,-77.037,Memphis International,0.0,0.0,0.0,64.0,8.5
1,2,1,0,2000-2059,2,6,44,180,Delta Air Lines Inc.,10714,...,6,Minneapolis-St Paul International,44.886,-93.218,Stapleton International,0.01,0.0,0.0,81.0,6.93
2,9,1,1,1800-1859,3,8,92,50,American Eagle Airlines Inc.,28583,...,21,Chicago O'Hare International,41.978,-87.906,Rochester Municipal,0.0,0.0,0.0,74.0,7.83
3,5,3,0,1600-1659,3,3,72,129,Delta Air Lines Inc.,34238,...,11,Atlanta Municipal,33.641,-84.427,Jacksonville International,0.0,0.0,0.0,84.0,6.71
4,6,7,0,1900-1959,2,1,56,173,United Air Lines Inc.,28904,...,6,Chicago O'Hare International,41.978,-87.906,NONE,0.38,0.0,0.0,81.0,10.29


### Create a training and test set

Using the sklearn `train_test_split` method, we can reserve 20% of the data to obtain unbaised metrics of our model's performance. As the dataset is greatly imbalanced, we will shuffle the data so that random allocation is perserved. Additionally, we will scale the data and change all the categorical data into 1s and 0s. 

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

df = df.sample(frac=1, random_state=27)

X = df.drop('DEP_DEL15', axis=1)
y = df['DEP_DEL15']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,
                                                y,
                                                test_size=0.2,
                                                random_state=20)

categorical_features = Xtrain.select_dtypes('object').columns

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)])

Xtrain_c = preprocessor.fit_transform(Xtrain)
Xtest_c = preprocessor.transform(Xtest)

scaler = StandardScaler(with_mean=False)
Xtrain = scaler.fit_transform(Xtrain_c)
Xtest = scaler.transform(Xtest_c)


### Fitting our first model
Here we will fit a simple, Logistic Regression model to obtain the baseline performance for a model on this data.

In [5]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=500)

clf.fit(Xtrain, ytrain)

### Evaluating model performance
When dealing with imbalance data, accuracy is not always a good metric to use to evaluate model performance. This is because a model can achieve high accuracy by simply predicting the majority class every time.

Instead, the following metrics will be more useful to us:  
- Precision: This is the proportion of true positives among all predicted positives.  
$\text{Precision} = \text{True Positive} / \text{True Positive + False Positive}$  
- Recall: This is the proportion of true positives among all positives in the dataset.  
$\text{Recall} = \text{True Positive} / \text{True Positive + False Negative}$   
- F1 Score: This is the harmonic mean of precision and recall.  
$\text{F1} = 2*(\text{Precision} * \text{Recall}) / (\text{Precision} + \text{Recall})$  

Let's now evaluate how well the above model performed.

In [6]:
from sklearn.metrics import classification_report

def evaluate(model, Xtest, ytest):
    ypred = model.predict(Xtest)
    report = classification_report(ytest, ypred, digits=4)
    print(report)

In [7]:
evaluate(clf, Xtest, ytest)

              precision    recall  f1-score   support

           0     0.8130    1.0000    0.8968    205876
           1     0.9091    0.0002    0.0004     47373

    accuracy                         0.8130    253249
   macro avg     0.8610    0.5001    0.4486    253249
weighted avg     0.8310    0.8130    0.7292    253249



As described in this report, the model basically always predicts that the flight will not be delayed, as seen in the f1-score of 0. To fix this, we can try out various techniques for dealing with imbalanced data.

### Dealing with imbalanced data  
Now that we have our dataset and evaluation metrics set up, let's explore the techniques dicussed prior for dealing with imbalanced data. 

#### Undersampling
As a reminder, undersampling involves reducing the number of examples in the majority class to match the number of examples in the minority class. 

We will first do this using the `RandomUnderSampler` from the `imbalanced-learn` library. This will randomly remove instances from the majority class.

In [8]:
from imblearn.under_sampling import RandomUnderSampler
Xtrain_under_sampled, ytrain_under_sampled = RandomUnderSampler(random_state=27).fit_resample(Xtrain, ytrain)

ytrain_under_sampled.value_counts()

0    189922
1    189922
Name: DEP_DEL15, dtype: int64

As we can see, the `RandomUnderSampler` removed enough observations from the majority class (no delay) such that there is now an equal occurrence of both classes. 
Let's now see how the model performs on this resampled training set.

In [9]:
log_under_sampled = LogisticRegression(max_iter=500)
log_under_sampled.fit(Xtrain_under_sampled, ytrain_under_sampled)
evaluate(log_under_sampled, Xtest, ytest)

              precision    recall  f1-score   support

           0     0.8744    0.5737    0.6928    205876
           1     0.2573    0.6418    0.3673     47373

    accuracy                         0.5864    253249
   macro avg     0.5658    0.6077    0.5301    253249
weighted avg     0.7589    0.5864    0.6319    253249



Immediately, we can see that the model is now picking up instances of the minority class as the f1-score is 0.37. However, this has come at the cost of lower accuracy.

#### Oversampling
As mentioned before, oversampling involves duplicating (or creating new) instances of the minority class to match the number of instances of the majority class. 

We will do this using the `RandomOverSampler` from the `imbalanced-learn` library. This will randomly duplicate instances from the minority class.

In [10]:
from imblearn.over_sampling import RandomOverSampler
Xtrain_over_sampled, ytrain_over_sampled = RandomOverSampler(random_state=27).fit_resample(Xtrain, ytrain)
ytrain_over_sampled.value_counts()

0    823070
1    823070
Name: DEP_DEL15, dtype: int64

In [11]:
log_over_sampled = LogisticRegression()
log_over_sampled.fit(Xtrain_over_sampled, ytrain_over_sampled)
evaluate(log_over_sampled, Xtest, ytest)

              precision    recall  f1-score   support

           0     0.8739    0.5760    0.6943    205876
           1     0.2574    0.6387    0.3669     47373

    accuracy                         0.5877    253249
   macro avg     0.5656    0.6074    0.5306    253249
weighted avg     0.7586    0.5877    0.6331    253249



Oversampling in this manner has also improved the model from before resampling. It has acheived near identical results as undersampling.

#### SMOTE
SMOTE (Synthetic Minority Over-Sampling Technique) is a method that creates synthetic minority class observations until the class distribution is equal.

In [12]:
from imblearn.over_sampling import SMOTE
Xtrain_smote, ytrain_smote = SMOTE(random_state=27).fit_resample(Xtrain, ytrain)

ytrain_over_sampled.value_counts()

0    823070
1    823070
Name: DEP_DEL15, dtype: int64

In [13]:
log_smote = LogisticRegression(max_iter=500)
log_smote.fit(Xtrain_smote, ytrain_smote)
evaluate(log_smote, Xtest, ytest)

              precision    recall  f1-score   support

           0     0.8740    0.5778    0.6957    205876
           1     0.2580    0.6381    0.3675     47373

    accuracy                         0.5891    253249
   macro avg     0.5660    0.6080    0.5316    253249
weighted avg     0.7588    0.5891    0.6343    253249



#### A combination of over- and under-sampling  
Another way to deal with imbalanced data is to combine over- and under- sampling data. One such method is called SMOTEENN. This is a method in the imbalanced learn library that uses SMOTE to over-sample and another method called Edited Nearest Neighbours to clean (under-sample) the data.  

The syntax is as follows (this took my computer 15 hours to run):

In [15]:
from imblearn.combine import SMOTEENN
Xtrain_sme, ytrain_sme = SMOTEENN(random_state=27).fit_resample(Xtrain, ytrain)

ytrain_over_sampled.value_counts()

0    823070
1    823070
Name: DEP_DEL15, dtype: int64

In [16]:
log_sme = LogisticRegression(max_iter=500)
log_sme.fit(Xtrain_sme, ytrain_sme)
evaluate(log_sme, Xtest, ytest)

              precision    recall  f1-score   support

           0     0.8239    0.9364    0.8766    205876
           1     0.3208    0.1305    0.1855     47373

    accuracy                         0.7857    253249
   macro avg     0.5724    0.5334    0.5310    253249
weighted avg     0.7298    0.7857    0.7473    253249



Although the model fit on the SMOTEENN data took several hours to run, it did not result in a signficantly better model than the other methods dicussed above. It did give a better precision metric than the other resampling techniques but got a much lower recall and f1-score.

#### Class Weighting

In many classification models, there is the option to specify a `class_weight` parameter. If this is set to `balanced`, it means that the model will give more weight to the minority class and less to the majority class. This means that the model will penalize the misclassification of the minority class more than the majority class, leading to better predictions of the minority class.  

For logisitic regression, we do the following:


In [17]:
log_class_weights = LogisticRegression(class_weight='balanced')
log_class_weights.fit(Xtrain, ytrain)
evaluate(log_class_weights, Xtest, ytest)

              precision    recall  f1-score   support

           0     0.8744    0.5758    0.6943    205876
           1     0.2578    0.6405    0.3677     47373

    accuracy                         0.5879    253249
   macro avg     0.5661    0.6082    0.5310    253249
weighted avg     0.7591    0.5879    0.6332    253249



Setting `class_weight='balanced'` significantly improved the model from a base model without it. It is a very simple and fast way to adjust for imbalanced data.

### Wrapping Up
Surprisingly, all resampling techniques but SMOTEEN yielded very similar results. Likely, the relative performance of these will vary depending on the data. Additionally, these comparisons were only performed with a simple logistic regression model. It is very likely that the effect of resampling will be different for different models as well.

I encourage you to try these methods on your own data as what is best in one case may not be the best in every circumstance so it is a good idea to become familiar with the possibilities. For more information, check out the [imbalanced-learn documentation](https://imbalanced-learn.org/stable/user_guide.html#user-guide).

As always, thank you for reading and feel free to let me know your thoughts, questions, ideas to improve, etc.