# Classification with Imbalanced Data

The dataset contains information about customer transactions and can be found [here](https://www.kaggle.com/datasets/computingvictor/transactions-fraud-datasets). It has been modified for the purpose of this notebook.

The objective is to build a predictive model that can accurately classify whether a customer will churn based on these features. This notebook is based on [imbalanced-learn](https://imbalanced-learn.org/stable/auto_examples/applications/plot_impact_imbalanced_classes.html#).

In [1]:
# Core libraries
import pandas as pd

%config InlineBackend.figure_format = "retina"

## Data

In [2]:
# Read data
path_data = "https://github.com/pabloestradac/causalml-basics/raw/main/data/"
df = pd.read_csv(path_data + 'churn.csv')
df.describe().round(2)

Unnamed: 0,num_transactions,total_amount,avg_amount,std_amount,churn,current_age,yearly_income,total_debt,credit_score,num_credit_cards
count,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0,1219.0
mean,10915.43,469102.15,43.91,78.71,0.03,53.05,45117.95,58196.86,713.22,3.7
std,5607.36,292602.85,17.48,27.27,0.18,15.66,23292.43,51255.43,65.55,1.57
min,760.0,26605.34,5.34,9.71,0.0,23.0,1.0,0.0,488.0,1.0
25%,7223.5,269693.84,32.22,61.25,0.0,41.0,32075.0,18527.5,684.0,3.0
50%,9832.0,398837.16,40.72,74.25,0.0,51.0,40012.0,51984.0,715.0,4.0
75%,13349.0,597774.02,52.55,91.87,0.0,63.0,52176.5,84080.5,755.0,5.0
max,48479.0,2445773.25,147.24,279.91,1.0,101.0,280199.0,461854.0,850.0,9.0


Only 3% of the customers in the dataset have churned. This is a highly imbalanced dataset.

## Logit

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

num_pipe = make_pipeline(
    StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor_linear = make_column_transformer(
    (num_pipe, make_column_selector(dtype_include="number")),
    (cat_pipe, make_column_selector(dtype_include="category")),
    n_jobs=2,
)

lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))

We will start with `LogisticRegression` as a linear classifier baseline with one-hot encoding for categorical columns and standardization for the numerical columns.

In [4]:
index = []
scores = {"Accuracy": [], "Balanced accuracy": []}
scoring = ["accuracy", "balanced_accuracy"]

X = df.drop(columns=['churn'])
y = df['churn']

In [5]:
index += ["Logistic regression"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring, cv=3)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index).round(3)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966,0.512


The balanced accuracy score is the macro-average of recall scores per class or, equivalently, raw accuracy where each sample is weighted according to the inverse prevalence of its true class. Thus for balanced datasets, the score is equal to accuracy.

## Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder

num_pipe = SimpleImputer(strategy="mean", add_indicator=True)
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
)

preprocessor_tree = make_column_transformer(
    (num_pipe, make_column_selector(dtype_include="number")),
    (cat_pipe, make_column_selector(dtype_include="category")),
    n_jobs=2,
)

rf_clf = make_pipeline(
    preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2)
)

We can also set up a tree-based model with `RandomForestClassifier`. We will not need to scale the numerical data, and we will only need to ordinal encode the categorical data.

In [7]:
index += ["Random forest"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index).round(3)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966,0.512
Random forest,0.962,0.509


## Using weights

We can set `class_weight="balanced"` such that the weight applied is inversely proportional to the class frequency.

In [8]:
lr_clf.set_params(logisticregression__class_weight="balanced")

index += ["Logistic regression with balanced class weights"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index).round(3)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966,0.512
Random forest,0.962,0.509
Logistic regression with balanced class weights,0.709,0.71


In [9]:
rf_clf.set_params(randomforestclassifier__class_weight="balanced")

index += ["Random forest with balanced class weights"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index).round(3)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966,0.512
Random forest,0.962,0.509
Logistic regression with balanced class weights,0.709,0.71
Random forest with balanced class weights,0.963,0.499


We can see that reweighting was really effective for logit, alleviating the issue of imbalanced classes. However, random forest is still biased toward the majority class, mainly due to the criterion which is not suited enough to fight the class imbalance.

## Undersampling


In [10]:
from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

lr_clf = make_pipeline_with_sampler(
    preprocessor_linear,
    RandomUnderSampler(random_state=42),
    LogisticRegression(max_iter=1000),
)

In [11]:
index += ["Under-sampling + Logistic regression"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966367,0.511905
Random forest,0.962265,0.508987
Logistic regression with balanced class weights,0.708774,0.709851
Random forest with balanced class weights,0.963088,0.498729
Under-sampling + Logistic regression,0.688278,0.664385


In [12]:
rf_clf = make_pipeline_with_sampler(
    preprocessor_tree,
    RandomUnderSampler(random_state=42),
    RandomForestClassifier(random_state=42, n_jobs=2),
)

In [13]:
index += ["Under-sampling + Random forest"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966367,0.511905
Random forest,0.962265,0.508987
Logistic regression with balanced class weights,0.708774,0.709851
Random forest with balanced class weights,0.963088,0.498729
Under-sampling + Logistic regression,0.688278,0.664385
Under-sampling + Random forest,0.712052,0.668787


## Oversampling

In [15]:
from imblearn.over_sampling import SMOTE

lr_clf = make_pipeline_with_sampler(
    preprocessor_linear,
    SMOTE(random_state=42),
    LogisticRegression(max_iter=1000),
)

In [16]:
index += ["SMOTE + Logistic regression"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966367,0.511905
Random forest,0.962265,0.508987
Logistic regression with balanced class weights,0.708774,0.709851
Random forest with balanced class weights,0.963088,0.498729
Under-sampling + Logistic regression,0.688278,0.664385
Under-sampling + Random forest,0.712052,0.668787
SMOTE + Logistic regression,0.721902,0.683199


In [17]:
rf_clf = make_pipeline_with_sampler(
    preprocessor_tree,
    SMOTE(random_state=42),
    RandomForestClassifier(random_state=42, n_jobs=2),
)

In [18]:
index += ["SMOTE + Random forest"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Logistic regression,0.966367,0.511905
Random forest,0.962265,0.508987
Logistic regression with balanced class weights,0.708774,0.709851
Random forest with balanced class weights,0.963088,0.498729
Under-sampling + Logistic regression,0.688278,0.664385
Under-sampling + Random forest,0.712052,0.668787
SMOTE + Logistic regression,0.721902,0.683199
SMOTE + Random forest,0.930284,0.527269
