# Dealing with class imbalance

I you have had the chance of working around classification problems, then is probable you have faced a problem of imbalanced classes. This occurs in datasets with a disproportionate ratio of observations. In other words, in a binary classification problem, you’d have a lot of elements of a class and very few from another. But this could also happen in a multi-classification problem when we the vast majority of the observations are clustered in one category or we have one category that's highly under-represented in comparison with the rest.

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Load the dataset

In [7]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, cross_val_score

In [8]:
df = pd.read_csv('wine_data/winequality_merged.csv')
df.columns = [col.replace(' ','_') for col in df.columns]
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,red_wine
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [9]:
df.red_wine.value_counts()

0    4898
1    1599
Name: red_wine, dtype: int64

We want to predict good or bad quality (below 5).

In [10]:
y = df.pop('quality')

In [11]:
y.value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

In [12]:
y = (y > 4)*1
X = df

In [13]:
y.value_counts(normalize=True)

1    0.962136
0    0.037864
Name: quality, dtype: float64

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                        test_size=0.2, stratify=y, random_state=1)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=X.columns, index=y_train.index)
X_test = pd.DataFrame(X_test,  columns=X.columns, index=y_test.index)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  # Remove the CWD from sys.path while we load stuff.


## Fit the model

In [15]:
def fitter(model, X_train, y_train, X_test, y_test):
    """
    Takes a model, training and test sets as inputs and evaluated the model on both 
    reporting scores, confusion matrix and classification report.
    """
    model.fit(X_train, y_train)
    print(model.score(X_train, y_train))
    print(cross_val_score(model, X_train, y_train, cv=5).mean())
    print(model.score(X_test, y_test))
    print()
    print(confusion_matrix(y_train, model.predict(X_train)))
    print()
    print(classification_report(y_train, model.predict(X_train)))
    print()
    print(confusion_matrix(y_test, model.predict(X_test)))
    print()
    print(classification_report(y_test, model.predict(X_test)))

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
model = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000)

In [18]:
model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.9630556090051953
0.9615384615384616


In [19]:
fitter(model, X_train, y_train, X_test, y_test)

0.9630556090051953
0.9628635152143333
0.9615384615384616

[[   9  188]
 [   4 4996]]

              precision    recall  f1-score   support

           0       0.69      0.05      0.09       197
           1       0.96      1.00      0.98      5000

   micro avg       0.96      0.96      0.96      5197
   macro avg       0.83      0.52      0.53      5197
weighted avg       0.95      0.96      0.95      5197


[[   0   49]
 [   1 1250]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        49
           1       0.96      1.00      0.98      1251

   micro avg       0.96      0.96      0.96      1300
   macro avg       0.48      0.50      0.49      1300
weighted avg       0.93      0.96      0.94      1300



We can put that model into a neat pipeline.

In [20]:
from sklearn.pipeline import make_pipeline

In [21]:
pipe = make_pipeline(StandardScaler(),model)

pipe.fit(X_train, y_train)

fitter(pipe, X_train, y_train, X_test, y_test)

0.9630556090051953
0.9628635152143333
0.9615384615384616

[[   9  188]
 [   4 4996]]

              precision    recall  f1-score   support

           0       0.69      0.05      0.09       197
           1       0.96      1.00      0.98      5000

   micro avg       0.96      0.96      0.96      5197
   macro avg       0.83      0.52      0.53      5197
weighted avg       0.95      0.96      0.95      5197


[[   0   49]
 [   1 1250]]

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        49
           1       0.96      1.00      0.98      1251

   micro avg       0.96      0.96      0.96      1300
   macro avg       0.48      0.50      0.49      1300
weighted avg       0.93      0.96      0.94      1300



## Balance the class weights

One way of treating class imbalances when using logistic regression is to use `class_weight='balanced'`.

In [22]:
model_balanced = LogisticRegression(solver='lbfgs', 
                                    multi_class='ovr', 
                                    max_iter=1000,
                                    class_weight='balanced')


fitter(model_balanced, X_train, y_train, X_test, y_test)

0.7523571291129498
0.7490886207151847
0.7676923076923077

[[ 136   61]
 [1226 3774]]

              precision    recall  f1-score   support

           0       0.10      0.69      0.17       197
           1       0.98      0.75      0.85      5000

   micro avg       0.75      0.75      0.75      5197
   macro avg       0.54      0.72      0.51      5197
weighted avg       0.95      0.75      0.83      5197


[[ 30  19]
 [283 968]]

              precision    recall  f1-score   support

           0       0.10      0.61      0.17        49
           1       0.98      0.77      0.87      1251

   micro avg       0.77      0.77      0.77      1300
   macro avg       0.54      0.69      0.52      1300
weighted avg       0.95      0.77      0.84      1300



## Resampling using [imbalanced learn](http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html)

Alternatively we can use resampling methods such as undersampling and oversampling or even generate new samples.

### Undersampling

We create class balance by selecting a random subset of the majority class.

In [23]:
from imblearn.under_sampling import RandomUnderSampler

In [24]:
sampler = RandomUnderSampler(random_state=1)
X_resampled, y_resampled = sampler.fit_sample(X_train, y_train)

pd.Series(y_resampled).value_counts()

1    197
0    197
dtype: int64

In [25]:
fitter(model, X_resampled, y_resampled, X_test, y_test)

0.7309644670050761
0.6804487179487179
0.7515384615384615

[[135  62]
 [ 44 153]]

              precision    recall  f1-score   support

           0       0.75      0.69      0.72       197
           1       0.71      0.78      0.74       197

   micro avg       0.73      0.73      0.73       394
   macro avg       0.73      0.73      0.73       394
weighted avg       0.73      0.73      0.73       394


[[ 29  20]
 [303 948]]

              precision    recall  f1-score   support

           0       0.09      0.59      0.15        49
           1       0.98      0.76      0.85      1251

   micro avg       0.75      0.75      0.75      1300
   macro avg       0.53      0.67      0.50      1300
weighted avg       0.95      0.75      0.83      1300



Undersampling can be used during gridsearch.

In [26]:
model = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000)

model_params = {'penalty': ['l2'],
                'C': np.logspace(-2, 2, 5)}

gs = GridSearchCV(model, model_params, cv=5, iid=False)

fitter(gs, X_resampled, y_resampled, X_test, y_test)

0.7157360406091371
0.6753205128205128
0.75

[[133  64]
 [ 48 149]]

              precision    recall  f1-score   support

           0       0.73      0.68      0.70       197
           1       0.70      0.76      0.73       197

   micro avg       0.72      0.72      0.72       394
   macro avg       0.72      0.72      0.72       394
weighted avg       0.72      0.72      0.72       394


[[ 27  22]
 [303 948]]

              precision    recall  f1-score   support

           0       0.08      0.55      0.14        49
           1       0.98      0.76      0.85      1251

   micro avg       0.75      0.75      0.75      1300
   macro avg       0.53      0.65      0.50      1300
weighted avg       0.94      0.75      0.83      1300



### Oversampling

We create class balance by sampling the minority class with replacement (bootstrapping).

Be careful with the resampling and cross validation. **If you upsample before cross validation, you will have the same observations in different k-folds**

In [20]:
from imblearn.over_sampling import RandomOverSampler

In [21]:
sampler = RandomOverSampler(random_state=1)
X_resampled, y_resampled = sampler.fit_sample(X_train, y_train)

print(pd.Series(y_resampled).value_counts())

1    5000
0    5000
dtype: int64


In [22]:
fitter(model, X_resampled, y_resampled, X_test, y_test)

0.7202
0.7217
0.7638461538461538

[[3430 1570]
 [1228 3772]]

              precision    recall  f1-score   support

           0       0.74      0.69      0.71      5000
           1       0.71      0.75      0.73      5000

   micro avg       0.72      0.72      0.72     10000
   macro avg       0.72      0.72      0.72     10000
weighted avg       0.72      0.72      0.72     10000


[[ 29  20]
 [287 964]]

              precision    recall  f1-score   support

           0       0.09      0.59      0.16        49
           1       0.98      0.77      0.86      1251

   micro avg       0.76      0.76      0.76      1300
   macro avg       0.54      0.68      0.51      1300
weighted avg       0.95      0.76      0.84      1300



If we want to use cross validation, we have to set it up by hand and do the oversampling **after** we have created the k-folds.

In [23]:
from sklearn.model_selection import GridSearchCV

In [27]:
"""---------------------"""
# wrong way
"""---------------------"""

model = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000)

model_params = {'penalty': ['l2'],
                'C': np.logspace(-2, 2, 5)}

gs = GridSearchCV(model, model_params, cv=5, iid=False)
fitter(gs, X_resampled, y_resampled, X_test, y_test)

0.7157360406091371
0.6753205128205128
0.75

[[133  64]
 [ 48 149]]

              precision    recall  f1-score   support

           0       0.73      0.68      0.70       197
           1       0.70      0.76      0.73       197

   micro avg       0.72      0.72      0.72       394
   macro avg       0.72      0.72      0.72       394
weighted avg       0.72      0.72      0.72       394


[[ 27  22]
 [303 948]]

              precision    recall  f1-score   support

           0       0.08      0.55      0.14        49
           1       0.98      0.76      0.85      1251

   micro avg       0.75      0.75      0.75      1300
   macro avg       0.53      0.65      0.50      1300
weighted avg       0.94      0.75      0.83      1300



In [28]:
from sklearn.model_selection import StratifiedKFold

In [26]:
"""---------------------"""
# correct way
"""---------------------"""

kf = StratifiedKFold(n_splits=5)

for C_current in np.logspace(-2, 4, 10):
    model = LogisticRegression(C=C_current, solver='lbfgs', multi_class='ovr', max_iter=1000)
    
    scores = []
    for train, test in kf.split(X_train, y_train):
        X_train_now, X_test_now = X_train.iloc[train, :], X_train.iloc[test, :]
        y_train_now, y_test_now = y_train.iloc[train], y_train.iloc[test]

        X_resampled, y_resampled = sampler.fit_sample(X_train_now, y_train_now)

        model.fit(X_resampled, y_resampled)
        scores.append(model.score(X_test_now, y_test_now))

    print(np.round(C_current, 3), '\t', 
          np.round(np.mean(scores), 3), '\t', 
          np.round(model.score(X_test, y_test), 3))
    print()

0.01 	 0.749 	 0.768

0.046 	 0.75 	 0.772

0.215 	 0.748 	 0.778

1.0 	 0.747 	 0.776

4.642 	 0.747 	 0.777

21.544 	 0.747 	 0.777

100.0 	 0.747 	 0.777

464.159 	 0.747 	 0.777

2154.435 	 0.747 	 0.777

10000.0 	 0.747 	 0.777



### Using a pipeline

With imbalancedlearn's pipeline wrapper, we can use sampling without danger.

In [27]:
from imblearn.pipeline import make_pipeline as make_pipeline_imb

In [28]:
pipe = make_pipeline_imb(RandomOverSampler(random_state=1),
                         LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000))

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)


fitter(pipe, X_train, y_train, X_test, y_test)

0.751779873003656
0.7471638779891908
0.7638461538461538

[[ 135   62]
 [1228 3772]]

              precision    recall  f1-score   support

           0       0.10      0.69      0.17       197
           1       0.98      0.75      0.85      5000

   micro avg       0.75      0.75      0.75      5197
   macro avg       0.54      0.72      0.51      5197
weighted avg       0.95      0.75      0.83      5197


[[ 29  20]
 [287 964]]

              precision    recall  f1-score   support

           0       0.09      0.59      0.16        49
           1       0.98      0.77      0.86      1251

   micro avg       0.76      0.76      0.76      1300
   macro avg       0.54      0.68      0.51      1300
weighted avg       0.95      0.76      0.84      1300



The next two methods, SMOTE and ADASYN, create synthetic samples, i.e. new samples for restoring class balance, which were not contained in the original dataset. They are created by looking at  the k-nearest neighbors.

### Oversampling with SMOTE

In [29]:
from imblearn.over_sampling import SMOTE

In [30]:
sampler = SMOTE(random_state=1)
X_resampled, y_resampled = sampler.fit_sample(X_train, y_train)

pd.Series(y_resampled).value_counts()

1    5000
0    5000
dtype: int64

In [31]:
pipe = make_pipeline_imb(SMOTE(random_state=1),
                         LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000))

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)


fitter(pipe, X_train, y_train, X_test, y_test)

0.7600538772368675
0.7571685052195157
0.7784615384615384

[[ 134   63]
 [1184 3816]]

              precision    recall  f1-score   support

           0       0.10      0.68      0.18       197
           1       0.98      0.76      0.86      5000

   micro avg       0.76      0.76      0.76      5197
   macro avg       0.54      0.72      0.52      5197
weighted avg       0.95      0.76      0.83      5197


[[ 30  19]
 [269 982]]

              precision    recall  f1-score   support

           0       0.10      0.61      0.17        49
           1       0.98      0.78      0.87      1251

   micro avg       0.78      0.78      0.78      1300
   macro avg       0.54      0.70      0.52      1300
weighted avg       0.95      0.78      0.85      1300



### Oversampling with ADASYN

In [32]:
from imblearn.over_sampling import ADASYN

In [33]:
sampler = ADASYN(random_state=1, n_neighbors=5)
X_resampled, y_resampled = sampler.fit_sample(X_train, y_train)

pd.Series(y_resampled).value_counts()

1    5000
0    4965
dtype: int64

In [34]:
pipe = make_pipeline_imb(ADASYN(random_state=1, n_neighbors=5),
                         LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000))

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)


fitter(pipe, X_train, y_train, X_test, y_test)

0.7479314989416971
0.7421594358480788
0.7715384615384615

[[ 138   59]
 [1251 3749]]

              precision    recall  f1-score   support

           0       0.10      0.70      0.17       197
           1       0.98      0.75      0.85      5000

   micro avg       0.75      0.75      0.75      5197
   macro avg       0.54      0.73      0.51      5197
weighted avg       0.95      0.75      0.83      5197


[[ 29  20]
 [277 974]]

              precision    recall  f1-score   support

           0       0.09      0.59      0.16        49
           1       0.98      0.78      0.87      1251

   micro avg       0.77      0.77      0.77      1300
   macro avg       0.54      0.69      0.52      1300
weighted avg       0.95      0.77      0.84      1300

