##Handling imbalanced dataset


Consider classification problem where we have a datset(1000 data points) and output is a category(yes/no
)

Suppose we have 900 yes and 100 no data points (9:1)
===> Imbalance in the dataset

Our model will get biased to the maximum number of data points(yes)

Two techniques -> Up sampling and Down sampling

In up sampling we increase the minority and in down sampling we decrease the majority

In [1]:
import numpy as np
import pandas as pd

np.random.seed(123)

n_sample = 1000
class_0_ratio = 0.9 ##90 percent class ratio
n_class_0 = int(n_sample * class_0_ratio)
n_class_1 = n_sample - n_class_0

In [2]:
n_class_0, n_class_1

(900, 100)

In [7]:
##CREATE DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    "feature_1": np.random.normal(loc=2, scale=1, size=n_class_0),
    "feature_2": np.random.normal(loc=2, scale=1, size=n_class_0),
    "target": [0]*n_class_0
})
class_1 = pd.DataFrame({
    "feature_1": np.random.normal(loc=2, scale=1, size=n_class_1),
    "feature_2": np.random.normal(loc=2, scale=1, size=n_class_1),
    "target": [1]*n_class_1
})

In [9]:
df = pd.concat([class_0, class_1]).reset_index(drop=True)

In [10]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,0.225776,2.285744,0
1,0.798623,2.333279,0
2,3.096257,2.531807,0
3,2.861037,1.645234,0
4,0.479633,0.879185,0


In [11]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [12]:
##Up Sampling
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [17]:
!pip install scikit_learn
!pip install sklearn.utils

Collecting sklearn.utils
  Downloading sklearn_utils-0.0.15.tar.gz (26 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pyfunctional (from sklearn.utils)
  Downloading pyfunctional-1.5.0-py3-none-any.whl.metadata (40 kB)
Collecting statsmodels (from sklearn.utils)
  Downloading statsmodels-0.14.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (9.5 kB)
Collecting dill>=0.2.5 (from pyfunctional->sklearn.utils)
  Downloading dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting tabulate<=1.0.0 (from pyfunctional->sklearn.utils)
  Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
Collecting patsy>=0.5.6 (from statsmodels->sklearn.utils)
  Downloading patsy-1.0.2-py2.py3-none-any.whl.metadata (3.6 kB)
Downloading pyfunctional-1.5.0-py3-none-any.whl (53 kB)
Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Downloading dill-0.4.0-py3-none-any.whl (119 kB)
Downloading statsmodels-0.14.5-cp312-cp312-macosx_11_0_arm64.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [19]:
from sklearn.utils import resample
##extrapolation
##we will now upsample the minority dataset
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=42)

In [20]:
df_minority_upsampled.shape

(900, 3)

In [21]:
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

In [23]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

##Downsampling

often considered bad because we are losing the data points

In [24]:
import numpy as np
import pandas as pd

np.random.seed(123)

n_sample = 1000
class_0_ratio = 0.9 ##90 percent class ratio
n_class_0 = int(n_sample * class_0_ratio)
n_class_1 = n_sample - n_class_0

In [None]:

##CREATE DATAFRAME WITH IMBALANCED DATASET
class_0 = pd.DataFrame({
    "feature_1": np.random.normal(loc=2, scale=1, size=n_class_0),
    "feature_2": np.random.normal(loc=2, scale=1, size=n_class_0),
    "target": [0]*n_class_0
})
class_1 = pd.DataFrame({
    "feature_1": np.random.normal(loc=2, scale=1, size=n_class_1),
    "feature_2": np.random.normal(loc=2, scale=1, size=n_class_1),
    "target": [1]*n_class_1
})

In [27]:
df = pd.concat([class_0, class_1]).reset_index(drop=True)
df.head()
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [28]:
##DOWN SAMPLING
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]


In [29]:
from sklearn.utils import resample
##extrapolation
##we will now upsample the minority dataset
df_majority_downsampled = resample(df_majority, n_samples=len(df_minority), random_state=42)

df_majority_downsampled.shape

(100, 3)

In [33]:
df_downsampled = pd.concat([df_minority, df_majority_downsampled])

In [34]:
df_downsampled['target'].value_counts()

target
1    100
0    100
Name: count, dtype: int64