# Handling Imbalanced Dataset


Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations. We can better understand imbalanced dataset handling with an example.

Let’s assume that XYZ is a bank that issues a credit card to its customers. Now the bank is concerned that some fraudulent transactions are going on and when the bank checks their data they found that for each 2000 transaction there are only 30 Nos of fraud recorded. So, the number of fraud per 100 transactions is less than 2%, or we can say more than 98% transaction is “No Fraud” in nature. Here, the class “No Fraud” is called the majority class, and the much smaller in size “Fraud” class is called the minority class.

### Ways of Handling Imbalanced Dataset

##### 1. Upsampling
##### 2. Downsampling

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Set the random seed for reproducibility
np.random.seed(123)
# Creating a dataframe with 2 Class
n_samples=1000
class_0_ratio=0.9
n_class_0=int(n_samples*class_0_ratio)
n_class_1=n_samples-n_class_0

In [3]:
n_class_0,n_class_1

(900, 100)

In [4]:
class_0=pd.DataFrame({
    'feature_1':np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2':np.random.normal(loc=0, scale=1, size=n_class_0),
    'target':[0]*n_class_0
})

class_1=pd.DataFrame({
    'feature_1':np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2':np.random.normal(loc=2, scale=1, size=n_class_1),
    'target':[1]*n_class_1
})

In [5]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [6]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [7]:
df['target'].value_counts()

0    900
1    100
Name: target, dtype: int64

#### Upsampling

In [8]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [9]:
df_majority.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [10]:
df_minority.head()

Unnamed: 0,feature_1,feature_2,target
900,1.699768,2.139033,1
901,1.367739,2.025577,1
902,1.795683,1.803557,1
903,2.213696,3.312255,1
904,3.033878,3.187417,1


In [11]:
# Performing Upsampling
from sklearn.utils import resample

In [21]:
df_minority_upsample=resample(df_minority,
                              replace=True, # Sample with Replacement
                              n_samples=len(df_majority), # To create the number of new samples, same as the majority class
                              random_state=42)

In [20]:
df_minority_upsample.shape

(900, 3)

In [14]:
df_minority_upsample.head()

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.19657,1.397425,1
914,1.93217,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1


In [22]:
df_minority_upsample['target'].value_counts()

1    900
Name: target, dtype: int64

In [24]:
df_upsampled=pd.concat([df_majority,df_minority_upsample])

In [25]:
df_upsampled

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
952,1.188902,2.189189,1
965,3.919526,1.980541,1
976,2.810326,3.604614,1
942,3.621531,2.168229,1


In [26]:
df_upsampled['target'].value_counts()

0    900
1    900
Name: target, dtype: int64

In [27]:
df_upsampled.shape

(1800, 3)

#### Downsampling

In [34]:
class_0=pd.DataFrame({
    'feature_1':np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2':np.random.normal(loc=0, scale=1, size=n_class_0),
    'target':[0]*n_class_0
})

class_1=pd.DataFrame({
    'feature_1':np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2':np.random.normal(loc=2, scale=1, size=n_class_1),
    'target':[1]*n_class_1
})

In [35]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [36]:
df_minority=df[df['target']==1]
df_majority=df[df['target']==0]

In [37]:
df_majority_downsample=resample(df_majority,
                              replace=False, # Sample with Replacement
                              n_samples=len(df_minority), # To create the number of new samples, same as the minority class
                              random_state=42)

In [38]:
df_majority_downsample.shape

(100, 3)

In [41]:
df_downsample=pd.concat([df_minority,df_majority_downsample])

In [43]:
df_downsample['target'].value_counts()

1    100
0    100
Name: target, dtype: int64