## Handling Imbalanced dataset

**1. Up sampling**: In this technique we increase the count of **minority** records to match the count of **majority** records.

**2. Down sampling** In this technique we will decrease the count of **majority** records to match the count of **minority** records.

In [1]:
import numpy as np
import pandas as pd

# set a random seed for the reproducibility 

np.random.seed(123) # It is used to produce the similar results when we using np.random functions

# Create a dataframe with two classes
 
n_samples=1000 
class_0_ratio=0.9
n_class_0=int(n_samples* class_0_ratio)
n_class_1=n_samples-n_class_0

class_0 has **90%** of the data and class_1 has **10%** of the data.

In [2]:
n_class_0, n_class_1

(900, 100)

In [3]:
# Lets create a dataset with imbalanced dataset ( In the target, we placed the 0 and 1's)

np.random.seed(123)

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

# we taken the feature_1, feature_2 as the normal distribution.

In [4]:
class_0

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
895,0.238761,-0.003155,0
896,-1.106386,-0.430660,0
897,0.366732,-0.146416,0
898,1.023906,1.160176,0


In [5]:
class_1

Unnamed: 0,feature_1,feature_2,target
0,1.699768,2.139033,1
1,1.367739,2.025577,1
2,1.795683,1.803557,1
3,2.213696,3.312255,1
4,3.033878,3.187417,1
...,...,...,...
95,1.376371,2.845701,1
96,2.239810,0.880077,1
97,1.131760,1.640703,1
98,2.902006,0.390305,1


In [6]:
# Concatenating the both df

pd.concat([class_0,class_1])

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
95,1.376371,2.845701,1
96,2.239810,0.880077,1
97,1.131760,1.640703,1
98,2.902006,0.390305,1


In [7]:
# To align the index values (observe the index values)

pd.concat([class_0,class_1]).reset_index(drop=True)

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
995,1.376371,2.845701,1
996,2.239810,0.880077,1
997,1.131760,1.640703,1
998,2.902006,0.390305,1


In [8]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [9]:
df

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
995,1.376371,2.845701,1
996,2.239810,0.880077,1
997,1.131760,1.640703,1
998,2.902006,0.390305,1


In [10]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [11]:
df.target.value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [12]:
df["target"].value_counts()

target
0    900
1    100
Name: count, dtype: int64

records for category "1" has 100 records (less records as comapared to record "0") so we call it as "minority records".

records for category "0" has 900 records (high number of records as comapared to record "1") so we call it as "majority records"

## Upsampling

In [13]:
# Create a seperate dataframe for every record.

df_minority = df[df["target"]==1]
df_majority = df[df["target"]==0]

In [14]:
from sklearn.utils import resample

# syntax: resample(minority records dataframe, replace=True, to which count you want to increase the count of minority records
# (we use count of majority records), random_state=42)
# replace=True means when we use resample the samples will be added with replacement.
resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42)

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.196570,1.397425,1
914,1.932170,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1
...,...,...,...
952,1.188902,2.189189,1
965,3.919526,1.980541,1
976,2.810326,3.604614,1
942,3.621531,2.168229,1


In [15]:
df_minority_upsampled= resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42)

In [16]:
df_minority_upsampled

Unnamed: 0,feature_1,feature_2,target
951,1.125854,1.843917,1
992,2.196570,1.397425,1
914,1.932170,2.998053,1
971,2.272825,3.034197,1
960,2.870056,1.550485,1
...,...,...,...
952,1.188902,2.189189,1
965,3.919526,1.980541,1
976,2.810326,3.604614,1
942,3.621531,2.168229,1


In [17]:
df_minority_upsampled.shape

# As you can see below the minority records are upsampled and now the count of minority records are equals to majority records

(900, 3)

In [18]:
# After performing upsampling we will concatenate the majority and df_minority_upsampled records

pd.concat([df_majority,df_minority_upsampled])

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
952,1.188902,2.189189,1
965,3.919526,1.980541,1
976,2.810326,3.604614,1
942,3.621531,2.168229,1


In [19]:
# Upon observation (above), the total records are 1800 records

df_upsampled=pd.concat([df_majority,df_minority_upsampled])

In [20]:
df_upsampled

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.252750,0
4,-0.578600,-0.292004,0
...,...,...,...
952,1.188902,2.189189,1
965,3.919526,1.980541,1
976,2.810326,3.604614,1
942,3.621531,2.168229,1


In [21]:
df_upsampled["target"].value_counts()

target
0    900
1    900
Name: count, dtype: int64

In [22]:
# We upsampled the minority records and equals the count of minority records as like count of majority records.

## Downsampling

In [23]:

np.random.seed(123)

class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0, scale=1, size=n_class_0),
    'feature_2': np.random.normal(loc=0, scale=1, size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=2, scale=1, size=n_class_1),
    'feature_2': np.random.normal(loc=2, scale=1, size=n_class_1),
    'target': [1] * n_class_1
})

In [24]:
df=pd.concat([class_0,class_1]).reset_index(drop=True)

In [25]:
# Create a seperate dataframe for every record.

df_minority = df[df["target"]==1]
df_majority = df[df["target"]==0]

In [26]:
df_majority.shape, df_minority.shape

((900, 3), (100, 3))

In [27]:
from sklearn.utils import resample

# syntax: resample(majority records dataframe, replace=False, to which count you want to decrease the count of majority records
# (we use count of minority records), random_state=42)
# replace=False means when we use resample the samples will be added without replacement.
resample(df_majority,replace=False,n_samples=len(df_minority),random_state=42)

Unnamed: 0,feature_1,feature_2,target
70,0.468439,1.720920,0
827,1.089165,-0.464899,0
231,0.753869,-0.969798,0
588,0.588686,-0.704720,0
39,0.283627,1.012868,0
...,...,...,...
398,-0.168426,0.553775,0
76,-0.403366,0.081491,0
196,-0.269293,0.611238,0
631,-0.295829,0.671673,0


In [28]:
df_majority_downsampled= resample(df_majority,replace=False,n_samples=len(df_minority),random_state=42)

In [29]:
df_majority_downsampled

Unnamed: 0,feature_1,feature_2,target
70,0.468439,1.720920,0
827,1.089165,-0.464899,0
231,0.753869,-0.969798,0
588,0.588686,-0.704720,0
39,0.283627,1.012868,0
...,...,...,...
398,-0.168426,0.553775,0
76,-0.403366,0.081491,0
196,-0.269293,0.611238,0
631,-0.295829,0.671673,0


In [30]:
df_majority_downsampled.shape

(100, 3)

In [31]:
df_majority_downsampled["target"].unique()

array([0], dtype=int64)

In [32]:
# As of now the majority records count reduced to the count of minority record. Let's concatenate the both df_majority_downsampled and df_minority

pd.concat([df_majority_downsampled,df_minority]).reset_index()

Unnamed: 0,index,feature_1,feature_2,target
0,70,0.468439,1.720920,0
1,827,1.089165,-0.464899,0
2,231,0.753869,-0.969798,0
3,588,0.588686,-0.704720,0
4,39,0.283627,1.012868,0
...,...,...,...,...
195,995,1.376371,2.845701,1
196,996,2.239810,0.880077,1
197,997,1.131760,1.640703,1
198,998,2.902006,0.390305,1


In [33]:
df_downsampled=pd.concat([df_majority_downsampled,df_minority]).reset_index()

In [34]:
df_downsampled

Unnamed: 0,index,feature_1,feature_2,target
0,70,0.468439,1.720920,0
1,827,1.089165,-0.464899,0
2,231,0.753869,-0.969798,0
3,588,0.588686,-0.704720,0
4,39,0.283627,1.012868,0
...,...,...,...,...
195,995,1.376371,2.845701,1
196,996,2.239810,0.880077,1
197,997,1.131760,1.640703,1
198,998,2.902006,0.390305,1


In [35]:
df_downsampled["target"].value_counts()

target
0    100
1    100
Name: count, dtype: int64

In [36]:
# As we can see the count of majority is reduced to the count of minority records (100)

### Downsampling is not used mostly as we lose most of the data points due to reduction of the points. we mostly use upsampling to train our machine learning model to get the desired results.