### Handling imbalanced dataset
A key component of machine learning classification tasks is handling unbalanced, ehich is characterized by a skewed class distribution with a considerable overrepresentation of one class over the others. The difficulty posed by this imbalance is that models may exhibit inferior performance due to bias towards the majority classes.

### Resampling methods
It is a statistical method that is used to generate new data points in the dataset by randomly picking data points from the existing dataset. It helps in creating new synthetic datasets for training machine learning models and to estimate the properties of a dataset when the dataset is unknown, difficult to estimate, or when the sample size of the dataset is small.


1. Up sampling
2. Down sampling

In [14]:
import numpy as np
import pandas as pd

# set the random seed for reproducibility
np.random.seed(123)

# create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

In [15]:
n_class_0, n_class_1

(900, 100)

In [16]:
## create my dataframe with imbalanced dataset
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2': np.random.normal(loc=0,scale=1,size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0,scale=1,size=n_class_1),
    'feature_2': np.random.normal(loc=0,scale=1,size=n_class_1),
    'target': [1] * n_class_1
})

## reset_index in pandas
Pandas keeps the old row indices from both dataframes like - 0,1,2,...899,0,1,2,...99. This method creates a new continuous index from 0 to total_row-1, drop parameter when set to true it drops the old index instead of keeping it as a separate column.

In [17]:
df = pd.concat([class_0,class_1]).reset_index(drop=True)

In [18]:
df.head()

Unnamed: 0,feature_1,feature_2,target
0,-1.085631,0.551302,0
1,0.997345,0.419589,0
2,0.282978,1.815652,0
3,-1.506295,-0.25275,0
4,-0.5786,-0.292004,0


In [19]:
df.tail()

Unnamed: 0,feature_1,feature_2,target
995,-0.623629,0.845701,1
996,0.23981,-1.119923,1
997,-0.86824,-0.359297,1
998,0.902006,-1.609695,1
999,0.69749,0.01357,1


In [20]:
df['target'].value_counts()

target
0    900
1    100
Name: count, dtype: int64

In [21]:
## upsampling
df_minority = df[df['target'] == 1]
df_majority = df[df['target'] == 0]

## Random state and seeding for reproducibility
Computers cannot generate truly random numbers. Instead, they produce sequences of numbers that appear random but are actually generated by deterministic algorithms. These are called pseudorandom numbers.
- seed : The random_state parameter provides an initial value (the "seed") to this PRNG. If the same seed is used, the PRNG will produce the exact same sequence of "random" numbers every time the code is executed.

In [29]:
from sklearn.utils import resample
df_minority_upsampled = resample(df_minority, replace=True, #Sample with replacement
         n_samples=len(df_majority), #make minority class equal in size to majority
         random_state=42)  #seed for reproducibility

In [23]:
df_minority_upsampled.shape

(900, 3)

In [24]:
df_minority_upsampled.head()

Unnamed: 0,feature_1,feature_2,target
951,-0.874146,-0.156083,1
992,0.19657,-0.602575,1
914,-0.06783,0.998053,1
971,0.272825,1.034197,1
960,0.870056,-0.449515,1


In [26]:
df_upsampled = pd.concat([df_majority,df_minority_upsampled])

In [27]:
df_upsampled['target'].value_counts()

target
0    900
1    900
Name: count, dtype: int64

### Down sampling

In [28]:
## Down sampling
import numpy as np
import pandas as pd

# set the random seed for reproducibility
np.random.seed(123)

# create a dataframe with two classes
n_samples = 1000
class_0_ratio = 0.9
n_class_0 = int(n_samples * class_0_ratio)
n_class_1 = n_samples - n_class_0

## create my dataframe with imbalanced dataset
class_0 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0,scale=1,size=n_class_0),
    'feature_2': np.random.normal(loc=0,scale=1,size=n_class_0),
    'target': [0] * n_class_0
})

class_1 = pd.DataFrame({
    'feature_1': np.random.normal(loc=0,scale=1,size=n_class_1),
    'feature_2': np.random.normal(loc=0,scale=1,size=n_class_1),
    'target': [1] * n_class_1
})

df = pd.concat([class_0, class_1]).reset_index(drop=True)

# check the class distribution
print(df['target'].value_counts())

target
0    900
1    100
Name: count, dtype: int64


In [30]:
## downsampling
df_minority = df[df['target']==1]
df_majority = df[df['target']==0]

In [31]:
from sklearn.utils import resample
df_majority_downsampled = resample(df_majority,replace=True,
                                   n_samples=len(df_minority),
                                   random_state=42)

In [32]:
df_majority_downsampled.shape

(100, 3)

In [35]:
df_downsampled = pd.concat([df_minority,df_majority_downsampled])

In [37]:
df_downsampled.reset_index(drop=True)

Unnamed: 0,feature_1,feature_2,target
0,-0.300232,0.139033,1
1,-0.632261,0.025577,1
2,-0.204317,-0.196443,1
3,0.213696,1.312255,1
4,1.033878,1.187417,1
...,...,...,...
195,-0.598105,1.575650,0
196,0.420180,0.570631,0
197,-0.392309,0.446491,0
198,-0.148405,-0.457929,0


In [38]:
df_downsampled.target.value_counts()

target
1    100
0    100
Name: count, dtype: int64