**<ins>Balancing a Dataset with Downsampling</ins>**
   
   
   Imagine we have a dataset for a binary classification task where the class labels are imbalanced,and we want to downsample the majority class 
to balance the dataset


In [3]:
import pandas as pd
from sklearn.utils import resample

In [7]:
df = pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4500,5000,5500,6000,6500],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})

High class has 7 instances

Low class has 6 instances

In [36]:

#Separate majority and minority classes
df_high = df[df['Class'] == 'High']
df_low = df[df['Class'] == 'Low']
print(df_low)
print(df_high)

    Age  Income Class
1    25    2500   Low
2    27    2700   Low
5    35    3800   Low
8    50    4500   Low
9    55    5000   Low
12   70    6500   Low
    Age  Income Class
0    22    2000  High
3    28    3200  High
4    30    3500  High
6    40    4000  High
7    45    4200  High
10   60    5500  High
11   65    6000  High


In [22]:
#Downsample majority class
df_high_downsample = resample(df_high,replace=False,n_samples=len(df_low),random_state=42)

In [40]:
# Combine downsample majority with minority class
df_balanced = pd.concat([df_high_downsample , df_low])
print(df_balanced['Class'].value_counts())

Class
High    6
Low     6
Name: count, dtype: int64


<b> Upsampling the Minority class</b>

Let's use a dataset with a binary classification task where the minority class has fewer instances than the majority class, and we will perform upsampling on the minority class

In [56]:
df = pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4500,5000,5500,6000,6500],
    'Class':['Minority','Majority','Majority','Majority','Majority','Minority','Minority','Minority','Majority','Majority','Majority','Majority','Majority']
})

Majority class has 9 instances

Minority class has 4 instances

In [61]:
# Separate majority and minority classes
df_majority = df[df['Class'] == 'Majority']
df_minority = df[df['Class'] == 'Minority']
print(df_majority)
print(df_minority)

    Age  Income     Class
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
8    50    4500  Majority
9    55    5000  Majority
10   60    5500  Majority
11   65    6000  Majority
12   70    6500  Majority
   Age  Income     Class
0   22    2000  Minority
5   35    3800  Minority
6   40    4000  Minority
7   45    4200  Minority


In [73]:
#Upsample minority class
df_minority_upsampled = resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42)

In [75]:
# Combine upsampled minority with majority class
df_balanced = pd.concat([df_majority,df_minority_upsampled])

In [77]:
print(df_balanced['Class'].value_counts())

Class
Majority    9
Minority    9
Name: count, dtype: int64


SMOTE(Synthetic Minority Over-sampling Technique) to balance the dataset.SMOTE generates synthetic examples rather than simply duplicating existing ones.

In [80]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


1.SMOTE to generate synthetic samples instead of duplicationg existing ones.


2.Convert categorical class labels into numeric from for SMOTE to work.


3.Apply SMOTE to balance the dataset.


4.Convert back to original categorical labels.


5.Combine the resampled data into a final balanced datset

In [84]:
import pandas as pd
from imblearn.over_sampling import SMOTE

In [92]:
df = pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4500,5000,5500,6000,6500],
    'Class':['Minority','Majority','Majority','Majority','Majority','Minority','Minority','Minority','Majority','Majority','Majority','Majority','Majority']
})

In [94]:
#Step 1: Convert categorical labels to numerical values
df['Class'] = df['Class'].map({'Majority':0,'Minority':1})

#Step2 : Split features(x) and target variable(y)
X = df[['Age','Income']]
Y = df['Class']

#Step 3 : Apply SMOTE with k_neighbors=3(reducing from default 5)
smote = SMOTE(sampling_strategy='auto',random_state=42,k_neighbors=3)
X_resampled,Y_resampled = smote.fit_resample(X,Y)

#Step 4 : Convert numeric labels back to categorical
Y_resampled = Y_resampled.map({0: 'Majority',1: 'Minority'})

#Step 5 : Combine the resampled dataset
df_balanced = pd.concat([pd.DataFrame(X_resampled,columns=['Age','Income']),pd.DataFrame(Y_resampled,columns=['Class'])],axis=1)

#Step 6 : Print class distribution
print(df_balanced['Class'].value_counts())

#Step 7 : Display the upsampled dataset
print(df_balanced)

Class
Minority    9
Majority    9
Name: count, dtype: int64
    Age  Income     Class
0    22    2000  Minority
1    25    2500  Majority
2    27    2700  Majority
3    28    3200  Majority
4    30    3500  Majority
5    35    3800  Minority
6    40    4000  Minority
7    45    4200  Minority
8    50    4500  Majority
9    55    5000  Majority
10   60    5500  Majority
11   65    6000  Majority
12   70    6500  Majority
13   40    4031  Minority
14   35    3831  Minority
15   44    4176  Minority
16   35    3826  Minority
17   41    4040  Minority
