***Balancing a Dataset with Downsampling***
    
        Imagine we have a dataset for a binary classification task where the class labels are imbalanced, and we want to downsample the majority class to balance the dataset.

In [8]:
import pandas as pd
from sklearn.utils import resample
df=pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65,70],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4200,4300,4500,5000,5500,6000],
    'Class':['High','Low','Low','High','High','Low','High','High','Low','Low','High','High','Low']
})

High class has 7 instances.

Low class has 6 instances.

In [25]:
#Seperate majority and minority classes
df_high = df[df['Class'] == 'High']
df_low  = df[df['Class'] == 'Low']
print(f"df high:{df_high}")
print(f"df low:{df_low}")

df high:    Age  Income Class
0    22    2000  High
3    28    3200  High
4    30    3500  High
6    40    4000  High
7    45    4200  High
10   60    5000  High
11   65    5500  High
df low:    Age  Income Class
1    25    2500   Low
2    27    2700   Low
5    35    3800   Low
8    50    4300   Low
9    55    4500   Low
12   70    6000   Low


In [17]:
#Downsample majority class
df_high_downsampled = resample(df_high,replace=False,n_samples=len(df_low),random_state=42)

In [19]:
#Combine downsampled majority with minority class
df_balanced = pd.concat([df_high_downsampled,df_low])

In [23]:
print(df_balanced['Class'].value_counts())

Class
High    6
Low     6
Name: count, dtype: int64


***Upsampling the Minority Class***

Let's use a dataset with a binary classification task where the Minority class has fewer instances than the Majority class,and we"ll
perform upsampling on the minority class.

In [52]:
import pandas as pd
from sklearn.utils import resample

#Sample dataset
df = pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4300,4500,5000,5500,6000],
    'Class':['Minority','Majority','Majority','Majority','Minority','Majority','Minority','Majority','Majority','Minority','Majority','Majority']
})

Majority class has 9 instances

Minority class has 4 instances

In [55]:
#Seperate majority 
df_majority=df[df['Class']=='Majority']
df_minority=df[df['Class']=='Minority']


In [57]:
df_minority_upsampled=resample(df_minority,replace=True,n_samples=len(df_majority),random_state=42)
df_balanced=pd.concat([df_majority,df_minority_upsampled])
print(df_balanced['Class'].value_counts())

Class
Majority    8
Minority    8
Name: count, dtype: int64


***SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to address the class imbalance problem in machine learning, particularly in classification tasks.***

In [59]:
pip install imbalanced-learn

Note: you may need to restart the kernel to use updated packages.


In [66]:
import pandas as pd
from imblearn.over_sampling import SMOTE

#Sample dataset
df = pd.DataFrame({
    'Age':[22,25,27,28,30,35,40,45,50,55,60,65],
    'Income':[2000,2500,2700,3200,3500,3800,4000,4300,4500,5000,5500,6000],
    'Class':['Minority','Majority','Majority','Majority','Minority','Majority','Minority','Majority','Majority','Minority','Majority','Majority']
})
#Step 1:Convert categorical labels to numerical values
df['Class'] = df['Class'].map({'Majority':0,'Minority':1})

#Step 2:Split features(X) and target variable(y)
X = df[['Age','Income']]
y = df['Class']

#Step 3:Apply SMOTE with k_neighbors=3(reducing from default 5)
smote = SMOTE(sampling_strategy='auto',random_state=42,k_neighbors=3)
X_resampled,y_resampled = smote.fit_resample(X,y)

#Step 4: Convert numeric labels back to categorical
y_sampled = y_resampled.map({0:'Majority',1:'Minority'})

#Step 5:Combine the resampled dataset
df_balanced = pd.concat([pd.DataFrame(X_resampled,columns=['Age','Income']),pd.DataFrame(y_resampled,columns=['Class'])],axis=1)

#Step 6: Print class distribution
print(df_balanced['Class'].value_counts())
print(df_balanced)

Class
1    8
0    8
Name: count, dtype: int64
    Age  Income  Class
0    22    2000      1
1    25    2500      0
2    27    2700      0
3    28    3200      0
4    30    3500      1
5    35    3800      0
6    40    4000      1
7    45    4300      0
8    50    4500      0
9    55    5000      1
10   60    5500      0
11   65    6000      0
12   34    3700      1
13   31    3578      1
14   51    4766      1
15   40    4058      1
