In [29]:
import pandas as pd
import kagglehub as kh
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

In [3]:
path = kh.dataset_download("rakeshrau/social-network-ads")

dataset = pd.read_csv(f"{path}/Social_Network_Ads.csv")

dataset.drop(columns=["User ID","Gender"],axis=1,inplace=True)
dataset.tail()

Unnamed: 0,Age,EstimatedSalary,Purchased
395,46,41000,1
396,51,23000,1
397,50,20000,1
398,36,33000,0
399,49,36000,1


In [4]:
dataset["Purchased"].value_counts()

Purchased
0    257
1    143
Name: count, dtype: int64

In [5]:
x = dataset.drop("Purchased",axis=1)
y = dataset ["Purchased"]

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.25,random_state=42)


In [6]:
lr = LogisticRegression()
lr.fit(x_train,y_train)
lr.score(x_test,y_test)

0.88

In [13]:

sample = pd.DataFrame([[46, 41000]], columns=x.columns)
lr.predict(sample)

array([0])

undersampling


In [21]:
ru = RandomUnderSampler()
ru_x,ru_y = ru.fit_resample(x,y)

ru_y.value_counts()

Purchased
0    143
1    143
Name: count, dtype: int64

In [26]:
x_train,x_test,y_train,y_test = train_test_split(ru_x,ru_y,test_size=.25,random_state=42)

lr.fit(x_train,y_train)
lr.score(x_test,y_test)

0.8194444444444444

In [28]:
lr.predict([[36,33000]])



array([0])

Oversampling

In [31]:
os = RandomOverSampler()

os_x,os_y = os.fit_resample(x,y)


x_train,x_test,y_train,y_test = train_test_split(os_x,os_y,test_size=.25,random_state=42)

lr.fit(x_train,y_train)
lr.score(x_test,y_test)

0.875968992248062

In [36]:
os_y.value_counts()

Purchased
0    257
1    257
Name: count, dtype: int64

In [32]:
lr.predict([[36,33000]])



array([0])

### Notes on Imbalanced Datasets

- **Definition:** An imbalanced dataset is one where the classes are not represented equally. Typically, one class (the majority class) has significantly more samples than the other class(es) (the minority class).

- **Challenges:**
    - Standard machine learning algorithms may be biased towards the majority class.
    - Metrics like accuracy can be misleading; alternative metrics such as precision, recall, F1-score, and ROC-AUC are preferred.
    - The minority class, often the class of interest, may be under-predicted.

- **Common Solutions:**
    - **Resampling Techniques:** 
        - *Oversampling* the minority class (e.g., SMOTE).
        - *Undersampling* the majority class.
    - **Algorithmic Approaches:** 
        - Use algorithms that are robust to class imbalance (e.g., tree-based methods).
        - Adjust class weights in the loss function.
    - **Evaluation Metrics:** 
        - Use confusion matrix, precision, recall, F1-score, ROC-AUC, and PR-AUC for better assessment.

- **Applications:** Fraud detection, medical diagnosis, anomaly detection, etc., where rare events are of high importance.

### Resampling Techniques for Imbalanced Datasets

Resampling techniques are strategies used to adjust the class distribution within a dataset to address class imbalance. The main goal is to provide a more balanced representation of classes, which helps machine learning models learn equally from all classes and improves their predictive performance, especially for the minority class.

#### 1. Oversampling

- **Definition:** Increases the number of samples in the minority class by replicating existing samples or generating new synthetic samples.
- **Methods:**
    - **Random Oversampling:** Randomly duplicates examples from the minority class.
    - **SMOTE (Synthetic Minority Over-sampling Technique):** Generates synthetic samples by interpolating between existing minority class samples.
- **Pros:** Helps models learn more about the minority class.
- **Cons:** May lead to overfitting, especially with simple duplication.

#### 2. Undersampling

- **Definition:** Reduces the number of samples in the majority class by randomly removing samples.
- **Methods:**
    - **Random Undersampling:** Randomly removes examples from the majority class.
    - **Cluster Centroids:** Replaces clusters of majority samples with their centroids.
- **Pros:** Reduces training time and memory usage.
- **Cons:** Risk of losing important information from the majority class.

#### 3. Combination (Hybrid) Methods

- **Definition:** Combines both oversampling and undersampling to balance the dataset.
- **Example:** First applies SMOTE to oversample the minority class, then randomly undersamples the majority class.

#### 4. Advanced Techniques

- **Tomek Links:** Removes overlapping samples between classes to clean the decision boundary.
- **Edited Nearest Neighbors (ENN):** Removes samples that differ from the majority of their neighbors.

#### **When to Use Resampling?**

- When the dataset is highly imbalanced and standard algorithms perform poorly on the minority class.
- When evaluation metrics (like recall or F1-score) for the minority class are low.

**Note:** Always apply resampling techniques only to the training set to avoid data leakage.