# **Imbalanced Data**

Imbalanced data is a common challenge in machine learning where one class significantly outnumbers the other. This imbalance can cause models to perform poorly, as they tend to predict the majority class more frequently. Techniques like **Random Under-Sampling** and **Random Over-Sampling** are used to address this issue by balancing the dataset.

---

## **1. Random Under-Sampling (RUS)**

**Definition**:  
Random under-sampling involves reducing the number of samples in the majority class to match the minority class size. This creates a balanced dataset but may result in loss of potentially valuable data from the majority class.

**Steps**:  
1. Identify the imbalance in class distribution.
2. Randomly remove samples from the majority class until its count matches the minority class.
3. Train the model on the balanced dataset.

**Example from Code**:  
- Original class distribution:
  - Class 0 (Not Purchased): 257
  - Class 1 (Purchased): 143
- After under-sampling:
  - Class 0: 143
  - Class 1: 143
- Model accuracy: 75.86%

**Advantages**:  
- Balances the dataset quickly.
- Reduces computational cost by reducing the dataset size.

**Disadvantages**:  
- May lose important information from the majority class.

---

## **2. Random Over-Sampling (ROS)**

**Definition**:  
Random over-sampling involves duplicating samples from the minority class to match the size of the majority class. This also creates a balanced dataset without losing any data.

**Steps**:  
1. Identify the imbalance in class distribution.
2. Duplicate random samples from the minority class until its count matches the majority class.
3. Train the model on the balanced dataset.

**Example from Code**:  
- Original class distribution:
  - Class 0 (Not Purchased): 257
  - Class 1 (Purchased): 143
- After over-sampling:
  - Class 0: 257
  - Class 1: 257
- Model accuracy: 86.41%

**Advantages**:  
- Preserves all original data.
- Can improve model performance on the minority class.

**Disadvantages**:  
- May lead to overfitting as the model sees duplicated samples multiple times.

---

## **When to Use Under-Sampling vs. Over-Sampling**

- **Under-Sampling**: Use when the majority class significantly outnumbers the minority class, and you want to reduce dataset size or computational cost.
- **Over-Sampling**: Use when preserving all original data is essential, and you want to avoid losing information from the majority class.

---

## **Conclusion**

Addressing imbalanced data is crucial for building effective models. Both **under-sampling** and **over-sampling** provide simple solutions to balance the class distribution. Choosing the right technique depends on the dataset size, computational constraints, and the importance of retaining all data. Balancing the data often improves the model's performance on minority class predictions, leading to a more reliable and fair evaluation of the classification task.


In [500]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from mlxtend.plotting import plot_decision_regions

In [501]:
ads_data = pd.read_csv('Social_Network_Ads.csv')
ads_data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [502]:
ads_data.drop(columns=['User ID', 'Gender'], inplace=True)
ads_data.head(3)

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0


In [503]:
ads_data['Purchased'].value_counts()

Purchased
0    257
1    143
Name: count, dtype: int64

In [504]:
x = ads_data.iloc[:,:-1]
y= ads_data['Purchased']

In [505]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [506]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
lr.score(x_test, y_test)*100

88.75

In [507]:
true_val = ads_data[ads_data['Age']==45]
true_val

Unnamed: 0,Age,EstimatedSalary,Purchased
17,45,26000,1
20,45,22000,1
23,45,22000,1
259,45,131000,1
298,45,79000,0
318,45,32000,1
392,45,45000,1


### For imbalanced data it predicts wrong for 1 most of the time

In [508]:
lr.predict(true_val[['Age', 'EstimatedSalary']])

array([0, 0, 0, 1, 1, 0, 0], dtype=int64)

### Now lets balance the data and work with it

#### 1. Using Under Sampling

In [509]:
from imblearn.under_sampling import RandomUnderSampler

In [510]:
rus = RandomUnderSampler()
rus_x , rus_y = rus.fit_resample(x,y)

In [511]:
y.value_counts(), rus_y.value_counts()

(Purchased
 0    257
 1    143
 Name: count, dtype: int64,
 Purchased
 0    143
 1    143
 Name: count, dtype: int64)

In [512]:
x_train, x_test, y_train, y_test = train_test_split(rus_x, rus_y, test_size=0.2, random_state=42)
lr = LogisticRegression()
lr.fit(x_train, y_train)
lr.score(x_test, y_test)*100

75.86206896551724

In [513]:
true_val = ads_data[ads_data['Age']==45]
true_val

Unnamed: 0,Age,EstimatedSalary,Purchased
17,45,26000,1
20,45,22000,1
23,45,22000,1
259,45,131000,1
298,45,79000,0
318,45,32000,1
392,45,45000,1


In [514]:
lr.predict(true_val[['Age', 'EstimatedSalary']])

array([0, 0, 0, 1, 1, 0, 1], dtype=int64)

#### 2. Using Over Sampling

In [515]:
from imblearn.over_sampling import RandomOverSampler

In [516]:
ros = RandomOverSampler()
ros_x , ros_y = ros.fit_resample(x,y)

In [517]:
y.value_counts(), ros_y.value_counts()

(Purchased
 0    257
 1    143
 Name: count, dtype: int64,
 Purchased
 0    257
 1    257
 Name: count, dtype: int64)

In [518]:
x_train, x_test, y_train, y_test = train_test_split(ros_x, ros_y, test_size=0.2, random_state=42)
lr1 = LogisticRegression()
lr1.fit(x_train, y_train)
lr1.score(x_test, y_test)*100

86.40776699029125

In [519]:
true_val = ads_data[ads_data['Age']==45]
true_val

Unnamed: 0,Age,EstimatedSalary,Purchased
17,45,26000,1
20,45,22000,1
23,45,22000,1
259,45,131000,1
298,45,79000,0
318,45,32000,1
392,45,45000,1


In [520]:
lr1.predict(true_val[['Age', 'EstimatedSalary']])

array([0, 0, 0, 1, 1, 0, 1], dtype=int64)