# **Problem Statement**  
## **31. Use SMOTE to handle class imbalance in a binary classification problem.**

Handle class imbalance in a binary classification problem using SMOTE (Synthetic Minority Over-sampling Technique).

The objective is to:
- Balance the dataset
- Improve model learning for minority class
- Validate results with before/after comparisons

### Constraints & Example Inputs/Outputs

### Constraints
- Binary classification problem
- Minority class has significantly fewer samples
- Dataset is numerical (SMOTE works in feature space)

### Example Input:
```python
Class 0: 90 samples
Class 1: 10 samples

```

Expected Output:
```python
Class 0: 90 samples
Class 1: 90 samples (after SMOTE)

```

### Solution Approach

**Step 1: Understand Class Imbalance**
- ML models bias toward majority class
- Accuracy becomes misleading
- Recall & F1-score suffer

**Step 2: Baseline (Brute Force) Oversampling**
- Random duplication of minority samples
- Leads to overfitting

**Step 3: SMOTE (Optimized Approach)**
- Generates synthetic samples
- Interpolates between nearest neighbors
- Avoids exact duplication

**Step 4: Apply SMOTE**
- Fit only on training data
- Never on test data (data leakage risk)

**Step 5: Validate**
- Compare class distribution
- Train model before & after SMOTE

### Solution Code

In [6]:
# Approach 1: Brute Force Approach: Random Oversampling
"""
Logic
- Randomly duplicate minority class samples
- Simple but risky
"""

import numpy as np
from collections import Counter

def random_oversample(X, y):
    X_resampled = list(X)
    y_resampled = list(y)
    
    counter = Counter(y)
    majority_class = max(counter, key=counter.get)
    minority_class = min(counter, key=counter.get)
    
    diff = counter[majority_class] - counter[minority_class]
    
    minority_indices = [i for i, label in enumerate(y) if label == minority_class]
    
    for _ in range(diff):
        idx = np.random.choice(minority_indices)
        X_resampled.append(X[idx])
        y_resampled.append(minority_class)
    
    return np.array(X_resampled), np.array(y_resampled)


### Alternative Solution

In [7]:
!pip3 install imblearn



In [8]:
# Approach 2: Optimized Approach: SMOTE
from imblearn.over_sampling import SMOTE

def apply_smote(X, y, random_state=42):
    smote = SMOTE(random_state=random_state)
    X_resampled, y_resampled = smote.fit_resample(X, y)
    return X_resampled, y_resampled


### Alternative Approaches

**Brute Force**
- Random Oversampling
- Random Undersampling

**Optimized**
- SMOTE ✅
- Borderline-SMOTE
- ADASYN
- Class weights

### Test Case

In [19]:
# Test Case 1: Create Imbalanced Dataset
from sklearn.datasets import make_classification
from collections import Counter

X, y = make_classification(
    n_samples=100,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    weights=[0.9, 0.1],
    random_state=42
)

Counter(y)


Counter({np.int64(0): 90, np.int64(1): 10})

In [20]:
# Test Case 2: Apply Brute Force Oversampling
X_brute, y_brute = random_oversample(X, y)
Counter(y_brute)


Counter({np.int64(0): 90, np.int64(1): 90})

In [21]:
# Test Case 3: Apply SMOTE
X_smote, y_smote = apply_smote(X, y)
Counter(y_smote)


Counter({np.int64(0): 90, np.int64(1): 90})

In [22]:
# Test Case 4: Train Model Before & After SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


In [23]:
# Without SMOTE
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.93      1.00      0.96        27
           1       1.00      0.33      0.50         3

    accuracy                           0.93        30
   macro avg       0.97      0.67      0.73        30
weighted avg       0.94      0.93      0.92        30



In [24]:
# With SMOTE
X_train_sm, y_train_sm = apply_smote(X_train, y_train)

model.fit(X_train_sm, y_train_sm)
y_pred_sm = model.predict(X_test)

print(classification_report(y_test, y_pred_sm))


              precision    recall  f1-score   support

           0       1.00      0.93      0.96        27
           1       0.60      1.00      0.75         3

    accuracy                           0.93        30
   macro avg       0.80      0.96      0.86        30
weighted avg       0.96      0.93      0.94        30



## Complexity Analysis

### Random Oversampling
- Time: O(n)
- Space: O(n)

### SMOTE
- Time: O(n × k)
- Space: O(n)

Where k = number of nearest neighbors.

#### Thank You!!