# **06 - SMOTE Adjustment & Hybrid Resampling**

## **Objective**
In this notebook, we aim to **address the overfitting issue** found in the **SMOTE XGBoost model** by adjusting the **SMOTE ratio** and experimenting with **hybrid resampling techniques**.

---

## **Why Are We Doing This?**
From our **overfitting analysis (Notebook 05)**, we observed that:
- **SMOTE XGBoost achieved perfect recall (100%)**, but its precision dropped significantly.
- The **learning curve showed minimal generalization gap**, suggesting an **overfitted model**.
- Since **SMOTE generates synthetic fraud cases**, it might have **introduced unrealistic patterns**, making the model **too confident**.

---

## **What We Will Do**
1. **Reduce the SMOTE Ratio** → Instead of a strict **1:1 balance**, try **7:3 or 8:2**.
2. **Re-train & Evaluate New SMOTE Models** → Check if we reduce overfitting while maintaining good recall.

---

## **Final Goal**
- **Find the best SMOTE ratio** that balances **generalization and recall**.
- **Prepare a final resampled dataset** that will be used for model training.

**Imports**:

In [22]:
# Standard libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preprocessing
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Warnings
import warnings
warnings.filterwarnings("ignore")

print("All libraries imported successfully!")

All libraries imported successfully!


**Dataset Preparation**:

In [8]:
# Load the original scaled dataset
X = pd.read_csv("../datasets/X_scaled.csv")
y = pd.read_csv("../datasets/y.csv")

y = y.squeeze()

# Display basic information
print(f"Original Scaled Data Shape: {X.shape}")
print(f"Original Target Shape: {y.shape}")

# Check class distribution before resampling
class_distribution = y.value_counts(normalize=True)
print("\nClass distribution before resampling:")
print(class_distribution)

print("\nData loaded successfully!")

Original Scaled Data Shape: (284807, 30)
Original Target Shape: (284807,)

Class distribution before resampling:
Class
0    0.998273
1    0.001727
Name: proportion, dtype: float64

Data loaded successfully!


**Resampling Strategy Update**:
In the previous approach, **SMOTE fully balanced the dataset (1:1 ratio)**, which likely contributed to overfitting. To improve generalization, we will now test **multiple SMOTE ratios (e.g., 70:30, 60:40)** while also slightly undersampling the majority class to find the best balance.

In [21]:
# Define different SMOTE ratios to test
smote_ratios = [0.5, 0.3, 0.2]  # 50:50, 70:30, 80:20

# Dictionary to store resampled datasets
smote_datasets = {}

for ratio in smote_ratios:
    print(f"\nApplying SMOTE with {int(ratio*100)}:{int((1-ratio)*100)} ratio...")
    
    # Apply SMOTE with the given ratio
    smote = SMOTE(sampling_strategy=ratio, random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X, y)

    # Store resampled dataset
    smote_datasets[ratio] = (X_resampled, y_resampled)
    
    # Print new class distribution
    class_distribution = y_resampled.value_counts(normalize=True)
    print(f"New class distribution after SMOTE {int(ratio*100)}:{int((1-ratio)*100)}:")
    print(class_distribution)

print("\nSMOTE resampling completed for all ratios!")


Applying SMOTE with 50:50 ratio...
New class distribution after SMOTE 50:50:
Class
0    0.666667
1    0.333333
Name: proportion, dtype: float64

Applying SMOTE with 30:70 ratio...
New class distribution after SMOTE 30:70:
Class
0    0.769232
1    0.230768
Name: proportion, dtype: float64

Applying SMOTE with 20:80 ratio...
New class distribution after SMOTE 20:80:
Class
0    0.833333
1    0.166667
Name: proportion, dtype: float64

SMOTE resampling completed for all ratios!


Now, our smote data is ready!