# **06 - SMOTE Adjustment & Hybrid Resampling**

## **Objective**
In this notebook, we aim to **address the overfitting issue** found in the **SMOTE XGBoost model** by adjusting the **SMOTE ratio** and experimenting with **hybrid resampling techniques**.

---

## **Why Are We Doing This?**
From our **overfitting analysis (Notebook 05)**, we observed that:
- **SMOTE XGBoost achieved perfect recall (100%)**, but its precision dropped significantly.
- The **learning curve showed minimal generalization gap**, suggesting an **overfitted model**.
- Since **SMOTE generates synthetic fraud cases**, it might have **introduced unrealistic patterns**, making the model **too confident**.

---

## **What We Will Do**
1. **Reduce the SMOTE Ratio** → Instead of a strict **1:1 balance**, try **7:3 or 8:2**.
2. **Combine SMOTE with Undersampling** → Reduce the majority class slightly **before** applying SMOTE.
3. **Re-train & Evaluate New SMOTE Models** → Check if we reduce overfitting while maintaining good recall.

---

## **Final Goal**
- **Find the best SMOTE ratio** that balances **generalization and recall**.
- **Prepare a final resampled dataset** that will be used for model training.

**Imports**:

In [1]:
# Standard libraries
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Preprocessing
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# Warnings
import warnings
warnings.filterwarnings("ignore")

print("All libraries imported successfully!")

All libraries imported successfully!


**Dataset Preparation**:

In [4]:
# Load the original scaled dataset
X = pd.read_csv("../datasets/X_scaled.csv")
y = pd.read_csv("../datasets/y.csv")

# Display basic information
print(f"Original Scaled Data Shape: {X.shape}")
print(f"Original Target Shape: {y.shape}")

# Check class distribution before resampling
class_distribution = y.value_counts(normalize=True)
print("\nClass distribution before resampling:")
print(class_distribution)

print("\nData loaded successfully!")

Original Scaled Data Shape: (284807, 30)
Original Target Shape: (284807, 1)

Class distribution before resampling:
Class
0        0.998273
1        0.001727
Name: proportion, dtype: float64

Data loaded successfully!


**Resampling Strategy Update**:
In the previous approach, **SMOTE fully balanced the dataset (1:1 ratio)**, which likely contributed to overfitting. To improve generalization, we will now test **multiple SMOTE ratios (e.g., 70:30, 60:40)** while also slightly undersampling the majority class to find the best balance.