# SMOTE Deep Analysis - Oversampling Techniques for Imbalanced Fraud Data (No Modeling)

### What This Notebook Covers
This notebook is a **visual and conceptual exploration** of oversampling techniques for imbalanced datasets,  
using the **Credit Card Fraud dataset** (creditcard.csv from Kaggle).  

I do **not train models** here. Instead, the focus is on understanding **how different oversampling methods reshape the dataset**  
and what these changes mean for downstream modeling.

### Pipeline
1. **SMOTE (Regular)** — synthetic samples between minority neighbors  
2. **Random Oversampling (ROS)** — simple duplication of minority class  
3. **Borderline-SMOTE (1 & 2)** — focus on decision boundary (“danger zone”)  
4. **KMeans-SMOTE** — cluster-based adaptive oversampling  

### Goal
- Show **class counts** before & after each technique  
- Visualize how each method changes the **data distribution**  
- Provide **practical guidance** on when to use each oversampling method  

By the end, you’ll have a clear **intuition** of how oversampling works without running a single model.


## SMOTE

Creates synthetic minority samples by interpolating between a sample and its nearest minority neighbors.

In [1]:
CSV_PATH = "data/creditcard.csv"
TARGET_COL = "Class"

import pandas as pd
from collections import Counter
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# load
df = pd.read_csv(CSV_PATH)
y = df[TARGET_COL].astype(int)
X = df.drop(columns=[TARGET_COL])

# split so I don't oversample test (I'll look only at train)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Original TRAIN distribution:", Counter(ytr))

# define parameter grids
sampling_list = [0.1, 0.25, 0.5, 1.0, 'minority', 'not majority', 'all']
smote_k_list = [3, 5, 7, 10]

# dict example: force minority to be ~80% of majority
dict_target = {1: int(0.8 * Counter(ytr)[0])}
sampling_list.append(dict_target)

# try combos and show class counts
for ss in sampling_list:
    for sk in smote_k_list:
        try:
            sm = SMOTE(sampling_strategy=ss, k_neighbors=sk, random_state=42)
            X_res, y_res = sm.fit_resample(Xtr, ytr)
            print(f"\n[sampling_strategy={ss}, k_neighbors={sk}]")
            print(" Resampled distribution:", Counter(y_res))
        except Exception as e:
            print(f"\n[sampling_strategy={ss}, k_neighbors={sk}] -> ERROR: {e}")


Original TRAIN distribution: Counter({0: 227451, 1: 394})

[sampling_strategy=0.1, k_neighbors=3]
 Resampled distribution: Counter({0: 227451, 1: 22745})

[sampling_strategy=0.1, k_neighbors=5]
 Resampled distribution: Counter({0: 227451, 1: 22745})

[sampling_strategy=0.1, k_neighbors=7]
 Resampled distribution: Counter({0: 227451, 1: 22745})

[sampling_strategy=0.1, k_neighbors=10]
 Resampled distribution: Counter({0: 227451, 1: 22745})

[sampling_strategy=0.25, k_neighbors=3]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=5]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=7]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=10]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.5, k_neighbors=3]
 Resampled distribution: Counter({0: 227451, 1: 113725})

[sampling_strategy=0.5, k_neighbors=5]
 Resampled distribution: Counte

### Analysis of SMOTE Parameter Experiments

**Original train distribution:**  
- Majority (0): **227,451**  
- Minority (1): **394** → extremely imbalanced.

---

#### 1. sampling_strategy (float values)
- **0.1** → minority = 22,745 (≈10% of majority)  
- **0.25** → minority = 56,862 (≈25% of majority)  
- **0.5** → minority = 113,725 (≈50% of majority)  
- **1.0** → minority = 227,451 (balanced 1:1)  

Floats directly control the **ratio** of minority to majority after resampling.

---

#### 2. sampling_strategy (string values)
- 'minority', 'not majority', 'all' → all ended up balancing the classes (**227,451 vs 227,451**).  

In binary problems, these string options all act like 1.0.

---

#### 3. k_neighbors
- Tried 3, 5, 7, 10 → **counts did not change**.  
- This parameter affects **where the synthetic samples are generated**:  
  - Small k → local interpolation (tighter clusters).  
  - Large k → smoother, more spread out synthetic points (but can risk crossing into majority space).

k_neighbors changes the **geometry**, not the **number** of samples.

---

#### 4. sampling_strategy (dict example)
- {1: 181960} → minority forced to exactly 181,960 samples.  
- Majority stayed 227,451 → ratio ≈ 0.8.  

Dict gives **absolute control** over final class sizes, independent of ratios.

---

### Conclusions
- **sampling_strategy** = decides **how many synthetic points**.  
- **k_neighbors** = decides **where synthetic points are placed**.  
- **dict** = lets you hard-set target counts.  
- On creditcard data, I can scale minority from a few hundred → hundreds of thousands just by changing the parameter.


### SMOTE API mini-experiments (why & what to expect)

**fit(X, y)**  
- Only validates inputs and computes internal stats (e.g., neighbor index).  

**fit_resample(X, y)**  
- Does everything `fit` does **and** returns `(X_resampled, y_resampled)`.  

**get_feature_names_out(input_features=None)**  
- Returns feature names for the output.  
- If X is a DataFrame with string column names, you’ll get them back. Otherwise it falls back to **["x0", "x1", ...]**.

**get_params**(deep=True) / set_params(**kwargs)**  
- Standard sklearn-style param access/update.  
- Useful to **log** the config you ran, or to change **k_neighbors** / **sampling_strategy** 

**get_metadata_routing()**  
- Advanced plumbing for scikit-learn’s metadata routing.  
- Not needed for normal SMOTE use, but you can call it and see the object returned.

### What is get_metadata_routing()?
This is an internal utility in scikit-learn for **advanced pipelines**. It controls how extra information (metadata like sample_weight or groups) flows between different steps.  
For SMOTE we don’t need it at all, since SMOTE only uses X and y. We can ignore it safely, it’s just part of scikit-learn’s plumbing.

**What we’ll do now**  
1) Inspect params → 2) call fit (no resample) → 3) call fit_resample and check counts → 4) change params with set_params and resample again → 5) try get_feature_names_out and get_metadata_routing.


In [2]:
# SMOTE API mini-experiments (no modeling) 
CSV_PATH = "data/creditcard.csv"
TARGET_COL = "Class"

import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

#load + split
df = pd.read_csv(CSV_PATH)
y = df[TARGET_COL].astype(int)
X = df.drop(columns=[TARGET_COL])
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Original TRAIN distribution:", Counter(ytr))

#create SMOTE and inspect params
sm = SMOTE(sampling_strategy=0.25, k_neighbors=5, random_state=42)
print("\nget_params (short):", {k: sm.get_params()[k] for k in ['sampling_strategy','k_neighbors','random_state']})

#fit() only (no resample) — just to show it runs
sm.fit(Xtr, ytr)
print("fit() done (no resampling returned).")

#fit_resample() — actual resampling
X_res, y_res = sm.fit_resample(Xtr, ytr)
print("After fit_resample with sampling_strategy=0.25, k_neighbors=5 ->", Counter(y_res))

#set_params() — change k_neighbors and sampling_strategy, resample again
sm.set_params(k_neighbors=7, sampling_strategy=0.5)
X_res2, y_res2 = sm.fit_resample(Xtr, ytr)
print("After set_params(k=7, ss=0.5) ->", Counter(y_res2))

#feature names out
try:
    names = sm.get_feature_names_out(getattr(Xtr, 'columns', None))
    print("\nget_feature_names_out (first 10):", list(names)[:10], "…")
except Exception as e:
    print("\nget_feature_names_out -> ERROR:", e)

#metadata routing (just to see it's there)
try:
    routing = sm.get_metadata_routing()
    print("get_metadata_routing() type:", type(routing).__name__)
except Exception as e:
    print("get_metadata_routing -> ERROR:", e)


Original TRAIN distribution: Counter({0: 227451, 1: 394})

get_params (short): {'sampling_strategy': 0.25, 'k_neighbors': 5, 'random_state': 42}
fit() done (no resampling returned).
After fit_resample with sampling_strategy=0.25, k_neighbors=5 -> Counter({0: 227451, 1: 56862})
After set_params(k=7, ss=0.5) -> Counter({0: 227451, 1: 113725})

get_feature_names_out (first 10): ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9'] …
get_metadata_routing() type: MetadataRequest


 ## Random Oversampling (ROS)

In [3]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

ros = RandomOverSampler(sampling_strategy='minority', random_state=42)
X_ros, y_ros = ros.fit_resample(X, y)

print("Before ROS:", Counter(y))
print("After ROS:", Counter(y_ros))


Before ROS: Counter({0: 284315, 1: 492})
After ROS: Counter({0: 284315, 1: 284315})


### Random Oversampling (ROS)

Random Oversampling is the most straightforward balancing technique.  
Instead of generating synthetic samples, it simply **duplicates existing minority class samples** until the dataset is balanced.  

- **Before ROS:** 0 : 284,315 | 1 : 492  
- **After ROS:** 0 : 284,315 | 1 : 284,315

**Pros:**  
- Very simple and effective at balancing.  
- Ensures the minority class has equal representation.  

**Cons:**  
- Can cause **overfitting**, since the model may see the same minority examples multiple times.  
- Doesn’t add new information like SMOTE or Borderline-SMOTE.  

ROS serves as a **baseline** method to compare with more advanced oversampling techniques (SMOTE, Borderline-SMOTE).


## BorderlineSMOTE

In [4]:
# BorderlineSMOTE parameter experiments: print class counts only 
CSV_PATH = "data/creditcard.csv"   # Kaggle credit card fraud dataset
TARGET_COL = "Class"

import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import BorderlineSMOTE

# 1) load + split (resample ONLY on train)
df = pd.read_csv(CSV_PATH)
y = df[TARGET_COL].astype(int)
X = df.drop(columns=[TARGET_COL])
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Original TRAIN distribution:", Counter(ytr))

# 2) parameter grids
# sampling_strategy: floats (ratio minority/majority), strings, and a dict example
sampling_list = [0.25, 0.5, 1.0, 'minority']  # keep it short
dict_target = {1: int(0.8 * Counter(ytr)[0])}  # force minority ≈ 80% of majority
sampling_list.append(dict_target)

k_list = [3, 5, 7]        # k_neighbors: synthesis neighborhood
m_list = [5, 10, 15]      # m_neighbors: "danger" neighborhood
kinds = ['borderline-1', 'borderline-2']

# 3) run and print counts
for ss in sampling_list:
    for k in k_list:
        for m in m_list:
            for kd in kinds:
                try:
                    bls = BorderlineSMOTE(
                        sampling_strategy=ss,
                        k_neighbors=k,
                        m_neighbors=m,
                        kind=kd,
                        random_state=42
                    )
                    X_res, y_res = bls.fit_resample(Xtr, ytr)
                    print(f"\n[sampling_strategy={ss}, k_neighbors={k}, m_neighbors={m}, kind={kd}]")
                    print(" Resampled distribution:", Counter(y_res))
                except Exception as e:
                    print(f"\n[sampling_strategy={ss}, k_neighbors={k}, m_neighbors={m}, kind={kd}] -> ERROR: {e}")


Original TRAIN distribution: Counter({0: 227451, 1: 394})

[sampling_strategy=0.25, k_neighbors=3, m_neighbors=5, kind=borderline-1]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=3, m_neighbors=5, kind=borderline-2]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=3, m_neighbors=10, kind=borderline-1]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=3, m_neighbors=10, kind=borderline-2]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=3, m_neighbors=15, kind=borderline-1]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=3, m_neighbors=15, kind=borderline-2]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0.25, k_neighbors=5, m_neighbors=5, kind=borderline-1]
 Resampled distribution: Counter({0: 227451, 1: 56862})

[sampling_strategy=0

### Analysis of BorderlineSMOTE Parameter Experiments

**Original train distribution:**  
- Majority (0): **227,451**  
- Minority (1): **394** → extremely imbalanced.

---

#### 1. sampling_strategy
- **0.25** → minority scaled to ~56,862 (≈25% of majority).  
- **0.5** → minority scaled to ~113,725 (≈50% of majority).  
- **1.0** → fully balanced at 227,451 vs 227,451.  
- **'minority'** → same as 1.0 (balances the dataset).  
- **dict {1: 181960}** → minority forced to exactly 181,960 (≈80% of majority).  

Same pattern as classic SMOTE: sampling_strategy **controls the final counts**, regardless of other parameters.

---

#### 2. k_neighbors
- Tested 3, 5, 7 → **class counts stayed the same**.  
- Like in SMOTE, this parameter controls **where synthetic samples are generated**, not how many.  
- Smaller k → more local interpolation.  
- Larger k → more spread-out neighbors.

---

#### 3. m_neighbors
- Tried 5, 10, 15 → **no change in counts** either.  
- **m_neighbors** controls how “danger” samples (minority points near the majority) are detected.  
- Affects **which minority samples are oversampled**, not the final count.

---

#### 4. kind (borderline-1 vs borderline-2)
- Both **borderline-1** and **borderline-2** gave **identical distributions** in terms of counts.  
- The difference is *which minority points* are chosen:
  - **borderline-1**: only uses “danger” samples (minority points close to majority) to generate new ones.  
  - **borderline-2**: uses both “danger” and some majority-adjacent points, generating slightly more diverse synthetic data.  
- The effect is **qualitative (geometry of samples)**, not quantitative (counts).

---

### Conclusions
- **sampling_strategy**: the only parameter that changes **how many synthetic samples** are created.  
- **k_neighbors**: defines the interpolation neighborhood, influences the **distribution shape** of synthetic samples.  
- **m_neighbors**: defines how we detect borderline/danger points, again affects *where* new points are made.  
- **kind**: changes the algorithm’s definition of “borderline” (1 vs 2) but **not the class counts**.  

On our dataset, counts didn’t change with `k_neighbors`, `m_neighbors`, or `kind` → but the **quality/placement** of the synthetic minority points *did* change internally.


### What is m_neighbors in BorderlineSMOTE?

- BorderlineSMOTE focuses on **“danger” minority samples** → these are minority points that sit close to majority points (on the decision boundary).
- Why? Because misclassifications usually happen at the boundary, so oversampling there is more useful than duplicating safe points.

---

#### Role of m_neighbors
- **m_neighbors** = how many neighbors we check around each minority point to decide if it’s:
  - **Safe** → mostly surrounded by other minority points (no need to oversample).
  - **Danger** → surrounded by many majority points (good candidate to oversample).
  - **Noise** → completely surrounded by majority (ignored, not oversampled).

---

#### Why it helps
- If **m_neighbors** is **small (e.g., 5)** → danger zones are detected very locally. More points are labeled as “danger,” which means more synthetic samples, but also risk of adding noise.
- If **m_neighbors** is **large (e.g., 15)** → danger detection is stricter. Fewer points are marked as danger, so fewer but cleaner synthetic samples are created.

---

###  Takeaway
- **m_neighbors controls how sensitive BorderlineSMOTE is when finding borderline points.**
- It doesn’t change how many samples are generated overall (that’s sampling_strategy), but it changes **which minority points** are used to generate them.
- The “best” value depends on the dataset → small values oversample more aggressively, large values oversample more carefully.

Without SMOTE → the model is biased toward majority, ignores minority.

With BorderlineSMOTE → synthetic minority samples fill in the “hard-to-learn” boundary zones.

This teaches the model to recognize fraud better, instead of just predicting “not fraud” all the time.

![m_neighbors](example.png)

BorderlineSMOTE focuses on the orange “danger” points and generates synthetic samples around them to strengthen the decision boundary.

### BorderlineSMOTE — kind parameter

BorderlineSMOTE has two algorithm variants: **borderline-1** and **borderline-2**.  
Both focus on oversampling the **danger samples** (minority points close to majority points), but they handle the neighborhood slightly differently:

---

#### 1. borderline-1
- Only oversamples **danger minority samples**.  
- Generates new synthetic points **between the danger sample and its minority neighbors**.  
- Goal: reinforce the minority exactly where it’s under threat from the majority.  
- More conservative → sticks closer to existing minority points.

---

#### 2. borderline-2
- Also oversamples danger samples, **but**:  
- Generates synthetic points not only towards other minority neighbors, but also **towards majority neighbors**.  
- This spreads synthetic samples wider, covering the whole boundary area.  
- More aggressive → can help when classes overlap strongly, but may risk creating noisier samples.

---

### Takeaway
- **Both kinds** create the same *number* of samples (controlled by **sampling_strategy**), but they differ in *where* the new points are placed.  
- **borderline-1** → safer, tighter oversampling around existing minority.  
- **borderline-2** → wider oversampling, may better handle overlaps but can introduce more noise.  
- Which one works better depends on the dataset → you typically try both and compare model performance.

![borderline-2](borderline-2.png)

## KMeans SMOTE

In [5]:
# Robust KMeansSMOTE run on creditcard.csv: print class counts only 

CSV_PATH = "data/creditcard.csv"
TARGET_COL = "Class"

import os
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.cluster import MiniBatchKMeans
from imblearn.over_sampling import KMeansSMOTE

# (Windows MKL tip) Reduce threads to avoid the MiniBatchKMeans warning
os.environ.setdefault("OMP_NUM_THREADS", "4")

# 1) load + split
df = pd.read_csv(CSV_PATH)
y = df[TARGET_COL].astype(int)
X = df.drop(columns=[TARGET_COL])
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Original TRAIN distribution:", Counter(ytr))

# choose conservative, workable params for extreme imbalance
# - few clusters so each cluster has several minority points
# - k_neighbors=2 (default)
# - cluster_balance_threshold relaxed to 0.0
# - pass MiniBatchKMeans with large batch_size to avoid warning
minority_count = Counter(ytr)[1]
n_clusters = max(5, min(12, minority_count // 20))  # heuristic: aim ~≥20 minority per cluster

kmeans_obj = MiniBatchKMeans(
    n_clusters=n_clusters,
    batch_size=4096,   # >= 3584 to avoid MKL warning; adjust if needed
    n_init="auto",
    random_state=42
)

configs = [
    # (sampling_strategy, k_neighbors, cluster_balance_threshold, density_exponent)
    (0.25, 2, 0.0, "auto"),
    (0.50, 2, 0.0, "auto"),
    (1.00, 2, 0.0, "auto"),
    ("minority", 2, 0.0, "auto"),
]

for ss, kn, cbt, dexp in configs:
    try:
        kms = KMeansSMOTE(
            sampling_strategy=ss,
            k_neighbors=kn,
            kmeans_estimator=kmeans_obj,
            cluster_balance_threshold=cbt,
            density_exponent=dexp,
            random_state=42
        )
        X_res, y_res = kms.fit_resample(Xtr, ytr)
        print(f"\n[sampling_strategy={ss}, k_neighbors={kn}, kmeans_k={n_clusters}, "
              f"cluster_balance_threshold={cbt}, density_exponent={dexp}]")
        print(" Resampled distribution:", Counter(y_res))
    except Exception as e:
        print(f"\n[ss={ss}, kn={kn}, kmeans_k={n_clusters}, cbt={cbt}, dexp={dexp}] -> ERROR: {e}")

Original TRAIN distribution: Counter({0: 227451, 1: 394})





[sampling_strategy=0.25, k_neighbors=2, kmeans_k=12, cluster_balance_threshold=0.0, density_exponent=auto]
 Resampled distribution: Counter({0: 227451, 1: 56872})





[sampling_strategy=0.5, k_neighbors=2, kmeans_k=12, cluster_balance_threshold=0.0, density_exponent=auto]
 Resampled distribution: Counter({0: 227451, 1: 113734})





[sampling_strategy=1.0, k_neighbors=2, kmeans_k=12, cluster_balance_threshold=0.0, density_exponent=auto]
 Resampled distribution: Counter({1: 227460, 0: 227451})





[sampling_strategy=minority, k_neighbors=2, kmeans_k=12, cluster_balance_threshold=0.0, density_exponent=auto]
 Resampled distribution: Counter({1: 227460, 0: 227451})


### KMeansSMOTE — What your output shows & what each parameter did

**Original TRAIN distribution:**  
- Majority (0): **227,451**  
- Minority (1): **394**  → extremely imbalanced.

---

#### 1) sampling_strategy (changed): **controls how many minority samples are created**
Your runs:

- *sampling_strategy = 0.25** → **1 ≈ 56,872** vs 0 = 227,451  
- **sampling_strategy = 0.5**  → **1 ≈ 113,734** vs 0 = 227,451  
- **sampling_strategy = 1.0**  → **1 ≈ 227,460** vs 0 = 227,451 (balanced, tiny off-by-few)  
- **sampling_strategy = "minority"** → **1 ≈ 227,460** vs 0 = 227,451 (same as 1.0 for binary)

**Why the small mismatch (227,460 vs 227,451)?**  
KMeansSMOTE allocates synthetic samples **per cluster**. After rounding per-cluster quotas, the final minority count can differ by a few samples from the “exact” target—this is normal.

**Takeaway:** **sampling_strategy** is the **only** thing changing the **counts**. All other params affect **where** new points go, not how many.

---

#### 2) k_neighbors = 2 (fixed in your runs): **local interpolation inside clusters**
- Controls how SMOTE interpolates **within each cluster** (how “local” the synthesis is).  
- Does **not** change counts; only the geometry/placement of synthetic samples.  
- With very few minority per cluster, small values (2–3) are safer.

---

#### 3) kmeans_k = 12 : **how finely we partition the space**
- KMeansSMOTE first clusters the data, then oversamples **within** clusters that contain enough minority.  
- More clusters = finer regions, but **risk** empty/single-minority clusters (can’t interpolate).  
- Fewer clusters = coarser regions, often **more stable** with tiny minority (like this dataset).  
- Your choice of **12** worked (we didn't get the error this time “not enough minority in any cluster”).

---

#### 4) cluster_balance_threshold = 0.0 : **which clusters are eligible**
- With **0.0**, we essentially **allow all clusters** to be considered for oversampling.  
- If this threshold is too **strict** (e.g., **'auto'** on a dataset with ultra-few minority), many clusters get excluded → we can get **“No clusters found…”** errors.

---

#### 5) density_exponent = "auto": **how per-cluster density weights are computed**
- Used to **distribute** the total number of synthetic samples **across clusters** (denser clusters may get more samples).  
- Doesn’t change overall counts, only **where** samples are allocated.

![kmeans](kmeans.png)

## 1. Why KMeans before SMOTE?
- Standard **SMOTE** creates synthetic minority samples by interpolating between neighbors, but it doesn’t consider the global structure of the data.  
- This can generate unrealistic samples in areas where majority and minority overlap.  
- **KMeans-SMOTE** first clusters the dataset → oversampling happens *within each cluster* instead of across the whole dataset.  
- Result: synthetic samples are **more realistic and cluster-aware**.

---

## 2. How It Works Step by Step
1. **Clustering**  
   - Data is divided into clusters (**kmeans_k** = number of clusters).  
   - Each cluster has its own mix of majority and minority points.  

2. **Cluster Selection**  
   - Clusters with **higher minority ratios** get more oversampling.  
   - Clusters dominated by majority get less (or none) to avoid noisy samples.  

3. **Sample Creation**  
   - Within chosen clusters, SMOTE generates new points by interpolating between minority neighbors.  
   - This ensures new samples respect the cluster’s structure.

---

## 3. Key Parameters
- **kmeans_k**  
  - Number of clusters.  
  - Larger **kmeans_k** → finer segmentation of data.  
  - Helps focus oversampling on local minority “pockets.”  

- **cluster_balance_threshold**  
  - Decides when a cluster is eligible for oversampling.  
  - Example:  
    - **0.1** → clusters with ≥10% minority get oversampled.  
    - **0.0** → every cluster is oversampled, even majority-heavy ones.  

- **density_exponent**  
  - Controls how density affects oversampling.  
  - **"auto"** = based on number of features.  
  - Higher values → denser clusters get more synthetic points.  

---

## Summary
KMeans-SMOTE = SMOTE with clustering.  
It oversamples **more in minority-heavy clusters** and **less in majority-heavy clusters**, producing balanced and realistic synthetic samples.


## Comparison of SMOTE Variants

- **No SMOTE (Baseline):**  
  The dataset stays highly imbalanced. The model struggles to learn minority (fraud) patterns and tends to predict majority (non-fraud) more often.

- **SMOTE (Regular):**  
  Creates synthetic samples by interpolating between minority neighbors. This balances the dataset and improves recall, but sometimes generates less realistic points in sparse regions.

- **Random Oversampling (ROS):**  
  Duplicates existing minority class samples until the dataset is balanced.  
  Simple and effective for balancing.  
  Risk of **overfitting**, since no new information is added.
- **Borderline-SMOTE:**  
  Focuses on samples near the decision boundary (danger zone).  
  - **Borderline-1:** Generates minority samples near the boundary.  
  - **Borderline-2:** Generates even “harder” minority samples closer to majority, making the model better at detecting fraud in tricky regions.

- **KMeans-SMOTE:**  
  Uses clustering to decide where to add samples. More samples are generated in clusters with higher imbalance (minority density). This avoids over-sampling “easy” areas and produces more diverse, useful samples.

### Takeaway
- **ROS** → Quick fix, but prone to overfitting.  
- **SMOTE** → Good starting point, creates synthetic variation.  
- **Borderline-SMOTE** → Strengthens decision boundaries, especially for fraud detection.  
- **KMeans-SMOTE** → Most adaptive, balances intelligently based on data distribution.  

Overall, all methods improve over baseline, but **Borderline and KMeans-SMOTE provide the most realistic and effective improvements** for imbalanced fraud datasets.