
# 03 – SMOTE and Variants

**Module:** Anomaly & Fraud Detection  
**Topic:** Imbalanced Learning Strategies

> This notebook covers **synthetic oversampling techniques** for rare-event problems,
focusing on SMOTE and its most common variants. These methods are treated as
*training-time data augmentation*, not as evaluation shortcuts.


## Objective

Build a leakage-free and production-aware workflow that:

- Applies SMOTE-based oversampling **only on training data**
- Compares multiple SMOTE variants
- Evaluates models on the original, imbalanced distribution
- Highlights precision–recall trade-offs for fraud detection

Oversampling is treated as **signal amplification**, not class equalization.

## Pipeline Design Principles

✔ Oversampling applied only after train split  
✔ Validation reflects real-world prevalence  
✔ Synthetic data never leaks into test data  
✔ Metrics aligned with rare-event detection

## High-Level Workflow

Imbalanced Dataset  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Train / Test Split (Stratified)  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
SMOTE / Variant (Train Only)  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Model Training  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Evaluation on Original Distribution

## Imports and Setup


In [43]:
!pip install imblearn





In [47]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from imblearn.over_sampling import (SMOTE, 
                                    BorderlineSMOTE, 
                                    SMOTETomek
                                    )

np.random.seed(2010)

ImportError: cannot import name 'SMOTETomek' from 'imblearn.over_sampling' (C:\Users\pantu\anaconda3\lib\site-packages\imblearn\over_sampling\__init__.py)


## Dataset Assumptions

- Binary fraud label: `fraud`
- Extreme class imbalance (≈ 1–2% fraud)
- Tabular, numeric features


##  Simulated Imbalanced Fraud Dataset



In [49]:
X, y = make_classification(
    n_samples=12000,
    n_features=12,
    n_informative=5,
    n_redundant=3,
    weights=[0.985, 0.015],
    flip_y=0.001,
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["fraud"] = y

In [54]:
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,fraud
0,0.77026,-2.625638,-0.828707,0.814636,-0.768503,-0.134738,1.282337,-1.326938,2.413959,-0.233218,0.383257,0.57185,0
1,-0.238001,0.050574,-2.266612,0.252346,1.036836,-0.81188,1.578604,-0.622636,-0.075854,0.348804,-1.705429,0.076863,0
2,-2.429437,-1.190506,0.174269,-0.106405,1.060758,-0.057729,0.73401,0.103039,1.623791,1.582132,0.0348,0.074928,0
3,1.567401,-2.035331,-0.706796,1.717783,0.371393,0.410368,1.312859,0.305656,1.243666,0.614772,-0.108857,1.43018,0
4,-0.752606,-4.368473,-1.825955,-1.574827,-0.729408,-0.792788,2.31869,-2.927283,2.667991,0.947658,0.065568,-1.396407,0


Class distribution

In [52]:
df["fraud"].value_counts(normalize=True)

fraud
0    0.984833
1    0.015167
Name: proportion, dtype: float64


## Leakage-Free Train / Test Split


In [56]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="fraud"),
    df["fraud"],
    test_size=0.3,
    stratify=df["fraud"],
    random_state=42
)


## Define Oversampling Strategies

We compare:
- SMOTE (standard)
- Borderline-SMOTE
- SMOTE + Tomek Links (hybrid cleaning)



In [59]:
oversamplers = {
    "SMOTE": SMOTE(random_state=42),
    "BorderlineSMOTE": BorderlineSMOTE(random_state=2010),
    "SMOTETomek": SMOTETomek(random_state=2010)
}

NameError: name 'SMOTETomek' is not defined


## Model Training with Oversampling

A separate model is trained for each oversampling strategy.



In [62]:
results = {}

for name, sampler in oversamplers.items():
    X_resampled, y_resampled = sampler.fit_resample(X_train, y_train)

    model = LogisticRegression(max_iter=1000)
    model.fit(X_resampled, y_resampled)

    y_pred = model.predict(X_test)

    results[name] = classification_report(y_test, y_pred, output_dict=True)

NameError: name 'oversamplers' is not defined


## Evaluation on Original Distribution

All models are evaluated on the **unchanged test set**

In [65]:
for name, report in results.items():
    print(f"=== {name} ===")
    print(pd.DataFrame(report).T)
    print("\n")


## Interpretation of Results

Typical observations:
- SMOTE increases recall for fraud cases
- Borderline-SMOTE focuses on difficult boundary cases
- Hybrid methods reduce noisy synthetic points
- Precision often decreases as recall improves

## Risks and Anti-Patterns

❌ Applying SMOTE before train/test split  
❌ Oversampling validation or test sets  
❌ Using synthetic balance as a success metric  
❌ Ignoring feature space geometry

## When SMOTE Makes Sense

- Minority class has sufficient density
- Feature space is continuous and meaningful
- Combined with regularization or ensembles

## Production Checklist

✔ SMOTE applied only on training data  
✔ Evaluation done on real distribution  
✔ Model stability verified  
✔ Recall–precision trade-offs understood

## Key Takeaways

- SMOTE generates *plausible* but not *real* data
- Variants behave differently near class boundaries
- Oversampling is a modeling choice, not a fix-all


## Next Steps

- Compare with class-weighted learning
- Apply SMOTE inside cross-validation folds
- Evaluate using PR-AUC and cost-sensitive metrics