
# 01 – Random Oversampling

**Module:** Anomaly & Fraud Detection  
**Topic:** Imbalanced Data Handling

This notebook demonstrates **random oversampling** for rare-event datasets.
The goal is to increase minority class representation in the training set
without introducing leakage, enabling better model learning.


## Objective

Build a pipeline that:
- Applies random oversampling only on training data
- Preserves test distribution
- Integrates with downstream modeling and evaluation
- Supports probabilistic thresholding


## Design Principles

✔ Oversampling only on training set (no leakage)  
✔ Original distribution in test/validation preserved  
✔ Probabilistic outputs thresholded for deployment  
✔ Modular for integration into pipelines


## Imports and Setup

In [12]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, classification_report, auc
from imblearn.over_sampling import RandomOverSampler

np.random.seed(2010)

##  Simulated Imbalanced Fraud Dataset

In [14]:
X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    weights=[0.985, 0.015],
    flip_y=0.001,
    random_state=2010
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["fraud"] = y

## Leakage-Free Train/Test Split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="fraud"), df["fraud"],
    test_size=0.3, stratify=df["fraud"], random_state=42
)

## Apply Random Oversampling on Training Set Only

In [21]:
ros = RandomOverSampler(random_state=42)
X_train_res, y_train_res = ros.fit_resample(X_train, y_train)

print(f"Original training set class distribution:\n{np.bincount(y_train)}")
print(f"Resampled training set class distribution:\n{np.bincount(y_train_res)}")

Original training set class distribution:
[6891  109]
Resampled training set class distribution:
[6891 6891]




## Train Model on Resampled Data

In [24]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_res, y_train_res)

## Predict Probabilities on Original Test Set

In [27]:
y_probs = model.predict_proba(X_test)[:,1]

## Threshold Selection via F1 Score

In [30]:
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * precision * recall / (precision + recall + 1e-9)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
y_pred = (y_probs >= best_threshold).astype(int)

print(f"Optimal threshold (max F1): {best_threshold:.3f}")
print(classification_report(y_test, y_pred))

Optimal threshold (max F1): 0.734
              precision    recall  f1-score   support

           0       0.99      0.93      0.96      2953
           1       0.07      0.32      0.12        47

    accuracy                           0.92      3000
   macro avg       0.53      0.63      0.54      3000
weighted avg       0.97      0.92      0.95      3000




## Interpretation

- Random oversampling balances the training set, helping models learn minority patterns  
- Test distribution remains unchanged to reflect real-world performance  
- Threshold tuning adjusts for recall/precision trade-offs  
- Works best with simple classifiers and moderate imbalance


## Production Checklist

✔ Oversampling applied only on training set  
✔ Threshold tuned on validation set  
✔ Evaluation metrics computed on original distribution  
✔ Pipeline modular for reuse and monitoring


## Key Takeaways

- Random oversampling is simple and effective for small minority classes  
- Avoid oversampling before train/test split (leakage risk)  
- Integrates with threshold tuning, PR curve evaluation, and cost-sensitive metrics


## Next Steps

- Compare performance against random undersampling and class weighting  
- Integrate into full anomaly/fraud detection pipeline  
- Monitor rare-event performance over time in production