#  Anomaly & Fraud Detection – Imbalanced Learning  
**Technimque:** Random Undersampling

This notebook demonstrates how random undersampling can be used as a **training-time strategy**
for handling extreme class imbalance in fraud detection problems.

## 1. Objective

- Handle severe class imbalance in fraud datasets
- Apply random undersampling **only on training data**
- Compare model behavior with and without undersampling

> Random undersampling is treated as a *risk trade-off tool*, not a default preprocessing step.

## 2. Conceptual Overview

Fraud datasets are typically dominated by the non-fraud class (often >99%).

**Random undersampling:**
- Removes samples from the majority class
- Produces a balanced or semi-balanced training set
- Can improve recall for rare events

**Key risks:**
- Information loss
- Poor generalization if applied incorrectly

**Golden rule:**
> Never undersample validation or test data.


## 3. Imports and Setup



In [57]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

np.random.seed(42)


 ## 4. Simulated Imbalanced Fraud Dataset
 
 - 99% non-fraud
 - 1% fraud


In [64]:
X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=4,
    n_redundant=2,
    weights=[0.99, 0.01],
    flip_y=0.001,
    random_state=42
)

In [66]:
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["fraud"] = y
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,fraud
0,-0.142482,6.302694,0.708933,-0.595658,-4.494622,3.763959,-1.182141,-0.706649,0.87765,3.460011,0
1,1.459805,-0.982961,0.274762,1.327741,0.274597,0.406654,-1.424216,-0.541076,-0.458819,-1.361351,0
2,0.771083,-2.29665,0.773686,0.711279,2.0957,-2.179456,0.284796,-2.635911,-1.545472,0.635664,0
3,0.837445,-1.187392,1.558968,1.434297,1.37859,-0.374378,-0.532691,-2.025809,-1.42541,-0.28803,0
4,0.82229,-1.625413,0.23738,0.473487,0.92797,-1.028673,-0.6474,-1.510206,-0.510055,-0.195119,0


# Class distribution

In [30]:
df["fraud"].value_counts(normalize=True)

fraud
0    0.9895
1    0.0105
Name: proportion, dtype: float64

## 5. Train / Test Split (Leakage-Safe)

In [33]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="fraud"),
    df["fraud"],
    test_size=0.3,
    stratify=df["fraud"],
    random_state=42
)

## 6. Random Undersampling (Training Set Only)

In [36]:
train_df = pd.concat([X_train, y_train], axis=1)

fraud_df = train_df[train_df["fraud"] == 1]
non_fraud_df = train_df[train_df["fraud"] == 0]

non_fraud_sampled = non_fraud_df.sample(
    n=len(fraud_df),
    random_state=42
)

undersampled_train_df = pd.concat([fraud_df, non_fraud_sampled])

undersampled_train_df["fraud"].value_counts()

fraud
1    74
0    74
Name: count, dtype: int64

## 7. Model Training

Two models are trained:
1. On the original imbalanced dataset
2. On the undersampled training dataset

In [39]:
model_imbalanced = LogisticRegression(max_iter=1000)
model_imbalanced.fit(X_train, y_train)

X_us = undersampled_train_df.drop(columns="fraud")
y_us = undersampled_train_df["fraud"]

model_undersampled = LogisticRegression(max_iter=1000)
model_undersampled.fit(X_us, y_us)

## 8. Evaluation on Original Test Set

> Evaluation is always performed on the **unchanged, real distribution**.
    

In [53]:
y_pred_imbalanced = model_imbalanced.predict(X_test)
y_pred_undersampled = model_undersampled.predict(X_test)

print("=== Model Trained on Imbalanced Data ===")
print(classification_report(y_test, y_pred_imbalanced))

=== Model Trained on Imbalanced Data ===
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2969
           1       1.00      0.10      0.18        31

    accuracy                           0.99      3000
   macro avg       1.00      0.55      0.59      3000
weighted avg       0.99      0.99      0.99      3000



In [51]:
print("=== Model Trained with Random Undersampling ===")
print(classification_report(y_test, y_pred_undersampled))

=== Model Trained with Random Undersampling ===
              precision    recall  f1-score   support

           0       0.99      0.74      0.85      2969
           1       0.02      0.61      0.05        31

    accuracy                           0.74      3000
   macro avg       0.51      0.68      0.45      3000
weighted avg       0.98      0.74      0.84      3000



## 9. Interpretation of Results

Typical behavior observed:
- Undersampling increases **recall for fraud**
- Precision often decreases
- Accuracy becomes misleading and should be ignored


## 10. Risks and Anti-Patterns

- ❌ Applying undersampling before the train/test split  
- ❌ Evaluating on undersampled data  
- ❌ Using undersampling as a default solution


## 11. When Random Undersampling Makes Sense

- Extremely large majority class
- Clear signal separation
- Used inside ensembles or bagging strategies


## 12. Key Takeaways

- Random undersampling is simple but dangerous
- Always preserve the real data distribution for evaluation
- Prefer combining with class weighting or ensembles


## 13. Next Steps

- Compare with random oversampling and SMOTE
- Apply undersampling inside cross-validation folds
- Explore cost-sensitive learning