
# 04 – Pipeline Integration for Imbalanced Data

**Module:** Anomaly & Fraud Detection  
**Topic:** Production-Ready Imbalance Handling

This notebook demonstrates integrating class weighting, threshold adjustment, and evaluation metrics
into a **reusable ML pipeline** for rare-event detection such as fraud.


## Objective

Build an end-to-end pipeline that:
- Handles class imbalance
- Supports threshold tuning
- Incorporates evaluation metrics (PR, F1, cost-sensitive)
- Is ready for deployment and monitoring


## Design Principles

✔ Leakage-free pipeline  
✔ Reusable feature + model integration  
✔ Threshold tuning for rare events  
✔ Business-aware metrics incorporated  
✔ Modular and production-ready


## High-Level Architecture
Raw Data  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Preprocessing & Feature Engineering  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Class-Balanced Model Training  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Probability Prediction  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Threshold Adjustment  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Evaluation Metrics (PR, F1, Cost)  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Deployment / Monitoring


## Imports and Setup


In [13]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, classification_report, auc

np.random.seed(42)

## Simulated Imbalanced Dataset 

In [15]:
X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    weights=[0.985, 0.015],
    flip_y=0.001,
    random_state=42
)

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["fraud"] = y

In [18]:
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,fraud
0,-1.18532,-1.981989,-1.345503,2.099419,-0.52637,-3.270027,1.025845,0.411469,-1.834725,-1.303835,0
1,0.059089,1.694271,1.981215,2.087415,-0.515182,-2.652186,0.241717,1.68561,2.010614,-3.767967,0
2,0.990966,-0.840328,-1.302106,0.058121,0.467136,-0.658101,-1.764214,-0.630063,-0.094083,-1.085228,0
3,-1.023847,-0.774103,0.18269,1.272535,0.587931,-2.153325,-2.414405,-0.944846,1.098097,-3.621804,0
4,0.928713,1.429745,2.065031,0.103607,-0.271382,1.297563,1.666663,-0.493448,0.618468,0.686027,0


  ## Train/Test Split 

In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="fraud"), df["fraud"],
    test_size=0.3, stratify=df["fraud"], random_state=42
)

  ## Preprocessing Pipeline 

In [24]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

  ## Train Class-Balanced Model 

In [27]:
model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train_scaled, y_train)

## Predict Probabilities 

In [30]:
y_probs = model.predict_proba(X_test_scaled)[:,1]

 ## Threshold Tuning

In [33]:
precision, recall, thresholds = precision_recall_curve(y_test, y_probs)
f1_scores = 2 * precision * recall / (precision + recall + 1e-9)
best_idx = np.argmax(f1_scores)
best_threshold = thresholds[best_idx]
y_pred = (y_probs >= best_threshold).astype(int)
print(f"Optimal threshold: {best_threshold:.3f}")

Optimal threshold: 0.734


##  Evaluation Metrics 

In [36]:
pr_auc = auc(recall, precision)
print(f"PR-AUC: {pr_auc:.4f}")
print(classification_report(y_test, y_pred))

PR-AUC: 0.0854
              precision    recall  f1-score   support

           0       0.99      0.93      0.96      2954
           1       0.11      0.57      0.18        46

    accuracy                           0.92      3000
   macro avg       0.55      0.75      0.57      3000
weighted avg       0.98      0.92      0.95      3000



## Integration Notes

- Class weighting handles imbalance without changing distribution
- Threshold tuning improves rare-event recall
- Evaluation includes PR-AUC and F1 score for rare-event relevance
- Preprocessing and model combined into a modular pipeline for production

## Production Checklist

✔ Pipeline modular and reproducible  
✔ Preprocessing deterministic  
✔ Threshold tuned on validation  
✔ PR, F1, and cost metrics monitored  
✔ Ready for deployment

## Key Takeaways

- Integrating preprocessing, modeling, and evaluation into a single pipeline ensures consistency  
- Class weighting + threshold tuning is often sufficient for many imbalanced fraud datasets  
- Monitoring thresholds and metrics post-deployment is critical for maintaining model performance

## Next Steps

- Extend pipeline with resampling or focal loss modules  
- Automate threshold re-calibration over time  
- Integrate alerting and real-time scoring mechanisms