
# 01 – Class Weighting

**Module:** Anomaly & Fraud Detection  
**Topic:** Imbalanced Learning Strategies

This notebook demonstrates **class weighting** as a principled alternative to resampling
for handling extreme class imbalance in fraud detection problems.

Instead of altering the data distribution, class weighting alters the **loss function**,
forcing the model to penalize minority-class errors more heavily.

    
## Objective

Build a leakage-free and production-ready workflow that:

- Handles severe class imbalance via weighted loss functions
- Preserves the original data distribution
- Compares weighted vs unweighted models
- Uses evaluation metrics aligned with rare-event detection

Fraud modeling is treated as **cost-sensitive decision making**, not accuracy optimization.

## Design Principles

✔ No synthetic data generation  
✔ No majority-class removal  
✔ Original distribution preserved end-to-end  
✔ Explicit control of error costs

## High-Level Workflow

Imbalanced Dataset  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Train / Test Split (Stratified)  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Cost-Sensitive Model Training  
&nbsp;&nbsp;&nbsp;&nbsp;↓  
Evaluation on Original Distribution

## Imports and Setup


In [22]:
import numpy as np
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

np.random.seed(2010)

## Dataset Assumptions

- Binary target: `fraud`  
- Extreme imbalance (≈ 1–2% fraud)  
- Tabular numeric features

##  Simulated Imbalanced Fraud Dataset

In [50]:
X, y = make_classification(
    n_samples=15000,
    n_features=10,
    n_informative=4,
    n_redundant=2,
    weights=[0.99, 0.01],
    flip_y=0.001,
    random_state=2010
)

In [52]:
df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["fraud"] = y

# Class distribution

In [55]:
df["fraud"].value_counts(normalize=True)

fraud
0    0.989533
1    0.010467
Name: proportion, dtype: float64

## Leakage-Free Train / Test Split



In [58]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="fraud"),
    df["fraud"],
    test_size=0.3,
    stratify=df["fraud"],
    random_state=42
)

## Step 3 – Baseline Model (No Class Weighting)


In [61]:
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, y_train)

y_pred_baseline = baseline_model.predict(X_test)

print("=== Baseline Logistic Regression ===")
print(classification_report(y_test, y_pred_baseline))

=== Baseline Logistic Regression ===
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      4453
           1       0.86      0.13      0.22        47

    accuracy                           0.99      4500
   macro avg       0.92      0.56      0.61      4500
weighted avg       0.99      0.99      0.99      4500



## Class-Weighted Model

`class_weight='balanced'` automatically scales penalties
inversely proportional to class frequencies.


In [64]:
weighted_model = LogisticRegression(
    max_iter=1000,
    class_weight='balanced'
)

weighted_model.fit(X_train, y_train)

In [66]:
y_pred_weighted = weighted_model.predict(X_test)

print("=== Class-Weighted Logistic Regression ===")
print(classification_report(y_test, y_pred_weighted))

=== Class-Weighted Logistic Regression ===
              precision    recall  f1-score   support

           0       0.99      0.77      0.86      4453
           1       0.03      0.60      0.05        47

    accuracy                           0.76      4500
   macro avg       0.51      0.68      0.46      4500
weighted avg       0.98      0.76      0.86      4500



##  Manual Cost Configuration (Optional)
In real fraud systems, costs are often asymmetric and business-driven.

In [69]:
custom_weights = {
    0: 1,    # non-fraud
    1: 20    # fraud (higher penalty)
}

custom_weighted_model = LogisticRegression(
    max_iter=1000,
    class_weight=custom_weights
)

custom_weighted_model.fit(X_train, y_train)

In [71]:
y_pred_custom = custom_weighted_model.predict(X_test)

In [73]:
print("=== Custom Cost-Sensitive Model ===")
print(classification_report(y_test, y_pred_custom))

=== Custom Cost-Sensitive Model ===
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4453
           1       0.23      0.38      0.29        47

    accuracy                           0.98      4500
   macro avg       0.61      0.68      0.64      4500
weighted avg       0.99      0.98      0.98      4500




## Interpretation of Results

Typical behavior:
- Class weighting increases fraud recall
- Precision usually decreases
- Overall accuracy becomes irrelevant

The optimal configuration depends on **business cost trade-offs**.


## Risks and Anti-Patterns

❌ Using default accuracy for model selection  
❌ Ignoring business cost asymmetry  
❌ Treating class weighting as a silver bullet


## When Class Weighting Is the Best Choice

- Dataset is small or medium-sized
- Synthetic data is undesirable or risky
- Strong baseline model already exists
- Business costs are well understood


## Production Checklist

✔ Original distribution preserved  
✔ Costs explicitly controlled  
✔ Evaluation leakage-free  
✔ Threshold tuning planned


## Key Takeaways

- Class weighting modifies the loss, not the data
- It is often safer than resampling
- Best results come from combining with threshold tuning


## Next Steps

- Compare with SMOTE and undersampling
- Tune decision thresholds explicitly
- Combine with ensemble models