# Outlier and Anomaly Detection

___`Exploratory Risk Assessment notebook`___

# Objective
This notebook performs __systematic outlier and anomaly detection__ to:
- Identify extreme and rare observations

- Distinguish noise from business-critical signals

- Quantify modeling and operational risk

- Decide - `what to keep, transform, cap, or isolate`

This notebook anwers:

___`Which observations can materially distort analysis, models, or decisions?`___


## Why Outlier & Anomaly is a Risk Exercise
Outliers are not just statistical artifacts. They may represent:

- Fraud or abuse

- VIP or high-value customers

- System failures or data corruption

- Edge cases driving revenue or loss

Incorrect handling leads to:

- Biased models

- Fragile performance

- Regulatory exposure



## Imports and Configuration

In [3]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor


# Step 1 – Load Dataset

We reuse the same dataset from previous EDA notebooks for continuity.

In [4]:
np.random.seed(2010)

N = 5000

df = pd.DataFrame({
    "age": np.random.randint(18, 75, size=N),
    "income": np.random.lognormal(mean=10.8, sigma=0.6, size=N),
    "tenure_years": np.random.exponential(scale=6, size=N),
    "transactions_last_30d": np.random.poisson(lam=4, size=N),
    "region": np.random.choice(
        ["North", "South", "East", "West"],
        size=N,
        p=[0.35, 0.25, 0.25, 0.15]
    ),
    "churn": np.random.binomial(1, 0.28, size=N)
})

df.head()


Unnamed: 0,age,income,tenure_years,transactions_last_30d,region,churn
0,18,45868.374647,5.749047,4,North,0
1,18,74287.388492,1.537824,5,South,0
2,67,78586.352313,24.502748,3,North,0
3,64,56102.92543,1.502888,3,South,0
4,37,25639.985952,3.9506,6,North,0


# Step 2 – Univariate Statistical Outliers
2.1 IQR Method

In [5]:
def iqr_outliers(series):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr
    return (series < lower) | (series > upper)


In [6]:
outlier_flags = pd.DataFrame({
    col: iqr_outliers(df[col])
    for col in ["income", "tenure_years", "transactions_last_30d"]
})

outlier_flags.mean()


income                   0.0470
tenure_years             0.0498
transactions_last_30d    0.0230
dtype: float64