# Banking Fraud Analytics Showcase (IEEE‑CIS)  

### Agenda
1. **Business framing**: what “fraud detection” means operationally (alerts, false positives, costs).
2. **Data understanding**: what’s in the dataset, how transactions relate to identity signals.
3. **Data quality checks**: duplicates, missingness, cardinality, outliers, leakage watchlist.
4. **EDA (on a safe sample)**: patterns by amount, time-of-day, device, email domain, etc.
5. **Feature design**: practical features banks actually use (time buckets, velocity proxies, frequency encodings).
6. **Modeling**:
   - Baseline model (fast sanity check)
   - Neural model (GPU-friendly) on engineered tabular features
7. **Decisioning**: threshold by **alert rate** and by **cost**.
8. **Benchmark**: demonstrate **CPU vs GPU** training time (epochs=2) using CUDA if available.
9. **What “production readiness” looks like**: monitoring, drift, governance, privacy constraints.


## 0) Setup and environment checks

In [6]:
# Step 1: Install CUDA-enabled PyTorch (explicit)
# %pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# import torch
# print("CUDA available:", torch.cuda.is_available())

# Step 2: Install everything else
# %pip install -r requirements.txt


### Environment Validation and Dependency Management

This cell performs a comprehensive environment check to ensure all required dependencies are available. It verifies:
- **Core libraries**: NumPy, Pandas, Scikit-learn for data manipulation and modeling
- **Visualization**: Matplotlib and Seaborn for exploratory data analysis
- **Data processing**: Polars for high-performance streaming operations

We suppress warnings to keep the output clean during production runs. If any packages are missing, uncomment the pip install commands in the previous cell to install them.

In [4]:
# If needed (uncomment):
# %pip -q install -U numpy pandas scipy scikit-learn matplotlib seaborn polars pyarrow psutil
# %pip -q install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

import os, sys, time, subprocess, platform, pathlib, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### System Resource Assessment

This diagnostic cell performs a thorough system resource assessment to understand the compute environment:

**Memory Detection:**
- Checks both host-level RAM (via psutil) and container-level limits (via cgroups)
- Critical for preventing out-of-memory crashes in containerized environments
- Helps determine appropriate batch sizes and streaming strategies

**GPU Availability:**
- Detects NVIDIA GPUs using nvidia-smi
- Essential for determining whether GPU-accelerated training is possible
- Informs decisions about PyTorch device placement (CPU vs CUDA)

**Shared Memory (shm):**
- Checks /dev/shm capacity for DataLoader workers
- Low shm can cause PyTorch multiprocessing errors
- We'll configure num_workers=0 if shm is limited

This information guides our downstream choices for data loading, batch sizing, and parallelization strategies.

In [5]:
import psutil

def bytes_to_gb(b): 
    return b / (1024**3)

def read_cgroup_limit_bytes():
    p = "/sys/fs/cgroup/memory.max"
    if pathlib.Path(p).exists():
        v = open(p).read().strip()
        if v.isdigit():
            return int(v)
    p = "/sys/fs/cgroup/memory/memory.limit_in_bytes"
    if pathlib.Path(p).exists():
        v = open(p).read().strip()
        if v.isdigit():
            return int(v)
    return None

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("CPU cores:", os.cpu_count())
print("Host RAM total (GB):", round(bytes_to_gb(psutil.virtual_memory().total), 2))
print("Host RAM avail (GB):", round(bytes_to_gb(psutil.virtual_memory().available), 2))

CGROUP_LIMIT = read_cgroup_limit_bytes()
print("cgroup memory cap (GB):", round(bytes_to_gb(CGROUP_LIMIT), 2) if CGROUP_LIMIT else "not detected")

try:
    print(subprocess.check_output(["bash","-lc","nvidia-smi -L"], text=True))
    GPU_VISIBLE = True
except Exception:
    print("No nvidia-smi detected (or no GPU visible).")
    GPU_VISIBLE = False

print("GPU_VISIBLE:", GPU_VISIBLE)

try:
    shm = subprocess.check_output(["bash","-lc","df -h /dev/shm | tail -n 1"], text=True).strip()
    print("/dev/shm:", shm)
except Exception:
    pass

Python: 3.11.13
Platform: macOS-15.7.3-arm64-arm-64bit
CPU cores: 11
Host RAM total (GB): 36.0
Host RAM avail (GB): 13.65
cgroup memory cap (GB): not detected


bash: nvidia-smi: command not found


No nvidia-smi detected (or no GPU visible).
GPU_VISIBLE: False
/dev/shm: 


df: /dev/shm: No such file or directory


## 0.1) Extract Data

### Data Extraction from Archive

This cell extracts the IEEE-CIS fraud detection dataset from a compressed archive:

**Dataset Source:**
- Kaggle competition: [IEEE-CIS Fraud Detection](https://www.kaggle.com/competitions/ieee-fraud-detection)
- Contains real-world transaction data with identity information

**Files Expected:**
- `train_transaction.csv`: Transaction-level features (amount, time, card info, etc.)
- `train_identity.csv`: Identity-level features (device, browser, address, etc.)

The extraction creates a `data/` directory structure that subsequent cells will reference. This one-time operation prepares the raw data for processing.

In [None]:
import os, zipfile

# Data source: https://www.kaggle.com/competitions/ieee-fraud-detection

os.makedirs("data/data", exist_ok=True)

with zipfile.ZipFile("data.zip", "r") as z:
    z.extractall("data")

print("Extracted files:", os.listdir("data"))


## 1) Business framing: what we’re optimizing for

Fraud modeling in banks usually becomes an **operations optimization** problem:

- **False Positive (FP)** = customer friction + analyst time + manual review cost  
- **False Negative (FN)** = fraud loss + chargebacks + regulatory/audit concerns

That means:
- AUC is useful, but **alert-rate decisions** (top X% flagged) often matter more.
- We typically present results at **fixed alert rates** (e.g., 0.5%, 1%, 2%).
- We may also present a **cost curve**: cost(FN) vs cost(FP).

### Data File Validation

Performs basic validation to ensure the required CSV files exist at the expected paths:

- Confirms presence of transaction data (`train_transaction.csv`)
- Confirms presence of identity data (`train_identity.csv`)
- Raises clear errors if files are missing, preventing silent failures downstream

This is a defensive programming practice that catches data pipeline issues early.

## 2) Data access: robust join (streaming) to avoid RAM crashes

### Memory-Efficient Data Join Using Polars Streaming

This cell performs a **left join** of transaction and identity data using Polars' streaming engine:

**Why Polars Streaming?**
- Processes data in chunks rather than loading everything into RAM
- Critical for environments with cgroup memory limits (containers, cloud notebooks)
- Prevents OOM crashes that would occur with pandas on large datasets

**Join Strategy:**
- Left join on `TransactionID` preserves all transactions
- Identity features are optional enrichment (some transactions lack identity data)
- Result is written to Parquet for efficient columnar storage

**Caching:**
- If the joined Parquet file already exists, we skip this step
- Saves time on repeated runs and notebook restarts

This approach is production-ready: it handles datasets larger than RAM and produces a durable, compressed artifact.

In [None]:
DATA_DIR = "data"
trx_csv = os.path.join(DATA_DIR, "train_transaction.csv")
id_csv  = os.path.join(DATA_DIR, "train_identity.csv")

assert os.path.exists(trx_csv), f"Missing {trx_csv}"
assert os.path.exists(id_csv),  f"Missing {id_csv}"

print("OK:", trx_csv)
print("OK:", id_csv)

In [None]:
import polars as pl

JOINED_PARQ = os.path.join(DATA_DIR, "train_joined.parquet")

if not os.path.exists(JOINED_PARQ):
    print("Building:", JOINED_PARQ)
    trx = pl.scan_csv(trx_csv)
    idd = pl.scan_csv(id_csv)
    trx.join(idd, on="TransactionID", how="left").sink_parquet(JOINED_PARQ)
    print("Wrote:", JOINED_PARQ)
else:
    print("Exists:", JOINED_PARQ)

### Dataset Profiling: Shape and Target Distribution

Computes high-level statistics about the dataset:

**Key Metrics:**
- **Row count**: Total number of transactions
- **Fraud rate**: Baseline prevalence of fraud (class balance)
- **Column count**: Dimensionality of the feature space

**Why This Matters:**
- Low fraud rates (~3-4%) indicate severe class imbalance
- Imbalanced data requires careful metric selection (AUC, precision-recall, alert rate)
- High dimensionality suggests need for feature selection or dimensionality reduction

**Schema Inspection:**
- Shows first 8 columns with data types
- Helps identify categorical vs. numerical features
- Informs feature engineering strategy

## 3) Data understanding (schema, size, target)

In [None]:
scan = pl.scan_parquet(JOINED_PARQ)
rows = scan.select(pl.len()).collect().item()
fraud_rate = scan.select(pl.col("isFraud").mean()).collect().item()

print("Rows:", rows)
print("Fraud rate:", round(fraud_rate*100, 3), "%")

schema = pl.read_parquet_schema(JOINED_PARQ)
print("Columns:", len(schema))
list(schema.items())[:8]

### Duplicate Detection

Checks for duplicate `TransactionID` values in the dataset:

**Purpose:**
- Duplicates can indicate data quality issues or extraction errors
- May lead to data leakage if duplicates appear in both train and validation sets
- Can artificially inflate model performance metrics

**Expected Result:**
- Clean datasets should have zero duplicates
- Non-zero count requires investigation and potential deduplication

## 4) Data quality checks (duplicates, missingness, cardinality)

### Missingness Analysis

Quantifies missing values across all columns:

**Analysis:**
- Computes null count and null percentage for each column
- Sorts by missingness to identify problematic features
- Displays top 20 columns with highest missing rates

**Decision Criteria:**
- Features with >80% missingness may have limited predictive value
- High missingness in categorical features suggests encoding challenges
- Missing patterns (MCAR vs. MAR) may be informative signals themselves

**Production Implications:**
- High-missingness features require imputation strategy or exclusion
- Missingness patterns should be monitored for drift in production

In [None]:
dup_count = (
    scan.select(pl.col("TransactionID"))
        .collect()
        .to_series()
        .is_duplicated()
        .sum()
)
print("Duplicate TransactionID rows:", int(dup_count))

### Cardinality Profiling for Categorical Features

Analyzes the unique value counts of string (categorical) columns:

**Purpose:**
- High-cardinality features (e.g., DeviceInfo) may require grouping or embeddings
- Low-cardinality features are candidates for one-hot encoding
- Extremely high cardinality may indicate PII or unique identifiers (potential leakage)

**Sampling Strategy:**
- Uses a 120K-row sample for computational efficiency
- Provides approximate cardinality estimates sufficient for feature engineering decisions

**Interpretation:**
- Features with 1000+ unique values may benefit from frequency encoding
- Features with <20 unique values are ideal for categorical encoding
- Single-value features (zero variance) should be dropped

In [None]:
null_counts = scan.select([pl.all().null_count()]).collect().row(0)
cols = list(schema.keys())
null_df = pd.DataFrame({"column": cols, "null_count": null_counts})
null_df["null_pct"] = null_df["null_count"] / rows * 100
null_df = null_df.sort_values("null_pct", ascending=False)
display(null_df.head(20))

In [None]:
EDA_N = 120_000
candidate_cats = [c for c, dt in schema.items() if dt == pl.Utf8][:50]
if candidate_cats:
    df_cat = scan.select(candidate_cats).collect().sample(n=min(EDA_N, rows), seed=42).to_pandas()
    nunq = df_cat.nunique(dropna=True).sort_values(ascending=False).head(15)
    display(nunq.to_frame("n_unique (sampled)"))
else:
    print("No Utf8/object columns detected in schema sample.")

### EDA Sample Construction

Creates a manageable sample for exploratory data analysis:

**Selected Features:**
- Core transaction attributes: amount, time, product
- Card identifiers: card4 (provider), card6 (type)
- Identity signals: email domains, device type/info, addresses

**Sampling Strategy:**
- 100K-row sample balances statistical power with computational speed
- Fixed random seed (7) ensures reproducibility
- Sample is small enough to visualize without performance issues

**Purpose:**
- Enables rapid hypothesis testing and pattern discovery
- Reduces iteration time during exploratory phase
- Sample-based insights are validated on full data during modeling

## 5) EDA on a safe sample (banking-relevant insights)

### Temporal and Magnitude Feature Engineering for EDA

Derives interpretable features from raw transaction data:

**Time-Based Features:**
- `TransactionDay`: Converts Unix timestamp to day index (detects weekly patterns)
- `TransactionHour`: Extracts hour-of-day (0-23) for circadian fraud patterns

**Amount Binning:**
- `AmtBin`: Quantile-based discretization into 10 bins
- Enables visualization of fraud rate by amount tier
- Handles skewed distributions better than uniform bins

**Banking Rationale:**
- Fraud risk varies by time (off-hours, weekends)
- Small vs. large transactions have different fraud profiles
- These patterns inform rule-based alerts and model features

In [None]:
EDA_COLS = [
    "isFraud","TransactionAmt","TransactionDT",
    "ProductCD","card4","card6",
    "P_emaildomain","R_emaildomain",
    "DeviceType","DeviceInfo",
    "addr1","addr2"
]
EDA_COLS = [c for c in EDA_COLS if c in schema]

df_eda = scan.select(EDA_COLS).collect().sample(n=min(100_000, rows), seed=7).to_pandas()
print("EDA sample:", df_eda.shape)
df_eda.head(3)

### Target Variable Distribution Visualization

Displays the class balance between fraudulent and legitimate transactions:

**Key Insights:**
- Visual confirmation of severe class imbalance
- Fraud typically represents 3-4% of transactions
- Imbalance necessitates careful metric selection:
  - **Avoid**: Accuracy (misleading with imbalance)
  - **Use**: AUC-ROC, Precision-Recall, Alert Rate metrics

**Business Context:**
- High imbalance reflects real-world fraud rates
- Models must achieve high recall without excessive false positives
- Cost-benefit analysis is more relevant than F1 score

In [None]:
if "TransactionDT" in df_eda.columns:
    df_eda["TransactionDay"]  = (df_eda["TransactionDT"] // (3600*24)).astype("int32")
    df_eda["TransactionHour"] = ((df_eda["TransactionDT"] // 3600) % 24).astype("int16")
if "TransactionAmt" in df_eda.columns:
    df_eda["AmtBin"] = pd.qcut(df_eda["TransactionAmt"].rank(method="first"), q=10, duplicates="drop")

### Circadian Pattern Analysis: Fraud Rate by Hour

Analyzes how fraud risk varies throughout the 24-hour cycle:

**Expected Patterns:**
- Higher fraud during off-hours (late night, early morning) when monitoring is minimal
- Lower fraud during business hours when customers are active
- Weekend vs. weekday differences may also emerge

**Actionable Insights:**
- Temporal features are strong fraud signals
- Can inform dynamic threshold adjustments
- Helps schedule analyst coverage for high-risk periods

This type of analysis is standard in fraud operations and directly translates to model features.

In [None]:
plt.figure()
df_eda["isFraud"].value_counts().plot(kind="bar")
plt.title("Class distribution (EDA sample)")
plt.xlabel("isFraud")
plt.ylabel("Count")
plt.show()

### Transaction Amount Risk Stratification

Examines fraud rate across transaction amount deciles:

**Analysis Technique:**
- Rank-based quantile binning avoids issues with skewed distributions
- Each bin contains ~10% of transactions
- Compares fraud rates across amount tiers

**Typical Findings:**
- Very small amounts (testing stolen cards) may show elevated fraud
- Very large amounts (high-value theft) also show elevated fraud
- Mid-range amounts often have lower fraud rates

**Business Application:**
- Informs amount-based risk scoring rules
- Helps set dynamic transaction limits
- Identifies sweet spots for fraudsters

In [None]:
if "TransactionHour" in df_eda.columns:
    grp = df_eda.groupby("TransactionHour")["isFraud"].mean()
    plt.figure()
    grp.plot(kind="line", marker="o")
    plt.title("Fraud rate by hour-of-day (EDA sample)")
    plt.xlabel("Hour")
    plt.ylabel("Fraud rate")
    plt.show()

### Device Type Fraud Risk Profiling

Analyzes fraud rate by device type (mobile, desktop, etc.):

**Banking Context:**
- Mobile devices may have different fraud profiles than desktops
- Unknown/missing device types are often high-risk
- Helps identify compromised device populations

**Visualization Strategy:**
- Focuses on top 10 most common device types
- Rare devices aggregated to avoid noise
- Sorted by fraud rate to highlight highest-risk segments

**Operational Use:**
- High-risk device types can trigger additional authentication
- Device fingerprinting becomes a key fraud signal
- Informs device-based blocking rules

In [None]:
if "AmtBin" in df_eda.columns:
    grp = df_eda.groupby("AmtBin")["isFraud"].mean()
    plt.figure(figsize=(10,4))
    grp.plot(kind="bar")
    plt.title("Fraud rate by TransactionAmt decile (EDA sample)")
    plt.xlabel("Amount decile (rank-based)")
    plt.ylabel("Fraud rate")
    plt.tight_layout()
    plt.show()

### Email Domain Risk Analysis

Examines fraud rates by purchaser (`P_emaildomain`) and recipient (`R_emaildomain`) email domains:

**Key Patterns:**
- Free email providers (gmail, yahoo) vs. corporate domains
- Disposable/temporary email services show elevated fraud
- Missing email domains are often high-risk

**Banking Intelligence:**
- Email domain is a strong identity verification signal
- Mismatch between purchaser and recipient domains can indicate fraud
- Domain reputation lists inform real-time decisioning

**Visualization:**
- Top 12 most common domains to focus on significant traffic
- Rare domains aggregated to reduce noise
- Separate charts for purchaser and recipient for comparison

In [None]:
if "DeviceType" in df_eda.columns:
    top = df_eda["DeviceType"].value_counts(dropna=False).head(10).index
    sub = df_eda[df_eda["DeviceType"].isin(top)]
    grp = sub.groupby("DeviceType")["isFraud"].mean().sort_values(ascending=False)
    plt.figure(figsize=(8,4))
    grp.plot(kind="bar")
    plt.title("Fraud rate by DeviceType (top 10, EDA sample)")
    plt.xlabel("DeviceType")
    plt.ylabel("Fraud rate")
    plt.tight_layout()
    plt.show()

### Comprehensive Missingness Profiling

Provides dual view of missing data patterns:

**Table View:**
- Top 20 columns by missingness percentage
- Identifies features requiring imputation or exclusion
- Flags columns with near-complete missingness

**Distribution View:**
- Histogram of missingness rates across all columns
- Shows whether missingness is concentrated or widespread
- Helps assess overall data quality

**Strategic Decisions:**
- Features with >80% missing: typically excluded
- Features with 20-80% missing: require sophisticated imputation
- Features with <20% missing: simple imputation (median, mode) often sufficient

Missing data handling is critical for model robustness and production reliability.

In [None]:
for col in ["P_emaildomain","R_emaildomain"]:
    if col in df_eda.columns:
        top = df_eda[col].value_counts(dropna=False).head(12).index
        sub = df_eda[df_eda[col].isin(top)]
        grp = sub.groupby(col)["isFraud"].mean().sort_values(ascending=False)
        plt.figure(figsize=(10,4))
        grp.plot(kind="bar")
        plt.title(f"Fraud rate by {col} (top 12, EDA sample)")
        plt.xlabel(col)
        plt.ylabel("Fraud rate")
        plt.tight_layout()
        plt.show()

In [None]:
miss = (df_eda.isna().mean()*100).sort_values(ascending=False).head(20)
display(miss.to_frame("missing_% (EDA sample)"))

plt.figure(figsize=(8,4))
plt.hist(df_eda.isna().mean(axis=0).values, bins=30)
plt.title("Missingness distribution across columns (EDA sample)")
plt.xlabel("Fraction missing")
plt.ylabel("# columns")
plt.tight_layout()
plt.show()

### Production-Grade Feature Engineering Pipeline

Transforms raw data into model-ready numeric features using Polars streaming:

**Feature Selection:**
- Keeps **V columns** (V1-V250): Vesta-engineered features (industry-standard fraud signals)
- Keeps **categorical columns**: ProductCD, card identifiers, email domains, device info, addresses

**Feature Engineering Operations:**

1. **Frequency Encoding** (for categorical features):
   - Counts occurrences of each category value
   - High-frequency values often indicate legitimate patterns
   - Low-frequency values can signal novel fraud attempts

2. **Amount Transformation**:
   - `TransactionAmt_log1p`: Log transform to handle skewness and outliers
   - Improves model convergence and reduces sensitivity to extreme values

3. **Temporal Features**:
   - `TransactionDay`: Day index for weekly patterns
   - `TransactionHour`: Hour-of-day for circadian patterns

4. **Interaction Features**:
   - `Amt_x_Hour`: Amount-time interaction (large late-night transactions are risky)
   - `Amt_div_Day`: Amount velocity proxy (spending rate over time)

**Pipeline Characteristics:**
- **Streaming**: Processes data in chunks to avoid OOM
- **Type Safety**: Casts all features to Float32 for memory efficiency
- **Caching**: Writes to Parquet for reuse across experiments
- **Reproducibility**: Deterministic transformations, no randomness

This pipeline is production-ready: it handles large datasets, produces consistent outputs, and can be deployed as a batch or real-time scoring pipeline.

## 6) Feature engineering pipeline (streaming) → compact numeric feature Parquet

In [None]:
FEAT_PARQ = os.path.join(DATA_DIR, "train_features.parquet")

V_MAX = 250
CAT_CANDS = ["ProductCD","card4","card6","P_emaildomain","R_emaildomain","DeviceType","DeviceInfo","addr1","addr2"]

schema = pl.read_parquet_schema(JOINED_PARQ)
v_cols = [f"V{i}" for i in range(1, V_MAX+1) if f"V{i}" in schema]
cat_cols = [c for c in CAT_CANDS if c in schema]

base_cols = ["TransactionID","isFraud","TransactionDT","TransactionAmt"]
keep_cols = base_cols + v_cols + cat_cols

print("Keeping V cols:", len(v_cols))
print("Freq-encoding cats:", cat_cols)

if not os.path.exists(FEAT_PARQ):
    base = pl.scan_parquet(JOINED_PARQ).select(keep_cols)
    feat = base

    for c in cat_cols:
        ft = base.group_by(c).agg(pl.len().alias(f"{c}_freq"))
        feat = feat.join(ft, on=c, how="left")

    feat = feat.with_columns([
        pl.col("TransactionAmt").clip(0).log1p().alias("TransactionAmt_log1p"),
        (pl.col("TransactionDT") // (3600*24)).cast(pl.Int32).alias("TransactionDay"),
        ((pl.col("TransactionDT") // 3600) % 24).cast(pl.Int16).alias("TransactionHour"),
        (pl.col("TransactionAmt").fill_null(0) * ((pl.col("TransactionDT") // 3600) % 24).cast(pl.Float32)).alias("Amt_x_Hour"),
        (pl.col("TransactionAmt").fill_null(0) / ((pl.col("TransactionDT") // (3600*24)).cast(pl.Float32) + 1.0)).alias("Amt_div_Day"),
    ])

    feat = feat.drop(cat_cols)
    feat = feat.with_columns([pl.all().exclude(["TransactionID","isFraud"]).cast(pl.Float32, strict=False)])
    feat.sink_parquet(FEAT_PARQ)
    print("Wrote:", FEAT_PARQ)
else:
    print("Exists:", FEAT_PARQ)

### Baseline Model Training and Evaluation

Trains a fast baseline classifier to establish performance benchmarks:

**Data Preparation:**
- Loads engineered features from Parquet
- Separates target (`isFraud`) from feature matrix
- Handles infinite values and missing data (fill with sentinel value -999)
- Converts to Float32 for memory efficiency

**Train-Validation Split:**
- 80/20 split with stratification to preserve class balance
- Fixed random seed (42) for reproducibility
- Ensures both sets have similar fraud rates

**Model Selection:**
- **SGDClassifier** with log loss (logistic regression trained via SGD)
- Fast to train, interpretable, serves as sanity check
- Regularization (alpha=1e-5) prevents overfitting

**Feature Scaling:**
- StandardScaler (zero mean, unit variance)
- Critical for SGD convergence
- Fit on training data, applied to validation (no leakage)

**Evaluation Metrics:**
- **ROC-AUC**: Overall discriminative ability (threshold-agnostic)
- **PR-AUC**: Precision-recall trade-off (better for imbalanced data)
- Training time benchmarked for comparison

This baseline establishes the minimum performance bar. More complex models must beat this to justify their cost.

## 7) Baseline model + Decisioning (alert-rate & cost)

### Alert-Rate Decisioning

Applies an **operational decisioning strategy** based on alert rate:

**Business Context:**
- Banks have finite capacity to review alerts (analyst hours, customer friction)
- Alert rate = percentage of transactions flagged for review
- Common targets: 0.5%, 1%, 2% depending on review capacity

**Threshold Selection:**
- Sets threshold to flag top 1% of riskiest transactions
- Equivalent to 99th percentile of model scores

**Performance Evaluation:**
- **Confusion Matrix**: Shows TP, FP, TN, FN at this operating point
- **Precision**: What % of alerts are actual fraud?
- **Recall**: What % of fraud is caught?

**Operational Interpretation:**
- High precision = efficient use of analyst time
- High recall = effective fraud prevention
- Trade-off is managed via alert rate constraint

This approach is how fraud models are actually deployed: not as classifiers, but as ranking systems with operational constraints.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score, average_precision_score, confusion_matrix, classification_report

df_feat = pd.read_parquet(FEAT_PARQ)
print("Feature df:", df_feat.shape)

y = df_feat["isFraud"].astype(np.int64).values
X = (df_feat.drop(columns=["isFraud","TransactionID"])
     .replace([np.inf,-np.inf], np.nan)
     .fillna(-999)
     .astype(np.float32)
     .values)

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train).astype(np.float32)
X_valid = scaler.transform(X_valid).astype(np.float32)

baseline = SGDClassifier(loss="log_loss", alpha=1e-5, max_iter=2000, tol=1e-3, random_state=42)

t0 = time.time()
baseline.fit(X_train, y_train)
t1 = time.time()

scores = baseline.predict_proba(X_valid)[:, 1]
print("Baseline time (s):", round(t1-t0, 2))
print("Baseline ROC-AUC :", round(roc_auc_score(y_valid, scores), 4))
print("Baseline PR-AUC  :", round(average_precision_score(y_valid, scores), 4))

### Cost-Based Threshold Optimization

Optimizes decision threshold based on **economic costs** rather than statistical metrics:

**Cost Model:**
- **False Negative (missed fraud)**: $100 per incident (fraud loss + chargeback)
- **False Positive (false alarm)**: $1 per incident (review cost + customer friction)
- These ratios are business-specific and should be calibrated to actual operations

**Optimization Process:**
- Sweeps 200 candidate thresholds from 0.001 to 0.999
- Computes expected cost at each threshold
- Selects threshold minimizing total cost

**Cost Function:**
```
Expected Cost = (FN × $100) + (FP × $1)
```

**Results:**
- Optimal threshold balances fraud losses against review costs
- Confusion matrix at optimal point
- Cost curve visualization shows sensitivity to threshold choice

**Business Value:**
- Directly ties model decisions to P&L impact
- Supports ROI calculations for fraud prevention programs
- Enables what-if analysis (e.g., "What if review costs double?")

This cost-based approach aligns ML directly with business objectives, making it easier to justify investments and operational changes.

In [None]:
# Alert-rate threshold (top 1% flagged)
alert_rate = 0.01
thr_alert = float(np.quantile(scores, 1 - alert_rate))
y_hat_alert = (scores >= thr_alert).astype(int)

print("Alert rate:", alert_rate, "Threshold:", thr_alert)
print("Confusion matrix:\n", confusion_matrix(y_valid, y_hat_alert))
print(classification_report(y_valid, y_hat_alert, digits=4))

In [None]:
# Cost-based threshold
COST_FN = 100.0
COST_FP = 1.0

def expected_cost(y_true, y_pred, cost_fn, cost_fp):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return fn*cost_fn + fp*cost_fp

thresholds = np.linspace(0.001, 0.999, 200)
costs = []
for thr in thresholds:
    y_hat = (scores >= thr).astype(int)
    costs.append(expected_cost(y_valid, y_hat, COST_FN, COST_FP))

best_i = int(np.argmin(costs))
thr_cost = float(thresholds[best_i])
print("Best threshold (cost-based):", thr_cost, "Expected cost:", round(costs[best_i],2))

y_hat_cost = (scores >= thr_cost).astype(int)
print("Confusion matrix:\n", confusion_matrix(y_valid, y_hat_cost))
print(classification_report(y_valid, y_hat_cost, digits=4))

plt.figure()
plt.plot(thresholds, costs)
plt.title("Expected cost vs threshold (baseline)")
plt.xlabel("Threshold")
plt.ylabel("Expected cost")
plt.tight_layout()
plt.show()

### GPU Acceleration Setup and Data Loading

Prepares PyTorch infrastructure for CPU vs. GPU benchmarking:

**Environment Check:**
- Validates PyTorch installation and version
- Detects CUDA availability (GPU acceleration)
- Identifies specific GPU model if available

**Data Preparation:**
- Converts NumPy arrays to PyTorch tensors
- Creates `TensorDataset` for efficient data access
- Wraps in `DataLoader` for batch iteration

**DataLoader Configuration:**
- **Batch size**: 8192 (large batches maximize GPU utilization)
- **num_workers**: 0 (avoids multiprocessing issues with limited shm)
- **pin_memory**: False (disabled for safety in containerized environments)
- **shuffle**: True for training, False for validation

**Why These Settings?**
- Large batches amortize GPU kernel launch overhead
- Single-process loading avoids shm exhaustion
- No pinned memory prevents CUDA OOM in memory-constrained environments

This configuration is optimized for the detected environment (containerized, possibly GPU-accelerated).

## 8) CPU vs GPU Benchmark (PyTorch MLP) — epochs=2 (shm-safe)

### Neural Network Architecture Definition

Defines a Multi-Layer Perceptron (MLP) for fraud classification:

**Architecture:**
```
Input → [4096 ReLU Dropout(0.1)] → [2048 ReLU Dropout(0.1)] → [1024 ReLU] → [2 logits]
```

**Design Rationale:**
- **Large layers**: 4096 and 2048 neurons leverage GPU parallel compute
- **ReLU activation**: Fast, gradient-friendly, GPU-optimized
- **Dropout**: 10% regularization prevents overfitting on imbalanced data
- **Binary output**: 2 logits for CrossEntropyLoss (fraud vs. legitimate)

**GPU Suitability:**
- Large matrix multiplications (4096×2048) saturate GPU cores
- Batch processing (8192 samples) maximizes throughput
- Simple operations (ReLU, Dropout) have efficient CUDA kernels

**Why Not Tree-Based Models for Benchmarking?**
- Trees (XGBoost, LightGBM) are serial by nature
- Neural networks exhibit massive parallelism, ideal for GPU comparison
- This architecture demonstrates GPU advantage at scale

This model is deliberately overparameterized for benchmarking purposes—it showcases GPU compute advantage.

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
from sklearn.metrics import roc_auc_score

print("torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("GPU:", torch.cuda.get_device_name(0))

train_ds = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
valid_ds = TensorDataset(torch.from_numpy(X_valid), torch.from_numpy(y_valid))

BATCH_SIZE = 8192
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=0, pin_memory=False)
valid_loader = DataLoader(valid_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=0, pin_memory=False)

print("DataLoader: batch_size=", BATCH_SIZE, "num_workers=0 pin_memory=False")

### CPU vs. GPU Training Benchmark

Executes controlled performance comparison between CPU and GPU training:

**Benchmark Design:**

1. **Training Function** (`train_eval`):
   - Accepts device ('cpu' or 'cuda'), epochs, learning rate
   - Uses AdamW optimizer (standard for deep learning)
   - CrossEntropyLoss for binary classification
   - Mixed precision (AMP) for GPU (FP16 compute, FP32 accumulation)

2. **Timing Methodology**:
   - **Warmup pass**: Excludes cold-start overhead (kernel compilation, cache warming)
   - **Synchronized timing**: `torch.cuda.synchronize()` ensures accurate GPU measurements
   - Only training time measured (excludes data loading and setup)

3. **Controlled Variables**:
   - Same model architecture
   - Same number of epochs (2 — short to keep demo fast)
   - Same batch size and learning rate
   - Same data and random seed

4. **Evaluation**:
   - Computes ROC-AUC on validation set after training
   - Ensures both CPU and GPU models achieve similar quality

**Expected Results:**
- CPU: Slower but always available
- GPU: 5-20x faster depending on hardware (A100, V100, T4, etc.)

**Why This Matters:**
- Faster iteration = more experiments = better models
- Batch scoring throughput for real-time fraud detection
- ROI justification for GPU infrastructure

**GPU Fallback:**
- If no CUDA detected, GPU benchmark is skipped gracefully
- CPU-only mode still produces valid results

This benchmark demonstrates tangible business value of GPU acceleration: time-to-market and throughput gains.

In [None]:
class MLP(nn.Module):
    def __init__(self, d_in):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 4096), nn.ReLU(), nn.Dropout(0.1),
            nn.Linear(4096, 2048), nn.ReLU(), nn.Dropout(0.1),
            nn.Linear(2048, 1024), nn.ReLU(),
            nn.Linear(1024, 2),
        )
    def forward(self, x): return self.net(x)

In [None]:
def train_eval(device="cpu", epochs=2, lr=1e-3, amp=True):
    dev = torch.device(device)
    model = MLP(X_train.shape[1]).to(dev)
    opt = torch.optim.AdamW(model.parameters(), lr=lr)
    crit = nn.CrossEntropyLoss()

    use_amp = (amp and dev.type == "cuda")
    scaler = torch.amp.GradScaler("cuda", enabled=use_amp)

    # warmup
    model.train()
    xb, yb = next(iter(train_loader))
    xb, yb = xb.to(dev), yb.to(dev)
    opt.zero_grad(set_to_none=True)
    with torch.amp.autocast("cuda", enabled=use_amp):
        loss = crit(model(xb), yb)
    scaler.scale(loss).backward()
    scaler.step(opt); scaler.update()

    if dev.type == "cuda":
        torch.cuda.synchronize()
    t0 = time.time()

    for _ in range(epochs):
        model.train()
        for xb, yb in train_loader:
            xb, yb = xb.to(dev), yb.to(dev)
            opt.zero_grad(set_to_none=True)
            with torch.amp.autocast("cuda", enabled=use_amp):
                logits = model(xb)
                loss = crit(logits, yb)
            scaler.scale(loss).backward()
            scaler.step(opt); scaler.update()

    if dev.type == "cuda":
        torch.cuda.synchronize()
    t1 = time.time()

    model.eval()
    ps, ys = [], []
    with torch.no_grad():
        for xb, yb in valid_loader:
            xb = xb.to(dev)
            prob = torch.softmax(model(xb), dim=1)[:, 1].cpu().numpy()
            ps.append(prob); ys.append(yb.numpy())
    p = np.concatenate(ps); yt = np.concatenate(ys)
    auc = roc_auc_score(yt, p)

    return (t1 - t0), auc

cpu_t, cpu_auc = train_eval(device="cpu", epochs=2, amp=False)
print("CPU  time:", round(cpu_t, 2), "AUC:", round(cpu_auc, 4))

if torch.cuda.is_available():
    gpu_t, gpu_auc = train_eval(device="cuda", epochs=2, amp=True)
    print("GPU  time:", round(gpu_t, 2), "AUC:", round(gpu_auc, 4))
    print("Speedup:", round(cpu_t/gpu_t, 2), "x")
else:
    gpu_t, gpu_auc = None, None
    print("CUDA not available; GPU benchmark skipped.")

### Summary Dashboard

**Summary Table:**
- Dataset size and split (train/validation rows)
- Feature dimensionality (number of numeric features)
- Class balance (fraud rate)
- Baseline model performance (ROC-AUC, PR-AUC)
- Neural model benchmarks:
  - CPU training time and AUC
  - GPU training time and AUC (if available)
  - Speedup factor (CPU time / GPU time)

**Operational Metrics Table:**
- **Alert-rate threshold** (top 1% flagged):
  - Threshold value
  - Precision, recall, alert rate
  - Confusion matrix (TP, FP, TN, FN)
  
- **Cost-based threshold** (minimizing expected cost):
  - Optimal threshold value
  - Precision, recall, alert rate
  - Confusion matrix
  - Expected cost at optimal operating point

## 9) Summary

In [None]:
baseline_auc = roc_auc_score(y_valid, scores)
baseline_prauc = average_precision_score(y_valid, scores)

def summarize_threshold(name, thr, y_true, scores):
    y_hat = (scores >= thr).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_true, y_hat).ravel()
    return {
        "threshold_name": name,
        "threshold": float(thr),
        "TP": int(tp), "FP": int(fp), "FN": int(fn), "TN": int(tn),
        "precision": float(tp/(tp+fp+1e-9)),
        "recall": float(tp/(tp+fn+1e-9)),
        "alert_rate_empirical": float(y_hat.mean()),
    }

ops = pd.DataFrame([
    summarize_threshold("alert_rate_1pct", thr_alert, y_valid, scores),
    summarize_threshold("cost_based", thr_cost, y_valid, scores),
])

summary = pd.DataFrame([{
    "rows_total": int(rows),
    "rows_train": int(X_train.shape[0]),
    "rows_valid": int(X_valid.shape[0]),
    "num_features": int(X_train.shape[1]),
    "fraud_rate_%": float(fraud_rate*100),
    "baseline_ROC_AUC": float(baseline_auc),
    "baseline_PR_AUC": float(baseline_prauc),
    "mlp_epochs": 2,
    "cpu_time_sec": float(cpu_t),
    "cpu_auc": float(cpu_auc),
    "gpu_time_sec": (float(gpu_t) if gpu_t is not None else None),
    "gpu_auc": (float(gpu_auc) if gpu_auc is not None else None),
    "speedup_cpu_over_gpu": (float(cpu_t/gpu_t) if gpu_t is not None else None),
}])

display(summary)
display(ops)