# 01 – Feature Engineering for Fraud Detection

**Module:** Anomaly & Fraud Detection  
**Folder:** Fraud Detection Workflows

This notebook focuses on **domain-informed feature engineering** for fraud detection,
where signal creation is often more important than model choice.

## Objective

Build a reusable feature engineering workflow that:
- Extracts behavioral signals from transactional data
- Encodes velocity, frequency, and monetary patterns
- Is leakage-safe and time-aware
- Produces high-signal inputs for fraud models

## Design Principles

✔ Behavior-first feature design  
✔ Time-aware aggregations  
✔ Leakage prevention  
✔ Reusable and pipeline-ready

## Imports and Setup


In [13]:
import numpy as np
import pandas as pd

np.random.seed(2010)


##  Simulated Transaction Dataset


Simulated transactional data

In [16]:
n_rows = 5000

df = pd.DataFrame({
    "transaction_id": range(n_rows),
    "user_id": np.random.randint(1, 500, size=n_rows),
    "amount": np.random.exponential(scale=50, size=n_rows),
    "timestamp": pd.date_range("2023-01-01", periods=n_rows, freq="min")
})

df = df.sort_values("timestamp")

## Basic Transaction-Level Features

In [19]:
df["log_amount"] = np.log1p(df["amount"])

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek

## Velocity Features (Time-Aware)

In [22]:
df = df.set_index("timestamp")

# Transactions per user in last 1h / 24h
df["tx_count_1h"] = (
    df.groupby("user_id")["transaction_id"]
      .rolling("1h").count()
      .reset_index(level=0, drop=True)
)

df["tx_count_24h"] = (
    df.groupby("user_id")["transaction_id"]
      .rolling("24h").count()
      .reset_index(level=0, drop=True)
)

##  Monetary Aggregations

In [25]:
df["amount_sum_1h"] = (
    df.groupby("user_id")["amount"]
      .rolling("1h").sum()
      .reset_index(level=0, drop=True)
)

df["amount_mean_24h"] = (
    df.groupby("user_id")["amount"]
      .rolling("24h").mean()
      .reset_index(level=0, drop=True)
)

## Ratio and Normalized Features

In [28]:
df["amount_vs_user_mean"] = (
    df["amount"] /
    df.groupby("user_id")["amount"].transform("mean")
)

## Final Feature Set

In [33]:
FEATURES = [
    "log_amount",
    "hour",
    "day_of_week",
    "tx_count_1h",
    "tx_count_24h",
    "amount_sum_1h",
    "amount_mean_24h",
    "amount_vs_user_mean"
]

feature_df = df[FEATURES].dropna()

feature_df.head()

Unnamed: 0_level_0,log_amount,hour,day_of_week,tx_count_1h,tx_count_24h,amount_sum_1h,amount_mean_24h,amount_vs_user_mean
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2023-01-01 00:00:00,1.14672,0,6,1.0,1.0,2.14785,2.14785,0.031834
2023-01-01 00:01:00,2.441591,0,6,1.0,1.0,10.491305,10.491305,0.240696
2023-01-01 00:02:00,4.590196,0,6,1.0,1.0,97.513752,97.513752,1.615487
2023-01-01 00:03:00,4.100415,0,6,1.0,1.0,59.365337,59.365337,1.359203
2023-01-01 00:04:00,4.084069,0,6,1.0,1.0,58.386647,58.386647,1.52785



## Integration Notes

- Fraud signals are behavioral, not raw values  
- Velocity and ratio features often outperform complex models  
- All aggregations must be backward-looking to avoid leakage  
- Feature functions should be reused in training and inference


## Production Checklist
