# 02 – Feature Engineering (Fraud Detection)

**Projekt:** Secure AI Fraud Detection Pipeline  
**Zweck:** Robuste Feature-Pipeline für Fraud-Detection mit Privacy-by-Design-Grundsätzen.

**Output dieses Notebooks**
- Zeit-, Betrag-, Frequenz- und Kontext-Features
- Scaler/Encoder-Pipeline (`models/feature_pipeline.pkl`)
- Feature-Namen (`models/feature_names.json`)
- Vorverarbeitete Daten (`data/processed/features.parquet`)
- Konfig (`models/feature_config.json`)

> Hinweise:  
> - Das Notebook nutzt `data/processed/fraud_cleaned.csv` (falls vorhanden) oder `data/raw/fraud_simulated.csv`.  
> - Wenn beides fehlt, wird ein **synthetischer Demo-Datensatz** erzeugt (für reproduzierbare Läufe).


## Block 2 – Imports & Projektpfade


In [10]:
# Imports und Pfade
import os, json, warnings, joblib
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime, timedelta

from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")

# Robust: wenn Notebook in /notebooks liegt, gehe eine Ebene hoch
cwd = Path.cwd()
PROJECT_ROOT = cwd if (cwd / "data").exists() else (cwd.parent if cwd.name == "notebooks" else cwd)

DATA_PROCESSED = PROJECT_ROOT / "data" / "processed"
DATA_RAW = PROJECT_ROOT / "data" / "raw"
MODELS = PROJECT_ROOT / "models"

DATA_PROCESSED.mkdir(parents=True, exist_ok=True)
DATA_RAW.mkdir(parents=True, exist_ok=True)
MODELS.mkdir(parents=True, exist_ok=True)

CLEAN_PATH = DATA_PROCESSED / "fraud_cleaned.csv"
RAW_PATH = DATA_RAW / "fraud_simulated.csv"

print(f"PROJECT_ROOT = {PROJECT_ROOT}")


PROJECT_ROOT = C:\Users\admin\Desktop\AI Sec Project\GitHub\secure-ai-fraud-detection-pipeline\notebooks


## Block 2a – Diagnose & Root-Fix (wenn CSV-Dateien „nicht gefunden“ werden)


## Block 3 – Daten laden (mit Fallback & synthetischem Demo-Datensatz)


In [13]:
def _generate_synthetic(n=5000, seed=42):
    rng = np.random.default_rng(seed)
    start = datetime(2024, 1, 1)
    ts = [start + timedelta(minutes=int(x)) for x in rng.integers(0, 60*24*30, size=n)]
    amount = np.round(rng.gamma(shape=2.0, scale=50.0, size=n), 2)
    user_id = rng.integers(1000, 2000, size=n)
    country = rng.choice(
        ["DE","AT","CH","FR","IT","ES","NL","PL","US","GB"],
        size=n,
        p=[.22,.08,.05,.12,.08,.08,.08,.09,.1,.1]
    )
    channel = rng.choice(["app","web","pos"], size=n, p=[.4,.4,.2])
    merchant_category = rng.choice(["grocery","electronics","travel","gaming","fashion","other"], size=n)

    # Fraud-Label simulieren
    fraud = (
        (rng.random(size=n) < (
            0.02 
            + 0.03*(np.isin(country, ["US","GB"])) 
            + 0.02*(channel == "web") 
            + 0.04*(merchant_category == "gaming")
            + 0.03*(amount > 300)
        ))
    ).astype(int)

    df = pd.DataFrame({
        "timestamp": ts,
        "amount": amount,
        "user_id": user_id,
        "country": country,
        "channel": channel,
        "merchant_category": merchant_category,
        "is_fraud": fraud,
    })
    RAW_PATH.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(RAW_PATH, index=False)
    return df


def load_data():
    if CLEAN_PATH.exists():
        path = CLEAN_PATH
    elif RAW_PATH.exists():
        path = RAW_PATH
    else:
        print("Weder cleaned noch raw gefunden – generiere synthetischen Demo-Datensatz…")
        return _generate_synthetic()
    print(f"Lade Daten aus: {path}")
    return pd.read_csv(path)


df = load_data()

# Timestamp konvertieren
if "timestamp" in df.columns:
    try:
        df["timestamp"] = pd.to_datetime(df["timestamp"], errors="coerce", utc=False)
    except Exception:
        pass

print(df.head())
print(df.dtypes)


Weder cleaned noch raw gefunden – generiere synthetischen Demo-Datensatz…
            timestamp  amount  user_id country channel merchant_category  \
0 2024-01-03 16:15:00   61.34     1508      US     app            gaming   
1 2024-01-24 05:14:00  143.48     1504      DE     web           grocery   
2 2024-01-20 15:17:00   53.35     1993      DE     app           grocery   
3 2024-01-14 03:59:00  123.37     1599      GB     pos            travel   
4 2024-01-13 23:46:00   59.54     1597      DE     web            gaming   

   is_fraud  
0         0  
1         0  
2         0  
3         0  
4         0  
timestamp            datetime64[ns]
amount                      float64
user_id                       int64
country                      object
channel                      object
merchant_category            object
is_fraud                      int64
dtype: object
