
# 🛡️ SentinelPOS — AI/ML SIEM for Real‑Time POS Fraud

This notebook is a **professional demo** of the project built for the ESTEEC Olympics Hackathon.

**What you'll see:**
- Architecture overview of our **end-to-end SIEM**
- Load & explore a small sample of the POS dataset
- **History-aware feature engineering** (velocity, distance, novelty)
- Model **training & evaluation** (80/20 split)
- **Precision–Recall** curve and **threshold selection**
- Save artifacts for the real-time API: `model.pkl`, `features.json`, `threshold.json`
- (Optional) **Live prediction** call to our FastAPI endpoint

> Tip: For a smooth live run, keep the dataset sample around **50,000 rows** (or less). Train the full model offline.



## 🧩 System Architecture

```
           ┌────────────────────────────┐
           │  Hackathon Stream (SSE)    │
           └────────────┬───────────────┘
                        │ events
                        ▼
               sse_to_predict.py
        ┌───────────────┼────────────────┐
        │               │                │
        ▼               ▼                ▼
 FastAPI /predict   /api/flag (judge)   Next.js /api/stream
   (SQLite + ML)     ← response rate      (SSE dashboard)
        │
        ▼
  model.pkl + features.json + threshold.json
```

- **Collector:** `sse_to_predict.py`
- **Analyzer:** FastAPI + scikit-learn model
- **Feature Store:** SQLite (user history → behavioral features)
- **Alerting:** POST to `/api/flag` with decision within 30s
- **Visualization:** Next.js SSE dashboard (live alerts & charts)


In [3]:

# %% Imports & configuration
import os, json, math, sqlite3, statistics, datetime as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (classification_report, confusion_matrix,
                             precision_recall_curve, average_precision_score,
                             roc_auc_score)

plt.rcParams["figure.figsize"] = (6, 4)
plt.rcParams["axes.grid"] = True

DATA_PATH = os.getenv("DATA_PATH", "../hackathon_train.csv")   # pipe-delimited dataset
MODEL_PATH = "model.pkl"
FEATURES_PATH = "features.json"
THRESHOLD_PATH = "threshold.json"

NB_DEMO_ROWS = int(os.getenv("NB_DEMO_ROWS", "50000"))



## 1) Load a Sample of the Dataset

We load a manageable slice and **sort by `unix_time`** to preserve causality (features only use past).


In [4]:

# %% Load sample
df = pd.read_csv(DATA_PATH, sep="|", nrows=NB_DEMO_ROWS)

for col in ["amt","lat","long","merch_lat","merch_long","city_pop","unix_time"]:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

if "unix_time" in df.columns:
    df = df.sort_values("unix_time").reset_index(drop=True)

print("Rows loaded:", len(df))
df.head(3)


Rows loaded: 50000


Unnamed: 0,ssn,cc_num,first,last,gender,street,city,state,zip,lat,...,trans_date,trans_time,unix_time,category,amt,is_fraud,merchant,merch_lat,merch_long,transaction_id
0,576-07-9241,180059854173232,Edward,Moore,M,86175 Barnes Circles,Lake Wales,FL,33898,27.8643,...,2025-09-26,00:00:00,1758834000,gas_transport,39.64,0,"fraud_Reilly, Heaney and Cole",28.548351,-81.596743,d9e602e3-a45d-45d1-bad2-c1534ed6c4f6
1,727-46-9662,347151813192013,Christopher,Williams,M,5073 Gill Dale,Torrance,CA,90501,33.8268,...,2025-09-26,00:00:01,1758834001,gas_transport,65.55,0,"fraud_Zieme, Bode and Dooley",34.241997,-118.355662,0eb36227-59d1-435b-a446-c3d4dde749ca
2,684-83-0896,375537948969934,Aaron,Patton,M,68081 Ferrell Station,Chaparral,NM,88081,32.2239,...,2025-09-26,00:00:01,1758834001,gas_transport,34.12,0,"fraud_Greenholt, Jacobi and Gleason",33.189298,-105.553728,32a08dd2-f4aa-43df-98fa-aa964f267160



## 2) History‑Aware Feature Engineering

We compute features that reflect **user behavior and context**:
- **Velocity**: transactions in the last 60s / 5m / 15m / 1h
- **Uniqueness**: unique merchants & categories in last 15m
- **Novelty**: seen merchant before?
- **Temporal**: hour, day of week, night flag
- **Distances**: previous merchant → current merchant, user → merchant
- **User profile**: mean/std spend in last 24h, z‑score of current amount


In [None]:

# %% Feature engineering helpers
def fast_age(dob_str, at_unix):
    try:
        y, m, d = map(int, str(dob_str)[:10].split("-"))
        t = dt.datetime.utcfromtimestamp(int(at_unix))
        return max(0, t.year - y - ((t.month, t.day) < (m, d)))
    except Exception:
        return -1

def hour_from_unix(u):
    try:
        return dt.datetime.utcfromtimestamp(int(u)).hour
    except Exception:
        return -1

def dow_from_unix(u):
    try:
        return dt.datetime.utcfromtimestamp(int(u)).weekday()
    except Exception:
        return -1

def is_night(h):
    return 1 if (h <= 6 or h >= 22) else 0

def haversine_km(lat1, lon1, lat2, lon2):
    try:
        R = 6371.0
        phi1 = math.radians(float(lat1)); phi2 = math.radians(float(lat2))
        dphi = math.radians(float(lat2) - float(lat1))
        dlmb = math.radians(float(lon2) - float(lon1))
        a = math.sin(dphi/2)**2 + math.cos(phi1)*math.cos(phi2)*math.sin(dlmb/2)**2
        return 2 * R * math.asin(math.sqrt(a))
    except Exception:
        return np.nan

def add_base_columns(df):
    df = df.copy()
    df["age"] = [fast_age(d, u) for d, u in zip(df.get("dob", ""), df["unix_time"])]
    df["hour"] = [hour_from_unix(u) for u in df["unix_time"]]
    df["dow"] = [dow_from_unix(u) for u in df["unix_time"]]
    df["is_night"] = df["hour"].apply(is_night)
    df["log_amt"] = np.log1p(df["amt"].fillna(0.0))
    for col in ["lat","long","merch_lat","merch_long","city_pop"]:
        if col not in df.columns:
            df[col] = 0.0
        df[col] = df[col].fillna(0.0)
    g = df.get("gender","").astype(str).str.upper()
    df["gender_M"] = (g == "M").astype(int)
    df["gender_F"] = (g == "F").astype(int)
    return df

def add_history_features(df):
    df = df.copy()
    cols = ["velocity_60s","velocity_5m","velocity_15m","velocity_1h",
            "unique_merchants_15m","unique_categories_15m",
            "seen_merchant_before","user_merchant_dist_km",
            "time_since_last_s","time_since_last_merchant_s",
            "user_mean_amt_24h","user_std_amt_24h","user_amt_delta","amt_z_user"]
    for c in cols: df[c] = 0.0

    for cc, g in df.groupby("cc_num", sort=False):
        idx = g.index.values
        times = g["unix_time"].values.astype(np.int64)
        amts = g["amt"].fillna(0.0).values.astype(float)
        merch = g["merchant"].astype(str).values

        lat_u = g["lat"].values; lon_u = g["long"].values
        lat_m = g["merch_lat"].values; lon_m = g["merch_long"].values

        start = 0
        for i in range(len(g)):
            t = times[i]
            while start < i and times[start] < t - 24*3600:
                start += 1
            past = slice(start, i)

            v60  = np.sum(times[past] >= t - 60)
            v5   = np.sum(times[past] >= t - 5*60)
            v15  = np.sum(times[past] >= t - 15*60)
            v1h  = np.sum(times[past] >= t - 60*60)

            mask15 = times[past] >= t - 15*60
            uniq_merch = len(set(merch[past][mask15]))
            uniq_cat   = len(set(g["category"].astype(str).values[past][mask15]))

            seen_before = 1 if (merch[i] in set(merch[past])) else 0

            tsl = (t - times[i-1]) if i > 0 else 10**9

            prev_same = None
            for j in range(i-1, start-1, -1):
                if merch[j] == merch[i]:
                    prev_same = times[j]; break
            tslm = (t - prev_same) if prev_same is not None else 10**9

            if i > 0:
                mlat_prev = lat_m[i-1] if not np.isnan(lat_m[i-1]) else lat_u[i-1]
                mlon_prev = lon_m[i-1] if not np.isnan(lon_m[i-1]) else lon_u[i-1]
                dist_prev_to_now = haversine_km(mlat_prev, mlon_prev,
                                                lat_m[i] if not np.isnan(lat_m[i]) else lat_u[i],
                                                lon_m[i] if not np.isnan(lon_m[i]) else lon_u[i])
            else:
                dist_prev_to_now = 0.0

            dist_user_to_merchant = haversine_km(lat_u[i], lon_u[i], lat_m[i], lon_m[i])
            if np.isnan(dist_user_to_merchant): dist_user_to_merchant = 0.0

            past_amts = amts[past]
            if past_amts.size:
                mean24 = float(past_amts.mean())
                std24  = float(past_amts.std(ddof=0))
            else:
                mean24, std24 = 0.0, 0.0
            delta = float(amts[i] - mean24)
            z = float(delta / std24) if std24 > 0 else 0.0

            df.loc[idx[i], ["velocity_60s","velocity_5m","velocity_15m","velocity_1h"]] = [v60, v5, v15, v1h]
            df.loc[idx[i], ["unique_merchants_15m","unique_categories_15m"]] = [uniq_merch, uniq_cat]
            df.loc[idx[i], "seen_merchant_before"] = float(seen_before)
            df.loc[idx[i], "user_merchant_dist_km"] = float(0.0 if np.isnan(dist_prev_to_now) else dist_prev_to_now)
            df.loc[idx[i], "time_since_last_s"] = float(tsl)
            df.loc[idx[i], "time_since_last_merchant_s"] = float(tslm)
            df.loc[idx[i], ["user_mean_amt_24h","user_std_amt_24h","user_amt_delta","amt_z_user"]] = [mean24, std24, delta, z]
    return df

FEATURE_ORDER = [
    "age","log_amt","hour","dow","is_night",
    "city_pop","lat","long","merch_lat","merch_long",
    "velocity_60s","velocity_5m","velocity_15m","velocity_1h",
    "unique_merchants_15m","unique_categories_15m",
    "seen_merchant_before","user_merchant_dist_km",
    "time_since_last_s","time_since_last_merchant_s",
    "user_mean_amt_24h","user_std_amt_24h","user_amt_delta","amt_z_user",
    "gender_M","gender_F"
]



### Build the feature matrix


In [None]:

# %% Build features
base = add_base_columns(df)
feat_df = add_history_features(base)

if "is_fraud" not in feat_df.columns:
    raise ValueError("Column 'is_fraud' is required in the dataset.")

y = feat_df["is_fraud"].astype(str).str.lower().map(
    {"1":1,"0":0,"true":1,"false":0,"t":1,"f":0,"yes":1,"no":0}
).fillna(0).astype(int)

X = feat_df[FEATURE_ORDER].fillna(0.0).astype(float).values
X.shape, y.shape



## 3) Train / Evaluate (80/20 split)
We bias toward **recall** using class weights (fraud is rare).


In [None]:

# %% Split & train
strat = y if (y.sum() and y.sum() != len(y)) else None
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=strat
)

model = RandomForestClassifier(
    n_estimators=160,
    max_depth=None,
    min_samples_leaf=1,
    n_jobs=-1,
    random_state=42,
    class_weight={0:1.0, 1:5.0}
)
model.fit(X_train, y_train)

y_prob = model.predict_proba(X_test)[:,1]
y_pred_50 = (y_prob >= 0.50).astype(int)

print("=== Metrics @ 0.50 threshold ===")
print(classification_report(y_test, y_pred_50, digits=4))
print("Confusion matrix @ 0.50:")
print(confusion_matrix(y_test, y_pred_50))

print("ROC-AUC:", roc_auc_score(y_test, y_prob))
print("PR-AUC:", average_precision_score(y_test, y_prob))

prec, rec, thr = precision_recall_curve(y_test, y_prob)
f1 = (2*prec*rec) / (prec + rec + 1e-12)
best_idx = int(np.argmax(f1))
best_thr = float(thr[best_idx]) if best_idx < len(thr) else 0.5
print(f"Suggested threshold (max F1): {best_thr:.3f}")



### Precision–Recall curve


In [None]:

# %% Plot PR curve
plt.figure()
plt.plot(rec, prec, linewidth=2)
if 0 <= best_idx < len(rec):
    plt.scatter(rec[best_idx], prec[best_idx], s=40)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall Curve")
plt.show()



### Metrics at the suggested threshold


In [None]:

# %% Evaluate at best threshold
y_pred_best = (y_prob >= best_thr).astype(int)
print("=== Metrics @ best threshold ===")
print(classification_report(y_test, y_pred_best, digits=4))
print("Confusion matrix @ best threshold:")
print(confusion_matrix(y_test, y_pred_best))



## 4) Save Artifacts (for API + Streamer)


In [None]:

# %% Save model + features + threshold
import joblib, json

joblib.dump(model, MODEL_PATH)
with open(FEATURES_PATH, "w", encoding="utf-8") as f:
    json.dump(FEATURE_ORDER, f, indent=2)
with open(THRESHOLD_PATH, "w", encoding="utf-8") as f:
    json.dump({"threshold": float(best_thr)}, f, indent=2)

print("Saved:", MODEL_PATH, FEATURES_PATH, THRESHOLD_PATH)



## 5) (Optional) Live Predict via FastAPI

Start your API in a terminal:

```bash
uvicorn fraud_api:app --host 127.0.0.1 --port 8000
```

Then run this cell to score a sample transaction.


In [None]:

# %% Live predict demo (optional)
import requests

FASTAPI_URL = os.getenv("FASTAPI_URL", "http://127.0.0.1:8000/predict?store=0")

try:
    sample_tx = df.iloc[len(df)//2].to_dict()
    r = requests.post(FASTAPI_URL, json=sample_tx, timeout=5)
    print("HTTP", r.status_code)
    print(r.text[:600])
    try:
        print(json.dumps(r.json(), indent=2))
    except Exception:
        pass
except Exception as e:
    print("Live predict skipped or failed:", e)



## 6) Feature Importance (Explainability)


In [None]:

# %% Plot top feature importances
import numpy as np

importances = model.feature_importances_
order = np.argsort(importances)[::-1][:15]
names = [FEATURE_ORDER[i] for i in order]
vals = importances[order]

plt.figure()
plt.bar(range(len(vals)), vals)
plt.xticks(range(len(vals)), names, rotation=45, ha="right")
plt.title("Top Feature Importances (RandomForest)")
plt.tight_layout()
plt.show()



## 🏁 Conclusion

- **Training:** uses history-aware signals (velocity, novelty, distance, temporal) to predict fraud.
- **Thresholding:** selects an operating point from the **Precision–Recall** curve to balance recall vs precision.
- **Artifacts:** saved to disk so the FastAPI + streamer can operate in real time.
- **Dashboard:** Next.js SSE shows live alerts and trends during the hackathon.

**Next ideas**
- Add anomaly scores (e.g., IsolationForest)
- Geo heatmap view of frauds
- Nightly re-training job with fresh labels
