# Real-Time Streaming Fraud Detection Pipeline

This notebook demonstrates an end-to-end streaming fraud detection pipeline combining:
- **Stateless features**: Time, amount, merchant, device, network (from `AdvancedFeatureEngineering`)
- **Stateful features**: Transaction velocity, IP tracking, location anomalies (from `FraudDetectionFeaturesProcessor`)

## Architecture

```
Streaming Source (Rate)
    ↓
Generate Transactions (TransactionDataGenerator)
    ↓
Apply Stateless Features (AdvancedFeatureEngineering)
    ↓
Apply Stateful Fraud Detection (transformWithStateInPandas)
    ↓
Write to Lakebase PostgreSQL (foreachBatch)
    ↓
Real-Time Feature Serving (<10ms query latency)
```

## Prerequisites

- Run `00_setup.ipynb` to create feature tables
- Databricks Runtime 17.3+ with Spark 4.0+
- Lakebase PostgreSQL instance provisioned

## Features Generated

**Stateless (~40 columns):**
- Time-based: hour, day, business hours, cyclical encodings
- Amount-based: log, sqrt, categories, z-scores
- Merchant: risk scores
- Location, Device, Network: risk indicators

**Stateful (~25 columns):**
- Velocity: transaction counts in time windows
- IP tracking: IP change detection
- Location anomalies: impossible travel detection
- Amount anomalies: statistical outliers
- Fraud scoring: composite 0-100 score


In [0]:
# Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
import logging

# Import utility modules
from utils.data_generator import TransactionDataGenerator
from utils.feature_engineering import (
    AdvancedFeatureEngineering, 
    FraudDetectionFeaturesProcessor,
    get_fraud_detection_output_schema
)
from utils.lakebase_client import LakebaseClient

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("All modules imported successfully")
print(f"Spark version: {spark.version}")

INFO:py4j.clientserver:Received command c on object id p0


All modules imported successfully
Spark version: 4.0.0


## Step 1: Configuration

Configure Lakebase connection and initialize components.

In [0]:
# Lakebase connection configuration
LAKEBASE_CONFIG = {
    "instance_name": "rtm-lakebase-demo",
    "database": "databricks_postgres"
}

# Initialize components
data_gen = TransactionDataGenerator(spark)
feature_engineer = AdvancedFeatureEngineering(spark)
lakebase = LakebaseClient(**LAKEBASE_CONFIG)

# Test Lakebase connection
if lakebase.test_connection():
    print("Connected to Lakebase PostgreSQL")
else:
    raise Exception("Failed to connect to Lakebase")

# Verify fraud_features table exists
print("\nVerifying fraud_features table...")
try:
    stats = lakebase.get_table_stats("fraud_features")
    print(f"  Table exists with {stats['total_rows']:,} rows")
except Exception as e:
    print("  Table not found. Creating it now...")
    lakebase.create_feature_table("fraud_features")
    print("  Table created successfully!")

0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Lakebase connection test successful


Connected to Lakebase PostgreSQL

Verifying fraud_features table...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Table stats: 0 rows


  Table exists with 0 rows


## Step 2: Generate Streaming Transaction Data

Create a streaming source that continuously generates synthetic transactions.

In [0]:
# Generate streaming transaction data
df_transactions = data_gen.generate_transaction_data(
    num_users=50,           # 50 unique users
    num_merchants=100,      # 100 unique merchants
    rows_per_second=10      # 10 transactions per second
)

print("Streaming data source created")
print("\nTransaction schema:")
df_transactions.printSchema()

INFO:utils.data_generator:Creating streaming transaction source...
INFO:utils.data_generator:   Rate: 10 transactions/second
INFO:utils.data_generator:   Users: 50, Merchants: 100
INFO:utils.data_generator:Streaming source created successfully


Streaming data source created

Transaction schema:
root
 |-- timestamp: timestamp (nullable = true)
 |-- transaction_id: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- merchant_id: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- currency: string (nullable = false)
 |-- merchant_category: string (nullable = false)
 |-- payment_method: string (nullable = false)
 |-- ip_address: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- location_lat: double (nullable = false)
 |-- location_lon: double (nullable = false)
 |-- card_type: string (nullable = false)



## Step 3: Apply Stateless Features

Apply time-based, amount-based, merchant, location, device, and network features.

In [0]:
df_with_stateless_features = feature_engineer.apply_all_features(df_transactions)

INFO:utils.feature_engineering:Applying streaming-compatible feature engineering...
INFO:utils.feature_engineering:Creating time-based features...
INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.feature_engineering:Creating amount-based features...
INFO:utils.feature_engineering:Creating merchant features (streaming-only)...
INFO:utils.feature_engineering:Creating location features (streaming-only)...
INFO:utils.feature_engineering:Creating device features (streaming-only)...
INFO:utils.feature_engineering:Creating network features (streaming-only)...
INFO:utils.feature_engineering:Streaming feature engineering completed!


## Step 4: Apply Stateful Fraud Detection

Use `transformWithStateInPandas` to maintain per-user state and detect fraud patterns.

In [0]:
# Apply stateful fraud detection using transformWithStateInPandas
df_with_fraud_features = df_with_stateless_features \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy("user_id") \
    .transformWithStateInPandas(
        statefulProcessor=FraudDetectionFeaturesProcessor(),
        outputStructType=get_fraud_detection_output_schema(),
        outputMode="Append",
        timeMode="processingTime"
    )

print("Stateful fraud detection configured")
print("\nFinal schema with all features:")
df_with_fraud_features.printSchema()

INFO:py4j.clientserver:Received command c on object id p1


Stateful fraud detection configured

Final schema with all features:
root
 |-- transaction_id: string (nullable = false)
 |-- user_id: string (nullable = false)
 |-- timestamp: timestamp (nullable = false)
 |-- amount: double (nullable = false)
 |-- merchant_id: string (nullable = false)
 |-- ip_address: string (nullable = false)
 |-- latitude: double (nullable = false)
 |-- longitude: double (nullable = false)
 |-- user_transaction_count: integer (nullable = false)
 |-- transactions_last_hour: integer (nullable = false)
 |-- transactions_last_10min: integer (nullable = false)
 |-- ip_changed: integer (nullable = false)
 |-- ip_change_count_total: integer (nullable = false)
 |-- distance_from_last_km: double (nullable = true)
 |-- velocity_kmh: double (nullable = true)
 |-- amount_vs_user_avg_ratio: double (nullable = true)
 |-- amount_vs_user_max_ratio: double (nullable = true)
 |-- amount_zscore: double (nullable = true)
 |-- seconds_since_last_transaction: double (nullable = true)
 

## Step 6: Write to Lakebase PostgreSQL

Stream all features to Lakebase for real-time serving.

In [0]:
# Define foreachBatch function
def write_to_lakebase(batch_df, batch_id):
    if batch_df.isEmpty():
        return
    
    logger.info(f"Processing batch {batch_id} with {batch_df.count()} rows")
    lakebase.write_streaming_batch(batch_df, batch_id, "fraud_features")
    logger.info(f"Batch {batch_id} written to Lakebase")

# Start streaming query
query = df_with_fraud_features \
    .writeStream \
    .outputMode("append") \
    .foreachBatch(write_to_lakebase) \
    .option("checkpointLocation", "/tmp/fraud_pipeline_checkpoint") \
    .trigger(processingTime="10 seconds") \
    .start()

print("Streaming pipeline started!")
print(f"Query ID: {query.id}")
print(f"Status: {query.status}")
print("\nPipeline: Transactions → Stateless Features → Stateful Fraud Detection → Lakebase")

INFO:py4j.clientserver:Received command c on object id p0


Streaming pipeline started!
Query ID: 3f87cc9d-4be3-45ee-bafc-864e3f9d2481
Status: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}

Pipeline: Transactions → Stateless Features → Stateful Fraud Detection → Lakebase


## Step 7: Monitor and Query Results

Query fraud features from Lakebase for real-time insights.

In [0]:
# Wait for data to process
import time
print("Waiting 30 seconds for data to process...")
time.sleep(30)

# Query top fraud users
fraud_query = """
SELECT 
    user_id,
    COUNT(*) as total_transactions,
    SUM(is_fraud_prediction) as predicted_frauds,
    AVG(fraud_score) as avg_fraud_score,
    MAX(fraud_score) as max_fraud_score,
    SUM(is_rapid_transaction) as rapid_transactions,
    SUM(is_impossible_travel) as impossible_travels
FROM fraud_features
GROUP BY user_id
HAVING SUM(is_fraud_prediction) > 0
ORDER BY predicted_frauds DESC
LIMIT 10
"""

with lakebase.get_connection() as conn:
    fraud_df = pd.read_sql(fraud_query, conn)

print("\nTop 10 Users by Fraud Predictions:")
display(fraud_df)

Waiting 30 seconds for data to process...


INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Python Server ready to receive messages
INFO:py4j.clientserver:Received command c on object id p2
INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side.
Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/clientserver.py", line 644, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/utils.py", line 173, in call
    raise e
  File "/databricks

0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
  fraud_df = pd.read_sql(fraud_query, conn)



Top 10 Users by Fraud Predictions:


INFO:py4j.clientserver:Received command c on object id p0


[0;31m---------------------------------------------------------------------------[0m
[0;31mValueError[0m                                Traceback (most recent call last)
File [0;32m<command-4593315501084477>, line 27[0m
[1;32m     24[0m     fraud_df [38;5;241m=[39m pd[38;5;241m.[39mread_sql(fraud_query, conn)
[1;32m     26[0m [38;5;28mprint[39m([38;5;124m"[39m[38;5;130;01m\n[39;00m[38;5;124mTop 10 Users by Fraud Predictions:[39m[38;5;124m"[39m)
[0;32m---> 27[0m display(fraud_df)

File [0;32m/databricks/python_shell/lib/dbruntime/display.py:156[0m, in [0;36mDisplay.display[0;34m(self, input, *args, **kwargs)[0m
[1;32m    154[0m     [38;5;28mself[39m[38;5;241m.[39mdisplay([38;5;28minput[39m[38;5;241m.[39mmake_dataframe())
[1;32m    155[0m [38;5;28;01melif[39;00m [38;5;28mtype[39m([38;5;28minput[39m)[38;5;241m.[39m[38;5;18m__module__[39m [38;5;241m==[39m [38;5;124m'[39m[38;5;124mpandas.core.frame[39m[38;5;124m'[39m [38;5;129;01

## Step 8: Stop Streaming Query

Stop the streaming pipeline when done.

In [0]:
# Stop streaming query
if query.isActive:
    query.stop()
    print("Streaming query stopped")

print("\nPipeline complete!")

