# Real-Time Streaming Fraud Detection Pipeline

This notebook demonstrates an end-to-end streaming fraud detection pipeline that combines stateless and stateful features.

## Features

**Stateless Features (~40 columns):**
- Time-based: hour, day, business hours, cyclical encodings
- Amount-based: log, sqrt, categories, z-scores
- Merchant: risk scores based on category
- Location: risk indicators, region classification
- Device: device type detection
- Network: IP classification

**Stateful Features (~15 columns):**
- Velocity: transaction counts in time windows (10 min, 1 hour)
- IP tracking: IP change detection and counts
- Location anomalies: impossible travel detection (velocity > 800 km/h)
- Amount anomalies: z-score calculation vs user history
- Fraud scoring: composite 0-100 score with prediction flag

## Architecture

```
Streaming Source (rate)
    ↓
Generate Transactions (synthetic data)
    ↓
Apply Stateless Features (AdvancedFeatureEngineering)
    ↓
Apply Stateful Fraud Detection (transformWithStateInPandas)
    ↓
Write to Lakebase PostgreSQL (foreachBatch)
    ↓
Real-Time Feature Serving (<10ms query latency)
```

## Prerequisites

- Run `00_setup.ipynb` first to create `transaction_features` table
- Databricks Runtime 17.3+ (with Spark 4.0+ for transformWithStateInPandas)
- Lakebase PostgreSQL instance provisioned

## Output

All features (stateless + stateful) are written to:
- **Table**: `transaction_features` (~70+ columns)
- **Write latency**: 50-100ms per micro-batch
- **Query latency**: <10ms for real-time serving


In [0]:
dbutils.library.restartPython()

In [0]:
# Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
import logging

# Import utility modules
from utils.data_generator import TransactionDataGenerator
from utils.feature_engineering import (
    AdvancedFeatureEngineering, 
    FraudDetectionFeaturesProcessor,
    get_fraud_detection_output_schema
)
from utils.lakebase_client import LakebaseClient

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("All modules imported successfully")
print(f"Spark version: {spark.version}")

INFO:py4j.clientserver:Received command c on object id p0


All modules imported successfully
Spark version: 4.0.0


## Step 1: Configuration

Configure Lakebase connection and initialize components.

In [0]:
# Lakebase connection configuration
LAKEBASE_CONFIG = {
    "instance_name": "rtm-lakebase-demo",
    "database": "databricks_postgres"
}

# Initialize components
data_gen = TransactionDataGenerator(spark)
feature_engineer = AdvancedFeatureEngineering(spark)
lakebase = LakebaseClient(**LAKEBASE_CONFIG)

# Test Lakebase connection
if lakebase.test_connection():
    print("Connected to Lakebase PostgreSQL")
else:
    raise Exception("Failed to connect to Lakebase")

# Verify transaction_features table exists
print("\nVerifying transaction_features table...")
try:
    stats = lakebase.get_table_stats("transaction_features")
    print(f"  Table exists with {stats['total_rows']:,} rows")
except Exception as e:
    print("  Table not found. Please run 00_setup.ipynb first!")
    raise Exception("transaction_features table does not exist")

0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Lakebase connection test successful
INFO:py4j.clientserver:Received command c on object id p0


Connected to Lakebase PostgreSQL

Verifying fraud_features table...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
ERROR:utils.lakebase_client:Database error: relation "fraud_features" does not exist
LINE 9:             FROM fraud_features
                         ^

ERROR:utils.lakebase_client:Error getting table stats: relation "fraud_features" does not exist
LINE 9:             FROM fraud_features
                         ^



  Table not found. Creating it now...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Created unified feature table: transaction_features (~70+ columns)


  Table created successfully!


## Step 2: Generate Streaming Transaction Data

Create a streaming source that continuously generates synthetic transactions.

In [0]:
# Generate streaming transaction data
df_transactions = data_gen.generate_transaction_data(
    num_users=50,           # 50 unique users
    num_merchants=100,      # 100 unique merchants
    rows_per_second=10      # 10 transactions per second
)

print("Streaming data source created")
print("\nTransaction schema:")
df_transactions.printSchema()

INFO:utils.data_generator:Creating streaming transaction source...
INFO:utils.data_generator:   Rate: 10 transactions/second
INFO:utils.data_generator:   Users: 50, Merchants: 100
INFO:utils.data_generator:Streaming source created successfully


Streaming data source created

Transaction schema:
root
 |-- timestamp: timestamp (nullable = true)
 |-- transaction_id: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- merchant_id: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- currency: string (nullable = false)
 |-- merchant_category: string (nullable = false)
 |-- payment_method: string (nullable = false)
 |-- ip_address: string (nullable = true)
 |-- device_id: string (nullable = true)
 |-- location_lat: double (nullable = false)
 |-- location_lon: double (nullable = false)
 |-- card_type: string (nullable = false)



## Step 3: Apply Stateless Features

Apply time-based, amount-based, merchant, location, device, and network features.

In [0]:
df_with_stateless_features = feature_engineer.apply_all_features(df_transactions)

INFO:utils.feature_engineering:Applying streaming-compatible feature engineering...
INFO:utils.feature_engineering:Creating time-based features...
INFO:utils.feature_engineering:Creating amount-based features...
INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.feature_engineering:Creating merchant features (streaming-only)...
INFO:utils.feature_engineering:Creating location features (streaming-only)...
INFO:utils.feature_engineering:Creating device features (streaming-only)...
INFO:utils.feature_engineering:Creating network features (streaming-only)...
INFO:utils.feature_engineering:Streaming feature engineering completed!


## Step 4: Apply Stateful Features

Use `transformWithStateInPandas` to maintain per-user state and detect fraud patterns.

In [0]:
# Apply stateful fraud detection using transformWithStateInPandas
df_with_fraud_features = df_with_stateless_features \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy("user_id") \
    .transformWithStateInPandas(
        statefulProcessor=FraudDetectionFeaturesProcessor(),
        outputStructType=get_fraud_detection_output_schema(),
        outputMode="Update",
        timeMode="processingTime"
    )

print("Stateful fraud detection configured")
print("\nFinal schema with all features:")
df_with_fraud_features.printSchema()

Stateful fraud detection configured

Final schema with all features:
root
 |-- transaction_id: string (nullable = false)
 |-- user_id: string (nullable = false)
 |-- timestamp: timestamp (nullable = false)
 |-- amount: double (nullable = false)
 |-- merchant_id: string (nullable = false)
 |-- ip_address: string (nullable = false)
 |-- latitude: double (nullable = false)
 |-- longitude: double (nullable = false)
 |-- user_transaction_count: integer (nullable = false)
 |-- transactions_last_hour: integer (nullable = false)
 |-- transactions_last_10min: integer (nullable = false)
 |-- ip_changed: integer (nullable = false)
 |-- ip_change_count_total: integer (nullable = false)
 |-- distance_from_last_km: double (nullable = true)
 |-- velocity_kmh: double (nullable = true)
 |-- amount_vs_user_avg_ratio: double (nullable = true)
 |-- amount_vs_user_max_ratio: double (nullable = true)
 |-- amount_zscore: double (nullable = true)
 |-- seconds_since_last_transaction: double (nullable = true)
 

## Step 5: Write to Lakebase

Stream all features to Lakebase for real-time serving.

In [0]:
# Define foreachBatch function to write to Lakebase PostgreSQL
def write_to_lakebase(batch_df, batch_id):
    """Write each micro-batch to transaction_features table"""
    lakebase.write_streaming_batch(batch_df, batch_id, "transaction_features")
    logger.info(f"Batch {batch_id} written to Lakebase")

# Start streaming query
query = df_with_fraud_features \
    .writeStream \
    .outputMode("update") \
    .foreachBatch(write_to_lakebase) \
    .option("checkpointLocation", "/tmp/fraud_pipeline_checkpoint") \
    .trigger(processingTime="10 seconds") \
    .start()

print("Streaming pipeline started!")
print(f"Query ID: {query.id}")
print(f"Status: {query.status}")
print("\nPipeline flow:")
print("  Transactions → Stateless Features → Stateful Fraud Detection → Lakebase")
print("\nWriting to:")
print("  Table: transaction_features")
print("  Trigger: 10 seconds")
print("  Checkpoint: /tmp/fraud_pipeline_checkpoint")

Streaming pipeline started!
Query ID: 51b7ac8b-ddeb-453d-aa10-8208583c2212
Status: {'message': 'Initializing sources', 'isDataAvailable': False, 'isTriggerActive': False}

Pipeline: Transactions → Stateless Features → Stateful Fraud Detection → Lakebase


## Step 6: Stop Streaming Query

Stop the streaming pipeline when done.

In [0]:
# Stop streaming query
if query.isActive:
    query.stop()
    print("Streaming query stopped")

print("\nPipeline complete!")

