# Real-Time Streaming Fraud Detection Pipeline

This notebook demonstrates an end-to-end streaming fraud detection pipeline combining:
- **Stateless features**: Time, amount, merchant, device, network (from `AdvancedFeatureEngineering`)
- **Stateful features**: Transaction velocity, IP tracking, location anomalies (from `FraudDetectionFeaturesProcessor`)

## Architecture

```
Streaming Source (Rate)
    ↓
Generate Transactions (TransactionDataGenerator)
    ↓
Apply Stateless Features (AdvancedFeatureEngineering)
    ↓
Apply Stateful Fraud Detection (transformWithStateInPandas)
    ↓
Write to Lakebase PostgreSQL (foreachBatch)
    ↓
Real-Time Feature Serving (<10ms query latency)
```

## Prerequisites

- Run `00_setup.ipynb` to create feature tables
- Databricks Runtime 17.3+ with Spark 4.0+
- Lakebase PostgreSQL instance provisioned

## Features Generated

**Stateless (~40 columns):**
- Time-based: hour, day, business hours, cyclical encodings
- Amount-based: log, sqrt, categories, z-scores
- Merchant: risk scores
- Location, Device, Network: risk indicators

**Stateful (~25 columns):**
- Velocity: transaction counts in time windows
- IP tracking: IP change detection
- Location anomalies: impossible travel detection
- Amount anomalies: statistical outliers
- Fraud scoring: composite 0-100 score


In [None]:
# Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
import logging

# Import utility modules
from utils.data_generator import TransactionDataGenerator
from utils.feature_engineering import (
    AdvancedFeatureEngineering, 
    FraudDetectionFeaturesProcessor,
    get_fraud_detection_output_schema
)
from utils.lakebase_client import LakebaseClient

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("All modules imported successfully")
print(f"Spark version: {spark.version}")

## Step 1: Configuration

Configure Lakebase connection and initialize components.

In [None]:
# Lakebase connection configuration
LAKEBASE_CONFIG = {
    "instance_name": "neha-lakebase-demo",
    "database": "databricks_postgres"
}

# Initialize components
data_gen = TransactionDataGenerator(spark)
feature_engineer = AdvancedFeatureEngineering(spark)
lakebase = LakebaseClient(**LAKEBASE_CONFIG)

# Test Lakebase connection
if lakebase.test_connection():
    print("Connected to Lakebase PostgreSQL")
else:
    raise Exception("Failed to connect to Lakebase")

# Verify fraud_features table exists
print("\nVerifying fraud_features table...")
try:
    stats = lakebase.get_table_stats("fraud_features")
    print(f"  Table exists with {stats['total_rows']:,} rows")
except Exception as e:
    print("  Table not found. Creating it now...")
    lakebase.create_feature_table("fraud_features")
    print("  Table created successfully!")

## Step 2: Generate Streaming Transaction Data

Create a streaming source that continuously generates synthetic transactions.

In [None]:
# Generate streaming transaction data
df_transactions = data_gen.generate_transaction_data(
    num_users=50,           # 50 unique users
    num_merchants=100,      # 100 unique merchants
    rows_per_second=10      # 10 transactions per second
)

print("Streaming data source created")
print("\nTransaction schema:")
df_transactions.printSchema()

## Step 3: Apply Stateless Features

Apply time-based, amount-based, merchant, location, device, and network features.

In [None]:
# Get the output schema for fraud detection features
# This schema is defined in utils.feature_engineering module
output_schema = get_fraud_detection_output_schema()

print("Output schema loaded from feature_engineering module")
print(f"Total fields: {len(output_schema.fields)}")

## Step 4: Define Stateful Fraud Detection Schema

Define the output schema that combines stateless and stateful features.

In [None]:
# Define complete output schema (stateless + stateful features)
# Note: This is a simplified schema - the full schema matches the fraud_features table
output_schema = StructType([
    # Core transaction fields
    StructField("transaction_id", StringType(), False),
    StructField("user_id", StringType(), False),
    StructField("timestamp", TimestampType(), False),
    StructField("amount", DoubleType(), False),
    StructField("merchant_id", StringType(), False),
    StructField("ip_address", StringType(), False),
    StructField("latitude", DoubleType(), False),
    StructField("longitude", DoubleType(), False),
    
    # Stateful fraud detection features
    StructField("user_transaction_count", IntegerType(), False),
    StructField("transactions_last_hour", IntegerType(), False),
    StructField("transactions_last_10min", IntegerType(), False),
    StructField("ip_changed", IntegerType(), False),
    StructField("ip_change_count_total", IntegerType(), False),
    StructField("distance_from_last_km", DoubleType(), True),
    StructField("velocity_kmh", DoubleType(), True),
    StructField("amount_vs_user_avg_ratio", DoubleType(), True),
    StructField("amount_vs_user_max_ratio", DoubleType(), True),
    StructField("amount_zscore", DoubleType(), True),
    StructField("seconds_since_last_transaction", DoubleType(), True),
    StructField("is_rapid_transaction", IntegerType(), False),
    StructField("is_impossible_travel", IntegerType(), False),
    StructField("is_amount_anomaly", IntegerType(), False),
    StructField("fraud_score", DoubleType(), False),
    StructField("is_fraud_prediction", IntegerType(), False)
])

print("Output schema defined with stateful fraud features")

## Step 5: Apply Stateful Fraud Detection

Use `transformWithStateInPandas` to maintain per-user state and detect fraud patterns.

In [None]:
# Apply stateful fraud detection using transformWithStateInPandas
df_with_fraud_features = df_with_stateless_features \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy("user_id") \
    .transformWithStateInPandas(
        statefulProcessor=FraudDetectionFeaturesProcessor(),
        outputStructType=output_schema,
        outputMode="Append",
        timeMode="None"
    )

print("Stateful fraud detection configured")
print("\nFinal schema with all features:")
df_with_fraud_features.printSchema()

## Step 6: Write to Lakebase PostgreSQL

Stream all features to Lakebase for real-time serving.

In [None]:
# Define foreachBatch function
def write_to_lakebase(batch_df, batch_id):
    if batch_df.isEmpty():
        return
    
    logger.info(f"Processing batch {batch_id} with {batch_df.count()} rows")
    lakebase.write_streaming_batch(batch_df, batch_id, "fraud_features")
    logger.info(f"Batch {batch_id} written to Lakebase")

# Start streaming query
query = df_with_fraud_features \
    .writeStream \
    .outputMode("append") \
    .foreachBatch(write_to_lakebase) \
    .option("checkpointLocation", "/tmp/fraud_pipeline_checkpoint") \
    .trigger(processingTime="10 seconds") \
    .start()

print("Streaming pipeline started!")
print(f"Query ID: {query.id}")
print(f"Status: {query.status}")
print("\nPipeline: Transactions → Stateless Features → Stateful Fraud Detection → Lakebase")

## Step 7: Monitor and Query Results

Query fraud features from Lakebase for real-time insights.

In [None]:
# Wait for data to process
import time
print("Waiting 30 seconds for data to process...")
time.sleep(30)

# Query top fraud users
fraud_query = """
SELECT 
    user_id,
    COUNT(*) as total_transactions,
    SUM(is_fraud_prediction) as predicted_frauds,
    AVG(fraud_score) as avg_fraud_score,
    MAX(fraud_score) as max_fraud_score,
    SUM(is_rapid_transaction) as rapid_transactions,
    SUM(is_impossible_travel) as impossible_travels
FROM fraud_features
GROUP BY user_id
HAVING SUM(is_fraud_prediction) > 0
ORDER BY predicted_frauds DESC
LIMIT 10
"""

with lakebase.get_connection() as conn:
    fraud_df = pd.read_sql(fraud_query, conn)

print("\nTop 10 Users by Fraud Predictions:")
display(fraud_df)

## Step 8: Stop Streaming Query

Stop the streaming pipeline when done.

In [None]:
# Stop streaming query
if query.isActive:
    query.stop()
    print("Streaming query stopped")

print("\nPipeline complete!")