# Real-Time Feature Engineering for Fraud Detection

This notebook demonstrates an end-to-end [Real-Time streaming](https://www.databricks.com/blog/introducing-real-time-mode-apache-sparktm-structured-streaming) feature engineering pipeline that demontrates stateless and stateful feature engineering of credit card transaction data and publishing the features to Lakebase PostgresSQL table with subsecond latency.

## Features

**Stateless Features:**
- Time-based: hour, day, business hours, cyclical encodings
- Amount-based: log, sqrt, categories, z-scores
- Merchant: risk scores based on category
- Location: risk indicators, region classification
- Device: device type detection
- Network: IP classification

**Stateful Features:**
- Velocity: transaction counts in time windows (10 min, 1 hour)
- IP tracking: IP change detection and counts
- Location anomalies: impossible travel detection (velocity > 800 km/h)
- Amount anomalies: z-score calculation vs user history
- Fraud scoring: composite 0-100 score with prediction flag

## Architecture

```
Streaming Source (rate)
    ↓
Generate Transactions (synthetic data)
    ↓
Apply Stateless Features (AdvancedFeatureEngineering)
    ↓
Apply Stateful Fraud Detection (transformWithState)
    ↓
Write to Lakebase PostgreSQL in Realtime Mode (foreach)
```

## Prerequisites

- Lakebase PostgreSQL instance provisioned
    - Configure Lakebase information in utils/config.py  
- Ensure the Databricks Cluster is configured 
  - with Databricks Runtime 17.3+
  - to support [Realtime Streaming](https://docs.databricks.com/aws/en/structured-streaming/real-time#cluster-configuration)
  - to have enough task slots/cores [Cluster size requirements](https://docs.databricks.com/aws/en/structured-streaming/real-time#cluster-size-requirements) 
- Kafka topic to write and read streaming data 
  - Configure Kafka topic information in utils/config.py  
- Run `00_setup.ipynb` to create `transaction_features` table
- Run `01_generate_streaming_data.ipynb` to generate and write transaction data to a kafka stream 

## Output

All features (stateless + stateful) are written to:
- **Table**: `transaction_features`
- **Write latency**: 400ms

In [0]:
#Retsart Python env to clear config and code caches. 
dbutils.library.restartPython()

#Setup Spark streaming configs 
spark.conf.set("spark.sql.streaming.stateStore.providerClass", "com.databricks.sql.streaming.state.RocksDBStateStoreProvider")
spark.conf.set("spark.databricks.streaming.statefulOperator.asyncCheckpoint.enabled", "false")
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled", "true")

#Adjust Shuffle partitions based on your need. 
spark.conf.set("spark.sql.shuffle.partitions", "4")



In [0]:
# Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import pandas as pd
import logging

# Import utility modules
from utils.config import Config
from utils.data_generator import TransactionDataGenerator
from utils.feature_engineering import (
    AdvancedFeatureEngineering, 
    FraudDetectionFeaturesProcessor,
    get_fraud_detection_output_schema
)
from utils.lakebase_client import LakebaseClient

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

INFO:dbldatagen._version:Version : VersionInfo(major='0', minor='4', patch='0', release='post', build='1')


## Step 1: Configuration

Configure Lakebase connection and initialize components.

In [0]:
#Initialize Config
config = Config()

# Initialize components
data_gen = TransactionDataGenerator(spark)
feature_engineer = AdvancedFeatureEngineering(spark)
lakebase = LakebaseClient(**config.lakebase_config)

# Test Lakebase connection
if lakebase.test_connection():
    print("Connected to Lakebase PostgreSQL")
else:
    raise Exception("Failed to connect to Lakebase")

# Verify transaction_features table exists
print("\nVerifying transaction_features table...")
try:
    stats = lakebase.get_table_stats("transaction_features")
    print(f"  Table exists with {stats['total_rows']:,} rows")
except Exception as e:
    print("  Table not found. Please run 00_setup.ipynb first!")
    raise Exception("transaction_features table does not exist")

0.68.0


INFO:utils.lakebase_client:Lakebase connection test successful


Connected to Lakebase PostgreSQL

Verifying transaction_features table...
0.68.0


INFO:utils.lakebase_client:Table stats: 1,400,707 rows


  Table exists with 1,400,707 rows


## Step 2: Read Data from Kafka Topic

In [0]:
# Retrieve Kafka credentials from Databricks secrets
kafka_credentials_secrets = config.kafka_config["kafka_credentials_secrets"]
scope = kafka_credentials_secrets["scope"]

# Retrieve secrets from Databricks secret scope. 
# Note: Daatabricks secrets should be stored prior to using them and it's not covered in this example
# You can find more information about Databricks secrets here: https://docs.databricks.com/aws/en/security/secrets/#secrets-overview 
KAFKA_USERNAME = dbutils.secrets.get(scope = scope, key = kafka_credentials_secrets["username"])
KAFKA_SECRET = dbutils.secrets.get(scope = scope, key = kafka_credentials_secrets["secret"])
KAFKA_SERVER = dbutils.secrets.get(scope = scope, key = kafka_credentials_secrets["server"])
KAFKA_TOPIC = config.kafka_config["kafka_topic"]

# Define the schema for the Kafka value column
value_schema = "STRUCT<transaction_id: STRING, user_id: STRING, merchant_id: STRING, amount: DOUBLE, currency: STRING, merchant_category: STRING, payment_method: STRING, ip_address: STRING, device_id: STRING, latitude: DOUBLE, longitude: DOUBLE, card_type: STRING, timestamp: TIMESTAMP>"

# Read streaming data from Kafka topic with SASL_SSL authentication
kafka_df = (
    spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", KAFKA_SERVER)
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.jaas.config", f"kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{KAFKA_USERNAME}' password='{KAFKA_SECRET}';")
    .option("kafka.ssl.endpoint.identification.algorithm", "https")
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("subscribe", KAFKA_TOPIC)
    .option("startingOffsets", "latest")
    .option("failOnDataLoss", "false")  
    .load()
) \
.select(
    from_json(col("value").cast("string"), 
              value_schema).alias("value"), 
    col("timestamp").alias("kafka_timestamp")) \
.selectExpr("value.*", "kafka_timestamp") \
  .drop("timestamp") \
  .withColumnRenamed("kafka_timestamp", "timestamp")

## Step 3: Apply Stateless Features

Apply time-based, amount-based, merchant, location, device, and network features.

In [0]:
df_with_stateless_features = feature_engineer.apply_all_features(kafka_df)

INFO:utils.feature_engineering:Applying streaming-compatible feature engineering...
INFO:utils.feature_engineering:Creating time-based features...
INFO:utils.feature_engineering:Creating amount-based features...
INFO:utils.feature_engineering:Creating merchant features (streaming-only)...
INFO:utils.feature_engineering:Skipping stateless location features (optimized out)
INFO:utils.feature_engineering:Skipping device features (optimized out)
INFO:utils.feature_engineering:Creating network features (streaming-only)...
INFO:utils.feature_engineering:Streaming feature engineering completed!


## Step 4: Apply Stateful Features

Use `transformWithStateInPandas` to maintain per-user state and detect fraud patterns.

In [0]:
# Apply stateful fraud detection using transformWithState
df_with_fraud_features = df_with_stateless_features \
    .withWatermark("timestamp", "1 minutes") \
    .groupBy("user_id") \
    .transformWithState(
        statefulProcessor=FraudDetectionFeaturesProcessor(),
        outputStructType=get_fraud_detection_output_schema(),
        outputMode="Update",
        timeMode="processingTime"
    )

## Step 5: Write to Lakebase

Stream all features to Lakebase for real-time serving.

In [0]:
CHECKPOINT_BASE_PATH = config.kafka_config["checkpoint_base_path"]
FRAUD_PIPELINE_CHECKPOINT_LOCATION = f"{CHECKPOINT_BASE_PATH}/fraud-pipeline-checkpoint"
dbutils.fs.rm(FRAUD_PIPELINE_CHECKPOINT_LOCATION, True)

True

In [0]:
#Get Schema from dataframe
table_schema = df_with_fraud_features.schema

#Initialize lakebase writer
lakebase_writer  = lakebase.get_foreach_writer(column_names=table_schema.names, batch_size=5)

#Start streaming query
query = df_with_fraud_features \
    .writeStream \
    .outputMode("update") \
    .foreach(lakebase_writer) \
    .option("checkpointLocation", FRAUD_PIPELINE_CHECKPOINT_LOCATION) \
    .trigger(realTime="5 minutes") \
    .start()

0.68.0


## Step 6: Stop Streaming Query

Stop the streaming pipeline when done.

In [0]:
# # Stop streaming query
# if query.isActive:
#     query.stop()
#     print("Streaming query stopped")

# print("\nPipeline complete!")