# Streaming Feature Engineering Examples and Usage

This notebook demonstrates how to use Spark Structured Streaming to generate realtime features from a syntentic  real-time credit card data source and publish it to Lakebase.

**Note**: 
- All feature engineering methods are implemented in `utils/feature_engineering.py` file 
- Streaming data is generated using `TransactionDataGenerator` from `utils/data_generator.py` with a rate source
- This notebook only demonstrates usage

## Prerequisites

- Databricks Runtime 17.3+
- Lakebase PostgreSQL instance provisioned
- Run `00_setup.ipynb` first to create the feature table

## Generated Features

- **Time-based**: Hour, day of week, business hours, cyclical encoding (year, month, day, etc.)
- **Amount-based**: Log transformations, categories, z-scores
- **Merchant**: Risk scores based on merchant category
- **Location**: Risk indicators, region classification (if location data available)
- **Device**: Device type detection (if device data available)
- **Network**: IP classification, private/public indicators (if IP data available)


In [0]:
dbutils.library.restartPython()

#Initial Setup

In [0]:
# Import necessary libraries
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
import logging

# Import the AdvancedFeatureEngineering class from feature_engineering.py
import sys
from utils.feature_engineering import AdvancedFeatureEngineering

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print(f" Spark version: {spark.version}")
print(f" Feature engineering module imported from feature_engineering.py")

# Lakebase connection configuration
LAKEBASE_CONFIG = {
    "instance_name": "neha-lakebase-demo",
    "database": "databricks_postgres",    
}


# Initialize the AdvancedFeatureEngineering class
feature_engineer = AdvancedFeatureEngineering()

## Create Streaming Transaction Data Source

Import the data generator and create a **streaming DataFrame** that continuously generates realistic transaction data.

**Note**: This uses PySpark Structured Streaming with a rate source - perfect for testing streaming feature engineering!


In [0]:
# Import the data generator class
from utils.data_generator import TransactionDataGenerator

# Initialize generator
generator = TransactionDataGenerator()

# Create a STREAMING DataFrame
# This continuously generates transactions at the specified rate
df_streaming = generator.generate_transaction_data(
    num_users=10,           # 10 unique users
    num_merchants=20,       # 20 unique merchants  
    rows_per_second=5       # Generate 5 transactions per second
)

print(f" Created streaming data source")

print(f"\n Streaming DataFrame Schema:")
df_streaming.printSchema()


## Streaming Feature Engineering with Real-Time Features

Apply all stateless feature engineering transformations to the streaming DataFrame.

This includes time-based, amount-based, merchant, location, device, and network features.


In [0]:
# Apply ALL features to streaming DataFrame
df_with_features = feature_engineer.apply_all_features(df_streaming)

print(" Features applied to streaming data")
print(f"\n Schema with engineered features:")
df_with_features.printSchema()

#display(df_with_features)

In [0]:
from utils.lakebase_client import LakebaseClient
import time

lakebase = LakebaseClient(**LAKEBASE_CONFIG)

# Write to Lakebase PostgreSQL in batches
df_with_features.writeStream.foreachBatch(lakebase.write_streaming_batch).start()

print(" Streaming to Lakebase PostgreSQL...")
# time.sleep(60)
# query.stop()
# print(" Done")

In [0]:
from utils.lakebase_client import LakebaseClient
import time

lakebase = LakebaseClient(**LAKEBASE_CONFIG)
for_each_writer = lakebase.get_foreach_writer(
    creds=lakebase.get_credentials(), table_name="transaction_features", batch_size=2
)
# Write to Lakebase PostgreSQL for each record

df_with_features.writeStream \
  .foreach(for_each_writer) \
  .outputMode("update") \
  .trigger(realTime="5 seconds") \
  .start()

print(" Streaming to Lakebase PostgreSQL...")

In [0]:
# Query features
stats = lakebase.get_table_stats()
print(f" Total rows: {stats['total_rows']:,}")

recent = lakebase.read_features('SELECT * FROM transaction_features ORDER BY timestamp DESC LIMIT 10')
display(recent)