# Setup and Configuration for Demonstrating Real-Time Streaming Credit Card Fraud Feature Engineering Pipeline

This notebook handles the initial setup and configuration for the Real-Time streaming feature engineering pipeline that publishes features to Databricks Lakebase PostgreSQL table.

## Prerequisites
- Databricks Runtime 17.3+ (with Spark 4.0+ for transformWithStateInPandas)
- Ensure the cluster is configured 
  - to support [Real-Time Streaming](https://docs.databricks.com/aws/en/structured-streaming/real-time#cluster-configuration)
  - to have enough task slots/cores [Cluster size requirements](https://docs.databricks.com/aws/en/structured-streaming/real-time#cluster-size-requirements)
- Databricks Python SDK 0.65.0 or above installed on the cluster
- dbldatagen library installed on the cluster
- Access to an existing Lakebase PostgreSQL instance

## Setup Tasks
1. **Import Required Libraries**: Import required library dependencies
2. **Configuration**: Set up Lakebase PostgreSQL connection
3. **Database Setup**: Create the unified `transaction_features` table
4. **Validation**: Test connection and verify table creation

## What Gets Created
- **transaction_features table**: Stores both stateless and stateful fraud detection features

## Post-Setup
After running this notebook, proceed with:
- 1. Run **01_generate_streaming_data** notebook to generate synthetic streaming credit card transaction data
- 2. Run **02_streaming_fraud_detection_pipeline** notebook to generate streaming fraud detection features


In [0]:
# Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import logging
from datetime import datetime

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [0]:
#Validate if databricks-sdk > 0.65.0 is installed to support Lakebase SDK
%pip show databricks-sdk | grep -oP '(?<=Version: )\S+'

0.68.0


In [0]:
#Validate if dbldatagen is installed for kafka data generation
import dbldatagen as dg 
print("dbldatagen version:", dg.__version__)

dbldatagen version: 0.4.0post1


In [0]:
# Import Lakebase client
from utils.lakebase_client import LakebaseClient
from utils.config import Config

#initialize Config
config = Config()

print("Connecting to Lakebase PostgreSQL...\n")

# Initialize Lakebase client
lakebase = LakebaseClient(**config.lakebase_config)

# Test connection
print("Testing Lakebase connection...")
if lakebase.test_connection():
    print("Successfully connected to Lakebase PostgreSQL!")    
else:
    print("Failed to connect to Lakebase")
    print("  Please check:")
    print("  1. Lakebase instance is provisioned")
    print("  2. Instance name is correct")
    print("  3. Database name is correct")
    raise Exception("Lakebase connection failed")

# Create unified feature table
print("\nCreating unified feature table in Lakebase...")
print("  Table: transaction_features")
print("  Includes: stateless + stateful fraud detection features")

lakebase.create_feature_table("transaction_features")

print("Table created successfully!")

# Verify table exists
print("\nVerifying table...")
try:
    stats_txn = lakebase.get_table_stats("transaction_features")
    print(f"  transaction_features: {stats_txn['total_rows']:,} rows")
except Exception as e:
    print("  Table exists but is empty (just created)")

print("\n" + "="*60)
print("LAKEBASE POSTGRESQL SETUP COMPLETE")
print("="*60)
print("\nNext steps:")
print("  1. Run 01_generate_streaming_data notebook to generate synthetic streaming credit card transaction data")
print("  2. Run 02_streaming_fraud_detection_pipeline notebook to generate streaming fraud detection features")

[0;31m---------------------------------------------------------------------------[0m
[0;31mFileNotFoundError[0m                         Traceback (most recent call last)
File [0;32m<command-6712667302074722>, line 2[0m
[1;32m      1[0m [38;5;66;03m# Import Lakebase client[39;00m
[0;32m----> 2[0m [38;5;28;01mfrom[39;00m [38;5;21;01mutils[39;00m[38;5;21;01m.[39;00m[38;5;21;01mlakebase_client[39;00m [38;5;28;01mimport[39;00m LakebaseClient
[1;32m      3[0m [38;5;28;01mfrom[39;00m [38;5;21;01mutils[39;00m[38;5;21;01m.[39;00m[38;5;21;01mconfig[39;00m [38;5;28;01mimport[39;00m Config
[1;32m      5[0m [38;5;66;03m#initialize Config[39;00m

[0;31mFileNotFoundError[0m: [Errno 2] No such file or directory: '/Workspace/Users/jay.palaniappan@databricks.com/databricks-blogposts/2025-10-realtime-streaming-feature-enginnering-with-lakebase/utils/lakebase_client.py'