# Setup and Configuration for Streaming Feature Engineering Pipeline

This notebook handles the initial setup and configuration for the streaming feature engineering pipeline with Databricks Lakebase PostgreSQL.

## Prerequisites
- Databricks Runtime 17.3+ (with Spark 4.0+ for transformWithStateInPandas)
- Databricks Python SDK 0.65.0 or above installed on the cluster
- Access to an existing Lakebase PostgreSQL instance

## Setup Tasks
1. **Import Libraries**: Import required dependencies
2. **Configuration**: Set up Lakebase PostgreSQL connection
3. **Database Setup**: Create the unified `transaction_features` table (~70+ columns)
4. **Validation**: Test connection and verify table creation

## What Gets Created
- **transaction_features table**: Stores both stateless and stateful fraud detection features

## Post-Setup
After running this notebook, proceed with:
- `01_streaming_fraud_detection_pipeline.ipynb` - End-to-end streaming fraud detection pipeline


In [0]:
# Import required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
import logging
from datetime import datetime

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [0]:
# Import Lakebase client
from utils.lakebase_client import LakebaseClient

# Get OAuth token for authentication (Databricks handles this automatically)
# Note: In production, use secrets for credential management
# token = dbutils.secrets.get(scope="lakebase", key="token")

# Lakebase connection configuration
LAKEBASE_CONFIG = {
    "instance_name": "rtm-lakebase-demo",
    "database": "databricks_postgres"
}

print("Connecting to Lakebase PostgreSQL...\n")

# Initialize Lakebase client
lakebase = LakebaseClient(**LAKEBASE_CONFIG)

# Test connection
print("Testing Lakebase connection...")
if lakebase.test_connection():
    print("Successfully connected to Lakebase PostgreSQL!")    
else:
    print("Failed to connect to Lakebase")
    print("  Please check:")
    print("  1. Lakebase instance is provisioned")
    print("  2. Instance name is correct")
    print("  3. Database name is correct")
    raise Exception("Lakebase connection failed")

# Create unified feature table
print("\nCreating unified feature table in Lakebase...")
print("  Table: transaction_features (~70+ columns)")
print("  Includes: stateless + stateful fraud detection features")

lakebase.create_feature_table("transaction_features")

print("Table created successfully!")

# Verify table exists
print("\nVerifying table...")
try:
    stats_txn = lakebase.get_table_stats("transaction_features")
    print(f"  transaction_features: {stats_txn['total_rows']:,} rows")
except Exception as e:
    print("  Table exists but is empty (just created)")

print("\n" + "="*60)
print("LAKEBASE POSTGRESQL SETUP COMPLETE")
print("="*60)
print("\nNext steps:")
print("  1. Run 01_streaming_fraud_detection_pipeline.ipynb")
print("  2. Features will be written to: transaction_features table")
print("  3. Query latency: <10ms for real-time serving")


INFO:py4j.clientserver:Received command c on object id p0


Connecting to Lakebase PostgreSQL...

Testing Lakebase connection...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Lakebase connection test successful


Successfully connected to Lakebase PostgreSQL!

Creating unified feature tables in Lakebase...
  • transaction_features (for features)
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Created unified feature table: transaction_features (~70+ columns)


Tables created successfully!

Verifying tables...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Table stats: 0 rows


  transaction_features: 0 rows

LAKEBASE POSTGRESQL SETUP COMPLETE

Next steps:
  Run 01_streaming_fraud_detection_pipeline.ipynb
