# Setup and Configuration for Streaming Feature Engineering Pipeline

This notebook handles the initial setup and configuration for the streaming feature engineering pipeline with Lakebase.  

## Prerequisites
- Databricks Runtime 17.3+ 
- Install Databricks Python SDK 0.65.0 or above on the cluster to support lakebase APIs
- Access to an existing Lakebase Postgres Database


## Setup Tasks
2. **Configuration**: Set up LakeBase connection
3. **Database Setup**: Create the feature table to store the features
4. **Validation**: Test all components and connections

## Post-Setup
After running this notebook, you can proceed with:
- `01_streaming_features.ipynb` - Streaming feature engineering pipeline to walkthrough real-time feature engineering and publish the features to Lakebase


In [0]:
# Import required libraries
# Note: 'spark' session is already available in Databricks
from pyspark.sql import Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta.tables import DeltaTable
import logging
from datetime import datetime

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

In [0]:

# Import Lakebase client
from utils.lakebase_client import LakebaseClient

# Get OAuth token for authentication
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()

# OR use secrets (recommended for production)
# token = dbutils.secrets.get(scope="lakebase", key="token")
# host = dbutils.secrets.get(scope="lakebase", key="host")

# Lakebase connection configuration
LAKEBASE_CONFIG = {
    "instance_name": "rtm-lakebase-demo",
    "database": "databricks_postgres"
}

print("Connecting to Lakebase PostgreSQL...\n")

# Initialize Lakebase client
lakebase = LakebaseClient(**LAKEBASE_CONFIG)

# Test connection
print("Testing Lakebase connection...")
if lakebase.test_connection():
    print("Successfully connected to Lakebase PostgreSQL!")    
else:
    print("Failed to connect to Lakebase")
    print("  Please check:")
    print("  1. Lakebase instance is provisioned")
    print("  2. Host is correct")
    print("  3. OAuth token is valid")
    raise Exception("Lakebase connection failed")

# Create feature tables
print("\nCreating unified feature tables in Lakebase...")
print("  • transaction_features (for stateless features)")
print("  • fraud_features (for stateful fraud detection)")

lakebase.create_feature_table("transaction_features")
lakebase.create_feature_table("fraud_features")

print("Tables created successfully!")

# Verify tables exist
print("\nVerifying tables...")
try:
    stats_txn = lakebase.get_table_stats("transaction_features")
    stats_fraud = lakebase.get_table_stats("fraud_features")
    print(f"  transaction_features: {stats_txn['total_rows']:,} rows")
    print(f"  fraud_features: {stats_fraud['total_rows']:,} rows")
except Exception as e:
    print("  Tables exist but are empty (just created)")

print("\n" + "="*60)
print("LAKEBASE POSTGRESQL SETUP COMPLETE")
print("="*60)
print("\nNext steps:")
print("  Run streaming_fraud_detection_pipeline.ipynb")


INFO:py4j.clientserver:Received command c on object id p0


Connecting to Lakebase PostgreSQL...

Testing Lakebase connection...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Lakebase connection test successful


Successfully connected to Lakebase PostgreSQL!

Creating unified feature tables in Lakebase...
  • transaction_features (for stateless features)
  • fraud_features (for stateful fraud detection)
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Created unified feature table: transaction_features (~70+ columns)


0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Created unified feature table: fraud_features (~70+ columns)


Tables created successfully!

Verifying tables...
0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Table stats: 0 rows


0.68.0


INFO:py4j.clientserver:Received command c on object id p0
INFO:utils.lakebase_client:Table stats: 0 rows


  transaction_features: 0 rows
  fraud_features: 0 rows

LAKEBASE POSTGRESQL SETUP COMPLETE

Next steps:
  Run streaming_fraud_detection_pipeline.ipynb
