# Feature Engineering Examples and Usage

This notebook demonstrates how to use the `AdvancedFeatureEngineering` class from `feature_engineering.py` for transaction data.

**Important**: 
- All feature engineering methods are implemented in `feature_engineering.py` (single source of truth)
- Sample data is generated using `TransactionDataGenerator` from `data_generator.py` 
- This notebook only demonstrates usage

## Prerequisites

Install `dbldatagen` for data generation:
```python
%pip install dbldatagen
dbutils.library.restartPython()
```

## Available Features

- **Time-based**: Hour, day of week, business hours, cyclical encoding
- **Amount-based**: Log transformations, categories, statistical features
- **Velocity**: Transaction counts/amounts over time windows
- **Behavioral**: User patterns, merchant switching
- **Location**: Distance calculations, velocity
- **Risk**: Risk scoring and anomaly detection


In [0]:
dbutils.library.restartPython()

In [0]:
# Import necessary libraries
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import *
from pyspark.sql.types import *
import logging

# Import the AdvancedFeatureEngineering class from feature_engineering.py
import sys
from feature_engineering import AdvancedFeatureEngineering

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print(f"âœ… Spark version: {spark.version}")
print(f"âœ… Feature engineering module imported from feature_engineering.py")

# Initialize the AdvancedFeatureEngineering class
feature_engineer = AdvancedFeatureEngineering()

## Create Streaming Transaction Data Source

Import the data generator and create a **streaming DataFrame** that continuously generates realistic transaction data.

**Note**: This uses PySpark Structured Streaming with a rate source - perfect for testing streaming feature engineering!


In [0]:
# Import the data generator class
from data_generator import TransactionDataGenerator

# Initialize generator
generator = TransactionDataGenerator()

# Create a STREAMING DataFrame
# This continuously generates transactions at the specified rate
df_streaming = generator.generate_transaction_data(
    num_users=10,           # 10 unique users
    num_merchants=20,       # 20 unique merchants  
    rows_per_second=5       # Generate 5 transactions per second
)

print(f"âœ… Created streaming data source")

print(f"\nðŸ“‹ Streaming DataFrame Schema:")
df_streaming.printSchema()


## Streaming Feature Engineering with Real-Time Features

First, let's demonstrate feature engineering on **batch data** (easier to inspect).

Use `create_time_based_features()` to extract time-related features.


In [0]:
# Apply ALL features to streaming DataFrame
df_with_features = feature_engineer.apply_all_features(df_streaming)

print("âœ… Features applied to streaming data")
print(f"\nðŸ“‹ Schema with engineered features:")
df_with_features.printSchema()

display(df_with_features)

In [None]:
# Write to Lakebase PostgreSQL
query = feature_engineer.write_features_to_lakebase(
    df=df_with_features,
    lakebase_client=lakebase,
    table_name="transaction_features"
)

print("ðŸš€ Streaming to Lakebase PostgreSQL...")
time.sleep(60)
query.stop()
print("âœ… Done")

In [None]:
# Query features
stats = lakebase.get_table_stats()
print(f"ðŸ“Š Total rows: {stats['total_rows']:,}")

recent = lakebase.read_features('SELECT * FROM transaction_features ORDER BY timestamp DESC LIMIT 10')
display(recent)