# Financial Services: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a financial services analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Transaction Fraud Detection and Customer Analytics

We'll analyze financial transaction records from a bank. Our clustering strategy will optimize for:

- **Customer-specific queries**: Fast lookups by account ID
- **Time-based analysis**: Efficient filtering by transaction date
- **Fraud pattern detection**: Quick aggregation by transaction type and risk scores

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create financial services catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS finance")

spark.sql("CREATE SCHEMA IF NOT EXISTS finance.analytics")

print("Financial services catalog and analytics schema created successfully!")

## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `account_transactions` table will store:

- **account_id**: Unique account identifier
- **transaction_date**: Date and time of transaction
- **transaction_type**: Type (Deposit, Withdrawal, Transfer, Payment, etc.)
- **amount**: Transaction amount
- **merchant_category**: Merchant type (Retail, Restaurant, Online, etc.)
- **location**: Transaction location
- **risk_score**: Fraud risk assessment (0-100)

### Clustering Strategy

We'll cluster by `account_id` and `transaction_date` because:

- **account_id**: Customers often have multiple transactions, grouping their financial activity together
- **transaction_date**: Time-based queries are critical for fraud detection, spending analysis, and regulatory reporting
- This combination optimizes for both customer account analysis and temporal fraud pattern detection

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS finance.analytics.account_transactions (

    account_id STRING,

    transaction_date TIMESTAMP,

    transaction_type STRING,

    amount DECIMAL(15,2),

    merchant_category STRING,

    location STRING,

    risk_score INT

)

USING DELTA

CLUSTER BY (account_id, transaction_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on account_id and transaction_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on account_id and transaction_date.


## Step 3: Generate Financial Services Sample Data

### Data Generation Strategy

We'll create realistic financial transaction data including:

- **5,000 accounts** with multiple transactions over time
- **Transaction types**: Deposits, withdrawals, transfers, payments, ATM withdrawals
- **Realistic temporal patterns**: Daily banking activity, weekend vs weekday patterns
- **Merchant categories**: Retail, restaurants, online shopping, utilities, entertainment

### Why This Data Pattern?

This data simulates real financial scenarios where:

- Customers perform multiple transactions daily/weekly
- Fraud patterns emerge over time
- Regulatory reporting requires temporal analysis
- Risk scoring enables real-time fraud prevention
- Customer spending analysis drives personalized financial services

In [None]:
# Generate sample financial transaction data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define financial data constants

TRANSACTION_TYPES = ['Deposit', 'Withdrawal', 'Transfer', 'Payment', 'ATM']

MERCHANT_CATEGORIES = ['Retail', 'Restaurant', 'Online', 'Utilities', 'Entertainment', 'Groceries', 'Healthcare', 'Transportation']

LOCATIONS = ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Miami, FL', 'Online', 'ATM']


# Generate account transaction records

transaction_data = []

base_date = datetime(2024, 1, 1)


# Create 5,000 accounts with 10-50 transactions each

for account_num in range(1, 5001):

    account_id = f"ACC{account_num:08d}"
    
    # Each account gets 10-50 transactions over 12 months

    num_transactions = random.randint(10, 50)
    
    for i in range(num_transactions):

        # Spread transactions over 12 months with realistic timing

        days_offset = random.randint(0, 365)

        hours_offset = random.randint(0, 23)

        transaction_date = base_date + timedelta(days=days_offset, hours=hours_offset)
        
        # Select transaction type

        transaction_type = random.choice(TRANSACTION_TYPES)
        
        # Amount based on transaction type

        if transaction_type in ['Deposit', 'Transfer']:

            amount = round(random.uniform(100, 10000), 2)

        elif transaction_type == 'ATM':

            amount = round(random.uniform(20, 500), 2) * -1

        else:

            amount = round(random.uniform(10, 2000), 2) * -1
        
        # Select merchant category and location

        merchant_category = random.choice(MERCHANT_CATEGORIES)

        if transaction_type == 'ATM':

            location = 'ATM'

        elif transaction_type == 'Online':

            location = 'Online'

        else:

            location = random.choice(LOCATIONS)
        
        # Risk score (0-100, higher = more suspicious)

        risk_score = random.randint(0, 100)
        
        transaction_data.append({

            "account_id": account_id,

            "transaction_date": transaction_date,

            "transaction_type": transaction_type,

            "amount": amount,

            "merchant_category": merchant_category,

            "location": location,

            "risk_score": risk_score

        })



print(f"Generated {len(transaction_data)} account transaction records")

print("Sample record:", transaction_data[0])

Generated 149912 account transaction records
Sample record: {'account_id': 'ACC00000001', 'transaction_date': datetime.datetime(2024, 6, 2, 10, 0), 'transaction_type': 'Deposit', 'amount': 6959.24, 'merchant_category': 'Online', 'location': 'ATM', 'risk_score': 71}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_transactions = spark.createDataFrame(transaction_data)


# Display schema and sample data

print("DataFrame Schema:")

df_transactions.printSchema()



print("\nSample Data:")

df_transactions.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (account_id, transaction_date) will automatically optimize the data layout

df_transactions.write.mode("overwrite").saveAsTable("finance.analytics.account_transactions")


print(f"\nSuccessfully inserted {df_transactions.count()} records into finance.analytics.account_transactions")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- account_id: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- location: string (nullable = true)
 |-- merchant_category: string (nullable = true)
 |-- risk_score: long (nullable = true)
 |-- transaction_date: timestamp (nullable = true)
 |-- transaction_type: string (nullable = true)


Sample Data:


+-----------+-------+------------+-----------------+----------+-------------------+----------------+
| account_id| amount|    location|merchant_category|risk_score|   transaction_date|transaction_type|
+-----------+-------+------------+-----------------+----------+-------------------+----------------+
|ACC00000001|6959.24|         ATM|           Online|        71|2024-06-02 10:00:00|         Deposit|
|ACC00000001| 171.51|         ATM|    Entertainment|        67|2024-04-21 17:00:00|         Deposit|
|ACC00000001|4679.81| Chicago, IL|       Restaurant|        67|2024-05-11 20:00:00|         Deposit|
|ACC00000001|8596.92|         ATM|       Restaurant|         5|2024-07-25 19:00:00|         Deposit|
|ACC00000001|-999.96|New York, NY|    Entertainment|        61|2024-09-18 01:00:00|      Withdrawal|
+-----------+-------+------------+-----------------+----------+-------------------+----------------+
only showing top 5 rows




Successfully inserted 149912 records into finance.analytics.account_transactions
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Account transaction history** (clustered by account_id)
2. **Time-based fraud analysis** (clustered by transaction_date)
3. **Combined account + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Account transaction history - benefits from account_id clustering

print("=== Query 1: Account Transaction History ===")

account_history = spark.sql("""

SELECT account_id, transaction_date, transaction_type, amount, merchant_category

FROM finance.analytics.account_transactions

WHERE account_id = 'ACC00000001'

ORDER BY transaction_date DESC

""")



account_history.show()

print(f"Records found: {account_history.count()}")



# Query 2: Time-based fraud analysis - benefits from transaction_date clustering

print("\n=== Query 2: High-Risk Transactions Today ===")

high_risk_today = spark.sql("""

SELECT transaction_date, account_id, transaction_type, amount, risk_score

FROM finance.analytics.account_transactions

WHERE DATE(transaction_date) = CURRENT_DATE AND risk_score > 70

ORDER BY risk_score DESC, transaction_date DESC

""")



high_risk_today.show()

print(f"High-risk transactions found: {high_risk_today.count()}")



# Query 3: Combined account + time query - optimal for our clustering strategy

print("\n=== Query 3: Account Fraud Pattern Analysis ===")

fraud_patterns = spark.sql("""

SELECT account_id, transaction_date, transaction_type, amount, risk_score

FROM finance.analytics.account_transactions

WHERE account_id LIKE 'ACC0000001%' AND transaction_date >= '2024-06-01'

ORDER BY account_id, transaction_date

""")



fraud_patterns.show()

print(f"Pattern records found: {fraud_patterns.count()}")

=== Query 1: Account Transaction History ===


+-----------+-------------------+----------------+--------+-----------------+
| account_id|   transaction_date|transaction_type|  amount|merchant_category|
+-----------+-------------------+----------------+--------+-----------------+
|ACC00000001|2024-10-18 16:00:00|      Withdrawal|-1915.87|           Retail|
|ACC00000001|2024-09-29 10:00:00|         Payment| -962.07|        Groceries|
|ACC00000001|2024-09-18 01:00:00|      Withdrawal| -999.96|    Entertainment|
|ACC00000001|2024-09-04 05:00:00|        Transfer|  231.83|   Transportation|
|ACC00000001|2024-09-02 22:00:00|        Transfer|  345.69|       Healthcare|
|ACC00000001|2024-08-09 22:00:00|         Deposit| 7166.14|       Healthcare|
|ACC00000001|2024-07-25 19:00:00|         Deposit| 8596.92|       Restaurant|
|ACC00000001|2024-07-12 08:00:00|         Payment| -715.38|       Restaurant|
|ACC00000001|2024-06-28 10:00:00|             ATM| -173.86|           Retail|
|ACC00000001|2024-06-02 10:00:00|         Deposit| 6959.24|     

Records found: 13

=== Query 2: High-Risk Transactions Today ===


+----------------+----------+----------------+------+----------+
|transaction_date|account_id|transaction_type|amount|risk_score|
+----------------+----------+----------------+------+----------+
+----------------+----------+----------------+------+----------+



High-risk transactions found: 0

=== Query 3: Account Fraud Pattern Analysis ===


+-----------+-------------------+----------------+--------+----------+
| account_id|   transaction_date|transaction_type|  amount|risk_score|
+-----------+-------------------+----------------+--------+----------+
|ACC00000010|2024-06-09 10:00:00|         Payment| -551.26|         6|
|ACC00000010|2024-06-10 17:00:00|         Deposit| 1617.89|        15|
|ACC00000010|2024-07-08 15:00:00|         Deposit| 2775.23|        75|
|ACC00000010|2024-07-20 07:00:00|             ATM| -486.48|        65|
|ACC00000010|2024-07-22 15:00:00|         Payment| -701.06|        46|
|ACC00000010|2024-07-30 14:00:00|         Payment| -814.33|        10|
|ACC00000010|2024-08-06 22:00:00|         Payment|  -10.83|        51|
|ACC00000010|2024-09-10 22:00:00|             ATM| -275.27|        20|
|ACC00000010|2024-09-27 11:00:00|      Withdrawal|-1623.57|        61|
|ACC00000010|2024-11-08 01:00:00|         Deposit| 9571.56|        96|
|ACC00000010|2024-11-24 20:00:00|         Payment| -148.81|         8|
|ACC00

Pattern records found: 153


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the financial insights possible with this optimized structure.

### Key Analytics

- **Transaction volume** by type and risk patterns
- **Customer spending analysis** and account segmentation
- **Fraud detection metrics** and risk scoring effectiveness
- **Merchant category trends** and spending patterns

In [None]:
# Analyze clustering effectiveness and financial insights


# Transaction analysis by type

print("=== Transaction Analysis by Type ===")

transaction_analysis = spark.sql("""

SELECT transaction_type, COUNT(*) as total_transactions,

       ROUND(SUM(amount), 2) as total_amount,

       ROUND(AVG(amount), 2) as avg_amount,

       ROUND(AVG(risk_score), 2) as avg_risk_score

FROM finance.analytics.account_transactions

GROUP BY transaction_type

ORDER BY total_transactions DESC

""")



transaction_analysis.show()


# Risk score distribution

print("\n=== Risk Score Distribution ===")

risk_distribution = spark.sql("""

SELECT 

    CASE 

        WHEN risk_score >= 80 THEN 'Very High Risk'

        WHEN risk_score >= 60 THEN 'High Risk'

        WHEN risk_score >= 40 THEN 'Medium Risk'

        WHEN risk_score >= 20 THEN 'Low Risk'

        ELSE 'Very Low Risk'

    END as risk_category,

    COUNT(*) as transaction_count,

    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage

FROM finance.analytics.account_transactions

GROUP BY 

    CASE 

        WHEN risk_score >= 80 THEN 'Very High Risk'

        WHEN risk_score >= 60 THEN 'High Risk'

        WHEN risk_score >= 40 THEN 'Medium Risk'

        WHEN risk_score >= 20 THEN 'Low Risk'

        ELSE 'Very Low Risk'

    END

ORDER BY transaction_count DESC

""")



risk_distribution.show()


# Merchant category spending

print("\n=== Merchant Category Spending Analysis ===")

merchant_analysis = spark.sql("""

SELECT merchant_category, COUNT(*) as transactions,

       ROUND(SUM(CASE WHEN amount > 0 THEN amount ELSE 0 END), 2) as deposits,

       ROUND(SUM(CASE WHEN amount < 0 THEN ABS(amount) ELSE 0 END), 2) as spending,

       ROUND(AVG(risk_score), 2) as avg_risk

FROM finance.analytics.account_transactions

GROUP BY merchant_category

ORDER BY spending DESC

""")



merchant_analysis.show()


# Monthly transaction trends

print("\n=== Monthly Transaction Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(transaction_date, 'yyyy-MM') as month,

       COUNT(*) as transactions,

       ROUND(SUM(amount), 2) as net_flow,

       COUNT(DISTINCT account_id) as active_accounts,

       ROUND(AVG(risk_score), 2) as avg_risk_score

FROM finance.analytics.account_transactions

GROUP BY DATE_FORMAT(transaction_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Transaction Analysis by Type ===


+----------------+------------------+--------------+----------+--------------+
|transaction_type|total_transactions|  total_amount|avg_amount|avg_risk_score|
+----------------+------------------+--------------+----------+--------------+
|         Deposit|             30198| 1.523849763E8|   5046.19|         50.01|
|        Transfer|             30132|1.5222047821E8|   5051.79|          49.9|
|      Withdrawal|             29909|-3.008644349E7|  -1005.93|         50.26|
|             ATM|             29878|   -7736558.47|   -258.94|         50.26|
|         Payment|             29795|-3.004764726E7|  -1008.48|         49.77|
+----------------+------------------+--------------+----------+--------------+


=== Risk Score Distribution ===


+--------------+-----------------+----------+
| risk_category|transaction_count|percentage|
+--------------+-----------------+----------+
|Very High Risk|            31430|     20.97|
|   Medium Risk|            29815|     19.89|
| Very Low Risk|            29684|     19.80|
|      Low Risk|            29522|     19.69|
|     High Risk|            29461|     19.65|
+--------------+-----------------+----------+


=== Merchant Category Spending Analysis ===


+-----------------+------------+-------------+----------+--------+
|merchant_category|transactions|     deposits|  spending|avg_risk|
+-----------------+------------+-------------+----------+--------+
|       Healthcare|       18816|3.831436558E7|8594934.96|   50.27|
|       Restaurant|       18781|3.805589073E7|8550856.19|   50.26|
|           Online|       18735|3.803928903E7|8516905.61|   49.61|
|   Transportation|       18710|3.769609547E7|8514449.46|   50.05|
|        Groceries|       18549|3.736572863E7|8481252.91|   49.76|
|    Entertainment|       18810|3.822328708E7|8449057.55|   50.52|
|           Retail|       18722|3.857212447E7|8400978.27|   49.96|
|        Utilities|       18789|3.833867352E7|8362214.27|   49.88|
+-----------------+------------+-------------+----------+--------+


=== Monthly Transaction Trends ===


+-------+------------+-------------+---------------+--------------+
|  month|transactions|     net_flow|active_accounts|avg_risk_score|
+-------+------------+-------------+---------------+--------------+
|2024-01|       12828|2.046173847E7|           4447|          50.0|
|2024-02|       11919|1.904988781E7|           4366|         50.15|
|2024-03|       12702|1.998205525E7|           4401|         49.94|
|2024-04|       12268|1.934449189E7|           4384|          50.2|
|2024-05|       12598| 1.92428523E7|           4439|         49.85|
|2024-06|       12202|1.893913387E7|           4347|          50.1|
|2024-07|       12636| 2.06894535E7|           4380|         49.79|
|2024-08|       12708|2.005541701E7|           4419|         50.02|
|2024-09|       12431|1.910335389E7|           4412|         49.74|
|2024-10|       12613|2.006110277E7|           4432|         50.46|
|2024-11|       12216| 1.94978334E7|           4343|         50.26|
|2024-12|       12791|2.030748513E7|           4

## Step 7: Train Financial Services Fraud Detection Model

### Machine Learning for Financial Services Business Improvement

Now we'll train a machine learning model to predict fraudulent transactions. This model can help financial institutions:

- **Reduce fraud losses** by identifying suspicious transactions
- **Improve customer experience** by reducing false positives in fraud alerts
- **Automate transaction monitoring** for real-time risk assessment
- **Optimize compliance** by prioritizing high-risk transactions for review

### Model Approach

We'll use a **Random Forest Classifier** to predict transaction fraud based on:

- Transaction amount and type
- Merchant category and location
- Temporal patterns (time of day, day of week)
- Account transaction history patterns

### Business Impact

- **Fraud Detection**: Automated identification of potentially fraudulent transactions
- **Cost Savings**: Reduced chargeback losses and investigation costs
- **Efficiency**: Faster processing of legitimate transactions
- **Customer Trust**: Better balance between security and convenience

In [None]:
# Prepare data for machine learning

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Load data for ML
ml_data = spark.sql("""
SELECT 
    account_id,
    transaction_date,
    transaction_type,
    amount,
    merchant_category,
    location,
    risk_score,
    CASE WHEN risk_score > 60 THEN 1 ELSE 0 END as is_fraud
FROM finance.analytics.account_transactions
""")

print(f"Loaded {ml_data.count()} records for ML training")
ml_data.groupBy("is_fraud").count().show()

Loaded 149912 records for ML training


+--------+-----+
|is_fraud|count|
+--------+-----+
|       1|59447|
|       0|90465|
+--------+-----+



In [None]:
# Feature engineering

# Extract temporal features
ml_data = ml_data.withColumn("month", F.month("transaction_date")) \
                 .withColumn("day_of_week", F.dayofweek("transaction_date")) \
                 .withColumn("hour", F.hour("transaction_date"))

# Create indexers for categorical variables
transaction_type_indexer = StringIndexer(inputCol="transaction_type", outputCol="transaction_type_index")
merchant_category_indexer = StringIndexer(inputCol="merchant_category", outputCol="merchant_category_index")
location_indexer = StringIndexer(inputCol="location", outputCol="location_index")

# Assemble features
assembler = VectorAssembler(
    inputCols=["amount", "month", "day_of_week", "hour", 
               "transaction_type_index", "merchant_category_index", "location_index"],
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="is_fraud", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[transaction_type_indexer, merchant_category_indexer, location_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = ml_data.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} records")
print(f"Test set: {test_data.count()} records")

Training set: 120010 records


Test set: 29902 records


In [None]:
# Train the model

print("Training fraud detection model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="is_fraud", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select("account_id", "amount", "risk_score", "is_fraud", "prediction", "probability").show(10)

# Calculate confusion matrix
confusion_matrix = predictions.groupBy("is_fraud", "prediction").count()
confusion_matrix.show()

Training fraud detection model...


Model AUC: 0.4990


+-----------+--------+----------+--------+----------+--------------------+
| account_id|  amount|risk_score|is_fraud|prediction|         probability|
+-----------+--------+----------+--------+----------+--------------------+
|ACC00001244| -373.02|        51|       0|       0.0|[0.59525626208791...|
|ACC00001244| 7497.15|        61|       1|       0.0|[0.60731110280234...|
|ACC00001244| 5562.12|         9|       0|       0.0|[0.58394451331490...|
|ACC00001245| 8329.91|        11|       0|       0.0|[0.60644649521069...|
|ACC00001245|-1785.75|        38|       0|       0.0|[0.61275539606750...|
|ACC00001245| 4444.54|        69|       1|       0.0|[0.60175865368915...|
|ACC00001245| 7921.03|        27|       0|       0.0|[0.59757790918507...|
|ACC00001245| -179.47|        63|       1|       0.0|[0.59647284495004...|
|ACC00001246| 6077.13|        72|       1|       0.0|[0.59795782635330...|
|ACC00001246| -121.63|        34|       0|       0.0|[0.57918215597959...|
+-----------+--------+---

+--------+----------+-----+
|is_fraud|prediction|count|
+--------+----------+-----+
|       1|       0.0|11937|
|       0|       0.0|17959|
|       1|       1.0|    4|
|       0|       1.0|    2|
+--------+----------+-----+



In [None]:
# Model interpretation and business insights

# Feature importance (approximate)
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = ["amount", "month", "day_of_week", "hour", "transaction_type", "merchant_category", "location"]

print("=== Feature Importance ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate potential savings from fraud detection
fraud_predictions = predictions.filter("prediction = 1")
high_risk_transactions = fraud_predictions.count()
total_flagged_amount = fraud_predictions.agg(F.sum(F.abs("amount"))).collect()[0][0] or 0

total_test_amount = test_data.agg(F.sum(F.abs("amount"))).collect()[0][0] or 0

print(f"Total test transactions: {test_data.count()}")
print(f"Transactions flagged as high-risk: {high_risk_transactions}")
print(f"Percentage flagged: {(high_risk_transactions/test_data.count())*100:.1f}%")
print(f"Total amount of flagged transactions: ${total_flagged_amount:,.2f}")

# Accuracy metrics
accuracy = predictions.filter("is_fraud = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND is_fraud = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND is_fraud = 1").count() / predictions.filter("is_fraud = 1").count() if predictions.filter("is_fraud = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance ===
amount: 0.2093
month: 0.1688
day_of_week: 0.1236
hour: 0.2047
transaction_type: 0.0573
merchant_category: 0.1204
location: 0.1160

=== Business Impact Analysis ===


Total test transactions: 29902
Transactions flagged as high-risk: 6


Percentage flagged: 0.0%
Total amount of flagged transactions: $8,551.12



Model Performance:
Accuracy: 0.6007
Precision: 0.6667
Recall: 0.0003
AUC: 0.4990


## Key Takeaways: Delta Liquid Clustering + ML in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (account_id, transaction_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (account_id, transaction_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Machine Learning Integration**: Trained a fraud detection model using the optimized data

5. **Real-World Use Case**: Financial services analytics where fraud detection and risk assessment are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates data optimization with ML
- **Governance**: Catalog and schema isolation for financial data
- **Performance**: Optimized for both analytical queries and ML training
- **Scalability**: Handles financial-scale data volumes effortlessly

### Business Benefits for Financial Services

1. **Fraud Reduction**: Automated detection of suspicious transactions
2. **Cost Savings**: Reduced chargeback losses and investigation costs
3. **Operational Efficiency**: Faster processing of legitimate transactions
4. **Customer Experience**: Better balance between security and convenience
5. **Regulatory Compliance**: Improved monitoring and reporting capabilities

### Best Practices for Financial Services Analytics

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve
5. **Combine with ML** for predictive analytics and automation

### Next Steps

- Explore other AIDP ML features like AutoML
- Try liquid clustering with different column combinations
- Scale up to larger financial datasets
- Integrate with real banking systems and fraud detection platforms
- Deploy models for real-time fraud scoring

This notebook demonstrates how Oracle AI Data Platform makes advanced financial services analytics accessible while maintaining enterprise-grade performance and governance.