# financial services: Iceberg and Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a financial services analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Transaction Fraud Detection and Customer Analytics

We'll analyze financial transaction records from a bank. Our clustering strategy will optimize for:

- **Customer-specific queries**: Fast lookups by account ID
- **Time-based analysis**: Efficient filtering by transaction date
- **Fraud pattern detection**: Quick aggregation by transaction type and risk scores

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create financial services catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS finance")

spark.sql("CREATE SCHEMA IF NOT EXISTS finance.analytics")

print("Financial services catalog and analytics schema created successfully!")

Financial services catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `account_transactions_uf` table will store:

- **account_id**: Unique account identifier
- **transaction_date**: Date and time of transaction
- **transaction_type**: Type (Deposit, Withdrawal, Transfer, Payment, etc.)
- **amount**: Transaction amount
- **merchant_category**: Merchant type (Retail, Restaurant, Online, etc.)
- **location**: Transaction location
- **risk_score**: Fraud risk assessment (0-100)

### Clustering Strategy

We'll cluster by `account_id` and `transaction_date` because:

- **account_id**: Customers often have multiple transactions, grouping their financial activity together
- **transaction_date**: Time-based queries are critical for fraud detection, spending analysis, and regulatory reporting
- This combination optimizes for both customer account analysis and temporal fraud pattern detection

In [1]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, DateType
data_schema = StructType([
    StructField("account_id", StringType(), True),
    StructField("transaction_date", TimestampType(), True),
    StructField("transaction_type", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("merchant_category", StringType(), True),
    StructField("location", StringType(), True),
    StructField("risk_score", IntegerType(), True)
])

spark.sql("""

CREATE TABLE IF NOT EXISTS finance.analytics.account_transactions_uf (
    account_id STRING,
    transaction_date TIMESTAMP,
    transaction_type STRING,
    amount DECIMAL(15,2),
    merchant_category STRING,
    location STRING,
    risk_score INT
)

USING DELTA

TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (account_id, transaction_date)

""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")

print("Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.


## Step 3: Generate Financial Services Sample Data

### Data Generation Strategy

We'll create realistic financial transaction data including:

- **5,000 accounts** with multiple transactions over time
- **Transaction types**: Deposits, withdrawals, transfers, payments, ATM withdrawals
- **Realistic temporal patterns**: Daily banking activity, weekend vs weekday patterns
- **Merchant categories**: Retail, restaurants, online shopping, utilities, entertainment

### Why This Data Pattern?

This data simulates real financial scenarios where:

- Customers perform multiple transactions daily/weekly
- Fraud patterns emerge over time
- Regulatory reporting requires temporal analysis
- Risk scoring enables real-time fraud prevention
- Customer spending analysis drives personalized financial services

In [1]:
# Generate sample financial transaction data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define financial data constants

TRANSACTION_TYPES = ['Deposit', 'Withdrawal', 'Transfer', 'Payment', 'ATM']

MERCHANT_CATEGORIES = ['Retail', 'Restaurant', 'Online', 'Utilities', 'Entertainment', 'Groceries', 'Healthcare', 'Transportation']

LOCATIONS = ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Miami, FL', 'Online', 'ATM']


# Generate account transaction records

transaction_data = []

base_date = datetime(2024, 1, 1)


# Create 5,000 accounts with 10-50 transactions each

for account_num in range(1, 5001):

    account_id = f"ACC{account_num:08d}"
    
    # Each account gets 10-50 transactions over 12 months

    num_transactions = random.randint(10, 50)
    
    for i in range(num_transactions):

        # Spread transactions over 12 months with realistic timing

        days_offset = random.randint(0, 365)

        hours_offset = random.randint(0, 23)

        transaction_date = base_date + timedelta(days=days_offset, hours=hours_offset)
        
        # Select transaction type

        transaction_type = random.choice(TRANSACTION_TYPES)
        
        # Amount based on transaction type

        if transaction_type in ['Deposit', 'Transfer']:

            amount = round(random.uniform(100, 10000), 2)

        elif transaction_type == 'ATM':

            amount = round(random.uniform(20, 500), 2) * -1

        else:

            amount = round(random.uniform(10, 2000), 2) * -1
        
        # Select merchant category and location

        merchant_category = random.choice(MERCHANT_CATEGORIES)

        if transaction_type == 'ATM':

            location = 'ATM'

        elif transaction_type == 'Online':

            location = 'Online'

        else:

            location = random.choice(LOCATIONS)
        
        # Risk score (0-100, higher = more suspicious)

        risk_score = random.randint(0, 100)
        
        transaction_data.append({

            "account_id": account_id,

            "transaction_date": transaction_date,

            "transaction_type": transaction_type,

            "amount": amount,

            "merchant_category": merchant_category,

            "location": location,

            "risk_score": risk_score

        })



print(f"Generated {len(transaction_data)} account transaction records")

print("Sample record:", transaction_data[0])

Generated 149143 account transaction records
Sample record: {'account_id': 'ACC00000001', 'transaction_date': datetime.datetime(2024, 3, 12, 1, 0), 'transaction_type': 'Transfer', 'amount': 9780.6, 'merchant_category': 'Groceries', 'location': 'Online', 'risk_score': 8}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_transactions = spark.createDataFrame(transaction_data, schema=data_schema)


# Display schema and sample data

print("DataFrame Schema:")

df_transactions.printSchema()



print("\nSample Data:")

df_transactions.show(5)


# Insert data into Delta table with liquid clustering

# The TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (account_id, transaction_date) will automatically optimize the data layout

df_transactions.write.mode("overwrite").insertInto("finance.analytics.account_transactions_uf")


print(f"\nSuccessfully inserted {df_transactions.count()} records into finance.analytics.account_transactions_uf")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- account_id: string (nullable = true)
 |-- transaction_date: timestamp (nullable = true)
 |-- transaction_type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- merchant_category: string (nullable = true)
 |-- location: string (nullable = true)
 |-- risk_score: integer (nullable = true)


Sample Data:


+-----------+-------------------+----------------+--------+-----------------+------------+----------+
| account_id|   transaction_date|transaction_type|  amount|merchant_category|    location|risk_score|
+-----------+-------------------+----------------+--------+-----------------+------------+----------+
|ACC00000001|2024-03-12 01:00:00|        Transfer|  9780.6|        Groceries|      Online|         8|
|ACC00000001|2024-01-08 13:00:00|        Transfer| 3075.42|       Healthcare| Chicago, IL|        30|
|ACC00000001|2024-05-24 12:00:00|         Payment|-1475.39|   Transportation| Chicago, IL|        99|
|ACC00000001|2024-10-08 15:00:00|      Withdrawal| -595.12|           Online| Chicago, IL|        81|
|ACC00000001|2024-09-19 09:00:00|        Transfer| 3645.06|    Entertainment|New York, NY|        98|
+-----------+-------------------+----------------+--------+-----------------+------------+----------+
only showing top 5 rows




Successfully inserted 149143 records into finance.analytics.account_transactions_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Account transaction history** (clustered by account_id)
2. **Time-based fraud analysis** (clustered by transaction_date)
3. **Combined account + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Account transaction history - benefits from account_id clustering

print("=== Query 1: Account Transaction History ===")

account_history = spark.sql("""

SELECT account_id, transaction_date, transaction_type, amount, merchant_category

FROM finance.analytics.account_transactions_uf

WHERE account_id = 'ACC00000001'

ORDER BY transaction_date DESC

""")



account_history.show()

print(f"Records found: {account_history.count()}")



# Query 2: Time-based fraud analysis - benefits from transaction_date clustering

print("\n=== Query 2: High-Risk Transactions Today ===")

high_risk_today = spark.sql("""

SELECT transaction_date, account_id, transaction_type, amount, risk_score

FROM finance.analytics.account_transactions_uf

WHERE DATE(transaction_date) = CURRENT_DATE AND risk_score > 70

ORDER BY risk_score DESC, transaction_date DESC

""")



high_risk_today.show()

print(f"High-risk transactions found: {high_risk_today.count()}")



# Query 3: Combined account + time query - optimal for our clustering strategy

print("\n=== Query 3: Account Fraud Pattern Analysis ===")

fraud_patterns = spark.sql("""

SELECT account_id, transaction_date, transaction_type, amount, risk_score

FROM finance.analytics.account_transactions_uf

WHERE account_id LIKE 'ACC0000001%' AND transaction_date >= '2024-06-01'

ORDER BY account_id, transaction_date

""")



fraud_patterns.show()

print(f"Pattern records found: {fraud_patterns.count()}")

=== Query 1: Account Transaction History ===


+-----------+-------------------+----------------+--------+-----------------+
| account_id|   transaction_date|transaction_type|  amount|merchant_category|
+-----------+-------------------+----------------+--------+-----------------+
|ACC00000001|2024-12-27 03:00:00|             ATM| -103.15|       Healthcare|
|ACC00000001|2024-12-26 02:00:00|         Payment|-1591.98|   Transportation|
|ACC00000001|2024-12-19 17:00:00|         Payment| -446.84|        Groceries|
|ACC00000001|2024-11-30 18:00:00|        Transfer| 8528.64|    Entertainment|
|ACC00000001|2024-11-06 19:00:00|      Withdrawal|  -37.15|       Healthcare|
|ACC00000001|2024-10-14 09:00:00|        Transfer| 5505.77|           Retail|
|ACC00000001|2024-10-08 15:00:00|      Withdrawal| -595.12|           Online|
|ACC00000001|2024-09-27 07:00:00|             ATM| -458.07|    Entertainment|
|ACC00000001|2024-09-19 09:00:00|        Transfer| 3645.06|    Entertainment|
|ACC00000001|2024-09-12 08:00:00|             ATM| -286.90|     

Records found: 36

=== Query 2: High-Risk Transactions Today ===


+----------------+----------+----------------+------+----------+
|transaction_date|account_id|transaction_type|amount|risk_score|
+----------------+----------+----------------+------+----------+
+----------------+----------+----------------+------+----------+



High-risk transactions found: 0

=== Query 3: Account Fraud Pattern Analysis ===


+-----------+-------------------+----------------+--------+----------+
| account_id|   transaction_date|transaction_type|  amount|risk_score|
+-----------+-------------------+----------------+--------+----------+
|ACC00000010|2024-07-21 12:00:00|        Transfer| 1382.46|        77|
|ACC00000010|2024-07-28 18:00:00|      Withdrawal|-1227.64|        67|
|ACC00000010|2024-08-13 19:00:00|             ATM|  -52.76|        52|
|ACC00000010|2024-08-19 09:00:00|         Deposit| 5522.80|        61|
|ACC00000010|2024-08-31 17:00:00|         Payment|  -17.65|        17|
|ACC00000010|2024-09-25 23:00:00|             ATM|  -61.76|        55|
|ACC00000010|2024-11-01 09:00:00|         Payment| -465.37|        71|
|ACC00000010|2024-11-10 20:00:00|         Deposit| 2367.70|         1|
|ACC00000010|2024-11-17 10:00:00|      Withdrawal| -470.32|        43|
|ACC00000010|2024-12-06 19:00:00|         Payment|-1910.33|         9|
|ACC00000011|2024-06-06 06:00:00|        Transfer| 2805.70|        60|
|ACC00

Pattern records found: 146


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the financial insights possible with this optimized structure.

### Key Analytics

- **Transaction volume** by type and risk patterns
- **Customer spending analysis** and account segmentation
- **Fraud detection metrics** and risk scoring effectiveness
- **Merchant category trends** and spending patterns

In [1]:
# Analyze clustering effectiveness and financial insights


# Transaction analysis by type

print("=== Transaction Analysis by Type ===")

transaction_analysis = spark.sql("""

SELECT transaction_type, COUNT(*) as total_transactions,

       ROUND(SUM(amount), 2) as total_amount,

       ROUND(AVG(amount), 2) as avg_amount,

       ROUND(AVG(risk_score), 2) as avg_risk_score

FROM finance.analytics.account_transactions_uf

GROUP BY transaction_type

ORDER BY total_transactions DESC

""")



transaction_analysis.show()


# Risk score distribution

print("\n=== Risk Score Distribution ===")

risk_distribution = spark.sql("""

SELECT 

    CASE 

        WHEN risk_score >= 80 THEN 'Very High Risk'

        WHEN risk_score >= 60 THEN 'High Risk'

        WHEN risk_score >= 40 THEN 'Medium Risk'

        WHEN risk_score >= 20 THEN 'Low Risk'

        ELSE 'Very Low Risk'

    END as risk_category,

    COUNT(*) as transaction_count,

    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage

FROM finance.analytics.account_transactions_uf

GROUP BY 

    CASE 

        WHEN risk_score >= 80 THEN 'Very High Risk'

        WHEN risk_score >= 60 THEN 'High Risk'

        WHEN risk_score >= 40 THEN 'Medium Risk'

        WHEN risk_score >= 20 THEN 'Low Risk'

        ELSE 'Very Low Risk'

    END

ORDER BY transaction_count DESC

""")



risk_distribution.show()


# Merchant category spending

print("\n=== Merchant Category Spending Analysis ===")

merchant_analysis = spark.sql("""

SELECT merchant_category, COUNT(*) as transactions,

       ROUND(SUM(CASE WHEN amount > 0 THEN amount ELSE 0 END), 2) as deposits,

       ROUND(SUM(CASE WHEN amount < 0 THEN ABS(amount) ELSE 0 END), 2) as spending,

       ROUND(AVG(risk_score), 2) as avg_risk

FROM finance.analytics.account_transactions_uf

GROUP BY merchant_category

ORDER BY spending DESC

""")



merchant_analysis.show()


# Monthly transaction trends

print("\n=== Monthly Transaction Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(transaction_date, 'yyyy-MM') as month,

       COUNT(*) as transactions,

       ROUND(SUM(amount), 2) as net_flow,

       COUNT(DISTINCT account_id) as active_accounts,

       ROUND(AVG(risk_score), 2) as avg_risk_score

FROM finance.analytics.account_transactions_uf

GROUP BY DATE_FORMAT(transaction_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Transaction Analysis by Type ===


+----------------+------------------+------------+----------+--------------+
|transaction_type|total_transactions|total_amount|avg_amount|avg_risk_score|
+----------------+------------------+------------+----------+--------------+
|        Transfer|             29999|151108162.38|   5037.11|         49.99|
|         Deposit|             29923|150962629.64|   5045.04|         50.22|
|      Withdrawal|             29921|-29988461.60|  -1002.25|         50.01|
|         Payment|             29789|-29893428.71|  -1003.51|         50.11|
|             ATM|             29511| -7619730.14|   -258.20|         49.97|
+----------------+------------------+------------+----------+--------------+


=== Risk Score Distribution ===


+--------------+-----------------+----------+
| risk_category|transaction_count|percentage|
+--------------+-----------------+----------+
|Very High Risk|            30987|     20.78|
|   Medium Risk|            29797|     19.98|
|     High Risk|            29627|     19.86|
| Very Low Risk|            29423|     19.73|
|      Low Risk|            29309|     19.65|
+--------------+-----------------+----------+


=== Merchant Category Spending Analysis ===


+-----------------+------------+-----------+----------+--------+
|merchant_category|transactions|   deposits|  spending|avg_risk|
+-----------------+------------+-----------+----------+--------+
|        Groceries|       18642|37772542.91|8494798.48|   49.94|
|           Retail|       18604|37468897.74|8487432.78|   50.04|
|       Restaurant|       18672|38179209.41|8480727.72|    50.4|
|           Online|       18515|37231692.24|8472075.04|   49.91|
|    Entertainment|       18636|37411438.43|8461974.74|   49.76|
|   Transportation|       18651|37290951.61|8443335.98|   50.34|
|        Utilities|       18638|37986761.82|8356903.09|   50.23|
|       Healthcare|       18785|38729297.86|8304372.62|   49.85|
+-----------------+------------+-----------+----------+--------+


=== Monthly Transaction Trends ===


+-------+------------+-----------+---------------+--------------+
|  month|transactions|   net_flow|active_accounts|avg_risk_score|
+-------+------------+-----------+---------------+--------------+
|2024-01|       12624|19939024.66|           4398|         49.73|
|2024-02|       11944|19328144.90|           4373|          50.3|
|2024-03|       12510|19632464.84|           4402|         49.95|
|2024-04|       12288|19264206.65|           4405|         50.09|
|2024-05|       12554|19137998.88|           4398|         49.88|
|2024-06|       12423|20206469.72|           4397|         50.25|
|2024-07|       12397|19601966.76|           4399|          49.8|
|2024-08|       12614|20083080.90|           4427|         50.11|
|2024-09|       12243|18520818.85|           4349|         50.03|
|2024-10|       12758|20407110.51|           4428|         50.19|
|2024-11|       12199|18622669.90|           4388|         50.07|
|2024-12|       12589|19825215.00|           4447|         50.32|
+-------+-

## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (account_id, transaction_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (account_id, transaction_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Financial services analytics where fraud detection and customer analysis are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for financial data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles financial-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger financial datasets
- Integrate with real banking systems and fraud detection platforms

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.