# insurance: Iceberg and Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an insurance analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Claims Processing and Risk Assessment

We'll analyze insurance claims and policy data. Our clustering strategy will optimize for:

- **Policyholder-specific queries**: Fast lookups by customer ID
- **Time-based analysis**: Efficient filtering by claim and policy dates
- **Risk patterns**: Quick aggregation by claim type and risk scores

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create insurance catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS insurance")

spark.sql("CREATE SCHEMA IF NOT EXISTS insurance.analytics")

print("Insurance catalog and analytics schema created successfully!")

Insurance catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `insurance_claims_uf` table will store:

- **customer_id**: Unique policyholder identifier
- **claim_date**: Date claim was filed
- **policy_type**: Type of insurance (Auto, Home, Health, etc.)
- **claim_amount**: Claim payout amount
- **risk_score**: Customer risk assessment (1-100)
- **processing_time**: Days to process claim
- **claim_status**: Approved, Denied, Pending

### Clustering Strategy

We'll cluster by `customer_id` and `claim_date` because:

- **customer_id**: Policyholders often file multiple claims, grouping their insurance history together
- **claim_date**: Time-based queries are critical for fraud detection, seasonal analysis, and regulatory reporting
- This combination optimizes for both customer risk profiling and temporal claims analysis

In [1]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, DateType
data_schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("claim_date", DateType(), True),
    StructField("policy_type", StringType(), True),
    StructField("claim_amount", DoubleType(), True),
    StructField("risk_score", IntegerType(), True),
    StructField("processing_time", IntegerType(), True),
    StructField("claim_status", StringType(), True)
])

spark.sql("""

CREATE TABLE IF NOT EXISTS insurance.analytics.insurance_claims_uf (
    customer_id STRING,
    claim_date DATE,
    policy_type STRING,
    claim_amount DECIMAL(10,2),
    risk_score INT,
    processing_time INT,
    claim_status STRING
)

USING DELTA

TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (customer_id, claim_date)

""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")

print("Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.


## Step 3: Generate Insurance Sample Data

### Data Generation Strategy

We'll create realistic insurance claims data including:

- **8,000 customers** with multiple claims over time
- **Policy types**: Auto, Home, Health, Life, Property
- **Realistic claim patterns**: Seasonal variations, claim frequencies, processing times
- **Risk scoring**: Customer risk assessment and fraud indicators

### Why This Data Pattern?

This data simulates real insurance scenarios where:

- Customer claims history affects risk assessment
- Seasonal patterns impact claim volumes
- Processing efficiency affects customer satisfaction
- Fraud detection requires pattern analysis
- Regulatory reporting demands temporal analysis

In [1]:
# Generate sample insurance claims data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define insurance data constants

POLICY_TYPES = ['Auto', 'Home', 'Health', 'Life', 'Property']

CLAIM_STATUSES = ['Approved', 'Denied', 'Pending']

# Base claim parameters by policy type

CLAIM_PARAMS = {

    'Auto': {'avg_claim': 3500, 'frequency': 3, 'processing_days': 14},

    'Home': {'avg_claim': 8500, 'frequency': 1, 'processing_days': 21},

    'Health': {'avg_claim': 1200, 'frequency': 8, 'processing_days': 7},

    'Life': {'avg_claim': 25000, 'frequency': 0.5, 'processing_days': 30},

    'Property': {'avg_claim': 15000, 'frequency': 1.5, 'processing_days': 18}

}


# Generate insurance claims records

claims_data = []

base_date = datetime(2024, 1, 1)


# Create 8,000 customers with 1-12 claims each (based on frequency)

for customer_num in range(1, 8001):

    customer_id = f"CUST{customer_num:06d}"
    
    # Assign a primary policy type for this customer

    primary_policy = random.choice(POLICY_TYPES)

    params = CLAIM_PARAMS[primary_policy]
    
    # Determine number of claims based on frequency (some customers have no claims)

    if random.random() < 0.3:  # 30% of customers have no claims

        num_claims = 0

    else:

        num_claims = max(1, int(random.gauss(params['frequency'], params['frequency'] * 0.5)))
        num_claims = min(num_claims, 12)  # Cap at 12 claims
    
    # Generate claims

    for i in range(num_claims):

        # Spread claims over 12 months

        days_offset = random.randint(0, 365)

        claim_date = base_date + timedelta(days=days_offset)
        
        # Sometimes use different policy types for the same customer

        if random.random() < 0.2:

            policy_type = random.choice(POLICY_TYPES)

            params = CLAIM_PARAMS[policy_type]

        else:

            policy_type = primary_policy
        
        # Calculate claim amount with variation

        amount_variation = random.uniform(0.1, 3.0)

        claim_amount = round(params['avg_claim'] * amount_variation, 2)
        
        # Risk score (higher for larger/frequent claims)

        base_risk = random.randint(20, 80)

        risk_adjustment = min(20, int(claim_amount / 1000))  # Higher amounts increase risk

        risk_score = min(100, base_risk + risk_adjustment)
        
        # Processing time (varies by claim type and some randomness)

        time_variation = random.uniform(0.5, 2.0)

        processing_time = max(1, int(params['processing_days'] * time_variation))
        
        # Claim status (most approved, some denied, few pending)

        status_weights = [0.75, 0.15, 0.10]  # Approved, Denied, Pending

        claim_status = random.choices(CLAIM_STATUSES, weights=status_weights)[0]
        
        claims_data.append({

            "customer_id": customer_id,

            "claim_date": claim_date.date(),

            "policy_type": policy_type,

            "claim_amount": claim_amount,

            "risk_score": risk_score,

            "processing_time": processing_time,

            "claim_status": claim_status

        })



print(f"Generated {len(claims_data)} insurance claims records")

print("Sample record:", claims_data[0])

Generated 14904 insurance claims records
Sample record: {'customer_id': 'CUST000001', 'claim_date': datetime.date(2024, 1, 11), 'policy_type': 'Property', 'claim_amount': 7632.8, 'risk_score': 46, 'processing_time': 29, 'claim_status': 'Approved'}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_claims = spark.createDataFrame(claims_data, schema=data_schema)


# Display schema and sample data

print("DataFrame Schema:")

df_claims.printSchema()



print("\nSample Data:")

df_claims.show(5)


# Insert data into Delta table with liquid clustering

# The TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (customer_id, claim_date) will automatically optimize the data layout

df_claims.write.mode("overwrite").insertInto("insurance.analytics.insurance_claims_uf")


print(f"\nSuccessfully inserted {df_claims.count()} records into insurance.analytics.insurance_claims_uf")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- customer_id: string (nullable = true)
 |-- claim_date: date (nullable = true)
 |-- policy_type: string (nullable = true)
 |-- claim_amount: double (nullable = true)
 |-- risk_score: integer (nullable = true)
 |-- processing_time: integer (nullable = true)
 |-- claim_status: string (nullable = true)


Sample Data:


+-----------+----------+-----------+------------+----------+---------------+------------+
|customer_id|claim_date|policy_type|claim_amount|risk_score|processing_time|claim_status|
+-----------+----------+-----------+------------+----------+---------------+------------+
| CUST000001|2024-01-11|   Property|      7632.8|        46|             29|    Approved|
| CUST000001|2024-08-29|   Property|    38039.82|        42|             17|    Approved|
| CUST000002|2024-04-10|     Health|      3507.8|        64|             11|    Approved|
| CUST000002|2024-09-03|     Health|     1839.22|        32|             11|    Approved|
| CUST000002|2024-05-12|     Health|      912.54|        50|             12|    Approved|
+-----------+----------+-----------+------------+----------+---------------+------------+
only showing top 5 rows




Successfully inserted 14904 records into insurance.analytics.insurance_claims_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Customer claims history** (clustered by customer_id)
2. **Time-based claims analysis** (clustered by claim_date)
3. **Combined customer + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Customer claims history - benefits from customer_id clustering

print("=== Query 1: Customer Claims History ===")

customer_history = spark.sql("""

SELECT customer_id, claim_date, policy_type, claim_amount, claim_status

FROM insurance.analytics.insurance_claims_uf

WHERE customer_id = 'CUST000001'

ORDER BY claim_date DESC

""")



customer_history.show()

print(f"Records found: {customer_history.count()}")



# Query 2: Time-based high-value claims analysis - benefits from claim_date clustering

print("\n=== Query 2: Recent High-Value Claims ===")

high_value_claims = spark.sql("""

SELECT claim_date, customer_id, policy_type, claim_amount, risk_score

FROM insurance.analytics.insurance_claims_uf

WHERE claim_date >= '2024-06-01' AND claim_amount > 10000

ORDER BY claim_amount DESC, claim_date DESC

""")



high_value_claims.show()

print(f"High-value claims found: {high_value_claims.count()}")



# Query 3: Combined customer + time query - optimal for our clustering strategy

print("\n=== Query 3: Customer Claims Trends ===")

claims_trends = spark.sql("""

SELECT customer_id, claim_date, policy_type, claim_amount, risk_score

FROM insurance.analytics.insurance_claims_uf

WHERE customer_id LIKE 'CUST000%' AND claim_date >= '2024-04-01'

ORDER BY customer_id, claim_date

""")



claims_trends.show()

print(f"Claims trend records found: {claims_trends.count()}")

=== Query 1: Customer Claims History ===


+-----------+----------+-----------+------------+------------+
|customer_id|claim_date|policy_type|claim_amount|claim_status|
+-----------+----------+-----------+------------+------------+
| CUST000001|2024-08-29|   Property|    38039.82|    Approved|
| CUST000001|2024-01-11|   Property|     7632.80|    Approved|
+-----------+----------+-----------+------------+------------+



Records found: 2

=== Query 2: Recent High-Value Claims ===


+----------+-----------+-----------+------------+----------+
|claim_date|customer_id|policy_type|claim_amount|risk_score|
+----------+-----------+-----------+------------+----------+
|2024-06-02| CUST002941|       Life|    74994.62|        59|
|2024-12-21| CUST005262|       Auto|    74980.50|        80|
|2024-07-16| CUST000987|       Life|    74936.48|        74|
|2024-08-17| CUST007202|       Life|    74880.06|        55|
|2024-12-18| CUST001214|       Life|    74858.45|        41|
|2024-07-07| CUST007810|     Health|    74856.28|        45|
|2024-07-17| CUST004966|       Life|    74849.51|        72|
|2024-12-31| CUST000026|       Auto|    74809.23|        68|
|2024-08-15| CUST003894|       Life|    74794.83|        75|
|2024-11-18| CUST005686|       Life|    74788.74|        98|
|2024-06-08| CUST000119|     Health|    74753.89|        84|
|2024-10-21| CUST006486|       Life|    74748.00|        64|
|2024-10-11| CUST006655|       Life|    74706.78|        87|
|2024-11-20| CUST000052|

High-value claims found: 3247

=== Query 3: Customer Claims Trends ===


+-----------+----------+-----------+------------+----------+
|customer_id|claim_date|policy_type|claim_amount|risk_score|
+-----------+----------+-----------+------------+----------+
| CUST000001|2024-08-29|   Property|    38039.82|        42|
| CUST000002|2024-04-10|     Health|     3507.80|        64|
| CUST000002|2024-05-12|     Health|      912.54|        50|
| CUST000002|2024-05-20|     Health|    45930.57|        96|
| CUST000002|2024-06-18|     Health|     3362.74|        58|
| CUST000002|2024-08-31|       Life|     6042.58|        64|
| CUST000002|2024-09-02|     Health|     2378.79|        73|
| CUST000002|2024-09-03|     Health|     1839.22|        32|
| CUST000002|2024-10-04|     Health|      546.56|        71|
| CUST000002|2024-11-21|     Health|      381.53|        30|
| CUST000002|2024-12-06|     Health|     2737.99|        43|
| CUST000004|2024-06-12|       Life|    33700.70|        40|
| CUST000004|2024-12-05|   Property|    10192.90|        74|
| CUST000005|2024-09-16|

Claims trend records found: 1320


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the insurance insights possible with this optimized structure.

### Key Analytics

- **Customer risk profiling** and claims frequency analysis
- **Policy performance** and loss ratio calculations
- **Claims processing efficiency** and operational metrics
- **Fraud detection patterns** and risk scoring effectiveness

In [1]:
# Analyze clustering effectiveness and insurance insights


# Customer risk analysis

print("=== Customer Risk Analysis ===")

customer_risk = spark.sql("""

SELECT customer_id, COUNT(*) as total_claims,

       ROUND(SUM(claim_amount), 2) as total_claimed,

       ROUND(AVG(claim_amount), 2) as avg_claim_amount,

       ROUND(AVG(risk_score), 2) as avg_risk_score,

       ROUND(AVG(processing_time), 2) as avg_processing_days

FROM insurance.analytics.insurance_claims_uf

GROUP BY customer_id

ORDER BY total_claimed DESC

LIMIT 10

""")



customer_risk.show()


# Policy type performance

print("\n=== Policy Type Performance ===")

policy_performance = spark.sql("""

SELECT policy_type, COUNT(*) as total_claims,

       ROUND(SUM(claim_amount), 2) as total_payout,

       ROUND(AVG(claim_amount), 2) as avg_claim_amount,

       ROUND(AVG(processing_time), 2) as avg_processing_days,

       COUNT(DISTINCT customer_id) as affected_customers

FROM insurance.analytics.insurance_claims_uf

GROUP BY policy_type

ORDER BY total_payout DESC

""")



policy_performance.show()


# Claims processing efficiency

print("\n=== Claims Processing Efficiency ===")

processing_efficiency = spark.sql("""

SELECT 

    CASE 

        WHEN processing_time <= 7 THEN 'Fast (1-7 days)'

        WHEN processing_time <= 14 THEN 'Normal (8-14 days)'

        WHEN processing_time <= 21 THEN 'Slow (15-21 days)'

        ELSE 'Very Slow (22+ days)'

    END as processing_category,

    COUNT(*) as claim_count,

    ROUND(AVG(processing_time), 2) as avg_days,

    ROUND(SUM(claim_amount), 2) as total_amount

FROM insurance.analytics.insurance_claims_uf

GROUP BY 

    CASE 

        WHEN processing_time <= 7 THEN 'Fast (1-7 days)'

        WHEN processing_time <= 14 THEN 'Normal (8-14 days)'

        WHEN processing_time <= 21 THEN 'Slow (15-21 days)'

        ELSE 'Very Slow (22+ days)'

    END

ORDER BY avg_days

""")



processing_efficiency.show()


# Monthly claims trends

print("\n=== Monthly Claims Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(claim_date, 'yyyy-MM') as month,

       COUNT(*) as total_claims,

       ROUND(SUM(claim_amount), 2) as monthly_payout,

       ROUND(AVG(claim_amount), 2) as avg_claim_amount,

       ROUND(AVG(risk_score), 2) as avg_risk_score,

       COUNT(DISTINCT customer_id) as unique_claimants

FROM insurance.analytics.insurance_claims_uf

GROUP BY DATE_FORMAT(claim_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Customer Risk Analysis ===


+-----------+------------+-------------+----------------+--------------+-------------------+
|customer_id|total_claims|total_claimed|avg_claim_amount|avg_risk_score|avg_processing_days|
+-----------+------------+-------------+----------------+--------------+-------------------+
| CUST006889|          12|    512096.09|        42674.67|          71.0|              36.42|
| CUST003087|          12|    494079.73|        41173.31|         63.08|              34.42|
| CUST004252|          12|    480623.64|        40051.97|         62.83|               39.0|
| CUST003861|          12|    463411.73|        38617.64|         56.67|              39.42|
| CUST007598|          12|    429344.95|        35778.75|         66.42|              33.67|
| CUST007746|          12|    429089.80|        35757.48|         59.83|              28.25|
| CUST000950|          10|    406357.34|        40635.73|          70.5|               29.3|
| CUST006165|          12|    391941.88|        32661.82|         68.0

+-----------+------------+------------+----------------+-------------------+------------------+
|policy_type|total_claims|total_payout|avg_claim_amount|avg_processing_days|affected_customers|
+-----------+------------+------------+----------------+-------------------+------------------+
|     Health|        7257| 59359391.63|         8179.60|               14.2|              1371|
|       Life|        1460| 56015836.12|        38367.01|              37.45|              1416|
|   Property|        1718| 39668505.50|        23089.93|              21.87|              1445|
|       Auto|        2905| 21194047.58|         7295.71|               17.9|              1471|
|       Home|        1564| 20049577.32|        12819.42|              25.39|              1504|
+-----------+------------+------------+----------------+-------------------+------------------+


=== Claims Processing Efficiency ===


+--------------------+-----------+--------+------------+
| processing_category|claim_count|avg_days|total_amount|
+--------------------+-----------+--------+------------+
|     Fast (1-7 days)|       2115|    5.33|  4439848.71|
|  Normal (8-14 days)|       4835|   10.84| 28009319.75|
|   Slow (15-21 days)|       2495|   18.02| 39898654.54|
|Very Slow (22+ days)|       5459|   32.68|123939535.15|
+--------------------+-----------+--------+------------+


=== Monthly Claims Trends ===


+-------+------------+--------------+----------------+--------------+----------------+
|  month|total_claims|monthly_payout|avg_claim_amount|avg_risk_score|unique_claimants|
+-------+------------+--------------+----------------+--------------+----------------+
|2024-01|        1272|   16483541.24|        12958.76|         58.53|            1045|
|2024-02|        1157|   14759706.28|        12756.88|         57.83|             969|
|2024-03|        1275|   16761344.67|        13146.15|         58.69|            1065|
|2024-04|        1249|   15710918.29|        12578.80|         58.15|            1030|
|2024-05|        1273|   16723419.76|        13137.01|          58.7|            1042|
|2024-06|        1203|   16722261.68|        13900.47|          59.7|             998|
|2024-07|        1248|   17641965.71|        14136.19|         58.69|            1051|
|2024-08|        1217|   15749763.34|        12941.47|         58.69|            1006|
|2024-09|        1266|   17168161.17|      

## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (customer_id, claim_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (customer_id, claim_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Insurance analytics where claims processing and risk assessment are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for insurance data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles insurance-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger insurance datasets
- Integrate with real claims processing systems

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.