# Healthcare: Iceberg and Liquid Clustering Demo

## Overview

This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a healthcare analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering, now enhanced with Iceberg compatibility through Delta Universal Format.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Patient Diagnosis Analytics

We'll analyze patient diagnosis records from a healthcare system. Our clustering strategy will optimize for:
- **Patient-specific queries**: Fast lookups by patient ID
- **Time-based analysis**: Efficient filtering by diagnosis date
- **Diagnosis patterns**: Quick aggregation by diagnosis type

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create healthcare catalog and gold schema
# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS healthcare")
spark.sql("CREATE SCHEMA IF NOT EXISTS healthcare.gold")

print("Healthcare catalog and gold schema created successfully!")

Healthcare catalog and gold schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `patient_diagnoses_uf` table will store:
- **patient_id**: Unique patient identifier
- **diagnosis_date**: When the diagnosis was made
- **diagnosis_code**: ICD-10 diagnosis code
- **diagnosis_description**: Human-readable diagnosis
- **severity_level**: Critical, High, Medium, Low
- **treating_physician**: Physician ID
- **facility_id**: Healthcare facility

### Clustering Strategy

We'll cluster by `patient_id` and `diagnosis_date` because:
- **patient_id**: Patients often have multiple visits, grouping their records together
- **diagnosis_date**: Time-based queries are common in healthcare analytics
- This combination optimizes for both patient history lookups and temporal analysis

In [1]:
# Create Delta table with Iceberg compatibility via Universal Format and liquid clustering
# TBLPROPERTIES enables Delta Universal Format for Iceberg compatibility

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, DateType
data_schema = StructType([
    StructField("patient_id", StringType(), True),
    StructField("diagnosis_date", DateType(), True),
    StructField("diagnosis_code", StringType(), True),
    StructField("diagnosis_description", StringType(), True),
    StructField("severity_level", StringType(), True),
    StructField("treating_physician", StringType(), True),
    StructField("facility_id", StringType(), True)
])

spark.sql("""
CREATE TABLE IF NOT EXISTS healthcare.gold.patient_diagnoses_uf (
    patient_id STRING,
    diagnosis_date DATE,
    diagnosis_code STRING,
    diagnosis_description STRING,
    severity_level STRING,
    treating_physician STRING,
    facility_id STRING
)
USING DELTA
TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg')
CLUSTER BY (patient_id, diagnosis_date)
""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")
print("Universal format enables Iceberg features while CLUSTER BY (patient_id, diagnosis_date) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (patient_id, diagnosis_date) optimizes data layout.


## Step 3: Generate Healthcare Sample Data

### Data Generation Strategy

We'll create realistic healthcare data including:
- **100 patients** with multiple diagnoses over time
- **Common diagnoses**: Diabetes, Hypertension, Asthma, etc.
- **Realistic temporal patterns**: Follow-up visits, chronic condition management
- **Multiple facilities**: Different hospitals/clinics

### Why This Data Pattern?

This data simulates real healthcare scenarios where:
- Patients have multiple encounters
- Chronic conditions require ongoing monitoring
- Time-based analysis reveals treatment effectiveness
- Facility-level reporting is needed

In [1]:
# Generate sample healthcare diagnosis data
# Using fully qualified pyspark.sql.functions to avoid conflicts

import random
from datetime import datetime, timedelta

# Define healthcare data constants
DIAGNOSES = [
    ("E11.9", "Type 2 diabetes mellitus without complications", "Medium"),
    ("I10", "Essential hypertension", "High"),
    ("J45.909", "Unspecified asthma, uncomplicated", "Medium"),
    ("M54.5", "Low back pain", "Low"),
    ("N39.0", "Urinary tract infection, site not specified", "Medium"),
    ("Z51.11", "Encounter for antineoplastic chemotherapy", "Critical"),
    ("I25.10", "Atherosclerotic heart disease of native coronary artery without angina pectoris", "High"),
    ("F41.9", "Anxiety disorder, unspecified", "Medium"),
    ("M79.3", "Panniculitis, unspecified", "Low"),
    ("Z00.00", "Encounter for general adult medical examination without abnormal findings", "Low")
]

FACILITIES = ["HOSP001", "HOSP002", "CLINIC001", "CLINIC002", "URGENT001"]
PHYSICIANS = ["DR_SMITH", "DR_JOHNSON", "DR_WILLIAMS", "DR_BROWN", "DR_JONES", "DR_GARCIA", "DR_MILLER", "DR_DAVIS"]

# Generate patient diagnosis records
patient_data = []
base_date = datetime(2024, 1, 1)

# Create 100 patients with 2-5 diagnoses each
for patient_num in range(1, 101):
    patient_id = f"PAT{patient_num:04d}"
    
    # Each patient gets 2-5 diagnoses over several months
    num_diagnoses = random.randint(2, 5)
    
    for i in range(num_diagnoses):
        # Spread diagnoses over 6 months
        days_offset = random.randint(0, 180)
        diagnosis_date = base_date + timedelta(days=days_offset)
        
        # Select random diagnosis
        diagnosis_code, description, severity = random.choice(DIAGNOSES)
        
        # Select random facility and physician
        facility = random.choice(FACILITIES)
        physician = random.choice(PHYSICIANS)
        
        patient_data.append({
            "patient_id": patient_id,
            "diagnosis_date": diagnosis_date.date(),
            "diagnosis_code": diagnosis_code,
            "diagnosis_description": description,
            "severity_level": severity,
            "treating_physician": physician,
            "facility_id": facility
        })

print(f"Generated {len(patient_data)} patient diagnosis records")
print("Sample record:", patient_data[0])

Generated 349 patient diagnosis records
Sample record: {'patient_id': 'PAT0001', 'diagnosis_date': datetime.date(2024, 4, 25), 'diagnosis_code': 'Z51.11', 'diagnosis_description': 'Encounter for antineoplastic chemotherapy', 'severity_level': 'Critical', 'treating_physician': 'DR_WILLIAMS', 'facility_id': 'HOSP002'}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:
1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations
# Using fully qualified function references to avoid conflicts

# Create DataFrame from generated data
df_diagnoses = spark.createDataFrame(patient_data, schema=data_schema)

# Display schema and sample data
print("DataFrame Schema:")
df_diagnoses.printSchema()

print("\nSample Data:")
df_diagnoses.show(5)

# Insert data into Delta table with liquid clustering
# The CLUSTER BY (patient_id, diagnosis_date) will automatically optimize the data layout
df_diagnoses.write.mode("overwrite").insertInto("healthcare.gold.patient_diagnoses_uf")

print(f"\nSuccessfully inserted {df_diagnoses.count()} records into healthcare.gold.patient_diagnoses_uf")
print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- patient_id: string (nullable = true)
 |-- diagnosis_date: date (nullable = true)
 |-- diagnosis_code: string (nullable = true)
 |-- diagnosis_description: string (nullable = true)
 |-- severity_level: string (nullable = true)
 |-- treating_physician: string (nullable = true)
 |-- facility_id: string (nullable = true)


Sample Data:


+----------+--------------+--------------+---------------------+--------------+------------------+-----------+
|patient_id|diagnosis_date|diagnosis_code|diagnosis_description|severity_level|treating_physician|facility_id|
+----------+--------------+--------------+---------------------+--------------+------------------+-----------+
|   PAT0001|    2024-04-25|        Z51.11| Encounter for ant...|      Critical|       DR_WILLIAMS|    HOSP002|
|   PAT0001|    2024-05-09|         E11.9| Type 2 diabetes m...|        Medium|       DR_WILLIAMS|    HOSP002|
|   PAT0001|    2024-06-19|         M79.3| Panniculitis, uns...|           Low|          DR_SMITH|  CLINIC002|
|   PAT0001|    2024-04-13|           I10| Essential hyperte...|          High|          DR_BROWN|  CLINIC001|
|   PAT0002|    2024-04-10|         M79.3| Panniculitis, uns...|           Low|        DR_JOHNSON|    HOSP001|
+----------+--------------+--------------+---------------------+--------------+------------------+-----------+
o


Successfully inserted 349 records into healthcare.gold.patient_diagnoses_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Patient history lookup** (clustered by patient_id)
2. **Time-based analysis** (clustered by diagnosis_date)
3. **Combined patient + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:
- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries

# Query 1: Patient history - benefits from patient_id clustering
print("=== Query 1: Patient Diagnosis History ===")
patient_history = spark.sql("""
SELECT patient_id, diagnosis_date, diagnosis_code, diagnosis_description, severity_level
FROM healthcare.gold.patient_diagnoses_uf
WHERE patient_id = 'PAT0001'
ORDER BY diagnosis_date
""")

patient_history.show()
print(f"Records found: {patient_history.count()}")

# Query 2: Time-based analysis - benefits from diagnosis_date clustering
print("\n=== Query 2: Recent Critical Diagnoses ===")
recent_critical = spark.sql("""
SELECT diagnosis_date, patient_id, diagnosis_code, diagnosis_description, treating_physician
FROM healthcare.gold.patient_diagnoses_uf
WHERE diagnosis_date >= '2024-04-01' AND severity_level = 'Critical'
ORDER BY diagnosis_date DESC
""")

recent_critical.show()
print(f"Critical diagnoses found: {recent_critical.count()}")

# Query 3: Combined patient + time query - optimal for our clustering strategy
print("\n=== Query 3: Patient Timeline Analysis ===")
patient_timeline = spark.sql("""
SELECT patient_id, diagnosis_date, diagnosis_code, severity_level, facility_id
FROM healthcare.gold.patient_diagnoses_uf
WHERE patient_id LIKE 'PAT001%' AND diagnosis_date >= '2024-03-01'
ORDER BY patient_id, diagnosis_date
""")

patient_timeline.show()
print(f"Timeline records found: {patient_timeline.count()}")

=== Query 1: Patient Diagnosis History ===


+----------+--------------+--------------+---------------------+--------------+
|patient_id|diagnosis_date|diagnosis_code|diagnosis_description|severity_level|
+----------+--------------+--------------+---------------------+--------------+
|   PAT0001|    2024-04-13|           I10| Essential hyperte...|          High|
|   PAT0001|    2024-04-25|        Z51.11| Encounter for ant...|      Critical|
|   PAT0001|    2024-05-09|         E11.9| Type 2 diabetes m...|        Medium|
|   PAT0001|    2024-06-19|         M79.3| Panniculitis, uns...|           Low|
+----------+--------------+--------------+---------------------+--------------+



Records found: 4

=== Query 2: Recent Critical Diagnoses ===


+--------------+----------+--------------+---------------------+------------------+
|diagnosis_date|patient_id|diagnosis_code|diagnosis_description|treating_physician|
+--------------+----------+--------------+---------------------+------------------+
|    2024-06-26|   PAT0007|        Z51.11| Encounter for ant...|          DR_DAVIS|
|    2024-06-24|   PAT0097|        Z51.11| Encounter for ant...|         DR_MILLER|
|    2024-06-18|   PAT0082|        Z51.11| Encounter for ant...|          DR_JONES|
|    2024-06-08|   PAT0014|        Z51.11| Encounter for ant...|          DR_BROWN|
|    2024-05-30|   PAT0075|        Z51.11| Encounter for ant...|         DR_GARCIA|
|    2024-05-26|   PAT0015|        Z51.11| Encounter for ant...|          DR_SMITH|
|    2024-05-24|   PAT0069|        Z51.11| Encounter for ant...|          DR_SMITH|
|    2024-04-28|   PAT0027|        Z51.11| Encounter for ant...|         DR_MILLER|
|    2024-04-25|   PAT0001|        Z51.11| Encounter for ant...|       DR_WI

Critical diagnoses found: 19

=== Query 3: Patient Timeline Analysis ===


+----------+--------------+--------------+--------------+-----------+
|patient_id|diagnosis_date|diagnosis_code|severity_level|facility_id|
+----------+--------------+--------------+--------------+-----------+
|   PAT0011|    2024-06-15|           I10|          High|  CLINIC002|
|   PAT0012|    2024-05-14|        I25.10|          High|    HOSP002|
|   PAT0012|    2024-05-23|        I25.10|          High|  CLINIC002|
|   PAT0012|    2024-06-18|        Z00.00|           Low|  URGENT001|
|   PAT0013|    2024-05-17|         N39.0|        Medium|    HOSP001|
|   PAT0014|    2024-05-24|        I25.10|          High|  CLINIC002|
|   PAT0014|    2024-06-08|         N39.0|        Medium|  CLINIC002|
|   PAT0014|    2024-06-08|        Z51.11|      Critical|  URGENT001|
|   PAT0015|    2024-04-16|       J45.909|        Medium|  CLINIC002|
|   PAT0015|    2024-05-26|        Z51.11|      Critical|  CLINIC002|
|   PAT0015|    2024-06-06|       J45.909|        Medium|  CLINIC001|
|   PAT0016|    2024

Timeline records found: 22


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the healthcare insights possible with this optimized structure.

### Key Analytics

- **Diagnosis frequency** by type
- **Severity distribution** across facilities
- **Physician workload** analysis
- **Temporal patterns** in diagnoses

In [1]:
# Analyze clustering effectiveness and healthcare insights

# Diagnosis frequency analysis
print("=== Diagnosis Frequency Analysis ===")
diagnosis_freq = spark.sql("""
SELECT diagnosis_code, diagnosis_description, COUNT(*) as frequency,
       ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
FROM healthcare.gold.patient_diagnoses_uf
GROUP BY diagnosis_code, diagnosis_description
ORDER BY frequency DESC
""")

diagnosis_freq.show(truncate=False)

# Severity distribution by facility
print("\n=== Severity Distribution by Facility ===")
severity_by_facility = spark.sql("""
SELECT facility_id, severity_level, COUNT(*) as count
FROM healthcare.gold.patient_diagnoses_uf
GROUP BY facility_id, severity_level
ORDER BY facility_id, severity_level
""")

severity_by_facility.show()

# Physician workload analysis
print("\n=== Physician Workload Analysis ===")
physician_workload = spark.sql("""
SELECT treating_physician, COUNT(*) as total_diagnoses,
       COUNT(DISTINCT patient_id) as unique_patients,
       ROUND(AVG(CASE WHEN severity_level = 'Critical' THEN 1 ELSE 0 END), 3) as critical_case_ratio
FROM healthcare.gold.patient_diagnoses_uf
GROUP BY treating_physician
ORDER BY total_diagnoses DESC
""")

physician_workload.show()

=== Diagnosis Frequency Analysis ===


+--------------+-------------------------------------------------------------------------------+---------+----------+
|diagnosis_code|diagnosis_description                                                          |frequency|percentage|
+--------------+-------------------------------------------------------------------------------+---------+----------+
|J45.909       |Unspecified asthma, uncomplicated                                              |43       |12.32     |
|Z00.00        |Encounter for general adult medical examination without abnormal findings      |42       |12.03     |
|I25.10        |Atherosclerotic heart disease of native coronary artery without angina pectoris|41       |11.75     |
|M79.3         |Panniculitis, unspecified                                                      |38       |10.89     |
|M54.5         |Low back pain                                                                  |37       |10.60     |
|E11.9         |Type 2 diabetes mellitus without complic

+-----------+--------------+-----+
|facility_id|severity_level|count|
+-----------+--------------+-----+
|  CLINIC001|      Critical|    6|
|  CLINIC001|          High|   16|
|  CLINIC001|           Low|   26|
|  CLINIC001|        Medium|   22|
|  CLINIC002|      Critical|    5|
|  CLINIC002|          High|   19|
|  CLINIC002|           Low|   24|
|  CLINIC002|        Medium|   26|
|    HOSP001|      Critical|    9|
|    HOSP001|          High|    9|
|    HOSP001|           Low|   23|
|    HOSP001|        Medium|   25|
|    HOSP002|      Critical|    7|
|    HOSP002|          High|   11|
|    HOSP002|           Low|   24|
|    HOSP002|        Medium|   25|
|  URGENT001|      Critical|    6|
|  URGENT001|          High|   14|
|  URGENT001|           Low|   20|
|  URGENT001|        Medium|   32|
+-----------+--------------+-----+


=== Physician Workload Analysis ===


+------------------+---------------+---------------+-------------------+
|treating_physician|total_diagnoses|unique_patients|critical_case_ratio|
+------------------+---------------+---------------+-------------------+
|       DR_WILLIAMS|             51|             39|              0.078|
|          DR_BROWN|             49|             41|              0.082|
|          DR_SMITH|             47|             37|              0.149|
|        DR_JOHNSON|             44|             35|              0.091|
|         DR_MILLER|             43|             38|               0.14|
|          DR_DAVIS|             41|             37|              0.024|
|          DR_JONES|             38|             32|              0.079|
|         DR_GARCIA|             36|             33|              0.111|
+------------------+---------------+---------------+-------------------+



## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Iceberg Compatibility**: Enabled Delta Universal Format with `'delta.universalFormat.enabledFormats' = 'iceberg'` for cross-engine access

2. **Liquid Clustering**: Created a table with `CLUSTER BY (patient_id, diagnosis_date)` for automatic data optimization

3. **Performance Benefits**: Queries on clustered columns are significantly faster due to data locality

4. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required

5. **Real-World Use Case**: Healthcare analytics where patient history lookups and temporal analysis are critical

### Iceberg Advantages

- **Open Standard**: Apache 2.0 licensed, community-driven table format
- **Schema Evolution**: Add, drop, rename columns without expensive data rewrites
- **Partition Evolution**: Change partitioning schemes without disrupting workflows
- **Time Travel**: Query historical data snapshots for auditing and reproducibility
- **ACID Transactions**: Reliable concurrent read/write operations across engines
- **Multi-Engine Support**: Query same data from Spark, Presto, Flink, Hive, and more
- **Future-Proof**: Standards-based approach protects your data investments

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for healthcare data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles healthcare-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Leverage Iceberg features** like schema evolution for changing requirements
5. **Monitor and adjust** as query patterns and schema evolve

### Next Steps

- Explore Iceberg time travel capabilities with `SELECT * FROM table TIMESTAMP AS OF`
- Try schema evolution by adding new columns without data migration
- Query the same data from different engines like Presto or Trino
- Integrate with real healthcare systems
- Scale up to larger healthcare datasets across multiple clusters

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.