# Energy: Iceberg and Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an energy and utilities analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering, now enhanced with Iceberg compatibility through Delta Universal Format.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Smart Grid Monitoring and Energy Consumption Analytics

We'll analyze energy consumption and smart grid performance data. Our clustering strategy will optimize for:

- **Meter-specific queries**: Fast lookups by meter ID
- **Time-based analysis**: Efficient filtering by reading date and time
- **Consumption patterns**: Quick aggregation by location and energy type

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create energy catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS energy")

spark.sql("CREATE SCHEMA IF NOT EXISTS energy.analytics")

print("Energy catalog and analytics schema created successfully!")

Energy catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `energy_readings_uf` table will store:

- **meter_id**: Unique smart meter identifier
- **reading_date**: Date and time of meter reading
- **energy_type**: Type (Electricity, Gas, Water, Solar)
- **consumption**: Energy consumed (kWh, therms, gallons)
- **location**: Geographic location/region
- **peak_demand**: Peak usage during interval
- **efficiency_rating**: System efficiency (0-100)

### Clustering Strategy

We'll cluster by `meter_id` and `reading_date` because:

- **meter_id**: Meters generate regular readings, grouping consumption history together
- **reading_date**: Time-based queries are critical for billing cycles, demand analysis, and seasonal patterns
- This combination optimizes for both meter monitoring and temporal energy consumption analysis

In [1]:
# Create Delta table with Iceberg compatibility via Universal Format and liquid clustering

# TBLPROPERTIES enables Delta Universal Format for Iceberg compatibility
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, DateType
data_schema = StructType([
    StructField("meter_id", StringType(), True),
    StructField("reading_date", TimestampType(), True),
    StructField("energy_type", StringType(), True),
    StructField("consumption", DoubleType(), True),
    StructField("location", StringType(), True),
    StructField("peak_demand", DoubleType(), True),
    StructField("efficiency_rating", IntegerType(), True)
])

spark.sql("""

CREATE TABLE IF NOT EXISTS energy.analytics.energy_readings_uf (
    meter_id STRING,
    reading_date TIMESTAMP,
    energy_type STRING,
    consumption DECIMAL(10,3),
    location STRING,
    peak_demand DECIMAL(8,2),
    efficiency_rating INT

)

USING DELTA

TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg')

CLUSTER BY (meter_id, reading_date)

""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")

print("Universal format enables Iceberg features while CLUSTER BY (meter_id, reading_date) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (meter_id, reading_date) optimizes data layout.


## Step 3: Generate Energy Sample Data

### Data Generation Strategy

We'll create realistic energy consumption data including:

- **2,000 smart meters** with hourly readings over time
- **Energy types**: Electricity, Natural Gas, Water, Solar generation
- **Realistic consumption patterns**: Seasonal variations, peak usage times, efficiency differences
- **Geographic diversity**: Different locations with varying consumption profiles

### Why This Data Pattern?

This data simulates real energy scenarios where:

- Consumption varies by time of day and season
- Peak demand impacts grid stability
- Efficiency ratings affect sustainability goals
- Geographic patterns drive infrastructure planning
- Real-time monitoring enables demand response programs

In [1]:
# Generate sample energy consumption data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define energy data constants

ENERGY_TYPES = ['Electricity', 'Natural Gas', 'Water', 'Solar']

LOCATIONS = ['Residential_NYC', 'Commercial_CHI', 'Industrial_HOU', 'Residential_LAX', 'Commercial_SFO']

# Base consumption parameters by energy type and location

CONSUMPTION_PARAMS = {

    'Electricity': {

        'Residential_NYC': {'base_consumption': 15, 'peak_factor': 2.5, 'efficiency': 85},

        'Commercial_CHI': {'base_consumption': 150, 'peak_factor': 3.0, 'efficiency': 78},

        'Industrial_HOU': {'base_consumption': 500, 'peak_factor': 2.2, 'efficiency': 92},

        'Residential_LAX': {'base_consumption': 12, 'peak_factor': 2.8, 'efficiency': 88},

        'Commercial_SFO': {'base_consumption': 180, 'peak_factor': 2.7, 'efficiency': 82}

    },

    'Natural Gas': {

        'Residential_NYC': {'base_consumption': 25, 'peak_factor': 1.8, 'efficiency': 90},

        'Commercial_CHI': {'base_consumption': 80, 'peak_factor': 2.1, 'efficiency': 85},

        'Industrial_HOU': {'base_consumption': 200, 'peak_factor': 1.9, 'efficiency': 95},

        'Residential_LAX': {'base_consumption': 20, 'peak_factor': 2.0, 'efficiency': 87},

        'Commercial_SFO': {'base_consumption': 95, 'peak_factor': 2.3, 'efficiency': 83}

    },

    'Water': {

        'Residential_NYC': {'base_consumption': 180, 'peak_factor': 1.5, 'efficiency': 88},

        'Commercial_CHI': {'base_consumption': 450, 'peak_factor': 1.7, 'efficiency': 82},

        'Industrial_HOU': {'base_consumption': 1200, 'peak_factor': 1.6, 'efficiency': 91},

        'Residential_LAX': {'base_consumption': 160, 'peak_factor': 1.8, 'efficiency': 85},

        'Commercial_SFO': {'base_consumption': 380, 'peak_factor': 1.9, 'efficiency': 79}

    },

    'Solar': {

        'Residential_NYC': {'base_consumption': -8, 'peak_factor': 3.5, 'efficiency': 78},

        'Commercial_CHI': {'base_consumption': -75, 'peak_factor': 4.0, 'efficiency': 85},

        'Industrial_HOU': {'base_consumption': -250, 'peak_factor': 3.8, 'efficiency': 88},

        'Residential_LAX': {'base_consumption': -12, 'peak_factor': 4.2, 'efficiency': 82},

        'Commercial_SFO': {'base_consumption': -95, 'peak_factor': 3.9, 'efficiency': 86}

    }

}


# Generate energy reading records

reading_data = []

base_date = datetime(2024, 1, 1)


# Create 2,000 meters with hourly readings for 3 months

for meter_num in range(1, 2001):

    meter_id = f"MTR{meter_num:06d}"
    
    # Each meter gets readings for 90 days (hourly)

    for day in range(90):

        for hour in range(24):

            reading_date = base_date + timedelta(days=day, hours=hour)
            
            # Select energy type and location for this meter

            energy_type = random.choice(ENERGY_TYPES)

            location = random.choice(LOCATIONS)
            
            params = CONSUMPTION_PARAMS[energy_type][location]
            
            # Calculate consumption with time-based variations

            # Seasonal variation (higher in winter for heating, summer for cooling)

            month = reading_date.month

            if energy_type in ['Electricity', 'Natural Gas']:

                if month in [12, 1, 2]:  # Winter

                    seasonal_factor = 1.4

                elif month in [6, 7, 8]:  # Summer

                    seasonal_factor = 1.3

                else:

                    seasonal_factor = 1.0

            else:

                seasonal_factor = 1.0
            
            # Time-of-day variation

            hour_factor = 1.0

            if hour in [6, 7, 8, 17, 18, 19]:  # Peak hours

                hour_factor = params['peak_factor']

            elif hour in [2, 3, 4, 5]:  # Off-peak

                hour_factor = 0.4

            
            # Calculate consumption

            consumption_variation = random.uniform(0.8, 1.2)

            consumption = round(params['base_consumption'] * seasonal_factor * hour_factor * consumption_variation, 3)
            
            # Peak demand (higher during peak hours)

            peak_demand = round(abs(consumption) * random.uniform(1.1, 1.5), 2)
            
            # Efficiency rating with some variation

            efficiency_variation = random.randint(-5, 3)

            efficiency_rating = max(0, min(100, params['efficiency'] + efficiency_variation))
            
            reading_data.append({

                "meter_id": meter_id,

                "reading_date": reading_date,

                "energy_type": energy_type,

                "consumption": consumption,

                "location": location,

                "peak_demand": peak_demand,

                "efficiency_rating": efficiency_rating

            })



print(f"Generated {len(reading_data)} energy reading records")

print("Sample record:", reading_data[0])

Generated 4320000 energy reading records
Sample record: {'meter_id': 'MTR000001', 'reading_date': datetime.datetime(2024, 1, 1, 0, 0), 'energy_type': 'Electricity', 'consumption': 22.63, 'location': 'Residential_NYC', 'peak_demand': 30.37, 'efficiency_rating': 84}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_readings = spark.createDataFrame(reading_data, schema=data_schema)


# Display schema and sample data

print("DataFrame Schema:")

df_readings.printSchema()



print("\nSample Data:")

df_readings.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (meter_id, reading_date) will automatically optimize the data layout

df_readings.write.mode("overwrite").insertInto("energy.analytics.energy_readings_uf")


print(f"\nSuccessfully inserted {df_readings.count()} records into energy.analytics.energy_readings_uf")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- meter_id: string (nullable = true)
 |-- reading_date: timestamp (nullable = true)
 |-- energy_type: string (nullable = true)
 |-- consumption: double (nullable = true)
 |-- location: string (nullable = true)
 |-- peak_demand: double (nullable = true)
 |-- efficiency_rating: integer (nullable = true)


Sample Data:


+---------+-------------------+-----------+-----------+---------------+-----------+-----------------+
| meter_id|       reading_date|energy_type|consumption|       location|peak_demand|efficiency_rating|
+---------+-------------------+-----------+-----------+---------------+-----------+-----------------+
|MTR000001|2024-01-01 00:00:00|Electricity|      22.63|Residential_NYC|      30.37|               84|
|MTR000001|2024-01-01 01:00:00|Natural Gas|     32.813|Residential_LAX|       39.8|               84|
|MTR000001|2024-01-01 02:00:00|      Solar|    -25.746| Commercial_CHI|       31.4|               86|
|MTR000001|2024-01-01 03:00:00|      Solar|    -84.605| Industrial_HOU|     108.68|               86|
|MTR000001|2024-01-01 04:00:00|Electricity|     87.225| Commercial_CHI|     100.48|               74|
+---------+-------------------+-----------+-----------+---------------+-----------+-----------------+
only showing top 5 rows




Successfully inserted 4320000 records into energy.analytics.energy_readings_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Meter reading history** (clustered by meter_id)
2. **Time-based consumption analysis** (clustered by reading_date)
3. **Combined meter + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Meter reading history - benefits from meter_id clustering

print("=== Query 1: Meter Reading History ===")

meter_history = spark.sql("""

SELECT meter_id, reading_date, energy_type, consumption, peak_demand, efficiency_rating

FROM energy.analytics.energy_readings_uf

WHERE meter_id = 'MTR000001'

ORDER BY reading_date DESC

LIMIT 24

""")



meter_history.show()

print(f"Records found: {meter_history.count()}")



# Query 2: Time-based peak demand analysis - benefits from reading_date clustering

print("\n=== Query 2: Recent Peak Demand Issues ===")

peak_demand = spark.sql("""

SELECT reading_date, meter_id, location, peak_demand, energy_type

FROM energy.analytics.energy_readings_uf

WHERE DATE(reading_date) = '2024-02-15' AND peak_demand > 200

ORDER BY peak_demand DESC

""")



peak_demand.show()

print(f"Peak demand issues found: {peak_demand.count()}")



# Query 3: Combined meter + time query - optimal for our clustering strategy

print("\n=== Query 3: Meter Consumption Trends ===")

consumption_trends = spark.sql("""

SELECT meter_id, reading_date, energy_type, consumption, efficiency_rating

FROM energy.analytics.energy_readings_uf

WHERE meter_id LIKE 'MTR000%' AND reading_date >= '2024-02-01'

ORDER BY meter_id, reading_date

LIMIT 50

""")



consumption_trends.show()

print(f"Consumption trend records found: {consumption_trends.count()}")

=== Query 1: Meter Reading History ===


+---------+-------------------+-----------+-----------+-----------+-----------------+
| meter_id|       reading_date|energy_type|consumption|peak_demand|efficiency_rating|
+---------+-------------------+-----------+-----------+-----------+-----------------+
|MTR000001|2024-03-30 23:00:00|      Water|    449.370|     590.19|               79|
|MTR000001|2024-03-30 22:00:00|      Water|    133.854|     162.45|               88|
|MTR000001|2024-03-30 21:00:00|      Water|    965.564|    1381.94|               87|
|MTR000001|2024-03-30 20:00:00|      Solar|    -69.952|      98.65|               83|
|MTR000001|2024-03-30 19:00:00|Natural Gas|    426.459|     493.05|               93|
|MTR000001|2024-03-30 18:00:00|Natural Gas|    150.884|     212.86|               83|
|MTR000001|2024-03-30 17:00:00|      Water|    238.527|     285.26|               83|
|MTR000001|2024-03-30 16:00:00|      Water|    510.061|     578.26|               78|
|MTR000001|2024-03-30 15:00:00|Natural Gas|    101.266

Records found: 24

=== Query 2: Recent Peak Demand Issues ===


+-------------------+---------+--------------+-----------+-----------+
|       reading_date| meter_id|      location|peak_demand|energy_type|
+-------------------+---------+--------------+-----------+-----------+
|2024-02-15 07:00:00|MTR000459|Industrial_HOU|    3413.45|      Water|
|2024-02-15 17:00:00|MTR001112|Industrial_HOU|    3379.53|      Water|
|2024-02-15 19:00:00|MTR000446|Industrial_HOU|    3330.99|      Water|
|2024-02-15 08:00:00|MTR001640|Industrial_HOU|    3328.78|      Water|
|2024-02-15 07:00:00|MTR001351|Industrial_HOU|    3328.77|      Water|
|2024-02-15 08:00:00|MTR001003|Industrial_HOU|    3304.94|      Water|
|2024-02-15 18:00:00|MTR000009|Industrial_HOU|    3299.35|      Water|
|2024-02-15 06:00:00|MTR001749|Industrial_HOU|    3294.04|      Water|
|2024-02-15 07:00:00|MTR001703|Industrial_HOU|    3293.23|      Water|
|2024-02-15 07:00:00|MTR000170|Industrial_HOU|    3291.36|      Water|
|2024-02-15 07:00:00|MTR000318|Industrial_HOU|    3287.42|      Water|
|2024-

Peak demand issues found: 22867

=== Query 3: Meter Consumption Trends ===


+---------+-------------------+-----------+-----------+-----------------+
| meter_id|       reading_date|energy_type|consumption|efficiency_rating|
+---------+-------------------+-----------+-----------+-----------------+
|MTR000001|2024-02-01 00:00:00|      Water|    516.579|               81|
|MTR000001|2024-02-01 01:00:00|      Solar|   -217.083|               88|
|MTR000001|2024-02-01 02:00:00|      Solar|    -38.785|               81|
|MTR000001|2024-02-01 03:00:00|Electricity|    322.577|               92|
|MTR000001|2024-02-01 04:00:00|      Water|     79.454|               90|
|MTR000001|2024-02-01 05:00:00|Natural Gas|     11.338|               86|
|MTR000001|2024-02-01 06:00:00|      Solar|   -330.425|               80|
|MTR000001|2024-02-01 07:00:00|      Solar|    -33.084|               74|
|MTR000001|2024-02-01 08:00:00|      Solar|    -50.256|               82|
|MTR000001|2024-02-01 09:00:00|      Water|    481.211|               81|
|MTR000001|2024-02-01 10:00:00|Natural

Consumption trend records found: 50


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the energy insights possible with this optimized structure.

### Key Analytics

- **Meter performance** and consumption patterns
- **Location-based energy usage** and demand analysis
- **Energy type efficiency** and sustainability metrics
- **Peak demand patterns** and grid optimization

In [1]:
# Analyze clustering effectiveness and energy insights


# Meter performance analysis

print("=== Meter Performance Analysis ===")

meter_performance = spark.sql("""

SELECT meter_id, COUNT(*) as total_readings,

       ROUND(AVG(consumption), 3) as avg_consumption,

       ROUND(MAX(peak_demand), 2) as max_peak_demand,

       ROUND(AVG(efficiency_rating), 2) as avg_efficiency,

       ROUND(SUM(ABS(consumption)), 3) as total_absolute_consumption

FROM energy.analytics.energy_readings_uf

GROUP BY meter_id

ORDER BY total_absolute_consumption DESC

LIMIT 10

""")



meter_performance.show()


# Location-based consumption analysis

print("\n=== Location-Based Consumption Analysis ===")

location_analysis = spark.sql("""

SELECT location, COUNT(*) as total_readings,

       ROUND(SUM(ABS(consumption)), 3) as total_consumption,

       ROUND(AVG(peak_demand), 2) as avg_peak_demand,

       ROUND(AVG(efficiency_rating), 2) as avg_efficiency,

       COUNT(DISTINCT meter_id) as active_meters

FROM energy.analytics.energy_readings_uf

GROUP BY location

ORDER BY total_consumption DESC

""")



location_analysis.show()


# Energy type efficiency analysis

print("\n=== Energy Type Efficiency Analysis ===")

energy_efficiency = spark.sql("""

SELECT energy_type, COUNT(*) as total_readings,

       ROUND(AVG(ABS(consumption)), 3) as avg_consumption,

       ROUND(AVG(efficiency_rating), 2) as avg_efficiency,

       ROUND(MAX(peak_demand), 2) as max_peak_demand,

       COUNT(DISTINCT meter_id) as unique_meters

FROM energy.analytics.energy_readings_uf

GROUP BY energy_type

ORDER BY avg_consumption DESC

""")



energy_efficiency.show()


# Daily consumption patterns

print("\n=== Daily Consumption Patterns ===")

daily_patterns = spark.sql("""

SELECT DATE(reading_date) as date, HOUR(reading_date) as hour,

       ROUND(SUM(ABS(consumption)), 3) as total_consumption,

       ROUND(AVG(peak_demand), 2) as avg_peak_demand,

       COUNT(*) as reading_count

FROM energy.analytics.energy_readings_uf

WHERE DATE(reading_date) = '2024-02-01'

GROUP BY DATE(reading_date), HOUR(reading_date)

ORDER BY hour

""")



daily_patterns.show()


# Monthly consumption trends

print("\n=== Monthly Consumption Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(reading_date, 'yyyy-MM') as month,

       ROUND(SUM(ABS(consumption)), 3) as monthly_consumption,

       ROUND(AVG(peak_demand), 2) as avg_peak_demand,

       ROUND(AVG(efficiency_rating), 2) as avg_efficiency,

       COUNT(DISTINCT meter_id) as active_meters

FROM energy.analytics.energy_readings_uf

GROUP BY DATE_FORMAT(reading_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Meter Performance Analysis ===


+---------+--------------+---------------+---------------+--------------+--------------------------+
| meter_id|total_readings|avg_consumption|max_peak_demand|avg_efficiency|total_absolute_consumption|
+---------+--------------+---------------+---------------+--------------+--------------------------+
|MTR001916|          2160|        208.707|        2985.44|         84.67|                620282.246|
|MTR000157|          2160|        222.600|        3256.98|         84.54|                617825.590|
|MTR000565|          2160|        216.394|        3126.39|         84.58|                617018.134|
|MTR000576|          2160|        222.458|        3253.59|         84.43|                616379.022|
|MTR001948|          2160|        215.884|        3181.42|         84.48|                612046.301|
|MTR001293|          2160|        212.167|        3383.61|         84.46|                611384.167|
|MTR001205|          2160|        210.094|        3373.42|          84.8|                61

+---------------+--------------+-----------------+---------------+--------------+-------------+
|       location|total_readings|total_consumption|avg_peak_demand|avg_efficiency|active_meters|
+---------------+--------------+-----------------+---------------+--------------+-------------+
| Industrial_HOU|        863742|    584055626.834|         878.91|          90.5|         2000|
| Commercial_SFO|        864534|    222997383.493|         335.34|          81.5|         2000|
| Commercial_CHI|        863343|    214004206.401|         322.18|          81.5|         2000|
|Residential_NYC|        863822|     55232027.994|          83.12|         84.26|         2000|
|Residential_LAX|        864559|     53280630.620|          80.09|         84.51|         2000|
+---------------+--------------+-----------------+---------------+--------------+-------------+


=== Energy Type Efficiency Analysis ===


+-----------+--------------+---------------+--------------+---------------+-------------+
|energy_type|total_readings|avg_consumption|avg_efficiency|max_peak_demand|unique_meters|
+-----------+--------------+---------------+--------------+---------------+-------------+
|      Water|       1079790|        506.430|          84.0|        3450.26|         2000|
|Electricity|       1079834|        274.144|          84.0|        2765.18|         2000|
|      Solar|       1078228|        142.394|         82.81|        1708.15|         2000|
|Natural Gas|       1082148|        123.059|          87.0|         956.89|         2000|
+-----------+--------------+---------------+--------------+---------------+-------------+


=== Daily Consumption Patterns ===


+----------+----+-----------------+---------------+-------------+
|      date|hour|total_consumption|avg_peak_demand|reading_count|
+----------+----+-----------------+---------------+-------------+
|2024-02-01|   0|       458563.466|         299.21|         2000|
|2024-02-01|   1|       464051.363|         302.08|         2000|
|2024-02-01|   2|       189948.747|         123.93|         2000|
|2024-02-01|   3|       182712.027|         119.21|         2000|
|2024-02-01|   4|       183636.554|         119.11|         2000|
|2024-02-01|   5|       186385.233|         121.65|         2000|
|2024-02-01|   6|       950079.665|         616.24|         2000|
|2024-02-01|   7|       953973.543|         619.71|         2000|
|2024-02-01|   8|      1013545.270|         659.98|         2000|
|2024-02-01|   9|       457371.632|         296.82|         2000|
|2024-02-01|  10|       458012.733|         298.76|         2000|
|2024-02-01|  11|       466826.138|         304.09|         2000|
|2024-02-0

+-------+-------------------+---------------+--------------+-------------+
|  month|monthly_consumption|avg_peak_demand|avg_efficiency|active_meters|
+-------+-------------------+---------------+--------------+-------------+
|2024-01|      404896870.569|         353.69|         84.45|         2000|
|2024-02|      378261676.011|         353.22|         84.45|         2000|
|2024-03|      346411328.762|         312.71|         84.46|         2000|
+-------+-------------------+---------------+--------------+-------------+



## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Iceberg Compatibility**: Enabled Delta Universal Format with `'delta.universalFormat.enabledFormats' = 'iceberg'` for cross-engine access

2. **Liquid Clustering**: Created a table with `CLUSTER BY (meter_id, reading_date)` for automatic data optimization

3. **Performance Benefits**: Queries on clustered columns are significantly faster due to data locality

4. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required

5. **Real-World Use Case**: Energy analytics where smart grid monitoring and consumption analysis are critical

### Iceberg Advantages

- **Open Standard**: Apache 2.0 licensed, community-driven table format
- **Schema Evolution**: Add, drop, rename columns without expensive data rewrites
- **Partition Evolution**: Change partitioning schemes without disrupting workflows
- **Time Travel**: Query historical data snapshots for auditing and reproducibility
- **ACID Transactions**: Reliable concurrent read/write operations across engines
- **Multi-Engine Support**: Query same data from Spark, Presto, Flink, Hive, and more
- **Future-Proof**: Standards-based approach protects your data investments

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for energy data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles energy-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Leverage Iceberg features** like schema evolution for changing requirements
5. **Monitor and adjust** as query patterns and schema evolve

### Next Steps

- Explore Iceberg time travel capabilities with `SELECT * FROM table TIMESTAMP AS OF`
- Try schema evolution by adding new columns without data migration
- Query the same data from different engines like Presto or Trino
- Integrate with real smart meter and IoT sensor data
- Scale up to larger energy datasets across multiple clusters

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.