# real estate: Iceberg and Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a real estate analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Property Transactions and Market Analysis

We'll analyze real estate transactions and property market data. Our clustering strategy will optimize for:

- **Property-specific queries**: Fast lookups by property ID
- **Time-based analysis**: Efficient filtering by transaction and listing dates
- **Market performance patterns**: Quick aggregation by location and property type

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create real estate catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS real_estate")

spark.sql("CREATE SCHEMA IF NOT EXISTS real_estate.analytics")

print("Real estate catalog and analytics schema created successfully!")

Real estate catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `property_transactions_uf` table will store:

- **property_id**: Unique property identifier
- **transaction_date**: Date of property transaction
- **property_type**: Type (Single Family, Condo, Apartment, etc.)
- **sale_price**: Transaction sale price
- **location**: Geographic location/neighborhood
- **days_on_market**: Time property was listed before sale
- **price_per_sqft**: Price per square foot

### Clustering Strategy

We'll cluster by `property_id` and `transaction_date` because:

- **property_id**: Properties may have multiple transactions over time, grouping their sales history together
- **transaction_date**: Time-based queries are critical for market analysis, seasonal trends, and investment performance
- This combination optimizes for both property tracking and temporal market analysis

In [1]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType, TimestampType
data_schema = StructType([
    StructField("property_id", StringType(), True),
    StructField("transaction_date", DateType(), True),
    StructField("property_type", StringType(), True),
    StructField("sale_price", DoubleType(), True),
    StructField("location", StringType(), True),
    StructField("days_on_market", IntegerType(), True),
    StructField("price_per_sqft", DoubleType(), True)])

spark.sql("""

CREATE TABLE IF NOT EXISTS real_estate.analytics.property_transactions_uf (

    property_id STRING,

    transaction_date DATE,

    property_type STRING,

    sale_price DECIMAL(12,2),

    location STRING,

    days_on_market INT,

    price_per_sqft DECIMAL(8,2)

)

USING DELTA

TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (property_id, transaction_date)

""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")

print("Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.


## Step 3: Generate Real Estate Sample Data

### Data Generation Strategy

We'll create realistic real estate transaction data including:

- **8,000 properties** with multiple transactions over time
- **Property types**: Single Family, Condo, Townhouse, Apartment, Commercial
- **Realistic market patterns**: Seasonal pricing, location premiums, market fluctuations
- **Geographic diversity**: Different neighborhoods with varying price points

### Why This Data Pattern?

This data simulates real real estate scenarios where:

- Properties appreciate or depreciate over time
- Market conditions vary by season and location
- Investment performance requires historical tracking
- Neighborhood analysis drives pricing strategies
- Market trends influence buying/selling decisions

In [1]:
# Generate sample real estate transaction data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define real estate data constants

PROPERTY_TYPES = ['Single Family', 'Condo', 'Townhouse', 'Apartment', 'Commercial']

LOCATIONS = ['Downtown', 'Suburban', 'Waterfront', 'Mountain View', 'Urban Core', 'Residential District']

# Base pricing parameters by property type and location

PRICE_PARAMS = {

    'Single Family': {

        'Downtown': {'base_price': 850000, 'sqft_range': (1800, 3500)},

        'Suburban': {'base_price': 650000, 'sqft_range': (2000, 4000)},

        'Waterfront': {'base_price': 1200000, 'sqft_range': (2200, 4500)},

        'Mountain View': {'base_price': 750000, 'sqft_range': (1900, 3800)},

        'Urban Core': {'base_price': 950000, 'sqft_range': (1600, 3200)},

        'Residential District': {'base_price': 700000, 'sqft_range': (2100, 4200)}

    },

    'Condo': {

        'Downtown': {'base_price': 550000, 'sqft_range': (800, 1800)},

        'Suburban': {'base_price': 350000, 'sqft_range': (900, 2000)},

        'Waterfront': {'base_price': 750000, 'sqft_range': (1000, 2200)},

        'Mountain View': {'base_price': 450000, 'sqft_range': (850, 1900)},

        'Urban Core': {'base_price': 650000, 'sqft_range': (750, 1700)},

        'Residential District': {'base_price': 400000, 'sqft_range': (950, 2100)}

    },

    'Townhouse': {

        'Downtown': {'base_price': 700000, 'sqft_range': (1400, 2800)},

        'Suburban': {'base_price': 550000, 'sqft_range': (1600, 3200)},

        'Waterfront': {'base_price': 900000, 'sqft_range': (1500, 3000)},

        'Mountain View': {'base_price': 600000, 'sqft_range': (1450, 2900)},

        'Urban Core': {'base_price': 800000, 'sqft_range': (1300, 2600)},

        'Residential District': {'base_price': 580000, 'sqft_range': (1650, 3300)}

    },

    'Apartment': {

        'Downtown': {'base_price': 450000, 'sqft_range': (600, 1400)},

        'Suburban': {'base_price': 280000, 'sqft_range': (650, 1500)},

        'Waterfront': {'base_price': 600000, 'sqft_range': (700, 1600)},

        'Mountain View': {'base_price': 350000, 'sqft_range': (625, 1450)},

        'Urban Core': {'base_price': 520000, 'sqft_range': (550, 1300)},

        'Residential District': {'base_price': 320000, 'sqft_range': (675, 1550)}

    },

    'Commercial': {

        'Downtown': {'base_price': 2500000, 'sqft_range': (3000, 10000)},

        'Suburban': {'base_price': 1500000, 'sqft_range': (2500, 8000)},

        'Waterfront': {'base_price': 3500000, 'sqft_range': (4000, 12000)},

        'Mountain View': {'base_price': 1800000, 'sqft_range': (2800, 9000)},

        'Urban Core': {'base_price': 3000000, 'sqft_range': (3500, 11000)},

        'Residential District': {'base_price': 1600000, 'sqft_range': (2600, 8500)}

    }

}


# Generate property transaction records

transaction_data = []

base_date = datetime(2024, 1, 1)


# Create 8,000 properties with 1-4 transactions each

for property_num in range(1, 8001):

    property_id = f"PROP{property_num:06d}"
    
    # Each property gets 1-4 transactions over 12 months (most have 1, some flip/resale)

    num_transactions = random.choices([1, 2, 3, 4], weights=[0.7, 0.2, 0.08, 0.02])[0]
    
    # Select property type and location (consistent for the same property)

    property_type = random.choice(PROPERTY_TYPES)

    location = random.choice(LOCATIONS)
    
    params = PRICE_PARAMS[property_type][location]
    
    # Base square footage for this property

    sqft = random.randint(params['sqft_range'][0], params['sqft_range'][1])
    
    for i in range(num_transactions):

        # Spread transactions over 12 months

        days_offset = random.randint(0, 365)

        transaction_date = base_date + timedelta(days=days_offset)
        
        # Calculate sale price with market variations

        # Seasonal pricing (higher in spring/summer)

        month = transaction_date.month

        if month in [3, 4, 5, 6]:  # Spring/Summer peak

            seasonal_factor = 1.15

        elif month in [11, 12, 1, 2]:  # Winter off-season

            seasonal_factor = 0.9

        else:

            seasonal_factor = 1.0
        
        # Market appreciation over time (slight increase)

        months_elapsed = (transaction_date.year - base_date.year) * 12 + (transaction_date.month - base_date.month)

        appreciation_factor = 1.0 + (months_elapsed * 0.002)  # 0.2% monthly appreciation

        # Calculate price per square foot

        base_price_per_sqft = params['base_price'] / ((params['sqft_range'][0] + params['sqft_range'][1]) / 2)

        price_per_sqft = round(base_price_per_sqft * seasonal_factor * appreciation_factor * random.uniform(0.9, 1.1), 2)
        
        # Calculate total sale price

        sale_price = round(price_per_sqft * sqft, 2)
        
        # Days on market (varies by property type and market conditions)

        if property_type == 'Commercial':

            days_on_market = random.randint(30, 180)

        else:

            days_on_market = random.randint(7, 90)
        
        transaction_data.append({

            "property_id": property_id,

            "transaction_date": transaction_date.date(),

            "property_type": property_type,

            "sale_price": sale_price,

            "location": location,

            "days_on_market": days_on_market,

            "price_per_sqft": price_per_sqft

        })



print(f"Generated {len(transaction_data)} property transaction records")

print("Sample record:", transaction_data[0])

Generated 11347 property transaction records
Sample record: {'property_id': 'PROP000001', 'transaction_date': datetime.date(2024, 6, 9), 'property_type': 'Single Family', 'sale_price': 750546.3, 'location': 'Residential District', 'days_on_market': 77, 'price_per_sqft': 253.05}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_transactions = spark.createDataFrame(transaction_data,schema=data_schema)


# Display schema and sample data

print("DataFrame Schema:")

df_transactions.printSchema()



print("\nSample Data:")

df_transactions.show(5)


# Insert data into Delta table with liquid clustering

# The TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (property_id, transaction_date) will automatically optimize the data layout

df_transactions.write.mode("overwrite").insertInto("real_estate.analytics.property_transactions_uf")


print(f"\nSuccessfully inserted {df_transactions.count()} records into real_estate.analytics.property_transactions_uf")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- property_id: string (nullable = true)
 |-- transaction_date: date (nullable = true)
 |-- property_type: string (nullable = true)
 |-- sale_price: double (nullable = true)
 |-- location: string (nullable = true)
 |-- days_on_market: integer (nullable = true)
 |-- price_per_sqft: double (nullable = true)


Sample Data:
+-----------+----------------+-------------+----------+--------------------+--------------+--------------+
|property_id|transaction_date|property_type|sale_price|            location|days_on_market|price_per_sqft|
+-----------+----------------+-------------+----------+--------------------+--------------+--------------+
| PROP000001|      2024-06-09|Single Family|  750546.3|Residential District|            77|        253.05|
| PROP000002|      2024-11-10|   Commercial| 1712513.6|Residential District|            81|        239.68|
| PROP000002|      2024-11-22|   Commercial| 1949870.5|Residential District|           147|         272.9|
| PROP00000


Successfully inserted 11347 records into real_estate.analytics.property_transactions_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Property transaction history** (clustered by property_id)
2. **Time-based market analysis** (clustered by transaction_date)
3. **Combined property + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Property transaction history - benefits from property_id clustering

print("=== Query 1: Property Transaction History ===")

property_history = spark.sql("""

SELECT property_id, transaction_date, property_type, sale_price, location

FROM real_estate.analytics.property_transactions_uf

WHERE property_id = 'PROP000001'

ORDER BY transaction_date DESC

""")



property_history.show()

print(f"Records found: {property_history.count()}")



# Query 2: Time-based high-value transaction analysis - benefits from transaction_date clustering

print("\n=== Query 2: Recent High-Value Transactions ===")

high_value = spark.sql("""

SELECT transaction_date, property_id, property_type, sale_price, location

FROM real_estate.analytics.property_transactions_uf

WHERE transaction_date >= '2024-06-01' AND sale_price > 1000000

ORDER BY sale_price DESC, transaction_date DESC

""")



high_value.show()

print(f"High-value transactions found: {high_value.count()}")



# Query 3: Combined property + time query - optimal for our clustering strategy

print("\n=== Query 3: Property Value Trends ===")

value_trends = spark.sql("""

SELECT property_id, transaction_date, property_type, sale_price, price_per_sqft

FROM real_estate.analytics.property_transactions_uf

WHERE property_id LIKE 'PROP000%' AND transaction_date >= '2024-04-01'

ORDER BY property_id, transaction_date

""")



value_trends.show()

print(f"Value trend records found: {value_trends.count()}")

=== Query 1: Property Transaction History ===


+-----------+----------------+-------------+----------+--------------------+
|property_id|transaction_date|property_type|sale_price|            location|
+-----------+----------------+-------------+----------+--------------------+
| PROP000001|      2024-06-09|Single Family| 750546.30|Residential District|
+-----------+----------------+-------------+----------+--------------------+



Records found: 1

=== Query 2: Recent High-Value Transactions ===


+----------------+-----------+-------------+----------+----------+
|transaction_date|property_id|property_type|sale_price|  location|
+----------------+-----------+-------------+----------+----------+
|      2024-06-26| PROP003714|   Commercial|5774720.85|Waterfront|
|      2024-10-29| PROP004944|   Commercial|5706152.76|Waterfront|
|      2024-08-17| PROP003413|   Commercial|5614801.00|Waterfront|
|      2024-06-10| PROP001907|   Commercial|5566569.44|Waterfront|
|      2024-09-26| PROP004332|   Commercial|5559972.57|Waterfront|
|      2024-06-10| PROP001433|   Commercial|5471540.75|Waterfront|
|      2024-09-21| PROP002828|   Commercial|5467353.00|Waterfront|
|      2024-08-30| PROP000387|   Commercial|5419186.55|Waterfront|
|      2024-07-29| PROP001900|   Commercial|5380949.82|Waterfront|
|      2024-09-20| PROP007943|   Commercial|5375556.60|Waterfront|
|      2024-09-21| PROP007909|   Commercial|5282494.80|Waterfront|
|      2024-08-06| PROP007267|   Commercial|5276526.80|Waterfr

High-value transactions found: 1727

=== Query 3: Property Value Trends ===


+-----------+----------------+-------------+----------+--------------+
|property_id|transaction_date|property_type|sale_price|price_per_sqft|
+-----------+----------------+-------------+----------+--------------+
| PROP000001|      2024-06-09|Single Family| 750546.30|        253.05|
| PROP000002|      2024-11-10|   Commercial|1712513.60|        239.68|
| PROP000002|      2024-11-22|   Commercial|1949870.50|        272.90|
| PROP000004|      2024-07-13|        Condo| 580100.40|        555.12|
| PROP000005|      2024-04-14|Single Family|1287495.80|        389.56|
| PROP000005|      2024-06-15|Single Family|1489827.90|        450.78|
| PROP000006|      2024-09-29|Single Family| 527263.50|        206.77|
| PROP000009|      2024-06-19|   Commercial|3084254.03|        512.59|
| PROP000010|      2024-09-29|        Condo| 839823.39|        547.83|
| PROP000012|      2024-07-22|    Apartment| 442179.60|        519.60|
| PROP000012|      2024-12-28|    Apartment| 408709.77|        480.27|
| PROP

Value trend records found: 1050


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the real estate insights possible with this optimized structure.

### Key Analytics

- **Property value appreciation** and market performance
- **Location-based pricing** and neighborhood analysis
- **Property type trends** and market segmentation
- **Market timing** and seasonal patterns

In [1]:
# Analyze clustering effectiveness and real estate insights


# Property value analysis

print("=== Property Value Analysis ===")

property_values = spark.sql("""

SELECT property_id, COUNT(*) as total_transactions,

       ROUND(MIN(sale_price), 2) as min_sale_price,

       ROUND(MAX(sale_price), 2) as max_sale_price,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       property_type, location

FROM real_estate.analytics.property_transactions_uf

GROUP BY property_id, property_type, location

ORDER BY avg_sale_price DESC

LIMIT 10

""")



property_values.show()


# Location market analysis

print("\n=== Location Market Analysis ===")

location_analysis = spark.sql("""

SELECT location, COUNT(*) as total_transactions,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       ROUND(AVG(days_on_market), 2) as avg_days_on_market,

       COUNT(DISTINCT property_id) as unique_properties

FROM real_estate.analytics.property_transactions_uf

GROUP BY location

ORDER BY avg_sale_price DESC

""")



location_analysis.show()


# Property type market trends

print("\n=== Property Type Market Trends ===")

property_trends = spark.sql("""

SELECT property_type, COUNT(*) as total_sales,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       ROUND(AVG(days_on_market), 2) as avg_days_on_market,

       COUNT(DISTINCT property_id) as unique_properties

FROM real_estate.analytics.property_transactions_uf

GROUP BY property_type

ORDER BY avg_sale_price DESC

""")



property_trends.show()


# Market timing analysis

print("\n=== Market Timing Analysis ===")

market_timing = spark.sql("""

SELECT 

    CASE 

        WHEN days_on_market <= 30 THEN 'Fast Sale (1-30 days)'

        WHEN days_on_market <= 60 THEN 'Normal Sale (31-60 days)'

        WHEN days_on_market <= 90 THEN 'Slow Sale (61-90 days)'

        ELSE 'Very Slow Sale (90+ days)'

    END as sale_speed,

    COUNT(*) as transaction_count,

    ROUND(AVG(sale_price), 2) as avg_sale_price,

    ROUND(AVG(days_on_market), 2) as avg_days,

    ROUND(SUM(sale_price), 2) as total_volume

FROM real_estate.analytics.property_transactions_uf

GROUP BY 

    CASE 

        WHEN days_on_market <= 30 THEN 'Fast Sale (1-30 days)'

        WHEN days_on_market <= 60 THEN 'Normal Sale (31-60 days)'

        WHEN days_on_market <= 90 THEN 'Slow Sale (61-90 days)'

        ELSE 'Very Slow Sale (90+ days)'

    END

ORDER BY avg_days

""")



market_timing.show()


# Monthly market trends

print("\n=== Monthly Market Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(transaction_date, 'yyyy-MM') as month,

       COUNT(*) as total_transactions,

       ROUND(SUM(sale_price), 2) as monthly_volume,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       COUNT(DISTINCT property_id) as unique_properties

FROM real_estate.analytics.property_transactions_uf

GROUP BY DATE_FORMAT(transaction_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Property Value Analysis ===


+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+
|property_id|total_transactions|min_sale_price|max_sale_price|avg_sale_price|avg_price_per_sqft|property_type|  location|
+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+
| PROP000468|                 1|    6187669.02|    6187669.02|    6187669.02|            532.41|   Commercial|Waterfront|
| PROP004238|                 1|     6122593.4|     6122593.4|     6122593.4|             530.6|   Commercial|Waterfront|
| PROP003918|                 1|    6064207.98|    6064207.98|    6064207.98|            547.41|   Commercial|Waterfront|
| PROP002022|                 2|    5590175.58|    6348323.74|    5969249.66|            511.46|   Commercial|Waterfront|
| PROP007461|                 1|     5957520.0|     5957520.0|     5957520.0|            496.46|   Commercial|Waterfront|
| PROP000943|           

+--------------------+------------------+--------------+------------------+------------------+-----------------+
|            location|total_transactions|avg_sale_price|avg_price_per_sqft|avg_days_on_market|unique_properties|
+--------------------+------------------+--------------+------------------+------------------+-----------------+
|          Waterfront|              1863|    1441088.63|            451.02|              60.4|             1304|
|          Urban Core|              2000|    1223951.45|            474.84|             59.79|             1371|
|            Downtown|              1914|    1064652.39|            392.89|             59.12|             1357|
|       Mountain View|              1864|     812725.11|            311.06|             60.04|             1327|
|Residential District|              1854|     730866.14|            266.11|             59.34|             1311|
|            Suburban|              1852|     703600.61|            253.91|             60.83|  

+-------------+-----------+--------------+------------------+------------------+-----------------+
|property_type|total_sales|avg_sale_price|avg_price_per_sqft|avg_days_on_market|unique_properties|
+-------------+-----------+--------------+------------------+------------------+-----------------+
|   Commercial|       2302|    2417189.63|            365.35|            105.03|             1631|
|Single Family|       2233|     870670.36|            304.49|             48.71|             1569|
|    Townhouse|       2265|     705960.16|            323.47|              48.0|             1579|
|        Condo|       2339|     544614.28|            389.13|             48.51|             1663|
|    Apartment|       2208|     435679.09|            417.36|             48.52|             1558|
+-------------+-----------+--------------+------------------+------------------+-----------------+


=== Market Timing Analysis ===


+--------------------+-----------------+--------------+--------+---------------+
|          sale_speed|transaction_count|avg_sale_price|avg_days|   total_volume|
+--------------------+-----------------+--------------+--------+---------------+
|Fast Sale (1-30 d...|             2597|     660951.94|   18.65|1.71649218565E9|
|Normal Sale (31-6...|             3727|     836267.83|    45.7|3.11677018659E9|
|Slow Sale (61-90 ...|             3627|     870470.68|   75.43|3.15719717322E9|
|Very Slow Sale (9...|             1396|    2401826.57|  134.35|3.35294989703E9|
+--------------------+-----------------+--------------+--------+---------------+


=== Monthly Market Trends ===


+-------+------------------+---------------+--------------+------------------+-----------------+
|  month|total_transactions| monthly_volume|avg_sale_price|avg_price_per_sqft|unique_properties|
+-------+------------------+---------------+--------------+------------------+-----------------+
|2024-01|               966| 8.1816106127E8|     846957.62|             313.0|              941|
|2024-02|               885| 7.7212737503E8|     872460.31|            313.98|              853|
|2024-03|               930|1.08237789977E9|     1163847.2|            403.74|              905|
|2024-04|               941|1.05195504907E9|    1117911.85|            399.17|              913|
|2024-05|               951| 1.0930246287E9|    1149342.41|            412.99|              923|
|2024-06|              1000|1.11476728878E9|    1114767.29|            407.77|              968|
|2024-07|               945| 9.3378043396E8|     988127.44|            359.04|              904|
|2024-08|               942| 9

## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (property_id, transaction_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (property_id, transaction_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Real estate analytics where property tracking and market analysis are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for real estate data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles real estate-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger real estate datasets
- Integrate with real MLS and property management systems

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.