# Real Estate: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a real estate analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Property Transactions and Market Analysis

We'll analyze real estate transactions and property market data. Our clustering strategy will optimize for:

- **Property-specific queries**: Fast lookups by property ID
- **Time-based analysis**: Efficient filtering by transaction and listing dates
- **Market performance patterns**: Quick aggregation by location and property type

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create real estate catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS real_estate")

spark.sql("CREATE SCHEMA IF NOT EXISTS real_estate.analytics")

print("Real estate catalog and analytics schema created successfully!")

## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `property_transactions` table will store:

- **property_id**: Unique property identifier
- **transaction_date**: Date of property transaction
- **property_type**: Type (Single Family, Condo, Apartment, etc.)
- **sale_price**: Transaction sale price
- **location**: Geographic location/neighborhood
- **days_on_market**: Time property was listed before sale
- **price_per_sqft**: Price per square foot

### Clustering Strategy

We'll cluster by `property_id` and `transaction_date` because:

- **property_id**: Properties may have multiple transactions over time, grouping their sales history together
- **transaction_date**: Time-based queries are critical for market analysis, seasonal trends, and investment performance
- This combination optimizes for both property tracking and temporal market analysis

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS real_estate.analytics.property_transactions (

    property_id STRING,

    transaction_date DATE,

    property_type STRING,

    sale_price DECIMAL(12,2),

    location STRING,

    days_on_market INT,

    price_per_sqft DECIMAL(8,2)

)

USING DELTA

CLUSTER BY (property_id, transaction_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on property_id and transaction_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on property_id and transaction_date.


## Step 3: Generate Real Estate Sample Data

### Data Generation Strategy

We'll create realistic real estate transaction data including:

- **8,000 properties** with multiple transactions over time
- **Property types**: Single Family, Condo, Townhouse, Apartment, Commercial
- **Realistic market patterns**: Seasonal pricing, location premiums, market fluctuations
- **Geographic diversity**: Different neighborhoods with varying price points

### Why This Data Pattern?

This data simulates real real estate scenarios where:

- Properties appreciate or depreciate over time
- Market conditions vary by season and location
- Investment performance requires historical tracking
- Neighborhood analysis drives pricing strategies
- Market trends influence buying/selling decisions

In [None]:
# Generate sample real estate transaction data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define real estate data constants

PROPERTY_TYPES = ['Single Family', 'Condo', 'Townhouse', 'Apartment', 'Commercial']

LOCATIONS = ['Downtown', 'Suburban', 'Waterfront', 'Mountain View', 'Urban Core', 'Residential District']

# Base pricing parameters by property type and location

PRICE_PARAMS = {

    'Single Family': {

        'Downtown': {'base_price': 850000, 'sqft_range': (1800, 3500)},

        'Suburban': {'base_price': 650000, 'sqft_range': (2000, 4000)},

        'Waterfront': {'base_price': 1200000, 'sqft_range': (2200, 4500)},

        'Mountain View': {'base_price': 750000, 'sqft_range': (1900, 3800)},

        'Urban Core': {'base_price': 950000, 'sqft_range': (1600, 3200)},

        'Residential District': {'base_price': 700000, 'sqft_range': (2100, 4200)}

    },

    'Condo': {

        'Downtown': {'base_price': 550000, 'sqft_range': (800, 1800)},

        'Suburban': {'base_price': 350000, 'sqft_range': (900, 2000)},

        'Waterfront': {'base_price': 750000, 'sqft_range': (1000, 2200)},

        'Mountain View': {'base_price': 450000, 'sqft_range': (850, 1900)},

        'Urban Core': {'base_price': 650000, 'sqft_range': (750, 1700)},

        'Residential District': {'base_price': 400000, 'sqft_range': (950, 2100)}

    },

    'Townhouse': {

        'Downtown': {'base_price': 700000, 'sqft_range': (1400, 2800)},

        'Suburban': {'base_price': 550000, 'sqft_range': (1600, 3200)},

        'Waterfront': {'base_price': 900000, 'sqft_range': (1500, 3000)},

        'Mountain View': {'base_price': 600000, 'sqft_range': (1450, 2900)},

        'Urban Core': {'base_price': 800000, 'sqft_range': (1300, 2600)},

        'Residential District': {'base_price': 580000, 'sqft_range': (1650, 3300)}

    },

    'Apartment': {

        'Downtown': {'base_price': 450000, 'sqft_range': (600, 1400)},

        'Suburban': {'base_price': 280000, 'sqft_range': (650, 1500)},

        'Waterfront': {'base_price': 600000, 'sqft_range': (700, 1600)},

        'Mountain View': {'base_price': 350000, 'sqft_range': (625, 1450)},

        'Urban Core': {'base_price': 520000, 'sqft_range': (550, 1300)},

        'Residential District': {'base_price': 320000, 'sqft_range': (675, 1550)}

    },

    'Commercial': {

        'Downtown': {'base_price': 2500000, 'sqft_range': (3000, 10000)},

        'Suburban': {'base_price': 1500000, 'sqft_range': (2500, 8000)},

        'Waterfront': {'base_price': 3500000, 'sqft_range': (4000, 12000)},

        'Mountain View': {'base_price': 1800000, 'sqft_range': (2800, 9000)},

        'Urban Core': {'base_price': 3000000, 'sqft_range': (3500, 11000)},

        'Residential District': {'base_price': 1600000, 'sqft_range': (2600, 8500)}

    }

}



# Generate property transaction records

transaction_data = []

base_date = datetime(2024, 1, 1)


# Create 8,000 properties with 1-4 transactions each

for property_num in range(1, 8001):

    property_id = f"PROP{property_num:06d}"
    
    # Each property gets 1-4 transactions over 12 months (most have 1, some flip/resale)

    num_transactions = random.choices([1, 2, 3, 4], weights=[0.7, 0.2, 0.08, 0.02])[0]
    
    # Select property type and location (consistent for the same property)

    property_type = random.choice(PROPERTY_TYPES)

    location = random.choice(LOCATIONS)
    
    params = PRICE_PARAMS[property_type][location]
    
    # Base square footage for this property

    sqft = random.randint(params['sqft_range'][0], params['sqft_range'][1])
    
    for i in range(num_transactions):

        # Spread transactions over 12 months

        days_offset = random.randint(0, 365)

        transaction_date = base_date + timedelta(days=days_offset)
        
        # Calculate sale price with market variations

        # Seasonal pricing (higher in spring/summer)

        month = transaction_date.month

        if month in [3, 4, 5, 6]:  # Spring/Summer peak

            seasonal_factor = 1.15

        elif month in [11, 12, 1, 2]:  # Winter off-season

            seasonal_factor = 0.9

        else:

            seasonal_factor = 1.0
        
        # Market appreciation over time (slight increase)

        months_elapsed = (transaction_date.year - base_date.year) * 12 + (transaction_date.month - base_date.month)

        appreciation_factor = 1.0 + (months_elapsed * 0.002)  # 0.2% monthly appreciation
        
        # Calculate price per square foot

        base_price_per_sqft = params['base_price'] / ((params['sqft_range'][0] + params['sqft_range'][1]) / 2)

        price_per_sqft = round(base_price_per_sqft * seasonal_factor * appreciation_factor * random.uniform(0.9, 1.1), 2)
        
        # Calculate total sale price

        sale_price = round(price_per_sqft * sqft, 2)
        
        # Days on market (varies by property type and market conditions)

        if property_type == 'Commercial':

            days_on_market = random.randint(30, 180)

        else:

            days_on_market = random.randint(7, 90)
        
        transaction_data.append({

            "property_id": property_id,

            "transaction_date": transaction_date.date(),

            "property_type": property_type,

            "sale_price": sale_price,

            "location": location,

            "days_on_market": days_on_market,

            "price_per_sqft": price_per_sqft

        })



print(f"Generated {len(transaction_data)} property transaction records")

print("Sample record:", transaction_data[0])

Generated 11453 property transaction records
Sample record: {'property_id': 'PROP000001', 'transaction_date': datetime.date(2024, 3, 24), 'property_type': 'Single Family', 'sale_price': 806427.68, 'location': 'Downtown', 'days_on_market': 68, 'price_per_sqft': 374.56}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_transactions = spark.createDataFrame(transaction_data)


# Display schema and sample data

print("DataFrame Schema:")

df_transactions.printSchema()



print("\nSample Data:")

df_transactions.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (property_id, transaction_date) will automatically optimize the data layout

df_transactions.write.mode("overwrite").saveAsTable("real_estate.analytics.property_transactions")


print(f"\nSuccessfully inserted {df_transactions.count()} records into real_estate.analytics.property_transactions")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- days_on_market: long (nullable = true)
 |-- location: string (nullable = true)
 |-- price_per_sqft: double (nullable = true)
 |-- property_id: string (nullable = true)
 |-- property_type: string (nullable = true)
 |-- sale_price: double (nullable = true)
 |-- transaction_date: date (nullable = true)


Sample Data:


+--------------+--------------------+--------------+-----------+-------------+----------+----------------+
|days_on_market|            location|price_per_sqft|property_id|property_type|sale_price|transaction_date|
+--------------+--------------------+--------------+-----------+-------------+----------+----------------+
|            68|            Downtown|        374.56| PROP000001|Single Family| 806427.68|      2024-03-24|
|            53|            Downtown|        277.53| PROP000001|Single Family| 597522.09|      2024-02-21|
|            19|            Downtown|        351.79| PROP000001|Single Family| 757403.87|      2024-10-07|
|            56|Residential District|        236.95| PROP000002|Single Family| 523896.45|      2024-06-15|
|           168|          Waterfront|        364.25| PROP000003|   Commercial| 3345272.0|      2024-02-15|
+--------------+--------------------+--------------+-----------+-------------+----------+----------------+
only showing top 5 rows




Successfully inserted 11453 records into real_estate.analytics.property_transactions
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Property transaction history** (clustered by property_id)
2. **Time-based market analysis** (clustered by transaction_date)
3. **Combined property + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Property transaction history - benefits from property_id clustering

print("=== Query 1: Property Transaction History ===")

property_history = spark.sql("""

SELECT property_id, transaction_date, property_type, sale_price, location

FROM real_estate.analytics.property_transactions

WHERE property_id = 'PROP000001'

ORDER BY transaction_date DESC

""")



property_history.show()

print(f"Records found: {property_history.count()})")



# Query 2: Time-based high-value transaction analysis - benefits from transaction_date clustering

print("\n=== Query 2: Recent High-Value Transactions ===")

high_value = spark.sql("""

SELECT transaction_date, property_id, property_type, sale_price, location

FROM real_estate.analytics.property_transactions

WHERE transaction_date >= '2024-06-01' AND sale_price > 1000000

ORDER BY sale_price DESC, transaction_date DESC

""")



high_value.show()

print(f"High-value transactions found: {high_value.count()})")



# Query 3: Combined property + time query - optimal for our clustering strategy

print("\n=== Query 3: Property Value Trends ===")

value_trends = spark.sql("""

SELECT property_id, transaction_date, property_type, sale_price, price_per_sqft

FROM real_estate.analytics.property_transactions

WHERE property_id LIKE 'PROP000%' AND transaction_date >= '2024-04-01'

ORDER BY property_id, transaction_date

""")



value_trends.show()

print(f"Value trend records found: {value_trends.count()})")

=== Query 1: Property Transaction History ===


+-----------+----------------+-------------+----------+--------+
|property_id|transaction_date|property_type|sale_price|location|
+-----------+----------------+-------------+----------+--------+
| PROP000001|      2024-10-07|Single Family| 757403.87|Downtown|
| PROP000001|      2024-03-24|Single Family| 806427.68|Downtown|
| PROP000001|      2024-02-21|Single Family| 597522.09|Downtown|
+-----------+----------------+-------------+----------+--------+



Records found: 3)

=== Query 2: Recent High-Value Transactions ===


+----------------+-----------+-------------+----------+----------+
|transaction_date|property_id|property_type|sale_price|  location|
+----------------+-----------+-------------+----------+----------+
|      2024-06-19| PROP003631|   Commercial| 6544467.3|Waterfront|
|      2024-06-04| PROP007792|   Commercial|6278450.24|Waterfront|
|      2024-06-29| PROP007076|   Commercial| 6236953.2|Waterfront|
|      2024-06-30| PROP000596|   Commercial| 6223248.0|Waterfront|
|      2024-06-06| PROP006735|   Commercial|5965073.15|Waterfront|
|      2024-06-14| PROP004288|   Commercial|5899989.06|Waterfront|
|      2024-06-30| PROP000038|   Commercial|5654482.84|Waterfront|
|      2024-06-07| PROP004068|   Commercial|5538127.68|Waterfront|
|      2024-10-08| PROP003766|   Commercial|5463452.56|Waterfront|
|      2024-06-27| PROP001261|   Commercial| 5399924.8|Waterfront|
|      2024-06-01| PROP003919|   Commercial| 5306833.6|Urban Core|
|      2024-10-05| PROP003631|   Commercial|5267238.69|Waterfr

High-value transactions found: 1717)

=== Query 3: Property Value Trends ===


+-----------+----------------+-------------+----------+--------------+
|property_id|transaction_date|property_type|sale_price|price_per_sqft|
+-----------+----------------+-------------+----------+--------------+
| PROP000001|      2024-10-07|Single Family| 757403.87|        351.79|
| PROP000002|      2024-06-15|Single Family| 523896.45|        236.95|
| PROP000004|      2024-05-02|Single Family|1086423.42|        312.46|
| PROP000004|      2024-12-24|Single Family| 815113.11|        234.43|
| PROP000005|      2024-04-12|    Apartment| 483395.85|        351.05|
| PROP000007|      2024-07-15|Single Family| 686516.25|        261.53|
| PROP000008|      2024-06-29|        Condo| 950510.94|        505.86|
| PROP000009|      2024-05-10|    Apartment|  232365.5|         311.9|
| PROP000009|      2024-09-04|    Apartment|  189155.5|         253.9|
| PROP000009|      2024-10-23|    Apartment|  215901.0|         289.8|
| PROP000010|      2024-09-11|   Commercial|2769922.86|        335.22|
| PROP

Value trend records found: 1104)


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the real estate insights possible with this optimized structure.

### Key Analytics

- **Property value appreciation** and market performance
- **Location-based pricing** and neighborhood analysis
- **Property type trends** and market segmentation
- **Market timing** and seasonal patterns

In [None]:
# Analyze clustering effectiveness and real estate insights


# Property value analysis

print("=== Property Value Analysis ===")

property_values = spark.sql("""

SELECT property_id, COUNT(*) as total_transactions,

       ROUND(MIN(sale_price), 2) as min_sale_price,

       ROUND(MAX(sale_price), 2) as max_sale_price,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       property_type, location

FROM real_estate.analytics.property_transactions

GROUP BY property_id, property_type, location

ORDER BY avg_sale_price DESC

LIMIT 10

""")



property_values.show()


# Location market analysis

print("\n=== Location Market Analysis ===")

location_analysis = spark.sql("""

SELECT location, COUNT(*) as total_transactions,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       ROUND(AVG(days_on_market), 2) as avg_days_on_market,

       COUNT(DISTINCT property_id) as unique_properties

FROM real_estate.analytics.property_transactions

GROUP BY location

ORDER BY avg_sale_price DESC

""")



location_analysis.show()


# Property type market trends

print("\n=== Property Type Market Trends ===")

property_trends = spark.sql("""

SELECT property_type, COUNT(*) as total_sales,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       ROUND(AVG(days_on_market), 2) as avg_days_on_market,

       COUNT(DISTINCT property_id) as unique_properties

FROM real_estate.analytics.property_transactions

GROUP BY property_type

ORDER BY avg_sale_price DESC

""")



property_trends.show()


# Market timing analysis

print("\n=== Market Timing Analysis ===")

market_timing = spark.sql("""

SELECT 

    CASE 

        WHEN days_on_market <= 30 THEN 'Fast Sale (1-30 days)'

        WHEN days_on_market <= 60 THEN 'Normal Sale (31-60 days)'

        WHEN days_on_market <= 90 THEN 'Slow Sale (61-90 days)'

        ELSE 'Very Slow Sale (90+ days)'

    END as sale_speed,

    COUNT(*) as transaction_count,

    ROUND(AVG(sale_price), 2) as avg_sale_price,

    ROUND(AVG(days_on_market), 2) as avg_days,

    ROUND(SUM(sale_price), 2) as total_volume

FROM real_estate.analytics.property_transactions

GROUP BY 

    CASE 

        WHEN days_on_market <= 30 THEN 'Fast Sale (1-30 days)'

        WHEN days_on_market <= 60 THEN 'Normal Sale (31-60 days)'

        WHEN days_on_market <= 90 THEN 'Slow Sale (61-90 days)'

        ELSE 'Very Slow Sale (90+ days)'

    END

ORDER BY avg_days

""")



market_timing.show()


# Monthly market trends

print("\n=== Monthly Market Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(transaction_date, 'yyyy-MM') as month,

       COUNT(*) as total_transactions,

       ROUND(SUM(sale_price), 2) as monthly_volume,

       ROUND(AVG(sale_price), 2) as avg_sale_price,

       ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,

       COUNT(DISTINCT property_id) as unique_properties

FROM real_estate.analytics.property_transactions

GROUP BY DATE_FORMAT(transaction_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Property Value Analysis ===


+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+
|property_id|total_transactions|min_sale_price|max_sale_price|avg_sale_price|avg_price_per_sqft|property_type|  location|
+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+
| PROP006849|                 1|    6433466.97|    6433466.97|    6433466.97|            550.01|   Commercial|Waterfront|
| PROP002526|                 1|     6338081.7|     6338081.7|     6338081.7|             541.3|   Commercial|Waterfront|
| PROP007076|                 1|     6236953.2|     6236953.2|     6236953.2|             548.4|   Commercial|Waterfront|
| PROP004086|                 1|    6048936.12|    6048936.12|    6048936.12|            535.02|   Commercial|Waterfront|
| PROP003631|                 2|    5267238.69|     6544467.3|     5905853.0|            494.26|   Commercial|Waterfront|
| PROP004288|           

+--------------------+------------------+--------------+------------------+------------------+-----------------+
|            location|total_transactions|avg_sale_price|avg_price_per_sqft|avg_days_on_market|unique_properties|
+--------------------+------------------+--------------+------------------+------------------+-----------------+
|          Waterfront|              1838|    1429878.15|             448.7|             60.88|             1288|
|          Urban Core|              1975|    1231699.93|            480.15|             60.26|             1373|
|            Downtown|              1857|    1041112.68|            389.81|             59.72|             1286|
|       Mountain View|              1876|     826320.39|            309.67|             61.02|             1338|
|Residential District|              1949|     729886.77|            264.81|             59.57|             1361|
|            Suburban|              1958|     654614.89|            253.83|             57.33|  

+-------------+-----------+--------------+------------------+------------------+-----------------+
|property_type|total_sales|avg_sale_price|avg_price_per_sqft|avg_days_on_market|unique_properties|
+-------------+-----------+--------------+------------------+------------------+-----------------+
|   Commercial|       2244|    2403707.17|            363.58|            105.78|             1578|
|Single Family|       2290|     865272.73|            301.28|             48.25|             1625|
|    Townhouse|       2303|     710431.48|            323.01|              48.1|             1617|
|        Condo|       2399|     539901.22|            386.65|             48.51|             1663|
|    Apartment|       2217|      424804.6|            412.12|             49.45|             1517|
+-------------+-----------+--------------+------------------+------------------+-----------------+


=== Market Timing Analysis ===


+--------------------+-----------------+--------------+--------+---------------+
|          sale_speed|transaction_count|avg_sale_price|avg_days|   total_volume|
+--------------------+-----------------+--------------+--------+---------------+
|Fast Sale (1-30 d...|             2580|     647319.13|   18.46|1.67008335987E9|
|Normal Sale (31-6...|             3777|     833196.52|    45.4|3.14698327446E9|
|Slow Sale (61-90 ...|             3727|     847142.28|    75.3|3.15729926552E9|
|Very Slow Sale (9...|             1369|    2391647.98|  135.05|3.27416607939E9|
+--------------------+-----------------+--------------+--------+---------------+


=== Monthly Market Trends ===


+-------+------------------+---------------+--------------+------------------+-----------------+
|  month|total_transactions| monthly_volume|avg_sale_price|avg_price_per_sqft|unique_properties|
+-------+------------------+---------------+--------------+------------------+-----------------+
|2024-01|               949| 8.0059691024E8|     843621.61|            314.25|              920|
|2024-02|               874|  7.458980738E8|     853430.29|            315.36|              841|
|2024-03|               927| 9.6453802836E8|     1040494.1|            402.79|              901|
|2024-04|               964|1.07271778159E9|    1112777.78|            402.09|              927|
|2024-05|               997|1.09480884955E9|    1098103.16|            397.14|              976|
|2024-06|               960|1.12414774258E9|    1170987.23|             403.9|              924|
|2024-07|               933| 8.9032841699E8|     954264.11|            346.99|              900|
|2024-08|               966| 9

## Step 7: Train Real Estate Price Prediction Model

### Machine Learning for Real Estate Business Improvement

Now we'll train a machine learning model to predict property sale prices. This model can help real estate companies:

- **Predict market values** for pricing strategy optimization
- **Identify undervalued properties** for investment opportunities
- **Optimize listing prices** to maximize seller returns
- **Provide market insights** for buyers and sellers

### Model Approach

We'll use a **Random Forest Regressor** to predict property sale prices based on:

- Property characteristics (type, location, size)
- Market conditions (seasonal factors, time-based trends)
- Historical transaction patterns
- Market timing and liquidity factors

### Business Impact

- **Pricing Optimization**: Better pricing strategies for faster sales
- **Investment Decisions**: Data-driven property valuation
- **Market Intelligence**: Competitive advantage through predictive analytics
- **Revenue Growth**: Improved transaction success rates

In [None]:
# Prepare data for machine learning - create property price prediction features

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Create property-level features for price prediction
property_features = spark.sql("""
SELECT 
    property_id,
    transaction_date,
    property_type,
    sale_price,
    location,
    days_on_market,
    price_per_sqft,
    -- Market timing features
    MONTH(transaction_date) as transaction_month,
    QUARTER(transaction_date) as transaction_quarter,
    DAYOFWEEK(transaction_date) as transaction_day_of_week,
    -- Market conditions
    CASE WHEN MONTH(transaction_date) IN (3,4,5,6) THEN 1 ELSE 0 END as spring_summer_season,
    CASE WHEN MONTH(transaction_date) IN (11,12,1,2) THEN 1 ELSE 0 END as winter_season,
    -- Market speed indicators
    CASE WHEN days_on_market <= 30 THEN 'fast' 
         WHEN days_on_market <= 60 THEN 'normal' 
         WHEN days_on_market <= 90 THEN 'slow' 
         ELSE 'very_slow' END as market_speed
FROM real_estate.analytics.property_transactions
""")

print(f"Created property features for {property_features.count()} transactions")
property_features.show(5)

Created property features for 11453 transactions


+-----------+----------------+-------------+----------+--------------------+--------------+--------------+-----------------+-------------------+-----------------------+--------------------+-------------+------------+
|property_id|transaction_date|property_type|sale_price|            location|days_on_market|price_per_sqft|transaction_month|transaction_quarter|transaction_day_of_week|spring_summer_season|winter_season|market_speed|
+-----------+----------------+-------------+----------+--------------------+--------------+--------------+-----------------+-------------------+-----------------------+--------------------+-------------+------------+
| PROP000001|      2024-03-24|Single Family| 806427.68|            Downtown|            68|        374.56|                3|                  1|                      1|                   1|            0|        slow|
| PROP000001|      2024-02-21|Single Family| 597522.09|            Downtown|            53|        277.53|                2|        

In [None]:
# Feature engineering for price prediction

# Create indexers for categorical features
property_type_indexer = StringIndexer(inputCol="property_type", outputCol="property_type_index")
location_indexer = StringIndexer(inputCol="location", outputCol="location_index")
market_speed_indexer = StringIndexer(inputCol="market_speed", outputCol="market_speed_index")

# Assemble features for the model
feature_cols = ["days_on_market", "price_per_sqft", "transaction_month", "transaction_quarter", 
                "transaction_day_of_week", "spring_summer_season", "winter_season", 
                "property_type_index", "location_index", "market_speed_index"]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestRegressor(
    labelCol="sale_price", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[property_type_indexer, location_indexer, market_speed_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = property_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} transactions")
print(f"Test set: {test_data.count()} transactions")

Training set: 9254 transactions


Test set: 2199 transactions


In [None]:
# Train the property price prediction model

print("Training property price prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = RegressionEvaluator(labelCol="sale_price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

evaluator_r2 = RegressionEvaluator(labelCol="sale_price", predictionCol="prediction", metricName="r2")
r2 = evaluator_r2.evaluate(predictions)

print(f"Model RMSE: ${rmse:,.2f}")
print(f"Model R²: {r2:.4f}")

# Show prediction results
predictions.select("property_id", "property_type", "location", "sale_price", "prediction").show(10)

# Calculate prediction accuracy
predictions_with_accuracy = predictions.withColumn(
    "prediction_error", 
    F.abs(F.col("sale_price") - F.col("prediction"))
).withColumn(
    "prediction_error_pct", 
    F.abs(F.col("sale_price") - F.col("prediction")) / F.col("sale_price") * 100
)

predictions_with_accuracy.select("property_id", "sale_price", "prediction", "prediction_error", "prediction_error_pct").show(10)

Training property price prediction model...


Model RMSE: $381,124.39
Model R²: 0.8281


+-----------+-------------+--------------------+----------+------------------+
|property_id|property_type|            location|sale_price|        prediction|
+-----------+-------------+--------------------+----------+------------------+
| PROP000001|Single Family|            Downtown| 757403.87| 916323.7333174192|
| PROP000004|Single Family|       Mountain View|1086423.42|  814643.688558861|
| PROP000005|    Apartment|Residential District| 462369.06|344412.41506012395|
| PROP000009|    Apartment|            Suburban|  232365.5|339209.18379322864|
| PROP000013|    Apartment|          Urban Core| 748281.03| 648853.1630104046|
| PROP000015|        Condo|          Urban Core| 782683.95| 621021.3716854007|
| PROP000019|   Commercial|       Mountain View| 1485473.6| 2235534.442801727|
| PROP000023|    Townhouse|       Mountain View| 486548.68|  557960.266391651|
| PROP000030|Single Family|            Suburban| 743770.75| 597364.0121666738|
| PROP000031|Single Family|            Downtown|  97

+-----------+----------+------------------+------------------+--------------------+
|property_id|sale_price|        prediction|  prediction_error|prediction_error_pct|
+-----------+----------+------------------+------------------+--------------------+
| PROP000001| 757403.87| 916323.7333174192| 158919.8633174192|  20.982182638889764|
| PROP000004|1086423.42|  814643.688558861| 271779.7314411389|   25.01600448204061|
| PROP000005| 462369.06|344412.41506012395|117956.64493987605|  25.511362057806387|
| PROP000009|  232365.5|339209.18379322864|106843.68379322864|  45.980872286646964|
| PROP000013| 748281.03| 648853.1630104046| 99427.86698959547|  13.287503358142791|
| PROP000015| 782683.95| 621021.3716854007|161662.57831459923|  20.654898866215316|
| PROP000019| 1485473.6| 2235534.442801727| 750060.8428017269|  50.493044292522384|
| PROP000023| 486548.68|  557960.266391651| 71411.58639165101|   14.67717195156115|
| PROP000030| 743770.75| 597364.0121666738|146406.73783332622|  19.684390362

In [None]:
# Model interpretation and business insights

# Feature importance (approximate)
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = feature_cols

print("=== Feature Importance for Price Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate prediction accuracy metrics
avg_prediction_error = predictions_with_accuracy.agg(F.avg("prediction_error")).collect()[0][0]
avg_prediction_error_pct = predictions_with_accuracy.agg(F.avg("prediction_error_pct")).collect()[0][0]
median_error_pct = predictions_with_accuracy.approxQuantile("prediction_error_pct", [0.5], 0.01)[0]

print(f"Average prediction error: ${avg_prediction_error:,.0f}")
print(f"Average prediction error percentage: {avg_prediction_error_pct:.2f}%")
print(f"Median prediction error percentage: {median_error_pct:.2f}%")

# Calculate potential value for pricing optimization
total_test_properties = test_data.count()
avg_property_value = test_data.agg(F.avg("sale_price")).collect()[0][0]

# Estimate potential value of better pricing (assuming 1% improvement in sale price)
price_optimization_value = total_test_properties * avg_property_value * 0.01

print(f"\nEstimated value of 1% price optimization: ${price_optimization_value:,.0f}")

# Market timing insights
seasonal_performance = predictions_with_accuracy.groupBy("spring_summer_season").agg(
    F.avg("prediction_error_pct").alias("avg_error_pct"),
    F.count("*").alias("transaction_count")
).orderBy("spring_summer_season")

print("\n=== Seasonal Prediction Performance ===")
seasonal_performance.show()

# Property type performance
property_type_performance = predictions_with_accuracy.groupBy("property_type").agg(
    F.avg("prediction_error_pct").alias("avg_error_pct"),
    F.count("*").alias("transaction_count")
).orderBy("avg_error_pct")

print("\n=== Property Type Prediction Performance ===")
property_type_performance.show()

# Location performance
location_performance = predictions_with_accuracy.groupBy("location").agg(
    F.avg("prediction_error_pct").alias("avg_error_pct"),
    F.count("*").alias("transaction_count")
).orderBy("avg_error_pct")

print("\n=== Location Prediction Performance ===")
location_performance.show()

# Model confidence analysis
confidence_analysis = predictions_with_accuracy.withColumn(
    "prediction_confidence", 
    F.when(F.col("prediction_error_pct") <= 5, "High")
     .when(F.col("prediction_error_pct") <= 10, "Medium")
     .otherwise("Low")
).groupBy("prediction_confidence").count().orderBy("prediction_confidence")

print("\n=== Model Confidence Analysis ===")
confidence_analysis.show()

print(f"\nModel Summary:")
print(f"RMSE: ${rmse:,.0f}")
print(f"R² Score: {r2:.4f}")
print(f"Median Error: {median_error_pct:.2f}%")

=== Feature Importance for Price Prediction ===
days_on_market: 0.1554
price_per_sqft: 0.1510
transaction_month: 0.0162
transaction_quarter: 0.0057
transaction_day_of_week: 0.0166
spring_summer_season: 0.0087
winter_season: 0.0048
property_type_index: 0.3887
location_index: 0.0628
market_speed_index: 0.1901

=== Business Impact Analysis ===


Average prediction error: $224,089
Average prediction error percentage: 23.06%
Median prediction error percentage: 18.48%



Estimated value of 1% price optimization: $21,817,600

=== Seasonal Prediction Performance ===


+--------------------+------------------+-----------------+
|spring_summer_season|     avg_error_pct|transaction_count|
+--------------------+------------------+-----------------+
|                   0|22.980314545347937|             1471|
|                   1|23.211318379074374|              728|
+--------------------+------------------+-----------------+


=== Property Type Prediction Performance ===


+-------------+------------------+-----------------+
|property_type|     avg_error_pct|transaction_count|
+-------------+------------------+-----------------+
|Single Family|17.728367455144397|              451|
|    Townhouse| 18.17706464360844|              454|
|        Condo|23.799995270422027|              436|
|    Apartment| 25.11799341558691|              425|
|   Commercial|30.951631099714014|              433|
+-------------+------------------+-----------------+


=== Location Prediction Performance ===


+--------------------+------------------+-----------------+
|            location|     avg_error_pct|transaction_count|
+--------------------+------------------+-----------------+
|Residential District|22.094311927352077|              376|
|       Mountain View| 22.14760592467116|              372|
|          Waterfront|22.579055181882183|              351|
|            Downtown|23.344850130414347|              375|
|            Suburban| 23.66362669208197|              371|
|          Urban Core|24.567059652549364|              354|
+--------------------+------------------+-----------------+


=== Model Confidence Analysis ===


+---------------------+-----+
|prediction_confidence|count|
+---------------------+-----+
|                 High|  291|
|                  Low| 1633|
|               Medium|  275|
+---------------------+-----+


Model Summary:
RMSE: $381,124
R² Score: 0.8281
Median Error: 18.48%


## Key Takeaways: Delta Liquid Clustering + ML in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (property_id, transaction_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (property_id, transaction_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Machine Learning Integration**: Trained a property price prediction model using the optimized data

5. **Real-World Use Case**: Real estate analytics where property valuation and market analysis are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates data optimization with ML
- **Governance**: Catalog and schema isolation for real estate data
- **Performance**: Optimized for both analytical queries and ML training
- **Scalability**: Handles real estate-scale data volumes effortlessly

### Business Benefits for Real Estate

1. **Pricing Optimization**: AI-driven pricing strategies for faster sales
2. **Market Intelligence**: Predictive analytics for investment decisions
3. **Competitive Advantage**: Superior market valuation accuracy
4. **Revenue Growth**: Better pricing leading to higher transaction values
5. **Risk Reduction**: Data-driven market timing and valuation

### Best Practices for Real Estate Analytics

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve
5. **Combine with ML** for predictive analytics and automation

### Next Steps

- Explore other AIDP ML features like AutoML
- Try liquid clustering with different column combinations
- Scale up to larger real estate datasets
- Integrate with real MLS and appraisal systems
- Deploy models for real-time property valuation

This notebook demonstrates how Oracle AI Data Platform makes advanced real estate analytics accessible while maintaining enterprise-grade performance and governance.