# Real Estate: Medallion Architecture Demo

## Overview

This notebook demonstrates a **Medallion Architecture** implementation in Oracle AI Data Platform (AIDP) Workbench using a real estate analytics use case. The medallion architecture organizes data into three layers:

- **Bronze Layer**: Raw data ingestion and storage
- **Silver Layer**: Cleaned, validated, and structured data
- **Gold Layer**: Aggregated, analytics-ready data with ML models

### What is Medallion Architecture?

The medallion architecture provides a structured approach to data processing:

- **Bronze**: Raw, unprocessed data as ingested
- **Silver**: Cleansed, validated, and enriched data
- **Gold**: Business-ready data for analytics and ML

### Use Case: Property Transaction Analytics and Price Prediction

We'll analyze real estate transactions and property market data across all three layers, culminating in ML-powered price prediction for property valuation.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Setup: Create Real Estate Catalog and Medallion Schemas

### Catalog and Schema Design

We'll create:

- `real_estate.bronze`: Raw transaction data
- `real_estate.silver`: Cleaned and validated transactions
- `real_estate.gold`: Analytics and ML-ready data

This structure provides data isolation and governance across layers.

In [None]:
# Create real estate catalog and medallion schemas

spark.sql("CREATE CATALOG IF NOT EXISTS real_estate")

spark.sql("CREATE SCHEMA IF NOT EXISTS real_estate.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS real_estate.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS real_estate.gold")

print("Real estate catalog and medallion schemas created successfully!")
print("- real_estate.bronze: Raw transaction data")
print("- real_estate.silver: Cleaned and validated data")
print("- real_estate.gold: Analytics and ML-ready data")

Real estate catalog and medallion schemas created successfully!
- real_estate.bronze: Raw transaction data
- real_estate.silver: Cleaned and validated data
- real_estate.gold: Analytics and ML-ready data


## Bronze Layer: Raw Data Ingestion

### Bronze Layer Design

The bronze layer stores raw property transaction data as ingested, with minimal processing. We'll use Delta tables with liquid clustering for optimal performance.

### Table: `property_transactions_bronze`

- Raw transaction records with all original fields
- Liquid clustering on `property_id` and `transaction_date`
- Preserves data integrity and auditability

In [None]:
# Create Bronze Layer Delta table with liquid clustering

spark.sql("""
CREATE TABLE IF NOT EXISTS real_estate.bronze.property_transactions_bronze (
    property_id STRING,
    transaction_date DATE,
    property_type STRING,
    sale_price DECIMAL(12,2),
    location STRING,
    days_on_market INT,
    price_per_sqft DECIMAL(8,2),
    ingestion_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (property_id, transaction_date)
""")

print("Bronze layer table created successfully!")
print("Liquid clustering will automatically optimize data layout for property_id and transaction_date queries.")

Bronze layer table created successfully!
Liquid clustering will automatically optimize data layout for property_id and transaction_date queries.


In [None]:
# Generate sample real estate transaction data for Bronze layer

import random
from datetime import datetime, timedelta

# Define real estate data constants

PROPERTY_TYPES = ['Single Family', 'Condo', 'Townhouse', 'Apartment', 'Commercial']
LOCATIONS = ['Downtown', 'Suburban', 'Waterfront', 'Mountain View', 'Urban Core', 'Residential District']

# Base pricing parameters by property type and location
PRICE_PARAMS = {
    'Single Family': {
        'Downtown': {'base_price': 850000, 'sqft_range': (1800, 3500)},
        'Suburban': {'base_price': 650000, 'sqft_range': (2000, 4000)},
        'Waterfront': {'base_price': 1200000, 'sqft_range': (2200, 4500)},
        'Mountain View': {'base_price': 750000, 'sqft_range': (1900, 3800)},
        'Urban Core': {'base_price': 950000, 'sqft_range': (1600, 3200)},
        'Residential District': {'base_price': 700000, 'sqft_range': (2100, 4200)}
    },
    'Condo': {
        'Downtown': {'base_price': 550000, 'sqft_range': (800, 1800)},
        'Suburban': {'base_price': 350000, 'sqft_range': (900, 2000)},
        'Waterfront': {'base_price': 750000, 'sqft_range': (1000, 2200)},
        'Mountain View': {'base_price': 450000, 'sqft_range': (850, 1900)},
        'Urban Core': {'base_price': 650000, 'sqft_range': (750, 1700)},
        'Residential District': {'base_price': 400000, 'sqft_range': (950, 2100)}
    },
    'Townhouse': {
        'Downtown': {'base_price': 700000, 'sqft_range': (1400, 2800)},
        'Suburban': {'base_price': 550000, 'sqft_range': (1600, 3200)},
        'Waterfront': {'base_price': 900000, 'sqft_range': (1500, 3000)},
        'Mountain View': {'base_price': 600000, 'sqft_range': (1450, 2900)},
        'Urban Core': {'base_price': 800000, 'sqft_range': (1300, 2600)},
        'Residential District': {'base_price': 580000, 'sqft_range': (1650, 3300)}
    },
    'Apartment': {
        'Downtown': {'base_price': 450000, 'sqft_range': (600, 1400)},
        'Suburban': {'base_price': 280000, 'sqft_range': (650, 1500)},
        'Waterfront': {'base_price': 600000, 'sqft_range': (700, 1600)},
        'Mountain View': {'base_price': 350000, 'sqft_range': (625, 1450)},
        'Urban Core': {'base_price': 520000, 'sqft_range': (550, 1300)},
        'Residential District': {'base_price': 320000, 'sqft_range': (675, 1550)}
    },
    'Commercial': {
        'Downtown': {'base_price': 2500000, 'sqft_range': (3000, 10000)},
        'Suburban': {'base_price': 1500000, 'sqft_range': (2500, 8000)},
        'Waterfront': {'base_price': 3500000, 'sqft_range': (4000, 12000)},
        'Mountain View': {'base_price': 1800000, 'sqft_range': (2800, 9000)},
        'Urban Core': {'base_price': 3000000, 'sqft_range': (3500, 11000)},
        'Residential District': {'base_price': 1600000, 'sqft_range': (2600, 8500)}
    }
}

# Generate property transaction records
transaction_data = []
base_date = datetime(2024, 1, 1)

# Create 8,000 properties with multiple transactions over time
for property_num in range(1, 8001):
    property_id = f"PROP{property_num:06d}"
    
    # Each property gets 1-4 transactions over 12 months (most have 1, some flip/resale)
    num_transactions = random.choices([1, 2, 3, 4], weights=[0.7, 0.2, 0.08, 0.02])[0]
    
    # Select property type and location (consistent for the same property)
    property_type = random.choice(PROPERTY_TYPES)
    location = random.choice(LOCATIONS)
    
    params = PRICE_PARAMS[property_type][location]
    
    # Base square footage for this property
    sqft = random.randint(params['sqft_range'][0], params['sqft_range'][1])
    
    for i in range(num_transactions):
        # Spread transactions over 12 months
        days_offset = random.randint(0, 365)
        transaction_date = base_date + timedelta(days=days_offset)
        
        # Calculate sale price with market variations
        # Seasonal pricing (higher in spring/summer)
        month = transaction_date.month
        if month in [3, 4, 5, 6]:  # Spring/Summer peak
            seasonal_factor = 1.15
        elif month in [11, 12, 1, 2]:  # Winter off-season
            seasonal_factor = 0.9
        else:
            seasonal_factor = 1.0
        
        # Market appreciation over time (slight increase)
        months_elapsed = (transaction_date.year - base_date.year) * 12 + (transaction_date.month - base_date.month)
        appreciation_factor = 1.0 + (months_elapsed * 0.002)  # 0.2% monthly appreciation
        
        # Calculate price per square foot
        base_price_per_sqft = params['base_price'] / ((params['sqft_range'][0] + params['sqft_range'][1]) / 2)
        price_per_sqft = round(base_price_per_sqft * seasonal_factor * appreciation_factor * random.uniform(0.9, 1.1), 2)
        
        # Calculate total sale price
        sale_price = round(price_per_sqft * sqft, 2)
        
        # Days on market (varies by property type and market conditions)
        if property_type == 'Commercial':
            days_on_market = random.randint(30, 180)
        else:
            days_on_market = random.randint(7, 90)
        
        transaction_data.append({
            "property_id": property_id,
            "transaction_date": transaction_date.date(),
            "property_type": property_type,
            "sale_price": sale_price,
            "location": location,
            "days_on_market": days_on_market,
            "price_per_sqft": price_per_sqft
        })

print(f"Generated {len(transaction_data)} raw property transaction records for Bronze layer")
print("Sample record:", transaction_data[0])

Generated 11339 raw property transaction records for Bronze layer
Sample record: {'property_id': 'PROP000001', 'transaction_date': datetime.date(2024, 5, 17), 'property_type': 'Single Family', 'sale_price': 1313507.88, 'location': 'Waterfront', 'days_on_market': 61, 'price_per_sqft': 409.32}


In [None]:
# Insert raw data into Bronze layer

# Create DataFrame from generated data
df_bronze = spark.createDataFrame(transaction_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_bronze.printSchema()

print("\nSample Bronze Data:")
df_bronze.show(5)

# Insert data into Bronze table
df_bronze.write.mode("overwrite").saveAsTable("real_estate.bronze.property_transactions_bronze")

print(f"\nSuccessfully inserted {df_bronze.count()} raw records into Bronze layer")
print("Data is now available for Silver layer processing.")

Bronze Layer DataFrame Schema:
root
 |-- days_on_market: long (nullable = true)
 |-- location: string (nullable = true)
 |-- price_per_sqft: double (nullable = true)
 |-- property_id: string (nullable = true)
 |-- property_type: string (nullable = true)
 |-- sale_price: double (nullable = true)
 |-- transaction_date: date (nullable = true)


Sample Bronze Data:


+--------------+----------+--------------+-----------+-------------+----------+----------------+
|days_on_market|  location|price_per_sqft|property_id|property_type|sale_price|transaction_date|
+--------------+----------+--------------+-----------+-------------+----------+----------------+
|            61|Waterfront|        409.32| PROP000001|Single Family|1313507.88|      2024-05-17|
|            54|Urban Core|        359.99| PROP000002|    Townhouse| 858576.15|      2024-11-08|
|            21|Waterfront|        441.11| PROP000003|    Apartment| 663870.55|      2024-01-27|
|            66|Urban Core|        400.79| PROP000004|    Townhouse| 648077.43|      2024-01-20|
|            68|Urban Core|        344.44| PROP000004|    Townhouse| 556959.48|      2024-11-30|
+--------------+----------+--------------+-----------+-------------+----------+----------------+
only showing top 5 rows




Successfully inserted 11339 raw records into Bronze layer
Data is now available for Silver layer processing.


## Silver Layer: Data Cleaning and Validation

### Silver Layer Design

The silver layer provides cleaned, validated, and enriched property data. We'll:

- Remove invalid records
- Standardize data formats
- Add data quality metrics
- Enrich with temporal features

### Table: `property_transactions_silver`

- Cleaned transaction data with validation flags
- Enhanced with market timing features
- Ready for analytical processing

In [None]:
# Create Silver Layer Delta table

spark.sql("""
CREATE TABLE IF NOT EXISTS real_estate.silver.property_transactions_silver (
    property_id STRING,
    transaction_date DATE,
    property_type STRING,
    sale_price DECIMAL(12,2),
    location STRING,
    days_on_market INT,
    price_per_sqft DECIMAL(8,2),
    month INT,
    quarter INT,
    day_of_week INT,
    is_spring_summer BOOLEAN,
    is_winter BOOLEAN,
    market_speed STRING,
    is_valid BOOLEAN,
    data_quality_score DOUBLE,
    processed_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (property_id, transaction_date)
""")

print("Silver layer table created successfully!")

Silver layer table created successfully!


In [None]:
# Process Bronze data to Silver layer

from pyspark.sql.functions import col, when, month, quarter, dayofweek, lit

# Read from Bronze layer
bronze_df = spark.table("real_estate.bronze.property_transactions_bronze")

print(f"Read {bronze_df.count()} records from Bronze layer")

# Data validation and cleaning
silver_df = bronze_df \
    .withColumn("month", month(col("transaction_date"))) \
    .withColumn("quarter", quarter(col("transaction_date"))) \
    .withColumn("day_of_week", dayofweek(col("transaction_date"))) \
    .withColumn("is_spring_summer", when(col("month").isin([3, 4, 5, 6]), True).otherwise(False)) \
    .withColumn("is_winter", when(col("month").isin([11, 12, 1, 2]), True).otherwise(False)) \
    .withColumn("market_speed", 
                when(col("days_on_market") <= 30, "fast")
                 .when(col("days_on_market") <= 60, "normal")
                 .when(col("days_on_market") <= 90, "slow")
                 .otherwise("very_slow")) \
    .withColumn("is_valid", 
                when((col("sale_price").isNotNull()) & 
                     (col("property_id").isNotNull()) & 
                     (col("transaction_date").isNotNull()) &
                     (col("sale_price") > 0), True).otherwise(False)) \
    .withColumn("data_quality_score", 
                when(col("is_valid"), lit(0.8)).otherwise(0.0))

# Filter out invalid records
valid_silver_df = silver_df.filter(col("is_valid") == True)

print(f"After validation: {valid_silver_df.count()} valid records")
print(f"Filtered out {bronze_df.count() - valid_silver_df.count()} invalid records")

# Show sample cleaned data
print("\nSample Silver Layer Data:")
valid_silver_df.select("property_id", "transaction_date", "property_type", "sale_price", "days_on_market", "market_speed", "is_valid", "data_quality_score").show(5)

Read 11339 records from Bronze layer


After validation: 11339 valid records


Filtered out 0 invalid records

Sample Silver Layer Data:


+-----------+----------------+-------------+----------+--------------+------------+--------+------------------+
|property_id|transaction_date|property_type|sale_price|days_on_market|market_speed|is_valid|data_quality_score|
+-----------+----------------+-------------+----------+--------------+------------+--------+------------------+
| PROP002173|      2024-04-18|    Townhouse| 763922.25|            18|        fast|    true|               0.8|
| PROP002173|      2024-09-17|    Townhouse|  598718.9|            62|        slow|    true|               0.8|
| PROP002173|      2024-01-28|    Townhouse| 542460.75|            68|        slow|    true|               0.8|
| PROP002174|      2024-03-17|    Townhouse| 929576.63|            11|        fast|    true|               0.8|
| PROP002174|      2024-09-20|    Townhouse| 760890.12|            57|      normal|    true|               0.8|
+-----------+----------------+-------------+----------+--------------+------------+--------+------------

In [None]:
# Insert cleaned data into Silver layer

valid_silver_df.write.mode("overwrite").saveAsTable("real_estate.silver.property_transactions_silver")

print(f"Successfully inserted {valid_silver_df.count()} cleaned records into Silver layer")
print("Data is now validated, enriched, and ready for Gold layer analytics.")

Successfully inserted 11339 cleaned records into Silver layer
Data is now validated, enriched, and ready for Gold layer analytics.


## Gold Layer: Analytics and ML-Ready Data

### Gold Layer Design

The gold layer provides business-ready analytics and ML features. We'll create:

- Aggregated analytics tables
- ML-ready feature engineering
- Price prediction model training

### Tables in Gold Layer

- `property_analytics_gold`: Aggregated property metrics
- `price_prediction_model_gold`: ML-ready features for price prediction

In [None]:
# Create Gold Layer Analytics Table

spark.sql("""
CREATE TABLE IF NOT EXISTS real_estate.gold.property_analytics_gold (
    property_id STRING,
    property_type STRING,
    location STRING,
    total_transactions BIGINT,
    min_sale_price DECIMAL(12,2),
    max_sale_price DECIMAL(12,2),
    avg_sale_price DECIMAL(12,2),
    avg_price_per_sqft DECIMAL(8,2),
    avg_days_on_market DOUBLE,
    market_speed_distribution MAP<STRING, BIGINT>,
    transaction_months ARRAY<INT>,
    created_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (property_id)
""")

print("Gold layer analytics table created successfully!")

Gold layer analytics table created successfully!


In [None]:
# Create Gold Layer ML Features Table

spark.sql("""
CREATE TABLE IF NOT EXISTS real_estate.gold.price_prediction_model_gold (
    property_id STRING,
    transaction_date DATE,
    property_type STRING,
    sale_price DECIMAL(12,2),
    location STRING,
    days_on_market INT,
    price_per_sqft DECIMAL(8,2),
    month INT,
    quarter INT,
    day_of_week INT,
    is_spring_summer BOOLEAN,
    is_winter BOOLEAN,
    market_speed STRING,
    created_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (property_id, transaction_date)
""")

print("Gold layer ML features table created successfully!")

Gold layer ML features table created successfully!


In [None]:
# Generate Gold Layer Analytics from Silver Data

from pyspark.sql.functions import collect_list, map_from_entries, struct, min, max, avg, count, array_distinct

# Read Silver data
silver_df = spark.table("real_estate.silver.property_transactions_silver")

# Create property-level aggregations
property_analytics = silver_df \
    .groupBy("property_id", "property_type", "location") \
    .agg(
        count("*").alias("total_transactions"),
        min("sale_price").alias("min_sale_price"),
        max("sale_price").alias("max_sale_price"),
        avg("sale_price").alias("avg_sale_price"),
        avg("price_per_sqft").alias("avg_price_per_sqft"),
        avg("days_on_market").alias("avg_days_on_market")
    )

# Add market speed distribution
speed_dist = silver_df \
    .groupBy("property_id", "market_speed") \
    .agg(count("*").alias("count")) \
    .groupBy("property_id") \
    .agg(collect_list(struct("market_speed", "count")).alias("speed_distribution_list"))

# Add transaction months
month_dist = silver_df \
    .groupBy("property_id") \
    .agg(collect_list("month").alias("transaction_months_raw")) \
    .withColumn("transaction_months", array_distinct(col("transaction_months_raw"))) \
    .drop("transaction_months_raw")

# Join aggregations
gold_analytics = property_analytics \
    .join(speed_dist, "property_id", "left") \
    .join(month_dist, "property_id", "left") \
    .withColumn("market_speed_distribution", map_from_entries(col("speed_distribution_list"))) \
    .drop("speed_distribution_list")

print(f"Generated {gold_analytics.count()} property analytics records")
print("\nSample Gold Analytics:")
gold_analytics.select("property_id", "property_type", "location", "total_transactions", "avg_sale_price", "avg_price_per_sqft").show(5)

Generated 8000 property analytics records

Sample Gold Analytics:


+-----------+-------------+--------------------+------------------+--------------+------------------+
|property_id|property_type|            location|total_transactions|avg_sale_price|avg_price_per_sqft|
+-----------+-------------+--------------------+------------------+--------------+------------------+
| PROP002398|   Commercial|            Downtown|                 1|    3736670.79|            450.69|
| PROP002487|        Condo|          Waterfront|                 1|     754224.73|            565.81|
| PROP002837|    Apartment|            Suburban|                 1|     305504.96|            242.08|
| PROP003367|        Condo|Residential District|                 1|     457692.34|            238.63|
| PROP003859|Single Family|       Mountain View|                 1|     585933.04|            224.84|
+-----------+-------------+--------------------+------------------+--------------+------------------+
only showing top 5 rows



In [None]:
# Insert analytics into Gold layer

gold_analytics.write.mode("overwrite").saveAsTable("real_estate.gold.property_analytics_gold")

print(f"Successfully inserted {gold_analytics.count()} analytics records into Gold layer")

Successfully inserted 8000 analytics records into Gold layer


In [None]:
# Prepare ML Features for Gold Layer

# Read Silver data for ML
ml_data = silver_df.select(
    "property_id", "transaction_date", "property_type", "sale_price", 
    "location", "days_on_market", "price_per_sqft", "month", 
    "quarter", "day_of_week", "is_spring_summer", "is_winter", "market_speed"
)

print(f"Prepared {ml_data.count()} records for ML feature engineering")
ml_data.show(5)

Prepared 11339 records for ML feature engineering


+-----------+----------------+-------------+----------+--------------------+--------------+--------------+-----+-------+-----------+----------------+---------+------------+
|property_id|transaction_date|property_type|sale_price|            location|days_on_market|price_per_sqft|month|quarter|day_of_week|is_spring_summer|is_winter|market_speed|
+-----------+----------------+-------------+----------+--------------------+--------------+--------------+-----+-------+-----------+----------------+---------+------------+
| PROP002173|      2024-04-18|    Townhouse| 763922.25|Residential District|            18|        286.65|    4|      2|          5|            true|    false|        fast|
| PROP002173|      2024-09-17|    Townhouse|  598718.9|Residential District|            62|        224.66|    9|      3|          3|           false|    false|        slow|
| PROP002173|      2024-01-28|    Townhouse| 542460.75|Residential District|            68|        203.55|    1|      1|          1|   

In [None]:
# Insert into Gold layer

ml_data.write.mode("overwrite").saveAsTable("real_estate.gold.price_prediction_model_gold")

print(f"Successfully inserted {ml_data.count()} ML-ready records into Gold layer")

Successfully inserted 11339 ML-ready records into Gold layer


## Gold Layer: ML Model Training and Evaluation

### Price Prediction Model

Now we'll train a Random Forest regression model using the ML-ready data from the Gold layer to predict property sale prices.

In [None]:
# Load Gold layer ML data for training

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Load data from Gold layer
gold_ml_data = spark.table("real_estate.gold.price_prediction_model_gold")

print(f"Loaded {gold_ml_data.count()} records from Gold layer for ML training")
gold_ml_data.show(5)

Loaded 11339 records from Gold layer for ML training


+-----------+----------------+-------------+----------+----------+--------------+--------------+-----+-------+-----------+----------------+---------+------------+
|property_id|transaction_date|property_type|sale_price|  location|days_on_market|price_per_sqft|month|quarter|day_of_week|is_spring_summer|is_winter|market_speed|
+-----------+----------------+-------------+----------+----------+--------------+--------------+-----+-------+-----------+----------------+---------+------------+
| PROP000001|      2024-05-17|Single Family|1313507.88|Waterfront|            61|        409.32|    5|      2|          6|            true|    false|        slow|
| PROP000002|      2024-11-08|    Townhouse| 858576.15|Urban Core|            54|        359.99|   11|      4|          6|           false|     true|      normal|
| PROP000003|      2024-01-27|    Apartment| 663870.55|Waterfront|            21|        441.11|    1|      1|          7|           false|     true|        fast|
| PROP000004|      202

In [None]:
# Feature engineering pipeline

# Create indexers for categorical variables
property_type_indexer = StringIndexer(inputCol="property_type", outputCol="property_type_index")
location_indexer = StringIndexer(inputCol="location", outputCol="location_index")
market_speed_indexer = StringIndexer(inputCol="market_speed", outputCol="market_speed_index")

# Assemble features
assembler = VectorAssembler(
    inputCols=["days_on_market", "price_per_sqft", "month", "quarter", "day_of_week", 
               "property_type_index", "location_index", "market_speed_index",
               "is_spring_summer", "is_winter"],
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestRegressor(
    labelCol="sale_price", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[property_type_indexer, location_indexer, market_speed_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = gold_ml_data.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} records")
print(f"Test set: {test_data.count()} records")

Training set: 9154 records


Test set: 2185 records


In [None]:
# Train the price prediction model

print("Training property price prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = RegressionEvaluator(labelCol="sale_price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)

evaluator_r2 = RegressionEvaluator(labelCol="sale_price", predictionCol="prediction", metricName="r2")
r2 = evaluator_r2.evaluate(predictions)

print(f"Model RMSE: ${rmse:,.2f}")
print(f"Model R²: {r2:.4f}")

# Show prediction results
predictions.select("property_id", "property_type", "location", "sale_price", "prediction").show(10)

Training property price prediction model...


Model RMSE: $358,913.39
Model R²: 0.8300


+-----------+-------------+--------------------+----------+------------------+
|property_id|property_type|            location|sale_price|        prediction|
+-----------+-------------+--------------------+----------+------------------+
| PROP002173|    Townhouse|Residential District|  598718.9| 589832.7164126843|
| PROP002175|    Apartment|          Waterfront| 456298.92| 754067.7592078398|
| PROP002176|    Townhouse|          Urban Core|  565739.2| 771930.9673399623|
| PROP002180|Single Family|          Urban Core|1122442.56|1025470.5435646757|
| PROP002183|    Townhouse|            Suburban| 685427.28| 559209.2103593717|
| PROP002186|    Apartment|            Downtown| 569329.92| 467594.5879715044|
| PROP002189|    Apartment|          Urban Core| 497292.66|472475.34698397503|
| PROP002195|        Condo|          Waterfront|  740775.6| 662730.9836915237|
| PROP002200|        Condo|            Suburban| 401808.64| 407720.5569539201|
| PROP002201|   Commercial|          Waterfront|2204

In [None]:
# Model evaluation and business insights

# Feature importance
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = ["days_on_market", "price_per_sqft", "month", "quarter", "day_of_week", 
                 "property_type", "location", "market_speed", "is_spring_summer", "is_winter"]

print("\n=== Feature Importance for Price Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate prediction accuracy metrics
predictions_with_accuracy = predictions.withColumn(
    "prediction_error", 
    F.abs(F.col("sale_price") - F.col("prediction"))
).withColumn(
    "prediction_error_pct", 
    F.abs(F.col("sale_price") - F.col("prediction")) / F.col("sale_price") * 100
)

avg_prediction_error = predictions_with_accuracy.agg(F.avg("prediction_error")).collect()[0][0]
avg_prediction_error_pct = predictions_with_accuracy.agg(F.avg("prediction_error_pct")).collect()[0][0]
median_error_pct = predictions_with_accuracy.approxQuantile("prediction_error_pct", [0.5], 0.01)[0]

print(f"Average prediction error: ${avg_prediction_error:,.0f}")
print(f"Average prediction error percentage: {avg_prediction_error_pct:.2f}%")
print(f"Median prediction error percentage: {median_error_pct:.2f}%")

# Calculate potential value for pricing optimization
total_test_properties = test_data.count()
avg_property_value = test_data.agg(F.avg("sale_price")).collect()[0][0]

# Estimate potential value of better pricing (assuming 1% improvement in sale price)
price_optimization_value = total_test_properties * avg_property_value * 0.01

print(f"\nEstimated value of 1% price optimization: ${price_optimization_value:,.0f}")

# Market timing insights
seasonal_performance = predictions_with_accuracy.groupBy("is_spring_summer").agg(
    F.avg("prediction_error_pct").alias("avg_error_pct"),
    F.count("*").alias("transaction_count")
).orderBy("is_spring_summer")

print("\n=== Seasonal Prediction Performance ===")
seasonal_performance.show()

# Property type performance
property_type_performance = predictions_with_accuracy.groupBy("property_type").agg(
    F.avg("prediction_error_pct").alias("avg_error_pct"),
    F.count("*").alias("transaction_count")
).orderBy("avg_error_pct")

print("\n=== Property Type Prediction Performance ===")
property_type_performance.show()

# Location performance
location_performance = predictions_with_accuracy.groupBy("location").agg(
    F.avg("prediction_error_pct").alias("avg_error_pct"),
    F.count("*").alias("transaction_count")
).orderBy("avg_error_pct")

print("\n=== Location Prediction Performance ===")
location_performance.show()


=== Feature Importance for Price Prediction ===
days_on_market: 0.1411
price_per_sqft: 0.1700
month: 0.0158
quarter: 0.0051
day_of_week: 0.0172
property_type: 0.4074
location: 0.0492
market_speed: 0.1839
is_spring_summer: 0.0058
is_winter: 0.0045

=== Business Impact Analysis ===


Average prediction error: $213,288
Average prediction error percentage: 22.17%
Median prediction error percentage: 17.96%



Estimated value of 1% price optimization: $21,376,477

=== Seasonal Prediction Performance ===


+----------------+------------------+-----------------+
|is_spring_summer|     avg_error_pct|transaction_count|
+----------------+------------------+-----------------+
|           false|22.316770187327403|             1456|
|            true|21.868353999311417|              729|
+----------------+------------------+-----------------+


=== Property Type Prediction Performance ===


+-------------+------------------+-----------------+
|property_type|     avg_error_pct|transaction_count|
+-------------+------------------+-----------------+
|Single Family|16.518508570487484|              445|
|    Townhouse| 19.36782164357186|              455|
|        Condo| 22.24487225971695|              442|
|    Apartment|22.574195821436938|              422|
|   Commercial|30.673653494330907|              421|
+-------------+------------------+-----------------+


=== Location Prediction Performance ===


+--------------------+------------------+-----------------+
|            location|     avg_error_pct|transaction_count|
+--------------------+------------------+-----------------+
|Residential District|20.978593517531923|              352|
|          Waterfront|21.378073549031242|              373|
|            Downtown|21.873278961979548|              366|
|          Urban Core|22.288514105850624|              371|
|            Suburban|22.806133551135137|              388|
|       Mountain View|23.741261059974047|              335|
+--------------------+------------------+-----------------+



## Query Examples Across Medallion Layers

### Bronze Layer Queries

Raw data access for audit and debugging

In [None]:
# Bronze Layer: Raw transaction data queries

print("=== Bronze Layer: Raw Property Transaction Data ===")
bronze_sample = spark.sql("""
SELECT property_id, transaction_date, property_type, sale_price, location
FROM real_estate.bronze.property_transactions_bronze
WHERE property_id = 'PROP000001'
ORDER BY transaction_date DESC
LIMIT 5
""")
bronze_sample.show()

=== Bronze Layer: Raw Property Transaction Data ===


+-----------+----------------+-------------+----------+----------+
|property_id|transaction_date|property_type|sale_price|  location|
+-----------+----------------+-------------+----------+----------+
| PROP000001|      2024-05-17|Single Family|1313507.88|Waterfront|
+-----------+----------------+-------------+----------+----------+



In [None]:
# Silver Layer: Cleaned data queries

print("=== Silver Layer: Validated Property Transaction Data ===")
silver_sample = spark.sql("""
SELECT property_id, transaction_date, property_type, sale_price, days_on_market, market_speed, is_valid, data_quality_score
FROM real_estate.silver.property_transactions_silver
WHERE property_id = 'PROP000001' AND is_valid = true
ORDER BY transaction_date DESC
LIMIT 5
""")
silver_sample.show()

=== Silver Layer: Validated Property Transaction Data ===


+-----------+----------------+-------------+----------+--------------+------------+--------+------------------+
|property_id|transaction_date|property_type|sale_price|days_on_market|market_speed|is_valid|data_quality_score|
+-----------+----------------+-------------+----------+--------------+------------+--------+------------------+
| PROP000001|      2024-05-17|Single Family|1313507.88|            61|        slow|    true|               0.8|
+-----------+----------------+-------------+----------+--------------+------------+--------+------------------+



In [None]:
# Gold Layer: Analytics queries

print("=== Gold Layer: Property Analytics ===")
gold_sample = spark.sql("""
SELECT property_id, property_type, location, total_transactions, avg_sale_price, avg_price_per_sqft, avg_days_on_market
FROM real_estate.gold.property_analytics_gold
WHERE property_id LIKE 'PROP000%'
ORDER BY avg_sale_price DESC
LIMIT 5
""")
gold_sample.show()

=== Gold Layer: Property Analytics ===


+-----------+-------------+----------+------------------+--------------+------------------+------------------+
|property_id|property_type|  location|total_transactions|avg_sale_price|avg_price_per_sqft|avg_days_on_market|
+-----------+-------------+----------+------------------+--------------+------------------+------------------+
| PROP000992|   Commercial|Waterfront|                 1|    5247559.45|            507.55|              78.0|
| PROP000370|   Commercial|Waterfront|                 1|    4979861.68|            528.76|             105.0|
| PROP000352|   Commercial|Waterfront|                 1|     4889264.7|            464.98|              38.0|
| PROP000148|   Commercial|Waterfront|                 1|     4863300.0|            543.75|             111.0|
| PROP000696|   Commercial|Waterfront|                 1|    4814185.84|            414.73|             147.0|
+-----------+-------------+----------+------------------+--------------+------------------+------------------+



## Key Takeaways: Medallion Architecture in AIDP for Real Estate

### What We Demonstrated

1. **Bronze Layer**: Raw property transaction data ingestion with Delta liquid clustering
2. **Silver Layer**: Data validation, cleaning, and market feature enrichment
3. **Gold Layer**: Property analytics aggregation and ML model training for price prediction
4. **End-to-End Pipeline**: Complete medallion architecture in a single notebook

### AIDP Advantages

- **Unified Platform**: Seamless data flow between layers
- **Governance**: Catalog and schema isolation
- **Performance**: Optimized with liquid clustering
- **ML Integration**: Built-in ML capabilities for price prediction

### Business Benefits for Real Estate

1. **Data Quality**: Progressive improvement through layers
2. **Price Prediction**: AI-driven property valuation
3. **Market Intelligence**: Data-driven pricing strategies
4. **Investment Decisions**: Analytics for property performance
5. **Governance**: Audit trails and data lineage

### Best Practices

1. **Layer Isolation**: Keep raw data separate from processed data
2. **Incremental Processing**: Build upon validated foundations
3. **Business Alignment**: Gold layer matches business needs
4. **Performance Optimization**: Use clustering strategically
5. **ML Integration**: Include predictive analytics in gold layer

This notebook demonstrates how Oracle AI Data Platform enables sophisticated real estate analytics with proper data architecture, governance, and ML-powered price prediction.