# Energy Analytics: Medallion Architecture with Delta Liquid Clustering Demo

## Overview

This notebook demonstrates the **Medallion Architecture** combined with **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an energy and utilities analytics use case. 

### What is Medallion Architecture?

The Medallion Architecture is a data design pattern used to logically organize data in a lakehouse:

- **Bronze Layer**: Raw data as ingested from sources
- **Silver Layer**: Cleaned, validated, and enriched data
- **Gold Layer**: Business-level aggregates and ML-ready features

### What is Liquid Clustering?

Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### Use Case: Smart Grid Monitoring and Energy Consumption Analytics

We'll analyze energy consumption and smart grid performance data through the medallion layers, with liquid clustering optimizing performance at each stage.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Setup: Create Energy Catalog and Schemas

### Multi-Layer Schema Design

We'll create separate schemas for each medallion layer to maintain data isolation and governance:

In [None]:
# Create energy catalog with bronze, silver, and gold schemas
# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS energy")

spark.sql("CREATE SCHEMA IF NOT EXISTS energy.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS energy.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS energy.gold")

print("Energy catalog with bronze, silver, and gold schemas created successfully!")

## ðŸ¥‰ BRONZE LAYER: Raw Data Ingestion

### Bronze Layer Principles

- **Raw data storage**: Store data as-is from source systems
- **Minimal processing**: No transformations or cleaning
- **Append-only**: Historical record preservation
- **Schema enforcement**: Basic structure validation

### Table Design for Energy Readings

Our `energy_readings_bronze` table stores raw meter data:

- **meter_id**: Raw meter identifier from source
- **reading_date**: Timestamp as received
- **energy_type**: Energy type string
- **consumption**: Raw consumption value
- **location**: Geographic location string
- **peak_demand**: Peak usage value
- **efficiency_rating**: Raw efficiency score
- **ingestion_timestamp**: When data was ingested

### Bronze Layer Clustering Strategy

Cluster by `meter_id` and `reading_date` for efficient raw data queries and incremental processing.

In [None]:
# Create Bronze layer Delta table with liquid clustering
# CLUSTER BY optimizes for incremental data ingestion and querying

spark.sql("""
CREATE TABLE IF NOT EXISTS energy.bronze.energy_readings (
    meter_id STRING,
    reading_date TIMESTAMP,
    energy_type STRING,
    consumption DECIMAL(10,3),
    location STRING,
    peak_demand DECIMAL(8,2),
    efficiency_rating INT,
    ingestion_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (meter_id, reading_date)
""")

print("Bronze layer Delta table with liquid clustering created successfully!")
print("Ready to ingest raw energy meter data...")

### Generate and Ingest Raw Energy Data

#### Data Generation Strategy

We'll simulate raw data from smart meters including realistic variations:

- **Data quality issues**: Missing values, outliers, inconsistent formats
- **Realistic patterns**: Seasonal variations, peak usage times
- **Source diversity**: Different meter types and locations

#### Why Raw Data Ingestion?

The bronze layer preserves the original state of data for:

- **Audit trails**: Complete historical record
- **Reprocessing**: Ability to re-run transformations
- **Data lineage**: Track data from source to insights

In [None]:
# Generate realistic raw energy consumption data with quality issues
# This simulates data as it would come from various meter sources

import random
from datetime import datetime, timedelta

# Define energy data constants
ENERGY_TYPES = ['Electricity', 'Natural Gas', 'Water', 'Solar']
LOCATIONS = ['Residential_NYC', 'Commercial_CHI', 'Industrial_HOU', 'Residential_LAX', 'Commercial_SFO']

# Base consumption parameters by energy type and location
CONSUMPTION_PARAMS = {
    'Electricity': {
        'Residential_NYC': {'base_consumption': 15, 'peak_factor': 2.5, 'efficiency': 85},
        'Commercial_CHI': {'base_consumption': 150, 'peak_factor': 3.0, 'efficiency': 78},
        'Industrial_HOU': {'base_consumption': 500, 'peak_factor': 2.2, 'efficiency': 92},
        'Residential_LAX': {'base_consumption': 12, 'peak_factor': 2.8, 'efficiency': 88},
        'Commercial_SFO': {'base_consumption': 180, 'peak_factor': 2.7, 'efficiency': 82}
    },
    'Natural Gas': {
        'Residential_NYC': {'base_consumption': 25, 'peak_factor': 1.8, 'efficiency': 90},
        'Commercial_CHI': {'base_consumption': 80, 'peak_factor': 2.1, 'efficiency': 85},
        'Industrial_HOU': {'base_consumption': 200, 'peak_factor': 1.9, 'efficiency': 95},
        'Residential_LAX': {'base_consumption': 20, 'peak_factor': 2.0, 'efficiency': 87},
        'Commercial_SFO': {'base_consumption': 95, 'peak_factor': 2.3, 'efficiency': 83}
    },
    'Water': {
        'Residential_NYC': {'base_consumption': 180, 'peak_factor': 1.5, 'efficiency': 88},
        'Commercial_CHI': {'base_consumption': 450, 'peak_factor': 1.7, 'efficiency': 82},
        'Industrial_HOU': {'base_consumption': 1200, 'peak_factor': 1.6, 'efficiency': 91},
        'Residential_LAX': {'base_consumption': 160, 'peak_factor': 1.8, 'efficiency': 85},
        'Commercial_SFO': {'base_consumption': 380, 'peak_factor': 1.9, 'efficiency': 79}
    },
    'Solar': {
        'Residential_NYC': {'base_consumption': -8, 'peak_factor': 3.5, 'efficiency': 78},
        'Commercial_CHI': {'base_consumption': -75, 'peak_factor': 4.0, 'efficiency': 85},
        'Industrial_HOU': {'base_consumption': -250, 'peak_factor': 3.8, 'efficiency': 88},
        'Residential_LAX': {'base_consumption': -12, 'peak_factor': 4.2, 'efficiency': 82},
        'Commercial_SFO': {'base_consumption': -95, 'peak_factor': 3.9, 'efficiency': 86}
    }
}

# Generate raw energy reading records with data quality variations
bronze_data = []
base_date = datetime(2024, 1, 1)

# Create 2,000 meters with hourly readings for 3 months
for meter_num in range(1, 2001):
    meter_id = f"MTR{meter_num:06d}"
    
    # Each meter gets readings for 90 days (hourly)
    for day in range(90):
        for hour in range(24):
            reading_date = base_date + timedelta(days=day, hours=hour)
            ingestion_timestamp = reading_date + timedelta(minutes=random.randint(0, 30))  # Simulated ingestion delay
            
            # Select energy type and location for this meter
            energy_type = random.choice(ENERGY_TYPES)
            location = random.choice(LOCATIONS)
            
            params = CONSUMPTION_PARAMS[energy_type][location]
            
            # Calculate consumption with variations
            month = reading_date.month
            if energy_type in ['Electricity', 'Natural Gas']:
                if month in [12, 1, 2]:  # Winter
                    seasonal_factor = 1.4
                elif month in [6, 7, 8]:  # Summer
                    seasonal_factor = 1.3
                else:
                    seasonal_factor = 1.0
            else:
                seasonal_factor = 1.0
            
            hour_factor = 1.0
            if hour in [6, 7, 8, 17, 18, 19]:  # Peak hours
                hour_factor = params['peak_factor']
            elif hour in [2, 3, 4, 5]:  # Off-peak
                hour_factor = 0.4
            
            consumption_variation = random.uniform(0.8, 1.2)
            consumption = round(params['base_consumption'] * seasonal_factor * hour_factor * consumption_variation, 3)
            
            # Add data quality issues (simulating real-world raw data)
            if random.random() < 0.02:  # 2% missing consumption
                consumption = None
            elif random.random() < 0.05:  # 5% outliers
                consumption = consumption * random.uniform(5, 10)
            
            peak_demand = round(abs(consumption if consumption else params['base_consumption']) * random.uniform(1.1, 1.5), 2) if consumption else None
            
            efficiency_variation = random.randint(-5, 3)
            efficiency_rating = max(0, min(100, params['efficiency'] + efficiency_variation))
            
            # Occasional missing efficiency ratings
            if random.random() < 0.03:
                efficiency_rating = None
            
            bronze_data.append({
                "meter_id": meter_id,
                "reading_date": reading_date,
                "energy_type": energy_type,
                "consumption": consumption,
                "location": location,
                "peak_demand": peak_demand,
                "efficiency_rating": efficiency_rating,
                "ingestion_timestamp": ingestion_timestamp
            })

print(f"Generated {len(bronze_data)} raw energy reading records for bronze layer")
print("Sample raw record:", bronze_data[0])

In [None]:
# Insert raw data into Bronze layer using PySpark
# Bronze layer preserves data as-is from sources
from pyspark.sql.functions import col
from pyspark.sql.types import DecimalType, IntegerType

# Create DataFrame from generated raw data
df_bronze = spark.createDataFrame(bronze_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_bronze.printSchema()

print("\nSample Raw Data (including quality issues):")
df_bronze.show(5)

# Insert data into Bronze layer Delta table
df_bronze.write.mode("overwrite").saveAsTable("energy.bronze.energy_readings")

print(f"\nSuccessfully ingested {df_bronze.count()} raw records into bronze layer")
print("Data includes missing values, outliers, and formatting variations as would be found in raw sources")

## ðŸ¥ˆ SILVER LAYER: Data Cleaning and Enrichment

### Silver Layer Principles

- **Data quality**: Clean, validate, and standardize data
- **Business rules**: Apply transformations and enrichments
- **Deduplication**: Remove duplicates and handle conflicts
- **Schema evolution**: Consistent data structure

### Silver Layer Transformations

From our bronze `energy_readings` table, we'll create `energy_readings_silver` with:

- **Data validation**: Remove invalid records, handle missing values
- **Standardization**: Consistent formats and units
- **Enrichment**: Add derived fields like consumption categories
- **Quality metrics**: Data quality scores and validation flags

### Silver Layer Clustering Strategy

Maintain clustering by `meter_id` and `reading_date` for efficient processing and querying of cleaned data.

In [None]:
# Create Silver layer Delta table with liquid clustering
# Silver layer applies data quality rules and business transformations

spark.sql("""
CREATE TABLE IF NOT EXISTS energy.silver.energy_readings (
    meter_id STRING,
    reading_date TIMESTAMP,
    energy_type STRING,
    consumption DECIMAL(10,3),
    location STRING,
    peak_demand DECIMAL(8,2),
    efficiency_rating INT,
    ingestion_timestamp TIMESTAMP,
    -- Derived fields
    consumption_category STRING,
    is_peak_hour BOOLEAN,
    seasonal_factor DECIMAL(3,2),
    data_quality_score INT,
    processing_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (meter_id, reading_date)
""")

print("Silver layer Delta table with liquid clustering created successfully!")
print("Ready to transform and clean bronze layer data...")

In [None]:
# Transform bronze data to silver layer with cleaning and enrichment
# This demonstrates data quality improvements and business logic application

from pyspark.sql.functions import col, when, hour, month, current_timestamp, lit

# Read bronze data
df_bronze = spark.table("energy.bronze.energy_readings")

# Apply data cleaning and enrichment transformations
df_silver = df_bronze \
    .filter(col("reading_date").isNotNull()) \
    .filter(col("meter_id").isNotNull()) \
    .withColumn("consumption", 
                when(col("consumption").isNull(), 0)
                .when(col("consumption") > 10000, 0)  # Remove extreme outliers
                .otherwise(col("consumption"))) \
    .withColumn("efficiency_rating",
                when(col("efficiency_rating").isNull(), 75)  # Default efficiency
                .when(col("efficiency_rating") < 0, 0)
                .when(col("efficiency_rating") > 100, 100)
                .otherwise(col("efficiency_rating"))) \
    .withColumn("peak_demand",
                when(col("peak_demand").isNull(), col("consumption") * 1.2)
                .otherwise(col("peak_demand"))) \
    .withColumn("consumption_category",
                when(col("consumption") <= 0, "Generation")
                .when(col("consumption") < 50, "Low")
                .when(col("consumption") < 200, "Medium")
                .otherwise("High")) \
    .withColumn("is_peak_hour",
                when(hour(col("reading_date")).isin([6,7,8,17,18,19]), True)
                .otherwise(False)) \
    .withColumn("seasonal_factor",
                when(month(col("reading_date")).isin([12,1,2]), 1.4)
                .when(month(col("reading_date")).isin([6,7,8]), 1.3)
                .otherwise(1.0)) \
    .withColumn("data_quality_score",
                when(col("consumption").isNotNull() & col("efficiency_rating").isNotNull(), 100)
                .when(col("consumption").isNotNull(), 80)
                .otherwise(60)) \
    .withColumn("processing_timestamp", current_timestamp())

print("Silver layer transformations applied:")
print("- Null value handling")
print("- Outlier removal")
print("- Data standardization")
print("- Business rule enrichment")
print("- Quality scoring")

print(f"\nBronze records: {df_bronze.count()}")
print(f"Silver records after cleaning: {df_silver.count()}")

In [None]:
# Insert cleaned and enriched data into Silver layer

df_silver.write.mode("overwrite").saveAsTable("energy.silver.energy_readings")

print("Successfully transformed and loaded data into silver layer")
print("\nSample silver layer data:")
spark.table("energy.silver.energy_readings").select(
    "meter_id", "reading_date", "consumption", "consumption_category", 
    "is_peak_hour", "data_quality_score"
).show(10)

# Data quality comparison
bronze_nulls = df_bronze.filter(col("consumption").isNull() | col("efficiency_rating").isNull()).count()
silver_nulls = df_silver.filter(col("consumption").isNull() | col("efficiency_rating").isNull()).count()

print(f"\nData Quality Improvement:")
print(f"Bronze layer null values: {bronze_nulls}")
print(f"Silver layer null values: {silver_nulls}")
print(f"Quality improvement: {((bronze_nulls - silver_nulls) / bronze_nulls * 100):.1f}%")

## ðŸ¥‡ GOLD LAYER: Business Analytics and ML Features

### Gold Layer Principles

- **Business focus**: Aggregated metrics and KPIs
- **Performance**: Optimized for reporting and analytics
- **ML-ready**: Feature engineering for machine learning
- **Governance**: Curated datasets for enterprise use

### Gold Layer Tables

From our silver `energy_readings` table, we'll create:

1. **`meter_analytics_gold`**: Aggregated meter-level analytics
2. **`energy_forecasting_features_gold`**: ML-ready features for demand forecasting

### Gold Layer Clustering Strategy

Different clustering strategies optimized for specific use cases:
- Meter analytics: Cluster by `meter_id` and `month`
- Forecasting features: Cluster by `meter_id` and `reading_date`

In [None]:
# Create Gold layer tables for business analytics and ML

# Meter analytics table - business KPIs
spark.sql("""
CREATE TABLE IF NOT EXISTS energy.gold.meter_analytics (
    meter_id STRING,
    month STRING,
    energy_type STRING,
    location STRING,
    total_consumption DECIMAL(12,3),
    avg_consumption DECIMAL(10,3),
    max_peak_demand DECIMAL(8,2),
    avg_efficiency DECIMAL(5,2),
    reading_count INT,
    peak_hours_count INT,
    data_quality_avg DECIMAL(5,2),
    last_reading_date TIMESTAMP
)
USING DELTA
CLUSTER BY (meter_id, month)
""")

# ML features table for energy demand forecasting
spark.sql("""
CREATE TABLE IF NOT EXISTS energy.gold.energy_forecasting_features (
    meter_id STRING,
    reading_date TIMESTAMP,
    energy_type STRING,
    location STRING,
    consumption DECIMAL(10,3),
    -- Temporal features
    hour_of_day INT,
    day_of_week INT,
    month_of_year INT,
    is_weekend BOOLEAN,
    -- Lagged features
    prev_hour_consumption DECIMAL(10,3),
    prev_day_consumption DECIMAL(10,3),
    -- Statistical features
    rolling_24h_avg DECIMAL(10,3),
    rolling_7d_avg DECIMAL(10,3),
    -- Other features
    peak_demand DECIMAL(8,2),
    efficiency_rating INT,
    seasonal_factor DECIMAL(3,2),
    consumption_category STRING
)
USING DELTA
CLUSTER BY (meter_id, reading_date)
""")

print("Gold layer tables created successfully!")
print("- meter_analytics: Business KPIs and aggregations")
print("- energy_forecasting_features: ML-ready feature set")

In [None]:
# Populate meter analytics gold table
# Business-focused aggregations for reporting and dashboards

from pyspark.sql.functions import date_format, count, sum, avg, max, last

meter_analytics = spark.table("energy.silver.energy_readings") \
    .groupBy("meter_id", date_format("reading_date", "yyyy-MM").alias("month")) \
    .agg(
        last("energy_type").alias("energy_type"),
        last("location").alias("location"),
        sum("consumption").alias("total_consumption"),
        avg("consumption").alias("avg_consumption"),
        max("peak_demand").alias("max_peak_demand"),
        avg("efficiency_rating").alias("avg_efficiency"),
        count("*").alias("reading_count"),
        sum(when(col("is_peak_hour"), 1).otherwise(0)).alias("peak_hours_count"),
        avg("data_quality_score").alias("data_quality_avg"),
        max("reading_date").alias("last_reading_date")
    )

meter_analytics.write.mode("overwrite").saveAsTable("energy.gold.meter_analytics")

print("Meter analytics gold table populated")
print("\nSample meter analytics:")
spark.table("energy.gold.meter_analytics").select(
    "meter_id", "month", "total_consumption", "avg_consumption", 
    "max_peak_demand", "avg_efficiency"
).show(10)

In [None]:
# Create ML-ready features for energy demand forecasting
# Advanced feature engineering for predictive analytics

from pyspark.sql.functions import lag, window, dayofweek, date_add, when
from pyspark.sql.window import Window

df_silver = spark.table("energy.silver.energy_readings")

# Create time-based features
df_features = df_silver \
    .withColumn("hour_of_day", hour("reading_date")) \
    .withColumn("day_of_week", dayofweek("reading_date")) \
    .withColumn("month_of_year", month("reading_date")) \
    .withColumn("is_weekend", when(col("day_of_week").isin([1,7]), True).otherwise(False))

# Add lagged features
window_spec_1h = Window.partitionBy("meter_id").orderBy("reading_date")
window_spec_24h = Window.partitionBy("meter_id").orderBy("reading_date").rowsBetween(-23, 0)
window_spec_7d = Window.partitionBy("meter_id").orderBy("reading_date").rowsBetween(-167, 0)

df_features = df_features \
    .withColumn("prev_hour_consumption", lag("consumption", 1).over(window_spec_1h)) \
    .withColumn("prev_day_consumption", lag("consumption", 24).over(window_spec_1h)) \
    .withColumn("rolling_24h_avg", avg("consumption").over(window_spec_24h)) \
    .withColumn("rolling_7d_avg", avg("consumption").over(window_spec_7d))

# Select final feature set
df_gold_features = df_features.select(
    "meter_id", "reading_date", "energy_type", "location", "consumption",
    "hour_of_day", "day_of_week", "month_of_year", "is_weekend",
    "prev_hour_consumption", "prev_day_consumption",
    "rolling_24h_avg", "rolling_7d_avg",
    "peak_demand", "efficiency_rating", "seasonal_factor", "consumption_category"
)

df_gold_features.write.mode("overwrite").saveAsTable("energy.gold.energy_forecasting_features")

print("Energy forecasting features gold table populated")
print("\nSample ML features:")
spark.table("energy.gold.energy_forecasting_features").select(
    "meter_id", "reading_date", "consumption", "prev_hour_consumption", 
    "rolling_24h_avg", "hour_of_day", "is_weekend"
).show(10)

## Liquid Clustering Performance Demonstration

### Query Performance Across Medallion Layers

Now let's demonstrate how liquid clustering optimizes queries at each layer:

1. **Bronze**: Raw data scanning and filtering
2. **Silver**: Cleaned data analytics
3. **Gold**: Business intelligence and ML feature extraction

### Performance Benefits

Liquid clustering provides:
- **Automatic optimization**: No manual tuning required
- **Query acceleration**: Faster aggregations and joins
- **Storage efficiency**: Better compression and layout
- **Adaptive performance**: Adjusts as data patterns change

In [None]:
# Demonstrate liquid clustering benefits across medallion layers

print("=== Liquid Clustering Performance Demonstration ===\n")

# Bronze layer: Raw data exploration
print("ðŸ¥‰ BRONZE LAYER - Raw Data Queries")
bronze_query = spark.sql("""
SELECT meter_id, reading_date, energy_type, consumption, peak_demand
FROM energy.bronze.energy_readings
WHERE meter_id = 'MTR000001'
ORDER BY reading_date DESC
LIMIT 10
""")
bronze_query.show()
print(f"Bronze query returned {bronze_query.count()} records\n")

# Silver layer: Cleaned data analytics
print("ðŸ¥ˆ SILVER LAYER - Enhanced Analytics")
silver_query = spark.sql("""
SELECT meter_id, reading_date, consumption, consumption_category, 
       is_peak_hour, data_quality_score
FROM energy.silver.energy_readings
WHERE meter_id = 'MTR000001' AND consumption_category = 'High'
ORDER BY reading_date DESC
LIMIT 10
""")
silver_query.show()
print(f"Silver query returned {silver_query.count()} records\n")

# Gold layer: Business intelligence
print("ðŸ¥‡ GOLD LAYER - Business Analytics")
gold_analytics = spark.sql("""
SELECT meter_id, month, total_consumption, avg_consumption, 
       max_peak_demand, avg_efficiency
FROM energy.gold.meter_analytics
WHERE meter_id LIKE 'MTR000%'
ORDER BY meter_id, month
LIMIT 15
""")
gold_analytics.show()
print(f"Gold analytics returned {gold_analytics.count()} records\n")

# Gold layer: ML feature extraction
print("ðŸ¥‡ GOLD LAYER - ML Feature Extraction")
gold_ml = spark.sql("""
SELECT meter_id, reading_date, consumption, prev_hour_consumption,
       rolling_24h_avg, hour_of_day, is_weekend
FROM energy.gold.energy_forecasting_features
WHERE meter_id = 'MTR000001'
ORDER BY reading_date DESC
LIMIT 10
""")
gold_ml.show()
print(f"ML features query returned {gold_ml.count()} records")

## Energy Demand Forecasting Model (Gold Layer)

### Business Value of Predictive Analytics

Energy demand forecasting enables utilities to:

- **Optimize grid operations**: Predict and prevent peak demand issues
- **Improve pricing strategies**: Dynamic pricing based on predicted demand
- **Enable demand response**: Encourage conservation during peak times
- **Reduce infrastructure costs**: Better capacity planning

### ML Pipeline Using Gold Layer Features

We'll train a Random Forest model using our gold layer forecasting features to predict hourly energy consumption.

In [None]:
# Train energy demand forecasting model using gold layer features
# Demonstrates end-to-end ML pipeline with medallion architecture

from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

print("=== Energy Demand Forecasting Model ===\n")

# Load gold layer ML features and filter out null values for training
df_ml = spark.table("energy.gold.energy_forecasting_features") \
    .filter(col("prev_hour_consumption").isNotNull()) \
    .filter(col("reading_date") < "2024-03-15") \
    .na.drop()  # Drop any remaining null values

print(f"ML dataset: {df_ml.count()} records after null removal")

# Split data for training and testing
train_data = df_ml.filter("reading_date < '2024-03-01'")
test_data = df_ml.filter("reading_date >= '2024-03-01'")

print(f"Training set: {train_data.count()} records")
print(f"Testing set: {test_data.count()} records")

# Prepare features for ML
feature_cols = [
    "hour_of_day", "day_of_week", "month_of_year", "is_weekend",
    "prev_hour_consumption", "prev_day_consumption", 
    "rolling_24h_avg", "rolling_7d_avg",
    "peak_demand", "efficiency_rating", "seasonal_factor",
    "energy_type_index", "location_index", "consumption_category_index"
]

# Encode categorical variables
energy_type_indexer = StringIndexer(inputCol="energy_type", outputCol="energy_type_index")
location_indexer = StringIndexer(inputCol="location", outputCol="location_index")
category_indexer = StringIndexer(inputCol="consumption_category", outputCol="consumption_category_index")

# Create pipeline with handleInvalid="skip" to handle any remaining nulls
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features", handleInvalid="skip")
rf = RandomForestRegressor(featuresCol="features", labelCol="consumption", numTrees=50, seed=42)

pipeline = Pipeline(stages=[energy_type_indexer, location_indexer, category_indexer, assembler, rf])

# Train model
print("\nTraining Random Forest model...")
model = pipeline.fit(train_data)
print("Model training completed!")

# Make predictions
predictions = model.transform(test_data)
print(f"\nGenerated predictions for {predictions.count()} test records")

# Evaluate model
evaluator_rmse = RegressionEvaluator(labelCol="consumption", predictionCol="prediction", metricName="rmse")
evaluator_r2 = RegressionEvaluator(labelCol="consumption", predictionCol="prediction", metricName="r2")

rmse = evaluator_rmse.evaluate(predictions)
r2 = evaluator_r2.evaluate(predictions)

print(f"\nModel Performance:")
print(f"RMSE: {rmse:.2f}")
print(f"RÂ² Score: {r2:.4f}")

# Show sample predictions
print("\nSample Predictions:")
predictions.select("meter_id", "reading_date", "consumption", "prediction", "energy_type").show(10)

=== Energy Demand Forecasting Model ===



ML dataset: 3504000 records after null removal


Training set: 2832000 records


Testing set: 672000 records

Training Random Forest model...


Model training completed!



Generated predictions for 672000 test records



Model Performance:
RMSE: 265.71
RÂ² Score: 0.8226

Sample Predictions:


+---------+-------------------+-----------+-------------------+-----------+
| meter_id|       reading_date|consumption|         prediction|energy_type|
+---------+-------------------+-----------+-------------------+-----------+
|MTR000004|2024-03-01 00:00:00|   -204.261|-193.54698996455508|      Solar|
|MTR000004|2024-03-01 01:00:00|    147.416| 128.95653049569464|Electricity|
|MTR000004|2024-03-01 02:00:00|     68.205|  87.40272657988953|      Water|
|MTR000004|2024-03-01 03:00:00|    121.664| 115.12839032370157|      Water|
|MTR000004|2024-03-01 04:00:00|    555.751|  654.6475498955411|      Water|
|MTR000004|2024-03-01 05:00:00|     34.831|  31.40105961988671|Natural Gas|
|MTR000004|2024-03-01 06:00:00|    825.667|  628.9479734113935|      Water|
|MTR000004|2024-03-01 07:00:00|    656.253|  603.6565043644263|      Water|
|MTR000004|2024-03-01 08:00:00|   -309.556|-221.52657571314506|      Solar|
|MTR000004|2024-03-01 09:00:00|     -8.386| -32.25039479438769|      Solar|
+---------+-

## Key Takeaways: Medallion Architecture with Liquid Clustering

### Architecture Benefits Demonstrated

1. **Bronze Layer**: Raw data ingestion with data quality issues preserved
2. **Silver Layer**: Data cleaning, validation, and business rule application
3. **Gold Layer**: Business analytics and ML-ready feature engineering

### Liquid Clustering Advantages

- **Automatic optimization**: No manual partitioning or Z-Ordering required
- **Query performance**: Fast aggregations across all medallion layers
- **Storage efficiency**: Optimized data layout for each layer's access patterns
- **Scalability**: Handles large-scale energy datasets efficiently

### Business Value Delivered

- **Data governance**: Clear data lineage from raw ingestion to business insights
- **Analytics acceleration**: Fast queries for real-time dashboards and reporting
- **ML readiness**: Feature engineering optimized for predictive modeling
- **Operational efficiency**: Automated data quality and transformation pipelines

### AIDP Platform Advantages

- **Unified analytics**: Seamless data engineering and ML workflows
- **Performance optimization**: Delta tables with liquid clustering
- **Enterprise governance**: Multi-layer data organization
- **Scalable processing**: Distributed computing for large energy datasets

This notebook demonstrates how Oracle AI Data Platform combines medallion architecture principles with advanced Delta Lake features to deliver a complete data analytics solution for energy and utilities use cases.