# Transportation Analytics: Medallion Architecture Demo with Delta Liquid Clustering

## Overview

This notebook demonstrates the **Medallion Architecture** in Oracle AI Data Platform (AIDP) Workbench using a transportation and logistics analytics use case. The medallion architecture organizes data into Bronze, Silver, and Gold layers, with Delta Liquid Clustering providing automatic optimization at each layer.

### What is Medallion Architecture?

- **Bronze Layer**: Raw, unprocessed data as ingested from source systems
- **Silver Layer**: Cleaned, enriched, and transformed data with business logic applied
- **Gold Layer**: Aggregated, business-ready data optimized for analytics and ML

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations.

### Use Case: Fleet Management and Route Optimization

We'll analyze transportation fleet operations and logistics data across all three medallion layers, incorporating machine learning for predictive maintenance.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Step 1: Create Transportation Catalog and Schemas

### Medallion Schema Structure

We'll create separate schemas for each medallion layer:
- `transportation.bronze`: Raw data storage
- `transportation.silver`: Cleaned and enriched data
- `transportation.gold`: Business analytics and ML-ready data

In [None]:
# Create transportation catalog and medallion schemas

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS transportation")

spark.sql("CREATE SCHEMA IF NOT EXISTS transportation.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS transportation.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS transportation.gold")

print("Transportation catalog and medallion schemas created successfully!")
print("- Bronze: Raw data ingestion")
print("- Silver: Cleaned and enriched data")
print("- Gold: Analytics and ML-ready data")

## Step 2: Bronze Layer - Raw Data Ingestion

### Bronze Layer Design

The bronze layer stores raw data exactly as received, with minimal transformations:

- **Data quality checks**: Basic validation
- **Partitioning/Clustering**: Optimized for ingestion patterns
- **Retention**: Raw data preserved for reprocessing
- **Schema**: Flexible to accommodate source system changes

In [None]:
# Create Bronze Layer: Raw fleet trips table

# CLUSTER BY vehicle_id for efficient ingestion and vehicle-based queries

spark.sql("""
CREATE TABLE IF NOT EXISTS transportation.bronze.fleet_trips_raw (
    vehicle_id STRING,
    trip_date TIMESTAMP,
    route_id STRING,
    distance DECIMAL(8,2),
    duration DECIMAL(6,2),
    fuel_consumed DECIMAL(6,2),
    load_factor INT,
    ingestion_timestamp TIMESTAMP,
    source_system STRING
)
USING DELTA
CLUSTER BY (vehicle_id)
""")

print("Bronze layer table created successfully!")
print("Liquid clustering by vehicle_id optimizes for ingestion patterns.")

In [None]:
# Generate and ingest raw transportation fleet data

# Using fully qualified imports to avoid conflicts

import random
from datetime import datetime, timedelta

# Vehicle performance profiles to create more interesting data
VEHICLE_PROFILES = {
    'high_performer': {
        'fuel_multiplier': 0.8,  # Better fuel efficiency
        'efficiency_variation': 0.1,  # Low variation
        'load_factor_bonus': 10,
        'maintenance_risk': 0.05,  # 5% chance of issues
        'trip_frequency': 'high'
    },
    'medium_performer': {
        'fuel_multiplier': 1.0,  # Average efficiency
        'efficiency_variation': 0.3,  # Medium variation
        'load_factor_bonus': 0,
        'maintenance_risk': 0.3,  # 30% chance of issues
        'trip_frequency': 'medium'
    },
    'low_performer': {
        'fuel_multiplier': 1.4,  # Poor fuel efficiency
        'efficiency_variation': 0.5,  # High variation
        'load_factor_bonus': -5,
        'maintenance_risk': 0.7,  # 70% chance of issues
        'trip_frequency': 'low'
    }
}

# Base trip parameters by route type (simulating raw sensor data with some noise)
TRIP_PARAMS = {
    'Urban Delivery': {'avg_distance': 45, 'avg_duration': 120, 'avg_fuel': 8.5, 'load_factor': 85},
    'Long-haul': {'avg_distance': 450, 'avg_duration': 480, 'avg_fuel': 65.0, 'load_factor': 92},
    'Local Transport': {'avg_distance': 120, 'avg_duration': 180, 'avg_fuel': 15.2, 'load_factor': 78},
    'Express Delivery': {'avg_distance': 80, 'avg_duration': 90, 'avg_fuel': 12.8, 'load_factor': 95}
}

# Generate raw fleet trip records (simulating IoT sensor data)
raw_trip_data = []
base_date = datetime(2024, 1, 1)

# Create 500 vehicles with different performance profiles
for vehicle_num in range(1, 501):
    vehicle_id = f"VH{vehicle_num:04d}"
    
    # Assign vehicle to performance profile (weighted distribution)
    profile_weights = [0.3, 0.5, 0.2]  # 30% high, 50% medium, 20% low performers
    profile_name = random.choices(list(VEHICLE_PROFILES.keys()), weights=profile_weights)[0]
    profile = VEHICLE_PROFILES[profile_name]
    
    # Determine number of trips based on profile
    if profile['trip_frequency'] == 'high':
        num_trips = random.randint(45, 60)
    elif profile['trip_frequency'] == 'medium':
        num_trips = random.randint(30, 45)
    else:  # low
        num_trips = random.randint(20, 30)
    
    for i in range(num_trips):
        # Spread trips over 12 months
        days_offset = random.randint(0, 365)
        trip_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing (more trips during business hours)
        hour_weights = [1, 1, 1, 1, 1, 3, 8, 10, 12, 10, 8, 6, 8, 9, 8, 7, 6, 5, 3, 2, 2, 1, 1, 1]
        hours_offset = random.choices(range(24), weights=hour_weights)[0]
        trip_date = trip_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select route type
        route_type = random.choice(ROUTE_TYPES)
        params = TRIP_PARAMS[route_type]
        
        # Apply vehicle profile adjustments
        base_fuel = params['avg_fuel'] * profile['fuel_multiplier']
        base_load_factor = params['load_factor'] + profile['load_factor_bonus']
        
        # Calculate trip metrics with vehicle-specific variation
        distance_variation = random.uniform(0.7, 1.4)  # Moderate variation
        distance = round(params['avg_distance'] * distance_variation, 2)
        
        duration_variation = random.uniform(0.8, 1.6)  # Moderate variation
        duration = round(params['avg_duration'] * duration_variation, 2)
        
        fuel_variation = random.uniform(0.8, 1.3)  # Vehicle performance variation
        fuel_consumed = round(base_fuel * fuel_variation, 2)
        
        load_factor_variation = random.randint(-8, 8)  # Smaller variation
        load_factor = max(0, min(100, base_load_factor + load_factor_variation))
        
        # Select specific route
        route_id = random.choice(ROUTES)
        
        # Add some raw data characteristics (nulls, duplicates possible)
        if random.random() < 0.02:  # 2% chance of missing fuel data
            fuel_consumed = None
        
        raw_trip_data.append({
            "vehicle_id": vehicle_id,
            "trip_date": trip_date,
            "route_id": route_id,
            "distance": distance,
            "duration": duration,
            "fuel_consumed": fuel_consumed,
            "load_factor": load_factor,
            "source_system": "fleet_sensor"
        })

print(f"Generated {len(raw_trip_data)} raw fleet trip records (simulating IoT sensor data)")
print("Raw data includes noise, outliers, and potential missing values as would be expected from sensors.")
print(f"Vehicle profiles: High performers (30%), Medium performers (50%), Low performers (20%)")

Generated 20077 raw fleet trip records (simulating IoT sensor data)
Raw data includes noise, outliers, and potential missing values as would be expected from sensors.
Vehicle profiles: High performers (30%), Medium performers (50%), Low performers (20%)


In [None]:
# Insert raw data into Bronze layer

# Create DataFrame from generated raw data
df_raw_trips = spark.createDataFrame(raw_trip_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_raw_trips.printSchema()

print("\nSample Raw Data:")
df_raw_trips.show(5)

# Insert data into Bronze table
df_raw_trips.write.mode("overwrite").saveAsTable("transportation.bronze.fleet_trips_raw")

print(f"\nSuccessfully ingested {df_raw_trips.count()} raw records into Bronze layer")
print("Bronze layer preserves raw data with all its imperfections for auditability.")

Bronze Layer DataFrame Schema:
root
 |-- distance: double (nullable = true)
 |-- duration: double (nullable = true)
 |-- fuel_consumed: double (nullable = true)
 |-- load_factor: long (nullable = true)
 |-- route_id: string (nullable = true)
 |-- source_system: string (nullable = true)
 |-- trip_date: timestamp (nullable = true)
 |-- vehicle_id: string (nullable = true)


Sample Raw Data:
+--------+--------+-------------+-----------+--------------+-------------+-------------------+----------+
|distance|duration|fuel_consumed|load_factor|      route_id|source_system|          trip_date|vehicle_id|
+--------+--------+-------------+-----------+--------------+-------------+-------------------+----------+
|   52.92|  104.41|         6.96|         98|RT_NYC_MAN_001| fleet_sensor|2024-08-26 14:18:00|    VH0001|
|  139.76|  273.15|        15.31|         86|RT_MIA_ORL_005| fleet_sensor|2024-04-29 07:36:00|    VH0001|
|    56.5|  121.74|         8.77|        100|RT_MIA_ORL_005| fleet_sensor|2024


Successfully ingested 20077 raw records into Bronze layer
Bronze layer preserves raw data with all its imperfections for auditability.


## Step 3: Silver Layer - Data Cleaning and Enrichment

### Silver Layer Design

The silver layer transforms raw data into clean, enriched, business-ready format:

- **Data Quality**: Validation, deduplication, outlier treatment
- **Business Logic**: Enrichment with derived metrics
- **Standardization**: Consistent formats and units
- **Optimization**: Clustering for analytical query patterns

In [None]:
# Create Silver Layer: Cleaned and enriched fleet trips

# CLUSTER BY (vehicle_id, trip_date) for optimal analytical queries

spark.sql("""
CREATE TABLE IF NOT EXISTS transportation.silver.fleet_trips_clean (
    vehicle_id STRING,
    trip_date TIMESTAMP,
    route_id STRING,
    distance DECIMAL(8,2),
    duration DECIMAL(6,2),
    fuel_consumed DECIMAL(6,2),
    load_factor INT,
    fuel_efficiency DECIMAL(6,2),  
    trip_speed DECIMAL(6,2),       
    is_valid BOOLEAN,            
    processing_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (vehicle_id, trip_date)
""")

print("Silver layer table created successfully!")
print("Clustering by (vehicle_id, trip_date) optimizes for analytical queries.")

In [None]:
# Process Bronze to Silver: Clean and enrich the data
from pyspark.sql import functions as F

# Read from Bronze layer
bronze_data = spark.table("transportation.bronze.fleet_trips_raw")

print(f"Processing {bronze_data.count()} records from Bronze layer")

# Silver layer transformations
silver_data = bronze_data \
    .filter("vehicle_id IS NOT NULL") \
    .filter("trip_date IS NOT NULL") \
    .filter("distance > 0") \
    .filter("duration > 0") \
    .withColumn("fuel_consumed", 
                F.when(F.col("fuel_consumed").isNull(), 
                       F.col("distance") * 0.25).otherwise(F.col("fuel_consumed"))) \
    .withColumn("fuel_efficiency", F.round(F.col("distance") / F.col("fuel_consumed"), 2)) \
    .withColumn("trip_speed", F.round(F.col("distance") / (F.col("duration") / 60), 2)) \
    .withColumn("is_valid", F.lit(True)) \
    .filter("fuel_efficiency BETWEEN 5 AND 50") \
    .filter("trip_speed BETWEEN 10 AND 120") \
    .dropDuplicates(["vehicle_id", "trip_date"])  # Remove duplicates

print(f"After cleaning: {silver_data.count()} valid records")

# Show data quality improvements
print("\nData Quality Summary:")
silver_data.select("vehicle_id", "trip_date", "distance", "fuel_efficiency", "trip_speed", "is_valid").show(5)

Processing 20077 records from Bronze layer


After cleaning: 16135 valid records

Data Quality Summary:


+----------+-------------------+--------+---------------+----------+--------+
|vehicle_id|          trip_date|distance|fuel_efficiency|trip_speed|is_valid|
+----------+-------------------+--------+---------------+----------+--------+
|    VH0002|2024-01-21 07:27:00|   152.8|          11.62|     51.69|    true|
|    VH0015|2024-06-23 06:08:00|  447.69|           5.48|     58.69|    true|
|    VH0022|2024-03-21 17:26:00|   80.66|           7.84|     35.71|    true|
|    VH0022|2024-06-16 20:52:00|   57.04|           8.36|     22.71|    true|
|    VH0033|2024-08-28 14:10:00|  138.74|           6.71|     36.35|    true|
+----------+-------------------+--------+---------------+----------+--------+
only showing top 5 rows



In [None]:
# Insert cleaned data into Silver layer

silver_data.write.mode("overwrite").saveAsTable("transportation.silver.fleet_trips_clean")

print(f"Successfully processed and stored {silver_data.count()} cleaned records in Silver layer")
print("Silver layer provides validated, enriched data for downstream analytics.")

Successfully processed and stored 16135 cleaned records in Silver layer
Silver layer provides validated, enriched data for downstream analytics.


## Step 4: Gold Layer - Business Analytics and ML-Ready Data

### Gold Layer Design

The gold layer provides aggregated, business-ready data optimized for:

- **Analytics**: Pre-computed aggregates and KPIs
- **ML Training**: Feature engineering and model-ready datasets
- **Reporting**: Business metrics and performance indicators
- **Optimization**: Clustering for specific analytical workloads

In [None]:
# Create Gold Layer: Vehicle performance analytics

# CLUSTER BY vehicle_id for vehicle-centric analytics and ML

spark.sql("""
CREATE TABLE IF NOT EXISTS transportation.gold.vehicle_performance (
    vehicle_id STRING,
    total_trips INT,
    total_distance DECIMAL(10,2),
    total_duration DECIMAL(10,2),
    total_fuel DECIMAL(10,2),
    avg_fuel_efficiency DECIMAL(6,2),
    avg_load_factor DECIMAL(6,2),
    efficiency_stddev DECIMAL(6,2),
    routes_used INT,
    active_days INT,
    avg_daily_trips DECIMAL(6,2),
    maintenance_score DECIMAL(6,2),
    last_trip_date TIMESTAMP,
    created_at TIMESTAMP 
)
USING DELTA
CLUSTER BY (vehicle_id)
""")

print("Gold layer vehicle performance table created successfully!")

In [None]:
# Create Gold Layer: Route performance analytics

spark.sql("""
CREATE TABLE IF NOT EXISTS transportation.gold.route_analytics (
    route_id STRING,
    total_trips INT,
    avg_distance DECIMAL(8,2),
    avg_duration DECIMAL(8,2),
    avg_speed DECIMAL(6,2),
    avg_load_factor DECIMAL(6,2),
    avg_fuel_efficiency DECIMAL(6,2),
    total_distance DECIMAL(12,2),
    total_fuel DECIMAL(10,2),
    vehicles_used INT,
    efficiency_rank INT,
    created_at TIMESTAMP
)
USING DELTA
CLUSTER BY (route_id)
""")

print("Gold layer route analytics table created successfully!")

Gold layer route analytics table created successfully!


In [None]:
# Process Silver to Gold: Create vehicle performance aggregates

# Read from Silver layer
silver_trips = spark.table("transportation.silver.fleet_trips_clean")

# Create vehicle performance aggregates
vehicle_performance = silver_trips.groupBy("vehicle_id").agg(
    F.count("*").alias("total_trips"),
    F.round(F.sum("distance"), 2).alias("total_distance"),
    F.round(F.sum("duration"), 2).alias("total_duration"),
    F.round(F.sum("fuel_consumed"), 2).alias("total_fuel"),
    F.round(F.avg("fuel_efficiency"), 2).alias("avg_fuel_efficiency"),
    F.round(F.avg("load_factor"), 2).alias("avg_load_factor"),
    F.round(F.stddev("fuel_efficiency"), 2).alias("efficiency_stddev"),
    F.countDistinct("route_id").alias("routes_used"),
    F.countDistinct(F.date_format("trip_date", "yyyy-MM-dd")).alias("active_days"),
    F.round(F.count("*") / F.countDistinct(F.date_format("trip_date", "yyyy-MM-dd")), 2).alias("avg_daily_trips"),
    F.max("trip_date").alias("last_trip_date")
).withColumn(
    "maintenance_score",
    F.when(
        (F.col("total_distance") > 80000) |  # Higher threshold
        (F.col("avg_fuel_efficiency") < 8) |  # Lower efficiency threshold
        (F.col("efficiency_stddev") > 4),     # Higher variation threshold
        F.lit(1.0)
    ).otherwise(F.lit(0.0))
)

print(f"Created performance metrics for {vehicle_performance.count()} vehicles")
vehicle_performance.show(5)

Created performance metrics for 500 vehicles


+----------+-----------+--------------+--------------+----------+-------------------+---------------+-----------------+-----------+-----------+---------------+-------------------+-----------------+
|vehicle_id|total_trips|total_distance|total_duration|total_fuel|avg_fuel_efficiency|avg_load_factor|efficiency_stddev|routes_used|active_days|avg_daily_trips|     last_trip_date|maintenance_score|
+----------+-----------+--------------+--------------+----------+-------------------+---------------+-----------------+-----------+-----------+---------------+-------------------+-----------------+
|    VH0387|         49|       9798.69|       13893.5|   1166.11|               8.63|          95.29|             2.32|          5|         48|           1.02|2024-12-28 15:31:00|              0.0|
|    VH0083|         44|       7702.21|       11508.6|    937.15|               8.12|          94.93|             2.29|          5|         41|           1.07|2024-12-28 08:33:00|              0.0|
|    VH035

In [None]:
# Process Silver to Gold: Create route analytics
from pyspark.sql.window import Window

route_analytics = silver_trips.groupBy("route_id").agg(
    F.count("*").alias("total_trips"),
    F.round(F.avg("distance"), 2).alias("avg_distance"),
    F.round(F.avg("duration"), 2).alias("avg_duration"),
    F.round(F.avg("trip_speed"), 2).alias("avg_speed"),
    F.round(F.avg("load_factor"), 2).alias("avg_load_factor"),
    F.round(F.avg("fuel_efficiency"), 2).alias("avg_fuel_efficiency"),
    F.round(F.sum("distance"), 2).alias("total_distance"),
    F.round(F.sum("fuel_consumed"), 2).alias("total_fuel"),
    F.countDistinct("vehicle_id").alias("vehicles_used")
).withColumn(
    "efficiency_rank",
    F.row_number().over(
        Window.orderBy(F.desc("avg_fuel_efficiency"))
    )
)

print(f"Created analytics for {route_analytics.count()} routes")
route_analytics.orderBy("efficiency_rank").show()

Created analytics for 5 routes


+--------------+-----------+------------+------------+---------+---------------+-------------------+--------------+----------+-------------+---------------+
|      route_id|total_trips|avg_distance|avg_duration|avg_speed|avg_load_factor|avg_fuel_efficiency|total_distance|total_fuel|vehicles_used|efficiency_rank|
+--------------+-----------+------------+------------+---------+---------------+-------------------+--------------+----------+-------------+---------------+
|RT_LAX_SFO_002|       3210|      200.28|      273.02|    41.79|          90.95|               7.97|     642898.98|   82176.9|          492|              1|
|RT_HOU_DAL_004|       3241|       199.0|      271.76|    41.57|          90.76|               7.96|      644968.9|  82156.58|          485|              2|
|RT_MIA_ORL_005|       3239|      193.26|      264.67|     41.6|          91.01|                7.9|     625966.89|  80728.67|          486|              3|
|RT_CHI_DET_003|       3204|      203.44|      275.12|    

In [None]:
# Insert Gold layer data

vehicle_performance.write.mode("overwrite").saveAsTable("transportation.gold.vehicle_performance")
route_analytics.write.mode("overwrite").saveAsTable("transportation.gold.route_analytics")

print("Gold layer tables populated successfully!")
print(f"- Vehicle performance: {vehicle_performance.count()} records")
print(f"- Route analytics: {route_analytics.count()} records")
print("Gold layer provides optimized analytics and ML-ready data.")

Gold layer tables populated successfully!


- Vehicle performance: 500 records


- Route analytics: 5 records
Gold layer provides optimized analytics and ML-ready data.


## Step 5: Demonstrate Medallion Architecture Benefits

### Query Performance Across Layers

Let's demonstrate how each layer serves different analytical needs with optimized performance through liquid clustering.

In [None]:
# Bronze Layer: Raw data audit and investigation

print("=== Bronze Layer: Raw Data Investigation ===")
bronze_audit = spark.sql("""
SELECT 
    source_system,
    COUNT(*) as total_records,
    COUNT(CASE WHEN fuel_consumed IS NULL THEN 1 END) as missing_fuel,
    ROUND(AVG(distance), 2) as avg_distance
FROM transportation.bronze.fleet_trips_raw
GROUP BY source_system
""")
bronze_audit.show()

=== Bronze Layer: Raw Data Investigation ===


+-------------+-------------+------------+------------+
|source_system|total_records|missing_fuel|avg_distance|
+-------------+-------------+------------+------------+
| fleet_sensor|        20077|         400|      183.37|
+-------------+-------------+------------+------------+



In [None]:
# Silver Layer: Clean analytical queries

print("=== Silver Layer: Clean Data Analytics ===")
silver_analysis = spark.sql("""
SELECT 
    vehicle_id,
    COUNT(*) as trips_this_month,
    ROUND(SUM(distance), 2) as monthly_distance,
    ROUND(AVG(fuel_efficiency), 2) as avg_efficiency,
    ROUND(AVG(trip_speed), 2) as avg_speed
FROM transportation.silver.fleet_trips_clean
WHERE DATE_FORMAT(trip_date, 'yyyy-MM') = '2024-06'
GROUP BY vehicle_id
ORDER BY monthly_distance DESC
LIMIT 10
""")
silver_analysis.show()

=== Silver Layer: Clean Data Analytics ===


+----------+----------------+----------------+--------------+---------+
|vehicle_id|trips_this_month|monthly_distance|avg_efficiency|avg_speed|
+----------+----------------+----------------+--------------+---------+
|    VH0022|               9|         2630.95|          8.62|    46.02|
|    VH0458|               7|          2361.6|          8.59|    41.74|
|    VH0215|              11|         2289.59|          7.95|    41.32|
|    VH0264|               6|         2177.42|          8.62|    53.75|
|    VH0214|               6|         2085.41|          8.61|    48.32|
|    VH0050|               5|         1928.98|         10.47|    58.37|
|    VH0205|               6|         1892.79|          8.07|     48.4|
|    VH0489|               6|         1858.77|         10.88|    40.32|
|    VH0471|               3|         1834.45|         12.44|     60.1|
|    VH0493|               5|         1830.21|          9.14|    42.22|
+----------+----------------+----------------+--------------+---

In [None]:
# Gold Layer: Business intelligence and KPIs

print("=== Gold Layer: Business KPIs ===")
gold_kpis = spark.sql("""
SELECT 
    COUNT(*) as total_vehicles,
    ROUND(AVG(avg_fuel_efficiency), 2) as fleet_avg_efficiency,
    ROUND(SUM(total_distance), 0) as total_fleet_distance,
    ROUND(AVG(avg_load_factor), 2) as fleet_avg_load_factor,
    COUNT(CASE WHEN maintenance_score = 1 THEN 1 END) as vehicles_needing_maintenance
FROM transportation.gold.vehicle_performance
""")
gold_kpis.show()

print("\n=== Top Performing Routes ===")
top_routes = spark.sql("""
SELECT route_id, total_trips, avg_fuel_efficiency, efficiency_rank
FROM transportation.gold.route_analytics
ORDER BY efficiency_rank
LIMIT 5
""")
top_routes.show()

=== Gold Layer: Business KPIs ===


+--------------+--------------------+--------------------+---------------------+----------------------------+
|total_vehicles|fleet_avg_efficiency|total_fleet_distance|fleet_avg_load_factor|vehicles_needing_maintenance|
+--------------+--------------------+--------------------+---------------------+----------------------------+
|           500|                7.52|           3224501.0|                88.76|                         329|
+--------------+--------------------+--------------------+---------------------+----------------------------+


=== Top Performing Routes ===


+--------------+-----------+-------------------+---------------+
|      route_id|total_trips|avg_fuel_efficiency|efficiency_rank|
+--------------+-----------+-------------------+---------------+
|RT_LAX_SFO_002|       3210|               7.97|              1|
|RT_HOU_DAL_004|       3241|               7.96|              2|
|RT_MIA_ORL_005|       3239|                7.9|              3|
|RT_CHI_DET_003|       3204|               7.88|              4|
|RT_NYC_MAN_001|       3241|               7.87|              5|
+--------------+-----------+-------------------+---------------+



## Step 6: Machine Learning on Gold Layer Data

### Predictive Maintenance Model

Using the Gold layer's enriched vehicle performance data to train a predictive maintenance model.

In [None]:
# Prepare Gold layer data for ML training

from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Read vehicle performance data from Gold layer
vehicle_data = spark.table("transportation.gold.vehicle_performance")

print(f"Loading {vehicle_data.count()} vehicle records from Gold layer for ML training")

# Prepare features for maintenance prediction
feature_cols = [
    "total_trips", "total_distance", "total_duration", "total_fuel",
    "avg_fuel_efficiency", "avg_load_factor", "efficiency_stddev",
    "routes_used", "active_days", "avg_daily_trips"
]

# Split data for training
train_data, test_data = vehicle_data.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} vehicles")
print(f"Test set: {test_data.count()} vehicles")

# Show maintenance distribution
train_data.groupBy("maintenance_score").count().show()

Loading 500 vehicle records from Gold layer for ML training


Training set: 426 vehicles


Test set: 74 vehicles


+-----------------+-----+
|maintenance_score|count|
+-----------------+-----+
|              0.0|  147|
|              1.0|  279|
+-----------------+-----+



In [None]:
# Train predictive maintenance model

# Feature engineering pipeline
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Random Forest model
rf = RandomForestClassifier(
    labelCol="maintenance_score",
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create and fit pipeline
pipeline = Pipeline(stages=[assembler, scaler, rf])

print("Training predictive maintenance model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate model
evaluator = BinaryClassificationEvaluator(labelCol="maintenance_score", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show predictions
predictions.select(
    "vehicle_id", "total_distance", "avg_fuel_efficiency", 
    "maintenance_score", "prediction", "probability"
).show(10)

Training predictive maintenance model...


Model AUC: 1.0000


+----------+--------------+-------------------+-----------------+----------+-----------+
|vehicle_id|total_distance|avg_fuel_efficiency|maintenance_score|prediction|probability|
+----------+--------------+-------------------+-----------------+----------+-----------+
|    VH0003|       8485.22|               8.46|              0.0|       0.0|  [1.0,0.0]|
|    VH0007|       1812.16|               6.24|              1.0|       1.0|  [0.0,1.0]|
|    VH0009|      10061.57|               8.19|              0.0|       0.0|  [1.0,0.0]|
|    VH0014|       2822.83|               6.21|              1.0|       1.0|  [0.0,1.0]|
|    VH0020|       4554.17|               5.85|              1.0|       1.0|  [0.0,1.0]|
|    VH0024|       6996.05|               7.36|              1.0|       1.0|  [0.0,1.0]|
|    VH0030|      11055.55|               8.75|              0.0|       0.0|  [1.0,0.0]|
|    VH0036|       7073.37|               7.25|              1.0|       1.0|[0.04,0.96]|
|    VH0046|       55

In [None]:
# Model interpretation and business impact

# Feature importance
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances

print("=== Feature Importance for Maintenance Prediction ===")
for name, importance in zip(feature_cols, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
maintenance_predictions = predictions.filter("prediction = 1")
vehicles_needing_maintenance = maintenance_predictions.count()
total_test_vehicles = test_data.count()

print(f"\n=== Business Impact Analysis ===")
print(f"Total test vehicles: {total_test_vehicles}")
print(f"Vehicles predicted to need maintenance: {vehicles_needing_maintenance}")
print(f"Percentage flagged for maintenance: {(vehicles_needing_maintenance/total_test_vehicles)*100:.1f}%")

# Cost savings calculation
avg_maintenance_cost = 2500
preventive_maintenance_savings = 0.6
potential_savings = vehicles_needing_maintenance * avg_maintenance_cost * preventive_maintenance_savings

print(f"\nEstimated cost per maintenance event: ${avg_maintenance_cost:,}")
print(f"Potential annual savings from preventive maintenance: ${potential_savings:,.0f}")

# Model performance metrics
accuracy = predictions.filter("maintenance_score = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND maintenance_score = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND maintenance_score = 1").count() / predictions.filter("maintenance_score = 1").count() if predictions.filter("maintenance_score = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Maintenance Prediction ===
total_trips: 0.2166
total_distance: 0.0010
total_duration: 0.0156
total_fuel: 0.0003
avg_fuel_efficiency: 0.4895
avg_load_factor: 0.1330
efficiency_stddev: 0.0198
routes_used: 0.0000
active_days: 0.1237
avg_daily_trips: 0.0005



=== Business Impact Analysis ===
Total test vehicles: 74
Vehicles predicted to need maintenance: 51
Percentage flagged for maintenance: 68.9%

Estimated cost per maintenance event: $2,500
Potential annual savings from preventive maintenance: $76,500



Model Performance:
Accuracy: 0.9865
Precision: 0.9804
Recall: 1.0000
AUC: 1.0000


## Key Takeaways: Medallion Architecture + Liquid Clustering + ML in AIDP

### What We Demonstrated

1. **Medallion Architecture**: Three-layer data organization (Bronze → Silver → Gold)
   - Bronze: Raw data ingestion with realistic vehicle performance profiles
   - Silver: Cleaned, enriched data with business logic
   - Gold: Aggregated analytics and ML-ready datasets

2. **Liquid Clustering**: Automatic optimization at each layer
   - Bronze: Clustered by `vehicle_id` for ingestion patterns
   - Silver: Clustered by `(vehicle_id, trip_date)` for analytical queries
   - Gold: Clustered by entity keys for specific workloads

3. **Progressive Data Quality**: Each layer improves data quality
   - Bronze preserves raw data with vehicle performance variations
   - Silver applies validation and enrichment
   - Gold provides business-ready aggregates

4. **ML Integration**: End-to-end ML pipeline using Gold layer data
   - Feature engineering on aggregated metrics
   - Predictive maintenance model with meaningful feature importance
   - Business impact quantification

### AIDP Advantages

- **Unified Platform**: Seamless data flow from ingestion to ML
- **Governance**: Catalog and schema isolation by layer
- **Performance**: Liquid clustering optimizes each layer's access patterns
- **Scalability**: Handles complex transformations at scale
- **Integration**: Native ML capabilities on processed data

### Business Benefits for Transportation

1. **Data Quality**: Progressive improvement from raw to refined
2. **Analytical Flexibility**: Different layers serve different use cases
3. **Performance**: Optimized queries at each layer
4. **ML Readiness**: Gold layer provides perfect training data
5. **Cost Reduction**: Predictive maintenance prevents breakdowns
6. **Operational Excellence**: Data-driven fleet management

### Best Practices for Medallion Architecture

1. **Layer Purpose**: Keep each layer focused on its role
2. **Clustering Strategy**: Choose columns based on primary access patterns
3. **Data Quality**: Implement validation progressively
4. **Schema Evolution**: Allow flexibility in Bronze, standardize in Silver/Gold
5. **Monitoring**: Track data quality and performance metrics
6. **ML Integration**: Use Gold layer for model training and features

### Next Steps

- Implement real-time data ingestion into Bronze layer
- Add more sophisticated Silver layer transformations
- Create additional Gold layer aggregations and KPIs
- Deploy ML models for real-time predictions
- Integrate with fleet management systems
- Scale to production workloads with AIDP

This notebook demonstrates how Oracle AI Data Platform enables sophisticated data architecture patterns while maintaining enterprise-grade performance and governance.