# Manufacturing: Medallion Architecture with Delta Liquid Clustering

## Overview

This notebook demonstrates the **Medallion Architecture** in Oracle AI Data Platform (AIDP) Workbench using Delta Liquid Clustering. The medallion architecture organizes data into three layers (Bronze, Silver, Gold) that progressively refine data quality and structure for different use cases.

### What is the Medallion Architecture?

- **Bronze Layer**: Raw, unprocessed data as ingested from source systems
- **Silver Layer**: Cleaned, standardized, and enriched data
- **Gold Layer**: Business-ready, aggregated data for analytics and reporting

### What is Liquid Clustering?

Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### Use Case: Manufacturing Analytics

We'll process manufacturing production records through all three medallion layers, optimizing each with liquid clustering for equipment monitoring, quality control, and business intelligence.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Layer 1: Bronze - Raw Data Ingestion

### Bronze Layer Purpose

The bronze layer serves as the landing zone for raw data. Data is ingested in its original format with minimal processing:

- **Raw format**: Data as received from source systems
- **Minimal validation**: Basic schema enforcement
- **Immutable**: Historical data preserved as-is
- **Optimized for ingestion**: Fast write operations

### Table Design

Our bronze `production_records_bronze` table stores raw manufacturing data with liquid clustering optimized for time-based partitioning and equipment queries.

In [None]:
# Create manufacturing catalog and schemas for medallion architecture

# Bronze, Silver, and Gold schemas provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS manufacturing")

spark.sql("CREATE SCHEMA IF NOT EXISTS manufacturing.bronze")

spark.sql("CREATE SCHEMA IF NOT EXISTS manufacturing.silver")

spark.sql("CREATE SCHEMA IF NOT EXISTS manufacturing.gold")

print("Manufacturing catalog with Bronze, Silver, and Gold schemas created successfully!")

Manufacturing catalog with Bronze, Silver, and Gold schemas created successfully!


In [None]:
# Create Bronze layer Delta table with liquid clustering

# CLUSTER BY optimizes for time-based queries and equipment monitoring

spark.sql("""

CREATE TABLE IF NOT EXISTS manufacturing.bronze.production_records_bronze (

    machine_id STRING,

    production_date TIMESTAMP,

    product_type STRING,

    units_produced INT,

    defect_count INT,

    production_line STRING,

    cycle_time DECIMAL(5,2),

    ingestion_timestamp TIMESTAMP,

    source_system STRING 

)

USING DELTA

CLUSTER BY (production_date, machine_id)

""")

print("Bronze layer Delta table with liquid clustering created successfully!")
print("Clustering optimizes for time-based queries and equipment monitoring.")

Bronze layer Delta table with liquid clustering created successfully!
Clustering optimizes for time-based queries and equipment monitoring.


In [None]:
# Generate and ingest raw manufacturing production data into Bronze layer

# Using fully qualified imports to avoid conflicts

import random
from datetime import datetime, timedelta

# Define manufacturing data constants
PRODUCT_TYPES = ['Electronics', 'Automotive Parts', 'Consumer Goods', 'Industrial Equipment']
PRODUCTION_LINES = ['LINE_A', 'LINE_B', 'LINE_C', 'LINE_D', 'LINE_E']

# Base production parameters by product type
PRODUCTION_PARAMS = {
    'Electronics': {'base_units': 500, 'defect_rate': 0.02, 'cycle_time': 2.5},
    'Automotive Parts': {'base_units': 200, 'defect_rate': 0.05, 'cycle_time': 8.0},
    'Consumer Goods': {'base_units': 800, 'defect_rate': 0.03, 'cycle_time': 1.8},
    'Industrial Equipment': {'base_units': 50, 'defect_rate': 0.08, 'cycle_time': 25.0}
}

# Generate production records with some data quality issues (for Silver layer cleaning)
production_data = []
base_date = datetime(2024, 1, 1)

# Create 200 machines with 30-90 production runs each
for machine_num in range(1, 201):
    machine_id = f"MCH{machine_num:04d}"
    
    # Each machine gets 30-90 production runs over 12 months
    num_runs = random.randint(30, 90)
    
    for i in range(num_runs):
        # Spread production runs over 12 months (weekdays only, during shifts)
        days_offset = random.randint(0, 365)
        production_date = base_date + timedelta(days=days_offset)
        
        # Skip weekends
        while production_date.weekday() >= 5:
            production_date += timedelta(days=1)
        
        # Add shift timing (6 AM - 6 PM)
        hours_offset = random.randint(6, 18)
        production_date = production_date.replace(hour=hours_offset, minute=0, second=0, microsecond=0)
        
        # Select product type
        product_type = random.choice(PRODUCT_TYPES)
        params = PRODUCTION_PARAMS[product_type]
        
        # Calculate production with variability
        units_variation = random.uniform(0.7, 1.3)
        units_produced = int(params['base_units'] * units_variation)
        
        # Calculate defects
        defect_rate_variation = random.uniform(0.5, 2.0)
        actual_defect_rate = params['defect_rate'] * defect_rate_variation
        defect_count = int(units_produced * actual_defect_rate)
        
        # Calculate cycle time with variation
        cycle_time_variation = random.uniform(0.8, 1.4)
        cycle_time = round(params['cycle_time'] * cycle_time_variation, 2)
        
        # Select production line
        production_line = random.choice(PRODUCTION_LINES)
        
        # Add some data quality issues (nulls, invalid values) for Silver layer demo
        if random.random() < 0.05:  # 5% chance of null values
            if random.choice([True, False]):
                units_produced = None
            else:
                cycle_time = None
        
        if random.random() < 0.02:  # 2% chance of negative values
            defect_count = -abs(defect_count)
        
        production_data.append({
            "machine_id": machine_id,
            "production_date": production_date,
            "product_type": product_type,
            "units_produced": units_produced,
            "defect_count": defect_count,
            "production_line": production_line,
            "cycle_time": cycle_time,
            "source_system": "manufacturing_scada"
        })

print(f"Generated {len(production_data)} raw production records for Bronze layer")
print("Data includes intentional quality issues for Silver layer processing demo")
print("Sample record:", production_data[0])

Generated 12137 raw production records for Bronze layer
Data includes intentional quality issues for Silver layer processing demo
Sample record: {'machine_id': 'MCH0001', 'production_date': datetime.datetime(2024, 8, 9, 7, 0), 'product_type': 'Automotive Parts', 'units_produced': 242, 'defect_count': 18, 'production_line': 'LINE_B', 'cycle_time': 10.78, 'source_system': 'manufacturing_scada'}


In [None]:
# Insert raw data into Bronze layer using PySpark

# Create DataFrame from generated data
df_bronze = spark.createDataFrame(production_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_bronze.printSchema()

print("\nSample Bronze Data:")
df_bronze.show(5)

# Insert data into Delta table with liquid clustering
df_bronze.write.mode("overwrite").saveAsTable("manufacturing.bronze.production_records_bronze")

print(f"\nSuccessfully ingested {df_bronze.count()} raw records into Bronze layer")
print("Liquid clustering automatically optimized the raw data layout during ingestion!")

Bronze Layer DataFrame Schema:
root
 |-- cycle_time: double (nullable = true)
 |-- defect_count: long (nullable = true)
 |-- machine_id: string (nullable = true)
 |-- product_type: string (nullable = true)
 |-- production_date: timestamp (nullable = true)
 |-- production_line: string (nullable = true)
 |-- source_system: string (nullable = true)
 |-- units_produced: long (nullable = true)


Sample Bronze Data:


+----------+------------+----------+----------------+-------------------+---------------+-------------------+--------------+
|cycle_time|defect_count|machine_id|    product_type|    production_date|production_line|      source_system|units_produced|
+----------+------------+----------+----------------+-------------------+---------------+-------------------+--------------+
|     10.78|          18|   MCH0001|Automotive Parts|2024-08-09 07:00:00|         LINE_B|manufacturing_scada|           242|
|      NULL|          18|   MCH0001|  Consumer Goods|2024-10-01 06:00:00|         LINE_A|manufacturing_scada|           830|
|      1.57|          36|   MCH0001|  Consumer Goods|2024-07-23 11:00:00|         LINE_B|manufacturing_scada|           961|
|      2.44|          32|   MCH0001|  Consumer Goods|2024-06-20 13:00:00|         LINE_A|manufacturing_scada|           763|
|      9.89|          12|   MCH0001|Automotive Parts|2024-05-07 16:00:00|         LINE_B|manufacturing_scada|           217|



Successfully ingested 12137 raw records into Bronze layer
Liquid clustering automatically optimized the raw data layout during ingestion!


## Layer 2: Silver - Data Cleaning and Standardization

### Silver Layer Purpose

The silver layer transforms raw bronze data into clean, standardized, and enriched datasets:

- **Data quality**: Remove duplicates, handle nulls, validate ranges
- **Standardization**: Consistent formats, units, and naming
- **Enrichment**: Add derived fields and business logic
- **Optimization**: Clustering for analytical queries

### Transformations Applied

- Remove records with critical null values
- Fix negative defect counts
- Add calculated quality metrics
- Standardize data types and formats
- Add data quality flags

In [None]:
# Create Silver layer Delta table with liquid clustering

# Clustering optimized for analytical queries by machine and time

spark.sql("""

CREATE TABLE IF NOT EXISTS manufacturing.silver.production_records_silver (

    machine_id STRING,

    production_date TIMESTAMP,

    product_type STRING,

    units_produced INT,

    defect_count INT,

    production_line STRING,

    cycle_time DECIMAL(5,2),

    defect_rate DECIMAL(5,2),

    hourly_production_rate DECIMAL(8,2),

    quality_grade STRING,

    processing_timestamp TIMESTAMP,

    data_quality_score INT

)

USING DELTA

CLUSTER BY (machine_id, production_date)

""")

print("Silver layer Delta table with liquid clustering created successfully!")
print("Clustering optimizes for machine-specific and time-based analytical queries.")

Silver layer Delta table with liquid clustering created successfully!
Clustering optimizes for machine-specific and time-based analytical queries.


In [None]:
# Transform Bronze data to Silver layer with data quality improvements

from pyspark.sql.functions import col, when, round, expr, current_timestamp

# Read from Bronze layer
df_silver = spark.table("manufacturing.bronze.production_records_bronze")

# Apply data quality transformations
df_silver = df_silver \
    .filter(col("machine_id").isNotNull()) \
    .filter(col("production_date").isNotNull()) \
    .withColumn("units_produced", when(col("units_produced").isNull(), 0).otherwise(col("units_produced"))) \
    .withColumn("defect_count", when(col("defect_count").isNull(), 0).otherwise(col("defect_count"))) \
    .withColumn("defect_count", when(col("defect_count") < 0, 0).otherwise(col("defect_count"))) \
    .withColumn("cycle_time", when(col("cycle_time").isNull(), 1.0).otherwise(col("cycle_time"))) \
    .withColumn("cycle_time", when(col("cycle_time") <= 0, 1.0).otherwise(col("cycle_time")))

# Add calculated fields
df_silver = df_silver \
    .withColumn("defect_rate", 
                round(when(col("units_produced") > 0, col("defect_count") * 100.0 / col("units_produced")).otherwise(0), 2)) \
    .withColumn("hourly_production_rate", 
                round(when(col("cycle_time") > 0, col("units_produced") * 60.0 / col("cycle_time")).otherwise(0), 2)) \
    .withColumn("quality_grade",
                when(col("defect_rate") <= 2, "Excellent")
                .when(col("defect_rate") <= 5, "Good")
                .when(col("defect_rate") <= 10, "Fair")
                .otherwise("Poor")) \
    .withColumn("processing_timestamp", current_timestamp()) \
    .withColumn("data_quality_score", 
                when(col("units_produced").isNotNull() & col("defect_count").isNotNull() & col("cycle_time").isNotNull(), 100)
                .when(col("units_produced").isNotNull() | col("defect_count").isNotNull(), 75)
                .otherwise(50))

# Select final columns for Silver layer
df_silver = df_silver.select(
    "machine_id", "production_date", "product_type", "units_produced", 
    "defect_count", "production_line", "cycle_time", "defect_rate",
    "hourly_production_rate", "quality_grade", "processing_timestamp", "data_quality_score"
)

print(f"Silver layer transformation completed: {df_silver.count()} clean records")
print("\nData quality improvements applied:")
print("- Removed records with null machine_id or production_date")
print("- Fixed negative defect counts")
print("- Added calculated quality metrics")
print("- Added data quality grading and scoring")

print("\nSilver Layer Sample Data:")
df_silver.show(5)

Silver layer transformation completed: 12137 clean records

Data quality improvements applied:
- Removed records with null machine_id or production_date
- Fixed negative defect counts
- Added calculated quality metrics
- Added data quality grading and scoring

Silver Layer Sample Data:


+----------+-------------------+----------------+--------------+------------+---------------+----------+-----------+----------------------+-------------+--------------------+------------------+
|machine_id|    production_date|    product_type|units_produced|defect_count|production_line|cycle_time|defect_rate|hourly_production_rate|quality_grade|processing_timestamp|data_quality_score|
+----------+-------------------+----------------+--------------+------------+---------------+----------+-----------+----------------------+-------------+--------------------+------------------+
|   MCH0001|2024-08-09 07:00:00|Automotive Parts|           242|          18|         LINE_B|     10.78|       7.44|               1346.94|         Fair|2025-12-19 23:26:...|               100|
|   MCH0001|2024-10-01 06:00:00|  Consumer Goods|           830|          18|         LINE_A|       1.0|       2.17|               49800.0|         Good|2025-12-19 23:26:...|               100|
|   MCH0001|2024-07-23 11:00:0

In [None]:
# Save cleaned data to Silver layer

df_silver.write.mode("overwrite").saveAsTable("manufacturing.silver.production_records_silver")

print("Successfully saved cleaned and enriched data to Silver layer!")
print("Silver layer now contains standardized, quality-assured manufacturing data.")

Successfully saved cleaned and enriched data to Silver layer!
Silver layer now contains standardized, quality-assured manufacturing data.


## Layer 3: Gold - Business Analytics and Aggregations

### Gold Layer Purpose

The gold layer provides business-ready datasets optimized for analytics and reporting:

- **Aggregations**: Pre-computed metrics and KPIs
- **Business logic**: Domain-specific calculations and classifications
- **Performance**: Optimized for dashboard and BI tool queries
- **Governance**: Curated datasets with clear business definitions

### Gold Tables Created

1. **equipment_performance_gold**: Equipment-level KPIs and metrics
2. **quality_analytics_gold**: Quality control and defect analysis
3. **production_efficiency_gold**: Production line and efficiency metrics

In [None]:
# Create Gold layer tables for business analytics

# Equipment Performance Gold Table
spark.sql("""

CREATE TABLE IF NOT EXISTS manufacturing.gold.equipment_performance_gold (

    machine_id STRING,

    total_production_runs INT,

    total_units_produced BIGINT,

    avg_daily_production DECIMAL(8,2),

    avg_defect_rate DECIMAL(5,2),

    avg_hourly_rate DECIMAL(8,2),

    performance_category STRING,

    reliability_score DECIMAL(5,2),

    last_production_date TIMESTAMP,

    days_since_last_run INT

)

USING DELTA

CLUSTER BY (performance_category, machine_id)

""")

# Quality Analytics Gold Table
spark.sql("""

CREATE TABLE IF NOT EXISTS manufacturing.gold.quality_analytics_gold (

    product_type STRING,

    production_line STRING,

    month_year STRING,

    total_production_runs INT,

    total_units BIGINT,

    total_defects BIGINT,

    avg_defect_rate DECIMAL(5,2),

    quality_trend STRING,

    defect_rate_percentile DECIMAL(5,2)

)

USING DELTA

CLUSTER BY (product_type, month_year)

""")

# Production Efficiency Gold Table
spark.sql("""

CREATE TABLE IF NOT EXISTS manufacturing.gold.production_efficiency_gold (

    production_line STRING,

    month_year STRING,

    active_machines INT,

    total_production_runs INT,

    total_units_produced BIGINT,

    avg_production_efficiency DECIMAL(5,2),

    line_utilization_rate DECIMAL(5,2),

    bottleneck_indicator STRING

)

USING DELTA

CLUSTER BY (production_line, month_year)

""")

print("Gold layer tables created successfully!")
print("Three business-ready tables created for equipment performance, quality analytics, and production efficiency.")

Gold layer tables created successfully!
Three business-ready tables created for equipment performance, quality analytics, and production efficiency.


In [None]:
# Populate Equipment Performance Gold Table

from pyspark.sql.functions import datediff, current_date, count, sum, avg, round, max, percentile_approx

equipment_gold = spark.sql("""

SELECT 

    machine_id,

    COUNT(*) as total_production_runs,

    SUM(units_produced) as total_units_produced,

    ROUND(AVG(units_produced), 2) as avg_daily_production,

    ROUND(AVG(defect_rate), 2) as avg_defect_rate,

    ROUND(AVG(hourly_production_rate), 2) as avg_hourly_rate,

    MAX(production_date) as last_production_date,

    CASE 

        WHEN AVG(defect_rate) <= 3 AND AVG(hourly_production_rate) >= 500 THEN 'High Performer'

        WHEN AVG(defect_rate) <= 7 AND AVG(hourly_production_rate) >= 300 THEN 'Good Performer'

        WHEN AVG(defect_rate) <= 12 THEN 'Needs Attention'

        ELSE 'Critical'

    END as performance_category,

    ROUND(100 - AVG(defect_rate), 2) as reliability_score

FROM manufacturing.silver.production_records_silver

GROUP BY machine_id

""")

# Add days since last run
equipment_gold = equipment_gold.withColumn(
    "days_since_last_run", 
    datediff(current_date(), col("last_production_date"))
)

equipment_gold.write.mode("overwrite").saveAsTable("manufacturing.gold.equipment_performance_gold")

print("Equipment Performance Gold table populated successfully!")
print(f"Contains performance metrics for {equipment_gold.count()} machines.")

Equipment Performance Gold table populated successfully!


Contains performance metrics for 200 machines.


In [None]:
# Populate Quality Analytics Gold Table

quality_gold = spark.sql("""

WITH monthly_quality AS (

    SELECT 

        product_type,

        production_line,

        DATE_FORMAT(production_date, 'yyyy-MM') as month_year,

        COUNT(*) as total_production_runs,

        SUM(units_produced) as total_units,

        SUM(defect_count) as total_defects,

        ROUND(AVG(defect_rate), 2) as avg_defect_rate,

        LAG(AVG(defect_rate)) OVER (PARTITION BY product_type, production_line ORDER BY DATE_FORMAT(production_date, 'yyyy-MM')) as prev_month_rate

    FROM manufacturing.silver.production_records_silver

    GROUP BY product_type, production_line, DATE_FORMAT(production_date, 'yyyy-MM')

),

overall_stats AS (

    SELECT 

        PERCENTILE_APPROX(avg_defect_rate, 0.5) as median_defect_rate

    FROM monthly_quality

)

SELECT 

    mq.product_type,

    mq.production_line,

    mq.month_year,

    mq.total_production_runs,

    mq.total_units,

    mq.total_defects,

    mq.avg_defect_rate,

    CASE 

        WHEN mq.prev_month_rate IS NULL THEN 'New'

        WHEN mq.avg_defect_rate < mq.prev_month_rate * 0.9 THEN 'Improving'

        WHEN mq.avg_defect_rate > mq.prev_month_rate * 1.1 THEN 'Declining'

        ELSE 'Stable'

    END as quality_trend,

    ROUND(

        PERCENT_RANK() OVER (ORDER BY mq.avg_defect_rate) * 100, 2

    ) as defect_rate_percentile

FROM monthly_quality mq

ORDER BY product_type, production_line, month_year

""")

quality_gold.write.mode("overwrite").saveAsTable("manufacturing.gold.quality_analytics_gold")

print("Quality Analytics Gold table populated successfully!")
print(f"Contains quality metrics for {quality_gold.count()} product-line-month combinations.")

Quality Analytics Gold table populated successfully!


Contains quality metrics for 240 product-line-month combinations.


In [None]:
# Populate Production Efficiency Gold Table

efficiency_gold = spark.sql("""

SELECT 

    production_line,

    DATE_FORMAT(production_date, 'yyyy-MM') as month_year,

    COUNT(DISTINCT machine_id) as active_machines,

    COUNT(*) as total_production_runs,

    SUM(units_produced) as total_units_produced,

    ROUND(AVG(hourly_production_rate), 2) as avg_production_efficiency,

    ROUND(

        COUNT(*) * 100.0 / 

        (COUNT(DISTINCT machine_id) * 30 * 24), 2  -- Assuming 30 days, 24 hours potential

    ) as line_utilization_rate,

    CASE 

        WHEN AVG(hourly_production_rate) < 200 THEN 'Severe Bottleneck'

        WHEN AVG(hourly_production_rate) < 400 THEN 'Minor Bottleneck'

        WHEN AVG(hourly_production_rate) > 800 THEN 'Overutilized'

        ELSE 'Optimal'

    END as bottleneck_indicator

FROM manufacturing.silver.production_records_silver

GROUP BY production_line, DATE_FORMAT(production_date, 'yyyy-MM')

ORDER BY production_line, month_year

""")

efficiency_gold.write.mode("overwrite").saveAsTable("manufacturing.gold.production_efficiency_gold")

print("Production Efficiency Gold table populated successfully!")
print(f"Contains efficiency metrics for {efficiency_gold.count()} production line-month combinations.")

Production Efficiency Gold table populated successfully!


Contains efficiency metrics for 60 production line-month combinations.


## Medallion Architecture Demonstration

### Query Performance Across Layers

Let's demonstrate how the medallion architecture serves different analytical needs with optimized queries at each layer.

In [None]:
# Demonstrate Bronze layer: Raw data access

print("=== BRONZE LAYER: Raw Data Access ===")
print("Purpose: Historical audit trail and raw data exploration")

bronze_sample = spark.sql("""

SELECT machine_id, production_date, product_type, units_produced, defect_count, source_system

FROM manufacturing.bronze.production_records_bronze

WHERE machine_id = 'MCH0001'

ORDER BY production_date DESC

LIMIT 5

""")

bronze_sample.show()

bronze_stats = spark.sql("""

SELECT 

    COUNT(*) as total_records,

    COUNT(CASE WHEN units_produced IS NULL THEN 1 END) as null_units,

    COUNT(CASE WHEN defect_count < 0 THEN 1 END) as negative_defects

FROM manufacturing.bronze.production_records_bronze

""")

bronze_stats.show()

=== BRONZE LAYER: Raw Data Access ===
Purpose: Historical audit trail and raw data exploration


+----------+-------------------+--------------------+--------------+------------+-------------------+
|machine_id|    production_date|        product_type|units_produced|defect_count|      source_system|
+----------+-------------------+--------------------+--------------+------------+-------------------+
|   MCH0001|2024-12-27 13:00:00|Industrial Equipment|            61|           4|manufacturing_scada|
|   MCH0001|2024-12-26 09:00:00|Industrial Equipment|            39|           2|manufacturing_scada|
|   MCH0001|2024-12-16 14:00:00|      Consumer Goods|          1038|          16|manufacturing_scada|
|   MCH0001|2024-12-10 10:00:00|      Consumer Goods|           853|          28|manufacturing_scada|
|   MCH0001|2024-12-09 11:00:00|         Electronics|           590|           5|manufacturing_scada|
+----------+-------------------+--------------------+--------------+------------+-------------------+



+-------------+----------+----------------+
|total_records|null_units|negative_defects|
+-------------+----------+----------------+
|        12137|       317|             247|
+-------------+----------+----------------+



In [None]:
# Demonstrate Silver layer: Clean analytical data

print("\n=== SILVER LAYER: Clean Analytical Data ===")
print("Purpose: Standardized data for detailed analysis and ML features")

silver_sample = spark.sql("""

SELECT machine_id, production_date, product_type, units_produced, defect_rate, 

       hourly_production_rate, quality_grade, data_quality_score

FROM manufacturing.silver.production_records_silver

WHERE machine_id = 'MCH0001'

ORDER BY production_date DESC

LIMIT 5

""")

silver_sample.show()

silver_quality = spark.sql("""

SELECT quality_grade, COUNT(*) as count, ROUND(AVG(data_quality_score), 2) as avg_quality_score

FROM manufacturing.silver.production_records_silver

GROUP BY quality_grade

ORDER BY count DESC

""")

silver_quality.show()


=== SILVER LAYER: Clean Analytical Data ===
Purpose: Standardized data for detailed analysis and ML features


+----------+-------------------+--------------------+--------------+-----------+----------------------+-------------+------------------+
|machine_id|    production_date|        product_type|units_produced|defect_rate|hourly_production_rate|quality_grade|data_quality_score|
+----------+-------------------+--------------------+--------------+-----------+----------------------+-------------+------------------+
|   MCH0001|2024-12-27 13:00:00|Industrial Equipment|            61|       6.56|                111.35|         Fair|               100|
|   MCH0001|2024-12-26 09:00:00|Industrial Equipment|            39|       5.13|                 80.03|         Fair|               100|
|   MCH0001|2024-12-16 14:00:00|      Consumer Goods|          1038|       1.54|               62280.0|    Excellent|               100|
|   MCH0001|2024-12-10 10:00:00|      Consumer Goods|           853|       3.28|              28915.25|         Good|               100|
|   MCH0001|2024-12-09 11:00:00|         

+-------------+-----+-----------------+
|quality_grade|count|avg_quality_score|
+-------------+-----+-----------------+
|         Good| 5298|            100.0|
|         Fair| 3680|            100.0|
|    Excellent| 2028|            100.0|
|         Poor| 1131|            100.0|
+-------------+-----+-----------------+



In [None]:
# Demonstrate Gold layer: Business intelligence

print("\n=== GOLD LAYER: Business Intelligence ===")
print("Purpose: Pre-aggregated KPIs for dashboards and executive reporting")

# Equipment Performance Dashboard
print("\nEquipment Performance Summary:")
equipment_summary = spark.sql("""

SELECT performance_category, COUNT(*) as machine_count, 

       ROUND(AVG(reliability_score), 2) as avg_reliability_score,

       ROUND(SUM(total_units_produced)/1000, 1) as total_units_k

FROM manufacturing.gold.equipment_performance_gold

GROUP BY performance_category

ORDER BY machine_count DESC

""")

equipment_summary.show()

# Quality Analytics Dashboard
print("\nQuality Analytics Summary:")
quality_summary = spark.sql("""

SELECT product_type, quality_trend, COUNT(*) as months_count,

       ROUND(AVG(avg_defect_rate), 2) as avg_defect_rate,

       ROUND(SUM(total_units)/1000, 1) as total_units_k

FROM manufacturing.gold.quality_analytics_gold

GROUP BY product_type, quality_trend

ORDER BY product_type, quality_trend

""")

quality_summary.show()

# Production Efficiency Dashboard
print("\nProduction Efficiency Summary:")
efficiency_summary = spark.sql("""

SELECT production_line, bottleneck_indicator, COUNT(*) as months_count,

       ROUND(AVG(line_utilization_rate), 2) as avg_utilization_rate,

       ROUND(SUM(total_units_produced)/1000, 1) as total_units_k

FROM manufacturing.gold.production_efficiency_gold

GROUP BY production_line, bottleneck_indicator

ORDER BY production_line, bottleneck_indicator

""")

efficiency_summary.show()


=== GOLD LAYER: Business Intelligence ===
Purpose: Pre-aggregated KPIs for dashboards and executive reporting

Equipment Performance Summary:


+--------------------+-------------+---------------------+-------------+
|performance_category|machine_count|avg_reliability_score|total_units_k|
+--------------------+-------------+---------------------+-------------+
|      Good Performer|          200|                94.99|       4576.5|
+--------------------+-------------+---------------------+-------------+


Quality Analytics Summary:


+--------------------+-------------+------------+---------------+-------------+
|        product_type|quality_trend|months_count|avg_defect_rate|total_units_k|
+--------------------+-------------+------------+---------------+-------------+
|    Automotive Parts|    Declining|           7|           6.06|         66.2|
|    Automotive Parts|    Improving|           4|           5.27|         42.1|
|    Automotive Parts|          New|           5|           5.59|         53.8|
|    Automotive Parts|       Stable|          44|           5.77|        439.6|
|      Consumer Goods|    Declining|          10|           3.81|        361.3|
|      Consumer Goods|    Improving|           6|           3.31|        226.9|
|      Consumer Goods|          New|           5|           3.64|        229.2|
|      Consumer Goods|       Stable|          39|            3.5|       1526.6|
|         Electronics|    Declining|           9|           2.37|        227.7|
|         Electronics|    Improving|    

+---------------+--------------------+------------+--------------------+-------------+
|production_line|bottleneck_indicator|months_count|avg_utilization_rate|total_units_k|
+---------------+--------------------+------------+--------------------+-------------+
|         LINE_A|        Overutilized|          12|                0.23|        950.0|
|         LINE_B|        Overutilized|          12|                0.23|        895.5|
|         LINE_C|        Overutilized|          12|                0.23|        898.2|
|         LINE_D|        Overutilized|          12|                0.22|        921.8|
|         LINE_E|        Overutilized|          12|                0.22|        911.0|
+---------------+--------------------+------------+--------------------+-------------+



## Machine Learning: Predictive Maintenance with Silver Layer Data

### ML Integration in Medallion Architecture

The Silver layer's cleaned and enriched data provides the perfect foundation for machine learning models. We'll demonstrate predictive maintenance using equipment health indicators derived from production data.

### Business Value of Predictive Maintenance

- **Reduce downtime** through proactive equipment maintenance
- **Optimize maintenance costs** by scheduling based on actual condition
- **Improve asset utilization** and production reliability
- **Enhance quality control** by preventing equipment-related defects

### Model Approach

We'll build a classification model to predict equipment maintenance risk levels using rolling statistics and production metrics from the Silver layer.

In [None]:
# Feature Engineering for Predictive Maintenance using Silver layer data

from pyspark.sql.functions import col, lag, avg, stddev, count, window
from pyspark.sql.window import Window
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

print("=== Feature Engineering for Predictive Maintenance ===")
print("Using Silver layer data for ML model training")

# Read from Silver layer
df_ml = spark.table("manufacturing.silver.production_records_silver")

# Add rolling statistics (7-day windows for equipment health indicators)
window_spec_7d = Window.partitionBy("machine_id").orderBy("production_date").rowsBetween(-7, 0)

df_health = df_ml.withColumn("rolling_avg_defect_rate", avg("defect_rate").over(window_spec_7d)) \
                .withColumn("rolling_std_defect_rate", stddev("defect_rate").over(window_spec_7d)) \
                .withColumn("rolling_avg_hourly_rate", avg("hourly_production_rate").over(window_spec_7d)) \
                .withColumn("rolling_std_hourly_rate", stddev("hourly_production_rate").over(window_spec_7d)) \
                .withColumn("production_count_7d", count("*").over(window_spec_7d))

# Create maintenance risk labels based on health indicators
from pyspark.sql.functions import when
df_health = df_health.withColumn("maintenance_risk", 
    when((col("rolling_avg_defect_rate") > 8.0) & (col("rolling_std_defect_rate") > 3.0) & 
         (col("rolling_avg_hourly_rate") < 1000), "High")
    .when((col("rolling_avg_defect_rate") > 5.0) | (col("rolling_std_hourly_rate") > 2000), "Medium")
    .otherwise("Low")
)

print(f"Dataset prepared with {df_health.count()} records for maintenance prediction")
print("Risk distribution:")
df_health.groupBy("maintenance_risk").count().show()

=== Feature Engineering for Predictive Maintenance ===
Using Silver layer data for ML model training


Dataset prepared with 12137 records for maintenance prediction
Risk distribution:


+----------------+-----+
|maintenance_risk|count|
+----------------+-----+
|            High|   26|
|             Low|  171|
|          Medium|11940|
+----------------+-----+



In [None]:
# Prepare data for ML training

print("\n=== Data Preparation for ML ===")

# Filter out records with insufficient history for reliable predictions
df_ml_ready = df_health.filter("production_count_7d >= 3")

# Split data (70/30 split for training/testing)
train_data, test_data = df_ml_ready.randomSplit([0.7, 0.3], seed=42)

print(f"Training set: {train_data.count()} records")
print(f"Testing set: {test_data.count()} records")

# Encode categorical features
product_indexer = StringIndexer(inputCol="product_type", outputCol="product_type_index")
line_indexer = StringIndexer(inputCol="production_line", outputCol="production_line_index")
risk_indexer = StringIndexer(inputCol="maintenance_risk", outputCol="label")

product_encoder = OneHotEncoder(inputCol="product_type_index", outputCol="product_type_vec")
line_encoder = OneHotEncoder(inputCol="production_line_index", outputCol="production_line_vec")

# Assemble feature vector from Silver layer metrics
feature_cols = [
    "defect_rate", "hourly_production_rate", "rolling_avg_defect_rate", "rolling_std_defect_rate",
    "rolling_avg_hourly_rate", "rolling_std_hourly_rate", "production_count_7d",
    "product_type_vec", "production_line_vec"
]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

# Define Random Forest Classifier for maintenance risk prediction
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol="label",
    numTrees=100,
    maxDepth=8,
    seed=42
)

# Create ML pipeline
pipeline = Pipeline(stages=[
    product_indexer, line_indexer, risk_indexer,
    product_encoder, line_encoder,
    assembler, rf
])

print("ML pipeline configured for maintenance risk classification using Silver layer features")


=== Data Preparation for ML ===


Training set: 8327 records


Testing set: 3410 records
ML pipeline configured for maintenance risk classification using Silver layer features


In [None]:
# Train the predictive maintenance model

print("\n=== Model Training ===")

# Fit the pipeline on training data
model = pipeline.fit(train_data)

print("Maintenance prediction model training completed using Silver layer data")

# Make predictions on test data
predictions = model.transform(test_data)

print(f"Predictions generated for {predictions.count()} test records")
print("Sample predictions:")
predictions.select("machine_id", "production_date", "maintenance_risk", "prediction").show(10)


=== Model Training ===


Maintenance prediction model training completed using Silver layer data


Predictions generated for 3410 test records
Sample predictions:


+----------+-------------------+----------------+----------+
|machine_id|    production_date|maintenance_risk|prediction|
+----------+-------------------+----------------+----------+
|   MCH0001|2024-01-15 13:00:00|          Medium|       0.0|
|   MCH0001|2024-02-06 18:00:00|          Medium|       0.0|
|   MCH0001|2024-03-04 12:00:00|          Medium|       0.0|
|   MCH0001|2024-03-12 08:00:00|          Medium|       0.0|
|   MCH0001|2024-03-19 16:00:00|          Medium|       0.0|
|   MCH0001|2024-03-27 10:00:00|          Medium|       0.0|
|   MCH0001|2024-03-27 13:00:00|          Medium|       0.0|
|   MCH0001|2024-04-08 17:00:00|          Medium|       0.0|
|   MCH0001|2024-04-12 12:00:00|          Medium|       0.0|
|   MCH0001|2024-04-16 16:00:00|          Medium|       0.0|
+----------+-------------------+----------------+----------+
only showing top 10 rows



In [None]:
# Evaluate model performance

print("\n=== Model Evaluation ===")

# Calculate evaluation metrics
evaluator_accuracy = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="accuracy"
)

evaluator_f1 = MulticlassClassificationEvaluator(
    labelCol="label",
    predictionCol="prediction",
    metricName="f1"
)

accuracy = evaluator_accuracy.evaluate(predictions)
f1 = evaluator_f1.evaluate(predictions)

print(f"Accuracy: {accuracy:.4f}")
print(f"F1 Score: {f1:.4f}")

# Show confusion matrix
print("\nConfusion Matrix:")
predictions.groupBy("maintenance_risk", "prediction").count().orderBy("maintenance_risk", "prediction").show()

# Business value assessment
print("\n=== Business Value Assessment ===")
print(f"Model accuracy of {accuracy:.1%} enables:")
print(f"- Proactive maintenance scheduling")
print(f"- Reduced unplanned downtime")
print(f"- Optimized maintenance costs")
print(f"- Improved production reliability")


=== Model Evaluation ===


Accuracy: 0.9974
F1 Score: 0.9963

Confusion Matrix:


+----------------+----------+-----+
|maintenance_risk|prediction|count|
+----------------+----------+-----+
|            High|       0.0|    6|
|            High|       1.0|    1|
|             Low|       0.0|    3|
|          Medium|       0.0| 3400|
+----------------+----------+-----+


=== Business Value Assessment ===
Model accuracy of 99.7% enables:
- Proactive maintenance scheduling
- Reduced unplanned downtime
- Optimized maintenance costs
- Improved production reliability


In [None]:
# Analyze maintenance insights and recommendations

print("\n=== Maintenance Insights from ML Model ===")

# Identify high-risk machines requiring immediate attention
high_risk_machines = predictions.filter("prediction = 0") \
    .select("machine_id", "production_date", "rolling_avg_defect_rate", "rolling_avg_hourly_rate") \
    .orderBy("rolling_avg_defect_rate", ascending=False) \
    .limit(10)

print("Top 10 high-risk machines requiring immediate attention:")
high_risk_machines.show()

# Maintenance scheduling recommendations by risk level
maintenance_schedule = predictions.filter("prediction < 2") \
    .groupBy("prediction") \
    .agg(count("machine_id").alias("machines_needing_attention")) \
    .orderBy("prediction")

print("\nMaintenance scheduling recommendations:")
maintenance_schedule.show()

# Equipment utilization analysis with ML insights
equipment_ml_analysis = predictions.groupBy("machine_id") \
    .agg(
        avg("rolling_avg_hourly_rate").alias("avg_hourly_rate"),
        avg("rolling_avg_defect_rate").alias("avg_defect_rate"),
        count("*").alias("total_productions"),
        avg("prediction").alias("avg_risk_score")
    ) \
    .orderBy("avg_hourly_rate", ascending=False)

print("\nTop performing equipment (by efficiency) with ML risk assessment:")
equipment_ml_analysis.show(10)

# Demonstrate how ML predictions could enhance Gold layer
print("\n=== ML Enhancement for Gold Layer ===")
print("The ML predictions could be added to the equipment_performance_gold table")
print("to provide predictive maintenance insights alongside historical performance data.")


=== Maintenance Insights from ML Model ===
Top 10 high-risk machines requiring immediate attention:


+----------+-------------------+-----------------------+-----------------------+
|machine_id|    production_date|rolling_avg_defect_rate|rolling_avg_hourly_rate|
+----------+-------------------+-----------------------+-----------------------+
|   MCH0007|2024-11-11 06:00:00|      9.401250000000001|             2783.59125|
|   MCH0014|2024-03-15 13:00:00|      9.291250000000002|              668.92875|
|   MCH0027|2024-01-19 08:00:00|      9.159999999999998|                2045.24|
|   MCH0082|2024-09-16 06:00:00|      9.113750000000001|      610.7524999999999|
|   MCH0097|2024-12-23 12:00:00|                9.07375|     1382.0325000000003|
|   MCH0177|2024-08-26 07:00:00|                 9.0475|             1707.13375|
|   MCH0052|2024-08-19 11:00:00|                 9.0375|               5644.615|
|   MCH0177|2024-08-26 13:00:00|                8.97875|     1553.9250000000002|
|   MCH0014|2024-03-13 10:00:00|      8.932500000000001|      555.5799999999999|
|   MCH0007|2024-10-25 12:00

+----------+--------------------------+
|prediction|machines_needing_attention|
+----------+--------------------------+
|       0.0|                      3409|
|       1.0|                         1|
+----------+--------------------------+


Top performing equipment (by efficiency) with ML risk assessment:


+----------+------------------+-----------------+-----------------+--------------+
|machine_id|   avg_hourly_rate|  avg_defect_rate|total_productions|avg_risk_score|
+----------+------------------+-----------------+-----------------+--------------+
|   MCH0153|13458.653020833332|4.226322916666668|                8|           0.0|
|   MCH0130|12550.008083333332|4.684673076923077|               13|           0.0|
|   MCH0197|       12484.01575|          4.68025|                5|           0.0|
|   MCH0043|12440.022878787877|5.474015151515151|               11|           0.0|
|   MCH0187|    12378.54203125|        4.7940625|                8|           0.0|
|   MCH0195|       12242.75275|4.105357142857143|               10|           0.0|
|   MCH0058|12163.561197368419|5.200473684210526|               19|           0.0|
|   MCH0109|12156.647250000002|5.223433333333334|               15|           0.0|
|   MCH0181|12139.634579081636|5.593590561224488|               28|           0.0|
|   

## Key Takeaways: Medallion Architecture with Delta Liquid Clustering and ML

### Architecture Benefits Demonstrated

1. **Bronze Layer**: Preserves raw data integrity and provides audit trail
2. **Silver Layer**: Ensures data quality and provides foundation for analytics and ML
3. **Gold Layer**: Delivers fast, pre-aggregated analytics for business users
4. **ML Layer**: Leverages Silver layer data for predictive maintenance and insights

### Liquid Clustering Advantages

- **Automatic optimization** at each layer for different query patterns
- **Zero maintenance** data layout optimization
- **Performance benefits** for both ingestion and analytical queries
- **ML-ready data** through optimized Silver layer clustering

### AIDP Integration Benefits

- **Unified platform** for data engineering, analytics, and ML
- **Seamless progression** from raw data to predictive insights
- **Governance** through catalog and schema isolation
- **Scalability** for large-scale manufacturing datasets and ML training

### Manufacturing Analytics Value

- **Equipment monitoring** with performance categorization
- **Quality control** with trend analysis and defect tracking
- **Production optimization** with bottleneck identification
- **Predictive maintenance** with ML-based risk assessment
- **Business intelligence** through pre-computed KPIs

### Next Steps

- Integrate with real-time SCADA systems for Bronze layer ingestion
- Add ML predictions to Gold layer tables for unified analytics
- Build dashboards combining Gold layer KPIs with ML predictions
- Implement automated alerting based on ML risk scores
- Deploy model for real-time equipment monitoring
- Extend to multi-plant analytics with federated queries

This notebook demonstrates how the medallion architecture with Delta Liquid Clustering and integrated ML enables comprehensive, predictive manufacturing analytics in Oracle AI Data Platform.