# manufacturing: Iceberg and Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a manufacturing analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Production Quality Control and Equipment Monitoring

We'll analyze manufacturing production records from a factory. Our clustering strategy will optimize for:

- **Equipment-specific queries**: Fast lookups by machine ID
- **Time-based analysis**: Efficient filtering by production date
- **Quality control patterns**: Quick aggregation by product type and defect rates

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create manufacturing catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS manufacturing")

spark.sql("CREATE SCHEMA IF NOT EXISTS manufacturing.analytics")

print("Manufacturing catalog and analytics schema created successfully!")

Manufacturing catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `production_records_uf` table will store:

- **machine_id**: Unique equipment identifier
- **production_date**: Date and time of production
- **product_type**: Type of product manufactured
- **units_produced**: Number of units produced
- **defect_count**: Number of defective units
- **production_line**: Assembly line identifier
- **cycle_time**: Time to produce one unit (minutes)

### Clustering Strategy

We'll cluster by `machine_id` and `production_date` because:

- **machine_id**: Equipment often produces multiple batches, grouping maintenance and performance data together
- **production_date**: Time-based queries are essential for shift analysis, maintenance scheduling, and quality trending
- This combination optimizes for both equipment monitoring and temporal production analysis

In [1]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, DateType
data_schema = StructType([
    StructField("machine_id", StringType(), True),
    StructField("production_date", TimestampType(), True),
    StructField("product_type", StringType(), True),
    StructField("units_produced", IntegerType(), True),
    StructField("defect_count", IntegerType(), True),
    StructField("production_line", StringType(), True),
    StructField("cycle_time", DoubleType(), True)
])

spark.sql("""

CREATE TABLE IF NOT EXISTS manufacturing.analytics.production_records_uf (
    machine_id STRING,
    production_date TIMESTAMP,
    product_type STRING,
    units_produced INT,
    defect_count INT,
    production_line STRING,
    cycle_time DECIMAL(5,2)
)

USING DELTA

TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (machine_id, production_date)

""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")

print("Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.


## Step 3: Generate Manufacturing Sample Data

### Data Generation Strategy

We'll create realistic manufacturing production data including:

- **200 machines** with multiple production runs over time
- **Product types**: Electronics, Automotive Parts, Consumer Goods, Industrial Equipment
- **Realistic production patterns**: Shift-based operations, maintenance downtime, quality variations
- **Multiple production lines**: Different assembly areas and facilities

### Why This Data Pattern?

This data simulates real manufacturing scenarios where:

- Equipment performance varies over time
- Quality control requires tracking defects and yields
- Maintenance scheduling depends on usage patterns
- Production optimization drives efficiency improvements
- Supply chain visibility requires real-time production data

In [1]:
# Generate sample manufacturing production data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define manufacturing data constants

PRODUCT_TYPES = ['Electronics', 'Automotive Parts', 'Consumer Goods', 'Industrial Equipment']

PRODUCTION_LINES = ['LINE_A', 'LINE_B', 'LINE_C', 'LINE_D', 'LINE_E']

# Base production parameters by product type

PRODUCTION_PARAMS = {

    'Electronics': {'base_units': 500, 'defect_rate': 0.02, 'cycle_time': 2.5},

    'Automotive Parts': {'base_units': 200, 'defect_rate': 0.05, 'cycle_time': 8.0},

    'Consumer Goods': {'base_units': 800, 'defect_rate': 0.03, 'cycle_time': 1.8},

    'Industrial Equipment': {'base_units': 50, 'defect_rate': 0.08, 'cycle_time': 25.0}

}


# Generate production records

production_data = []

base_date = datetime(2024, 1, 1)


# Create 200 machines with 30-90 production runs each

for machine_num in range(1, 201):

    machine_id = f"MCH{machine_num:04d}"
    
    # Each machine gets 30-90 production runs over 12 months

    num_runs = random.randint(30, 90)
    
    for i in range(num_runs):

        # Spread production runs over 12 months (weekdays only, during shifts)

        days_offset = random.randint(0, 365)

        production_date = base_date + timedelta(days=days_offset)
        
        # Skip weekends

        while production_date.weekday() >= 5:

            production_date += timedelta(days=1)
        
        # Add shift timing (6 AM - 6 PM)

        hours_offset = random.randint(6, 18)

        production_date = production_date.replace(hour=hours_offset, minute=0, second=0, microsecond=0)
        
        # Select product type

        product_type = random.choice(PRODUCT_TYPES)

        params = PRODUCTION_PARAMS[product_type]
        
        # Calculate production with variability

        units_variation = random.uniform(0.7, 1.3)

        units_produced = int(params['base_units'] * units_variation)
        
        # Calculate defects

        defect_rate_variation = random.uniform(0.5, 2.0)

        actual_defect_rate = params['defect_rate'] * defect_rate_variation

        defect_count = int(units_produced * actual_defect_rate)
        
        # Calculate cycle time with variation

        cycle_time_variation = random.uniform(0.8, 1.4)

        cycle_time = round(params['cycle_time'] * cycle_time_variation, 2)
        
        # Select production line

        production_line = random.choice(PRODUCTION_LINES)
        
        production_data.append({

            "machine_id": machine_id,

            "production_date": production_date,

            "product_type": product_type,

            "units_produced": units_produced,

            "defect_count": defect_count,

            "production_line": production_line,

            "cycle_time": cycle_time

        })



print(f"Generated {len(production_data)} production records")

print("Sample record:", production_data[0])

Generated 11874 production records
Sample record: {'machine_id': 'MCH0001', 'production_date': datetime.datetime(2024, 8, 5, 11, 0), 'product_type': 'Consumer Goods', 'units_produced': 803, 'defect_count': 34, 'production_line': 'LINE_A', 'cycle_time': 1.46}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_production = spark.createDataFrame(production_data, schema=data_schema)


# Display schema and sample data

print("DataFrame Schema:")

df_production.printSchema()



print("\nSample Data:")

df_production.show(5)


# Insert data into Delta table with liquid clustering

# The TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (machine_id, production_date) will automatically optimize the data layout

df_production.write.mode("overwrite").insertInto("manufacturing.analytics.production_records_uf")


print(f"\nSuccessfully inserted {df_production.count()} records into manufacturing.analytics.production_records_uf")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- machine_id: string (nullable = true)
 |-- production_date: timestamp (nullable = true)
 |-- product_type: string (nullable = true)
 |-- units_produced: integer (nullable = true)
 |-- defect_count: integer (nullable = true)
 |-- production_line: string (nullable = true)
 |-- cycle_time: double (nullable = true)


Sample Data:


+----------+-------------------+----------------+--------------+------------+---------------+----------+
|machine_id|    production_date|    product_type|units_produced|defect_count|production_line|cycle_time|
+----------+-------------------+----------------+--------------+------------+---------------+----------+
|   MCH0001|2024-08-05 11:00:00|  Consumer Goods|           803|          34|         LINE_A|      1.46|
|   MCH0001|2024-12-26 14:00:00|     Electronics|           631|          18|         LINE_B|      2.58|
|   MCH0001|2024-03-21 11:00:00|  Consumer Goods|           810|          34|         LINE_E|      1.83|
|   MCH0001|2024-04-01 16:00:00|Automotive Parts|           188|          17|         LINE_C|      7.89|
|   MCH0001|2024-04-09 11:00:00|     Electronics|           609|          18|         LINE_B|       3.1|
+----------+-------------------+----------------+--------------+------------+---------------+----------+
only showing top 5 rows




Successfully inserted 11874 records into manufacturing.analytics.production_records_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Machine performance history** (clustered by machine_id)
2. **Time-based production analysis** (clustered by production_date)
3. **Combined machine + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Machine performance history - benefits from machine_id clustering

print("=== Query 1: Machine Performance History ===")

machine_history = spark.sql("""

SELECT machine_id, production_date, product_type, units_produced, defect_count,

       ROUND(defect_count * 100.0 / units_produced, 2) as defect_rate_percent

FROM manufacturing.analytics.production_records_uf

WHERE machine_id = 'MCH0001'

ORDER BY production_date DESC

""")



machine_history.show()

print(f"Records found: {machine_history.count()}")



# Query 2: Time-based quality analysis - benefits from production_date clustering

print("\n=== Query 2: Recent Quality Issues ===")

quality_issues = spark.sql("""

SELECT production_date, machine_id, product_type, units_produced, defect_count,

       ROUND(defect_count * 100.0 / units_produced, 2) as defect_rate_percent

FROM manufacturing.analytics.production_records_uf

WHERE production_date >= '2024-06-01' AND (defect_count * 100.0 / units_produced) > 5.0

ORDER BY defect_rate_percent DESC, production_date DESC

""")



quality_issues.show()

print(f"Quality issues found: {quality_issues.count()}")



# Query 3: Combined machine + time query - optimal for our clustering strategy

print("\n=== Query 3: Equipment Performance Trends ===")

performance_trends = spark.sql("""

SELECT machine_id, production_date, product_type, units_produced, cycle_time,

       ROUND(units_produced * 60.0 / cycle_time, 2) as hourly_rate

FROM manufacturing.analytics.production_records_uf

WHERE machine_id LIKE 'MCH000%' AND production_date >= '2024-04-01'

ORDER BY machine_id, production_date

""")



performance_trends.show()

print(f"Performance records found: {performance_trends.count()}")

=== Query 1: Machine Performance History ===


+----------+-------------------+--------------------+--------------+------------+-------------------+
|machine_id|    production_date|        product_type|units_produced|defect_count|defect_rate_percent|
+----------+-------------------+--------------------+--------------+------------+-------------------+
|   MCH0001|2024-12-26 17:00:00|      Consumer Goods|           957|          50|               5.22|
|   MCH0001|2024-12-26 14:00:00|         Electronics|           631|          18|               2.85|
|   MCH0001|2024-12-23 16:00:00|         Electronics|           452|           4|               0.88|
|   MCH0001|2024-12-19 14:00:00|         Electronics|           570|          18|               3.16|
|   MCH0001|2024-12-09 12:00:00|         Electronics|           569|          21|               3.69|
|   MCH0001|2024-12-09 09:00:00|Industrial Equipment|            63|           9|              14.29|
|   MCH0001|2024-12-09 08:00:00|      Consumer Goods|          1005|          49| 

Records found: 77

=== Query 2: Recent Quality Issues ===


+-------------------+----------+--------------------+--------------+------------+-------------------+
|    production_date|machine_id|        product_type|units_produced|defect_count|defect_rate_percent|
+-------------------+----------+--------------------+--------------+------------+-------------------+
|2024-07-08 11:00:00|   MCH0062|Industrial Equipment|            38|           6|              15.79|
|2024-10-08 06:00:00|   MCH0063|Industrial Equipment|            51|           8|              15.69|
|2024-07-10 09:00:00|   MCH0200|Industrial Equipment|            64|          10|              15.63|
|2024-12-12 16:00:00|   MCH0153|Industrial Equipment|            45|           7|              15.56|
|2024-08-21 09:00:00|   MCH0114|Industrial Equipment|            45|           7|              15.56|
|2024-11-25 14:00:00|   MCH0084|Industrial Equipment|            58|           9|              15.52|
|2024-09-18 10:00:00|   MCH0163|Industrial Equipment|            58|           9| 

Quality issues found: 2830

=== Query 3: Equipment Performance Trends ===


+----------+-------------------+--------------------+--------------+----------+-----------+
|machine_id|    production_date|        product_type|units_produced|cycle_time|hourly_rate|
+----------+-------------------+--------------------+--------------+----------+-----------+
|   MCH0001|2024-04-01 16:00:00|    Automotive Parts|           188|      7.89|    1429.66|
|   MCH0001|2024-04-01 16:00:00|         Electronics|           587|      2.22|   15864.86|
|   MCH0001|2024-04-02 13:00:00|         Electronics|           490|      2.31|   12727.27|
|   MCH0001|2024-04-09 11:00:00|         Electronics|           609|      3.10|   11787.10|
|   MCH0001|2024-04-16 06:00:00|Industrial Equipment|            55|     34.90|      94.56|
|   MCH0001|2024-05-07 12:00:00|      Consumer Goods|          1007|      2.15|   28102.33|
|   MCH0001|2024-05-14 17:00:00|Industrial Equipment|            52|     24.72|     126.21|
|   MCH0001|2024-05-20 11:00:00|         Electronics|           648|      3.10| 

Performance records found: 461


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the manufacturing insights possible with this optimized structure.

### Key Analytics

- **Equipment utilization** and performance metrics
- **Quality control analysis** and defect patterns
- **Production line efficiency** and bottleneck identification
- **Product type performance** and optimization opportunities

In [1]:
# Analyze clustering effectiveness and manufacturing insights


# Equipment performance analysis

print("=== Equipment Performance Analysis ===")

equipment_performance = spark.sql("""

SELECT machine_id, COUNT(*) as total_runs,

       ROUND(AVG(units_produced), 2) as avg_units_produced,

       ROUND(AVG(defect_count * 100.0 / units_produced), 2) as avg_defect_rate,

       ROUND(AVG(cycle_time), 2) as avg_cycle_time,

       ROUND(SUM(units_produced), 0) as total_units

FROM manufacturing.analytics.production_records_uf

GROUP BY machine_id

ORDER BY total_units DESC

""")



equipment_performance.show()


# Quality analysis by product type

print("\n=== Quality Analysis by Product Type ===")

quality_by_product = spark.sql("""

SELECT product_type, COUNT(*) as production_runs,

       ROUND(SUM(units_produced), 0) as total_units,

       ROUND(SUM(defect_count), 0) as total_defects,

       ROUND(AVG(defect_count * 100.0 / units_produced), 2) as avg_defect_rate,

       ROUND(AVG(cycle_time), 2) as avg_cycle_time

FROM manufacturing.analytics.production_records_uf

GROUP BY product_type

ORDER BY total_units DESC

""")



quality_by_product.show()


# Production line efficiency

print("\n=== Production Line Efficiency ===")

line_efficiency = spark.sql("""

SELECT production_line, COUNT(*) as total_runs,

       COUNT(DISTINCT machine_id) as machines_used,

       ROUND(SUM(units_produced), 0) as total_production,

       ROUND(AVG(units_produced), 2) as avg_run_size,

       ROUND(SUM(defect_count * 100.0 / units_produced) / COUNT(*), 2) as avg_defect_rate

FROM manufacturing.analytics.production_records_uf

GROUP BY production_line

ORDER BY total_production DESC

""")



line_efficiency.show()


# Monthly production trends

print("\n=== Monthly Production Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(production_date, 'yyyy-MM') as month,

       COUNT(*) as production_runs,

       ROUND(SUM(units_produced), 0) as total_units,

       ROUND(AVG(defect_count * 100.0 / units_produced), 2) as avg_defect_rate,

       COUNT(DISTINCT machine_id) as active_machines

FROM manufacturing.analytics.production_records_uf

GROUP BY DATE_FORMAT(production_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Equipment Performance Analysis ===


+----------+----------+------------------+---------------+--------------+-----------+
|machine_id|total_runs|avg_units_produced|avg_defect_rate|avg_cycle_time|total_units|
+----------+----------+------------------+---------------+--------------+-----------+
|   MCH0084|        86|            455.69|           5.06|          8.74|      39189|
|   MCH0001|        77|            490.01|           5.28|          8.67|      37731|
|   MCH0154|        88|             419.1|           4.83|          9.26|      36881|
|   MCH0184|        89|            408.88|           5.10|         10.63|      36390|
|   MCH0069|        82|            443.55|           4.71|          7.88|      36371|
|   MCH0009|        85|            425.96|           4.54|          9.42|      36207|
|   MCH0061|        88|            411.17|           4.67|          8.84|      36183|
|   MCH0044|        87|            411.02|           4.85|         10.48|      35759|
|   MCH0007|        88|            405.73|           5

+--------------------+---------------+-----------+-------------+---------------+--------------+
|        product_type|production_runs|total_units|total_defects|avg_defect_rate|avg_cycle_time|
+--------------------+---------------+-----------+-------------+---------------+--------------+
|      Consumer Goods|           2948|    2368488|        87160|           3.68|          1.98|
|         Electronics|           3064|    1538208|        36684|           2.38|          2.74|
|    Automotive Parts|           2962|     590410|        35701|           6.02|          8.75|
|Industrial Equipment|           2900|     144483|        13002|           8.96|         27.50|
+--------------------+---------------+-----------+-------------+---------------+--------------+


=== Production Line Efficiency ===


+---------------+----------+-------------+----------------+------------+---------------+
|production_line|total_runs|machines_used|total_production|avg_run_size|avg_defect_rate|
+---------------+----------+-------------+----------------+------------+---------------+
|         LINE_B|      2434|          200|          957386|      393.34|           5.25|
|         LINE_D|      2395|          200|          954864|      398.69|           5.22|
|         LINE_C|      2442|          200|          939830|      384.86|           5.19|
|         LINE_A|      2290|          200|          907895|      396.46|           5.17|
|         LINE_E|      2313|          200|          881614|      381.16|           5.27|
+---------------+----------+-------------+----------------+------------+---------------+


=== Monthly Production Trends ===


+-------+---------------+-----------+---------------+---------------+
|  month|production_runs|total_units|avg_defect_rate|active_machines|
+-------+---------------+-----------+---------------+---------------+
|2024-01|           1020|     406833|           5.11|            199|
|2024-02|            927|     363057|           5.30|            198|
|2024-03|            966|     389557|           5.22|            195|
|2024-04|           1046|     401987|           5.16|            200|
|2024-05|           1021|     393406|           5.27|            194|
|2024-06|            897|     348198|           5.20|            197|
|2024-07|           1054|     411739|           5.22|            199|
|2024-08|            990|     387198|           5.18|            198|
|2024-09|            968|     362704|           5.31|            194|
|2024-10|           1025|     397135|           5.23|            197|
|2024-11|            915|     364122|           5.24|            193|
|2024-12|           

## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (machine_id, production_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (machine_id, production_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Manufacturing analytics where equipment monitoring and quality control are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for manufacturing data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles manufacturing-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger manufacturing datasets
- Integrate with real SCADA systems and IoT sensors

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.