# Transportation: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a transportation and logistics analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Fleet Management and Route Optimization

We'll analyze transportation fleet operations and logistics data. Our clustering strategy will optimize for:

- **Vehicle-specific queries**: Fast lookups by vehicle ID
- **Time-based analysis**: Efficient filtering by trip date and time
- **Route performance patterns**: Quick aggregation by route and operational metrics

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create transportation catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS transportation")

spark.sql("CREATE SCHEMA IF NOT EXISTS transportation.analytics")

print("Transportation catalog and analytics schema created successfully!")

Transportation catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `fleet_trips` table will store:

- **vehicle_id**: Unique vehicle identifier
- **trip_date**: Date and time of trip start
- **route_id**: Route identifier
- **distance**: Distance traveled (miles/km)
- **duration**: Trip duration (minutes)
- **fuel_consumed**: Fuel used (gallons/liters)
- **load_factor**: Capacity utilization (0-100)

### Clustering Strategy

We'll cluster by `vehicle_id` and `trip_date` because:

- **vehicle_id**: Vehicles generate multiple trips, grouping maintenance and performance data together
- **trip_date**: Time-based queries are essential for scheduling, fuel analysis, and operational reporting
- This combination optimizes for both vehicle monitoring and temporal fleet performance analysis

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS transportation.analytics.fleet_trips (

    vehicle_id STRING,

    trip_date TIMESTAMP,

    route_id STRING,

    distance DECIMAL(8,2),

    duration DECIMAL(6,2),

    fuel_consumed DECIMAL(6,2),

    load_factor INT

)

USING DELTA

CLUSTER BY (vehicle_id, trip_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on vehicle_id and trip_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on vehicle_id and trip_date.


## Step 3: Generate Transportation Sample Data

### Data Generation Strategy

We'll create realistic transportation fleet data including:

- **500 vehicles** with multiple trips over time
- **Route types**: Urban delivery, Long-haul, Local transport, Express delivery
- **Realistic operational patterns**: Peak hours, route variations, fuel efficiency differences
- **Fleet diversity**: Different vehicle types with varying capacities and fuel consumption

### Why This Data Pattern?

This data simulates real transportation scenarios where:

- Vehicle performance varies by route and time of day
- Fuel efficiency impacts operational costs
- Route optimization requires historical performance data
- Capacity utilization affects profitability
- Maintenance scheduling depends on usage patterns

In [None]:
# Generate sample transportation fleet data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define transportation data constants

ROUTE_TYPES = ['Urban Delivery', 'Long-haul', 'Local Transport', 'Express Delivery']

ROUTES = ['RT_NYC_MAN_001', 'RT_LAX_SFO_002', 'RT_CHI_DET_003', 'RT_HOU_DAL_004', 'RT_MIA_ORL_005']

# Base trip parameters by route type

TRIP_PARAMS = {

    'Urban Delivery': {'avg_distance': 45, 'avg_duration': 120, 'avg_fuel': 8.5, 'load_factor': 85},

    'Long-haul': {'avg_distance': 450, 'avg_duration': 480, 'avg_fuel': 65.0, 'load_factor': 92},

    'Local Transport': {'avg_distance': 120, 'avg_duration': 180, 'avg_fuel': 15.2, 'load_factor': 78},

    'Express Delivery': {'avg_distance': 80, 'avg_duration': 90, 'avg_fuel': 12.8, 'load_factor': 95}

}


# Generate fleet trip records

trip_data = []

base_date = datetime(2024, 1, 1)


# Create 500 vehicles with 20-60 trips each

for vehicle_num in range(1, 501):

    vehicle_id = f"VH{vehicle_num:04d}"
    
    # Each vehicle gets 20-60 trips over 12 months

    num_trips = random.randint(20, 60)
    
    for i in range(num_trips):

        # Spread trips over 12 months

        days_offset = random.randint(0, 365)

        trip_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing (more trips during business hours)

        hour_weights = [1, 1, 1, 1, 1, 3, 8, 10, 12, 10, 8, 6, 8, 9, 8, 7, 6, 5, 3, 2, 2, 1, 1, 1]

        hours_offset = random.choices(range(24), weights=hour_weights)[0]

        trip_date = trip_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select route type

        route_type = random.choice(ROUTE_TYPES)

        params = TRIP_PARAMS[route_type]
        
        # Calculate trip metrics with variability

        distance_variation = random.uniform(0.7, 1.4)

        distance = round(params['avg_distance'] * distance_variation, 2)
        
        duration_variation = random.uniform(0.8, 1.6)

        duration = round(params['avg_duration'] * duration_variation, 2)
        
        fuel_variation = random.uniform(0.85, 1.25)

        fuel_consumed = round(params['avg_fuel'] * fuel_variation, 2)
        
        load_factor_variation = random.randint(-10, 8)

        load_factor = max(0, min(100, params['load_factor'] + load_factor_variation))
        
        # Select specific route

        route_id = random.choice(ROUTES)
        
        trip_data.append({

            "vehicle_id": vehicle_id,

            "trip_date": trip_date,

            "route_id": route_id,

            "distance": distance,

            "duration": duration,

            "fuel_consumed": fuel_consumed,

            "load_factor": load_factor

        })



print(f"Generated {len(trip_data)} fleet trip records")

print("Sample record:", trip_data[0])

Generated 20176 fleet trip records
Sample record: {'vehicle_id': 'VH0001', 'trip_date': datetime.datetime(2024, 9, 21, 14, 44), 'route_id': 'RT_HOU_DAL_004', 'distance': 48.18, 'duration': 107.57, 'fuel_consumed': 8.54, 'load_factor': 79}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_trips = spark.createDataFrame(trip_data)


# Display schema and sample data

print("DataFrame Schema:")

df_trips.printSchema()



print("\nSample Data:")

df_trips.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (vehicle_id, trip_date) will automatically optimize the data layout

df_trips.write.mode("overwrite").saveAsTable("transportation.analytics.fleet_trips")


print(f"\nSuccessfully inserted {df_trips.count()} records into transportation.analytics.fleet_trips")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- distance: double (nullable = true)
 |-- duration: double (nullable = true)
 |-- fuel_consumed: double (nullable = true)
 |-- load_factor: long (nullable = true)
 |-- route_id: string (nullable = true)
 |-- trip_date: timestamp (nullable = true)
 |-- vehicle_id: string (nullable = true)


Sample Data:


+--------+--------+-------------+-----------+--------------+-------------------+----------+
|distance|duration|fuel_consumed|load_factor|      route_id|          trip_date|vehicle_id|
+--------+--------+-------------+-----------+--------------+-------------------+----------+
|   48.18|  107.57|         8.54|         79|RT_HOU_DAL_004|2024-09-21 14:44:00|    VH0001|
|   71.26|  122.74|        14.88|         87|RT_HOU_DAL_004|2024-12-01 05:12:00|    VH0001|
|  136.21|  266.74|        18.61|         81|RT_NYC_MAN_001|2024-11-22 09:04:00|    VH0001|
|   488.8|  544.36|        62.62|         96|RT_HOU_DAL_004|2024-12-13 20:05:00|    VH0001|
|  417.19|  437.07|        72.73|         96|RT_MIA_ORL_005|2024-12-22 06:01:00|    VH0001|
+--------+--------+-------------+-----------+--------------+-------------------+----------+
only showing top 5 rows




Successfully inserted 20176 records into transportation.analytics.fleet_trips
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Vehicle trip history** (clustered by vehicle_id)
2. **Time-based fleet analysis** (clustered by trip_date)
3. **Combined vehicle + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Vehicle trip history - benefits from vehicle_id clustering

print("=== Query 1: Vehicle Trip History ===")

vehicle_history = spark.sql("""

SELECT vehicle_id, trip_date, route_id, distance, fuel_consumed, load_factor

FROM transportation.analytics.fleet_trips

WHERE vehicle_id = 'VH0001'

ORDER BY trip_date DESC

""")



vehicle_history.show()

print(f"Records found: {vehicle_history.count()}")



# Query 2: Time-based fuel efficiency analysis - benefits from trip_date clustering

print("\n=== Query 2: Recent Fuel Efficiency Issues ===")

fuel_efficiency = spark.sql("""

SELECT trip_date, vehicle_id, route_id, distance, fuel_consumed,

       ROUND(distance / fuel_consumed, 2) as mpg

FROM transportation.analytics.fleet_trips

WHERE trip_date >= '2024-06-01' AND (distance / fuel_consumed) < 15

ORDER BY mpg ASC, trip_date DESC

""")



fuel_efficiency.show()

print(f"Fuel efficiency issues found: {fuel_efficiency.count()}")



# Query 3: Combined vehicle + time query - optimal for our clustering strategy

print("\n=== Query 3: Vehicle Performance Trends ===")

performance_trends = spark.sql("""

SELECT vehicle_id, trip_date, route_id, distance, duration, load_factor

FROM transportation.analytics.fleet_trips

WHERE vehicle_id LIKE 'VH000%' AND trip_date >= '2024-04-01'

ORDER BY vehicle_id, trip_date

""")



performance_trends.show()

print(f"Performance trend records found: {performance_trends.count()}")

=== Query 1: Vehicle Trip History ===


+----------+-------------------+--------------+--------+-------------+-----------+
|vehicle_id|          trip_date|      route_id|distance|fuel_consumed|load_factor|
+----------+-------------------+--------------+--------+-------------+-----------+
|    VH0001|2024-12-22 06:01:00|RT_MIA_ORL_005|  417.19|        72.73|         96|
|    VH0001|2024-12-15 12:53:00|RT_LAX_SFO_002|   99.26|        16.85|         84|
|    VH0001|2024-12-13 20:05:00|RT_HOU_DAL_004|   488.8|        62.62|         96|
|    VH0001|2024-12-03 11:07:00|RT_HOU_DAL_004|  519.16|        69.75|         99|
|    VH0001|2024-12-01 05:12:00|RT_HOU_DAL_004|   71.26|        14.88|         87|
|    VH0001|2024-11-23 06:58:00|RT_LAX_SFO_002|  348.19|        68.93|         98|
|    VH0001|2024-11-22 09:04:00|RT_NYC_MAN_001|  136.21|        18.61|         81|
|    VH0001|2024-11-20 13:03:00|RT_CHI_DET_003|   89.91|        16.35|         82|
|    VH0001|2024-11-16 11:09:00|RT_HOU_DAL_004|  605.19|        67.39|         97|
|   

Records found: 31

=== Query 2: Recent Fuel Efficiency Issues ===


+-------------------+----------+--------------+--------+-------------+----+
|          trip_date|vehicle_id|      route_id|distance|fuel_consumed| mpg|
+-------------------+----------+--------------+--------+-------------+----+
|2024-08-03 07:41:00|    VH0114|RT_NYC_MAN_001|   31.71|        10.62|2.99|
|2024-07-04 14:10:00|    VH0416|RT_LAX_SFO_002|   31.57|        10.57|2.99|
|2024-11-10 09:49:00|    VH0444|RT_MIA_ORL_005|   31.83|         10.6| 3.0|
|2024-09-30 16:50:00|    VH0362|RT_LAX_SFO_002|   31.78|        10.61| 3.0|
|2024-11-29 13:49:00|    VH0117|RT_LAX_SFO_002|   31.71|        10.54|3.01|
|2024-06-03 13:03:00|    VH0413|RT_NYC_MAN_001|    31.9|        10.58|3.02|
|2024-10-03 08:02:00|    VH0452|RT_NYC_MAN_001|   31.58|        10.35|3.05|
|2024-10-19 18:05:00|    VH0274|RT_MIA_ORL_005|   32.63|        10.58|3.08|
|2024-08-03 14:49:00|    VH0058|RT_CHI_DET_003|   31.61|        10.27|3.08|
|2024-07-14 08:26:00|    VH0118|RT_MIA_ORL_005|   32.02|        10.39|3.08|
|2024-11-23 

Fuel efficiency issues found: 11804

=== Query 3: Vehicle Performance Trends ===


+----------+-------------------+--------------+--------+--------+-----------+
|vehicle_id|          trip_date|      route_id|distance|duration|load_factor|
+----------+-------------------+--------------+--------+--------+-----------+
|    VH0001|2024-04-05 10:59:00|RT_LAX_SFO_002|   59.35|  187.65|         80|
|    VH0001|2024-04-14 23:04:00|RT_LAX_SFO_002|   46.08|  173.74|         91|
|    VH0001|2024-05-01 14:40:00|RT_CHI_DET_003|  108.14|  217.92|         77|
|    VH0001|2024-05-11 07:48:00|RT_NYC_MAN_001|  603.41|  701.52|         91|
|    VH0001|2024-06-16 11:51:00|RT_LAX_SFO_002|   554.8|  481.12|         88|
|    VH0001|2024-06-24 13:48:00|RT_HOU_DAL_004|    89.9|  160.75|         77|
|    VH0001|2024-07-18 11:37:00|RT_CHI_DET_003|  418.77|   679.2|         97|
|    VH0001|2024-07-23 07:31:00|RT_HOU_DAL_004|  316.56|  767.37|         99|
|    VH0001|2024-08-12 06:57:00|RT_CHI_DET_003|    98.6|   88.27|         99|
|    VH0001|2024-08-16 04:26:00|RT_LAX_SFO_002|   42.08|  127.55

Performance trend records found: 236


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the transportation insights possible with this optimized structure.

### Key Analytics

- **Vehicle utilization** and performance metrics
- **Route efficiency** and fuel consumption analysis
- **Fleet capacity utilization** and load factors
- **Operational cost trends** and optimization opportunities

In [None]:
# Analyze clustering effectiveness and transportation insights


# Vehicle performance analysis

print("=== Vehicle Performance Analysis ===")

vehicle_performance = spark.sql("""

SELECT vehicle_id, COUNT(*) as total_trips,

       ROUND(SUM(distance), 2) as total_distance,

       ROUND(SUM(fuel_consumed), 2) as total_fuel,

       ROUND(AVG(distance / fuel_consumed), 2) as avg_mpg,

       ROUND(AVG(load_factor), 2) as avg_load_factor,

       ROUND(SUM(distance), 0) as total_miles

FROM transportation.analytics.fleet_trips

GROUP BY vehicle_id

ORDER BY total_miles DESC

""")



vehicle_performance.show()


# Route efficiency analysis

print("\n=== Route Efficiency Analysis ===")

route_efficiency = spark.sql("""

SELECT route_id, COUNT(*) as total_trips,

       ROUND(AVG(distance), 2) as avg_distance,

       ROUND(AVG(duration), 2) as avg_duration,

       ROUND(AVG(distance / duration * 60), 2) as avg_speed,

       ROUND(AVG(load_factor), 2) as avg_load_factor

FROM transportation.analytics.fleet_trips

GROUP BY route_id

ORDER BY total_trips DESC

""")



route_efficiency.show()


# Fleet fuel consumption analysis

print("\n=== Fleet Fuel Consumption Analysis ===")

fuel_analysis = spark.sql("""

SELECT 

    CASE 

        WHEN distance / fuel_consumed >= 25 THEN 'Excellent (25+ MPG)'

        WHEN distance / fuel_consumed >= 20 THEN 'Good (20-24 MPG)'

        WHEN distance / fuel_consumed >= 15 THEN 'Average (15-19 MPG)'

        WHEN distance / fuel_consumed >= 10 THEN 'Poor (10-14 MPG)'

        ELSE 'Very Poor (<10 MPG)'

    END as fuel_efficiency_category,

    COUNT(*) as trip_count,

    ROUND(AVG(distance / fuel_consumed), 2) as avg_mpg,

    ROUND(SUM(fuel_consumed), 2) as total_fuel_used

FROM transportation.analytics.fleet_trips

GROUP BY 

    CASE 

        WHEN distance / fuel_consumed >= 25 THEN 'Excellent (25+ MPG)'

        WHEN distance / fuel_consumed >= 20 THEN 'Good (20-24 MPG)'

        WHEN distance / fuel_consumed >= 15 THEN 'Average (15-19 MPG)'

        WHEN distance / fuel_consumed >= 10 THEN 'Poor (10-14 MPG)'

        ELSE 'Very Poor (<10 MPG)'

    END

ORDER BY avg_mpg DESC

""")



fuel_analysis.show()


# Monthly operational trends

print("\n=== Monthly Operational Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(trip_date, 'yyyy-MM') as month,

       COUNT(*) as total_trips,

       ROUND(SUM(distance), 2) as monthly_distance,

       ROUND(SUM(fuel_consumed), 2) as monthly_fuel,

       ROUND(AVG(load_factor), 2) as avg_load_factor,

       COUNT(DISTINCT vehicle_id) as active_vehicles

FROM transportation.analytics.fleet_trips

GROUP BY DATE_FORMAT(trip_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Vehicle Performance Analysis ===


+----------+-----------+--------------+----------+-------+---------------+-----------+
|vehicle_id|total_trips|total_distance|total_fuel|avg_mpg|avg_load_factor|total_miles|
+----------+-----------+--------------+----------+-------+---------------+-----------+
|    VH0051|         59|       13834.0|   2033.34|   6.57|          87.44|    13834.0|
|    VH0123|         60|      13633.12|   1879.83|   7.08|          84.27|    13633.0|
|    VH0453|         57|      12890.51|    1937.5|   6.63|          86.86|    12891.0|
|    VH0343|         54|      12846.02|   1855.01|   6.95|          87.28|    12846.0|
|    VH0088|         57|      12547.45|   1816.14|   6.88|          85.05|    12547.0|
|    VH0238|         59|      12448.77|    1814.0|   7.05|          86.02|    12449.0|
|    VH0278|         53|      12418.19|   1824.83|   6.77|          87.13|    12418.0|
|    VH0427|         54|      12406.78|   1753.24|   6.86|          87.31|    12407.0|
|    VH0406|         60|      12304.12|   1

+--------------+-----------+------------+------------+---------+---------------+
|      route_id|total_trips|avg_distance|avg_duration|avg_speed|avg_load_factor|
+--------------+-----------+------------+------------+---------+---------------+
|RT_NYC_MAN_001|       4111|      177.28|      255.39|    38.95|          86.55|
|RT_CHI_DET_003|       4060|      185.45|      265.02|    39.14|          86.34|
|RT_LAX_SFO_002|       4029|      180.58|       258.5|    39.27|          86.44|
|RT_MIA_ORL_005|       3991|      181.71|      260.37|    39.22|          86.31|
|RT_HOU_DAL_004|       3985|      182.39|      261.78|    39.02|          86.49|
+--------------+-----------+------------+------------+---------+---------------+


=== Fleet Fuel Consumption Analysis ===


+------------------------+----------+-------+---------------+
|fuel_efficiency_category|trip_count|avg_mpg|total_fuel_used|
+------------------------+----------+-------+---------------+
|        Poor (10-14 MPG)|       941|  10.79|       21486.63|
|     Very Poor (<10 MPG)|     19235|   6.46|      514816.86|
+------------------------+----------+-------+---------------+


=== Monthly Operational Trends ===


+-------+-----------+----------------+------------+---------------+---------------+
|  month|total_trips|monthly_distance|monthly_fuel|avg_load_factor|active_vehicles|
+-------+-----------+----------------+------------+---------------+---------------+
|2024-01|       1756|       319156.46|    46719.73|          86.38|            479|
|2024-02|       1559|       268888.27|    39614.96|          86.14|            467|
|2024-03|       1703|       318040.34|    46279.08|          86.41|            478|
|2024-04|       1641|       303054.54|    44354.61|          86.33|            477|
|2024-05|       1713|       316645.41|    46364.37|          86.38|            476|
|2024-06|       1700|       312248.26|    46181.31|          86.65|            474|
|2024-07|       1637|       291667.53|    43144.47|          86.67|            482|
|2024-08|       1704|       302704.22|    44418.71|           86.5|            481|
|2024-09|       1640|       295376.12|    42881.73|          86.49|         

## Key Takeaways: Delta Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (vehicle_id, trip_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (vehicle_id, trip_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Transportation analytics where fleet monitoring and route optimization are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for transportation data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles transportation-scale data volumes effortlessly

### Best Practices for Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger transportation datasets
- Integrate with real GPS tracking and IoT sensor data

This notebook demonstrates how Oracle AI Data Platform makes advanced transportation analytics accessible while maintaining enterprise-grade performance and governance.