# Media: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a media and entertainment analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Content Performance and User Engagement Analytics

We'll analyze media content consumption and user engagement data. Our clustering strategy will optimize for:

- **User-specific queries**: Fast lookups by user ID
- **Time-based analysis**: Efficient filtering by viewing and engagement dates
- **Content performance patterns**: Quick aggregation by content type and engagement metrics

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create media catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS media")

spark.sql("CREATE SCHEMA IF NOT EXISTS media.analytics")

print("Media catalog and analytics schema created successfully!")

Media catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `content_engagement` table will store:

- **user_id**: Unique user identifier
- **engagement_date**: Date and time of engagement
- **content_type**: Type (Video, Article, Podcast, Live Stream)
- **watch_time**: Time spent consuming content (minutes)
- **content_id**: Specific content identifier
- **engagement_score**: User engagement metric (0-100)
- **device_type**: Device used (Mobile, Desktop, TV, etc.)

### Clustering Strategy

We'll cluster by `user_id` and `engagement_date` because:

- **user_id**: Users consume multiple pieces of content, grouping their viewing history together
- **engagement_date**: Time-based queries are critical for content performance analysis, recommendation systems, and user behavior trends
- This combination optimizes for both personalized content recommendations and temporal engagement analysis

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS media.analytics.content_engagement (

    user_id STRING,

    engagement_date TIMESTAMP,

    content_type STRING,

    watch_time DECIMAL(8,2),

    content_id STRING,

    engagement_score INT,

    device_type STRING

)

USING DELTA

CLUSTER BY (user_id, engagement_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on user_id and engagement_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on user_id and engagement_date.


## Step 3: Generate Media Sample Data

### Data Generation Strategy

We'll create realistic media engagement data including:

- **12,000 users** with multiple content interactions over time
- **Content types**: Video, Article, Podcast, Live Stream
- **Realistic engagement patterns**: Peak viewing times, content preferences, device usage
- **Engagement metrics**: Watch time, completion rates, interaction scores

### Why This Data Pattern?

This data simulates real media scenarios where:

- User preferences drive content recommendations
- Engagement metrics determine content success
- Device usage affects viewing experience
- Time-based patterns influence programming decisions
- Personalization requires historical user behavior

In [None]:
# Generate sample media engagement data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define media data constants

CONTENT_TYPES = ['Video', 'Article', 'Podcast', 'Live Stream']

DEVICE_TYPES = ['Mobile', 'Desktop', 'Tablet', 'Smart TV', 'Gaming Console']

# Base engagement parameters by content type

ENGAGEMENT_PARAMS = {

    'Video': {'avg_watch_time': 15, 'engagement_base': 75, 'frequency': 12},

    'Article': {'avg_watch_time': 8, 'engagement_base': 65, 'frequency': 8},

    'Podcast': {'avg_watch_time': 25, 'engagement_base': 70, 'frequency': 6},

    'Live Stream': {'avg_watch_time': 45, 'engagement_base': 80, 'frequency': 4}

}

# Device engagement multipliers

DEVICE_MULTIPLIERS = {

    'Mobile': 0.9, 'Desktop': 1.0, 'Tablet': 0.95, 'Smart TV': 1.1, 'Gaming Console': 1.05

}


# Generate content engagement records

engagement_data = []

base_date = datetime(2024, 1, 1)


# Create 12,000 users with 10-40 engagement events each

for user_num in range(1, 12001):

    user_id = f"USER{user_num:06d}"
    
    # Each user gets 10-40 engagement events over 12 months

    num_engagements = random.randint(10, 40)
    
    for i in range(num_engagements):

        # Spread engagements over 12 months

        days_offset = random.randint(0, 365)

        engagement_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing (more engagement during certain hours)

        hour_weights = [2, 1, 1, 1, 1, 1, 3, 6, 8, 7, 6, 7, 8, 9, 10, 9, 8, 10, 12, 9, 7, 5, 4, 3]

        hours_offset = random.choices(range(24), weights=hour_weights)[0]

        engagement_date = engagement_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select content type

        content_type = random.choice(CONTENT_TYPES)

        params = ENGAGEMENT_PARAMS[content_type]
        
        # Select device type

        device_type = random.choice(DEVICE_TYPES)

        device_multiplier = DEVICE_MULTIPLIERS[device_type]
        
        # Calculate watch time with variations

        time_variation = random.uniform(0.3, 2.5)

        watch_time = round(params['avg_watch_time'] * time_variation * device_multiplier, 2)
        
        # Content ID

        content_id = f"{content_type[:3].upper()}{random.randint(10000, 99999)}"
        
        # Engagement score (based on content type, device, and some randomness)

        engagement_variation = random.randint(-15, 15)

        engagement_score = max(0, min(100, int(params['engagement_base'] * device_multiplier) + engagement_variation))
        
        engagement_data.append({

            "user_id": user_id,

            "engagement_date": engagement_date,

            "content_type": content_type,

            "watch_time": watch_time,

            "content_id": content_id,

            "engagement_score": engagement_score,

            "device_type": device_type

        })



print(f"Generated {len(engagement_data)} content engagement records")

print("Sample record:", engagement_data[0])

Generated 299540 content engagement records
Sample record: {'user_id': 'USER000001', 'engagement_date': datetime.datetime(2024, 8, 13, 17, 29), 'content_type': 'Podcast', 'watch_time': 34.22, 'content_id': 'POD96528', 'engagement_score': 74, 'device_type': 'Desktop'}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_engagement = spark.createDataFrame(engagement_data)


# Display schema and sample data

print("DataFrame Schema:")

df_engagement.printSchema()



print("\nSample Data:")

df_engagement.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (user_id, engagement_date) will automatically optimize the data layout

df_engagement.write.mode("overwrite").saveAsTable("media.analytics.content_engagement")


print(f"\nSuccessfully inserted {df_engagement.count()} records into media.analytics.content_engagement")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- content_id: string (nullable = true)
 |-- content_type: string (nullable = true)
 |-- device_type: string (nullable = true)
 |-- engagement_date: timestamp (nullable = true)
 |-- engagement_score: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- watch_time: double (nullable = true)


Sample Data:


+----------+------------+--------------+-------------------+----------------+----------+----------+
|content_id|content_type|   device_type|    engagement_date|engagement_score|   user_id|watch_time|
+----------+------------+--------------+-------------------+----------------+----------+----------+
|  POD96528|     Podcast|       Desktop|2024-08-13 17:29:00|              74|USER000001|     34.22|
|  VID98484|       Video|        Mobile|2024-09-04 00:59:00|              81|USER000001|     13.27|
|  VID15293|       Video|        Tablet|2024-01-01 10:39:00|              84|USER000001|      9.75|
|  POD83689|     Podcast|        Mobile|2024-06-04 20:33:00|              76|USER000001|     41.79|
|  POD56644|     Podcast|Gaming Console|2024-02-19 13:31:00|              63|USER000001|      27.7|
+----------+------------+--------------+-------------------+----------------+----------+----------+
only showing top 5 rows




Successfully inserted 299540 records into media.analytics.content_engagement
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **User engagement history** (clustered by user_id)
2. **Time-based content analysis** (clustered by engagement_date)
3. **Combined user + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: User engagement history - benefits from user_id clustering

print("=== Query 1: User Engagement History ===")

user_history = spark.sql("""

SELECT user_id, engagement_date, content_type, watch_time, engagement_score

FROM media.analytics.content_engagement

WHERE user_id = 'USER000001'

ORDER BY engagement_date DESC

LIMIT 10

""")



user_history.show()

print(f"Records found: {user_history.count()}")



# Query 2: Time-based high-engagement content analysis - benefits from engagement_date clustering

print("\n=== Query 2: Recent High-Engagement Content ===")

high_engagement = spark.sql("""

SELECT engagement_date, user_id, content_id, content_type, engagement_score, watch_time

FROM media.analytics.content_engagement

WHERE DATE(engagement_date) = '2024-02-15' AND engagement_score > 85

ORDER BY engagement_score DESC, watch_time DESC

""")



high_engagement.show()

print(f"High-engagement records found: {high_engagement.count()}")



# Query 3: Combined user + time query - optimal for our clustering strategy

print("\n=== Query 3: User Content Preferences ===")

user_preferences = spark.sql("""

SELECT user_id, engagement_date, content_type, watch_time, device_type

FROM media.analytics.content_engagement

WHERE user_id LIKE 'USER000%' AND engagement_date >= '2024-02-01'

ORDER BY user_id, engagement_date

LIMIT 25

""")



user_preferences.show()

print(f"User preference records found: {user_preferences.count()}")

=== Query 1: User Engagement History ===


+----------+-------------------+------------+----------+----------------+
|   user_id|    engagement_date|content_type|watch_time|engagement_score|
+----------+-------------------+------------+----------+----------------+
|USER000001|2024-12-30 07:16:00|     Podcast|     41.06|              83|
|USER000001|2024-12-08 17:18:00|     Podcast|     13.61|              75|
|USER000001|2024-11-27 07:56:00|     Article|     18.44|              63|
|USER000001|2024-10-15 15:23:00| Live Stream|     111.8|              80|
|USER000001|2024-09-04 00:59:00|       Video|     13.27|              81|
|USER000001|2024-09-03 23:01:00| Live Stream|      65.6|              88|
|USER000001|2024-09-03 14:35:00| Live Stream|     44.77|              91|
|USER000001|2024-08-20 19:50:00|     Podcast|     40.36|              67|
|USER000001|2024-08-13 17:29:00|     Podcast|     34.22|              74|
|USER000001|2024-07-17 23:14:00| Live Stream|     113.5|              74|
+----------+-------------------+------

Records found: 10

=== Query 2: Recent High-Engagement Content ===


+-------------------+----------+----------+------------+----------------+----------+
|    engagement_date|   user_id|content_id|content_type|engagement_score|watch_time|
+-------------------+----------+----------+------------+----------------+----------+
|2024-02-15 16:33:00|USER004701|  LIV23443| Live Stream|             100|     111.0|
|2024-02-15 15:56:00|USER009133|  LIV37632| Live Stream|             100|    107.46|
|2024-02-15 06:56:00|USER005956|  LIV52538| Live Stream|             100|    102.42|
|2024-02-15 15:32:00|USER002011|  LIV53566| Live Stream|             100|     57.66|
|2024-02-15 10:38:00|USER004131|  LIV78476| Live Stream|             100|     21.97|
|2024-02-15 07:53:00|USER001098|  LIV42709| Live Stream|             100|     21.52|
|2024-02-15 15:50:00|USER011262|  LIV59439| Live Stream|              99|     74.89|
|2024-02-15 13:38:00|USER006084|  LIV42623| Live Stream|              98|    110.39|
|2024-02-15 02:57:00|USER010226|  LIV65581| Live Stream|         

High-engagement records found: 106

=== Query 3: User Content Preferences ===


+----------+-------------------+------------+----------+--------------+
|   user_id|    engagement_date|content_type|watch_time|   device_type|
+----------+-------------------+------------+----------+--------------+
|USER000001|2024-02-19 13:31:00|     Podcast|      27.7|Gaming Console|
|USER000001|2024-03-06 18:48:00| Live Stream|     93.56|        Mobile|
|USER000001|2024-03-19 21:42:00|       Video|     32.25|       Desktop|
|USER000001|2024-03-26 07:32:00|     Podcast|      17.3|      Smart TV|
|USER000001|2024-04-02 12:00:00|     Podcast|     40.56|      Smart TV|
|USER000001|2024-04-02 13:07:00|     Podcast|     24.74|       Desktop|
|USER000001|2024-04-27 14:31:00|     Podcast|     32.07|        Tablet|
|USER000001|2024-05-05 23:26:00|       Video|     11.33|        Tablet|
|USER000001|2024-05-06 18:32:00|     Podcast|      17.0|        Tablet|
|USER000001|2024-06-04 20:33:00|     Podcast|     41.79|        Mobile|
|USER000001|2024-06-06 13:12:00|       Video|     30.08|      Sm

User preference records found: 25


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the media insights possible with this optimized structure.

### Key Analytics

- **User engagement patterns** and content preferences
- **Content performance** by type and popularity metrics
- **Device usage trends** and platform optimization
- **Time-based consumption patterns** and programming insights

In [None]:
# Analyze clustering effectiveness and media insights


# User engagement analysis

print("=== User Engagement Analysis ===")

user_engagement = spark.sql("""

SELECT user_id, COUNT(*) as total_sessions,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(watch_time), 2) as avg_session_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT content_type) as content_types_used

FROM media.analytics.content_engagement

GROUP BY user_id

ORDER BY total_watch_time DESC

LIMIT 10

""")



user_engagement.show()


# Content type performance

print("\n=== Content Type Performance ===")

content_performance = spark.sql("""

SELECT content_type, COUNT(*) as total_engagements,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(watch_time), 2) as avg_watch_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as unique_users,

       COUNT(DISTINCT content_id) as unique_content

FROM media.analytics.content_engagement

GROUP BY content_type

ORDER BY total_watch_time DESC

""")



content_performance.show()


# Device usage analysis

print("\n=== Device Usage Analysis ===")

device_analysis = spark.sql("""

SELECT device_type, COUNT(*) as total_sessions,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(watch_time), 2) as avg_session_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as unique_users

FROM media.analytics.content_engagement

GROUP BY device_type

ORDER BY total_watch_time DESC

""")



device_analysis.show()


# Hourly engagement patterns

print("\n=== Hourly Engagement Patterns ===")

hourly_patterns = spark.sql("""

SELECT HOUR(engagement_date) as hour_of_day, COUNT(*) as engagement_events,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as active_users

FROM media.analytics.content_engagement

WHERE DATE(engagement_date) = '2024-02-01'

GROUP BY HOUR(engagement_date)

ORDER BY hour_of_day

""")



hourly_patterns.show()


# Monthly engagement trends

print("\n=== Monthly Engagement Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(engagement_date, 'yyyy-MM') as month,

       COUNT(*) as total_engagements,

       ROUND(SUM(watch_time), 2) as monthly_watch_time,

       ROUND(AVG(watch_time), 2) as avg_session_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as active_users

FROM media.analytics.content_engagement

GROUP BY DATE_FORMAT(engagement_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== User Engagement Analysis ===


+----------+--------------+----------------+----------------+--------------+------------------+
|   user_id|total_sessions|total_watch_time|avg_session_time|avg_engagement|content_types_used|
+----------+--------------+----------------+----------------+--------------+------------------+
|USER007579|            40|         1877.93|           46.95|         75.13|                 4|
|USER005840|            37|         1833.53|           49.55|         74.32|                 4|
|USER001865|            38|         1811.01|           47.66|         74.92|                 4|
|USER004356|            38|         1750.62|           46.07|         72.79|                 4|
|USER007922|            36|         1738.63|            48.3|         75.08|                 4|
|USER002936|            35|         1729.81|           49.42|         69.69|                 4|
|USER002713|            40|         1712.54|           42.81|         71.73|                 4|
|USER007310|            40|         1705

+------------+-----------------+----------------+--------------+--------------+------------+--------------+
|content_type|total_engagements|total_watch_time|avg_watch_time|avg_engagement|unique_users|unique_content|
+------------+-----------------+----------------+--------------+--------------+------------+--------------+
| Live Stream|            75054|      4737522.68|         63.12|         79.97|       11912|         50853|
|     Podcast|            75096|      2632220.72|         35.05|         69.87|       11904|         51028|
|       Video|            74449|      1568878.01|         21.07|         74.64|       11906|         50616|
|     Article|            74941|       839239.02|          11.2|         64.59|       11923|         50708|
+------------+-----------------+----------------+--------------+--------------+------------+--------------+


=== Device Usage Analysis ===


+--------------+--------------+----------------+----------------+--------------+------------+
|   device_type|total_sessions|total_watch_time|avg_session_time|avg_engagement|unique_users|
+--------------+--------------+----------------+----------------+--------------+------------+
|      Smart TV|         60108|      2160351.14|           35.94|         79.43|       11778|
|Gaming Console|         59734|      2028688.05|           33.96|         75.74|       11802|
|       Desktop|         59949|      1969632.73|           32.86|          72.5|       11783|
|        Tablet|         60175|      1869267.79|           31.06|         68.54|       11804|
|        Mobile|         59574|      1749920.72|           29.37|         65.08|       11784|
+--------------+--------------+----------------+----------------+--------------+------------+


=== Hourly Engagement Patterns ===


+-----------+-----------------+----------------+--------------+------------+
|hour_of_day|engagement_events|total_watch_time|avg_engagement|active_users|
+-----------+-----------------+----------------+--------------+------------+
|          0|               15|           472.0|         71.47|          15|
|          1|                7|          158.18|         73.71|           7|
|          2|                8|           322.8|         68.25|           8|
|          3|                6|          199.68|          68.0|           6|
|          4|                8|          219.29|         68.88|           8|
|          5|                3|          116.65|         76.33|           3|
|          6|               18|           568.5|         72.56|          18|
|          7|               42|         1211.49|         71.38|          42|
|          8|               43|         1407.64|         73.84|          43|
|          9|               47|          1604.9|         70.06|          47|

+-------+-----------------+------------------+----------------+--------------+------------+
|  month|total_engagements|monthly_watch_time|avg_session_time|avg_engagement|active_users|
+-------+-----------------+------------------+----------------+--------------+------------+
|2024-01|            25159|         827121.13|           32.88|         72.26|       10203|
|2024-02|            23872|         772994.45|           32.38|         72.24|       10000|
|2024-03|            25510|         827291.65|           32.43|         72.29|       10244|
|2024-04|            24519|          798865.9|           32.58|         72.23|       10145|
|2024-05|            25288|         829255.26|           32.79|         72.26|       10225|
|2024-06|            24308|         794100.99|           32.67|         72.17|       10062|
|2024-07|            25428|         832311.23|           32.73|         72.25|       10260|
|2024-08|            25603|         833486.22|           32.55|         72.34|  

## Key Takeaways: Delta Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (user_id, engagement_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (user_id, engagement_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Media analytics where content engagement and user behavior analysis are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for media data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles media-scale data volumes effortlessly

### Best Practices for Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger media datasets
- Integrate with real content management and streaming platforms

This notebook demonstrates how Oracle AI Data Platform makes advanced media analytics accessible while maintaining enterprise-grade performance and governance.