# Media: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a media and entertainment analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Content Performance and User Engagement Analytics

We'll analyze media content consumption and user engagement data. Our clustering strategy will optimize for:

- **User-specific queries**: Fast lookups by user ID
- **Time-based analysis**: Efficient filtering by viewing and engagement dates
- **Content performance patterns**: Quick aggregation by content type and engagement metrics

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create media catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS media")

spark.sql("CREATE SCHEMA IF NOT EXISTS media.analytics")

print("Media catalog and analytics schema created successfully!")

## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `content_engagement` table will store:

- **user_id**: Unique user identifier
- **engagement_date**: Date and time of engagement
- **content_type**: Type (Video, Article, Podcast, Live Stream)
- **watch_time**: Time spent consuming content (minutes)
- **content_id**: Specific content identifier
- **engagement_score**: User engagement metric (0-100)
- **device_type**: Device used (Mobile, Desktop, TV, etc.)

### Clustering Strategy

We'll cluster by `user_id` and `engagement_date` because:

- **user_id**: Users consume multiple pieces of content, grouping their viewing history together
- **engagement_date**: Time-based queries are critical for content performance analysis, recommendation systems, and user behavior trends
- This combination optimizes for both personalized content recommendations and temporal engagement analysis

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS media.analytics.content_engagement (

    user_id STRING,

    engagement_date TIMESTAMP,

    content_type STRING,

    watch_time DECIMAL(8,2),

    content_id STRING,

    engagement_score INT,

    device_type STRING

)

USING DELTA

CLUSTER BY (user_id, engagement_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on user_id and engagement_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on user_id and engagement_date.


## Step 3: Generate Media Sample Data

### Data Generation Strategy

We'll create realistic media engagement data including:

- **12,000 users** with multiple content interactions over time
- **Content types**: Video, Article, Podcast, Live Stream
- **Realistic engagement patterns**: Peak viewing times, content preferences, device usage
- **Engagement metrics**: Watch time, completion rates, interaction scores

### Why This Data Pattern?

This data simulates real media scenarios where:

- User preferences drive content recommendations
- Engagement metrics determine content success
- Device usage affects viewing experience
- Time-based patterns influence programming decisions
- Personalization requires historical user behavior

In [None]:
# Generate sample media engagement data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define media data constants

CONTENT_TYPES = ['Video', 'Article', 'Podcast', 'Live Stream']

DEVICE_TYPES = ['Mobile', 'Desktop', 'Tablet', 'Smart TV', 'Gaming Console']

# Base engagement parameters by content type

ENGAGEMENT_PARAMS = {

    'Video': {'avg_watch_time': 15, 'engagement_base': 75, 'frequency': 12},

    'Article': {'avg_watch_time': 8, 'engagement_base': 65, 'frequency': 8},

    'Podcast': {'avg_watch_time': 25, 'engagement_base': 70, 'frequency': 6},

    'Live Stream': {'avg_watch_time': 45, 'engagement_base': 80, 'frequency': 4}

}

# Device engagement multipliers

DEVICE_MULTIPLIERS = {

    'Mobile': 0.9, 'Desktop': 1.0, 'Tablet': 0.95, 'Smart TV': 1.1, 'Gaming Console': 1.05

}


# Generate content engagement records

engagement_data = []

base_date = datetime(2024, 1, 1)


# Create 12,000 users with 10-40 engagement events each

for user_num in range(1, 12001):

    user_id = f"USER{user_num:06d}"
    
    # Each user gets 10-40 engagement events over 12 months

    num_engagements = random.randint(10, 40)
    
    for i in range(num_engagements):

        # Spread engagements over 12 months

        days_offset = random.randint(0, 365)

        engagement_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing (more engagement during certain hours)

        hour_weights = [2, 1, 1, 1, 1, 1, 3, 6, 8, 7, 6, 7, 8, 9, 10, 9, 8, 10, 12, 9, 7, 5, 4, 3]

        hours_offset = random.choices(range(24), weights=hour_weights)[0]

        engagement_date = engagement_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select content type

        content_type = random.choice(CONTENT_TYPES)

        params = ENGAGEMENT_PARAMS[content_type]
        
        # Select device type

        device_type = random.choice(DEVICE_TYPES)

        device_multiplier = DEVICE_MULTIPLIERS[device_type]
        
        # Calculate watch time with variations

        time_variation = random.uniform(0.3, 2.5)

        watch_time = round(params['avg_watch_time'] * time_variation * device_multiplier, 2)
        
        # Content ID

        content_id = f"{content_type[:3].upper()}{random.randint(10000, 99999)}"
        
        # Engagement score (based on content type, device, and some randomness)

        engagement_variation = random.randint(-15, 15)

        engagement_score = max(0, min(100, int(params['engagement_base'] * device_multiplier) + engagement_variation))
        
        engagement_data.append({

            "user_id": user_id,

            "engagement_date": engagement_date,

            "content_type": content_type,

            "watch_time": watch_time,

            "content_id": content_id,

            "engagement_score": engagement_score,

            "device_type": device_type

        })



print(f"Generated {len(engagement_data)} content engagement records")

print("Sample record:", engagement_data[0])

Generated 299696 content engagement records
Sample record: {'user_id': 'USER000001', 'engagement_date': datetime.datetime(2024, 5, 12, 21, 57), 'content_type': 'Article', 'watch_time': 10.07, 'content_id': 'ART59438', 'engagement_score': 64, 'device_type': 'Gaming Console'}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_engagement = spark.createDataFrame(engagement_data)


# Display schema and sample data

print("DataFrame Schema:")

df_engagement.printSchema()



print("\nSample Data:")

df_engagement.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (user_id, engagement_date) will automatically optimize the data layout

df_engagement.write.mode("overwrite").saveAsTable("media.analytics.content_engagement")


print(f"\nSuccessfully inserted {df_engagement.count()} records into media.analytics.content_engagement")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- content_id: string (nullable = true)
 |-- content_type: string (nullable = true)
 |-- device_type: string (nullable = true)
 |-- engagement_date: timestamp (nullable = true)
 |-- engagement_score: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-- watch_time: double (nullable = true)


Sample Data:


+----------+------------+--------------+-------------------+----------------+----------+----------+
|content_id|content_type|   device_type|    engagement_date|engagement_score|   user_id|watch_time|
+----------+------------+--------------+-------------------+----------------+----------+----------+
|  ART59438|     Article|Gaming Console|2024-05-12 21:57:00|              64|USER000001|     10.07|
|  VID93820|       Video|        Mobile|2024-10-15 06:04:00|              56|USER000001|     33.09|
|  ART16141|     Article|        Tablet|2024-09-23 16:09:00|              59|USER000001|     11.63|
|  LIV44087| Live Stream|        Tablet|2024-12-28 16:21:00|              79|USER000001|      13.6|
|  POD85603|     Podcast|      Smart TV|2024-10-13 09:41:00|              68|USER000001|      29.2|
+----------+------------+--------------+-------------------+----------------+----------+----------+
only showing top 5 rows




Successfully inserted 299696 records into media.analytics.content_engagement
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **User engagement history** (clustered by user_id)
2. **Time-based content analysis** (clustered by engagement_date)
3. **Combined user + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: User engagement history - benefits from user_id clustering

print("=== Query 1: User Engagement History ===")

user_history = spark.sql("""

SELECT user_id, engagement_date, content_type, watch_time, engagement_score

FROM media.analytics.content_engagement

WHERE user_id = 'USER000001'

ORDER BY engagement_date DESC

LIMIT 10

""")



user_history.show()

print(f"Records found: {user_history.count()} (showing first 10)")



# Query 2: Time-based high-engagement content analysis - benefits from engagement_date clustering

print("\n=== Query 2: Recent High-Engagement Content ===")

high_engagement = spark.sql("""

SELECT engagement_date, user_id, content_id, content_type, engagement_score, watch_time

FROM media.analytics.content_engagement

WHERE DATE(engagement_date) = '2024-02-15' AND engagement_score > 85

ORDER BY engagement_score DESC, watch_time DESC

""")



high_engagement.show()

print(f"High-engagement records found: {high_engagement.count()} (showing first 20)")



# Query 3: Combined user + time query - optimal for our clustering strategy

print("\n=== Query 3: User Content Preferences ===")

user_preferences = spark.sql("""

SELECT user_id, engagement_date, content_type, watch_time, device_type

FROM media.analytics.content_engagement

WHERE user_id LIKE 'USER000%' AND engagement_date >= '2024-02-01'

ORDER BY user_id, engagement_date

LIMIT 25

""")



user_preferences.show()

print(f"User preference records found: {user_preferences.count()} (showing first 25)")

=== Query 1: User Engagement History ===


+----------+-------------------+------------+----------+----------------+
|   user_id|    engagement_date|content_type|watch_time|engagement_score|
+----------+-------------------+------------+----------+----------------+
|USER000001|2024-12-29 19:51:00| Live Stream|     67.09|              63|
|USER000001|2024-12-28 16:21:00| Live Stream|      13.6|              79|
|USER000001|2024-12-24 17:40:00|       Video|     20.14|              75|
|USER000001|2024-12-05 09:50:00|       Video|     27.32|              73|
|USER000001|2024-12-03 14:50:00|     Podcast|     50.85|              70|
|USER000001|2024-11-26 11:18:00| Live Stream|     79.14|             100|
|USER000001|2024-11-09 12:07:00| Live Stream|     49.07|              66|
|USER000001|2024-10-15 06:04:00|       Video|     33.09|              56|
|USER000001|2024-10-13 09:41:00|     Podcast|      29.2|              68|
|USER000001|2024-10-08 16:00:00|     Article|      8.59|              66|
+----------+-------------------+------

Records found: 10 (showing first 10)

=== Query 2: Recent High-Engagement Content ===


+-------------------+----------+----------+------------+----------------+----------+
|    engagement_date|   user_id|content_id|content_type|engagement_score|watch_time|
+-------------------+----------+----------+------------+----------------+----------+
|2024-02-15 07:41:00|USER001708|  LIV38799| Live Stream|             100|    105.07|
|2024-02-15 08:23:00|USER009654|  LIV43097| Live Stream|             100|     99.65|
|2024-02-15 12:52:00|USER010253|  LIV92793| Live Stream|             100|     77.38|
|2024-02-15 08:33:00|USER001622|  LIV68096| Live Stream|             100|     69.99|
|2024-02-15 22:50:00|USER011218|  LIV95921| Live Stream|             100|     54.75|
|2024-02-15 07:07:00|USER005461|  LIV57619| Live Stream|             100|     44.22|
|2024-02-15 08:45:00|USER006405|  LIV69306| Live Stream|             100|     43.79|
|2024-02-15 20:49:00|USER001080|  LIV83472| Live Stream|             100|     28.92|
|2024-02-15 12:32:00|USER006006|  LIV82269| Live Stream|         

High-engagement records found: 112 (showing first 20)

=== Query 3: User Content Preferences ===


+----------+-------------------+------------+----------+--------------+
|   user_id|    engagement_date|content_type|watch_time|   device_type|
+----------+-------------------+------------+----------+--------------+
|USER000001|2024-02-06 22:09:00|     Article|     13.05|       Desktop|
|USER000001|2024-03-12 06:14:00|     Podcast|     31.97|Gaming Console|
|USER000001|2024-03-19 14:51:00| Live Stream|     39.39|      Smart TV|
|USER000001|2024-05-12 21:57:00|     Article|     10.07|Gaming Console|
|USER000001|2024-05-23 19:15:00|     Article|      5.61|      Smart TV|
|USER000001|2024-06-28 19:36:00| Live Stream|     49.28|Gaming Console|
|USER000001|2024-07-01 21:31:00|       Video|     16.08|Gaming Console|
|USER000001|2024-07-09 06:32:00|     Podcast|     25.73|      Smart TV|
|USER000001|2024-07-11 02:53:00|     Article|     12.19|        Mobile|
|USER000001|2024-08-10 16:31:00|     Podcast|     54.28|        Mobile|
|USER000001|2024-08-16 20:11:00|     Podcast|     62.21|       D

User preference records found: 25 (showing first 25)


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the media insights possible with this optimized structure.

### Key Analytics

- **User engagement patterns** and content preferences
- **Content performance** by type and popularity metrics
- **Device usage trends** and platform optimization
- **Time-based consumption patterns** and programming insights

In [None]:
# Analyze clustering effectiveness and media insights


# User engagement analysis

print("=== User Engagement Analysis ===")

user_engagement = spark.sql("""

SELECT user_id, COUNT(*) as total_sessions,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(watch_time), 2) as avg_session_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT content_type) as content_types_used

FROM media.analytics.content_engagement

GROUP BY user_id

ORDER BY total_watch_time DESC

LIMIT 10

""")



user_engagement.show()


# Content type performance

print("\n=== Content Type Performance ===")

content_performance = spark.sql("""

SELECT content_type, COUNT(*) as total_engagements,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(watch_time), 2) as avg_watch_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as unique_users,

       COUNT(DISTINCT content_id) as unique_content

FROM media.analytics.content_engagement

GROUP BY content_type

ORDER BY total_watch_time DESC

""")



content_performance.show()


# Device usage analysis

print("\n=== Device Usage Analysis ===")

device_analysis = spark.sql("""

SELECT device_type, COUNT(*) as total_sessions,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(watch_time), 2) as avg_session_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as unique_users

FROM media.analytics.content_engagement

GROUP BY device_type

ORDER BY total_watch_time DESC

""")



device_analysis.show()


# Hourly engagement patterns

print("\n=== Hourly Engagement Patterns ===")

hourly_patterns = spark.sql("""

SELECT HOUR(engagement_date) as hour_of_day, COUNT(*) as engagement_events,

       ROUND(SUM(watch_time), 2) as total_watch_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as active_users

FROM media.analytics.content_engagement

WHERE DATE(engagement_date) = '2024-02-01'

GROUP BY HOUR(engagement_date)

ORDER BY hour_of_day

""")



hourly_patterns.show()


# Monthly engagement trends

print("\n=== Monthly Engagement Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(engagement_date, 'yyyy-MM') as month,

       COUNT(*) as total_engagements,

       ROUND(SUM(watch_time), 2) as monthly_watch_time,

       ROUND(AVG(watch_time), 2) as avg_session_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT user_id) as active_users

FROM media.analytics.content_engagement

GROUP BY DATE_FORMAT(engagement_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== User Engagement Analysis ===


+----------+--------------+----------------+----------------+--------------+------------------+
|   user_id|total_sessions|total_watch_time|avg_session_time|avg_engagement|content_types_used|
+----------+--------------+----------------+----------------+--------------+------------------+
|USER005870|            40|         1904.54|           47.61|         77.88|                 4|
|USER008764|            38|         1788.89|           47.08|         75.37|                 4|
|USER005341|            40|         1779.68|           44.49|         73.75|                 4|
|USER000055|            40|         1775.95|            44.4|         73.28|                 4|
|USER009434|            39|         1770.35|           45.39|         72.87|                 4|
|USER000927|            37|         1769.41|           47.82|         78.35|                 4|
|USER008715|            40|         1743.27|           43.58|         73.33|                 4|
|USER001049|            38|         1738

+------------+-----------------+----------------+--------------+--------------+------------+--------------+
|content_type|total_engagements|total_watch_time|avg_watch_time|avg_engagement|unique_users|unique_content|
+------------+-----------------+----------------+--------------+--------------+------------+--------------+
| Live Stream|            74856|      4709012.02|         62.91|         80.03|       11920|         50734|
|     Podcast|            74797|      2617116.68|         34.99|         69.86|       11912|         50808|
|       Video|            75029|      1572204.49|         20.95|         74.59|       11921|         50963|
|     Article|            75014|       840406.44|          11.2|         64.63|       11918|         50808|
+------------+-----------------+----------------+--------------+--------------+------------+--------------+


=== Device Usage Analysis ===


+--------------+--------------+----------------+----------------+--------------+------------+
|   device_type|total_sessions|total_watch_time|avg_session_time|avg_engagement|unique_users|
+--------------+--------------+----------------+----------------+--------------+------------+
|      Smart TV|         59789|      2133021.69|           35.68|         79.48|       11786|
|Gaming Console|         60021|      2044411.01|           34.06|         75.85|       11797|
|       Desktop|         60280|      1959091.98|            32.5|         72.52|       11790|
|        Tablet|         59771|      1848360.38|           30.92|         68.53|       11818|
|        Mobile|         59835|      1753854.57|           29.31|         64.98|       11783|
+--------------+--------------+----------------+----------------+--------------+------------+


=== Hourly Engagement Patterns ===


+-----------+-----------------+----------------+--------------+------------+
|hour_of_day|engagement_events|total_watch_time|avg_engagement|active_users|
+-----------+-----------------+----------------+--------------+------------+
|          0|               14|          442.86|         77.71|          14|
|          1|                8|          241.31|         69.88|           8|
|          2|                5|          178.38|          67.0|           5|
|          3|                9|          331.53|         73.44|           9|
|          4|                8|          292.48|         73.13|           8|
|          5|                3|          256.24|         68.33|           3|
|          6|               16|          524.01|         73.38|          16|
|          7|               31|         1149.06|         72.55|          31|
|          8|               38|         1430.55|         73.53|          38|
|          9|               46|         1390.35|         71.78|          46|

+-------+-----------------+------------------+----------------+--------------+------------+
|  month|total_engagements|monthly_watch_time|avg_session_time|avg_engagement|active_users|
+-------+-----------------+------------------+----------------+--------------+------------+
|2024-01|            25180|         813811.06|           32.32|         72.23|       10218|
|2024-02|            23629|         770696.13|           32.62|         72.29|        9999|
|2024-03|            25417|         819129.81|           32.23|         72.09|       10227|
|2024-04|            24549|         798241.78|           32.52|          72.2|       10148|
|2024-05|            25286|         823505.15|           32.57|         72.27|       10283|
|2024-06|            24567|         797466.01|           32.46|         72.36|       10148|
|2024-07|            25411|         829033.74|           32.62|         72.41|       10222|
|2024-08|            25545|         828501.89|           32.43|         72.27|  

## Step 7: Train Media Content Recommendation Model

### Machine Learning for Media Business Improvement

Now we'll train a machine learning model to predict content engagement and enable personalized recommendations. This model can help media companies:

- **Personalize content recommendations** for better user engagement
- **Optimize content discovery** and reduce user churn
- **Maximize watch time** through intelligent content suggestions
- **Improve content production** decisions based on engagement predictions

### Model Approach

We'll use a **Random Forest Classifier** to predict high engagement (engagement_score > 80) based on:

- User behavior patterns (content preferences, device usage)
- Content characteristics (type, timing)
- Contextual factors (time of day, user engagement history)

### Business Impact

- **Engagement Boost**: Personalized recommendations increase watch time
- **Retention Improvement**: Better content discovery reduces user churn
- **Revenue Growth**: Higher engagement drives advertising and subscription revenue
- **Content Strategy**: Data-driven decisions for content creation and acquisition

In [None]:
# Prepare data for machine learning - create user-content engagement features

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Create engagement prediction features
engagement_features = spark.sql("""
SELECT 
    user_id,
    engagement_date,
    content_type,
    watch_time,
    content_id,
    device_type,
    HOUR(engagement_date) as engagement_hour,
    DAYOFWEEK(engagement_date) as engagement_day_of_week,
    AVG(engagement_score) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as user_avg_engagement,
    COUNT(*) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) as user_prior_engagements,
    COUNT(CASE WHEN content_type = 'Video' THEN 1 END) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) / NULLIF(COUNT(*) OVER (PARTITION BY user_id ORDER BY engagement_date ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) as video_preference,
    CASE WHEN engagement_score > 80 THEN 1 ELSE 0 END as high_engagement
FROM media.analytics.content_engagement
""")

# Fill null values from window functions
engagement_features = engagement_features.fillna(0, subset=['user_avg_engagement', 'video_preference'])
engagement_features = engagement_features.fillna(1, subset=['user_prior_engagements'])

print(f"Created engagement prediction features for {engagement_features.count()} interactions")
engagement_features.groupBy("high_engagement").count().show()

Created engagement prediction features for 299696 interactions


+---------------+------+
|high_engagement| count|
+---------------+------+
|              1| 76722|
|              0|222974|
+---------------+------+



In [None]:
# Feature engineering for engagement prediction

# Create indexers for categorical features
content_type_indexer = StringIndexer(inputCol="content_type", outputCol="content_type_index")
device_type_indexer = StringIndexer(inputCol="device_type", outputCol="device_type_index")

# Assemble features for the model
feature_cols = ["watch_time", "engagement_hour", "engagement_day_of_week", 
                "user_avg_engagement", "user_prior_engagements", "video_preference", 
                "content_type_index", "device_type_index"]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="high_engagement", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[content_type_indexer, device_type_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = engagement_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} interactions")
print(f"Test set: {test_data.count()} interactions")

Training set: 239756 interactions


Test set: 59940 interactions


In [None]:
# Train the engagement prediction model

print("Training content engagement prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="high_engagement", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select("user_id", "content_type", "watch_time", "high_engagement", "prediction", "probability").show(15)

# Calculate confusion matrix
confusion_matrix = predictions.groupBy("high_engagement", "prediction").count()
confusion_matrix.show()

Training content engagement prediction model...


Model AUC: 0.8165


+----------+------------+----------+---------------+----------+--------------------+
|   user_id|content_type|watch_time|high_engagement|prediction|         probability|
+----------+------------+----------+---------------+----------+--------------------+
|USER000004| Live Stream|     94.86|              1|       1.0|[0.41142401408534...|
|USER000004|     Podcast|     64.35|              0|       0.0|[0.62779094770207...|
|USER000004|     Article|     12.64|              0|       0.0|[0.82686425125836...|
|USER000004|     Article|      7.65|              0|       0.0|[0.96838202801927...|
|USER000004|       Video|     10.76|              0|       0.0|[0.70597107984159...|
|USER000004|     Article|       2.7|              0|       0.0|[0.97082017356439...|
|USER000004|     Podcast|     44.58|              0|       0.0|[0.92652018088369...|
|USER000004|     Podcast|     64.41|              0|       0.0|[0.76833594624079...|
|USER000007|     Podcast|     16.51|              0|       0.0|[0

+---------------+----------+-----+
|high_engagement|prediction|count|
+---------------+----------+-----+
|              1|       0.0| 8732|
|              0|       0.0|40513|
|              1|       1.0| 6514|
|              0|       1.0| 4181|
+---------------+----------+-----+



In [None]:
# Model interpretation and business insights

# Feature importance (approximate)
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = feature_cols

print("=== Feature Importance for Engagement Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate potential impact of personalized recommendations
high_engagement_predictions = predictions.filter("prediction = 1")
recommended_content = high_engagement_predictions.count()
total_test_content = test_data.count()

print(f"Total test interactions: {total_test_content}")
print(f"Content predicted for high engagement: {recommended_content}")
print(f"Recommendation coverage: {(recommended_content/total_test_content)*100:.1f}%")

# Calculate engagement lift potential
avg_watch_time_recommended = high_engagement_predictions.agg(F.avg("watch_time")).collect()[0][0] or 0
avg_watch_time_all = test_data.agg(F.avg("watch_time")).collect()[0][0] or 0
engagement_lift = ((avg_watch_time_recommended - avg_watch_time_all) / avg_watch_time_all) * 100

print(f"\nAverage watch time for recommended content: {avg_watch_time_recommended:.2f} minutes")
print(f"Average watch time overall: {avg_watch_time_all:.2f} minutes")
print(f"Potential engagement lift: {engagement_lift:.1f}%")

# Revenue impact estimation
avg_rpm = 25  # Average revenue per thousand minutes watched
additional_minutes = recommended_content * (avg_watch_time_recommended - avg_watch_time_all)
additional_revenue = (additional_minutes / 1000) * avg_rpm

print(f"\nEstimated additional watch minutes: {additional_minutes:,.0f}")
print(f"Potential additional revenue: ${additional_revenue:,.0f}")

# Accuracy metrics
accuracy = predictions.filter("high_engagement = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND high_engagement = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND high_engagement = 1").count() / predictions.filter("high_engagement = 1").count() if predictions.filter("high_engagement = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Engagement Prediction ===
watch_time: 0.1027
engagement_hour: 0.0104
engagement_day_of_week: 0.0057
user_avg_engagement: 0.0100
user_prior_engagements: 0.0102
video_preference: 0.0091
content_type_index: 0.4600
device_type_index: 0.3918

=== Business Impact Analysis ===


Total test interactions: 59940
Content predicted for high engagement: 10695
Recommendation coverage: 17.8%



Average watch time for recommended content: 56.16 minutes
Average watch time overall: 32.59 minutes
Potential engagement lift: 72.3%

Estimated additional watch minutes: 252,134
Potential additional revenue: $6,303



Model Performance:
Accuracy: 0.7846
Precision: 0.6091
Recall: 0.4273
AUC: 0.8165


## Key Takeaways: Delta Liquid Clustering + ML in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (user_id, engagement_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (user_id, engagement_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Machine Learning Integration**: Trained a content engagement prediction model using the optimized data

5. **Real-World Use Case**: Media analytics where content engagement and user behavior analysis are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates data optimization with ML
- **Governance**: Catalog and schema isolation for media data
- **Performance**: Optimized for both analytical queries and ML training
- **Scalability**: Handles media-scale data volumes effortlessly

### Business Benefits for Media

1. **Personalization**: AI-driven content recommendations increase engagement
2. **Revenue Growth**: Higher watch time drives advertising and subscription revenue
3. **User Retention**: Better content discovery reduces churn
4. **Content Strategy**: Data-driven decisions for content creation and acquisition
5. **Platform Optimization**: Device-specific recommendations improve user experience

### Best Practices for Media Analytics

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve
5. **Combine with ML** for predictive analytics and automation

### Next Steps

- Explore other AIDP ML features like AutoML
- Try liquid clustering with different column combinations
- Scale up to larger media datasets
- Integrate with real content management and streaming platforms
- Deploy models for real-time content recommendations

This notebook demonstrates how Oracle AI Data Platform makes advanced media analytics accessible while maintaining enterprise-grade performance and governance.