# Media: Medallion Architecture Demo with Delta Liquid Clustering

## Overview

This notebook demonstrates a **Medallion Architecture** implementation in Oracle AI Data Platform (AIDP) Workbench using a media and entertainment analytics use case. The medallion architecture organizes data into three layers:

- **Bronze Layer**: Raw, unprocessed data as ingested
- **Silver Layer**: Cleaned, validated, and transformed data
- **Gold Layer**: Aggregated, business-ready data with analytics and ML insights

We'll incorporate **Delta Liquid Clustering** for automatic data optimization and include machine learning components for content recommendation.

### Key Technologies
- **Delta Lake**: ACID transactions, time travel, schema enforcement
- **Liquid Clustering**: Automatic data layout optimization
- **Medallion Architecture**: Progressive data refinement
- **PySpark ML**: Machine learning for content recommendations

### Use Case: Content Performance and User Engagement Analytics

We'll analyze media content consumption patterns to optimize content recommendations, improve user engagement, and drive business insights.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create media catalog with bronze, silver, and gold schemas
# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS media")
spark.sql("CREATE SCHEMA IF NOT EXISTS media.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS media.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS media.gold")

print("Media catalog with bronze, silver, and gold schemas created successfully!")

Media catalog with bronze, silver, and gold schemas created successfully!


## Bronze Layer: Raw Data Ingestion

### Purpose
The Bronze layer stores raw data exactly as received, without any transformations. This preserves data integrity and enables reprocessing if needed.

### Table Design
Our `content_engagement_raw` table will store:

- **user_id**: Raw user identifier
- **engagement_date**: Raw timestamp
- **content_type**: Content type as received
- **watch_time**: Raw watch time data
- **content_id**: Raw content identifier
- **engagement_score**: Raw engagement metric
- **device_type**: Device information
- **ingestion_timestamp**: When data was ingested

### Clustering Strategy
We'll cluster by `user_id` and `ingestion_timestamp` for efficient data management and historical tracking.

In [None]:
# Create Bronze layer Delta table with liquid clustering

spark.sql("""
CREATE TABLE IF NOT EXISTS media.bronze.content_engagement_raw (
    user_id STRING,
    engagement_date STRING,
    content_type STRING,
    watch_time STRING,
    content_id STRING,
    engagement_score STRING,
    device_type STRING,
    ingestion_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (user_id, ingestion_timestamp)
""")

print("Bronze layer table created successfully!")
print("Clustering will optimize data layout for user-based queries and temporal analysis.")

In [None]:
# Generate and ingest raw media engagement data
# Using fully qualified imports to avoid conflicts

import random
from datetime import datetime, timedelta

# Define media data constants
CONTENT_TYPES = ['Video', 'Article', 'Podcast', 'Live Stream']
DEVICE_TYPES = ['Mobile', 'Desktop', 'Tablet', 'Smart TV', 'Gaming Console']

# Generate sample raw data (simulating various data quality issues)
raw_engagement_data = []
base_date = datetime(2024, 1, 1)
ingestion_time = datetime.now()

# Create 15,000 users with varying data quality
for user_num in range(1, 15001):
    user_id = f"USER{user_num:06d}"
    
    # Each user gets 8-35 engagement events
    num_engagements = random.randint(8, 35)
    
    for i in range(num_engagements):
        # Spread engagements over 12 months
        days_offset = random.randint(0, 365)
        engagement_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing
        hour_weights = [2, 1, 1, 1, 1, 1, 3, 6, 8, 7, 6, 7, 8, 9, 10, 9, 8, 10, 12, 9, 7, 5, 4, 3]
        hours_offset = random.choices(range(24), weights=hour_weights)[0]
        engagement_datetime = engagement_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select content type
        content_type = random.choice(CONTENT_TYPES)
        
        # Select device type
        device_type = random.choice(DEVICE_TYPES)
        
        # Generate watch time with some data quality issues
        base_watch_time = {'Video': 15, 'Article': 8, 'Podcast': 25, 'Live Stream': 45}[content_type]
        watch_time = round(base_watch_time * random.uniform(0.1, 3.0), 2)
        
        # Content ID
        content_id = f"{content_type[:3].upper()}{random.randint(10000, 99999)}"
        
        # Engagement score with some outliers - increased range to ensure more high engagement examples
        base_score = {'Video': 75, 'Article': 65, 'Podcast': 70, 'Live Stream': 80}[content_type]
        engagement_score = random.randint(max(0, base_score - 15), min(100, base_score + 35))
        
        # Simulate data quality issues in raw data
        if random.random() < 0.05:  # 5% chance of data quality issues
            if random.random() < 0.3:
                watch_time = "NULL"  # Missing data
            elif random.random() < 0.5:
                engagement_score = str(random.randint(200, 500))  # Out of range
            else:
                content_type = content_type.lower()  # Inconsistent casing
        
        raw_engagement_data.append({
            "user_id": user_id,
            "engagement_date": engagement_datetime.isoformat(),
            "content_type": str(content_type),
            "watch_time": str(watch_time),
            "content_id": content_id,
            "engagement_score": str(engagement_score),
            "device_type": device_type,
            "ingestion_timestamp": ingestion_time
        })

print(f"Generated {len(raw_engagement_data)} raw content engagement records")
print("Sample raw record:", raw_engagement_data[0])

In [None]:
# Insert raw data into Bronze layer

df_raw = spark.createDataFrame(raw_engagement_data)

print("Raw DataFrame Schema:")
df_raw.printSchema()

print("\nSample Raw Data:")
df_raw.show(5)

# Insert into Bronze table
df_raw.write.mode("overwrite").saveAsTable("media.bronze.content_engagement_raw")

print(f"\nSuccessfully ingested {df_raw.count()} raw records into Bronze layer")
print("Data stored exactly as received, preserving original quality and format")

## Silver Layer: Data Cleaning and Transformation

### Purpose
The Silver layer contains cleaned, validated, and standardized data. This layer:
- Removes or corrects invalid data
- Standardizes formats and types
- Enriches data with derived fields
- Prepares data for analytical use

### Transformations Applied
- **Type casting**: Convert strings to appropriate data types
- **Data validation**: Remove/correct invalid values
- **Standardization**: Consistent formatting
- **Enrichment**: Add derived fields like engagement categories

In [None]:
# Create Silver layer table

spark.sql("""
CREATE TABLE IF NOT EXISTS media.silver.content_engagement_clean (
    user_id STRING,
    engagement_date TIMESTAMP,
    content_type STRING,
    watch_time DECIMAL(8,2),
    content_id STRING,
    engagement_score INT,
    device_type STRING,
    engagement_category STRING,
    ingestion_timestamp TIMESTAMP,
    processing_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (user_id, engagement_date)
""")

print("Silver layer table created successfully!")

In [None]:
# Transform Bronze data to Silver layer
from pyspark.sql.functions import col, when, udf, current_timestamp
from pyspark.sql.types import IntegerType, DecimalType, TimestampType

# Read Bronze data
bronze_df = spark.table("media.bronze.content_engagement_raw")

# Data cleaning and transformation
silver_df = bronze_df \
    .withColumn("engagement_date_clean", 
                when(col("engagement_date").isNotNull(), col("engagement_date").cast(TimestampType()))
                .otherwise(current_timestamp())) \
    .withColumn("watch_time_clean",
                when((col("watch_time") != "NULL") & (col("watch_time").cast("float").isNotNull()), 
                     col("watch_time").cast(DecimalType(8,2)))
                .otherwise(0.0)) \
    .withColumn("engagement_score_clean",
                when(col("engagement_score").cast("int").isNotNull(), col("engagement_score").cast("int"))
                .otherwise(50)  # Default score
                .cast(IntegerType())) \
    .withColumn("engagement_score_clean",
                when(col("engagement_score_clean").between(0, 100), col("engagement_score_clean"))
                .otherwise(50)) \
    .withColumn("content_type_clean",
                when(col("content_type").isNotNull(), col("content_type"))
                .otherwise("Unknown")) \
    .withColumn("engagement_category",
                when(col("engagement_score_clean") >= 80, "High")
                .when(col("engagement_score_clean") >= 60, "Medium")
                .otherwise("Low")) \
    .withColumn("processing_timestamp", current_timestamp()) \
    .select(
        col("user_id"),
        col("engagement_date_clean").alias("engagement_date"),
        col("content_type_clean").alias("content_type"),
        col("watch_time_clean").alias("watch_time"),
        col("content_id"),
        col("engagement_score_clean").alias("engagement_score"),
        col("device_type"),
        col("engagement_category"),
        col("ingestion_timestamp"),
        col("processing_timestamp")
    )

print("Silver layer transformation completed")
print(f"Processed {silver_df.count()} records")

print("\nSilver Data Schema:")
silver_df.printSchema()

print("\nSample Clean Data:")
silver_df.show(10)

In [None]:
# Insert cleaned data into Silver layer

silver_df.write.mode("overwrite").saveAsTable("media.silver.content_engagement_clean")

print("Successfully transformed and loaded data into Silver layer")
print("Data is now cleaned, validated, and enriched for analysis")

## Gold Layer: Business Analytics and ML Insights

### Purpose
The Gold layer contains aggregated, business-ready data optimized for:
- **Analytics dashboards** and reporting
- **Machine learning** model training and scoring
- **Business intelligence** and decision making

### Tables in Gold Layer
- **content_analytics**: Aggregated metrics and KPIs
- **user_profiles**: User behavior summaries
- **engagement_predictions**: ML model predictions

In [None]:
# Create Gold layer analytics table

spark.sql("""
CREATE TABLE IF NOT EXISTS media.gold.content_analytics (
    content_type STRING,
    device_type STRING,
    date DATE,
    total_engagements BIGINT,
    total_watch_time DECIMAL(12,2),
    avg_watch_time DECIMAL(8,2),
    avg_engagement_score DECIMAL(5,2),
    unique_users BIGINT,
    high_engagement_rate DECIMAL(5,4),
    created_at TIMESTAMP
)
USING DELTA
CLUSTER BY (content_type, date)
""")

print("Gold layer analytics table created successfully!")

In [None]:
# Create user profiles table in Gold layer

spark.sql("""
CREATE TABLE IF NOT EXISTS media.gold.user_profiles (
    user_id STRING,
    total_sessions BIGINT,
    total_watch_time DECIMAL(10,2),
    avg_session_time DECIMAL(8,2),
    avg_engagement_score DECIMAL(5,2),
    preferred_content_type STRING,
    preferred_device STRING,
    engagement_trend STRING,
    last_engagement_date TIMESTAMP,
    user_segment STRING,
    created_at TIMESTAMP
)
USING DELTA
CLUSTER BY (user_segment, last_engagement_date)
""")

print("Gold layer user profiles table created successfully!")

In [None]:
# Aggregate data for Gold layer analytics
from pyspark.sql.functions import date_format, count, sum, avg, countDistinct, round, current_timestamp

# Read Silver data
silver_data = spark.table("media.silver.content_engagement_clean")

# Create content analytics aggregations
content_analytics = silver_data \
    .withColumn("date", date_format("engagement_date", "yyyy-MM-dd").cast("date")) \
    .groupBy("content_type", "device_type", "date") \
    .agg(
        count("*").alias("total_engagements"),
        sum("watch_time").alias("total_watch_time"),
        avg("watch_time").alias("avg_watch_time"),
        avg("engagement_score").alias("avg_engagement_score"),
        countDistinct("user_id").alias("unique_users"),
        (count(when(col("engagement_category") == "High", 1)) / count("*")).alias("high_engagement_rate")
    ) \
    .withColumn("created_at", current_timestamp())

# Round decimal columns
content_analytics = content_analytics \
    .withColumn("total_watch_time", round("total_watch_time", 2)) \
    .withColumn("avg_watch_time", round("avg_watch_time", 2)) \
    .withColumn("avg_engagement_score", round("avg_engagement_score", 2)) \
    .withColumn("high_engagement_rate", round("high_engagement_rate", 4))

print("Content analytics aggregations created")
content_analytics.show(10)

In [None]:
# Create user profiles for Gold layer
from pyspark.sql.functions import max, first, when, col
from pyspark.sql.window import Window

# User behavior aggregations
user_profiles_base = silver_data \
    .groupBy("user_id") \
    .agg(
        count("*").alias("total_sessions"),
        sum("watch_time").alias("total_watch_time"),
        avg("watch_time").alias("avg_session_time"),
        avg("engagement_score").alias("avg_engagement_score"),
        max("engagement_date").alias("last_engagement_date")
    )

# Get preferred content type and device per user
user_preferences = silver_data \
    .groupBy("user_id", "content_type") \
    .agg(count("*").alias("content_count")) \
    .withColumn("rank", row_number().over(Window.partitionBy("user_id").orderBy(col("content_count").desc()))) \
    .filter("rank = 1") \
    .select("user_id", "content_type")

user_device_prefs = silver_data \
    .groupBy("user_id", "device_type") \
    .agg(count("*").alias("device_count")) \
    .withColumn("rank", row_number().over(Window.partitionBy("user_id").orderBy(col("device_count").desc()))) \
    .filter("rank = 1") \
    .select("user_id", "device_type")

# Combine user profiles
user_profiles = user_profiles_base \
    .join(user_preferences, "user_id", "left") \
    .join(user_device_prefs, "user_id", "left") \
    .withColumn("preferred_content_type", col("content_type")) \
    .withColumn("preferred_device", col("device_type")) \
    .withColumn("engagement_trend", 
                when(col("avg_engagement_score") >= 70, "High Performer")
                .when(col("avg_engagement_score") >= 60, "Good Engagement")
                .otherwise("Needs Attention")) \
    .withColumn("user_segment",
                when(col("total_sessions") >= 25, "Power User")
                .when(col("total_sessions") >= 15, "Regular User")
                .otherwise("Casual User")) \
    .withColumn("created_at", current_timestamp()) \
    .drop("content_type", "device_type")

# Round decimal columns
user_profiles = user_profiles \
    .withColumn("total_watch_time", round("total_watch_time", 2)) \
    .withColumn("avg_session_time", round("avg_session_time", 2)) \
    .withColumn("avg_engagement_score", round("avg_engagement_score", 2))

print("User profiles created")
user_profiles.show(10)

User profiles created


+----------+--------------+----------------+----------------+--------------------+--------------------+----------------------+----------------+----------------+------------+--------------------+
|   user_id|total_sessions|total_watch_time|avg_session_time|avg_engagement_score|last_engagement_date|preferred_content_type|preferred_device|engagement_trend|user_segment|          created_at|
+----------+--------------+----------------+----------------+--------------------+--------------------+----------------------+----------------+----------------+------------+--------------------+
|USER003726|            31|         1330.09|           42.91|               72.16| 2024-12-25 11:20:00|           Live Stream|  Gaming Console|  High Performer|  Power User|2026-01-02 20:20:...|
|USER003738|            21|          944.71|           44.99|               61.43| 2024-12-31 16:15:00|           Live Stream|        Smart TV| Good Engagement|Regular User|2026-01-02 20:20:...|
|USER003806|            3

In [None]:
# Load aggregated data into Gold layer tables

content_analytics.write.mode("overwrite").saveAsTable("media.gold.content_analytics")
user_profiles.write.mode("overwrite").saveAsTable("media.gold.user_profiles")

print("Successfully loaded aggregated analytics into Gold layer")
print(f"Content analytics: {content_analytics.count()} records")
print(f"User profiles: {user_profiles.count()} records")

Successfully loaded aggregated analytics into Gold layer


Content analytics: 11190 records


User profiles: 15000 records


## Machine Learning: Content Engagement Prediction

### ML in the Gold Layer
We'll train a machine learning model to predict content engagement and create personalized recommendations.

### Business Value
- **Personalized Recommendations**: Increase user engagement and watch time
- **Content Optimization**: Identify high-performing content patterns
- **Revenue Growth**: Better engagement drives advertising and subscription revenue

In [None]:
# Prepare data for ML model training
from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.functions import vector_to_array
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Read Silver layer data for ML
ml_data = spark.table("media.silver.content_engagement_clean")

# Create engagement prediction features
engagement_features = ml_data \
    .withColumn("high_engagement", when(col("engagement_score") > 70, 1).otherwise(0)) \
    .withColumn("engagement_hour", F.hour("engagement_date")) \
    .withColumn("engagement_day_of_week", F.dayofweek("engagement_date")) \
    .withColumn("user_avg_engagement", 
                F.avg("engagement_score").over(Window.partitionBy("user_id").orderBy("engagement_date").rowsBetween(-10, -1))) \
    .withColumn("user_prior_engagements", 
                F.count("*").over(Window.partitionBy("user_id").orderBy("engagement_date").rowsBetween(-10, -1))) \
    .fillna(0, subset=["user_avg_engagement"]) \
    .fillna(1, subset=["user_prior_engagements"])

print(f"Prepared {engagement_features.count()} records for ML training")
engagement_features.groupBy("high_engagement").count().show()

Prepared 321321 records for ML training


+---------------+------+
|high_engagement| count|
+---------------+------+
|              1|139466|
|              0|181855|
+---------------+------+



In [None]:
# Feature engineering and model training

# Index categorical features
content_type_indexer = StringIndexer(inputCol="content_type", outputCol="content_type_index")
device_type_indexer = StringIndexer(inputCol="device_type", outputCol="device_type_index")

# Assemble features
feature_cols = ["watch_time", "engagement_hour", "engagement_day_of_week", 
                "user_avg_engagement", "user_prior_engagements", 
                "content_type_index", "device_type_index"]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Random Forest model
rf = RandomForestClassifier(
    labelCol="high_engagement", 
    featuresCol="scaled_features",
    numTrees=50,
    maxDepth=8
)

# Create pipeline
pipeline = Pipeline(stages=[content_type_indexer, device_type_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = engagement_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} interactions")
print(f"Test set: {test_data.count()} interactions")

Training set: 257009 interactions


Test set: 64312 interactions


In [None]:
# Train the engagement prediction model

print("Training content engagement prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate model
evaluator = BinaryClassificationEvaluator(labelCol="high_engagement", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select("user_id", "content_type", "watch_time", "high_engagement", "prediction", "probability").show(15)

# Create predictions table in Gold layer
spark.sql("""
CREATE TABLE IF NOT EXISTS media.gold.engagement_predictions (
    user_id STRING,
    content_type STRING,
    watch_time DECIMAL(8,2),
    engagement_score INT,
    predicted_high_engagement INT,
    prediction_probability DECIMAL(5,4),
    prediction_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (user_id, predicted_high_engagement)
""")

print("Gold layer predictions table created!")

Training content engagement prediction model...


Model AUC: 0.6207


+----------+------------+----------+---------------+----------+--------------------+
|   user_id|content_type|watch_time|high_engagement|prediction|         probability|
+----------+------------+----------+---------------+----------+--------------------+
|USER000004| Live Stream|     18.84|              0|       1.0|[0.45896525024308...|
|USER000004|       Video|     26.57|              1|       1.0|[0.49641008043898...|
|USER000004|     Article|      8.29|              1|       0.0|[0.70666455872038...|
|USER000007| Live Stream|    103.29|              1|       1.0|[0.42414598085538...|
|USER000007| Live Stream|     28.55|              1|       1.0|[0.43578712043250...|
|USER000011| Live Stream|    119.85|              0|       1.0|[0.42343499054838...|
|USER000011| Live Stream|     33.56|              1|       1.0|[0.43529629589007...|
|USER000022|     Article|     22.15|              1|       0.0|[0.69395640594806...|
|USER000022|       Video|     24.93|              1|       0.0|[0

In [None]:
# Save predictions to Gold layer

predictions_for_gold = predictions \
    .select(
        "user_id",
        "content_type", 
        "watch_time",
        "engagement_score",
        "prediction",
        vector_to_array("probability")[1].alias("prediction_probability"),
        F.current_timestamp().alias("prediction_timestamp")
    ) \
    .withColumnRenamed("prediction", "predicted_high_engagement") \
    .withColumn("prediction_probability", F.round("prediction_probability", 4))

predictions_for_gold.write.mode("overwrite").saveAsTable("media.gold.engagement_predictions")

print(f"Successfully saved {predictions_for_gold.count()} predictions to Gold layer")
print("ML predictions are now available for business analysis and recommendations")

Successfully saved 64312 predictions to Gold layer
ML predictions are now available for business analysis and recommendations


In [None]:
# Model interpretation and business insights

# Feature importance
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = feature_cols

print("=== Feature Importance for Engagement Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

print("\n=== Business Impact Analysis ===")

# Calculate potential impact
high_engagement_predictions = predictions.filter("prediction = 1")
total_predictions = predictions.count()

print(f"Total predictions: {total_predictions}")
print(f"Predicted high engagement content: {high_engagement_predictions.count()}")
print(f"Recommendation coverage: {(high_engagement_predictions.count()/total_predictions)*100:.1f}%")

# Revenue impact estimation
avg_watch_time_predicted = high_engagement_predictions.agg(F.avg("watch_time")).collect()[0][0] or 0
avg_watch_time_all = predictions.agg(F.avg("watch_time")).collect()[0][0] or 0
engagement_lift = ((avg_watch_time_predicted - avg_watch_time_all) / avg_watch_time_all) * 100

print(f"\nAverage watch time for recommended content: {avg_watch_time_predicted:.2f} minutes")
print(f"Average watch time overall: {avg_watch_time_all:.2f} minutes")
print(f"Potential engagement lift: {engagement_lift:.1f}%")

# Model accuracy metrics
accuracy = predictions.filter("high_engagement = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND high_engagement = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Engagement Prediction ===
watch_time: 0.1994
engagement_hour: 0.0163
engagement_day_of_week: 0.0085
user_avg_engagement: 0.0155
user_prior_engagements: 0.0090
content_type_index: 0.7440
device_type_index: 0.0073

=== Business Impact Analysis ===


Total predictions: 64312


Predicted high engagement content: 16963


Recommendation coverage: 26.4%



Average watch time for recommended content: 65.75 minutes
Average watch time overall: 35.22 minutes
Potential engagement lift: 86.7%



Model Performance:
Accuracy: 0.6039
Precision: 0.5702
AUC: 0.6207


## Querying the Medallion Architecture

### Demonstrating Data Flow and Optimization

Let's run queries across all layers to show how the medallion architecture enables different types of analysis.

In [None]:
# Query Bronze layer - raw data inspection
print("=== Bronze Layer: Raw Data ===")
spark.sql("""
SELECT user_id, engagement_date, content_type, watch_time, engagement_score
FROM media.bronze.content_engagement_raw
WHERE user_id = 'USER000001'
ORDER BY ingestion_timestamp DESC
LIMIT 5
""").show()

=== Bronze Layer: Raw Data ===


+----------+-------------------+------------+----------+----------------+
|   user_id|    engagement_date|content_type|watch_time|engagement_score|
+----------+-------------------+------------+----------+----------------+
|USER000001|2024-08-16T08:36:00|     Podcast|     15.01|              75|
|USER000001|2024-02-06T20:47:00|       Video|     13.89|              59|
|USER000001|2024-12-31T13:58:00|     Article|     19.38|              66|
|USER000001|2024-07-19T14:43:00|       Video|     30.19|              94|
|USER000001|2024-04-13T07:42:00|     Podcast|     57.78|              84|
+----------+-------------------+------------+----------+----------------+



In [None]:
# Query Silver layer - cleaned data analysis
print("=== Silver Layer: Cleaned Data ===")
spark.sql("""
SELECT user_id, engagement_date, content_type, watch_time, engagement_score, engagement_category
FROM media.silver.content_engagement_clean
WHERE user_id = 'USER000001'
ORDER BY engagement_date DESC
LIMIT 5
""").show()

=== Silver Layer: Cleaned Data ===


+----------+-------------------+------------+----------+----------------+-------------------+
|   user_id|    engagement_date|content_type|watch_time|engagement_score|engagement_category|
+----------+-------------------+------------+----------+----------------+-------------------+
|USER000001|2024-12-31 13:58:00|     Article|     19.38|              66|             Medium|
|USER000001|2024-11-15 17:00:00| Live Stream|     32.38|              50|                Low|
|USER000001|2024-10-20 17:29:00|     Podcast|     56.98|              46|                Low|
|USER000001|2024-10-10 18:42:00|     Article|     16.54|              48|                Low|
|USER000001|2024-09-28 06:49:00|       Video|     34.18|              77|             Medium|
+----------+-------------------+------------+----------+----------------+-------------------+



In [None]:
# Query Gold layer - business analytics
print("=== Gold Layer: Business Analytics ===")
spark.sql("""
SELECT content_type, date, total_engagements, avg_watch_time, avg_engagement_score, high_engagement_rate
FROM media.gold.content_analytics
WHERE content_type = 'Video'
ORDER BY date DESC
LIMIT 5
""").show()

=== Gold Layer: Business Analytics ===


+------------+----------+-----------------+--------------+--------------------+--------------------+
|content_type|      date|total_engagements|avg_watch_time|avg_engagement_score|high_engagement_rate|
+------------+----------+-----------------+--------------+--------------------+--------------------+
|       Video|2024-12-31|               42|         25.03|               70.07|              0.3333|
|       Video|2024-12-31|               45|         22.04|                68.6|              0.2889|
|       Video|2024-12-31|               41|         21.96|               65.98|              0.2195|
|       Video|2024-12-31|               48|         24.05|               73.25|               0.375|
|       Video|2024-12-31|               50|         20.02|                68.8|                0.28|
+------------+----------+-----------------+--------------+--------------------+--------------------+



In [None]:
# Query Gold layer - user profiles
print("=== Gold Layer: User Profiles ===")
spark.sql("""
SELECT user_id, total_sessions, avg_engagement_score, preferred_content_type, user_segment
FROM media.gold.user_profiles
ORDER BY total_sessions DESC
LIMIT 5
""").show()

=== Gold Layer: User Profiles ===


+----------+--------------+--------------------+----------------------+------------+
|   user_id|total_sessions|avg_engagement_score|preferred_content_type|user_segment|
+----------+--------------+--------------------+----------------------+------------+
|USER005822|            35|               71.77|           Live Stream|  Power User|
|USER006268|            35|               63.14|                 Video|  Power User|
|USER004568|            35|               68.43|           Live Stream|  Power User|
|USER006265|            35|               69.09|                 Video|  Power User|
|USER005047|            35|               64.77|                 Video|  Power User|
+----------+--------------+--------------------+----------------------+------------+



In [None]:
# Query Gold layer - ML predictions
print("=== Gold Layer: ML Predictions ===")
spark.sql("""
SELECT user_id, content_type, engagement_score, predicted_high_engagement, prediction_probability
FROM media.gold.engagement_predictions
WHERE predicted_high_engagement = 1
ORDER BY prediction_probability DESC
LIMIT 5
""").show()

=== Gold Layer: ML Predictions ===


+----------+------------+----------------+-------------------------+----------------------+
|   user_id|content_type|engagement_score|predicted_high_engagement|prediction_probability|
+----------+------------+----------------+-------------------------+----------------------+
|USER006220| live stream|              55|                      1.0|                0.7102|
|USER002796| live stream|              54|                      1.0|                 0.682|
|USER009903| live stream|              95|                      1.0|                0.6719|
|USER003578| live stream|              52|                      1.0|                0.6697|
|USER010685| live stream|              59|                      1.0|                0.6669|
+----------+------------+----------------+-------------------------+----------------------+



## Key Takeaways: Medallion Architecture with Delta Liquid Clustering

### Architecture Benefits

1. **Progressive Data Refinement**: Each layer serves specific analytical needs
   - Bronze: Data preservation and auditability
   - Silver: Clean, validated data for operational analytics
   - Gold: Business-ready aggregations and ML insights

2. **Performance Optimization**: Liquid clustering automatically optimizes query performance
   - No manual partitioning or Z-Ordering required
   - Adaptive clustering adjusts to query patterns
   - Significant performance improvements for analytical workloads

3. **Data Governance**: Clear separation of concerns and data quality management
   - Schema enforcement prevents data corruption
   - Time travel enables historical analysis
   - Catalog isolation provides security and governance

### Business Impact for Media Companies

1. **Personalized Content Discovery**: ML-driven recommendations increase engagement
2. **Data-Driven Content Strategy**: Analytics guide content creation and acquisition
3. **User Retention**: Better understanding of user behavior improves retention
4. **Revenue Optimization**: Higher engagement drives subscription and advertising revenue
5. **Operational Efficiency**: Automated data processing reduces manual effort

### Technical Advantages

- **Unified Analytics**: Seamless integration of data processing and ML
- **Scalability**: Handles massive media datasets effortlessly
- **Cost Efficiency**: Liquid clustering reduces storage and compute costs
- **Developer Productivity**: Focus on business logic, not infrastructure

### Best Practices

1. **Layer Design**: Clearly define the purpose of each medallion layer
2. **Clustering Strategy**: Choose clustering columns based on query patterns
3. **Data Quality**: Implement comprehensive validation in the Silver layer
4. **Incremental Processing**: Use Delta's capabilities for incremental updates
5. **Monitoring**: Track data quality and pipeline performance

### Next Steps

- Implement real-time data ingestion pipelines
- Add more sophisticated ML models (recommendation systems, churn prediction)
- Integrate with content management systems
- Deploy models for production recommendations
- Scale to larger datasets and more complex analytics

This medallion architecture demonstrates how Oracle AI Data Platform enables sophisticated media analytics while maintaining enterprise-grade performance, governance, and scalability.