# Retail Analytics: Medallion Architecture Demo with Delta Liquid Clustering

This notebook demonstrates the **Medallion Architecture** in Oracle AI Data Platform (AIDP) Workbench using a retail analytics use case with Delta Liquid Clustering for optimal performance.

## Medallion Architecture Overview

The Medallion Architecture organizes data into three layers:

- **Bronze Layer**: Raw, unprocessed data as ingested from sources
- **Silver Layer**: Cleaned, enriched, and standardized data
- **Gold Layer**: Business-ready data with aggregations and ML features

## Delta Liquid Clustering

Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

## Use Case: Customer Purchase Analytics with Churn Prediction

We'll analyze customer purchase records from a retail company and build a churn prediction model using the medallion architecture.

## AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Step 1: Create Retail Catalog and Medallion Schemas

In AIDP, catalogs provide data isolation and governance. We'll create separate schemas for each medallion layer.

In [None]:
# Create retail catalog and medallion schemas
# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS retail")

spark.sql("CREATE SCHEMA IF NOT EXISTS retail.bronze")
spark.sql("CREATE SCHEMA IF NOT EXISTS retail.silver")
spark.sql("CREATE SCHEMA IF NOT EXISTS retail.gold")

print("Retail catalog and medallion schemas (bronze, silver, gold) created successfully!")

Retail catalog and medallion schemas (bronze, silver, gold) created successfully!


## Bronze Layer: Raw Data Ingestion

### Table Design

Our `customer_purchases_raw` table stores raw purchase data as ingested from source systems:

- **customer_id**: Raw customer identifier
- **purchase_date**: Raw purchase date
- **product_id**: Raw product identifier
- **product_category**: Raw category string
- **purchase_amount**: Raw transaction amount
- **store_id**: Raw store identifier
- **payment_method**: Raw payment type
- **ingestion_timestamp**: When data was ingested

### Clustering Strategy

We'll cluster by `customer_id` and `purchase_date` to optimize for:
- Customer-specific queries
- Time-based analysis
- Purchase pattern analysis

In [None]:
# Create Bronze layer Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""
CREATE TABLE IF NOT EXISTS retail.bronze.customer_purchases_raw (
    customer_id STRING,
    purchase_date DATE,
    product_id STRING,
    product_category STRING,
    purchase_amount DECIMAL(10,2),
    store_id STRING,
    payment_method STRING,
    ingestion_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (customer_id, purchase_date)
""")

print("Bronze layer Delta table with liquid clustering created successfully!")
print("Clustering will automatically optimize data layout for queries on customer_id and purchase_date.")

Bronze layer Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on customer_id and purchase_date.


## Step 3: Generate and Ingest Raw Retail Data

### Data Generation Strategy

We'll create realistic raw retail purchase data including:

- **1,000 customers** with multiple purchases over time
- **Product categories**: Electronics, Clothing, Home & Garden, Books, Sports
- **Realistic temporal patterns**: Seasonal shopping, repeat purchases
- **Multiple stores**: Different retail locations
- **Data quality issues**: Some missing values, inconsistent formats (simulating real-world data)

In [None]:
# Generate sample retail purchase data with some data quality issues
# Using fully qualified imports to avoid conflicts

import random
from datetime import datetime, timedelta

# Define retail data constants
PRODUCTS = {
    "Electronics": [
        ("ELE001", "Smartphone", 599.99),
        ("ELE002", "Laptop", 1299.99),
        ("ELE003", "Headphones", 149.99),
        ("ELE004", "Smart TV", 799.99),
        ("ELE005", "Tablet", 399.99)
    ],
    "Clothing": [
        ("CLO001", "T-Shirt", 19.99),
        ("CLO002", "Jeans", 79.99),
        ("CLO003", "Jacket", 129.99),
        ("CLO004", "Sneakers", 89.99),
        ("CLO005", "Dress", 59.99)
    ],
    "Home & Garden": [
        ("HOM001", "Blender", 79.99),
        ("HOM002", "Coffee Maker", 49.99),
        ("HOM003", "Garden Tools Set", 39.99),
        ("HOM004", "Bedding Set", 89.99),
        ("HOM005", "Decorative Pillow", 24.99)
    ],
    "Books": [
        ("BOK001", "Fiction Novel", 14.99),
        ("BOK002", "Cookbook", 24.99),
        ("BOK003", "Biography", 19.99),
        ("BOK004", "Self-Help Book", 16.99),
        ("BOK005", "Children's Book", 9.99)
    ],
    "Sports": [
        ("SPO001", "Yoga Mat", 29.99),
        ("SPO002", "Dumbbells", 49.99),
        ("SPO003", "Running Shoes", 119.99),
        ("SPO004", "Basketball", 24.99),
        ("SPO005", "Tennis Racket", 89.99)
    ]
}

STORES = ["STORE_NYC_001", "STORE_LAX_002", "STORE_CHI_003", "STORE_HOU_004", "STORE_MIA_005"]
PAYMENT_METHODS = ["Credit Card", "Debit Card", "Cash", "Digital Wallet", "Buy Now Pay Later"]

# Generate customer purchase records with some data quality issues
purchase_data = []
base_date = datetime(2024, 1, 1)

# Create 1,000 customers with 3-8 purchases each
for customer_num in range(1, 1001):
    customer_id = f"CUST{customer_num:06d}"
    
    # Each customer gets 3-8 purchases over 12 months
    num_purchases = random.randint(3, 8)
    
    for i in range(num_purchases):
        # Spread purchases over 12 months
        days_offset = random.randint(0, 365)
        purchase_date = base_date + timedelta(days=days_offset)
        
        # Select random category and product
        category = random.choice(list(PRODUCTS.keys()))
        product_id, product_name, base_price = random.choice(PRODUCTS[category])
        
        # Add some price variation (±20%)
        price_variation = random.uniform(0.8, 1.2)
        purchase_amount = round(base_price * price_variation, 2)
        
        # Select random store and payment method
        store_id = random.choice(STORES)
        payment_method = random.choice(PAYMENT_METHODS)
        
        # Simulate data quality issues (5% chance)
        if random.random() < 0.05:
            # Introduce some missing or invalid data
            if random.random() < 0.3:
                purchase_amount = None  # Missing amount
            elif random.random() < 0.5:
                product_category = None  # Missing category
            else:
                payment_method = "Unknown"  # Invalid payment method
        
        purchase_data.append({
            "customer_id": customer_id,
            "purchase_date": purchase_date.date(),
            "product_id": product_id,
            "product_category": category,
            "purchase_amount": purchase_amount,
            "store_id": store_id,
            "payment_method": payment_method,
            "ingestion_timestamp": datetime.now()
        })

print(f"Generated {len(purchase_data)} raw customer purchase records (with simulated data quality issues)")
print("Sample record:", purchase_data[0])

Generated 5475 raw customer purchase records (with simulated data quality issues)
Sample record: {'customer_id': 'CUST000001', 'purchase_date': datetime.date(2024, 1, 12), 'product_id': 'BOK003', 'product_category': 'Books', 'purchase_amount': 19.58, 'store_id': 'STORE_MIA_005', 'payment_method': 'Digital Wallet', 'ingestion_timestamp': datetime.datetime(2026, 1, 2, 20, 28, 11, 313848)}


In [None]:
# Insert raw data into Bronze layer using PySpark

# Create DataFrame from generated data
df_purchases_raw = spark.createDataFrame(purchase_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_purchases_raw.printSchema()

print("\nSample Raw Data:")
df_purchases_raw.show(5)

# Insert data into Bronze table with liquid clustering
df_purchases_raw.write.mode("overwrite").saveAsTable("retail.bronze.customer_purchases_raw")

print(f"\nSuccessfully inserted {df_purchases_raw.count()} raw records into retail.bronze.customer_purchases_raw")
print("Bronze layer now contains raw, unprocessed data with potential quality issues.")

Bronze Layer DataFrame Schema:
root
 |-- customer_id: string (nullable = true)
 |-- ingestion_timestamp: timestamp (nullable = true)
 |-- payment_method: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- purchase_amount: double (nullable = true)
 |-- purchase_date: date (nullable = true)
 |-- store_id: string (nullable = true)


Sample Raw Data:


+-----------+--------------------+-----------------+----------------+----------+---------------+-------------+-------------+
|customer_id| ingestion_timestamp|   payment_method|product_category|product_id|purchase_amount|purchase_date|     store_id|
+-----------+--------------------+-----------------+----------------+----------+---------------+-------------+-------------+
| CUST000001|2026-01-02 20:28:...|   Digital Wallet|           Books|    BOK003|          19.58|   2024-01-12|STORE_MIA_005|
| CUST000001|2026-01-02 20:28:...|      Credit Card|        Clothing|    CLO003|         132.31|   2024-01-09|STORE_HOU_004|
| CUST000001|2026-01-02 20:28:...|   Digital Wallet|        Clothing|    CLO005|          64.72|   2024-02-02|STORE_NYC_001|
| CUST000001|2026-01-02 20:28:...|Buy Now Pay Later|        Clothing|    CLO003|         145.23|   2024-01-22|STORE_HOU_004|
| CUST000001|2026-01-02 20:28:...|   Digital Wallet|          Sports|    SPO001|          26.21|   2024-10-14|STORE_LAX_002|



Successfully inserted 5475 raw records into retail.bronze.customer_purchases_raw
Bronze layer now contains raw, unprocessed data with potential quality issues.


## Silver Layer: Data Cleansing and Enrichment

### Transformation Logic

The Silver layer transforms raw Bronze data into:

- **Cleaned data**: Handle missing values, standardize formats
- **Enriched data**: Add derived features and business logic
- **Validated data**: Apply data quality rules

### Silver Table Design

`customer_purchases_clean` table includes:

- All original fields (cleaned)
- Data quality flags
- Derived features (e.g., purchase_amount_category)
- Standardized categories

In [None]:
# Create Silver layer table for cleaned and enriched data

spark.sql("""
CREATE TABLE IF NOT EXISTS retail.silver.customer_purchases_clean (
    customer_id STRING,
    purchase_date DATE,
    product_id STRING,
    product_category STRING,
    purchase_amount DECIMAL(10,2),
    store_id STRING,
    payment_method STRING,
    ingestion_timestamp TIMESTAMP,
    data_quality_score INT,
    purchase_amount_category STRING,
    is_high_value_customer BOOLEAN,
    processing_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (customer_id, purchase_date)
""")

print("Silver layer Delta table created successfully!")

Silver layer Delta table created successfully!


In [None]:
# Transform Bronze data to Silver layer with cleansing and enrichment

from pyspark.sql.functions import col, when, lit, current_timestamp, expr

# Read from Bronze layer
bronze_df = spark.table("retail.bronze.customer_purchases_raw")

# Apply data cleansing and enrichment
silver_df = bronze_df.withColumn(
    "purchase_amount",
    when(col("purchase_amount").isNull(), 0.0).otherwise(col("purchase_amount"))
).withColumn(
    "product_category",
    when(col("product_category").isNull(), "Unknown").otherwise(col("product_category"))
).withColumn(
    "payment_method",
    when(col("payment_method") == "Unknown", "Other").otherwise(col("payment_method"))
).withColumn(
    "data_quality_score",
    when(
        (col("purchase_amount").isNotNull()) & 
        (col("product_category") != "Unknown") & 
        (col("payment_method") != "Other"),
        100
    ).otherwise(75)
).withColumn(
    "purchase_amount_category",
    when(col("purchase_amount") >= 500, "High")
    .when(col("purchase_amount") >= 100, "Medium")
    .otherwise("Low")
).withColumn(
    "is_high_value_customer",
    lit(False)  # Will be updated in Gold layer based on aggregations
).withColumn(
    "processing_timestamp",
    current_timestamp()
)

# Show transformation results
print("Silver Layer Transformation Results:")
silver_df.select(
    "customer_id", "purchase_date", "product_category", 
    "purchase_amount", "data_quality_score", "purchase_amount_category"
).show(10)

# Write to Silver layer
silver_df.write.mode("overwrite").saveAsTable("retail.silver.customer_purchases_clean")

print(f"\nSuccessfully transformed and inserted {silver_df.count()} cleaned records into retail.silver.customer_purchases_clean")

Silver Layer Transformation Results:


+-----------+-------------+----------------+---------------+------------------+------------------------+
|customer_id|purchase_date|product_category|purchase_amount|data_quality_score|purchase_amount_category|
+-----------+-------------+----------------+---------------+------------------+------------------------+
| CUST000183|   2024-01-07|          Sports|          44.05|               100|                     Low|
| CUST000183|   2024-09-02|        Clothing|          19.22|               100|                     Low|
| CUST000183|   2024-07-06|        Clothing|          55.25|               100|                     Low|
| CUST000183|   2024-07-14|     Electronics|         664.13|               100|                    High|
| CUST000184|   2024-12-13|          Sports|          56.46|               100|                     Low|
| CUST000184|   2024-12-15|   Home & Garden|          52.62|               100|                     Low|
| CUST000184|   2024-05-03|        Clothing|         12


Successfully transformed and inserted 5475 cleaned records into retail.silver.customer_purchases_clean


## Gold Layer: Business Analytics and ML Features

### Gold Layer Purpose

The Gold layer provides:

- **Aggregated metrics** for business reporting
- **Customer analytics** with lifetime value calculations
- **ML-ready features** for predictive modeling

### Tables in Gold Layer

1. `customer_analytics`: Aggregated customer metrics
2. `sales_analytics`: Business performance metrics
3. `churn_prediction_features`: ML features for churn modeling

In [None]:
# Create Gold layer tables for business analytics

# Customer analytics table
spark.sql("""
CREATE TABLE IF NOT EXISTS retail.gold.customer_analytics (
    customer_id STRING,
    total_purchases INT,
    total_spent DECIMAL(10,2),
    avg_purchase_value DECIMAL(10,2),
    purchase_variability DECIMAL(10,2),
    categories_purchased INT,
    stores_used INT,
    payment_methods_used INT,
    active_months INT,
    days_since_last_purchase INT,
    customer_tenure_days INT,
    avg_days_between_purchases DECIMAL(10,2),
    customer_segment STRING,
    lifetime_value DECIMAL(10,2),
    last_updated TIMESTAMP
)
USING DELTA
CLUSTER BY (customer_segment, customer_id)
""")

# Sales analytics table
spark.sql("""
CREATE TABLE IF NOT EXISTS retail.gold.sales_analytics (
    period STRING,
    total_transactions INT,
    total_revenue DECIMAL(10,2),
    avg_transaction_value DECIMAL(10,2),
    unique_customers INT,
    top_category STRING,
    top_store STRING,
    last_updated TIMESTAMP
)
USING DELTA
CLUSTER BY (period)
""")

print("Gold layer tables created successfully!")

Gold layer tables created successfully!


In [None]:
# Generate Gold layer customer analytics from Silver data

from pyspark.sql.functions import count, sum, avg, stddev, countDistinct, max, min, datediff, current_date, round

silver_df = spark.table("retail.silver.customer_purchases_clean")

# Calculate customer-level aggregations
customer_analytics = silver_df.groupBy("customer_id").agg(
    count("*").alias("total_purchases"),
    round(sum("purchase_amount"), 2).alias("total_spent"),
    round(avg("purchase_amount"), 2).alias("avg_purchase_value"),
    round(stddev("purchase_amount"), 2).alias("purchase_variability"),
    countDistinct("product_category").alias("categories_purchased"),
    countDistinct("store_id").alias("stores_used"),
    countDistinct("payment_method").alias("payment_methods_used"),
    countDistinct(expr("DATE_FORMAT(purchase_date, 'yyyy-MM')")).alias("active_months"),
    datediff(current_date(), max("purchase_date")).alias("days_since_last_purchase"),
    datediff(current_date(), min("purchase_date")).alias("customer_tenure_days"),
    round(avg("purchase_amount"), 2).alias("lifetime_value"),  # Simplified CLV
    current_timestamp().alias("last_updated")
).withColumn(
    "customer_segment",
    when(col("total_spent") >= 2000, "High Value")
    .when(col("total_spent") >= 500, "Medium Value")
    .otherwise("Low Value")
).withColumn(
    "avg_days_between_purchases",
    when(col("total_purchases") > 1, 
         round(col("customer_tenure_days") / (col("total_purchases") - 1), 2)
    ).otherwise(col("customer_tenure_days"))
)

# Write to Gold layer
customer_analytics.write.mode("overwrite").saveAsTable("retail.gold.customer_analytics")

print(f"Customer analytics generated for {customer_analytics.count()} customers")
customer_analytics.select(
    "customer_id", "total_purchases", "total_spent", 
    "customer_segment", "days_since_last_purchase"
).show(10)

Customer analytics generated for 1000 customers


+-----------+---------------+-----------+----------------+------------------------+
|customer_id|total_purchases|total_spent|customer_segment|days_since_last_purchase|
+-----------+---------------+-----------+----------------+------------------------+
| CUST000184|              3|     237.51|       Low Value|                     383|
| CUST000543|              4|     876.03|    Medium Value|                     503|
| CUST000492|              3|     215.61|       Low Value|                     528|
| CUST000217|              5|     272.42|       Low Value|                     427|
| CUST000338|              8|     481.32|       Low Value|                     455|
| CUST000435|              3|     867.26|    Medium Value|                     374|
| CUST000406|              3|    1546.72|    Medium Value|                     370|
| CUST000193|              3|      80.36|       Low Value|                     490|
| CUST000331|              6|     651.72|    Medium Value|                  

In [None]:
# Generate Gold layer sales analytics

# Monthly sales analytics
monthly_sales = silver_df.withColumn(
    "period", expr("DATE_FORMAT(purchase_date, 'yyyy-MM')")
).groupBy("period").agg(
    count("*").alias("total_transactions"),
    round(sum("purchase_amount"), 2).alias("total_revenue"),
    round(avg("purchase_amount"), 2).alias("avg_transaction_value"),
    countDistinct("customer_id").alias("unique_customers"),
    current_timestamp().alias("last_updated")
).orderBy("period")

# Add top category and store for each period (simplified)
category_sales = silver_df.withColumn(
    "period", expr("DATE_FORMAT(purchase_date, 'yyyy-MM')")
).groupBy("period", "product_category").agg(
    sum("purchase_amount").alias("category_revenue")
).orderBy("period", col("category_revenue").desc())

# Get top category per period
from pyspark.sql.window import Window
window_spec = Window.partitionBy("period").orderBy(col("category_revenue").desc())
top_categories = category_sales.withColumn(
    "rank", expr("row_number() over (partition by period order by category_revenue desc)")
).filter("rank = 1").select("period", "product_category")

# Join with monthly sales
sales_analytics = monthly_sales.join(
    top_categories, "period", "left"
).withColumnRenamed("product_category", "top_category").withColumn(
    "top_store", lit("STORE_NYC_001")  # Simplified - would calculate actual top store
)

# Write to Gold layer
sales_analytics.write.mode("overwrite").saveAsTable("retail.gold.sales_analytics")

print("Sales analytics generated by period")
sales_analytics.show()

Sales analytics generated by period


+-------+------------------+-------------+---------------------+----------------+--------------------+------------+-------------+
| period|total_transactions|total_revenue|avg_transaction_value|unique_customers|        last_updated|top_category|    top_store|
+-------+------------------+-------------+---------------------+----------------+--------------------+------------+-------------+
|2024-09|               469|      82519.0|               175.95|             385|2026-01-02 20:30:...| Electronics|STORE_NYC_001|
|2024-02|               399|     67535.15|               169.26|             325|2026-01-02 20:30:...| Electronics|STORE_NYC_001|
|2024-08|               444|     68654.21|               154.63|             367|2026-01-02 20:30:...| Electronics|STORE_NYC_001|
|2024-06|               457|     80212.57|               175.52|             367|2026-01-02 20:30:...| Electronics|STORE_NYC_001|
|2024-12|               454|     86359.89|               190.22|             372|2026-01-0

## Step 7: Machine Learning - Customer Churn Prediction

### ML in Gold Layer

Using the Gold layer customer analytics to train a churn prediction model:

- **Target**: Predict customers at risk of churning
- **Features**: Customer behavior metrics from Gold layer
- **Model**: Random Forest Classifier
- **Business Impact**: Enable proactive retention campaigns

In [None]:
# Prepare ML features from Gold layer customer analytics

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Read customer analytics from Gold layer
customer_analytics_df = spark.table("retail.gold.customer_analytics")

# Create churn risk label (business logic)
ml_features_df = customer_analytics_df.withColumn(
    "churn_risk",
    when(
        (col("days_since_last_purchase") > 60) | 
        (col("total_purchases") < 4) | 
        (col("avg_purchase_value") < 50),
        1
    ).otherwise(0)
)

print(f"ML dataset prepared with {ml_features_df.count()} customer records")
ml_features_df.groupBy("churn_risk").count().show()

ML dataset prepared with 1000 customer records


+----------+-----+
|churn_risk|count|
+----------+-----+
|         1| 1000|
+----------+-----+



In [None]:
# Feature engineering and model training

# Select features for the model
feature_cols = [
    "total_purchases", "total_spent", "avg_purchase_value", "purchase_variability", 
    "categories_purchased", "stores_used", "payment_methods_used", 
    "active_months", "days_since_last_purchase", "customer_tenure_days", 
    "avg_days_between_purchases"
]

# Handle missing values
ml_features_df = ml_features_df.fillna(30, subset=['avg_days_between_purchases'])
ml_features_df = ml_features_df.fillna(0, subset=['purchase_variability'])

# Assemble features
assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="churn_risk", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[assembler, scaler, rf])

# Split data
train_data, test_data = ml_features_df.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} customers")
print(f"Test set: {test_data.count()} customers")

# Train the model
print("Training customer churn prediction model...")
model = pipeline.fit(train_data)

Training set: 838 customers


Test set: 162 customers
Training customer churn prediction model...


In [None]:
# Model evaluation and business insights

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="churn_risk", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select(
    "customer_id", "total_purchases", "total_spent", "churn_risk", 
    "prediction", "probability"
).show(10)

# Calculate confusion matrix
confusion_matrix = predictions.groupBy("churn_risk", "prediction").count()
confusion_matrix.show()

# Business impact analysis
churn_predictions = predictions.filter("prediction = 1")
customers_at_risk = churn_predictions.count()
total_test_customers = test_data.count()

print(f"\nBusiness Impact Analysis:")
print(f"Total test customers: {total_test_customers}")
print(f"Customers predicted to be at churn risk: {customers_at_risk}")
print(f"Percentage flagged for retention: {(customers_at_risk/total_test_customers)*100:.1f}%")

# Calculate potential revenue impact
avg_customer_value = test_data.agg(F.avg("total_spent")).collect()[0][0] or 0
potential_lost_revenue = customers_at_risk * avg_customer_value

print(f"Average customer lifetime value: ${avg_customer_value:,.2f}")
print(f"Potential revenue at risk: ${potential_lost_revenue:,.0f}")

# Retention program value
retention_success_rate = 0.35
avg_retention_cost = 25
saved_revenue = (customers_at_risk * retention_success_rate) * avg_customer_value
retention_roi = (saved_revenue - (customers_at_risk * avg_retention_cost)) / (customers_at_risk * avg_retention_cost) * 100 if customers_at_risk > 0 else 0

print(f"Estimated retention success rate: {retention_success_rate*100:.0f}%")
print(f"Potential revenue saved: ${saved_revenue:,.0f}")
print(f"Retention program ROI: {retention_roi:.1f}%")

# Model performance metrics
accuracy = predictions.filter("churn_risk = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND churn_risk = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND churn_risk = 1").count() / predictions.filter("churn_risk = 1").count() if predictions.filter("churn_risk = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

Model AUC: 1.0000


+-----------+---------------+-----------+----------+----------+-----------+
|customer_id|total_purchases|total_spent|churn_risk|prediction|probability|
+-----------+---------------+-----------+----------+----------+-----------+
| CUST000003|              5|     172.23|         1|       1.0|  [0.0,1.0]|
| CUST000007|              8|     364.11|         1|       1.0|  [0.0,1.0]|
| CUST000009|              8|     579.61|         1|       1.0|  [0.0,1.0]|
| CUST000014|              5|    1090.84|         1|       1.0|  [0.0,1.0]|
| CUST000020|              3|     258.81|         1|       1.0|  [0.0,1.0]|
| CUST000024|              8|     1555.7|         1|       1.0|  [0.0,1.0]|
| CUST000030|              7|    1542.91|         1|       1.0|  [0.0,1.0]|
| CUST000036|              8|     451.76|         1|       1.0|  [0.0,1.0]|
| CUST000046|              8|     349.46|         1|       1.0|  [0.0,1.0]|
| CUST000047|              5|     1306.6|         1|       1.0|  [0.0,1.0]|
+-----------

+----------+----------+-----+
|churn_risk|prediction|count|
+----------+----------+-----+
|         1|       1.0|  162|
+----------+----------+-----+




Business Impact Analysis:
Total test customers: 162
Customers predicted to be at churn risk: 162
Percentage flagged for retention: 100.0%


Average customer lifetime value: $857.03
Potential revenue at risk: $138,838
Estimated retention success rate: 35%
Potential revenue saved: $48,593
Retention program ROI: 1099.8%



Model Performance:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
AUC: 1.0000


## Step 8: Demonstrate Medallion Architecture Benefits

### Query Performance Across Layers

Let's demonstrate how each layer serves different analytical needs with optimized queries.

In [None]:
# Demonstrate Bronze layer: Raw data inspection
print("=== Bronze Layer: Raw Data Inspection ===")
spark.sql("""
SELECT customer_id, purchase_date, product_category, purchase_amount, 
       ingestion_timestamp,
       CASE WHEN purchase_amount IS NULL THEN 'Missing Amount' 
            WHEN product_category IS NULL THEN 'Missing Category'
            ELSE 'Valid' END as data_quality_flag
FROM retail.bronze.customer_purchases_raw 
WHERE customer_id = 'CUST000001'
ORDER BY purchase_date
""").show()

# Demonstrate Silver layer: Cleaned and enriched data
print("\n=== Silver Layer: Cleaned and Enriched Data ===")
spark.sql("""
SELECT customer_id, purchase_date, product_category, purchase_amount,
       data_quality_score, purchase_amount_category,
       processing_timestamp
FROM retail.silver.customer_purchases_clean
WHERE customer_id = 'CUST000001'
ORDER BY purchase_date
""").show()

# Demonstrate Gold layer: Business analytics
print("\n=== Gold Layer: Customer Analytics ===")
spark.sql("""
SELECT customer_id, total_purchases, total_spent, customer_segment,
       days_since_last_purchase, lifetime_value
FROM retail.gold.customer_analytics
ORDER BY total_spent DESC
LIMIT 10
""").show()

print("\n=== Gold Layer: Sales Analytics ===")
spark.sql("""
SELECT period, total_transactions, total_revenue, unique_customers, top_category
FROM retail.gold.sales_analytics
ORDER BY period
""").show()

=== Bronze Layer: Raw Data Inspection ===


+-----------+-------------+----------------+---------------+--------------------+-----------------+
|customer_id|purchase_date|product_category|purchase_amount| ingestion_timestamp|data_quality_flag|
+-----------+-------------+----------------+---------------+--------------------+-----------------+
| CUST000001|   2024-01-09|        Clothing|         132.31|2026-01-02 20:28:...|            Valid|
| CUST000001|   2024-01-12|           Books|          19.58|2026-01-02 20:28:...|            Valid|
| CUST000001|   2024-01-22|        Clothing|         145.23|2026-01-02 20:28:...|            Valid|
| CUST000001|   2024-02-02|        Clothing|          64.72|2026-01-02 20:28:...|            Valid|
| CUST000001|   2024-10-14|          Sports|          26.21|2026-01-02 20:28:...|            Valid|
+-----------+-------------+----------------+---------------+--------------------+-----------------+


=== Silver Layer: Cleaned and Enriched Data ===


+-----------+-------------+----------------+---------------+------------------+------------------------+--------------------+
|customer_id|purchase_date|product_category|purchase_amount|data_quality_score|purchase_amount_category|processing_timestamp|
+-----------+-------------+----------------+---------------+------------------+------------------------+--------------------+
| CUST000001|   2024-01-09|        Clothing|         132.31|               100|                  Medium|2026-01-02 20:28:...|
| CUST000001|   2024-01-12|           Books|          19.58|               100|                     Low|2026-01-02 20:28:...|
| CUST000001|   2024-01-22|        Clothing|         145.23|               100|                  Medium|2026-01-02 20:28:...|
| CUST000001|   2024-02-02|        Clothing|          64.72|               100|                     Low|2026-01-02 20:28:...|
| CUST000001|   2024-10-14|          Sports|          26.21|               100|                     Low|2026-01-02 20:

+-----------+---------------+-----------+----------------+------------------------+--------------+
|customer_id|total_purchases|total_spent|customer_segment|days_since_last_purchase|lifetime_value|
+-----------+---------------+-----------+----------------+------------------------+--------------+
| CUST000747|              8|     4655.1|      High Value|                     406|        581.89|
| CUST000263|              8|    4593.14|      High Value|                     369|        574.14|
| CUST000280|              8|    4469.72|      High Value|                     410|        558.71|
| CUST000128|              7|    3894.54|      High Value|                     382|        556.36|
| CUST000808|              7|    3883.85|      High Value|                     373|        554.84|
| CUST000289|              7|    3581.88|      High Value|                     415|         511.7|
| CUST000992|              8|    3523.17|      High Value|                     537|         440.4|
| CUST0008

+-------+------------------+-------------+----------------+------------+
| period|total_transactions|total_revenue|unique_customers|top_category|
+-------+------------------+-------------+----------------+------------+
|2024-01|               487|     87135.12|             400| Electronics|
|2024-02|               399|     67535.15|             325| Electronics|
|2024-03|               456|     70601.22|             377| Electronics|
|2024-04|               440|     74152.53|             355| Electronics|
|2024-05|               522|     85645.91|             430| Electronics|
|2024-06|               457|     80212.57|             367| Electronics|
|2024-07|               425|     70450.22|             361| Electronics|
|2024-08|               444|     68654.21|             367| Electronics|
|2024-09|               469|      82519.0|             385| Electronics|
|2024-10|               452|     80140.87|             360| Electronics|
|2024-11|               470|     86633.08|         

## Key Takeaways: Medallion Architecture with Delta Liquid Clustering + ML

### What We Demonstrated

1. **Medallion Architecture**: Bronze (raw) → Silver (cleaned) → Gold (analytics/ML)
2. **Delta Liquid Clustering**: Automatic optimization in each layer
3. **Data Quality Improvement**: Progressive cleansing and enrichment
4. **Business Value**: Analytics and ML for customer churn prediction

### Layer Benefits

- **Bronze**: Preserves raw data integrity, audit trail
- **Silver**: Clean, standardized data for consistent analysis
- **Gold**: Business-ready insights and ML features

### AIDP Advantages

- **Unified Platform**: Seamless data flow between layers
- **Performance**: Liquid clustering optimizes each layer
- **Governance**: Schema isolation and data quality controls
- **ML Integration**: Direct path from data to predictive models

### Business Impact

1. **Data Quality**: Systematic approach to clean and validate data
2. **Analytics Efficiency**: Faster insights with optimized structures
3. **ML Readiness**: Features engineered for predictive modeling
4. **Customer Retention**: Proactive churn prevention
5. **Revenue Protection**: Data-driven business decisions

### Best Practices

1. **Layer Progression**: Always maintain clear Bronze → Silver → Gold flow
2. **Data Quality**: Implement validation rules in Silver layer
3. **Clustering Strategy**: Choose columns based on query patterns per layer
4. **Schema Evolution**: Plan for changing business requirements
5. **Governance**: Maintain data lineage and quality metrics

### Next Steps

- Implement real-time data ingestion into Bronze layer
- Add more sophisticated data quality validations
- Extend ML models (recommendation, lifetime value prediction)
- Build automated pipelines for layer updates
- Integrate with business intelligence tools

This notebook demonstrates how Oracle AI Data Platform enables sophisticated data architectures while maintaining performance, governance, and analytical power.