# Retail Analytics: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a retail analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Customer Purchase Analytics

We'll analyze customer purchase records from a retail company. Our clustering strategy will optimize for:

- **Customer-specific queries**: Fast lookups by customer ID
- **Time-based analysis**: Efficient filtering by purchase date
- **Purchase patterns**: Quick aggregation by product category and customer segments

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create retail catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS retail")

spark.sql("CREATE SCHEMA IF NOT EXISTS retail.analytics")

print("Retail catalog and analytics schema created successfully!")

Retail catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `customer_purchases` table will store:

- **customer_id**: Unique customer identifier
- **purchase_date**: Date of purchase
- **product_id**: Product identifier
- **product_category**: Category (Electronics, Clothing, Home, etc.)
- **purchase_amount**: Transaction amount
- **store_id**: Store location identifier
- **payment_method**: Payment type (Credit, Debit, Cash, etc.)

### Clustering Strategy

We'll cluster by `customer_id` and `purchase_date` because:

- **customer_id**: Customers often make multiple purchases, grouping their transaction history together
- **purchase_date**: Time-based queries are common for sales analysis, seasonality, and trends
- This combination optimizes for both customer behavior analysis and temporal sales reporting

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS retail.analytics.customer_purchases (

    customer_id STRING,

    purchase_date DATE,

    product_id STRING,

    product_category STRING,

    purchase_amount DECIMAL(10,2),

    store_id STRING,

    payment_method STRING

)

USING DELTA

CLUSTER BY (customer_id, purchase_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on customer_id and purchase_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on customer_id and purchase_date.


## Step 3: Generate Retail Sample Data

### Data Generation Strategy

We'll create realistic retail purchase data including:

- **1,000 customers** with multiple purchases over time
- **Product categories**: Electronics, Clothing, Home & Garden, Books, Sports
- **Realistic temporal patterns**: Seasonal shopping, repeat purchases, varying amounts
- **Multiple stores**: Different retail locations across regions

### Why This Data Pattern?

This data simulates real retail scenarios where:

- Customers make multiple purchases over time
- Seasonal trends affect buying patterns
- Product categories drive different analytics needs
- Store-level performance analysis is required
- Customer segmentation enables personalized marketing

In [None]:
# Generate sample retail purchase data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define retail data constants

PRODUCTS = {

    "Electronics": [

        ("ELE001", "Smartphone", 599.99),

        ("ELE002", "Laptop", 1299.99),

        ("ELE003", "Headphones", 149.99),

        ("ELE004", "Smart TV", 799.99),

        ("ELE005", "Tablet", 399.99)

    ],

    "Clothing": [

        ("CLO001", "T-Shirt", 19.99),

        ("CLO002", "Jeans", 79.99),

        ("CLO003", "Jacket", 129.99),

        ("CLO004", "Sneakers", 89.99),

        ("CLO005", "Dress", 59.99)

    ],

    "Home & Garden": [

        ("HOM001", "Blender", 79.99),

        ("HOM002", "Coffee Maker", 49.99),

        ("HOM003", "Garden Tools Set", 39.99),

        ("HOM004", "Bedding Set", 89.99),

        ("HOM005", "Decorative Pillow", 24.99)

    ],

    "Books": [

        ("BOK001", "Fiction Novel", 14.99),

        ("BOK002", "Cookbook", 24.99),

        ("BOK003", "Biography", 19.99),

        ("BOK004", "Self-Help Book", 16.99),

        ("BOK005", "Children's Book", 9.99)

    ],

    "Sports": [

        ("SPO001", "Yoga Mat", 29.99),

        ("SPO002", "Dumbbells", 49.99),

        ("SPO003", "Running Shoes", 119.99),

        ("SPO004", "Basketball", 24.99),

        ("SPO005", "Tennis Racket", 89.99)

    ]

}



STORES = ["STORE_NYC_001", "STORE_LAX_002", "STORE_CHI_003", "STORE_HOU_004", "STORE_MIA_005"]

PAYMENT_METHODS = ["Credit Card", "Debit Card", "Cash", "Digital Wallet", "Buy Now Pay Later"]


# Generate customer purchase records

purchase_data = []

base_date = datetime(2024, 1, 1)


# Create 1,000 customers with 3-8 purchases each

for customer_num in range(1, 1001):

    customer_id = f"CUST{customer_num:06d}"
    
    # Each customer gets 3-8 purchases over 12 months

    num_purchases = random.randint(3, 8)
    
    for i in range(num_purchases):

        # Spread purchases over 12 months

        days_offset = random.randint(0, 365)

        purchase_date = base_date + timedelta(days=days_offset)
        
        # Select random category and product

        category = random.choice(list(PRODUCTS.keys()))

        product_id, product_name, base_price = random.choice(PRODUCTS[category])
        
        # Add some price variation (Â±20%)

        price_variation = random.uniform(0.8, 1.2)

        purchase_amount = round(base_price * price_variation, 2)
        
        # Select random store and payment method

        store_id = random.choice(STORES)

        payment_method = random.choice(PAYMENT_METHODS)
        
        purchase_data.append({

            "customer_id": customer_id,

            "purchase_date": purchase_date.date(),

            "product_id": product_id,

            "product_category": category,

            "purchase_amount": purchase_amount,

            "store_id": store_id,

            "payment_method": payment_method

        })



print(f"Generated {len(purchase_data)} customer purchase records")

print("Sample record:", purchase_data[0])

Generated 5544 customer purchase records
Sample record: {'customer_id': 'CUST000001', 'purchase_date': datetime.date(2024, 9, 19), 'product_id': 'BOK003', 'product_category': 'Books', 'purchase_amount': 22.1, 'store_id': 'STORE_CHI_003', 'payment_method': 'Debit Card'}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_purchases = spark.createDataFrame(purchase_data)


# Display schema and sample data

print("DataFrame Schema:")

df_purchases.printSchema()



print("\nSample Data:")

df_purchases.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (customer_id, purchase_date) will automatically optimize the data layout

df_purchases.write.mode("overwrite").saveAsTable("retail.analytics.customer_purchases")


print(f"\nSuccessfully inserted {df_purchases.count()} records into retail.analytics.customer_purchases")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- customer_id: string (nullable = true)
 |-- payment_method: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- purchase_amount: double (nullable = true)
 |-- purchase_date: date (nullable = true)
 |-- store_id: string (nullable = true)


Sample Data:


+-----------+-----------------+----------------+----------+---------------+-------------+-------------+
|customer_id|   payment_method|product_category|product_id|purchase_amount|purchase_date|     store_id|
+-----------+-----------------+----------------+----------+---------------+-------------+-------------+
| CUST000001|       Debit Card|           Books|    BOK003|           22.1|   2024-09-19|STORE_CHI_003|
| CUST000001|      Credit Card|          Sports|    SPO004|          23.78|   2024-10-29|STORE_CHI_003|
| CUST000001|Buy Now Pay Later|          Sports|    SPO004|           20.7|   2024-03-20|STORE_LAX_002|
| CUST000001|             Cash|     Electronics|    ELE003|         153.44|   2024-11-07|STORE_HOU_004|
| CUST000001|             Cash|   Home & Garden|    HOM005|          21.11|   2024-05-11|STORE_HOU_004|
+-----------+-----------------+----------------+----------+---------------+-------------+-------------+
only showing top 5 rows




Successfully inserted 5544 records into retail.analytics.customer_purchases
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Customer purchase history** (clustered by customer_id)
2. **Time-based sales analysis** (clustered by purchase_date)
3. **Combined customer + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Customer purchase history - benefits from customer_id clustering

print("=== Query 1: Customer Purchase History ===")

customer_history = spark.sql("""

SELECT customer_id, purchase_date, product_category, purchase_amount, store_id

FROM retail.analytics.customer_purchases

WHERE customer_id = 'CUST000001'

ORDER BY purchase_date

""")



customer_history.show()

print(f"Records found: {customer_history.count()}")



# Query 2: Time-based sales analysis - benefits from purchase_date clustering

print("\n=== Query 2: High-Value Purchases This Month ===")

high_value_recent = spark.sql("""

SELECT purchase_date, customer_id, product_category, purchase_amount, payment_method

FROM retail.analytics.customer_purchases

WHERE purchase_date >= '2024-06-01' AND purchase_amount > 500

ORDER BY purchase_date DESC, purchase_amount DESC

""")



high_value_recent.show()

print(f"High-value purchases found: {high_value_recent.count()}")



# Query 3: Combined customer + time query - optimal for our clustering strategy

print("\n=== Query 3: Customer Spending Trends ===")

customer_trends = spark.sql("""

SELECT customer_id, purchase_date, product_category, purchase_amount

FROM retail.analytics.customer_purchases

WHERE customer_id LIKE 'CUST0001%' AND purchase_date >= '2024-04-01'

ORDER BY customer_id, purchase_date

""")



customer_trends.show()

print(f"Trend records found: {customer_trends.count()}")

=== Query 1: Customer Purchase History ===


+-----------+-------------+----------------+---------------+-------------+
|customer_id|purchase_date|product_category|purchase_amount|     store_id|
+-----------+-------------+----------------+---------------+-------------+
| CUST000001|   2024-03-20|          Sports|           20.7|STORE_LAX_002|
| CUST000001|   2024-05-11|   Home & Garden|          21.11|STORE_HOU_004|
| CUST000001|   2024-09-19|           Books|           22.1|STORE_CHI_003|
| CUST000001|   2024-10-29|          Sports|          23.78|STORE_CHI_003|
| CUST000001|   2024-11-07|     Electronics|         153.44|STORE_HOU_004|
+-----------+-------------+----------------+---------------+-------------+



Records found: 5

=== Query 2: High-Value Purchases This Month ===


+-------------+-----------+----------------+---------------+-----------------+
|purchase_date|customer_id|product_category|purchase_amount|   payment_method|
+-------------+-----------+----------------+---------------+-----------------+
|   2024-12-31| CUST000360|     Electronics|        1539.12|       Debit Card|
|   2024-12-31| CUST000133|     Electronics|         941.32|   Digital Wallet|
|   2024-12-31| CUST000989|     Electronics|         708.76|Buy Now Pay Later|
|   2024-12-31| CUST000279|     Electronics|         691.22|   Digital Wallet|
|   2024-12-31| CUST000047|     Electronics|         561.09|Buy Now Pay Later|
|   2024-12-30| CUST000366|     Electronics|        1413.89|             Cash|
|   2024-12-30| CUST000560|     Electronics|          900.6|             Cash|
|   2024-12-29| CUST000006|     Electronics|         896.14|       Debit Card|
|   2024-12-27| CUST000861|     Electronics|         546.64|             Cash|
|   2024-12-26| CUST000858|     Electronics|        

High-value purchases found: 385

=== Query 3: Customer Spending Trends ===


+-----------+-------------+----------------+---------------+
|customer_id|purchase_date|product_category|purchase_amount|
+-----------+-------------+----------------+---------------+
| CUST000100|   2024-05-05|     Electronics|        1507.71|
| CUST000100|   2024-05-15|   Home & Garden|          90.15|
| CUST000100|   2024-06-03|          Sports|         125.47|
| CUST000100|   2024-10-27|          Sports|          24.08|
| CUST000100|   2024-11-14|        Clothing|          85.26|
| CUST000100|   2024-12-02|           Books|          10.94|
| CUST000101|   2024-06-15|          Sports|          28.67|
| CUST000101|   2024-08-02|        Clothing|          85.37|
| CUST000101|   2024-08-10|          Sports|         128.44|
| CUST000101|   2024-09-03|          Sports|          79.51|
| CUST000102|   2024-05-28|          Sports|          24.17|
| CUST000102|   2024-06-17|           Books|          22.35|
| CUST000102|   2024-07-16|        Clothing|          80.71|
| CUST000102|   2024-09-

Trend records found: 400


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the retail insights possible with this optimized structure.

### Key Analytics

- **Sales by category** and performance trends
- **Customer segmentation** by spending patterns
- **Store performance** analysis
- **Payment method preferences** and seasonal trends

In [None]:
# Analyze clustering effectiveness and retail insights


# Sales by category analysis

print("=== Sales by Category Analysis ===")

category_sales = spark.sql("""

SELECT product_category, COUNT(*) as total_purchases,

       ROUND(SUM(purchase_amount), 2) as total_revenue,

       ROUND(AVG(purchase_amount), 2) as avg_purchase,

       ROUND(SUM(purchase_amount) * 100.0 / SUM(SUM(purchase_amount)) OVER (), 2) as revenue_percentage

FROM retail.analytics.customer_purchases

GROUP BY product_category

ORDER BY total_revenue DESC

""")



category_sales.show()



# Customer segmentation by spending

print("\n=== Customer Segmentation Analysis ===")

customer_segments = spark.sql("""

SELECT 

    CASE 

        WHEN total_spent >= 2000 THEN 'High Value'

        WHEN total_spent >= 500 THEN 'Medium Value'

        ELSE 'Low Value'

    END as customer_segment,

    COUNT(*) as customer_count,

    ROUND(AVG(total_spent), 2) as avg_total_spent,

    ROUND(SUM(total_spent), 2) as segment_revenue

FROM (

    SELECT customer_id, SUM(purchase_amount) as total_spent

    FROM retail.analytics.customer_purchases

    GROUP BY customer_id

) customer_totals

GROUP BY 

    CASE 

        WHEN total_spent >= 2000 THEN 'High Value'

        WHEN total_spent >= 500 THEN 'Medium Value'

        ELSE 'Low Value'

    END

ORDER BY segment_revenue DESC

""")



customer_segments.show()



# Store performance analysis

print("\n=== Store Performance Analysis ===")

store_performance = spark.sql("""

SELECT store_id, COUNT(*) as total_transactions,

       COUNT(DISTINCT customer_id) as unique_customers,

       ROUND(SUM(purchase_amount), 2) as total_revenue,

       ROUND(AVG(purchase_amount), 2) as avg_transaction_value

FROM retail.analytics.customer_purchases

GROUP BY store_id

ORDER BY total_revenue DESC

""")



store_performance.show()



# Monthly sales trends

print("\n=== Monthly Sales Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(purchase_date, 'yyyy-MM') as month,

       COUNT(*) as transactions,

       ROUND(SUM(purchase_amount), 2) as monthly_revenue,

       COUNT(DISTINCT customer_id) as active_customers

FROM retail.analytics.customer_purchases

GROUP BY DATE_FORMAT(purchase_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Sales by Category Analysis ===


+----------------+---------------+-------------+------------+------------------+
|product_category|total_purchases|total_revenue|avg_purchase|revenue_percentage|
+----------------+---------------+-------------+------------+------------------+
|     Electronics|           1069|    700376.34|      655.17|             74.54|
|        Clothing|           1134|     85054.48|        75.0|              9.05|
|          Sports|           1104|     69841.08|       63.26|              7.43|
|   Home & Garden|           1116|     64960.34|       58.21|              6.91|
|           Books|           1121|     19371.41|       17.28|              2.06|
+----------------+---------------+-------------+------------+------------------+


=== Customer Segmentation Analysis ===


+----------------+--------------+---------------+---------------+
|customer_segment|customer_count|avg_total_spent|segment_revenue|
+----------------+--------------+---------------+---------------+
|    Medium Value|           511|        1133.51|      579225.07|
|      High Value|            94|        2668.46|      250835.62|
|       Low Value|           395|         277.32|      109542.96|
+----------------+--------------+---------------+---------------+


=== Store Performance Analysis ===


+-------------+------------------+----------------+-------------+---------------------+
|     store_id|total_transactions|unique_customers|total_revenue|avg_transaction_value|
+-------------+------------------+----------------+-------------+---------------------+
|STORE_MIA_005|              1144|             691|    204945.24|               179.15|
|STORE_LAX_002|              1180|             710|    195725.56|               165.87|
|STORE_HOU_004|              1042|             654|     181276.2|               173.97|
|STORE_CHI_003|              1106|             698|     180939.6|                163.6|
|STORE_NYC_001|              1072|             680|    176717.05|               164.85|
+-------------+------------------+----------------+-------------+---------------------+


=== Monthly Sales Trends ===


+-------+------------+---------------+----------------+
|  month|transactions|monthly_revenue|active_customers|
+-------+------------+---------------+----------------+
|2024-01|         485|       77417.29|             389|
|2024-02|         422|       67018.43|             350|
|2024-03|         447|       74457.04|             375|
|2024-04|         458|       82553.07|             380|
|2024-05|         469|       79372.22|             369|
|2024-06|         477|       91938.76|             384|
|2024-07|         466|       75765.05|             382|
|2024-08|         477|       71764.42|             392|
|2024-09|         473|       86854.52|             377|
|2024-10|         442|       82179.17|             358|
|2024-11|         457|       71592.74|             373|
|2024-12|         471|       78690.94|             378|
+-------+------------+---------------+----------------+



## Key Takeaways: Delta Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (customer_id, purchase_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (customer_id, purchase_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Retail analytics where customer behavior analysis and sales reporting are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for retail data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles retail-scale data volumes effortlessly

### Best Practices for Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger retail datasets
- Integrate with real POS systems and e-commerce platforms

This notebook demonstrates how Oracle AI Data Platform makes advanced retail analytics accessible while maintaining enterprise-grade performance and governance.