# Telecommunications: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a telecommunications analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Network Performance Monitoring and Customer Experience Analytics

We'll analyze telecommunications network performance and customer usage data. Our clustering strategy will optimize for:

- **Customer-specific queries**: Fast lookups by subscriber ID
- **Time-based analysis**: Efficient filtering by call/service date
- **Network performance patterns**: Quick aggregation by cell tower and service quality metrics

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create telecommunications catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS telecom")

spark.sql("CREATE SCHEMA IF NOT EXISTS telecom.analytics")

print("Telecommunications catalog and analytics schema created successfully!")

Telecommunications catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `network_usage` table will store:

- **subscriber_id**: Unique customer identifier
- **usage_date**: Date and time of service usage
- **service_type**: Type (Voice, Data, SMS, Streaming)
- **data_volume**: Data consumed (GB)
- **call_duration**: Call length (minutes)
- **cell_tower_id**: Network cell tower identifier
- **signal_quality**: Network signal strength (0-100)

### Clustering Strategy

We'll cluster by `subscriber_id` and `usage_date` because:

- **subscriber_id**: Customers generate multiple service interactions, grouping their usage patterns together
- **usage_date**: Time-based queries are critical for billing cycles, network planning, and customer behavior analysis
- This combination optimizes for both customer analytics and temporal network performance monitoring

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS telecom.analytics.network_usage (

    subscriber_id STRING,

    usage_date TIMESTAMP,

    service_type STRING,

    data_volume DECIMAL(10,3),

    call_duration DECIMAL(8,2),

    cell_tower_id STRING,

    signal_quality INT

)

USING DELTA

CLUSTER BY (subscriber_id, usage_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.


## Step 3: Generate Telecommunications Sample Data

### Data Generation Strategy

We'll create realistic telecommunications usage data including:

- **10,000 subscribers** with multiple service interactions over time
- **Service types**: Voice calls, Data usage, SMS, Video streaming
- **Realistic usage patterns**: Peak hours, weekend vs weekday patterns, roaming
- **Network infrastructure**: Multiple cell towers with varying signal quality

### Why This Data Pattern?

This data simulates real telecommunications scenarios where:

- Customer usage varies by time of day and service type
- Network performance impacts customer experience
- Billing and service quality require temporal analysis
- Capacity planning depends on usage patterns
- Fraud detection needs real-time monitoring

In [None]:
# Generate sample telecommunications usage data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define telecommunications data constants

SERVICE_TYPES = ['Voice', 'Data', 'SMS', 'Streaming']

CELL_TOWERS = ['TOWER_NYC_001', 'TOWER_LAX_002', 'TOWER_CHI_003', 'TOWER_HOU_004', 'TOWER_MIA_005', 'TOWER_SFO_006', 'TOWER_SEA_007']

# Base usage parameters by service type

USAGE_PARAMS = {

    'Voice': {'avg_duration': 5.0, 'frequency': 8, 'data_volume': 0.0},

    'Data': {'avg_duration': 0.0, 'frequency': 15, 'data_volume': 0.5},

    'SMS': {'avg_duration': 0.0, 'frequency': 12, 'data_volume': 0.0},

    'Streaming': {'avg_duration': 0.0, 'frequency': 6, 'data_volume': 2.0}

}


# Generate network usage records

usage_data = []

base_date = datetime(2024, 1, 1)


# Create 10,000 subscribers with 20-100 usage events each

for subscriber_num in range(1, 10001):

    subscriber_id = f"SUB{subscriber_num:08d}"
    
    # Each subscriber gets 20-100 usage events over 12 months

    num_events = random.randint(20, 100)
    
    for i in range(num_events):

        # Spread usage events over 12 months

        days_offset = random.randint(0, 365)

        usage_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing (more usage during business hours and evenings)

        hour_weights = [1, 1, 1, 1, 1, 2, 4, 6, 8, 7, 6, 8, 9, 8, 7, 6, 8, 9, 10, 8, 6, 4, 3, 2]

        hours_offset = random.choices(range(24), weights=hour_weights)[0]

        usage_date = usage_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select service type

        service_type = random.choice(SERVICE_TYPES)

        params = USAGE_PARAMS[service_type]
        
        # Calculate usage metrics with variability

        if service_type == 'Voice':

            duration_variation = random.uniform(0.3, 3.0)

            call_duration = round(params['avg_duration'] * duration_variation, 2)

            data_volume = 0.0

        elif service_type == 'Data':

            data_variation = random.uniform(0.1, 5.0)

            data_volume = round(params['data_volume'] * data_variation, 3)

            call_duration = 0.0

        elif service_type == 'SMS':

            data_volume = 0.0

            call_duration = 0.0

        else:  # Streaming

            data_variation = random.uniform(0.5, 8.0)

            data_volume = round(params['data_volume'] * data_variation, 3)

            call_duration = 0.0
        
        # Select cell tower and signal quality

        cell_tower_id = random.choice(CELL_TOWERS)

        # Signal quality varies by tower and time

        base_signal = random.randint(60, 95)

        signal_variation = random.randint(-15, 5)

        signal_quality = max(0, min(100, base_signal + signal_variation))
        
        usage_data.append({

            "subscriber_id": subscriber_id,

            "usage_date": usage_date,

            "service_type": service_type,

            "data_volume": data_volume,

            "call_duration": call_duration,

            "cell_tower_id": cell_tower_id,

            "signal_quality": signal_quality

        })



print(f"Generated {len(usage_data)} network usage records")

print("Sample record:", usage_data[0])

Generated 603319 network usage records
Sample record: {'subscriber_id': 'SUB00000001', 'usage_date': datetime.datetime(2024, 3, 4, 13, 52), 'service_type': 'Voice', 'data_volume': 0.0, 'call_duration': 11.72, 'cell_tower_id': 'TOWER_SFO_006', 'signal_quality': 64}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_usage = spark.createDataFrame(usage_data)


# Display schema and sample data

print("DataFrame Schema:")

df_usage.printSchema()



print("\nSample Data:")

df_usage.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (subscriber_id, usage_date) will automatically optimize the data layout

df_usage.write.mode("overwrite").saveAsTable("telecom.analytics.network_usage")


print(f"\nSuccessfully inserted {df_usage.count()} records into telecom.analytics.network_usage")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- call_duration: double (nullable = true)
 |-- cell_tower_id: string (nullable = true)
 |-- data_volume: double (nullable = true)
 |-- service_type: string (nullable = true)
 |-- signal_quality: long (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- usage_date: timestamp (nullable = true)


Sample Data:


+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
|call_duration|cell_tower_id|data_volume|service_type|signal_quality|subscriber_id|         usage_date|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
|        11.72|TOWER_SFO_006|        0.0|       Voice|            64|  SUB00000001|2024-03-04 13:52:00|
|          0.0|TOWER_NYC_001|        0.0|         SMS|            62|  SUB00000001|2024-04-30 15:44:00|
|         2.56|TOWER_NYC_001|        0.0|       Voice|            85|  SUB00000001|2024-01-14 04:37:00|
|          0.0|TOWER_LAX_002|     14.926|   Streaming|            71|  SUB00000001|2024-09-13 12:56:00|
|          0.0|TOWER_SEA_007|      8.358|   Streaming|            88|  SUB00000001|2024-03-16 16:04:00|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
only showing top 5 rows




Successfully inserted 603319 records into telecom.analytics.network_usage
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Subscriber usage history** (clustered by subscriber_id)
2. **Time-based network analysis** (clustered by usage_date)
3. **Combined subscriber + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Subscriber usage history - benefits from subscriber_id clustering

print("=== Query 1: Subscriber Usage History ===")

subscriber_history = spark.sql("""

SELECT subscriber_id, usage_date, service_type, data_volume, call_duration, signal_quality

FROM telecom.analytics.network_usage

WHERE subscriber_id = 'SUB00000001'

ORDER BY usage_date DESC

""")



subscriber_history.show()

print(f"Records found: {subscriber_history.count()}")



# Query 2: Time-based network quality analysis - benefits from usage_date clustering

print("\n=== Query 2: Recent Network Quality Issues ===")

network_quality = spark.sql("""

SELECT usage_date, subscriber_id, cell_tower_id, signal_quality, service_type

FROM telecom.analytics.network_usage

WHERE usage_date >= '2024-06-01' AND signal_quality < 50

ORDER BY signal_quality ASC, usage_date DESC

""")



network_quality.show()

print(f"Network quality issues found: {network_quality.count()}")



# Query 3: Combined subscriber + time query - optimal for our clustering strategy

print("\n=== Query 3: Subscriber Data Usage Trends ===")

usage_trends = spark.sql("""

SELECT subscriber_id, usage_date, service_type, data_volume, call_duration

FROM telecom.analytics.network_usage

WHERE subscriber_id LIKE 'SUB000000%' AND usage_date >= '2024-04-01'

ORDER BY subscriber_id, usage_date

""")



usage_trends.show()

print(f"Usage trend records found: {usage_trends.count()}")

=== Query 1: Subscriber Usage History ===


+-------------+-------------------+------------+-----------+-------------+--------------+
|subscriber_id|         usage_date|service_type|data_volume|call_duration|signal_quality|
+-------------+-------------------+------------+-----------+-------------+--------------+
|  SUB00000001|2024-12-22 16:14:00|         SMS|        0.0|          0.0|            72|
|  SUB00000001|2024-12-08 17:36:00|        Data|      0.108|          0.0|            77|
|  SUB00000001|2024-12-06 15:00:00|        Data|      0.056|          0.0|            85|
|  SUB00000001|2024-11-23 13:11:00|   Streaming|     14.654|          0.0|            84|
|  SUB00000001|2024-11-07 18:22:00|         SMS|        0.0|          0.0|            95|
|  SUB00000001|2024-10-24 20:26:00|         SMS|        0.0|          0.0|            75|
|  SUB00000001|2024-10-08 19:32:00|   Streaming|      6.947|          0.0|            74|
|  SUB00000001|2024-09-25 19:05:00|        Data|      1.264|          0.0|            78|
|  SUB0000

Records found: 33

=== Query 2: Recent Network Quality Issues ===


+-------------------+-------------+-------------+--------------+------------+
|         usage_date|subscriber_id|cell_tower_id|signal_quality|service_type|
+-------------------+-------------+-------------+--------------+------------+
|2024-12-31 13:12:00|  SUB00009850|TOWER_SEA_007|            45|       Voice|
|2024-12-31 07:42:00|  SUB00001957|TOWER_NYC_001|            45|   Streaming|
|2024-12-30 17:24:00|  SUB00009189|TOWER_MIA_005|            45|   Streaming|
|2024-12-30 17:12:00|  SUB00009185|TOWER_CHI_003|            45|        Data|
|2024-12-28 11:49:00|  SUB00002129|TOWER_HOU_004|            45|         SMS|
|2024-12-26 17:32:00|  SUB00006483|TOWER_SFO_006|            45|        Data|
|2024-12-26 16:21:00|  SUB00000968|TOWER_CHI_003|            45|         SMS|
|2024-12-26 15:30:00|  SUB00007641|TOWER_NYC_001|            45|       Voice|
|2024-12-26 11:30:00|  SUB00007019|TOWER_SEA_007|            45|   Streaming|
|2024-12-25 19:01:00|  SUB00009049|TOWER_NYC_001|            45|

Network quality issues found: 7091

=== Query 3: Subscriber Data Usage Trends ===


+-------------+-------------------+------------+-----------+-------------+
|subscriber_id|         usage_date|service_type|data_volume|call_duration|
+-------------+-------------------+------------+-----------+-------------+
|  SUB00000001|2024-04-01 19:20:00|         SMS|        0.0|          0.0|
|  SUB00000001|2024-04-12 10:27:00|   Streaming|      8.765|          0.0|
|  SUB00000001|2024-04-13 16:43:00|   Streaming|      8.624|          0.0|
|  SUB00000001|2024-04-19 10:25:00|       Voice|        0.0|        13.93|
|  SUB00000001|2024-04-30 15:44:00|         SMS|        0.0|          0.0|
|  SUB00000001|2024-05-03 23:01:00|       Voice|        0.0|        12.49|
|  SUB00000001|2024-05-31 19:01:00|        Data|      1.873|          0.0|
|  SUB00000001|2024-06-17 20:17:00|   Streaming|     11.564|          0.0|
|  SUB00000001|2024-06-28 09:56:00|   Streaming|     15.775|          0.0|
|  SUB00000001|2024-07-09 22:13:00|        Data|      1.784|          0.0|
|  SUB00000001|2024-07-21

Usage trend records found: 4537


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the telecommunications insights possible with this optimized structure.

### Key Analytics

- **Subscriber usage patterns** and data consumption analysis
- **Network performance metrics** and signal quality trends
- **Service type adoption** and usage distribution
- **Cell tower utilization** and capacity planning

In [None]:
# Analyze clustering effectiveness and telecommunications insights


# Subscriber usage analysis

print("=== Subscriber Usage Analysis ===")

subscriber_usage = spark.sql("""

SELECT subscriber_id, COUNT(*) as total_sessions,

       ROUND(SUM(data_volume), 3) as total_data_gb,

       ROUND(SUM(call_duration), 2) as total_call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       COUNT(DISTINCT service_type) as services_used

FROM telecom.analytics.network_usage

GROUP BY subscriber_id

ORDER BY total_data_gb DESC

""")



subscriber_usage.show()


# Service type usage patterns

print("\n=== Service Type Usage Patterns ===")

service_patterns = spark.sql("""

SELECT service_type, COUNT(*) as total_usage,

       ROUND(SUM(data_volume), 3) as total_data_gb,

       ROUND(SUM(call_duration), 2) as total_call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       COUNT(DISTINCT subscriber_id) as unique_subscribers

FROM telecom.analytics.network_usage

GROUP BY service_type

ORDER BY total_usage DESC

""")



service_patterns.show()


# Cell tower performance

print("\n=== Cell Tower Performance ===")

tower_performance = spark.sql("""

SELECT cell_tower_id, COUNT(*) as total_connections,

       COUNT(DISTINCT subscriber_id) as unique_subscribers,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       ROUND(SUM(data_volume), 3) as total_data_gb,

       ROUND(SUM(call_duration), 2) as total_call_minutes

FROM telecom.analytics.network_usage

GROUP BY cell_tower_id

ORDER BY total_connections DESC

""")



tower_performance.show()


# Hourly usage patterns

print("\n=== Hourly Usage Patterns ===")

hourly_patterns = spark.sql("""

SELECT HOUR(usage_date) as hour_of_day, COUNT(*) as usage_events,

       ROUND(SUM(data_volume), 3) as data_volume_gb,

       ROUND(SUM(call_duration), 2) as call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality

FROM telecom.analytics.network_usage

GROUP BY HOUR(usage_date)

ORDER BY hour_of_day

""")



hourly_patterns.show()


# Monthly network trends

print("\n=== Monthly Network Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(usage_date, 'yyyy-MM') as month,

       COUNT(*) as total_usage,

       ROUND(SUM(data_volume), 3) as monthly_data_gb,

       ROUND(SUM(call_duration), 2) as monthly_call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       COUNT(DISTINCT subscriber_id) as active_subscribers

FROM telecom.analytics.network_usage

GROUP BY DATE_FORMAT(usage_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Subscriber Usage Analysis ===


+-------------+--------------+-------------+------------------+------------------+-------------+
|subscriber_id|total_sessions|total_data_gb|total_call_minutes|avg_signal_quality|services_used|
+-------------+--------------+-------------+------------------+------------------+-------------+
|  SUB00002907|            98|      374.303|            122.78|             72.31|            4|
|  SUB00003041|            97|      374.246|            151.33|              72.2|            4|
|  SUB00005923|            89|      371.788|            121.21|             75.24|            4|
|  SUB00009440|            95|      370.988|              66.9|             74.13|            4|
|  SUB00000337|            90|      365.707|            162.18|             73.34|            4|
|  SUB00007490|            98|      364.002|            179.59|             73.59|            4|
|  SUB00004805|            93|      348.482|            113.78|             71.77|            4|
|  SUB00009257|           100|

+------------+-----------+-------------+------------------+------------------+------------------+
|service_type|total_usage|total_data_gb|total_call_minutes|avg_signal_quality|unique_subscribers|
+------------+-----------+-------------+------------------+------------------+------------------+
|        Data|     151453|   193415.991|               0.0|             72.51|              9998|
|       Voice|     151003|          0.0|        1246808.73|              72.5|              9999|
|         SMS|     150646|          0.0|               0.0|             72.51|             10000|
|   Streaming|     150217|  1278512.672|               0.0|             72.48|             10000|
+------------+-----------+-------------+------------------+------------------+------------------+


=== Cell Tower Performance ===


+-------------+-----------------+------------------+------------------+-------------+------------------+
|cell_tower_id|total_connections|unique_subscribers|avg_signal_quality|total_data_gb|total_call_minutes|
+-------------+-----------------+------------------+------------------+-------------+------------------+
|TOWER_CHI_003|            86783|              9968|             72.47|   212538.431|         177926.82|
|TOWER_HOU_004|            86557|              9965|             72.56|   210666.317|         177868.82|
|TOWER_SFO_006|            86185|              9960|             72.46|   212706.208|         179430.77|
|TOWER_MIA_005|            86174|              9964|             72.55|   210610.422|         178815.47|
|TOWER_NYC_001|            86160|              9962|             72.49|   210421.392|         176386.73|
|TOWER_LAX_002|            85784|              9953|             72.49|   206811.988|         180098.11|
|TOWER_SEA_007|            85676|              9966|   

+-----------+------------+--------------+------------+------------------+
|hour_of_day|usage_events|data_volume_gb|call_minutes|avg_signal_quality|
+-----------+------------+--------------+------------+------------------+
|          0|        4792|     11788.074|    10100.05|             72.24|
|          1|        4796|     11492.777|    10334.85|             72.51|
|          2|        4849|     12027.327|    10040.76|             72.46|
|          3|        4802|     11230.272|     9795.77|              72.6|
|          4|        4786|     11925.201|     9794.47|             72.43|
|          5|        9544|     22804.363|    19789.58|             72.47|
|          6|       19183|     46809.918|    39957.14|             72.46|
|          7|       28830|     70531.404|    59719.89|             72.52|
|          8|       37931|     92887.894|    77744.11|             72.57|
|          9|       33653|      81507.36|    70418.82|             72.45|
|         10|       28633|     69056.1

+-------+-----------+---------------+--------------------+------------------+------------------+
|  month|total_usage|monthly_data_gb|monthly_call_minutes|avg_signal_quality|active_subscribers|
+-------+-----------+---------------+--------------------+------------------+------------------+
|2024-01|      51136|     125284.093|           104712.14|             72.54|              9738|
|2024-02|      47931|     117988.209|            99280.97|             72.52|              9701|
|2024-03|      50911|     122055.327|           104858.54|             72.56|              9787|
|2024-04|      49399|     120953.517|           102123.58|             72.53|              9726|
|2024-05|      51122|     124817.533|           106054.49|             72.49|              9748|
|2024-06|      49539|     119047.661|           103015.89|             72.54|              9713|
|2024-07|      50844|     124592.418|           104429.72|             72.45|              9760|
|2024-08|      51173|     1257

## Key Takeaways: Delta Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (subscriber_id, usage_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (subscriber_id, usage_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Telecommunications analytics where network monitoring and customer experience are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for telecommunications data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles telecommunications-scale data volumes effortlessly

### Best Practices for Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger telecommunications datasets
- Integrate with real network monitoring systems and CDR data

This notebook demonstrates how Oracle AI Data Platform makes advanced telecommunications analytics accessible while maintaining enterprise-grade performance and governance.