# Telecommunications: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a telecommunications analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Network Performance Monitoring and Customer Experience Analytics

We'll analyze telecommunications network performance and customer usage data. Our clustering strategy will optimize for:

- **Customer-specific queries**: Fast lookups by subscriber ID
- **Time-based analysis**: Efficient filtering by call/service date
- **Network performance patterns**: Quick aggregation by cell tower and service quality metrics

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create telecommunications catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS telecom")

spark.sql("CREATE SCHEMA IF NOT EXISTS telecom.analytics")

print("Telecommunications catalog and analytics schema created successfully!")

## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `network_usage` table will store:

- **subscriber_id**: Unique customer identifier
- **usage_date**: Date and time of service usage
- **service_type**: Type (Voice, Data, SMS, Streaming)
- **data_volume**: Data consumed (GB)
- **call_duration**: Call length (minutes)
- **cell_tower_id**: Network cell tower identifier
- **signal_quality**: Network signal strength (0-100)

### Clustering Strategy

We'll cluster by `subscriber_id` and `usage_date` because:

- **subscriber_id**: Customers generate multiple service interactions, grouping their usage patterns together
- **usage_date**: Time-based queries are critical for billing cycles, network planning, and customer behavior analysis
- This combination optimizes for both customer analytics and temporal network performance monitoring

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS telecom.analytics.network_usage (

    subscriber_id STRING,

    usage_date TIMESTAMP,

    service_type STRING,

    data_volume DECIMAL(10,3),

    call_duration DECIMAL(8,2),

    cell_tower_id STRING,

    signal_quality INT

)

USING DELTA

CLUSTER BY (subscriber_id, usage_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.


## Step 3: Generate Telecommunications Sample Data

### Data Generation Strategy

We'll create realistic telecommunications usage data including:

- **10,000 subscribers** with multiple service interactions over time
- **Service types**: Voice calls, Data usage, SMS, Video streaming
- **Realistic usage patterns**: Peak hours, weekend vs weekday patterns, roaming
- **Network infrastructure**: Multiple cell towers with varying signal quality

### Why This Data Pattern?

This data simulates real telecommunications scenarios where:

- Customer usage varies by time of day and service type
- Network performance impacts customer experience
- Billing and service quality require temporal analysis
- Capacity planning depends on usage patterns
- Fraud detection needs real-time monitoring

In [None]:
# Generate sample telecommunications usage data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define telecommunications data constants

SERVICE_TYPES = ['Voice', 'Data', 'SMS', 'Streaming']

CELL_TOWERS = ['TOWER_NYC_001', 'TOWER_LAX_002', 'TOWER_CHI_003', 'TOWER_HOU_004', 'TOWER_MIA_005', 'TOWER_SFO_006', 'TOWER_SEA_007']

# Base usage parameters by service type

USAGE_PARAMS = {

    'Voice': {'avg_duration': 5.0, 'frequency': 8, 'data_volume': 0.0},

    'Data': {'avg_duration': 0.0, 'frequency': 15, 'data_volume': 0.5},

    'SMS': {'avg_duration': 0.0, 'frequency': 12, 'data_volume': 0.0},

    'Streaming': {'avg_duration': 0.0, 'frequency': 6, 'data_volume': 2.0}

}


# Generate network usage records

usage_data = []

base_date = datetime(2024, 1, 1)


# Create 10,000 subscribers with 20-100 usage events each

for subscriber_num in range(1, 10001):

    subscriber_id = f"SUB{subscriber_num:08d}"
    
    # Each subscriber gets 20-100 usage events over 12 months

    num_events = random.randint(20, 100)
    
    for i in range(num_events):

        # Spread usage events over 12 months

        days_offset = random.randint(0, 365)

        usage_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing (more usage during business hours and evenings)

        hour_weights = [1, 1, 1, 1, 1, 2, 4, 6, 8, 7, 6, 8, 9, 8, 7, 6, 8, 9, 10, 8, 6, 4, 3, 2]

        hours_offset = random.choices(range(24), weights=hour_weights)[0]

        usage_date = usage_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select service type

        service_type = random.choice(SERVICE_TYPES)

        params = USAGE_PARAMS[service_type]
        
        # Calculate usage metrics with variability

        if service_type == 'Voice':

            duration_variation = random.uniform(0.3, 3.0)

            call_duration = round(params['avg_duration'] * duration_variation, 2)

            data_volume = 0.0

        elif service_type == 'Data':

            data_variation = random.uniform(0.1, 5.0)

            data_volume = round(params['data_volume'] * data_variation, 3)

            call_duration = 0.0

        elif service_type == 'SMS':

            data_volume = 0.0

            call_duration = 0.0

        else:  # Streaming

            data_variation = random.uniform(0.5, 8.0)

            data_volume = round(params['data_volume'] * data_variation, 3)

            call_duration = 0.0
        
        # Select cell tower and signal quality

        cell_tower_id = random.choice(CELL_TOWERS)

        # Signal quality varies by tower and time

        base_signal = random.randint(60, 95)

        signal_variation = random.randint(-15, 5)

        signal_quality = max(0, min(100, base_signal + signal_variation))
        
        usage_data.append({

            "subscriber_id": subscriber_id,

            "usage_date": usage_date,

            "service_type": service_type,

            "data_volume": data_volume,

            "call_duration": call_duration,

            "cell_tower_id": cell_tower_id,

            "signal_quality": signal_quality

        })



print(f"Generated {len(usage_data)} network usage records")

print("Sample record:", usage_data[0])

Generated 599492 network usage records
Sample record: {'subscriber_id': 'SUB00000001', 'usage_date': datetime.datetime(2024, 12, 12, 20, 52), 'service_type': 'Data', 'data_volume': 0.753, 'call_duration': 0.0, 'cell_tower_id': 'TOWER_MIA_005', 'signal_quality': 64}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_usage = spark.createDataFrame(usage_data)


# Display schema and sample data

print("DataFrame Schema:")

df_usage.printSchema()



print("\nSample Data:")

df_usage.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (subscriber_id, usage_date) will automatically optimize the data layout

df_usage.write.mode("overwrite").saveAsTable("telecom.analytics.network_usage")


print(f"\nSuccessfully inserted {df_usage.count()} records into telecom.analytics.network_usage")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- call_duration: double (nullable = true)
 |-- cell_tower_id: string (nullable = true)
 |-- data_volume: double (nullable = true)
 |-- service_type: string (nullable = true)
 |-- signal_quality: long (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- usage_date: timestamp (nullable = true)


Sample Data:


+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
|call_duration|cell_tower_id|data_volume|service_type|signal_quality|subscriber_id|         usage_date|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
|          0.0|TOWER_MIA_005|      0.753|        Data|            64|  SUB00000001|2024-12-12 20:52:00|
|         9.03|TOWER_NYC_001|        0.0|       Voice|            69|  SUB00000001|2024-12-18 09:32:00|
|         5.75|TOWER_CHI_003|        0.0|       Voice|            58|  SUB00000001|2024-10-05 17:21:00|
|          0.0|TOWER_HOU_004|     12.468|   Streaming|            70|  SUB00000001|2024-12-18 09:32:00|
|          0.0|TOWER_NYC_001|      0.394|        Data|            63|  SUB00000001|2024-04-30 02:00:00|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
only showing top 5 rows




Successfully inserted 599492 records into telecom.analytics.network_usage
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Subscriber usage history** (clustered by subscriber_id)
2. **Time-based network analysis** (clustered by usage_date)
3. **Combined subscriber + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Subscriber usage history - benefits from subscriber_id clustering

print("=== Query 1: Subscriber Usage History ===")

subscriber_history = spark.sql("""

SELECT subscriber_id, usage_date, service_type, data_volume, call_duration, signal_quality

FROM telecom.analytics.network_usage

WHERE subscriber_id = 'SUB00000001'

ORDER BY usage_date DESC

""")



subscriber_history.show()

print(f"Records found: {subscriber_history.count()}")



# Query 2: Time-based network quality analysis - benefits from usage_date clustering

print("\n=== Query 2: Recent Network Quality Issues ===")

network_quality = spark.sql("""

SELECT usage_date, subscriber_id, cell_tower_id, signal_quality, service_type

FROM telecom.analytics.network_usage

WHERE usage_date >= '2024-06-01' AND signal_quality < 50

ORDER BY signal_quality ASC, usage_date DESC

""")



network_quality.show()

print(f"Network quality issues found: {network_quality.count()}")



# Query 3: Combined subscriber + time query - optimal for our clustering strategy

print("\n=== Query 3: Subscriber Data Usage Trends ===")

usage_trends = spark.sql("""

SELECT subscriber_id, usage_date, service_type, data_volume, call_duration

FROM telecom.analytics.network_usage

WHERE subscriber_id LIKE 'SUB000000%' AND usage_date >= '2024-04-01'

ORDER BY subscriber_id, usage_date

""")



usage_trends.show()

print(f"Usage trend records found: {usage_trends.count()}")

=== Query 1: Subscriber Usage History ===


+-------------+-------------------+------------+-----------+-------------+--------------+
|subscriber_id|         usage_date|service_type|data_volume|call_duration|signal_quality|
+-------------+-------------------+------------+-----------+-------------+--------------+
|  SUB00000001|2024-12-31 23:30:00|   Streaming|     15.036|          0.0|            70|
|  SUB00000001|2024-12-28 19:26:00|         SMS|        0.0|          0.0|            68|
|  SUB00000001|2024-12-28 00:56:00|       Voice|        0.0|         7.09|            89|
|  SUB00000001|2024-12-26 08:25:00|       Voice|        0.0|        10.12|            87|
|  SUB00000001|2024-12-18 09:32:00|       Voice|        0.0|         9.03|            69|
|  SUB00000001|2024-12-18 09:32:00|   Streaming|     12.468|          0.0|            70|
|  SUB00000001|2024-12-18 07:42:00|       Voice|        0.0|         5.89|            86|
|  SUB00000001|2024-12-15 12:21:00|       Voice|        0.0|        11.78|            52|
|  SUB0000

Records found: 77

=== Query 2: Recent Network Quality Issues ===


+-------------------+-------------+-------------+--------------+------------+
|         usage_date|subscriber_id|cell_tower_id|signal_quality|service_type|
+-------------------+-------------+-------------+--------------+------------+
|2024-12-31 18:48:00|  SUB00008783|TOWER_LAX_002|            45|   Streaming|
|2024-12-31 16:33:00|  SUB00003787|TOWER_MIA_005|            45|   Streaming|
|2024-12-31 14:47:00|  SUB00003103|TOWER_CHI_003|            45|   Streaming|
|2024-12-31 10:42:00|  SUB00007036|TOWER_SFO_006|            45|       Voice|
|2024-12-30 15:57:00|  SUB00009393|TOWER_NYC_001|            45|       Voice|
|2024-12-30 14:06:00|  SUB00003452|TOWER_MIA_005|            45|   Streaming|
|2024-12-30 10:08:00|  SUB00007996|TOWER_CHI_003|            45|        Data|
|2024-12-30 09:53:00|  SUB00001662|TOWER_LAX_002|            45|       Voice|
|2024-12-29 20:16:00|  SUB00005675|TOWER_LAX_002|            45|         SMS|
|2024-12-28 19:45:00|  SUB00001138|TOWER_CHI_003|            45|

Network quality issues found: 6909

=== Query 3: Subscriber Data Usage Trends ===


+-------------+-------------------+------------+-----------+-------------+
|subscriber_id|         usage_date|service_type|data_volume|call_duration|
+-------------+-------------------+------------+-----------+-------------+
|  SUB00000001|2024-04-07 14:03:00|   Streaming|       6.03|          0.0|
|  SUB00000001|2024-04-11 10:25:00|       Voice|        0.0|        13.75|
|  SUB00000001|2024-04-15 13:03:00|       Voice|        0.0|         7.19|
|  SUB00000001|2024-04-17 19:44:00|         SMS|        0.0|          0.0|
|  SUB00000001|2024-04-19 18:04:00|       Voice|        0.0|        10.23|
|  SUB00000001|2024-04-26 19:42:00|   Streaming|      3.414|          0.0|
|  SUB00000001|2024-04-30 02:00:00|        Data|      0.394|          0.0|
|  SUB00000001|2024-04-30 06:16:00|         SMS|        0.0|          0.0|
|  SUB00000001|2024-05-07 11:30:00|        Data|      0.577|          0.0|
|  SUB00000001|2024-05-08 10:26:00|   Streaming|      6.758|          0.0|
|  SUB00000001|2024-05-09

Usage trend records found: 4454


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the telecommunications insights possible with this optimized structure.

### Key Analytics

- **Subscriber usage patterns** and data consumption analysis
- **Network performance metrics** and signal quality trends
- **Service type adoption** and usage distribution
- **Cell tower utilization** and capacity planning

In [None]:
# Analyze clustering effectiveness and telecommunications insights


# Subscriber usage analysis

print("=== Subscriber Usage Analysis ===")

subscriber_usage = spark.sql("""

SELECT subscriber_id, COUNT(*) as total_sessions,

       ROUND(SUM(data_volume), 3) as total_data_gb,

       ROUND(SUM(call_duration), 2) as total_call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       COUNT(DISTINCT service_type) as services_used

FROM telecom.analytics.network_usage

GROUP BY subscriber_id

ORDER BY total_data_gb DESC

""")



subscriber_usage.show()


# Service type usage patterns

print("\n=== Service Type Usage Patterns ===")

service_patterns = spark.sql("""

SELECT service_type, COUNT(*) as total_usage,

       ROUND(SUM(data_volume), 3) as total_data_gb,

       ROUND(SUM(call_duration), 2) as total_call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       COUNT(DISTINCT subscriber_id) as unique_subscribers

FROM telecom.analytics.network_usage

GROUP BY service_type

ORDER BY total_usage DESC

""")



service_patterns.show()


# Cell tower performance

print("\n=== Cell Tower Performance ===")

tower_performance = spark.sql("""

SELECT cell_tower_id, COUNT(*) as total_connections,

       COUNT(DISTINCT subscriber_id) as unique_subscribers,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       ROUND(SUM(data_volume), 3) as total_data_gb,

       ROUND(SUM(call_duration), 2) as total_call_minutes

FROM telecom.analytics.network_usage

GROUP BY cell_tower_id

ORDER BY total_connections DESC

""")



tower_performance.show()


# Hourly usage patterns

print("\n=== Hourly Usage Patterns ===")

hourly_patterns = spark.sql("""

SELECT HOUR(usage_date) as hour_of_day, COUNT(*) as usage_events,

       ROUND(SUM(data_volume), 3) as data_volume_gb,

       ROUND(SUM(call_duration), 2) as call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality

FROM telecom.analytics.network_usage

GROUP BY HOUR(usage_date)

ORDER BY hour_of_day

""")



hourly_patterns.show()


# Monthly network trends

print("\n=== Monthly Network Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(usage_date, 'yyyy-MM') as month,

       COUNT(*) as total_usage,

       ROUND(SUM(data_volume), 3) as monthly_data_gb,

       ROUND(SUM(call_duration), 2) as monthly_call_minutes,

       ROUND(AVG(signal_quality), 2) as avg_signal_quality,

       COUNT(DISTINCT subscriber_id) as active_subscribers

FROM telecom.analytics.network_usage

GROUP BY DATE_FORMAT(usage_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Subscriber Usage Analysis ===


+-------------+--------------+-------------+------------------+------------------+-------------+
|subscriber_id|total_sessions|total_data_gb|total_call_minutes|avg_signal_quality|services_used|
+-------------+--------------+-------------+------------------+------------------+-------------+
|  SUB00003991|            96|       405.35|             99.46|             73.69|            4|
|  SUB00008402|            94|      373.892|             196.9|             72.46|            4|
|  SUB00002884|            96|       366.61|            202.38|             72.77|            4|
|  SUB00001178|            94|      354.778|            205.55|             71.47|            4|
|  SUB00007652|            99|      351.278|            121.24|              71.8|            4|
|  SUB00006080|            80|      349.189|            102.71|             72.58|            4|
|  SUB00002701|            96|      349.039|            173.09|             71.91|            4|
|  SUB00006118|            96|

+------------+-----------+-------------+------------------+------------------+------------------+
|service_type|total_usage|total_data_gb|total_call_minutes|avg_signal_quality|unique_subscribers|
+------------+-----------+-------------+------------------+------------------+------------------+
|         SMS|     150136|          0.0|               0.0|             72.52|              9999|
|   Streaming|     150049|  1275253.862|               0.0|             72.52|             10000|
|        Data|     149947|   190729.154|               0.0|             72.47|              9999|
|       Voice|     149360|          0.0|         1230155.2|             72.48|              9999|
+------------+-----------+-------------+------------------+------------------+------------------+


=== Cell Tower Performance ===


+-------------+-----------------+------------------+------------------+-------------+------------------+
|cell_tower_id|total_connections|unique_subscribers|avg_signal_quality|total_data_gb|total_call_minutes|
+-------------+-----------------+------------------+------------------+-------------+------------------+
|TOWER_LAX_002|            86180|              9959|             72.53|   209550.744|          176764.7|
|TOWER_HOU_004|            85831|              9956|             72.48|   211175.342|          175565.2|
|TOWER_MIA_005|            85690|              9954|             72.47|   210204.599|         177086.58|
|TOWER_SFO_006|            85587|              9966|             72.47|   209599.497|         175356.27|
|TOWER_SEA_007|            85580|              9966|             72.54|   209181.485|          175116.0|
|TOWER_CHI_003|            85318|              9969|             72.48|   209784.082|         174256.42|
|TOWER_NYC_001|            85306|              9953|   

+-----------+------------+--------------+------------+------------------+
|hour_of_day|usage_events|data_volume_gb|call_minutes|avg_signal_quality|
+-----------+------------+--------------+------------+------------------+
|          0|        4716|     11146.131|      9561.2|             72.55|
|          1|        4647|     11412.267|     9520.85|             72.53|
|          2|        4818|     11734.839|     9706.25|             72.63|
|          3|        4828|      12269.49|    10174.36|             72.44|
|          4|        4697|     11388.678|     9634.72|             72.76|
|          5|        9411|     22767.134|    19881.45|             72.51|
|          6|       18906|     46540.893|    39919.24|             72.49|
|          7|       28338|     68891.571|    57707.75|             72.34|
|          8|       37902|     92559.901|    78153.24|             72.53|
|          9|       33009|     81097.378|     68248.8|             72.49|
|         10|       28797|      72193.

+-------+-----------+---------------+--------------------+------------------+------------------+
|  month|total_usage|monthly_data_gb|monthly_call_minutes|avg_signal_quality|active_subscribers|
+-------+-----------+---------------+--------------------+------------------+------------------+
|2024-01|      50986|     123095.808|           105952.17|             72.52|              9749|
|2024-02|      47609|     116650.232|            97923.63|             72.48|              9702|
|2024-03|      50870|     123873.978|            105374.2|             72.48|              9717|
|2024-04|      48916|     120802.198|           100590.86|             72.46|              9726|
|2024-05|      50756|     124252.466|            104269.3|             72.53|              9753|
|2024-06|      49318|     120021.321|            100302.2|             72.62|              9728|
|2024-07|      50493|     124146.114|           103128.33|             72.45|              9739|
|2024-08|      50693|     1252

## Step 7: Train Telecommunications Churn Prediction Model

### Machine Learning for Telecommunications Business Improvement

Now we'll train a machine learning model to predict customer churn. This model can help telecommunications companies:

- **Reduce customer loss** by identifying at-risk subscribers early
- **Improve customer retention** with targeted interventions
- **Optimize marketing spend** by focusing on high-risk customers
- **Enhance customer experience** by addressing pain points proactively

### Model Approach

We'll use a **Random Forest Classifier** to predict customer churn based on:

- Usage patterns (data volume, call duration, service types)
- Network quality metrics (signal strength)
- Temporal patterns (usage frequency, time of day)
- Service diversity and engagement levels

### Business Impact

- **Churn Prevention**: Early identification of subscribers likely to churn
- **Cost Savings**: Reduced customer acquisition costs through retention
- **Revenue Protection**: Preservation of recurring revenue streams
- **Customer Insights**: Understanding factors driving customer dissatisfaction

In [None]:
# Prepare data for machine learning - create churn labels and features

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Create subscriber-level features for churn prediction
subscriber_features = spark.sql("""
SELECT 
    subscriber_id,
    COUNT(*) as total_sessions,
    ROUND(SUM(data_volume), 3) as total_data_gb,
    ROUND(SUM(call_duration), 2) as total_call_minutes,
    ROUND(AVG(signal_quality), 2) as avg_signal_quality,
    COUNT(DISTINCT service_type) as services_used,
    COUNT(DISTINCT cell_tower_id) as towers_used,
    COUNT(DISTINCT DATE(usage_date)) as active_days,
    ROUND(AVG(HOUR(usage_date)), 2) as avg_usage_hour,
    -- Simulate churn based on low usage and poor signal quality
    CASE WHEN 
        COUNT(*) < 30 OR 
        AVG(signal_quality) < 65 OR 
        COUNT(DISTINCT service_type) < 3 
    THEN 1 ELSE 0 END as churn_risk
FROM telecom.analytics.network_usage
GROUP BY subscriber_id
""")

print(f"Created subscriber features for {subscriber_features.count()} subscribers")
subscriber_features.groupBy("churn_risk").count().show()

Created subscriber features for 10000 subscribers


+----------+-----+
|churn_risk|count|
+----------+-----+
|         1| 1204|
|         0| 8796|
+----------+-----+



In [None]:
# Feature engineering for churn prediction

# Create indexers for categorical features (though we have few in this dataset)
# Most features are already numeric

# Assemble features for the model
feature_cols = ["total_sessions", "total_data_gb", "total_call_minutes", 
                "avg_signal_quality", "services_used", "towers_used", 
                "active_days", "avg_usage_hour"]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="churn_risk", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[assembler, scaler, rf])

# Split data
train_data, test_data = subscriber_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} subscribers")
print(f"Test set: {test_data.count()} subscribers")

Training set: 8079 subscribers


Test set: 1921 subscribers


In [None]:
# Train the churn prediction model

print("Training churn prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="churn_risk", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select("subscriber_id", "total_sessions", "avg_signal_quality", "churn_risk", "prediction", "probability").show(10)

# Calculate confusion matrix
confusion_matrix = predictions.groupBy("churn_risk", "prediction").count()
confusion_matrix.show()

Training churn prediction model...


Model AUC: 1.0000


+-------------+--------------+------------------+----------+----------+-----------+
|subscriber_id|total_sessions|avg_signal_quality|churn_risk|prediction|probability|
+-------------+--------------+------------------+----------+----------+-----------+
|  SUB00000003|            90|             71.76|         0|       0.0|  [1.0,0.0]|
|  SUB00000007|            97|              73.7|         0|       0.0|  [1.0,0.0]|
|  SUB00000009|            60|             72.57|         0|       0.0|  [1.0,0.0]|
|  SUB00000014|            73|             71.38|         0|       0.0|  [1.0,0.0]|
|  SUB00000020|            78|             71.35|         0|       0.0|  [1.0,0.0]|
|  SUB00000024|            37|             71.76|         0|       0.0|  [1.0,0.0]|
|  SUB00000030|            97|             71.25|         0|       0.0|  [1.0,0.0]|
|  SUB00000036|            99|             72.28|         0|       0.0|  [1.0,0.0]|
|  SUB00000046|            65|             71.58|         0|       0.0|  [1.

+----------+----------+-----+
|churn_risk|prediction|count|
+----------+----------+-----+
|         0|       0.0| 1683|
|         1|       1.0|  238|
+----------+----------+-----+



In [None]:
# Model interpretation and business insights

# Feature importance (approximate)
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = feature_cols

print("=== Feature Importance for Churn Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate potential impact of churn prediction
churn_predictions = predictions.filter("prediction = 1")
high_risk_subscribers = churn_predictions.count()
total_test_subscribers = test_data.count()

print(f"Total test subscribers: {total_test_subscribers}")
print(f"Subscribers predicted as high churn risk: {high_risk_subscribers}")
print(f"Percentage flagged for intervention: {(high_risk_subscribers/total_test_subscribers)*100:.1f}%")

# Calculate average revenue per user (ARPU) estimate
avg_data_gb = test_data.agg(F.avg("total_data_gb")).collect()[0][0] or 0
avg_call_minutes = test_data.agg(F.avg("total_call_minutes")).collect()[0][0] or 0

# Rough ARPU calculation (simplified)
estimated_arpu = (avg_data_gb * 10) + (avg_call_minutes * 0.1) + 50  # Base plan
potential_monthly_loss = high_risk_subscribers * estimated_arpu

print(f"\nEstimated average ARPU: ${estimated_arpu:.2f}")
print(f"Potential monthly revenue at risk: ${potential_monthly_loss:,.2f}")

# Accuracy metrics
accuracy = predictions.filter("churn_risk = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND churn_risk = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND churn_risk = 1").count() / predictions.filter("churn_risk = 1").count() if predictions.filter("churn_risk = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Churn Prediction ===
total_sessions: 0.5294
total_data_gb: 0.0431
total_call_minutes: 0.0502
avg_signal_quality: 0.0011
services_used: 0.0000
towers_used: 0.0090
active_days: 0.3662
avg_usage_hour: 0.0010

=== Business Impact Analysis ===


Total test subscribers: 1921
Subscribers predicted as high churn risk: 238
Percentage flagged for intervention: 12.4%



Estimated average ARPU: $1532.52
Potential monthly revenue at risk: $364,739.93



Model Performance:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
AUC: 1.0000


## Key Takeaways: Delta Liquid Clustering + ML in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (subscriber_id, usage_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (subscriber_id, usage_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Machine Learning Integration**: Trained a churn prediction model using the optimized data

5. **Real-World Use Case**: Telecommunications analytics where customer retention and network monitoring are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates data optimization with ML
- **Governance**: Catalog and schema isolation for telecommunications data
- **Performance**: Optimized for both analytical queries and ML training
- **Scalability**: Handles telecommunications-scale data volumes effortlessly

### Business Benefits for Telecommunications

1. **Churn Prevention**: Early identification of at-risk subscribers
2. **Revenue Protection**: Preservation of recurring revenue streams
3. **Cost Optimization**: Targeted retention campaigns
4. **Customer Experience**: Proactive service quality improvements
5. **Network Optimization**: Data-driven capacity planning

### Best Practices for Telecommunications Analytics

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve
5. **Combine with ML** for predictive analytics and automation

### Next Steps

- Explore other AIDP ML features like AutoML
- Try liquid clustering with different column combinations
- Scale up to larger telecommunications datasets
- Integrate with real network monitoring systems and CDR data
- Deploy models for real-time churn prediction and intervention

This notebook demonstrates how Oracle AI Data Platform makes advanced telecommunications analytics accessible while maintaining enterprise-grade performance and governance.