# Telecommunications Medallion Architecture Demo

## Overview

This notebook demonstrates a **Medallion Architecture** implementation in Oracle AI Data Platform (AIDP) Workbench using a telecommunications analytics use case. The medallion architecture organizes data into three layers:

- **Bronze Layer**: Raw data ingestion and storage
- **Silver Layer**: Cleaned, transformed, and standardized data
- **Gold Layer**: Aggregated, business-ready data and analytics

The notebook also includes machine learning for customer churn prediction, showcasing how the medallion architecture supports advanced analytics.

### What is Medallion Architecture?

Medallion architecture provides a structured approach to data processing:

- **Bronze**: Raw data as-is from source systems
- **Silver**: Cleansed, deduplicated, and standardized data
- **Gold**: Curated data ready for business intelligence and ML

### Use Case: Telecommunications Analytics

We'll analyze telecommunications network performance and customer usage data across all three layers, culminating in churn prediction modeling.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Bronze Layer: Raw Data Ingestion

### Purpose
- Store raw telecommunications data as-is
- Provide data lake functionality
- Enable reprocessing if needed

### Schema Design
- Raw network usage events
- No transformations applied
- Delta table with liquid clustering for performance

In [None]:
# Create telecommunications catalog and bronze schema

spark.sql("CREATE CATALOG IF NOT EXISTS telecom")
spark.sql("CREATE SCHEMA IF NOT EXISTS telecom.bronze")

print("Telecommunications catalog and bronze schema created successfully!")

Telecommunications catalog and bronze schema created successfully!


In [None]:
# Create bronze layer Delta table with liquid clustering

spark.sql("""
CREATE TABLE IF NOT EXISTS telecom.bronze.network_usage_raw (
    subscriber_id STRING,
    usage_date TIMESTAMP,
    service_type STRING,
    data_volume DECIMAL(10,3),
    call_duration DECIMAL(8,2),
    cell_tower_id STRING,
    signal_quality INT,
    ingestion_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (subscriber_id, usage_date)
""")

print("Bronze layer Delta table created successfully!")
print("Liquid clustering will optimize data layout for subscriber and time-based queries.")

Bronze layer Delta table created successfully!
Liquid clustering will optimize data layout for subscriber and time-based queries.


In [None]:
# Generate and insert raw telecommunications data

import random
from datetime import datetime, timedelta

# Define telecommunications data constants
SERVICE_TYPES = ['Voice', 'Data', 'SMS', 'Streaming']
CELL_TOWERS = ['TOWER_NYC_001', 'TOWER_LAX_002', 'TOWER_CHI_003', 'TOWER_HOU_004', 'TOWER_MIA_005', 'TOWER_SFO_006', 'TOWER_SEA_007']

# Base usage parameters by service type
USAGE_PARAMS = {
    'Voice': {'avg_duration': 5.0, 'frequency': 8, 'data_volume': 0.0},
    'Data': {'avg_duration': 0.0, 'frequency': 15, 'data_volume': 0.5},
    'SMS': {'avg_duration': 0.0, 'frequency': 12, 'data_volume': 0.0},
    'Streaming': {'avg_duration': 0.0, 'frequency': 6, 'data_volume': 2.0}
}

# Generate raw network usage records (including some with potential data quality issues)
usage_data = []
base_date = datetime(2024, 1, 1)

# Create 10,000 subscribers with 20-100 usage events each
for subscriber_num in range(1, 10001):
    subscriber_id = f"SUB{subscriber_num:08d}"
    
    # Each subscriber gets 20-100 usage events over 12 months
    num_events = random.randint(20, 100)
    
    for i in range(num_events):
        # Spread usage events over 12 months
        days_offset = random.randint(0, 365)
        usage_date = base_date + timedelta(days=days_offset)
        
        # Add realistic timing
        hour_weights = [1, 1, 1, 1, 1, 2, 4, 6, 8, 7, 6, 8, 9, 8, 7, 6, 8, 9, 10, 8, 6, 4, 3, 2]
        hours_offset = random.choices(range(24), weights=hour_weights)[0]
        usage_date = usage_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)
        
        # Select service type
        service_type = random.choice(SERVICE_TYPES)
        params = USAGE_PARAMS[service_type]
        
        # Calculate usage metrics with variability (and occasional data quality issues)
        if service_type == 'Voice':
            duration_variation = random.uniform(0.3, 3.0)
            call_duration = round(params['avg_duration'] * duration_variation, 2)
            data_volume = 0.0
        elif service_type == 'Data':
            data_variation = random.uniform(0.1, 5.0)
            data_volume = round(params['data_volume'] * data_variation, 3)
            call_duration = 0.0
        elif service_type == 'SMS':
            data_volume = 0.0
            call_duration = 0.0
        else:  # Streaming
            data_variation = random.uniform(0.5, 8.0)
            data_volume = round(params['data_volume'] * data_variation, 3)
            call_duration = 0.0
        
        # Select cell tower and signal quality
        cell_tower_id = random.choice(CELL_TOWERS)
        
        # Signal quality varies by tower and time (occasional nulls for data quality demo)
        if random.random() < 0.02:  # 2% null values
            signal_quality = None
        else:
            base_signal = random.randint(60, 95)
            signal_variation = random.randint(-15, 5)
            signal_quality = max(0, min(100, base_signal + signal_variation))
        
        usage_data.append({
            "subscriber_id": subscriber_id,
            "usage_date": usage_date,
            "service_type": service_type,
            "data_volume": data_volume,
            "call_duration": call_duration,
            "cell_tower_id": cell_tower_id,
            "signal_quality": signal_quality
        })

print(f"Generated {len(usage_data)} raw network usage records")
print("Sample record:", usage_data[0])

Generated 599191 raw network usage records
Sample record: {'subscriber_id': 'SUB00000001', 'usage_date': datetime.datetime(2024, 6, 21, 19, 52), 'service_type': 'Voice', 'data_volume': 0.0, 'call_duration': 12.51, 'cell_tower_id': 'TOWER_NYC_001', 'signal_quality': 82}


In [None]:
# Insert raw data into bronze layer

df_bronze = spark.createDataFrame(usage_data)

print("Bronze Layer DataFrame Schema:")
df_bronze.printSchema()

print("\nSample Bronze Data:")
df_bronze.show(5)

# Insert into bronze table
df_bronze.write.mode("overwrite").saveAsTable("telecom.bronze.network_usage_raw")

bronze_count = spark.sql("SELECT COUNT(*) FROM telecom.bronze.network_usage_raw").collect()[0][0]
print(f"\nBronze layer: Successfully ingested {bronze_count} raw records")

Bronze Layer DataFrame Schema:
root
 |-- call_duration: double (nullable = true)
 |-- cell_tower_id: string (nullable = true)
 |-- data_volume: double (nullable = true)
 |-- service_type: string (nullable = true)
 |-- signal_quality: long (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- usage_date: timestamp (nullable = true)


Sample Bronze Data:


+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
|call_duration|cell_tower_id|data_volume|service_type|signal_quality|subscriber_id|         usage_date|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
|        12.51|TOWER_NYC_001|        0.0|       Voice|            82|  SUB00000001|2024-06-21 19:52:00|
|          0.0|TOWER_SFO_006|        0.0|         SMS|            95|  SUB00000001|2024-07-10 13:56:00|
|          0.0|TOWER_HOU_004|        0.0|         SMS|            70|  SUB00000001|2024-10-07 08:58:00|
|          0.0|TOWER_HOU_004|     11.144|   Streaming|            64|  SUB00000001|2024-05-18 20:56:00|
|          0.0|TOWER_SEA_007|     12.774|   Streaming|            83|  SUB00000001|2024-12-12 13:59:00|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+
only showing top 5 rows




Bronze layer: Successfully ingested 599191 raw records


## Silver Layer: Data Cleaning and Standardization

### Purpose
- Clean and validate data quality
- Standardize formats and handle missing values
- Remove duplicates and apply business rules
- Prepare data for downstream analytics

### Transformations Applied
- Handle missing signal_quality values
- Standardize service types
- Remove invalid data points
- Add derived columns for analysis

In [None]:
# Create silver schema

spark.sql("CREATE SCHEMA IF NOT EXISTS telecom.silver")
print("Silver schema created successfully!")

Silver schema created successfully!


In [None]:
# Create silver layer table with cleaned and standardized data

spark.sql("""
CREATE TABLE IF NOT EXISTS telecom.silver.network_usage_clean (
    subscriber_id STRING,
    usage_date TIMESTAMP,
    service_type STRING,
    data_volume DECIMAL(10,3),
    call_duration DECIMAL(8,2),
    cell_tower_id STRING,
    signal_quality INT,
    signal_category STRING,
    usage_hour INT,
    usage_day_of_week INT,
    is_business_hours BOOLEAN,
    processed_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (subscriber_id, usage_date)
""")

print("Silver layer Delta table created successfully!")

Silver layer Delta table created successfully!


In [None]:
# Transform bronze data to silver layer

from pyspark.sql.functions import *
from pyspark.sql.window import Window

# Read bronze data
bronze_df = spark.table("telecom.bronze.network_usage_raw")

# Apply silver layer transformations
silver_df = bronze_df \
    .filter("subscriber_id IS NOT NULL") \
    .withColumn("signal_quality", 
                when(col("signal_quality").isNull(), 
                     round(avg("signal_quality").over(Window.partitionBy("cell_tower_id")), 0).cast("int")
                ).otherwise(col("signal_quality"))) \
    .withColumn("signal_category",
                when(col("signal_quality") >= 80, "Excellent")
                .when(col("signal_quality") >= 60, "Good")
                .when(col("signal_quality") >= 40, "Fair")
                .otherwise("Poor")) \
    .withColumn("usage_hour", hour("usage_date")) \
    .withColumn("usage_day_of_week", dayofweek("usage_date")) \
    .withColumn("is_business_hours", 
                when((col("usage_hour") >= 9) & (col("usage_hour") <= 17), True)
                .otherwise(False)) \
    .filter("signal_quality IS NOT NULL") \
    .dropDuplicates(["subscriber_id", "usage_date", "service_type"])

print("Silver Layer Transformations Applied:")
print("- Handled missing signal_quality values with tower averages")
print("- Added signal_category classification")
print("- Added temporal features (usage_hour, usage_day_of_week, is_business_hours)")
print("- Removed duplicates and invalid records")

Silver Layer Transformations Applied:
- Handled missing signal_quality values with tower averages
- Added signal_category classification
- Added temporal features (usage_hour, usage_day_of_week, is_business_hours)
- Removed duplicates and invalid records


In [None]:
# Insert transformed data into silver layer

print("Silver Layer DataFrame Schema:")
silver_df.printSchema()

print("\nSample Silver Data:")
silver_df.show(5)

# Insert into silver table
silver_df.write.mode("overwrite").saveAsTable("telecom.silver.network_usage_clean")

silver_count = spark.sql("SELECT COUNT(*) FROM telecom.silver.network_usage_clean").collect()[0][0]
print(f"\nSilver layer: Successfully processed {silver_count} cleaned records")

# Quality check
null_check = spark.sql("""
SELECT 
    COUNT(*) as total_records,
    COUNT(CASE WHEN signal_quality IS NULL THEN 1 END) as null_signals
FROM telecom.silver.network_usage_clean
""").collect()[0]

print(f"Data quality check - Total records: {null_check['total_records']}, Null signals: {null_check['null_signals']}")

Silver Layer DataFrame Schema:
root
 |-- call_duration: double (nullable = true)
 |-- cell_tower_id: string (nullable = true)
 |-- data_volume: double (nullable = true)
 |-- service_type: string (nullable = true)
 |-- signal_quality: long (nullable = true)
 |-- subscriber_id: string (nullable = true)
 |-- usage_date: timestamp (nullable = true)
 |-- signal_category: string (nullable = false)
 |-- usage_hour: integer (nullable = true)
 |-- usage_day_of_week: integer (nullable = true)
 |-- is_business_hours: boolean (nullable = false)


Sample Silver Data:


+-------------+-------------+-----------+------------+--------------+-------------+-------------------+---------------+----------+-----------------+-----------------+
|call_duration|cell_tower_id|data_volume|service_type|signal_quality|subscriber_id|         usage_date|signal_category|usage_hour|usage_day_of_week|is_business_hours|
+-------------+-------------+-----------+------------+--------------+-------------+-------------------+---------------+----------+-----------------+-----------------+
|          0.0|TOWER_SEA_007|      7.295|   Streaming|            72|  SUB00000001|2024-01-15 20:16:00|           Good|        20|                2|            false|
|          0.0|TOWER_MIA_005|        0.0|         SMS|            67|  SUB00000001|2024-03-08 12:40:00|           Good|        12|                6|             true|
|          0.0|TOWER_CHI_003|      5.815|   Streaming|            72|  SUB00000001|2024-03-23 10:13:00|           Good|        10|                7|             true


Silver layer: Successfully processed 599179 cleaned records


Data quality check - Total records: 599179, Null signals: 0


## Gold Layer: Business Analytics and Aggregations

### Purpose
- Provide business-ready aggregations
- Enable fast queries for dashboards and reports
- Support advanced analytics and ML

### Analytics Included
- Subscriber-level metrics
- Service type performance
- Network infrastructure analytics
- Temporal usage patterns

In [None]:
# Create gold schema

spark.sql("CREATE SCHEMA IF NOT EXISTS telecom.gold")
print("Gold schema created successfully!")

Gold schema created successfully!


In [None]:
# Create gold layer subscriber analytics table

spark.sql("""
CREATE TABLE IF NOT EXISTS telecom.gold.subscriber_analytics (
    subscriber_id STRING,
    total_sessions BIGINT,
    total_data_gb DECIMAL(10,3),
    total_call_minutes DECIMAL(8,2),
    avg_signal_quality DECIMAL(5,2),
    services_used INT,
    towers_used INT,
    active_days INT,
    avg_usage_hour DECIMAL(5,2),
    business_hours_pct DECIMAL(5,2),
    primary_service_type STRING,
    signal_category STRING,
    last_activity_date DATE,
    subscriber_segment STRING,
    created_at TIMESTAMP
)
USING DELTA
CLUSTER BY (subscriber_segment, avg_signal_quality)
""")

print("Gold layer subscriber analytics table created successfully!")

Gold layer subscriber analytics table created successfully!


In [None]:
# Create gold layer aggregations from silver data

subscriber_gold = spark.sql("""
WITH subscriber_metrics AS (
    SELECT 
        subscriber_id,
        COUNT(*) as total_sessions,
        ROUND(SUM(data_volume), 3) as total_data_gb,
        ROUND(SUM(call_duration), 2) as total_call_minutes,
        ROUND(AVG(signal_quality), 2) as avg_signal_quality,
        COUNT(DISTINCT service_type) as services_used,
        COUNT(DISTINCT cell_tower_id) as towers_used,
        COUNT(DISTINCT DATE(usage_date)) as active_days,
        ROUND(AVG(usage_hour), 2) as avg_usage_hour,
        ROUND(AVG(CASE WHEN is_business_hours THEN 100.0 ELSE 0.0 END), 2) as business_hours_pct,
        MAX(usage_date) as last_activity_date
    FROM telecom.silver.network_usage_clean
    GROUP BY subscriber_id
),
service_preferences AS (
    SELECT 
        subscriber_id,
        FIRST(service_type) as primary_service_type
    FROM (
        SELECT subscriber_id, service_type, COUNT(*) as usage_count,
               ROW_NUMBER() OVER (PARTITION BY subscriber_id ORDER BY COUNT(*) DESC) as rn
        FROM telecom.silver.network_usage_clean
        GROUP BY subscriber_id, service_type
    )
    WHERE rn = 1
    GROUP BY subscriber_id
)
SELECT 
    s.subscriber_id,
    s.total_sessions,
    s.total_data_gb,
    s.total_call_minutes,
    s.avg_signal_quality,
    CASE WHEN s.avg_signal_quality >= 80 THEN 'Excellent'
         WHEN s.avg_signal_quality >= 60 THEN 'Good'
         WHEN s.avg_signal_quality >= 40 THEN 'Fair'
         ELSE 'Poor' END as signal_category,
    s.services_used,
    s.towers_used,
    s.active_days,
    s.avg_usage_hour,
    s.business_hours_pct,
    sp.primary_service_type,
    DATE(s.last_activity_date) as last_activity_date,
    CASE WHEN s.total_data_gb > 50 AND s.services_used >= 3 THEN 'High-Value'
         WHEN s.total_data_gb > 20 OR s.services_used >= 2 THEN 'Medium-Value'
         ELSE 'Low-Value' END as subscriber_segment
FROM subscriber_metrics s
LEFT JOIN service_preferences sp ON s.subscriber_id = sp.subscriber_id
""")

# Insert into gold layer
subscriber_gold.write.mode("overwrite").saveAsTable("telecom.gold.subscriber_analytics")

gold_count = spark.sql("SELECT COUNT(*) FROM telecom.gold.subscriber_analytics").collect()[0][0]
print(f"Gold layer: Successfully created analytics for {gold_count} subscribers")

Gold layer: Successfully created analytics for 10000 subscribers


In [None]:
# Create additional gold layer tables for network and service analytics

# Network infrastructure analytics
spark.sql("""
CREATE TABLE IF NOT EXISTS telecom.gold.network_infrastructure (
    cell_tower_id STRING,
    total_connections BIGINT,
    unique_subscribers BIGINT,
    avg_signal_quality DECIMAL(5,2),
    total_data_gb DECIMAL(10,3),
    total_call_minutes DECIMAL(8,2),
    signal_category STRING,
    utilization_rank INT,
    created_at TIMESTAMP 
)
USING DELTA
""")

# Service performance analytics
spark.sql("""
CREATE TABLE IF NOT EXISTS telecom.gold.service_performance (
    service_type STRING,
    total_usage BIGINT,
    total_data_gb DECIMAL(10,3),
    total_call_minutes DECIMAL(8,2),
    avg_signal_quality DECIMAL(5,2),
    unique_subscribers BIGINT,
    avg_sessions_per_subscriber DECIMAL(5,2),
    revenue_potential DECIMAL(10,2),
    created_at TIMESTAMP 
)
USING DELTA
""")

print("Gold layer infrastructure and service analytics tables created!")

Gold layer infrastructure and service analytics tables created!


In [None]:
# Populate network infrastructure analytics

network_gold = spark.sql("""
SELECT 
    cell_tower_id,
    COUNT(*) as total_connections,
    COUNT(DISTINCT subscriber_id) as unique_subscribers,
    ROUND(AVG(signal_quality), 2) as avg_signal_quality,
    ROUND(SUM(data_volume), 3) as total_data_gb,
    ROUND(SUM(call_duration), 2) as total_call_minutes,
    CASE WHEN AVG(signal_quality) >= 80 THEN 'Excellent'
         WHEN AVG(signal_quality) >= 60 THEN 'Good'
         WHEN AVG(signal_quality) >= 40 THEN 'Fair'
         ELSE 'Poor' END as signal_category,
    ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) as utilization_rank
FROM telecom.silver.network_usage_clean
GROUP BY cell_tower_id
ORDER BY total_connections DESC
""")

network_gold.write.mode("overwrite").saveAsTable("telecom.gold.network_infrastructure")
print("Network infrastructure analytics populated!")

Network infrastructure analytics populated!


In [None]:
# Populate service performance analytics

service_gold = spark.sql("""
SELECT 
    service_type,
    COUNT(*) as total_usage,
    ROUND(SUM(data_volume), 3) as total_data_gb,
    ROUND(SUM(call_duration), 2) as total_call_minutes,
    ROUND(AVG(signal_quality), 2) as avg_signal_quality,
    COUNT(DISTINCT subscriber_id) as unique_subscribers,
    ROUND(COUNT(*) * 1.0 / COUNT(DISTINCT subscriber_id), 2) as avg_sessions_per_subscriber,
    -- Simplified revenue calculation
    ROUND(
        SUM(data_volume) * 10 + 
        SUM(call_duration) * 0.1 + 
        COUNT(CASE WHEN service_type = 'SMS' THEN 1 END) * 0.02 +
        COUNT(CASE WHEN service_type = 'Voice' THEN 1 END) * 0.5 +
        COUNT(CASE WHEN service_type = 'Streaming' THEN 1 END) * 2.0 +
        COUNT(CASE WHEN service_type = 'Data' THEN 1 END) * 0.8
    , 2) as revenue_potential
FROM telecom.silver.network_usage_clean
GROUP BY service_type
ORDER BY total_usage DESC
""")

service_gold.write.mode("overwrite").saveAsTable("telecom.gold.service_performance")
print("Service performance analytics populated!")

Service performance analytics populated!


In [None]:
# Demonstrate gold layer analytics queries

print("=== Gold Layer Analytics ===")

# Top subscribers by data usage
print("\nTop Subscribers by Data Usage:")
spark.sql("""
SELECT subscriber_id, subscriber_segment, total_data_gb, services_used, signal_category
FROM telecom.gold.subscriber_analytics
ORDER BY total_data_gb DESC
LIMIT 5
""").show()

# Network tower performance
print("\nNetwork Tower Performance:")
spark.sql("""
SELECT cell_tower_id, signal_category, total_connections, unique_subscribers, utilization_rank
FROM telecom.gold.network_infrastructure
ORDER BY utilization_rank
LIMIT 5
""").show()

# Service revenue potential
print("\nService Revenue Potential:")
spark.sql("""
SELECT service_type, total_usage, revenue_potential, avg_sessions_per_subscriber
FROM telecom.gold.service_performance
ORDER BY revenue_potential DESC
""").show()

# Subscriber segmentation
print("\nSubscriber Segmentation:")
spark.sql("""
SELECT subscriber_segment, COUNT(*) as subscriber_count, 
       ROUND(AVG(total_data_gb), 2) as avg_data_gb,
       ROUND(AVG(avg_signal_quality), 2) as avg_signal
FROM telecom.gold.subscriber_analytics
GROUP BY subscriber_segment
ORDER BY subscriber_count DESC
""").show()

=== Gold Layer Analytics ===

Top Subscribers by Data Usage:


+-------------+------------------+-------------+-------------+---------------+
|subscriber_id|subscriber_segment|total_data_gb|services_used|signal_category|
+-------------+------------------+-------------+-------------+---------------+
|  SUB00008666|        High-Value|      407.254|            4|           Good|
|  SUB00004805|        High-Value|      372.894|            4|           Good|
|  SUB00005566|        High-Value|       369.98|            4|           Good|
|  SUB00000090|        High-Value|      354.836|            4|           Good|
|  SUB00000759|        High-Value|      348.745|            4|           Good|
+-------------+------------------+-------------+-------------+---------------+


Network Tower Performance:


+-------------+---------------+-----------------+------------------+----------------+
|cell_tower_id|signal_category|total_connections|unique_subscribers|utilization_rank|
+-------------+---------------+-----------------+------------------+----------------+
|TOWER_HOU_004|           Good|            86137|              9970|               1|
|TOWER_CHI_003|           Good|            85871|              9968|               2|
|TOWER_MIA_005|           Good|            85647|              9978|               3|
|TOWER_LAX_002|           Good|            85639|              9965|               4|
|TOWER_SEA_007|           Good|            85528|              9966|               5|
+-------------+---------------+-----------------+------------------+----------------+


Service Revenue Potential:


+------------+-----------+-----------------+---------------------------+
|service_type|total_usage|revenue_potential|avg_sessions_per_subscriber|
+------------+-----------+-----------------+---------------------------+
|   Streaming|     148979|    1.297370033E7|                      14.90|
|        Data|     150714|       2040220.65|                      15.07|
|       Voice|     149510|        197796.31|                      14.95|
|         SMS|     149976|          2999.52|                      15.00|
+------------+-----------+-----------------+---------------------------+


Subscriber Segmentation:


+------------------+----------------+-----------+----------+
|subscriber_segment|subscriber_count|avg_data_gb|avg_signal|
+------------------+----------------+-----------+----------+
|        High-Value|            9475|     151.94|     72.48|
|      Medium-Value|             525|       37.9|      72.5|
+------------------+----------------+-----------+----------+



## Machine Learning: Customer Churn Prediction

### Business Value
- Predict subscribers likely to churn
- Enable proactive retention strategies
- Optimize marketing spend

### ML Approach
- Use gold layer subscriber analytics as features
- Random Forest classifier for churn prediction
- Include network quality and usage patterns

In [None]:
# Prepare data for churn prediction model using gold layer analytics

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Read gold layer subscriber data
subscriber_data = spark.table("telecom.gold.subscriber_analytics")

# Create churn labels based on gold layer metrics
ml_data = subscriber_data.withColumn(
    "churn_risk",
    when(
        (col("total_sessions") < 30) | 
        (col("avg_signal_quality") < 65) | 
        (col("services_used") < 3) |
        (col("subscriber_segment") == "Low-Value"),
        1
    ).otherwise(0)
)

print(f"Prepared ML dataset with {ml_data.count()} subscribers")
print("Churn risk distribution:")
ml_data.groupBy("churn_risk").count().show()

Prepared ML dataset with 10000 subscribers
Churn risk distribution:


+----------+-----+
|churn_risk|count|
+----------+-----+
|         1| 1213|
|         0| 8787|
+----------+-----+



In [None]:
# Feature engineering for churn prediction

# Index categorical features
segment_indexer = StringIndexer(inputCol="subscriber_segment", outputCol="segment_index")
signal_indexer = StringIndexer(inputCol="signal_category", outputCol="signal_index")
service_indexer = StringIndexer(inputCol="primary_service_type", outputCol="service_index")

# Assemble features
feature_cols = [
    "total_sessions", "total_data_gb", "total_call_minutes", 
    "avg_signal_quality", "services_used", "towers_used", 
    "active_days", "avg_usage_hour", "business_hours_pct",
    "segment_index", "signal_index", "service_index"
]

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train Random Forest model
rf = RandomForestClassifier(
    labelCol="churn_risk", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10,
    seed=42
)

# Create pipeline
pipeline = Pipeline(stages=[segment_indexer, signal_indexer, service_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = ml_data.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} subscribers")
print(f"Test set: {test_data.count()} subscribers")

Training set: 8079 subscribers


Test set: 1921 subscribers


In [None]:
# Train the churn prediction model

print("Training churn prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate model
evaluator = BinaryClassificationEvaluator(labelCol="churn_risk", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"\nModel Performance - AUC: {auc:.4f}")

# Show predictions
predictions.select(
    "subscriber_id", "subscriber_segment", "total_sessions", 
    "avg_signal_quality", "churn_risk", "prediction", "probability"
).show(10)

Training churn prediction model...



Model Performance - AUC: 1.0000


+-------------+------------------+--------------+------------------+----------+----------+--------------------+
|subscriber_id|subscriber_segment|total_sessions|avg_signal_quality|churn_risk|prediction|         probability|
+-------------+------------------+--------------+------------------+----------+----------+--------------------+
|  SUB00000003|        High-Value|            93|             71.69|         0|       0.0|[0.99999567979812...|
|  SUB00000007|        High-Value|            69|             73.16|         0|       0.0|[0.99999567979812...|
|  SUB00000009|        High-Value|            97|             71.69|         0|       0.0|[0.99999567979812...|
|  SUB00000014|        High-Value|            77|             72.01|         0|       0.0|[0.99999567979812...|
|  SUB00000020|        High-Value|            23|              68.7|         1|       1.0|         [0.01,0.99]|
|  SUB00000024|        High-Value|            62|             71.76|         0|       0.0|[0.99999567979

In [None]:
# Model interpretation and business impact analysis

# Feature importance
rf_model = model.stages[-1]
feature_names = feature_cols

print("=== Feature Importance for Churn Prediction ===")
for name, importance in zip(feature_names, rf_model.featureImportances):
    print(f"{name}: {importance:.4f}")

print("\n=== Business Impact Analysis ===")

# Calculate potential impact
churn_predictions = predictions.filter("prediction = 1")
high_risk_subscribers = churn_predictions.count()
total_test_subscribers = test_data.count()

print(f"Total test subscribers: {total_test_subscribers}")
print(f"Subscribers predicted as high churn risk: {high_risk_subscribers}")
print(f"Percentage flagged for intervention: {(high_risk_subscribers/total_test_subscribers)*100:.1f}%")

# Revenue impact calculation
avg_data_gb = test_data.agg(F.avg("total_data_gb")).collect()[0][0] or 0
avg_call_minutes = test_data.agg(F.avg("total_call_minutes")).collect()[0][0] or 0
estimated_arpu = (avg_data_gb * 10) + (avg_call_minutes * 0.1) + 50
potential_monthly_loss = high_risk_subscribers * estimated_arpu

print(f"\nEstimated average ARPU: ${estimated_arpu:.2f}")
print(f"Potential monthly revenue at risk: ${potential_monthly_loss:,.2f}")

# Model metrics
accuracy = predictions.filter("churn_risk = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND churn_risk = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND churn_risk = 1").count() / predictions.filter("churn_risk = 1").count() if predictions.filter("churn_risk = 1").count() > 0 else 0

print(f"\nDetailed Model Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Churn Prediction ===
total_sessions: 0.4583
total_data_gb: 0.0865
total_call_minutes: 0.0499
avg_signal_quality: 0.0013
services_used: 0.0000
towers_used: 0.0064
active_days: 0.3676
avg_usage_hour: 0.0015
business_hours_pct: 0.0022
segment_index: 0.0251
signal_index: 0.0000
service_index: 0.0011

=== Business Impact Analysis ===


Total test subscribers: 1921
Subscribers predicted as high churn risk: 223
Percentage flagged for intervention: 11.6%



Estimated average ARPU: $1537.63
Potential monthly revenue at risk: $342,891.99



Detailed Model Metrics:
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
AUC: 1.0000


## Key Takeaways: Medallion Architecture + ML in AIDP

### What We Demonstrated

1. **Bronze Layer**: Raw data ingestion with liquid clustering for performance
2. **Silver Layer**: Data cleaning, standardization, and enrichment
3. **Gold Layer**: Business-ready aggregations and analytics
4. **Machine Learning**: Churn prediction using curated gold layer data

### Medallion Architecture Benefits

- **Data Quality**: Progressive improvement from raw to business-ready
- **Performance**: Optimized clustering at each layer
- **Governance**: Clear data lineage and catalog organization
- **Flexibility**: Reprocessing capability from bronze layer

### AIDP Advantages

- **Unified Platform**: Seamless data processing to ML
- **Liquid Clustering**: Automatic optimization without manual tuning
- **Enterprise Ready**: Governance, security, and scalability

### Business Impact for Telecommunications

1. **Data-Driven Insights**: Comprehensive analytics across all layers
2. **Predictive Analytics**: ML-powered churn prevention
3. **Operational Efficiency**: Automated data processing pipelines
4. **Customer Experience**: Proactive service improvements

### Best Practices

1. **Layer Progression**: Always maintain clear bronze → silver → gold flow
2. **Clustering Strategy**: Choose columns based on query patterns
3. **Data Quality**: Implement validation at each layer
4. **ML Integration**: Use gold layer for training production models

### Next Steps

- Deploy medallion pipelines in production
- Add real-time streaming to bronze layer
- Implement automated data quality monitoring
- Scale ML models for real-time predictions
- Integrate with customer service systems