# Hospitality: Medallion Architecture Demo

## Overview

This notebook demonstrates the **Medallion Architecture** in Oracle AI Data Platform (AIDP) Workbench using a hospitality and tourism analytics use case. The medallion architecture organizes data into bronze (raw), silver (cleaned), and gold (aggregated/analytics) layers, providing a clear data processing pipeline.

### What is Medallion Architecture?

The medallion architecture is a data design pattern that organizes data into three layers:

- **Bronze Layer**: Raw data as ingested, minimal processing
- **Silver Layer**: Cleaned, transformed, and validated data
- **Gold Layer**: Business-level aggregates and analytics-ready data

### Use Case: Hotel Guest Experience and Revenue Management

We'll analyze hotel booking and guest experience data across all three layers, incorporating machine learning for churn prediction in the gold layer.

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create hospitality catalog and schemas for medallion architecture

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS hospitality")

spark.sql("CREATE SCHEMA IF NOT EXISTS hospitality.bronze")

spark.sql("CREATE SCHEMA IF NOT EXISTS hospitality.silver")

spark.sql("CREATE SCHEMA IF NOT EXISTS hospitality.gold")

print("Hospitality catalog and bronze/silver/gold schemas created successfully!")

Hospitality catalog and bronze/silver/gold schemas created successfully!


## Bronze Layer: Raw Data Ingestion

### Purpose
The bronze layer stores raw data as ingested from source systems, with minimal processing. This provides an immutable audit trail of all incoming data.

### Table Design
Our bronze `raw_guest_bookings` table will store raw booking data with:

- Raw field names and types as they come from source systems
- No data quality checks or transformations
- Timestamps for data lineage
- Source system metadata

In [None]:
# Create bronze layer table for raw guest booking data

spark.sql("""
CREATE TABLE IF NOT EXISTS hospitality.bronze.raw_guest_bookings (
    guest_id_raw STRING,
    booking_timestamp_raw STRING,
    checkin_timestamp_raw STRING,
    room_type_raw STRING,
    booking_channel_raw STRING,
    revenue_amount_raw STRING,
    satisfaction_score_raw STRING,
    source_system STRING,
    ingestion_timestamp TIMESTAMP,
    raw_data_quality_score DOUBLE
)
USING DELTA
CLUSTER BY (guest_id_raw, ingestion_timestamp)
""")

print("Bronze layer table created for raw guest bookings data!")

Bronze layer table created for raw guest bookings data!


### Generate Raw Hospitality Sample Data

#### Data Generation Strategy
We'll create realistic raw hotel booking data including:

- **5,000 guests** with multiple bookings over time
- **Raw data quality issues**: Missing values, inconsistent formats, outliers
- **Multiple source systems**: PMS, OTA, Mobile App, Call Center
- **Realistic data quality problems**: Typos, null values, format inconsistencies

In [None]:
# Generate raw hospitality booking data with quality issues

import random
from datetime import datetime, timedelta
import uuid

# Define raw data constants with potential quality issues
ROOM_TYPES_RAW = ['standard', 'deluxe', 'SUITE', 'executive', 'Standard Room', 'Deluxe King', None]
BOOKING_CHANNELS_RAW = ['direct', 'ota', 'corporate', 'walk-in', 'Direct Booking', 'Online Travel', None]
SOURCE_SYSTEMS = ['PMS', 'OTA_API', 'MOBILE_APP', 'CALL_CENTER']

# Generate raw booking records with data quality issues
raw_booking_data = []
base_date = datetime(2024, 1, 1)

# Create 5,000 guests with 2-8 bookings each
for guest_num in range(1, 5001):
    guest_id = f"GST{guest_num:06d}"
    
    # Each guest gets 2-8 bookings over 12 months
    num_bookings = random.randint(2, 8)
    
    for i in range(num_bookings):
        # Spread bookings over 12 months
        days_offset = random.randint(0, 365)
        booking_date = base_date + timedelta(days=days_offset)
        
        # Check-in date (usually within 1-30 days of booking)
        checkin_offset = random.randint(1, 30)
        check_in_date = booking_date + timedelta(days=checkin_offset)
        
        # Select room type (with potential quality issues)
        room_type = random.choice(ROOM_TYPES_RAW)
        
        # Select booking channel (with potential quality issues)
        booking_channel = random.choice(BOOKING_CHANNELS_RAW)
        
        # Source system
        source_system = random.choice(SOURCE_SYSTEMS)
        
        # Generate revenue with format inconsistencies
        base_revenue = random.uniform(100, 2000)
        if random.random() < 0.1:  # 10% chance of format issues
            revenue_str = f"${base_revenue:.2f}"  # Currency symbol
        elif random.random() < 0.1:
            revenue_str = f"{base_revenue:.2f} USD"  # With currency code
        elif random.random() < 0.05:
            revenue_str = ""  # Missing value
        else:
            revenue_str = str(round(base_revenue, 2))  # Normal numeric
        
        # Generate satisfaction score with inconsistencies
        if random.random() < 0.1:
            satisfaction_str = str(random.randint(1, 10)) + "/10"  # With denominator
        elif random.random() < 0.05:
            satisfaction_str = ""  # Missing value
        elif random.random() < 0.05:
            satisfaction_str = "N/A"  # Text value
        else:
            satisfaction_str = str(random.randint(1, 10))  # Normal integer
        
        # Data quality score (simulated)
        quality_issues = sum([
            room_type is None,
            booking_channel is None,
            revenue_str == "",
            satisfaction_str in ["", "N/A"],
            "/" in satisfaction_str
        ])
        quality_score = max(0.1, 1.0 - (quality_issues * 0.2))
        
        raw_booking_data.append({
            "guest_id_raw": guest_id,
            "booking_timestamp_raw": booking_date.isoformat(),
            "checkin_timestamp_raw": check_in_date.isoformat(),
            "room_type_raw": room_type,
            "booking_channel_raw": booking_channel,
            "revenue_amount_raw": revenue_str,
            "satisfaction_score_raw": satisfaction_str,
            "source_system": source_system,
            "ingestion_timestamp": datetime.now(),
            "raw_data_quality_score": quality_score
        })

print(f"Generated {len(raw_booking_data)} raw guest booking records with simulated data quality issues")
print("Sample raw record:", raw_booking_data[0])

Generated 25088 raw guest booking records with simulated data quality issues
Sample raw record: {'guest_id_raw': 'GST000001', 'booking_timestamp_raw': '2024-10-17T00:00:00', 'checkin_timestamp_raw': '2024-10-31T00:00:00', 'room_type_raw': 'Deluxe King', 'booking_channel_raw': 'direct', 'revenue_amount_raw': '1488.17', 'satisfaction_score_raw': '6/10', 'source_system': 'MOBILE_APP', 'ingestion_timestamp': datetime.datetime(2025, 12, 20, 2, 36, 11, 264132), 'raw_data_quality_score': 0.8}


In [None]:
# Insert raw data into bronze layer

df_raw_bookings = spark.createDataFrame(raw_booking_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_raw_bookings.printSchema()

print("\nSample Raw Data:")
df_raw_bookings.show(5)

# Insert data into bronze table
df_raw_bookings.write.mode("overwrite").saveAsTable("hospitality.bronze.raw_guest_bookings")

print(f"\nSuccessfully inserted {df_raw_bookings.count()} raw records into bronze layer")
print("Bronze layer preserves raw data with all quality issues intact")

Bronze Layer DataFrame Schema:
root
 |-- booking_channel_raw: string (nullable = true)
 |-- booking_timestamp_raw: string (nullable = true)
 |-- checkin_timestamp_raw: string (nullable = true)
 |-- guest_id_raw: string (nullable = true)
 |-- ingestion_timestamp: timestamp (nullable = true)
 |-- raw_data_quality_score: double (nullable = true)
 |-- revenue_amount_raw: string (nullable = true)
 |-- room_type_raw: string (nullable = true)
 |-- satisfaction_score_raw: string (nullable = true)
 |-- source_system: string (nullable = true)


Sample Raw Data:


+-------------------+---------------------+---------------------+------------+--------------------+----------------------+------------------+-------------+----------------------+-------------+
|booking_channel_raw|booking_timestamp_raw|checkin_timestamp_raw|guest_id_raw| ingestion_timestamp|raw_data_quality_score|revenue_amount_raw|room_type_raw|satisfaction_score_raw|source_system|
+-------------------+---------------------+---------------------+------------+--------------------+----------------------+------------------+-------------+----------------------+-------------+
|             direct|  2024-10-17T00:00:00|  2024-10-31T00:00:00|   GST000001|2025-12-20 02:36:...|                   0.8|           1488.17|  Deluxe King|                  6/10|   MOBILE_APP|
|                ota|  2024-04-15T00:00:00|  2024-04-24T00:00:00|   GST000001|2025-12-20 02:36:...|                   0.8|            856.31|         NULL|                     9|   MOBILE_APP|
|             direct|  2024-09-24T0


Successfully inserted 25088 raw records into bronze layer
Bronze layer preserves raw data with all quality issues intact


## Silver Layer: Data Cleaning and Transformation

### Purpose
The silver layer contains cleaned, validated, and transformed data. Raw data from bronze is processed to:

- Standardize formats and data types
- Handle missing values and outliers
- Apply data quality rules
- Create derived fields
- Ensure referential integrity

### Transformation Logic
- Standardize room types and booking channels
- Parse and validate revenue amounts
- Clean satisfaction scores
- Add data quality metrics
- Create business-relevant derived fields

In [None]:
# Create silver layer table for cleaned guest booking data

spark.sql("""
CREATE TABLE IF NOT EXISTS hospitality.silver.cleaned_guest_bookings (
    guest_id STRING,
    booking_date DATE,
    check_in_date DATE,
    room_type STRING,
    booking_channel STRING,
    total_revenue DECIMAL(8,2),
    guest_satisfaction INT,
    source_system STRING,
    ingestion_timestamp TIMESTAMP,
    data_quality_score DOUBLE,
    booking_lead_time_days INT,
    weekend_booking BOOLEAN,
    peak_season BOOLEAN,
    processed_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (guest_id, booking_date)
""")

print("Silver layer table created for cleaned guest bookings data!")

Silver layer table created for cleaned guest bookings data!


In [None]:
# Transform bronze data to silver layer

from pyspark.sql.functions import *
from pyspark.sql.types import *

# Read bronze data
bronze_df = spark.table("hospitality.bronze.raw_guest_bookings")

# Define UDFs for data cleaning
def standardize_room_type(room_type):
    if room_type is None:
        return "Unknown"
    room_lower = str(room_type).lower().strip()
    if "standard" in room_lower:
        return "Standard"
    elif "deluxe" in room_lower:
        return "Deluxe"
    elif "suite" in room_lower:
        return "Suite"
    elif "executive" in room_lower:
        return "Executive"
    else:
        return "Other"

def standardize_booking_channel(channel):
    if channel is None:
        return "Unknown"
    channel_lower = str(channel).lower().strip()
    if "direct" in channel_lower:
        return "Direct"
    elif "ota" in channel_lower or "online" in channel_lower or "travel" in channel_lower:
        return "Online Travel Agency"
    elif "corporate" in channel_lower:
        return "Corporate"
    elif "walk" in channel_lower:
        return "Walk-in"
    else:
        return "Other"

def parse_revenue(revenue_str):
    if not revenue_str or revenue_str.strip() == "":
        return None
    try:
        # Remove currency symbols and extra text
        clean_str = str(revenue_str).replace("$", "").replace("USD", "").replace(" ", "").strip()
        return float(clean_str)
    except:
        return None

def parse_satisfaction(satisfaction_str):
    if not satisfaction_str or satisfaction_str.strip() in ["", "N/A"]:
        return None
    try:
        # Handle formats like "8/10"
        if "/" in str(satisfaction_str):
            parts = str(satisfaction_str).split("/")
            return int(parts[0])
        return int(float(satisfaction_str))
    except:
        return None

# Register UDFs
standardize_room_udf = udf(standardize_room_type, StringType())
standardize_channel_udf = udf(standardize_booking_channel, StringType())
parse_revenue_udf = udf(parse_revenue, DoubleType())
parse_satisfaction_udf = udf(parse_satisfaction, IntegerType())

# Transform bronze to silver
silver_df = bronze_df \
    .withColumn("guest_id", col("guest_id_raw")) \
    .withColumn("booking_date", to_date(col("booking_timestamp_raw"))) \
    .withColumn("check_in_date", to_date(col("checkin_timestamp_raw"))) \
    .withColumn("room_type", standardize_room_udf(col("room_type_raw"))) \
    .withColumn("booking_channel", standardize_channel_udf(col("booking_channel_raw"))) \
    .withColumn("revenue_parsed", parse_revenue_udf(col("revenue_amount_raw"))) \
    .withColumn("satisfaction_parsed", parse_satisfaction_udf(col("satisfaction_score_raw"))) \
    .withColumn("total_revenue", round(col("revenue_parsed"), 2).cast(DecimalType(8,2))) \
    .withColumn("guest_satisfaction", col("satisfaction_parsed")) \
    .withColumn("booking_lead_time_days", datediff(col("check_in_date"), col("booking_date"))) \
    .withColumn("weekend_booking", dayofweek(col("booking_date")).isin([1,7])) \
    .withColumn("peak_season", month(col("check_in_date")).isin([6,7,8,11,12])) \
    .withColumn("processed_timestamp", current_timestamp()) \
    .withColumn("data_quality_score", 
                when(col("total_revenue").isNull() | col("guest_satisfaction").isNull(), 0.7)
                .otherwise(col("raw_data_quality_score"))) \
    .select(
        "guest_id",
        "booking_date",
        "check_in_date",
        "room_type",
        "booking_channel",
        "total_revenue",
        "guest_satisfaction",
        "source_system",
        "ingestion_timestamp",
        "data_quality_score",
        "booking_lead_time_days",
        "weekend_booking",
        "peak_season",
        "processed_timestamp"
    )

# Filter out records with critical data issues (optional - could be quarantined)
silver_df_filtered = silver_df.filter(
    col("guest_id").isNotNull() & 
    col("booking_date").isNotNull()
)

print("Silver layer transformation completed")
print(f"Bronze records: {bronze_df.count()}")
print(f"Silver records after cleaning: {silver_df_filtered.count()}")

# Show sample transformed data
silver_df_filtered.show(5)

Silver layer transformation completed


Bronze records: 25088


Silver records after cleaning: 25088


+---------+------------+-------------+---------+---------------+-------------+------------------+-------------+--------------------+------------------+----------------------+---------------+-----------+--------------------+
| guest_id|booking_date|check_in_date|room_type|booking_channel|total_revenue|guest_satisfaction|source_system| ingestion_timestamp|data_quality_score|booking_lead_time_days|weekend_booking|peak_season| processed_timestamp|
+---------+------------+-------------+---------+---------------+-------------+------------------+-------------+--------------------+------------------+----------------------+---------------+-----------+--------------------+
|GST003660|  2024-09-27|   2024-10-24|  Unknown|        Unknown|      1002.37|                 4|   MOBILE_APP|2025-12-20 02:36:...|               0.6|                    27|          false|      false|2025-12-20 02:36:...|
|GST003660|  2024-09-07|   2024-09-08| Standard|      Corporate|      1489.75|                 3|  CALL_

In [None]:
# Save silver layer data

silver_df_filtered.write.mode("overwrite").saveAsTable("hospitality.silver.cleaned_guest_bookings")

print(f"Successfully saved {silver_df_filtered.count()} cleaned records to silver layer")
print("Silver layer provides standardized, validated data for downstream analytics")

Successfully saved 25088 cleaned records to silver layer
Silver layer provides standardized, validated data for downstream analytics


## Gold Layer: Analytics and Machine Learning

### Purpose
The gold layer contains business-ready aggregates and analytics data optimized for:

- Reporting and dashboards
- Business intelligence
- Machine learning model training
- API endpoints

### Analytics Tables
We'll create several gold layer tables:

- `guest_analytics`: Guest-level aggregates and KPIs
- `revenue_analytics`: Revenue performance by various dimensions
- `churn_predictions`: ML model predictions and insights

### Machine Learning Integration
We'll train a guest churn prediction model using the cleaned silver data and store predictions in the gold layer.

In [None]:
# Create gold layer tables for analytics

# Guest analytics table
spark.sql("""
CREATE TABLE IF NOT EXISTS hospitality.gold.guest_analytics (
    guest_id STRING,
    total_bookings INT,
    total_spent DECIMAL(10,2),
    avg_booking_value DECIMAL(8,2),
    avg_satisfaction DECIMAL(3,2),
    satisfaction_variability DECIMAL(3,2),
    room_types_used INT,
    channels_used INT,
    active_months INT,
    days_since_last_booking INT,
    customer_tenure_days INT,
    avg_advance_booking_days DECIMAL(5,2),
    preferred_room_type STRING,
    preferred_channel STRING,
    lifetime_value_segment STRING,
    updated_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (lifetime_value_segment, total_spent)
""")

# Revenue analytics table
spark.sql("""
CREATE TABLE IF NOT EXISTS hospitality.gold.revenue_analytics (
    date_dimension DATE,
    dimension_type STRING,
    dimension_value STRING,
    total_bookings INT,
    total_revenue DECIMAL(12,2),
    avg_revenue DECIMAL(8,2),
    avg_satisfaction DECIMAL(3,2),
    unique_guests INT,
    booking_channel_mix MAP<STRING, INT>,
    room_type_mix MAP<STRING, INT>,
    updated_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (dimension_type, date_dimension)
""")

# Churn predictions table
spark.sql("""
CREATE TABLE IF NOT EXISTS hospitality.gold.churn_predictions (
    guest_id STRING,
    churn_probability DECIMAL(3,3),
    churn_risk_level STRING,
    predicted_churn BOOLEAN,
    feature_importance MAP<STRING, DECIMAL(5,4)>,
    intervention_recommendation STRING,
    potential_lifetime_value DECIMAL(10,2),
    prediction_timestamp TIMESTAMP,
    model_version STRING
)
USING DELTA
CLUSTER BY (churn_risk_level, churn_probability)
""")

print("Gold layer analytics tables created successfully!")

Gold layer analytics tables created successfully!


In [None]:
# Generate guest analytics from silver layer

from pyspark.sql import Window

silver_df = spark.table("hospitality.silver.cleaned_guest_bookings")

# Guest analytics
guest_analytics = silver_df.groupBy("guest_id").agg(
    count("*").alias("total_bookings"),
    round(sum("total_revenue"), 2).alias("total_spent"),
    round(avg("total_revenue"), 2).alias("avg_booking_value"),
    round(avg("guest_satisfaction"), 2).alias("avg_satisfaction"),
    round(stddev("guest_satisfaction"), 2).alias("satisfaction_variability"),
    countDistinct("room_type").alias("room_types_used"),
    countDistinct("booking_channel").alias("channels_used"),
    countDistinct(date_format("check_in_date", "yyyy-MM")).alias("active_months"),
    datediff(current_date(), max("booking_date")).alias("days_since_last_booking"),
    datediff(current_date(), min("booking_date")).alias("customer_tenure_days"),
    round(avg("booking_lead_time_days"), 2).alias("avg_advance_booking_days")
)

# Add derived fields
guest_analytics = guest_analytics.withColumn(
    "preferred_room_type",
    expr("""
    CASE 
        WHEN total_spent > 5000 THEN 'High-Value'
        WHEN total_spent > 2000 THEN 'Medium-Value'
        ELSE 'Standard-Value'
    END
    """).alias("lifetime_value_segment")
).withColumn("updated_timestamp", current_timestamp())

# Add preferred room type and channel using window functions
room_window = Window.partitionBy("guest_id").orderBy(desc("total_bookings"))
channel_window = Window.partitionBy("guest_id").orderBy(desc("total_bookings"))

room_prefs = silver_df.groupBy("guest_id", "room_type").count() \
    .withColumn("rank", rank().over(room_window)) \
    .filter("rank = 1") \
    .select("guest_id", col("room_type").alias("preferred_room_type"))

channel_prefs = silver_df.groupBy("guest_id", "booking_channel").count() \
    .withColumn("rank", rank().over(channel_window)) \
    .filter("rank = 1") \
    .select("guest_id", col("booking_channel").alias("preferred_channel"))

# Join preferences
guest_analytics = guest_analytics \
    .join(room_prefs, "guest_id", "left") \
    .join(channel_prefs, "guest_id", "left")

print(f"Generated guest analytics for {guest_analytics.count()} guests")
guest_analytics.show(5)

Command ID failed with java.lang.RuntimeException: java.lang.Exception: [[0;31m---------------------------------------------------------------------------[0m, [0;31mAnalysisException[0m                         Traceback (most recent call last), File [0;32m/tmp/ipykernel_100365/746186560.py:39[0m
[1;32m     35[0m room_window [38;5;241m=[39m Window[38;5;241m.[39mpartitionBy([38;5;124m"[39m[38;5;124mguest_id[39m[38;5;124m"[39m)[38;5;241m.[39morderBy(desc([38;5;124m"[39m[38;5;124mtotal_bookings[39m[38;5;124m"[39m))
[1;32m     36[0m channel_window [38;5;241m=[39m Window[38;5;241m.[39mpartitionBy([38;5;124m"[39m[38;5;124mguest_id[39m[38;5;124m"[39m)[38;5;241m.[39morderBy(desc([38;5;124m"[39m[38;5;124mtotal_bookings[39m[38;5;124m"[39m))
[1;32m     38[0m room_prefs [38;5;241m=[39m [43msilver_df[49m[38;5;241;43m.[39;49m[43mgroupBy[49m[43m([49m[38;5;124;43m"[39;49m[38;5;124;43mguest_id[39;49m[38;5;124;43m"[39;49m[43m,[49m[43m [4

In [None]:
# Generate revenue analytics

# Room type analytics
room_analytics = silver_df.groupBy(
    date_format("booking_date", "yyyy-MM").alias("date_dimension")
).agg(
    count("*").alias("total_bookings"),
    round(sum("total_revenue"), 2).alias("total_revenue"),
    round(avg("total_revenue"), 2).alias("avg_revenue"),
    round(avg("guest_satisfaction"), 2).alias("avg_satisfaction"),
    countDistinct("guest_id").alias("unique_guests")
).withColumn("dimension_type", lit("monthly")).withColumn("dimension_value", col("date_dimension")) \
 .withColumn("updated_timestamp", current_timestamp())

# Add mix data (simplified)
room_analytics = room_analytics.withColumn("booking_channel_mix", lit(None).cast("map<string,int>")) \
    .withColumn("room_type_mix", lit(None).cast("map<string,int>"))

print(f"Generated monthly revenue analytics for {room_analytics.count()} months")
room_analytics.show(5)

In [None]:
# Save gold layer analytics data

guest_analytics.write.mode("overwrite").saveAsTable("hospitality.gold.guest_analytics")
room_analytics.write.mode("overwrite").saveAsTable("hospitality.gold.revenue_analytics")

print("Gold layer analytics tables populated successfully!")

In [None]:
# Train churn prediction model and generate predictions

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Read guest analytics for ML
guest_ml_data = spark.table("hospitality.gold.guest_analytics")

# Create churn risk label (simulated)
guest_ml_data = guest_ml_data.withColumn(
    "churn_risk",
    when(
        (col("days_since_last_booking") > 90) | 
        (col("avg_satisfaction") < 7) | 
        (col("total_bookings") < 3),
        1
    ).otherwise(0)
)

# Feature engineering
feature_cols = [
    "total_bookings", "total_spent", "avg_booking_value", "avg_satisfaction", 
    "satisfaction_variability", "room_types_used", "channels_used", 
    "active_months", "days_since_last_booking", "customer_tenure_days", 
    "avg_advance_booking_days"
]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Train model
rf = RandomForestClassifier(
    labelCol="churn_risk", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10,
    seed=42
)

pipeline = Pipeline(stages=[assembler, scaler, rf])

# Split data
train_data, test_data = guest_ml_data.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} guests")
print(f"Test set: {test_data.count()} guests")

# Train model
print("Training churn prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate
evaluator = BinaryClassificationEvaluator(labelCol="churn_risk", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Generate churn predictions for all guests
all_predictions = model.transform(guest_ml_data)

# Create churn predictions table data
churn_predictions = all_predictions.select(
    "guest_id",
    round(col("probability").getItem(1), 3).alias("churn_probability"),
    when(col("prediction") == 1, "High").when(col("probability").getItem(1) > 0.3, "Medium").otherwise("Low").alias("churn_risk_level"),
    (col("prediction") == 1).alias("predicted_churn"),
    lit(None).cast("map<string,double>").alias("feature_importance"),  # Simplified
    when(col("prediction") == 1, "Urgent retention campaign needed")
    .when(col("probability").getItem(1) > 0.3, "Monitor closely and send loyalty offers")
    .otherwise("Maintain regular communication").alias("intervention_recommendation"),
    col("total_spent").alias("potential_lifetime_value"),
    current_timestamp().alias("prediction_timestamp"),
    lit("v1.0").alias("model_version")
)

print(f"Generated churn predictions for {churn_predictions.count()} guests")
churn_predictions.show(5)

In [None]:
# Save churn predictions to gold layer

churn_predictions.write.mode("overwrite").saveAsTable("hospitality.gold.churn_predictions")

print("Churn predictions saved to gold layer!")
print("Gold layer now contains complete analytics and ML predictions")

# Business impact summary
high_risk_guests = churn_predictions.filter("churn_risk_level = 'High'").count()
total_guests = churn_predictions.count()
avg_lifetime_value = churn_predictions.agg(avg("potential_lifetime_value")).collect()[0][0]

print(f"\nBusiness Impact Summary:")
print(f"Total guests analyzed: {total_guests}")
print(f"High-risk churn guests identified: {high_risk_guests}")
print(f"Average lifetime value per guest: ${avg_lifetime_value:,.2f}")
print(f"Potential revenue at risk: ${(high_risk_guests * avg_lifetime_value):,.0f}")

## Key Takeaways: Medallion Architecture in AIDP

### What We Demonstrated

1. **Bronze Layer**: Raw data ingestion with data quality issues preserved
2. **Silver Layer**: Data cleaning, standardization, and validation
3. **Gold Layer**: Business analytics and machine learning predictions
4. **End-to-End Pipeline**: Complete data processing from raw to insights

### Medallion Architecture Benefits

- **Data Quality**: Progressive improvement through layers
- **Governance**: Clear data lineage and audit trails
- **Performance**: Optimized clustering for different access patterns
- **Flexibility**: Each layer serves different use cases
- **Maintainability**: Clear separation of concerns

### Business Value for Hospitality

1. **Data Quality Management**: Track and improve data quality from source
2. **Customer Insights**: Rich analytics on guest behavior and preferences
3. **Revenue Optimization**: ML-driven churn prevention and lifetime value
4. **Operational Intelligence**: Multi-dimensional analytics for decision making
5. **Scalability**: Architecture scales with business growth

### AIDP Advantages

- **Unified Platform**: Single environment for all data processing layers
- **Delta Lake**: ACID transactions, time travel, and optimized performance
- **ML Integration**: Seamless ML training and deployment
- **Governance**: Catalog and schema isolation
- **Performance**: Automatic optimization and clustering

### Best Practices

1. **Layer Design**: Each layer has a specific purpose and audience
2. **Data Contracts**: Define clear schemas and expectations per layer
3. **Quality Gates**: Implement data quality checks between layers
4. **Access Control**: Different permissions for different layers
5. **Monitoring**: Track data quality and pipeline health

### Next Steps

- Add real-time data ingestion to bronze layer
- Implement automated data quality monitoring
- Deploy ML models for real-time predictions
- Create APIs for gold layer analytics
- Add more sophisticated ML models and features
- Integrate with actual hospitality systems

This notebook demonstrates how Oracle AI Data Platform enables sophisticated data architectures that drive real business value in the hospitality industry.