# hospitality: Iceberg and Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Iceberg and Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a hospitality and tourism analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Iceberg?

Apache Iceberg is an open table format for huge analytic datasets that provides:

- **Schema evolution**: Add, drop, rename, update columns without rewriting data
- **Partition evolution**: Change partitioning without disrupting queries
- **Time travel**: Query historical data snapshots for auditing and rollback
- **ACID transactions**: Reliable concurrent read/write operations
- **Cross-engine compatibility**: Works with Spark, Flink, Presto, Hive, and more
- **Open ecosystem**: Apache 2.0 licensed, community-driven development

### Delta Universal Format with Iceberg

Delta Universal Format enables Iceberg compatibility while maintaining Delta's advanced features like liquid clustering. This combination provides:

- **Best of both worlds**: Delta's performance optimizations with Iceberg's openness
- **Multi-engine access**: Query the same data from different analytics engines
- **Future-proof architecture**: Standards-based approach for long-term data investments
- **Enhanced governance**: Rich metadata and catalog integration

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Hotel Guest Experience and Revenue Management

We'll analyze hotel booking and guest experience data. Our clustering strategy will optimize for:

- **Guest-specific queries**: Fast lookups by guest ID
- **Time-based analysis**: Efficient filtering by booking and stay dates
- **Revenue patterns**: Quick aggregation by room type and booking channels

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [1]:
# Create hospitality catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS hospitality")

spark.sql("CREATE SCHEMA IF NOT EXISTS hospitality.analytics")

print("Hospitality catalog and analytics schema created successfully!")

Hospitality catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `guest_stays_uf` table will store:

- **guest_id**: Unique guest identifier
- **booking_date**: Date booking was made
- **check_in_date**: Guest arrival date
- **room_type**: Type of room booked
- **booking_channel**: How booking was made (OTA, Direct, etc.)
- **total_revenue**: Total booking revenue
- **guest_satisfaction**: Guest satisfaction score (1-10)

### Clustering Strategy

We'll cluster by `guest_id` and `booking_date` because:

- **guest_id**: Guests often make multiple bookings, grouping their stay history together
- **booking_date**: Time-based queries are critical for revenue analysis, seasonal trends, and booking patterns
- This combination optimizes for both guest relationship management and temporal revenue analytics

In [1]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType, DateType
data_schema = StructType([
    StructField("guest_id", StringType(), True),
    StructField("booking_date", DateType(), True),
    StructField("check_in_date", DateType(), True),
    StructField("room_type", StringType(), True),
    StructField("booking_channel", StringType(), True),
    StructField("total_revenue", DoubleType(), True),
    StructField("guest_satisfaction", IntegerType(), True)
])

spark.sql("""

CREATE TABLE IF NOT EXISTS hospitality.analytics.guest_stays_uf (
    guest_id STRING,
    booking_date DATE,
    check_in_date DATE,
    room_type STRING,
    booking_channel STRING,
    total_revenue DECIMAL(8,2),
    guest_satisfaction INT
)

USING DELTA

TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (guest_id, booking_date)

""")

print("Delta table with Iceberg compatibility and liquid clustering created successfully!")

print("Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.")

Delta table with Iceberg compatibility and liquid clustering created successfully!
Universal format enables Iceberg features while CLUSTER BY (columns) optimizes data layout.


## Step 3: Generate Hospitality Sample Data

### Data Generation Strategy

We'll create realistic hotel booking and guest data including:

- **5,000 guests** with multiple bookings over time
- **Room types**: Standard, Deluxe, Suite, Executive
- **Booking channels**: Direct, Online Travel Agency, Corporate, Walk-in
- **Seasonal patterns**: Peak seasons, weekend vs weekday pricing

### Why This Data Pattern?

This data simulates real hospitality scenarios where:

- Guest loyalty programs require historical booking tracking
- Revenue management depends on booking channel analysis
- Seasonal pricing strategies drive occupancy optimization
- Guest satisfaction impacts reputation and repeat business
- Channel performance requires continuous monitoring

In [1]:
# Generate sample hospitality guest booking data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define hospitality data constants

ROOM_TYPES = ['Standard', 'Deluxe', 'Suite', 'Executive']

BOOKING_CHANNELS = ['Direct', 'Online Travel Agency', 'Corporate', 'Walk-in']

# Base revenue parameters by room type

REVENUE_PARAMS = {

    'Standard': {'base_rate': 120, 'satisfaction': 7.8},

    'Deluxe': {'base_rate': 200, 'satisfaction': 8.2},

    'Suite': {'base_rate': 350, 'satisfaction': 8.8},

    'Executive': {'base_rate': 280, 'satisfaction': 8.5}

}

# Channel margins (affect final revenue)

CHANNEL_MARGINS = {

    'Direct': 1.0,

    'Online Travel Agency': 0.85,

    'Corporate': 0.90,

    'Walk-in': 0.95

}


# Generate guest booking records

booking_data = []

base_date = datetime(2024, 1, 1)


# Create 5,000 guests with 2-8 bookings each

for guest_num in range(1, 5001):

    guest_id = f"GST{guest_num:06d}"
    
    # Each guest gets 2-8 bookings over 12 months

    num_bookings = random.randint(2, 8)
    
    for i in range(num_bookings):

        # Spread bookings over 12 months

        days_offset = random.randint(0, 365)

        booking_date = base_date + timedelta(days=days_offset)
        
        # Check-in date (usually within 1-30 days of booking)

        checkin_offset = random.randint(1, 30)

        check_in_date = booking_date + timedelta(days=checkin_offset)
        
        # Select room type

        room_type = random.choice(ROOM_TYPES)

        params = REVENUE_PARAMS[room_type]
        
        # Select booking channel

        booking_channel = random.choice(BOOKING_CHANNELS)

        channel_margin = CHANNEL_MARGINS[booking_channel]
        
        # Calculate revenue with variations

        # Seasonal pricing (higher in peak season)

        month = check_in_date.month

        if month in [6, 7, 8]:  # Summer peak

            seasonal_factor = 1.3

        elif month in [11, 12]:  # Holiday season

            seasonal_factor = 1.4

        else:

            seasonal_factor = 1.0
        
        # Weekend pricing

        if check_in_date.weekday() >= 5:  # Saturday = 5, Sunday = 6

            weekend_factor = 1.2

        else:

            weekend_factor = 1.0
        
        # Stay length (1-7 nights)

        stay_length = random.randint(1, 7)
        
        # Calculate total revenue

        revenue_variation = random.uniform(0.9, 1.1)

        total_revenue = round(params['base_rate'] * stay_length * seasonal_factor * weekend_factor * channel_margin * revenue_variation, 2)
        
        # Guest satisfaction (varies by room type and some randomness)

        satisfaction_variation = random.randint(-2, 2)

        guest_satisfaction = max(1, min(10, params['satisfaction'] + satisfaction_variation))
        
        booking_data.append({

            "guest_id": guest_id,

            "booking_date": booking_date.date(),

            "check_in_date": check_in_date.date(),

            "room_type": room_type,

            "booking_channel": booking_channel,

            "total_revenue": float(total_revenue),

            "guest_satisfaction": int(guest_satisfaction)

        })



print(f"Generated {len(booking_data)} guest booking records")

print("Sample record:", booking_data[0])

Generated 24914 guest booking records
Sample record: {'guest_id': 'GST000001', 'booking_date': datetime.date(2024, 1, 9), 'check_in_date': datetime.date(2024, 2, 1), 'room_type': 'Suite', 'booking_channel': 'Corporate', 'total_revenue': 2041.43, 'guest_satisfaction': 10}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [1]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_bookings = spark.createDataFrame(booking_data, schema=data_schema)


# Display schema and sample data

print("DataFrame Schema:")

df_bookings.printSchema()



print("\nSample Data:")

df_bookings.show(5)


# Insert data into Delta table with liquid clustering

# The TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (guest_id, booking_date) will automatically optimize the data layout

df_bookings.write.mode("overwrite").insertInto("hospitality.analytics.guest_stays_uf")


print(f"\nSuccessfully inserted {df_bookings.count()} records into hospitality.analytics.guest_stays_uf")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- guest_id: string (nullable = true)
 |-- booking_date: date (nullable = true)
 |-- check_in_date: date (nullable = true)
 |-- room_type: string (nullable = true)
 |-- booking_channel: string (nullable = true)
 |-- total_revenue: double (nullable = true)
 |-- guest_satisfaction: integer (nullable = true)


Sample Data:


+---------+------------+-------------+---------+--------------------+-------------+------------------+
| guest_id|booking_date|check_in_date|room_type|     booking_channel|total_revenue|guest_satisfaction|
+---------+------------+-------------+---------+--------------------+-------------+------------------+
|GST000001|  2024-01-09|   2024-02-01|    Suite|           Corporate|      2041.43|                10|
|GST000001|  2024-05-10|   2024-05-18|Executive|           Corporate|       856.96|                 6|
|GST000001|  2024-03-05|   2024-03-23| Standard|Online Travel Agency|       246.68|                 9|
|GST000001|  2024-02-14|   2024-03-10|    Suite|           Corporate|       2471.3|                 7|
|GST000001|  2024-01-21|   2024-02-14| Standard|Online Travel Agency|       191.46|                 9|
+---------+------------+-------------+---------+--------------------+-------------+------------------+
only showing top 5 rows




Successfully inserted 24914 records into hospitality.analytics.guest_stays_uf
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Guest booking history** (clustered by guest_id)
2. **Time-based revenue analysis** (clustered by booking_date)
3. **Combined guest + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [1]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Guest booking history - benefits from guest_id clustering

print("=== Query 1: Guest Booking History ===")

guest_history = spark.sql("""

SELECT guest_id, booking_date, room_type, total_revenue, guest_satisfaction

FROM hospitality.analytics.guest_stays_uf

WHERE guest_id = 'GST000001'

ORDER BY booking_date DESC

""")



guest_history.show()

print(f"Records found: {guest_history.count()}")



# Query 2: Time-based revenue analysis - benefits from booking_date clustering

print("\n=== Query 2: Recent High-Value Bookings ===")

high_value = spark.sql("""

SELECT booking_date, guest_id, room_type, total_revenue, booking_channel

FROM hospitality.analytics.guest_stays_uf

WHERE booking_date >= '2024-06-01' AND total_revenue > 1000

ORDER BY total_revenue DESC, booking_date DESC

""")



high_value.show()

print(f"High-value bookings found: {high_value.count()}")



# Query 3: Combined guest + time query - optimal for our clustering strategy

print("\n=== Query 3: Guest Spending Trends ===")

spending_trends = spark.sql("""

SELECT guest_id, booking_date, room_type, total_revenue, guest_satisfaction

FROM hospitality.analytics.guest_stays_uf

WHERE guest_id LIKE 'GST000%' AND booking_date >= '2024-04-01'

ORDER BY guest_id, booking_date

""")



spending_trends.show()

print(f"Spending trend records found: {spending_trends.count()}")

=== Query 1: Guest Booking History ===


+---------+------------+---------+-------------+------------------+
| guest_id|booking_date|room_type|total_revenue|guest_satisfaction|
+---------+------------+---------+-------------+------------------+
|GST000001|  2024-05-10|Executive|       856.96|                 6|
|GST000001|  2024-04-18|   Deluxe|      1232.28|                10|
|GST000001|  2024-03-05| Standard|       246.68|                 9|
|GST000001|  2024-02-14|    Suite|      2471.30|                 7|
|GST000001|  2024-01-21| Standard|       191.46|                 9|
|GST000001|  2024-01-09|    Suite|      2041.43|                10|
+---------+------------+---------+-------------+------------------+



Records found: 6

=== Query 2: Recent High-Value Bookings ===


+------------+---------+---------+-------------+---------------+
|booking_date| guest_id|room_type|total_revenue|booking_channel|
+------------+---------+---------+-------------+---------------+
|  2024-11-04|GST002768|    Suite|      4489.98|         Direct|
|  2024-12-03|GST003830|    Suite|      4472.90|         Direct|
|  2024-12-14|GST003968|    Suite|      4393.87|         Direct|
|  2024-12-17|GST003888|    Suite|      4392.39|         Direct|
|  2024-11-07|GST004989|    Suite|      4315.92|         Direct|
|  2024-12-02|GST002698|    Suite|      4292.21|        Walk-in|
|  2024-11-26|GST003979|    Suite|      4255.33|         Direct|
|  2024-10-28|GST001861|    Suite|      4251.18|        Walk-in|
|  2024-06-17|GST002087|    Suite|      4186.43|         Direct|
|  2024-12-20|GST001747|    Suite|      4173.54|         Direct|
|  2024-07-04|GST003253|    Suite|      4163.35|         Direct|
|  2024-07-04|GST001727|    Suite|      4139.95|         Direct|
|  2024-11-22|GST004743| 

High-value bookings found: 6899

=== Query 3: Guest Spending Trends ===


+---------+------------+---------+-------------+------------------+
| guest_id|booking_date|room_type|total_revenue|guest_satisfaction|
+---------+------------+---------+-------------+------------------+
|GST000001|  2024-04-18|   Deluxe|      1232.28|                10|
|GST000001|  2024-05-10|Executive|       856.96|                 6|
|GST000002|  2024-04-23|Executive|       549.34|                 9|
|GST000002|  2024-07-01|Executive|      2438.54|                10|
|GST000002|  2024-07-03|   Deluxe|       940.31|                 7|
|GST000002|  2024-08-22| Standard|       331.96|                 8|
|GST000003|  2024-04-13|Executive|      1380.31|                 6|
|GST000003|  2024-11-09| Standard|       546.23|                 6|
|GST000004|  2024-05-13|   Deluxe|       806.09|                 8|
|GST000004|  2024-10-17| Standard|       461.22|                 9|
|GST000004|  2024-10-29|   Deluxe|       346.74|                 9|
|GST000005|  2024-05-18|    Suite|      2579.93|

Spending trend records found: 3735


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the hospitality insights possible with this optimized structure.

### Key Analytics

- **Guest loyalty patterns** and repeat booking analysis
- **Revenue performance** by room type and booking channel
- **Seasonal trends** and occupancy optimization
- **Guest satisfaction** and service quality metrics

In [1]:
# Analyze clustering effectiveness and hospitality insights


# Guest loyalty analysis

print("=== Guest Loyalty Analysis ===")

guest_loyalty = spark.sql("""

SELECT guest_id, COUNT(*) as total_bookings,

       ROUND(SUM(total_revenue), 2) as total_spent,

       ROUND(AVG(total_revenue), 2) as avg_booking_value,

       ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,

       MAX(booking_date) as last_booking_date

FROM hospitality.analytics.guest_stays_uf

GROUP BY guest_id

ORDER BY total_spent DESC

LIMIT 10

""")



guest_loyalty.show()


# Room type performance

print("\n=== Room Type Performance ===")

room_performance = spark.sql("""

SELECT room_type, COUNT(*) as total_bookings,

       ROUND(SUM(total_revenue), 2) as total_revenue,

       ROUND(AVG(total_revenue), 2) as avg_revenue_per_booking,

       ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,

       COUNT(DISTINCT guest_id) as unique_guests

FROM hospitality.analytics.guest_stays_uf

GROUP BY room_type

ORDER BY total_revenue DESC

""")



room_performance.show()


# Booking channel analysis

print("\n=== Booking Channel Performance ===")

channel_analysis = spark.sql("""

SELECT booking_channel, COUNT(*) as total_bookings,

       ROUND(SUM(total_revenue), 2) as total_revenue,

       ROUND(AVG(total_revenue), 2) as avg_revenue,

       ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,

       COUNT(DISTINCT guest_id) as unique_guests

FROM hospitality.analytics.guest_stays_uf

GROUP BY booking_channel

ORDER BY total_revenue DESC

""")



channel_analysis.show()


# Monthly revenue trends

print("\n=== Monthly Revenue Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(booking_date, 'yyyy-MM') as month,

       COUNT(*) as total_bookings,

       ROUND(SUM(total_revenue), 2) as monthly_revenue,

       ROUND(AVG(total_revenue), 2) as avg_booking_value,

       ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,

       COUNT(DISTINCT guest_id) as unique_guests

FROM hospitality.analytics.guest_stays_uf

GROUP BY DATE_FORMAT(booking_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Guest Loyalty Analysis ===


+---------+--------------+-----------+-----------------+----------------+-----------------+
| guest_id|total_bookings|total_spent|avg_booking_value|avg_satisfaction|last_booking_date|
+---------+--------------+-----------+-----------------+----------------+-----------------+
|GST001701|             8|   13802.68|          1725.34|            8.25|       2024-11-27|
|GST000443|             7|   13726.36|          1960.91|            8.43|       2024-12-30|
|GST000301|             8|   13549.10|          1693.64|             7.5|       2024-10-13|
|GST004536|             8|   13531.14|          1691.39|             8.0|       2024-12-13|
|GST003692|             8|   13463.98|          1683.00|            7.88|       2024-11-23|
|GST000836|             7|   13454.50|          1922.07|            8.14|       2024-12-06|
|GST002452|             8|   13357.35|          1669.67|            8.13|       2024-08-18|
|GST001623|             8|   13340.68|          1667.59|             8.5|       

+---------+--------------+-------------+-----------------------+----------------+-------------+
|room_type|total_bookings|total_revenue|avg_revenue_per_booking|avg_satisfaction|unique_guests|
+---------+--------------+-------------+-----------------------+----------------+-------------+
|    Suite|          6282|   9867195.31|                1570.71|            7.99|         3604|
|Executive|          6171|   7658389.55|                1241.03|            7.99|         3603|
|   Deluxe|          6223|   5603556.19|                 900.46|            7.99|         3602|
| Standard|          6238|   3361711.16|                 538.91|            6.99|         3592|
+---------+--------------+-------------+-----------------------+----------------+-------------+


=== Booking Channel Performance ===


+--------------------+--------------+-------------+-----------+----------------+-------------+
|     booking_channel|total_bookings|total_revenue|avg_revenue|avg_satisfaction|unique_guests|
+--------------------+--------------+-------------+-----------+----------------+-------------+
|              Direct|          6167|   7084093.36|    1148.71|            7.74|         3580|
|             Walk-in|          6254|   6843609.33|    1094.28|            7.75|         3616|
|           Corporate|          6366|   6542250.30|    1027.69|            7.72|         3641|
|Online Travel Agency|          6127|   6020899.22|     982.68|            7.74|         3548|
+--------------------+--------------+-------------+-----------+----------------+-------------+


=== Monthly Revenue Trends ===


+-------+--------------+---------------+-----------------+----------------+-------------+
|  month|total_bookings|monthly_revenue|avg_booking_value|avg_satisfaction|unique_guests|
+-------+--------------+---------------+-----------------+----------------+-------------+
|2024-01|          2180|     2026243.57|           929.47|            7.72|         1774|
|2024-02|          2042|     1915938.84|           938.27|            7.74|         1690|
|2024-03|          2097|     1957015.98|           933.25|            7.75|         1712|
|2024-04|          2028|     1857348.21|           915.85|            7.75|         1687|
|2024-05|          2115|     2311742.48|          1093.02|            7.78|         1733|
|2024-06|          2029|     2470203.58|          1217.45|            7.71|         1689|
|2024-07|          2094|     2495579.66|          1191.78|            7.73|         1749|
|2024-08|          2156|     2358394.01|          1093.87|            7.71|         1744|
|2024-09| 

## Key Takeaways: Iceberg and Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `TBLPROPERTIES('delta.universalFormat.enabledFormats' = 'iceberg') CLUSTER BY (guest_id, booking_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (guest_id, booking_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Hospitality analytics where guest experience and revenue management are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for hospitality data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles hospitality-scale data volumes effortlessly

### Best Practices for Iceberg and Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger hospitality datasets
- Integrate with real PMS systems and booking platforms

This notebook demonstrates how Oracle AI Data Platform combines Delta's advanced liquid clustering with Iceberg's open, future-proof architecture to deliver enterprise-grade analytics that are both high-performance and standards-compliant.