diff --git a/Notebooks/liquid_clustering/education_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/education_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..662577b --- /dev/null +++ b/Notebooks/liquid_clustering/education_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1061 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Education: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an education analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Student Performance Analytics and Learning Management\n", + "\n", + "We'll analyze student learning data and academic performance metrics. Our clustering strategy will optimize for:\n", + "\n", + "- **Student-specific queries**: Fast lookups by student ID\n", + "- **Time-based analysis**: Efficient filtering by academic period and assessment dates\n", + "- **Performance patterns**: Quick aggregation by subject and learning outcomes\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Education catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create education catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS education\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS education.analytics\")\n", + "\n", + "print(\"Education catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `student_assessments` table will store:\n", + "\n", + "- **student_id**: Unique student identifier\n", + "- **assessment_date**: Date of assessment or assignment\n", + "- **subject**: Academic subject area\n", + "- **score**: Assessment score (0-100)\n", + "- **grade_level**: Student grade level\n", + "- **completion_time**: Time spent on assessment (minutes)\n", + "- **engagement_score**: Student engagement metric (0-100)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `student_id` and `assessment_date` because:\n", + "\n", + "- **student_id**: Students generate multiple assessments, grouping learning progress together\n", + "- **assessment_date**: Time-based queries are critical for academic tracking, semester analysis, and intervention planning\n", + "- This combination optimizes for both individual student monitoring and temporal academic performance analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on student_id and assessment_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS education.analytics.student_assessments (\n", + "\n", + " student_id STRING,\n", + "\n", + " assessment_date DATE,\n", + "\n", + " subject STRING,\n", + "\n", + " score DECIMAL(5,2),\n", + "\n", + " grade_level STRING,\n", + "\n", + " completion_time DECIMAL(6,2),\n", + "\n", + " engagement_score INT\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (student_id, assessment_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on student_id and assessment_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Education Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic student assessment data including:\n", + "\n", + "- **3,000 students** with multiple assessments over time\n", + "- **Subjects**: Math, English, Science, History, Art, Physical Education\n", + "- **Realistic performance patterns**: Learning curves, subject difficulty variations, engagement factors\n", + "- **Grade levels**: K-12 with appropriate academic progression\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real education scenarios where:\n", + "\n", + "- Student performance varies by subject and time\n", + "- Learning progress needs longitudinal tracking\n", + "- Intervention strategies require early identification\n", + "- Curriculum effectiveness drives teaching improvements\n", + "- Standardized testing and reporting require temporal analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 67514 student assessment records\n", + "Sample record: {'student_id': 'STU000001', 'assessment_date': datetime.date(2024, 10, 21), 'subject': 'Science', 'score': 44.62, 'grade_level': '12th Grade', 'completion_time': 79.61, 'engagement_score': 50}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample student assessment data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define education data constants\n", + "\n", + "SUBJECTS = ['Math', 'English', 'Science', 'History', 'Art', 'Physical Education']\n", + "\n", + "GRADE_LEVELS = ['Kindergarten', '1st Grade', '2nd Grade', '3rd Grade', '4th Grade', '5th Grade', \n", + " '6th Grade', '7th Grade', '8th Grade', '9th Grade', '10th Grade', '11th Grade', '12th Grade']\n", + "\n", + "# Base performance parameters by subject and grade level\n", + "\n", + "PERFORMANCE_PARAMS = {\n", + "\n", + " 'Math': {'base_score': 75, 'difficulty': 1.2, 'time_factor': 1.5},\n", + "\n", + " 'English': {'base_score': 78, 'difficulty': 1.0, 'time_factor': 1.2},\n", + "\n", + " 'Science': {'base_score': 72, 'difficulty': 1.3, 'time_factor': 1.4},\n", + "\n", + " 'History': {'base_score': 70, 'difficulty': 1.1, 'time_factor': 1.1},\n", + "\n", + " 'Art': {'base_score': 82, 'difficulty': 0.8, 'time_factor': 0.9},\n", + "\n", + " 'Physical Education': {'base_score': 85, 'difficulty': 0.7, 'time_factor': 0.8}\n", + "\n", + "}\n", + "\n", + "# Grade level adjustments\n", + "\n", + "GRADE_ADJUSTMENTS = {\n", + "\n", + " 'Kindergarten': 0.7, '1st Grade': 0.75, '2nd Grade': 0.8, '3rd Grade': 0.82,\n", + "\n", + " '4th Grade': 0.85, '5th Grade': 0.87, '6th Grade': 0.8, '7th Grade': 0.78,\n", + "\n", + " '8th Grade': 0.76, '9th Grade': 0.74, '10th Grade': 0.72, '11th Grade': 0.7, '12th Grade': 0.68\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate student assessment records\n", + "\n", + "assessment_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 3,000 students with 15-30 assessments each\n", + "\n", + "for student_num in range(1, 3001):\n", + "\n", + " student_id = f\"STU{student_num:06d}\"\n", + " \n", + " # Assign grade level\n", + "\n", + " grade_level = random.choice(GRADE_LEVELS)\n", + "\n", + " grade_factor = GRADE_ADJUSTMENTS[grade_level]\n", + " \n", + " # Each student gets 15-30 assessments over 12 months\n", + "\n", + " num_assessments = random.randint(15, 30)\n", + " \n", + " for i in range(num_assessments):\n", + "\n", + " # Spread assessments over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " assessment_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Select subject\n", + "\n", + " subject = random.choice(SUBJECTS)\n", + "\n", + " params = PERFORMANCE_PARAMS[subject]\n", + " \n", + " # Calculate score with variations\n", + "\n", + " score_variation = random.uniform(0.7, 1.3)\n", + "\n", + " base_score = params['base_score'] * grade_factor / params['difficulty']\n", + "\n", + " score = round(min(100, max(0, base_score * score_variation)), 2)\n", + " \n", + " # Calculate completion time\n", + "\n", + " time_variation = random.uniform(0.8, 1.5)\n", + "\n", + " base_time = 45 * params['time_factor'] # 45 minutes base time\n", + "\n", + " completion_time = round(base_time * time_variation, 2)\n", + " \n", + " # Engagement score (affects performance)\n", + "\n", + " engagement_score = random.randint(40, 100)\n", + "\n", + " # Slightly adjust score based on engagement\n", + "\n", + " engagement_factor = engagement_score / 100.0\n", + "\n", + " score = round(min(100, score * (0.8 + 0.4 * engagement_factor)), 2)\n", + " \n", + " assessment_data.append({\n", + "\n", + " \"student_id\": student_id,\n", + "\n", + " \"assessment_date\": assessment_date.date(),\n", + "\n", + " \"subject\": subject,\n", + "\n", + " \"score\": float(score),\n", + "\n", + " \"grade_level\": grade_level,\n", + "\n", + " \"completion_time\": float(completion_time),\n", + "\n", + " \"engagement_score\": int(engagement_score)\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(assessment_data)} student assessment records\")\n", + "\n", + "print(\"Sample record:\", assessment_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- assessment_date: date (nullable = true)\n", + " |-- completion_time: double (nullable = true)\n", + " |-- engagement_score: long (nullable = true)\n", + " |-- grade_level: string (nullable = true)\n", + " |-- score: double (nullable = true)\n", + " |-- student_id: string (nullable = true)\n", + " |-- subject: string (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------------+---------------+----------------+-----------+-----+----------+------------------+\n", + "|assessment_date|completion_time|engagement_score|grade_level|score|student_id| subject|\n", + "+---------------+---------------+----------------+-----------+-----+----------+------------------+\n", + "| 2024-10-21| 79.61| 50| 12th Grade|44.62| STU000001| Science|\n", + "| 2024-03-06| 52.9| 85| 12th Grade|88.76| STU000001|Physical Education|\n", + "| 2024-09-24| 34.43| 52| 12th Grade|60.94| STU000001| Art|\n", + "| 2024-09-12| 83.62| 58| 12th Grade|48.87| STU000001| Science|\n", + "| 2024-12-01| 47.97| 58| 12th Grade|68.91| STU000001| English|\n", + "+---------------+---------------+----------------+-----------+-----+----------+------------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 67514 records into education.analytics.student_assessments\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_assessments = spark.createDataFrame(assessment_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_assessments.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_assessments.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (student_id, assessment_date) will automatically optimize the data layout\n", + "\n", + "df_assessments.write.mode(\"overwrite\").saveAsTable(\"education.analytics.student_assessments\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_assessments.count()} records into education.analytics.student_assessments\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Student assessment history** (clustered by student_id)\n", + "2. **Time-based academic analysis** (clustered by assessment_date)\n", + "3. **Combined student + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Student Assessment History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+---------------+------------------+-----+----------------+\n", + "|student_id|assessment_date| subject|score|engagement_score|\n", + "+----------+---------------+------------------+-----+----------------+\n", + "| STU000001| 2024-12-01| English|68.91| 58|\n", + "| STU000001| 2024-10-21| Science|44.62| 50|\n", + "| STU000001| 2024-10-10| English|52.53| 86|\n", + "| STU000001| 2024-10-08|Physical Education|77.74| 59|\n", + "| STU000001| 2024-09-25| Science|32.35| 40|\n", + "| STU000001| 2024-09-24| Art|60.94| 52|\n", + "| STU000001| 2024-09-12| Science|48.87| 58|\n", + "| STU000001| 2024-09-05| English|68.71| 98|\n", + "| STU000001| 2024-08-30| Math|33.82| 64|\n", + "| STU000001| 2024-08-10| Math|53.37| 60|\n", + "| STU000001| 2024-08-06|Physical Education|76.45| 80|\n", + "| STU000001| 2024-05-06| Art|55.28| 83|\n", + "| STU000001| 2024-04-25| English|44.24| 71|\n", + "| STU000001| 2024-04-13|Physical Education|90.22| 55|\n", + "| STU000001| 2024-04-11| Science|37.37| 71|\n", + "| STU000001| 2024-03-06|Physical Education|88.76| 85|\n", + "| STU000001| 2024-02-18| Art|58.32| 82|\n", + "| STU000001| 2024-01-04|Physical Education|100.0| 92|\n", + "+----------+---------------+------------------+-----+----------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 18\n", + "\n", + "=== Query 2: Recent Low Performance Issues ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------------+----------+-------+-----+------------+\n", + "|assessment_date|student_id|subject|score| grade_level|\n", + "+---------------+----------+-------+-----+------------+\n", + "| 2024-09-22| STU000220|Science|25.83| 12th Grade|\n", + "| 2024-10-01| STU001661|Science|25.87| 12th Grade|\n", + "| 2024-08-09| STU001500|Science|26.33| 12th Grade|\n", + "| 2024-09-02| STU001198|Science|26.54| 12th Grade|\n", + "| 2024-12-04| STU002836|Science|26.57| 11th Grade|\n", + "| 2024-08-26| STU000831|Science|26.61| 12th Grade|\n", + "| 2024-10-15| STU001401|Science|26.71| 11th Grade|\n", + "| 2024-10-11| STU001198|Science|26.78| 12th Grade|\n", + "| 2024-06-23| STU000386|Science|26.78| 11th Grade|\n", + "| 2024-07-08| STU001919|Science|26.95| 11th Grade|\n", + "| 2024-12-08| STU002914|Science|27.05| 12th Grade|\n", + "| 2024-12-05| STU002552|Science|27.06| 11th Grade|\n", + "| 2024-10-01| STU001135|Science|27.07|Kindergarten|\n", + "| 2024-10-15| STU001119|Science|27.28| 12th Grade|\n", + "| 2024-12-19| STU001299|Science|27.33|Kindergarten|\n", + "| 2024-06-01| STU000557|Science|27.34| 12th Grade|\n", + "| 2024-12-04| STU002453|Science| 27.4| 12th Grade|\n", + "| 2024-11-03| STU001202|Science|27.41|Kindergarten|\n", + "| 2024-09-21| STU002152|Science|27.49|Kindergarten|\n", + "| 2024-11-29| STU002524|Science| 27.5| 10th Grade|\n", + "+---------------+----------+-------+-----+------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Low performance issues found: 19160\n", + "\n", + "=== Query 3: Student Performance Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+---------------+------------------+-----+----------------+\n", + "|student_id|assessment_date| subject|score|engagement_score|\n", + "+----------+---------------+------------------+-----+----------------+\n", + "| STU000001| 2024-04-11| Science|37.37| 71|\n", + "| STU000001| 2024-04-13|Physical Education|90.22| 55|\n", + "| STU000001| 2024-04-25| English|44.24| 71|\n", + "| STU000001| 2024-05-06| Art|55.28| 83|\n", + "| STU000001| 2024-08-06|Physical Education|76.45| 80|\n", + "| STU000001| 2024-08-10| Math|53.37| 60|\n", + "| STU000001| 2024-08-30| Math|33.82| 64|\n", + "| STU000001| 2024-09-05| English|68.71| 98|\n", + "| STU000001| 2024-09-12| Science|48.87| 58|\n", + "| STU000001| 2024-09-24| Art|60.94| 52|\n", + "| STU000001| 2024-09-25| Science|32.35| 40|\n", + "| STU000001| 2024-10-08|Physical Education|77.74| 59|\n", + "| STU000001| 2024-10-10| English|52.53| 86|\n", + "| STU000001| 2024-10-21| Science|44.62| 50|\n", + "| STU000001| 2024-12-01| English|68.91| 58|\n", + "| STU000002| 2024-05-10|Physical Education|100.0| 71|\n", + "| STU000002| 2024-05-26| History|60.61| 42|\n", + "| STU000002| 2024-06-02| History|63.75| 97|\n", + "| STU000002| 2024-06-10| Science|34.97| 62|\n", + "| STU000002| 2024-06-22| Math|45.26| 72|\n", + "+----------+---------------+------------------+-----+----------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Performance trend records found: 17102\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Student assessment history - benefits from student_id clustering\n", + "\n", + "print(\"=== Query 1: Student Assessment History ===\")\n", + "\n", + "student_history = spark.sql(\"\"\"\n", + "\n", + "SELECT student_id, assessment_date, subject, score, engagement_score\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "WHERE student_id = 'STU000001'\n", + "\n", + "ORDER BY assessment_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "student_history.show()\n", + "\n", + "print(f\"Records found: {student_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based academic performance analysis - benefits from assessment_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent Low Performance Issues ===\")\n", + "\n", + "low_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT assessment_date, student_id, subject, score, grade_level\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "WHERE assessment_date >= '2024-06-01' AND score < 60\n", + "\n", + "ORDER BY score ASC, assessment_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "low_performance.show()\n", + "\n", + "print(f\"Low performance issues found: {low_performance.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined student + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Student Performance Trends ===\")\n", + "\n", + "performance_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT student_id, assessment_date, subject, score, engagement_score\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "WHERE student_id LIKE 'STU000%' AND assessment_date >= '2024-04-01'\n", + "\n", + "ORDER BY student_id, assessment_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "performance_trends.show()\n", + "\n", + "print(f\"Performance trend records found: {performance_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the education insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Student performance patterns** and learning analytics\n", + "- **Subject difficulty analysis** and curriculum effectiveness\n", + "- **Grade level progression** and academic growth\n", + "- **Engagement correlations** and intervention opportunities" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Student Performance Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-----------------+---------+--------------+-------------------+-----------+\n", + "|student_id|total_assessments|avg_score|avg_engagement|avg_completion_time|grade_level|\n", + "+----------+-----------------+---------+--------------+-------------------+-----------+\n", + "| STU002351| 15| 84.67| 68.2| 55.49| 5th Grade|\n", + "| STU001691| 25| 83.65| 75.04| 54.83| 4th Grade|\n", + "| STU002992| 23| 82.1| 74.87| 54.35| 4th Grade|\n", + "| STU001644| 17| 80.76| 69.0| 56.12| 5th Grade|\n", + "| STU001131| 16| 80.72| 68.31| 50.13| 7th Grade|\n", + "| STU001347| 15| 80.6| 71.8| 55.47| 7th Grade|\n", + "| STU000282| 16| 80.19| 66.06| 53.04| 5th Grade|\n", + "| STU000129| 15| 80.17| 72.8| 52.55| 5th Grade|\n", + "| STU001565| 22| 80.09| 66.95| 53.57| 2nd Grade|\n", + "| STU002167| 20| 80.04| 75.3| 53.31| 4th Grade|\n", + "+----------+-----------------+---------+--------------+-------------------+-----------+\n", + "\n", + "\n", + "=== Subject Performance Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------------+-----------------+---------+-------------------+--------------+---------------+\n", + "| subject|total_assessments|avg_score|avg_completion_time|avg_engagement|unique_students|\n", + "+------------------+-----------------+---------+-------------------+--------------+---------------+\n", + "|Physical Education| 11307| 91.75| 41.32| 69.94| 2910|\n", + "| Art| 11268| 82.93| 46.66| 69.98| 2922|\n", + "| English| 11150| 64.39| 62.29| 70.03| 2939|\n", + "| History| 11269| 52.47| 56.92| 69.64| 2923|\n", + "| Math| 11267| 51.97| 77.57| 70.03| 2914|\n", + "| Science| 11253| 45.72| 72.4| 70.06| 2935|\n", + "+------------------+-----------------+---------+-------------------+--------------+---------------+\n", + "\n", + "\n", + "=== Grade Level Performance ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------+-----------------+---------+--------------+---------------+\n", + "| grade_level|total_assessments|avg_score|avg_engagement|unique_students|\n", + "+------------+-----------------+---------+--------------+---------------+\n", + "|Kindergarten| 4959| 60.22| 69.78| 219|\n", + "| 1st Grade| 5312| 63.47| 69.97| 235|\n", + "| 2nd Grade| 4908| 67.64| 69.88| 214|\n", + "| 3rd Grade| 5441| 68.08| 69.7| 251|\n", + "| 4th Grade| 5242| 70.34| 70.13| 235|\n", + "| 5th Grade| 5241| 71.8| 69.62| 229|\n", + "| 6th Grade| 4341| 66.94| 69.49| 191|\n", + "| 7th Grade| 5054| 66.42| 70.47| 222|\n", + "| 8th Grade| 5277| 64.84| 70.27| 237|\n", + "| 9th Grade| 5340| 63.39| 70.2| 240|\n", + "| 10th Grade| 5884| 61.88| 69.64| 260|\n", + "| 11th Grade| 5388| 60.38| 70.17| 239|\n", + "| 12th Grade| 5127| 58.81| 69.91| 228|\n", + "+------------+-----------------+---------+--------------+---------------+\n", + "\n", + "\n", + "=== Engagement vs Performance Correlation ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------------+----------------+---------+-------------------+\n", + "| engagement_level|assessment_count|avg_score|avg_completion_time|\n", + "+-----------------+----------------+---------+-------------------+\n", + "| High Engagement| 23188| 68.78| 59.6|\n", + "|Medium Engagement| 22098| 64.93| 59.48|\n", + "| Low Engagement| 22228| 60.8| 59.44|\n", + "+-----------------+----------------+---------+-------------------+\n", + "\n", + "\n", + "=== Monthly Academic Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+-----------------+---------+--------------+---------------+\n", + "| month|total_assessments|avg_score|avg_engagement|active_students|\n", + "+-------+-----------------+---------+--------------+---------------+\n", + "|2024-01| 5706| 65.35| 69.86| 2545|\n", + "|2024-02| 5212| 64.94| 69.78| 2474|\n", + "|2024-03| 5650| 64.9| 69.78| 2540|\n", + "|2024-04| 5480| 64.95| 70.09| 2524|\n", + "|2024-05| 5869| 64.83| 69.88| 2591|\n", + "|2024-06| 5621| 64.72| 69.91| 2557|\n", + "|2024-07| 5632| 65.03| 70.03| 2530|\n", + "|2024-08| 5739| 65.47| 69.84| 2543|\n", + "|2024-09| 5527| 64.63| 70.23| 2505|\n", + "|2024-10| 5757| 64.85| 70.0| 2548|\n", + "|2024-11| 5639| 64.53| 70.43| 2549|\n", + "|2024-12| 5682| 64.5| 69.54| 2545|\n", + "+-------+-----------------+---------+--------------+---------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and education insights\n", + "\n", + "\n", + "# Student performance analysis\n", + "\n", + "print(\"=== Student Performance Analysis ===\")\n", + "\n", + "student_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT student_id, COUNT(*) as total_assessments,\n", + "\n", + " ROUND(AVG(score), 2) as avg_score,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " ROUND(AVG(completion_time), 2) as avg_completion_time,\n", + "\n", + " grade_level\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "GROUP BY student_id, grade_level\n", + "\n", + "ORDER BY avg_score DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "student_performance.show()\n", + "\n", + "\n", + "# Subject performance analysis\n", + "\n", + "print(\"\\n=== Subject Performance Analysis ===\")\n", + "\n", + "subject_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT subject, COUNT(*) as total_assessments,\n", + "\n", + " ROUND(AVG(score), 2) as avg_score,\n", + "\n", + " ROUND(AVG(completion_time), 2) as avg_completion_time,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT student_id) as unique_students\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "GROUP BY subject\n", + "\n", + "ORDER BY avg_score DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "subject_analysis.show()\n", + "\n", + "\n", + "# Grade level performance\n", + "\n", + "print(\"\\n=== Grade Level Performance ===\")\n", + "\n", + "grade_performance = spark.sql(\"\"\"\n", + "\n", + "\n", + "SELECT \n", + " grade_level, \n", + " COUNT(*) AS total_assessments,\n", + " ROUND(AVG(score), 2) AS avg_score,\n", + " ROUND(AVG(engagement_score), 2) AS avg_engagement,\n", + " COUNT(DISTINCT student_id) AS unique_students\n", + "FROM education.analytics.student_assessments\n", + "GROUP BY grade_level\n", + "ORDER BY \n", + " CASE \n", + " WHEN grade_level = 'Kindergarten' THEN 0\n", + " ELSE CAST(REGEXP_REPLACE(grade_level, '[^0-9]', '') AS INT)\n", + " END;\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "grade_performance.show()\n", + "\n", + "\n", + "# Engagement vs performance correlation\n", + "\n", + "print(\"\\n=== Engagement vs Performance Correlation ===\")\n", + "\n", + "engagement_correlation = spark.sql(\"\"\"\n", + "\n", + "SELECT \n", + "\n", + " CASE \n", + "\n", + " WHEN engagement_score >= 80 THEN 'High Engagement'\n", + "\n", + " WHEN engagement_score >= 60 THEN 'Medium Engagement'\n", + "\n", + " WHEN engagement_score >= 40 THEN 'Low Engagement'\n", + "\n", + " ELSE 'Very Low Engagement'\n", + "\n", + " END as engagement_level,\n", + "\n", + " COUNT(*) as assessment_count,\n", + "\n", + " ROUND(AVG(score), 2) as avg_score,\n", + "\n", + " ROUND(AVG(completion_time), 2) as avg_completion_time\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "GROUP BY \n", + "\n", + " CASE \n", + "\n", + " WHEN engagement_score >= 80 THEN 'High Engagement'\n", + "\n", + " WHEN engagement_score >= 60 THEN 'Medium Engagement'\n", + "\n", + " WHEN engagement_score >= 40 THEN 'Low Engagement'\n", + "\n", + " ELSE 'Very Low Engagement'\n", + "\n", + " END\n", + "\n", + "ORDER BY avg_score DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "engagement_correlation.show()\n", + "\n", + "\n", + "# Monthly academic trends\n", + "\n", + "print(\"\\n=== Monthly Academic Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(assessment_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_assessments,\n", + "\n", + " ROUND(AVG(score), 2) as avg_score,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT student_id) as active_students\n", + "\n", + "FROM education.analytics.student_assessments\n", + "\n", + "GROUP BY DATE_FORMAT(assessment_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (student_id, assessment_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (student_id, assessment_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Education analytics where student performance tracking and learning analytics are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for education data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles education-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger education datasets\n", + "- Integrate with real LMS systems and assessment platforms\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced education analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/energy_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/energy_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..bd8a724 --- /dev/null +++ b/Notebooks/liquid_clustering/energy_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1097 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Energy: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an energy and utilities analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Smart Grid Monitoring and Energy Consumption Analytics\n", + "\n", + "We'll analyze energy consumption and smart grid performance data. Our clustering strategy will optimize for:\n", + "\n", + "- **Meter-specific queries**: Fast lookups by meter ID\n", + "- **Time-based analysis**: Efficient filtering by reading date and time\n", + "- **Consumption patterns**: Quick aggregation by location and energy type\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Energy catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create energy catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS energy\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS energy.analytics\")\n", + "\n", + "print(\"Energy catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `energy_readings` table will store:\n", + "\n", + "- **meter_id**: Unique smart meter identifier\n", + "- **reading_date**: Date and time of meter reading\n", + "- **energy_type**: Type (Electricity, Gas, Water, Solar)\n", + "- **consumption**: Energy consumed (kWh, therms, gallons)\n", + "- **location**: Geographic location/region\n", + "- **peak_demand**: Peak usage during interval\n", + "- **efficiency_rating**: System efficiency (0-100)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `meter_id` and `reading_date` because:\n", + "\n", + "- **meter_id**: Meters generate regular readings, grouping consumption history together\n", + "- **reading_date**: Time-based queries are critical for billing cycles, demand analysis, and seasonal patterns\n", + "- This combination optimizes for both meter monitoring and temporal energy consumption analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on meter_id and reading_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS energy.analytics.energy_readings (\n", + "\n", + " meter_id STRING,\n", + "\n", + " reading_date TIMESTAMP,\n", + "\n", + " energy_type STRING,\n", + "\n", + " consumption DECIMAL(10,3),\n", + "\n", + " location STRING,\n", + "\n", + " peak_demand DECIMAL(8,2),\n", + "\n", + " efficiency_rating INT\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (meter_id, reading_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on meter_id and reading_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Energy Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic energy consumption data including:\n", + "\n", + "- **2,000 smart meters** with hourly readings over time\n", + "- **Energy types**: Electricity, Natural Gas, Water, Solar generation\n", + "- **Realistic consumption patterns**: Seasonal variations, peak usage times, efficiency differences\n", + "- **Geographic diversity**: Different locations with varying consumption profiles\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real energy scenarios where:\n", + "\n", + "- Consumption varies by time of day and season\n", + "- Peak demand impacts grid stability\n", + "- Efficiency ratings affect sustainability goals\n", + "- Geographic patterns drive infrastructure planning\n", + "- Real-time monitoring enables demand response programs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 4320000 energy reading records\n", + "Sample record: {'meter_id': 'MTR000001', 'reading_date': datetime.datetime(2024, 1, 1, 0, 0), 'energy_type': 'Solar', 'consumption': -8.397, 'location': 'Residential_NYC', 'peak_demand': 11.81, 'efficiency_rating': 80}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample energy consumption data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define energy data constants\n", + "\n", + "ENERGY_TYPES = ['Electricity', 'Natural Gas', 'Water', 'Solar']\n", + "\n", + "LOCATIONS = ['Residential_NYC', 'Commercial_CHI', 'Industrial_HOU', 'Residential_LAX', 'Commercial_SFO']\n", + "\n", + "# Base consumption parameters by energy type and location\n", + "\n", + "CONSUMPTION_PARAMS = {\n", + "\n", + " 'Electricity': {\n", + "\n", + " 'Residential_NYC': {'base_consumption': 15, 'peak_factor': 2.5, 'efficiency': 85},\n", + "\n", + " 'Commercial_CHI': {'base_consumption': 150, 'peak_factor': 3.0, 'efficiency': 78},\n", + "\n", + " 'Industrial_HOU': {'base_consumption': 500, 'peak_factor': 2.2, 'efficiency': 92},\n", + "\n", + " 'Residential_LAX': {'base_consumption': 12, 'peak_factor': 2.8, 'efficiency': 88},\n", + "\n", + " 'Commercial_SFO': {'base_consumption': 180, 'peak_factor': 2.7, 'efficiency': 82}\n", + "\n", + " },\n", + "\n", + " 'Natural Gas': {\n", + "\n", + " 'Residential_NYC': {'base_consumption': 25, 'peak_factor': 1.8, 'efficiency': 90},\n", + "\n", + " 'Commercial_CHI': {'base_consumption': 80, 'peak_factor': 2.1, 'efficiency': 85},\n", + "\n", + " 'Industrial_HOU': {'base_consumption': 200, 'peak_factor': 1.9, 'efficiency': 95},\n", + "\n", + " 'Residential_LAX': {'base_consumption': 20, 'peak_factor': 2.0, 'efficiency': 87},\n", + "\n", + " 'Commercial_SFO': {'base_consumption': 95, 'peak_factor': 2.3, 'efficiency': 83}\n", + "\n", + " },\n", + "\n", + " 'Water': {\n", + "\n", + " 'Residential_NYC': {'base_consumption': 180, 'peak_factor': 1.5, 'efficiency': 88},\n", + "\n", + " 'Commercial_CHI': {'base_consumption': 450, 'peak_factor': 1.7, 'efficiency': 82},\n", + "\n", + " 'Industrial_HOU': {'base_consumption': 1200, 'peak_factor': 1.6, 'efficiency': 91},\n", + "\n", + " 'Residential_LAX': {'base_consumption': 160, 'peak_factor': 1.8, 'efficiency': 85},\n", + "\n", + " 'Commercial_SFO': {'base_consumption': 380, 'peak_factor': 1.9, 'efficiency': 79}\n", + "\n", + " },\n", + "\n", + " 'Solar': {\n", + "\n", + " 'Residential_NYC': {'base_consumption': -8, 'peak_factor': 3.5, 'efficiency': 78},\n", + "\n", + " 'Commercial_CHI': {'base_consumption': -75, 'peak_factor': 4.0, 'efficiency': 85},\n", + "\n", + " 'Industrial_HOU': {'base_consumption': -250, 'peak_factor': 3.8, 'efficiency': 88},\n", + "\n", + " 'Residential_LAX': {'base_consumption': -12, 'peak_factor': 4.2, 'efficiency': 82},\n", + "\n", + " 'Commercial_SFO': {'base_consumption': -95, 'peak_factor': 3.9, 'efficiency': 86}\n", + "\n", + " }\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate energy reading records\n", + "\n", + "reading_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 2,000 meters with hourly readings for 3 months\n", + "\n", + "for meter_num in range(1, 2001):\n", + "\n", + " meter_id = f\"MTR{meter_num:06d}\"\n", + " \n", + " # Each meter gets readings for 90 days (hourly)\n", + "\n", + " for day in range(90):\n", + "\n", + " for hour in range(24):\n", + "\n", + " reading_date = base_date + timedelta(days=day, hours=hour)\n", + " \n", + " # Select energy type and location for this meter\n", + "\n", + " energy_type = random.choice(ENERGY_TYPES)\n", + "\n", + " location = random.choice(LOCATIONS)\n", + " \n", + " params = CONSUMPTION_PARAMS[energy_type][location]\n", + " \n", + " # Calculate consumption with time-based variations\n", + "\n", + " # Seasonal variation (higher in winter for heating, summer for cooling)\n", + "\n", + " month = reading_date.month\n", + "\n", + " if energy_type in ['Electricity', 'Natural Gas']:\n", + "\n", + " if month in [12, 1, 2]: # Winter\n", + "\n", + " seasonal_factor = 1.4\n", + "\n", + " elif month in [6, 7, 8]: # Summer\n", + "\n", + " seasonal_factor = 1.3\n", + "\n", + " else:\n", + "\n", + " seasonal_factor = 1.0\n", + "\n", + " else:\n", + "\n", + " seasonal_factor = 1.0\n", + " \n", + " # Time-of-day variation\n", + "\n", + " hour_factor = 1.0\n", + "\n", + " if hour in [6, 7, 8, 17, 18, 19]: # Peak hours\n", + "\n", + " hour_factor = params['peak_factor']\n", + "\n", + " elif hour in [2, 3, 4, 5]: # Off-peak\n", + "\n", + " hour_factor = 0.4\n", + "\n", + " \n", + " # Calculate consumption\n", + "\n", + " consumption_variation = random.uniform(0.8, 1.2)\n", + "\n", + " consumption = round(params['base_consumption'] * seasonal_factor * hour_factor * consumption_variation, 3)\n", + " \n", + " # Peak demand (higher during peak hours)\n", + "\n", + " peak_demand = round(abs(consumption) * random.uniform(1.1, 1.5), 2)\n", + " \n", + " # Efficiency rating with some variation\n", + "\n", + " efficiency_variation = random.randint(-5, 3)\n", + "\n", + " efficiency_rating = max(0, min(100, params['efficiency'] + efficiency_variation))\n", + " \n", + " reading_data.append({\n", + "\n", + " \"meter_id\": meter_id,\n", + "\n", + " \"reading_date\": reading_date,\n", + "\n", + " \"energy_type\": energy_type,\n", + "\n", + " \"consumption\": consumption,\n", + "\n", + " \"location\": location,\n", + "\n", + " \"peak_demand\": peak_demand,\n", + "\n", + " \"efficiency_rating\": efficiency_rating\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(reading_data)} energy reading records\")\n", + "\n", + "print(\"Sample record:\", reading_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- consumption: double (nullable = true)\n", + " |-- efficiency_rating: long (nullable = true)\n", + " |-- energy_type: string (nullable = true)\n", + " |-- location: string (nullable = true)\n", + " |-- meter_id: string (nullable = true)\n", + " |-- peak_demand: double (nullable = true)\n", + " |-- reading_date: timestamp (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-----------------+-----------+---------------+---------+-----------+-------------------+\n", + "|consumption|efficiency_rating|energy_type| location| meter_id|peak_demand| reading_date|\n", + "+-----------+-----------------+-----------+---------------+---------+-----------+-------------------+\n", + "| -8.397| 80| Solar|Residential_NYC|MTR000001| 11.81|2024-01-01 00:00:00|\n", + "| -13.923| 83| Solar|Residential_LAX|MTR000001| 18.41|2024-01-01 01:00:00|\n", + "| 113.538| 83|Electricity| Commercial_SFO|MTR000001| 125.27|2024-01-01 02:00:00|\n", + "| 145.708| 78| Water| Commercial_CHI|MTR000001| 196.87|2024-01-01 03:00:00|\n", + "| 489.841| 86| Water| Industrial_HOU|MTR000001| 611.02|2024-01-01 04:00:00|\n", + "+-----------+-----------------+-----------+---------------+---------+-----------+-------------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 4320000 records into energy.analytics.energy_readings\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_readings = spark.createDataFrame(reading_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_readings.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_readings.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (meter_id, reading_date) will automatically optimize the data layout\n", + "\n", + "df_readings.write.mode(\"overwrite\").saveAsTable(\"energy.analytics.energy_readings\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_readings.count()} records into energy.analytics.energy_readings\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Meter reading history** (clustered by meter_id)\n", + "2. **Time-based consumption analysis** (clustered by reading_date)\n", + "3. **Combined meter + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Meter Reading History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+-------------------+-----------+-----------+-----------+-----------------+\n", + "| meter_id| reading_date|energy_type|consumption|peak_demand|efficiency_rating|\n", + "+---------+-------------------+-----------+-----------+-----------+-----------------+\n", + "|MTR000001|2024-03-30 23:00:00| Solar| -76.497| 100.24| 83|\n", + "|MTR000001|2024-03-30 22:00:00| Water| 1110.183| 1612.46| 89|\n", + "|MTR000001|2024-03-30 21:00:00|Natural Gas| 20.917| 24.2| 88|\n", + "|MTR000001|2024-03-30 20:00:00| Water| 1129.645| 1513.78| 92|\n", + "|MTR000001|2024-03-30 19:00:00| Solar| -311.465| 355.24| 81|\n", + "|MTR000001|2024-03-30 18:00:00|Electricity| 1126.97| 1515.54| 92|\n", + "|MTR000001|2024-03-30 17:00:00| Water| 2149.727| 2591.09| 88|\n", + "|MTR000001|2024-03-30 16:00:00|Electricity| 188.143| 262.5| 84|\n", + "|MTR000001|2024-03-30 15:00:00|Electricity| 579.727| 817.15| 95|\n", + "|MTR000001|2024-03-30 14:00:00| Water| 404.661| 538.24| 78|\n", + "|MTR000001|2024-03-30 13:00:00|Electricity| 149.379| 182.73| 79|\n", + "|MTR000001|2024-03-30 12:00:00|Electricity| 149.926| 213.39| 76|\n", + "|MTR000001|2024-03-30 11:00:00| Solar| -243.733| 293.81| 88|\n", + "|MTR000001|2024-03-30 10:00:00|Natural Gas| 215.168| 322.07| 94|\n", + "|MTR000001|2024-03-30 09:00:00| Solar| -6.953| 7.84| 80|\n", + "|MTR000001|2024-03-30 08:00:00|Natural Gas| 233.304| 271.79| 84|\n", + "|MTR000001|2024-03-30 07:00:00|Natural Gas| 399.802| 554.49| 95|\n", + "|MTR000001|2024-03-30 06:00:00|Electricity| 1137.001| 1620.21| 91|\n", + "|MTR000001|2024-03-30 05:00:00| Solar| -3.402| 4.63| 80|\n", + "|MTR000001|2024-03-30 04:00:00| Solar| -29.498| 33.89| 85|\n", + "+---------+-------------------+-----------+-----------+-----------+-----------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 24\n", + "\n", + "=== Query 2: Recent Peak Demand Issues ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------------+---------+--------------+-----------+-----------+\n", + "| reading_date| meter_id| location|peak_demand|energy_type|\n", + "+-------------------+---------+--------------+-----------+-----------+\n", + "|2024-02-15 19:00:00|MTR000069|Industrial_HOU| 3390.26| Water|\n", + "|2024-02-15 18:00:00|MTR001732|Industrial_HOU| 3384.62| Water|\n", + "|2024-02-15 17:00:00|MTR001502|Industrial_HOU| 3349.98| Water|\n", + "|2024-02-15 17:00:00|MTR000428|Industrial_HOU| 3312.01| Water|\n", + "|2024-02-15 19:00:00|MTR001003|Industrial_HOU| 3282.39| Water|\n", + "|2024-02-15 17:00:00|MTR000272|Industrial_HOU| 3274.09| Water|\n", + "|2024-02-15 06:00:00|MTR000513|Industrial_HOU| 3273.61| Water|\n", + "|2024-02-15 06:00:00|MTR000856|Industrial_HOU| 3258.9| Water|\n", + "|2024-02-15 19:00:00|MTR000552|Industrial_HOU| 3237.69| Water|\n", + "|2024-02-15 19:00:00|MTR000486|Industrial_HOU| 3231.59| Water|\n", + "|2024-02-15 07:00:00|MTR001437|Industrial_HOU| 3226.26| Water|\n", + "|2024-02-15 19:00:00|MTR000779|Industrial_HOU| 3217.28| Water|\n", + "|2024-02-15 18:00:00|MTR001101|Industrial_HOU| 3204.85| Water|\n", + "|2024-02-15 08:00:00|MTR001956|Industrial_HOU| 3203.88| Water|\n", + "|2024-02-15 06:00:00|MTR000745|Industrial_HOU| 3199.02| Water|\n", + "|2024-02-15 06:00:00|MTR001977|Industrial_HOU| 3197.43| Water|\n", + "|2024-02-15 06:00:00|MTR001795|Industrial_HOU| 3196.6| Water|\n", + "|2024-02-15 17:00:00|MTR001725|Industrial_HOU| 3188.73| Water|\n", + "|2024-02-15 08:00:00|MTR000494|Industrial_HOU| 3185.26| Water|\n", + "|2024-02-15 18:00:00|MTR001679|Industrial_HOU| 3178.45| Water|\n", + "+-------------------+---------+--------------+-----------+-----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Peak demand issues found: 23244\n", + "\n", + "=== Query 3: Meter Consumption Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+-------------------+-----------+-----------+-----------------+\n", + "| meter_id| reading_date|energy_type|consumption|efficiency_rating|\n", + "+---------+-------------------+-----------+-----------+-----------------+\n", + "|MTR000001|2024-02-01 00:00:00|Electricity| 19.291| 85|\n", + "|MTR000001|2024-02-01 01:00:00|Electricity| 18.316| 82|\n", + "|MTR000001|2024-02-01 02:00:00|Natural Gas| 45.141| 81|\n", + "|MTR000001|2024-02-01 03:00:00|Natural Gas| 12.816| 88|\n", + "|MTR000001|2024-02-01 04:00:00| Water| 413.259| 88|\n", + "|MTR000001|2024-02-01 05:00:00| Water| 124.545| 82|\n", + "|MTR000001|2024-02-01 06:00:00|Electricity| 59.509| 84|\n", + "|MTR000001|2024-02-01 07:00:00|Natural Gas| 267.85| 81|\n", + "|MTR000001|2024-02-01 08:00:00|Electricity| 597.628| 77|\n", + "|MTR000001|2024-02-01 09:00:00|Natural Gas| 32.049| 85|\n", + "|MTR000001|2024-02-01 10:00:00| Solar| -10.908| 80|\n", + "|MTR000001|2024-02-01 11:00:00| Water| 432.552| 85|\n", + "|MTR000001|2024-02-01 12:00:00|Natural Gas| 261.021| 98|\n", + "|MTR000001|2024-02-01 13:00:00| Water| 529.122| 81|\n", + "|MTR000001|2024-02-01 14:00:00|Electricity| 677.571| 87|\n", + "|MTR000001|2024-02-01 15:00:00|Natural Gas| 32.76| 86|\n", + "|MTR000001|2024-02-01 16:00:00|Natural Gas| 269.902| 91|\n", + "|MTR000001|2024-02-01 17:00:00|Natural Gas| 46.793| 87|\n", + "|MTR000001|2024-02-01 18:00:00| Water| 344.857| 80|\n", + "|MTR000001|2024-02-01 19:00:00| Water| 674.861| 76|\n", + "+---------+-------------------+-----------+-----------+-----------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Consumption trend records found: 50\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Meter reading history - benefits from meter_id clustering\n", + "\n", + "print(\"=== Query 1: Meter Reading History ===\")\n", + "\n", + "meter_history = spark.sql(\"\"\"\n", + "\n", + "SELECT meter_id, reading_date, energy_type, consumption, peak_demand, efficiency_rating\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "WHERE meter_id = 'MTR000001'\n", + "\n", + "ORDER BY reading_date DESC\n", + "\n", + "LIMIT 24\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "meter_history.show()\n", + "\n", + "print(f\"Records found: {meter_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based peak demand analysis - benefits from reading_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent Peak Demand Issues ===\")\n", + "\n", + "peak_demand = spark.sql(\"\"\"\n", + "\n", + "SELECT reading_date, meter_id, location, peak_demand, energy_type\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "WHERE DATE(reading_date) = '2024-02-15' AND peak_demand > 200\n", + "\n", + "ORDER BY peak_demand DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "peak_demand.show()\n", + "\n", + "print(f\"Peak demand issues found: {peak_demand.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined meter + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Meter Consumption Trends ===\")\n", + "\n", + "consumption_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT meter_id, reading_date, energy_type, consumption, efficiency_rating\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "WHERE meter_id LIKE 'MTR000%' AND reading_date >= '2024-02-01'\n", + "\n", + "ORDER BY meter_id, reading_date\n", + "\n", + "LIMIT 50\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "consumption_trends.show()\n", + "\n", + "print(f\"Consumption trend records found: {consumption_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the energy insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Meter performance** and consumption patterns\n", + "- **Location-based energy usage** and demand analysis\n", + "- **Energy type efficiency** and sustainability metrics\n", + "- **Peak demand patterns** and grid optimization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Meter Performance Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+--------------+---------------+---------------+--------------+--------------------------+\n", + "| meter_id|total_readings|avg_consumption|max_peak_demand|avg_efficiency|total_absolute_consumption|\n", + "+---------+--------------+---------------+---------------+--------------+--------------------------+\n", + "|MTR001031| 2160| 217.68| 3438.64| 84.45| 618706.118|\n", + "|MTR001601| 2160| 212.481| 3290.42| 84.41| 615332.819|\n", + "|MTR000731| 2160| 208.168| 3384.62| 84.58| 613149.576|\n", + "|MTR000498| 2160| 218.411| 3269.85| 84.62| 610811.8|\n", + "|MTR001677| 2160| 214.368| 3111.93| 84.52| 610499.368|\n", + "|MTR000756| 2160| 207.871| 3170.51| 84.52| 609804.32|\n", + "|MTR000738| 2160| 212.499| 3368.23| 84.66| 608161.062|\n", + "|MTR001445| 2160| 211.693| 3419.94| 84.66| 605353.179|\n", + "|MTR000672| 2160| 199.036| 3233.82| 84.43| 605183.137|\n", + "|MTR000638| 2160| 204.707| 3185.66| 84.55| 605092.247|\n", + "+---------+--------------+---------------+---------------+--------------+--------------------------+\n", + "\n", + "\n", + "=== Location-Based Consumption Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------------+--------------+-----------------+---------------+--------------+-------------+\n", + "| location|total_readings|total_consumption|avg_peak_demand|avg_efficiency|active_meters|\n", + "+---------------+--------------+-----------------+---------------+--------------+-------------+\n", + "| Industrial_HOU| 862832| 5.83683373747E8| 879.43| 90.51| 2000|\n", + "| Commercial_SFO| 862761| 2.22486257475E8| 335.26| 81.5| 2000|\n", + "| Commercial_CHI| 864849| 2.14778356813E8| 322.87| 81.5| 2000|\n", + "|Residential_NYC| 865090| 5.5338107161E7| 83.17| 84.25| 2000|\n", + "|Residential_LAX| 864468| 5.3089571892E7| 79.83| 84.5| 2000|\n", + "+---------------+--------------+-----------------+---------------+--------------+-------------+\n", + "\n", + "\n", + "=== Energy Type Efficiency Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+--------------+---------------+--------------+---------------+-------------+\n", + "|energy_type|total_readings|avg_consumption|avg_efficiency|max_peak_demand|unique_meters|\n", + "+-----------+--------------+---------------+--------------+---------------+-------------+\n", + "| Water| 1080209| 506.2| 84.0| 3450.33| 2000|\n", + "|Electricity| 1080461| 274.209| 83.99| 2765.99| 2000|\n", + "| Solar| 1079655| 141.955| 82.8| 1705.38| 2000|\n", + "|Natural Gas| 1079675| 123.221| 87.0| 955.52| 2000|\n", + "+-----------+--------------+---------------+--------------+---------------+-------------+\n", + "\n", + "\n", + "=== Daily Consumption Patterns ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+----+-----------------+---------------+-------------+\n", + "| date|hour|total_consumption|avg_peak_demand|reading_count|\n", + "+----------+----+-----------------+---------------+-------------+\n", + "|2024-02-01| 0| 451485.469| 293.78| 2000|\n", + "|2024-02-01| 1| 456081.564| 296.89| 2000|\n", + "|2024-02-01| 2| 186286.651| 121.16| 2000|\n", + "|2024-02-01| 3| 184992.821| 120.54| 2000|\n", + "|2024-02-01| 4| 188674.42| 122.68| 2000|\n", + "|2024-02-01| 5| 186561.449| 122.52| 2000|\n", + "|2024-02-01| 6| 984331.46| 640.24| 2000|\n", + "|2024-02-01| 7| 981030.505| 639.02| 2000|\n", + "|2024-02-01| 8| 975953.168| 633.09| 2000|\n", + "|2024-02-01| 9| 466503.579| 303.06| 2000|\n", + "|2024-02-01| 10| 445455.596| 289.59| 2000|\n", + "|2024-02-01| 11| 467970.723| 305.52| 2000|\n", + "|2024-02-01| 12| 448383.798| 292.21| 2000|\n", + "|2024-02-01| 13| 455059.613| 297.21| 2000|\n", + "|2024-02-01| 14| 439676.638| 286.3| 2000|\n", + "|2024-02-01| 15| 448438.104| 291.13| 2000|\n", + "|2024-02-01| 16| 454646.561| 293.63| 2000|\n", + "|2024-02-01| 17| 992647.303| 643.7| 2000|\n", + "|2024-02-01| 18| 989921.013| 640.01| 2000|\n", + "|2024-02-01| 19| 975567.504| 635.82| 2000|\n", + "+----------+----+-----------------+---------------+-------------+\n", + "only showing top 20 rows\n", + "\n", + "\n", + "=== Monthly Consumption Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+-------------------+---------------+--------------+-------------+\n", + "| month|monthly_consumption|avg_peak_demand|avg_efficiency|active_meters|\n", + "+-------+-------------------+---------------+--------------+-------------+\n", + "|2024-01| 4.04155944396E8| 353.1| 84.45| 2000|\n", + "|2024-02| 3.78452396762E8| 353.45| 84.45| 2000|\n", + "|2024-03| 3.4676732593E8| 313.08| 84.45| 2000|\n", + "+-------+-------------------+---------------+--------------+-------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and energy insights\n", + "\n", + "\n", + "# Meter performance analysis\n", + "\n", + "print(\"=== Meter Performance Analysis ===\")\n", + "\n", + "meter_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT meter_id, COUNT(*) as total_readings,\n", + "\n", + " ROUND(AVG(consumption), 3) as avg_consumption,\n", + "\n", + " ROUND(MAX(peak_demand), 2) as max_peak_demand,\n", + "\n", + " ROUND(AVG(efficiency_rating), 2) as avg_efficiency,\n", + "\n", + " ROUND(SUM(ABS(consumption)), 3) as total_absolute_consumption\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "GROUP BY meter_id\n", + "\n", + "ORDER BY total_absolute_consumption DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "meter_performance.show()\n", + "\n", + "\n", + "# Location-based consumption analysis\n", + "\n", + "print(\"\\n=== Location-Based Consumption Analysis ===\")\n", + "\n", + "location_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT location, COUNT(*) as total_readings,\n", + "\n", + " ROUND(SUM(ABS(consumption)), 3) as total_consumption,\n", + "\n", + " ROUND(AVG(peak_demand), 2) as avg_peak_demand,\n", + "\n", + " ROUND(AVG(efficiency_rating), 2) as avg_efficiency,\n", + "\n", + " COUNT(DISTINCT meter_id) as active_meters\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "GROUP BY location\n", + "\n", + "ORDER BY total_consumption DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "location_analysis.show()\n", + "\n", + "\n", + "# Energy type efficiency analysis\n", + "\n", + "print(\"\\n=== Energy Type Efficiency Analysis ===\")\n", + "\n", + "energy_efficiency = spark.sql(\"\"\"\n", + "\n", + "SELECT energy_type, COUNT(*) as total_readings,\n", + "\n", + " ROUND(AVG(ABS(consumption)), 3) as avg_consumption,\n", + "\n", + " ROUND(AVG(efficiency_rating), 2) as avg_efficiency,\n", + "\n", + " ROUND(MAX(peak_demand), 2) as max_peak_demand,\n", + "\n", + " COUNT(DISTINCT meter_id) as unique_meters\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "GROUP BY energy_type\n", + "\n", + "ORDER BY avg_consumption DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "energy_efficiency.show()\n", + "\n", + "\n", + "# Daily consumption patterns\n", + "\n", + "print(\"\\n=== Daily Consumption Patterns ===\")\n", + "\n", + "daily_patterns = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE(reading_date) as date, HOUR(reading_date) as hour,\n", + "\n", + " ROUND(SUM(ABS(consumption)), 3) as total_consumption,\n", + "\n", + " ROUND(AVG(peak_demand), 2) as avg_peak_demand,\n", + "\n", + " COUNT(*) as reading_count\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "WHERE DATE(reading_date) = '2024-02-01'\n", + "\n", + "GROUP BY DATE(reading_date), HOUR(reading_date)\n", + "\n", + "ORDER BY hour\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "daily_patterns.show()\n", + "\n", + "\n", + "# Monthly consumption trends\n", + "\n", + "print(\"\\n=== Monthly Consumption Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(reading_date, 'yyyy-MM') as month,\n", + "\n", + " ROUND(SUM(ABS(consumption)), 3) as monthly_consumption,\n", + "\n", + " ROUND(AVG(peak_demand), 2) as avg_peak_demand,\n", + "\n", + " ROUND(AVG(efficiency_rating), 2) as avg_efficiency,\n", + "\n", + " COUNT(DISTINCT meter_id) as active_meters\n", + "\n", + "FROM energy.analytics.energy_readings\n", + "\n", + "GROUP BY DATE_FORMAT(reading_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (meter_id, reading_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (meter_id, reading_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Energy analytics where smart grid monitoring and consumption analysis are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for energy data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles energy-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger energy datasets\n", + "- Integrate with real smart meter and IoT sensor data\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced energy analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/financial_services_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/financial_services_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..17f1f68 --- /dev/null +++ b/Notebooks/liquid_clustering/financial_services_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,948 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Financial Services: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a financial services analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Transaction Fraud Detection and Customer Analytics\n", + "\n", + "We'll analyze financial transaction records from a bank. Our clustering strategy will optimize for:\n", + "\n", + "- **Customer-specific queries**: Fast lookups by account ID\n", + "- **Time-based analysis**: Efficient filtering by transaction date\n", + "- **Fraud pattern detection**: Quick aggregation by transaction type and risk scores\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Financial services catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create financial services catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS finance\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS finance.analytics\")\n", + "\n", + "print(\"Financial services catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `account_transactions` table will store:\n", + "\n", + "- **account_id**: Unique account identifier\n", + "- **transaction_date**: Date and time of transaction\n", + "- **transaction_type**: Type (Deposit, Withdrawal, Transfer, Payment, etc.)\n", + "- **amount**: Transaction amount\n", + "- **merchant_category**: Merchant type (Retail, Restaurant, Online, etc.)\n", + "- **location**: Transaction location\n", + "- **risk_score**: Fraud risk assessment (0-100)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `account_id` and `transaction_date` because:\n", + "\n", + "- **account_id**: Customers often have multiple transactions, grouping their financial activity together\n", + "- **transaction_date**: Time-based queries are critical for fraud detection, spending analysis, and regulatory reporting\n", + "- This combination optimizes for both customer account analysis and temporal fraud pattern detection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on account_id and transaction_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS finance.analytics.account_transactions (\n", + "\n", + " account_id STRING,\n", + "\n", + " transaction_date TIMESTAMP,\n", + "\n", + " transaction_type STRING,\n", + "\n", + " amount DECIMAL(15,2),\n", + "\n", + " merchant_category STRING,\n", + "\n", + " location STRING,\n", + "\n", + " risk_score INT\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (account_id, transaction_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on account_id and transaction_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Financial Services Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic financial transaction data including:\n", + "\n", + "- **5,000 accounts** with multiple transactions over time\n", + "- **Transaction types**: Deposits, withdrawals, transfers, payments, ATM withdrawals\n", + "- **Realistic temporal patterns**: Daily banking activity, weekend vs weekday patterns\n", + "- **Merchant categories**: Retail, restaurants, online shopping, utilities, entertainment\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real financial scenarios where:\n", + "\n", + "- Customers perform multiple transactions daily/weekly\n", + "- Fraud patterns emerge over time\n", + "- Regulatory reporting requires temporal analysis\n", + "- Risk scoring enables real-time fraud prevention\n", + "- Customer spending analysis drives personalized financial services" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 150600 account transaction records\n", + "Sample record: {'account_id': 'ACC00000001', 'transaction_date': datetime.datetime(2024, 1, 5, 13, 0), 'transaction_type': 'ATM', 'amount': -412.88, 'merchant_category': 'Entertainment', 'location': 'ATM', 'risk_score': 27}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample financial transaction data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define financial data constants\n", + "\n", + "TRANSACTION_TYPES = ['Deposit', 'Withdrawal', 'Transfer', 'Payment', 'ATM']\n", + "\n", + "MERCHANT_CATEGORIES = ['Retail', 'Restaurant', 'Online', 'Utilities', 'Entertainment', 'Groceries', 'Healthcare', 'Transportation']\n", + "\n", + "LOCATIONS = ['New York, NY', 'Los Angeles, CA', 'Chicago, IL', 'Houston, TX', 'Miami, FL', 'Online', 'ATM']\n", + "\n", + "\n", + "# Generate account transaction records\n", + "\n", + "transaction_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 5,000 accounts with 10-50 transactions each\n", + "\n", + "for account_num in range(1, 5001):\n", + "\n", + " account_id = f\"ACC{account_num:08d}\"\n", + " \n", + " # Each account gets 10-50 transactions over 12 months\n", + "\n", + " num_transactions = random.randint(10, 50)\n", + " \n", + " for i in range(num_transactions):\n", + "\n", + " # Spread transactions over 12 months with realistic timing\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " hours_offset = random.randint(0, 23)\n", + "\n", + " transaction_date = base_date + timedelta(days=days_offset, hours=hours_offset)\n", + " \n", + " # Select transaction type\n", + "\n", + " transaction_type = random.choice(TRANSACTION_TYPES)\n", + " \n", + " # Amount based on transaction type\n", + "\n", + " if transaction_type in ['Deposit', 'Transfer']:\n", + "\n", + " amount = round(random.uniform(100, 10000), 2)\n", + "\n", + " elif transaction_type == 'ATM':\n", + "\n", + " amount = round(random.uniform(20, 500), 2) * -1\n", + "\n", + " else:\n", + "\n", + " amount = round(random.uniform(10, 2000), 2) * -1\n", + " \n", + " # Select merchant category and location\n", + "\n", + " merchant_category = random.choice(MERCHANT_CATEGORIES)\n", + "\n", + " if transaction_type == 'ATM':\n", + "\n", + " location = 'ATM'\n", + "\n", + " elif transaction_type == 'Online':\n", + "\n", + " location = 'Online'\n", + "\n", + " else:\n", + "\n", + " location = random.choice(LOCATIONS)\n", + " \n", + " # Risk score (0-100, higher = more suspicious)\n", + "\n", + " risk_score = random.randint(0, 100)\n", + " \n", + " transaction_data.append({\n", + "\n", + " \"account_id\": account_id,\n", + "\n", + " \"transaction_date\": transaction_date,\n", + "\n", + " \"transaction_type\": transaction_type,\n", + "\n", + " \"amount\": amount,\n", + "\n", + " \"merchant_category\": merchant_category,\n", + "\n", + " \"location\": location,\n", + "\n", + " \"risk_score\": risk_score\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(transaction_data)} account transaction records\")\n", + "\n", + "print(\"Sample record:\", transaction_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- account_id: string (nullable = true)\n", + " |-- amount: double (nullable = true)\n", + " |-- location: string (nullable = true)\n", + " |-- merchant_category: string (nullable = true)\n", + " |-- risk_score: long (nullable = true)\n", + " |-- transaction_date: timestamp (nullable = true)\n", + " |-- transaction_type: string (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+--------+-----------+-----------------+----------+-------------------+----------------+\n", + "| account_id| amount| location|merchant_category|risk_score| transaction_date|transaction_type|\n", + "+-----------+--------+-----------+-----------------+----------+-------------------+----------------+\n", + "|ACC00000001| -412.88| ATM| Entertainment| 27|2024-01-05 13:00:00| ATM|\n", + "|ACC00000001| -372.97| ATM| Entertainment| 3|2024-04-15 21:00:00| ATM|\n", + "|ACC00000001|-1117.24|Houston, TX| Transportation| 32|2024-01-16 12:00:00| Withdrawal|\n", + "|ACC00000001| -1733.0|Houston, TX| Restaurant| 8|2024-12-20 09:00:00| Payment|\n", + "|ACC00000001| -164.06| ATM| Entertainment| 2|2024-02-12 12:00:00| ATM|\n", + "+-----------+--------+-----------+-----------------+----------+-------------------+----------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 150600 records into finance.analytics.account_transactions\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_transactions = spark.createDataFrame(transaction_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_transactions.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_transactions.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (account_id, transaction_date) will automatically optimize the data layout\n", + "\n", + "df_transactions.write.mode(\"overwrite\").saveAsTable(\"finance.analytics.account_transactions\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_transactions.count()} records into finance.analytics.account_transactions\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Account transaction history** (clustered by account_id)\n", + "2. **Time-based fraud analysis** (clustered by transaction_date)\n", + "3. **Combined account + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Account Transaction History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-------------------+----------------+--------+-----------------+\n", + "| account_id| transaction_date|transaction_type| amount|merchant_category|\n", + "+-----------+-------------------+----------------+--------+-----------------+\n", + "|ACC00000001|2024-12-20 09:00:00| Payment| -1733.0| Restaurant|\n", + "|ACC00000001|2024-10-24 09:00:00| Payment|-1689.16| Healthcare|\n", + "|ACC00000001|2024-09-03 20:00:00| ATM| -270.98| Entertainment|\n", + "|ACC00000001|2024-06-28 15:00:00| Transfer| 1288.76| Healthcare|\n", + "|ACC00000001|2024-06-25 15:00:00| Withdrawal|-1109.31| Entertainment|\n", + "|ACC00000001|2024-05-31 21:00:00| Withdrawal|-1323.84| Entertainment|\n", + "|ACC00000001|2024-04-15 21:00:00| ATM| -372.97| Entertainment|\n", + "|ACC00000001|2024-04-05 17:00:00| Withdrawal|-1532.56| Online|\n", + "|ACC00000001|2024-03-11 06:00:00| Deposit| 2533.68| Restaurant|\n", + "|ACC00000001|2024-02-29 17:00:00| Deposit| 3042.86| Entertainment|\n", + "|ACC00000001|2024-02-12 12:00:00| ATM| -164.06| Entertainment|\n", + "|ACC00000001|2024-01-16 12:00:00| Withdrawal|-1117.24| Transportation|\n", + "|ACC00000001|2024-01-05 13:00:00| ATM| -412.88| Entertainment|\n", + "+-----------+-------------------+----------------+--------+-----------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 13\n", + "\n", + "=== Query 2: High-Risk Transactions Today ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------------+----------+----------------+------+----------+\n", + "|transaction_date|account_id|transaction_type|amount|risk_score|\n", + "+----------------+----------+----------------+------+----------+\n", + "+----------------+----------+----------------+------+----------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "High-risk transactions found: 0\n", + "\n", + "=== Query 3: Account Fraud Pattern Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-------------------+----------------+--------+----------+\n", + "| account_id| transaction_date|transaction_type| amount|risk_score|\n", + "+-----------+-------------------+----------------+--------+----------+\n", + "|ACC00000010|2024-06-04 14:00:00| Transfer| 1938.69| 89|\n", + "|ACC00000010|2024-07-11 20:00:00| Deposit| 8968.96| 53|\n", + "|ACC00000010|2024-07-19 14:00:00| Payment|-1926.25| 47|\n", + "|ACC00000010|2024-08-29 12:00:00| Deposit| 9483.61| 28|\n", + "|ACC00000010|2024-09-20 23:00:00| Transfer| 7191.52| 73|\n", + "|ACC00000010|2024-09-22 11:00:00| Deposit| 5494.43| 25|\n", + "|ACC00000010|2024-10-05 21:00:00| Transfer| 6472.15| 4|\n", + "|ACC00000010|2024-10-15 18:00:00| Transfer| 5734.37| 90|\n", + "|ACC00000010|2024-11-24 09:00:00| Deposit| 4922.75| 53|\n", + "|ACC00000010|2024-12-17 07:00:00| Transfer| 5578.49| 63|\n", + "|ACC00000011|2024-06-11 18:00:00| Payment| -500.16| 98|\n", + "|ACC00000011|2024-06-26 07:00:00| ATM| -336.53| 89|\n", + "|ACC00000011|2024-08-26 02:00:00| Transfer| 9392.47| 82|\n", + "|ACC00000011|2024-09-15 16:00:00| Transfer| 1028.15| 54|\n", + "|ACC00000011|2024-09-16 21:00:00| Payment|-1566.64| 92|\n", + "|ACC00000011|2024-09-22 08:00:00| Deposit| 9293.03| 79|\n", + "|ACC00000011|2024-10-03 15:00:00| ATM| -186.99| 31|\n", + "|ACC00000011|2024-10-29 14:00:00| Deposit| 3884.05| 71|\n", + "|ACC00000011|2024-11-07 01:00:00| ATM| -160.3| 25|\n", + "|ACC00000011|2024-12-24 06:00:00| Withdrawal| -284.68| 3|\n", + "+-----------+-------------------+----------------+--------+----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Pattern records found: 135\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Account transaction history - benefits from account_id clustering\n", + "\n", + "print(\"=== Query 1: Account Transaction History ===\")\n", + "\n", + "account_history = spark.sql(\"\"\"\n", + "\n", + "SELECT account_id, transaction_date, transaction_type, amount, merchant_category\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "WHERE account_id = 'ACC00000001'\n", + "\n", + "ORDER BY transaction_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "account_history.show()\n", + "\n", + "print(f\"Records found: {account_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based fraud analysis - benefits from transaction_date clustering\n", + "\n", + "print(\"\\n=== Query 2: High-Risk Transactions Today ===\")\n", + "\n", + "high_risk_today = spark.sql(\"\"\"\n", + "\n", + "SELECT transaction_date, account_id, transaction_type, amount, risk_score\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "WHERE DATE(transaction_date) = CURRENT_DATE AND risk_score > 70\n", + "\n", + "ORDER BY risk_score DESC, transaction_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "high_risk_today.show()\n", + "\n", + "print(f\"High-risk transactions found: {high_risk_today.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined account + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Account Fraud Pattern Analysis ===\")\n", + "\n", + "fraud_patterns = spark.sql(\"\"\"\n", + "\n", + "SELECT account_id, transaction_date, transaction_type, amount, risk_score\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "WHERE account_id LIKE 'ACC0000001%' AND transaction_date >= '2024-06-01'\n", + "\n", + "ORDER BY account_id, transaction_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "fraud_patterns.show()\n", + "\n", + "print(f\"Pattern records found: {fraud_patterns.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the financial insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Transaction volume** by type and risk patterns\n", + "- **Customer spending analysis** and account segmentation\n", + "- **Fraud detection metrics** and risk scoring effectiveness\n", + "- **Merchant category trends** and spending patterns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Transaction Analysis by Type ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------------+------------------+--------------+----------+--------------+\n", + "|transaction_type|total_transactions| total_amount|avg_amount|avg_risk_score|\n", + "+----------------+------------------+--------------+----------+--------------+\n", + "| Deposit| 30187|1.5221104223E8| 5042.27| 49.62|\n", + "| Withdrawal| 30174|-3.033876498E7| -1005.46| 50.26|\n", + "| Transfer| 30136|1.5238560295E8| 5056.6| 49.78|\n", + "| ATM| 30066| -7801233.33| -259.47| 50.03|\n", + "| Payment| 30037|-3.010117823E7| -1002.14| 50.05|\n", + "+----------------+------------------+--------------+----------+--------------+\n", + "\n", + "\n", + "=== Risk Score Distribution ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------+-----------------+----------+\n", + "| risk_category|transaction_count|percentage|\n", + "+--------------+-----------------+----------+\n", + "|Very High Risk| 31042| 20.61|\n", + "| Medium Risk| 30113| 20.00|\n", + "| Low Risk| 30068| 19.97|\n", + "| High Risk| 29761| 19.76|\n", + "| Very Low Risk| 29616| 19.67|\n", + "+--------------+-----------------+----------+\n", + "\n", + "\n", + "=== Merchant Category Spending Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------------+------------+-------------+----------+--------+\n", + "|merchant_category|transactions| deposits| spending|avg_risk|\n", + "+-----------------+------------+-------------+----------+--------+\n", + "| Restaurant| 19064|3.827775625E7|8691779.96| 49.79|\n", + "| Groceries| 18890|3.820906261E7|8631780.31| 49.87|\n", + "| Retail| 18742|3.759042825E7|8619076.98| 49.87|\n", + "| Transportation| 18829|3.861681326E7|8509561.01| 50.29|\n", + "| Online| 18699|3.762347824E7|8507451.77| 50.11|\n", + "| Entertainment| 18728|3.803600515E7|8477999.58| 50.05|\n", + "| Utilities| 18728|3.746450578E7|8462294.11| 49.36|\n", + "| Healthcare| 18920|3.877859564E7|8341232.82| 50.25|\n", + "+-----------------+------------+-------------+----------+--------+\n", + "\n", + "\n", + "=== Monthly Transaction Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+------------+-------------+---------------+--------------+\n", + "| month|transactions| net_flow|active_accounts|avg_risk_score|\n", + "+-------+------------+-------------+---------------+--------------+\n", + "|2024-01| 12934| 2.01199504E7| 4456| 50.27|\n", + "|2024-02| 11801|1.821607335E7| 4352| 50.08|\n", + "|2024-03| 12678|2.072403875E7| 4416| 49.11|\n", + "|2024-04| 12451|1.985937352E7| 4395| 50.24|\n", + "|2024-05| 12799| 1.98355254E7| 4452| 49.9|\n", + "|2024-06| 12219|1.878310794E7| 4383| 49.46|\n", + "|2024-07| 12846|2.066751421E7| 4446| 49.88|\n", + "|2024-08| 12749|1.995521618E7| 4439| 50.16|\n", + "|2024-09| 12277|1.939366962E7| 4382| 49.9|\n", + "|2024-10| 12796|2.047159483E7| 4448| 49.95|\n", + "|2024-11| 12464|1.915816532E7| 4404| 50.27|\n", + "|2024-12| 12586|1.917123912E7| 4414| 50.16|\n", + "+-------+------------+-------------+---------------+--------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and financial insights\n", + "\n", + "\n", + "# Transaction analysis by type\n", + "\n", + "print(\"=== Transaction Analysis by Type ===\")\n", + "\n", + "transaction_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT transaction_type, COUNT(*) as total_transactions,\n", + "\n", + " ROUND(SUM(amount), 2) as total_amount,\n", + "\n", + " ROUND(AVG(amount), 2) as avg_amount,\n", + "\n", + " ROUND(AVG(risk_score), 2) as avg_risk_score\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "GROUP BY transaction_type\n", + "\n", + "ORDER BY total_transactions DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "transaction_analysis.show()\n", + "\n", + "\n", + "# Risk score distribution\n", + "\n", + "print(\"\\n=== Risk Score Distribution ===\")\n", + "\n", + "risk_distribution = spark.sql(\"\"\"\n", + "\n", + "SELECT \n", + "\n", + " CASE \n", + "\n", + " WHEN risk_score >= 80 THEN 'Very High Risk'\n", + "\n", + " WHEN risk_score >= 60 THEN 'High Risk'\n", + "\n", + " WHEN risk_score >= 40 THEN 'Medium Risk'\n", + "\n", + " WHEN risk_score >= 20 THEN 'Low Risk'\n", + "\n", + " ELSE 'Very Low Risk'\n", + "\n", + " END as risk_category,\n", + "\n", + " COUNT(*) as transaction_count,\n", + "\n", + " ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "GROUP BY \n", + "\n", + " CASE \n", + "\n", + " WHEN risk_score >= 80 THEN 'Very High Risk'\n", + "\n", + " WHEN risk_score >= 60 THEN 'High Risk'\n", + "\n", + " WHEN risk_score >= 40 THEN 'Medium Risk'\n", + "\n", + " WHEN risk_score >= 20 THEN 'Low Risk'\n", + "\n", + " ELSE 'Very Low Risk'\n", + "\n", + " END\n", + "\n", + "ORDER BY transaction_count DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "risk_distribution.show()\n", + "\n", + "\n", + "# Merchant category spending\n", + "\n", + "print(\"\\n=== Merchant Category Spending Analysis ===\")\n", + "\n", + "merchant_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT merchant_category, COUNT(*) as transactions,\n", + "\n", + " ROUND(SUM(CASE WHEN amount > 0 THEN amount ELSE 0 END), 2) as deposits,\n", + "\n", + " ROUND(SUM(CASE WHEN amount < 0 THEN ABS(amount) ELSE 0 END), 2) as spending,\n", + "\n", + " ROUND(AVG(risk_score), 2) as avg_risk\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "GROUP BY merchant_category\n", + "\n", + "ORDER BY spending DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "merchant_analysis.show()\n", + "\n", + "\n", + "# Monthly transaction trends\n", + "\n", + "print(\"\\n=== Monthly Transaction Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(transaction_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as transactions,\n", + "\n", + " ROUND(SUM(amount), 2) as net_flow,\n", + "\n", + " COUNT(DISTINCT account_id) as active_accounts,\n", + "\n", + " ROUND(AVG(risk_score), 2) as avg_risk_score\n", + "\n", + "FROM finance.analytics.account_transactions\n", + "\n", + "GROUP BY DATE_FORMAT(transaction_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (account_id, transaction_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (account_id, transaction_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Financial services analytics where fraud detection and customer analysis are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for financial data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles financial-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger financial datasets\n", + "- Integrate with real banking systems and fraud detection platforms\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced financial analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/healthcare_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/healthcare_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..cf8f100 --- /dev/null +++ b/Notebooks/liquid_clustering/healthcare_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,711 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Healthcare Analytics: Delta Liquid Clustering Demo\n", + "\n", + "## Overview\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a healthcare analytics use case. Liquid clustering is a revolutionary feature that automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Patient Diagnosis Analytics\n", + "\n", + "We'll analyze patient diagnosis records from a healthcare system. Our clustering strategy will optimize for:\n", + "- **Patient-specific queries**: Fast lookups by patient ID\n", + "- **Time-based analysis**: Efficient filtering by diagnosis date\n", + "- **Diagnosis patterns**: Quick aggregation by diagnosis type\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Create healthcare catalog and gold schema\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS healthcare\")\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS healthcare.gold\")\n", + "\n", + "print(\"Healthcare catalog and gold schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `patient_diagnoses` table will store:\n", + "- **patient_id**: Unique patient identifier\n", + "- **diagnosis_date**: When the diagnosis was made\n", + "- **diagnosis_code**: ICD-10 diagnosis code\n", + "- **diagnosis_description**: Human-readable diagnosis\n", + "- **severity_level**: Critical, High, Medium, Low\n", + "- **treating_physician**: Physician ID\n", + "- **facility_id**: Healthcare facility\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `patient_id` and `diagnosis_date` because:\n", + "- **patient_id**: Patients often have multiple visits, grouping their records together\n", + "- **diagnosis_date**: Time-based queries are common in healthcare analytics\n", + "- This combination optimizes for both patient history lookups and temporal analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on patient_id and diagnosis_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "CREATE TABLE IF NOT EXISTS healthcare.gold.patient_diagnoses (\n", + " patient_id STRING,\n", + " diagnosis_date DATE,\n", + " diagnosis_code STRING,\n", + " diagnosis_description STRING,\n", + " severity_level STRING,\n", + " treating_physician STRING,\n", + " facility_id STRING\n", + ")\n", + "USING DELTA\n", + "CLUSTER BY (patient_id, diagnosis_date)\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "print(\"Clustering will automatically optimize data layout for queries on patient_id and diagnosis_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Healthcare Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic healthcare data including:\n", + "- **100 patients** with multiple diagnoses over time\n", + "- **Common diagnoses**: Diabetes, Hypertension, Asthma, etc.\n", + "- **Realistic temporal patterns**: Follow-up visits, chronic condition management\n", + "- **Multiple facilities**: Different hospitals/clinics\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real healthcare scenarios where:\n", + "- Patients have multiple encounters\n", + "- Chronic conditions require ongoing monitoring\n", + "- Time-based analysis reveals treatment effectiveness\n", + "- Facility-level reporting is needed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 350 patient diagnosis records\n", + "Sample record: {'patient_id': 'PAT0001', 'diagnosis_date': datetime.date(2024, 2, 17), 'diagnosis_code': 'F41.9', 'diagnosis_description': 'Anxiety disorder, unspecified', 'severity_level': 'Medium', 'treating_physician': 'DR_SMITH', 'facility_id': 'CLINIC002'}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample healthcare diagnosis data\n", + "# Using fully qualified pyspark.sql.functions to avoid conflicts\n", + "\n", + "import random\n", + "from datetime import datetime, timedelta\n", + "\n", + "# Define healthcare data constants\n", + "DIAGNOSES = [\n", + " (\"E11.9\", \"Type 2 diabetes mellitus without complications\", \"Medium\"),\n", + " (\"I10\", \"Essential hypertension\", \"High\"),\n", + " (\"J45.909\", \"Unspecified asthma, uncomplicated\", \"Medium\"),\n", + " (\"M54.5\", \"Low back pain\", \"Low\"),\n", + " (\"N39.0\", \"Urinary tract infection, site not specified\", \"Medium\"),\n", + " (\"Z51.11\", \"Encounter for antineoplastic chemotherapy\", \"Critical\"),\n", + " (\"I25.10\", \"Atherosclerotic heart disease of native coronary artery without angina pectoris\", \"High\"),\n", + " (\"F41.9\", \"Anxiety disorder, unspecified\", \"Medium\"),\n", + " (\"M79.3\", \"Panniculitis, unspecified\", \"Low\"),\n", + " (\"Z00.00\", \"Encounter for general adult medical examination without abnormal findings\", \"Low\")\n", + "]\n", + "\n", + "FACILITIES = [\"HOSP001\", \"HOSP002\", \"CLINIC001\", \"CLINIC002\", \"URGENT001\"]\n", + "PHYSICIANS = [\"DR_SMITH\", \"DR_JOHNSON\", \"DR_WILLIAMS\", \"DR_BROWN\", \"DR_JONES\", \"DR_GARCIA\", \"DR_MILLER\", \"DR_DAVIS\"]\n", + "\n", + "# Generate patient diagnosis records\n", + "patient_data = []\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "# Create 100 patients with 2-5 diagnoses each\n", + "for patient_num in range(1, 101):\n", + " patient_id = f\"PAT{patient_num:04d}\"\n", + " \n", + " # Each patient gets 2-5 diagnoses over several months\n", + " num_diagnoses = random.randint(2, 5)\n", + " \n", + " for i in range(num_diagnoses):\n", + " # Spread diagnoses over 6 months\n", + " days_offset = random.randint(0, 180)\n", + " diagnosis_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Select random diagnosis\n", + " diagnosis_code, description, severity = random.choice(DIAGNOSES)\n", + " \n", + " # Select random facility and physician\n", + " facility = random.choice(FACILITIES)\n", + " physician = random.choice(PHYSICIANS)\n", + " \n", + " patient_data.append({\n", + " \"patient_id\": patient_id,\n", + " \"diagnosis_date\": diagnosis_date.date(),\n", + " \"diagnosis_code\": diagnosis_code,\n", + " \"diagnosis_description\": description,\n", + " \"severity_level\": severity,\n", + " \"treating_physician\": physician,\n", + " \"facility_id\": facility\n", + " })\n", + "\n", + "print(f\"Generated {len(patient_data)} patient diagnosis records\")\n", + "print(\"Sample record:\", patient_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- diagnosis_code: string (nullable = true)\n", + " |-- diagnosis_date: date (nullable = true)\n", + " |-- diagnosis_description: string (nullable = true)\n", + " |-- facility_id: string (nullable = true)\n", + " |-- patient_id: string (nullable = true)\n", + " |-- severity_level: string (nullable = true)\n", + " |-- treating_physician: string (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n", + "+--------------+--------------+---------------------+-----------+----------+--------------+------------------+\n", + "|diagnosis_code|diagnosis_date|diagnosis_description|facility_id|patient_id|severity_level|treating_physician|\n", + "+--------------+--------------+---------------------+-----------+----------+--------------+------------------+\n", + "| F41.9| 2024-02-17| Anxiety disorder,...| CLINIC002| PAT0001| Medium| DR_SMITH|\n", + "| I10| 2024-01-15| Essential hyperte...| HOSP002| PAT0001| High| DR_JOHNSON|\n", + "| J45.909| 2024-02-13| Unspecified asthm...| HOSP002| PAT0001| Medium| DR_JONES|\n", + "| Z00.00| 2024-06-25| Encounter for gen...| URGENT001| PAT0002| Low| DR_DAVIS|\n", + "| Z00.00| 2024-01-24| Encounter for gen...| HOSP002| PAT0002| Low| DR_JONES|\n", + "+--------------+--------------+---------------------+-----------+----------+--------------+------------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 350 records into healthcare.gold.patient_diagnoses\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "# Create DataFrame from generated data\n", + "df_diagnoses = spark.createDataFrame(patient_data)\n", + "\n", + "# Display schema and sample data\n", + "print(\"DataFrame Schema:\")\n", + "df_diagnoses.printSchema()\n", + "\n", + "print(\"\\nSample Data:\")\n", + "df_diagnoses.show(5)\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "# The CLUSTER BY (patient_id, diagnosis_date) will automatically optimize the data layout\n", + "df_diagnoses.write.mode(\"overwrite\").saveAsTable(\"healthcare.gold.patient_diagnoses\")\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_diagnoses.count()} records into healthcare.gold.patient_diagnoses\")\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Patient history lookup** (clustered by patient_id)\n", + "2. **Time-based analysis** (clustered by diagnosis_date)\n", + "3. **Combined patient + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Patient Diagnosis History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+--------------+--------------+---------------------+--------------+\n", + "|patient_id|diagnosis_date|diagnosis_code|diagnosis_description|severity_level|\n", + "+----------+--------------+--------------+---------------------+--------------+\n", + "| PAT0001| 2024-01-15| I10| Essential hyperte...| High|\n", + "| PAT0001| 2024-02-13| J45.909| Unspecified asthm...| Medium|\n", + "| PAT0001| 2024-02-17| F41.9| Anxiety disorder,...| Medium|\n", + "+----------+--------------+--------------+---------------------+--------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 3\n", + "\n", + "=== Query 2: Recent Critical Diagnoses ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------+----------+--------------+---------------------+------------------+\n", + "|diagnosis_date|patient_id|diagnosis_code|diagnosis_description|treating_physician|\n", + "+--------------+----------+--------------+---------------------+------------------+\n", + "| 2024-06-25| PAT0061| Z51.11| Encounter for ant...| DR_WILLIAMS|\n", + "| 2024-06-24| PAT0099| Z51.11| Encounter for ant...| DR_GARCIA|\n", + "| 2024-06-19| PAT0082| Z51.11| Encounter for ant...| DR_BROWN|\n", + "| 2024-06-18| PAT0018| Z51.11| Encounter for ant...| DR_DAVIS|\n", + "| 2024-06-16| PAT0091| Z51.11| Encounter for ant...| DR_WILLIAMS|\n", + "| 2024-06-05| PAT0056| Z51.11| Encounter for ant...| DR_JOHNSON|\n", + "| 2024-06-03| PAT0042| Z51.11| Encounter for ant...| DR_JONES|\n", + "| 2024-05-31| PAT0062| Z51.11| Encounter for ant...| DR_SMITH|\n", + "| 2024-05-24| PAT0023| Z51.11| Encounter for ant...| DR_SMITH|\n", + "| 2024-05-24| PAT0088| Z51.11| Encounter for ant...| DR_BROWN|\n", + "| 2024-05-22| PAT0096| Z51.11| Encounter for ant...| DR_BROWN|\n", + "| 2024-05-14| PAT0097| Z51.11| Encounter for ant...| DR_SMITH|\n", + "| 2024-05-10| PAT0019| Z51.11| Encounter for ant...| DR_JONES|\n", + "| 2024-04-30| PAT0009| Z51.11| Encounter for ant...| DR_JOHNSON|\n", + "| 2024-04-24| PAT0026| Z51.11| Encounter for ant...| DR_SMITH|\n", + "| 2024-04-12| PAT0100| Z51.11| Encounter for ant...| DR_DAVIS|\n", + "| 2024-04-10| PAT0052| Z51.11| Encounter for ant...| DR_DAVIS|\n", + "| 2024-04-10| PAT0069| Z51.11| Encounter for ant...| DR_GARCIA|\n", + "| 2024-04-04| PAT0053| Z51.11| Encounter for ant...| DR_MILLER|\n", + "| 2024-04-03| PAT0057| Z51.11| Encounter for ant...| DR_SMITH|\n", + "+--------------+----------+--------------+---------------------+------------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Critical diagnoses found: 21\n", + "\n", + "=== Query 3: Patient Timeline Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+--------------+--------------+--------------+-----------+\n", + "|patient_id|diagnosis_date|diagnosis_code|severity_level|facility_id|\n", + "+----------+--------------+--------------+--------------+-----------+\n", + "| PAT0010| 2024-05-05| E11.9| Medium| HOSP002|\n", + "| PAT0010| 2024-05-21| M79.3| Low| URGENT001|\n", + "| PAT0010| 2024-06-28| M54.5| Low| HOSP002|\n", + "| PAT0011| 2024-03-09| F41.9| Medium| HOSP002|\n", + "| PAT0011| 2024-03-29| N39.0| Medium| CLINIC001|\n", + "| PAT0012| 2024-04-14| M54.5| Low| URGENT001|\n", + "| PAT0012| 2024-04-17| M79.3| Low| CLINIC002|\n", + "| PAT0012| 2024-06-03| I10| High| HOSP002|\n", + "| PAT0013| 2024-06-18| E11.9| Medium| CLINIC001|\n", + "| PAT0014| 2024-04-04| J45.909| Medium| HOSP001|\n", + "| PAT0014| 2024-05-13| N39.0| Medium| HOSP002|\n", + "| PAT0014| 2024-05-24| M54.5| Low| CLINIC002|\n", + "| PAT0015| 2024-04-16| N39.0| Medium| HOSP001|\n", + "| PAT0015| 2024-04-18| Z00.00| Low| URGENT001|\n", + "| PAT0015| 2024-04-27| F41.9| Medium| CLINIC002|\n", + "| PAT0016| 2024-04-30| E11.9| Medium| URGENT001|\n", + "| PAT0016| 2024-06-21| J45.909| Medium| HOSP002|\n", + "| PAT0017| 2024-05-24| Z00.00| Low| CLINIC001|\n", + "| PAT0018| 2024-05-01| M54.5| Low| HOSP002|\n", + "| PAT0018| 2024-06-18| Z51.11| Critical| HOSP002|\n", + "+----------+--------------+--------------+--------------+-----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Timeline records found: 25\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "# Query 1: Patient history - benefits from patient_id clustering\n", + "print(\"=== Query 1: Patient Diagnosis History ===\")\n", + "patient_history = spark.sql(\"\"\"\n", + "SELECT patient_id, diagnosis_date, diagnosis_code, diagnosis_description, severity_level\n", + "FROM healthcare.gold.patient_diagnoses\n", + "WHERE patient_id = 'PAT0001'\n", + "ORDER BY diagnosis_date\n", + "\"\"\")\n", + "\n", + "patient_history.show()\n", + "print(f\"Records found: {patient_history.count()}\")\n", + "\n", + "# Query 2: Time-based analysis - benefits from diagnosis_date clustering\n", + "print(\"\\n=== Query 2: Recent Critical Diagnoses ===\")\n", + "recent_critical = spark.sql(\"\"\"\n", + "SELECT diagnosis_date, patient_id, diagnosis_code, diagnosis_description, treating_physician\n", + "FROM healthcare.gold.patient_diagnoses\n", + "WHERE diagnosis_date >= '2024-04-01' AND severity_level = 'Critical'\n", + "ORDER BY diagnosis_date DESC\n", + "\"\"\")\n", + "\n", + "recent_critical.show()\n", + "print(f\"Critical diagnoses found: {recent_critical.count()}\")\n", + "\n", + "# Query 3: Combined patient + time query - optimal for our clustering strategy\n", + "print(\"\\n=== Query 3: Patient Timeline Analysis ===\")\n", + "patient_timeline = spark.sql(\"\"\"\n", + "SELECT patient_id, diagnosis_date, diagnosis_code, severity_level, facility_id\n", + "FROM healthcare.gold.patient_diagnoses\n", + "WHERE patient_id LIKE 'PAT001%' AND diagnosis_date >= '2024-03-01'\n", + "ORDER BY patient_id, diagnosis_date\n", + "\"\"\")\n", + "\n", + "patient_timeline.show()\n", + "print(f\"Timeline records found: {patient_timeline.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the healthcare insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Diagnosis frequency** by type\n", + "- **Severity distribution** across facilities\n", + "- **Physician workload** analysis\n", + "- **Temporal patterns** in diagnoses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Diagnosis Frequency Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------+-------------------------------------------------------------------------------+---------+----------+\n", + "|diagnosis_code|diagnosis_description |frequency|percentage|\n", + "+--------------+-------------------------------------------------------------------------------+---------+----------+\n", + "|Z00.00 |Encounter for general adult medical examination without abnormal findings |43 |12.29 |\n", + "|N39.0 |Urinary tract infection, site not specified |40 |11.43 |\n", + "|M54.5 |Low back pain |40 |11.43 |\n", + "|Z51.11 |Encounter for antineoplastic chemotherapy |38 |10.86 |\n", + "|J45.909 |Unspecified asthma, uncomplicated |37 |10.57 |\n", + "|F41.9 |Anxiety disorder, unspecified |36 |10.29 |\n", + "|E11.9 |Type 2 diabetes mellitus without complications |33 |9.43 |\n", + "|M79.3 |Panniculitis, unspecified |30 |8.57 |\n", + "|I10 |Essential hypertension |28 |8.00 |\n", + "|I25.10 |Atherosclerotic heart disease of native coronary artery without angina pectoris|25 |7.14 |\n", + "+--------------+-------------------------------------------------------------------------------+---------+----------+\n", + "\n", + "\n", + "=== Severity Distribution by Facility ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+--------------+-----+\n", + "|facility_id|severity_level|count|\n", + "+-----------+--------------+-----+\n", + "| CLINIC001| Critical| 8|\n", + "| CLINIC001| High| 8|\n", + "| CLINIC001| Low| 20|\n", + "| CLINIC001| Medium| 25|\n", + "| CLINIC002| Critical| 8|\n", + "| CLINIC002| High| 9|\n", + "| CLINIC002| Low| 24|\n", + "| CLINIC002| Medium| 23|\n", + "| HOSP001| Critical| 6|\n", + "| HOSP001| High| 8|\n", + "| HOSP001| Low| 18|\n", + "| HOSP001| Medium| 30|\n", + "| HOSP002| Critical| 10|\n", + "| HOSP002| High| 14|\n", + "| HOSP002| Low| 27|\n", + "| HOSP002| Medium| 33|\n", + "| URGENT001| Critical| 6|\n", + "| URGENT001| High| 14|\n", + "| URGENT001| Low| 24|\n", + "| URGENT001| Medium| 35|\n", + "+-----------+--------------+-----+\n", + "\n", + "\n", + "=== Physician Workload Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------------+---------------+---------------+-------------------+\n", + "|treating_physician|total_diagnoses|unique_patients|critical_case_ratio|\n", + "+------------------+---------------+---------------+-------------------+\n", + "| DR_BROWN| 57| 45| 0.123|\n", + "| DR_DAVIS| 56| 42| 0.089|\n", + "| DR_SMITH| 47| 38| 0.17|\n", + "| DR_GARCIA| 45| 38| 0.133|\n", + "| DR_WILLIAMS| 40| 30| 0.075|\n", + "| DR_MILLER| 38| 33| 0.079|\n", + "| DR_JOHNSON| 37| 35| 0.108|\n", + "| DR_JONES| 30| 27| 0.067|\n", + "+------------------+---------------+---------------+-------------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and healthcare insights\n", + "\n", + "# Diagnosis frequency analysis\n", + "print(\"=== Diagnosis Frequency Analysis ===\")\n", + "diagnosis_freq = spark.sql(\"\"\"\n", + "SELECT diagnosis_code, diagnosis_description, COUNT(*) as frequency,\n", + " ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage\n", + "FROM healthcare.gold.patient_diagnoses\n", + "GROUP BY diagnosis_code, diagnosis_description\n", + "ORDER BY frequency DESC\n", + "\"\"\")\n", + "\n", + "diagnosis_freq.show(truncate=False)\n", + "\n", + "# Severity distribution by facility\n", + "print(\"\\n=== Severity Distribution by Facility ===\")\n", + "severity_by_facility = spark.sql(\"\"\"\n", + "SELECT facility_id, severity_level, COUNT(*) as count\n", + "FROM healthcare.gold.patient_diagnoses\n", + "GROUP BY facility_id, severity_level\n", + "ORDER BY facility_id, severity_level\n", + "\"\"\")\n", + "\n", + "severity_by_facility.show()\n", + "\n", + "# Physician workload analysis\n", + "print(\"\\n=== Physician Workload Analysis ===\")\n", + "physician_workload = spark.sql(\"\"\"\n", + "SELECT treating_physician, COUNT(*) as total_diagnoses,\n", + " COUNT(DISTINCT patient_id) as unique_patients,\n", + " ROUND(AVG(CASE WHEN severity_level = 'Critical' THEN 1 ELSE 0 END), 3) as critical_case_ratio\n", + "FROM healthcare.gold.patient_diagnoses\n", + "GROUP BY treating_physician\n", + "ORDER BY total_diagnoses DESC\n", + "\"\"\")\n", + "\n", + "physician_workload.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (patient_id, diagnosis_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (patient_id, diagnosis_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Healthcare analytics where patient history lookups and temporal analysis are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for healthcare data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles healthcare-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger healthcare datasets\n", + "- Integrate with real healthcare systems\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/hospitality_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/hospitality_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..fc9f81c --- /dev/null +++ b/Notebooks/liquid_clustering/hospitality_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,993 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Hospitality: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a hospitality and tourism analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Hotel Guest Experience and Revenue Management\n", + "\n", + "We'll analyze hotel booking and guest experience data. Our clustering strategy will optimize for:\n", + "\n", + "- **Guest-specific queries**: Fast lookups by guest ID\n", + "- **Time-based analysis**: Efficient filtering by booking and stay dates\n", + "- **Revenue patterns**: Quick aggregation by room type and booking channels\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Hospitality catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create hospitality catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS hospitality\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS hospitality.analytics\")\n", + "\n", + "print(\"Hospitality catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `guest_stays` table will store:\n", + "\n", + "- **guest_id**: Unique guest identifier\n", + "- **booking_date**: Date booking was made\n", + "- **check_in_date**: Guest arrival date\n", + "- **room_type**: Type of room booked\n", + "- **booking_channel**: How booking was made (OTA, Direct, etc.)\n", + "- **total_revenue**: Total booking revenue\n", + "- **guest_satisfaction**: Guest satisfaction score (1-10)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `guest_id` and `booking_date` because:\n", + "\n", + "- **guest_id**: Guests often make multiple bookings, grouping their stay history together\n", + "- **booking_date**: Time-based queries are critical for revenue analysis, seasonal trends, and booking patterns\n", + "- This combination optimizes for both guest relationship management and temporal revenue analytics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on guest_id and booking_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS hospitality.analytics.guest_stays (\n", + "\n", + " guest_id STRING,\n", + "\n", + " booking_date DATE,\n", + "\n", + " check_in_date DATE,\n", + "\n", + " room_type STRING,\n", + "\n", + " booking_channel STRING,\n", + "\n", + " total_revenue DECIMAL(8,2),\n", + "\n", + " guest_satisfaction INT\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (guest_id, booking_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on guest_id and booking_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Hospitality Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic hotel booking and guest data including:\n", + "\n", + "- **5,000 guests** with multiple bookings over time\n", + "- **Room types**: Standard, Deluxe, Suite, Executive\n", + "- **Booking channels**: Direct, Online Travel Agency, Corporate, Walk-in\n", + "- **Seasonal patterns**: Peak seasons, weekend vs weekday pricing\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real hospitality scenarios where:\n", + "\n", + "- Guest loyalty programs require historical booking tracking\n", + "- Revenue management depends on booking channel analysis\n", + "- Seasonal pricing strategies drive occupancy optimization\n", + "- Guest satisfaction impacts reputation and repeat business\n", + "- Channel performance requires continuous monitoring" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 24925 guest booking records\n", + "Sample record: {'guest_id': 'GST000001', 'booking_date': datetime.date(2024, 9, 11), 'check_in_date': datetime.date(2024, 10, 8), 'room_type': 'Standard', 'booking_channel': 'Online Travel Agency', 'total_revenue': 97.25, 'guest_satisfaction': 7}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample hospitality guest booking data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define hospitality data constants\n", + "\n", + "ROOM_TYPES = ['Standard', 'Deluxe', 'Suite', 'Executive']\n", + "\n", + "BOOKING_CHANNELS = ['Direct', 'Online Travel Agency', 'Corporate', 'Walk-in']\n", + "\n", + "# Base revenue parameters by room type\n", + "\n", + "REVENUE_PARAMS = {\n", + "\n", + " 'Standard': {'base_rate': 120, 'satisfaction': 7.8},\n", + "\n", + " 'Deluxe': {'base_rate': 200, 'satisfaction': 8.2},\n", + "\n", + " 'Suite': {'base_rate': 350, 'satisfaction': 8.8},\n", + "\n", + " 'Executive': {'base_rate': 280, 'satisfaction': 8.5}\n", + "\n", + "}\n", + "\n", + "# Channel margins (affect final revenue)\n", + "\n", + "CHANNEL_MARGINS = {\n", + "\n", + " 'Direct': 1.0,\n", + "\n", + " 'Online Travel Agency': 0.85,\n", + "\n", + " 'Corporate': 0.90,\n", + "\n", + " 'Walk-in': 0.95\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate guest booking records\n", + "\n", + "booking_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 5,000 guests with 2-8 bookings each\n", + "\n", + "for guest_num in range(1, 5001):\n", + "\n", + " guest_id = f\"GST{guest_num:06d}\"\n", + " \n", + " # Each guest gets 2-8 bookings over 12 months\n", + "\n", + " num_bookings = random.randint(2, 8)\n", + " \n", + " for i in range(num_bookings):\n", + "\n", + " # Spread bookings over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " booking_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Check-in date (usually within 1-30 days of booking)\n", + "\n", + " checkin_offset = random.randint(1, 30)\n", + "\n", + " check_in_date = booking_date + timedelta(days=checkin_offset)\n", + " \n", + " # Select room type\n", + "\n", + " room_type = random.choice(ROOM_TYPES)\n", + "\n", + " params = REVENUE_PARAMS[room_type]\n", + " \n", + " # Select booking channel\n", + "\n", + " booking_channel = random.choice(BOOKING_CHANNELS)\n", + "\n", + " channel_margin = CHANNEL_MARGINS[booking_channel]\n", + " \n", + " # Calculate revenue with variations\n", + "\n", + " # Seasonal pricing (higher in peak season)\n", + "\n", + " month = check_in_date.month\n", + "\n", + " if month in [6, 7, 8]: # Summer peak\n", + "\n", + " seasonal_factor = 1.3\n", + "\n", + " elif month in [11, 12]: # Holiday season\n", + "\n", + " seasonal_factor = 1.4\n", + "\n", + " else:\n", + "\n", + " seasonal_factor = 1.0\n", + " \n", + " # Weekend pricing\n", + "\n", + " if check_in_date.weekday() >= 5: # Saturday = 5, Sunday = 6\n", + "\n", + " weekend_factor = 1.2\n", + "\n", + " else:\n", + "\n", + " weekend_factor = 1.0\n", + " \n", + " # Stay length (1-7 nights)\n", + "\n", + " stay_length = random.randint(1, 7)\n", + " \n", + " # Calculate total revenue\n", + "\n", + " revenue_variation = random.uniform(0.9, 1.1)\n", + "\n", + " total_revenue = round(params['base_rate'] * stay_length * seasonal_factor * weekend_factor * channel_margin * revenue_variation, 2)\n", + " \n", + " # Guest satisfaction (varies by room type and some randomness)\n", + "\n", + " satisfaction_variation = random.randint(-2, 2)\n", + "\n", + " guest_satisfaction = max(1, min(10, params['satisfaction'] + satisfaction_variation))\n", + " \n", + " booking_data.append({\n", + "\n", + " \"guest_id\": guest_id,\n", + "\n", + " \"booking_date\": booking_date.date(),\n", + "\n", + " \"check_in_date\": check_in_date.date(),\n", + "\n", + " \"room_type\": room_type,\n", + "\n", + " \"booking_channel\": booking_channel,\n", + "\n", + " \"total_revenue\": float(total_revenue),\n", + "\n", + " \"guest_satisfaction\": int(guest_satisfaction)\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(booking_data)} guest booking records\")\n", + "\n", + "print(\"Sample record:\", booking_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- booking_channel: string (nullable = true)\n", + " |-- booking_date: date (nullable = true)\n", + " |-- check_in_date: date (nullable = true)\n", + " |-- guest_id: string (nullable = true)\n", + " |-- guest_satisfaction: long (nullable = true)\n", + " |-- room_type: string (nullable = true)\n", + " |-- total_revenue: double (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------------+------------+-------------+---------+------------------+---------+-------------+\n", + "| booking_channel|booking_date|check_in_date| guest_id|guest_satisfaction|room_type|total_revenue|\n", + "+--------------------+------------+-------------+---------+------------------+---------+-------------+\n", + "|Online Travel Agency| 2024-09-11| 2024-10-08|GST000001| 7| Standard| 97.25|\n", + "| Direct| 2024-02-08| 2024-02-17|GST000001| 7| Suite| 841.19|\n", + "| Direct| 2024-11-10| 2024-11-12|GST000001| 6| Suite| 2441.59|\n", + "| Walk-in| 2024-06-16| 2024-06-25|GST000001| 8| Standard| 983.68|\n", + "|Online Travel Agency| 2024-12-27| 2025-01-19|GST000001| 10|Executive| 586.94|\n", + "+--------------------+------------+-------------+---------+------------------+---------+-------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 24925 records into hospitality.analytics.guest_stays\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_bookings = spark.createDataFrame(booking_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_bookings.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_bookings.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (guest_id, booking_date) will automatically optimize the data layout\n", + "\n", + "df_bookings.write.mode(\"overwrite\").saveAsTable(\"hospitality.analytics.guest_stays\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_bookings.count()} records into hospitality.analytics.guest_stays\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Guest booking history** (clustered by guest_id)\n", + "2. **Time-based revenue analysis** (clustered by booking_date)\n", + "3. **Combined guest + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Guest Booking History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+------------+---------+-------------+------------------+\n", + "| guest_id|booking_date|room_type|total_revenue|guest_satisfaction|\n", + "+---------+------------+---------+-------------+------------------+\n", + "|GST000001| 2024-12-27|Executive| 586.94| 10|\n", + "|GST000001| 2024-11-10| Suite| 2441.59| 6|\n", + "|GST000001| 2024-09-11| Standard| 97.25| 7|\n", + "|GST000001| 2024-06-16| Standard| 983.68| 8|\n", + "|GST000001| 2024-02-08| Suite| 841.19| 7|\n", + "+---------+------------+---------+-------------+------------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 5\n", + "\n", + "=== Query 2: Recent High-Value Bookings ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------+---------+---------+-------------+---------------+\n", + "|booking_date| guest_id|room_type|total_revenue|booking_channel|\n", + "+------------+---------+---------+-------------+---------------+\n", + "| 2024-10-16|GST002569| Suite| 4518.02| Direct|\n", + "| 2024-12-12|GST001814| Suite| 4512.27| Direct|\n", + "| 2024-11-08|GST004903| Suite| 4473.05| Direct|\n", + "| 2024-11-21|GST004845| Suite| 4469.66| Direct|\n", + "| 2024-12-12|GST002326| Suite| 4420.99| Direct|\n", + "| 2024-10-10|GST001027| Suite| 4291.87| Direct|\n", + "| 2024-11-30|GST002600| Suite| 4260.58| Walk-in|\n", + "| 2024-11-17|GST001894| Suite| 4236.16| Direct|\n", + "| 2024-10-04|GST004377| Suite| 4214.49| Walk-in|\n", + "| 2024-07-27|GST002825| Suite| 4168.56| Direct|\n", + "| 2024-12-28|GST002967| Suite| 4163.38| Direct|\n", + "| 2024-06-10|GST003431| Suite| 4098.36| Direct|\n", + "| 2024-07-04|GST000876| Suite| 4061.38| Direct|\n", + "| 2024-12-20|GST003771| Suite| 4057.82| Direct|\n", + "| 2024-07-24|GST004642| Suite| 4033.5| Direct|\n", + "| 2024-10-29|GST003491| Suite| 4013.08| Direct|\n", + "| 2024-12-03|GST000019| Suite| 3998.81| Corporate|\n", + "| 2024-06-03|GST004275| Suite| 3994.81| Direct|\n", + "| 2024-07-27|GST003054| Suite| 3992.25| Walk-in|\n", + "| 2024-06-18|GST003529| Suite| 3990.31| Direct|\n", + "+------------+---------+---------+-------------+---------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "High-value bookings found: 6911\n", + "\n", + "=== Query 3: Guest Spending Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+------------+---------+-------------+------------------+\n", + "| guest_id|booking_date|room_type|total_revenue|guest_satisfaction|\n", + "+---------+------------+---------+-------------+------------------+\n", + "|GST000001| 2024-06-16| Standard| 983.68| 8|\n", + "|GST000001| 2024-09-11| Standard| 97.25| 7|\n", + "|GST000001| 2024-11-10| Suite| 2441.59| 6|\n", + "|GST000001| 2024-12-27|Executive| 586.94| 10|\n", + "|GST000002| 2024-07-01| Deluxe| 928.56| 10|\n", + "|GST000003| 2024-05-09| Standard| 690.72| 9|\n", + "|GST000003| 2024-08-10| Standard| 119.51| 5|\n", + "|GST000003| 2024-08-26| Deluxe| 550.91| 9|\n", + "|GST000004| 2024-04-16| Standard| 557.51| 8|\n", + "|GST000004| 2024-06-17| Deluxe| 730.58| 10|\n", + "|GST000005| 2024-04-21|Executive| 315.68| 10|\n", + "|GST000005| 2024-06-30| Suite| 2723.72| 7|\n", + "|GST000005| 2024-09-06| Standard| 773.24| 6|\n", + "|GST000005| 2024-11-16| Deluxe| 2031.05| 9|\n", + "|GST000006| 2024-04-14| Suite| 1593.15| 10|\n", + "|GST000006| 2024-07-08|Executive| 905.4| 8|\n", + "|GST000006| 2024-08-20| Standard| 687.87| 6|\n", + "|GST000006| 2024-10-08| Suite| 951.49| 10|\n", + "|GST000006| 2024-11-20|Executive| 2022.85| 9|\n", + "|GST000007| 2024-07-03| Standard| 866.91| 6|\n", + "+---------+------------+---------+-------------+------------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Spending trend records found: 3737\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Guest booking history - benefits from guest_id clustering\n", + "\n", + "print(\"=== Query 1: Guest Booking History ===\")\n", + "\n", + "guest_history = spark.sql(\"\"\"\n", + "\n", + "SELECT guest_id, booking_date, room_type, total_revenue, guest_satisfaction\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "WHERE guest_id = 'GST000001'\n", + "\n", + "ORDER BY booking_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "guest_history.show()\n", + "\n", + "print(f\"Records found: {guest_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based revenue analysis - benefits from booking_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent High-Value Bookings ===\")\n", + "\n", + "high_value = spark.sql(\"\"\"\n", + "\n", + "SELECT booking_date, guest_id, room_type, total_revenue, booking_channel\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "WHERE booking_date >= '2024-06-01' AND total_revenue > 1000\n", + "\n", + "ORDER BY total_revenue DESC, booking_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "high_value.show()\n", + "\n", + "print(f\"High-value bookings found: {high_value.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined guest + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Guest Spending Trends ===\")\n", + "\n", + "spending_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT guest_id, booking_date, room_type, total_revenue, guest_satisfaction\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "WHERE guest_id LIKE 'GST000%' AND booking_date >= '2024-04-01'\n", + "\n", + "ORDER BY guest_id, booking_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "spending_trends.show()\n", + "\n", + "print(f\"Spending trend records found: {spending_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the hospitality insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Guest loyalty patterns** and repeat booking analysis\n", + "- **Revenue performance** by room type and booking channel\n", + "- **Seasonal trends** and occupancy optimization\n", + "- **Guest satisfaction** and service quality metrics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Guest Loyalty Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+--------------+-----------+-----------------+----------------+-----------------+\n", + "| guest_id|total_bookings|total_spent|avg_booking_value|avg_satisfaction|last_booking_date|\n", + "+---------+--------------+-----------+-----------------+----------------+-----------------+\n", + "|GST000291| 8| 15489.66| 1936.21| 7.5| 2024-11-28|\n", + "|GST001027| 7| 14781.78| 2111.68| 7.71| 2024-12-29|\n", + "|GST000705| 8| 14645.07| 1830.63| 7.0| 2024-12-03|\n", + "|GST001894| 7| 14147.55| 2021.08| 7.43| 2024-12-11|\n", + "|GST002089| 7| 14125.12| 2017.87| 9.0| 2024-10-23|\n", + "|GST003861| 8| 14003.91| 1750.49| 8.25| 2024-12-19|\n", + "|GST003563| 8| 13950.03| 1743.75| 7.13| 2024-12-17|\n", + "|GST004202| 8| 13918.78| 1739.85| 8.13| 2024-10-29|\n", + "|GST004845| 8| 13914.4| 1739.3| 7.75| 2024-11-21|\n", + "|GST001811| 8| 13865.4| 1733.18| 8.63| 2024-12-30|\n", + "+---------+--------------+-----------+-----------------+----------------+-----------------+\n", + "\n", + "\n", + "=== Room Type Performance ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------+--------------+-------------+-----------------------+----------------+-------------+\n", + "|room_type|total_bookings|total_revenue|avg_revenue_per_booking|avg_satisfaction|unique_guests|\n", + "+---------+--------------+-------------+-----------------------+----------------+-------------+\n", + "| Suite| 6197| 9648746.75| 1557.0| 7.98| 3580|\n", + "|Executive| 6189| 7706532.62| 1245.2| 8.01| 3573|\n", + "| Deluxe| 6273| 5616109.29| 895.28| 8.0| 3632|\n", + "| Standard| 6266| 3365445.72| 537.1| 7.0| 3609|\n", + "+---------+--------------+-------------+-----------------------+----------------+-------------+\n", + "\n", + "\n", + "=== Booking Channel Performance ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------------+--------------+-------------+-----------+----------------+-------------+\n", + "| booking_channel|total_bookings|total_revenue|avg_revenue|avg_satisfaction|unique_guests|\n", + "+--------------------+--------------+-------------+-----------+----------------+-------------+\n", + "| Direct| 6255| 7150741.3| 1143.2| 7.76| 3593|\n", + "| Walk-in| 6131| 6663099.09| 1086.79| 7.73| 3589|\n", + "| Corporate| 6307| 6460978.61| 1024.41| 7.78| 3608|\n", + "|Online Travel Agency| 6232| 6062015.38| 972.72| 7.72| 3580|\n", + "+--------------------+--------------+-------------+-----------+----------------+-------------+\n", + "\n", + "\n", + "=== Monthly Revenue Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+--------------+---------------+-----------------+----------------+-------------+\n", + "| month|total_bookings|monthly_revenue|avg_booking_value|avg_satisfaction|unique_guests|\n", + "+-------+--------------+---------------+-----------------+----------------+-------------+\n", + "|2024-01| 2047| 1889847.05| 923.23| 7.75| 1702|\n", + "|2024-02| 1980| 1826907.88| 922.68| 7.65| 1635|\n", + "|2024-03| 2136| 2001170.78| 936.88| 7.73| 1731|\n", + "|2024-04| 2051| 1888653.78| 920.85| 7.79| 1667|\n", + "|2024-05| 2078| 2163349.34| 1041.07| 7.76| 1723|\n", + "|2024-06| 2103| 2520292.68| 1198.43| 7.73| 1724|\n", + "|2024-07| 2085| 2492578.66| 1195.48| 7.74| 1696|\n", + "|2024-08| 2118| 2234283.63| 1054.9| 7.79| 1746|\n", + "|2024-09| 2061| 1937507.83| 940.08| 7.79| 1721|\n", + "|2024-10| 2091| 2366572.91| 1131.79| 7.71| 1704|\n", + "|2024-11| 2024| 2669816.41| 1319.08| 7.77| 1687|\n", + "|2024-12| 2151| 2345853.43| 1090.59| 7.75| 1766|\n", + "+-------+--------------+---------------+-----------------+----------------+-------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and hospitality insights\n", + "\n", + "\n", + "# Guest loyalty analysis\n", + "\n", + "print(\"=== Guest Loyalty Analysis ===\")\n", + "\n", + "guest_loyalty = spark.sql(\"\"\"\n", + "\n", + "SELECT guest_id, COUNT(*) as total_bookings,\n", + "\n", + " ROUND(SUM(total_revenue), 2) as total_spent,\n", + "\n", + " ROUND(AVG(total_revenue), 2) as avg_booking_value,\n", + "\n", + " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", + "\n", + " MAX(booking_date) as last_booking_date\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "GROUP BY guest_id\n", + "\n", + "ORDER BY total_spent DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "guest_loyalty.show()\n", + "\n", + "\n", + "# Room type performance\n", + "\n", + "print(\"\\n=== Room Type Performance ===\")\n", + "\n", + "room_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT room_type, COUNT(*) as total_bookings,\n", + "\n", + " ROUND(SUM(total_revenue), 2) as total_revenue,\n", + "\n", + " ROUND(AVG(total_revenue), 2) as avg_revenue_per_booking,\n", + "\n", + " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", + "\n", + " COUNT(DISTINCT guest_id) as unique_guests\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "GROUP BY room_type\n", + "\n", + "ORDER BY total_revenue DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "room_performance.show()\n", + "\n", + "\n", + "# Booking channel analysis\n", + "\n", + "print(\"\\n=== Booking Channel Performance ===\")\n", + "\n", + "channel_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT booking_channel, COUNT(*) as total_bookings,\n", + "\n", + " ROUND(SUM(total_revenue), 2) as total_revenue,\n", + "\n", + " ROUND(AVG(total_revenue), 2) as avg_revenue,\n", + "\n", + " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", + "\n", + " COUNT(DISTINCT guest_id) as unique_guests\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "GROUP BY booking_channel\n", + "\n", + "ORDER BY total_revenue DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "channel_analysis.show()\n", + "\n", + "\n", + "# Monthly revenue trends\n", + "\n", + "print(\"\\n=== Monthly Revenue Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(booking_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_bookings,\n", + "\n", + " ROUND(SUM(total_revenue), 2) as monthly_revenue,\n", + "\n", + " ROUND(AVG(total_revenue), 2) as avg_booking_value,\n", + "\n", + " ROUND(AVG(guest_satisfaction), 2) as avg_satisfaction,\n", + "\n", + " COUNT(DISTINCT guest_id) as unique_guests\n", + "\n", + "FROM hospitality.analytics.guest_stays\n", + "\n", + "GROUP BY DATE_FORMAT(booking_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (guest_id, booking_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (guest_id, booking_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Hospitality analytics where guest experience and revenue management are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for hospitality data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles hospitality-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger hospitality datasets\n", + "- Integrate with real PMS systems and booking platforms\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced hospitality analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/insurance_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/insurance_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..370cbbb --- /dev/null +++ b/Notebooks/liquid_clustering/insurance_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,992 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Insurance: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an insurance analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Claims Processing and Risk Assessment\n", + "\n", + "We'll analyze insurance claims and policy data. Our clustering strategy will optimize for:\n", + "\n", + "- **Policyholder-specific queries**: Fast lookups by customer ID\n", + "- **Time-based analysis**: Efficient filtering by claim and policy dates\n", + "- **Risk patterns**: Quick aggregation by claim type and risk scores\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Insurance catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create insurance catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS insurance\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS insurance.analytics\")\n", + "\n", + "print(\"Insurance catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `insurance_claims` table will store:\n", + "\n", + "- **customer_id**: Unique policyholder identifier\n", + "- **claim_date**: Date claim was filed\n", + "- **policy_type**: Type of insurance (Auto, Home, Health, etc.)\n", + "- **claim_amount**: Claim payout amount\n", + "- **risk_score**: Customer risk assessment (1-100)\n", + "- **processing_time**: Days to process claim\n", + "- **claim_status**: Approved, Denied, Pending\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `customer_id` and `claim_date` because:\n", + "\n", + "- **customer_id**: Policyholders often file multiple claims, grouping their insurance history together\n", + "- **claim_date**: Time-based queries are critical for fraud detection, seasonal analysis, and regulatory reporting\n", + "- This combination optimizes for both customer risk profiling and temporal claims analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on customer_id and claim_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS insurance.analytics.insurance_claims (\n", + "\n", + " customer_id STRING,\n", + "\n", + " claim_date DATE,\n", + "\n", + " policy_type STRING,\n", + "\n", + " claim_amount DECIMAL(10,2),\n", + "\n", + " risk_score INT,\n", + "\n", + " processing_time INT,\n", + "\n", + " claim_status STRING\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (customer_id, claim_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on customer_id and claim_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Insurance Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic insurance claims data including:\n", + "\n", + "- **8,000 customers** with multiple claims over time\n", + "- **Policy types**: Auto, Home, Health, Life, Property\n", + "- **Realistic claim patterns**: Seasonal variations, claim frequencies, processing times\n", + "- **Risk scoring**: Customer risk assessment and fraud indicators\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real insurance scenarios where:\n", + "\n", + "- Customer claims history affects risk assessment\n", + "- Seasonal patterns impact claim volumes\n", + "- Processing efficiency affects customer satisfaction\n", + "- Fraud detection requires pattern analysis\n", + "- Regulatory reporting demands temporal analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 15000 insurance claims records\n", + "Sample record: {'customer_id': 'CUST000001', 'claim_date': datetime.date(2024, 8, 23), 'policy_type': 'Home', 'claim_amount': 5562.56, 'risk_score': 35, 'processing_time': 26, 'claim_status': 'Approved'}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample insurance claims data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define insurance data constants\n", + "\n", + "POLICY_TYPES = ['Auto', 'Home', 'Health', 'Life', 'Property']\n", + "\n", + "CLAIM_STATUSES = ['Approved', 'Denied', 'Pending']\n", + "\n", + "# Base claim parameters by policy type\n", + "\n", + "CLAIM_PARAMS = {\n", + "\n", + " 'Auto': {'avg_claim': 3500, 'frequency': 3, 'processing_days': 14},\n", + "\n", + " 'Home': {'avg_claim': 8500, 'frequency': 1, 'processing_days': 21},\n", + "\n", + " 'Health': {'avg_claim': 1200, 'frequency': 8, 'processing_days': 7},\n", + "\n", + " 'Life': {'avg_claim': 25000, 'frequency': 0.5, 'processing_days': 30},\n", + "\n", + " 'Property': {'avg_claim': 15000, 'frequency': 1.5, 'processing_days': 18}\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate insurance claims records\n", + "\n", + "claims_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 8,000 customers with 1-12 claims each (based on frequency)\n", + "\n", + "for customer_num in range(1, 8001):\n", + "\n", + " customer_id = f\"CUST{customer_num:06d}\"\n", + " \n", + " # Assign a primary policy type for this customer\n", + "\n", + " primary_policy = random.choice(POLICY_TYPES)\n", + "\n", + " params = CLAIM_PARAMS[primary_policy]\n", + " \n", + " # Determine number of claims based on frequency (some customers have no claims)\n", + "\n", + " if random.random() < 0.3: # 30% of customers have no claims\n", + "\n", + " num_claims = 0\n", + "\n", + " else:\n", + "\n", + " num_claims = max(1, int(random.gauss(params['frequency'], params['frequency'] * 0.5)))\n", + " num_claims = min(num_claims, 12) # Cap at 12 claims\n", + " \n", + " # Generate claims\n", + "\n", + " for i in range(num_claims):\n", + "\n", + " # Spread claims over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " claim_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Sometimes use different policy types for the same customer\n", + "\n", + " if random.random() < 0.2:\n", + "\n", + " policy_type = random.choice(POLICY_TYPES)\n", + "\n", + " params = CLAIM_PARAMS[policy_type]\n", + "\n", + " else:\n", + "\n", + " policy_type = primary_policy\n", + " \n", + " # Calculate claim amount with variation\n", + "\n", + " amount_variation = random.uniform(0.1, 3.0)\n", + "\n", + " claim_amount = round(params['avg_claim'] * amount_variation, 2)\n", + " \n", + " # Risk score (higher for larger/frequent claims)\n", + "\n", + " base_risk = random.randint(20, 80)\n", + "\n", + " risk_adjustment = min(20, int(claim_amount / 1000)) # Higher amounts increase risk\n", + "\n", + " risk_score = min(100, base_risk + risk_adjustment)\n", + " \n", + " # Processing time (varies by claim type and some randomness)\n", + "\n", + " time_variation = random.uniform(0.5, 2.0)\n", + "\n", + " processing_time = max(1, int(params['processing_days'] * time_variation))\n", + " \n", + " # Claim status (most approved, some denied, few pending)\n", + "\n", + " status_weights = [0.75, 0.15, 0.10] # Approved, Denied, Pending\n", + "\n", + " claim_status = random.choices(CLAIM_STATUSES, weights=status_weights)[0]\n", + " \n", + " claims_data.append({\n", + "\n", + " \"customer_id\": customer_id,\n", + "\n", + " \"claim_date\": claim_date.date(),\n", + "\n", + " \"policy_type\": policy_type,\n", + "\n", + " \"claim_amount\": claim_amount,\n", + "\n", + " \"risk_score\": risk_score,\n", + "\n", + " \"processing_time\": processing_time,\n", + "\n", + " \"claim_status\": claim_status\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(claims_data)} insurance claims records\")\n", + "\n", + "print(\"Sample record:\", claims_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- claim_amount: double (nullable = true)\n", + " |-- claim_date: date (nullable = true)\n", + " |-- claim_status: string (nullable = true)\n", + " |-- customer_id: string (nullable = true)\n", + " |-- policy_type: string (nullable = true)\n", + " |-- processing_time: long (nullable = true)\n", + " |-- risk_score: long (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------+----------+------------+-----------+-----------+---------------+----------+\n", + "|claim_amount|claim_date|claim_status|customer_id|policy_type|processing_time|risk_score|\n", + "+------------+----------+------------+-----------+-----------+---------------+----------+\n", + "| 5562.56|2024-08-23| Approved| CUST000001| Home| 26| 35|\n", + "| 6011.12|2024-08-26| Denied| CUST000001| Health| 24| 76|\n", + "| 23118.44|2024-08-03| Approved| CUST000002| Home| 34| 55|\n", + "| 30107.2|2024-04-25| Approved| CUST000003| Life| 31| 47|\n", + "| 2186.86|2024-01-04| Approved| CUST000004| Health| 5| 54|\n", + "+------------+----------+------------+-----------+-----------+---------------+----------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 15000 records into insurance.analytics.insurance_claims\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_claims = spark.createDataFrame(claims_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_claims.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_claims.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (customer_id, claim_date) will automatically optimize the data layout\n", + "\n", + "df_claims.write.mode(\"overwrite\").saveAsTable(\"insurance.analytics.insurance_claims\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_claims.count()} records into insurance.analytics.insurance_claims\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Customer claims history** (clustered by customer_id)\n", + "2. **Time-based claims analysis** (clustered by claim_date)\n", + "3. **Combined customer + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Customer Claims History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+----------+-----------+------------+------------+\n", + "|customer_id|claim_date|policy_type|claim_amount|claim_status|\n", + "+-----------+----------+-----------+------------+------------+\n", + "| CUST000001|2024-08-26| Health| 6011.12| Denied|\n", + "| CUST000001|2024-08-23| Home| 5562.56| Approved|\n", + "+-----------+----------+-----------+------------+------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 2\n", + "\n", + "=== Query 2: Recent High-Value Claims ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-----------+-----------+------------+----------+\n", + "|claim_date|customer_id|policy_type|claim_amount|risk_score|\n", + "+----------+-----------+-----------+------------+----------+\n", + "|2024-12-29| CUST005908| Health| 74916.03| 96|\n", + "|2024-07-23| CUST007016| Health| 74900.02| 52|\n", + "|2024-10-18| CUST001581| Life| 74895.3| 80|\n", + "|2024-12-15| CUST004733| Life| 74883.6| 97|\n", + "|2024-07-25| CUST002601| Life| 74874.85| 57|\n", + "|2024-11-17| CUST005594| Life| 74829.33| 57|\n", + "|2024-12-24| CUST005524| Life| 74818.84| 82|\n", + "|2024-10-16| CUST001368| Health| 74812.53| 68|\n", + "|2024-10-18| CUST005266| Life| 74701.96| 88|\n", + "|2024-07-15| CUST001375| Life| 74683.53| 71|\n", + "|2024-06-04| CUST007676| Life| 74576.77| 45|\n", + "|2024-06-13| CUST004179| Health| 74573.16| 55|\n", + "|2024-07-05| CUST005762| Life| 74488.06| 91|\n", + "|2024-06-25| CUST005196| Life| 74420.28| 69|\n", + "|2024-09-06| CUST005887| Life| 74244.99| 94|\n", + "|2024-10-31| CUST005898| Health| 74241.14| 67|\n", + "|2024-10-13| CUST004707| Health| 74039.53| 81|\n", + "|2024-08-20| CUST006660| Life| 74012.73| 66|\n", + "|2024-12-31| CUST003724| Life| 73950.38| 64|\n", + "|2024-12-15| CUST003666| Health| 73901.43| 80|\n", + "+----------+-----------+-----------+------------+----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "High-value claims found: 3176\n", + "\n", + "=== Query 3: Customer Claims Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+----------+-----------+------------+----------+\n", + "|customer_id|claim_date|policy_type|claim_amount|risk_score|\n", + "+-----------+----------+-----------+------------+----------+\n", + "| CUST000001|2024-08-23| Home| 5562.56| 35|\n", + "| CUST000001|2024-08-26| Health| 6011.12| 76|\n", + "| CUST000002|2024-08-03| Home| 23118.44| 55|\n", + "| CUST000003|2024-04-25| Life| 30107.2| 47|\n", + "| CUST000004|2024-06-17| Health| 3418.76| 27|\n", + "| CUST000004|2024-07-13| Property| 5252.1| 78|\n", + "| CUST000004|2024-09-21| Health| 4055.38| 34|\n", + "| CUST000004|2024-12-21| Health| 3026.38| 73|\n", + "| CUST000012|2024-07-10| Auto| 5113.28| 45|\n", + "| CUST000014|2024-05-20| Auto| 5076.39| 82|\n", + "| CUST000014|2024-07-18| Auto| 7187.73| 30|\n", + "| CUST000015|2024-04-24| Property| 10582.94| 78|\n", + "| CUST000015|2024-07-18| Property| 20606.38| 77|\n", + "| CUST000016|2024-12-09| Health| 734.86| 41|\n", + "| CUST000017|2024-04-17| Health| 2787.38| 58|\n", + "| CUST000017|2024-05-31| Health| 3050.43| 69|\n", + "| CUST000017|2024-06-12| Health| 2451.17| 32|\n", + "| CUST000017|2024-09-07| Health| 1164.47| 67|\n", + "| CUST000017|2024-10-15| Health| 2573.82| 80|\n", + "| CUST000017|2024-11-23| Health| 1507.4| 44|\n", + "+-----------+----------+-----------+------------+----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Claims trend records found: 1340\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Customer claims history - benefits from customer_id clustering\n", + "\n", + "print(\"=== Query 1: Customer Claims History ===\")\n", + "\n", + "customer_history = spark.sql(\"\"\"\n", + "\n", + "SELECT customer_id, claim_date, policy_type, claim_amount, claim_status\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "WHERE customer_id = 'CUST000001'\n", + "\n", + "ORDER BY claim_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "customer_history.show()\n", + "\n", + "print(f\"Records found: {customer_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based high-value claims analysis - benefits from claim_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent High-Value Claims ===\")\n", + "\n", + "high_value_claims = spark.sql(\"\"\"\n", + "\n", + "SELECT claim_date, customer_id, policy_type, claim_amount, risk_score\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "WHERE claim_date >= '2024-06-01' AND claim_amount > 10000\n", + "\n", + "ORDER BY claim_amount DESC, claim_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "high_value_claims.show()\n", + "\n", + "print(f\"High-value claims found: {high_value_claims.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined customer + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Customer Claims Trends ===\")\n", + "\n", + "claims_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT customer_id, claim_date, policy_type, claim_amount, risk_score\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "WHERE customer_id LIKE 'CUST000%' AND claim_date >= '2024-04-01'\n", + "\n", + "ORDER BY customer_id, claim_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "claims_trends.show()\n", + "\n", + "print(f\"Claims trend records found: {claims_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the insurance insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Customer risk profiling** and claims frequency analysis\n", + "- **Policy performance** and loss ratio calculations\n", + "- **Claims processing efficiency** and operational metrics\n", + "- **Fraud detection patterns** and risk scoring effectiveness" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Customer Risk Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+------------+-------------+----------------+--------------+-------------------+\n", + "|customer_id|total_claims|total_claimed|avg_claim_amount|avg_risk_score|avg_processing_days|\n", + "+-----------+------------+-------------+----------------+--------------+-------------------+\n", + "| CUST007870| 12| 436160.35| 36346.7| 55.5| 38.5|\n", + "| CUST002884| 12| 430604.8| 35883.73| 54.83| 33.58|\n", + "| CUST002783| 10| 424762.01| 42476.2| 60.5| 29.1|\n", + "| CUST006960| 11| 418878.93| 38079.9| 69.91| 48.64|\n", + "| CUST001729| 11| 412611.37| 37510.12| 67.55| 37.18|\n", + "| CUST000883| 12| 395490.98| 32957.58| 60.58| 34.75|\n", + "| CUST004078| 12| 395238.21| 32936.52| 76.92| 27.42|\n", + "| CUST003279| 12| 389299.34| 32441.61| 69.92| 38.25|\n", + "| CUST001321| 12| 386399.15| 32199.93| 67.25| 32.17|\n", + "| CUST004110| 12| 373686.71| 31140.56| 69.42| 31.33|\n", + "+-----------+------------+-------------+----------------+--------------+-------------------+\n", + "\n", + "\n", + "=== Policy Type Performance ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+------------+-------------+----------------+-------------------+------------------+\n", + "|policy_type|total_claims| total_payout|avg_claim_amount|avg_processing_days|affected_customers|\n", + "+-----------+------------+-------------+----------------+-------------------+------------------+\n", + "| Health| 7136|6.025945934E7| 8444.43| 14.5| 1341|\n", + "| Life| 1473|5.693053767E7| 38649.38| 37.63| 1434|\n", + "| Property| 1733|3.911063925E7| 22568.17| 22.22| 1445|\n", + "| Auto| 3150|2.241207207E7| 7114.94| 17.87| 1541|\n", + "| Home| 1508|1.988170738E7| 13184.16| 25.84| 1440|\n", + "+-----------+------------+-------------+----------------+-------------------+------------------+\n", + "\n", + "\n", + "=== Claims Processing Efficiency ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------------+-----------+--------+--------------+\n", + "| processing_category|claim_count|avg_days| total_amount|\n", + "+--------------------+-----------+--------+--------------+\n", + "| Fast (1-7 days)| 2110| 5.35| 4533004.37|\n", + "| Normal (8-14 days)| 4624| 10.85| 2.668768244E7|\n", + "| Slow (15-21 days)| 2623| 18.0| 4.000567373E7|\n", + "|Very Slow (22+ days)| 5643| 32.6|1.2736805517E8|\n", + "+--------------------+-----------+--------+--------------+\n", + "\n", + "\n", + "=== Monthly Claims Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+------------+--------------+----------------+--------------+----------------+\n", + "| month|total_claims|monthly_payout|avg_claim_amount|avg_risk_score|unique_claimants|\n", + "+-------+------------+--------------+----------------+--------------+----------------+\n", + "|2024-01| 1303| 1.700282262E7| 13048.98| 58.71| 1077|\n", + "|2024-02| 1159| 1.530272761E7| 13203.39| 58.93| 966|\n", + "|2024-03| 1290| 1.739153802E7| 13481.81| 58.84| 1074|\n", + "|2024-04| 1206| 1.61826177E7| 13418.42| 59.45| 1007|\n", + "|2024-05| 1324| 1.80161403E7| 13607.36| 59.54| 1086|\n", + "|2024-06| 1220| 1.56975657E7| 12866.86| 58.2| 1008|\n", + "|2024-07| 1309| 1.737609287E7| 13274.33| 57.87| 1081|\n", + "|2024-08| 1222| 1.543843201E7| 12633.74| 57.6| 1029|\n", + "|2024-09| 1292| 1.686660742E7| 13054.65| 58.56| 1068|\n", + "|2024-10| 1219| 1.671987222E7| 13716.06| 58.46| 1012|\n", + "|2024-11| 1173| 1.568186196E7| 13369.02| 59.22| 959|\n", + "|2024-12| 1283| 1.691813728E7| 13186.39| 58.29| 1047|\n", + "+-------+------------+--------------+----------------+--------------+----------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and insurance insights\n", + "\n", + "\n", + "# Customer risk analysis\n", + "\n", + "print(\"=== Customer Risk Analysis ===\")\n", + "\n", + "customer_risk = spark.sql(\"\"\"\n", + "\n", + "SELECT customer_id, COUNT(*) as total_claims,\n", + "\n", + " ROUND(SUM(claim_amount), 2) as total_claimed,\n", + "\n", + " ROUND(AVG(claim_amount), 2) as avg_claim_amount,\n", + "\n", + " ROUND(AVG(risk_score), 2) as avg_risk_score,\n", + "\n", + " ROUND(AVG(processing_time), 2) as avg_processing_days\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "GROUP BY customer_id\n", + "\n", + "ORDER BY total_claimed DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "customer_risk.show()\n", + "\n", + "\n", + "# Policy type performance\n", + "\n", + "print(\"\\n=== Policy Type Performance ===\")\n", + "\n", + "policy_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT policy_type, COUNT(*) as total_claims,\n", + "\n", + " ROUND(SUM(claim_amount), 2) as total_payout,\n", + "\n", + " ROUND(AVG(claim_amount), 2) as avg_claim_amount,\n", + "\n", + " ROUND(AVG(processing_time), 2) as avg_processing_days,\n", + "\n", + " COUNT(DISTINCT customer_id) as affected_customers\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "GROUP BY policy_type\n", + "\n", + "ORDER BY total_payout DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "policy_performance.show()\n", + "\n", + "\n", + "# Claims processing efficiency\n", + "\n", + "print(\"\\n=== Claims Processing Efficiency ===\")\n", + "\n", + "processing_efficiency = spark.sql(\"\"\"\n", + "\n", + "SELECT \n", + "\n", + " CASE \n", + "\n", + " WHEN processing_time <= 7 THEN 'Fast (1-7 days)'\n", + "\n", + " WHEN processing_time <= 14 THEN 'Normal (8-14 days)'\n", + "\n", + " WHEN processing_time <= 21 THEN 'Slow (15-21 days)'\n", + "\n", + " ELSE 'Very Slow (22+ days)'\n", + "\n", + " END as processing_category,\n", + "\n", + " COUNT(*) as claim_count,\n", + "\n", + " ROUND(AVG(processing_time), 2) as avg_days,\n", + "\n", + " ROUND(SUM(claim_amount), 2) as total_amount\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "GROUP BY \n", + "\n", + " CASE \n", + "\n", + " WHEN processing_time <= 7 THEN 'Fast (1-7 days)'\n", + "\n", + " WHEN processing_time <= 14 THEN 'Normal (8-14 days)'\n", + "\n", + " WHEN processing_time <= 21 THEN 'Slow (15-21 days)'\n", + "\n", + " ELSE 'Very Slow (22+ days)'\n", + "\n", + " END\n", + "\n", + "ORDER BY avg_days\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "processing_efficiency.show()\n", + "\n", + "\n", + "# Monthly claims trends\n", + "\n", + "print(\"\\n=== Monthly Claims Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(claim_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_claims,\n", + "\n", + " ROUND(SUM(claim_amount), 2) as monthly_payout,\n", + "\n", + " ROUND(AVG(claim_amount), 2) as avg_claim_amount,\n", + "\n", + " ROUND(AVG(risk_score), 2) as avg_risk_score,\n", + "\n", + " COUNT(DISTINCT customer_id) as unique_claimants\n", + "\n", + "FROM insurance.analytics.insurance_claims\n", + "\n", + "GROUP BY DATE_FORMAT(claim_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (customer_id, claim_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (customer_id, claim_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Insurance analytics where claims processing and risk assessment are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for insurance data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles insurance-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger insurance datasets\n", + "- Integrate with real claims processing systems\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced insurance analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/manufacturing_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/manufacturing_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..b001f0f --- /dev/null +++ b/Notebooks/liquid_clustering/manufacturing_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,989 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Manufacturing: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a manufacturing analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Production Quality Control and Equipment Monitoring\n", + "\n", + "We'll analyze manufacturing production records from a factory. Our clustering strategy will optimize for:\n", + "\n", + "- **Equipment-specific queries**: Fast lookups by machine ID\n", + "- **Time-based analysis**: Efficient filtering by production date\n", + "- **Quality control patterns**: Quick aggregation by product type and defect rates\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Manufacturing catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create manufacturing catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS manufacturing\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS manufacturing.analytics\")\n", + "\n", + "print(\"Manufacturing catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `production_records` table will store:\n", + "\n", + "- **machine_id**: Unique equipment identifier\n", + "- **production_date**: Date and time of production\n", + "- **product_type**: Type of product manufactured\n", + "- **units_produced**: Number of units produced\n", + "- **defect_count**: Number of defective units\n", + "- **production_line**: Assembly line identifier\n", + "- **cycle_time**: Time to produce one unit (minutes)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `machine_id` and `production_date` because:\n", + "\n", + "- **machine_id**: Equipment often produces multiple batches, grouping maintenance and performance data together\n", + "- **production_date**: Time-based queries are essential for shift analysis, maintenance scheduling, and quality trending\n", + "- This combination optimizes for both equipment monitoring and temporal production analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on machine_id and production_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS manufacturing.analytics.production_records (\n", + "\n", + " machine_id STRING,\n", + "\n", + " production_date TIMESTAMP,\n", + "\n", + " product_type STRING,\n", + "\n", + " units_produced INT,\n", + "\n", + " defect_count INT,\n", + "\n", + " production_line STRING,\n", + "\n", + " cycle_time DECIMAL(5,2)\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (machine_id, production_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on machine_id and production_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Manufacturing Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic manufacturing production data including:\n", + "\n", + "- **200 machines** with multiple production runs over time\n", + "- **Product types**: Electronics, Automotive Parts, Consumer Goods, Industrial Equipment\n", + "- **Realistic production patterns**: Shift-based operations, maintenance downtime, quality variations\n", + "- **Multiple production lines**: Different assembly areas and facilities\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real manufacturing scenarios where:\n", + "\n", + "- Equipment performance varies over time\n", + "- Quality control requires tracking defects and yields\n", + "- Maintenance scheduling depends on usage patterns\n", + "- Production optimization drives efficiency improvements\n", + "- Supply chain visibility requires real-time production data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 12298 production records\n", + "Sample record: {'machine_id': 'MCH0001', 'production_date': datetime.datetime(2024, 9, 6, 6, 0), 'product_type': 'Industrial Equipment', 'units_produced': 36, 'defect_count': 2, 'production_line': 'LINE_A', 'cycle_time': 22.15}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample manufacturing production data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define manufacturing data constants\n", + "\n", + "PRODUCT_TYPES = ['Electronics', 'Automotive Parts', 'Consumer Goods', 'Industrial Equipment']\n", + "\n", + "PRODUCTION_LINES = ['LINE_A', 'LINE_B', 'LINE_C', 'LINE_D', 'LINE_E']\n", + "\n", + "# Base production parameters by product type\n", + "\n", + "PRODUCTION_PARAMS = {\n", + "\n", + " 'Electronics': {'base_units': 500, 'defect_rate': 0.02, 'cycle_time': 2.5},\n", + "\n", + " 'Automotive Parts': {'base_units': 200, 'defect_rate': 0.05, 'cycle_time': 8.0},\n", + "\n", + " 'Consumer Goods': {'base_units': 800, 'defect_rate': 0.03, 'cycle_time': 1.8},\n", + "\n", + " 'Industrial Equipment': {'base_units': 50, 'defect_rate': 0.08, 'cycle_time': 25.0}\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate production records\n", + "\n", + "production_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 200 machines with 30-90 production runs each\n", + "\n", + "for machine_num in range(1, 201):\n", + "\n", + " machine_id = f\"MCH{machine_num:04d}\"\n", + " \n", + " # Each machine gets 30-90 production runs over 12 months\n", + "\n", + " num_runs = random.randint(30, 90)\n", + " \n", + " for i in range(num_runs):\n", + "\n", + " # Spread production runs over 12 months (weekdays only, during shifts)\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " production_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Skip weekends\n", + "\n", + " while production_date.weekday() >= 5:\n", + "\n", + " production_date += timedelta(days=1)\n", + " \n", + " # Add shift timing (6 AM - 6 PM)\n", + "\n", + " hours_offset = random.randint(6, 18)\n", + "\n", + " production_date = production_date.replace(hour=hours_offset, minute=0, second=0, microsecond=0)\n", + " \n", + " # Select product type\n", + "\n", + " product_type = random.choice(PRODUCT_TYPES)\n", + "\n", + " params = PRODUCTION_PARAMS[product_type]\n", + " \n", + " # Calculate production with variability\n", + "\n", + " units_variation = random.uniform(0.7, 1.3)\n", + "\n", + " units_produced = int(params['base_units'] * units_variation)\n", + " \n", + " # Calculate defects\n", + "\n", + " defect_rate_variation = random.uniform(0.5, 2.0)\n", + "\n", + " actual_defect_rate = params['defect_rate'] * defect_rate_variation\n", + "\n", + " defect_count = int(units_produced * actual_defect_rate)\n", + " \n", + " # Calculate cycle time with variation\n", + "\n", + " cycle_time_variation = random.uniform(0.8, 1.4)\n", + "\n", + " cycle_time = round(params['cycle_time'] * cycle_time_variation, 2)\n", + " \n", + " # Select production line\n", + "\n", + " production_line = random.choice(PRODUCTION_LINES)\n", + " \n", + " production_data.append({\n", + "\n", + " \"machine_id\": machine_id,\n", + "\n", + " \"production_date\": production_date,\n", + "\n", + " \"product_type\": product_type,\n", + "\n", + " \"units_produced\": units_produced,\n", + "\n", + " \"defect_count\": defect_count,\n", + "\n", + " \"production_line\": production_line,\n", + "\n", + " \"cycle_time\": cycle_time\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(production_data)} production records\")\n", + "\n", + "print(\"Sample record:\", production_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- cycle_time: double (nullable = true)\n", + " |-- defect_count: long (nullable = true)\n", + " |-- machine_id: string (nullable = true)\n", + " |-- product_type: string (nullable = true)\n", + " |-- production_date: timestamp (nullable = true)\n", + " |-- production_line: string (nullable = true)\n", + " |-- units_produced: long (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+------------+----------+--------------------+-------------------+---------------+--------------+\n", + "|cycle_time|defect_count|machine_id| product_type| production_date|production_line|units_produced|\n", + "+----------+------------+----------+--------------------+-------------------+---------------+--------------+\n", + "| 22.15| 2| MCH0001|Industrial Equipment|2024-09-06 06:00:00| LINE_A| 36|\n", + "| 1.75| 30| MCH0001| Consumer Goods|2024-03-26 11:00:00| LINE_B| 1034|\n", + "| 9.09| 8| MCH0001| Automotive Parts|2024-12-30 17:00:00| LINE_B| 259|\n", + "| 2.65| 25| MCH0001| Electronics|2024-10-21 06:00:00| LINE_B| 641|\n", + "| 2.11| 9| MCH0001| Electronics|2024-05-13 18:00:00| LINE_A| 437|\n", + "+----------+------------+----------+--------------------+-------------------+---------------+--------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 12298 records into manufacturing.analytics.production_records\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_production = spark.createDataFrame(production_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_production.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_production.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (machine_id, production_date) will automatically optimize the data layout\n", + "\n", + "df_production.write.mode(\"overwrite\").saveAsTable(\"manufacturing.analytics.production_records\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_production.count()} records into manufacturing.analytics.production_records\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Machine performance history** (clustered by machine_id)\n", + "2. **Time-based production analysis** (clustered by production_date)\n", + "3. **Combined machine + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Machine Performance History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-------------------+--------------------+--------------+------------+-------------------+\n", + "|machine_id| production_date| product_type|units_produced|defect_count|defect_rate_percent|\n", + "+----------+-------------------+--------------------+--------------+------------+-------------------+\n", + "| MCH0001|2024-12-30 17:00:00| Automotive Parts| 259| 8| 3.09|\n", + "| MCH0001|2024-12-27 08:00:00|Industrial Equipment| 58| 5| 8.62|\n", + "| MCH0001|2024-12-23 08:00:00|Industrial Equipment| 45| 5| 11.11|\n", + "| MCH0001|2024-12-17 08:00:00| Electronics| 647| 22| 3.40|\n", + "| MCH0001|2024-12-11 14:00:00|Industrial Equipment| 48| 5| 10.42|\n", + "| MCH0001|2024-12-02 18:00:00| Automotive Parts| 175| 14| 8.00|\n", + "| MCH0001|2024-12-02 08:00:00| Automotive Parts| 184| 4| 2.17|\n", + "| MCH0001|2024-11-22 16:00:00| Consumer Goods| 704| 30| 4.26|\n", + "| MCH0001|2024-11-12 18:00:00|Industrial Equipment| 62| 8| 12.90|\n", + "| MCH0001|2024-11-11 17:00:00| Consumer Goods| 990| 23| 2.32|\n", + "| MCH0001|2024-11-08 08:00:00|Industrial Equipment| 41| 3| 7.32|\n", + "| MCH0001|2024-10-25 11:00:00| Automotive Parts| 183| 11| 6.01|\n", + "| MCH0001|2024-10-24 06:00:00| Automotive Parts| 191| 11| 5.76|\n", + "| MCH0001|2024-10-21 06:00:00| Electronics| 641| 25| 3.90|\n", + "| MCH0001|2024-10-21 06:00:00| Consumer Goods| 826| 23| 2.78|\n", + "| MCH0001|2024-10-16 15:00:00|Industrial Equipment| 52| 6| 11.54|\n", + "| MCH0001|2024-10-14 14:00:00| Consumer Goods| 974| 16| 1.64|\n", + "| MCH0001|2024-10-07 18:00:00| Electronics| 451| 7| 1.55|\n", + "| MCH0001|2024-10-01 10:00:00|Industrial Equipment| 52| 3| 5.77|\n", + "| MCH0001|2024-09-19 07:00:00| Consumer Goods| 654| 35| 5.35|\n", + "+----------+-------------------+--------------------+--------------+------------+-------------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 64\n", + "\n", + "=== Query 2: Recent Quality Issues ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------------+----------+--------------------+--------------+------------+-------------------+\n", + "| production_date|machine_id| product_type|units_produced|defect_count|defect_rate_percent|\n", + "+-------------------+----------+--------------------+--------------+------------+-------------------+\n", + "|2024-07-26 06:00:00| MCH0183|Industrial Equipment| 44| 7| 15.91|\n", + "|2024-11-04 15:00:00| MCH0135|Industrial Equipment| 63| 10| 15.87|\n", + "|2024-12-13 16:00:00| MCH0169|Industrial Equipment| 51| 8| 15.69|\n", + "|2024-10-25 11:00:00| MCH0023|Industrial Equipment| 51| 8| 15.69|\n", + "|2024-09-04 08:00:00| MCH0086|Industrial Equipment| 51| 8| 15.69|\n", + "|2024-09-03 16:00:00| MCH0001|Industrial Equipment| 64| 10| 15.63|\n", + "|2024-11-28 12:00:00| MCH0099|Industrial Equipment| 45| 7| 15.56|\n", + "|2024-12-26 08:00:00| MCH0148|Industrial Equipment| 58| 9| 15.52|\n", + "|2024-11-08 08:00:00| MCH0116|Industrial Equipment| 58| 9| 15.52|\n", + "|2024-08-02 12:00:00| MCH0134|Industrial Equipment| 58| 9| 15.52|\n", + "|2024-12-19 11:00:00| MCH0073|Industrial Equipment| 39| 6| 15.38|\n", + "|2024-11-11 13:00:00| MCH0158|Industrial Equipment| 52| 8| 15.38|\n", + "|2024-06-11 18:00:00| MCH0119|Industrial Equipment| 52| 8| 15.38|\n", + "|2024-12-25 10:00:00| MCH0106|Industrial Equipment| 59| 9| 15.25|\n", + "|2024-11-27 18:00:00| MCH0182|Industrial Equipment| 59| 9| 15.25|\n", + "|2024-11-04 12:00:00| MCH0063|Industrial Equipment| 59| 9| 15.25|\n", + "|2024-10-31 06:00:00| MCH0071|Industrial Equipment| 59| 9| 15.25|\n", + "|2024-08-30 07:00:00| MCH0184|Industrial Equipment| 59| 9| 15.25|\n", + "|2024-08-26 16:00:00| MCH0117|Industrial Equipment| 59| 9| 15.25|\n", + "|2024-08-02 08:00:00| MCH0122|Industrial Equipment| 59| 9| 15.25|\n", + "+-------------------+----------+--------------------+--------------+------------+-------------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Quality issues found: 2950\n", + "\n", + "=== Query 3: Equipment Performance Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-------------------+--------------------+--------------+----------+-----------+\n", + "|machine_id| production_date| product_type|units_produced|cycle_time|hourly_rate|\n", + "+----------+-------------------+--------------------+--------------+----------+-----------+\n", + "| MCH0001|2024-04-01 13:00:00| Automotive Parts| 204| 10.15| 1205.91|\n", + "| MCH0001|2024-04-01 15:00:00| Automotive Parts| 214| 9.77| 1314.23|\n", + "| MCH0001|2024-04-11 09:00:00| Electronics| 613| 3.3| 11145.45|\n", + "| MCH0001|2024-04-29 09:00:00|Industrial Equipment| 49| 25.46| 115.48|\n", + "| MCH0001|2024-04-30 09:00:00| Automotive Parts| 209| 6.8| 1844.12|\n", + "| MCH0001|2024-05-07 13:00:00|Industrial Equipment| 47| 33.01| 85.43|\n", + "| MCH0001|2024-05-13 18:00:00| Electronics| 437| 2.11| 12426.54|\n", + "| MCH0001|2024-05-14 18:00:00|Industrial Equipment| 44| 30.47| 86.64|\n", + "| MCH0001|2024-05-17 18:00:00| Consumer Goods| 862| 1.84| 28108.7|\n", + "| MCH0001|2024-05-20 16:00:00| Consumer Goods| 767| 1.68| 27392.86|\n", + "| MCH0001|2024-06-03 17:00:00| Consumer Goods| 573| 1.61| 21354.04|\n", + "| MCH0001|2024-06-07 18:00:00| Automotive Parts| 240| 9.11| 1580.68|\n", + "| MCH0001|2024-06-28 06:00:00|Industrial Equipment| 37| 34.71| 63.96|\n", + "| MCH0001|2024-07-15 13:00:00| Automotive Parts| 195| 6.67| 1754.12|\n", + "| MCH0001|2024-07-15 18:00:00| Consumer Goods| 883| 2.3| 23034.78|\n", + "| MCH0001|2024-07-17 14:00:00| Consumer Goods| 942| 2.22| 25459.46|\n", + "| MCH0001|2024-08-08 07:00:00|Industrial Equipment| 35| 21.0| 100.0|\n", + "| MCH0001|2024-08-20 08:00:00| Electronics| 390| 3.18| 7358.49|\n", + "| MCH0001|2024-08-26 08:00:00| Electronics| 436| 2.38| 10991.6|\n", + "| MCH0001|2024-08-29 06:00:00| Automotive Parts| 248| 9.27| 1605.18|\n", + "+----------+-------------------+--------------------+--------------+----------+-----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Performance records found: 382\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Machine performance history - benefits from machine_id clustering\n", + "\n", + "print(\"=== Query 1: Machine Performance History ===\")\n", + "\n", + "machine_history = spark.sql(\"\"\"\n", + "\n", + "SELECT machine_id, production_date, product_type, units_produced, defect_count,\n", + "\n", + " ROUND(defect_count * 100.0 / units_produced, 2) as defect_rate_percent\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "WHERE machine_id = 'MCH0001'\n", + "\n", + "ORDER BY production_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "machine_history.show()\n", + "\n", + "print(f\"Records found: {machine_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based quality analysis - benefits from production_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent Quality Issues ===\")\n", + "\n", + "quality_issues = spark.sql(\"\"\"\n", + "\n", + "SELECT production_date, machine_id, product_type, units_produced, defect_count,\n", + "\n", + " ROUND(defect_count * 100.0 / units_produced, 2) as defect_rate_percent\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "WHERE production_date >= '2024-06-01' AND (defect_count * 100.0 / units_produced) > 5.0\n", + "\n", + "ORDER BY defect_rate_percent DESC, production_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "quality_issues.show()\n", + "\n", + "print(f\"Quality issues found: {quality_issues.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined machine + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Equipment Performance Trends ===\")\n", + "\n", + "performance_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT machine_id, production_date, product_type, units_produced, cycle_time,\n", + "\n", + " ROUND(units_produced * 60.0 / cycle_time, 2) as hourly_rate\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "WHERE machine_id LIKE 'MCH000%' AND production_date >= '2024-04-01'\n", + "\n", + "ORDER BY machine_id, production_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "performance_trends.show()\n", + "\n", + "print(f\"Performance records found: {performance_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the manufacturing insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Equipment utilization** and performance metrics\n", + "- **Quality control analysis** and defect patterns\n", + "- **Production line efficiency** and bottleneck identification\n", + "- **Product type performance** and optimization opportunities" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Equipment Performance Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+----------+------------------+---------------+--------------+-----------+\n", + "|machine_id|total_runs|avg_units_produced|avg_defect_rate|avg_cycle_time|total_units|\n", + "+----------+----------+------------------+---------------+--------------+-----------+\n", + "| MCH0163| 87| 437.47| 5.52| 9.06| 38060|\n", + "| MCH0169| 83| 447.94| 5.18| 10.47| 37179|\n", + "| MCH0006| 84| 432.4| 4.73| 8.64| 36322|\n", + "| MCH0108| 88| 411.07| 5.20| 8.93| 36174|\n", + "| MCH0153| 88| 409.22| 5.03| 10.13| 36011|\n", + "| MCH0097| 90| 396.48| 5.16| 10.12| 35683|\n", + "| MCH0101| 86| 402.84| 4.81| 9.15| 34644|\n", + "| MCH0070| 86| 402.67| 4.90| 9.46| 34630|\n", + "| MCH0044| 84| 410.06| 5.25| 9.96| 34445|\n", + "| MCH0082| 87| 392.25| 5.09| 10.19| 34126|\n", + "| MCH0068| 87| 391.46| 5.35| 9.89| 34057|\n", + "| MCH0142| 85| 398.65| 5.05| 9.32| 33885|\n", + "| MCH0149| 87| 388.13| 5.38| 10.38| 33767|\n", + "| MCH0093| 82| 411.34| 5.27| 9.51| 33730|\n", + "| MCH0157| 84| 398.89| 5.28| 9.92| 33507|\n", + "| MCH0183| 81| 409.95| 5.37| 9.28| 33206|\n", + "| MCH0144| 81| 405.86| 5.24| 9.76| 32875|\n", + "| MCH0041| 90| 364.6| 5.24| 11.08| 32814|\n", + "| MCH0118| 79| 413.46| 5.47| 9.43| 32663|\n", + "| MCH0036| 83| 390.28| 5.44| 10.55| 32393|\n", + "+----------+----------+------------------+---------------+--------------+-----------+\n", + "only showing top 20 rows\n", + "\n", + "\n", + "=== Quality Analysis by Product Type ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------------+---------------+-----------+-------------+---------------+--------------+\n", + "| product_type|production_runs|total_units|total_defects|avg_defect_rate|avg_cycle_time|\n", + "+--------------------+---------------+-----------+-------------+---------------+--------------+\n", + "| Consumer Goods| 2982| 2392057| 87950| 3.68| 1.98|\n", + "| Electronics| 3172| 1580318| 37855| 2.39| 2.75|\n", + "| Automotive Parts| 3091| 615989| 37171| 6.02| 8.82|\n", + "|Industrial Equipment| 3053| 151345| 13680| 8.99| 27.52|\n", + "+--------------------+---------------+-----------+-------------+---------------+--------------+\n", + "\n", + "\n", + "=== Production Line Efficiency ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+---------------+----------+-------------+----------------+------------+---------------+\n", + "|production_line|total_runs|machines_used|total_production|avg_run_size|avg_defect_rate|\n", + "+---------------+----------+-------------+----------------+------------+---------------+\n", + "| LINE_E| 2478| 200| 964486| 389.22| 5.18|\n", + "| LINE_C| 2442| 200| 959370| 392.86| 5.19|\n", + "| LINE_D| 2464| 200| 944868| 383.47| 5.27|\n", + "| LINE_B| 2473| 200| 944805| 382.05| 5.28|\n", + "| LINE_A| 2441| 200| 926180| 379.43| 5.35|\n", + "+---------------+----------+-------------+----------------+------------+---------------+\n", + "\n", + "\n", + "=== Monthly Production Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+---------------+-----------+---------------+---------------+\n", + "| month|production_runs|total_units|avg_defect_rate|active_machines|\n", + "+-------+---------------+-----------+---------------+---------------+\n", + "|2024-01| 1045| 397326| 5.24| 198|\n", + "|2024-02| 914| 368483| 5.17| 195|\n", + "|2024-03| 1003| 383685| 5.41| 197|\n", + "|2024-04| 1074| 407309| 5.18| 199|\n", + "|2024-05| 1083| 413054| 5.38| 195|\n", + "|2024-06| 992| 375035| 5.28| 198|\n", + "|2024-07| 1138| 456635| 5.20| 197|\n", + "|2024-08| 930| 366966| 5.01| 195|\n", + "|2024-09| 1045| 391363| 5.28| 195|\n", + "|2024-10| 1015| 394063| 5.27| 198|\n", + "|2024-11| 946| 363835| 5.30| 192|\n", + "|2024-12| 1113| 421955| 5.30| 199|\n", + "+-------+---------------+-----------+---------------+---------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and manufacturing insights\n", + "\n", + "\n", + "# Equipment performance analysis\n", + "\n", + "print(\"=== Equipment Performance Analysis ===\")\n", + "\n", + "equipment_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT machine_id, COUNT(*) as total_runs,\n", + "\n", + " ROUND(AVG(units_produced), 2) as avg_units_produced,\n", + "\n", + " ROUND(AVG(defect_count * 100.0 / units_produced), 2) as avg_defect_rate,\n", + "\n", + " ROUND(AVG(cycle_time), 2) as avg_cycle_time,\n", + "\n", + " ROUND(SUM(units_produced), 0) as total_units\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "GROUP BY machine_id\n", + "\n", + "ORDER BY total_units DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "equipment_performance.show()\n", + "\n", + "\n", + "# Quality analysis by product type\n", + "\n", + "print(\"\\n=== Quality Analysis by Product Type ===\")\n", + "\n", + "quality_by_product = spark.sql(\"\"\"\n", + "\n", + "SELECT product_type, COUNT(*) as production_runs,\n", + "\n", + " ROUND(SUM(units_produced), 0) as total_units,\n", + "\n", + " ROUND(SUM(defect_count), 0) as total_defects,\n", + "\n", + " ROUND(AVG(defect_count * 100.0 / units_produced), 2) as avg_defect_rate,\n", + "\n", + " ROUND(AVG(cycle_time), 2) as avg_cycle_time\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "GROUP BY product_type\n", + "\n", + "ORDER BY total_units DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "quality_by_product.show()\n", + "\n", + "\n", + "# Production line efficiency\n", + "\n", + "print(\"\\n=== Production Line Efficiency ===\")\n", + "\n", + "line_efficiency = spark.sql(\"\"\"\n", + "\n", + "SELECT production_line, COUNT(*) as total_runs,\n", + "\n", + " COUNT(DISTINCT machine_id) as machines_used,\n", + "\n", + " ROUND(SUM(units_produced), 0) as total_production,\n", + "\n", + " ROUND(AVG(units_produced), 2) as avg_run_size,\n", + "\n", + " ROUND(SUM(defect_count * 100.0 / units_produced) / COUNT(*), 2) as avg_defect_rate\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "GROUP BY production_line\n", + "\n", + "ORDER BY total_production DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "line_efficiency.show()\n", + "\n", + "\n", + "# Monthly production trends\n", + "\n", + "print(\"\\n=== Monthly Production Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(production_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as production_runs,\n", + "\n", + " ROUND(SUM(units_produced), 0) as total_units,\n", + "\n", + " ROUND(AVG(defect_count * 100.0 / units_produced), 2) as avg_defect_rate,\n", + "\n", + " COUNT(DISTINCT machine_id) as active_machines\n", + "\n", + "FROM manufacturing.analytics.production_records\n", + "\n", + "GROUP BY DATE_FORMAT(production_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (machine_id, production_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (machine_id, production_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Manufacturing analytics where equipment monitoring and quality control are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for manufacturing data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles manufacturing-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger manufacturing datasets\n", + "- Integrate with real SCADA systems and IoT sensors\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced manufacturing analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/media_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/media_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..e191bcd --- /dev/null +++ b/Notebooks/liquid_clustering/media_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1038 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Media: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a media and entertainment analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Content Performance and User Engagement Analytics\n", + "\n", + "We'll analyze media content consumption and user engagement data. Our clustering strategy will optimize for:\n", + "\n", + "- **User-specific queries**: Fast lookups by user ID\n", + "- **Time-based analysis**: Efficient filtering by viewing and engagement dates\n", + "- **Content performance patterns**: Quick aggregation by content type and engagement metrics\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Media catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create media catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS media\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS media.analytics\")\n", + "\n", + "print(\"Media catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `content_engagement` table will store:\n", + "\n", + "- **user_id**: Unique user identifier\n", + "- **engagement_date**: Date and time of engagement\n", + "- **content_type**: Type (Video, Article, Podcast, Live Stream)\n", + "- **watch_time**: Time spent consuming content (minutes)\n", + "- **content_id**: Specific content identifier\n", + "- **engagement_score**: User engagement metric (0-100)\n", + "- **device_type**: Device used (Mobile, Desktop, TV, etc.)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `user_id` and `engagement_date` because:\n", + "\n", + "- **user_id**: Users consume multiple pieces of content, grouping their viewing history together\n", + "- **engagement_date**: Time-based queries are critical for content performance analysis, recommendation systems, and user behavior trends\n", + "- This combination optimizes for both personalized content recommendations and temporal engagement analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on user_id and engagement_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS media.analytics.content_engagement (\n", + "\n", + " user_id STRING,\n", + "\n", + " engagement_date TIMESTAMP,\n", + "\n", + " content_type STRING,\n", + "\n", + " watch_time DECIMAL(8,2),\n", + "\n", + " content_id STRING,\n", + "\n", + " engagement_score INT,\n", + "\n", + " device_type STRING\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (user_id, engagement_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on user_id and engagement_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Media Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic media engagement data including:\n", + "\n", + "- **12,000 users** with multiple content interactions over time\n", + "- **Content types**: Video, Article, Podcast, Live Stream\n", + "- **Realistic engagement patterns**: Peak viewing times, content preferences, device usage\n", + "- **Engagement metrics**: Watch time, completion rates, interaction scores\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real media scenarios where:\n", + "\n", + "- User preferences drive content recommendations\n", + "- Engagement metrics determine content success\n", + "- Device usage affects viewing experience\n", + "- Time-based patterns influence programming decisions\n", + "- Personalization requires historical user behavior" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 299540 content engagement records\n", + "Sample record: {'user_id': 'USER000001', 'engagement_date': datetime.datetime(2024, 8, 13, 17, 29), 'content_type': 'Podcast', 'watch_time': 34.22, 'content_id': 'POD96528', 'engagement_score': 74, 'device_type': 'Desktop'}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample media engagement data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define media data constants\n", + "\n", + "CONTENT_TYPES = ['Video', 'Article', 'Podcast', 'Live Stream']\n", + "\n", + "DEVICE_TYPES = ['Mobile', 'Desktop', 'Tablet', 'Smart TV', 'Gaming Console']\n", + "\n", + "# Base engagement parameters by content type\n", + "\n", + "ENGAGEMENT_PARAMS = {\n", + "\n", + " 'Video': {'avg_watch_time': 15, 'engagement_base': 75, 'frequency': 12},\n", + "\n", + " 'Article': {'avg_watch_time': 8, 'engagement_base': 65, 'frequency': 8},\n", + "\n", + " 'Podcast': {'avg_watch_time': 25, 'engagement_base': 70, 'frequency': 6},\n", + "\n", + " 'Live Stream': {'avg_watch_time': 45, 'engagement_base': 80, 'frequency': 4}\n", + "\n", + "}\n", + "\n", + "# Device engagement multipliers\n", + "\n", + "DEVICE_MULTIPLIERS = {\n", + "\n", + " 'Mobile': 0.9, 'Desktop': 1.0, 'Tablet': 0.95, 'Smart TV': 1.1, 'Gaming Console': 1.05\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate content engagement records\n", + "\n", + "engagement_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 12,000 users with 10-40 engagement events each\n", + "\n", + "for user_num in range(1, 12001):\n", + "\n", + " user_id = f\"USER{user_num:06d}\"\n", + " \n", + " # Each user gets 10-40 engagement events over 12 months\n", + "\n", + " num_engagements = random.randint(10, 40)\n", + " \n", + " for i in range(num_engagements):\n", + "\n", + " # Spread engagements over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " engagement_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Add realistic timing (more engagement during certain hours)\n", + "\n", + " hour_weights = [2, 1, 1, 1, 1, 1, 3, 6, 8, 7, 6, 7, 8, 9, 10, 9, 8, 10, 12, 9, 7, 5, 4, 3]\n", + "\n", + " hours_offset = random.choices(range(24), weights=hour_weights)[0]\n", + "\n", + " engagement_date = engagement_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)\n", + " \n", + " # Select content type\n", + "\n", + " content_type = random.choice(CONTENT_TYPES)\n", + "\n", + " params = ENGAGEMENT_PARAMS[content_type]\n", + " \n", + " # Select device type\n", + "\n", + " device_type = random.choice(DEVICE_TYPES)\n", + "\n", + " device_multiplier = DEVICE_MULTIPLIERS[device_type]\n", + " \n", + " # Calculate watch time with variations\n", + "\n", + " time_variation = random.uniform(0.3, 2.5)\n", + "\n", + " watch_time = round(params['avg_watch_time'] * time_variation * device_multiplier, 2)\n", + " \n", + " # Content ID\n", + "\n", + " content_id = f\"{content_type[:3].upper()}{random.randint(10000, 99999)}\"\n", + " \n", + " # Engagement score (based on content type, device, and some randomness)\n", + "\n", + " engagement_variation = random.randint(-15, 15)\n", + "\n", + " engagement_score = max(0, min(100, int(params['engagement_base'] * device_multiplier) + engagement_variation))\n", + " \n", + " engagement_data.append({\n", + "\n", + " \"user_id\": user_id,\n", + "\n", + " \"engagement_date\": engagement_date,\n", + "\n", + " \"content_type\": content_type,\n", + "\n", + " \"watch_time\": watch_time,\n", + "\n", + " \"content_id\": content_id,\n", + "\n", + " \"engagement_score\": engagement_score,\n", + "\n", + " \"device_type\": device_type\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(engagement_data)} content engagement records\")\n", + "\n", + "print(\"Sample record:\", engagement_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- content_id: string (nullable = true)\n", + " |-- content_type: string (nullable = true)\n", + " |-- device_type: string (nullable = true)\n", + " |-- engagement_date: timestamp (nullable = true)\n", + " |-- engagement_score: long (nullable = true)\n", + " |-- user_id: string (nullable = true)\n", + " |-- watch_time: double (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+------------+--------------+-------------------+----------------+----------+----------+\n", + "|content_id|content_type| device_type| engagement_date|engagement_score| user_id|watch_time|\n", + "+----------+------------+--------------+-------------------+----------------+----------+----------+\n", + "| POD96528| Podcast| Desktop|2024-08-13 17:29:00| 74|USER000001| 34.22|\n", + "| VID98484| Video| Mobile|2024-09-04 00:59:00| 81|USER000001| 13.27|\n", + "| VID15293| Video| Tablet|2024-01-01 10:39:00| 84|USER000001| 9.75|\n", + "| POD83689| Podcast| Mobile|2024-06-04 20:33:00| 76|USER000001| 41.79|\n", + "| POD56644| Podcast|Gaming Console|2024-02-19 13:31:00| 63|USER000001| 27.7|\n", + "+----------+------------+--------------+-------------------+----------------+----------+----------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 299540 records into media.analytics.content_engagement\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_engagement = spark.createDataFrame(engagement_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_engagement.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_engagement.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (user_id, engagement_date) will automatically optimize the data layout\n", + "\n", + "df_engagement.write.mode(\"overwrite\").saveAsTable(\"media.analytics.content_engagement\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_engagement.count()} records into media.analytics.content_engagement\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **User engagement history** (clustered by user_id)\n", + "2. **Time-based content analysis** (clustered by engagement_date)\n", + "3. **Combined user + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: User Engagement History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-------------------+------------+----------+----------------+\n", + "| user_id| engagement_date|content_type|watch_time|engagement_score|\n", + "+----------+-------------------+------------+----------+----------------+\n", + "|USER000001|2024-12-30 07:16:00| Podcast| 41.06| 83|\n", + "|USER000001|2024-12-08 17:18:00| Podcast| 13.61| 75|\n", + "|USER000001|2024-11-27 07:56:00| Article| 18.44| 63|\n", + "|USER000001|2024-10-15 15:23:00| Live Stream| 111.8| 80|\n", + "|USER000001|2024-09-04 00:59:00| Video| 13.27| 81|\n", + "|USER000001|2024-09-03 23:01:00| Live Stream| 65.6| 88|\n", + "|USER000001|2024-09-03 14:35:00| Live Stream| 44.77| 91|\n", + "|USER000001|2024-08-20 19:50:00| Podcast| 40.36| 67|\n", + "|USER000001|2024-08-13 17:29:00| Podcast| 34.22| 74|\n", + "|USER000001|2024-07-17 23:14:00| Live Stream| 113.5| 74|\n", + "+----------+-------------------+------------+----------+----------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 10\n", + "\n", + "=== Query 2: Recent High-Engagement Content ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------------+----------+----------+------------+----------------+----------+\n", + "| engagement_date| user_id|content_id|content_type|engagement_score|watch_time|\n", + "+-------------------+----------+----------+------------+----------------+----------+\n", + "|2024-02-15 16:33:00|USER004701| LIV23443| Live Stream| 100| 111.0|\n", + "|2024-02-15 15:56:00|USER009133| LIV37632| Live Stream| 100| 107.46|\n", + "|2024-02-15 06:56:00|USER005956| LIV52538| Live Stream| 100| 102.42|\n", + "|2024-02-15 15:32:00|USER002011| LIV53566| Live Stream| 100| 57.66|\n", + "|2024-02-15 10:38:00|USER004131| LIV78476| Live Stream| 100| 21.97|\n", + "|2024-02-15 07:53:00|USER001098| LIV42709| Live Stream| 100| 21.52|\n", + "|2024-02-15 15:50:00|USER011262| LIV59439| Live Stream| 99| 74.89|\n", + "|2024-02-15 13:38:00|USER006084| LIV42623| Live Stream| 98| 110.39|\n", + "|2024-02-15 02:57:00|USER010226| LIV65581| Live Stream| 98| 21.21|\n", + "|2024-02-15 18:31:00|USER010806| LIV22812| Live Stream| 97| 104.68|\n", + "|2024-02-15 15:57:00|USER011843| LIV75072| Live Stream| 97| 85.95|\n", + "|2024-02-15 19:05:00|USER001313| LIV27251| Live Stream| 97| 80.72|\n", + "|2024-02-15 13:35:00|USER002206| LIV20408| Live Stream| 97| 26.6|\n", + "|2024-02-15 21:38:00|USER010468| LIV75912| Live Stream| 96| 111.89|\n", + "|2024-02-15 15:08:00|USER010862| LIV57131| Live Stream| 96| 85.56|\n", + "|2024-02-15 13:46:00|USER007068| LIV56576| Live Stream| 96| 73.59|\n", + "|2024-02-15 14:03:00|USER002667| LIV60308| Live Stream| 96| 43.27|\n", + "|2024-02-15 11:15:00|USER003909| VID86057| Video| 96| 26.42|\n", + "|2024-02-15 08:09:00|USER009458| LIV92626| Live Stream| 95| 107.98|\n", + "|2024-02-15 14:27:00|USER006756| LIV23306| Live Stream| 95| 105.01|\n", + "+-------------------+----------+----------+------------+----------------+----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "High-engagement records found: 106\n", + "\n", + "=== Query 3: User Content Preferences ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-------------------+------------+----------+--------------+\n", + "| user_id| engagement_date|content_type|watch_time| device_type|\n", + "+----------+-------------------+------------+----------+--------------+\n", + "|USER000001|2024-02-19 13:31:00| Podcast| 27.7|Gaming Console|\n", + "|USER000001|2024-03-06 18:48:00| Live Stream| 93.56| Mobile|\n", + "|USER000001|2024-03-19 21:42:00| Video| 32.25| Desktop|\n", + "|USER000001|2024-03-26 07:32:00| Podcast| 17.3| Smart TV|\n", + "|USER000001|2024-04-02 12:00:00| Podcast| 40.56| Smart TV|\n", + "|USER000001|2024-04-02 13:07:00| Podcast| 24.74| Desktop|\n", + "|USER000001|2024-04-27 14:31:00| Podcast| 32.07| Tablet|\n", + "|USER000001|2024-05-05 23:26:00| Video| 11.33| Tablet|\n", + "|USER000001|2024-05-06 18:32:00| Podcast| 17.0| Tablet|\n", + "|USER000001|2024-06-04 20:33:00| Podcast| 41.79| Mobile|\n", + "|USER000001|2024-06-06 13:12:00| Video| 30.08| Smart TV|\n", + "|USER000001|2024-06-08 10:16:00| Live Stream| 95.7| Mobile|\n", + "|USER000001|2024-06-21 09:42:00| Live Stream| 54.65| Mobile|\n", + "|USER000001|2024-07-17 23:14:00| Live Stream| 113.5| Smart TV|\n", + "|USER000001|2024-08-13 17:29:00| Podcast| 34.22| Desktop|\n", + "|USER000001|2024-08-20 19:50:00| Podcast| 40.36| Desktop|\n", + "|USER000001|2024-09-03 14:35:00| Live Stream| 44.77| Desktop|\n", + "|USER000001|2024-09-03 23:01:00| Live Stream| 65.6| Tablet|\n", + "|USER000001|2024-09-04 00:59:00| Video| 13.27| Mobile|\n", + "|USER000001|2024-10-15 15:23:00| Live Stream| 111.8| Smart TV|\n", + "+----------+-------------------+------------+----------+--------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "User preference records found: 25\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: User engagement history - benefits from user_id clustering\n", + "\n", + "print(\"=== Query 1: User Engagement History ===\")\n", + "\n", + "user_history = spark.sql(\"\"\"\n", + "\n", + "SELECT user_id, engagement_date, content_type, watch_time, engagement_score\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "WHERE user_id = 'USER000001'\n", + "\n", + "ORDER BY engagement_date DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "user_history.show()\n", + "\n", + "print(f\"Records found: {user_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based high-engagement content analysis - benefits from engagement_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent High-Engagement Content ===\")\n", + "\n", + "high_engagement = spark.sql(\"\"\"\n", + "\n", + "SELECT engagement_date, user_id, content_id, content_type, engagement_score, watch_time\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "WHERE DATE(engagement_date) = '2024-02-15' AND engagement_score > 85\n", + "\n", + "ORDER BY engagement_score DESC, watch_time DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "high_engagement.show()\n", + "\n", + "print(f\"High-engagement records found: {high_engagement.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined user + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: User Content Preferences ===\")\n", + "\n", + "user_preferences = spark.sql(\"\"\"\n", + "\n", + "SELECT user_id, engagement_date, content_type, watch_time, device_type\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "WHERE user_id LIKE 'USER000%' AND engagement_date >= '2024-02-01'\n", + "\n", + "ORDER BY user_id, engagement_date\n", + "\n", + "LIMIT 25\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "user_preferences.show()\n", + "\n", + "print(f\"User preference records found: {user_preferences.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the media insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **User engagement patterns** and content preferences\n", + "- **Content performance** by type and popularity metrics\n", + "- **Device usage trends** and platform optimization\n", + "- **Time-based consumption patterns** and programming insights" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== User Engagement Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+--------------+----------------+----------------+--------------+------------------+\n", + "| user_id|total_sessions|total_watch_time|avg_session_time|avg_engagement|content_types_used|\n", + "+----------+--------------+----------------+----------------+--------------+------------------+\n", + "|USER007579| 40| 1877.93| 46.95| 75.13| 4|\n", + "|USER005840| 37| 1833.53| 49.55| 74.32| 4|\n", + "|USER001865| 38| 1811.01| 47.66| 74.92| 4|\n", + "|USER004356| 38| 1750.62| 46.07| 72.79| 4|\n", + "|USER007922| 36| 1738.63| 48.3| 75.08| 4|\n", + "|USER002936| 35| 1729.81| 49.42| 69.69| 4|\n", + "|USER002713| 40| 1712.54| 42.81| 71.73| 4|\n", + "|USER007310| 40| 1705.58| 42.64| 74.9| 4|\n", + "|USER001554| 39| 1680.15| 43.08| 72.31| 4|\n", + "|USER008670| 40| 1678.74| 41.97| 75.5| 4|\n", + "+----------+--------------+----------------+----------------+--------------+------------------+\n", + "\n", + "\n", + "=== Content Type Performance ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------+-----------------+----------------+--------------+--------------+------------+--------------+\n", + "|content_type|total_engagements|total_watch_time|avg_watch_time|avg_engagement|unique_users|unique_content|\n", + "+------------+-----------------+----------------+--------------+--------------+------------+--------------+\n", + "| Live Stream| 75054| 4737522.68| 63.12| 79.97| 11912| 50853|\n", + "| Podcast| 75096| 2632220.72| 35.05| 69.87| 11904| 51028|\n", + "| Video| 74449| 1568878.01| 21.07| 74.64| 11906| 50616|\n", + "| Article| 74941| 839239.02| 11.2| 64.59| 11923| 50708|\n", + "+------------+-----------------+----------------+--------------+--------------+------------+--------------+\n", + "\n", + "\n", + "=== Device Usage Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------+--------------+----------------+----------------+--------------+------------+\n", + "| device_type|total_sessions|total_watch_time|avg_session_time|avg_engagement|unique_users|\n", + "+--------------+--------------+----------------+----------------+--------------+------------+\n", + "| Smart TV| 60108| 2160351.14| 35.94| 79.43| 11778|\n", + "|Gaming Console| 59734| 2028688.05| 33.96| 75.74| 11802|\n", + "| Desktop| 59949| 1969632.73| 32.86| 72.5| 11783|\n", + "| Tablet| 60175| 1869267.79| 31.06| 68.54| 11804|\n", + "| Mobile| 59574| 1749920.72| 29.37| 65.08| 11784|\n", + "+--------------+--------------+----------------+----------------+--------------+------------+\n", + "\n", + "\n", + "=== Hourly Engagement Patterns ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-----------------+----------------+--------------+------------+\n", + "|hour_of_day|engagement_events|total_watch_time|avg_engagement|active_users|\n", + "+-----------+-----------------+----------------+--------------+------------+\n", + "| 0| 15| 472.0| 71.47| 15|\n", + "| 1| 7| 158.18| 73.71| 7|\n", + "| 2| 8| 322.8| 68.25| 8|\n", + "| 3| 6| 199.68| 68.0| 6|\n", + "| 4| 8| 219.29| 68.88| 8|\n", + "| 5| 3| 116.65| 76.33| 3|\n", + "| 6| 18| 568.5| 72.56| 18|\n", + "| 7| 42| 1211.49| 71.38| 42|\n", + "| 8| 43| 1407.64| 73.84| 43|\n", + "| 9| 47| 1604.9| 70.06| 47|\n", + "| 10| 39| 1341.92| 71.82| 39|\n", + "| 11| 48| 1707.31| 75.85| 48|\n", + "| 12| 49| 1723.38| 72.92| 49|\n", + "| 13| 70| 2297.3| 72.96| 70|\n", + "| 14| 47| 1873.51| 73.87| 47|\n", + "| 15| 51| 1556.71| 72.69| 51|\n", + "| 16| 42| 1095.14| 70.02| 42|\n", + "| 17| 63| 2550.92| 72.48| 63|\n", + "| 18| 72| 2541.56| 72.81| 72|\n", + "| 19| 40| 1289.31| 73.4| 40|\n", + "+-----------+-----------------+----------------+--------------+------------+\n", + "only showing top 20 rows\n", + "\n", + "\n", + "=== Monthly Engagement Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+-----------------+------------------+----------------+--------------+------------+\n", + "| month|total_engagements|monthly_watch_time|avg_session_time|avg_engagement|active_users|\n", + "+-------+-----------------+------------------+----------------+--------------+------------+\n", + "|2024-01| 25159| 827121.13| 32.88| 72.26| 10203|\n", + "|2024-02| 23872| 772994.45| 32.38| 72.24| 10000|\n", + "|2024-03| 25510| 827291.65| 32.43| 72.29| 10244|\n", + "|2024-04| 24519| 798865.9| 32.58| 72.23| 10145|\n", + "|2024-05| 25288| 829255.26| 32.79| 72.26| 10225|\n", + "|2024-06| 24308| 794100.99| 32.67| 72.17| 10062|\n", + "|2024-07| 25428| 832311.23| 32.73| 72.25| 10260|\n", + "|2024-08| 25603| 833486.22| 32.55| 72.34| 10257|\n", + "|2024-09| 24588| 808066.62| 32.86| 72.33| 10097|\n", + "|2024-10| 25287| 820795.48| 32.46| 72.26| 10214|\n", + "|2024-11| 24695| 804246.35| 32.57| 72.18| 10137|\n", + "|2024-12| 25283| 829325.15| 32.8| 72.37| 10259|\n", + "+-------+-----------------+------------------+----------------+--------------+------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and media insights\n", + "\n", + "\n", + "# User engagement analysis\n", + "\n", + "print(\"=== User Engagement Analysis ===\")\n", + "\n", + "user_engagement = spark.sql(\"\"\"\n", + "\n", + "SELECT user_id, COUNT(*) as total_sessions,\n", + "\n", + " ROUND(SUM(watch_time), 2) as total_watch_time,\n", + "\n", + " ROUND(AVG(watch_time), 2) as avg_session_time,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT content_type) as content_types_used\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "GROUP BY user_id\n", + "\n", + "ORDER BY total_watch_time DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "user_engagement.show()\n", + "\n", + "\n", + "# Content type performance\n", + "\n", + "print(\"\\n=== Content Type Performance ===\")\n", + "\n", + "content_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT content_type, COUNT(*) as total_engagements,\n", + "\n", + " ROUND(SUM(watch_time), 2) as total_watch_time,\n", + "\n", + " ROUND(AVG(watch_time), 2) as avg_watch_time,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT user_id) as unique_users,\n", + "\n", + " COUNT(DISTINCT content_id) as unique_content\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "GROUP BY content_type\n", + "\n", + "ORDER BY total_watch_time DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "content_performance.show()\n", + "\n", + "\n", + "# Device usage analysis\n", + "\n", + "print(\"\\n=== Device Usage Analysis ===\")\n", + "\n", + "device_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT device_type, COUNT(*) as total_sessions,\n", + "\n", + " ROUND(SUM(watch_time), 2) as total_watch_time,\n", + "\n", + " ROUND(AVG(watch_time), 2) as avg_session_time,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT user_id) as unique_users\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "GROUP BY device_type\n", + "\n", + "ORDER BY total_watch_time DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "device_analysis.show()\n", + "\n", + "\n", + "# Hourly engagement patterns\n", + "\n", + "print(\"\\n=== Hourly Engagement Patterns ===\")\n", + "\n", + "hourly_patterns = spark.sql(\"\"\"\n", + "\n", + "SELECT HOUR(engagement_date) as hour_of_day, COUNT(*) as engagement_events,\n", + "\n", + " ROUND(SUM(watch_time), 2) as total_watch_time,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT user_id) as active_users\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "WHERE DATE(engagement_date) = '2024-02-01'\n", + "\n", + "GROUP BY HOUR(engagement_date)\n", + "\n", + "ORDER BY hour_of_day\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "hourly_patterns.show()\n", + "\n", + "\n", + "# Monthly engagement trends\n", + "\n", + "print(\"\\n=== Monthly Engagement Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(engagement_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_engagements,\n", + "\n", + " ROUND(SUM(watch_time), 2) as monthly_watch_time,\n", + "\n", + " ROUND(AVG(watch_time), 2) as avg_session_time,\n", + "\n", + " ROUND(AVG(engagement_score), 2) as avg_engagement,\n", + "\n", + " COUNT(DISTINCT user_id) as active_users\n", + "\n", + "FROM media.analytics.content_engagement\n", + "\n", + "GROUP BY DATE_FORMAT(engagement_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (user_id, engagement_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (user_id, engagement_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Media analytics where content engagement and user behavior analysis are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for media data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles media-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger media datasets\n", + "- Integrate with real content management and streaming platforms\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced media analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/real_estate_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/real_estate_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..66e7278 --- /dev/null +++ b/Notebooks/liquid_clustering/real_estate_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1109 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Real Estate: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a real estate analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Property Transactions and Market Analysis\n", + "\n", + "We'll analyze real estate transactions and property market data. Our clustering strategy will optimize for:\n", + "\n", + "- **Property-specific queries**: Fast lookups by property ID\n", + "- **Time-based analysis**: Efficient filtering by transaction and listing dates\n", + "- **Market performance patterns**: Quick aggregation by location and property type\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Real estate catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create real estate catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS real_estate\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS real_estate.analytics\")\n", + "\n", + "print(\"Real estate catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `property_transactions` table will store:\n", + "\n", + "- **property_id**: Unique property identifier\n", + "- **transaction_date**: Date of property transaction\n", + "- **property_type**: Type (Single Family, Condo, Apartment, etc.)\n", + "- **sale_price**: Transaction sale price\n", + "- **location**: Geographic location/neighborhood\n", + "- **days_on_market**: Time property was listed before sale\n", + "- **price_per_sqft**: Price per square foot\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `property_id` and `transaction_date` because:\n", + "\n", + "- **property_id**: Properties may have multiple transactions over time, grouping their sales history together\n", + "- **transaction_date**: Time-based queries are critical for market analysis, seasonal trends, and investment performance\n", + "- This combination optimizes for both property tracking and temporal market analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on property_id and transaction_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS real_estate.analytics.property_transactions (\n", + "\n", + " property_id STRING,\n", + "\n", + " transaction_date DATE,\n", + "\n", + " property_type STRING,\n", + "\n", + " sale_price DECIMAL(12,2),\n", + "\n", + " location STRING,\n", + "\n", + " days_on_market INT,\n", + "\n", + " price_per_sqft DECIMAL(8,2)\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (property_id, transaction_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on property_id and transaction_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Real Estate Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic real estate transaction data including:\n", + "\n", + "- **8,000 properties** with multiple transactions over time\n", + "- **Property types**: Single Family, Condo, Townhouse, Apartment, Commercial\n", + "- **Realistic market patterns**: Seasonal pricing, location premiums, market fluctuations\n", + "- **Geographic diversity**: Different neighborhoods with varying price points\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real real estate scenarios where:\n", + "\n", + "- Properties appreciate or depreciate over time\n", + "- Market conditions vary by season and location\n", + "- Investment performance requires historical tracking\n", + "- Neighborhood analysis drives pricing strategies\n", + "- Market trends influence buying/selling decisions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 11372 property transaction records\n", + "Sample record: {'property_id': 'PROP000001', 'transaction_date': datetime.date(2024, 3, 26), 'property_type': 'Single Family', 'sale_price': 1071982.06, 'location': 'Downtown', 'days_on_market': 48, 'price_per_sqft': 404.98}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample real estate transaction data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define real estate data constants\n", + "\n", + "PROPERTY_TYPES = ['Single Family', 'Condo', 'Townhouse', 'Apartment', 'Commercial']\n", + "\n", + "LOCATIONS = ['Downtown', 'Suburban', 'Waterfront', 'Mountain View', 'Urban Core', 'Residential District']\n", + "\n", + "# Base pricing parameters by property type and location\n", + "\n", + "PRICE_PARAMS = {\n", + "\n", + " 'Single Family': {\n", + "\n", + " 'Downtown': {'base_price': 850000, 'sqft_range': (1800, 3500)},\n", + "\n", + " 'Suburban': {'base_price': 650000, 'sqft_range': (2000, 4000)},\n", + "\n", + " 'Waterfront': {'base_price': 1200000, 'sqft_range': (2200, 4500)},\n", + "\n", + " 'Mountain View': {'base_price': 750000, 'sqft_range': (1900, 3800)},\n", + "\n", + " 'Urban Core': {'base_price': 950000, 'sqft_range': (1600, 3200)},\n", + "\n", + " 'Residential District': {'base_price': 700000, 'sqft_range': (2100, 4200)}\n", + "\n", + " },\n", + "\n", + " 'Condo': {\n", + "\n", + " 'Downtown': {'base_price': 550000, 'sqft_range': (800, 1800)},\n", + "\n", + " 'Suburban': {'base_price': 350000, 'sqft_range': (900, 2000)},\n", + "\n", + " 'Waterfront': {'base_price': 750000, 'sqft_range': (1000, 2200)},\n", + "\n", + " 'Mountain View': {'base_price': 450000, 'sqft_range': (850, 1900)},\n", + "\n", + " 'Urban Core': {'base_price': 650000, 'sqft_range': (750, 1700)},\n", + "\n", + " 'Residential District': {'base_price': 400000, 'sqft_range': (950, 2100)}\n", + "\n", + " },\n", + "\n", + " 'Townhouse': {\n", + "\n", + " 'Downtown': {'base_price': 700000, 'sqft_range': (1400, 2800)},\n", + "\n", + " 'Suburban': {'base_price': 550000, 'sqft_range': (1600, 3200)},\n", + "\n", + " 'Waterfront': {'base_price': 900000, 'sqft_range': (1500, 3000)},\n", + "\n", + " 'Mountain View': {'base_price': 600000, 'sqft_range': (1450, 2900)},\n", + "\n", + " 'Urban Core': {'base_price': 800000, 'sqft_range': (1300, 2600)},\n", + "\n", + " 'Residential District': {'base_price': 580000, 'sqft_range': (1650, 3300)}\n", + "\n", + " },\n", + "\n", + " 'Apartment': {\n", + "\n", + " 'Downtown': {'base_price': 450000, 'sqft_range': (600, 1400)},\n", + "\n", + " 'Suburban': {'base_price': 280000, 'sqft_range': (650, 1500)},\n", + "\n", + " 'Waterfront': {'base_price': 600000, 'sqft_range': (700, 1600)},\n", + "\n", + " 'Mountain View': {'base_price': 350000, 'sqft_range': (625, 1450)},\n", + "\n", + " 'Urban Core': {'base_price': 520000, 'sqft_range': (550, 1300)},\n", + "\n", + " 'Residential District': {'base_price': 320000, 'sqft_range': (675, 1550)}\n", + "\n", + " },\n", + "\n", + " 'Commercial': {\n", + "\n", + " 'Downtown': {'base_price': 2500000, 'sqft_range': (3000, 10000)},\n", + "\n", + " 'Suburban': {'base_price': 1500000, 'sqft_range': (2500, 8000)},\n", + "\n", + " 'Waterfront': {'base_price': 3500000, 'sqft_range': (4000, 12000)},\n", + "\n", + " 'Mountain View': {'base_price': 1800000, 'sqft_range': (2800, 9000)},\n", + "\n", + " 'Urban Core': {'base_price': 3000000, 'sqft_range': (3500, 11000)},\n", + "\n", + " 'Residential District': {'base_price': 1600000, 'sqft_range': (2600, 8500)}\n", + "\n", + " }\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate property transaction records\n", + "\n", + "transaction_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 8,000 properties with 1-4 transactions each\n", + "\n", + "for property_num in range(1, 8001):\n", + "\n", + " property_id = f\"PROP{property_num:06d}\"\n", + " \n", + " # Each property gets 1-4 transactions over 12 months (most have 1, some flip/resale)\n", + "\n", + " num_transactions = random.choices([1, 2, 3, 4], weights=[0.7, 0.2, 0.08, 0.02])[0]\n", + " \n", + " # Select property type and location (consistent for the same property)\n", + "\n", + " property_type = random.choice(PROPERTY_TYPES)\n", + "\n", + " location = random.choice(LOCATIONS)\n", + " \n", + " params = PRICE_PARAMS[property_type][location]\n", + " \n", + " # Base square footage for this property\n", + "\n", + " sqft = random.randint(params['sqft_range'][0], params['sqft_range'][1])\n", + " \n", + " for i in range(num_transactions):\n", + "\n", + " # Spread transactions over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " transaction_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Calculate sale price with market variations\n", + "\n", + " # Seasonal pricing (higher in spring/summer)\n", + "\n", + " month = transaction_date.month\n", + "\n", + " if month in [3, 4, 5, 6]: # Spring/Summer peak\n", + "\n", + " seasonal_factor = 1.15\n", + "\n", + " elif month in [11, 12, 1, 2]: # Winter off-season\n", + "\n", + " seasonal_factor = 0.9\n", + "\n", + " else:\n", + "\n", + " seasonal_factor = 1.0\n", + " \n", + " # Market appreciation over time (slight increase)\n", + "\n", + " months_elapsed = (transaction_date.year - base_date.year) * 12 + (transaction_date.month - base_date.month)\n", + "\n", + " appreciation_factor = 1.0 + (months_elapsed * 0.002) # 0.2% monthly appreciation\n", + "\n", + " # Calculate price per square foot\n", + "\n", + " base_price_per_sqft = params['base_price'] / ((params['sqft_range'][0] + params['sqft_range'][1]) / 2)\n", + "\n", + " price_per_sqft = round(base_price_per_sqft * seasonal_factor * appreciation_factor * random.uniform(0.9, 1.1), 2)\n", + " \n", + " # Calculate total sale price\n", + "\n", + " sale_price = round(price_per_sqft * sqft, 2)\n", + " \n", + " # Days on market (varies by property type and market conditions)\n", + "\n", + " if property_type == 'Commercial':\n", + "\n", + " days_on_market = random.randint(30, 180)\n", + "\n", + " else:\n", + "\n", + " days_on_market = random.randint(7, 90)\n", + " \n", + " transaction_data.append({\n", + "\n", + " \"property_id\": property_id,\n", + "\n", + " \"transaction_date\": transaction_date.date(),\n", + "\n", + " \"property_type\": property_type,\n", + "\n", + " \"sale_price\": sale_price,\n", + "\n", + " \"location\": location,\n", + "\n", + " \"days_on_market\": days_on_market,\n", + "\n", + " \"price_per_sqft\": price_per_sqft\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(transaction_data)} property transaction records\")\n", + "\n", + "print(\"Sample record:\", transaction_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- days_on_market: long (nullable = true)\n", + " |-- location: string (nullable = true)\n", + " |-- price_per_sqft: double (nullable = true)\n", + " |-- property_id: string (nullable = true)\n", + " |-- property_type: string (nullable = true)\n", + " |-- sale_price: double (nullable = true)\n", + " |-- transaction_date: date (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------+--------------------+--------------+-----------+-------------+----------+----------------+\n", + "|days_on_market| location|price_per_sqft|property_id|property_type|sale_price|transaction_date|\n", + "+--------------+--------------------+--------------+-----------+-------------+----------+----------------+\n", + "| 48| Downtown| 404.98| PROP000001|Single Family|1071982.06| 2024-03-26|\n", + "| 22|Residential District| 254.48| PROP000002| Townhouse| 621440.16| 2024-05-31|\n", + "| 62| Urban Core| 370.97| PROP000003| Townhouse| 595406.85| 2024-11-14|\n", + "| 148|Residential District| 274.64| PROP000004| Commercial|1020562.24| 2024-10-31|\n", + "| 56| Downtown| 415.72| PROP000005| Condo| 362092.12| 2024-01-17|\n", + "+--------------+--------------------+--------------+-----------+-------------+----------+----------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 11372 records into real_estate.analytics.property_transactions\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_transactions = spark.createDataFrame(transaction_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_transactions.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_transactions.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (property_id, transaction_date) will automatically optimize the data layout\n", + "\n", + "df_transactions.write.mode(\"overwrite\").saveAsTable(\"real_estate.analytics.property_transactions\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_transactions.count()} records into real_estate.analytics.property_transactions\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Property transaction history** (clustered by property_id)\n", + "2. **Time-based market analysis** (clustered by transaction_date)\n", + "3. **Combined property + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Property Transaction History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+----------------+-------------+----------+--------+\n", + "|property_id|transaction_date|property_type|sale_price|location|\n", + "+-----------+----------------+-------------+----------+--------+\n", + "| PROP000001| 2024-03-26|Single Family|1071982.06|Downtown|\n", + "+-----------+----------------+-------------+----------+--------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 1\n", + "\n", + "=== Query 2: Recent High-Value Transactions ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------------+-----------+-------------+----------+----------+\n", + "|transaction_date|property_id|property_type|sale_price| location|\n", + "+----------------+-----------+-------------+----------+----------+\n", + "| 2024-06-10| PROP006087| Commercial| 6386463.6|Waterfront|\n", + "| 2024-06-23| PROP006087| Commercial|6074959.55|Waterfront|\n", + "| 2024-06-06| PROP003416| Commercial|5999320.32|Waterfront|\n", + "| 2024-06-30| PROP003052| Commercial| 5659487.1|Waterfront|\n", + "| 2024-06-28| PROP004596| Commercial|5609426.68|Waterfront|\n", + "| 2024-06-11| PROP007661| Commercial|5575704.44|Waterfront|\n", + "| 2024-07-24| PROP003416| Commercial|5540088.96|Waterfront|\n", + "| 2024-06-20| PROP002013| Commercial|5535950.94|Waterfront|\n", + "| 2024-07-05| PROP002013| Commercial|5288486.98|Waterfront|\n", + "| 2024-10-05| PROP004988| Commercial|5258298.18|Waterfront|\n", + "| 2024-06-06| PROP005373| Commercial|5242295.52|Waterfront|\n", + "| 2024-10-02| PROP000600| Commercial|5229563.04|Waterfront|\n", + "| 2024-10-10| PROP002000| Commercial|5221318.55|Waterfront|\n", + "| 2024-06-16| PROP007748| Commercial|5219796.51|Waterfront|\n", + "| 2024-06-29| PROP000353| Commercial| 5171034.0|Urban Core|\n", + "| 2024-06-18| PROP003405| Commercial|5166032.95|Urban Core|\n", + "| 2024-06-09| PROP004845| Commercial| 5147234.4|Waterfront|\n", + "| 2024-12-09| PROP001483| Commercial|5098624.65|Waterfront|\n", + "| 2024-12-09| PROP004901| Commercial|5075851.14|Waterfront|\n", + "| 2024-10-12| PROP003462| Commercial|5058786.39|Waterfront|\n", + "+----------------+-----------+-------------+----------+----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "High-value transactions found: 1684\n", + "\n", + "=== Query 3: Property Value Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+----------------+-------------+----------+--------------+\n", + "|property_id|transaction_date|property_type|sale_price|price_per_sqft|\n", + "+-----------+----------------+-------------+----------+--------------+\n", + "| PROP000002| 2024-05-31| Townhouse| 621440.16| 254.48|\n", + "| PROP000003| 2024-11-14| Townhouse| 595406.85| 370.97|\n", + "| PROP000004| 2024-10-31| Commercial|1020562.24| 274.64|\n", + "| PROP000007| 2024-12-18| Apartment| 465049.2| 323.4|\n", + "| PROP000008| 2024-09-28| Townhouse| 441549.36| 211.47|\n", + "| PROP000009| 2024-04-09| Commercial| 4561434.5| 454.1|\n", + "| PROP000009| 2024-10-03| Commercial|4357219.65| 433.77|\n", + "| PROP000009| 2024-10-09| Commercial|4204535.65| 418.57|\n", + "| PROP000010| 2024-05-01|Single Family|1248957.92| 441.64|\n", + "| PROP000010| 2024-05-30|Single Family|1194999.68| 422.56|\n", + "| PROP000010| 2024-08-09|Single Family|1219122.52| 431.09|\n", + "| PROP000011| 2024-09-22| Condo| 436550.4| 343.2|\n", + "| PROP000012| 2024-09-20| Condo| 530021.08| 253.72|\n", + "| PROP000013| 2024-07-25| Apartment| 379305.99| 520.31|\n", + "| PROP000014| 2024-10-10| Apartment| 440308.48| 288.16|\n", + "| PROP000015| 2024-11-19|Single Family| 850184.1| 286.74|\n", + "| PROP000016| 2024-11-16|Single Family| 828172.2| 225.66|\n", + "| PROP000017| 2024-08-31| Commercial|1756840.32| 428.08|\n", + "| PROP000018| 2024-08-28| Commercial|4382253.48| 455.82|\n", + "| PROP000019| 2024-11-10| Townhouse| 901382.94| 397.26|\n", + "+-----------+----------------+-------------+----------+--------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Value trend records found: 1046\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Property transaction history - benefits from property_id clustering\n", + "\n", + "print(\"=== Query 1: Property Transaction History ===\")\n", + "\n", + "property_history = spark.sql(\"\"\"\n", + "\n", + "SELECT property_id, transaction_date, property_type, sale_price, location\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "WHERE property_id = 'PROP000001'\n", + "\n", + "ORDER BY transaction_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "property_history.show()\n", + "\n", + "print(f\"Records found: {property_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based high-value transaction analysis - benefits from transaction_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent High-Value Transactions ===\")\n", + "\n", + "high_value = spark.sql(\"\"\"\n", + "\n", + "SELECT transaction_date, property_id, property_type, sale_price, location\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "WHERE transaction_date >= '2024-06-01' AND sale_price > 1000000\n", + "\n", + "ORDER BY sale_price DESC, transaction_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "high_value.show()\n", + "\n", + "print(f\"High-value transactions found: {high_value.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined property + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Property Value Trends ===\")\n", + "\n", + "value_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT property_id, transaction_date, property_type, sale_price, price_per_sqft\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "WHERE property_id LIKE 'PROP000%' AND transaction_date >= '2024-04-01'\n", + "\n", + "ORDER BY property_id, transaction_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "value_trends.show()\n", + "\n", + "print(f\"Value trend records found: {value_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the real estate insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Property value appreciation** and market performance\n", + "- **Location-based pricing** and neighborhood analysis\n", + "- **Property type trends** and market segmentation\n", + "- **Market timing** and seasonal patterns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Property Value Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+\n", + "|property_id|total_transactions|min_sale_price|max_sale_price|avg_sale_price|avg_price_per_sqft|property_type| location|\n", + "+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+\n", + "| PROP001960| 1| 6349995.27| 6349995.27| 6349995.27| 543.99| Commercial|Waterfront|\n", + "| PROP006087| 2| 6074959.55| 6386463.6| 6230711.57| 543.45| Commercial|Waterfront|\n", + "| PROP001727| 1| 5789451.2| 5789451.2| 5789451.2| 517.84| Commercial|Waterfront|\n", + "| PROP007555| 1| 5784090.12| 5784090.12| 5784090.12| 482.49| Commercial|Waterfront|\n", + "| PROP004332| 1| 5750284.36| 5750284.36| 5750284.36| 526.39| Commercial|Urban Core|\n", + "| PROP006731| 1| 5637737.05| 5637737.05| 5637737.05| 507.95| Commercial|Waterfront|\n", + "| PROP007714| 1| 5625904.48| 5625904.48| 5625904.48| 547.48| Commercial|Waterfront|\n", + "| PROP003955| 1| 5620209.9| 5620209.9| 5620209.9| 471.89| Commercial|Waterfront|\n", + "| PROP000900| 3| 4758664.84| 6037540.36| 5593865.89| 491.29| Commercial|Waterfront|\n", + "| PROP007661| 1| 5575704.44| 5575704.44| 5575704.44| 539.08| Commercial|Waterfront|\n", + "+-----------+------------------+--------------+--------------+--------------+------------------+-------------+----------+\n", + "\n", + "\n", + "=== Location Market Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------------+------------------+--------------+------------------+------------------+-----------------+\n", + "| location|total_transactions|avg_sale_price|avg_price_per_sqft|avg_days_on_market|unique_properties|\n", + "+--------------------+------------------+--------------+------------------+------------------+-----------------+\n", + "| Waterfront| 1881| 1409207.07| 443.56| 59.16| 1322|\n", + "| Urban Core| 1866| 1255564.55| 473.94| 59.7| 1310|\n", + "| Downtown| 1877| 1021967.95| 393.32| 60.1| 1337|\n", + "| Mountain View| 1890| 804231.04| 312.89| 60.04| 1322|\n", + "|Residential District| 1907| 723700.39| 267.59| 59.58| 1366|\n", + "| Suburban| 1951| 675060.74| 252.12| 59.67| 1343|\n", + "+--------------------+------------------+--------------+------------------+------------------+-----------------+\n", + "\n", + "\n", + "=== Property Type Market Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+-----------+--------------+------------------+------------------+-----------------+\n", + "|property_type|total_sales|avg_sale_price|avg_price_per_sqft|avg_days_on_market|unique_properties|\n", + "+-------------+-----------+--------------+------------------+------------------+-----------------+\n", + "| Commercial| 2246| 2364622.79| 361.68| 104.5| 1571|\n", + "|Single Family| 2285| 882472.63| 306.23| 49.79| 1612|\n", + "| Townhouse| 2294| 712690.76| 323.59| 48.78| 1599|\n", + "| Condo| 2261| 529188.26| 379.15| 47.3| 1579|\n", + "| Apartment| 2286| 424397.47| 410.72| 48.85| 1639|\n", + "+-------------+-----------+--------------+------------------+------------------+-----------------+\n", + "\n", + "\n", + "=== Market Timing Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------------+-----------------+--------------+--------+---------------+\n", + "| sale_speed|transaction_count|avg_sale_price|avg_days| total_volume|\n", + "+--------------------+-----------------+--------------+--------+---------------+\n", + "|Fast Sale (1-30 d...| 2603| 648593.87| 18.48| 1.6882898554E9|\n", + "|Normal Sale (31-6...| 3715| 849302.21| 45.58| 3.1551577265E9|\n", + "|Slow Sale (61-90 ...| 3718| 841586.19| 75.59|3.12901747284E9|\n", + "|Very Slow Sale (9...| 1336| 2362655.36| 135.14|3.15650755849E9|\n", + "+--------------------+-----------------+--------------+--------+---------------+\n", + "\n", + "\n", + "=== Monthly Market Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+------------------+---------------+--------------+------------------+-----------------+\n", + "| month|total_transactions| monthly_volume|avg_sale_price|avg_price_per_sqft|unique_properties|\n", + "+-------+------------------+---------------+--------------+------------------+-----------------+\n", + "|2024-01| 951| 8.0156294112E8| 842863.24| 314.57| 919|\n", + "|2024-02| 920| 7.5918726717E8| 825203.55| 314.39| 902|\n", + "|2024-03| 938|1.01247160402E9| 1079394.03| 401.65| 900|\n", + "|2024-04| 988|1.14251391792E9| 1156390.61| 396.33| 954|\n", + "|2024-05| 1002|1.07273555366E9| 1070594.36| 400.43| 979|\n", + "|2024-06| 909|1.01445630032E9| 1116013.53| 403.76| 876|\n", + "|2024-07| 1006| 9.4594880925E8| 940306.97| 352.47| 976|\n", + "|2024-08| 892| 8.5601528164E8| 959658.39| 349.58| 861|\n", + "|2024-09| 916| 8.6428797707E8| 943545.83| 351.74| 888|\n", + "|2024-10| 981| 9.9316499128E8| 1012400.6| 351.47| 951|\n", + "|2024-11| 919| 8.3654569217E8| 910278.23| 310.92| 892|\n", + "|2024-12| 950| 8.3008227761E8| 873770.82| 322.55| 914|\n", + "+-------+------------------+---------------+--------------+------------------+-----------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and real estate insights\n", + "\n", + "\n", + "# Property value analysis\n", + "\n", + "print(\"=== Property Value Analysis ===\")\n", + "\n", + "property_values = spark.sql(\"\"\"\n", + "\n", + "SELECT property_id, COUNT(*) as total_transactions,\n", + "\n", + " ROUND(MIN(sale_price), 2) as min_sale_price,\n", + "\n", + " ROUND(MAX(sale_price), 2) as max_sale_price,\n", + "\n", + " ROUND(AVG(sale_price), 2) as avg_sale_price,\n", + "\n", + " ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,\n", + "\n", + " property_type, location\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "GROUP BY property_id, property_type, location\n", + "\n", + "ORDER BY avg_sale_price DESC\n", + "\n", + "LIMIT 10\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "property_values.show()\n", + "\n", + "\n", + "# Location market analysis\n", + "\n", + "print(\"\\n=== Location Market Analysis ===\")\n", + "\n", + "location_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT location, COUNT(*) as total_transactions,\n", + "\n", + " ROUND(AVG(sale_price), 2) as avg_sale_price,\n", + "\n", + " ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,\n", + "\n", + " ROUND(AVG(days_on_market), 2) as avg_days_on_market,\n", + "\n", + " COUNT(DISTINCT property_id) as unique_properties\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "GROUP BY location\n", + "\n", + "ORDER BY avg_sale_price DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "location_analysis.show()\n", + "\n", + "\n", + "# Property type market trends\n", + "\n", + "print(\"\\n=== Property Type Market Trends ===\")\n", + "\n", + "property_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT property_type, COUNT(*) as total_sales,\n", + "\n", + " ROUND(AVG(sale_price), 2) as avg_sale_price,\n", + "\n", + " ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,\n", + "\n", + " ROUND(AVG(days_on_market), 2) as avg_days_on_market,\n", + "\n", + " COUNT(DISTINCT property_id) as unique_properties\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "GROUP BY property_type\n", + "\n", + "ORDER BY avg_sale_price DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "property_trends.show()\n", + "\n", + "\n", + "# Market timing analysis\n", + "\n", + "print(\"\\n=== Market Timing Analysis ===\")\n", + "\n", + "market_timing = spark.sql(\"\"\"\n", + "\n", + "SELECT \n", + "\n", + " CASE \n", + "\n", + " WHEN days_on_market <= 30 THEN 'Fast Sale (1-30 days)'\n", + "\n", + " WHEN days_on_market <= 60 THEN 'Normal Sale (31-60 days)'\n", + "\n", + " WHEN days_on_market <= 90 THEN 'Slow Sale (61-90 days)'\n", + "\n", + " ELSE 'Very Slow Sale (90+ days)'\n", + "\n", + " END as sale_speed,\n", + "\n", + " COUNT(*) as transaction_count,\n", + "\n", + " ROUND(AVG(sale_price), 2) as avg_sale_price,\n", + "\n", + " ROUND(AVG(days_on_market), 2) as avg_days,\n", + "\n", + " ROUND(SUM(sale_price), 2) as total_volume\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "GROUP BY \n", + "\n", + " CASE \n", + "\n", + " WHEN days_on_market <= 30 THEN 'Fast Sale (1-30 days)'\n", + "\n", + " WHEN days_on_market <= 60 THEN 'Normal Sale (31-60 days)'\n", + "\n", + " WHEN days_on_market <= 90 THEN 'Slow Sale (61-90 days)'\n", + "\n", + " ELSE 'Very Slow Sale (90+ days)'\n", + "\n", + " END\n", + "\n", + "ORDER BY avg_days\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "market_timing.show()\n", + "\n", + "\n", + "# Monthly market trends\n", + "\n", + "print(\"\\n=== Monthly Market Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(transaction_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_transactions,\n", + "\n", + " ROUND(SUM(sale_price), 2) as monthly_volume,\n", + "\n", + " ROUND(AVG(sale_price), 2) as avg_sale_price,\n", + "\n", + " ROUND(AVG(price_per_sqft), 2) as avg_price_per_sqft,\n", + "\n", + " COUNT(DISTINCT property_id) as unique_properties\n", + "\n", + "FROM real_estate.analytics.property_transactions\n", + "\n", + "GROUP BY DATE_FORMAT(transaction_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (property_id, transaction_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (property_id, transaction_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Real estate analytics where property tracking and market analysis are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for real estate data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles real estate-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger real estate datasets\n", + "- Integrate with real MLS and property management systems\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced real estate analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/retail_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/retail_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..1f076d5 --- /dev/null +++ b/Notebooks/liquid_clustering/retail_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1011 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Retail Analytics: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a retail analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Customer Purchase Analytics\n", + "\n", + "We'll analyze customer purchase records from a retail company. Our clustering strategy will optimize for:\n", + "\n", + "- **Customer-specific queries**: Fast lookups by customer ID\n", + "- **Time-based analysis**: Efficient filtering by purchase date\n", + "- **Purchase patterns**: Quick aggregation by product category and customer segments\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Retail catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create retail catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS retail\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS retail.analytics\")\n", + "\n", + "print(\"Retail catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `customer_purchases` table will store:\n", + "\n", + "- **customer_id**: Unique customer identifier\n", + "- **purchase_date**: Date of purchase\n", + "- **product_id**: Product identifier\n", + "- **product_category**: Category (Electronics, Clothing, Home, etc.)\n", + "- **purchase_amount**: Transaction amount\n", + "- **store_id**: Store location identifier\n", + "- **payment_method**: Payment type (Credit, Debit, Cash, etc.)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `customer_id` and `purchase_date` because:\n", + "\n", + "- **customer_id**: Customers often make multiple purchases, grouping their transaction history together\n", + "- **purchase_date**: Time-based queries are common for sales analysis, seasonality, and trends\n", + "- This combination optimizes for both customer behavior analysis and temporal sales reporting" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on customer_id and purchase_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS retail.analytics.customer_purchases (\n", + "\n", + " customer_id STRING,\n", + "\n", + " purchase_date DATE,\n", + "\n", + " product_id STRING,\n", + "\n", + " product_category STRING,\n", + "\n", + " purchase_amount DECIMAL(10,2),\n", + "\n", + " store_id STRING,\n", + "\n", + " payment_method STRING\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (customer_id, purchase_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on customer_id and purchase_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Retail Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic retail purchase data including:\n", + "\n", + "- **1,000 customers** with multiple purchases over time\n", + "- **Product categories**: Electronics, Clothing, Home & Garden, Books, Sports\n", + "- **Realistic temporal patterns**: Seasonal shopping, repeat purchases, varying amounts\n", + "- **Multiple stores**: Different retail locations across regions\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real retail scenarios where:\n", + "\n", + "- Customers make multiple purchases over time\n", + "- Seasonal trends affect buying patterns\n", + "- Product categories drive different analytics needs\n", + "- Store-level performance analysis is required\n", + "- Customer segmentation enables personalized marketing" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 5544 customer purchase records\n", + "Sample record: {'customer_id': 'CUST000001', 'purchase_date': datetime.date(2024, 9, 19), 'product_id': 'BOK003', 'product_category': 'Books', 'purchase_amount': 22.1, 'store_id': 'STORE_CHI_003', 'payment_method': 'Debit Card'}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample retail purchase data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define retail data constants\n", + "\n", + "PRODUCTS = {\n", + "\n", + " \"Electronics\": [\n", + "\n", + " (\"ELE001\", \"Smartphone\", 599.99),\n", + "\n", + " (\"ELE002\", \"Laptop\", 1299.99),\n", + "\n", + " (\"ELE003\", \"Headphones\", 149.99),\n", + "\n", + " (\"ELE004\", \"Smart TV\", 799.99),\n", + "\n", + " (\"ELE005\", \"Tablet\", 399.99)\n", + "\n", + " ],\n", + "\n", + " \"Clothing\": [\n", + "\n", + " (\"CLO001\", \"T-Shirt\", 19.99),\n", + "\n", + " (\"CLO002\", \"Jeans\", 79.99),\n", + "\n", + " (\"CLO003\", \"Jacket\", 129.99),\n", + "\n", + " (\"CLO004\", \"Sneakers\", 89.99),\n", + "\n", + " (\"CLO005\", \"Dress\", 59.99)\n", + "\n", + " ],\n", + "\n", + " \"Home & Garden\": [\n", + "\n", + " (\"HOM001\", \"Blender\", 79.99),\n", + "\n", + " (\"HOM002\", \"Coffee Maker\", 49.99),\n", + "\n", + " (\"HOM003\", \"Garden Tools Set\", 39.99),\n", + "\n", + " (\"HOM004\", \"Bedding Set\", 89.99),\n", + "\n", + " (\"HOM005\", \"Decorative Pillow\", 24.99)\n", + "\n", + " ],\n", + "\n", + " \"Books\": [\n", + "\n", + " (\"BOK001\", \"Fiction Novel\", 14.99),\n", + "\n", + " (\"BOK002\", \"Cookbook\", 24.99),\n", + "\n", + " (\"BOK003\", \"Biography\", 19.99),\n", + "\n", + " (\"BOK004\", \"Self-Help Book\", 16.99),\n", + "\n", + " (\"BOK005\", \"Children's Book\", 9.99)\n", + "\n", + " ],\n", + "\n", + " \"Sports\": [\n", + "\n", + " (\"SPO001\", \"Yoga Mat\", 29.99),\n", + "\n", + " (\"SPO002\", \"Dumbbells\", 49.99),\n", + "\n", + " (\"SPO003\", \"Running Shoes\", 119.99),\n", + "\n", + " (\"SPO004\", \"Basketball\", 24.99),\n", + "\n", + " (\"SPO005\", \"Tennis Racket\", 89.99)\n", + "\n", + " ]\n", + "\n", + "}\n", + "\n", + "\n", + "\n", + "STORES = [\"STORE_NYC_001\", \"STORE_LAX_002\", \"STORE_CHI_003\", \"STORE_HOU_004\", \"STORE_MIA_005\"]\n", + "\n", + "PAYMENT_METHODS = [\"Credit Card\", \"Debit Card\", \"Cash\", \"Digital Wallet\", \"Buy Now Pay Later\"]\n", + "\n", + "\n", + "# Generate customer purchase records\n", + "\n", + "purchase_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 1,000 customers with 3-8 purchases each\n", + "\n", + "for customer_num in range(1, 1001):\n", + "\n", + " customer_id = f\"CUST{customer_num:06d}\"\n", + " \n", + " # Each customer gets 3-8 purchases over 12 months\n", + "\n", + " num_purchases = random.randint(3, 8)\n", + " \n", + " for i in range(num_purchases):\n", + "\n", + " # Spread purchases over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " purchase_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Select random category and product\n", + "\n", + " category = random.choice(list(PRODUCTS.keys()))\n", + "\n", + " product_id, product_name, base_price = random.choice(PRODUCTS[category])\n", + " \n", + " # Add some price variation (±20%)\n", + "\n", + " price_variation = random.uniform(0.8, 1.2)\n", + "\n", + " purchase_amount = round(base_price * price_variation, 2)\n", + " \n", + " # Select random store and payment method\n", + "\n", + " store_id = random.choice(STORES)\n", + "\n", + " payment_method = random.choice(PAYMENT_METHODS)\n", + " \n", + " purchase_data.append({\n", + "\n", + " \"customer_id\": customer_id,\n", + "\n", + " \"purchase_date\": purchase_date.date(),\n", + "\n", + " \"product_id\": product_id,\n", + "\n", + " \"product_category\": category,\n", + "\n", + " \"purchase_amount\": purchase_amount,\n", + "\n", + " \"store_id\": store_id,\n", + "\n", + " \"payment_method\": payment_method\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(purchase_data)} customer purchase records\")\n", + "\n", + "print(\"Sample record:\", purchase_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- customer_id: string (nullable = true)\n", + " |-- payment_method: string (nullable = true)\n", + " |-- product_category: string (nullable = true)\n", + " |-- product_id: string (nullable = true)\n", + " |-- purchase_amount: double (nullable = true)\n", + " |-- purchase_date: date (nullable = true)\n", + " |-- store_id: string (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-----------------+----------------+----------+---------------+-------------+-------------+\n", + "|customer_id| payment_method|product_category|product_id|purchase_amount|purchase_date| store_id|\n", + "+-----------+-----------------+----------------+----------+---------------+-------------+-------------+\n", + "| CUST000001| Debit Card| Books| BOK003| 22.1| 2024-09-19|STORE_CHI_003|\n", + "| CUST000001| Credit Card| Sports| SPO004| 23.78| 2024-10-29|STORE_CHI_003|\n", + "| CUST000001|Buy Now Pay Later| Sports| SPO004| 20.7| 2024-03-20|STORE_LAX_002|\n", + "| CUST000001| Cash| Electronics| ELE003| 153.44| 2024-11-07|STORE_HOU_004|\n", + "| CUST000001| Cash| Home & Garden| HOM005| 21.11| 2024-05-11|STORE_HOU_004|\n", + "+-----------+-----------------+----------------+----------+---------------+-------------+-------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 5544 records into retail.analytics.customer_purchases\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_purchases = spark.createDataFrame(purchase_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_purchases.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_purchases.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (customer_id, purchase_date) will automatically optimize the data layout\n", + "\n", + "df_purchases.write.mode(\"overwrite\").saveAsTable(\"retail.analytics.customer_purchases\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_purchases.count()} records into retail.analytics.customer_purchases\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Customer purchase history** (clustered by customer_id)\n", + "2. **Time-based sales analysis** (clustered by purchase_date)\n", + "3. **Combined customer + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Customer Purchase History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-------------+----------------+---------------+-------------+\n", + "|customer_id|purchase_date|product_category|purchase_amount| store_id|\n", + "+-----------+-------------+----------------+---------------+-------------+\n", + "| CUST000001| 2024-03-20| Sports| 20.7|STORE_LAX_002|\n", + "| CUST000001| 2024-05-11| Home & Garden| 21.11|STORE_HOU_004|\n", + "| CUST000001| 2024-09-19| Books| 22.1|STORE_CHI_003|\n", + "| CUST000001| 2024-10-29| Sports| 23.78|STORE_CHI_003|\n", + "| CUST000001| 2024-11-07| Electronics| 153.44|STORE_HOU_004|\n", + "+-----------+-------------+----------------+---------------+-------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 5\n", + "\n", + "=== Query 2: High-Value Purchases This Month ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+-----------+----------------+---------------+-----------------+\n", + "|purchase_date|customer_id|product_category|purchase_amount| payment_method|\n", + "+-------------+-----------+----------------+---------------+-----------------+\n", + "| 2024-12-31| CUST000360| Electronics| 1539.12| Debit Card|\n", + "| 2024-12-31| CUST000133| Electronics| 941.32| Digital Wallet|\n", + "| 2024-12-31| CUST000989| Electronics| 708.76|Buy Now Pay Later|\n", + "| 2024-12-31| CUST000279| Electronics| 691.22| Digital Wallet|\n", + "| 2024-12-31| CUST000047| Electronics| 561.09|Buy Now Pay Later|\n", + "| 2024-12-30| CUST000366| Electronics| 1413.89| Cash|\n", + "| 2024-12-30| CUST000560| Electronics| 900.6| Cash|\n", + "| 2024-12-29| CUST000006| Electronics| 896.14| Debit Card|\n", + "| 2024-12-27| CUST000861| Electronics| 546.64| Cash|\n", + "| 2024-12-26| CUST000858| Electronics| 569.1| Cash|\n", + "| 2024-12-25| CUST000574| Electronics| 882.02|Buy Now Pay Later|\n", + "| 2024-12-25| CUST000621| Electronics| 676.28| Cash|\n", + "| 2024-12-24| CUST000865| Electronics| 1341.39| Digital Wallet|\n", + "| 2024-12-24| CUST000192| Electronics| 1313.52| Credit Card|\n", + "| 2024-12-24| CUST000130| Electronics| 1308.95| Debit Card|\n", + "| 2024-12-24| CUST000004| Electronics| 634.57| Digital Wallet|\n", + "| 2024-12-24| CUST000593| Electronics| 540.54|Buy Now Pay Later|\n", + "| 2024-12-23| CUST000184| Electronics| 1389.5| Credit Card|\n", + "| 2024-12-23| CUST000423| Electronics| 554.73|Buy Now Pay Later|\n", + "| 2024-12-22| CUST000651| Electronics| 1409.86| Debit Card|\n", + "+-------------+-----------+----------------+---------------+-----------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "High-value purchases found: 385\n", + "\n", + "=== Query 3: Customer Spending Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+-------------+----------------+---------------+\n", + "|customer_id|purchase_date|product_category|purchase_amount|\n", + "+-----------+-------------+----------------+---------------+\n", + "| CUST000100| 2024-05-05| Electronics| 1507.71|\n", + "| CUST000100| 2024-05-15| Home & Garden| 90.15|\n", + "| CUST000100| 2024-06-03| Sports| 125.47|\n", + "| CUST000100| 2024-10-27| Sports| 24.08|\n", + "| CUST000100| 2024-11-14| Clothing| 85.26|\n", + "| CUST000100| 2024-12-02| Books| 10.94|\n", + "| CUST000101| 2024-06-15| Sports| 28.67|\n", + "| CUST000101| 2024-08-02| Clothing| 85.37|\n", + "| CUST000101| 2024-08-10| Sports| 128.44|\n", + "| CUST000101| 2024-09-03| Sports| 79.51|\n", + "| CUST000102| 2024-05-28| Sports| 24.17|\n", + "| CUST000102| 2024-06-17| Books| 22.35|\n", + "| CUST000102| 2024-07-16| Clothing| 80.71|\n", + "| CUST000102| 2024-09-28| Books| 8.38|\n", + "| CUST000103| 2024-04-09| Clothing| 18.75|\n", + "| CUST000103| 2024-06-30| Books| 10.06|\n", + "| CUST000104| 2024-04-01| Clothing| 63.41|\n", + "| CUST000104| 2024-06-05| Home & Garden| 21.87|\n", + "| CUST000104| 2024-10-09| Electronics| 592.26|\n", + "| CUST000104| 2024-12-02| Home & Garden| 86.61|\n", + "+-----------+-------------+----------------+---------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Trend records found: 400\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Customer purchase history - benefits from customer_id clustering\n", + "\n", + "print(\"=== Query 1: Customer Purchase History ===\")\n", + "\n", + "customer_history = spark.sql(\"\"\"\n", + "\n", + "SELECT customer_id, purchase_date, product_category, purchase_amount, store_id\n", + "\n", + "FROM retail.analytics.customer_purchases\n", + "\n", + "WHERE customer_id = 'CUST000001'\n", + "\n", + "ORDER BY purchase_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "customer_history.show()\n", + "\n", + "print(f\"Records found: {customer_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based sales analysis - benefits from purchase_date clustering\n", + "\n", + "print(\"\\n=== Query 2: High-Value Purchases This Month ===\")\n", + "\n", + "high_value_recent = spark.sql(\"\"\"\n", + "\n", + "SELECT purchase_date, customer_id, product_category, purchase_amount, payment_method\n", + "\n", + "FROM retail.analytics.customer_purchases\n", + "\n", + "WHERE purchase_date >= '2024-06-01' AND purchase_amount > 500\n", + "\n", + "ORDER BY purchase_date DESC, purchase_amount DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "high_value_recent.show()\n", + "\n", + "print(f\"High-value purchases found: {high_value_recent.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined customer + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Customer Spending Trends ===\")\n", + "\n", + "customer_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT customer_id, purchase_date, product_category, purchase_amount\n", + "\n", + "FROM retail.analytics.customer_purchases\n", + "\n", + "WHERE customer_id LIKE 'CUST0001%' AND purchase_date >= '2024-04-01'\n", + "\n", + "ORDER BY customer_id, purchase_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "customer_trends.show()\n", + "\n", + "print(f\"Trend records found: {customer_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the retail insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Sales by category** and performance trends\n", + "- **Customer segmentation** by spending patterns\n", + "- **Store performance** analysis\n", + "- **Payment method preferences** and seasonal trends" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Sales by Category Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------------+---------------+-------------+------------+------------------+\n", + "|product_category|total_purchases|total_revenue|avg_purchase|revenue_percentage|\n", + "+----------------+---------------+-------------+------------+------------------+\n", + "| Electronics| 1069| 700376.34| 655.17| 74.54|\n", + "| Clothing| 1134| 85054.48| 75.0| 9.05|\n", + "| Sports| 1104| 69841.08| 63.26| 7.43|\n", + "| Home & Garden| 1116| 64960.34| 58.21| 6.91|\n", + "| Books| 1121| 19371.41| 17.28| 2.06|\n", + "+----------------+---------------+-------------+------------+------------------+\n", + "\n", + "\n", + "=== Customer Segmentation Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------------+--------------+---------------+---------------+\n", + "|customer_segment|customer_count|avg_total_spent|segment_revenue|\n", + "+----------------+--------------+---------------+---------------+\n", + "| Medium Value| 511| 1133.51| 579225.07|\n", + "| High Value| 94| 2668.46| 250835.62|\n", + "| Low Value| 395| 277.32| 109542.96|\n", + "+----------------+--------------+---------------+---------------+\n", + "\n", + "\n", + "=== Store Performance Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+------------------+----------------+-------------+---------------------+\n", + "| store_id|total_transactions|unique_customers|total_revenue|avg_transaction_value|\n", + "+-------------+------------------+----------------+-------------+---------------------+\n", + "|STORE_MIA_005| 1144| 691| 204945.24| 179.15|\n", + "|STORE_LAX_002| 1180| 710| 195725.56| 165.87|\n", + "|STORE_HOU_004| 1042| 654| 181276.2| 173.97|\n", + "|STORE_CHI_003| 1106| 698| 180939.6| 163.6|\n", + "|STORE_NYC_001| 1072| 680| 176717.05| 164.85|\n", + "+-------------+------------------+----------------+-------------+---------------------+\n", + "\n", + "\n", + "=== Monthly Sales Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+------------+---------------+----------------+\n", + "| month|transactions|monthly_revenue|active_customers|\n", + "+-------+------------+---------------+----------------+\n", + "|2024-01| 485| 77417.29| 389|\n", + "|2024-02| 422| 67018.43| 350|\n", + "|2024-03| 447| 74457.04| 375|\n", + "|2024-04| 458| 82553.07| 380|\n", + "|2024-05| 469| 79372.22| 369|\n", + "|2024-06| 477| 91938.76| 384|\n", + "|2024-07| 466| 75765.05| 382|\n", + "|2024-08| 477| 71764.42| 392|\n", + "|2024-09| 473| 86854.52| 377|\n", + "|2024-10| 442| 82179.17| 358|\n", + "|2024-11| 457| 71592.74| 373|\n", + "|2024-12| 471| 78690.94| 378|\n", + "+-------+------------+---------------+----------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and retail insights\n", + "\n", + "\n", + "# Sales by category analysis\n", + "\n", + "print(\"=== Sales by Category Analysis ===\")\n", + "\n", + "category_sales = spark.sql(\"\"\"\n", + "\n", + "SELECT product_category, COUNT(*) as total_purchases,\n", + "\n", + " ROUND(SUM(purchase_amount), 2) as total_revenue,\n", + "\n", + " ROUND(AVG(purchase_amount), 2) as avg_purchase,\n", + "\n", + " ROUND(SUM(purchase_amount) * 100.0 / SUM(SUM(purchase_amount)) OVER (), 2) as revenue_percentage\n", + "\n", + "FROM retail.analytics.customer_purchases\n", + "\n", + "GROUP BY product_category\n", + "\n", + "ORDER BY total_revenue DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "category_sales.show()\n", + "\n", + "\n", + "\n", + "# Customer segmentation by spending\n", + "\n", + "print(\"\\n=== Customer Segmentation Analysis ===\")\n", + "\n", + "customer_segments = spark.sql(\"\"\"\n", + "\n", + "SELECT \n", + "\n", + " CASE \n", + "\n", + " WHEN total_spent >= 2000 THEN 'High Value'\n", + "\n", + " WHEN total_spent >= 500 THEN 'Medium Value'\n", + "\n", + " ELSE 'Low Value'\n", + "\n", + " END as customer_segment,\n", + "\n", + " COUNT(*) as customer_count,\n", + "\n", + " ROUND(AVG(total_spent), 2) as avg_total_spent,\n", + "\n", + " ROUND(SUM(total_spent), 2) as segment_revenue\n", + "\n", + "FROM (\n", + "\n", + " SELECT customer_id, SUM(purchase_amount) as total_spent\n", + "\n", + " FROM retail.analytics.customer_purchases\n", + "\n", + " GROUP BY customer_id\n", + "\n", + ") customer_totals\n", + "\n", + "GROUP BY \n", + "\n", + " CASE \n", + "\n", + " WHEN total_spent >= 2000 THEN 'High Value'\n", + "\n", + " WHEN total_spent >= 500 THEN 'Medium Value'\n", + "\n", + " ELSE 'Low Value'\n", + "\n", + " END\n", + "\n", + "ORDER BY segment_revenue DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "customer_segments.show()\n", + "\n", + "\n", + "\n", + "# Store performance analysis\n", + "\n", + "print(\"\\n=== Store Performance Analysis ===\")\n", + "\n", + "store_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT store_id, COUNT(*) as total_transactions,\n", + "\n", + " COUNT(DISTINCT customer_id) as unique_customers,\n", + "\n", + " ROUND(SUM(purchase_amount), 2) as total_revenue,\n", + "\n", + " ROUND(AVG(purchase_amount), 2) as avg_transaction_value\n", + "\n", + "FROM retail.analytics.customer_purchases\n", + "\n", + "GROUP BY store_id\n", + "\n", + "ORDER BY total_revenue DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "store_performance.show()\n", + "\n", + "\n", + "\n", + "# Monthly sales trends\n", + "\n", + "print(\"\\n=== Monthly Sales Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(purchase_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as transactions,\n", + "\n", + " ROUND(SUM(purchase_amount), 2) as monthly_revenue,\n", + "\n", + " COUNT(DISTINCT customer_id) as active_customers\n", + "\n", + "FROM retail.analytics.customer_purchases\n", + "\n", + "GROUP BY DATE_FORMAT(purchase_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (customer_id, purchase_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (customer_id, purchase_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Retail analytics where customer behavior analysis and sales reporting are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for retail data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles retail-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger retail datasets\n", + "- Integrate with real POS systems and e-commerce platforms\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced retail analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/telecommunications_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/telecommunications_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..996518e --- /dev/null +++ b/Notebooks/liquid_clustering/telecommunications_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1066 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Telecommunications: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a telecommunications analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Network Performance Monitoring and Customer Experience Analytics\n", + "\n", + "We'll analyze telecommunications network performance and customer usage data. Our clustering strategy will optimize for:\n", + "\n", + "- **Customer-specific queries**: Fast lookups by subscriber ID\n", + "- **Time-based analysis**: Efficient filtering by call/service date\n", + "- **Network performance patterns**: Quick aggregation by cell tower and service quality metrics\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Telecommunications catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create telecommunications catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS telecom\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS telecom.analytics\")\n", + "\n", + "print(\"Telecommunications catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `network_usage` table will store:\n", + "\n", + "- **subscriber_id**: Unique customer identifier\n", + "- **usage_date**: Date and time of service usage\n", + "- **service_type**: Type (Voice, Data, SMS, Streaming)\n", + "- **data_volume**: Data consumed (GB)\n", + "- **call_duration**: Call length (minutes)\n", + "- **cell_tower_id**: Network cell tower identifier\n", + "- **signal_quality**: Network signal strength (0-100)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `subscriber_id` and `usage_date` because:\n", + "\n", + "- **subscriber_id**: Customers generate multiple service interactions, grouping their usage patterns together\n", + "- **usage_date**: Time-based queries are critical for billing cycles, network planning, and customer behavior analysis\n", + "- This combination optimizes for both customer analytics and temporal network performance monitoring" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS telecom.analytics.network_usage (\n", + "\n", + " subscriber_id STRING,\n", + "\n", + " usage_date TIMESTAMP,\n", + "\n", + " service_type STRING,\n", + "\n", + " data_volume DECIMAL(10,3),\n", + "\n", + " call_duration DECIMAL(8,2),\n", + "\n", + " cell_tower_id STRING,\n", + "\n", + " signal_quality INT\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (subscriber_id, usage_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on subscriber_id and usage_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Telecommunications Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic telecommunications usage data including:\n", + "\n", + "- **10,000 subscribers** with multiple service interactions over time\n", + "- **Service types**: Voice calls, Data usage, SMS, Video streaming\n", + "- **Realistic usage patterns**: Peak hours, weekend vs weekday patterns, roaming\n", + "- **Network infrastructure**: Multiple cell towers with varying signal quality\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real telecommunications scenarios where:\n", + "\n", + "- Customer usage varies by time of day and service type\n", + "- Network performance impacts customer experience\n", + "- Billing and service quality require temporal analysis\n", + "- Capacity planning depends on usage patterns\n", + "- Fraud detection needs real-time monitoring" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 603319 network usage records\n", + "Sample record: {'subscriber_id': 'SUB00000001', 'usage_date': datetime.datetime(2024, 3, 4, 13, 52), 'service_type': 'Voice', 'data_volume': 0.0, 'call_duration': 11.72, 'cell_tower_id': 'TOWER_SFO_006', 'signal_quality': 64}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample telecommunications usage data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define telecommunications data constants\n", + "\n", + "SERVICE_TYPES = ['Voice', 'Data', 'SMS', 'Streaming']\n", + "\n", + "CELL_TOWERS = ['TOWER_NYC_001', 'TOWER_LAX_002', 'TOWER_CHI_003', 'TOWER_HOU_004', 'TOWER_MIA_005', 'TOWER_SFO_006', 'TOWER_SEA_007']\n", + "\n", + "# Base usage parameters by service type\n", + "\n", + "USAGE_PARAMS = {\n", + "\n", + " 'Voice': {'avg_duration': 5.0, 'frequency': 8, 'data_volume': 0.0},\n", + "\n", + " 'Data': {'avg_duration': 0.0, 'frequency': 15, 'data_volume': 0.5},\n", + "\n", + " 'SMS': {'avg_duration': 0.0, 'frequency': 12, 'data_volume': 0.0},\n", + "\n", + " 'Streaming': {'avg_duration': 0.0, 'frequency': 6, 'data_volume': 2.0}\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate network usage records\n", + "\n", + "usage_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 10,000 subscribers with 20-100 usage events each\n", + "\n", + "for subscriber_num in range(1, 10001):\n", + "\n", + " subscriber_id = f\"SUB{subscriber_num:08d}\"\n", + " \n", + " # Each subscriber gets 20-100 usage events over 12 months\n", + "\n", + " num_events = random.randint(20, 100)\n", + " \n", + " for i in range(num_events):\n", + "\n", + " # Spread usage events over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " usage_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Add realistic timing (more usage during business hours and evenings)\n", + "\n", + " hour_weights = [1, 1, 1, 1, 1, 2, 4, 6, 8, 7, 6, 8, 9, 8, 7, 6, 8, 9, 10, 8, 6, 4, 3, 2]\n", + "\n", + " hours_offset = random.choices(range(24), weights=hour_weights)[0]\n", + "\n", + " usage_date = usage_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)\n", + " \n", + " # Select service type\n", + "\n", + " service_type = random.choice(SERVICE_TYPES)\n", + "\n", + " params = USAGE_PARAMS[service_type]\n", + " \n", + " # Calculate usage metrics with variability\n", + "\n", + " if service_type == 'Voice':\n", + "\n", + " duration_variation = random.uniform(0.3, 3.0)\n", + "\n", + " call_duration = round(params['avg_duration'] * duration_variation, 2)\n", + "\n", + " data_volume = 0.0\n", + "\n", + " elif service_type == 'Data':\n", + "\n", + " data_variation = random.uniform(0.1, 5.0)\n", + "\n", + " data_volume = round(params['data_volume'] * data_variation, 3)\n", + "\n", + " call_duration = 0.0\n", + "\n", + " elif service_type == 'SMS':\n", + "\n", + " data_volume = 0.0\n", + "\n", + " call_duration = 0.0\n", + "\n", + " else: # Streaming\n", + "\n", + " data_variation = random.uniform(0.5, 8.0)\n", + "\n", + " data_volume = round(params['data_volume'] * data_variation, 3)\n", + "\n", + " call_duration = 0.0\n", + " \n", + " # Select cell tower and signal quality\n", + "\n", + " cell_tower_id = random.choice(CELL_TOWERS)\n", + "\n", + " # Signal quality varies by tower and time\n", + "\n", + " base_signal = random.randint(60, 95)\n", + "\n", + " signal_variation = random.randint(-15, 5)\n", + "\n", + " signal_quality = max(0, min(100, base_signal + signal_variation))\n", + " \n", + " usage_data.append({\n", + "\n", + " \"subscriber_id\": subscriber_id,\n", + "\n", + " \"usage_date\": usage_date,\n", + "\n", + " \"service_type\": service_type,\n", + "\n", + " \"data_volume\": data_volume,\n", + "\n", + " \"call_duration\": call_duration,\n", + "\n", + " \"cell_tower_id\": cell_tower_id,\n", + "\n", + " \"signal_quality\": signal_quality\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(usage_data)} network usage records\")\n", + "\n", + "print(\"Sample record:\", usage_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- call_duration: double (nullable = true)\n", + " |-- cell_tower_id: string (nullable = true)\n", + " |-- data_volume: double (nullable = true)\n", + " |-- service_type: string (nullable = true)\n", + " |-- signal_quality: long (nullable = true)\n", + " |-- subscriber_id: string (nullable = true)\n", + " |-- usage_date: timestamp (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+-------------+-----------+------------+--------------+-------------+-------------------+\n", + "|call_duration|cell_tower_id|data_volume|service_type|signal_quality|subscriber_id| usage_date|\n", + "+-------------+-------------+-----------+------------+--------------+-------------+-------------------+\n", + "| 11.72|TOWER_SFO_006| 0.0| Voice| 64| SUB00000001|2024-03-04 13:52:00|\n", + "| 0.0|TOWER_NYC_001| 0.0| SMS| 62| SUB00000001|2024-04-30 15:44:00|\n", + "| 2.56|TOWER_NYC_001| 0.0| Voice| 85| SUB00000001|2024-01-14 04:37:00|\n", + "| 0.0|TOWER_LAX_002| 14.926| Streaming| 71| SUB00000001|2024-09-13 12:56:00|\n", + "| 0.0|TOWER_SEA_007| 8.358| Streaming| 88| SUB00000001|2024-03-16 16:04:00|\n", + "+-------------+-------------+-----------+------------+--------------+-------------+-------------------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 603319 records into telecom.analytics.network_usage\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_usage = spark.createDataFrame(usage_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_usage.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_usage.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (subscriber_id, usage_date) will automatically optimize the data layout\n", + "\n", + "df_usage.write.mode(\"overwrite\").saveAsTable(\"telecom.analytics.network_usage\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_usage.count()} records into telecom.analytics.network_usage\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Subscriber usage history** (clustered by subscriber_id)\n", + "2. **Time-based network analysis** (clustered by usage_date)\n", + "3. **Combined subscriber + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Subscriber Usage History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+-------------------+------------+-----------+-------------+--------------+\n", + "|subscriber_id| usage_date|service_type|data_volume|call_duration|signal_quality|\n", + "+-------------+-------------------+------------+-----------+-------------+--------------+\n", + "| SUB00000001|2024-12-22 16:14:00| SMS| 0.0| 0.0| 72|\n", + "| SUB00000001|2024-12-08 17:36:00| Data| 0.108| 0.0| 77|\n", + "| SUB00000001|2024-12-06 15:00:00| Data| 0.056| 0.0| 85|\n", + "| SUB00000001|2024-11-23 13:11:00| Streaming| 14.654| 0.0| 84|\n", + "| SUB00000001|2024-11-07 18:22:00| SMS| 0.0| 0.0| 95|\n", + "| SUB00000001|2024-10-24 20:26:00| SMS| 0.0| 0.0| 75|\n", + "| SUB00000001|2024-10-08 19:32:00| Streaming| 6.947| 0.0| 74|\n", + "| SUB00000001|2024-09-25 19:05:00| Data| 1.264| 0.0| 78|\n", + "| SUB00000001|2024-09-13 12:56:00| Streaming| 14.926| 0.0| 71|\n", + "| SUB00000001|2024-09-03 13:38:00| Voice| 0.0| 5.68| 76|\n", + "| SUB00000001|2024-08-26 11:33:00| SMS| 0.0| 0.0| 69|\n", + "| SUB00000001|2024-08-21 21:14:00| Data| 0.845| 0.0| 62|\n", + "| SUB00000001|2024-08-09 08:20:00| Streaming| 5.556| 0.0| 84|\n", + "| SUB00000001|2024-07-28 14:09:00| SMS| 0.0| 0.0| 87|\n", + "| SUB00000001|2024-07-21 17:13:00| SMS| 0.0| 0.0| 78|\n", + "| SUB00000001|2024-07-09 22:13:00| Data| 1.784| 0.0| 97|\n", + "| SUB00000001|2024-06-28 09:56:00| Streaming| 15.775| 0.0| 92|\n", + "| SUB00000001|2024-06-17 20:17:00| Streaming| 11.564| 0.0| 96|\n", + "| SUB00000001|2024-05-31 19:01:00| Data| 1.873| 0.0| 60|\n", + "| SUB00000001|2024-05-03 23:01:00| Voice| 0.0| 12.49| 81|\n", + "+-------------+-------------------+------------+-----------+-------------+--------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 33\n", + "\n", + "=== Query 2: Recent Network Quality Issues ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------------+-------------+-------------+--------------+------------+\n", + "| usage_date|subscriber_id|cell_tower_id|signal_quality|service_type|\n", + "+-------------------+-------------+-------------+--------------+------------+\n", + "|2024-12-31 13:12:00| SUB00009850|TOWER_SEA_007| 45| Voice|\n", + "|2024-12-31 07:42:00| SUB00001957|TOWER_NYC_001| 45| Streaming|\n", + "|2024-12-30 17:24:00| SUB00009189|TOWER_MIA_005| 45| Streaming|\n", + "|2024-12-30 17:12:00| SUB00009185|TOWER_CHI_003| 45| Data|\n", + "|2024-12-28 11:49:00| SUB00002129|TOWER_HOU_004| 45| SMS|\n", + "|2024-12-26 17:32:00| SUB00006483|TOWER_SFO_006| 45| Data|\n", + "|2024-12-26 16:21:00| SUB00000968|TOWER_CHI_003| 45| SMS|\n", + "|2024-12-26 15:30:00| SUB00007641|TOWER_NYC_001| 45| Voice|\n", + "|2024-12-26 11:30:00| SUB00007019|TOWER_SEA_007| 45| Streaming|\n", + "|2024-12-25 19:01:00| SUB00009049|TOWER_NYC_001| 45| Voice|\n", + "|2024-12-25 03:37:00| SUB00006282|TOWER_NYC_001| 45| Data|\n", + "|2024-12-24 20:44:00| SUB00001952|TOWER_SFO_006| 45| Data|\n", + "|2024-12-24 18:22:00| SUB00009904|TOWER_HOU_004| 45| Voice|\n", + "|2024-12-23 13:36:00| SUB00001633|TOWER_NYC_001| 45| Voice|\n", + "|2024-12-23 13:19:00| SUB00007155|TOWER_SFO_006| 45| Data|\n", + "|2024-12-23 07:49:00| SUB00008914|TOWER_NYC_001| 45| SMS|\n", + "|2024-12-22 12:02:00| SUB00009445|TOWER_LAX_002| 45| Voice|\n", + "|2024-12-22 08:58:00| SUB00008143|TOWER_HOU_004| 45| Data|\n", + "|2024-12-22 06:58:00| SUB00003470|TOWER_LAX_002| 45| Data|\n", + "|2024-12-21 18:25:00| SUB00006545|TOWER_SEA_007| 45| Streaming|\n", + "+-------------------+-------------+-------------+--------------+------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Network quality issues found: 7091\n", + "\n", + "=== Query 3: Subscriber Data Usage Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+-------------------+------------+-----------+-------------+\n", + "|subscriber_id| usage_date|service_type|data_volume|call_duration|\n", + "+-------------+-------------------+------------+-----------+-------------+\n", + "| SUB00000001|2024-04-01 19:20:00| SMS| 0.0| 0.0|\n", + "| SUB00000001|2024-04-12 10:27:00| Streaming| 8.765| 0.0|\n", + "| SUB00000001|2024-04-13 16:43:00| Streaming| 8.624| 0.0|\n", + "| SUB00000001|2024-04-19 10:25:00| Voice| 0.0| 13.93|\n", + "| SUB00000001|2024-04-30 15:44:00| SMS| 0.0| 0.0|\n", + "| SUB00000001|2024-05-03 23:01:00| Voice| 0.0| 12.49|\n", + "| SUB00000001|2024-05-31 19:01:00| Data| 1.873| 0.0|\n", + "| SUB00000001|2024-06-17 20:17:00| Streaming| 11.564| 0.0|\n", + "| SUB00000001|2024-06-28 09:56:00| Streaming| 15.775| 0.0|\n", + "| SUB00000001|2024-07-09 22:13:00| Data| 1.784| 0.0|\n", + "| SUB00000001|2024-07-21 17:13:00| SMS| 0.0| 0.0|\n", + "| SUB00000001|2024-07-28 14:09:00| SMS| 0.0| 0.0|\n", + "| SUB00000001|2024-08-09 08:20:00| Streaming| 5.556| 0.0|\n", + "| SUB00000001|2024-08-21 21:14:00| Data| 0.845| 0.0|\n", + "| SUB00000001|2024-08-26 11:33:00| SMS| 0.0| 0.0|\n", + "| SUB00000001|2024-09-03 13:38:00| Voice| 0.0| 5.68|\n", + "| SUB00000001|2024-09-13 12:56:00| Streaming| 14.926| 0.0|\n", + "| SUB00000001|2024-09-25 19:05:00| Data| 1.264| 0.0|\n", + "| SUB00000001|2024-10-08 19:32:00| Streaming| 6.947| 0.0|\n", + "| SUB00000001|2024-10-24 20:26:00| SMS| 0.0| 0.0|\n", + "+-------------+-------------------+------------+-----------+-------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Usage trend records found: 4537\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Subscriber usage history - benefits from subscriber_id clustering\n", + "\n", + "print(\"=== Query 1: Subscriber Usage History ===\")\n", + "\n", + "subscriber_history = spark.sql(\"\"\"\n", + "\n", + "SELECT subscriber_id, usage_date, service_type, data_volume, call_duration, signal_quality\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "WHERE subscriber_id = 'SUB00000001'\n", + "\n", + "ORDER BY usage_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "subscriber_history.show()\n", + "\n", + "print(f\"Records found: {subscriber_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based network quality analysis - benefits from usage_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent Network Quality Issues ===\")\n", + "\n", + "network_quality = spark.sql(\"\"\"\n", + "\n", + "SELECT usage_date, subscriber_id, cell_tower_id, signal_quality, service_type\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "WHERE usage_date >= '2024-06-01' AND signal_quality < 50\n", + "\n", + "ORDER BY signal_quality ASC, usage_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "network_quality.show()\n", + "\n", + "print(f\"Network quality issues found: {network_quality.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined subscriber + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Subscriber Data Usage Trends ===\")\n", + "\n", + "usage_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT subscriber_id, usage_date, service_type, data_volume, call_duration\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "WHERE subscriber_id LIKE 'SUB000000%' AND usage_date >= '2024-04-01'\n", + "\n", + "ORDER BY subscriber_id, usage_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "usage_trends.show()\n", + "\n", + "print(f\"Usage trend records found: {usage_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the telecommunications insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Subscriber usage patterns** and data consumption analysis\n", + "- **Network performance metrics** and signal quality trends\n", + "- **Service type adoption** and usage distribution\n", + "- **Cell tower utilization** and capacity planning" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Subscriber Usage Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+--------------+-------------+------------------+------------------+-------------+\n", + "|subscriber_id|total_sessions|total_data_gb|total_call_minutes|avg_signal_quality|services_used|\n", + "+-------------+--------------+-------------+------------------+------------------+-------------+\n", + "| SUB00002907| 98| 374.303| 122.78| 72.31| 4|\n", + "| SUB00003041| 97| 374.246| 151.33| 72.2| 4|\n", + "| SUB00005923| 89| 371.788| 121.21| 75.24| 4|\n", + "| SUB00009440| 95| 370.988| 66.9| 74.13| 4|\n", + "| SUB00000337| 90| 365.707| 162.18| 73.34| 4|\n", + "| SUB00007490| 98| 364.002| 179.59| 73.59| 4|\n", + "| SUB00004805| 93| 348.482| 113.78| 71.77| 4|\n", + "| SUB00009257| 100| 348.306| 99.18| 75.0| 4|\n", + "| SUB00000197| 96| 346.222| 188.72| 71.34| 4|\n", + "| SUB00008578| 99| 342.624| 189.36| 72.83| 4|\n", + "| SUB00004058| 100| 342.045| 171.1| 70.36| 4|\n", + "| SUB00004808| 98| 340.867| 159.38| 73.1| 4|\n", + "| SUB00006830| 94| 338.258| 174.3| 70.86| 4|\n", + "| SUB00007904| 100| 331.652| 139.5| 70.8| 4|\n", + "| SUB00003574| 97| 330.939| 188.04| 71.42| 4|\n", + "| SUB00005290| 99| 330.374| 180.29| 73.6| 4|\n", + "| SUB00009749| 96| 329.265| 158.48| 72.55| 4|\n", + "| SUB00000841| 98| 329.183| 160.89| 73.26| 4|\n", + "| SUB00002395| 98| 326.711| 169.83| 71.5| 4|\n", + "| SUB00009036| 99| 326.502| 214.12| 73.8| 4|\n", + "+-------------+--------------+-------------+------------------+------------------+-------------+\n", + "only showing top 20 rows\n", + "\n", + "\n", + "=== Service Type Usage Patterns ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------+-----------+-------------+------------------+------------------+------------------+\n", + "|service_type|total_usage|total_data_gb|total_call_minutes|avg_signal_quality|unique_subscribers|\n", + "+------------+-----------+-------------+------------------+------------------+------------------+\n", + "| Data| 151453| 193415.991| 0.0| 72.51| 9998|\n", + "| Voice| 151003| 0.0| 1246808.73| 72.5| 9999|\n", + "| SMS| 150646| 0.0| 0.0| 72.51| 10000|\n", + "| Streaming| 150217| 1278512.672| 0.0| 72.48| 10000|\n", + "+------------+-----------+-------------+------------------+------------------+------------------+\n", + "\n", + "\n", + "=== Cell Tower Performance ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------+-----------------+------------------+------------------+-------------+------------------+\n", + "|cell_tower_id|total_connections|unique_subscribers|avg_signal_quality|total_data_gb|total_call_minutes|\n", + "+-------------+-----------------+------------------+------------------+-------------+------------------+\n", + "|TOWER_CHI_003| 86783| 9968| 72.47| 212538.431| 177926.82|\n", + "|TOWER_HOU_004| 86557| 9965| 72.56| 210666.317| 177868.82|\n", + "|TOWER_SFO_006| 86185| 9960| 72.46| 212706.208| 179430.77|\n", + "|TOWER_MIA_005| 86174| 9964| 72.55| 210610.422| 178815.47|\n", + "|TOWER_NYC_001| 86160| 9962| 72.49| 210421.392| 176386.73|\n", + "|TOWER_LAX_002| 85784| 9953| 72.49| 206811.988| 180098.11|\n", + "|TOWER_SEA_007| 85676| 9966| 72.47| 208173.905| 176282.01|\n", + "+-------------+-----------------+------------------+------------------+-------------+------------------+\n", + "\n", + "\n", + "=== Hourly Usage Patterns ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-----------+------------+--------------+------------+------------------+\n", + "|hour_of_day|usage_events|data_volume_gb|call_minutes|avg_signal_quality|\n", + "+-----------+------------+--------------+------------+------------------+\n", + "| 0| 4792| 11788.074| 10100.05| 72.24|\n", + "| 1| 4796| 11492.777| 10334.85| 72.51|\n", + "| 2| 4849| 12027.327| 10040.76| 72.46|\n", + "| 3| 4802| 11230.272| 9795.77| 72.6|\n", + "| 4| 4786| 11925.201| 9794.47| 72.43|\n", + "| 5| 9544| 22804.363| 19789.58| 72.47|\n", + "| 6| 19183| 46809.918| 39957.14| 72.46|\n", + "| 7| 28830| 70531.404| 59719.89| 72.52|\n", + "| 8| 37931| 92887.894| 77744.11| 72.57|\n", + "| 9| 33653| 81507.36| 70418.82| 72.45|\n", + "| 10| 28633| 69056.154| 59004.92| 72.5|\n", + "| 11| 38344| 94047.209| 78831.68| 72.39|\n", + "| 12| 43116| 106191.504| 88863.47| 72.51|\n", + "| 13| 38026| 92789.657| 77937.27| 72.59|\n", + "| 14| 33454| 80862.764| 69920.12| 72.47|\n", + "| 15| 28912| 69966.3| 59963.27| 72.59|\n", + "| 16| 38375| 94255.459| 79582.8| 72.46|\n", + "| 17| 42985| 106404.281| 89543.92| 72.47|\n", + "| 18| 48021| 116934.422| 98731.61| 72.49|\n", + "| 19| 38443| 93611.677| 79642.43| 72.49|\n", + "+-----------+------------+--------------+------------+------------------+\n", + "only showing top 20 rows\n", + "\n", + "\n", + "=== Monthly Network Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+-----------+---------------+--------------------+------------------+------------------+\n", + "| month|total_usage|monthly_data_gb|monthly_call_minutes|avg_signal_quality|active_subscribers|\n", + "+-------+-----------+---------------+--------------------+------------------+------------------+\n", + "|2024-01| 51136| 125284.093| 104712.14| 72.54| 9738|\n", + "|2024-02| 47931| 117988.209| 99280.97| 72.52| 9701|\n", + "|2024-03| 50911| 122055.327| 104858.54| 72.56| 9787|\n", + "|2024-04| 49399| 120953.517| 102123.58| 72.53| 9726|\n", + "|2024-05| 51122| 124817.533| 106054.49| 72.49| 9748|\n", + "|2024-06| 49539| 119047.661| 103015.89| 72.54| 9713|\n", + "|2024-07| 50844| 124592.418| 104429.72| 72.45| 9760|\n", + "|2024-08| 51173| 125721.521| 105530.47| 72.46| 9770|\n", + "|2024-09| 49588| 119861.591| 102674.27| 72.54| 9744|\n", + "|2024-10| 51271| 125224.522| 105831.48| 72.52| 9762|\n", + "|2024-11| 49301| 121538.791| 102184.0| 72.44| 9736|\n", + "|2024-12| 51104| 124843.48| 106113.18| 72.41| 9762|\n", + "+-------+-----------+---------------+--------------------+------------------+------------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and telecommunications insights\n", + "\n", + "\n", + "# Subscriber usage analysis\n", + "\n", + "print(\"=== Subscriber Usage Analysis ===\")\n", + "\n", + "subscriber_usage = spark.sql(\"\"\"\n", + "\n", + "SELECT subscriber_id, COUNT(*) as total_sessions,\n", + "\n", + " ROUND(SUM(data_volume), 3) as total_data_gb,\n", + "\n", + " ROUND(SUM(call_duration), 2) as total_call_minutes,\n", + "\n", + " ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n", + "\n", + " COUNT(DISTINCT service_type) as services_used\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "GROUP BY subscriber_id\n", + "\n", + "ORDER BY total_data_gb DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "subscriber_usage.show()\n", + "\n", + "\n", + "# Service type usage patterns\n", + "\n", + "print(\"\\n=== Service Type Usage Patterns ===\")\n", + "\n", + "service_patterns = spark.sql(\"\"\"\n", + "\n", + "SELECT service_type, COUNT(*) as total_usage,\n", + "\n", + " ROUND(SUM(data_volume), 3) as total_data_gb,\n", + "\n", + " ROUND(SUM(call_duration), 2) as total_call_minutes,\n", + "\n", + " ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n", + "\n", + " COUNT(DISTINCT subscriber_id) as unique_subscribers\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "GROUP BY service_type\n", + "\n", + "ORDER BY total_usage DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "service_patterns.show()\n", + "\n", + "\n", + "# Cell tower performance\n", + "\n", + "print(\"\\n=== Cell Tower Performance ===\")\n", + "\n", + "tower_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT cell_tower_id, COUNT(*) as total_connections,\n", + "\n", + " COUNT(DISTINCT subscriber_id) as unique_subscribers,\n", + "\n", + " ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n", + "\n", + " ROUND(SUM(data_volume), 3) as total_data_gb,\n", + "\n", + " ROUND(SUM(call_duration), 2) as total_call_minutes\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "GROUP BY cell_tower_id\n", + "\n", + "ORDER BY total_connections DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "tower_performance.show()\n", + "\n", + "\n", + "# Hourly usage patterns\n", + "\n", + "print(\"\\n=== Hourly Usage Patterns ===\")\n", + "\n", + "hourly_patterns = spark.sql(\"\"\"\n", + "\n", + "SELECT HOUR(usage_date) as hour_of_day, COUNT(*) as usage_events,\n", + "\n", + " ROUND(SUM(data_volume), 3) as data_volume_gb,\n", + "\n", + " ROUND(SUM(call_duration), 2) as call_minutes,\n", + "\n", + " ROUND(AVG(signal_quality), 2) as avg_signal_quality\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "GROUP BY HOUR(usage_date)\n", + "\n", + "ORDER BY hour_of_day\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "hourly_patterns.show()\n", + "\n", + "\n", + "# Monthly network trends\n", + "\n", + "print(\"\\n=== Monthly Network Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(usage_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_usage,\n", + "\n", + " ROUND(SUM(data_volume), 3) as monthly_data_gb,\n", + "\n", + " ROUND(SUM(call_duration), 2) as monthly_call_minutes,\n", + "\n", + " ROUND(AVG(signal_quality), 2) as avg_signal_quality,\n", + "\n", + " COUNT(DISTINCT subscriber_id) as active_subscribers\n", + "\n", + "FROM telecom.analytics.network_usage\n", + "\n", + "GROUP BY DATE_FORMAT(usage_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (subscriber_id, usage_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (subscriber_id, usage_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Telecommunications analytics where network monitoring and customer experience are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for telecommunications data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles telecommunications-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger telecommunications datasets\n", + "- Integrate with real network monitoring systems and CDR data\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced telecommunications analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/Notebooks/liquid_clustering/transportation_delta_liquid_clustering_demo.ipynb b/Notebooks/liquid_clustering/transportation_delta_liquid_clustering_demo.ipynb new file mode 100644 index 0000000..722dcb4 --- /dev/null +++ b/Notebooks/liquid_clustering/transportation_delta_liquid_clustering_demo.ipynb @@ -0,0 +1,1007 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Transportation: Delta Liquid Clustering Demo\n", + "\n", + "\n", + "## Overview\n", + "\n", + "\n", + "This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using a transportation and logistics analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.\n", + "\n", + "### What is Liquid Clustering?\n", + "\n", + "Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:\n", + "\n", + "- **Automatic optimization**: No manual tuning required\n", + "- **Improved query performance**: Faster queries on clustered columns\n", + "- **Reduced maintenance**: No need for manual repartitioning\n", + "- **Adaptive clustering**: Adjusts as data patterns change\n", + "\n", + "### Use Case: Fleet Management and Route Optimization\n", + "\n", + "We'll analyze transportation fleet operations and logistics data. Our clustering strategy will optimize for:\n", + "\n", + "- **Vehicle-specific queries**: Fast lookups by vehicle ID\n", + "- **Time-based analysis**: Efficient filtering by trip date and time\n", + "- **Route performance patterns**: Quick aggregation by route and operational metrics\n", + "\n", + "### AIDP Environment Setup\n", + "\n", + "This notebook leverages the existing Spark session in your AIDP environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Transportation catalog and analytics schema created successfully!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create transportation catalog and analytics schema\n", + "\n", + "# In AIDP, catalogs provide data isolation and governance\n", + "\n", + "spark.sql(\"CREATE CATALOG IF NOT EXISTS transportation\")\n", + "\n", + "spark.sql(\"CREATE SCHEMA IF NOT EXISTS transportation.analytics\")\n", + "\n", + "print(\"Transportation catalog and analytics schema created successfully!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 2: Create Delta Table with Liquid Clustering\n", + "\n", + "### Table Design\n", + "\n", + "Our `fleet_trips` table will store:\n", + "\n", + "- **vehicle_id**: Unique vehicle identifier\n", + "- **trip_date**: Date and time of trip start\n", + "- **route_id**: Route identifier\n", + "- **distance**: Distance traveled (miles/km)\n", + "- **duration**: Trip duration (minutes)\n", + "- **fuel_consumed**: Fuel used (gallons/liters)\n", + "- **load_factor**: Capacity utilization (0-100)\n", + "\n", + "### Clustering Strategy\n", + "\n", + "We'll cluster by `vehicle_id` and `trip_date` because:\n", + "\n", + "- **vehicle_id**: Vehicles generate multiple trips, grouping maintenance and performance data together\n", + "- **trip_date**: Time-based queries are essential for scheduling, fuel analysis, and operational reporting\n", + "- This combination optimizes for both vehicle monitoring and temporal fleet performance analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Delta table with liquid clustering created successfully!\n", + "Clustering will automatically optimize data layout for queries on vehicle_id and trip_date.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Create Delta table with liquid clustering\n", + "\n", + "# CLUSTER BY defines the columns for automatic optimization\n", + "\n", + "spark.sql(\"\"\"\n", + "\n", + "CREATE TABLE IF NOT EXISTS transportation.analytics.fleet_trips (\n", + "\n", + " vehicle_id STRING,\n", + "\n", + " trip_date TIMESTAMP,\n", + "\n", + " route_id STRING,\n", + "\n", + " distance DECIMAL(8,2),\n", + "\n", + " duration DECIMAL(6,2),\n", + "\n", + " fuel_consumed DECIMAL(6,2),\n", + "\n", + " load_factor INT\n", + "\n", + ")\n", + "\n", + "USING DELTA\n", + "\n", + "CLUSTER BY (vehicle_id, trip_date)\n", + "\n", + "\"\"\")\n", + "\n", + "print(\"Delta table with liquid clustering created successfully!\")\n", + "\n", + "print(\"Clustering will automatically optimize data layout for queries on vehicle_id and trip_date.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 3: Generate Transportation Sample Data\n", + "\n", + "### Data Generation Strategy\n", + "\n", + "We'll create realistic transportation fleet data including:\n", + "\n", + "- **500 vehicles** with multiple trips over time\n", + "- **Route types**: Urban delivery, Long-haul, Local transport, Express delivery\n", + "- **Realistic operational patterns**: Peak hours, route variations, fuel efficiency differences\n", + "- **Fleet diversity**: Different vehicle types with varying capacities and fuel consumption\n", + "\n", + "### Why This Data Pattern?\n", + "\n", + "This data simulates real transportation scenarios where:\n", + "\n", + "- Vehicle performance varies by route and time of day\n", + "- Fuel efficiency impacts operational costs\n", + "- Route optimization requires historical performance data\n", + "- Capacity utilization affects profitability\n", + "- Maintenance scheduling depends on usage patterns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Generated 20176 fleet trip records\n", + "Sample record: {'vehicle_id': 'VH0001', 'trip_date': datetime.datetime(2024, 9, 21, 14, 44), 'route_id': 'RT_HOU_DAL_004', 'distance': 48.18, 'duration': 107.57, 'fuel_consumed': 8.54, 'load_factor': 79}\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Generate sample transportation fleet data\n", + "\n", + "# Using fully qualified imports to avoid conflicts\n", + "\n", + "import random\n", + "\n", + "from datetime import datetime, timedelta\n", + "\n", + "\n", + "# Define transportation data constants\n", + "\n", + "ROUTE_TYPES = ['Urban Delivery', 'Long-haul', 'Local Transport', 'Express Delivery']\n", + "\n", + "ROUTES = ['RT_NYC_MAN_001', 'RT_LAX_SFO_002', 'RT_CHI_DET_003', 'RT_HOU_DAL_004', 'RT_MIA_ORL_005']\n", + "\n", + "# Base trip parameters by route type\n", + "\n", + "TRIP_PARAMS = {\n", + "\n", + " 'Urban Delivery': {'avg_distance': 45, 'avg_duration': 120, 'avg_fuel': 8.5, 'load_factor': 85},\n", + "\n", + " 'Long-haul': {'avg_distance': 450, 'avg_duration': 480, 'avg_fuel': 65.0, 'load_factor': 92},\n", + "\n", + " 'Local Transport': {'avg_distance': 120, 'avg_duration': 180, 'avg_fuel': 15.2, 'load_factor': 78},\n", + "\n", + " 'Express Delivery': {'avg_distance': 80, 'avg_duration': 90, 'avg_fuel': 12.8, 'load_factor': 95}\n", + "\n", + "}\n", + "\n", + "\n", + "# Generate fleet trip records\n", + "\n", + "trip_data = []\n", + "\n", + "base_date = datetime(2024, 1, 1)\n", + "\n", + "\n", + "# Create 500 vehicles with 20-60 trips each\n", + "\n", + "for vehicle_num in range(1, 501):\n", + "\n", + " vehicle_id = f\"VH{vehicle_num:04d}\"\n", + " \n", + " # Each vehicle gets 20-60 trips over 12 months\n", + "\n", + " num_trips = random.randint(20, 60)\n", + " \n", + " for i in range(num_trips):\n", + "\n", + " # Spread trips over 12 months\n", + "\n", + " days_offset = random.randint(0, 365)\n", + "\n", + " trip_date = base_date + timedelta(days=days_offset)\n", + " \n", + " # Add realistic timing (more trips during business hours)\n", + "\n", + " hour_weights = [1, 1, 1, 1, 1, 3, 8, 10, 12, 10, 8, 6, 8, 9, 8, 7, 6, 5, 3, 2, 2, 1, 1, 1]\n", + "\n", + " hours_offset = random.choices(range(24), weights=hour_weights)[0]\n", + "\n", + " trip_date = trip_date.replace(hour=hours_offset, minute=random.randint(0, 59), second=0, microsecond=0)\n", + " \n", + " # Select route type\n", + "\n", + " route_type = random.choice(ROUTE_TYPES)\n", + "\n", + " params = TRIP_PARAMS[route_type]\n", + " \n", + " # Calculate trip metrics with variability\n", + "\n", + " distance_variation = random.uniform(0.7, 1.4)\n", + "\n", + " distance = round(params['avg_distance'] * distance_variation, 2)\n", + " \n", + " duration_variation = random.uniform(0.8, 1.6)\n", + "\n", + " duration = round(params['avg_duration'] * duration_variation, 2)\n", + " \n", + " fuel_variation = random.uniform(0.85, 1.25)\n", + "\n", + " fuel_consumed = round(params['avg_fuel'] * fuel_variation, 2)\n", + " \n", + " load_factor_variation = random.randint(-10, 8)\n", + "\n", + " load_factor = max(0, min(100, params['load_factor'] + load_factor_variation))\n", + " \n", + " # Select specific route\n", + "\n", + " route_id = random.choice(ROUTES)\n", + " \n", + " trip_data.append({\n", + "\n", + " \"vehicle_id\": vehicle_id,\n", + "\n", + " \"trip_date\": trip_date,\n", + "\n", + " \"route_id\": route_id,\n", + "\n", + " \"distance\": distance,\n", + "\n", + " \"duration\": duration,\n", + "\n", + " \"fuel_consumed\": fuel_consumed,\n", + "\n", + " \"load_factor\": load_factor\n", + "\n", + " })\n", + "\n", + "\n", + "\n", + "print(f\"Generated {len(trip_data)} fleet trip records\")\n", + "\n", + "print(\"Sample record:\", trip_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 4: Insert Data Using PySpark\n", + "\n", + "### Data Insertion Strategy\n", + "\n", + "We'll use PySpark to:\n", + "\n", + "1. **Create DataFrame** from our generated data\n", + "2. **Insert into Delta table** with liquid clustering\n", + "3. **Verify the insertion** with a sample query\n", + "\n", + "### Why PySpark for Insertion?\n", + "\n", + "- **Distributed processing**: Handles large datasets efficiently\n", + "- **Type safety**: Ensures data integrity\n", + "- **Optimization**: Leverages Spark's query optimization\n", + "- **Liquid clustering**: Automatically applies clustering during insertion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "DataFrame Schema:\n", + "root\n", + " |-- distance: double (nullable = true)\n", + " |-- duration: double (nullable = true)\n", + " |-- fuel_consumed: double (nullable = true)\n", + " |-- load_factor: long (nullable = true)\n", + " |-- route_id: string (nullable = true)\n", + " |-- trip_date: timestamp (nullable = true)\n", + " |-- vehicle_id: string (nullable = true)\n", + "\n", + "\n", + "Sample Data:\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------+--------+-------------+-----------+--------------+-------------------+----------+\n", + "|distance|duration|fuel_consumed|load_factor| route_id| trip_date|vehicle_id|\n", + "+--------+--------+-------------+-----------+--------------+-------------------+----------+\n", + "| 48.18| 107.57| 8.54| 79|RT_HOU_DAL_004|2024-09-21 14:44:00| VH0001|\n", + "| 71.26| 122.74| 14.88| 87|RT_HOU_DAL_004|2024-12-01 05:12:00| VH0001|\n", + "| 136.21| 266.74| 18.61| 81|RT_NYC_MAN_001|2024-11-22 09:04:00| VH0001|\n", + "| 488.8| 544.36| 62.62| 96|RT_HOU_DAL_004|2024-12-13 20:05:00| VH0001|\n", + "| 417.19| 437.07| 72.73| 96|RT_MIA_ORL_005|2024-12-22 06:01:00| VH0001|\n", + "+--------+--------+-------------+-----------+--------------+-------------------+----------+\n", + "only showing top 5 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "\n", + "Successfully inserted 20176 records into transportation.analytics.fleet_trips\n", + "Liquid clustering automatically optimized the data layout during insertion!\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Insert data using PySpark DataFrame operations\n", + "\n", + "# Using fully qualified function references to avoid conflicts\n", + "\n", + "\n", + "# Create DataFrame from generated data\n", + "\n", + "df_trips = spark.createDataFrame(trip_data)\n", + "\n", + "\n", + "# Display schema and sample data\n", + "\n", + "print(\"DataFrame Schema:\")\n", + "\n", + "df_trips.printSchema()\n", + "\n", + "\n", + "\n", + "print(\"\\nSample Data:\")\n", + "\n", + "df_trips.show(5)\n", + "\n", + "\n", + "# Insert data into Delta table with liquid clustering\n", + "\n", + "# The CLUSTER BY (vehicle_id, trip_date) will automatically optimize the data layout\n", + "\n", + "df_trips.write.mode(\"overwrite\").saveAsTable(\"transportation.analytics.fleet_trips\")\n", + "\n", + "\n", + "print(f\"\\nSuccessfully inserted {df_trips.count()} records into transportation.analytics.fleet_trips\")\n", + "\n", + "print(\"Liquid clustering automatically optimized the data layout during insertion!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 5: Demonstrate Liquid Clustering Benefits\n", + "\n", + "### Query Performance Analysis\n", + "\n", + "Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:\n", + "\n", + "1. **Vehicle trip history** (clustered by vehicle_id)\n", + "2. **Time-based fleet analysis** (clustered by trip_date)\n", + "3. **Combined vehicle + time queries** (optimal for our clustering)\n", + "\n", + "### Expected Performance Benefits\n", + "\n", + "With liquid clustering, these queries should be significantly faster because:\n", + "\n", + "- **Data locality**: Related records are physically grouped together\n", + "- **Reduced I/O**: Less data needs to be read from disk\n", + "- **Automatic optimization**: No manual tuning required" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Query 1: Vehicle Trip History ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-------------------+--------------+--------+-------------+-----------+\n", + "|vehicle_id| trip_date| route_id|distance|fuel_consumed|load_factor|\n", + "+----------+-------------------+--------------+--------+-------------+-----------+\n", + "| VH0001|2024-12-22 06:01:00|RT_MIA_ORL_005| 417.19| 72.73| 96|\n", + "| VH0001|2024-12-15 12:53:00|RT_LAX_SFO_002| 99.26| 16.85| 84|\n", + "| VH0001|2024-12-13 20:05:00|RT_HOU_DAL_004| 488.8| 62.62| 96|\n", + "| VH0001|2024-12-03 11:07:00|RT_HOU_DAL_004| 519.16| 69.75| 99|\n", + "| VH0001|2024-12-01 05:12:00|RT_HOU_DAL_004| 71.26| 14.88| 87|\n", + "| VH0001|2024-11-23 06:58:00|RT_LAX_SFO_002| 348.19| 68.93| 98|\n", + "| VH0001|2024-11-22 09:04:00|RT_NYC_MAN_001| 136.21| 18.61| 81|\n", + "| VH0001|2024-11-20 13:03:00|RT_CHI_DET_003| 89.91| 16.35| 82|\n", + "| VH0001|2024-11-16 11:09:00|RT_HOU_DAL_004| 605.19| 67.39| 97|\n", + "| VH0001|2024-11-14 11:48:00|RT_LAX_SFO_002| 96.21| 13.51| 93|\n", + "| VH0001|2024-11-11 18:27:00|RT_HOU_DAL_004| 58.57| 13.12| 85|\n", + "| VH0001|2024-11-04 13:08:00|RT_MIA_ORL_005| 336.23| 79.59| 87|\n", + "| VH0001|2024-10-23 08:36:00|RT_CHI_DET_003| 75.64| 11.85| 87|\n", + "| VH0001|2024-10-03 09:22:00|RT_CHI_DET_003| 137.81| 15.87| 80|\n", + "| VH0001|2024-09-30 15:41:00|RT_LAX_SFO_002| 58.77| 9.53| 89|\n", + "| VH0001|2024-09-27 08:59:00|RT_NYC_MAN_001| 393.69| 71.33| 82|\n", + "| VH0001|2024-09-21 14:44:00|RT_HOU_DAL_004| 48.18| 8.54| 79|\n", + "| VH0001|2024-09-12 08:28:00|RT_CHI_DET_003| 542.22| 70.83| 90|\n", + "| VH0001|2024-08-22 06:21:00|RT_LAX_SFO_002| 72.55| 13.76| 98|\n", + "| VH0001|2024-08-16 04:26:00|RT_LAX_SFO_002| 42.08| 8.76| 92|\n", + "+----------+-------------------+--------------+--------+-------------+-----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Records found: 31\n", + "\n", + "=== Query 2: Recent Fuel Efficiency Issues ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------------------+----------+--------------+--------+-------------+----+\n", + "| trip_date|vehicle_id| route_id|distance|fuel_consumed| mpg|\n", + "+-------------------+----------+--------------+--------+-------------+----+\n", + "|2024-08-03 07:41:00| VH0114|RT_NYC_MAN_001| 31.71| 10.62|2.99|\n", + "|2024-07-04 14:10:00| VH0416|RT_LAX_SFO_002| 31.57| 10.57|2.99|\n", + "|2024-11-10 09:49:00| VH0444|RT_MIA_ORL_005| 31.83| 10.6| 3.0|\n", + "|2024-09-30 16:50:00| VH0362|RT_LAX_SFO_002| 31.78| 10.61| 3.0|\n", + "|2024-11-29 13:49:00| VH0117|RT_LAX_SFO_002| 31.71| 10.54|3.01|\n", + "|2024-06-03 13:03:00| VH0413|RT_NYC_MAN_001| 31.9| 10.58|3.02|\n", + "|2024-10-03 08:02:00| VH0452|RT_NYC_MAN_001| 31.58| 10.35|3.05|\n", + "|2024-10-19 18:05:00| VH0274|RT_MIA_ORL_005| 32.63| 10.58|3.08|\n", + "|2024-08-03 14:49:00| VH0058|RT_CHI_DET_003| 31.61| 10.27|3.08|\n", + "|2024-07-14 08:26:00| VH0118|RT_MIA_ORL_005| 32.02| 10.39|3.08|\n", + "|2024-11-23 19:32:00| VH0220|RT_HOU_DAL_004| 32.23| 10.39| 3.1|\n", + "|2024-09-13 15:13:00| VH0167|RT_HOU_DAL_004| 32.17| 10.35|3.11|\n", + "|2024-06-17 14:21:00| VH0426|RT_CHI_DET_003| 32.02| 10.29|3.11|\n", + "|2024-10-18 10:31:00| VH0202|RT_NYC_MAN_001| 32.09| 10.27|3.12|\n", + "|2024-07-26 14:54:00| VH0139|RT_HOU_DAL_004| 32.82| 10.52|3.12|\n", + "|2024-07-07 08:06:00| VH0383|RT_NYC_MAN_001| 32.62| 10.45|3.12|\n", + "|2024-11-05 06:39:00| VH0196|RT_NYC_MAN_001| 33.08| 10.56|3.13|\n", + "|2024-06-24 12:30:00| VH0162|RT_CHI_DET_003| 33.16| 10.61|3.13|\n", + "|2024-11-20 12:14:00| VH0388|RT_MIA_ORL_005| 31.98| 10.18|3.14|\n", + "|2024-12-11 06:59:00| VH0302|RT_HOU_DAL_004| 32.19| 10.22|3.15|\n", + "+-------------------+----------+--------------+--------+-------------+----+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Fuel efficiency issues found: 11804\n", + "\n", + "=== Query 3: Vehicle Performance Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-------------------+--------------+--------+--------+-----------+\n", + "|vehicle_id| trip_date| route_id|distance|duration|load_factor|\n", + "+----------+-------------------+--------------+--------+--------+-----------+\n", + "| VH0001|2024-04-05 10:59:00|RT_LAX_SFO_002| 59.35| 187.65| 80|\n", + "| VH0001|2024-04-14 23:04:00|RT_LAX_SFO_002| 46.08| 173.74| 91|\n", + "| VH0001|2024-05-01 14:40:00|RT_CHI_DET_003| 108.14| 217.92| 77|\n", + "| VH0001|2024-05-11 07:48:00|RT_NYC_MAN_001| 603.41| 701.52| 91|\n", + "| VH0001|2024-06-16 11:51:00|RT_LAX_SFO_002| 554.8| 481.12| 88|\n", + "| VH0001|2024-06-24 13:48:00|RT_HOU_DAL_004| 89.9| 160.75| 77|\n", + "| VH0001|2024-07-18 11:37:00|RT_CHI_DET_003| 418.77| 679.2| 97|\n", + "| VH0001|2024-07-23 07:31:00|RT_HOU_DAL_004| 316.56| 767.37| 99|\n", + "| VH0001|2024-08-12 06:57:00|RT_CHI_DET_003| 98.6| 88.27| 99|\n", + "| VH0001|2024-08-16 04:26:00|RT_LAX_SFO_002| 42.08| 127.55| 92|\n", + "| VH0001|2024-08-22 06:21:00|RT_LAX_SFO_002| 72.55| 77.47| 98|\n", + "| VH0001|2024-09-12 08:28:00|RT_CHI_DET_003| 542.22| 610.28| 90|\n", + "| VH0001|2024-09-21 14:44:00|RT_HOU_DAL_004| 48.18| 107.57| 79|\n", + "| VH0001|2024-09-27 08:59:00|RT_NYC_MAN_001| 393.69| 754.9| 82|\n", + "| VH0001|2024-09-30 15:41:00|RT_LAX_SFO_002| 58.77| 160.73| 89|\n", + "| VH0001|2024-10-03 09:22:00|RT_CHI_DET_003| 137.81| 234.32| 80|\n", + "| VH0001|2024-10-23 08:36:00|RT_CHI_DET_003| 75.64| 135.39| 87|\n", + "| VH0001|2024-11-04 13:08:00|RT_MIA_ORL_005| 336.23| 574.36| 87|\n", + "| VH0001|2024-11-11 18:27:00|RT_HOU_DAL_004| 58.57| 90.88| 85|\n", + "| VH0001|2024-11-14 11:48:00|RT_LAX_SFO_002| 96.21| 116.48| 93|\n", + "+----------+-------------------+--------------+--------+--------+-----------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "Performance trend records found: 236\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Demonstrate liquid clustering benefits with optimized queries\n", + "\n", + "\n", + "# Query 1: Vehicle trip history - benefits from vehicle_id clustering\n", + "\n", + "print(\"=== Query 1: Vehicle Trip History ===\")\n", + "\n", + "vehicle_history = spark.sql(\"\"\"\n", + "\n", + "SELECT vehicle_id, trip_date, route_id, distance, fuel_consumed, load_factor\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "WHERE vehicle_id = 'VH0001'\n", + "\n", + "ORDER BY trip_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "vehicle_history.show()\n", + "\n", + "print(f\"Records found: {vehicle_history.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 2: Time-based fuel efficiency analysis - benefits from trip_date clustering\n", + "\n", + "print(\"\\n=== Query 2: Recent Fuel Efficiency Issues ===\")\n", + "\n", + "fuel_efficiency = spark.sql(\"\"\"\n", + "\n", + "SELECT trip_date, vehicle_id, route_id, distance, fuel_consumed,\n", + "\n", + " ROUND(distance / fuel_consumed, 2) as mpg\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "WHERE trip_date >= '2024-06-01' AND (distance / fuel_consumed) < 15\n", + "\n", + "ORDER BY mpg ASC, trip_date DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "fuel_efficiency.show()\n", + "\n", + "print(f\"Fuel efficiency issues found: {fuel_efficiency.count()}\")\n", + "\n", + "\n", + "\n", + "# Query 3: Combined vehicle + time query - optimal for our clustering strategy\n", + "\n", + "print(\"\\n=== Query 3: Vehicle Performance Trends ===\")\n", + "\n", + "performance_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT vehicle_id, trip_date, route_id, distance, duration, load_factor\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "WHERE vehicle_id LIKE 'VH000%' AND trip_date >= '2024-04-01'\n", + "\n", + "ORDER BY vehicle_id, trip_date\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "performance_trends.show()\n", + "\n", + "print(f\"Performance trend records found: {performance_trends.count()}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Step 6: Analyze Clustering Effectiveness\n", + "\n", + "### Understanding the Impact\n", + "\n", + "Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the transportation insights possible with this optimized structure.\n", + "\n", + "### Key Analytics\n", + "\n", + "- **Vehicle utilization** and performance metrics\n", + "- **Route efficiency** and fuel consumption analysis\n", + "- **Fleet capacity utilization** and load factors\n", + "- **Operational cost trends** and optimization opportunities" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "=== Vehicle Performance Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+----------+-----------+--------------+----------+-------+---------------+-----------+\n", + "|vehicle_id|total_trips|total_distance|total_fuel|avg_mpg|avg_load_factor|total_miles|\n", + "+----------+-----------+--------------+----------+-------+---------------+-----------+\n", + "| VH0051| 59| 13834.0| 2033.34| 6.57| 87.44| 13834.0|\n", + "| VH0123| 60| 13633.12| 1879.83| 7.08| 84.27| 13633.0|\n", + "| VH0453| 57| 12890.51| 1937.5| 6.63| 86.86| 12891.0|\n", + "| VH0343| 54| 12846.02| 1855.01| 6.95| 87.28| 12846.0|\n", + "| VH0088| 57| 12547.45| 1816.14| 6.88| 85.05| 12547.0|\n", + "| VH0238| 59| 12448.77| 1814.0| 7.05| 86.02| 12449.0|\n", + "| VH0278| 53| 12418.19| 1824.83| 6.77| 87.13| 12418.0|\n", + "| VH0427| 54| 12406.78| 1753.24| 6.86| 87.31| 12407.0|\n", + "| VH0406| 60| 12304.12| 1810.67| 6.6| 87.52| 12304.0|\n", + "| VH0049| 60| 12277.11| 1786.7| 6.74| 86.2| 12277.0|\n", + "| VH0242| 58| 12200.91| 1794.24| 6.57| 86.6| 12201.0|\n", + "| VH0253| 49| 12046.66| 1631.43| 7.0| 87.39| 12047.0|\n", + "| VH0160| 57| 12003.29| 1622.78| 7.13| 86.44| 12003.0|\n", + "| VH0126| 55| 11965.73| 1809.87| 6.63| 86.84| 11966.0|\n", + "| VH0280| 40| 11953.7| 1677.1| 7.02| 86.33| 11954.0|\n", + "| VH0362| 60| 11920.8| 1718.56| 6.8| 86.6| 11921.0|\n", + "| VH0114| 60| 11910.09| 1783.69| 6.33| 86.62| 11910.0|\n", + "| VH0498| 51| 11864.91| 1701.58| 6.74| 85.78| 11865.0|\n", + "| VH0111| 60| 11821.77| 1702.64| 6.48| 86.87| 11822.0|\n", + "| VH0244| 59| 11607.77| 1665.22| 6.74| 87.83| 11608.0|\n", + "+----------+-----------+--------------+----------+-------+---------------+-----------+\n", + "only showing top 20 rows\n", + "\n", + "\n", + "=== Route Efficiency Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+--------------+-----------+------------+------------+---------+---------------+\n", + "| route_id|total_trips|avg_distance|avg_duration|avg_speed|avg_load_factor|\n", + "+--------------+-----------+------------+------------+---------+---------------+\n", + "|RT_NYC_MAN_001| 4111| 177.28| 255.39| 38.95| 86.55|\n", + "|RT_CHI_DET_003| 4060| 185.45| 265.02| 39.14| 86.34|\n", + "|RT_LAX_SFO_002| 4029| 180.58| 258.5| 39.27| 86.44|\n", + "|RT_MIA_ORL_005| 3991| 181.71| 260.37| 39.22| 86.31|\n", + "|RT_HOU_DAL_004| 3985| 182.39| 261.78| 39.02| 86.49|\n", + "+--------------+-----------+------------+------------+---------+---------------+\n", + "\n", + "\n", + "=== Fleet Fuel Consumption Analysis ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+------------------------+----------+-------+---------------+\n", + "|fuel_efficiency_category|trip_count|avg_mpg|total_fuel_used|\n", + "+------------------------+----------+-------+---------------+\n", + "| Poor (10-14 MPG)| 941| 10.79| 21486.63|\n", + "| Very Poor (<10 MPG)| 19235| 6.46| 514816.86|\n", + "+------------------------+----------+-------+---------------+\n", + "\n", + "\n", + "=== Monthly Operational Trends ===\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/plain": [ + "+-------+-----------+----------------+------------+---------------+---------------+\n", + "| month|total_trips|monthly_distance|monthly_fuel|avg_load_factor|active_vehicles|\n", + "+-------+-----------+----------------+------------+---------------+---------------+\n", + "|2024-01| 1756| 319156.46| 46719.73| 86.38| 479|\n", + "|2024-02| 1559| 268888.27| 39614.96| 86.14| 467|\n", + "|2024-03| 1703| 318040.34| 46279.08| 86.41| 478|\n", + "|2024-04| 1641| 303054.54| 44354.61| 86.33| 477|\n", + "|2024-05| 1713| 316645.41| 46364.37| 86.38| 476|\n", + "|2024-06| 1700| 312248.26| 46181.31| 86.65| 474|\n", + "|2024-07| 1637| 291667.53| 43144.47| 86.67| 482|\n", + "|2024-08| 1704| 302704.22| 44418.71| 86.5| 481|\n", + "|2024-09| 1640| 295376.12| 42881.73| 86.49| 475|\n", + "|2024-10| 1700| 307452.9| 44814.6| 86.36| 480|\n", + "|2024-11| 1696| 316414.16| 46494.07| 86.55| 472|\n", + "|2024-12| 1727| 309624.27| 45035.85| 86.25| 480|\n", + "+-------+-----------+----------------+------------+---------------+---------------+\n", + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Analyze clustering effectiveness and transportation insights\n", + "\n", + "\n", + "# Vehicle performance analysis\n", + "\n", + "print(\"=== Vehicle Performance Analysis ===\")\n", + "\n", + "vehicle_performance = spark.sql(\"\"\"\n", + "\n", + "SELECT vehicle_id, COUNT(*) as total_trips,\n", + "\n", + " ROUND(SUM(distance), 2) as total_distance,\n", + "\n", + " ROUND(SUM(fuel_consumed), 2) as total_fuel,\n", + "\n", + " ROUND(AVG(distance / fuel_consumed), 2) as avg_mpg,\n", + "\n", + " ROUND(AVG(load_factor), 2) as avg_load_factor,\n", + "\n", + " ROUND(SUM(distance), 0) as total_miles\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "GROUP BY vehicle_id\n", + "\n", + "ORDER BY total_miles DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "vehicle_performance.show()\n", + "\n", + "\n", + "# Route efficiency analysis\n", + "\n", + "print(\"\\n=== Route Efficiency Analysis ===\")\n", + "\n", + "route_efficiency = spark.sql(\"\"\"\n", + "\n", + "SELECT route_id, COUNT(*) as total_trips,\n", + "\n", + " ROUND(AVG(distance), 2) as avg_distance,\n", + "\n", + " ROUND(AVG(duration), 2) as avg_duration,\n", + "\n", + " ROUND(AVG(distance / duration * 60), 2) as avg_speed,\n", + "\n", + " ROUND(AVG(load_factor), 2) as avg_load_factor\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "GROUP BY route_id\n", + "\n", + "ORDER BY total_trips DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "route_efficiency.show()\n", + "\n", + "\n", + "# Fleet fuel consumption analysis\n", + "\n", + "print(\"\\n=== Fleet Fuel Consumption Analysis ===\")\n", + "\n", + "fuel_analysis = spark.sql(\"\"\"\n", + "\n", + "SELECT \n", + "\n", + " CASE \n", + "\n", + " WHEN distance / fuel_consumed >= 25 THEN 'Excellent (25+ MPG)'\n", + "\n", + " WHEN distance / fuel_consumed >= 20 THEN 'Good (20-24 MPG)'\n", + "\n", + " WHEN distance / fuel_consumed >= 15 THEN 'Average (15-19 MPG)'\n", + "\n", + " WHEN distance / fuel_consumed >= 10 THEN 'Poor (10-14 MPG)'\n", + "\n", + " ELSE 'Very Poor (<10 MPG)'\n", + "\n", + " END as fuel_efficiency_category,\n", + "\n", + " COUNT(*) as trip_count,\n", + "\n", + " ROUND(AVG(distance / fuel_consumed), 2) as avg_mpg,\n", + "\n", + " ROUND(SUM(fuel_consumed), 2) as total_fuel_used\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "GROUP BY \n", + "\n", + " CASE \n", + "\n", + " WHEN distance / fuel_consumed >= 25 THEN 'Excellent (25+ MPG)'\n", + "\n", + " WHEN distance / fuel_consumed >= 20 THEN 'Good (20-24 MPG)'\n", + "\n", + " WHEN distance / fuel_consumed >= 15 THEN 'Average (15-19 MPG)'\n", + "\n", + " WHEN distance / fuel_consumed >= 10 THEN 'Poor (10-14 MPG)'\n", + "\n", + " ELSE 'Very Poor (<10 MPG)'\n", + "\n", + " END\n", + "\n", + "ORDER BY avg_mpg DESC\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "fuel_analysis.show()\n", + "\n", + "\n", + "# Monthly operational trends\n", + "\n", + "print(\"\\n=== Monthly Operational Trends ===\")\n", + "\n", + "monthly_trends = spark.sql(\"\"\"\n", + "\n", + "SELECT DATE_FORMAT(trip_date, 'yyyy-MM') as month,\n", + "\n", + " COUNT(*) as total_trips,\n", + "\n", + " ROUND(SUM(distance), 2) as monthly_distance,\n", + "\n", + " ROUND(SUM(fuel_consumed), 2) as monthly_fuel,\n", + "\n", + " ROUND(AVG(load_factor), 2) as avg_load_factor,\n", + "\n", + " COUNT(DISTINCT vehicle_id) as active_vehicles\n", + "\n", + "FROM transportation.analytics.fleet_trips\n", + "\n", + "GROUP BY DATE_FORMAT(trip_date, 'yyyy-MM')\n", + "\n", + "ORDER BY month\n", + "\n", + "\"\"\")\n", + "\n", + "\n", + "\n", + "monthly_trends.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Key Takeaways: Delta Liquid Clustering in AIDP\n", + "\n", + "### What We Demonstrated\n", + "\n", + "1. **Automatic Optimization**: Created a table with `CLUSTER BY (vehicle_id, trip_date)` and let Delta automatically optimize data layout\n", + "\n", + "2. **Performance Benefits**: Queries on clustered columns (vehicle_id, trip_date) are significantly faster due to data locality\n", + "\n", + "3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically\n", + "\n", + "4. **Real-World Use Case**: Transportation analytics where fleet monitoring and route optimization are critical\n", + "\n", + "### AIDP Advantages\n", + "\n", + "- **Unified Analytics**: Seamlessly integrates with other AIDP services\n", + "- **Governance**: Catalog and schema isolation for transportation data\n", + "- **Performance**: Optimized for both OLAP and OLTP workloads\n", + "- **Scalability**: Handles transportation-scale data volumes effortlessly\n", + "\n", + "### Best Practices for Liquid Clustering\n", + "\n", + "1. **Choose clustering columns** based on your most common query patterns\n", + "2. **Start with 1-4 columns** - too many can reduce effectiveness\n", + "3. **Consider cardinality** - high-cardinality columns work best\n", + "4. **Monitor and adjust** as query patterns evolve\n", + "\n", + "### Next Steps\n", + "\n", + "- Explore other AIDP features like AI/ML integration\n", + "- Try liquid clustering with different column combinations\n", + "- Scale up to larger transportation datasets\n", + "- Integrate with real GPS tracking and IoT sensor data\n", + "\n", + "This notebook demonstrates how Oracle AI Data Platform makes advanced transportation analytics accessible while maintaining enterprise-grade performance and governance." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}