# Education: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an education analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Student Performance Analytics and Learning Management

We'll analyze student learning data and academic performance metrics. Our clustering strategy will optimize for:

- **Student-specific queries**: Fast lookups by student ID
- **Time-based analysis**: Efficient filtering by academic period and assessment dates
- **Performance patterns**: Quick aggregation by subject and learning outcomes

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create education catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS education")

spark.sql("CREATE SCHEMA IF NOT EXISTS education.analytics")

print("Education catalog and analytics schema created successfully!")

Education catalog and analytics schema created successfully!


## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `student_assessments` table will store:

- **student_id**: Unique student identifier
- **assessment_date**: Date of assessment or assignment
- **subject**: Academic subject area
- **score**: Assessment score (0-100)
- **grade_level**: Student grade level
- **completion_time**: Time spent on assessment (minutes)
- **engagement_score**: Student engagement metric (0-100)

### Clustering Strategy

We'll cluster by `student_id` and `assessment_date` because:

- **student_id**: Students generate multiple assessments, grouping learning progress together
- **assessment_date**: Time-based queries are critical for academic tracking, semester analysis, and intervention planning
- This combination optimizes for both individual student monitoring and temporal academic performance analysis

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS education.analytics.student_assessments (

    student_id STRING,

    assessment_date DATE,

    subject STRING,

    score DECIMAL(5,2),

    grade_level STRING,

    completion_time DECIMAL(6,2),

    engagement_score INT

)

USING DELTA

CLUSTER BY (student_id, assessment_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on student_id and assessment_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on student_id and assessment_date.


## Step 3: Generate Education Sample Data

### Data Generation Strategy

We'll create realistic student assessment data including:

- **3,000 students** with multiple assessments over time
- **Subjects**: Math, English, Science, History, Art, Physical Education
- **Realistic performance patterns**: Learning curves, subject difficulty variations, engagement factors
- **Grade levels**: K-12 with appropriate academic progression

### Why This Data Pattern?

This data simulates real education scenarios where:

- Student performance varies by subject and time
- Learning progress needs longitudinal tracking
- Intervention strategies require early identification
- Curriculum effectiveness drives teaching improvements
- Standardized testing and reporting require temporal analysis

In [None]:
# Generate sample student assessment data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define education data constants

SUBJECTS = ['Math', 'English', 'Science', 'History', 'Art', 'Physical Education']

GRADE_LEVELS = ['Kindergarten', '1st Grade', '2nd Grade', '3rd Grade', '4th Grade', '5th Grade', 
                '6th Grade', '7th Grade', '8th Grade', '9th Grade', '10th Grade', '11th Grade', '12th Grade']

# Base performance parameters by subject and grade level

PERFORMANCE_PARAMS = {

    'Math': {'base_score': 75, 'difficulty': 1.2, 'time_factor': 1.5},

    'English': {'base_score': 78, 'difficulty': 1.0, 'time_factor': 1.2},

    'Science': {'base_score': 72, 'difficulty': 1.3, 'time_factor': 1.4},

    'History': {'base_score': 70, 'difficulty': 1.1, 'time_factor': 1.1},

    'Art': {'base_score': 82, 'difficulty': 0.8, 'time_factor': 0.9},

    'Physical Education': {'base_score': 85, 'difficulty': 0.7, 'time_factor': 0.8}

}

# Grade level adjustments

GRADE_ADJUSTMENTS = {

    'Kindergarten': 0.7, '1st Grade': 0.75, '2nd Grade': 0.8, '3rd Grade': 0.82,

    '4th Grade': 0.85, '5th Grade': 0.87, '6th Grade': 0.8, '7th Grade': 0.78,

    '8th Grade': 0.76, '9th Grade': 0.74, '10th Grade': 0.72, '11th Grade': 0.7, '12th Grade': 0.68

}


# Generate student assessment records

assessment_data = []

base_date = datetime(2024, 1, 1)


# Create 3,000 students with 15-30 assessments each

for student_num in range(1, 3001):

    student_id = f"STU{student_num:06d}"
    
    # Assign grade level

    grade_level = random.choice(GRADE_LEVELS)

    grade_factor = GRADE_ADJUSTMENTS[grade_level]
    
    # Each student gets 15-30 assessments over 12 months

    num_assessments = random.randint(15, 30)
    
    for i in range(num_assessments):

        # Spread assessments over 12 months

        days_offset = random.randint(0, 365)

        assessment_date = base_date + timedelta(days=days_offset)
        
        # Select subject

        subject = random.choice(SUBJECTS)

        params = PERFORMANCE_PARAMS[subject]
        
        # Calculate score with variations

        score_variation = random.uniform(0.7, 1.3)

        base_score = params['base_score'] * grade_factor / params['difficulty']

        score = round(min(100, max(0, base_score * score_variation)), 2)
        
        # Calculate completion time

        time_variation = random.uniform(0.8, 1.5)

        base_time = 45 * params['time_factor']  # 45 minutes base time

        completion_time = round(base_time * time_variation, 2)
        
        # Engagement score (affects performance)

        engagement_score = random.randint(40, 100)

        # Slightly adjust score based on engagement

        engagement_factor = engagement_score / 100.0

        score = round(min(100, score * (0.8 + 0.4 * engagement_factor)), 2)
        
        assessment_data.append({

            "student_id": student_id,

            "assessment_date": assessment_date.date(),

            "subject": subject,

            "score": float(score),

            "grade_level": grade_level,

            "completion_time": float(completion_time),

            "engagement_score": int(engagement_score)

        })



print(f"Generated {len(assessment_data)} student assessment records")

print("Sample record:", assessment_data[0])

Generated 67514 student assessment records
Sample record: {'student_id': 'STU000001', 'assessment_date': datetime.date(2024, 10, 21), 'subject': 'Science', 'score': 44.62, 'grade_level': '12th Grade', 'completion_time': 79.61, 'engagement_score': 50}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_assessments = spark.createDataFrame(assessment_data)


# Display schema and sample data

print("DataFrame Schema:")

df_assessments.printSchema()



print("\nSample Data:")

df_assessments.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (student_id, assessment_date) will automatically optimize the data layout

df_assessments.write.mode("overwrite").saveAsTable("education.analytics.student_assessments")


print(f"\nSuccessfully inserted {df_assessments.count()} records into education.analytics.student_assessments")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- assessment_date: date (nullable = true)
 |-- completion_time: double (nullable = true)
 |-- engagement_score: long (nullable = true)
 |-- grade_level: string (nullable = true)
 |-- score: double (nullable = true)
 |-- student_id: string (nullable = true)
 |-- subject: string (nullable = true)


Sample Data:


+---------------+---------------+----------------+-----------+-----+----------+------------------+
|assessment_date|completion_time|engagement_score|grade_level|score|student_id|           subject|
+---------------+---------------+----------------+-----------+-----+----------+------------------+
|     2024-10-21|          79.61|              50| 12th Grade|44.62| STU000001|           Science|
|     2024-03-06|           52.9|              85| 12th Grade|88.76| STU000001|Physical Education|
|     2024-09-24|          34.43|              52| 12th Grade|60.94| STU000001|               Art|
|     2024-09-12|          83.62|              58| 12th Grade|48.87| STU000001|           Science|
|     2024-12-01|          47.97|              58| 12th Grade|68.91| STU000001|           English|
+---------------+---------------+----------------+-----------+-----+----------+------------------+
only showing top 5 rows




Successfully inserted 67514 records into education.analytics.student_assessments
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Student assessment history** (clustered by student_id)
2. **Time-based academic analysis** (clustered by assessment_date)
3. **Combined student + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Student assessment history - benefits from student_id clustering

print("=== Query 1: Student Assessment History ===")

student_history = spark.sql("""

SELECT student_id, assessment_date, subject, score, engagement_score

FROM education.analytics.student_assessments

WHERE student_id = 'STU000001'

ORDER BY assessment_date DESC

""")



student_history.show()

print(f"Records found: {student_history.count()}")



# Query 2: Time-based academic performance analysis - benefits from assessment_date clustering

print("\n=== Query 2: Recent Low Performance Issues ===")

low_performance = spark.sql("""

SELECT assessment_date, student_id, subject, score, grade_level

FROM education.analytics.student_assessments

WHERE assessment_date >= '2024-06-01' AND score < 60

ORDER BY score ASC, assessment_date DESC

""")



low_performance.show()

print(f"Low performance issues found: {low_performance.count()}")



# Query 3: Combined student + time query - optimal for our clustering strategy

print("\n=== Query 3: Student Performance Trends ===")

performance_trends = spark.sql("""

SELECT student_id, assessment_date, subject, score, engagement_score

FROM education.analytics.student_assessments

WHERE student_id LIKE 'STU000%' AND assessment_date >= '2024-04-01'

ORDER BY student_id, assessment_date

""")



performance_trends.show()

print(f"Performance trend records found: {performance_trends.count()}")

=== Query 1: Student Assessment History ===


+----------+---------------+------------------+-----+----------------+
|student_id|assessment_date|           subject|score|engagement_score|
+----------+---------------+------------------+-----+----------------+
| STU000001|     2024-12-01|           English|68.91|              58|
| STU000001|     2024-10-21|           Science|44.62|              50|
| STU000001|     2024-10-10|           English|52.53|              86|
| STU000001|     2024-10-08|Physical Education|77.74|              59|
| STU000001|     2024-09-25|           Science|32.35|              40|
| STU000001|     2024-09-24|               Art|60.94|              52|
| STU000001|     2024-09-12|           Science|48.87|              58|
| STU000001|     2024-09-05|           English|68.71|              98|
| STU000001|     2024-08-30|              Math|33.82|              64|
| STU000001|     2024-08-10|              Math|53.37|              60|
| STU000001|     2024-08-06|Physical Education|76.45|              80|
| STU0

Records found: 18

=== Query 2: Recent Low Performance Issues ===


+---------------+----------+-------+-----+------------+
|assessment_date|student_id|subject|score| grade_level|
+---------------+----------+-------+-----+------------+
|     2024-09-22| STU000220|Science|25.83|  12th Grade|
|     2024-10-01| STU001661|Science|25.87|  12th Grade|
|     2024-08-09| STU001500|Science|26.33|  12th Grade|
|     2024-09-02| STU001198|Science|26.54|  12th Grade|
|     2024-12-04| STU002836|Science|26.57|  11th Grade|
|     2024-08-26| STU000831|Science|26.61|  12th Grade|
|     2024-10-15| STU001401|Science|26.71|  11th Grade|
|     2024-10-11| STU001198|Science|26.78|  12th Grade|
|     2024-06-23| STU000386|Science|26.78|  11th Grade|
|     2024-07-08| STU001919|Science|26.95|  11th Grade|
|     2024-12-08| STU002914|Science|27.05|  12th Grade|
|     2024-12-05| STU002552|Science|27.06|  11th Grade|
|     2024-10-01| STU001135|Science|27.07|Kindergarten|
|     2024-10-15| STU001119|Science|27.28|  12th Grade|
|     2024-12-19| STU001299|Science|27.33|Kinder

Low performance issues found: 19160

=== Query 3: Student Performance Trends ===


+----------+---------------+------------------+-----+----------------+
|student_id|assessment_date|           subject|score|engagement_score|
+----------+---------------+------------------+-----+----------------+
| STU000001|     2024-04-11|           Science|37.37|              71|
| STU000001|     2024-04-13|Physical Education|90.22|              55|
| STU000001|     2024-04-25|           English|44.24|              71|
| STU000001|     2024-05-06|               Art|55.28|              83|
| STU000001|     2024-08-06|Physical Education|76.45|              80|
| STU000001|     2024-08-10|              Math|53.37|              60|
| STU000001|     2024-08-30|              Math|33.82|              64|
| STU000001|     2024-09-05|           English|68.71|              98|
| STU000001|     2024-09-12|           Science|48.87|              58|
| STU000001|     2024-09-24|               Art|60.94|              52|
| STU000001|     2024-09-25|           Science|32.35|              40|
| STU0

Performance trend records found: 17102


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the education insights possible with this optimized structure.

### Key Analytics

- **Student performance patterns** and learning analytics
- **Subject difficulty analysis** and curriculum effectiveness
- **Grade level progression** and academic growth
- **Engagement correlations** and intervention opportunities

In [None]:
# Analyze clustering effectiveness and education insights


# Student performance analysis

print("=== Student Performance Analysis ===")

student_performance = spark.sql("""

SELECT student_id, COUNT(*) as total_assessments,

       ROUND(AVG(score), 2) as avg_score,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       ROUND(AVG(completion_time), 2) as avg_completion_time,

       grade_level

FROM education.analytics.student_assessments

GROUP BY student_id, grade_level

ORDER BY avg_score DESC

LIMIT 10

""")



student_performance.show()


# Subject performance analysis

print("\n=== Subject Performance Analysis ===")

subject_analysis = spark.sql("""

SELECT subject, COUNT(*) as total_assessments,

       ROUND(AVG(score), 2) as avg_score,

       ROUND(AVG(completion_time), 2) as avg_completion_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT student_id) as unique_students

FROM education.analytics.student_assessments

GROUP BY subject

ORDER BY avg_score DESC

""")



subject_analysis.show()


# Grade level performance

print("\n=== Grade Level Performance ===")

grade_performance = spark.sql("""


SELECT 
    grade_level, 
    COUNT(*) AS total_assessments,
    ROUND(AVG(score), 2) AS avg_score,
    ROUND(AVG(engagement_score), 2) AS avg_engagement,
    COUNT(DISTINCT student_id) AS unique_students
FROM education.analytics.student_assessments
GROUP BY grade_level
ORDER BY 
    CASE 
        WHEN grade_level = 'Kindergarten' THEN 0
        ELSE CAST(REGEXP_REPLACE(grade_level, '[^0-9]', '') AS INT)
    END;
""")



grade_performance.show()


# Engagement vs performance correlation

print("\n=== Engagement vs Performance Correlation ===")

engagement_correlation = spark.sql("""

SELECT 

    CASE 

        WHEN engagement_score >= 80 THEN 'High Engagement'

        WHEN engagement_score >= 60 THEN 'Medium Engagement'

        WHEN engagement_score >= 40 THEN 'Low Engagement'

        ELSE 'Very Low Engagement'

    END as engagement_level,

    COUNT(*) as assessment_count,

    ROUND(AVG(score), 2) as avg_score,

    ROUND(AVG(completion_time), 2) as avg_completion_time

FROM education.analytics.student_assessments

GROUP BY 

    CASE 

        WHEN engagement_score >= 80 THEN 'High Engagement'

        WHEN engagement_score >= 60 THEN 'Medium Engagement'

        WHEN engagement_score >= 40 THEN 'Low Engagement'

        ELSE 'Very Low Engagement'

    END

ORDER BY avg_score DESC

""")



engagement_correlation.show()


# Monthly academic trends

print("\n=== Monthly Academic Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(assessment_date, 'yyyy-MM') as month,

       COUNT(*) as total_assessments,

       ROUND(AVG(score), 2) as avg_score,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT student_id) as active_students

FROM education.analytics.student_assessments

GROUP BY DATE_FORMAT(assessment_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Student Performance Analysis ===


+----------+-----------------+---------+--------------+-------------------+-----------+
|student_id|total_assessments|avg_score|avg_engagement|avg_completion_time|grade_level|
+----------+-----------------+---------+--------------+-------------------+-----------+
| STU002351|               15|    84.67|          68.2|              55.49|  5th Grade|
| STU001691|               25|    83.65|         75.04|              54.83|  4th Grade|
| STU002992|               23|     82.1|         74.87|              54.35|  4th Grade|
| STU001644|               17|    80.76|          69.0|              56.12|  5th Grade|
| STU001131|               16|    80.72|         68.31|              50.13|  7th Grade|
| STU001347|               15|     80.6|          71.8|              55.47|  7th Grade|
| STU000282|               16|    80.19|         66.06|              53.04|  5th Grade|
| STU000129|               15|    80.17|          72.8|              52.55|  5th Grade|
| STU001565|               22|  

+------------------+-----------------+---------+-------------------+--------------+---------------+
|           subject|total_assessments|avg_score|avg_completion_time|avg_engagement|unique_students|
+------------------+-----------------+---------+-------------------+--------------+---------------+
|Physical Education|            11307|    91.75|              41.32|         69.94|           2910|
|               Art|            11268|    82.93|              46.66|         69.98|           2922|
|           English|            11150|    64.39|              62.29|         70.03|           2939|
|           History|            11269|    52.47|              56.92|         69.64|           2923|
|              Math|            11267|    51.97|              77.57|         70.03|           2914|
|           Science|            11253|    45.72|               72.4|         70.06|           2935|
+------------------+-----------------+---------+-------------------+--------------+---------------+


+------------+-----------------+---------+--------------+---------------+
| grade_level|total_assessments|avg_score|avg_engagement|unique_students|
+------------+-----------------+---------+--------------+---------------+
|Kindergarten|             4959|    60.22|         69.78|            219|
|   1st Grade|             5312|    63.47|         69.97|            235|
|   2nd Grade|             4908|    67.64|         69.88|            214|
|   3rd Grade|             5441|    68.08|          69.7|            251|
|   4th Grade|             5242|    70.34|         70.13|            235|
|   5th Grade|             5241|     71.8|         69.62|            229|
|   6th Grade|             4341|    66.94|         69.49|            191|
|   7th Grade|             5054|    66.42|         70.47|            222|
|   8th Grade|             5277|    64.84|         70.27|            237|
|   9th Grade|             5340|    63.39|          70.2|            240|
|  10th Grade|             5884|    61

+-----------------+----------------+---------+-------------------+
| engagement_level|assessment_count|avg_score|avg_completion_time|
+-----------------+----------------+---------+-------------------+
|  High Engagement|           23188|    68.78|               59.6|
|Medium Engagement|           22098|    64.93|              59.48|
|   Low Engagement|           22228|     60.8|              59.44|
+-----------------+----------------+---------+-------------------+


=== Monthly Academic Trends ===


+-------+-----------------+---------+--------------+---------------+
|  month|total_assessments|avg_score|avg_engagement|active_students|
+-------+-----------------+---------+--------------+---------------+
|2024-01|             5706|    65.35|         69.86|           2545|
|2024-02|             5212|    64.94|         69.78|           2474|
|2024-03|             5650|     64.9|         69.78|           2540|
|2024-04|             5480|    64.95|         70.09|           2524|
|2024-05|             5869|    64.83|         69.88|           2591|
|2024-06|             5621|    64.72|         69.91|           2557|
|2024-07|             5632|    65.03|         70.03|           2530|
|2024-08|             5739|    65.47|         69.84|           2543|
|2024-09|             5527|    64.63|         70.23|           2505|
|2024-10|             5757|    64.85|          70.0|           2548|
|2024-11|             5639|    64.53|         70.43|           2549|
|2024-12|             5682|     64

## Key Takeaways: Delta Liquid Clustering in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (student_id, assessment_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (student_id, assessment_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Real-World Use Case**: Education analytics where student performance tracking and learning analytics are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates with other AIDP services
- **Governance**: Catalog and schema isolation for education data
- **Performance**: Optimized for both OLAP and OLTP workloads
- **Scalability**: Handles education-scale data volumes effortlessly

### Best Practices for Liquid Clustering

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve

### Next Steps

- Explore other AIDP features like AI/ML integration
- Try liquid clustering with different column combinations
- Scale up to larger education datasets
- Integrate with real LMS systems and assessment platforms

This notebook demonstrates how Oracle AI Data Platform makes advanced education analytics accessible while maintaining enterprise-grade performance and governance.