# Education: Delta Liquid Clustering Demo


## Overview


This notebook demonstrates the power of **Delta Liquid Clustering** in Oracle AI Data Platform (AIDP) Workbench using an education analytics use case. Liquid clustering automatically optimizes data layout for query performance without requiring manual partitioning or Z-Ordering.

### What is Liquid Clustering?

Liquid clustering automatically identifies and groups similar data together based on clustering columns you define. This optimization happens automatically during data ingestion and maintenance operations, providing:

- **Automatic optimization**: No manual tuning required
- **Improved query performance**: Faster queries on clustered columns
- **Reduced maintenance**: No need for manual repartitioning
- **Adaptive clustering**: Adjusts as data patterns change

### Use Case: Student Performance Analytics and Learning Management

We'll analyze student learning data and academic performance metrics. Our clustering strategy will optimize for:

- **Student-specific queries**: Fast lookups by student ID
- **Time-based analysis**: Efficient filtering by academic period and assessment dates
- **Performance patterns**: Quick aggregation by subject and learning outcomes

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

In [None]:
# Create education catalog and analytics schema

# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS education")

spark.sql("CREATE SCHEMA IF NOT EXISTS education.analytics")

print("Education catalog and analytics schema created successfully!")

## Step 2: Create Delta Table with Liquid Clustering

### Table Design

Our `student_assessments` table will store:

- **student_id**: Unique student identifier
- **assessment_date**: Date of assessment or assignment
- **subject**: Academic subject area
- **score**: Assessment score (0-100)
- **grade_level**: Student grade level
- **completion_time**: Time spent on assessment (minutes)
- **engagement_score**: Student engagement metric (0-100)

### Clustering Strategy

We'll cluster by `student_id` and `assessment_date` because:

- **student_id**: Students generate multiple assessments, grouping learning progress together
- **assessment_date**: Time-based queries are critical for academic tracking, semester analysis, and intervention planning
- This combination optimizes for both individual student monitoring and temporal academic performance analysis

In [None]:
# Create Delta table with liquid clustering

# CLUSTER BY defines the columns for automatic optimization

spark.sql("""

CREATE TABLE IF NOT EXISTS education.analytics.student_assessments (

    student_id STRING,

    assessment_date DATE,

    subject STRING,

    score DECIMAL(5,2),

    grade_level STRING,

    completion_time DECIMAL(6,2),

    engagement_score INT

)

USING DELTA

CLUSTER BY (student_id, assessment_date)

""")

print("Delta table with liquid clustering created successfully!")

print("Clustering will automatically optimize data layout for queries on student_id and assessment_date.")

Delta table with liquid clustering created successfully!
Clustering will automatically optimize data layout for queries on student_id and assessment_date.


## Step 3: Generate Education Sample Data

### Data Generation Strategy

We'll create realistic student assessment data including:

- **3,000 students** with multiple assessments over time
- **Subjects**: Math, English, Science, History, Art, Physical Education
- **Realistic performance patterns**: Learning curves, subject difficulty variations, engagement factors
- **Grade levels**: K-12 with appropriate academic progression

### Why This Data Pattern?

This data simulates real education scenarios where:

- Student performance varies by subject and time
- Learning progress needs longitudinal tracking
- Intervention strategies require early identification
- Curriculum effectiveness drives teaching improvements
- Standardized testing and reporting require temporal analysis

In [None]:
# Generate sample student assessment data

# Using fully qualified imports to avoid conflicts

import random

from datetime import datetime, timedelta


# Define education data constants

SUBJECTS = ['Math', 'English', 'Science', 'History', 'Art', 'Physical Education']

GRADE_LEVELS = ['Kindergarten', '1st Grade', '2nd Grade', '3rd Grade', '4th Grade', '5th Grade', 
                '6th Grade', '7th Grade', '8th Grade', '9th Grade', '10th Grade', '11th Grade', '12th Grade']

# Base performance parameters by subject and grade level

PERFORMANCE_PARAMS = {

    'Math': {'base_score': 75, 'difficulty': 1.2, 'time_factor': 1.5},

    'English': {'base_score': 78, 'difficulty': 1.0, 'time_factor': 1.2},

    'Science': {'base_score': 72, 'difficulty': 1.3, 'time_factor': 1.4},

    'History': {'base_score': 70, 'difficulty': 1.1, 'time_factor': 1.1},

    'Art': {'base_score': 82, 'difficulty': 0.8, 'time_factor': 0.9},

    'Physical Education': {'base_score': 85, 'difficulty': 0.7, 'time_factor': 0.8}

}

# Grade level adjustments

GRADE_ADJUSTMENTS = {

    'Kindergarten': 0.7, '1st Grade': 0.75, '2nd Grade': 0.8, '3rd Grade': 0.82,

    '4th Grade': 0.85, '5th Grade': 0.87, '6th Grade': 0.8, '7th Grade': 0.78,

    '8th Grade': 0.76, '9th Grade': 0.74, '10th Grade': 0.72, '11th Grade': 0.7, '12th Grade': 0.68

}


# Generate student assessment records

assessment_data = []

base_date = datetime(2024, 1, 1)


# Create 3,000 students with 15-30 assessments each

for student_num in range(1, 3001):

    student_id = f"STU{student_num:06d}"
    
    # Assign grade level

    grade_level = random.choice(GRADE_LEVELS)

    grade_factor = GRADE_ADJUSTMENTS[grade_level]
    
    # Each student gets 15-30 assessments over 12 months

    num_assessments = random.randint(15, 30)
    
    for i in range(num_assessments):

        # Spread assessments over 12 months

        days_offset = random.randint(0, 365)

        assessment_date = base_date + timedelta(days=days_offset)
        
        # Select subject

        subject = random.choice(SUBJECTS)

        params = PERFORMANCE_PARAMS[subject]
        
        # Calculate score with variations

        score_variation = random.uniform(0.7, 1.3)

        base_score = params['base_score'] * grade_factor / params['difficulty']

        score = round(min(100, max(0, base_score * score_variation)), 2)
        
        # Calculate completion time

        time_variation = random.uniform(0.8, 1.5)

        base_time = 45 * params['time_factor']  # 45 minutes base time

        completion_time = round(base_time * time_variation, 2)
        
        # Engagement score (affects performance)

        engagement_score = random.randint(40, 100)

        # Slightly adjust score based on engagement

        engagement_factor = engagement_score / 100.0

        score = round(min(100, score * (0.8 + 0.4 * engagement_factor)), 2)
        
        assessment_data.append({

            "student_id": student_id,

            "assessment_date": assessment_date.date(),

            "subject": subject,

            "score": float(score),

            "grade_level": grade_level,

            "completion_time": float(completion_time),

            "engagement_score": int(engagement_score)

        })



print(f"Generated {len(assessment_data)} student assessment records")

print("Sample record:", assessment_data[0])

Generated 67233 student assessment records
Sample record: {'student_id': 'STU000001', 'assessment_date': datetime.date(2024, 9, 25), 'subject': 'History', 'score': 47.58, 'grade_level': '5th Grade', 'completion_time': 71.9, 'engagement_score': 51}


## Step 4: Insert Data Using PySpark

### Data Insertion Strategy

We'll use PySpark to:

1. **Create DataFrame** from our generated data
2. **Insert into Delta table** with liquid clustering
3. **Verify the insertion** with a sample query

### Why PySpark for Insertion?

- **Distributed processing**: Handles large datasets efficiently
- **Type safety**: Ensures data integrity
- **Optimization**: Leverages Spark's query optimization
- **Liquid clustering**: Automatically applies clustering during insertion

In [None]:
# Insert data using PySpark DataFrame operations

# Using fully qualified function references to avoid conflicts


# Create DataFrame from generated data

df_assessments = spark.createDataFrame(assessment_data)


# Display schema and sample data

print("DataFrame Schema:")

df_assessments.printSchema()



print("\nSample Data:")

df_assessments.show(5)


# Insert data into Delta table with liquid clustering

# The CLUSTER BY (student_id, assessment_date) will automatically optimize the data layout

df_assessments.write.mode("overwrite").saveAsTable("education.analytics.student_assessments")


print(f"\nSuccessfully inserted {df_assessments.count()} records into education.analytics.student_assessments")

print("Liquid clustering automatically optimized the data layout during insertion!")

DataFrame Schema:
root
 |-- assessment_date: date (nullable = true)
 |-- completion_time: double (nullable = true)
 |-- engagement_score: long (nullable = true)
 |-- grade_level: string (nullable = true)
 |-- score: double (nullable = true)
 |-- student_id: string (nullable = true)
 |-- subject: string (nullable = true)


Sample Data:


+---------------+---------------+----------------+-----------+-----+----------+-------+
|assessment_date|completion_time|engagement_score|grade_level|score|student_id|subject|
+---------------+---------------+----------------+-----------+-----+----------+-------+
|     2024-09-25|           71.9|              51|  5th Grade|47.58| STU000001|History|
|     2024-11-26|          59.47|              52|  5th Grade|70.08| STU000001|History|
|     2024-12-17|          62.23|              89|  5th Grade|76.43| STU000001|English|
|     2024-04-23|          75.37|              62|  5th Grade|40.01| STU000001|   Math|
|     2024-06-29|          73.66|              79|  5th Grade|38.55| STU000001|Science|
+---------------+---------------+----------------+-----------+-----+----------+-------+
only showing top 5 rows




Successfully inserted 67233 records into education.analytics.student_assessments
Liquid clustering automatically optimized the data layout during insertion!


## Step 5: Demonstrate Liquid Clustering Benefits

### Query Performance Analysis

Now let's see how liquid clustering improves query performance. We'll run queries that benefit from our clustering strategy:

1. **Student assessment history** (clustered by student_id)
2. **Time-based academic analysis** (clustered by assessment_date)
3. **Combined student + time queries** (optimal for our clustering)

### Expected Performance Benefits

With liquid clustering, these queries should be significantly faster because:

- **Data locality**: Related records are physically grouped together
- **Reduced I/O**: Less data needs to be read from disk
- **Automatic optimization**: No manual tuning required

In [None]:
# Demonstrate liquid clustering benefits with optimized queries


# Query 1: Student assessment history - benefits from student_id clustering

print("=== Query 1: Student Assessment History ===")

student_history = spark.sql("""

SELECT student_id, assessment_date, subject, score, engagement_score

FROM education.analytics.student_assessments

WHERE student_id = 'STU000001'

ORDER BY assessment_date DESC

""")



student_history.show()

print(f"Records found: {student_history.count()})")



# Query 2: Time-based academic performance analysis - benefits from assessment_date clustering

print("\n=== Query 2: Recent Low Performance Issues ===")

low_performance = spark.sql("""

SELECT assessment_date, student_id, subject, score, grade_level

FROM education.analytics.student_assessments

WHERE assessment_date >= '2024-06-01' AND score < 60

ORDER BY score ASC, assessment_date DESC

""")



low_performance.show()

print(f"Low performance issues found: {low_performance.count()})")



# Query 3: Combined student + time query - optimal for our clustering strategy

print("\n=== Query 3: Student Performance Trends ===")

performance_trends = spark.sql("""

SELECT student_id, assessment_date, subject, score, engagement_score

FROM education.analytics.student_assessments

WHERE student_id LIKE 'STU000%' AND assessment_date >= '2024-04-01'

ORDER BY student_id, assessment_date

""")



performance_trends.show()

print(f"Performance trend records found: {performance_trends.count()})")

=== Query 1: Student Assessment History ===


+----------+---------------+------------------+-----+----------------+
|student_id|assessment_date|           subject|score|engagement_score|
+----------+---------------+------------------+-----+----------------+
| STU000001|     2024-12-31|           History|66.76|              65|
| STU000001|     2024-12-17|           English|76.43|              89|
| STU000001|     2024-12-04|           History|45.62|              41|
| STU000001|     2024-11-26|           History|70.08|              52|
| STU000001|     2024-11-17|           Science| 68.0|              92|
| STU000001|     2024-11-17|           English| 56.2|              70|
| STU000001|     2024-09-27|           History| 40.9|              40|
| STU000001|     2024-09-25|           History|47.58|              51|
| STU000001|     2024-09-23|               Art| 96.8|              42|
| STU000001|     2024-08-29|               Art|100.0|              53|
| STU000001|     2024-08-11|           History|74.15|             100|
| STU0

Records found: 19)

=== Query 2: Recent Low Performance Issues ===


+---------------+----------+-------+-----+------------+
|assessment_date|student_id|subject|score| grade_level|
+---------------+----------+-------+-----+------------+
|     2024-10-27| STU002987|Science|26.09|  12th Grade|
|     2024-12-04| STU002647|Science|26.43|  12th Grade|
|     2024-06-10| STU001734|Science| 26.7|  11th Grade|
|     2024-08-18| STU000133|Science|26.85|Kindergarten|
|     2024-10-24| STU001404|Science|26.86|Kindergarten|
|     2024-10-29| STU000258|Science|26.89|Kindergarten|
|     2024-10-08| STU001735|Science| 27.0|  11th Grade|
|     2024-08-09| STU001150|Science|27.03|  12th Grade|
|     2024-12-27| STU002481|Science|27.07|  12th Grade|
|     2024-09-19| STU001845|Science| 27.1|  11th Grade|
|     2024-07-17| STU001826|Science|27.18|  12th Grade|
|     2024-07-19| STU000634|Science|27.19|  12th Grade|
|     2024-10-14| STU000273|Science| 27.2|  11th Grade|
|     2024-11-24| STU001154|Science|27.22|  12th Grade|
|     2024-09-09| STU001574|Science|27.26|  12th

Low performance issues found: 18954)

=== Query 3: Student Performance Trends ===


+----------+---------------+------------------+-----+----------------+
|student_id|assessment_date|           subject|score|engagement_score|
+----------+---------------+------------------+-----+----------------+
| STU000001|     2024-04-20|               Art|100.0|              50|
| STU000001|     2024-04-23|              Math|40.01|              62|
| STU000001|     2024-04-23|           Science|70.84|             100|
| STU000001|     2024-05-26|Physical Education|94.59|              83|
| STU000001|     2024-06-29|           Science|38.55|              79|
| STU000001|     2024-07-26|           English|51.46|              66|
| STU000001|     2024-08-11|           History|74.15|             100|
| STU000001|     2024-08-29|               Art|100.0|              53|
| STU000001|     2024-09-23|               Art| 96.8|              42|
| STU000001|     2024-09-25|           History|47.58|              51|
| STU000001|     2024-09-27|           History| 40.9|              40|
| STU0

Performance trend records found: 16972)


## Step 6: Analyze Clustering Effectiveness

### Understanding the Impact

Let's examine how liquid clustering has organized our data and analyze some aggregate statistics to demonstrate the education insights possible with this optimized structure.

### Key Analytics

- **Student performance patterns** and learning analytics
- **Subject difficulty analysis** and curriculum effectiveness
- **Grade level progression** and academic growth
- **Engagement correlations** and intervention opportunities

In [None]:
# Analyze clustering effectiveness and education insights


# Student performance analysis

print("=== Student Performance Analysis ===")

student_performance = spark.sql("""

SELECT student_id, COUNT(*) as total_assessments,

       ROUND(AVG(score), 2) as avg_score,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       ROUND(AVG(completion_time), 2) as avg_completion_time,

       grade_level

FROM education.analytics.student_assessments

GROUP BY student_id, grade_level

ORDER BY avg_score DESC

LIMIT 10

""")



student_performance.show()


# Subject performance analysis

print("\n=== Subject Performance Analysis ===")

subject_analysis = spark.sql("""

SELECT subject, COUNT(*) as total_assessments,

       ROUND(AVG(score), 2) as avg_score,

       ROUND(AVG(completion_time), 2) as avg_completion_time,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT student_id) as unique_students

FROM education.analytics.student_assessments

GROUP BY subject

ORDER BY avg_score DESC

""")



subject_analysis.show()


# Grade level performance

print("\n=== Grade Level Performance ===")

grade_performance = spark.sql("""

SELECT 

    grade_level, 

    COUNT(*) AS total_assessments,

    ROUND(AVG(score), 2) AS avg_score,

    ROUND(AVG(engagement_score), 2) AS avg_engagement,

    COUNT(DISTINCT student_id) AS unique_students

FROM education.analytics.student_assessments

GROUP BY grade_level

ORDER BY 

    CASE 

        WHEN grade_level = 'Kindergarten' THEN 0

        ELSE CAST(REGEXP_REPLACE(grade_level, '[^0-9]', '') AS INT)

    END

""")



grade_performance.show()


# Engagement vs performance correlation

print("\n=== Engagement vs Performance Correlation ===")

engagement_correlation = spark.sql("""

SELECT 

    CASE 

        WHEN engagement_score >= 80 THEN 'High Engagement'

        WHEN engagement_score >= 60 THEN 'Medium Engagement'

        WHEN engagement_score >= 40 THEN 'Low Engagement'

        ELSE 'Very Low Engagement'

    END as engagement_level,

    COUNT(*) as assessment_count,

    ROUND(AVG(score), 2) as avg_score,

    ROUND(AVG(completion_time), 2) as avg_completion_time

FROM education.analytics.student_assessments

GROUP BY 

    CASE 

        WHEN engagement_score >= 80 THEN 'High Engagement'

        WHEN engagement_score >= 60 THEN 'Medium Engagement'

        WHEN engagement_score >= 40 THEN 'Low Engagement'

        ELSE 'Very Low Engagement'

    END

ORDER BY avg_score DESC

""")



engagement_correlation.show()


# Monthly academic trends

print("\n=== Monthly Academic Trends ===")

monthly_trends = spark.sql("""

SELECT DATE_FORMAT(assessment_date, 'yyyy-MM') as month,

       COUNT(*) as total_assessments,

       ROUND(AVG(score), 2) as avg_score,

       ROUND(AVG(engagement_score), 2) as avg_engagement,

       COUNT(DISTINCT student_id) as active_students

FROM education.analytics.student_assessments

GROUP BY DATE_FORMAT(assessment_date, 'yyyy-MM')

ORDER BY month

""")



monthly_trends.show()

=== Student Performance Analysis ===


+----------+-----------------+---------+--------------+-------------------+-----------+
|student_id|total_assessments|avg_score|avg_engagement|avg_completion_time|grade_level|
+----------+-----------------+---------+--------------+-------------------+-----------+
| STU000455|               19|    84.44|         68.79|              50.06|  4th Grade|
| STU001973|               16|    83.68|         75.63|              58.71|  3rd Grade|
| STU002604|               15|    82.76|          80.4|              53.96|  5th Grade|
| STU001557|               18|    81.47|         76.22|              55.65|  5th Grade|
| STU000551|               28|    81.44|          69.5|              51.03|  6th Grade|
| STU002979|               20|    81.09|         76.25|              53.68|  5th Grade|
| STU002271|               16|    80.99|         72.31|              51.99|  4th Grade|
| STU002472|               23|    80.97|         75.39|              54.55|  5th Grade|
| STU000742|               23|  

+------------------+-----------------+---------+-------------------+--------------+---------------+
|           subject|total_assessments|avg_score|avg_completion_time|avg_engagement|unique_students|
+------------------+-----------------+---------+-------------------+--------------+---------------+
|Physical Education|            11190|    91.76|              41.46|         70.01|           2924|
|               Art|            11258|    82.84|              46.64|         69.98|           2940|
|           English|            11166|    64.32|              61.91|          70.0|           2929|
|           History|            11280|    52.59|              56.93|         69.84|           2924|
|              Math|            11117|    51.89|              77.41|         69.88|           2922|
|           Science|            11222|    45.91|              72.15|         69.75|           2919|
+------------------+-----------------+---------+-------------------+--------------+---------------+


+------------+-----------------+---------+--------------+---------------+
| grade_level|total_assessments|avg_score|avg_engagement|unique_students|
+------------+-----------------+---------+--------------+---------------+
|Kindergarten|             5674|     60.3|         70.51|            257|
|   1st Grade|             5044|    64.17|         69.92|            223|
|   2nd Grade|             5068|    66.62|         69.51|            231|
|   3rd Grade|             4670|    68.79|         70.02|            206|
|   4th Grade|             5406|    70.43|         70.21|            244|
|   5th Grade|             5620|    71.76|         70.18|            249|
|   6th Grade|             5108|    67.03|         69.96|            226|
|   7th Grade|             4922|    65.79|         69.51|            227|
|   8th Grade|             4514|    64.82|          70.0|            202|
|   9th Grade|             5842|    63.55|         69.91|            255|
|  10th Grade|             4978|    61

+-----------------+----------------+---------+-------------------+
| engagement_level|assessment_count|avg_score|avg_completion_time|
+-----------------+----------------+---------+-------------------+
|  High Engagement|           23075|     69.0|              59.36|
|Medium Engagement|           22090|     64.8|              59.37|
|   Low Engagement|           22068|    60.69|              59.44|
+-----------------+----------------+---------+-------------------+


=== Monthly Academic Trends ===


+-------+-----------------+---------+--------------+---------------+
|  month|total_assessments|avg_score|avg_engagement|active_students|
+-------+-----------------+---------+--------------+---------------+
|2024-01|             5726|    65.27|         69.71|           2540|
|2024-02|             5228|    64.93|         70.24|           2456|
|2024-03|             5730|     65.3|         70.19|           2553|
|2024-04|             5752|    64.78|          70.3|           2559|
|2024-05|             5668|    65.37|         69.85|           2567|
|2024-06|             5483|    64.27|         69.91|           2516|
|2024-07|             5685|    64.65|         69.55|           2571|
|2024-08|             5607|    64.65|         69.68|           2568|
|2024-09|             5497|    64.68|         69.86|           2502|
|2024-10|             5731|     65.0|         69.49|           2558|
|2024-11|             5458|    64.92|         69.93|           2515|
|2024-12|             5668|    64.

## Step 7: Train Education Student Performance Prediction Model

### Machine Learning for Education Business Improvement

Now we'll train a machine learning model to predict student performance and identify students who may need early intervention. This model can help education institutions:

- **Predict academic performance** before final assessments
- **Identify at-risk students** early for targeted interventions
- **Personalize learning plans** based on predicted performance
- **Optimize resource allocation** for academic support programs

### Model Approach

We'll use a **Random Forest Classifier** to predict whether a student will achieve high performance (score > 85) based on:

- Historical performance patterns and subject strengths
- Engagement levels and time spent on assessments
- Subject-specific performance and grade-level progression
- Temporal patterns and learning consistency

### Business Impact

- **Early Intervention**: Identify struggling students before it's too late
- **Resource Optimization**: Focus support where it's most needed
- **Personalized Education**: Tailor learning experiences to student needs
- **Improved Outcomes**: Better academic performance and graduation rates

In [None]:
# Prepare data for machine learning - create student-level performance features

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Create student-level features for performance prediction
student_features = spark.sql("""
SELECT 
    student_id,
    COUNT(*) as total_assessments,
    ROUND(AVG(score), 2) as avg_score,
    ROUND(STDDEV(score), 2) as score_variability,
    ROUND(AVG(engagement_score), 2) as avg_engagement,
    ROUND(AVG(completion_time), 2) as avg_completion_time,
    ROUND(STDDEV(completion_time), 2) as completion_time_variability,
    COUNT(DISTINCT subject) as subjects_attempted,
    COUNT(DISTINCT DATE_FORMAT(assessment_date, 'yyyy-MM')) as active_months,
    -- Subject-specific performance
    ROUND(AVG(CASE WHEN subject = 'Math' THEN score END), 2) as math_avg_score,
    ROUND(AVG(CASE WHEN subject = 'English' THEN score END), 2) as english_avg_score,
    ROUND(AVG(CASE WHEN subject = 'Science' THEN score END), 2) as science_avg_score,
    -- Performance trend (recent vs earlier)
    ROUND(AVG(CASE WHEN assessment_date >= '2024-07-01' THEN score END), 2) as recent_avg_score,
    ROUND(AVG(CASE WHEN assessment_date < '2024-07-01' THEN score END), 2) as earlier_avg_score,
    grade_level,
    -- Target: High performance (>75 average)
    CASE WHEN AVG(score) > 75 THEN 1 ELSE 0 END as high_performer
FROM education.analytics.student_assessments
GROUP BY student_id, grade_level
""")

# Fill null values from conditional aggregations
student_features = student_features.fillna(0, subset=['math_avg_score', 'english_avg_score', 'science_avg_score', 'recent_avg_score', 'earlier_avg_score'])

print(f"Created student performance features for {student_features.count()} students")
student_features.groupBy("high_performer").count().show()

Created student performance features for 3000 students


+--------------+-----+
|high_performer|count|
+--------------+-----+
|             1|  127|
|             0| 2873|
+--------------+-----+



In [None]:
# Feature engineering for performance prediction

# Create indexers for categorical features
grade_indexer = StringIndexer(inputCol="grade_level", outputCol="grade_level_index")

# Assemble features for the model
feature_cols = ["total_assessments", "avg_score", "score_variability", "avg_engagement", 
                "avg_completion_time", "completion_time_variability", "subjects_attempted", 
                "active_months", "math_avg_score", "english_avg_score", "science_avg_score", 
                "recent_avg_score", "earlier_avg_score", "grade_level_index"]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="high_performer", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10
)

# Create pipeline
pipeline = Pipeline(stages=[grade_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = student_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} students")
print(f"Test set: {test_data.count()} students")

Training set: 2451 students


Test set: 549 students


In [None]:
# Train the student performance prediction model

print("Training student performance prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="high_performer", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"Model AUC: {auc:.4f}")

# Show prediction results
predictions.select("student_id", "avg_score", "avg_engagement", "high_performer", "prediction", "probability").show(10)

# Calculate confusion matrix
confusion_matrix = predictions.groupBy("high_performer", "prediction").count()
confusion_matrix.show()

Training student performance prediction model...


Model AUC: 0.9977


+----------+---------+--------------+--------------+----------+-----------+
|student_id|avg_score|avg_engagement|high_performer|prediction|probability|
+----------+---------+--------------+--------------+----------+-----------+
| STU000003|    57.63|          77.0|             0|       0.0|  [1.0,0.0]|
| STU000007|    58.32|         70.53|             0|       0.0|[0.99,0.01]|
| STU000009|    68.72|          71.0|             0|       0.0|  [1.0,0.0]|
| STU000014|    58.66|          69.5|             0|       0.0|  [1.0,0.0]|
| STU000020|    69.24|         73.31|             0|       0.0|[0.98,0.02]|
| STU000024|     69.8|         68.11|             0|       0.0|[0.99,0.01]|
| STU000030|    59.54|         69.87|             0|       0.0|  [1.0,0.0]|
| STU000036|    62.59|         71.44|             0|       0.0|  [1.0,0.0]|
| STU000046|    56.48|         71.64|             0|       0.0|  [1.0,0.0]|
| STU000047|    60.15|         71.32|             0|       0.0|  [1.0,0.0]|
+----------+

+--------------+----------+-----+
|high_performer|prediction|count|
+--------------+----------+-----+
|             1|       0.0|    8|
|             0|       0.0|  518|
|             1|       1.0|   23|
+--------------+----------+-----+



In [None]:
# Model interpretation and business insights

# Feature importance (approximate)
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = feature_cols

print("=== Feature Importance for Student Performance Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== Business Impact Analysis ===")

# Calculate potential impact of performance prediction
high_performers_predicted = predictions.filter("prediction = 1")
students_identified = high_performers_predicted.count()
total_test_students = test_data.count()

print(f"Total test students: {total_test_students}")
print(f"Students predicted as high performers: {students_identified}")
print(f"Percentage identified for advanced programs: {(students_identified/total_test_students)*100:.1f}%")

# At-risk students (predicted as low performers)
at_risk_students = predictions.filter("prediction = 0")
at_risk_count = at_risk_students.count()

print(f"\nStudents predicted as needing intervention: {at_risk_count}")
print(f"Percentage flagged for academic support: {(at_risk_count/total_test_students)*100:.1f}%")

# Calculate intervention value
intervention_cost_per_student = 500  # Cost of tutoring/mentoring per student
intervention_effectiveness = 0.25  # Expected improvement in performance
avg_student_value = 10000  # Value of improved student outcomes

total_intervention_cost = at_risk_count * intervention_cost_per_student
expected_benefit = at_risk_count * intervention_effectiveness * avg_student_value
intervention_roi = (expected_benefit - total_intervention_cost) / total_intervention_cost * 100

print(f"\nEstimated intervention cost per student: ${intervention_cost_per_student:,}")
print(f"Total intervention cost: ${total_intervention_cost:,}")
print(f"Expected benefit from interventions: ${expected_benefit:,.0f}")
print(f"Intervention program ROI: {intervention_roi:.1f}%")

# Advanced program value for high performers
advanced_program_benefit = students_identified * avg_student_value * 0.15  # 15% additional benefit
print(f"\nAdditional value from advanced programs: ${advanced_program_benefit:,.0f}")

# Accuracy metrics
accuracy = predictions.filter("high_performer = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND high_performer = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND high_performer = 1").count() / predictions.filter("high_performer = 1").count() if predictions.filter("high_performer = 1").count() > 0 else 0

print(f"\nModel Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Student Performance Prediction ===
total_assessments: 0.0175
avg_score: 0.5264
score_variability: 0.0138
avg_engagement: 0.0178
avg_completion_time: 0.0344
completion_time_variability: 0.0204
subjects_attempted: 0.0029
active_months: 0.0064
math_avg_score: 0.0298
english_avg_score: 0.0330
science_avg_score: 0.0244
recent_avg_score: 0.1263
earlier_avg_score: 0.1353
grade_level_index: 0.0118

=== Business Impact Analysis ===


Total test students: 549
Students predicted as high performers: 23
Percentage identified for advanced programs: 4.2%



Students predicted as needing intervention: 526
Percentage flagged for academic support: 95.8%

Estimated intervention cost per student: $500
Total intervention cost: $263,000
Expected benefit from interventions: $1,315,000
Intervention program ROI: 400.0%

Additional value from advanced programs: $34,500



Model Performance:
Accuracy: 0.9854
Precision: 1.0000
Recall: 0.7419
AUC: 0.9977


## Key Takeaways: Delta Liquid Clustering + ML in AIDP

### What We Demonstrated

1. **Automatic Optimization**: Created a table with `CLUSTER BY (student_id, assessment_date)` and let Delta automatically optimize data layout

2. **Performance Benefits**: Queries on clustered columns (student_id, assessment_date) are significantly faster due to data locality

3. **Zero Maintenance**: No manual partitioning, bucketing, or Z-Ordering required - Delta handles it automatically

4. **Machine Learning Integration**: Trained a student performance prediction model using the optimized data

5. **Real-World Use Case**: Education analytics where student performance tracking and learning intervention are critical

### AIDP Advantages

- **Unified Analytics**: Seamlessly integrates data optimization with ML
- **Governance**: Catalog and schema isolation for education data
- **Performance**: Optimized for both analytical queries and ML training
- **Scalability**: Handles education-scale data volumes effortlessly

### Business Benefits for Education

1. **Early Intervention**: Identify struggling students before performance declines
2. **Personalized Learning**: Tailor educational approaches to student needs
3. **Resource Optimization**: Focus support where it's most effective
4. **Academic Excellence**: Improve overall student performance and outcomes
5. **Educational Equity**: Ensure all students receive appropriate support

### Best Practices for Education Analytics

1. **Choose clustering columns** based on your most common query patterns
2. **Start with 1-4 columns** - too many can reduce effectiveness
3. **Consider cardinality** - high-cardinality columns work best
4. **Monitor and adjust** as query patterns evolve
5. **Combine with ML** for predictive analytics and automation

### Next Steps

- Explore other AIDP ML features like AutoML
- Try liquid clustering with different column combinations
- Scale up to larger education datasets
- Integrate with real LMS systems and assessment platforms
- Deploy models for real-time student performance monitoring

This notebook demonstrates how Oracle AI Data Platform makes advanced education analytics accessible while maintaining enterprise-grade performance and governance.