# Education: Medallion Architecture with Delta Liquid Clustering Demo



## Overview



This notebook demonstrates a complete **Medallion Architecture** implementation in Oracle AI Data Platform (AIDP) Workbench for education analytics. The medallion architecture provides a structured approach to data processing:

- **Bronze Layer**: Raw data ingestion and storage
- **Silver Layer**: Cleaned, validated, and enriched data
- **Gold Layer**: Analytics-ready data with aggregations and ML insights

We'll use **Delta Liquid Clustering** throughout to automatically optimize data layout for query performance without manual tuning.

### What is Medallion Architecture?

Medallion Architecture organizes data processing into layers that progressively refine data quality and structure:

- **Bronze**: Raw, unprocessed data as received from sources
- **Silver**: Cleaned, standardized, and enriched data with business rules applied
- **Gold**: Aggregated, analytics-ready data optimized for business intelligence and ML

### Use Case: Student Performance Analytics and Learning Management

We'll build an end-to-end education analytics pipeline that:

- Ingests raw student assessment data
- Cleans and validates the data
- Creates analytics-ready aggregations
- Trains ML models for performance prediction

### AIDP Environment Setup

This notebook leverages the existing Spark session in your AIDP environment.

## Setup: Create Education Catalog and Schemas

In [None]:
# Create education catalog and layer-specific schemas
# In AIDP, catalogs provide data isolation and governance

spark.sql("CREATE CATALOG IF NOT EXISTS education")

# Bronze layer: Raw data
spark.sql("CREATE SCHEMA IF NOT EXISTS education.bronze")

# Silver layer: Cleaned and processed data
spark.sql("CREATE SCHEMA IF NOT EXISTS education.silver")

# Gold layer: Analytics and ML-ready data
spark.sql("CREATE SCHEMA IF NOT EXISTS education.gold")

print("Education catalog with bronze, silver, and gold schemas created successfully!")

Education catalog with bronze, silver, and gold schemas created successfully!


# Bronze Layer: Raw Data Ingestion

## Overview

The Bronze layer stores raw data exactly as received from source systems, without any transformation. This preserves data integrity and allows for:

- **Audit trails**: Complete historical record of all data received
- **Reprocessing**: Ability to re-run transformations if business rules change
- **Data lineage**: Clear traceability from source to final analytics

## Bronze Table Design

Our `bronze_student_assessments` table will store raw assessment data with:

- Raw data fields as received from the Learning Management System (LMS)
- Ingestion timestamps for data freshness tracking
- Source system metadata
- No data validation or cleansing at this layer

In [None]:
# Create Bronze layer table for raw student assessment data
# CLUSTER BY optimizes for ingestion patterns and audit queries

spark.sql("""
CREATE TABLE IF NOT EXISTS education.bronze.student_assessments_raw (
    -- Raw data fields as received from LMS
    raw_student_id STRING,
    raw_assessment_date STRING,
    raw_subject STRING,
    raw_score STRING,
    raw_grade_level STRING,
    raw_completion_time STRING,
    raw_engagement_score STRING,
    
    -- Ingestion metadata
    ingestion_timestamp TIMESTAMP,
    source_system STRING,
    batch_id STRING,
    
    -- Raw JSON payload for full fidelity
    raw_payload STRING
)
USING DELTA
CLUSTER BY (raw_student_id, ingestion_timestamp)
""")

print("Bronze layer table created successfully!")
print("Liquid clustering will optimize data layout for student_id and ingestion time queries.")

Bronze layer table created successfully!
Liquid clustering will optimize data layout for student_id and ingestion time queries.


In [None]:
# Generate realistic raw education data (simulating LMS export)
# This represents data as it might come from various source systems

import random
import json
from datetime import datetime, timedelta

# Define education data constants
SUBJECTS = ['Math', 'English', 'Science', 'History', 'Art', 'Physical Education']
GRADE_LEVELS = ['Kindergarten', '1st Grade', '2nd Grade', '3rd Grade', '4th Grade', '5th Grade', 
                '6th Grade', '7th Grade', '8th Grade', '9th Grade', '10th Grade', '11th Grade', '12th Grade']

# Simulate different source systems with varying data quality
SOURCE_SYSTEMS = ['LMS_System_A', 'Assessment_Platform_B', 'School_Database_C']

# Generate raw assessment records (some with data quality issues)
raw_assessment_data = []
base_date = datetime(2024, 1, 1)
batch_id = f"BATCH_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

# Create 3,000 students with 15-30 assessments each
for student_num in range(1, 3001):
    student_id = f"STU{student_num:06d}"
    
    # Assign grade level
    grade_level = random.choice(GRADE_LEVELS)
    
    # Each student gets 15-30 assessments over 12 months
    num_assessments = random.randint(15, 30)
    
    for i in range(num_assessments):
        # Spread assessments over 12 months
        days_offset = random.randint(0, 365)
        assessment_date = base_date + timedelta(days=days_offset)
        
        # Select subject
        subject = random.choice(SUBJECTS)
        
        # Generate realistic scores with some data quality issues
        base_score = random.uniform(50, 100)
        
        # Simulate data quality issues (10% of records)
        has_quality_issue = random.random() < 0.1
        
        if has_quality_issue:
            # Introduce various data quality problems
            quality_issues = [
                lambda: str(random.choice(['', 'NULL', 'N/A'])),  # Missing values
                lambda: str(random.uniform(150, 200)),  # Out of range scores
                lambda: f"{random.randint(1,12)}th Grade",  # Inconsistent grade format
                lambda: str(random.uniform(-10, 50)),  # Negative or invalid scores
            ]
            score = random.choice(quality_issues)()
        else:
            score = str(round(base_score, 2))
        
        # Generate other fields with occasional quality issues
        completion_time = str(round(random.uniform(20, 120), 2)) if random.random() > 0.05 else ''
        engagement_score = str(random.randint(0, 100)) if random.random() > 0.05 else 'INVALID'
        
        # Select source system
        source_system = random.choice(SOURCE_SYSTEMS)
        
        # Create raw JSON payload (simulating source system export)
        raw_payload = json.dumps({
            "student_id": student_id,
            "assessment_date": assessment_date.strftime('%Y-%m-%d'),
            "subject": subject,
            "score": score,
            "grade_level": grade_level,
            "completion_time": completion_time,
            "engagement_score": engagement_score,
            "source_metadata": {
                "export_timestamp": datetime.now().isoformat(),
                "data_quality_score": random.uniform(0.7, 1.0)
            }
        })
        
        raw_assessment_data.append({
            "raw_student_id": student_id,
            "raw_assessment_date": assessment_date.strftime('%Y-%m-%d'),
            "raw_subject": subject,
            "raw_score": score,
            "raw_grade_level": grade_level,
            "raw_completion_time": completion_time,
            "raw_engagement_score": engagement_score,
            "ingestion_timestamp": datetime.now(),
            "source_system": source_system,
            "batch_id": batch_id,
            "raw_payload": raw_payload
        })

print(f"Generated {len(raw_assessment_data)} raw student assessment records")
print("Sample raw record (with potential data quality issues):", raw_assessment_data[0])

Generated 67535 raw student assessment records
Sample raw record (with potential data quality issues): {'raw_student_id': 'STU000001', 'raw_assessment_date': '2024-04-05', 'raw_subject': 'English', 'raw_score': '187.20373006255488', 'raw_grade_level': '3rd Grade', 'raw_completion_time': '89.16', 'raw_engagement_score': '90', 'ingestion_timestamp': datetime.datetime(2025, 12, 20, 0, 48, 32, 590515), 'source_system': 'LMS_System_A', 'batch_id': 'BATCH_20251220_004832', 'raw_payload': '{"student_id": "STU000001", "assessment_date": "2024-04-05", "subject": "English", "score": "187.20373006255488", "grade_level": "3rd Grade", "completion_time": "89.16", "engagement_score": "90", "source_metadata": {"export_timestamp": "2025-12-20T00:48:32.590495", "data_quality_score": 0.7309072167760401}}'}


In [None]:
# Insert raw data into Bronze layer
# Using PySpark for distributed processing and type safety

# Create DataFrame from generated raw data
df_raw_assessments = spark.createDataFrame(raw_assessment_data)

# Display schema and sample data
print("Bronze Layer DataFrame Schema:")
df_raw_assessments.printSchema()

print("\nSample Raw Data (Bronze Layer):")
df_raw_assessments.show(5)

# Insert data into Bronze table with liquid clustering
# This preserves the raw data exactly as received
df_raw_assessments.write.mode("overwrite").saveAsTable("education.bronze.student_assessments_raw")

print(f"\nSuccessfully inserted {df_raw_assessments.count()} raw records into Bronze layer")
print("Data is stored exactly as received - no transformations applied.")

Bronze Layer DataFrame Schema:
root
 |-- batch_id: string (nullable = true)
 |-- ingestion_timestamp: timestamp (nullable = true)
 |-- raw_assessment_date: string (nullable = true)
 |-- raw_completion_time: string (nullable = true)
 |-- raw_engagement_score: string (nullable = true)
 |-- raw_grade_level: string (nullable = true)
 |-- raw_payload: string (nullable = true)
 |-- raw_score: string (nullable = true)
 |-- raw_student_id: string (nullable = true)
 |-- raw_subject: string (nullable = true)
 |-- source_system: string (nullable = true)


Sample Raw Data (Bronze Layer):


+--------------------+--------------------+-------------------+-------------------+--------------------+---------------+--------------------+------------------+--------------+-----------+--------------------+
|            batch_id| ingestion_timestamp|raw_assessment_date|raw_completion_time|raw_engagement_score|raw_grade_level|         raw_payload|         raw_score|raw_student_id|raw_subject|       source_system|
+--------------------+--------------------+-------------------+-------------------+--------------------+---------------+--------------------+------------------+--------------+-----------+--------------------+
|BATCH_20251220_00...|2025-12-20 00:48:...|         2024-04-05|              89.16|                  90|      3rd Grade|{"student_id": "S...|187.20373006255488|     STU000001|    English|        LMS_System_A|
|BATCH_20251220_00...|2025-12-20 00:48:...|         2024-09-10|              53.93|                  87|      3rd Grade|{"student_id": "S...|             85.18|    


Successfully inserted 67535 raw records into Bronze layer
Data is stored exactly as received - no transformations applied.


In [None]:
# Demonstrate Bronze layer querying
# Show raw data with potential quality issues

print("=== Bronze Layer: Raw Data Inspection ===")

# Query raw data to show data quality issues
raw_data_sample = spark.sql("""
SELECT raw_student_id, raw_assessment_date, raw_subject, 
       raw_score, raw_grade_level, source_system
FROM education.bronze.student_assessments_raw
WHERE raw_score NOT REGEXP '^[0-9]+\\.?[0-9]*$'
   OR raw_score = '' 
   OR raw_engagement_score NOT REGEXP '^[0-9]+$' 
LIMIT 10
""")

print("Records with potential data quality issues:")
raw_data_sample.show()

print(f"Total raw records in Bronze layer: {spark.table('education.bronze.student_assessments_raw').count()}")

=== Bronze Layer: Raw Data Inspection ===


Records with potential data quality issues:


+--------------+-------------------+------------------+-----------------+---------------+--------------------+
|raw_student_id|raw_assessment_date|       raw_subject|        raw_score|raw_grade_level|       source_system|
+--------------+-------------------+------------------+-----------------+---------------+--------------------+
|     STU002227|         2024-11-17|Physical Education|38.43922011268124|   Kindergarten|Assessment_Platfo...|
|     STU002228|         2024-03-30|              Math|            66.05|      1st Grade|        LMS_System_A|
|     STU002229|         2024-12-16|           English|             NULL|     11th Grade|Assessment_Platfo...|
|     STU002229|         2024-10-16|              Math|                 |     11th Grade|Assessment_Platfo...|
|     STU002230|         2024-05-30|              Math|        6th Grade|   Kindergarten|Assessment_Platfo...|
|     STU002230|         2024-06-23|           History|             NULL|   Kindergarten|   School_Database_C|
|

Total raw records in Bronze layer: 67535


# Silver Layer: Data Cleansing and Standardization

## Overview

The Silver layer transforms raw Bronze data into clean, standardized, and enriched datasets. Key activities:

- **Data validation**: Remove or correct invalid data
- **Standardization**: Normalize formats and values
- **Enrichment**: Add derived fields and business logic
- **Deduplication**: Remove duplicate records
- **Schema enforcement**: Apply consistent data types

## Silver Table Design

Our `silver_student_assessments` table will store cleaned data with:

- Validated and standardized fields
- Derived metrics (e.g., performance categories)
- Data quality scores
- Business rule validations

In [None]:
# Create Silver layer table for cleaned and validated data
# CLUSTER BY optimizes for analytical queries and student tracking

spark.sql("""
CREATE TABLE IF NOT EXISTS education.silver.student_assessments_clean (
    -- Standardized fields
    student_id STRING,
    assessment_date DATE,
    subject STRING,
    score DECIMAL(5,2),
    grade_level STRING,
    completion_time DECIMAL(6,2),
    engagement_score INT,
    
    -- Derived and enriched fields
    performance_category STRING,  -- High, Medium, Low
    engagement_category STRING,   -- High, Medium, Low
    is_valid_score BOOLEAN,
    is_valid_engagement BOOLEAN,
    
    -- Metadata
    source_system STRING,
    batch_id STRING,
    processing_timestamp TIMESTAMP,
    data_quality_score DECIMAL(3,2),
    
    -- Bronze layer reference for lineage
    bronze_batch_id STRING
)
USING DELTA
CLUSTER BY (student_id, assessment_date)
""")

print("Silver layer table created successfully!")
print("Optimized for student-specific and time-based analytical queries.")

Silver layer table created successfully!
Optimized for student-specific and time-based analytical queries.


In [None]:
# Transform Bronze data to Silver layer
# Apply data cleansing, validation, and enrichment
from pyspark.sql import functions as F

from pyspark.sql.functions import col, when, regexp_replace, to_date, udf
from pyspark.sql.types import BooleanType, StringType

# Read Bronze data
bronze_df = spark.table("education.bronze.student_assessments_raw")

# Define data cleansing functions
def validate_score(score_str):
    """Validate and clean score values"""
    if not score_str or score_str in ['', 'NULL', 'N/A', 'INVALID']:
        return None
    try:
        score = float(score_str)
        return score if 0 <= score <= 100 else None
    except:
        return None

def validate_engagement(engagement_str):
    """Validate and clean engagement scores"""
    if not engagement_str or engagement_str in ['', 'NULL', 'N/A', 'INVALID']:
        return None
    try:
        score = int(float(engagement_str))
        return score if 0 <= score <= 100 else None
    except:
        return None

def standardize_grade_level(grade_str):
    """Standardize grade level formats"""
    if not grade_str:
        return None
    
    # Handle various formats
    grade_str = grade_str.strip()
    
    # Convert "Xth Grade" to "Xth Grade"
    if grade_str.endswith('th Grade') or grade_str.endswith('st Grade') or grade_str.endswith('nd Grade') or grade_str.endswith('rd Grade'):
        return grade_str
    
    # Convert "Grade X" to "Xth Grade"
    if grade_str.startswith('Grade '):
        grade_num = grade_str.replace('Grade ', '')
        try:
            num = int(grade_num)
            if num == 1:
                return "1st Grade"
            elif num == 2:
                return "2nd Grade"
            elif num == 3:
                return "3rd Grade"
            else:
                return f"{num}th Grade"
        except:
            return grade_str
    
    return grade_str

# Register UDFs
validate_score_udf = udf(validate_score)
validate_engagement_udf = udf(validate_engagement)
standardize_grade_udf = udf(standardize_grade_level)

# Transform Bronze to Silver
silver_df = bronze_df.withColumn("student_id", col("raw_student_id")) \
    .withColumn("assessment_date", to_date(col("raw_assessment_date"))) \
    .withColumn("subject", col("raw_subject")) \
    .withColumn("score", validate_score_udf(col("raw_score"))) \
    .withColumn("grade_level", standardize_grade_udf(col("raw_grade_level"))) \
    .withColumn("completion_time", 
                when(col("raw_completion_time").rlike("^[0-9]+\\.?[0-9]*$"), 
                     col("raw_completion_time").cast("decimal(6,2)")).otherwise(None)) \
    .withColumn("engagement_score", validate_engagement_udf(col("raw_engagement_score"))) \
    .withColumn("performance_category",
                when(col("score") >= 85, "High")
                .when(col("score") >= 70, "Medium")
                .when(col("score") < 70, "Low")
                .otherwise("Unknown")) \
    .withColumn("engagement_category",
                when(col("engagement_score") >= 80, "High")
                .when(col("engagement_score") >= 60, "Medium")
                .when(col("engagement_score") < 60, "Low")
                .otherwise("Unknown")) \
    .withColumn("is_valid_score", col("score").isNotNull()) \
    .withColumn("is_valid_engagement", col("engagement_score").isNotNull()) \
    .withColumn("source_system", col("source_system")) \
    .withColumn("batch_id", col("batch_id")) \
    .withColumn("processing_timestamp", F.current_timestamp()) \
    .withColumn("data_quality_score", 
                ((when(col("is_valid_score"), 1).otherwise(0)) + 
                 (when(col("is_valid_engagement"), 1).otherwise(0))) / 2.0) \
    .withColumn("bronze_batch_id", col("batch_id")) \
    .select("student_id", "assessment_date", "subject", "score", "grade_level", 
            "completion_time", "engagement_score", "performance_category", 
            "engagement_category", "is_valid_score", "is_valid_engagement", 
            "source_system", "batch_id", "processing_timestamp", 
            "data_quality_score", "bronze_batch_id")

# Filter out records with no valid data
silver_df = silver_df.filter("is_valid_score = true OR is_valid_engagement = true")

print(f"Transformed {silver_df.count()} clean records for Silver layer")
print("Applied data validation, standardization, and enrichment.")

Transformed 67240 clean records for Silver layer
Applied data validation, standardization, and enrichment.


In [None]:
# Insert cleaned data into Silver layer

# Display sample of cleaned data
print("Silver Layer DataFrame Schema:")
silver_df.printSchema()

print("\nSample Cleaned Data (Silver Layer):")
silver_df.show(5)

# Insert into Silver table
silver_df.write.mode("overwrite").saveAsTable("education.silver.student_assessments_clean")

print(f"\nSuccessfully inserted {silver_df.count()} cleaned records into Silver layer")

# Show data quality improvements
quality_stats = silver_df.groupBy("is_valid_score", "is_valid_engagement").count()
print("\nData Quality Statistics:")
quality_stats.show()

Silver Layer DataFrame Schema:
root
 |-- student_id: string (nullable = true)
 |-- assessment_date: date (nullable = true)
 |-- subject: string (nullable = true)
 |-- score: string (nullable = true)
 |-- grade_level: string (nullable = true)
 |-- completion_time: decimal(6,2) (nullable = true)
 |-- engagement_score: string (nullable = true)
 |-- performance_category: string (nullable = false)
 |-- engagement_category: string (nullable = false)
 |-- is_valid_score: boolean (nullable = false)
 |-- is_valid_engagement: boolean (nullable = false)
 |-- source_system: string (nullable = true)
 |-- batch_id: string (nullable = true)
 |-- processing_timestamp: timestamp (nullable = false)
 |-- data_quality_score: double (nullable = true)
 |-- bronze_batch_id: string (nullable = true)


Sample Cleaned Data (Silver Layer):


+----------+---------------+------------------+-----------------+------------+---------------+----------------+--------------------+-------------------+--------------+-------------------+--------------------+--------------------+--------------------+------------------+--------------------+
|student_id|assessment_date|           subject|            score| grade_level|completion_time|engagement_score|performance_category|engagement_category|is_valid_score|is_valid_engagement|       source_system|            batch_id|processing_timestamp|data_quality_score|     bronze_batch_id|
+----------+---------------+------------------+-----------------+------------+---------------+----------------+--------------------+-------------------+--------------+-------------------+--------------------+--------------------+--------------------+------------------+--------------------+
| STU002226|     2024-11-22|           History|            66.58|   9th Grade|          84.15|               4|                


Successfully inserted 67240 cleaned records into Silver layer

Data Quality Statistics:


+--------------+-------------------+-----+
|is_valid_score|is_valid_engagement|count|
+--------------+-------------------+-----+
|          true|              false| 3081|
|          true|               true|59184|
|         false|               true| 4975|
+--------------+-------------------+-----+



In [None]:
# Demonstrate Silver layer analytical capabilities
# Show cleaned data benefits

print("=== Silver Layer: Cleaned Data Analytics ===")

# Query performance by category
performance_analysis = spark.sql("""
SELECT performance_category, engagement_category, 
       COUNT(*) as record_count,
       ROUND(AVG(score), 2) as avg_score,
       ROUND(AVG(engagement_score), 2) as avg_engagement,
       ROUND(AVG(data_quality_score * 100), 2) as avg_quality_score
FROM education.silver.student_assessments_clean
GROUP BY performance_category, engagement_category
ORDER BY performance_category, engagement_category
""")

performance_analysis.show()

# Show grade level standardization
grade_standardization = spark.sql("""
SELECT grade_level, COUNT(*) as count
FROM education.silver.student_assessments_clean
GROUP BY grade_level
ORDER BY grade_level
""")

print("\nGrade Level Standardization:")
grade_standardization.show()

=== Silver Layer: Cleaned Data Analytics ===


+--------------------+-------------------+------------+---------+--------------+-----------------+
|performance_category|engagement_category|record_count|avg_score|avg_engagement|avg_quality_score|
+--------------------+-------------------+------------+---------+--------------+-----------------+
|                High|               High|        3626|    92.47|         90.09|            100.0|
|                High|                Low|       10255|     92.5|         29.45|            100.0|
|                High|             Medium|        3447|    92.46|         69.42|            100.0|
|                High|            Unknown|         925|    92.53|          NULL|             50.0|
|                 Low|               High|        5111|    58.11|         89.91|            100.0|
|                 Low|                Low|       14478|    58.07|         29.72|            100.0|
|                 Low|             Medium|        4774|    57.95|         69.42|            100.0|
|         


Grade Level Standardization:


+------------+-----+
| grade_level|count|
+------------+-----+
|  10th Grade| 5059|
|  11th Grade| 5288|
|  12th Grade| 4932|
|   1st Grade| 5150|
|   2nd Grade| 5351|
|   3rd Grade| 5059|
|   4th Grade| 4419|
|   5th Grade| 5544|
|   6th Grade| 5272|
|   7th Grade| 5208|
|   8th Grade| 5364|
|   9th Grade| 5372|
|Kindergarten| 5222|
+------------+-----+



# Gold Layer: Analytics and Machine Learning

## Overview

The Gold layer provides analytics-ready data optimized for business intelligence and machine learning. Key activities:

- **Aggregations**: Pre-computed metrics and KPIs
- **Machine Learning**: Feature engineering and model training
- **Business Intelligence**: Dashboard-ready datasets
- **Performance Optimization**: Denormalized for fast queries

## Gold Tables

- `student_performance_aggregates`: Aggregated student metrics
- `subject_performance_analytics`: Subject-level analytics
- `student_performance_predictions`: ML predictions for intervention planning

In [None]:
# Create Gold layer tables for analytics and ML

# Student performance aggregates
spark.sql("""
CREATE TABLE IF NOT EXISTS education.gold.student_performance_aggregates (
    student_id STRING,
    grade_level STRING,
    total_assessments INT,
    avg_score DECIMAL(5,2),
    score_stddev DECIMAL(5,2),
    avg_engagement DECIMAL(5,2),
    avg_completion_time DECIMAL(6,2),
    subjects_attempted INT,
    active_months INT,
    performance_trend DECIMAL(5,2),
    engagement_trend DECIMAL(5,2),
    overall_performance_category STRING,
    risk_level STRING,
    last_assessment_date DATE,
    processing_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (grade_level, overall_performance_category)
""")

# Subject performance analytics
spark.sql("""
CREATE TABLE IF NOT EXISTS education.gold.subject_performance_analytics (
    subject STRING,
    grade_level STRING,
    total_assessments INT,
    avg_score DECIMAL(5,2),
    avg_engagement DECIMAL(5,2),
    avg_completion_time DECIMAL(6,2),
    unique_students INT,
    performance_distribution STRING,
    difficulty_rating DECIMAL(3,2),
    processing_timestamp TIMESTAMP
)
USING DELTA
CLUSTER BY (subject, grade_level)
""")

# ML predictions table
spark.sql("""
CREATE TABLE IF NOT EXISTS education.gold.student_performance_predictions (
    student_id STRING,
    predicted_performance_category STRING,
    prediction_probability DECIMAL(5,4),
    risk_score DECIMAL(5,4),
    recommended_intervention STRING,
    model_version STRING,
    prediction_timestamp TIMESTAMP,
    features_used STRING
)
USING DELTA
CLUSTER BY (student_id, prediction_timestamp)
""")

print("Gold layer tables created successfully!")
print("Optimized for analytical queries and ML feature serving.")

Gold layer tables created successfully!
Optimized for analytical queries and ML feature serving.


In [None]:
# Create student performance aggregates

student_aggregates = spark.sql("""
SELECT 
    student_id,
    grade_level,
    COUNT(*) as total_assessments,
    ROUND(AVG(score), 2) as avg_score,
    ROUND(STDDEV(score), 2) as score_stddev,
    ROUND(AVG(engagement_score), 2) as avg_engagement,
    ROUND(AVG(completion_time), 2) as avg_completion_time,
    COUNT(DISTINCT subject) as subjects_attempted,
    COUNT(DISTINCT DATE_FORMAT(assessment_date, 'yyyy-MM')) as active_months,
    -- Performance trend (recent vs earlier)
    ROUND(
        AVG(CASE WHEN assessment_date >= '2024-07-01' THEN score END) - 
        AVG(CASE WHEN assessment_date < '2024-07-01' THEN score END), 
        2
    ) as performance_trend,
    -- Engagement trend
    ROUND(
        AVG(CASE WHEN assessment_date >= '2024-07-01' THEN engagement_score END) - 
        AVG(CASE WHEN assessment_date < '2024-07-01' THEN engagement_score END), 
        2
    ) as engagement_trend,
    -- Overall category
    CASE 
        WHEN AVG(score) >= 85 THEN 'High Performer'
        WHEN AVG(score) >= 70 THEN 'Medium Performer'
        ELSE 'Low Performer'
    END as overall_performance_category,
    -- Risk level based on multiple factors
    CASE 
        WHEN AVG(score) < 70 AND AVG(engagement_score) < 60 THEN 'High Risk'
        WHEN AVG(score) < 75 OR AVG(engagement_score) < 65 THEN 'Medium Risk'
        ELSE 'Low Risk'
    END as risk_level,
    MAX(assessment_date) as last_assessment_date,
    CURRENT_TIMESTAMP() as processing_timestamp
FROM education.silver.student_assessments_clean
GROUP BY student_id, grade_level
""")

# Handle null values in trend calculations
student_aggregates = student_aggregates.fillna(0, subset=['performance_trend', 'engagement_trend'])

print(f"Created performance aggregates for {student_aggregates.count()} students")

# Insert into Gold layer
student_aggregates.write.mode("overwrite").saveAsTable("education.gold.student_performance_aggregates")

print("Student performance aggregates inserted into Gold layer")

Created performance aggregates for 3000 students


Student performance aggregates inserted into Gold layer


In [None]:
# Create subject performance analytics

subject_analytics = spark.sql("""
SELECT 
    subject,
    grade_level,
    COUNT(*) as total_assessments,
    ROUND(AVG(score), 2) as avg_score,
    ROUND(AVG(engagement_score), 2) as avg_engagement,
    ROUND(AVG(completion_time), 2) as avg_completion_time,
    COUNT(DISTINCT student_id) as unique_students,
    -- Performance distribution
    CONCAT(
        'High: ', CAST(COUNT(CASE WHEN performance_category = 'High' THEN 1 END) AS STRING), ', ',
        'Medium: ', CAST(COUNT(CASE WHEN performance_category = 'Medium' THEN 1 END) AS STRING), ', ',
        'Low: ', CAST(COUNT(CASE WHEN performance_category = 'Low' THEN 1 END) AS STRING)
    ) as performance_distribution,
    -- Difficulty rating (inverse of average score)
    ROUND((100 - AVG(score)) / 20, 2) as difficulty_rating,
    CURRENT_TIMESTAMP() as processing_timestamp
FROM education.silver.student_assessments_clean
GROUP BY subject, grade_level
ORDER BY subject, grade_level
""")

print(f"Created analytics for {subject_analytics.count()} subject-grade combinations")

# Insert into Gold layer
subject_analytics.write.mode("overwrite").saveAsTable("education.gold.subject_performance_analytics")

print("Subject performance analytics inserted into Gold layer")

Created analytics for 78 subject-grade combinations


Subject performance analytics inserted into Gold layer


In [None]:
# Demonstrate Gold layer analytics capabilities

print("=== Gold Layer: Analytics Dashboard Data ===")

# Student performance overview
performance_overview = spark.sql("""
SELECT overall_performance_category, risk_level, 
       COUNT(*) as student_count,
       ROUND(AVG(avg_score), 2) as avg_score,
       ROUND(AVG(avg_engagement), 2) as avg_engagement
FROM education.gold.student_performance_aggregates
GROUP BY overall_performance_category, risk_level
ORDER BY overall_performance_category, risk_level
""")

performance_overview.show()

# Subject difficulty analysis
subject_difficulty = spark.sql("""
SELECT subject, ROUND(AVG(avg_score), 2) as avg_score, 
       ROUND(AVG(difficulty_rating), 2) as avg_difficulty
FROM education.gold.subject_performance_analytics
GROUP BY subject
ORDER BY avg_difficulty DESC
""")

print("\nSubject Difficulty Analysis:")
subject_difficulty.show()

# At-risk students for intervention
at_risk_students = spark.sql("""
SELECT student_id, grade_level, avg_score, avg_engagement, risk_level,
       performance_trend, engagement_trend
FROM education.gold.student_performance_aggregates
WHERE risk_level IN ('High Risk', 'Medium Risk')
ORDER BY avg_score ASC
LIMIT 10
""")

print("\nStudents Needing Intervention:")
at_risk_students.show()

=== Gold Layer: Analytics Dashboard Data ===


+----------------------------+-----------+-------------+---------+--------------+
|overall_performance_category| risk_level|student_count|avg_score|avg_engagement|
+----------------------------+-----------+-------------+---------+--------------+
|              High Performer|Medium Risk|            2|    85.28|         50.33|
|               Low Performer|  High Risk|          382|    68.03|         49.53|
|               Low Performer|Medium Risk|           22|    67.91|         62.34|
|            Medium Performer|   Low Risk|           19|    77.82|          67.5|
|            Medium Performer|Medium Risk|         2575|    74.78|         50.14|
+----------------------------+-----------+-------------+---------+--------------+




Subject Difficulty Analysis:


+------------------+---------+--------------+
|           subject|avg_score|avg_difficulty|
+------------------+---------+--------------+
|           Science|    73.55|          1.32|
|Physical Education|    73.95|           1.3|
|               Art|    73.96|           1.3|
|              Math|    73.96|           1.3|
|           English|    73.99|           1.3|
|           History|    74.02|           1.3|
+------------------+---------+--------------+




Students Needing Intervention:


+----------+------------+---------+--------------+----------+-----------------+----------------+
|student_id| grade_level|avg_score|avg_engagement|risk_level|performance_trend|engagement_trend|
+----------+------------+---------+--------------+----------+-----------------+----------------+
| STU000045|  11th Grade|    59.83|         45.28| High Risk|            14.64|           20.83|
| STU001660|   6th Grade|    60.19|         43.61| High Risk|            -1.62|           -9.61|
| STU002952|  10th Grade|    61.69|          46.0| High Risk|             1.46|          -11.38|
| STU000141|   3rd Grade|    61.94|         57.45| High Risk|            -0.65|           -7.48|
| STU001770|   5th Grade|    62.71|         40.79| High Risk|             -4.2|           19.19|
| STU000401|  11th Grade|    63.19|         38.88| High Risk|            -1.53|           23.27|
| STU000861|Kindergarten|     63.3|         41.69| High Risk|            -2.98|            7.83|
| STU000802|   5th Grade|    6

In [None]:
# Train student performance prediction model
# This is the ML component moved to Gold layer

from pyspark.ml.feature import StringIndexer, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

# Create ML features from Gold layer aggregates
ml_features = spark.sql("""
SELECT 
    student_id,
    total_assessments,
    avg_score,
    score_stddev,
    avg_engagement,
    avg_completion_time,
    subjects_attempted,
    active_months,
    performance_trend,
    engagement_trend,
    grade_level,
    -- Target: High performance (score >= 75)
    CASE WHEN avg_score >= 75 THEN 1 ELSE 0 END as high_performer,
    -- Risk score for intervention planning
    CASE 
        WHEN risk_level = 'High Risk' THEN 0.9
        WHEN risk_level = 'Medium Risk' THEN 0.6
        ELSE 0.2
    END as risk_score
FROM education.gold.student_performance_aggregates
""")

# Fill null values
ml_features = ml_features.fillna(0, subset=['score_stddev', 'performance_trend', 'engagement_trend'])

print(f"Prepared ML features for {ml_features.count()} students")
ml_features.groupBy("high_performer").count().show()

Prepared ML features for 3000 students


+--------------+-----+
|high_performer|count|
+--------------+-----+
|             1| 1162|
|             0| 1838|
+--------------+-----+



In [None]:
# Feature engineering and model training

# Create indexers for categorical features
grade_indexer = StringIndexer(inputCol="grade_level", outputCol="grade_level_index")

# Assemble features for the model
feature_cols = ["total_assessments", "avg_score", "score_stddev", "avg_engagement", 
                "avg_completion_time", "subjects_attempted", "active_months", 
                "performance_trend", "engagement_trend", "grade_level_index"]

assembler = VectorAssembler(
    inputCols=feature_cols,
    outputCol="features"
)

# Scale features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

# Create and train the model
rf = RandomForestClassifier(
    labelCol="high_performer", 
    featuresCol="scaled_features",
    numTrees=100,
    maxDepth=10,
    seed=42
)

# Create pipeline
pipeline = Pipeline(stages=[grade_indexer, assembler, scaler, rf])

# Split data
train_data, test_data = ml_features.randomSplit([0.8, 0.2], seed=42)

print(f"Training set: {train_data.count()} students")
print(f"Test set: {test_data.count()} students")

# Train the model
print("Training student performance prediction model...")
model = pipeline.fit(train_data)

# Make predictions
predictions = model.transform(test_data)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="high_performer", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)

print(f"\nModel Performance - AUC: {auc:.4f}")

# Show prediction results
predictions.select("student_id", "avg_score", "avg_engagement", "high_performer", "prediction", "probability").show(10)

Training set: 2451 students


Test set: 549 students
Training student performance prediction model...



Model Performance - AUC: 0.9999


+----------+---------+--------------+--------------+----------+--------------------+
|student_id|avg_score|avg_engagement|high_performer|prediction|         probability|
+----------+---------+--------------+--------------+----------+--------------------+
| STU000003|    79.26|         47.81|             1|       1.0|[0.00238709677419...|
| STU000007|    78.33|         55.05|             1|       1.0|           [0.0,1.0]|
| STU000009|    77.46|          41.0|             1|       1.0|           [0.0,1.0]|
| STU000014|    75.39|         46.42|             1|       1.0|           [0.0,1.0]|
| STU000020|    73.89|         48.07|             0|       0.0|           [1.0,0.0]|
| STU000024|     72.0|         59.65|             0|       0.0|[0.97654411764705...|
| STU000030|    71.41|         55.63|             0|       0.0|           [1.0,0.0]|
| STU000036|    73.99|         54.08|             0|       0.0|[0.99306451612903...|
| STU000046|    73.72|          56.4|             0|       0.0|[0

In [None]:
# Generate predictions and store in Gold layer

from pyspark.ml.functions import vector_to_array

# Convert probability vector to array for proper indexing
predictions_with_prob_array = predictions.withColumn("prob_array", vector_to_array("probability"))

# Create predictions DataFrame
model_predictions = predictions_with_prob_array.select(
    "student_id",
    when(col("prediction") == 1, "High Performer").otherwise("Low Performer").alias("predicted_performance_category"),
    (when(col("prediction") == 1, col("prob_array")[1]).otherwise(col("prob_array")[0])).alias("prediction_probability"),
    "risk_score",
    when(col("prediction") == 0, "Academic Support Program").otherwise("Advanced Learning Program").alias("recommended_intervention"),
    F.lit("v1.0").alias("model_version"),
    F.current_timestamp().alias("prediction_timestamp"),
    F.lit(",".join(feature_cols)).alias("features_used")
)

# Insert predictions into Gold layer
model_predictions.write.mode("overwrite").saveAsTable("education.gold.student_performance_predictions")

print(f"Stored {model_predictions.count()} ML predictions in Gold layer")

# Show sample predictions
print("\nSample ML Predictions:")
model_predictions.show(5)

Stored 549 ML predictions in Gold layer

Sample ML Predictions:


+----------+------------------------------+----------------------+----------+------------------------+-------------+--------------------+--------------------+
|student_id|predicted_performance_category|prediction_probability|risk_score|recommended_intervention|model_version|prediction_timestamp|       features_used|
+----------+------------------------------+----------------------+----------+------------------------+-------------+--------------------+--------------------+
| STU000003|                High Performer|    0.9976129032258065|       0.6|    Advanced Learning...|         v1.0|2025-12-20 01:02:...|total_assessments...|
| STU000007|                High Performer|                   1.0|       0.6|    Advanced Learning...|         v1.0|2025-12-20 01:02:...|total_assessments...|
| STU000009|                High Performer|                   1.0|       0.6|    Advanced Learning...|         v1.0|2025-12-20 01:02:...|total_assessments...|
| STU000014|                High Performer|   

In [None]:
# Model interpretation and business insights

# Feature importance
rf_model = model.stages[-1]
feature_importance = rf_model.featureImportances
feature_names = feature_cols

print("=== Feature Importance for Student Performance Prediction ===")
for name, importance in zip(feature_names, feature_importance):
    print(f"{name}: {importance:.4f}")

# Business impact analysis
print("\n=== ML-Driven Business Impact Analysis ===")

# Calculate intervention recommendations
intervention_stats = spark.sql("""
SELECT recommended_intervention, COUNT(*) as student_count,
       ROUND(AVG(prediction_probability), 4) as avg_confidence
FROM education.gold.student_performance_predictions
GROUP BY recommended_intervention
""")

intervention_stats.show()

# Calculate potential ROI
high_risk_count = spark.sql("""
SELECT COUNT(*) as high_risk_count
FROM education.gold.student_performance_aggregates
WHERE risk_level = 'High Risk'
""").collect()[0][0]

intervention_cost_per_student = 500
intervention_effectiveness = 0.25
avg_student_value = 10000

total_intervention_cost = high_risk_count * intervention_cost_per_student
expected_benefit = high_risk_count * intervention_effectiveness * avg_student_value
intervention_roi = (expected_benefit - total_intervention_cost) / total_intervention_cost * 100 if total_intervention_cost > 0 else 0

print(f"\nEstimated Intervention Program:")
print(f"Students identified for intervention: {high_risk_count}")
print(f"Total intervention cost: ${total_intervention_cost:,}")
print(f"Expected benefit: ${expected_benefit:,.0f}")
print(f"Intervention program ROI: {intervention_roi:.1f}%")

# Model accuracy metrics
accuracy = predictions.filter("high_performer = prediction").count() / predictions.count()
precision = predictions.filter("prediction = 1 AND high_performer = 1").count() / predictions.filter("prediction = 1").count() if predictions.filter("prediction = 1").count() > 0 else 0
recall = predictions.filter("prediction = 1 AND high_performer = 1").count() / predictions.filter("high_performer = 1").count() if predictions.filter("high_performer = 1").count() > 0 else 0

print(f"\nModel Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"AUC: {auc:.4f}")

=== Feature Importance for Student Performance Prediction ===
total_assessments: 0.0063
avg_score: 0.9327
score_stddev: 0.0294
avg_engagement: 0.0054
avg_completion_time: 0.0058
subjects_attempted: 0.0006
active_months: 0.0026
performance_trend: 0.0065
engagement_trend: 0.0056
grade_level_index: 0.0052

=== ML-Driven Business Impact Analysis ===


+------------------------+-------------+--------------+
|recommended_intervention|student_count|avg_confidence|
+------------------------+-------------+--------------+
|    Academic Support ...|          331|        0.9831|
|    Advanced Learning...|          218|        0.9943|
+------------------------+-------------+--------------+




Estimated Intervention Program:
Students identified for intervention: 382
Total intervention cost: $191,000
Expected benefit: $955,000
Intervention program ROI: 400.0%



Model Performance Metrics:
Accuracy: 0.9945
Precision: 1.0000
Recall: 0.9864
AUC: 0.9999


# Key Takeaways: Medallion Architecture + ML in AIDP

## What We Demonstrated

1. **Bronze Layer**: Raw data ingestion preserving data fidelity with potential quality issues
2. **Silver Layer**: Data cleansing, validation, standardization, and enrichment
3. **Gold Layer**: Analytics-ready aggregations and ML predictions for business insights
4. **Delta Liquid Clustering**: Automatic optimization across all layers for query performance
5. **Machine Learning Integration**: Student performance prediction with business impact analysis

## Medallion Architecture Benefits

- **Data Governance**: Clear data lineage from Bronze to Gold
- **Reusability**: Each layer serves different use cases
- **Performance**: Optimized clustering strategies per layer
- **Maintainability**: Independent layer updates and refreshes
- **Scalability**: Handles data volume growth at each stage

## AIDP Advantages

- **Unified Analytics**: Seamless data transformation pipeline
- **ML Integration**: Built-in ML capabilities with Spark MLlib
- **Performance**: Delta Liquid Clustering for automatic optimization
- **Governance**: Catalog and schema isolation
- **Scalability**: Distributed processing with Spark

## Education Business Impact

1. **Early Intervention**: ML identifies at-risk students before performance declines
2. **Personalized Learning**: Performance predictions enable tailored educational approaches
3. **Resource Optimization**: Data-driven allocation of academic support
4. **Academic Excellence**: Improved student outcomes through predictive analytics
5. **Educational Equity**: Proactive support ensures all students receive appropriate intervention

## Best Practices

1. **Layer Optimization**: Choose clustering columns based on query patterns per layer
2. **Data Quality**: Implement validation rules early in the Silver layer
3. **Incremental Processing**: Design for incremental updates as data grows
4. **Monitoring**: Track data quality and ML model performance over time
5. **Governance**: Maintain clear documentation and data lineage

## Next Steps

- Implement incremental data processing pipelines
- Add real-time data ingestion for immediate interventions
- Integrate with learning management systems
- Deploy ML models for production scoring
- Build dashboards for real-time academic monitoring
- Expand to multi-year longitudinal analysis

This notebook demonstrates how Oracle AI Data Platform enables sophisticated education analytics through the Medallion Architecture pattern, combining data engineering best practices with machine learning for actionable business insights.