### 🔹 Raw Sensor Data Ingestion

This DLT table ingests streaming CSV files of raw aircraft sensor data using Auto Loader.
The schema includes:
- `timestamp`, `aircraft_id`, `model`
- Sensor metrics like `engine_temp`, `fuel_efficiency`, `vibration`, `altitude`, `airspeed`, and `anomaly_score`.

Files are read from a managed volume at `/Volumes/arao/aerodemo/tmp/raw`.

✅ **DLT Benefits**:
- Built-in schema enforcement
- Automatic lineage
- Continuous ingestion readiness

In [0]:
import dlt
from pyspark.sql.functions import *

@dlt.table(
    name="raw_sensor_data",
    comment="Ingested raw sensor data from aircraft using Auto Loader"
)
def load_sensor_data():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("pathGlobFilter", "*.csv")
        .option("cloudFiles.schemaLocation", "/Volumes/arao/aerodemo/tmp/raw/schema")
        .schema("""
            timestamp TIMESTAMP,
            aircraft_id STRING,
            model STRING,
            engine_temp DOUBLE,
            fuel_efficiency DOUBLE,
            vibration DOUBLE,
            altitude DOUBLE,
            airspeed DOUBLE,
            anomaly_score DOUBLE
        """)
        .load("/Volumes/arao/aerodemo/tmp/raw")
    )

### 🔹 Maintenance Events Ingestion

This DLT table loads structured maintenance event logs for aircraft.
Schema includes:
- `aircraft_id`, `event_date`, `event_type` (e.g., "Routine Check", "Engine Repair")

Source files are CSVs dropped in `/Volumes/arao/aerodemo/tmp/maintenance`.

✅ **DLT Benefits**:
- Data quality enforcement using `@dlt.expect`
- Easier governance and visibility

In [0]:
import dlt
from pyspark.sql.types import StructType, StructField, StringType, DateType

@dlt.table(
    comment="Ingested maintenance event logs per aircraft"
)
@dlt.expect("valid_aircraft_id", "aircraft_id IS NOT NULL")
@dlt.expect("valid_event_type", "event_type IN ('Routine Check', 'Engine Repair')")
def maintenance_events():
    volume_path = "/Volumes/arao/aerodemo/tmp/maintenance"
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("pathGlobFilter", "*.csv")
        .option("cloudFiles.schemaLocation", f"{volume_path}/schema/maintenance_events")
        .load(volume_path)
    )

### 🔹 Cleaned Sensor Data

This DLT table applies quality filters to the incoming raw sensor data stream to ensure data reliability. The following constraints are enforced:

- `engine_temp` must be ≤ 700
- `fuel_efficiency` must be ≥ 50
- `vibration` must be ≤ 25

Additionally, data expectations (`@dlt.expect`) are defined to ensure:
- Engine temperature is within 0–1000
- Fuel efficiency is positive
- Vibration is non-negative

✅ **DLT Benefits**:
- Streaming data validation
- Simplified enforcement of operational thresholds
- Automatic schema tracking and lineage

In [0]:
@dlt.table(
    name="cleaned_sensor_data",
    comment="Cleaned sensor data after applying quality filters"
)
@dlt.expect("valid_engine_temp", "engine_temp BETWEEN 0 AND 1000")
@dlt.expect("valid_fuel_eff", "fuel_efficiency > 0")
@dlt.expect("valid_vibration", "vibration >= 0")
def clean_sensor_data():
    return (
        dlt.read_stream("raw_sensor_data")
        .filter(
            (col("engine_temp") <= 700) &
            (col("fuel_efficiency") >= 50) &
            (col("vibration") <= 25)
        )
    )

### 🔹 Prediction Results

This DLT table performs aggregation on cleaned sensor data to produce a daily risk score per aircraft. The logic includes:

- Grouping by `aircraft_id` and `timestamp` (converted to date)
- Calculating the average `engine_temp` and `vibration`
- Applying a formula to compute a `risk_score`:
  
  \[
  \text{{risk_score}} = \left(\frac{{\text{{avg(engine_temp)}}}}{700} + \frac{{\text{{avg(vibration)}}}}{25}\right) \times 50
  \]

✅ **DLT Benefits**:
- Continuous scoring as new sensor data arrives
- Built-in aggregation and model-based logic
- Enables downstream alerting and visualization

In [0]:
@dlt.table(
    name="prediction_results",
    comment="Predicted AOG risk scores from cleaned sensor data"
)
def predict_risk():
    df = dlt.read("cleaned_sensor_data")
    return (
        df.groupBy("aircraft_id", to_date("timestamp").alias("prediction_date"))
          .agg(((avg("engine_temp") / 700 + avg("vibration") / 25) * 50).alias("risk_score"))
    )

---

### 📘 Enriched Sensor Data with Maintenance Events


### 🔹 Enriched Sensor Data

This view joins cleaned sensor records with matching maintenance events on `aircraft_id` and `date(timestamp) = event_date`.

Purpose:
- Add maintenance context to sensor data
- Enable root cause analysis and post-maintenance tracking

✅ **DLT Benefits**:
- Built-in lineage from both sensor and maintenance streams
- Flexible expansion for ML or analytics

In [0]:
import dlt
from pyspark.sql.functions import col, to_date, max as spark_max
from pyspark.sql.window import Window
import pyspark.sql.functions as F

@dlt.table(
    comment="Sensor data enriched with most recent maintenance event info"
)
def enriched_sensor_data():
    cleaned_df = dlt.read("cleaned_sensor_data").withColumn("reading_date", to_date("timestamp"))
    events_df = dlt.read("maintenance_events").withColumnRenamed("event_date", "maint_date")

    # Join on aircraft_id, and filter only maintenance events before or on the sensor reading date
    joined = cleaned_df.join(events_df, "aircraft_id", "left") \
        .filter(col("maint_date") <= col("reading_date"))

    # Use window function to get the most recent maintenance event before each reading
    window = Window.partitionBy("aircraft_id", "timestamp").orderBy(col("maint_date").desc())

    result = (joined
              .withColumn("rank", F.row_number().over(window))
              .filter(col("rank") == 1)
              .drop("rank", "reading_date"))

    return result