# Notebook 2: Synthetic Data Generation
This notebook simulates realistic aircraft sensor data and writes it to a managed volume.

# 🧪 Synthetic Data Generation

This notebook generates synthetic datasets for the Aircraft AOG (Aircraft on Ground) demo. It simulates both sensor telemetry and maintenance event logs for a fleet of aircraft, and saves them as CSV files with timestamped filenames.

## ✈️ Datasets Generated

1. **`raw_sensor_data_*.csv`**
   - Simulates daily sensor readings such as engine temperature, fuel efficiency, and vibration.
   - Covers multiple aircraft models (e.g., A320, B737, A330).
   - Includes a full calendar year of data.

2. **`maintenance_events_*.csv`**
   - Simulates scheduled and unscheduled maintenance activities.
   - Events like "Routine Check", "Engine Repair" are randomly generated.

## 📦 Output

Files are saved to Unity Catalog-managed volumes in two separate folders:

- `raw_sensor_data_*.csv` → `/Volumes/arao/aerodemo/tmp/raw/`
- `maintenance_events_*.csv` → `/Volumes/arao/aerodemo/tmp/maintenance/`

Each run creates **timestamped filenames** so that Auto Loader can detect and ingest them as new data.

## ✅ Why This Matters

This notebook is designed to support:
- Continuous ingestion via Auto Loader
- Downstream analytics and Delta Live Tables (DLT)
- Simulation of realistic aircraft operational behavior

> You can re-run this notebook anytime to simulate a new day's worth of data for ingestion and pipeline testing.

In [0]:

import numpy as np
from datetime import datetime, timedelta
import random
import pandas as pd
import os


### Synthetic Aircraft Sensor Data Generator

This notebook generates simulated sensor data and maintenance logs for 25 aircraft models over a 1-year period (2024-01-01 to 2024-12-31). Each aircraft experiences:
- Gradual wear-and-tear over time (drift in temperature, vibration, and fuel efficiency)
- Simulated anomalies and scheduled repairs
- Daily readings with a full set of sensor metrics

In [0]:
from datetime import datetime, timedelta
import random
import numpy as np

# --- Configuration ---
NUM_AIRCRAFT_PER_MODEL = 5
START_DATE = datetime(2024, 1, 1)
END_DATE = datetime(2024, 12, 31)

# --- Aircraft Setup ---
aircraft_ids = (
    [f"A320_{i:03d}" for i in range(101, 101 + NUM_AIRCRAFT_PER_MODEL)] +
    [f"B737_{i:03d}" for i in range(201, 201 + NUM_AIRCRAFT_PER_MODEL)] +
    [f"A330_{i:03d}" for i in range(301, 301 + NUM_AIRCRAFT_PER_MODEL)] +
    [f"B777_{i:03d}" for i in range(401, 401 + NUM_AIRCRAFT_PER_MODEL)] +
    [f"E190_{i:03d}" for i in range(501, 501 + NUM_AIRCRAFT_PER_MODEL)]
)

models = (
    ["A320"] * NUM_AIRCRAFT_PER_MODEL +
    ["B737"] * NUM_AIRCRAFT_PER_MODEL +
    ["A330"] * NUM_AIRCRAFT_PER_MODEL +
    ["B777"] * NUM_AIRCRAFT_PER_MODEL +
    ["E190"] * NUM_AIRCRAFT_PER_MODEL
)

date_range = [START_DATE + timedelta(days=i) for i in range((END_DATE - START_DATE).days + 1)]
raw_data, maintenance_events = [], []

### Sensor Data Simulation for Each Aircraft

This cell generates synthetic daily sensor readings and maintenance logs for each of the 25 aircraft defined earlier.

Key logic:
- **Baseline Initialization**: Each aircraft starts with slightly different base values for engine temperature, fuel efficiency, and vibration.
- **Drift Simulation**: Over time, performance drifts to simulate aging and wear.
- **Scheduled Maintenance**:
  - A "Routine Check" is injected at a random index between days 150–180.
  - An "Engine Repair" is scheduled shortly after an artificially injected anomaly.
- **Anomaly Injection**: A spike in temperature, drop in fuel efficiency, and sharp increase in vibration is added on a specific day to simulate a high-risk event.
- **Post-Repair Reset**: After repairs, sensor readings gradually improve to reflect the effect of maintenance.

Metrics simulated daily:
- `engine_temp`, `fuel_efficiency`, `vibration`, `altitude`, `airspeed`, `oil_pressure`, `engine_rpm`, and `battery_voltage`.
- Each record is time-stamped using the full datetime string (`%Y-%m-%d %H:%M:%S`) to support future time-based operations.

All generated data is stored in:
- `raw_data`: Daily sensor readings
- `maintenance_events`: Log of maintenance activities

This setup provides a realistic dataset for building and testing predictive maintenance workflows.

In [0]:
from datetime import time, datetime

# Generate data for each aircraft
for aircraft_id, model in zip(aircraft_ids, models):
    # Initialize base sensor values
    base_temp = random.uniform(550, 600)
    base_fuel_eff = random.uniform(80, 90)
    base_vib = random.uniform(3.0, 6.0)

    # Simulate daily drift (wear and tear)
    drift_temp = random.uniform(0.05, 0.1)
    drift_fuel_eff = random.uniform(-0.1, -0.05)
    drift_vib = random.uniform(0.01, 0.03)

    # Schedule maintenance and anomaly windows
    sched_idx = random.randint(150, 180)
    anomaly_idx = random.randint(250, 300)
    if anomaly_idx <= sched_idx:
        anomaly_idx = sched_idx + 50
    if anomaly_idx >= len(date_range):
        anomaly_idx = len(date_range) - 60
    repair_idx = min(anomaly_idx + 1, len(date_range) - 1)

    # Add maintenance events
    maintenance_events.append({
        "aircraft_id": aircraft_id,
        "event_date": date_range[sched_idx].date(),
        "event_type": "Routine Check"
    })
    maintenance_events.append({
        "aircraft_id": aircraft_id,
        "event_date": date_range[repair_idx].date(),
        "event_type": "Engine Repair"
    })

    # Generate one record per day with randomized timestamp
    for day_idx, current_date in enumerate(date_range):
        # Add randomized time-of-day to the current date
        hour = random.randint(0, 23)
        minute = random.randint(0, 59)
        second = random.randint(0, 59)
        timestamp = datetime.combine(current_date, time(hour, minute, second))

        # Sensor drift and noise
        engine_temp = base_temp + drift_temp * day_idx + np.random.normal(0, 2)
        fuel_eff = base_fuel_eff + drift_fuel_eff * day_idx + np.random.normal(0, 1)
        vibration = base_vib + drift_vib * day_idx + np.random.normal(0, 0.1)

        # Additional sensor values
        altitude = 30000 + np.random.normal(0, 500)
        airspeed = 450 + np.random.normal(0, 20)
        oil_pressure = round(random.uniform(30, 90), 2)
        engine_rpm = int(random.uniform(1500, 5000))
        battery_voltage = round(random.uniform(22.0, 28.0), 2)

        # Inject anomaly on specific day
        anomaly_score = 0.0
        if day_idx == anomaly_idx:
            engine_temp *= 1.3
            fuel_eff *= 0.7
            vibration = max(vibration * 3, vibration + 5)
            anomaly_score = 1.0

        # Post-repair normalization
        if day_idx >= repair_idx:
            engine_temp = max(base_temp, engine_temp - 0.15 * base_temp)
            fuel_eff = min(base_fuel_eff, fuel_eff + 0.15 * base_fuel_eff)
            vibration = max(0.0, vibration - 0.5 * base_vib)

        # Append the record to raw_data
        raw_data.append({
            "timestamp": timestamp.strftime("%Y-%m-%d %H:%M:%S"),
            "aircraft_id": aircraft_id,
            "model": model,
            "engine_temp": round(engine_temp, 2),
            "fuel_efficiency": round(fuel_eff, 2),
            "vibration": round(vibration, 3),
            "altitude": round(altitude, 2),
            "airspeed": round(airspeed, 2),
            "anomaly_score": anomaly_score,
            "oil_pressure": oil_pressure,
            "engine_rpm": engine_rpm,
            "battery_voltage": battery_voltage
        })

In [0]:
import pandas as pd

# Print sample and sizes for verification
# print("Sensor records:", len(raw_data))
# print("Maintenance events:", len(maintenance_events))
# print(pd.DataFrame(raw_data).head())
# print(pd.DataFrame(maintenance_events).head())
print(pd.DataFrame(raw_data).head())


📦 Benefits of the logic in the next cell:
- 	✅ Keeps ingestion folders clean
- 	✅ Prevents Auto Loader schema collisions
- 	✅ Aligned with your 03_ and 03B_ setups


In [0]:
from datetime import datetime
import pandas as pd

# Define output directories (volume paths)
raw_data_path = "/Volumes/arao/aerodemo/tmp/raw"
maint_data_path = "/Volumes/arao/aerodemo/tmp/maintenance"

# Generate timestamp string for filenames
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")

# Construct full file paths
raw_file_path = f"{raw_data_path}/raw_sensor_data_{timestamp_str}.csv"
maint_file_path = f"{maint_data_path}/maintenance_events_{timestamp_str}.csv"

# Save raw sensor data
pd.DataFrame(raw_data, columns=[
    "timestamp", "aircraft_id", "model", 
    "engine_temp", "fuel_efficiency", "vibration",
    "altitude", "airspeed", "anomaly_score",
    "oil_pressure", "engine_rpm", "battery_voltage"
]).to_csv(raw_file_path, index=False)

# Save maintenance event data
pd.DataFrame(maintenance_events, columns=[
    "aircraft_id", "event_date", "event_type"
]).to_csv(maint_file_path, index=False)

print("✅ Files written:")
print(f"- {raw_file_path}")
print(f"- {maint_file_path}")

### ✈️ Aircraft Model Reference Table (for Digital Twin Mapping)

This section generates a static reference table called `aircraft_model_reference`, which contains key specifications for each aircraft model (e.g., manufacturer, engine type, seating capacity, and range). This table acts as foundational metadata for building Digital Twin representations and linking operational data to aircraft characteristics.

The table is stored as a Delta table in Unity Catalog under `arao.aerodemo.aircraft_model_reference`.

In [0]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Define model-level reference data
model_data = [
    ("A320", "Airbus", "CFM56", 180, 6150),
    ("B737", "Boeing", "LEAP-1B", 160, 5600),
    ("A330", "Airbus", "Trent 700", 277, 13450)
]

columns = ["model", "manufacturer", "engine_type", "capacity", "range_km"]

# Create DataFrame
df_model = spark.createDataFrame(model_data, columns)

# Check schema
df_model.printSchema()

# Show sample data
df_model.show()

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

# Create Spark session
spark = SparkSession.builder.getOrCreate()

# Define model-level reference data
model_data = [
    ("A320", "Airbus", "CFM56", 180, 6150),
    ("B737", "Boeing", "LEAP-1B", 160, 5600),
    ("A330", "Airbus", "Trent 700", 277, 13450)
]

columns = ["model", "manufacturer", "engine_type", "capacity", "range_km"]

# Create DataFrame
df_model = spark.createDataFrame(model_data, columns)

# Cast numerical columns to IntegerType to match Delta table definition
df_model = df_model.withColumn("capacity", df_model["capacity"].cast(IntegerType()))
df_model = df_model.withColumn("range_km", df_model["range_km"].cast(IntegerType()))

# Overwrite table
df_model.write.format("delta").mode("overwrite").saveAsTable("arao.aerodemo.aircraft_model_reference")

print("✅ Aircraft model reference table written with correct schema.")

In [0]:
pd.read_csv(raw_file_path).columns

In [0]:
# %fs rm -r /Volumes/arao/aerodemo/tmp/raw/schema/raw_sensor_data