# Notebook 2: Synthetic Data Generation
This notebook simulates realistic aircraft sensor data and writes it to a managed volume.

# 🧪 Synthetic Data Generation

This notebook generates synthetic datasets for the Aircraft AOG (Aircraft on Ground) demo. It simulates both sensor telemetry and maintenance event logs for a fleet of aircraft, and saves them as CSV files with timestamped filenames.

## ✈️ Datasets Generated

1. **`raw_sensor_data_*.csv`**
   - Simulates daily sensor readings such as engine temperature, fuel efficiency, and vibration.
   - Covers multiple aircraft models (e.g., A320, B737, A330).
   - Includes a full calendar year of data.

2. **`maintenance_events_*.csv`**
   - Simulates scheduled and unscheduled maintenance activities.
   - Events like "Routine Check", "Engine Repair" are randomly generated.

## 📦 Output

Files are saved to Unity Catalog-managed volumes in two separate folders:

- `raw_sensor_data_*.csv` → `/Volumes/arao/aerodemo/tmp/raw/`
- `maintenance_events_*.csv` → `/Volumes/arao/aerodemo/tmp/maintenance/`

Each run creates **timestamped filenames** so that Auto Loader can detect and ingest them as new data.

## ✅ Why This Matters

This notebook is designed to support:
- Continuous ingestion via Auto Loader
- Downstream analytics and Delta Live Tables (DLT)
- Simulation of realistic aircraft operational behavior

> You can re-run this notebook anytime to simulate a new day's worth of data for ingestion and pipeline testing.

In [0]:

import numpy as np
from datetime import datetime, timedelta
import random
import pandas as pd
import os


In [0]:

# Define aircraft IDs and models
aircraft_ids = ["A320_101","A320_102","A320_103","A320_104",
                "B737_201","B737_202","B737_203",
                "A330_301","A330_302","A330_303"]
models = ["A320"]*4 + ["B737"]*3 + ["A330"]*3
start_date = datetime(2024, 1, 1)
end_date = datetime(2024, 12, 31)
date_range = [start_date + timedelta(days=i) for i in range((end_date-start_date).days + 1)]
raw_data, maintenance_events = [], []


In [0]:
# Generate data for each aircraft
for aircraft_id, model in zip(aircraft_ids, models):
    base_temp = random.uniform(550, 600)
    base_fuel_eff = random.uniform(80, 90)
    base_vib = random.uniform(3.0, 6.0)
    drift_temp = random.uniform(0.05, 0.1)
    drift_fuel_eff = random.uniform(-0.1, -0.05)
    drift_vib = random.uniform(0.01, 0.03)
    sched_idx = random.randint(150, 180)
    anomaly_idx = random.randint(250, 300)
    if anomaly_idx <= sched_idx:
        anomaly_idx = sched_idx + 50
    if anomaly_idx >= len(date_range):
        anomaly_idx = len(date_range) - 60
    repair_idx = min(anomaly_idx + 1, len(date_range) - 1)

    maintenance_events.append({
        "aircraft_id": aircraft_id,
        "event_date": date_range[sched_idx].date(),
        "event_type": "Routine Check"
    })
    maintenance_events.append({
        "aircraft_id": aircraft_id,
        "event_date": date_range[repair_idx].date(),
        "event_type": "Engine Repair"
    })

    for day_idx, current_date in enumerate(date_range):
        engine_temp = base_temp + drift_temp * day_idx + np.random.normal(0, 2)
        fuel_eff = base_fuel_eff + drift_fuel_eff * day_idx + np.random.normal(0, 1)
        vibration = base_vib + drift_vib * day_idx + np.random.normal(0, 0.1)

        # Simulated new metrics
        altitude = 30000 + np.random.normal(0, 500)             # feet
        airspeed = 450 + np.random.normal(0, 20)                # knots
        oil_pressure = round(random.uniform(30, 90), 2)         # psi
        engine_rpm = int(random.uniform(1500, 5000))            # rpm
        battery_voltage = round(random.uniform(22.0, 28.0), 2)  # volts

        # Anomaly injection
        anomaly_score = 0.0
        if day_idx == anomaly_idx:
            engine_temp *= 1.3
            fuel_eff *= 0.7
            vibration = max(vibration * 3, vibration + 5)
            anomaly_score = 1.0  # High risk day
        if day_idx >= repair_idx:
            engine_temp = max(base_temp, engine_temp - 0.15 * base_temp)
            fuel_eff = min(base_fuel_eff, fuel_eff + 0.15 * base_fuel_eff)
            vibration = max(0.0, vibration - 0.5 * base_vib)

        raw_data.append({
            "timestamp": current_date,
            "aircraft_id": aircraft_id,
            "model": model,
            "engine_temp": round(engine_temp, 2),
            "fuel_efficiency": round(fuel_eff, 2),
            "vibration": round(vibration, 3),
            "altitude": round(altitude, 2),
            "airspeed": round(airspeed, 2),
            "anomaly_score": anomaly_score,
            "oil_pressure": oil_pressure,
            "engine_rpm": engine_rpm,
            "battery_voltage": battery_voltage
        })

In [0]:

# Print sample and sizes for verification
# print("Sensor records:", len(raw_data))
# print("Maintenance events:", len(maintenance_events))
# print(pd.DataFrame(raw_data).head())
# print(pd.DataFrame(maintenance_events).head())
print(pd.DataFrame(raw_data).head())


📦 Benefits of the logic in the next cell:
- 	✅ Keeps ingestion folders clean
- 	✅ Prevents Auto Loader schema collisions
- 	✅ Aligned with your 03_ and 03B_ setups


In [0]:
from datetime import datetime
import pandas as pd

# Define separate paths
raw_data_path = "/Volumes/arao/aerodemo/tmp/raw"
maint_data_path = "/Volumes/arao/aerodemo/tmp/maintenance"

# Create timestamped filenames
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
raw_file_path = f"{raw_data_path}/raw_sensor_data_{timestamp_str}.csv"
maint_file_path = f"{maint_data_path}/maintenance_events_{timestamp_str}.csv"

# Convert and save raw sensor data (updated with new columns)
pd.DataFrame(raw_data, columns=[
    "timestamp", "aircraft_id", "model", 
    "engine_temp", "fuel_efficiency", "vibration",
    "altitude", "airspeed", "anomaly_score",
    "oil_pressure", "engine_rpm", "battery_voltage"
]).to_csv(raw_file_path, index=False)

# pd.DataFrame(raw_data).to_csv(raw_file_path, index=False)


# Convert and save maintenance event data
pd.DataFrame(maintenance_events).to_csv(maint_file_path, index=False)

print(f"✅ Files written:\n- {raw_file_path}\n- {maint_file_path}")

In [0]:
pd.read_csv(raw_file_path).columns

In [0]:
%fs rm -r /Volumes/arao/aerodemo/tmp/raw/schema/raw_sensor_data