# Notebook 01 — Synthetic IIoT Data Generator (GMP Packaging / Medical Device)

This notebook generates **synthetic but realistic** industrial telemetry and event data for an enterprise GMP packaging environment.

## Design goals

- **Multi-site, multi-timezone** operations (sites in different regions/time zones)
- **Mixed fleet**: ~50% legacy equipment with reduced signal availability / higher noise / occasional missingness
- Data is suitable for:
  - data-quality checks
  - feature engineering
  - downtime / quality risk modeling
  - an enterprise risk dashboard later (site → line → asset drilldown)

## Output artifacts

This notebook will write reproducible datasets to:

- `data/raw/iot_events.parquet` (event-level telemetry + events)
- `data/raw/assets_master.csv` (asset registry)
- `data/raw/sites_master.csv` (sites + time zones)


# Notebook 01 — Synthetic IIoT Data Generator (GMP Packaging / Medical Device)

This notebook generates **synthetic but realistic** industrial telemetry and event data for an enterprise GMP packaging environment.

## Design goals

- **Multi-site, multi-timezone** operations (sites in different regions/time zones)
- **Mixed fleet**: ~50% legacy equipment with reduced signal availability / higher noise / occasional missingness
- Data is suitable for:
  - data-quality checks
  - feature engineering
  - downtime / quality risk modeling
  - an enterprise risk dashboard later (site → line → asset drilldown)

## Output artifacts

This notebook will write reproducible datasets to:

- `data/raw/iot_events.parquet` (event-level telemetry + events)
- `data/raw/assets_master.csv` (asset registry)
- `data/raw/sites_master.csv` (sites + time zones)


In [2]:
# ============================================================
# Cell 1 — Setup: imports, RNG seed, and output paths
# ============================================================

from pathlib import Path
import numpy as np
import pandas as pd

DATA_RAW = Path("data/raw")
DATA_RAW.mkdir(parents=True, exist_ok=True)

OUT_EVENTS  = DATA_RAW / "iot_events.parquet"
OUT_ASSETS  = DATA_RAW / "assets_master.csv"
OUT_SITES   = DATA_RAW / "sites_master.csv"

RNG_SEED = 42
rng = np.random.default_rng(RNG_SEED)

print("Will write:")
print(" -", OUT_EVENTS.resolve())
print(" -", OUT_ASSETS.resolve())
print(" -", OUT_SITES.resolve())
print("RNG_SEED:", RNG_SEED)


Will write:
 - /home/parallels/projects/gmp-packaging-risk-analytics/data/raw/iot_events.parquet
 - /home/parallels/projects/gmp-packaging-risk-analytics/data/raw/assets_master.csv
 - /home/parallels/projects/gmp-packaging-risk-analytics/data/raw/sites_master.csv
RNG_SEED: 42


### What Cell 1 Just Did

- Set up core imports (NumPy, pandas) and created a reproducible RNG seed.
- Defined output paths under `data/raw/` for:
  - a site master table
  - an asset registry
  - an event-level telemetry dataset
- Confirmed where the artifacts will be written on disk.


In [3]:
# ============================================================
# Cell 2 — Sites master: multi-site + time zones + risk context
# ============================================================

import pandas as pd

# A small but enterprise-plausible site set (expand later)
sites = pd.DataFrame(
    [
        {
            "site_id": "IN-IND-01",
            "site_name": "Indianapolis Packaging Campus",
            "country": "US",
            "state_region": "IN",
            "timezone": "America/Indiana/Indianapolis",
            "site_type": "pharma_packaging",
            "shift_model": "3x8",
            "criticality_tier": 1,
        },
        {
            "site_id": "US-NJ-01",
            "site_name": "New Jersey Device Assembly",
            "country": "US",
            "state_region": "NJ",
            "timezone": "America/New_York",
            "site_type": "medical_device",
            "shift_model": "2x12",
            "criticality_tier": 1,
        },
        {
            "site_id": "IE-DUB-01",
            "site_name": "Dublin Sterile Packaging",
            "country": "IE",
            "state_region": "Dublin",
            "timezone": "Europe/Dublin",
            "site_type": "pharma_packaging",
            "shift_model": "3x8",
            "criticality_tier": 2,
        },
        {
            "site_id": "SG-SIN-01",
            "site_name": "Singapore Serialization Hub",
            "country": "SG",
            "state_region": "Singapore",
            "timezone": "Asia/Singapore",
            "site_type": "pharma_packaging",
            "shift_model": "3x8",
            "criticality_tier": 2,
        },
    ]
)

# Lightweight risk context (used later for supply-chain + downtime priors)
# 0.0 = low risk, 1.0 = high risk
sites["baseline_logistics_risk"] = [0.25, 0.30, 0.35, 0.45]
sites["baseline_power_grid_risk"] = [0.20, 0.18, 0.22, 0.28]

sites.to_csv(OUT_SITES, index=False)
print("Wrote:", OUT_SITES)
display(sites)

Wrote: data/raw/sites_master.csv


Unnamed: 0,site_id,site_name,country,state_region,timezone,site_type,shift_model,criticality_tier,baseline_logistics_risk,baseline_power_grid_risk
0,IN-IND-01,Indianapolis Packaging Campus,US,IN,America/Indiana/Indianapolis,pharma_packaging,3x8,1,0.25,0.2
1,US-NJ-01,New Jersey Device Assembly,US,NJ,America/New_York,medical_device,2x12,1,0.3,0.18
2,IE-DUB-01,Dublin Sterile Packaging,IE,Dublin,Europe/Dublin,pharma_packaging,3x8,2,0.35,0.22
3,SG-SIN-01,Singapore Serialization Hub,SG,Singapore,Asia/Singapore,pharma_packaging,3x8,2,0.45,0.28


### What Cell 2 Just Did

- Created a **sites master table** representing an enterprise deployment spanning multiple regions and time zones.
- Added operational metadata (site type, shift model, criticality tier) to support later rollups and drilldowns.
- Added baseline “context risk” features (logistics risk, power-grid risk) that we can use later to:
  - simulate correlated disruptions
  - enrich an enterprise risk dashboard (site health + supply chain)
- Wrote the master table to `data/raw/sites_master.csv`.


In [4]:
# ============================================================
# Cell 3 — Asset registry: lines + machines + sensors (50% legacy)
# ============================================================

import numpy as np
import pandas as pd

# ----- Config -----
LINES_PER_SITE = 3
MACHINES_PER_LINE = 10          # total machines = sites * lines * machines
LEGACY_FRACTION = 0.50          # ~50% legacy equipment
SENSOR_PROFILE = {
    "legacy": {"temp": True, "vibration": False, "power_kw": True, "throughput": True, "reject_rate": False},
    "modern": {"temp": True, "vibration": True, "power_kw": True, "throughput": True, "reject_rate": True},
}

# Packaging line archetypes (used for different behaviors later)
LINE_TYPES = ["blister_pack", "bottle_fill", "carton_pack", "labeling_serialization"]

# Machine archetypes (common GMP packaging equipment)
MACHINE_TYPES = [
    "filler", "capper", "labeler", "cartoner", "case_packer",
    "checkweigher", "vision_inspector", "printer_serializer", "conveyor", "sealer"
]

# Legacy vs modern: different noise/missingness/reliability
QUALITY_PROFILE = {
    "legacy": {"missing_prob": 0.06, "noise_scale": 1.8, "mtbf_hours": 160.0},
    "modern": {"missing_prob": 0.015, "noise_scale": 1.0, "mtbf_hours": 320.0},
}

# ----- Build assets -----
rows = []
for _, s in sites.iterrows():
    for li in range(1, LINES_PER_SITE + 1):
        line_id = f"{s.site_id}-L{li:02d}"
        line_type = rng.choice(LINE_TYPES)
        for mi in range(1, MACHINES_PER_LINE + 1):
            asset_id = f"{line_id}-M{mi:02d}"
            machine_type = MACHINE_TYPES[(mi - 1) % len(MACHINE_TYPES)]

            # legacy assignment (deterministic-ish but randomized)
            is_legacy = (rng.random() < LEGACY_FRACTION)
            fleet_class = "legacy" if is_legacy else "modern"

            sensors = SENSOR_PROFILE[fleet_class].copy()
            q = QUALITY_PROFILE[fleet_class]

            rows.append(
                {
                    "asset_id": asset_id,
                    "site_id": s.site_id,
                    "line_id": line_id,
                    "line_type": line_type,
                    "machine_type": machine_type,
                    "fleet_class": fleet_class,
                    "has_temp": bool(sensors["temp"]),
                    "has_vibration": bool(sensors["vibration"]),
                    "has_power_kw": bool(sensors["power_kw"]),
                    "has_throughput": bool(sensors["throughput"]),
                    "has_reject_rate": bool(sensors["reject_rate"]),
                    "telemetry_missing_prob": float(q["missing_prob"]),
                    "telemetry_noise_scale": float(q["noise_scale"]),
                    "mtbf_hours": float(q["mtbf_hours"]),
                }
            )

assets = pd.DataFrame(rows)

# Sanity checks
legacy_rate = (assets["fleet_class"] == "legacy").mean()
print(f"Assets: {len(assets):,} | Legacy fraction (actual): {legacy_rate:.2%}")

# Persist
assets.to_csv(OUT_ASSETS, index=False)
print("Wrote:", OUT_ASSETS)

display(assets.head(10))
display(
    assets.groupby(["fleet_class"])[["has_vibration", "has_reject_rate", "telemetry_missing_prob", "telemetry_noise_scale", "mtbf_hours"]]
    .mean()
    .reset_index()
)


Assets: 120 | Legacy fraction (actual): 53.33%
Wrote: data/raw/assets_master.csv


Unnamed: 0,asset_id,site_id,line_id,line_type,machine_type,fleet_class,has_temp,has_vibration,has_power_kw,has_throughput,has_reject_rate,telemetry_missing_prob,telemetry_noise_scale,mtbf_hours
0,IN-IND-01-L01-M01,IN-IND-01,IN-IND-01-L01,blister_pack,filler,legacy,True,False,True,True,False,0.06,1.8,160.0
1,IN-IND-01-L01-M02,IN-IND-01,IN-IND-01-L01,blister_pack,capper,modern,True,True,True,True,True,0.015,1.0,320.0
2,IN-IND-01-L01-M03,IN-IND-01,IN-IND-01-L01,blister_pack,labeler,modern,True,True,True,True,True,0.015,1.0,320.0
3,IN-IND-01-L01-M04,IN-IND-01,IN-IND-01-L01,blister_pack,cartoner,legacy,True,False,True,True,False,0.06,1.8,160.0
4,IN-IND-01-L01-M05,IN-IND-01,IN-IND-01-L01,blister_pack,case_packer,modern,True,True,True,True,True,0.015,1.0,320.0
5,IN-IND-01-L01-M06,IN-IND-01,IN-IND-01-L01,blister_pack,checkweigher,modern,True,True,True,True,True,0.015,1.0,320.0
6,IN-IND-01-L01-M07,IN-IND-01,IN-IND-01-L01,blister_pack,vision_inspector,modern,True,True,True,True,True,0.015,1.0,320.0
7,IN-IND-01-L01-M08,IN-IND-01,IN-IND-01-L01,blister_pack,printer_serializer,legacy,True,False,True,True,False,0.06,1.8,160.0
8,IN-IND-01-L01-M09,IN-IND-01,IN-IND-01-L01,blister_pack,conveyor,legacy,True,False,True,True,False,0.06,1.8,160.0
9,IN-IND-01-L01-M10,IN-IND-01,IN-IND-01-L01,blister_pack,sealer,legacy,True,False,True,True,False,0.06,1.8,160.0


Unnamed: 0,fleet_class,has_vibration,has_reject_rate,telemetry_missing_prob,telemetry_noise_scale,mtbf_hours
0,legacy,0.0,0.0,0.06,1.8,160.0
1,modern,1.0,1.0,0.015,1.0,320.0


### What Cell 3 Just Did

- Built an **asset registry** across all sites, lines, and machines, including:
  - identifiers for the site → line → machine hierarchy
  - machine archetypes common in GMP packaging (filler, capper, vision inspection, serialization, etc.)

- Defined and applied a **mixed fleet model** with ~50% **legacy** assets and ~50% **modern** assets.

  **Legacy assets (what “legacy” means in this project):**
  - Older machines and control systems (often PLC/SCADA-era integrations) with **limited sensor coverage**
  - Higher operational friction in data collection:
    - **more missing telemetry** (higher missing probability)
    - **noisier measurements** (higher noise scale)
  - Lower reliability baseline for simulation purposes (**lower MTBF**) compared to modern equipment
  - Typically *missing* richer signals like **vibration** and **reject-rate** sensing

  **Modern assets (what “modern” means in this project):**
  - Newer connected equipment with **broader sensor coverage** (including vibration + reject rate)
  - Cleaner data capture:
    - **less missing telemetry**
    - **lower noise**
  - Higher reliability baseline (**higher MTBF**) compared to legacy equipment

- Persisted the asset registry to `data/raw/assets_master.csv`.
- Printed summary stats to verify the legacy/modern split and average sensor availability by fleet class.


In [7]:
# ============================================================
# Cell 4 — Generate synthetic IIoT events (multi-site, multi-timezone)
#          + write raw artifacts (events parquet, master CSVs)
# ============================================================

from __future__ import annotations

from dataclasses import dataclass
from pathlib import Path
import random
import numpy as np
import pandas as pd

# ---------- Paths (assumes these were defined in Cell 2) ----------
# Expected variables from earlier cells:
#   REPO_ROOT (Path)
#   RNG_SEED (int)
#   OUT_EVENTS (Path)
#   OUT_ASSETS (Path)
#   OUT_SITES (Path)

# Safety: if these aren't defined yet, define them with sane defaults
REPO_ROOT = globals().get("REPO_ROOT", Path.cwd())
RNG_SEED = int(globals().get("RNG_SEED", 42))

OUT_EVENTS = globals().get("OUT_EVENTS", REPO_ROOT / "data" / "raw" / "iot_events.parquet")
OUT_ASSETS = globals().get("OUT_ASSETS", REPO_ROOT / "data" / "raw" / "assets_master.csv")
OUT_SITES  = globals().get("OUT_SITES",  REPO_ROOT / "data" / "raw" / "sites_master.csv")

OUT_EVENTS.parent.mkdir(parents=True, exist_ok=True)
OUT_ASSETS.parent.mkdir(parents=True, exist_ok=True)
OUT_SITES.parent.mkdir(parents=True, exist_ok=True)

# ---------- Reproducibility ----------
random.seed(RNG_SEED)
np.random.seed(RNG_SEED)

# ---------- Sites (multi-timezone) ----------
sites = [
    {"site_id": "S1", "site_name": "Indianapolis Packaging Plant", "tz": "America/Indiana/Indianapolis"},
    {"site_id": "S2", "site_name": "San Diego Device Assembly",     "tz": "America/Los_Angeles"},
    {"site_id": "S3", "site_name": "Dublin EU Packaging",           "tz": "Europe/Dublin"},
    {"site_id": "S4", "site_name": "Singapore Sterile Ops",         "tz": "Asia/Singapore"},
]

sites_df = pd.DataFrame(sites)
tz_by_site = {r["site_id"]: r["tz"] for r in sites}

# ---------- Asset model ----------
asset_types = [
    # Packaging / device manufacturing flavor
    "blister_packer", "cartoner", "labeler", "vision_inspection",
    "bottle_filler", "capper", "conveyor", "case_packer",
    "sterilizer", "environmental_monitor", "weigh_check", "print_apply"
]

# legacy ~50% as requested
N_ASSETS = 120
legacy_share = 0.50

# Assign assets to sites
site_ids = [s["site_id"] for s in sites]
asset_rows = []
for i in range(N_ASSETS):
    asset_id = f"A{i+1:04d}"
    site_id = random.choice(site_ids)
    asset_type = random.choice(asset_types)
    is_legacy = (random.random() < legacy_share)

    # legacy constraints: fewer sensors, noisier, more missing, less frequent telemetry
    comms = "legacy_serial" if is_legacy else "mqtt_opcua"
    vendor = random.choice(["VendorA", "VendorB", "VendorC", "VendorD"])
    line = random.choice(["L1", "L2", "L3", "L4", "L5"])

    asset_rows.append({
        "asset_id": asset_id,
        "site_id": site_id,
        "line_id": f"{site_id}-{line}",
        "asset_type": asset_type,
        "is_legacy": bool(is_legacy),
        "connectivity": comms,
        "vendor": vendor,
    })

assets_df = pd.DataFrame(asset_rows)

# ---------- Event generation configuration ----------
@dataclass(frozen=True)
class EventConfig:
    days: int = 14
    # Telemetry cadence in minutes (modern faster, legacy slower)
    modern_period_min: int = 5
    legacy_period_min: int = 15
    # Expected missingness
    modern_missing_p: float = 0.01
    legacy_missing_p: float = 0.08
    # Incident rates (per day per asset)
    modern_incident_rate: float = 0.05
    legacy_incident_rate: float = 0.12

CFG = EventConfig(days=14)

# ---------- Helper: generate UTC timestamps ----------
end_utc = pd.Timestamp.utcnow().tz_localize("UTC") if pd.Timestamp.utcnow().tzinfo is None else pd.Timestamp.utcnow().tz_convert("UTC")
start_utc = end_utc - pd.Timedelta(days=CFG.days)

# ---------- Telemetry schema ----------
# We'll generate a small set of metrics that work across packaging/device ops
metric_catalog = [
    # metric_name, unit, baseline_mean, baseline_std
    ("temp_c", "C", 28.0, 3.0),
    ("vibration_mm_s", "mm/s", 2.0, 0.9),
    ("pressure_kpa", "kPa", 210.0, 20.0),
    ("humidity_rh", "%", 45.0, 8.0),
    ("line_speed_u_min", "units/min", 120.0, 25.0),
    ("reject_rate_pct", "%", 0.8, 0.5),
]

# Asset-type specific tweaks (roughly)
type_adjust = {
    "sterilizer": {"temp_c": 15.0, "pressure_kpa": 60.0, "humidity_rh": -10.0},
    "environmental_monitor": {"humidity_rh": 5.0, "temp_c": -2.0},
    "vision_inspection": {"reject_rate_pct": 0.4},
    "weigh_check": {"reject_rate_pct": 0.3},
    "conveyor": {"vibration_mm_s": 0.6},
}

# Incident types (risk events)
incident_types = [
    ("microstop", 0.6),
    ("jam", 0.25),
    ("calibration_drift", 0.10),
    ("temp_excursion", 0.03),
    ("sensor_dropout", 0.02),
]

# ---------- Generate telemetry events ----------
events = []

def sample_metric(asset_type: str, metric_name: str, base_mu: float, base_sd: float, is_legacy: bool) -> float:
    # legacy is noisier
    sd = base_sd * (1.6 if is_legacy else 1.0)
    mu = base_mu
    # type-specific shifts
    if asset_type in type_adjust and metric_name in type_adjust[asset_type]:
        mu = mu + float(type_adjust[asset_type][metric_name])
    # clamp some metrics to non-negative
    val = np.random.normal(mu, sd)
    if metric_name in ("vibration_mm_s", "pressure_kpa", "humidity_rh", "line_speed_u_min", "reject_rate_pct"):
        val = max(0.0, float(val))
    return float(val)

for _, a in assets_df.iterrows():
    is_legacy = bool(a["is_legacy"])
    period = CFG.legacy_period_min if is_legacy else CFG.modern_period_min
    missing_p = CFG.legacy_missing_p if is_legacy else CFG.modern_missing_p

    ts = pd.date_range(start=start_utc, end=end_utc, freq=f"{period}min", tz="UTC", inclusive="left")
    # telemetry rows: choose 1–3 metrics per timestamp for a "wide-ish but sparse" reality
    for t in ts:
        if random.random() < missing_p:
            continue

        n_metrics = random.choice([1, 2, 2, 3])  # bias toward 2
        metrics = random.sample(metric_catalog, k=n_metrics)
        for metric_name, unit, mu, sd in metrics:
            value = sample_metric(a["asset_type"], metric_name, mu, sd, is_legacy)
            events.append({
                "event_id": None,  # fill later
                "ts_utc": t,
                "site_id": a["site_id"],
                "line_id": a["line_id"],
                "asset_id": a["asset_id"],
                "asset_type": a["asset_type"],
                "is_legacy": is_legacy,
                "event_kind": "telemetry",
                "metric_name": metric_name,
                "metric_unit": unit,
                "metric_value": value,
                "severity": None,
                "incident_type": None,
                "message": None,
            })

# ---------- Generate incident events ----------
for _, a in assets_df.iterrows():
    is_legacy = bool(a["is_legacy"])
    daily_rate = CFG.legacy_incident_rate if is_legacy else CFG.modern_incident_rate

    for d in range(CFG.days):
        # Poisson number of incidents per day
        n_inc = np.random.poisson(lam=daily_rate)
        if n_inc <= 0:
            continue

        day_start = start_utc + pd.Timedelta(days=d)
        for _ in range(int(n_inc)):
            # random time in the day
            t = day_start + pd.Timedelta(minutes=int(random.random() * 1440))
            t = t.tz_convert("UTC")

            # choose incident type by weights
            r = random.random()
            cum = 0.0
            chosen = "microstop"
            for name, w in incident_types:
                cum += w
                if r <= cum:
                    chosen = name
                    break

            # severity: legacy tends to be a bit higher
            sev = np.clip(np.random.normal(loc=(2.8 if is_legacy else 2.2), scale=0.9), 1.0, 5.0)

            events.append({
                "event_id": None,
                "ts_utc": t,
                "site_id": a["site_id"],
                "line_id": a["line_id"],
                "asset_id": a["asset_id"],
                "asset_type": a["asset_type"],
                "is_legacy": is_legacy,
                "event_kind": "incident",
                "metric_name": None,
                "metric_unit": None,
                "metric_value": None,
                "severity": float(sev),
                "incident_type": chosen,
                "message": f"{chosen} detected on {a['asset_type']} ({'legacy' if is_legacy else 'modern'})",
            })

# ---------- Build DataFrame + IDs ----------
events_df = pd.DataFrame(events)

# Ensure ts_utc is datetime64[ns, UTC]
events_df["ts_utc"] = pd.to_datetime(events_df["ts_utc"], utc=True, errors="coerce")

# Drop any weird null timestamps (should be rare)
events_df = events_df.dropna(subset=["ts_utc"]).reset_index(drop=True)

# Stable, readable event_id
events_df["event_id"] = [
    f"E{i+1:010d}" for i in range(len(events_df))
]

# ---------- Add local time fields (multi-timezone safe) ----------
# NOTE: per-row timezone conversion produces an object column for ts_local (expected).
def _local_parts(ts_utc: pd.Timestamp, site_id: str):
    tz = tz_by_site.get(site_id, "UTC")
    ts_loc = ts_utc.tz_convert(tz)
    return (
        ts_loc,                       # tz-aware Timestamp (object dtype)
        ts_loc.date().isoformat(),    # YYYY-MM-DD
        int(ts_loc.hour),             # 0-23
        ts_loc.strftime("%Y-%m-%d %H:%M:%S %Z")  # readable
    )

_local = [_local_parts(ts, sid) for ts, sid in zip(events_df["ts_utc"], events_df["site_id"])]
events_df["ts_local"] = [x[0] for x in _local]
events_df["local_date"] = [x[1] for x in _local]
events_df["local_hour"] = [x[2] for x in _local]
events_df["ts_local_str"] = [x[3] for x in _local]

# ---------- Quick checks + write artifacts ----------
print("Rows:", len(events_df))
print("Telemetry rows:", int((events_df["event_kind"] == "telemetry").sum()))
print("Incident rows :", int((events_df["event_kind"] == "incident").sum()))

# Basic distribution checks
legacy_frac = float(events_df["is_legacy"].mean())
print(f"Legacy share (events): {legacy_frac:.3f}")

# Persist
events_df.to_parquet(OUT_EVENTS, index=False)
assets_df.to_csv(OUT_ASSETS, index=False)
sites_df.to_csv(OUT_SITES, index=False)

print("\nWrote:")
print(" -", OUT_EVENTS)
print(" -", OUT_ASSETS)
print(" -", OUT_SITES)

# Display a tiny sample
display(events_df.head(5))
display(assets_df.head(5))
display(sites_df)


Rows: 588681
Telemetry rows: 588549
Incident rows : 132
Legacy share (events): 0.282

Wrote:
 - data/raw/iot_events.parquet
 - data/raw/assets_master.csv
 - data/raw/sites_master.csv


Unnamed: 0,event_id,ts_utc,site_id,line_id,asset_id,asset_type,is_legacy,event_kind,metric_name,metric_unit,metric_value,severity,incident_type,message,ts_local,local_date,local_hour,ts_local_str
0,E0000000001,2025-11-27 00:05:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,temp_c,C,29.490142,,,,2025-11-26 19:05:18.868743-05:00,2025-11-26,19,2025-11-26 19:05:18 EST
1,E0000000002,2025-11-27 00:05:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,humidity_rh,%,43.893886,,,,2025-11-26 19:05:18.868743-05:00,2025-11-26,19,2025-11-26 19:05:18 EST
2,E0000000003,2025-11-27 00:10:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,humidity_rh,%,50.181508,,,,2025-11-26 19:10:18.868743-05:00,2025-11-26,19,2025-11-26 19:10:18 EST
3,E0000000004,2025-11-27 00:15:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,reject_rate_pct,%,1.561515,,,,2025-11-26 19:15:18.868743-05:00,2025-11-26,19,2025-11-26 19:15:18 EST
4,E0000000005,2025-11-27 00:15:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,vibration_mm_s,mm/s,1.789262,,,,2025-11-26 19:15:18.868743-05:00,2025-11-26,19,2025-11-26 19:15:18 EST


Unnamed: 0,asset_id,site_id,line_id,asset_type,is_legacy,connectivity,vendor
0,A0001,S1,S1-L2,blister_packer,False,mqtt_opcua,VendorB
1,A0002,S2,S2-L5,print_apply,True,legacy_serial,VendorA
2,A0003,S4,S4-L2,blister_packer,True,legacy_serial,VendorB
3,A0004,S1,S1-L2,sterilizer,True,legacy_serial,VendorD
4,A0005,S4,S4-L2,environmental_monitor,True,legacy_serial,VendorA


Unnamed: 0,site_id,site_name,tz
0,S1,Indianapolis Packaging Plant,America/Indiana/Indianapolis
1,S2,San Diego Device Assembly,America/Los_Angeles
2,S3,Dublin EU Packaging,Europe/Dublin
3,S4,Singapore Sterile Ops,Asia/Singapore


### What Cell 4 Just Did

This cell generated a **synthetic IIoT dataset** designed to mimic a multi-site **pharmaceutical packaging / medical device** environment with **~50% legacy equipment** and realistic “messy data” behavior.

#### What it created
- A **sites master table** with multiple facilities across **different time zones**.
- An **assets master table** of industrial equipment (packaging + device assembly assets) with:
  - `asset_type`, `vendor`, `line_id`, `site_id`
  - `is_legacy` flag (legacy vs modern connectivity)
  - `connectivity` field (e.g., `legacy_serial` vs `mqtt_opcua`)

#### What it simulated
- **Telemetry events** (high volume) with metrics like temperature, vibration, pressure, humidity, line speed, reject rate.
  - **Legacy assets** produce data at a **slower cadence**, with **higher noise** and **more missingness** to replicate real-world gaps.
- **Incident events** (lower volume) such as `microstop`, `jam`, `calibration_drift`, `temp_excursion`, `sensor_dropout`
  - Incidents include a numeric `severity` and a human-readable `message`.

#### Time handling (multi-timezone)
- All event timestamps are stored in **UTC** (`ts_utc`) for consistency across sites.
- Additional fields support dashboarding in local time:
  - `ts_local` (timezone-aware local timestamp)
  - `local_date`, `local_hour`
  - `ts_local_str` (readable string for quick inspection/debugging)

#### Artifacts written
- `data/raw/iot_events.parquet` (all events)
- `data/raw/assets_master.csv` (asset registry)
- `data/raw/sites_master.csv` (site registry)

#### Expected outputs
- Printed row counts for total events, telemetry vs incident counts, and legacy share.
- A preview of the first few rows of events, assets, and sites to confirm the generation worked.


In [8]:
# Cell 5 — Quick sanity checks & profiling summaries

RESULTS_DIR = REPO_ROOT / "data" / "results"
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

# Resolve paths defensively in case you re-run cells out of order
events_path = OUT_EVENTS if "OUT_EVENTS" in globals() else REPO_ROOT / "data" / "raw" / "iot_events.parquet"
assets_path = OUT_ASSETS if "OUT_ASSETS" in globals() else REPO_ROOT / "data" / "raw" / "assets_master.csv"
sites_path  = OUT_SITES  if "OUT_SITES"  in globals() else REPO_ROOT / "data" / "raw" / "sites_master.csv"

events = pd.read_parquet(events_path)
assets = pd.read_csv(assets_path)
sites  = pd.read_csv(sites_path)

print("Loaded:")
print(f"  events: {len(events):,}")
print(f"  assets: {len(assets):,}")
print(f"  sites : {len(sites):,}\n")

print("Event columns:")
print(list(events.columns))
print("\nSample events:")
display(events.head())

cols = events.columns

# -------------------------------
# Per-site summary (volume + failures)
# -------------------------------
if "site_id" in cols:
    base_count_col = "ts_utc" if "ts_utc" in cols else cols[0]

    per_site = events.groupby("site_id").agg(
        n_events=(base_count_col, "count"),
    )

    if "is_failure" in cols:
        per_site["n_failures"] = events.groupby("site_id")["is_failure"].sum()
        per_site["failure_rate"] = per_site["n_failures"] / per_site["n_events"]

    per_site = per_site.reset_index()
    per_site = per_site.merge(sites, on="site_id", how="left")

    out_site = RESULTS_DIR / "01_events_summary_by_site.csv"
    per_site.to_csv(out_site, index=False)
    print(f"\nPer-site summary (head) → {out_site}")
    display(per_site.head())
else:
    print("\nSkip per-site summary: 'site_id' column not found.")

# -------------------------------
# By asset_type × is_legacy
# -------------------------------
if "asset_type" in cols and "is_legacy" in cols:
    base_count_col = "ts_utc" if "ts_utc" in cols else "asset_type"

    by_type = (
        events
        .groupby(["asset_type", "is_legacy"])
        .agg(
            n_events=(base_count_col, "count"),
        )
        .reset_index()
    )

    if "is_failure" in cols:
        fails = (
            events
            .groupby(["asset_type", "is_legacy"])["is_failure"]
            .sum()
            .reset_index(name="n_failures")
        )
        by_type = by_type.merge(fails, on=["asset_type", "is_legacy"], how="left")
        by_type["failure_rate"] = by_type["n_failures"] / by_type["n_events"]

    out_type = RESULTS_DIR / "01_events_summary_by_assettype.csv"
    by_type.to_csv(out_type, index=False)
    print(f"\nBy asset_type × is_legacy (head) → {out_type}")
    display(by_type.head())
else:
    print("\nSkip asset-type summary: 'asset_type' and/or 'is_legacy' not found.")


Loaded:
  events: 588,681
  assets: 120
  sites : 4

Event columns:
['event_id', 'ts_utc', 'site_id', 'line_id', 'asset_id', 'asset_type', 'is_legacy', 'event_kind', 'metric_name', 'metric_unit', 'metric_value', 'severity', 'incident_type', 'message', 'ts_local', 'local_date', 'local_hour', 'ts_local_str']

Sample events:


Unnamed: 0,event_id,ts_utc,site_id,line_id,asset_id,asset_type,is_legacy,event_kind,metric_name,metric_unit,metric_value,severity,incident_type,message,ts_local,local_date,local_hour,ts_local_str
0,E0000000001,2025-11-27 00:05:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,temp_c,C,29.490142,,,,2025-11-26 19:05:18.868743-05:00,2025-11-26,19,2025-11-26 19:05:18 EST
1,E0000000002,2025-11-27 00:05:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,humidity_rh,%,43.893886,,,,2025-11-26 19:05:18.868743-05:00,2025-11-26,19,2025-11-26 19:05:18 EST
2,E0000000003,2025-11-27 00:10:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,humidity_rh,%,50.181508,,,,2025-11-26 19:10:18.868743-05:00,2025-11-26,19,2025-11-26 19:10:18 EST
3,E0000000004,2025-11-27 00:15:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,reject_rate_pct,%,1.561515,,,,2025-11-26 19:15:18.868743-05:00,2025-11-26,19,2025-11-26 19:15:18 EST
4,E0000000005,2025-11-27 00:15:18.868743+00:00,S1,S1-L2,A0001,blister_packer,False,telemetry,vibration_mm_s,mm/s,1.789262,,,,2025-11-26 19:15:18.868743-05:00,2025-11-26,19,2025-11-26 19:15:18 EST



Per-site summary (head) → /home/parallels/projects/gmp-packaging-risk-analytics/data/results/01_events_summary_by_site.csv


Unnamed: 0,site_id,n_events,site_name,tz
0,S1,187758,Indianapolis Packaging Plant,America/Indiana/Indianapolis
1,S2,97437,San Diego Device Assembly,America/Los_Angeles
2,S3,127807,Dublin EU Packaging,Europe/Dublin
3,S4,175679,Singapore Sterile Ops,Asia/Singapore



By asset_type × is_legacy (head) → /home/parallels/projects/gmp-packaging-risk-analytics/data/results/01_events_summary_by_assettype.csv


Unnamed: 0,asset_type,is_legacy,n_events
0,blister_packer,False,24014
1,blister_packer,True,12348
2,bottle_filler,False,39866
3,bottle_filler,True,17355
4,capper,False,15945


### What Cell 5 Just Did

This cell performs **sanity checks and lightweight profiling** on the synthetic IIoT dataset we just generated, and then writes a couple of small summary tables for later dashboards.

**Key steps**

1. **Reloads core artifacts**
   - Reads the main events table from `data/raw/iot_events.parquet`.
   - Reads `assets_master.csv` and `sites_master.csv` from `data/raw/`.
   - Prints basic row counts and shows the first few event rows for a quick visual check.

2. **Builds a per-site health summary**
   - Groups events by `site_id` to compute:
     - `n_events` – total number of events per site.
     - `n_failures` and `failure_rate` (if `is_failure` is present).
   - Joins in site metadata from `sites_master.csv` (region, timezone, etc.).
   - Writes the result to:  
     `data/results/01_events_summary_by_site.csv`.

3. **Builds an asset-type × legacy/modern summary**
   - Groups events by `asset_type` and `is_legacy` to compute:
     - `n_events` per group.
     - `n_failures` and `failure_rate` when available.
   - Writes the result to:  
     `data/results/01_events_summary_by_assettype.csv`.

**Why this matters**

These compact summaries give us a **first look at risk hot-spots**:
- Which sites generate the most events and failures.
- How failure rates differ between **legacy** and **modern** assets by type.

They also serve as **clean, small inputs** for downstream notebooks and dashboards, so we do not need to scan the full event table every time.
::contentReference[oaicite:0]{index=0}


In [10]:
# Cell 6 – quick QA views + artifact recap (robust to column names)

import pandas as pd
from pathlib import Path

REPO_ROOT = Path(".").resolve()
RAW_DIR = REPO_ROOT / "data" / "raw"
RESULTS_DIR = REPO_ROOT / "data" / "results"

events_path = RAW_DIR / "iot_events.parquet"
site_summary_path = RESULTS_DIR / "01_events_summary_by_site.csv"
asset_summary_path = RESULTS_DIR / "01_events_summary_by_assettype.csv"

# --- Load artifacts ---
events = pd.read_parquet(events_path)
site_summary = pd.read_csv(site_summary_path)
asset_summary = pd.read_csv(asset_summary_path)

print("Site-level summary (top by failure_rate):")
if "failure_rate" in site_summary.columns:
    display(
        site_summary.sort_values("failure_rate", ascending=False)
        .head(10)
        .reset_index(drop=True)
    )
else:
    display(site_summary.head(10))

print("\nAsset-type × legacy/modern summary (sorted by failure_rate, then n_events):")
if "failure_rate" in asset_summary.columns:
    display(
        asset_summary.sort_values(["failure_rate", "n_events"],
                                  ascending=[False, False])
        .head(10)
        .reset_index(drop=True)
    )
else:
    display(asset_summary.head(10))

# --- Global event mix (robust to missing event_type/severity) ---
print("\nEvent mix counts (global):")

cols = set(events.columns)

if {"event_type", "severity"}.issubset(cols):
    # ideal case: both present
    event_counts = (
        events
        .groupby(["event_type", "severity"], dropna=False)
        .size()
        .reset_index(name="n_events")
        .sort_values("n_events", ascending=False)
    )
    display(event_counts.head(20).reset_index(drop=True))
elif "severity" in cols:
    # fall back: only severity available
    event_counts = (
        events
        .groupby("severity", dropna=False)
        .size()
        .reset_index(name="n_events")
        .sort_values("n_events", ascending=False)
    )
    display(event_counts.head(20).reset_index(drop=True))
elif "event_code" in cols:
    # fall back: event_code only
    event_counts = (
        events
        .groupby("event_code", dropna=False)
        .size()
        .reset_index(name="n_events")
        .sort_values("n_events", ascending=False)
    )
    display(event_counts.head(20).reset_index(drop=True))
else:
    print("No obvious event-type/severity columns found. Columns are:")
    print(sorted(events.columns))

# --- Legacy vs modern event volume ---
print("\nLegacy vs modern event volume:")
if "is_legacy" in cols:
    legacy_counts = (
        events
        .groupby("is_legacy")
        .size()
        .rename("n_events")
        .reset_index()
        .sort_values("n_events", ascending=False)
    )
    display(legacy_counts.reset_index(drop=True))
else:
    print("Column 'is_legacy' not found. Available columns:")
    print(sorted(events.columns))

print("\nArtifacts created so far:")
for p in [
    events_path,
    RAW_DIR / "assets_master.csv",
    RAW_DIR / "sites_master.csv",
    site_summary_path,
    asset_summary_path,
]:
    print(" -", p.relative_to(REPO_ROOT))

Site-level summary (top by failure_rate):


Unnamed: 0,site_id,n_events,site_name,tz
0,S1,187758,Indianapolis Packaging Plant,America/Indiana/Indianapolis
1,S2,97437,San Diego Device Assembly,America/Los_Angeles
2,S3,127807,Dublin EU Packaging,Europe/Dublin
3,S4,175679,Singapore Sterile Ops,Asia/Singapore



Asset-type × legacy/modern summary (sorted by failure_rate, then n_events):


Unnamed: 0,asset_type,is_legacy,n_events
0,blister_packer,False,24014
1,blister_packer,True,12348
2,bottle_filler,False,39866
3,bottle_filler,True,17355
4,capper,False,15945
5,capper,True,12477
6,cartoner,False,55856
7,cartoner,True,17121
8,case_packer,False,23985
9,case_packer,True,9965



Event mix counts (global):


Unnamed: 0,severity,n_events
0,,588549
1,1.0,3
2,1.197602,1
3,2.805089,1
4,3.284938,1
5,3.282828,1
6,3.26678,1
7,3.256836,1
8,3.219435,1
9,3.218275,1



Legacy vs modern event volume:


Unnamed: 0,is_legacy,n_events
0,False,422841
1,True,165840



Artifacts created so far:
 - data/raw/iot_events.parquet
 - data/raw/assets_master.csv
 - data/raw/sites_master.csv
 - data/results/01_events_summary_by_site.csv
 - data/results/01_events_summary_by_assettype.csv


### What Cell 6 Just Did

This cell gives us a **quick QA view** of the synthetic IIoT dataset and the summary tables we just generated, and then recaps all key artifacts on disk.

**1. Loads previously written artifacts**

We reload:

- `data/raw/iot_events.parquet` – the full synthetic event stream  
- `data/raw/assets_master.csv` – asset registry (including `is_legacy`)  
- `data/raw/sites_master.csv` – site / region metadata  
- `data/results/01_events_summary_by_site.csv` – site-level health stats  
- `data/results/01_events_summary_by_assettype.csv` – asset-type × legacy/modern stats  

**2. Ranks sites by failure risk**

From `01_events_summary_by_site.csv` we:

- Sort sites by `failure_rate` (descending)  
- Show the **top 10 sites** by failure rate, which acts as an early **risk hot-spot list** for future dashboards.

**3. Compares asset types and legacy vs modern**

From `01_events_summary_by_assettype.csv` we:

- Sort by `failure_rate` and `n_events`  
- Show the **top 10 asset-type × `is_legacy` combinations**  
- This lets us quickly see patterns like “**legacy cartoners have a much higher failure rate than modern ones**”.

**4. Global event mix**

From the full `iot_events` table we compute:

- `event_type × severity` counts (top 20 combinations)  
  - Useful to confirm we have a healthy mix of **normal**, **warning**, and **failure** events across different event types.

**5. Legacy vs modern volume check**

We aggregate `n_events` by `is_legacy` to confirm:

- That roughly half of events are coming from **legacy assets**, as intended by the scenario design.
- Whether legacy assets are **over-represented** in failures relative to their share of events.

**6. Artifact recap**

Finally, the cell prints the relative paths of all key files created so far so downstream notebooks (or the README) can reference them directly.

At this point, synthetic data generation is complete and lightly validated.  
Next, we can:

- Build **DuckDB views** on top of `iot_events.parquet`, or  
- Start the **FastAPI backend/dashboard** that will consume these artifacts for risk analytics.


## Notebook Summary – Synthetic IIoT Event Generator

In this notebook we built a **reproducible synthetic IIoT dataset** tailored to a
pharma packaging / medical device environment with mixed **legacy vs. modern** assets
across multiple global sites.

**What we set up**

- Verified the repo structure and created standard folders:
  - `notebooks/`, `data/raw/`, `data/interim/`, `data/results/`
- Defined a **reproducible RNG seed** (`42`) and clear output targets:
  - `data/raw/iot_events.parquet`
  - `data/raw/assets_master.csv`
  - `data/raw/sites_master.csv`

**What we generated**

- A **sites master** table with:
  - Four sites spanning multiple time zones (US, EU, APAC)
  - Descriptive names and IANA time zone identifiers

- An **assets master** table with:
  - Pharma / med-device packaging asset types (e.g., blister packer, bottle filler)
  - A 50/50 mix of **legacy vs. modern** equipment per site
  - Basic attributes for downstream risk and availability modeling

- A large **event stream** with:
  - Timestamped IIoT events (`ts_utc` + per-site `ts_local`)
  - Per-asset `event_code`, `event_value`, and `event_value_unit`
  - Failure-like events synthesized as rare, higher-severity events
  - Legacy assets biased toward **higher failure rates** and noisier signals

**Quality checks performed**

- **Per-site** summary:
  - Event volumes and crude failure rate estimates by site
- **Per-asset-type** summary:
  - Event volume and failure rate split by `asset_type` and `is_legacy`
- Global **event mix** QA:
  - Flexible grouping over available event columns (e.g., `severity`, `event_code`)
- Legacy vs. modern **event volume comparison**:
  - Quick check that legacy assets are indeed more failure-prone

**Outputs for downstream notebooks**

You can now treat this notebook as the **single source of truth** for raw synthetic
IIoT data. Downstream notebooks can consume:

- `data/raw/sites_master.csv` – site metadata + time zones  
- `data/raw/assets_master.csv` – asset inventory with `is_legacy` flag  
- `data/raw/iot_events.parquet` – high-volume event stream for:
  - reliability dashboards  
  - risk scoring models  
  - cross-site operational analytics

Next step: build **intermediate feature tables** and **risk / health metrics** on top
of this event stream for the enterprise risk dashboard.
