# 01 – Data Simulation (IoT Sensor Dataset)

**Goal:** Generate 12 months of hourly IoT sensor data for 50 machines, simulate failures, and inject 72-hour pre-failure drift patterns.

This notebook creates:  
- `sensor_readings.csv` → hourly machine readings  
- `failures.csv` → failure events  

These datasets will be used for Azure SQL ingestion, feature engineering, and model training.

## 1. Setup: Imports and Basic Configuration
Load required libraries, set reproducibility, and define the time range (Jan–Dec 2023).

In [1]:
import numpy as np
import pandas as pd

# Reproducibility
np.random.seed(42)

# Configuration
n_machines = 20                # fewer machines(was 5.) 
start_date = "2023-01-01"      # start of year
end_date   = end_date = "2023-03-31 23:00:00"  # end of June (6 months)

# Generate hourly timestamps
time_index = pd.date_range(start=start_date, end=end_date, freq="H")
n_hours = len(time_index)

print(f"Hours in dataset: {n_hours}")
print(f"Machines: {n_machines}")

Hours in dataset: 2160
Machines: 20


  time_index = pd.date_range(start=start_date, end=end_date, freq="H")


## 2. Generate Base Sensor Readings
Create a row for each combination of (machine_id × hourly timestamp), then add “normal” sensor values:

- temperature  
- vibration  
- pressure  
- current  
- rpm  

`status_code = 0` means normal behavior.

In [2]:
# Create base index of machine_id × timestamp
machine_ids = np.arange(1, n_machines + 1)

sensor_df = pd.MultiIndex.from_product(
    [machine_ids, time_index],
    names=["machine_id", "reading_time"]
).to_frame(index=False)

# Normal sensor behavior
sensor_df["temperature"] = 60 + np.random.normal(0, 2, len(sensor_df))
sensor_df["vibration"]   = 1.0 + np.random.normal(0, 0.1, len(sensor_df))
sensor_df["pressure"]    = 30 + np.random.normal(0, 1, len(sensor_df))
sensor_df["current"]     = 10 + np.random.normal(0, 0.5, len(sensor_df))
sensor_df["rpm"]         = 1500 + np.random.normal(0, 50, len(sensor_df))

sensor_df["status_code"] = 0  # 0 = normal
sensor_df.head()


Unnamed: 0,machine_id,reading_time,temperature,vibration,pressure,current,rpm,status_code
0,1,2023-01-01 00:00:00,60.993428,1.009931,29.872158,9.664418,1589.581888,0
1,1,2023-01-01 01:00:00,59.723471,1.104616,28.475418,10.343879,1522.293431,0
2,1,2023-01-01 02:00:00,61.295377,1.15479,29.787573,9.453091,1502.452725,0
3,1,2023-01-01 03:00:00,63.04606,0.987813,29.386867,9.340127,1549.795581,0
4,1,2023-01-01 04:00:00,59.531693,1.08758,31.336242,10.02605,1559.86806,0


## 3. Simulate Failures per Machine
Each machine has 2–5 random failures during the year.

We store:
- `machine_id`
- `failure_time`
- `failure_type` (bearing, overheat, vibration)

This will drive the drift injection and label creation.

In [3]:
failure_records = []

min_failures = 2
max_failures = 5

# Avoid first/last 3 days to leave room for drift
valid_times = time_index[24*3 : -24*3]

for m in machine_ids:
    n_fail = np.random.randint(min_failures, max_failures + 1)
    failure_times = np.random.choice(valid_times, size=n_fail, replace=False)
    failure_times = sorted(failure_times)
    
    for ft in failure_times:
        failure_records.append({
            "machine_id": m,
            "failure_time": ft,
            "failure_type": np.random.choice(["bearing", "overheat", "vibration"])
        })

failures_df = pd.DataFrame(failure_records)
failures_df.head()


Unnamed: 0,machine_id,failure_time,failure_type
0,1,2023-01-17 17:00:00,overheat
1,1,2023-01-26 12:00:00,overheat
2,2,2023-01-27 18:00:00,overheat
3,2,2023-02-08 12:00:00,overheat
4,3,2023-01-28 23:00:00,vibration


## 4. Inject 72-Hour Pre-Failure Drift
Before each failure:

- Temperature rises  
- Vibration increases  
- Pressure increases  

We also mark the final 24 hours before failure with `status_code = 1` to indicate abnormal conditions.

In [4]:
drift_hours = 72

sensor_df = sensor_df.sort_values(["machine_id", "reading_time"]).reset_index(drop=True)
sensor_df.set_index(["machine_id", "reading_time"], inplace=True)

for _, row in failures_df.iterrows():
    m_id = row["machine_id"]
    f_time = row["failure_time"]
    
    drift_start = f_time - pd.Timedelta(hours=drift_hours)
    
    mask = (
        (sensor_df.index.get_level_values("machine_id") == m_id) &
        (sensor_df.index.get_level_values("reading_time") >= drift_start) &
        (sensor_df.index.get_level_values("reading_time") < f_time)
    )
    
    n_points = mask.sum()
    if n_points > 0:
        sensor_df.loc[mask, "temperature"] += np.linspace(2, 8, n_points)
        sensor_df.loc[mask, "vibration"]   += np.linspace(0.2, 0.8, n_points)
        sensor_df.loc[mask, "pressure"]    += np.linspace(1, 4, n_points)

# Mark last 24 hours as abnormal
for _, row in failures_df.iterrows():
    m_id = row["machine_id"]
    f_time = row["failure_time"]
    window_start = f_time - pd.Timedelta(hours=24)
    
    mask = (
        (sensor_df.index.get_level_values("machine_id") == m_id) &
        (sensor_df.index.get_level_values("reading_time") >= window_start) &
        (sensor_df.index.get_level_values("reading_time") <= f_time)
    )
    
    sensor_df.loc[mask, "status_code"] = 1

sensor_df.reset_index(inplace=True)
sensor_df.head()


Unnamed: 0,machine_id,reading_time,temperature,vibration,pressure,current,rpm,status_code
0,1,2023-01-01 00:00:00,60.993428,1.009931,29.872158,9.664418,1589.581888,0
1,1,2023-01-01 01:00:00,59.723471,1.104616,28.475418,10.343879,1522.293431,0
2,1,2023-01-01 02:00:00,61.295377,1.15479,29.787573,9.453091,1502.452725,0
3,1,2023-01-01 03:00:00,63.04606,0.987813,29.386867,9.340127,1549.795581,0
4,1,2023-01-01 04:00:00,59.531693,1.08758,31.336242,10.02605,1559.86806,0


## 5. Save Datasets to the `/data` Folder
We export:

- `sensor_readings.csv`
- `failures.csv`

These will be used in the next notebook for feature engineering.

In [5]:
sensor_path   = "../data/sensor_readings.csv"
failures_path = "../data/failures.csv"

sensor_df.to_csv(sensor_path, index=False)
failures_df.to_csv(failures_path, index=False)

print("Saved files:")
print(sensor_path)
print(failures_path)

Saved files:
../data/sensor_readings.csv
../data/failures.csv


##  Summary

This notebook generated a complete synthetic IoT dataset for Predictive Maintenance:

- 12 months of hourly sensor data  
- 50 machines  
- 2–5 failures per machine  
- Injected realistic 72-hour drift patterns  
- Exported structured CSV files  

Next Notebook → `02_feature_engineering.ipynb`  
We will create rolling features, trends, and the 72-hour failure prediction target.