# 02 – Feature Engineering

**Goal:**  
Transform the raw machine sensor data from Week 1 into features suitable for machine learning:

- Rolling window statistics (6h, 12h, 24h)
- Trend/delta features
- Merging sensor data with failures
- Creating the target variable: `failure_within_72h`
- Exporting a clean machine-learning dataset

This prepares the data for Week 3: Modeling.

In [1]:
import pandas as pd
import numpy as np

# Paths to the Week 1 data
sensor_path = "../data/sensor_readings.csv"
failures_path = "../data/failures.csv"

sensor_df = pd.read_csv(sensor_path, parse_dates=["reading_time"])
failures_df = pd.read_csv(failures_path, parse_dates=["failure_time"])

sensor_df.head(), failures_df.head()

(   machine_id        reading_time  temperature  vibration   pressure  \
 0           1 2023-01-01 00:00:00    60.993428   1.009931  29.872158   
 1           1 2023-01-01 01:00:00    59.723471   1.104616  28.475418   
 2           1 2023-01-01 02:00:00    61.295377   1.154790  29.787573   
 3           1 2023-01-01 03:00:00    63.046060   0.987813  29.386867   
 4           1 2023-01-01 04:00:00    59.531693   1.087580  31.336242   
 
      current          rpm  status_code  
 0   9.664418  1589.581888            0  
 1  10.343879  1522.293431            0  
 2   9.453091  1502.452725            0  
 3   9.340127  1549.795581            0  
 4  10.026050  1559.868060            0  ,
    machine_id        failure_time failure_type
 0           1 2023-01-17 17:00:00     overheat
 1           1 2023-01-26 12:00:00     overheat
 2           2 2023-01-27 18:00:00     overheat
 3           2 2023-02-08 12:00:00     overheat
 4           3 2023-01-28 23:00:00    vibration)

## 1. Sort Data for Time-Series Processing
We must sort the data by:
- machine_id
- reading_time

This ensures rolling windows and shifts work correctly.

In [2]:
sensor_df = sensor_df.sort_values(["machine_id", "reading_time"]).reset_index(drop=True)


## 2. Create Rolling Window Features
We compute rolling statistics for each machine:

Windows:
- 6 hours  
- 12 hours  
- 24 hours  

Statistics:
- Mean  
- Standard Deviation  
- Min / Max  

In [3]:
rolling_windows = [6, 12, 24]
features = []

for w in rolling_windows:
    sensor_df[f"temp_mean_{w}h"] = sensor_df.groupby("machine_id")["temperature"].rolling(w).mean().reset_index(0,drop=True)
    sensor_df[f"temp_std_{w}h"]  = sensor_df.groupby("machine_id")["temperature"].rolling(w).std().reset_index(0,drop=True)
    
    sensor_df[f"vib_mean_{w}h"]  = sensor_df.groupby("machine_id")["vibration"].rolling(w).mean().reset_index(0,drop=True)
    sensor_df[f"vib_std_{w}h"]   = sensor_df.groupby("machine_id")["vibration"].rolling(w).std().reset_index(0,drop=True)
    
    features.append(f"temp_mean_{w}h")
    features.append(f"temp_std_{w}h")
    features.append(f"vib_mean_{w}h")
    features.append(f"vib_std_{w}h")

sensor_df[features].head()


Unnamed: 0,temp_mean_6h,temp_std_6h,vib_mean_6h,vib_std_6h,temp_mean_12h,temp_std_12h,vib_mean_12h,vib_std_12h,temp_mean_24h,temp_std_24h,vib_mean_24h,vib_std_24h
0,,,,,,,,,,,,
1,,,,,,,,,,,,
2,,,,,,,,,,,,
3,,,,,,,,,,,,
4,,,,,,,,,,,,


## 3. Create Trend (Delta) Features
These measure how quickly a sensor is increasing or decreasing.

We compute the change over:
- 1 hour
- 6 hours

In [4]:
# 1-hour deltas
sensor_df["temp_delta_1h"] = sensor_df.groupby("machine_id")["temperature"].diff(1)
sensor_df["vib_delta_1h"]  = sensor_df.groupby("machine_id")["vibration"].diff(1)

# 6-hour deltas
sensor_df["temp_delta_6h"] = sensor_df.groupby("machine_id")["temperature"].diff(6)
sensor_df["vib_delta_6h"]  = sensor_df.groupby("machine_id")["vibration"].diff(6)

sensor_df[["temp_delta_1h", "temp_delta_6h"]].head()

Unnamed: 0,temp_delta_1h,temp_delta_6h
0,,
1,-1.269957,
2,1.571906,
3,1.750683,
4,-3.514366,


## 4. Create the 72-Hour Failure Label
We mark every timestamp where a **failure will occur within the next 72 hours**:

`failure_within_72h = 1`  
if a failure is within 72 hours after this time; otherwise `0`.


In [5]:
sensor_df["failure_within_72h"] = 0

for _, row in failures_df.iterrows():
    m_id = row["machine_id"]
    f_time = row["failure_time"]
    
    start = f_time - pd.Timedelta(hours=72)
    
    mask = (
        (sensor_df["machine_id"] == m_id) &
        (sensor_df["reading_time"] >= start) &
        (sensor_df["reading_time"] < f_time)
    )
    
    sensor_df.loc[mask, "failure_within_72h"] = 1

sensor_df["failure_within_72h"].value_counts()

failure_within_72h
0    38823
1     4377
Name: count, dtype: int64

## 5. Drop NaNs Created by Rolling Features
Rolling windows create missing values at the start of each machine's timeline.
We drop these rows.

In [6]:
sensor_df_clean = sensor_df.dropna().reset_index(drop=True)
sensor_df_clean.head()

Unnamed: 0,machine_id,reading_time,temperature,vibration,pressure,current,rpm,status_code,temp_mean_6h,temp_std_6h,...,vib_std_12h,temp_mean_24h,temp_std_24h,vib_mean_24h,vib_std_24h,temp_delta_1h,vib_delta_1h,temp_delta_6h,vib_delta_6h,failure_within_72h
0,1,2023-01-01 23:00:00,57.150504,0.937167,30.24298,9.870475,1563.794434,0,59.187442,2.202235,...,0.106564,59.704723,1.947517,1.060162,0.089745,-2.984553,-0.133819,-3.477991,0.036166,0
1,1,2023-01-02 00:00:00,58.911235,1.12646,29.902485,10.259461,1503.926682,0,59.308655,2.155474,...,0.100237,59.617965,1.933944,1.065017,0.090061,1.760731,0.189293,0.727283,0.056084,0
2,1,2023-01-02 01:00:00,60.221845,0.997305,30.119151,9.905707,1437.663984,0,59.816397,1.895608,...,0.101074,59.638731,1.937798,1.060546,0.090671,1.310611,-0.129155,3.046453,0.035122,0
3,1,2023-01-02 02:00:00,57.698013,0.987546,29.613272,9.712392,1431.948007,0,58.944183,1.279613,...,0.102235,59.488841,1.943205,1.053577,0.089533,-2.523832,-0.009759,-5.233285,-0.284031,0
4,1,2023-01-02 03:00:00,60.751396,0.989948,30.191727,10.337799,1480.568773,0,59.144675,1.472871,...,0.099032,59.39323,1.812636,1.053666,0.089465,3.053383,0.002401,1.202949,-0.035349,0


## 6. Save Final Dataset
We save the machine-learning ready dataset as:

`dataset_ready_for_model.csv`

In [7]:
output_path = "../data/dataset_ready_for_model.csv"
sensor_df_clean.to_csv(output_path, index=False)

print("Saved dataset:", output_path)
print("Final shape:", sensor_df_clean.shape)


Saved dataset: ../data/dataset_ready_for_model.csv
Final shape: (42740, 25)


#  Summary – Week 2 Completed

We successfully engineered features for predictive maintenance:

### Created Features:
- Rolling statistics (6h, 12h, 24h)
- Trend/delta features
- 72-hour failure label
- Clean dataset ready for modeling

### Saved:
- `/data/dataset_ready_for_model.csv`

Next → **Week 3: Modeling**