# üîß Feature Engineering for Predictive Maintenance
## Creating 40+ Features from Sensor Data

**Objective:** Transform raw sensor data into powerful predictive features

**Strategy:**
1. Temporal Features (20+): Rolling stats, lags, trends
2. Sensor Interactions (10+): Ratios, products, differences
3. Domain Features (10+): Age, maintenance, operational patterns

**Based on EDA Findings:**
- Vibration is #1 predictor (r=0.65 with anomalies)
- Temperature shows degradation trend (+13.8% over 5 years)
- Clear temporal patterns (hourly, seasonal)
- Equipment age affects sensor values

---
## STEP 1: Setup & Data Loading
---

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported!")

‚úÖ Libraries imported!


In [2]:
# Database connection
DB_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'weefarm_db',
    'user': 'postgres',
    'password': '0000'
}

engine = create_engine(
    f"postgresql://{DB_CONFIG['user']}:{DB_CONFIG['password']}@"
    f"{DB_CONFIG['host']}:{DB_CONFIG['port']}/{DB_CONFIG['database']}"
)

print("‚úÖ Database connected!")

‚úÖ Database connected!


In [3]:
# Load data (sample for development)
print("Loading data...")

# Equipment data
df_equipment = pd.read_sql("SELECT * FROM equipment", engine)

# Sensor data (sample 200K for faster processing)
query = """
    SELECT * FROM sensor_readings 
    ORDER BY equipment_id, timestamp
    LIMIT 200000
"""
df_sensors = pd.read_sql(query, engine)

print(f"‚úÖ Equipment: {len(df_equipment)} records")
print(f"‚úÖ Sensors: {len(df_sensors):,} records")
print(f"‚úÖ Equipment in sample: {df_sensors['equipment_id'].nunique()}")

Loading data...
‚úÖ Equipment: 300 records
‚úÖ Sensors: 200,000 records
‚úÖ Equipment in sample: 7


In [4]:
# Prepare data
df_sensors['timestamp'] = pd.to_datetime(df_sensors['timestamp'])
df_sensors = df_sensors.sort_values(['equipment_id', 'timestamp'])

# Merge with equipment data
df = df_sensors.merge(
    df_equipment[['equipment_id', 'equipment_type', 'purchase_date', 'operating_hours']], 
    on='equipment_id'
)

df['purchase_date'] = pd.to_datetime(df['purchase_date'])

print(f"‚úÖ Merged dataset: {df.shape}")
print(f"‚úÖ Columns: {len(df.columns)}")

‚úÖ Merged dataset: (200000, 25)
‚úÖ Columns: 25


---
## STEP 2: Temporal Features (Rolling Statistics)
---

In [5]:
print("Creating rolling statistics features...")
print("This may take a few minutes...\n")

# Key sensors for rolling features (based on EDA)
key_sensors = ['temperature', 'vibration', 'oil_pressure', 'battery_voltage']

# Rolling windows (in hours)
windows = [24, 168, 720]  # 1 day, 7 days, 30 days

feature_count = 0

for sensor in key_sensors:
    for window in windows:
        window_name = f"{window}h" if window < 168 else f"{window//24}d"
        
        # Rolling mean
        df[f'{sensor}_rolling_mean_{window_name}'] = (
            df.groupby('equipment_id')[sensor]
            .transform(lambda x: x.rolling(window=window, min_periods=1).mean())
        )
        
        # Rolling std
        df[f'{sensor}_rolling_std_{window_name}'] = (
            df.groupby('equipment_id')[sensor]
            .transform(lambda x: x.rolling(window=window, min_periods=1).std())
        )
        
        feature_count += 2
        print(f"‚úÖ {sensor} - {window_name} window (mean, std)")

print(f"\n‚úÖ Created {feature_count} rolling features!")

Creating rolling statistics features...
This may take a few minutes...

‚úÖ temperature - 24h window (mean, std)
‚úÖ temperature - 7d window (mean, std)
‚úÖ temperature - 30d window (mean, std)
‚úÖ vibration - 24h window (mean, std)
‚úÖ vibration - 7d window (mean, std)
‚úÖ vibration - 30d window (mean, std)
‚úÖ oil_pressure - 24h window (mean, std)
‚úÖ oil_pressure - 7d window (mean, std)
‚úÖ oil_pressure - 30d window (mean, std)
‚úÖ battery_voltage - 24h window (mean, std)
‚úÖ battery_voltage - 7d window (mean, std)
‚úÖ battery_voltage - 30d window (mean, std)

‚úÖ Created 24 rolling features!


---
## STEP 3: Lag Features
---

In [6]:
print("Creating lag features...\n")

# Lag periods (in hours)
lags = [1, 24, 168]  # 1 hour, 1 day, 7 days

feature_count = 0

for sensor in key_sensors:
    for lag in lags:
        lag_name = f"{lag}h" if lag < 24 else f"{lag//24}d"
        
        df[f'{sensor}_lag_{lag_name}'] = (
            df.groupby('equipment_id')[sensor]
            .shift(lag)
        )
        
        feature_count += 1
        print(f"‚úÖ {sensor} - lag {lag_name}")

print(f"\n‚úÖ Created {feature_count} lag features!")

Creating lag features...

‚úÖ temperature - lag 1h
‚úÖ temperature - lag 1d
‚úÖ temperature - lag 7d
‚úÖ vibration - lag 1h
‚úÖ vibration - lag 1d
‚úÖ vibration - lag 7d
‚úÖ oil_pressure - lag 1h
‚úÖ oil_pressure - lag 1d
‚úÖ oil_pressure - lag 7d
‚úÖ battery_voltage - lag 1h
‚úÖ battery_voltage - lag 1d
‚úÖ battery_voltage - lag 7d

‚úÖ Created 12 lag features!


---
## STEP 4: Trend Features
---

In [7]:
print("Creating trend features...\n")

feature_count = 0

for sensor in key_sensors:
    # Difference from 24h ago (daily change)
    df[f'{sensor}_change_24h'] = (
        df[sensor] - df.groupby('equipment_id')[sensor].shift(24)
    )
    
    # Difference from 7d ago (weekly change)
    df[f'{sensor}_change_7d'] = (
        df[sensor] - df.groupby('equipment_id')[sensor].shift(168)
    )
    
    # Is increasing? (compared to 24h ago)
    df[f'{sensor}_is_increasing'] = (
        df[f'{sensor}_change_24h'] > 0
    ).astype(int)
    
    feature_count += 3
    print(f"‚úÖ {sensor} - trend features (change_24h, change_7d, is_increasing)")

print(f"\n‚úÖ Created {feature_count} trend features!")

Creating trend features...

‚úÖ temperature - trend features (change_24h, change_7d, is_increasing)
‚úÖ vibration - trend features (change_24h, change_7d, is_increasing)
‚úÖ oil_pressure - trend features (change_24h, change_7d, is_increasing)
‚úÖ battery_voltage - trend features (change_24h, change_7d, is_increasing)

‚úÖ Created 12 trend features!


---
## STEP 5: Sensor Interaction Features
---

In [8]:
print("Creating sensor interaction features...\n")

# Based on EDA correlations

# Temperature √ó Vibration (both correlate with failures)
df['temp_vibration_product'] = df['temperature'] * df['vibration']
df['temp_vibration_ratio'] = df['temperature'] / (df['vibration'] + 0.01)
print("‚úÖ Temperature √ó Vibration features")

# Temperature difference (engine - coolant)
df['temp_coolant_diff'] = df['temperature'] - df['coolant_temperature']
print("‚úÖ Temperature - Coolant difference")

# Pressure √ó Temperature
df['pressure_temp_ratio'] = df['oil_pressure'] / (df['temperature'] + 1)
print("‚úÖ Pressure / Temperature ratio")

# Load efficiency (fuel per load)
df['fuel_efficiency'] = df['fuel_consumption'] / (df['engine_load'] + 1)
print("‚úÖ Fuel efficiency")

# Speed per RPM (transmission efficiency)
df['speed_per_rpm'] = df['gps_speed'] / (df['rpm'] + 1)
print("‚úÖ Speed per RPM")

# Tire pressure difference (front - rear)
df['tire_pressure_diff'] = df['tire_pressure_front'] - df['tire_pressure_rear']
print("‚úÖ Tire pressure difference")

# Hydraulic load indicator
df['hydraulic_load_ratio'] = df['hydraulic_pressure'] / (df['engine_load'] + 1)
print("‚úÖ Hydraulic load ratio")

print("\n‚úÖ Created 8 interaction features!")

Creating sensor interaction features...

‚úÖ Temperature √ó Vibration features
‚úÖ Temperature - Coolant difference
‚úÖ Pressure / Temperature ratio
‚úÖ Fuel efficiency
‚úÖ Speed per RPM
‚úÖ Tire pressure difference
‚úÖ Hydraulic load ratio

‚úÖ Created 8 interaction features!


---
## STEP 6: Domain-Specific Features
---

In [9]:
print("Creating domain-specific features...\n")

# Equipment age (in days)
df['equipment_age_days'] = (df['timestamp'] - df['purchase_date']).dt.days
df['equipment_age_years'] = df['equipment_age_days'] / 365.25
print("‚úÖ Equipment age (days, years)")

# Age category
df['age_category'] = pd.cut(
    df['equipment_age_years'],
    bins=[0, 1, 2, 3, 4, 100],
    labels=['new', 'young', 'mid', 'old', 'very_old']
)
print("‚úÖ Age category")

# Time features
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['month'] = df['timestamp'].dt.month
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)
print("‚úÖ Time features (hour, day_of_week, month, is_weekend)")

# Work hours indicator (6am-6pm based on EDA)
df['is_work_hours'] = ((df['hour'] >= 6) & (df['hour'] <= 18)).astype(int)
print("‚úÖ Work hours indicator")

# Season (Tunisia: Dec-Feb=winter, Mar-May=spring, Jun-Aug=summer, Sep-Nov=fall)
df['season'] = df['month'].map({
    12: 'winter', 1: 'winter', 2: 'winter',
    3: 'spring', 4: 'spring', 5: 'spring',
    6: 'summer', 7: 'summer', 8: 'summer',
    9: 'fall', 10: 'fall', 11: 'fall'
})
print("‚úÖ Season")

# Operating status (based on RPM)
df['is_operating'] = (df['rpm'] > 500).astype(int)
df['is_idle'] = (df['rpm'] < 100).astype(int)
df['is_high_load'] = (df['engine_load'] > 60).astype(int)
print("‚úÖ Operating status (operating, idle, high_load)")

print("\n‚úÖ Created 13 domain features!")

Creating domain-specific features...

‚úÖ Equipment age (days, years)
‚úÖ Age category
‚úÖ Time features (hour, day_of_week, month, is_weekend)
‚úÖ Work hours indicator
‚úÖ Season
‚úÖ Operating status (operating, idle, high_load)

‚úÖ Created 13 domain features!


---
## STEP 7: Feature Summary
---

In [10]:
# Count features
original_sensors = 18
rolling_features = 24  # 4 sensors √ó 3 windows √ó 2 stats
lag_features = 12  # 4 sensors √ó 3 lags
trend_features = 12  # 4 sensors √ó 3 trends
interaction_features = 8
domain_features = 13

total_new_features = (
    rolling_features + lag_features + trend_features + 
    interaction_features + domain_features
)

print("="*70)
print("üìä FEATURE ENGINEERING SUMMARY")
print("="*70)
print(f"\nüìå Original Features:")
print(f"   - Sensors: {original_sensors}")
print(f"   - Equipment info: 2 (type, operating_hours)")
print(f"   - Total original: {original_sensors + 2}")

print(f"\n‚ú® New Features Created:")
print(f"   - Rolling statistics: {rolling_features}")
print(f"   - Lag features: {lag_features}")
print(f"   - Trend features: {trend_features}")
print(f"   - Interaction features: {interaction_features}")
print(f"   - Domain features: {domain_features}")
print(f"   - Total new: {total_new_features}")

print(f"\nüéØ Final Feature Count: {original_sensors + 2 + total_new_features}")
print(f"\nüìä Dataset shape: {df.shape}")
print("="*70)

üìä FEATURE ENGINEERING SUMMARY

üìå Original Features:
   - Sensors: 18
   - Equipment info: 2 (type, operating_hours)
   - Total original: 20

‚ú® New Features Created:
   - Rolling statistics: 24
   - Lag features: 12
   - Trend features: 12
   - Interaction features: 8
   - Domain features: 13
   - Total new: 69

üéØ Final Feature Count: 89

üìä Dataset shape: (200000, 93)


In [11]:
# List all new features
new_feature_cols = [col for col in df.columns if any([
    'rolling' in col,
    'lag' in col,
    'change' in col,
    'increasing' in col,
    'product' in col,
    'ratio' in col,
    'diff' in col,
    'efficiency' in col,
    'age' in col,
    col in ['hour', 'day_of_week', 'month', 'is_weekend', 'is_work_hours', 
            'season', 'is_operating', 'is_idle', 'is_high_load']
])]

print(f"\nüìã New Features ({len(new_feature_cols)}):")
for i, col in enumerate(sorted(new_feature_cols), 1):
    print(f"   {i}. {col}")


üìã New Features (69):
   1. age_category
   2. battery_voltage
   3. battery_voltage_change_24h
   4. battery_voltage_change_7d
   5. battery_voltage_is_increasing
   6. battery_voltage_lag_1d
   7. battery_voltage_lag_1h
   8. battery_voltage_lag_7d
   9. battery_voltage_rolling_mean_24h
   10. battery_voltage_rolling_mean_30d
   11. battery_voltage_rolling_mean_7d
   12. battery_voltage_rolling_std_24h
   13. battery_voltage_rolling_std_30d
   14. battery_voltage_rolling_std_7d
   15. day_of_week
   16. equipment_age_days
   17. equipment_age_years
   18. fuel_efficiency
   19. hour
   20. hydraulic_load_ratio
   21. is_high_load
   22. is_idle
   23. is_operating
   24. is_weekend
   25. is_work_hours
   26. month
   27. oil_pressure_change_24h
   28. oil_pressure_change_7d
   29. oil_pressure_is_increasing
   30. oil_pressure_lag_1d
   31. oil_pressure_lag_1h
   32. oil_pressure_lag_7d
   33. oil_pressure_rolling_mean_24h
   34. oil_pressure_rolling_mean_30d
   35. oil_pressure_

---
## STEP 8: Handle Missing Values (from lag/rolling)
---

In [12]:
# Check missing values
missing = df[new_feature_cols].isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)

if len(missing) > 0:
    print("‚ö†Ô∏è Missing values in new features:")
    print(missing)
    print(f"\nüí° These are expected (lag/rolling at start of series)")
    print(f"üí° Will be handled during model training (drop or forward fill)")
else:
    print("‚úÖ No missing values!")

‚ö†Ô∏è Missing values in new features:
temperature_lag_7d                 1176
temperature_change_7d              1176
vibration_change_7d                1176
battery_voltage_change_7d          1176
oil_pressure_change_7d             1176
oil_pressure_lag_7d                1176
battery_voltage_lag_7d             1176
vibration_lag_7d                   1176
oil_pressure_lag_1d                 168
temperature_change_24h              168
temperature_lag_1d                  168
vibration_lag_1d                    168
battery_voltage_change_24h          168
vibration_change_24h                168
battery_voltage_lag_1d              168
oil_pressure_change_24h             168
age_category                        144
temperature_rolling_std_24h           7
vibration_lag_1h                      7
battery_voltage_rolling_std_30d       7
temperature_lag_1h                    7
battery_voltage_rolling_std_24h       7
battery_voltage_rolling_std_7d        7
oil_pressure_rolling_std_7d           7
o

---
## STEP 9: Save Engineered Features
---

In [13]:
# Save to CSV for later use
output_file = '../data/features_engineered_sample.csv'

# Select important columns
cols_to_save = [
    'equipment_id', 'timestamp', 'equipment_type',
    # Original sensors
    'temperature', 'vibration', 'oil_pressure', 'rpm',
    'fuel_consumption', 'engine_load', 'battery_voltage',
    # Target
    'is_anomaly'
] + new_feature_cols

df[cols_to_save].to_csv(output_file, index=False)

print(f"‚úÖ Features saved to: {output_file}")
print(f"‚úÖ Columns saved: {len(cols_to_save)}")
print(f"‚úÖ Rows saved: {len(df):,}")

‚úÖ Features saved to: ../data/features_engineered_sample.csv
‚úÖ Columns saved: 80
‚úÖ Rows saved: 200,000


---
## STEP 10: Feature Statistics
---

In [14]:
# Show statistics for some key new features
key_new_features = [
    'temperature_rolling_mean_7d',
    'vibration_rolling_std_7d',
    'temperature_change_24h',
    'temp_vibration_product',
    'equipment_age_years'
]

print("üìä Statistics for Key New Features:")
print("="*70)
print(df[key_new_features].describe())

üìä Statistics for Key New Features:
       temperature_rolling_mean_7d  vibration_rolling_std_7d  \
count                200000.000000             199993.000000   
mean                     71.996417                  1.332599   
std                       9.511721                  0.164772   
min                      33.210000                  0.000000   
25%                      64.975610                  1.218585   
50%                      71.648452                  1.338348   
75%                      78.373661                  1.447991   
max                      93.206012                  3.387041   

       temperature_change_24h  temp_vibration_product  equipment_age_years  
count           199832.000000           200000.000000        200000.000000  
mean                 0.021253              235.479810             1.798189  
std                  6.751245              201.396257             1.192072  
min                -40.720000               44.525000             0.000000  


---
## ‚úÖ FEATURE ENGINEERING COMPLETE!

### **What We Created:**
- ‚úÖ 24 Rolling statistics (mean, std)
- ‚úÖ 12 Lag features (1h, 1d, 7d)
- ‚úÖ 12 Trend features (changes, directions)
- ‚úÖ 8 Sensor interactions (ratios, products)
- ‚úÖ 13 Domain features (age, time, operational)

### **Total: 69 new features!**

### **Next Steps:**
1. Feature Selection (select top 20-25)
2. Model Comparison (test 10-12 algorithms)
3. SHAP Analysis (interpretability)

---