In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Topic: EX2 - Turbofan RUL Prediction
**Task**: Predict the remaining useful life (RUL) of turbofan engines based on given sensor data (time series data). It is a regression problem.
**Data**: Turbofan engine degradation simulation data (NASA) - [Link](https://data.nasa.gov/dataset/Turbofan-Engine-Degradation-Simulation-Data-Set/vrks-gjie). See also in the topic [introduction notebook](https://github.com/nina-prog/damage-propagation-modeling/blob/2fb8c1a1102a48d7abbf04e4031807790a913a99/notebooks/Turbofan%20remaining%20useful%20life%20Prediction.ipynb).

**Subtasks**:
1. Perform a deep **exploratory data analysis (EDA)** on the given data.
2. Implement a more efficient **sliding window method** for time series data analysis.
3. Apply **traditional machine learning methods** (SOTA) to predict the remaining useful life. Includes data preparation, feature extraction, feature selection, model selection, and model parameter optimization. -> 🎯 **Focus on this task** data preparation and feature selection (feature extraction part of sliding window method).
4. Create **neural network models** to predict the remaining useful life. Includes different architectures like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Attention Models. Note: You can search for SOTA research papers and reproduce current state-of-the-art models.


# Imports + Settings

In [22]:
# third-party libraries
import pandas as pd
import numpy as np
import os
from typing import List, Union
import time
from tqdm.notebook import tqdm
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

In [28]:
# source code
from src.utils import load_data, load_config
from src.data_preprocessing import create_rolling_windows_datasets
from src.data_cleaning import identify_missing_values, identify_single_unique_features

In [5]:
# settings
sns.set_style("whitegrid")
sns.set_palette("Set2")
sns.set(rc={"figure.dpi":100, 'savefig.dpi':200})
sns.set_context('notebook')

In [6]:
np.random.seed(42)

# Paths

In [7]:
os.chdir("../") # set working directory to root of project
#os.getcwd() # check current working directory

In [8]:
PATH_TO_CONFIG = "configs/config.yaml"

# Load config + Data

In [16]:
config = load_config(PATH_TO_CONFIG) # config is dict

In [10]:
%%time
train_data, test_data, test_RUL_data = load_data(config_path=PATH_TO_CONFIG, dataset_num=1)

2024-05-13 17:26:06 [[34msrc.utils:56[0m] [[32mINFO[0m] >>>> Loading data set 1...[0m
2024-05-13 17:26:06 [[34msrc.utils:85[0m] [[32mINFO[0m] >>>> Loaded raw data for dataset 1.[0m
2024-05-13 17:26:06 [[34msrc.utils:86[0m] [[32mINFO[0m] >>>> Train Data: (20631, 26)[0m
2024-05-13 17:26:06 [[34msrc.utils:87[0m] [[32mINFO[0m] >>>> Test Data: (13096, 26)[0m
2024-05-13 17:26:06 [[34msrc.utils:88[0m] [[32mINFO[0m] >>>> Test RUL Data: (100, 1)[0m
CPU times: total: 62.5 ms
Wall time: 325 ms


In [17]:
%%time
X_train, y_train, X_test, y_test = create_rolling_windows_datasets(train_data, test_data, test_RUL_data, column_id="UnitNumber", column_sort="Cycle", max_timeshift=config["preprocessing"]["max_window_size"], min_timeshift=config["preprocessing"]["min_window_size"])

2024-05-13 17:39:27 [[34msrc.data_preprocessing:61[0m] [[32mINFO[0m] >>>> Creating rolling windows for train data...[0m


Rolling: 100%|██████████| 37/37 [00:05<00:00,  6.35it/s]


2024-05-13 17:39:33 [[34msrc.data_preprocessing:65[0m] [[32mINFO[0m] >>>> Extracting features for train data...[0m


Feature Extraction: 100%|██████████| 40/40 [01:08<00:00,  1.70s/it]


2024-05-13 17:40:55 [[34msrc.data_preprocessing:73[0m] [[32mINFO[0m] >>>> Calculating target for train data...[0m
2024-05-13 17:40:55 [[34msrc.data_preprocessing:80[0m] [[32mINFO[0m] >>>> Creating rolling windows for test data...[0m


Rolling: 100%|██████████| 38/38 [00:04<00:00,  9.43it/s]


2024-05-13 17:40:59 [[34msrc.data_preprocessing:86[0m] [[32mINFO[0m] >>>> Extracting features for test data...[0m


Feature Extraction: 100%|██████████| 40/40 [00:02<00:00, 14.64it/s]


2024-05-13 17:41:02 [[34msrc.data_preprocessing:94[0m] [[32mINFO[0m] >>>> Matching target index with test data...[0m
2024-05-13 17:41:02 [[34msrc.data_preprocessing:98[0m] [[32mINFO[0m] >>>> Datasets created successfully.[0m
2024-05-13 17:41:02 [[34msrc.data_preprocessing:99[0m] [[32mINFO[0m] >>>> Shape of X_train: (20131, 240)[0m
2024-05-13 17:41:02 [[34msrc.data_preprocessing:100[0m] [[32mINFO[0m] >>>> Shape of y_train: (20131, 1)[0m
2024-05-13 17:41:02 [[34msrc.data_preprocessing:101[0m] [[32mINFO[0m] >>>> Shape of X_test: (100, 240)[0m
2024-05-13 17:41:02 [[34msrc.data_preprocessing:102[0m] [[32mINFO[0m] >>>> Shape of y_test: (100, 1)[0m
CPU times: total: 27.3 s
Wall time: 1min 35s


In [None]:
# load saved preprocessed data from pickle
#X_train = pd.read_pickle("data/processed/ex2_X_train_20240512-155504.pkl")
#y_train = pd.read_pickle("data/processed/ex2_y_train_20240512-155504.pkl")
#X_test = pd.read_pickle("data/processed/ex2_X_test_20240512-155504.pkl")
#y_test = pd.read_pickle("data/processed/ex2_y_test_20240512-155504.pkl")

# Data Preprocessing

## 1. Clean Data


In [29]:
missing_features = identify_missing_values(X_train, threshold=0.1)
missing_features

2024-05-13 19:07:32 [[34msrc.data_cleaning:28[0m] [[32mINFO[0m] >>>> Found 0 features with missing values above the threshold of 0.1.[0m


[]

In [30]:
single_unique_features = identify_single_unique_features(X_train)
single_unique_features

2024-05-13 19:07:34 [[34msrc.data_cleaning:48[0m] [[32mINFO[0m] >>>> Found 45 features with only a single unique value.[0m


['Operation Setting 3__median',
 'Operation Setting 3__mean',
 'Operation Setting 3__standard_deviation',
 'Operation Setting 3__variance',
 'Operation Setting 3__root_mean_square',
 'Operation Setting 3__maximum',
 'Operation Setting 3__absolute_maximum',
 'Operation Setting 3__minimum',
 'Sensor Measure 1__median',
 'Sensor Measure 1__mean',
 'Sensor Measure 1__standard_deviation',
 'Sensor Measure 1__variance',
 'Sensor Measure 1__maximum',
 'Sensor Measure 1__absolute_maximum',
 'Sensor Measure 1__minimum',
 'Sensor Measure 5__median',
 'Sensor Measure 5__maximum',
 'Sensor Measure 5__absolute_maximum',
 'Sensor Measure 5__minimum',
 'Sensor Measure 6__maximum',
 'Sensor Measure 6__absolute_maximum',
 'Sensor Measure 10__median',
 'Sensor Measure 10__maximum',
 'Sensor Measure 10__absolute_maximum',
 'Sensor Measure 10__minimum',
 'Sensor Measure 16__median',
 'Sensor Measure 16__maximum',
 'Sensor Measure 16__absolute_maximum',
 'Sensor Measure 16__minimum',
 'Sensor Measure 18__m

In [None]:
# drop features with missing values
X_train.drop(columns=missing_features, inplace=True)
X_test.drop(columns=missing_features, inplace=True)
# drop features with single unique values
X_train.drop(columns=single_unique_features, inplace=True)
X_test.drop(columns=single_unique_features, inplace=True)

## 2. Feature Selection

Orientation:
![Feature Selection](https://machinelearningmastery.com/wp-content/uploads/2019/11/How-to-Choose-Feature-Selection-Methods-For-Machine-Learning.png)

Potential Feature Selection Methods:
* Supervised:
    * Filter Methods:
        * Numerical Input, Numerical Output:
                * Pearson’s correlation coefficient (linear)
                * Spearman’s rank coefficient (nonlinear)
        * --> Using Pearson’s Correlation Coefficient via the f_regression() function and SelectKBest class.