In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Topic: EX2 - Turbofan RUL Prediction
**Task**: Predict the remaining useful life (RUL) of turbofan engines based on given sensor data (time series data). It is a forcasting problem, where the goal is to predict the number of cycles an engine will last before it fails.
**Data**: Turbofan engine degradation simulation data (NASA) - [Link](https://data.nasa.gov/dataset/Turbofan-Engine-Degradation-Simulation-Data-Set/vrks-gjie). See also in the topic [introduction notebook](https://github.com/nina-prog/damage-propagation-modeling/blob/2fb8c1a1102a48d7abbf04e4031807790a913a99/notebooks/Turbofan%20remaining%20useful%20life%20Prediction.ipynb).

**Subtasks**:
1. Perform a deep **exploratory data analysis (EDA)** on the given data.
2. Implement a more efficient **sliding window method** for time series data analysis. -> 🎯 **Focus on this task**
3. Apply **traditional machine learning methods** (SOTA) to predict the remaining useful life. Includes data preparation, feature extraction, feature selection, model selection, and model parameter optimization.
4. Create **neural network models** to predict the remaining useful life. Includes different architectures like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Attention Models. Note: You can search for SOTA research papers and reproduce current state-of-the-art models.


# Imports + Settings

In [4]:
# third-party libraries
import pandas as pd
import numpy as np
import os
import pytorch_lightning as pl
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import ModelCheckpoint
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.metrics import root_mean_squared_error
import torch 

import time

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
# source code
from src.data_loading import load_data, load_config
from src.data_splitting import train_val_split_by_group
from src.nn_utils import create_sliding_window, create_sliding_window_test
from src.rolling_window_creator import calculate_RUL
from src.data_processing import apply_padding_on_train_data_and_test_data, drop_samples_with_clipped_values
from src.nn_util.nn_models.ligthning.cnnModel1 import CNNModel1 as CNNModel
from src.nn_util.datamodule.lightning.turbofanDatamodule import TurbofanDatamodule
from src.data_cleaning import clean_data

In [6]:
# settings
sns.set_style("whitegrid")
sns.set_palette("Set2")
sns.set(rc={"figure.dpi":100, 'savefig.dpi':200})
sns.set_context('notebook')

In [7]:
np.random.seed(42)

# Paths

In [8]:
# Make sure to execute this cell only once for one kernel session, before running any other cell below.
os.chdir("../") # set working directory to root of project
os.getcwd() # check current working directory

'C:\\Users\\Johannes\\PycharmProjects\\damage-propagation-modeling'

In [9]:
PATH_TO_CONFIG = "configs/config.yaml"

# Load Config + Data

In [10]:
config = load_config(PATH_TO_CONFIG) # config is dict

In [114]:
dataset_num = 4
train_data, test_data, test_RUL_data = load_data(config_path=PATH_TO_CONFIG, dataset_num=dataset_num)

2024-05-31 23:40:02 [[34msrc.data_loading:43[0m] [[32mINFO[0m] >>>> Loading data set 4...[0m
2024-05-31 23:40:03 [[34msrc.data_loading:72[0m] [[32mINFO[0m] >>>> Loaded raw data for dataset 4.[0m
2024-05-31 23:40:03 [[34msrc.data_loading:73[0m] [[32mINFO[0m] >>>> Train Data: (61249, 26)[0m
2024-05-31 23:40:03 [[34msrc.data_loading:74[0m] [[32mINFO[0m] >>>> Test Data: (41214, 26)[0m
2024-05-31 23:40:03 [[34msrc.data_loading:75[0m] [[32mINFO[0m] >>>> Test RUL Data: (248, 1)[0m


# Create Neural Regression Models

Pipeline:
1.	Data Cleaning
2.	Optional: Padding
3.	Create sliding windows
4.	Split train data in validation and train data
5.	Drop some samples with the clipped value
6.	Scale the Data
7.	Find the best hyperparameters
8.	Create Model with found hyperparameters

Explanation of selected hyperparameters:
*	Window size: We selected a window size of 30 due to some experiments with other window sizes. Furthermore, the window size is also used in the paper from Mitici [1] which shows good results with a CNN architecture.
*	Clipping value: The clipping value of 125 has been selected because it has proven useful and is used in paper [1] 

References:
1.	Mihaela Mitici, Ingeborg de Pater, Anne Barros, Zhiguo Zeng, “Dynamic predictive maintenance for multiple components using data-driven probabilistic RUL prognostics: The case of turbofan engines”, Reliability Engineering & System Safety, Volume 234, 2023, https://doi.org/10.1016/j.ress.2023.109199.


In [115]:
# some hyperparameters
time_column = 'Cycle'
group_column = 'UnitNumber'

window_size = 30
clip_value = 125
test_size = 0.1
apply_data_cleaning = True
# If activated, adds for every sensor a new column with the commutative sum of the peaks
apply_peaks_generation = False

# Apply scaler. The order in the list represents the order in which they are applied
std_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()
robust_scaler = RobustScaler()
scaler = [std_scaler, minmax_scaler, robust_scaler]

Explanation of each step:
- Data Cleaning
    - The outlier detection and replacement method has been deactivated. 
    - The removal of columns based on the correlation of a single value has been deactivated because the neural model makes the feature selection.
    - Features with a unique single value will be removed
- Padding:
    -	Only applied for the datasets with a sample in test or train data smaller than the window size.
    -	The padding length is exactly the difference between the window size and the timesteps of the sample with the fewest timesteps
    -	The padding is applied on all the time series
- Create sliding window
    -	TODO: Explanation Frederik (falls du noch eine hinzufügen möchtest)
- Split train data in validation and train data
    -	Splitting training and validation sets based on the UnitNumber
- Drop some samples with the clipped value:
    -	To make the data more evenly distributed, in this step some of the samples with the clipping value as RUL are removed 
    -	Therefore, the median of the frequency of other RUL values is computed and the number of samples with the clipping value is a multiple of the median. 
    -	We selected two to not drop to many samples


In [116]:
if apply_data_cleaning:
    train_data, test_data = clean_data(train_data, test_data, method=None, ignore_columns=['UnitNumber', 'Cycle'], threshold_missing=0.1, threshold_corr=0.0, contamination=0.05)
    
# Add column RUL to train_data
train_data = calculate_RUL(train_data, time_column, group_column, clip_value)

train_data, test_data = apply_padding_on_train_data_and_test_data(train_data=train_data, test_data=test_data, window_size=window_size)

train, val = train_val_split_by_group(train_data, test_size=test_size, random_state=12)

X_train, y_train = create_sliding_window(train, window_size=window_size)  #, drop_columns=['UnitNumber', 'Cycle, 'RUL'])
X_val, y_val = create_sliding_window(val, window_size=window_size)  #, drop_columns=['UnitNumber', 'Cycle', 'RUL'])
X_test, _ = create_sliding_window_test(test_data, column_RUL=False, drop_columns=['UnitNumber'])
y_test = test_RUL_data.values

X_train, y_train = drop_samples_with_clipped_values(X_train, y_train, clip_value)
X_val, y_val = drop_samples_with_clipped_values(X_val, y_val, clip_value)

2024-05-31 23:40:04 [[34msrc.data_cleaning:134[0m] [[32mINFO[0m] >>>> Cleaning train and test data...[0m
2024-05-31 23:40:04 [[34msrc.data_cleaning:136[0m] [[32mINFO[0m] >>>> Formatting column types...[0m
2024-05-31 23:40:04 [[34msrc.data_cleaning:69[0m] [DEBUG[0m] >>>> Found 0 categorical columns: [][0m
2024-05-31 23:40:04 [[34msrc.data_cleaning:69[0m] [DEBUG[0m] >>>> Found 0 categorical columns: [][0m
2024-05-31 23:40:04 [[34msrc.data_cleaning:141[0m] [[32mINFO[0m] >>>> Handling duplicates...[0m
2024-05-31 23:40:04 [[34msrc.data_cleaning:146[0m] [[32mINFO[0m] >>>> Removing outliers...[0m
2024-05-31 23:40:04 [[34msrc.outlier_detection:150[0m] [DEBUG[0m] >>>> Removing outliers using method: None ...[0m
2024-05-31 23:40:04 [[34msrc.outlier_detection:162[0m] [[32mINFO[0m] >>>> No outlier detection method specified. Skipping outlier detection.[0m
2024-05-31 23:40:04 [[34msrc.outlier_detection:150[0m] [DEBUG[0m] >>>> Removing outliers using method: N

Scale the data
*	The applied scalers are the StandardScaler, the MinMaxScaler, and the RobustScaler 
*	These three scalers have been selected because the training has been most robust with them


In [117]:
# Note: Do not normalize the cycle value! That is why we start with one
for single_scaler in scaler:
    for i in range(1, X_train.shape[-1]):
        X_train[:, :, i] = single_scaler.fit_transform(X_train[:, :, i])
        X_val[:, :, i] = single_scaler.transform(X_val[:, :, i])
        X_test[:, :, i] = single_scaler.transform(X_test[:, :, i])

Change data types of arrays to float32 and swap axes if necessary:

In [118]:
print(X_train.shape)
X_train = np.swapaxes(X_train, 1, 2)
X_train = np.array(X_train, dtype=np.float32)
y_train = np.array(y_train, dtype=np.float32)
print(X_train.shape)

print(X_val.shape)
X_val = np.swapaxes(X_val, 1, 2)
X_val = np.array(X_val, dtype=np.float32)
y_val = np.array(y_val, dtype=np.float32)
print(X_val.shape)

print(X_test.shape)
X_test = np.swapaxes(X_test, 1, 2)
X_test = np.array(X_test, dtype=np.float32)
y_test = np.array(y_test, dtype=np.float32)
print(X_test.shape)

(28188, 30, 25)
(28188, 25, 30)
(3150, 30, 25)
(3150, 25, 30)
(248, 30, 25)
(248, 25, 30)


Save processed test data

In [119]:
save_test_data = False
if save_test_data:
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    np.save(f"{config['paths']['processed_data_dir']}ex2_preprocessed_X_test_from_dataset_{dataset_num}_for_CNNModel1_{timestamp}.npy", X_test)
    np.save(f"{config['paths']['processed_data_dir']}ex2_preprocessed_y_test_from_dataset_{dataset_num}_for_CNNModel1_{timestamp}.npy", y_test)

## CNN

Architecture
*	The architecture of the first CNN model (“ExampleCNNModel”) is a minimalistic approach with only two convolutional layers and some fully connected layers 
*	The second CNN model uses more convolutional layers and one fully connected layers more
*	More convolutional layers are used to be more like the architecture from the paper from Mitici [1]
*	Both architectures use only 1D convolutional layers as is done in the paper [1]
*	Both use dropout to enable generalization and prevent overfitting
*	Adam is used as an optimizer and the mean squared error as a loss function
*	Because the possible targets are higher or equal to one in the second CNN the max function with one is applied on the output.


Hyperparameter search
*	The best hyperparameters are found with Bayesian Optimization
*	For each dataset a new set of hyperparameters has been searched
*	The search has been done on the SCC JupyterHub and to parallelize the computation for each data set a separate Notebook has been created
*	The notebooks are stored in the “notebooks/cnn_hyperparameter_search” folder


In [109]:
hyper_params = [{'batch_size': 114.84809532072403, 'beta_1': 0.9586517323123119, 'beta_2': 0.9558431375026947, 'dropout': 0.021025382021542985, 'learning_rate_init': 0.01}, 
                {}, 
                {'batch_size': 92.4798215637139, 'beta_1': 0.9635139876762263, 'beta_2': 0.9432583039935667, 'dropout': 0.2119494320551308, 'learning_rate_init': 0.0004461791916105841}, 
                {},
                ]

seeds = [21, 21, 21, 21]

In [60]:
pl.seed_everything(seeds[dataset_num-1])

# Select hyperparameters of trainer!
checkpoint_callback = ModelCheckpoint(monitor="val_loss")
trainer = Trainer(min_epochs=1, max_epochs=150, callbacks=[checkpoint_callback], deterministic=True)
datamodule = TurbofanDatamodule(batch_size=int(hyper_params[dataset_num-1]['batch_size']))
datamodule.set_train_dataset(X_train, y_train)
datamodule.set_val_dataset(X_val, y_val)
datamodule.set_predict_dataset(X_test)
datamodule.set_test_dataset(X_test, y_test[:, 0])
model = CNNModel(lr=hyper_params[dataset_num-1]['learning_rate_init'], beta_1=hyper_params[dataset_num-1]['beta_1'], beta_2=hyper_params[dataset_num-1]['beta_2'], window_size=window_size, features=X_train.shape[1], dropout_rate=hyper_params[dataset_num-1]['dropout'])

Seed set to 21
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [61]:
%%capture
# For visualization write 'tensorboard --logdir=lightning_logs/' in console

trainer.fit(model, datamodule=datamodule)


  | Name        | Type    | Params
----------------------------------------
0 | loss        | MSELoss | 0     
1 | dropout     | Dropout | 0     
2 | layer1_conv | Conv1d  | 3.6 K 
3 | layer2_conv | Conv1d  | 8.0 K 
4 | layer3_conv | Conv1d  | 8.0 K 
5 | layer4_conv | Conv1d  | 8.0 K 
6 | fc1         | Linear  | 153 K 
7 | fc2         | Linear  | 8.3 K 
8 | fc3         | Linear  | 65    
----------------------------------------
189 K     Trainable params
0         Non-trainable params
189 K     Total params
0.759     Total estimated model params size (MB)
`Trainer.fit` stopped: `max_epochs=150` reached.


In [62]:
%%capture
pred = trainer.test(model, datamodule=datamodule, ckpt_path="best")

Restoring states from the checkpoint path at C:\Users\Johannes\PycharmProjects\damage-propagation-modeling\lightning_logs\version_0\checkpoints\epoch=125-step=12474.ckpt
Loaded model weights from the checkpoint at C:\Users\Johannes\PycharmProjects\damage-propagation-modeling\lightning_logs\version_0\checkpoints\epoch=125-step=12474.ckpt


In [63]:
pred

[{'test_loss': 286.5386657714844}]

## Scores on all testsets 

In [111]:
all_test_data = []
paths = [
    ('data/processed/ex2_preprocessed_X_test_from_dataset_1_for_CNNModel1_20240531-232248.npy', 'data/processed/ex2_preprocessed_y_test_from_dataset_1_for_CNNModel1_20240531-232248.npy'),
    ('data/processed/ex2_preprocessed_X_test_from_dataset_2_for_CNNModel1_20240531-233230.npy', 'data/processed/ex2_preprocessed_y_test_from_dataset_2_for_CNNModel1_20240531-233230.npy'),
    ('data/processed/ex2_preprocessed_X_test_from_dataset_3_for_CNNModel1_20240531-232732.npy', 'data/processed/ex2_preprocessed_y_test_from_dataset_3_for_CNNModel1_20240531-232732.npy'),
    ('data/processed/ex2_preprocessed_X_test_from_dataset_4_for_CNNModel1_20240531-234033.npy', 'data/processed/ex2_preprocessed_y_test_from_dataset_4_for_CNNModel1_20240531-234033.npy'),
]
for i in range(len(paths)):
    X_temp = np.load(paths[i][0])
    y_temp = np.load(paths[i][1])
    all_test_data.append((X_temp, y_temp))

In [112]:
dataset_num_temp = 1
model = CNNModel(lr=hyper_params[dataset_num_temp-1]['learning_rate_init'], beta_1=hyper_params[dataset_num_temp-1]['beta_1'], beta_2=hyper_params[dataset_num_temp-1]['beta_2'], window_size=window_size, features=all_test_data[dataset_num_temp-1][0].shape[1], dropout_rate=hyper_params[dataset_num_temp-1]['dropout'])
checkpoint = torch.load("models/cnn_dataset_1.ckpt")
model.load_state_dict(checkpoint['state_dict'])

model.eval()

pred = model(torch.tensor(all_test_data[dataset_num_temp-1][0])).detach().numpy()
rmse_cnn_1 = root_mean_squared_error(pred, torch.tensor(all_test_data[dataset_num_temp-1][1]))
print(f'The RMSE score on dataset FD00{dataset_num_temp} is {rmse_cnn_1}.')

The RMSE score on dataset FD001 is 16.927452087402344.


In [113]:
dataset_num_temp = 3
model = CNNModel(lr=hyper_params[dataset_num_temp-1]['learning_rate_init'], beta_1=hyper_params[dataset_num_temp-1]['beta_1'], beta_2=hyper_params[dataset_num_temp-1]['beta_2'], window_size=window_size, features=all_test_data[dataset_num_temp-1][0].shape[1], dropout_rate=hyper_params[dataset_num_temp-1]['dropout'])
checkpoint = torch.load("models/cnn_dataset_3.ckpt")
model.load_state_dict(checkpoint['state_dict'])

model.eval()

pred = model(torch.tensor(all_test_data[dataset_num_temp-1][0])).detach().numpy()
rmse_cnn_3 = root_mean_squared_error(pred, torch.tensor(all_test_data[dataset_num_temp-1][1]))
print(f'The RMSE score on dataset FD00{dataset_num_temp} is {rmse_cnn_3}.')

The RMSE score on dataset FD003 is 19.158700942993164.
