In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Topic: EX2 - Turbofan RUL Prediction
**Task**: Predict the remaining useful life (RUL) of turbofan engines based on given sensor data (time series data). It is a regression problem.
**Data**: Turbofan engine degradation simulation data (NASA) - [Link](https://data.nasa.gov/dataset/Turbofan-Engine-Degradation-Simulation-Data-Set/vrks-gjie). See also in the topic [introduction notebook](https://github.com/nina-prog/damage-propagation-modeling/blob/2fb8c1a1102a48d7abbf04e4031807790a913a99/notebooks/Turbofan%20remaining%20useful%20life%20Prediction.ipynb).

**Subtasks**:
1. Perform a deep **exploratory data analysis (EDA)** on the given data.
2. Implement a more efficient **sliding window method** for time series data analysis. -> 🎯 **Focus on this task**
3. Apply **traditional machine learning methods** (SOTA) to predict the remaining useful life. Includes data preparation, feature extraction, feature selection, model selection, and model parameter optimization.
4. Create **neural network models** to predict the remaining useful life. Includes different architectures like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), or Attention Models. Note: You can search for SOTA research papers and reproduce current state-of-the-art models.


# Imports + Settings

In [50]:
# third-party libraries
import pandas as pd
import numpy as np
import os
import time
import seaborn as sns

from tsfresh import extract_features, select_features
from tsfresh.utilities.dataframe_functions import roll_time_series, make_forecasting_frame, impute
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters

In [4]:
# source code
from src.utils import load_data, load_config
from src.data_preprocessing import calculate_RUL

In [5]:
# settings
sns.set_style("whitegrid")
sns.set_palette("Set2")
sns.set(rc={"figure.dpi":100, 'savefig.dpi':200})
sns.set_context('notebook')

In [6]:
np.random.seed(42)

# Paths

In [7]:
os.chdir("../") # set working directory to root of project
#os.getcwd() # check current working directory

In [8]:
PATH_TO_CONFIG = "configs/config.yaml"

# Load Config + Data

In [9]:
config = load_config(PATH_TO_CONFIG) # config is dict

In [10]:
train_data, test_data, test_RUL_data = load_data(config_path=PATH_TO_CONFIG, dataset_num=1)

2024-05-12 15:52:17 [[34msrc.utils:56[0m] [[32mINFO[0m] >>>> Loading data set 1...[0m
2024-05-12 15:52:17 [[34msrc.utils:85[0m] [[32mINFO[0m] >>>> Loaded raw data for dataset 1.[0m
2024-05-12 15:52:17 [[34msrc.utils:86[0m] [[32mINFO[0m] >>>> Train Data: (20631, 26)[0m
2024-05-12 15:52:17 [[34msrc.utils:87[0m] [[32mINFO[0m] >>>> Test Data: (13096, 26)[0m
2024-05-12 15:52:17 [[34msrc.utils:88[0m] [[32mINFO[0m] >>>> Test RUL Data: (100, 1)[0m


# 📍 Subtask 2: Sliding Window Method

Note:
* In training however, we need multiple examples to train. If we would only use the time series until today (and wait for the value of tomorrow to have a target), we would only have a single training example. Therefore, we use a trick: we replay the history.
* At each time step $t$, you treat the data as it would be today. You extract the features with everything you know until today (which is all data until and including $t$). The target for the features until time $t$ is the time value of time t + 1 (which we already know, because everything has already happened).
* The process of window-sliding is implemented in the function `roll_time_series`. Our window size will be 20 (we look at max 20 days in the past) and we disregard all windows which are shorter than 5 days.

Generating Data to train Forecasting typically involves the following steps:
1. Generate roling windows of size k (e.g. 20) from the time series data. It means that we take the last k time periods as relevant features, including the current time period. The target is then the next time period.
2. Extract features from the rolling windows. This can be as simple as taking the mean of the last k days, or more complex features like the slope of a linear regression. The dimension of the rolling window then would be (k, n_features) and the dimension of the target would be (1,).

--> The data would then be of size (n_samples, k, n_features) and the target would be of size (n_samples, 1).

## 1. Generate rolling windows

In [11]:
# define window size
MAX_K = 20
MIN_K = 5

In [12]:
%%time
# generate rolling windows
train_data_rolled = roll_time_series(train_data, column_id="UnitNumber", column_sort="Cycle", max_timeshift=MAX_K, min_timeshift=MIN_K)

Rolling: 100%|██████████| 37/37 [00:04<00:00,  8.16it/s]


CPU times: total: 797 ms
Wall time: 5.08 s


In [13]:
train_data_rolled.head(5)

Unnamed: 0,UnitNumber,Cycle,Operation Setting 1,Operation Setting 2,Operation Setting 3,Sensor Measure 1,Sensor Measure 2,Sensor Measure 3,Sensor Measure 4,Sensor Measure 5,...,Sensor Measure 13,Sensor Measure 14,Sensor Measure 15,Sensor Measure 16,Sensor Measure 17,Sensor Measure 18,Sensor Measure 19,Sensor Measure 20,Sensor Measure 21,id
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,"(1, 6)"
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,"(1, 6)"
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,"(1, 6)"
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,"(1, 6)"
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,"(1, 6)"


Findings:
* **Function Result**: New column `id` is generated, representing a **window id**. It is a tuple of the **`UnitNumber` (certain group)** and the **ending `Cycle` (ending time step)**.
* **Example**: Data with the `id` (1, 40) contains the original data of `UnitNumber` 1 of the last 20 `Cycle` steps until `Cycle` 40 (including `Cycle` 40).

In [14]:
# check if example interpretation is correct and if the window is generated correctly
rolled_win = train_data_rolled[train_data_rolled["id"] == (1, 40)]
print(rolled_win.shape)
display(rolled_win.head(5))

original_win_data = train_data[(train_data["UnitNumber"] == 1) &
                               (train_data["Cycle"] <= 40) &
                               (train_data["Cycle"] >= 40 - 20)]
print(original_win_data.shape)
display(original_win_data.head(5))

(21, 27)


Unnamed: 0,UnitNumber,Cycle,Operation Setting 1,Operation Setting 2,Operation Setting 3,Sensor Measure 1,Sensor Measure 2,Sensor Measure 3,Sensor Measure 4,Sensor Measure 5,...,Sensor Measure 13,Sensor Measure 14,Sensor Measure 15,Sensor Measure 16,Sensor Measure 17,Sensor Measure 18,Sensor Measure 19,Sensor Measure 20,Sensor Measure 21,id
43900,1,20,-0.0037,0.0001,100.0,518.67,643.04,1581.11,1405.23,14.62,...,2388.02,8129.71,8.421,0.03,392,2388,100.0,39.03,23.422,"(1, 40)"
43901,1,21,-0.0012,0.0001,100.0,518.67,642.37,1586.07,1398.13,14.62,...,2388.08,8134.02,8.4049,0.03,392,2388,100.0,39.09,23.3101,"(1, 40)"
43902,1,22,0.0002,0.0,100.0,518.67,642.77,1592.93,1400.57,14.62,...,2388.03,8130.41,8.4034,0.03,392,2388,100.0,38.92,23.3792,"(1, 40)"
43903,1,23,0.0034,-0.0003,100.0,518.67,642.14,1588.19,1394.75,14.62,...,2388.05,8127.9,8.424,0.03,392,2388,100.0,38.94,23.4562,"(1, 40)"
43904,1,24,-0.001,0.0003,100.0,518.67,642.38,1590.83,1398.81,14.62,...,2388.03,8133.88,8.3891,0.03,392,2388,100.0,39.0,23.3696,"(1, 40)"


(21, 26)


Unnamed: 0,UnitNumber,Cycle,Operation Setting 1,Operation Setting 2,Operation Setting 3,Sensor Measure 1,Sensor Measure 2,Sensor Measure 3,Sensor Measure 4,Sensor Measure 5,...,Sensor Measure 12,Sensor Measure 13,Sensor Measure 14,Sensor Measure 15,Sensor Measure 16,Sensor Measure 17,Sensor Measure 18,Sensor Measure 19,Sensor Measure 20,Sensor Measure 21
19,1,20,-0.0037,0.0001,100.0,518.67,643.04,1581.11,1405.23,14.62,...,522.07,2388.02,8129.71,8.421,0.03,392,2388,100.0,39.03,23.422
20,1,21,-0.0012,0.0001,100.0,518.67,642.37,1586.07,1398.13,14.62,...,522.42,2388.08,8134.02,8.4049,0.03,392,2388,100.0,39.09,23.3101
21,1,22,0.0002,0.0,100.0,518.67,642.77,1592.93,1400.57,14.62,...,522.0,2388.03,8130.41,8.4034,0.03,392,2388,100.0,38.92,23.3792
22,1,23,0.0034,-0.0003,100.0,518.67,642.14,1588.19,1394.75,14.62,...,521.52,2388.05,8127.9,8.424,0.03,392,2388,100.0,38.94,23.4562
23,1,24,-0.001,0.0003,100.0,518.67,642.38,1590.83,1398.81,14.62,...,522.13,2388.03,8133.88,8.3891,0.03,392,2388,100.0,39.0,23.3696


In [15]:
print(f"Number of samples: {train_data_rolled.shape[0]}")
print(f"Number of unique windows: {train_data_rolled['id'].nunique()}")
print(f"Number of windows shorter than 5 days: {train_data_rolled['id'].value_counts().loc[lambda x: x < MIN_K].sum()}")
print(f"Number of unique units: {train_data_rolled['UnitNumber'].nunique()}")

Number of samples: 410751
Number of unique windows: 20131
Number of windows shorter than 5 days: 0
Number of unique units: 100


In [16]:
# check windows sizes / lengths
train_data_rolled.groupby("id").size().value_counts()

21    18631
6       100
7       100
8       100
9       100
10      100
11      100
12      100
13      100
14      100
15      100
16      100
17      100
18      100
19      100
20      100
Name: count, dtype: int64

In [17]:
# get list of features
features = config["dataloading"]["features"]["operational_settings"] + \
           config["dataloading"]["features"]["sensor_measure"]

Findings:
* The fact that there are always a number of 100 windows for each window size less than 21 and more than 5 indicates that there are 100 unique units in the data and that the windows are generated correctly.

## 2. Extract Features
Note:
* We have many features and the `RUL` column is the target. so we thus need to extract features for each feature column in the window and then also get a y value for each window.

In [18]:
%%time
# extract features - tsfresh
X = extract_features(train_data_rolled.drop(["UnitNumber"], axis=1),
                     column_id="id", column_sort="Cycle",
                     default_fc_parameters=MinimalFCParameters(),
                     impute_function=impute, show_warnings=False)
# add index names
X.index = X.index.rename(["UnitNumber", "Cycle"])

Feature Extraction: 100%|██████████| 40/40 [01:00<00:00,  1.51s/it]


CPU times: total: 43 s
Wall time: 1min 10s


In [19]:
X.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Operation Setting 1__sum_values,Operation Setting 1__median,Operation Setting 1__mean,Operation Setting 1__length,Operation Setting 1__standard_deviation,Operation Setting 1__variance,Operation Setting 1__root_mean_square,Operation Setting 1__maximum,Operation Setting 1__absolute_maximum,Operation Setting 1__minimum,...,Sensor Measure 21__sum_values,Sensor Measure 21__median,Sensor Measure 21__mean,Sensor Measure 21__length,Sensor Measure 21__standard_deviation,Sensor Measure 21__variance,Sensor Measure 21__root_mean_square,Sensor Measure 21__maximum,Sensor Measure 21__absolute_maximum,Sensor Measure 21__minimum
UnitNumber,Cycle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,6,-0.0086,-0.0013,-0.001433,6.0,0.00234,5e-06,0.002744,0.0019,0.0043,-0.0043,...,140.332,23.38915,23.388667,6.0,0.029032,0.000843,23.388685,23.4236,23.4236,23.3442
1,7,-0.0076,-0.0007,-0.001086,7.0,0.002328,5e-06,0.002568,0.0019,0.0043,-0.0043,...,163.7094,23.3774,23.387057,7.0,0.027166,0.000738,23.387073,23.4236,23.4236,23.3442
1,8,-0.011,-0.0013,-0.001375,8.0,0.002308,5e-06,0.002687,0.0019,0.0043,-0.0043,...,187.02,23.37565,23.3775,8.0,0.035848,0.001285,23.377527,23.4236,23.4236,23.3106
1,9,-0.0102,-0.0007,-0.001133,9.0,0.002281,5e-06,0.002547,0.0019,0.0043,-0.0043,...,210.4266,23.3774,23.380733,9.0,0.035014,0.001226,23.38076,23.4236,23.4236,23.3106
1,10,-0.0135,-0.0013,-0.00135,10.0,0.002259,5e-06,0.002632,0.0019,0.0043,-0.0043,...,233.896,23.3909,23.3896,10.0,0.042555,0.001811,23.389639,23.4694,23.4694,23.3106


In [20]:
# check if example window can also be found in the extracted features
X.loc[(1, 40)]

Operation Setting 1__sum_values            -0.010700
Operation Setting 1__median                -0.000400
Operation Setting 1__mean                  -0.000510
Operation Setting 1__length                21.000000
Operation Setting 1__standard_deviation     0.001900
                                             ...    
Sensor Measure 21__variance                 0.003121
Sensor Measure 21__root_mean_square        23.394533
Sensor Measure 21__maximum                 23.499900
Sensor Measure 21__absolute_maximum        23.499900
Sensor Measure 21__minimum                 23.284100
Name: (1, 40), Length: 240, dtype: float64

## 3. Extract Target & Map to Features
Note:
* **Target Example**: The target for the row with the id (1, 40) is the RUL of the row with the id (1, 41) and so on --> shift the RUL column by one and then we can use it as our target. - Only if RUL is shifted by one for forcasting. (DEPRECATED)
* **Target Example**: The target for the row with the id (1, 40) is the RUL of the row with the id (1, 40)

In [57]:
# calculate RUL
train_data_RUL = calculate_RUL(data=train_data, time_column="Cycle", group_column="UnitNumber")

In [58]:
train_data_RUL[(train_data_RUL["UnitNumber"] == 1)].tail(5)

Unnamed: 0,UnitNumber,Cycle,Operation Setting 1,Operation Setting 2,Operation Setting 3,Sensor Measure 1,Sensor Measure 2,Sensor Measure 3,Sensor Measure 4,Sensor Measure 5,...,Sensor Measure 13,Sensor Measure 14,Sensor Measure 15,Sensor Measure 16,Sensor Measure 17,Sensor Measure 18,Sensor Measure 19,Sensor Measure 20,Sensor Measure 21,RUL
187,1,188,-0.0067,0.0003,100.0,518.67,643.75,1602.38,1422.78,14.62,...,2388.23,8117.69,8.5207,0.03,396,2388,100.0,38.51,22.9588,5
188,1,189,-0.0006,0.0002,100.0,518.67,644.18,1596.17,1428.01,14.62,...,2388.33,8117.51,8.5183,0.03,395,2388,100.0,38.48,23.1127,4
189,1,190,-0.0027,0.0001,100.0,518.67,643.64,1599.22,1425.95,14.62,...,2388.35,8112.58,8.5223,0.03,398,2388,100.0,38.49,23.0675,3
190,1,191,-0.0,-0.0004,100.0,518.67,643.34,1602.36,1425.77,14.62,...,2388.3,8114.61,8.5174,0.03,394,2388,100.0,38.45,23.1295,2
191,1,192,0.0009,-0.0,100.0,518.67,643.54,1601.41,1427.2,14.62,...,2388.32,8110.93,8.5113,0.03,396,2388,100.0,38.48,22.9649,1


In [89]:
# extract target
#y = train_data_RUL.set_index(["UnitNumber", "Cycle"]).sort_index().RUL.shift(-1) # shift RUL by one for forcasting (DEPRECATED)
y = train_data_RUL.set_index(["UnitNumber", "Cycle"]).sort_index().RUL

In [85]:
# consistency check for example window (1, 40)
print(f"y value for window (1, 40): {y.loc[(1, 40)]}") # should be the same as the RUL of the row with the id (1, 41)
print(f"Train data RUL value for UnitNumber, Cycle (1, 41): {train_data_RUL[(train_data_RUL['UnitNumber'] == 1) & (train_data_RUL['Cycle'] == 41)]['RUL'].values[0]}")

y value for window (1, 40): 153
Train data RUL value for UnitNumber, Cycle (1, 41): 152


Findings:
* The target value RUL does indeed match the RUL of the next cycle in the original data and the rolled data. If RUL is shifted by one for forcasting.

In [90]:
# compare index and size of X and y
print(f"Shape of X: {X.shape}")
print(f"Fist 3 rows of index of X: {X.index[:3]}")
print(f"Shape of y: {y.shape}")
print(f"Fist 3 rows of index of y: {y.index[:3]}")

Shape of X: (20131, 240)
Fist 3 rows of index of X: MultiIndex([(1, 6),
            (1, 7),
            (1, 8)],
           names=['UnitNumber', 'Cycle'])
Shape of y: (20631,)
Fist 3 rows of index of y: MultiIndex([(1, 1),
            (1, 2),
            (1, 3)],
           names=['UnitNumber', 'Cycle'])


In [91]:
# check last rows of y
y.tail(5)

UnitNumber  Cycle
100         196      5
            197      4
            198      3
            199      2
            200      1
Name: RUL, dtype: int64

In [92]:
# show nan values in y
print(f"Number of NaN values in y: {y.isnull().sum()}")
y[y.isnull()]

Number of NaN values in y: 0


Series([], Name: RUL, dtype: int64)

Findings:

There are some inconsistencies! :
* X is missing the first 5 dates (as our minimum window size was 5), which can be seen looking at the index length of X and y. X index has a length of `20131` and y index has a length of `20631`. This `500` difference is because we have `100` units and the first `5` cycles of each unit are not in X, thus we have $100 * 5 = 500$ missing rows in X which are in y. --> drop the first min_k time steps / cycles from X
* y is missing the last date, as we cannot predict the RUL of the last cycle of each unit, as there is no next cycle to predict on. This can be seen by looking at the last row of y where the RUL is `NaN`. - only if shifted by one for forcasting. (DEPRECATED)

In [93]:
# make X and y consistent
y = y[y.index.isin(X.index)]
X = X[X.index.isin(y.index)]

In [94]:
# compare index of X and y
print(f"Shape of X: {X.shape}")
print(f"Fist 3 rows of index of X: {X.index[:3]}")
print(f"Shape of y: {y.shape}")
print(f"Fist 3 rows of index of y: {y.index[:3]}")

Shape of X: (20131, 240)
Fist 3 rows of index of X: MultiIndex([(1, 6),
            (1, 7),
            (1, 8)],
           names=['UnitNumber', 'Cycle'])
Shape of y: (20131,)
Fist 3 rows of index of y: MultiIndex([(1, 6),
            (1, 7),
            (1, 8)],
           names=['UnitNumber', 'Cycle'])


In [97]:
print(f"Shape of X: {X.shape}")
display(X.tail(5))
print(f"Shape of y: {y.shape}")
display(y.tail(5))

Shape of X: (20131, 240)


Unnamed: 0_level_0,Unnamed: 1_level_0,Operation Setting 1__sum_values,Operation Setting 1__median,Operation Setting 1__mean,Operation Setting 1__length,Operation Setting 1__standard_deviation,Operation Setting 1__variance,Operation Setting 1__root_mean_square,Operation Setting 1__maximum,Operation Setting 1__absolute_maximum,Operation Setting 1__minimum,...,Sensor Measure 21__sum_values,Sensor Measure 21__median,Sensor Measure 21__mean,Sensor Measure 21__length,Sensor Measure 21__standard_deviation,Sensor Measure 21__variance,Sensor Measure 21__root_mean_square,Sensor Measure 21__maximum,Sensor Measure 21__absolute_maximum,Sensor Measure 21__minimum
UnitNumber,Cycle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
100,196,0.0035,-0.0002,0.000167,21.0,0.001378,2e-06,0.001388,0.0027,0.0027,-0.0017,...,485.5209,23.1218,23.120043,21.0,0.07247,0.005252,23.120156,23.2345,23.2345,22.9735
100,197,0.0036,-0.0002,0.000171,21.0,0.001372,2e-06,0.001382,0.0027,0.0027,-0.0016,...,485.5783,23.1229,23.122776,21.0,0.07282,0.005303,23.122891,23.2345,23.2345,22.9735
100,198,0.0051,-0.0001,0.000243,21.0,0.001342,2e-06,0.001364,0.0027,0.0027,-0.0016,...,485.4764,23.1229,23.117924,21.0,0.081385,0.006624,23.118067,23.2345,23.2345,22.9333
100,199,0.0035,-0.0002,0.000167,21.0,0.001371,2e-06,0.001381,0.0027,0.0027,-0.0016,...,485.3152,23.1218,23.110248,21.0,0.078455,0.006155,23.110381,23.2345,23.2345,22.9333
100,200,-0.0017,-0.0004,-8.1e-05,21.0,0.001482,2e-06,0.001484,0.0027,0.0032,-0.0032,...,485.1989,23.1173,23.10471,21.0,0.078252,0.006123,23.104842,23.2345,23.2345,22.9333


Shape of y: (20131,)


UnitNumber  Cycle
100         196      5
            197      4
            198      3
            199      2
            200      1
Name: RUL, dtype: int64

## 3. Modification for Test Set
Note:
* The test is different from the training set, as we have extern RUL target values for the test set. This target consits only of one value for each unit, which is the RUL of the last cycle of each unit. Thus, we need to filter the data.
* Only use last window of each unit in the test set. The target RUL is in the `test_RUL_data` dataframe, which is the true RUL of the last cycle of each unit, thus a shape of (100, 1).

--> Filter rolled data to only include the last window of each UnitNumber:
* Ensure the selected cycles are sorted in ascending order of UnitNumber and Cycle values.
* Group the filtered data by UnitNumber.
* For each UnitNumber group, select the last max_window_size cycles.


In [31]:
# generate rolling windows for test data
test_data_rolled = roll_time_series(test_data, column_id="UnitNumber", column_sort="Cycle", max_timeshift=MAX_K, min_timeshift=MIN_K)

Rolling: 100%|██████████| 38/38 [00:03<00:00,  9.65it/s]


In [32]:
test_data_rolled.head(5)

Unnamed: 0,UnitNumber,Cycle,Operation Setting 1,Operation Setting 2,Operation Setting 3,Sensor Measure 1,Sensor Measure 2,Sensor Measure 3,Sensor Measure 4,Sensor Measure 5,...,Sensor Measure 13,Sensor Measure 14,Sensor Measure 15,Sensor Measure 16,Sensor Measure 17,Sensor Measure 18,Sensor Measure 19,Sensor Measure 20,Sensor Measure 21,id
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,"(1, 6)"
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,"(1, 6)"
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166,"(1, 6)"
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,...,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737,"(1, 6)"
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,...,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413,"(1, 6)"


In [33]:
# filter to only include the last window of each unit
filtered_test_data_rolled = test_data_rolled.groupby("UnitNumber").tail(MAX_K)

In [34]:
filtered_test_data_rolled.head(40)

Unnamed: 0,UnitNumber,Cycle,Operation Setting 1,Operation Setting 2,Operation Setting 3,Sensor Measure 1,Sensor Measure 2,Sensor Measure 3,Sensor Measure 4,Sensor Measure 5,...,Sensor Measure 13,Sensor Measure 14,Sensor Measure 15,Sensor Measure 16,Sensor Measure 17,Sensor Measure 18,Sensor Measure 19,Sensor Measure 20,Sensor Measure 21,id
76785,1,12,0.0026,0.0003,100.0,518.67,642.54,1587.43,1397.82,14.62,...,2388.06,8132.33,8.3984,0.03,391,2388,100.0,39.11,23.3845,"(1, 31)"
76786,1,13,-0.0056,0.0003,100.0,518.67,641.94,1589.09,1403.94,14.62,...,2388.03,8131.12,8.4166,0.03,392,2388,100.0,39.08,23.3677,"(1, 31)"
76787,1,14,0.0017,-0.0004,100.0,518.67,642.23,1583.16,1402.88,14.62,...,2388.06,8130.3,8.4293,0.03,392,2388,100.0,39.03,23.4572,"(1, 31)"
76788,1,15,-0.0003,-0.0003,100.0,518.67,642.5,1584.81,1398.79,14.62,...,2388.0,8133.62,8.4163,0.03,392,2388,100.0,39.04,23.3672,"(1, 31)"
76789,1,16,-0.0018,0.0003,100.0,518.67,642.32,1584.51,1407.76,14.62,...,2388.1,8133.83,8.43,0.03,390,2388,100.0,38.87,23.3484,"(1, 31)"
76790,1,17,0.0014,0.0002,100.0,518.67,642.19,1582.7,1404.12,14.62,...,2388.02,8126.78,8.4577,0.03,391,2388,100.0,39.09,23.3409,"(1, 31)"
76791,1,18,0.0035,0.0001,100.0,518.67,642.59,1586.53,1403.69,14.62,...,2388.06,8133.22,8.4323,0.03,391,2388,100.0,38.96,23.4481,"(1, 31)"
76792,1,19,0.0029,0.0001,100.0,518.67,642.43,1585.58,1402.3,14.62,...,2388.01,8129.31,8.3892,0.03,391,2388,100.0,39.06,23.3809,"(1, 31)"
76793,1,20,0.0011,-0.0001,100.0,518.67,642.61,1587.78,1400.7,14.62,...,2388.05,8128.59,8.4099,0.03,392,2388,100.0,39.0,23.3325,"(1, 31)"
76794,1,21,0.0038,-0.0002,100.0,518.67,642.7,1583.3,1399.2,14.62,...,2388.11,8126.86,8.4174,0.03,392,2388,100.0,38.96,23.4025,"(1, 31)"


In [35]:
# extract features for test data
X_test = extract_features(filtered_test_data_rolled.drop(["UnitNumber"], axis=1),
                          column_id="id", column_sort="Cycle",
                          default_fc_parameters=MinimalFCParameters(),
                          impute_function=impute, show_warnings=False)
# add index names
X_test.index = X_test.index.rename(["UnitNumber", "Cycle"])

Feature Extraction: 100%|██████████| 40/40 [00:02<00:00, 15.03it/s]


In [36]:
# extract target for test data - match index of y_test with X_test
y_test = test_RUL_data
y_test.index = X_test.index

In [37]:
# check if X_test and y_test match
print(f"Shape of X_test: {X_test.shape}")
display(X_test.head(5))
print(f"Shape of y_test: {y_test.shape}")
display(y_test.head(5))

Shape of X_test: (100, 240)


Unnamed: 0_level_0,Unnamed: 1_level_0,Operation Setting 1__sum_values,Operation Setting 1__median,Operation Setting 1__mean,Operation Setting 1__length,Operation Setting 1__standard_deviation,Operation Setting 1__variance,Operation Setting 1__root_mean_square,Operation Setting 1__maximum,Operation Setting 1__absolute_maximum,Operation Setting 1__minimum,...,Sensor Measure 21__sum_values,Sensor Measure 21__median,Sensor Measure 21__mean,Sensor Measure 21__length,Sensor Measure 21__standard_deviation,Sensor Measure 21__variance,Sensor Measure 21__root_mean_square,Sensor Measure 21__maximum,Sensor Measure 21__absolute_maximum,Sensor Measure 21__minimum
UnitNumber,Cycle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,31,0.0181,0.0013,0.000905,20.0,0.002373,6e-06,0.00254,0.0047,0.0056,-0.0056,...,467.5243,23.37235,23.376215,20.0,0.034147,0.001166,23.37624,23.4572,23.4572,23.3186
2,49,-0.0091,5e-05,-0.000455,20.0,0.002212,5e-06,0.002258,0.0032,0.0039,-0.0039,...,465.9096,23.30065,23.29548,20.0,0.041742,0.001742,23.295517,23.3693,23.3693,23.2196
3,126,-0.0044,-0.00025,-0.00022,20.0,0.002824,8e-06,0.002832,0.0057,0.0057,-0.0057,...,465.0032,23.2499,23.25016,20.0,0.043298,0.001875,23.2502,23.3559,23.3559,23.1749
4,106,-4.336809e-19,5e-05,-2.1684039999999998e-20,20.0,0.001972,4e-06,0.001972,0.0031,0.0048,-0.0048,...,465.5171,23.27205,23.275855,20.0,0.056855,0.003232,23.275924,23.3769,23.3769,23.1676
5,98,0.0164,0.00105,0.00082,20.0,0.002243,5e-06,0.002388,0.0062,0.0062,-0.0035,...,465.1504,23.24325,23.25752,20.0,0.064874,0.004209,23.25761,23.4117,23.4117,23.1476


Shape of y_test: (100, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,RUL
UnitNumber,Cycle,Unnamed: 2_level_1
1,31,112
2,49,98
3,126,69
4,106,82
5,98,91


## 4. Summarize in Function

In [51]:
from src.data_preprocessing import create_rolling_windows_datasets

In [98]:
%%time
X_train, y_train, X_test, y_test = create_rolling_windows_datasets(train_data, test_data, test_RUL_data, column_id="UnitNumber", column_sort="Cycle", max_timeshift=20, min_timeshift=5)

2024-05-12 18:31:43 [[34msrc.data_preprocessing:59[0m] [[32mINFO[0m] >>>> Creating rolling windows for train data...[0m


Rolling: 100%|██████████| 37/37 [00:10<00:00,  3.63it/s]


2024-05-12 18:31:54 [[34msrc.data_preprocessing:62[0m] [[32mINFO[0m] >>>> Extracting features for train data...[0m


Feature Extraction: 100%|██████████| 40/40 [02:19<00:00,  3.48s/it]


2024-05-12 18:34:27 [[34msrc.data_preprocessing:70[0m] [[32mINFO[0m] >>>> Calculating target for train data...[0m
2024-05-12 18:34:27 [[34msrc.data_preprocessing:77[0m] [[32mINFO[0m] >>>> Creating rolling windows for test data...[0m


Rolling: 100%|██████████| 38/38 [00:07<00:00,  5.18it/s]


2024-05-12 18:34:35 [[34msrc.data_preprocessing:82[0m] [[32mINFO[0m] >>>> Extracting features for test data...[0m


Feature Extraction: 100%|██████████| 40/40 [00:04<00:00,  8.07it/s]


2024-05-12 18:34:40 [[34msrc.data_preprocessing:90[0m] [[32mINFO[0m] >>>> Matching target index with test data...[0m
2024-05-12 18:34:40 [[34msrc.data_preprocessing:94[0m] [[32mINFO[0m] >>>> Shape of X_train: (20131, 240)[0m
2024-05-12 18:34:40 [[34msrc.data_preprocessing:95[0m] [[32mINFO[0m] >>>> Shape of y_train: (20131, 1)[0m
2024-05-12 18:34:40 [[34msrc.data_preprocessing:96[0m] [[32mINFO[0m] >>>> Shape of X_test: (100, 240)[0m
2024-05-12 18:34:40 [[34msrc.data_preprocessing:97[0m] [[32mINFO[0m] >>>> Shape of y_test: (100, 1)[0m
CPU times: total: 49.3 s
Wall time: 2min 57s


In [101]:
# check if X_train and y_train match
print(f"Shape of X_train: {X_train.shape}")
display(X_train.head(5))
print(f"Shape of y_train: {y_train.shape}")
display(y_train.head(5))

Shape of X_train: (20131, 240)


Unnamed: 0_level_0,Unnamed: 1_level_0,Operation Setting 1__sum_values,Operation Setting 1__median,Operation Setting 1__mean,Operation Setting 1__length,Operation Setting 1__standard_deviation,Operation Setting 1__variance,Operation Setting 1__root_mean_square,Operation Setting 1__maximum,Operation Setting 1__absolute_maximum,Operation Setting 1__minimum,...,Sensor Measure 21__sum_values,Sensor Measure 21__median,Sensor Measure 21__mean,Sensor Measure 21__length,Sensor Measure 21__standard_deviation,Sensor Measure 21__variance,Sensor Measure 21__root_mean_square,Sensor Measure 21__maximum,Sensor Measure 21__absolute_maximum,Sensor Measure 21__minimum
UnitNumber,Cycle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,6,-0.0086,-0.0013,-0.001433,6.0,0.00234,5e-06,0.002744,0.0019,0.0043,-0.0043,...,140.332,23.38915,23.388667,6.0,0.029032,0.000843,23.388685,23.4236,23.4236,23.3442
1,7,-0.0076,-0.0007,-0.001086,7.0,0.002328,5e-06,0.002568,0.0019,0.0043,-0.0043,...,163.7094,23.3774,23.387057,7.0,0.027166,0.000738,23.387073,23.4236,23.4236,23.3442
1,8,-0.011,-0.0013,-0.001375,8.0,0.002308,5e-06,0.002687,0.0019,0.0043,-0.0043,...,187.02,23.37565,23.3775,8.0,0.035848,0.001285,23.377527,23.4236,23.4236,23.3106
1,9,-0.0102,-0.0007,-0.001133,9.0,0.002281,5e-06,0.002547,0.0019,0.0043,-0.0043,...,210.4266,23.3774,23.380733,9.0,0.035014,0.001226,23.38076,23.4236,23.4236,23.3106
1,10,-0.0135,-0.0013,-0.00135,10.0,0.002259,5e-06,0.002632,0.0019,0.0043,-0.0043,...,233.896,23.3909,23.3896,10.0,0.042555,0.001811,23.389639,23.4694,23.4694,23.3106


Shape of y_train: (20131, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,RUL
UnitNumber,Cycle,Unnamed: 2_level_1
1,6,187
1,7,186
1,8,185
1,9,184
1,10,183


In [100]:
# check if X_test and y_test match
print(f"Shape of X_test: {X_test.shape}")
display(X_test.head(5))
print(f"Shape of y_test: {y_test.shape}")
display(y_test.head(5))

Shape of X_test: (100, 240)


Unnamed: 0_level_0,Unnamed: 1_level_0,Sensor Measure 10__sum_values,Sensor Measure 10__median,Sensor Measure 10__mean,Sensor Measure 10__length,Sensor Measure 10__standard_deviation,Sensor Measure 10__variance,Sensor Measure 10__root_mean_square,Sensor Measure 10__maximum,Sensor Measure 10__absolute_maximum,Sensor Measure 10__minimum,...,Sensor Measure 9__sum_values,Sensor Measure 9__median,Sensor Measure 9__mean,Sensor Measure 9__length,Sensor Measure 9__standard_deviation,Sensor Measure 9__variance,Sensor Measure 9__root_mean_square,Sensor Measure 9__maximum,Sensor Measure 9__absolute_maximum,Sensor Measure 9__minimum
UnitNumber,Cycle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,31,26.0,1.3,1.3,20.0,2.220446e-16,4.930381e-32,1.3,1.3,1.3,1.3,...,180961.92,9046.675,9048.096,20.0,4.379112,19.176624,9048.09706,9056.4,9056.4,9041.12
2,49,26.0,1.3,1.3,20.0,2.220446e-16,4.930381e-32,1.3,1.3,1.3,1.3,...,180961.62,9048.33,9048.081,20.0,3.193564,10.198849,9048.081564,9055.9,9055.9,9042.03
3,126,26.0,1.3,1.3,20.0,2.220446e-16,4.930381e-32,1.3,1.3,1.3,1.3,...,181012.73,9049.89,9050.6365,20.0,4.084531,16.683393,9050.637422,9058.16,9058.16,9044.93
4,106,26.0,1.3,1.3,20.0,2.220446e-16,4.930381e-32,1.3,1.3,1.3,1.3,...,181065.25,9051.79,9053.2625,20.0,4.500998,20.258979,9053.263619,9062.04,9062.04,9046.96
5,98,26.0,1.3,1.3,20.0,2.220446e-16,4.930381e-32,1.3,1.3,1.3,1.3,...,181051.76,9053.015,9052.588,20.0,3.830631,14.673736,9052.58881,9059.29,9059.29,9044.76


Shape of y_test: (100, 1)


Unnamed: 0_level_0,Unnamed: 1_level_0,RUL
UnitNumber,Cycle,Unnamed: 2_level_1
1,31,112
2,49,98
3,126,69
4,106,82
5,98,91


In [41]:
# save processed data (as pickle)
timestamp = time.strftime("%Y%m%d-%H%M%S")
X_train.to_pickle(f"{config['paths']['processed_data_dir']}ex2_X_train_{timestamp}.pkl")
y_train.to_pickle(f"{config['paths']['processed_data_dir']}ex2_y_train_{timestamp}.pkl")
X_test.to_pickle(f"{config['paths']['processed_data_dir']}ex2_X_test_{timestamp}.pkl")
y_test.to_pickle(f"{config['paths']['processed_data_dir']}ex2_y_test_{timestamp}.pkl")

# Exkurs - Extract Features not using tsfresh