# Pre-processing

In thi section, this Notebook *Markdowns* highlight the differences in pre-procession techniques compared to the `Pre-processing & RF & SVM` notebook. Thus, in order to understand the Notebook, the `Pre-processing & RF & SVM` notebook must have been read beforehand.

Firstly, the **`os`** library is utilised to interact with the operating system. By setting the logging level to `2`, it filters out **INFO** and **WARNING** messages, allowing only **ERROR** messages to be displayed. This adjustment improves readability by ensuring that only critical issues are shown, making error messages clearer and easier to interpret.

In [2]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

### Data loading

In [3]:
import pandas as pd

df = pd.read_parquet('./../data/dataset.parquet', engine='pyarrow')

In [4]:
# Check the dataset has been imported correctly
df.shape

(679045, 17)

## Pre-processing steps 

The key distinction in the pre-processing lies in the **reshaping of the data**, which is applied to both the **sliding window process** and **Random Undersampling (RUS)**.  
For **Random Forest (RF)** and **Support Vector Machine (SVM)**, the data is structured as a **2D array**, where each row represents an independent instance with all features stacked. This format is suitable for traditional machine learning models, which do not account for temporal dependencies. However, for **Long Short-Term Memory (LSTM)**, the data must be reshaped into a **3D array** to preserve the **sequential nature** of the time-series data.  
This transformation is crucial, as LSTM models rely on **temporal dependencies** rather than isolated feature vectors. Without maintaining this structure, the model would lose the ability to recognize patterns over time, limiting its effectiveness in capturing trends and predicting alerts accurately.

### Dataset optimisation

As implemented in the **Pre-processing & RF & SVM** notebook, the `session_counter` and `time_to_failure` features are removed from the dataset, as they are not relevant to the classification task. Eliminating these features ensures that the model focuses solely on the meaningful variables that contribute to the prediction of `alert_11`.

In [5]:
df.drop(columns=['session_counter', 'time_to_failure'], inplace=True)

In [6]:
# Display the dataset to ensure the columns have been removed
df.head()

Unnamed: 0,Timestamp,Flag roping,Platform Position [°],Platform Motor frequency [HZ],Temperature platform drive [°C],Temperature slave drive [°C],Temperature hoist drive [°C],Tensione totale film [%],Current speed cart [%],Platform motor speed [%],Lifting motor speed [RPM],Platform rotation speed [RPM],Slave rotation speed [M/MIN],Lifting speed rotation [M/MIN],alert_11
0,2021-06-07 04:14:30.742,31.0,115.0,5200.0,18.0,22.0,18.0,181.0,0.0,100.0,0.0,84.0,116.0,0.0,0.0
1,2021-06-07 04:14:35.742,31.0,115.0,5200.0,18.0,22.0,18.0,181.0,0.0,100.0,0.0,84.0,116.0,0.0,0.0
2,2021-06-07 04:14:40.742,31.0,115.0,5200.0,18.0,22.0,18.0,181.0,0.0,100.0,0.0,84.0,116.0,0.0,0.0
3,2021-06-07 04:14:45.742,31.0,115.0,5200.0,18.0,22.0,18.0,181.0,0.0,100.0,0.0,84.0,116.0,0.0,0.0
4,2021-06-07 04:14:50.742,31.0,115.0,5200.0,18.0,22.0,18.0,181.0,0.0,100.0,0.0,84.0,116.0,0.0,0.0


As is common in most time-series datasets, the `Timestamp` column is set as the index of the dataset. This allows for efficient time-based operations, such as resampling, sliding window calculations, and trend analysis, while preserving the chronological structure of the data.

In [7]:
df.set_index('Timestamp', inplace=True)

In [8]:
# Check if Timestamp has become the index of the dataset
df.index.name

'Timestamp'

### Features extraction

In this section, the label (`y`) and the features (`X`) for the models are defined, with `alert_11` serving as the target variable and all other columns being designated as features. This decision is based on the fact that `alert_11` represents the primary event of interest, aligning with the study’s objective. By including all other columns as features, the models can leverage the full range of available sensor data to identify patterns and relationships that may contribute to predicting `alert_11`.

In [9]:
# State the label and the features
import numpy as np

label = np.array(['alert_11'])
features = np.array(df.columns.difference(label))

print(f"-> Label (shape={label.shape}): {label}")
print(f"-> Features (shape={features.shape}): {features}")

-> Label (shape=(1,)): ['alert_11']
-> Features (shape=(13,)): ['Current speed cart [%]' 'Flag roping' 'Lifting motor speed [RPM]'
 'Lifting speed rotation [M/MIN]' 'Platform Motor frequency [HZ]'
 'Platform Position [°]' 'Platform motor speed [%]'
 'Platform rotation speed [RPM]' 'Slave rotation speed [M/MIN]'
 'Temperature hoist drive [°C]' 'Temperature platform drive [°C]'
 'Temperature slave drive [°C]' 'Tensione totale film [%]']


In [10]:
# Extract and assign the label and the features, X and y
X = df[features]
y = df[label]

print(f"-> X (shape={X.shape})")
print(f"-> y (shape={y.shape})")

-> X (shape=(679045, 13))
-> y (shape=(679045, 1))


### Sliding window

In [11]:
# Prepare the label and features for the window
X = X.to_numpy()
y = y.to_numpy().flatten()

In [6]:
# Create the window
import numpy as np

x_wins_shape = None

def window(X_data, y_data, width: int, shift: int):

    X_wins, y_wins = [], []

    for index, (X, y) in enumerate(zip(X_data, y_data)):
        if (index + width + shift) <= X_data.shape[0]:

            window = slice((index + width), (index + width + shift))

            X_wins.append(X_data[index: index + width])

            y_values_shift = y_data[window]
            y_wins.append(int(np.any(y_values_shift == 1)))

    X_wins = np.array(X_wins)
    x_wins_shape = X_wins.shape # Ensuring the 2D array is contained inside another array (3D array)
    y_wins = np.array(y_wins)
    return X_wins.reshape(X_wins.shape[0], -1), y_wins.flatten()

In [7]:
# State the variables and the size of the window
X_wins, y_wins = window(X, y, width=120, shift=180)

### Random Under Sampler (RUS)

In [8]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(random_state=0)

X_res, y_res = rus.fit_resample(X_wins, y_wins)
X_res = X_res.reshape(X_res.shape[0], 1, X_res.shape[1]) # Generates 1 empty array in the middle to have a 3D array
print(X_res.shape)

(6648, 1, 1560)


# Modelling

### Long Short-Term Memeory (LSTM)

In [9]:
# Perform 5-fold-cross-validation
from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_metrics = []

In [10]:
# Build and train LSTM model
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Bidirectional, BatchNormalization, Dense, Dropout
from tensorflow.keras.regularizers import L2
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import Precision, Recall

for fold, (train_idx, val_idx) in enumerate(kf.split(X_res, y_res)):
    print(f"Training fold {fold+1}...")

    X_train, X_val = X_res[train_idx], X_res[val_idx]
    y_train, y_val = y_res[train_idx], y_res[val_idx]

    model = Sequential()
    model.add(Bidirectional(LSTM(128, return_sequences=True, kernel_regularizer=L2(0.001)), input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(BatchNormalization())
    model.add(Bidirectional(LSTM(64, return_sequences=True, kernel_regularizer=L2(0.001))))
    model.add(BatchNormalization())
    model.add(Bidirectional(LSTM(128, return_sequences=False, kernel_regularizer=L2(0.001))))
    model.add(BatchNormalization())
    model.add(Dense(units=64, activation="relu", kernel_regularizer=L2(0.001)))
    model.add(Dropout(0.3))
    model.add(Dense(units=32, activation="relu", kernel_regularizer=L2(0.001)))
    model.add(Dense(units=1, activation="sigmoid"))

    model.compile(
        loss='binary_crossentropy',
        optimizer='adam',
        metrics=['accuracy', Precision(), Recall()]
    )

    early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    history = model.fit(
        X_train, y_train,
        epochs=10,
        batch_size=32,
        validation_data=(X_val, y_val),
        callbacks=[early_stop],
        verbose=1
    )

    score = model.evaluate(X_val, y_val, batch_size=32, verbose=0)
    fold_metrics.append({
        'fold': fold + 1,
        'loss': score[0],
        'accuracy': score[1],
        'precision': score[2],
        'recall': score[3]
    })

Training fold 1...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training fold 2...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training fold 3...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training fold 4...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training fold 5...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [11]:
# Print the model results
metrics_df = pd.DataFrame(fold_metrics)

print("\nCross-Validation Results:")
print(metrics_df)
print("\nAverage metrics across all folds:")
print(metrics_df.mean())


Cross-Validation Results:
   fold      loss  accuracy  precision    recall
0     1  0.898425  0.760150   0.805654  0.685714
1     2  0.892884  0.754887   0.841046  0.628571
2     3  0.901770  0.734586   0.762626  0.681203
3     4  0.939431  0.726862   0.801603  0.602410
4     5  0.840650  0.782543   0.873016  0.661654

Average metrics across all folds:
fold         3.000000
loss         0.894632
accuracy     0.751806
precision    0.816789
recall       0.651911
dtype: float64
