# Preprocessing of Dataset
- Dataset file: "PM_train.txt"

Dataset Column Description (from official dataset data card)
- id
- - "A unique identifier for each data entry."
- cycle
- - "Denotes the operational cycle or period, indicating the stage or duration of engine operation."
- setting1, setting2, setting3
- - "Numerical values representing various operational settings or parameters of the aircraft engine."
- s1 to s21
- - "Numeric sensor readings obtained from 21 different sensors installed on the engine. These readings encompass a range of physical measurements, including but not limited to temperature, pressure, and other relevant parameters."

-------------------------------------------------------------------------------------------------------------------------------------------------------
Conversion from txt document to CSV. Dataset loading. 

In [41]:
import matplotlib as plt
import numpy as np
import pandas as pd

# Conversion of .txt file to .csv file
with open('../data/dataset_a_s_e_p/PM_test.txt') as file:
    for _ in range(5):
        print(file.readline())

column_names = ['id', 'cycle', 'setting1', 'setting2', 'setting3'] + \
                [f's{i}' for i in range(1, 22)]

pm_test_df = pd.read_csv(
    '../data/dataset_a_s_e_p/PM_test.txt',
    sep=r'\s+',
    header=None,
    names=column_names
)

print(pm_test_df.head(20))
pm_test_df.to_csv('../data/dataset_a_s_e_p/PM_test.csv', index=False)


1 1 0.0023 0.0003 100.0 518.67 643.02 1585.29 1398.21 14.62 21.61 553.90 2388.04 9050.17 1.30 47.20 521.72 2388.03 8125.55 8.4052 0.03 392 2388 100.00 38.86 23.3735  

1 2 -0.0027 -0.0003 100.0 518.67 641.71 1588.45 1395.42 14.62 21.61 554.85 2388.01 9054.42 1.30 47.50 522.16 2388.06 8139.62 8.3803 0.03 393 2388 100.00 39.02 23.3916  

1 3 0.0003 0.0001 100.0 518.67 642.46 1586.94 1401.34 14.62 21.61 554.11 2388.05 9056.96 1.30 47.50 521.97 2388.03 8130.10 8.4441 0.03 393 2388 100.00 39.08 23.4166  

1 4 0.0042 0.0000 100.0 518.67 642.44 1584.12 1406.42 14.62 21.61 554.07 2388.03 9045.29 1.30 47.28 521.38 2388.05 8132.90 8.3917 0.03 391 2388 100.00 39.00 23.3737  

1 5 0.0014 0.0000 100.0 518.67 642.51 1587.19 1401.92 14.62 21.61 554.16 2388.01 9044.55 1.30 47.31 522.15 2388.03 8129.54 8.4031 0.03 390 2388 100.00 38.99 23.4130  

    id  cycle  setting1  setting2  setting3      s1      s2  ...     s15   s16    s17     s18    s19    s20      s21
0    1      1    0.0023    0.0003     100

-------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset cleaning via generalized NA row drop.

In [44]:
# Pre NA drop check
with pd.option_context('display.max_columns', None, 'display.max_rows', None):
    print(pm_test_df.describe())

# Check for missing data
print("\n Null values: \n", pm_test_df.isnull().sum())

                 id         cycle      setting1      setting2  setting3  \
count  11939.000000  11939.000000  11939.000000  11939.000000   11939.0   
mean      47.219281     76.401541     -0.000006      0.000001     100.0   
std       25.796561     52.750586      0.002199      0.000294       0.0   
min        1.000000      1.000000     -0.008200     -0.000600     100.0   
25%       25.000000     33.000000     -0.001500     -0.000200     100.0   
50%       49.000000     68.000000      0.000000      0.000000     100.0   
75%       68.000000    112.000000      0.001500      0.000300     100.0   
max       92.000000    303.000000      0.007800      0.000700     100.0   

                 s1            s2            s3            s4            s5  \
count  1.193900e+04  11939.000000  11939.000000  11939.000000  11939.000000   
mean   5.186700e+02    642.474090   1588.088548   1404.705402     14.618859   
std    5.866487e-11      0.399549      4.999466      6.662342      0.124650   
min    5

In [57]:
pm_test_df_clean = pm_test_df.dropna()

# Confirmation of null removal
print("\n Null values: \n", pm_test_df_clean.isnull().sum())

# Check for constants
with pd.option_context('display.max_columns', None, 'display.max_rows', None):
    print(pm_test_df_clean.describe())


 Null values: 
 id          0
cycle       0
setting1    0
setting2    0
setting3    0
s1          0
s2          0
s3          0
s4          0
s5          0
s6          0
s7          0
s8          0
s9          0
s10         0
s11         0
s12         0
s13         0
s14         0
s15         0
s16         0
s17         0
s18         0
s19         0
s20         0
s21         0
dtype: int64
                 id         cycle      setting1      setting2  setting3  \
count  11938.000000  11938.000000  11938.000000  11938.000000   11938.0   
mean      47.215530     76.399062     -0.000006      0.000001     100.0   
std       25.794385     52.752100      0.002199      0.000294       0.0   
min        1.000000      1.000000     -0.008200     -0.000600     100.0   
25%       25.000000     33.000000     -0.001500     -0.000200     100.0   
50%       49.000000     68.000000     -0.000000      0.000000     100.0   
75%       68.000000    112.000000      0.001500      0.000300     100.0   
max   

Constants Identified
- Avoid standarization of:
- - setting3
  - s18
  - s19

-------------------------------------------------------------------------------------------------------------------------------------------------------
Dataset normalization via z-score
standarized scaling:
- Centering each feature to zero mean
- Scaling to unit variance

In [58]:
# Feature columns to ensure only features are normalized
# Setting 3 is a constant, so it is not normalized
feature_columns = ['setting1', 'setting2'] + [f's{i}' for i in range(1, 22)
                                              if i not in range(18, 20)]

print(feature_columns)

# Pre-data "normalization" check
with pd.option_context('display.max_columns', None, 'display.max_rows', None):
    print(pm_test_df_clean.describe())

# Mean/STD calculation of pm_test
mean = pm_test_df_clean[feature_columns].mean()
std = pm_test_df_clean[feature_columns].std()

# Standardization of each column
pm_test_df_clean.loc[:, feature_columns] = (
    pm_test_df_clean[feature_columns] - mean) / std


# Scaling confirmation
with pd.option_context('display.max_columns', None, 'display.max_rows', None):
    print(pm_test_df_clean.describe())

['setting1', 'setting2', 's1', 's2', 's3', 's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's20', 's21']
                 id         cycle      setting1      setting2  setting3  \
count  11938.000000  11938.000000  11938.000000  11938.000000   11938.0   
mean      47.215530     76.399062     -0.000006      0.000001     100.0   
std       25.794385     52.752100      0.002199      0.000294       0.0   
min        1.000000      1.000000     -0.008200     -0.000600     100.0   
25%       25.000000     33.000000     -0.001500     -0.000200     100.0   
50%       49.000000     68.000000     -0.000000      0.000000     100.0   
75%       68.000000    112.000000      0.001500      0.000300     100.0   
max       92.000000    303.000000      0.007800      0.000700     100.0   

                 s1            s2            s3            s4            s5  \
count  1.193800e+04  11938.000000  11938.000000  11938.000000  1.193800e+04   
mean   5.186700e+