# Predictive maintenance

## Part 1: Data Preparation

The original data can be [downloaded from this link.](https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan) Since the content in the train and test datasets is different, we are making it uniform before we start the data exploration and the model buiding process. We will convert the data into a more natural format for Vaex.

In [2]:
import pandas as pd

### Read the data

The data contains a list of sensors. These are their names and meanings:


| Name      |Description                      |Unit     |    
|-----------|---------------------------------|---------|    
| T2        | Total temperature at fan inlet  | °R      |    
| T24       | Total temperature at LPC outlet | °R      |    
| T30       | Total temperature at HPC outlet | °R      |    
| T50       | Total temperature at LPT outlet | °R      |    
| P2        | Pressure at fan inlet           | psia    |    
| P15       | Total pressure in bypass-duct   | psia    |    
| P30       | Total pressure at HPC outlet    | psia    |    
| Nf        | Physical fan speed              | rpm     |    
| Nc        | Physical core speed             | rpm     |    
| epr       | Engine pressure ratio (P50/P2)  | --      |    
| Ps30      | Static pressure at HPC outlet   | psia    |    
| phi       | Ratio of fuel flow to Ps30      | pps/psi |    
| NRf       | Corrected fan speed             | rpm     |    
| NRc       | Corrected core speed            | rpm     |    
| BPR       | Bypass Ratio                    | --      |    
| farB      | Burner fuel-air ratio           | --      |    
| htBleed   | Bleed Enthalpy                  | --      |    
| Nf_dmd    | Demanded fan speed              | rpm     |    
| PCNfR_dmd | Demanded corrected fan speed    | rpm     |    
| W31       | HPT coolant bleed               | lbm/s   |    
| W32       | LPT coolant bleed               | lbm/s   |    


In [8]:
column_names = ['unit_number', 'time_in_cycles', 'setting_1', 'setting_2', 'setting_3',
                'T2', 'T24', 'T30', 'T50', 'P2', 'P15', 'P30', 'Nf', 'Nc', 'epr', 'Ps30', 'phi', 
                'NRf', 'NRc', 'BPR', 'farB', 'htBleed', 'Nf_dmd', 'PCNfR_dmd', 'W31', 'W32']


# The training data
train_data = pd.read_csv("../CMAPSSData/train_FD001.txt", sep='\s+', names=column_names)

# The testing data
test_data = pd.read_csv("../CMAPSSData/test_FD001.txt", sep='\s+', names=column_names)

# The "answer" to the test data
y_test = pd.read_csv('../CMAPSSData/RUL_FD001.txt', names=['remaining_cycles'])
y_test['unit_number'] = y_test.index
y_test['unit_number'] = y_test.unit_number.astype('int')

### Create proper train and test datasets

- in the training set, the engines are run until failure occurs, so we can calculate the target varuable, i,e, the RUL (Remaining Useful Life) based on when a particular engines running;
- in the test set the engines are run for some time, and our goal is to predict their RULs. Their RUL are provided in a separate file, so we need to join it so it can be made available for evaluating scores and estimateing model performance

In [13]:
def prepare_data(data, y=None):
    df = data.copy()  # To avoid modifying the original DataFrame
    
    # Count how many cycles each unit has run for - groupby and count
    g = df.groupby('unit_number').agg(max_cycles=('time_in_cycles', 'count')).reset_index()
    
    # Merge the aggregated data to the main DataFrame - adds the "max_cycles" column
    df = df.merge(g, on='unit_number', how='left')
    
    # Calculate the Remaining Useful Life (RUL)
    if y is None:  # This is for the training data -> the last point is the point of failure
        # Calculate the RUL
        df['RUL'] = df['max_cycles'] - df['time_in_cycles']
        # Drop the 'max_cycles' column as it's no longer needed
        df = df.drop(columns=['max_cycles'])
    else:  # This is for the test data -> add the answer to calculate the RUL
        # Merge the answers with the main DataFrame
        df = df.merge(y, on='unit_number', how='left')
        # Calculate the RUL using 'remaining_cycles' from y
        df['RUL'] = df['max_cycles'] + df['remaining_cycles'] - df['time_in_cycles']
        # Drop the 'remaining_cycles' and 'max_cycles' columns as they're no longer needed
        df = df.drop(columns=['remaining_cycles', 'max_cycles'])
    
    # Return the processed DataFrame
    return df

In [14]:
# Add the RUL to the train and test sets
df_train = prepare_data(train_data)
df_test = prepare_data(test_data, y=y_test)


### Quick preview of the datasets

In [15]:
df_train

Unnamed: 0,unit_number,time_in_cycles,setting_1,setting_2,setting_3,T2,T24,T30,T50,P2,...,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.70,1400.60,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.4190,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.00,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.20,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0000,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.80,8.4294,0.03,393,2388,100.0,38.90,23.4044,187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20626,100,196,-0.0004,-0.0003,100.0,518.67,643.49,1597.98,1428.63,14.62,...,2388.26,8137.60,8.4956,0.03,397,2388,100.0,38.49,22.9735,4
20627,100,197,-0.0016,-0.0005,100.0,518.67,643.54,1604.50,1433.58,14.62,...,2388.22,8136.50,8.5139,0.03,395,2388,100.0,38.30,23.1594,3
20628,100,198,0.0004,0.0000,100.0,518.67,643.42,1602.46,1428.18,14.62,...,2388.24,8141.05,8.5646,0.03,398,2388,100.0,38.44,22.9333,2
20629,100,199,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,...,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.0640,1


In [16]:
df_test

Unnamed: 0,unit_number,time_in_cycles,setting_1,setting_2,setting_3,T2,T24,T30,T50,P2,...,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,...,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,128.0
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,...,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,127.0
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,...,2388.03,8130.10,8.4441,0.03,393,2388,100.0,39.08,23.4166,126.0
3,1,4,0.0042,0.0000,100.0,518.67,642.44,1584.12,1406.42,14.62,...,2388.05,8132.90,8.3917,0.03,391,2388,100.0,39.00,23.3737,125.0
4,1,5,0.0014,0.0000,100.0,518.67,642.51,1587.19,1401.92,14.62,...,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.4130,124.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13091,100,194,0.0049,0.0000,100.0,518.67,643.24,1599.45,1415.79,14.62,...,2388.00,8213.28,8.4715,0.03,394,2388,100.0,38.65,23.1974,
13092,100,195,-0.0011,-0.0001,100.0,518.67,643.22,1595.69,1422.05,14.62,...,2388.09,8210.85,8.4512,0.03,395,2388,100.0,38.57,23.2771,
13093,100,196,-0.0006,-0.0003,100.0,518.67,643.44,1593.15,1406.82,14.62,...,2388.04,8217.24,8.4569,0.03,395,2388,100.0,38.62,23.2051,
13094,100,197,-0.0038,0.0001,100.0,518.67,643.26,1594.99,1419.36,14.62,...,2388.08,8220.48,8.4711,0.03,395,2388,100.0,38.66,23.2699,


### Export the datasets to HDF5

In [18]:
df_train.to_csv('../CMAPSSData/data_train.csv')
df_test.to_csv('../CMAPSSData/data_test.csv')

The data is ready and now we can start with the modeling process.