# Analysis of the Turbofan dataset

This notebook is an attempt at studying the Turbofan dataset from NASA. The dataset and the accompanying documentation is available here: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan

The aim of the dataset is to develop algorithms that are able to predict the failure of turbofan engines in advance. In the following code, we will try to do that using *Recurrent Neural Networks* built using PyTorch (see https://www.pytorch.org) and the FastAI library (see https://docs.fast.ai/).

## Setting-up the environment and getting the file

In [1]:
# Import of the required libraries
import numpy as np
import pandas as pd
import torch
import fastai
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import urllib
from pathlib import Path
import zipfile

from pylab import mpl, plt
import seaborn

seaborn.set()

In [2]:
url = 'https://ti.arc.nasa.gov/c/6/'
CMAPSSDATAFILE = Path.cwd() / 'data' / 'CMAPSSData.zip'
DATAPATH = Path.cwd() / 'data'

In [3]:
if not CMAPSSDATAFILE.exists():
    dir = DATAPATH.mkdir(parents=True)
    filedata = urllib.request.urlretrieve(url, CMAPSSDATAFILE)
    file = zipfile.ZipFile(CMAPSSDATAFILE, 'r')
    file.extractall(CMAPSSDATAFILE.parent)
    file.close()

In [4]:
# Load the Dataset Turbofan engine
file = DATAPATH / 'train_FD001.txt'

# Let's define the column names for this dataframe
columns = ['Unit', 'Cycle', 'os1', 'os2', 'os3']

for i in range(1, 22):
    columns.append('sm'+str(i))

cols = []
for i in range(0,26):
    cols.append(i)
    
# os stands for operational setting
# sm stands for sensor measurument

data = pd.read_csv(file, names = columns, header = None, usecols = cols, sep=' ')

In [5]:
# Let's take a look at the beginning of the data and see if it's properly loaded
data.head()

Unnamed: 0,Unit,Cycle,os1,os2,os3,sm1,sm2,sm3,sm4,sm5,...,sm12,sm13,sm14,sm15,sm16,sm17,sm18,sm19,sm20,sm21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


In [6]:
# Let's have a quick look at statistics describing the different parameters
data.describe()

Unnamed: 0,Unit,Cycle,os1,os2,os3,sm1,sm2,sm3,sm4,sm5,...,sm12,sm13,sm14,sm15,sm16,sm17,sm18,sm19,sm20,sm21
count,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,...,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0,20631.0
mean,51.506568,108.807862,-9e-06,2e-06,100.0,518.67,642.680934,1590.523119,1408.933782,14.62,...,521.41347,2388.096152,8143.752722,8.442146,0.03,393.210654,2388.0,100.0,38.816271,23.289705
std,29.227633,68.88099,0.002187,0.000293,0.0,6.537152e-11,0.500053,6.13115,9.000605,3.3947e-12,...,0.737553,0.071919,19.076176,0.037505,1.556432e-14,1.548763,0.0,0.0,0.180746,0.108251
min,1.0,1.0,-0.0087,-0.0006,100.0,518.67,641.21,1571.04,1382.25,14.62,...,518.69,2387.88,8099.94,8.3249,0.03,388.0,2388.0,100.0,38.14,22.8942
25%,26.0,52.0,-0.0015,-0.0002,100.0,518.67,642.325,1586.26,1402.36,14.62,...,520.96,2388.04,8133.245,8.4149,0.03,392.0,2388.0,100.0,38.7,23.2218
50%,52.0,104.0,0.0,0.0,100.0,518.67,642.64,1590.1,1408.04,14.62,...,521.48,2388.09,8140.54,8.4389,0.03,393.0,2388.0,100.0,38.83,23.2979
75%,77.0,156.0,0.0015,0.0003,100.0,518.67,643.0,1594.38,1414.555,14.62,...,521.95,2388.14,8148.31,8.4656,0.03,394.0,2388.0,100.0,38.95,23.3668
max,100.0,362.0,0.0087,0.0006,100.0,518.67,644.53,1616.91,1441.49,14.62,...,523.38,2388.56,8293.72,8.5848,0.03,400.0,2388.0,100.0,39.43,23.6184


## Cleaning and preparing the dataset

### Cleaning

As we can see in the dataset above, some of the parameters have a standard deviation of 0.0. This means, the values of those parameters never change. So, we can get rid of those parameters entirely in our model since they won't have any impact in our prediction. 

In [7]:
# .iloc[2,:] gives us the std row in the describe() call above
criteria = (data.describe().iloc[2,:] != 0)
data = data[criteria.index[criteria]]
data.shape

(20631, 23)

### Preparing

Now that we have remove the parameters (i.e features) that have no value to our prediction, we have to define the target variable that we will need to predict using our model. 
Reading the accompanying file with the dataset, we can see that the value we are trying to predict is the *Remaining Useful Life*, also named *RUL*. The different units in the dataset are brought to failure, therefore the *RUL* can be define with the following calculation: $RUL = Cycle_{max} - Cycle_{current}$. Let's use this formula to add a column containing *RUL* 

In [8]:
data['RUL'] = data.groupby('Unit')['Cycle'].transform(max) - data['Cycle']