# Predictive Maintenance of Turbofan Engines

## Exploratory Data Analysis

First thing to do is take a look at our data to get a feel for it.

We will create a list of the column names using the information from the included `readme.txt` file.

* 'unit number' has been substituted for `id` as it represents the unique identifier for an engine
* 'time, in cycles' has been shortened to `cycle`
* 'operational setting *X*' has been shortened to `settingX`
* 'sensor measurement  *X*' has been shottened to `sensorX`

*N.B. We can use whatever column names we like but for this workshop these are the column names that the proceding code will be expecting to see*



In [None]:
import pandas as pd

# Column names 
index_names = ['id', 'cycle']
setting_names = ['setting1', 'setting2', 'setting3']
sensor_names = ['s{}'.format(i) for i in range(1,22)] 
columns = index_names + setting_names + sensor_names

print(columns)

We'll take a look at the first training dataset, i.e. `train_FD001.txt`

Let's import that to a pandas dataframe as well as adding our column names to the dataframe.

In [None]:
# Read in the training dataset passing in the list of column names
train1 = pd.read_csv('data/train_FD001.txt', delimiter='\s+', header=None, names=columns)
train1.head()

As in the previous workshop we can take a look at the shape of the data to see how many rows and columns we have.

We can use the `.describe()` method to get some basic statistical information from the data.


In [None]:
train1.shape

In [None]:
train1.describe()

Due to the width of the screen we can't see the entire output of the describe function so lets flip the output.

Using the `.transpose()` method will reflect a dataframe on the diagonal, by writing the rows as columns and vice-versa.

In [None]:
train1.describe().transpose()

### Computing RUL

We'll now compute the target variable, Remaining Useful Life. At this stage will allow us to plot sensor signals against the RUL allowing us to eaily interpret the data.

Mathematically we can compute the Remaining Useful Life by `max_cycle - cycle` for each enigne id.
* We group the dataframe by `id` into a new dataframe, `max_cycle_df`
* Compute the `max_cycle` field 
* Merge the `max_cycle_df` back into original dataframe
* Compute RUL by subtracting `cycle` from `max_cycle`
* Drop the `max_cycle` field

In [None]:
def calculate_rul(df):
    max_cycle_df = pd.DataFrame(df.groupby('id')['cycle'].max()).reset_index()
    max_cycle_df.columns = ['id', 'max_cycle']
    df = df.merge(max_cycle_df, on=['id'], how='left')
    df['RUL'] = df['max_cycle'] - df['cycle']
    df.drop('max_cycle', axis=1, inplace=True)
    return df

train1 = calculate_rul(train1)
train1.head()

We can now plot a histogram that will show us the distribution of the RUL values.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

df_max_rul = train1[['id', 'RUL']].groupby('id').max().reset_index()
df_max_rul['RUL'].hist(bins=15, figsize=(15,7))
plt.xlabel('RUL')
plt.ylabel('frequency')
plt.show()

This shows us a couple of interesting things:

1. Most engines breakdown around the 200 cycle mark.
1. The distribution is right skewed, with few engines lasting over the 300 cycle mark.

## How many life to death cycle we have and how do they look like?

In [None]:
##There are 100 engines in first dataset whose RUL is coming to '0'
len(train1[train1['RUL'] == 0])

In [None]:
# This confirms why we want to predict them because all of them starting healthy and reaching to RUL '0' at different cycles
# because if they would have been collapsing at some fix cycle we don't need to to put all these efforts.
one_engine = []
for i,r in train1.iterrows():
    rul = r['RUL']
    one_engine.append(rul)
    if rul == 0:
        plt.plot(one_engine)
        one_engine = []
        
plt.grid()

### Plotting Sensors

Next we'll plot the trends for each of the 21 sensors.

Due to the size of the dataset, it's not practical to plot every value. Therefor we will only plot engines with an id divisible by 10.

We will also reverse the x-axis so that the RUL decreases along the axis.

In [None]:
def plot_sensor(df, sensor_name):
    plt.figure(figsize=(13,5))
    for i in df['id'].unique():
        if (i % 10 == 0):  ##
            plt.plot('RUL', sensor_name, 
                     data=df[df['id']==i])
    plt.xlim(250, 0)
    plt.xticks(np.arange(0, 275, 25))
    plt.ylabel(sensor_name)
    plt.xlabel('Remaining Use fulLife')
    plt.show()

for sensor_name in sensor_names:
    plot_sensor(train1, sensor_name)

Based on these graphs we can make a few observations:

1. A few of the sensor reading don't vary throughout our dataset. 
1. Sensor readings tend to stay somewhat consistant before trending up or down towards failure

We could assume that using this dataset that we could omit the readings from s1, s5, s10, s16, s18 and s19.

However, lets check the other datasets so see if they show the same trends.


In [None]:
# Import the CSV data into dataframes
train2 = pd.read_csv('data/train_FD002.txt', delimiter='\s+', header=None, names=columns)
train3 = pd.read_csv('data/train_FD003.txt', delimiter='\s+', header=None, names=columns)
train4 = pd.read_csv('data/train_FD004.txt', delimiter='\s+', header=None, names=columns)

# Calculate the RUL values
train2 = calculate_rul(train2)
train3 = calculate_rul(train3)
train4 = calculate_rul(train4)

In [None]:
for sensor_name in sensor_names:
    plot_sensor(train2, sensor_name)

In [None]:
for sensor_name in sensor_names:
    plot_sensor(train3, sensor_name)

In [None]:
for sensor_name in sensor_names:
    plot_sensor(train4, sensor_name)

As you may have guessed life isn't going to be that easy. We can't discard any of the columns as they could be holding valuable information that we just can't see.

In this case we did't have to clean to do any cleaning of the data.
