### AE4465 (MM&A) - Lecture7 (Classification)

Classification is the process of recognizing, understanding, and grouping ideas and objects into preset categories or “sub-populations.” Using pre-categorized training datasets, machine learning programs use a variety of algorithms to classify future datasets into categories.

Classification algorithms in machine learning use input training data to predict the likelihood that subsequent data will fall into one of the predetermined categories. One of the most common uses of classification in maintenance is filtering snapshots of sensor data into “(near) fault” or “non-fault”.

**Popular Classification Algorithms:**
- Logistic Regression
- Naive Bayes
- K-Nearest Neighbors
- Decision Tree
- Support Vector Machines

In this tutorial we will see how to develop a classification model based on CMAPSS data.

In [27]:
#import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import scipy.stats

In [5]:
df1 = pd.read_csv('data/les05_CMAPSStrain001.txt', sep=' ')
df1.head()

Unnamed: 0,Equipment,Cycle,Op1,Op2,Op3,1,2,3,4,5,...,12,13,14,15,16,17,18,19,20,21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


### RUL Calculation

It is possible to calculate the RUL using groupby function.

In [8]:
def add_rul(g):
    g['RUL'] = max(g['Cycle']) - g['Cycle']
    return g

df1 = df1.groupby('Equipment').apply(add_rul)
df1.head()

To preserve the previous behavior, use

	>>> .groupby(..., group_keys=False)


	>>> .groupby(..., group_keys=True)
  df1 = df1.groupby('Equipment').apply(add_rul)


Unnamed: 0,Equipment,Cycle,Op1,Op2,Op3,1,2,3,4,5,...,13,14,15,16,17,18,19,20,21,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,...,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,...,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,...,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,...,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,...,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187


### Classification Label

We can now create a classification label about the failure status

In [18]:
cycle=30
df1['label'] = df1['RUL'].apply(lambda x: 1 if x <= cycle else 0)

### Prepare the training and testing data

Prepare the data for running the model and also evaluating it.

In [25]:
y1 = df1['label']
X1 = df1.drop(['RUL', 'Equipment', 'label'], axis=1)

y_1_train = y1[df1.Equipment.values < 80]
y_1_test = y1[df1.Equipment.values >= 80]

X1_train = X1.loc[df1.Equipment < 80, :]
X1_test = X1.loc[df1.Equipment >= 80, :]

for col in X1_train.columns:
    if np.std(X1_train[col].values) != 0:
        X1_train[col] = (X1_train[col].values - np.mean(X1_train[col].values)) / np.std(X1_train[col].values)
    else:
        X1_train[col] = 1
for col in X1_train.columns:
    if np.std(X1_test[col].values) != 0:
        X1_test[col] = (X1_test[col].values - np.mean(X1_test[col].values)) / np.std(X1_test[col].values)
    else:
        X1_test[col] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X1_train[col] = (X1_train[col].values - np.mean(X1_train[col].values)) / np.std(X1_train[col].values)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X1_train[col] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X1_test[col] = (X1_test[col].values - np.mean(X1_test[col].values)) / np.std(X1_test[

### Model Preparation

Now we can prepare the modeling framework and run it [might take a while]

In [26]:
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.inspection import DecisionBoundaryDisplay

names = [
    "Nearest Neighbors",
    "Linear SVM",
    "Decision Tree",
    "Random Forest",
    "Neural Net",
    "AdaBoost",
    "Naive Bayes",
    "QDA",
]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
]


# iterate over classifiers
for name, clf in zip(names, classifiers):
    clf.fit(X1_train, y_1_train)
    score = clf.score(X1_test, y_1_test)
    print('Score of ', name, score)

Score of  Nearest Neighbors 0.9480547242411287
Score of  Linear SVM 0.9581017528858486
Score of  Decision Tree 0.9559640872167593
Score of  Random Forest 0.9576742197520308
Score of  Neural Net 0.9551090209491235
Score of  AdaBoost 0.9568191534843951
Score of  Naive Bayes 0.860837964942283
Score of  QDA 0.860837964942283


