# Predictive Maintenance 

This assignment covers the topic of predictive maintenance. Predictive Maintenance problems adress predicting when a machine needs to be maintained ahead of breaking down. This problem can occur anywhere regular maintenance is required for a machine. For example, it can be used in manufacturing, fleet operations, train maintenance, etc.

This assignment will use the [Predictive Maintenance Dataset](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset). The dataset consists of 10 000 data points stored as rows with 14 features in columns. The 'machine failure' label that indicates, whether the machine has failed in this particular datapoint.

# Learning Objectives
- Perform model tuning based on hyper parameters.
- Select the best model after attempting multiple models.
- Perform recursive feature elimination, producing a statistically significant improvement over a model without feature selection.

In [24]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


ai4i2020 = pd.read_csv('ai4i2020.csv')
print(ai4i2020.info())
# ai4i2020.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  object 
 4   Process temperature [K]  10000 non-null  object 
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 703.3+ KB
None


Question 1.1:  Write a command that will calculate the number of unique values for each feature in the training data.

In [25]:
# Command(s)
nUDI = ai4i2020['UDI'].unique().shape[0]
nprodIds = ai4i2020['Product ID'].unique().shape[0]
nTypes = ai4i2020['Type'].unique().shape[0]
nAirTemps = ai4i2020['Air temperature [K]'].unique().shape[0]
nProcessTemps = ai4i2020['Process temperature [K]'].unique().shape[0]
nRotationalSpeeds = ai4i2020['Rotational speed [rpm]'].unique().shape[0]
nTorques = ai4i2020['Torque [Nm]'].unique().shape[0]
nToolWear = ai4i2020['Tool wear [min]'].unique().shape[0]
nMachineFailure = ai4i2020['Machine failure'].unique().shape[0]

print(nUDI)
print(nprodIds)
print(nTypes)
print(nAirTemps)
print(nProcessTemps)
print(nRotationalSpeeds)
print(nTorques)
print(nToolWear)
print(nMachineFailure)

10000
10000
3
93
82
941
577
246
2


Question 1.2: Determine if the data contains any missing values, and replace the values with np.nan. Missing values would be '?'.

In [26]:
ai4i2020.replace(to_replace='\s*\?', value=np.nan, regex=True, inplace=True)


print(ai4i2020['Air temperature [K]'].unique())
print(ai4i2020['Process temperature [K]'].unique())
# Note there are no longer ? in the data, those are now replaced with NaN below

['298.1' '298.2' '298.3' '298.5' '298.4' '298.6' '298.7' '298.8' '298.9'
 '299' '299.1' '298' '297.9' '297.8' '297.7' '297.6' '297.5' '297.4'
 '297.3' '297.2' '297.1' '297' '296.9' '296.8' '296.7' '296.6' '296.5'
 '296.3' '296.4' '296.2' '296.1' '296' '295.9' '295.8' '295.7' '295.6'
 '295.5' '295.4' '295.3' '299.2' '299.3' '299.5' '299.4' '299.6' '299.7'
 '299.8' '299.9' nan '300.1' '300.2' '300.3' '300.4' '300.5' '300.6'
 '300.7' '300.8' '300.9' '301' '301.1' '301.2' '301.3' '301.4' '301.5'
 '301.6' '301.7' '301.8' '301.9' '302' '302.1' '302.2' '302.3' '302.4'
 '302.5' '302.6' '302.7' '302.8' '302.9' '303' '303.1' '303.2' '303.3'
 '303.4' '303.5' '303.6' '303.7' '303.8' '303.9' '304' '304.1' '304.2'
 '304.3' '304.4' '304.5']
['308.6' '308.7' '308.5' '309' '308.9' '309.1' '309.2' '309.3' '309.4'
 '309.5' '308.8' '308.4' '308.3' '308.2' '308.1' '308' '307.9' '309.6'
 '309.7' '309.8' '309.9' nan '310.1' '310.2' '307.8' '307.7' '307.6'
 '307.5' '307.4' '307.3' '307.2' '307.1' '307' '306.9

Question 1.3: Replace all missing values with the mean. Change column types to numeric.

In [27]:
#  1   Product ID               10000 non-null  object 
#  2   Type                     10000 non-null  object 
#  3   Air temperature [K]      10000 non-null  object 
#  4   Process temperature [K]  10000 non-null  object 
#ai4i2020['Product ID'] = ai4i2020['Product ID'].astype(float)
#ai4i2020['Type'] = ai4i2020['Type'].astype(float)
ai4i2020['Air temperature [K]'] = ai4i2020['Air temperature [K]'].astype(float)
ai4i2020['Process temperature [K]'] = ai4i2020['Process temperature [K]'].astype(float)

median_airtemp = ai4i2020["Air temperature [K]"].mean()
median_ptemp = ai4i2020["Process temperature [K]"].mean()

ai4i2020["Air temperature [K]"].fillna(median_airtemp, inplace=True)
ai4i2020["Process temperature [K]"].fillna(median_ptemp, inplace=True)

print(ai4i2020.info())

nan_values = ai4i2020.isnull().sum().sum()
if nan_values == 0:
    print("There are no NaN values in the dataset.")
else:
    print("There are", nan_values, "NaN values in the dataset.")


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 703.3+ KB
None
There are no NaN values in the dataset.


Question 1.4: Drop UDI and 'Product ID' from the data

In [28]:
ai4i2020.drop(['UDI', 'Product ID'], axis=1, inplace=True)
ai4i2020.head()


Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,M,298.1,308.6,1551,42.8,0,0
1,L,298.2,308.7,1408,46.3,3,0
2,L,298.1,308.5,1498,49.4,5,0
3,L,298.2,308.6,1433,39.5,7,0
4,L,298.2,308.7,1408,40.0,9,0


Question 2.1: Split the data into training and testing taking into consideration 'Machine failure' as the target (y)

In [29]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = ai4i2020.drop('Machine failure', axis=1)
y = ai4i2020['Machine failure']

# Split the data into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

X_train.head()


Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
4901,M,303.6,312.3,1630,32.4,223
4375,L,302.0,309.7,1414,36.3,209
6698,L,301.6,310.8,1418,44.7,46
9805,L,298.4,309.2,1651,28.5,141
1101,H,296.7,307.5,1607,33.6,38


Question 2.2: Apply [One-Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to data. Make sure to Fit the training data and transform both training and test data. 

I tried encoding just based on Sklearn One-Hot documentation. HOwever, the problem I ran into was that the encoding just gives you the new one-hot encoded columns back but I didn't know how to re-integrate those with the original data set.
The following article was helpful in doing this as a one step process. The article suggested use of the column transformer.

https://datagy.io/sklearn-one-hot-encode/

In [30]:
# Using make_column_transformer to One-Hot Encode
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

c_transform = make_column_transformer(
    (OneHotEncoder(), ['Type']),
    remainder='passthrough')

c_transform.fit(X_train)

X_train_enc = c_transform.transform(X_train)
X_test_enc = c_transform.transform(X_test)

X_train_enc = pd.DataFrame(X_train_enc)
X_test_enc = pd.DataFrame(X_test_enc)

X_train_enc.head()
X_test_enc.head()   
# Hmmm, looks like I lost the original column names from the dataset


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,1.0,0.0,300.8,310.3,1538.0,36.1,198.0
1,0.0,0.0,1.0,303.6,311.8,1421.0,44.8,101.0
2,0.0,0.0,1.0,298.3,307.9,1485.0,42.0,117.0
3,0.0,1.0,0.0,303.3,311.3,1592.0,33.7,14.0
4,0.0,1.0,0.0,302.4,310.4,1865.0,23.9,129.0


Question 2.3: Apply [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to the training data since there is class imbalance.

In [31]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

X_train_resampled, y_train_resampled = smote.fit_resample(X_train_enc, y_train)

Question 3.1: Train five machine learning [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) based on the training data, and evaluate their performance on the test dataset. Use default hyperparameter values.

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train_resampled, y_train_resampled)

y_pred = logreg.predict(X_test_enc)
print('Logistic Regression')
print(classification_report(y_test, y_pred))

# Support Vector Machine
svc = SVC()
svc.fit(X_train_resampled, y_train_resampled)

y_pred = svc.predict(X_test_enc)
print('Support Vector Machine')
print(classification_report(y_test, y_pred))

# KNeighbors Classifier
knn = KNeighborsClassifier()
knn.fit(X_train_resampled, y_train_resampled)
y_pred = knn.predict(X_test_enc)
print('KNeighbors Classifier')
print(classification_report(y_test, y_pred))

# Decision Tree Classifier
dtree = DecisionTreeClassifier()
dtree.fit(X_train_resampled, y_train_resampled)

y_pred = dtree.predict(X_test_enc)
print('Decision Tree Classifier')
print(classification_report(y_test, y_pred))

# XGBoost Classifier
xgb = XGBClassifier()
xgb.fit(X_train_resampled, y_train_resampled)
y_pred = xgb.predict(X_test_enc)
print('XGBoost Classifier')
print(classification_report(y_test, y_pred))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression
              precision    recall  f1-score   support

           0       0.99      0.84      0.91      2428
           1       0.13      0.82      0.22        72

    accuracy                           0.84      2500
   macro avg       0.56      0.83      0.57      2500
weighted avg       0.97      0.84      0.89      2500

Support Vector Machine
              precision    recall  f1-score   support

           0       0.99      0.80      0.89      2428
           1       0.11      0.83      0.20        72

    accuracy                           0.80      2500
   macro avg       0.55      0.82      0.54      2500
weighted avg       0.97      0.80      0.87      2500

KNeighbors Classifier
              precision    recall  f1-score   support

           0       0.99      0.90      0.94      2428
           1       0.14      0.58      0.23        72

    accuracy                           0.89      2500
   macro avg       0.56      0.74      0.58      2500
weighted 

By the time I got to this step, I realized that I had pretty much done the 5 models above again below. The below code is much more compact and readable so I left it in.

In [38]:
#Build models (You can either do it combined or separate)
models = {'Logistic Regresion': LogisticRegression(), 
          'Support Vector Machine': SVC(), 
          'K-NN': KNeighborsClassifier(), 
          'Decision Tree':DecisionTreeClassifier(), 
          'XGBoost': XGBClassifier()}

models_accuracy = {}
models_precision = {}

for name, model in models.items():
    model.fit(X_train_resampled, y_train_resampled)
    y_pred = model.predict(X_test_enc)
    print(name)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))
    print('\n\n')
    models_accuracy[name] = metrics.accuracy_score(y_test, y_pred)
    models_precision[name] = metrics.precision_score(y_test, y_pred)

print(models_accuracy, sep='\n')
print(models_precision)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regresion
              precision    recall  f1-score   support

           0       0.99      0.84      0.91      2428
           1       0.13      0.82      0.22        72

    accuracy                           0.84      2500
   macro avg       0.56      0.83      0.57      2500
weighted avg       0.97      0.84      0.89      2500

[[2033  395]
 [  13   59]]



Support Vector Machine
              precision    recall  f1-score   support

           0       0.99      0.80      0.89      2428
           1       0.11      0.83      0.20        72

    accuracy                           0.80      2500
   macro avg       0.55      0.82      0.54      2500
weighted avg       0.97      0.80      0.87      2500

[[1948  480]
 [  12   60]]



K-NN
              precision    recall  f1-score   support

           0       0.99      0.90      0.94      2428
           1       0.14      0.58      0.23        72

    accuracy                           0.89      2500
   macro avg       0.

Questions 3.2:  Perform recursive feature elimination (3 features) on the dataset using a logistic regression classifier with max_iter= 1000, random_state=5.  Any difference in the results? Explain.

In [37]:
from sklearn.feature_selection import RFE


lr = LogisticRegression(max_iter=1000, random_state=5)
rfe = RFE(lr, n_features_to_select=3, step=1)
rfe.fit(X_train_resampled, y_train_resampled)

y_pred_rfe = rfe.predict(X_test_enc) 

model_accuracy = metrics.accuracy_score(y_test, y_pred)
model_precision = metrics.precision_score(y_test, y_pred)

print(model_accuracy)
print(model_precision)
print('\n\n')


0.9748
0.5473684210526316





Q.4. Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include:
What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work. 

## My Response to Q4

From the table seen two cells below, the accuracy score of each model increased in series.
A small improvement can be seen when moving from one model to the next, however a very large jump in accuracy can be seen whe comparing the worst model accuracy ('Logistic Regression') with the best model accuracy ('XGBoost'). the jump in accuracy is dramatic. From 83% accurate to 97+% accurate.

So when the assignment said to try the same prediction on the same data using the worst performing model from Question 3.1 (Logistic Regression) I was skeptical that I would improve the accuracy as much as was observed. I did not expect that logistic regression with simple feature elimination would improve Logistic Regression to be as good as the best performing model from Question 3.1. LR+RFE performed as well as XGBoost with ~97.5% accuracy which was surprising.

XGBoost	                        0.9748
Logistic Regression with RFE	0.9748




In [36]:
import pandas as pd


model_accuracy_values = {
    'Logistic Regression': models_accuracy['Logistic Regresion'],
    'Support Vector Machine': models_accuracy['Support Vector Machine'],
    'K-NN': models_accuracy['K-NN'],
    'Decision Tree': models_accuracy['Decision Tree'],
    'XGBoost': models_accuracy['XGBoost'],
    'Logistic Regression with RFE': model_accuracy
}


accuracy_table = pd.DataFrame.from_dict(model_accuracy_values, orient='index', columns=['Accuracy'])


accuracy_table

Unnamed: 0,Accuracy
Logistic Regression,0.8368
Support Vector Machine,0.8032
K-NN,0.8872
Decision Tree,0.9612
XGBoost,0.9748
Logistic Regression with RFE,0.9748
