# Predictive Maintenance 

This assignment covers the topic of predictive maintenance. Predictive Maintenance problems adress predicting when a machine needs to be maintained ahead of breaking down. This problem can occur anywhere regular maintenance is required for a machine. For example, it can be used in manufacturing, fleet operations, train maintenance, etc.

This assignment will use the [Predictive Maintenance Dataset](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset). The dataset consists of 10 000 data points stored as rows with 14 features in columns. The 'machine failure' label that indicates, whether the machine has failed in this particular datapoint.

# Learning Objectives
- Perform model tuning based on hyper parameters.
- Select the best model after attempting multiple models.
- Perform recursive feature elimination, producing a statistically significant improvement over a model without feature selection.

In [28]:
import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split

import re


ai4i2020 = pd.read_csv('ai4i2020.csv')
ai4i2020 = ai4i2020.rename(columns=dict([(x,re.sub(r'\[.*\]', '', x).strip()) for x in ai4i2020.columns]))
print(ai4i2020.info())
ai4i2020.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   UDI                  10000 non-null  int64  
 1   Product ID           10000 non-null  object 
 2   Type                 10000 non-null  object 
 3   Air temperature      10000 non-null  object 
 4   Process temperature  10000 non-null  object 
 5   Rotational speed     10000 non-null  int64  
 6   Torque               10000 non-null  float64
 7   Tool wear            10000 non-null  int64  
 8   Machine failure      10000 non-null  int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 703.3+ KB
None


Unnamed: 0,UDI,Product ID,Type,Air temperature,Process temperature,Rotational speed,Torque,Tool wear,Machine failure
0,1,M14860,M,298.1,308.6,1551,42.8,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0
5,6,M14865,M,298.1,308.6,1425,41.9,11,0
6,7,L47186,L,298.1,308.6,1558,42.4,14,0
7,8,L47187,L,298.1,308.6,1527,40.2,16,0
8,9,M14868,M,298.3,308.7,1667,28.6,18,0
9,10,M14869,M,298.5,309.0,1741,28.0,21,0


Question 1.1:  Write a command that will calculate the number of unique values for each feature in the training data.

In [29]:
display(ai4i2020.nunique())

UDI                    10000
Product ID             10000
Type                       3
Air temperature           93
Process temperature       82
Rotational speed         941
Torque                   577
Tool wear                246
Machine failure            2
dtype: int64

Question 1.2: Determine if the data contains any missing values, and replace the values with np.nan. Missing values would be '?'.

In [30]:
has_missing = ai4i2020.where(ai4i2020 == "?").count().any()
print("Has missing values!" if has_missing else "No missing values!")

ai4i2020 = ai4i2020.replace('?', np.nan)
has_missing = ai4i2020.where(ai4i2020 == "?").count().any()
print("Has missing values!" if has_missing else "No missing values!")


Has missing values!
No missing values!


Question 1.3: Replace all missing values with the mean. Change column types to numeric.

In [31]:
numeric_columns = ai4i2020.columns.drop(['Product ID', 'Type'])

ai4i2020[numeric_columns] = ai4i2020[numeric_columns].apply(pd.to_numeric).apply(lambda x: x.fillna(x.mean()))

ai4i2020.notna().all().all()

print(ai4i2020.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   UDI                  10000 non-null  int64  
 1   Product ID           10000 non-null  object 
 2   Type                 10000 non-null  object 
 3   Air temperature      10000 non-null  float64
 4   Process temperature  10000 non-null  float64
 5   Rotational speed     10000 non-null  int64  
 6   Torque               10000 non-null  float64
 7   Tool wear            10000 non-null  int64  
 8   Machine failure      10000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 703.3+ KB
None


Question 1.4: Drop UDI and 'Product ID' from the data

In [32]:
ai4i2020_features = ai4i2020.drop(['UDI', 'Product ID'], axis=1)
display(ai4i2020_features.head())
display(ai4i2020_features.info())

Unnamed: 0,Type,Air temperature,Process temperature,Rotational speed,Torque,Tool wear,Machine failure
0,M,298.1,308.6,1551,42.8,0,0
1,L,298.2,308.7,1408,46.3,3,0
2,L,298.1,308.5,1498,49.4,5,0
3,L,298.2,308.6,1433,39.5,7,0
4,L,298.2,308.7,1408,40.0,9,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Type                 10000 non-null  object 
 1   Air temperature      10000 non-null  float64
 2   Process temperature  10000 non-null  float64
 3   Rotational speed     10000 non-null  int64  
 4   Torque               10000 non-null  float64
 5   Tool wear            10000 non-null  int64  
 6   Machine failure      10000 non-null  int64  
dtypes: float64(3), int64(3), object(1)
memory usage: 547.0+ KB


None

Question 2.1: Split the data into training and testing taking into consideration 'Machine failure' as the target (y)

In [33]:
y = ai4i2020_features[['Machine failure']]
display(y.head())

X = ai4i2020_features.drop('Machine failure', axis=1)
display(X.head())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Unnamed: 0,Machine failure
0,0
1,0
2,0
3,0
4,0


Unnamed: 0,Type,Air temperature,Process temperature,Rotational speed,Torque,Tool wear
0,M,298.1,308.6,1551,42.8,0
1,L,298.2,308.7,1408,46.3,3
2,L,298.1,308.5,1498,49.4,5
3,L,298.2,308.6,1433,39.5,7
4,L,298.2,308.7,1408,40.0,9


Question 2.2: Apply [One-Hot Encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) to data. Make sure to Fit the training data and transform both training and test data. 

In [34]:
from sklearn.preprocessing import OneHotEncoder

one_hot_encoder = OneHotEncoder(sparse_output=False)
one_hot_encoder.fit(X_train[['Type']])

X_train[one_hot_encoder.get_feature_names_out()] = one_hot_encoder.transform(X_train[['Type']])
X_test[one_hot_encoder.get_feature_names_out()] = one_hot_encoder.transform(X_test[['Type']])

X_train = X_train.drop('Type', axis=1)
X_test = X_test.drop('Type', axis=1)
display(X_train.head())
display(X_test.head())


Unnamed: 0,Air temperature,Process temperature,Rotational speed,Torque,Tool wear,Type_H,Type_L,Type_M
9849,298.6,309.4,2312,15.5,44,0.0,1.0,0.0
2596,299.0,308.7,1426,38.6,16,0.0,1.0,0.0
994,296.2,307.2,1168,63.4,172,0.0,0.0,1.0
2469,299.0,308.7,1507,37.5,134,0.0,1.0,0.0
3683,302.0,311.2,1558,39.1,179,0.0,0.0,1.0


Unnamed: 0,Air temperature,Process temperature,Rotational speed,Torque,Tool wear,Type_H,Type_L,Type_M
2990,300.6,309.8,1688,28.6,135,0.0,1.0,0.0
910,295.5,306.0,1546,35.9,169,0.0,1.0,0.0
4913,303.5,312.3,1593,35.2,29,0.0,0.0,1.0
2259,299.1,308.3,1431,52.3,38,0.0,1.0,0.0
6235,301.3,310.9,1596,33.9,157,0.0,1.0,0.0


Question 2.3: Apply [SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to the training data since there is class imbalance.

In [35]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

display(X_train_resampled.head())
display(y_train_resampled.head())

Unnamed: 0,Air temperature,Process temperature,Rotational speed,Torque,Tool wear,Type_H,Type_L,Type_M
0,298.6,309.4,2312,15.5,44,0.0,1.0,0.0
1,299.0,308.7,1426,38.6,16,0.0,1.0,0.0
2,296.2,307.2,1168,63.4,172,0.0,0.0,1.0
3,299.0,308.7,1507,37.5,134,0.0,1.0,0.0
4,302.0,311.2,1558,39.1,179,0.0,0.0,1.0


Unnamed: 0,Machine failure
0,0
1,0
2,0
3,0
4,0


Question 3.1: Train five machine learning [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), and [XGBClassifier](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier) based on the training data, and evaluate their performance on the test dataset. Use default hyperparameter values.

In [48]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

logistic_regression = LogisticRegression(max_iter=1000)
logistic_regression.fit(X_train_resampled, y_train_resampled['Machine failure'])
y_pred_logistic_regression = logistic_regression.predict(X_test)
print(classification_report(y_test, y_pred_logistic_regression))

svc = SVC()
svc.fit(X_train_resampled, y_train_resampled['Machine failure'])
y_pred_svc = svc.predict(X_test)
print(classification_report(y_test, y_pred_svc))

knn = KNeighborsClassifier()
knn.fit(X_train_resampled, y_train_resampled['Machine failure'])
y_pred_knn = knn.predict(X_test)
print(classification_report(y_test, y_pred_knn))

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_resampled, y_train_resampled['Machine failure'])
y_pred_decision_tree = decision_tree.predict(X_test)
print(classification_report(y_test, y_pred_decision_tree))

xgboost = XGBClassifier()
xgboost.fit(X_train_resampled, y_train_resampled['Machine failure'])
y_pred_xgboost = xgboost.predict(X_test)
print(classification_report(y_test, y_pred_xgboost))



              precision    recall  f1-score   support

           0       0.99      0.83      0.90      1926
           1       0.14      0.76      0.24        74

    accuracy                           0.82      2000
   macro avg       0.57      0.79      0.57      2000
weighted avg       0.96      0.82      0.88      2000

              precision    recall  f1-score   support

           0       0.99      0.79      0.88      1926
           1       0.13      0.81      0.22        74

    accuracy                           0.79      2000
   macro avg       0.56      0.80      0.55      2000
weighted avg       0.96      0.79      0.86      2000

              precision    recall  f1-score   support

           0       0.98      0.89      0.94      1926
           1       0.17      0.55      0.26        74

    accuracy                           0.88      2000
   macro avg       0.57      0.72      0.60      2000
weighted avg       0.95      0.88      0.91      2000

              preci

In [10]:
# Build models (You can either do it combined or separate)

models = {
    "Logistic Regresion": LogisticRegression(),
    "Support Vector Machine": SVC(),
    "K-NN": KNeighborsClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "XGBoost": XGBClassifier(),
}

Questions 3.2:  Perform recursive feature elimination (3 features) on the dataset using a logistic regression classifier with max_iter= 1000, random_state=5.  Any difference in the results? Explain.

In [53]:
from sklearn.feature_selection import RFE

estimator = LogisticRegression(max_iter=1000, random_state=5)
selector = RFE(estimator, n_features_to_select=3)
selector.fit(X_train_resampled, y_train_resampled['Machine failure'])

y_pred_rfe = selector.predict(X_test)
print(classification_report(y_test, y_pred_rfe))

print(selector.get_feature_names_out())

              precision    recall  f1-score   support

           0       0.98      0.59      0.73      1926
           1       0.05      0.62      0.10        74

    accuracy                           0.59      2000
   macro avg       0.52      0.60      0.42      2000
weighted avg       0.94      0.59      0.71      2000

['Air temperature' 'Process temperature' 'Type_L']


It looks like the accuracy of the logistic regression decreased significantly. Looks like there is an oversimplification of the model at this point.

Q.4. Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include:
What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work. 

Before starting the exercise I didn't really have any real experience working with a model end-to-end. I had to massage some data a little, but assuming I did everything correct, I feel the results are pretty good. One surprising thing is that the model with more removed data is actually worse.