# Classification on Manufacturing Data

In this notebook, you will learn:
- how to clean/tidy data
- how to deal with unbalanced classes
- various different classification models, such as logistic regression, and knn

In [62]:
# Load in the necessary Python libraries. We will be using pandas, numpy and matplotlib (for now). 
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
# from sklearn.metrics import balanced_accuracy_score, accuracy_score, classification_report, plot_confusion_matrix
from sklearn.metrics import balanced_accuracy_score, accuracy_score, classification_report
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.neighbors import KNeighborsClassifier


from sklearn.model_selection import cross_val_score, cross_validate, train_test_split

### What dataset will we be using?
We will be working with a dataset that's modeled after real existing milling machines, which falls under the domain of manufacturing.
<br></br>
A milling machine is a versatile industrial tool used to shape and machine solid materials, such as metal, wood, or plastic. It employs rotary cutters to remove material from a workpiece, creating complex shapes, holes, and patterns with high precision. Milling machines are integral to manufacturing processes, enabling the creation of precise parts for industries like automotive, aerospace, and electronics. They come in various types, including vertical and horizontal mills, each offering specific capabilities for tasks such as cutting, drilling, and contouring, making them essential for both prototyping and mass production.
<br></br>
This dataset could be useful if we wish to learn more about how machine failures are related to technical specifications such as air and process temperature, rotational speed, torque, and more.
<br></br>
If you would like to access this dataset from the source, you can find it here: https://www.kaggle.com/datasets/stephanmatzka/predictive-maintenance-dataset-ai4i-2020.

Let's take a look at our dataset: 

In [63]:
# Go into the data folder, and read in the csv dataset into the Python workspace. 
# Then, name that item "data".
data = pd.read_csv('../data/millingMachine.csv')
data.head() # print out only the first few rows, for easy viewing.

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


Now that we have loaded our data, let's look further into some of the columns we'll be working with:

- UID: unique identifier ranging from 1 to 10000
- product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
- type: just the product type L, M or H from column 2
- air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
- process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.
- rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise
- torque [Nm]: torque values are normally distributed around 40 Nm with a SD = 10 Nm and no negative values.
- tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process.
- a 'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.


Moreover, when it comes to machine failure, there are five independent failure types that should be noted:

1. tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 - 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
2. heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the tools rotational speed is below 1380 rpm. This is the case for 115 data points.
3. power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset.
4. overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints.
5. random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset.

If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail.

### Let's get a quick summary rundown of our machine maintenance data! This can be done using the ```.info()``` function. 

In [64]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

### What does the above summary tell us about our concrete dataset?
- There's a total of 14 columns, and 1000 rows.
- There are 3 float64 columns, 9 int64 columns, and 2 object columns.
- There are no columns with missing values

For now, we are curious with the machine failure. 

Let's use the ```value_counts()``` function to obtain statistics about the distribution of working and failing machines.

In [65]:
data['Machine failure'].value_counts()

0    9661
1     339
Name: Machine failure, dtype: int64

What if we wanted the percentage distribution of the machine failures? 

What would that look like?

In [66]:
data['Machine failure'].value_counts(normalize=True)

0    0.9661
1    0.0339
Name: Machine failure, dtype: float64

So, it appears that around 97% of machines are operational, with 3% experiencing failure. 

Since machine failure is a binary variable, we can associate it with it being either true or false. 

This sort of binary classification calls for **logistic regression**. 

### Logistic Regression

Logistic regression is a statistical method used for binary classification. It models the probability of an instance belonging to a particular class by applying the logistic function to a linear combination of input features. The resulting value, ranging between 0 and 1, represents the likelihood of the instance being in the positive class. A decision threshold is used to classify instances. Logistic regression is widely used for tasks like spam detection and medical diagnosis, offering insight into relationships between features and outcomes while providing interpretable results. Despite the name, logistic regression is used for classification, not regression.

Logistic regression is useful because the independent feature variables which predict the binary outcome can be either categorical or numeric. However the dependent target variable is always categorical. Let's take advantage of this fact by incorporating both categorical and numeric variables as our predictors. 

In [67]:
data.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


### Which Columns should be Used as our Predictors?

In terms of predictor variables, let's take Type, Air temperature [K], Process temperature [K], Rotational speed [rpm], Torque [Nm], and Tool wear [min]. 

I did not take UDI because that is simply a number representing the ID of the product, it will not have any impact on the possibility of machine failure. However many rows there are is the number of unique values for the UDI column. 

I also left out Product ID because while it does contain information about the product type which may be useful, the numbers after the type indicator will not have any correlation to machine failure. For example, the "14860" in M14860 is simply randomized, and will not impact machine failure. 

Lastly, I omitted all 5 of the failure codes (TWF,HDF,PWF,OSF,RNF) because we are only interested in whether the machine failed or not (at least for now). We are not interested in the exact type of failure observed by the machine. 

For now, let's separate our data into the input data (X) and the target variable (y).  

In [68]:
X = data.iloc[:,2:8]
y = data[["Machine failure"]]

X

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
0,M,298.1,308.6,1551,42.8,0
1,L,298.2,308.7,1408,46.3,3
2,L,298.1,308.5,1498,49.4,5
3,L,298.2,308.6,1433,39.5,7
4,L,298.2,308.7,1408,40.0,9
...,...,...,...,...,...,...
9995,M,298.8,308.4,1604,29.5,14
9996,H,298.9,308.4,1632,31.8,17
9997,M,299.0,308.6,1645,33.4,22
9998,H,299.0,308.7,1408,48.5,25


Now that we have our separated datasets, we can split them further into training and testing sets. 

In [69]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now, let's classify which columns are categorical, and which are numeric. This will be useful later on when we need to perform column transformations, such as scaling, one-hot encoding, etc. 

In [70]:
X_train.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
9069,M,297.2,308.2,1678,28.1,133
2603,M,299.3,309.2,1334,46.3,31
7738,M,300.5,312.0,1263,60.8,146
1579,L,298.3,308.3,1444,43.8,176
5058,L,303.9,312.9,1526,42.5,194


In [71]:
numeric_features = list(X_train.iloc[:, 1:].columns.values)
categorical_features = ['Type']
numeric_features

['Air temperature [K]',
 'Process temperature [K]',
 'Rotational speed [rpm]',
 'Torque [Nm]',
 'Tool wear [min]']

Let's think about what transformations we need to apply to our columns. 

We need a ```StandardScaler``` for our numerical columns.

We also need a ```OneHotEncoder``` for our categorical variable, Type.

We won't need an imputer because we do not have any missing values. 

In [72]:
ct = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features),
)

ct

In the next section, you will notice that the ```named_transformers_``` attribute is accessible only **after** we run the function fit_transform.

If you're wondering why this is the case, it's because the transformations must first be applied, and this is done through fit_transform or fit. This is important because ```named_transformers_``` holds the information about the transformers' instances and configurations, and this information is only available after the pipeline has been fitted to the data.  

The `named_transformers_` attribute in scikit-learn's `ColumnTransformer` is accessible after you run the `fit_transform` method because it is during this step that the transformations defined in the `ColumnTransformer` are actually fitted to the data and applied.

Here's why it works this way:

1. **Definition of Transformers:** When you create a `ColumnTransformer`, you specify the transformers you want to apply to different columns. However, these transformers are not actually fitted to the data at this point. They are just defined and waiting to be applied.

2. **Fitting and Transformation:** When you call the `fit_transform` method on the `ColumnTransformer`, it goes through the following steps:
   - It fits each of the transformers to their respective subsets of data (columns).
   - It then applies the transformations to the data, producing the transformed dataset.

3. **`named_transformers_` Attribute:** After the `fit_transform` process is complete, the `named_transformers_` attribute becomes accessible. This attribute holds a dictionary containing the fitted transformers for each column. The keys of the dictionary are the names you assigned to each transformer when defining the `ColumnTransformer`.

By accessing `named_transformers_` after fitting, you can see the actual fitted transformers and possibly use them for further analysis or debugging. However, you cannot access this information before fitting because the transformations haven't been applied until that point. This separation of defining and fitting is a fundamental aspect of scikit-learn's design, allowing for more flexibility and control over the transformation process.

In [73]:
# ct.named_transformers_

In [74]:
transformedX = ct.fit_transform(X)

print(type(transformedX))
transformedX

<class 'numpy.ndarray'>


array([[-0.95238944, -0.94735989,  0.06818514, ...,  0.        ,
         0.        ,  1.        ],
       [-0.90239341, -0.879959  , -0.72947151, ...,  0.        ,
         1.        ,  0.        ],
       [-0.95238944, -1.01476077, -0.22744984, ...,  0.        ,
         1.        ,  0.        ],
       ...,
       [-0.50242514, -0.94735989,  0.59251888, ...,  0.        ,
         0.        ,  1.        ],
       [-0.50242514, -0.879959  , -0.72947151, ...,  1.        ,
         0.        ,  0.        ],
       [-0.50242514, -0.879959  , -0.2162938 , ...,  0.        ,
         0.        ,  1.        ]])

Let's take a look at the column names that we have.

In [75]:
ct.named_transformers_

{'standardscaler': StandardScaler(),
 'onehotencoder': OneHotEncoder(handle_unknown='ignore', sparse_output=False)}

In [76]:
column_names = (
    numeric_features + ct.named_transformers_["onehotencoder"].get_feature_names_out().tolist()
)


column_names

['Air temperature [K]',
 'Process temperature [K]',
 'Rotational speed [rpm]',
 'Torque [Nm]',
 'Tool wear [min]',
 'Type_H',
 'Type_L',
 'Type_M']

In [77]:
transformedX = pd.DataFrame(transformedX, columns=column_names)
transformedX.head()

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Type_H,Type_L,Type_M
0,-0.952389,-0.94736,0.068185,0.2822,-1.695984,0.0,0.0,1.0
1,-0.902393,-0.879959,-0.729472,0.633308,-1.648852,0.0,1.0,0.0
2,-0.952389,-1.014761,-0.22745,0.94429,-1.61743,0.0,1.0,0.0
3,-0.902393,-0.94736,-0.590021,-0.048845,-1.586009,0.0,1.0,0.0
4,-0.902393,-0.879959,-0.729472,0.001313,-1.554588,0.0,1.0,0.0


In [78]:
X.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
0,M,298.1,308.6,1551,42.8,0
1,L,298.2,308.7,1408,46.3,3
2,L,298.1,308.5,1498,49.4,5
3,L,298.2,308.6,1433,39.5,7
4,L,298.2,308.7,1408,40.0,9


In [79]:
pipe_lr = make_pipeline(ct, LogisticRegression())

In [80]:
pipe_lr.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [81]:
# Make predictions on the testing data
y_pred = pipe_lr.predict(X_test)

In [82]:
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.97


# KNN Classification

Now that we've taken a look at logistic regression, lets examine how knn classification works.

We will start off by obtaining a fresh new copy of the dataset.

In [83]:
data = pd.read_csv('../data/millingMachine.csv')

data

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,M24855,M,298.8,308.4,1604,29.5,14,0,0,0,0,0,0
9996,9997,H39410,H,298.9,308.4,1632,31.8,17,0,0,0,0,0,0
9997,9998,M24857,M,299.0,308.6,1645,33.4,22,0,0,0,0,0,0
9998,9999,H39412,H,299.0,308.7,1408,48.5,25,0,0,0,0,0,0


We wish to make predictions on the category type of product (low, medium, high). 

In [84]:
X = data.iloc[:,3:9]
y = data[["Type"]]

X

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure
0,298.1,308.6,1551,42.8,0,0
1,298.2,308.7,1408,46.3,3,0
2,298.1,308.5,1498,49.4,5,0
3,298.2,308.6,1433,39.5,7,0
4,298.2,308.7,1408,40.0,9,0
...,...,...,...,...,...,...
9995,298.8,308.4,1604,29.5,14,0
9996,298.9,308.4,1632,31.8,17,0
9997,299.0,308.6,1645,33.4,22,0
9998,299.0,308.7,1408,48.5,25,0


In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

When it comes to knn classification, we are allowed to use categorical predictor variables, however some modifications must be made to the dataset beforehand. 

Traditional kNN algorithms use distance metrics, such as Euclidean distance, to measure the similarity between data points. These distance metrics are designed for continuous variables, so using categorical variables directly with such metrics might not yield meaningful results. How can a kNN algorithm calculate distances if there's non-numeric features?

What we need to do is convert the datatype from a string to a number. This way, we can still represent and communicate categorical information, just using numbers.

The two most common conversion methods are:
1. Ordinal encoding
    -  Convert categorical variables into numerical dummy variables using one-hot encoding. Each category becomes a binary column (0 or 1) indicating the presence or absence of that category. This way, the categorical information can be used with distance metrics. However, be cautious of the curse of dimensionality when using one-hot encoding, as it can significantly increase the dimensionality of your dataset.
2. One-hot encoding
    - For ordinal categorical variables (categories with an inherent order), you can encode them with numerical values that preserve their order.
    - An example could be income status, like low/middle/high income. It would make sense for this variable to be ordinal encoded, because it has a natural ordering. 
    
    
We will aim to explore both methods, and compare/contrast their results.

In [86]:
numeric_features = list(X_train.iloc[:, 0:-1].columns.values)
categorical_features = ['Machine failure']
numeric_features



['Air temperature [K]',
 'Process temperature [K]',
 'Rotational speed [rpm]',
 'Torque [Nm]',
 'Tool wear [min]']

Let's explore the second method first, which is OHE (one-hot encoding)

In [87]:
ct = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features),
)

ct

In [88]:
ct.fit_transform(X)

array([[-0.95238944, -0.94735989,  0.06818514, ..., -1.69598374,
         1.        ,  0.        ],
       [-0.90239341, -0.879959  , -0.72947151, ..., -1.6488517 ,
         1.        ,  0.        ],
       [-0.95238944, -1.01476077, -0.22744984, ..., -1.61743034,
         1.        ,  0.        ],
       ...,
       [-0.50242514, -0.94735989,  0.59251888, ..., -1.35034876,
         1.        ,  0.        ],
       [-0.50242514, -0.879959  , -0.72947151, ..., -1.30321671,
         1.        ,  0.        ],
       [-0.50242514, -0.879959  , -0.2162938 , ..., -1.22466331,
         1.        ,  0.        ]])

Here, we are using a KNN model, with K=21. K specifies the number of neighbours. 

In [89]:
pipe_lr = make_pipeline(ct, KNeighborsClassifier(21))

We now have a KNN model created, called ```pipe_lr```. Let's apply our training data to this model, so our model can learn how to make predictions based on this dataset. 

In [90]:
pipe_lr.fit(X_train, y_train.values.ravel())

Now that our KNN model has been trained, let's use it to make predictions on the testing set.

In [91]:
y_pred = pipe_lr.predict(X_test)

In [92]:
y_pred

array(['L', 'L', 'L', ..., 'L', 'L', 'L'], dtype=object)

As you can see, our predicted values consists of categories, either low, medium, or high.

Let's analyze the distribution of our categories, for our predicted dataset.

In [93]:
series = pd.Series(y_pred)

print(len(series))
series.value_counts()

3000


L    2828
M     168
H       4
dtype: int64

Let's analyze the distribution of our categories, for our actual testing set. 

In [94]:
print(len(y_test))
y_test.value_counts()

3000


Type
L       1830
M        879
H        291
dtype: int64

Now, let's compute an accuracy score, for our KNN model.

In [95]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.59
