# Data Preprocessing

 Steps

| Step                  | Action                                                                 |
|-----------------------|----------------------------------------------------------------------|
| **Missing Values**    | Checked with `df.isna().sum()`; no missing values found (0% missing). |
| **Duplicates**        | Checked `UDI` and `Product ID`; no duplicates found.                  |
| **Data Types**        | `UDI` (int64), `Product ID` (object), `Type` (object); converted `Product ID` and `Type` to categorical. |
| **Feature Engineering**| Created `failure_type` from `TWF`, `HDF`, etc.; added temperature difference (`Process temperature [C] - Air temperature [C]`). |
| **Categorical Encoding** | Label encoded `failure_type`; ordinal encoded `Type` (L=0, M=1, H=2). |
| **Feature Scaling**   | Applied `MinMaxScaler` on numerical columns; avoided `StandardScaler` due to negative values. |
| **Oversampling**      | Used SMOTE to balance `failure_type`; all classes now have 9652 samples. |
| **Save Dataset**      | Saved as `data_processed.csv` with `index=False`.                     |

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt  
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

In [2]:
df=pd.read_csv('data.csv')
df.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


Missing Values

In [3]:
print('Missing values in each column:')
df.isna().sum()/len(df)*100

Missing values in each column:


UDI                        0.0
Product ID                 0.0
Type                       0.0
Air temperature [K]        0.0
Process temperature [K]    0.0
Rotational speed [rpm]     0.0
Torque [Nm]                0.0
Tool wear [min]            0.0
Machine failure            0.0
TWF                        0.0
HDF                        0.0
PWF                        0.0
OSF                        0.0
RNF                        0.0
dtype: float64

Check for duplicates

In [4]:
df['Product ID'].duplicated().sum()

0

In [5]:
df['UDI'].duplicated().sum()

print("No duplicates found in 'Product ID' and 'UDI' columns.")

No duplicates found in 'Product ID' and 'UDI' columns.


Data Type  Conversion

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

**Feature Engineering**


Create a new column 'failure_type'

In [7]:
def get_failure_type(row):
    if row['TWF'] == 1:
        return 'TWF'
    elif row['HDF'] == 1:
        return 'HDF'
    elif row['PWF'] == 1:
        return 'PWF'
    elif row['OSF'] == 1:
        return 'OSF'
    elif row['RNF'] == 1:
        return 'RNF'
    else:
        return np.nan

df['failure_type'] = df.apply(get_failure_type, axis=1)
df['failure_type'].replace(np.nan, 'no failure', inplace=True)
df.drop(['TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1, inplace=True)
df.head()





The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['failure_type'].replace(np.nan, 'no failure', inplace=True)


Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,failure_type
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,no failure
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,no failure
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,no failure
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,no failure
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,no failure


In [8]:
df.drop(['UDI', 'Product ID'], axis=1, inplace=True)
df.head()

Unnamed: 0,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,failure_type
0,M,298.1,308.6,1551,42.8,0,0,no failure
1,L,298.2,308.7,1408,46.3,3,0,no failure
2,L,298.1,308.5,1498,49.4,5,0,no failure
3,L,298.2,308.6,1433,39.5,7,0,no failure
4,L,298.2,308.7,1408,40.0,9,0,no failure


### converting Kelvin to Celsius

In [9]:
def kelvin_to_celsius(k_temp):
    return k_temp - 273.15


df['Air temperature [C]'] = df['Air temperature [K]'].apply(kelvin_to_celsius)
df['Process temperature [C]'] = df['Process temperature [K]'].apply(kelvin_to_celsius)
df.drop(['Air temperature [K]', 'Process temperature [K]'], axis=1, inplace=True)


In [10]:
# df['Air temperature [C]']=df['Air temperature [K]'] - 273.15
# df['Process temperature [C]']=df['Process temperature [K]'] - 273.15
# df.drop(['Air temperature [K]', 'Process temperature [K]'], axis=1, inplace=True)   
# df.head()

### Categorical Encoding

Label Encoding

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['failure_type']=le.fit_transform(df['failure_type'])
df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,failure_type,Air temperature [C],Process temperature [C]
0,M,1551,42.8,0,0,5,24.95,35.45
1,L,1408,46.3,3,0,5,25.05,35.55
2,L,1498,49.4,5,0,5,24.95,35.35
3,L,1433,39.5,7,0,5,25.05,35.45
4,L,1408,40.0,9,0,5,25.05,35.55


Ordinal Encoding

In [20]:
# Define the ordinal encoding function
def ordinal_encoding(X):
    # Map values as per desired order
    mapping = {"L": 0, "M": 1, "H": 2}
    return X.replace(mapping)


In [12]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['L', 'M', 'H']])
df['Type']=oe.fit_transform(df[['Type']])
df.head()

Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,failure_type,Air temperature [C],Process temperature [C]
0,1.0,1551,42.8,0,0,5,24.95,35.45
1,0.0,1408,46.3,3,0,5,25.05,35.55
2,0.0,1498,49.4,5,0,5,24.95,35.35
3,0.0,1433,39.5,7,0,5,25.05,35.45
4,0.0,1408,40.0,9,0,5,25.05,35.55


In [13]:
df['Type'].value_counts()

Type
0.0    6000
1.0    2997
2.0    1003
Name: count, dtype: int64

In [14]:
df['failure_type'].value_counts()

failure_type
5    9652
0     115
2      91
1      78
4      46
3      18
Name: count, dtype: int64

## Feature Scaling

In [15]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

numerical_cols = ['Air temperature [C]', 'Process temperature [C]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']

df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

df.head()


Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,failure_type,Air temperature [C],Process temperature [C]
0,1.0,0.222934,0.535714,0.0,0,5,0.304348,0.358025
1,0.0,0.139697,0.583791,0.011858,0,5,0.315217,0.37037
2,0.0,0.192084,0.626374,0.019763,0,5,0.304348,0.345679
3,0.0,0.154249,0.490385,0.027668,0,5,0.315217,0.358025
4,0.0,0.139697,0.497253,0.035573,0,5,0.315217,0.37037


When we use standard Scaler then rpc and other columns comes into negative range which is not suitable

## Oversampling

In [16]:
from imblearn.over_sampling import SMOTE
smote=SMOTE(sampling_strategy='auto')
X=df.drop('failure_type', axis=1)
y=df['failure_type']    
X_resampled, y_resampled = smote.fit_resample(X, y)
df_resampled = pd.DataFrame(X_resampled, columns=X.columns)
df_resampled['failure_type'] = y_resampled  
df_resampled.head()


Unnamed: 0,Type,Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,Air temperature [C],Process temperature [C],failure_type
0,1.0,0.222934,0.535714,0.0,0,0.304348,0.358025,5
1,0.0,0.139697,0.583791,0.011858,0,0.315217,0.37037,5
2,0.0,0.192084,0.626374,0.019763,0,0.304348,0.345679,5
3,0.0,0.154249,0.490385,0.027668,0,0.315217,0.358025,5
4,0.0,0.139697,0.497253,0.035573,0,0.315217,0.37037,5


In [25]:
df_resampled['failure_type'].value_counts()

failure_type
5    9652
2    9652
4    9652
1    9652
3    9652
0    9652
Name: count, dtype: int64

In [35]:
df_resampled[df_resampled['Machine failure']==0].shape[0]/df_resampled.shape[0]*100

33.31779251277801

In [18]:
df_resampled.to_csv('data_processed.csv', index=False)

In [21]:
celsius_cols = ["Air temperature [K]", "Process temperature [K]"]
categorical_cols = ["Type"]
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

feature_transformer = ColumnTransformer(
    transformers=[
        ("kelvin_to_celsius", FunctionTransformer(kelvin_to_celsius), celsius_cols),
        ("ordinal_encoding", FunctionTransformer(ordinal_encoding), ["Type"]),
    ],
    remainder="passthrough",
)

scaling_transformer = ColumnTransformer(
    transformers=[("feature_scaling", MinMaxScaler(), [1, 2, 4, 5, 6])], remainder="passthrough"
)


In [23]:
from sklearn import set_config

set_config(display='diagram')
feature_transformer

In [22]:
from sklearn import set_config

set_config(display='diagram')
scaling_transformer