# Problem Statement

link to the problem statement : https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset

Attribute Information:

The dataset consists of 10 000 data points stored as rows with 14 features in columns

UID: unique identifier ranging from 1 to 10000

product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number

air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K

process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 K.

rotational speed [rpm]: calculated from a power of 2860 W, overlaid with a normally distributed noise

torque [Nm]: torque values are normally distributed around 40 Nm with a Ïƒ = 10 Nm and no negative values.
tool wear [min]: The quality variants H/M/L add 5/3/2 minutes of tool wear to the used tool in the process. and a
'machine failure' label that indicates, whether the machine has failed in this particular datapoint for any of the following failure modes are true.

The machine failure consists of five independent failure modes
tool wear failure (TWF): the tool will be replaced of fail at a randomly selected tool wear time between 200 â€“ 240 mins (120 times in our dataset). At this point in time, the tool is replaced 69 times, and fails 51 times (randomly assigned).
heat dissipation failure (HDF): heat dissipation causes a process failure, if the difference between air- and process temperature is below 8.6 K and the toolâ€™s rotational speed is below 1380 rpm. This is the case for 115 data points.
power failure (PWF): the product of torque and rotational speed (in rad/s) equals the power required for the process. If this power is below 3500 W or above 9000 W, the process fails, which is the case 95 times in our dataset.
overstrain failure (OSF): if the product of tool wear and torque exceeds 11,000 minNm for the L product variant (12,000 M, 13,000 H), the process fails due to overstrain. This is true for 98 datapoints.
random failures (RNF): each process has a chance of 0,1 % to fail regardless of its process parameters. This is the case for only 5 datapoints, less than could be expected for 10,000 datapoints in our dataset.

If at least one of the above failure modes is true, the process fails and the 'machine failure' label is set to 1. It is therefore not transparent to the machine learning method, which of the failure modes has caused the process to fail

#### Need to perform
1. load a data
2. Profiling of Data
3. Analysis of Data
4. Handle NaN by imputation
5. if data is not normal, handle it
6. check multicollinearity
7. build a model
8. save it
9. model's accuracy
10. 10 test case

In [2]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# step-1:

ai4i_2020 = pd.read_csv(r'C:\Users\mohit.kumar\fullstackdatascience\extra\ai4i2020.csv')
ai4i_2020.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


In [4]:
# step-2:

pf_ai4i_2020 = ProfileReport(ai4i_2020)
pf_ai4i_2020.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [5]:
ai4i_2020.columns

Index(['UDI', 'Product ID', 'Type', 'Air temperature [K]',
       'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]',
       'Tool wear [min]', 'Machine failure', 'TWF', 'HDF', 'PWF', 'OSF',
       'RNF'],
      dtype='object')

In [6]:
ai4i_2020.dtypes

UDI                          int64
Product ID                  object
Type                        object
Air temperature [K]        float64
Process temperature [K]    float64
Rotational speed [rpm]       int64
Torque [Nm]                float64
Tool wear [min]              int64
Machine failure              int64
TWF                          int64
HDF                          int64
PWF                          int64
OSF                          int64
RNF                          int64
dtype: object

In [6]:
ai4i_2020.isnull().sum()

UDI                        0
Product ID                 0
Type                       0
Air temperature [K]        0
Process temperature [K]    0
Rotational speed [rpm]     0
Torque [Nm]                0
Tool wear [min]            0
Machine failure            0
TWF                        0
HDF                        0
PWF                        0
OSF                        0
RNF                        0
dtype: int64

In [7]:
ai4i_2020.drop(columns = ['UDI', 'Product ID','Type'], inplace = True)

In [8]:
ai4i_2020.describe()

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,300.00493,310.00556,1538.7761,39.98691,107.951,0.0339,0.0046,0.0115,0.0095,0.0098,0.0019
std,2.000259,1.483734,179.284096,9.968934,63.654147,0.180981,0.067671,0.106625,0.097009,0.098514,0.04355
min,295.3,305.7,1168.0,3.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,298.3,308.8,1423.0,33.2,53.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,300.1,310.1,1503.0,40.1,108.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,301.5,311.1,1612.0,46.8,162.0,0.0,0.0,0.0,0.0,0.0,0.0
max,304.5,313.8,2886.0,76.6,253.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
y = ai4i_2020['Air temperature [K]']
x = ai4i_2020.drop(columns = ['Air temperature [K]'])

In [10]:
x.head()

Unnamed: 0,Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,308.6,1551,42.8,0,0,0,0,0,0,0
1,308.7,1408,46.3,3,0,0,0,0,0,0
2,308.5,1498,49.4,5,0,0,0,0,0,0
3,308.6,1433,39.5,7,0,0,0,0,0,0
4,308.7,1408,40.0,9,0,0,0,0,0,0


### using  OLS Regression

In [11]:
import statsmodels.api as sm

In [12]:
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
summary = model.summary()
print(summary)

                             OLS Regression Results                            
Dep. Variable:     Air temperature [K]   R-squared:                       0.776
Model:                             OLS   Adj. R-squared:                  0.775
Method:                  Least Squares   F-statistic:                     3454.
Date:                 Tue, 19 Jul 2022   Prob (F-statistic):               0.00
Time:                         18:27:54   Log-Likelihood:                -13649.
No. Observations:                10000   AIC:                         2.732e+04
Df Residuals:                     9989   BIC:                         2.740e+04
Df Model:                           10                                         
Covariance Type:             nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const           

  x = pd.concat(x[::order], 1)


***

### NOW Standardizing

In [13]:
x.head()

Unnamed: 0,Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,308.6,1551,42.8,0,0,0,0,0,0,0
1,308.7,1408,46.3,3,0,0,0,0,0,0
2,308.5,1498,49.4,5,0,0,0,0,0,0
3,308.6,1433,39.5,7,0,0,0,0,0,0
4,308.7,1408,40.0,9,0,0,0,0,0,0


In [14]:
from sklearn.preprocessing import StandardScaler

In [15]:
scaler = StandardScaler()

In [16]:
x1 = scaler.fit_transform(x)
x1

array([[-0.94735989,  0.06818514,  0.28219976, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [-0.879959  , -0.72947151,  0.63330802, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [-1.01476077, -0.22744984,  0.94428963, ..., -0.09793424,
        -0.09948362, -0.04363046],
       ...,
       [-0.94735989,  0.59251888, -0.66077672, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [-0.879959  , -0.72947151,  0.85400464, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [-0.879959  , -0.2162938 ,  0.02137647, ..., -0.09793424,
        -0.09948362, -0.04363046]])

### checking VIF

In [17]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [18]:
# creating vif_df dataframe
vif_df = pd.DataFrame()

# calculating VIF for each columns with x as a dataframe
vif_df['vif'] = [variance_inflation_factor(x1, i) for i in range(x1.shape[1])]
vif_df

Unnamed: 0,vif
0,1.004799
1,5.171592
2,5.236156
3,1.039958
4,11.829612
5,2.433058
6,4.597022
7,3.623946
8,3.3476
9,1.002015


In [19]:
vif_df['features'] = x.columns
vif_df

Unnamed: 0,vif,features
0,1.004799,Process temperature [K]
1,5.171592,Rotational speed [rpm]
2,5.236156,Torque [Nm]
3,1.039958,Tool wear [min]
4,11.829612,Machine failure
5,2.433058,TWF
6,4.597022,HDF
7,3.623946,PWF
8,3.3476,OSF
9,1.002015,RNF


### Spiltting the dataset

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x1, y, test_size= 0.25, random_state = 100)

In [22]:
x_train

array([[-1.01476077, -1.3318975 ,  1.30542955, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [-0.94735989, -0.96374828,  1.04460627, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [-0.74515723,  1.1112746 , -1.6037532 , ..., -0.09793424,
        -0.09948362, -0.04363046],
       ...,
       [ 1.27686934,  0.11838731, -0.82128336, ..., -0.09793424,
        -0.09948362, -0.04363046],
       [ 0.73766225, -0.37805634,  0.0715348 , ..., -0.09793424,
        -0.09948362, -0.04363046],
       [ 1.41167111, -0.66253528,  1.16498625, ..., -0.09793424,
        -0.09948362, -0.04363046]])

In [23]:
y_train

314     297.9
8722    297.2
668     297.5
3353    301.6
9839    298.4
        ...  
350     297.6
79      298.8
8039    300.8
6936    300.6
5640    302.6
Name: Air temperature [K], Length: 7500, dtype: float64

In [24]:
ai4i_2020.head(3)

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,298.1,308.5,1498,49.4,5,0,0,0,0,0,0


In [25]:
# convert x1 into a datframe
print(x1[:3,:])
x2 = pd.DataFrame(x1, columns = x.columns)
x2.head(3)


[[-0.94735989  0.06818514  0.28219976 -1.69598374 -0.18732201 -0.06797983
  -0.10786004 -0.09793424 -0.09948362 -0.04363046]
 [-0.879959   -0.72947151  0.63330802 -1.6488517  -0.18732201 -0.06797983
  -0.10786004 -0.09793424 -0.09948362 -0.04363046]
 [-1.01476077 -0.22744984  0.94428963 -1.61743034 -0.18732201 -0.06797983
  -0.10786004 -0.09793424 -0.09948362 -0.04363046]]


Unnamed: 0,Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,-0.94736,0.068185,0.2822,-1.695984,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363
1,-0.879959,-0.729472,0.633308,-1.648852,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363
2,-1.014761,-0.22745,0.94429,-1.61743,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363


***

### Linear Regression

In [26]:
from sklearn.linear_model import LinearRegression

In [27]:
linear = LinearRegression()

In [28]:
linear.fit(x_train,y_train)

LinearRegression()

In [29]:
linear.coef_

array([ 1.73179068e+00,  2.89614151e-02, -8.25601445e-04,  8.56870986e-03,
       -3.01024487e-02,  2.13515033e-02,  1.97882223e-01,  2.53158118e-02,
       -2.29079513e-03, -5.12836991e-03])

In [30]:
linear.intercept_

300.0083028222017

In [31]:
linear.score(x_test,y_test)

0.7964294983765051

In [32]:
# predicting a value (actual value = 298.1)

linear.predict([[-0.947360,0.068185,0.282200,-1.695984,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363]])

array([298.33569912])

### Lasso Regression

In [33]:
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV, ElasticNet, ElasticNetCV

In [34]:
lassocv = LassoCV(cv = 20, max_iter = 200000, normalize = True)
lassocv.fit(x_train, y_train)

LassoCV(cv=20, max_iter=200000, normalize=True)

In [35]:
lassocv.alpha_

0.00015086996770029556

In [36]:
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(x_train, y_train)

Lasso(alpha=0.00015086996770029556)

In [37]:
lasso.score(x_test,y_test)

0.7964391755002797

In [38]:
lasso.coef_

array([ 1.73165146e+00,  2.88282721e-02, -7.75206237e-04,  8.39295052e-03,
       -2.61693577e-02,  1.98619442e-02,  1.95496983e-01,  2.34023192e-02,
       -3.91098547e-03, -4.94657115e-03])

In [39]:
lasso.intercept_

300.00830207132697

In [40]:
lasso.predict([[-0.947360,0.068185,0.282200,-1.695984,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363]])

array([298.33609592])

### Ridge Regression

#### Choosing alpha

In [41]:
ridgecv = RidgeCV(alphas=[0.1, 1.0, 10.0], cv = 20, normalize = True)
ridgecv.fit(x_train, y_train)

RidgeCV(alphas=array([ 0.1,  1. , 10. ]), cv=20, normalize=True)

In [42]:
ridgecv.alpha_

0.1

#### Using it in the ridge model

In [43]:
ridge_lr = Ridge(alpha = ridgecv.alpha_)
ridge_lr.fit(x_train,y_train)

Ridge(alpha=0.1)

In [44]:
ridge_lr.coef_

array([ 1.73176726e+00,  2.89591457e-02, -8.27537657e-04,  8.56863929e-03,
       -3.00737713e-02,  2.13417629e-02,  1.97864715e-01,  2.53028854e-02,
       -2.30394815e-03, -5.12787869e-03])

In [45]:
ridge_lr.intercept_

300.00830275222216

In [46]:
ridge_lr.score(x_test,y_test)

0.7964291265050731

In [47]:
ridge_lr.predict(([[-0.947360,0.068185,0.282200,-1.695984,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363]]))

array([298.33572039])

In [48]:
# predicting a value (actual value = 298.1)

ridge_lr.predict([[-0.947360,0.068185,0.282200,-1.695984,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363]])

array([298.33572039])

### ElasticNet Regression

In [49]:
elastic = ElasticNetCV(alphas=None, cv = 20)
elastic.fit(x_train,y_train)

ElasticNetCV(cv=20)

In [50]:
elastic.alpha_

0.00598277063022332

In [51]:
## effect of l1-L2 ratio

elastic.l1_ratio_

0.5

**elastic model**

In [52]:
elastic_lr = ElasticNet(alpha = elastic.alpha_, l1_ratio = elastic.l1_ratio_)

In [53]:
elastic_lr.fit(x_train,y_train)

ElasticNet(alpha=0.00598277063022332)

In [54]:
elastic_lr.score(x_test,y_test)

0.7963782504962263

In [55]:
# predicting a value (actual value = 298.1)

elastic_lr.predict([[-0.947360,0.068185,0.282200,-1.695984,-0.187322,-0.06798,-0.10786,-0.097934,-0.099484,-0.04363]])

array([298.34902042])

**All 3 models are giving approx same accuracy i.e model is stable. So it is not overfitting model.**

___

### Now Dopping The Machine failure column as it had High VIF

ai4i= ai4i_2020.drop(columns='Machine failure')
ai4i

***