# Milestone 2 Assignment

## Houda Aynaou

The capstone project focuses on diaper manufacturing quality. In the [article](http://www.madehow.com/Volume-3/Disposable-Diaper.html), you discovered how the diaper manufacturing process works. Generally, to ensure or predict quality, a diaper manufacturer needs to monitor every step of the manufacturing process with sensors such as heat sensors, glue sensors, glue level, etc.

For this capstone project, we will use the [SECOM manufacturing Data Set](https://archive.ics.uci.edu/ml/datasets/SECOM) from the UCI Machine Learning Repository. The set is originally for semiconductor manufacturing, but in our case, we will assume that it is for the diaper manufacturing process.


## To Do

1. Split prepared data from Milestone 1 into training and testing
2. Build a decision tree model that detects faulty products
3. Build an ensemble model that detects faulty products
4. Build an SVM model
5. Evaluate all three models
6. Describe your findings

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# 1. Read and merge data

In [2]:
# Secom data
LINK = 'https://raw.githubusercontent.com/houdaaynaou/DS-Certificate-UW/master/Course%203%20Machine%20Learning%20Techniques/Data/secom.csv'
secom_data = pd.read_csv(LINK,header= None, delimiter= ' ')
secom_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,580,581,582,583,584,585,586,587,588,589
0,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,0.0162,...,,,0.5005,0.0118,0.0035,2.363,,,,
1,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,-0.0005,...,0.006,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045
2,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,0.0041,...,0.0148,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602
3,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,-0.0124,...,0.0044,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432
4,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,-0.0031,...,,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432


In [3]:
# Secom labels
LINK2 = 'https://raw.githubusercontent.com/houdaaynaou/DS-Certificate-UW/master/Course%203%20Machine%20Learning%20Techniques/Data/secom_labels.csv'
secom_labels = pd.read_csv(LINK2, header= None, parse_dates=[1], names= ['Class', 'Time Stamp'])
secom_labels.head()

Unnamed: 0,Class,Time Stamp
0,-1,2008-07-19 11:55:00
1,-1,2008-07-19 12:32:00
2,1,2008-07-19 13:17:00
3,-1,2008-07-19 14:43:00
4,-1,2008-07-19 15:22:00


In [4]:
# mergine data

data = pd.concat([secom_labels, secom_data], axis= 1)
data.head(10)

Unnamed: 0,Class,Time Stamp,0,1,2,3,4,5,6,7,...,580,581,582,583,584,585,586,587,588,589
0,-1,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,...,,,0.5005,0.0118,0.0035,2.363,,,,
1,-1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,...,0.006,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045
2,1,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,...,0.0148,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602
3,-1,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,...,0.0044,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432
4,-1,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,...,,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432
5,-1,2008-07-19 17:53:00,2946.25,2432.84,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,...,0.0052,44.0077,0.4949,0.0189,0.0044,3.8276,0.0342,0.0151,0.0052,44.0077
6,-1,2008-07-19 19:44:00,3030.27,2430.12,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,...,,,0.501,0.0143,0.0042,2.8515,0.0342,0.0151,0.0052,44.0077
7,-1,2008-07-19 19:45:00,3058.88,2690.15,2248.9,1004.4692,0.7884,100.0,106.24,0.1185,...,0.0063,95.031,0.4984,0.0106,0.0034,2.1261,0.0204,0.0194,0.0063,95.031
8,-1,2008-07-19 20:24:00,2967.68,2600.47,2248.9,1004.4692,0.7884,100.0,106.24,0.1185,...,0.0045,111.6525,0.4993,0.0172,0.0046,3.4456,0.0111,0.0124,0.0045,111.6525
9,-1,2008-07-19 21:35:00,3016.11,2428.37,2248.9,1004.4692,0.7884,100.0,106.24,0.1185,...,0.0073,90.2294,0.4967,0.0152,0.0038,3.0687,0.0212,0.0191,0.0073,90.2294


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 592 entries, Class to 589
dtypes: datetime64[ns](1), float64(590), int64(1)
memory usage: 7.1 MB


# 2. Clean and prepare data

In [6]:
missing = data.isnull().sum().sort_values(ascending = False).to_frame(name = 'Total missing')

# Columns with missing values
col_missing = missing[missing['Total missing'] != 0]
col_missing['% missing'] = (col_missing['Total missing']/len(data))*100
print('There are {} columns with missing values.'.format(col_missing.shape[0]))
col_missing.head()

There are 538 columns with missing values.


Unnamed: 0,Total missing,% missing
293,1429,91.193363
158,1429,91.193363
157,1429,91.193363
292,1429,91.193363
358,1341,85.577537


In [7]:
# columns where more than 50 % of the data is missing
print('Number of columns where more than 50% missing data', col_missing[col_missing['% missing'] > 50].shape[0])

print('\nList of columns where more than 50% missing data:\n',list(col_missing[col_missing['% missing'] > 50].index))



Number of columns where more than 50% missing data 28

List of columns where more than 50% missing data:
 [293, 158, 157, 292, 358, 220, 492, 85, 244, 111, 518, 109, 384, 383, 382, 110, 246, 245, 516, 517, 581, 578, 579, 580, 346, 73, 72, 345]


In [8]:
# Dropping columns where more than 70 of the data is missing
to_drop = list(col_missing[col_missing['% missing'] > 70].index)

d_data = data.drop(to_drop, axis= 1)

d_data.shape

(1567, 584)

In [9]:
# Impute missing value with the mean
df = d_data.fillna(data.mean(axis =1))
df.head()

Unnamed: 0,Class,Time Stamp,0,1,2,3,4,5,6,7,...,580,581,582,583,584,585,586,587,588,589
0,-1,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,...,81.399372,72.628602,0.5005,0.0118,0.0035,2.363,69.759871,82.904667,91.517791,71.129541
1,-1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,...,0.006,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045
2,1,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,...,0.0148,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602
3,-1,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,...,0.0044,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432
4,-1,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,...,81.399372,72.628602,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432


# 3. Normalyzing the data:

In [10]:
# Normalizing the data
from sklearn.preprocessing import StandardScaler

x = df.drop(['Class', 'Time Stamp'], axis= 1)
y= df['Class']
scaler = StandardScaler()
scaler.fit(x)
scaler.transform(x)


array([[ 1.41546096e-01,  4.37359596e-01,  3.08298605e-02, ...,
         3.95723680e+01,  3.95726875e+01, -3.03963215e-01],
       [ 4.72415824e-01, -1.10021195e-01,  2.41101189e-01, ...,
        -2.35379179e-02, -2.49598098e-02,  1.15684911e+00],
       [-3.60090166e-01,  4.14879665e-01,  2.43171429e-02, ...,
        -1.00183426e-02, -2.11520022e-02, -1.78949193e-01],
       ...,
       [-1.24374184e-01, -5.82653446e-01,  1.22283252e-01, ...,
        -2.90317383e-02, -2.64742788e-02, -5.98165954e-01],
       [-5.52387468e-01,  2.60233241e-01, -2.18747820e-02, ...,
        -2.14359345e-02, -2.43107517e-02, -6.56233736e-02],
       [-2.97283592e-01, -1.89642232e-01,  6.88121753e-02, ...,
        -2.54010396e-02, -2.56088679e-02,  4.06379801e-01]])

# 4. Handling Class imbalance


In [11]:
from sklearn.preprocessing import StandardScaler

x = df.drop(['Class', 'Time Stamp'], axis= 1)
y= df['Class']
scaler = StandardScaler()
scaler.fit(x)
x_standarised = pd.DataFrame(scaler.transform(x), columns= x.columns)
x_standarised.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,580,581,582,583,584,585,586,587,588,589
0,0.141546,0.43736,0.03083,0.058187,-0.061219,0.094946,-0.479672,-0.075516,-0.025159,-0.028958,...,0.806977,-0.177287,-0.025129,-0.026955,-0.025356,-0.177453,39.571745,39.572368,39.572687,-0.303963
1,0.472416,-0.110021,0.241101,0.173165,-0.07062,0.094946,0.219624,-0.075413,-0.026256,-0.035612,...,-1.239178,2.231067,-0.024639,-0.021922,-0.024858,0.304854,-0.032002,-0.023538,-0.02496,1.156849
2,-0.36009,0.41488,0.024317,0.686083,-0.058562,0.094946,-0.793912,-0.075537,-0.041156,-0.033779,...,-1.238956,0.004466,-0.026772,-0.025086,-0.025257,0.010563,-0.004294,-0.010018,-0.021152,-0.178949
3,-0.073813,-0.028296,0.08649,-1.039043,-0.061924,0.094946,0.499549,-0.076029,-0.028617,-0.040354,...,-1.239218,-0.155711,-0.025653,-0.027674,-0.025606,-0.248952,-0.025984,-0.026022,-0.025652,-0.275044
4,0.14823,0.098887,0.255605,-0.126984,-0.058152,0.094946,-0.068167,-0.07566,-0.024428,-0.036648,...,0.806977,-0.177287,-0.032298,0.19586,-0.000171,22.282531,-0.025984,-0.026022,-0.025652,-0.275044


In [12]:
# Split the data: 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_standarised, 
                                                    y, 
                                                    test_size = 0.20, 
                                                    random_state = 0)



In [13]:
import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification


# Over sampling with SMOTE 
sm = SMOTE(random_state = 42)
X_train_new, y_train_new = sm.fit_sample(x_train, y_train.ravel())

# Balanced Classes 
np.unique(y_train_new, return_counts= True)

(array([-1,  1]), array([1162, 1162]))

## 5. Decision Tree

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

# Geni coefficient Model
gini_tree = DecisionTreeClassifier(criterion= 'gini', random_state= 42)
gini_tree.fit(X_train_new, y_train_new)

# Predictions
gini_pred = gini_tree.predict(x_test)
 
# Model performance
print('Confusion matric of decision tree with SMOTE:\n', confusion_matrix(y_test, gini_pred))

Confusion matric of decision tree with SMOTE:
 [[273  28]
 [ 11   2]]


## 6. Ensemble methods

In [18]:
from sklearn.ensemble import RandomForestClassifier

hypers = {"n_estimators": 1000, "max_features": "sqrt",}
clf_rf = RandomForestClassifier(random_state = 0, verbose = True, **hypers)
clf_rf.fit(X_train_new, y_train_new)

# Predictions:
rf_pred = clf_rf.predict(x_test)

# Model performance
print('Confusion matric of random forest with SMOTE:\n', confusion_matrix(y_test, rf_pred))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Confusion matric of random forest with SMOTE:
 [[301   0]
 [ 13   0]]


[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:   52.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.2s finished


## 7. SVM

In [23]:
from sklearn.svm import SVC

svmc = SVC(gamma = 'scale', cache_size = 1024) # cache size can improve performance
svmc.fit(X_train_new, y_train_new)

# Predictions 
svm_pred = svmc.predict(x_test)

# Model performance
print('Confusion matric of SVM with SMOTE:\n', confusion_matrix(y_test, svm_pred))

Confusion matric of SVM with SMOTE:
 [[276  25]
 [ 11   2]]


## 8. Evaluate all three models

In [25]:
y_test.value_counts()

-1    301
 1     13
Name: Class, dtype: int64

In [36]:
from sklearn.metrics import classification_report

# Classification report of Decision tree
print('Classification report of Decision tree:\n',classification_report(y_test, gini_pred, target_names=['Class -1', 'Class 1']))

Classification report of Decision tree:
               precision    recall  f1-score   support

    Class -1       0.96      0.91      0.93       301
     Class 1       0.07      0.15      0.09        13

   micro avg       0.88      0.88      0.88       314
   macro avg       0.51      0.53      0.51       314
weighted avg       0.92      0.88      0.90       314



In [34]:
# Classification report of Ensemble method
print('Classification report of Random forest:\n',classification_report(y_test, rf_pred, target_names=['Class -1', 'Class 1']))

Classification report of Random forest:
               precision    recall  f1-score   support

    Class -1       0.96      1.00      0.98       301
     Class 1       0.00      0.00      0.00        13

   micro avg       0.96      0.96      0.96       314
   macro avg       0.48      0.50      0.49       314
weighted avg       0.92      0.96      0.94       314



In [35]:
# Classification report of SVC
print('Classification report of SVC:\n',classification_report(y_test, svm_pred, target_names=['Class -1', 'Class 1']))

Classification report of SVC:
               precision    recall  f1-score   support

    Class -1       0.96      0.92      0.94       301
     Class 1       0.07      0.15      0.10        13

   micro avg       0.89      0.89      0.89       314
   macro avg       0.52      0.54      0.52       314
weighted avg       0.92      0.89      0.90       314



# 9. Findings

High precision means that the model returned substantially more relevant results than irrelevant ones, while high recall means that an algorithm returned most of the relevant results. Reading from the classfication reports above, all three models achieved same precision in terms of correclty detecting class -1 with random forest catching all faulty products. The models differe in recall score with random fores model acheiving 1 meaning that it has no false negative. Also, the f1_score is 0.98 which is good indicator especially with the data being imbalanced.

