# Milestone 2 Assignment

## Background

    The capstone project focuses on diaper manufacturing quality. In the Lesson 01 assignment, you discovered how the diaper manufacturing process works. Generally, to ensure or predict quality, a diaper manufacturer need s to monitor every step of the manufacturing process with sensors such as heat sensors, glue sensors, glue level, etc.

    For this capstone project, we will use the SECOM manufacturing Data Set from the UCI Machine Learning Repository. The set is originally for semiconductor manufacturing, but in our case, we will assume that it is for the diaper manufacturing process.

    The dataset consists of two files:

    (1) a dataset file SECOM containing 1567 examples, each with 591 features, presented in a 1567 x 591 matrix
    (2) a labels file listing the classifications and date time stamp for each example
    
    Reference
    Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## Instructions

    Create a new notebook for this assignment named Milestone02_studentname.ipynb (replacing studentname with your own).

    (1) Split prepared data from Milestone 1 into training and testing
    (2) Build a decision tree model that detects faulty products
    (3) Build an ensemble model that detects faulty products
    (4) Build an SVM model
    (5) Evaluate all three models
    (6) Describe your findings

### (1) Split prepared data from Milestone 1 into training and testing

In [1]:
# Import packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from collections import OrderedDict
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn import svm, metrics
from imblearn.over_sampling import SMOTE 
from collections import Counter
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn import tree 
from subprocess import check_call
from IPython.display import Image
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

In [2]:
# Load dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
names = ["feature" + str(x) for x in range(1, 591)]
secom_var = pd.read_csv(url, sep=" ", names=names, na_values = "NaN") 
 
url_l = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"
secom_labels = pd.read_csv(url_l,sep=" ",names = ["classification","date"],parse_dates = ["date"],na_values = "NaN")
feature_names = secom_labels

#Merging the two datasets

data = pd.merge(secom_var, secom_labels,left_index=True,right_index=True)
data.describe()
data.head()
print(data.columns) 
(nrows, ncols) = data.shape

Index(['feature1', 'feature2', 'feature3', 'feature4', 'feature5', 'feature6',
       'feature7', 'feature8', 'feature9', 'feature10',
       ...
       'feature583', 'feature584', 'feature585', 'feature586', 'feature587',
       'feature588', 'feature589', 'feature590', 'classification', 'date'],
      dtype='object', length=592)


In [3]:
# Replacing Nulls

data = data.replace(to_replace= float('NaN'), value=float(0))
data_null = data.isnull().sum()
print(data_null)
print("There are 0 columns with missing data")

feature1          0
feature2          0
feature3          0
feature4          0
feature5          0
feature6          0
feature7          0
feature8          0
feature9          0
feature10         0
feature11         0
feature12         0
feature13         0
feature14         0
feature15         0
feature16         0
feature17         0
feature18         0
feature19         0
feature20         0
feature21         0
feature22         0
feature23         0
feature24         0
feature25         0
feature26         0
feature27         0
feature28         0
feature29         0
feature30         0
                 ..
feature563        0
feature564        0
feature565        0
feature566        0
feature567        0
feature568        0
feature569        0
feature570        0
feature571        0
feature572        0
feature573        0
feature574        0
feature575        0
feature576        0
feature577        0
feature578        0
feature579        0
feature580        0
feature581        0


#### Comments:

    I replaced the null values with a value of 0. 

In [4]:
# Define the target and features:

target_label = 'classification'
non_feature = 'date'
feature_labels = [x for x in data.columns if x not in [target_label]+ [non_feature]]

# One-hot encode inputs

data_expanded = pd.get_dummies(data, drop_first=True)
print('DataFrame one-hot-expanded shape: {}'.format(data_expanded.shape))

# Get target and original x-matrix

y = data[target_label]
x = data.as_matrix(columns=feature_labels)

DataFrame one-hot-expanded shape: (1567, 592)


  from ipykernel import kernelapp as app


In [5]:
# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(x, y, 
                                  test_size=0.3,random_state=42) # 70% training and 30% test

### (2) Build a decision tree model that detects faulty products

In [6]:
# Decision Tree

nTrees = 100
max_depth = 5
min_node_size = 5
verbose = 0
learning_rate = 0.05
gbm_clf = GradientBoostingClassifier(n_estimators=nTrees, loss='deviance', learning_rate=learning_rate, max_depth=max_depth, \
                                   min_samples_leaf=min_node_size)
gbm_clf.fit(X_train, y_train)
print(gbm_clf.feature_importances_[:5])
dt_y_test_hat = gbm_clf.predict(X_test)

[4.71153837e-03 5.24252587e-06 3.84566463e-03 8.51207610e-04
 2.41536189e-03]


### (3) Build an ensemble model that detects faulty products

In [7]:
# Random Forest

nTrees = 100
max_depth = 5
min_node_size = 5
verbose = 0
clf = RandomForestClassifier(n_estimators=nTrees, max_depth=max_depth, random_state=0, verbose=verbose, min_samples_leaf=min_node_size)
clf.fit(X_train, y_train)
clf2 = clf.fit(X_train, y_train)
print(clf.feature_importances_[:5])
rf_y_test_hat = clf.predict(X_test)

[0.00556305 0.         0.0035221  0.         0.        ]


### (4) Build an SVM model

In [8]:
# Set the parameters

cost = .9 # penalty parameter of the error term
gamma = 5 # defines the influence of input vectors on the margins

In [9]:
# Test a LinearSVC

clf1 = svm.LinearSVC(C=cost).fit(X_train, y_train)
clf1.predict(X_test)
print("LinearSVC")
print(classification_report(clf1.predict(X_test), y_test))

LinearSVC
              precision    recall  f1-score   support

          -1       1.00      0.93      0.96       470
           1       0.00      0.00      0.00         1

   micro avg       0.93      0.93      0.93       471
   macro avg       0.50      0.47      0.48       471
weighted avg       1.00      0.93      0.96       471





### (5) Evaluate all three models

#### Decision Tree

In [10]:
# Accuracy

dt_accuracy_score = accuracy_score(y_test, dt_y_test_hat)
print(dt_accuracy_score)

0.9341825902335457


#### Random Forest

In [11]:
# Accuracy

rf_accuracy_score = accuracy_score(y_test, rf_y_test_hat)
print(rf_accuracy_score)

0.9341825902335457


#### SVM

In [12]:
# Test a LinearSVC

clf1 = svm.LinearSVC(C=cost).fit(X_train, y_train)
clf1.predict(X_test)
print("LinearSVC")
print(classification_report(clf1.predict(X_test), y_test))

LinearSVC
              precision    recall  f1-score   support

          -1       0.07      0.92      0.14        36
           1       0.90      0.06      0.12       435

   micro avg       0.13      0.13      0.13       471
   macro avg       0.49      0.49      0.13       471
weighted avg       0.84      0.13      0.12       471





### (6) Describe your findings

#### Comments:

    In Milestone 1, I originally ran a model using all 591 features to predict whether or not a quality diaper was produced. This model had very low accuracy. To improve the model, I then used a SMOTE method to handle class imbalance, this improved the accuracy significantly however it was still only .57. I then used recursive feature selection to handle the issue of overfitting. This improved the accuracy to .932. In this Milestone 2 assignment, I first ran a decision tree model which produced an accuracy of .932. I then ran a random forest model which produced an accuracy of .934 which is the highest accuracy that I have gottent in predicting whether or not that a quality diaper was produced.