# Milestone 03
# Peter Lorenz

## 0. Preliminaries

Import the required libraries:

In [1]:
import sys
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.utils import class_weight
from tensorflow import keras

Set global options:

In [2]:
# Display plots inline
%matplotlib inline

# Display multiple cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Suppress scientific notation
np.set_printoptions(suppress=True)
np.set_printoptions(precision=3)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Do not truncate numpy arrays
np.set_printoptions(threshold=sys.maxsize)

Declare constants:

Declare utility functions:

## 1. Split data from Milestone 1 into training and testing
In this section, we split the prepared data from Milestone 1 into training and test data sets. But first we must reload and clean this data following our procedure in Milestone 1.

### Read and clean data (from Milestone 1)
Here we follow the steps taken in Milestone 1 to prepare our data for modeling. Commentary is kept to a minimum as these matters have already been discussed in Milestone 1. Also, cell output is kept to the minimum necessary to confirm that the code is functioning as expected.

In [3]:
# Internet location of the data set and labels
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data"
labels_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data"

# Download sensor data and labels into a dataframe object, specify python engine for regex
sensor_data = pd.read_csv(url, sep='\s{1,}', engine='python')
sensor_labels_data = pd.read_csv(labels_url, sep='\s{1,}', engine='python')

# Generate index-based column names for the sensor data set
sensor_data.columns = list('s' + str(idx + 1) for idx in range(0, sensor_data.shape[1]))

# Assign column names to the labels
sensor_labels_data.columns = ['result', 'date', 'time']

# Save the original data frame for future reference as we modify its contents
sensor_data_orig = sensor_data

# Confirm that data set and labels are loaded
print('Sensor data set:')
sensor_data.shape
sensor_data.head()

print('Sensor labels:')
sensor_labels_data.shape
sensor_labels_data.head()

Sensor data set:


(1566, 590)

Unnamed: 0,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,...,s581,s582,s583,s584,s585,s586,s587,s588,s589,s590
0,3095.78,2465.14,2230.422,1463.661,0.829,100.0,102.343,0.125,1.497,-0.001,...,0.006,208.204,0.502,0.022,0.005,4.445,0.01,0.02,0.006,208.204
1,2932.61,2559.94,2186.411,1698.017,1.51,100.0,95.488,0.124,1.444,0.004,...,0.015,82.86,0.496,0.016,0.004,3.175,0.058,0.048,0.015,82.86
2,2988.72,2479.9,2199.033,909.793,1.32,100.0,104.237,0.122,1.488,-0.012,...,0.004,73.843,0.499,0.01,0.003,2.054,0.02,0.015,0.004,73.843
3,3032.24,2502.87,2233.367,1326.52,1.533,100.0,100.397,0.123,1.503,-0.003,...,,,0.48,0.477,0.104,99.303,0.02,0.015,0.004,73.843
4,2946.25,2432.84,2233.367,1326.52,1.533,100.0,100.397,0.123,1.529,0.017,...,0.005,44.008,0.495,0.019,0.004,3.828,0.034,0.015,0.005,44.008


Sensor labels:


(1566, 3)

Unnamed: 0,result,date,time
0,-1,"""19/07/2008","12:32:00"""
1,1,"""19/07/2008","13:17:00"""
2,-1,"""19/07/2008","14:43:00"""
3,-1,"""19/07/2008","15:22:00"""
4,-1,"""19/07/2008","17:53:00"""


Drop columns with more than 10% NaN:

In [4]:
# Count NaN's per column
df_na = sensor_data.isna().sum()

# Identify columns above cutoff of 10% NaN's
nan_10_pct = df_na[df_na > 0.1 * sensor_data.shape[0]]

# Drop columns with more than 5% NaN's
sensor_data = sensor_data.drop(list(nan_10_pct.index), axis=1)
sensor_data.shape

(1566, 538)

Impute fields with NaN in the remaining columns:

In [5]:
# Impute and replace missing values using column median
sensor_data = sensor_data.replace('?', 
                                  np.NaN).apply(lambda x: x.fillna(x.median()))

Remove columns with zero variance:

In [6]:
# Identify columns with zero variance
zero_variance_cols = np.array(sensor_data.columns[sensor_data.var() == 0])

# Drop columns with zero variance
sensor_data = sensor_data.drop(zero_variance_cols, axis=1)
sensor_data.shape

(1566, 422)

Our data set is now almost ready for modeling. We deal with class inbalance and feature standardization after splitting the data into test and training sets.

### Feature standardization
We choose RobustScaler over StandardScaler due to the skewness of a significant number of features in the data set, as determined in Milestone 1. Because StandardScaler must compute the mean and standard deviation, it is susceptible to outliers. On the other hand, RobustScaler is based on percentiles and, hence, is less susceptible to outliers. We now apply RobustScaler to our training set:

In [7]:
# Scale data
scaler = RobustScaler()
sensor_data = pd.DataFrame(scaler.fit_transform(sensor_data), 
                           columns=sensor_data.columns)

# Display scaled data set
sensor_data.head()

Unnamed: 0,s1,s2,s3,s4,s5,s7,s8,s9,s10,s11,...,s577,s578,s583,s584,s585,s586,s587,s588,s589,s590
0,0.937,-0.398,0.794,0.352,-0.973,0.126,0.852,0.332,0.042,-1.322,...,-1.049,-0.523,0.38,1.735,1.9,1.706,-0.769,0.546,0.452,1.937
1,-0.875,0.705,-0.396,0.815,0.386,-0.916,0.63,-0.17,0.281,0.078,...,0.953,-0.814,-0.983,0.388,0.3,0.422,2.674,3.464,3.29,0.156
2,-0.252,-0.226,-0.055,-0.741,0.007,0.414,-0.259,0.252,-0.578,-0.322,...,0.271,-0.938,-0.268,-0.714,-1.1,-0.711,-0.021,0.01,-0.065,0.028
3,0.231,0.041,0.873,0.082,0.432,-0.17,0.407,0.393,-0.094,-0.661,...,0.071,-0.51,-4.514,94.449,100.9,97.651,-0.021,0.01,-0.065,0.028
4,-0.724,-0.773,0.873,0.082,0.432,-0.17,0.407,0.636,0.937,0.443,...,0.088,-0.008,-1.184,1.041,0.8,1.082,0.966,0.031,0.194,-0.396


### Split data into training and test
We now split the data into training and test data sets, reserving ten percent of the rows for testing (157 rows) and using the rest to train our models. We choose this relatively high number to ensure that a sufficient number of positives exist in the test data:

In [8]:
# Split data into training and test
X_train, X_test, y_train, y_test = \
    train_test_split(sensor_data, 
                     sensor_labels_data['result'], 
                     test_size = 0.2,
                     random_state = 0)

# Describe training and test
print("Training data has {} rows.".format(X_train.shape[0]))
print("Test data has {} rows.".format(X_test.shape[0]))

Training data has 1252 rows.
Test data has 314 rows.


Now that the test data has been isolated, we can deal with class inbalance and feature standardization in the training data.

### Balance classes using class weights
To address class inbalance we are not using oversampling as in previous iterations of this project. Instead, we pass a dictionary of class weights to the fit() method on the model (below).

We now turn to our results from Milestone 1 to select the features for our models.

### Feature selection
For our neural networks, we will not use feature selection. Instead, we rely on the internals of the neural network to prioritize features that lead to the best predictions.

Now that our training data set has been resampled and standardized, we are ready to proceed to modeling.

## 1. Build a simple neural networks model
In this section we build a simple neural network with no hidden layers. We begin by computing the class weights to address class imbalance in the target:

In [57]:
# Prepare class weights to balance classes
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

# Create class-weights dictionary
class_weights = dict(zip([0, 1], class_weights))

Adjust target to 0 and 1:

In [60]:
y_train = y_train.replace(-1, 0)
y_test = y_test.replace(-1, 0)

Now we build and compile our model using the 'sgd' optimizer and a binary cross entropy loss function:

In [80]:
# Create model
model = keras.Sequential([
    keras.layers.Dense(128, activation = tf.nn.relu, input_dim = 422),
    keras.layers.Dense(1, activation = tf.nn.softmax)
])

# Compile model
model.compile(optimizer = 'sgd', 
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

Next we fit the model using the class weights computed above:

In [81]:
# Fit model
model.fit(X_train, y_train, 
          epochs = 10, class_weight=class_weights)

Train on 1252 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x407844dec8>

Finally, we evaluate the model:

In [68]:
# Evaluate model
test_loss, test_acc = model.evaluate(X_test, y_test)

# Display accuracy
print('Test accuracy:', test_acc)

Test accuracy: 0.041401275


With an accuracy of just 0.04, clearly there is remove for improvement.

## 2. Build a deep neural networks model
In this section we build a deep neural network with a hidden layer to try to improve the accuracy of our model. We begin by building and compiling our model, using the class weights computed above. We use the 'sgd' optimizer and a binary cross entropy loss function:

In [10]:
# Create model
model = keras.Sequential([
    keras.layers.Dense(128, activation = tf.nn.relu, input_dim = 422),
    keras.layers.Dense(128, activation = tf.nn.relu),
    keras.layers.Dense(1, activation = tf.nn.softmax)
])

# Compile model
model.compile(optimizer = 'sgd', 
              loss = 'binary_crossentropy',
              metrics = ['accuracy'])

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Next we fit the model, passing in the class weights computed above:

In [11]:
# Fit model
model.fit(X_train, y_train, 
          epochs = 10, class_weight=class_weights)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 1252 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x4067ed4208>

Finally, we evaluate the model:

In [12]:
# Evaluate model
test_loss, test_acc = model.evaluate(X_test, y_test)

# Display accuracy
print('Test accuracy:', test_acc)

Test accuracy: 0.041401275


The accuracy has not improved over our basic neural network, remaining at 0.04.

## 3. Build a RNN model
In this section we build a recurrent neural network to try to improve upon the accuracy of our previous models. We begin by setting our labels to 0 and 1 (from -1 and 1):

In [60]:
y_train = y_train.replace(-1, 0)
y_test = y_test.replace(-1, 0)

Now we build a recurrent neural network with three LSTM layers and two hidden layers, with the 'adam' optimizer and sparse categorical cross entropy as the loss function:

In [111]:
# Build model
model = keras.models.Sequential()
model.add(keras.layers.LSTM(64, return_sequences=True, input_dim = 422))
model.add(keras.layers.LSTM(64, return_sequences=True))
model.add(keras.layers.LSTM(64))
model.add(keras.layers.Dense(2, activation = 'softmax'))
model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', 
              metrics = ['accuracy'])
print(model.summary())

Model: "sequential_30"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_31 (LSTM)               (None, None, 64)          124672    
_________________________________________________________________
lstm_32 (LSTM)               (None, None, 64)          33024     
_________________________________________________________________
lstm_33 (LSTM)               (None, None, 64)          33024     
_________________________________________________________________
lstm_34 (LSTM)               (None, None, 64)          33024     
_________________________________________________________________
lstm_35 (LSTM)               (None, 64)                33024     
_________________________________________________________________
dense_36 (Dense)             (None, 2)                 130       
Total params: 256,898
Trainable params: 256,898
Non-trainable params: 0
_______________________________________________

In order to fit the model, we need to reshape the input to 3 dimensions:

In [84]:
# Convert dataframe to array
x_train_arr = np.array(X_train)
x_test_arr = np.array(X_test)

# Reshape input to be 3D, i.e. samples, timesteps, features
x_train_arr = x_train_arr.reshape((x_train_arr.shape[0], 1, x_train_arr.shape[1]))
x_test_arr = x_test_arr.reshape((x_test_arr.shape[0], 1, x_test_arr.shape[1]))
print(x_train_arr.shape, y_train.shape, x_test_arr.shape, y_test.shape)

(1252, 1, 422) (1252,) (314, 1, 422) (314,)


Now we train the model, choosing 5 as the number of epochs:

In [112]:
# Train the model
model.fit(x_train_arr, y_train, 
          validation_data = (x_test_arr, y_test), 
          epochs = 5, batch_size = 128)

Train on 1252 samples, validate on 314 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x40192588c8>

Finally, we evaluate:

In [113]:
# Evaluate model
test_loss, test_acc = model.evaluate(x_test_arr, y_test)

# Display accuracy
print('Test accuracy:', test_acc)

Test accuracy: 0.95859873


Our accuracy is now close to 96%!

## 4. Summarize your findings
In this section, we summarize the findings of our three milestones, focusing on what the models we have developed can tell us about how to optimize the diaper manufacturing process. The overall business objective, as noted in Assignment 1, is to detect problems that might potentially lead to a poor quality product. So quality is the driving motivation behind our model. But as we noted in Milestone 2, there is an inherent balance between quality and cost. A model that tells us to monitor all sensors constantly may end up producing quality products, but would drive up costs to an unreasonable level. More fundamentally, it would nullify the purpose of having a model in the first place to direct us where to focus our energy for maximum efficiency. With this business objective in view, we now evaluate our findings.

### Neural Networks
In the present milestone, we developed three neural networks. Our first neural network was a simple network with no hidden layers, providing an opportunity co compute class weights and align our inputs to the necessary dimensions. Our second neural network added a single hidden layer. We found that adding additional layers did not improve accuracy, so we left this model with a single hidden layer. Both of our first two models used the SGD optimizer and binary cross entropy loss function with RELU activation functions, except in the output layers, which used softmax. Neither of these initial models performed well. However, our third neural network, a recurrent neural network (RNN), achieved an impressive accuracy of nearly 96%. This model used three LSTM (Long Short-Term Memory) layers and a dense layer for the output layer along with the Adam optimizer and sparse categorical cross entropy for the loss function. In five epochs it was able to converge with an accuracy consistently above 95%. 
Since in-depth "X-ray" analysis of our neural network was not included in the initial contract, it is not immediately clear which sensors contributed most to the success of the model. This kind of analysis would be a potential next step to consider for a future contract.

### Feature Selection
In Milestone 1, we identified ten sensors (out of 590 initial sensors) that represent the best predictors of the target variable based on stepwise selection:

1. s22
1. s34
1. s65
1. s104
1. s125
1. s130
1. s144
1. s189
1. s313
1. s438

This selection of feature gives us an idea as to which sensors are potentially the most significant in predicting product defects.

### Support Vector Machine
In Milestone 2, we developed a support vector machine that had some success in predicting defects using this list of features. This model delivered a 67% recall (despite a not-so-impressive precision of 5%) with AUC of 0.72:

<img src="https://github.com/pelorenz/data-science-420/raw/master/svm-auc.png">

While far less accurate than our RNN model, the SVM offers some confirmation that the ten sensors identified in feature selection are helpful predictors of potential defects. We recommend focusing on these sensors to optimize manufacturing for the reduction of defects.

### Model Evaluation
As mentioned, the model with the highest accuracy was our final recurrent neural network, with nearly 96% accuracy. Given the black-box nature of neural networks in general, this metric supplied us with our best (and only) means of evaluating this model. The question that remains is whether this model performed well enough in production. Given the high degree of accuracy, we recommend a trial program to test the success of the model in production. This trial would provide empirical data to assess the model's performance in a live setting. The results of the trial would let us know whether the model is sufficiently accurate in a real-life setting to rely on more extensively. In addition, the trial would supply us with more labeled data to feed into the model to improve accuracy. One limitation of the present model is the fact that it was trained on only 1252 data points out of a total of just 1566. Having more data would certainly have improved our confidence as to the production-readiness of the model. Collecting additional data while the model is in trial will allow us to further improve the model and prepare for releasing it fully into production.

### Conclusion
In this project, we have shown the benefits of applying a neural network (RNN) to data that proved difficult to model with convential methods, such as decision trees, ensemble models, and support vector classifiers. Our highest accuracy with these methods was 67% using the support vector classifier. Our best neural network, a recurrent network, achieved an accuracy of 96%, allowing us to plan for a production trial. This trial will allow us to collect additional labeled data to improve the model further in preparation for a full launch into production. With the model in production, we should be able to gauge when products are not meeting quality standards simply by feeding the sensors into the model, allowing us to optimize for quality by relying on the model.