<b>Importing the CSV file into a Pandas DataFrame

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split

d_failures=pd.read_csv("C:/Users/mfrid/Downloads/device_failure.csv")

In [2]:
d_failures.dtypes # All the dtypes look correct except for date, so we will cast it to datetime64[ns] type

date          object
device        object
failure        int64
attribute1     int64
attribute2     int64
attribute3     int64
attribute4     int64
attribute5     int64
attribute6     int64
attribute7     int64
attribute8     int64
attribute9     int64
dtype: object

In [3]:
d_failures['date'] = pd.DataFrame(d_failures['date'], dtype='datetime64[ns]', columns=['date']) #converting date to datetime type from dtype object

By looking at the columns available to us in our DataFrame we observe the following: <br>
-We assume that Date cannot be a significant predictor since each device has only one entry, there is no way to relate different devices and there are too many unique dates in the dataset <br>
-Attributes 1 through 9 are all numeric with no missing values and will be evaluated as predictors <br>
-Device will represent our target variable (value we are looking to predict, it is binary containing a value of 0 (non-failure) or 1 (failure) <br><br>
We will divide our data into 2 sets: Training (90%) and Testing (10%) due to the low % of failures<br><br>
<b>Preprocessing/Scaling

In [4]:
from sklearn import preprocessing
X=d_failures[['attribute1','attribute2','attribute3','attribute4','attribute5','attribute6','attribute7','attribute8','attribute9']]
#X contains all predictive attributes
y=d_failures['failure']
#y is our binary target variable
min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X) #This scales each attribute to between 0 and 1 relative to their column values

<b>Creating Training and Test Datasets

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.10, random_state=77)

<b>Model Selection/Creation</b><br>

Since the relationship between features and the target, optimal results cannot be obtained from Logistic Regression. Instead we will use a Support Vector Machine Classifier with an RBF(radial basis function) kernel which emphasizes the euclidean distance between feature vectors. Additionally, we balance the class weight since the # of failures(1's) is not proportional and we would like to maximize the accuracy of predicting failures and non-failures. The misclassification cost has been set to 1000 to ensure a heavy penalty is levied if one of the few failures is incorrectly predicted.

In [15]:
from sklearn import svm
svc = svm.SVC(C=1000.0, class_weight='balanced',
    kernel='rbf',
    max_iter=-1, random_state=77, shrinking=True,
    tol=0.001, verbose=False)
svc.fit(X_train, y_train)

SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=77, shrinking=True,
  tol=0.001, verbose=False)

In [16]:
svc.score(X_test,y_test) #Overall % accuracy on the test data

0.96120481927710844

In [17]:
from sklearn.metrics import confusion_matrix #Matrix showing accurate and inaccurate predictions on the test data
confusion_matrix(y_test,svc.predict(X_test))

array([[11961,   479],
       [    4,     6]])

<b>Key</b><br>
TN=[0,0] <br>
FN=[1,0] <br>
FP=[0,1] <br>
TP=[1,1]

In [18]:
from sklearn.metrics import confusion_matrix #Matrix showing accurate and inaccurate predictions on the training data
confusion_matrix(y_train,svc.predict(X_train))

array([[107207,   4741],
       [    33,     63]])

<b>Key</b><br>
TN=[0,0] <br>
FN=[1,0] <br>
FP=[0,1] <br>
TP=[1,1]

In [19]:
from sklearn.metrics import confusion_matrix #Matrix showing accurate and inaccurate predictions on all data
confusion_matrix(y,svc.predict(X_scaled))

array([[119168,   5220],
       [    37,     69]])

<b>Key</b><br>
TN=[0,0] <br>
FN=[1,0] <br>
FP=[0,1] <br>
TP=[1,1]

<b>Rebuilding the Model with All Data

In [20]:
#Same as previous model, but with all data
from sklearn import svm
svc = svm.SVC(C=1000.0, class_weight='balanced',
    kernel='rbf',
    max_iter=-1, random_state=77, shrinking=True,
    tol=0.001, verbose=False)
svc.fit(X_scaled, y)

SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=77, shrinking=True,
  tol=0.001, verbose=False)

In [25]:
svc.score(X_scaled,y)#Overall % accuracy on all data using trained model

0.95866467460279214

In [22]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y,svc.predict(X_scaled))

array([[119279,   5109],
       [    37,     69]])

<b>Key</b><br>
TN=[0,0] <br>
FN=[1,0] <br>
FP=[0,1] <br>
TP=[1,1]

<b>Building a Truth Matrix

In [23]:
y_perfect_predictions = (y==1)

In [24]:
confusion_matrix((y==1), y_perfect_predictions) #Represents actual number of failures and non-failures

array([[124388,      0],
       [     0,    106]])

True Failures= 106 <br>
True Non-Failures=124,388

<b>Analysis of Model Predictive Accuracy

In [58]:
TP=69/106*100
FP=100-TP
TN=119279/124388*100
FN=100-TN
print("Failures Accurately Predicted: %.2f Percent" % TP)
print("Failures Incorrectly Predicted: %.2f Percent" % FP)
print("Non-Failures Accurately Predicted: %.2f Percent" % TN)
print("Non-Failures Incorrectly Predicted: %.2f Percent" % FN)

Failures Accurately Predicted: 65.09 Percent
Failures Incorrectly Predicted: 34.91 Percent
Non-Failures Accurately Predicted: 95.89 Percent
Non-Failures Incorrectly Predicted: 4.11 Percent
