# Certificate in Data Science | Milestone 2 |  
> University of Washington, Seattle, WA    
> January 2020  
> N. Hicks

## Instructions

Milestone 2 continues your work with the diaper manufacturing problem, using the same datasets:
- A dataset file SECOM containing 1567 examples, each with 591 features, presented in a 1567 x 591 matrix.
- A labels file listing the classifications and date time stamp for each example.

Accomplish the following outcomes:
    - Split prepared data from Milestone 1 into training and testing data.
    - Build a decision tree model that detects faulty products.
    - Build an ensemble model that detects faulty products.
    - Build an SVM model that detects faulty products.
    - Evaluate all three models.
    - Solicit specific feedback on your code.

# Pre-Existing Work
As derived in previously accomplished assignment `Milestone 01`.

## Establish the Datatset

### Import Libraries

In [1]:
'''
Import Required Libraries
'''
import pandas as pd
import numpy as np
import os

from imblearn.over_sampling import SMOTE
from collections import Counter

from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

from sklearn.metrics import accuracy_score

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn import svm, metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score



### Functions for Scripting

In [2]:
'''
Retrieve the prescribed dataset.
RETURN: pd.DataFrame
'''
def fetch_data(path, file):
    try:
        # import the file to a dataframe
        _df = pd.read_csv(path + file, sep=' ', header=None)
        print('REMOTE FILE USED')
    except:
        # Local Copy -- Link would not permit access
        path = os.getcwd()
        print('LOCAL FILE USED\n\n')
        # import the file to a dataframe
        _df = pd.read_csv(os.path.join(path, file))
    
    return _df

In [3]:
'''
Create a scale function for a single feature.
RETURN: a scaled column feature
'''
def scale(col):
    mean_col = np.mean(col)
    sd_col = np.std(col)
    std = (col - mean_col) / sd_col
    return std

In [4]:
'''
Accomplish a 'train-test-validate' split of a provided dataset.
INPUT: pd.DataFrame
RETURN: pd.DataFrame| [train, validate, test]
'''
    # np.split will split at 60% of the length of the shuffled array,
    # then 80% of length (which is an additional 20% of data),
    # thus leaving a remaining 20% of the data.
    # This is due to the definition of the function.
def train_test_validate_split(_df):
    train, validate, test = np.split(_df.sample(frac=1), [int(.6*len(_df)), int(.8*len(_df))])
    print('TRAIN:    {}\nVALIDATE: {}\nTEST:     {}'.format(train.shape, validate.shape, test.shape))
    return [train, validate, test]

In [5]:
'''
Generate an accuracy Score for a Decision Tree
INPUT: y_test|the test target, y_pred|the predicted scores
RETURN: prints the Accuracy Score
'''
def print_scores_decision_tree(y_test, y_pred):
    print('Accuracy: {}%'.format(np.round(accuracy_score(y_test, y_pred)*100, 2)))

In [6]:
'''
Establish an appropriately labeled confusion matrix  
INPUT: y_test|test target, y_pred| test prediction, pos|positivie outcome, neg|negative outcome
RETURN: pd.DataFrame
'''
def conf_matrix(y_test, y_pred, pos, neg):
    return pd.DataFrame(
        confusion_matrix(y_test, y_pred),
        columns=['Predicted '+neg, 'Predicted '+pos],
        index=['True '+neg, 'True '+pos]
    )

In [7]:
'''
Derive the accuracy score for the Random Forest Decision Tree,
Derive the accuracy score for the Gradient Descent Boost Decision Tree
RETURN: print of the accuracy score
'''
def print_scores_ensemble_tree(model, X, y):
    Y_hat = model.predict(X)
    Accuracy = [1 for i in range(len(Y_hat)) if y.iloc[i] == Y_hat[i]]
    Accuracy = round(float(np.sum(Accuracy))/len(Y_hat)*100,2)
    print('%.2f%%'%Accuracy)

### Import the Data

In [8]:
# import the sensors dataset
path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/secom/'
file_data = 'secom.data'
file_labels = 'secom_labels.data'
secom_df = fetch_data(path, file_data)
labels_df = fetch_data(path, file_labels)

REMOTE FILE USED
REMOTE FILE USED


## Wrangle the Data

In [9]:
# replace the 'NaN' values
secom_df  = secom_df.fillna(0)

In [10]:
# replace all '-1' values with '0' values
# this is a more standardized manner with which to display the target attribute
labels_df = labels_df.replace(-1, 0)

In [11]:
# Change the dtype of 'labels_df[1]' to datetime
cols = [1]
labels_df[cols] = labels_df[cols].apply(pd.to_datetime)

## Merge the DataFrames

In [12]:
df = secom_df.copy(deep=True)
last_col = len(df.columns)
labels_df = labels_df.rename(columns={0:last_col, 1:last_col+1})
labels_df.columns
df = pd.concat([df, labels_df], axis=1)

## Identify 'Mean Zero' Attributes

In [13]:
# return the mean-zero attributes
drop_cols = []
is_zero = np.mean(df)==0
for item in is_zero.index:
    if is_zero[item]==True:
        drop_cols.append(item)

# drop the mean-zero attributes from the DataFrame
df = df.drop(drop_cols, axis=1)

In [14]:
#re-name the columns, after the dropped attributes
new_cols = np.arange(0,len(df.columns))
i = 0

# rename the attributes, consecutively
for item in df.columns:
    df.rename(columns={item:new_cols[i]}, inplace=True)
    i += 1

## Scale the Attributes

In [15]:
# define a list of attributes to iterate 
drop_cols = [478, 479]
attributes = df.columns
attributes = np.delete(attributes, drop_cols)
target = drop_cols[0]

In [16]:
# establish the attributes and target
X = df.copy()
X = X.drop(drop_cols, axis=1)
y = df[target]

In [17]:
# scale the remaining features
for attr in attributes:
    X[attr] = scale(X[attr])

## Over-sample the Dataset

In [18]:
# oversample the dataset to balance it, for improved prediction capability
sm_res = SMOTE(random_state=43)
X_sm, y_sm = sm_res.fit_sample(X, y)
print('Resampled dataset shape {}'.format(Counter(y_sm)))

Resampled dataset shape Counter({0: 1463, 1: 1463})




In [19]:
# recombine the arrays, after oversampling, as DataFrames
smX_df = pd.DataFrame(X_sm)
smY_df = pd.DataFrame(y_sm)

oversampled_df = pd.concat([smX_df, smY_df], axis=1)
oversampled_df.columns = np.arange(0, oversampled_df.shape[1])

# Current Work

## Split Dataset - Train / Test / Validate

In [20]:
# establish the initial dataset split
split_data = train_test_validate_split(oversampled_df)   # [train, validate, test]

# re-assign the split results
print('-----------------------------------')
print('THE OVERSAMPLED DATASET, NOW SPLIT:')
train_data = split_data[0]
val_data = split_data[1]
test_data = split_data[2]

TRAIN:    (1755, 479)
VALIDATE: (585, 479)
TEST:     (586, 479)
-----------------------------------
THE OVERSAMPLED DATASET, NOW SPLIT:


In [21]:
# establish the features and target segmentations
X_train = train_data[attributes]
Y_train = train_data[target]

X_test = test_data[attributes]
Y_test = test_data[target]

X_val = val_data[attributes]
Y_val = val_data[target]


## Decision Tree

### Entropy Model

In [22]:
# return the basic entropy decision tree
clf_entropy = DecisionTreeClassifier(criterion='entropy').fit(X_train, Y_train)
print('TEST DATA')
Y_entropy_test_pred = clf_entropy.predict(X_test)
print_scores_decision_tree(Y_test, Y_entropy_test_pred)

print('\nVALIDATION DATA')
Y_entropy_val_pred = clf_entropy.predict(X_val)
print_scores_decision_tree(Y_val, Y_entropy_val_pred)

TEST DATA
Accuracy: 87.88%

VALIDATION DATA
Accuracy: 85.64%


#### Confusion Matrix

In [23]:
# return the confusion matrix
print('CONFUSION MATRIX - TEST - ENTROPY MODEL')
print(conf_matrix(Y_test, Y_entropy_test_pred, 'FAIL', 'PASS'))

print('\nCONFUSION MATRIX - VALIDATION - ENTROPY MODEL')
print(conf_matrix(Y_val, Y_entropy_val_pred, 'FAIL', 'PASS'))

CONFUSION MATRIX - TEST - ENTROPY MODEL
           Predicted PASS  Predicted FAIL
True PASS             240              51
True FAIL              20             275

CONFUSION MATRIX - VALIDATION - ENTROPY MODEL
           Predicted PASS  Predicted FAIL
True PASS             248              59
True FAIL              25             253


#### AUC

In [24]:
# return the classifications of the binary states of the fitted models
print('       PASS-FAIL')
print('entropy: {}'.format(clf_entropy.classes_))

       PASS-FAIL
entropy: [0 1]


In [25]:
# predict the probabilities for each decision tree
Y_entropy_test_pred_proba = clf_entropy.predict_proba(X_test)
Y_entropy_val_pred_proba = clf_entropy.predict_proba(X_val)

# compute the AUC scores
print('AUC TEST SCORES - PASS')
print('entropy: {}'.format(roc_auc_score(Y_test, Y_entropy_test_pred_proba[:,1])))

print('\nAUC VALIDATION SCORES - PASS')
print('entropy: {}'.format(roc_auc_score(Y_val, Y_entropy_val_pred_proba[:,1])))

AUC TEST SCORES - PASS
entropy: 0.8784728289358728

AUC VALIDATION SCORES - PASS
entropy: 0.8589447660112952


#### Precision / Recall / F1

In [26]:
# return the precision and recall results
print('ENTROPY MODEL - TEST DATA')
print(classification_report(Y_test, Y_entropy_test_pred))

print('\nENTROPY MODEL - VALIDATION DATA')
print(classification_report(Y_val, Y_entropy_val_pred))

ENTROPY MODEL - TEST DATA
              precision    recall  f1-score   support

           0       0.92      0.82      0.87       291
           1       0.84      0.93      0.89       295

    accuracy                           0.88       586
   macro avg       0.88      0.88      0.88       586
weighted avg       0.88      0.88      0.88       586


ENTROPY MODEL - VALIDATION DATA
              precision    recall  f1-score   support

           0       0.91      0.81      0.86       307
           1       0.81      0.91      0.86       278

    accuracy                           0.86       585
   macro avg       0.86      0.86      0.86       585
weighted avg       0.86      0.86      0.86       585



### GINI Model

In [27]:
# return the basic gini decision tree
clf_gini = DecisionTreeClassifier(criterion='gini').fit(X_train, Y_train)
print('TEST DATA')
Y_gini_test_pred = clf_gini.predict(X_test)
print_scores_decision_tree(Y_test, Y_gini_test_pred)

print('\nVALIDATION DATA')
Y_gini_val_pred = clf_gini.predict(X_val)
print_scores_decision_tree(Y_val, Y_gini_val_pred)

TEST DATA
Accuracy: 86.86%

VALIDATION DATA
Accuracy: 87.69%


#### Confusion Matrix

In [28]:
# return the confusion matrix
print('CONFUSION MATRIX- TEST - GINI MODEL')
print(conf_matrix(Y_test, Y_gini_test_pred, 'FAIL', 'PASS'))

print('\nCONFUSION MATRIX - VALIDATION - GINI MODEL')
print(conf_matrix(Y_val, Y_gini_val_pred, 'FAIL', 'PASS'))

CONFUSION MATRIX- TEST - GINI MODEL
           Predicted PASS  Predicted FAIL
True PASS             240              51
True FAIL              26             269

CONFUSION MATRIX - VALIDATION - GINI MODEL
           Predicted PASS  Predicted FAIL
True PASS             254              53
True FAIL              19             259


#### AUC

In [29]:
# return the classifications of the binary states of the fitted models
print('       PASS-FAIL')
print('gini:    {}'.format(clf_gini.classes_))

       PASS-FAIL
gini:    [0 1]


In [30]:
# predict the probabilities for each decision tree
Y_gini_test_pred_proba = clf_gini.predict_proba(X_test)
Y_gini_val_pred_proba = clf_gini.predict_proba(X_val)

# compute the AUC scores
print('AUC TEST SCORES - PASS')
print('gini:    {}'.format(roc_auc_score(Y_test, Y_gini_test_pred_proba[:,1])))

print('\nAUC VALIDATION SCORES - PASS')
print('gini:    {}'.format(roc_auc_score(Y_val, Y_gini_val_pred_proba[:,1])))

AUC TEST SCORES - PASS
gini:    0.8683033374104491

AUC VALIDATION SCORES - PASS
gini:    0.8795081198884541


#### Precision / Recall / F1

In [31]:
# return the precision and recall results
print('GINI MODEL - TEST DATA')
print(classification_report(Y_test, Y_gini_test_pred))

print('\nGINI MODEL - VALIDATION DATA')
print(classification_report(Y_val, Y_gini_val_pred))

GINI MODEL - TEST DATA
              precision    recall  f1-score   support

           0       0.90      0.82      0.86       291
           1       0.84      0.91      0.87       295

    accuracy                           0.87       586
   macro avg       0.87      0.87      0.87       586
weighted avg       0.87      0.87      0.87       586


GINI MODEL - VALIDATION DATA
              precision    recall  f1-score   support

           0       0.93      0.83      0.88       307
           1       0.83      0.93      0.88       278

    accuracy                           0.88       585
   macro avg       0.88      0.88      0.88       585
weighted avg       0.88      0.88      0.88       585



## Ensemble Decision Tree

### Random Forest

In [32]:
# employ basic hyperparameters to the RANDOM FOREST model
nTrees = 100
max_depth = 5
min_node_size = 5
verbose = 0

clf = RandomForestClassifier(n_estimators=nTrees, max_depth=max_depth, random_state=0, verbose=verbose, min_samples_leaf=min_node_size)
clf.fit(X_train, Y_train)
print(clf.feature_importances_)

[9.08570790e-04 9.14249129e-04 1.44156732e-03 2.78085876e-04
 3.82219812e-04 0.00000000e+00 2.97408568e-05 8.97824968e-04
 2.32316855e-03 4.31262533e-04 2.17794233e-03 2.14073965e-03
 5.36350485e-04 2.02858743e-03 3.56287241e-04 9.59494834e-04
 9.01241363e-04 1.35045391e-03 1.73827367e-04 3.77605233e-04
 3.60885940e-03 2.98216997e-04 7.56403000e-04 3.64646171e-03
 6.47201718e-04 7.60576088e-04 1.49145760e-03 9.16698038e-03
 4.08284192e-04 1.28056877e-03 4.04563152e-03 3.09099506e-03
 8.55866509e-03 1.25859334e-03 1.49490567e-03 1.20673280e-05
 2.36371316e-03 1.17755610e-03 2.81774949e-04 1.97002087e-03
 1.35593332e-03 0.00000000e+00 1.40545890e-03 0.00000000e+00
 9.05644919e-04 3.34407790e-04 2.53573700e-03 1.97483696e-04
 0.00000000e+00 1.37045438e-03 5.27868913e-04 1.29445619e-03
 9.11358982e-04 8.56770211e-04 4.33342441e-04 1.29674141e-03
 1.42884663e-03 3.77820083e-02 1.04732896e-04 1.10730483e-04
 9.91248056e-04 1.49417349e-03 3.32664747e-03 3.11008060e-03
 6.23427637e-04 1.807710

In [33]:
# Return the Random Forest Decision Tree accuracy scores
print('RANDOM FOREST\n=============')
print('TEST DATASET')
print_scores_ensemble_tree(clf, X_test, Y_test)

print('\nVALIDATION DATASET')
print_scores_ensemble_tree(clf, X_val, Y_val)

RANDOM FOREST
TEST DATASET
93.17%

VALIDATION DATASET
92.48%


#### Confusion Matrix

In [34]:
# return the confusion matrix for the RANDOM FOREST model
Y_random_test_pred = clf.predict(X_test)
print('RANDOM FOREST - TEST - Confusion Matrix:')
print(conf_matrix(Y_test, Y_random_test_pred, 'FAIL', 'PASS'))

Y_random_val_pred = clf.predict(X_val)
print('\nRANDOM FOREST - VALIDATION - Confusion Matrix:')
print(conf_matrix(Y_val, Y_random_val_pred, 'FAIL', 'PASS'))

RANDOM FOREST - TEST - Confusion Matrix:
           Predicted PASS  Predicted FAIL
True PASS             261              30
True FAIL              10             285

RANDOM FOREST - VALIDATION - Confusion Matrix:
           Predicted PASS  Predicted FAIL
True PASS             276              31
True FAIL              13             265


#### AUC

In [35]:
# predict the probabilities for each decision tree
Y_random_test_pred_proba = clf.predict_proba(X_test)
Y_random_val_pred_proba = clf.predict_proba(X_val)

# compute the AUC scores
print('AUC TEST SCORES - PASS')
print('random forest:    {}'.format(roc_auc_score(Y_test, Y_random_test_pred_proba[:,1])))

print('\nAUC VALIDATION SCORES - PASS')
print('random forest:    {}'.format(roc_auc_score(Y_val, Y_random_val_pred_proba[:,1])))

AUC TEST SCORES - PASS
random forest:    0.9821655308987128

AUC VALIDATION SCORES - PASS
random forest:    0.9823893328334076


#### Precision / Recall / F1

In [36]:
# return the precision and recall results
print('RANDOM FOREST MODEL - TEST DATA')
print(classification_report(Y_test, Y_random_test_pred))

print('\nRANDOM FOREST MODEL - VALIDATION DATA')
print(classification_report(Y_val, Y_random_val_pred))

RANDOM FOREST MODEL - TEST DATA
              precision    recall  f1-score   support

           0       0.96      0.90      0.93       291
           1       0.90      0.97      0.93       295

    accuracy                           0.93       586
   macro avg       0.93      0.93      0.93       586
weighted avg       0.93      0.93      0.93       586


RANDOM FOREST MODEL - VALIDATION DATA
              precision    recall  f1-score   support

           0       0.96      0.90      0.93       307
           1       0.90      0.95      0.92       278

    accuracy                           0.92       585
   macro avg       0.93      0.93      0.92       585
weighted avg       0.93      0.92      0.92       585



### Gradient Descent Boost

In [37]:
# employ basic hyperparameters to the GRADIENT DESCENT model
nTrees = 100
max_depth = 5
min_node_size = 5
verbose = 0
learning_rate = 0.05

gbm_clf = GradientBoostingClassifier(n_estimators=nTrees, loss='deviance', learning_rate=learning_rate, max_depth=max_depth, \
                                    min_samples_leaf=min_node_size)
gbm_clf.fit(X_train, Y_train)
print(gbm_clf.feature_importances_)

[3.20562718e-03 2.52433421e-04 1.29690206e-03 5.72149305e-04
 1.38478147e-03 0.00000000e+00 6.06025673e-04 4.18964515e-06
 2.50948931e-04 2.00192333e-05 9.53596450e-04 1.04994525e-04
 6.96192910e-05 5.88688234e-04 4.24443171e-06 1.53490097e-03
 1.20105280e-03 1.09891829e-02 1.41843398e-03 9.11843673e-04
 3.44892583e-03 2.98382852e-04 1.33379851e-03 3.96259869e-04
 1.03335755e-03 6.59950916e-04 4.18116517e-04 1.10785118e-02
 4.50607742e-04 1.24140566e-04 1.01677491e-02 3.64356816e-03
 2.41919725e-02 2.53350478e-03 2.33020535e-04 1.25232867e-04
 2.43819930e-04 2.12880624e-04 1.05530516e-04 1.35330925e-03
 4.83597534e-04 0.00000000e+00 1.35259697e-04 1.49241114e-04
 2.05291243e-04 1.77944025e-04 5.92714493e-03 1.77850569e-03
 0.00000000e+00 2.00555593e-03 5.17546469e-07 8.39724876e-03
 0.00000000e+00 5.33637548e-03 5.38409414e-03 2.42392927e-04
 8.55453012e-04 1.49449721e-01 7.94809185e-05 2.65284219e-04
 5.68007195e-04 1.01019142e-05 4.89837634e-03 1.04559801e-02
 4.00889799e-06 8.553417

In [38]:
# Return the Random Forest Decision Tree accuracy scores
print('GRADIENT BOOST\n=============')
print('TEST DATASET')
print_scores_ensemble_tree(gbm_clf, X_test, Y_test)

print('\nVALIDATION DATASET')
print_scores_ensemble_tree(gbm_clf, X_val, Y_val)

GRADIENT BOOST
TEST DATASET
96.93%

VALIDATION DATASET
96.75%


#### Confusion Matrix

In [39]:
# return the confusion matrix for the RANDOM FOREST model
Y_gradDescent_test_pred = gbm_clf.predict(X_test)
print('GRADIENT DESCENT BOOST Confusion Matrix:')
print(conf_matrix(Y_test, Y_gradDescent_test_pred, 'FAIL', 'PASS'))

Y_gradDescent_val_pred = gbm_clf.predict(X_val)
print('\nGRADIENT DESCENT BOOST Confusion Matrix:')
print(conf_matrix(Y_val, Y_gradDescent_val_pred, 'FAIL', 'PASS'))

GRADIENT DESCENT BOOST Confusion Matrix:
           Predicted PASS  Predicted FAIL
True PASS             279              12
True FAIL               6             289

GRADIENT DESCENT BOOST Confusion Matrix:
           Predicted PASS  Predicted FAIL
True PASS             292              15
True FAIL               4             274


#### AUC

In [40]:
# predict the probabilities for each decision tree
Y_gradDescent_test_pred_proba = clf_gini.predict_proba(X_test)
Y_gradDescent_val_pred_proba = clf_gini.predict_proba(X_val)

# compute the AUC scores
print('AUC TEST SCORES - PASS')
print('gradiant boost:    {}'.format(roc_auc_score(Y_test, Y_gradDescent_test_pred_proba[:,1])))

print('\nAUC VALIDATION SCORES - PASS')
print('gradiant boost:    {}'.format(roc_auc_score(Y_val, Y_gradDescent_val_pred_proba[:,1])))

AUC TEST SCORES - PASS
gradiant boost:    0.8683033374104491

AUC VALIDATION SCORES - PASS
gradiant boost:    0.8795081198884541


#### Precision / Recall / F1

In [41]:
# return the precision and recall results
print('GRADIENT DESCENT BOOST MODEL - TEST DATA')
print(classification_report(Y_test, Y_gradDescent_test_pred))

print('\nGRADIENT DESCENT BOOST MODEL - VALIDATION DATA')
print(classification_report(Y_val, Y_gradDescent_val_pred))

GRADIENT DESCENT BOOST MODEL - TEST DATA
              precision    recall  f1-score   support

           0       0.98      0.96      0.97       291
           1       0.96      0.98      0.97       295

    accuracy                           0.97       586
   macro avg       0.97      0.97      0.97       586
weighted avg       0.97      0.97      0.97       586


GRADIENT DESCENT BOOST MODEL - VALIDATION DATA
              precision    recall  f1-score   support

           0       0.99      0.95      0.97       307
           1       0.95      0.99      0.97       278

    accuracy                           0.97       585
   macro avg       0.97      0.97      0.97       585
weighted avg       0.97      0.97      0.97       585



## Support Vector Machine

### Support Vector Classifier (SVC)

In [42]:
# define the SVC estimator model, with prediction
sv_class = svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='scale',
                   coef0=0.0, shrinking=True, probability=False, tol=0.001,
                   cache_size=200, class_weight=None, verbose=False, max_iter=-1,
                   decision_function_shape='ovr', break_ties=False, random_state=None
                  ).fit(X_train, Y_train)

sv_class_test_pred = sv_class.predict(X_test)
print("SVC - TEST DATA")
print('accuracy: {}%'.format(np.round(accuracy_score(sv_class_test_pred, Y_test)*100, 2)))

sv_class_val_pred = sv_class.predict(X_val)
print("\nSVC - VALIDATION DATA")
print('accuracy: {}%'.format(np.round(accuracy_score(sv_class_val_pred, Y_val)*100, 2)))

SVC - TEST DATA
accuracy: 98.12%

SVC - VALIDATION DATA
accuracy: 98.12%


#### Compare Kernels

In [43]:
# define the SVC estimator model using various kernels, with prediction
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for kern in kernels:
    if kern=='poly':   # the 'degree' value only effects the 'poly' kernel
        for deg in np.arange(2,7):
            sv_class = svm.SVC(C=1.0, kernel=kern, degree=deg, gamma='scale',
                               coef0=0.0, shrinking=True, probability=False, tol=0.001,
                               cache_size=200, class_weight=None, verbose=False, max_iter=-1,
                               decision_function_shape='ovr', break_ties=False, random_state=None
                              ).fit(X_train, Y_train)
            sv_class_pred = sv_class.predict(X_test)
            print('kernel: {}\ndegree: {}'.format(kern, deg))
            print('accuracy: {}%\n'.format(np.round(accuracy_score(sv_class_pred, Y_test)*100, 2)))
    else:
        sv_class = svm.SVC(C=1.0, kernel=kern, degree=3, gamma='scale',
                           coef0=0.0, shrinking=True, probability=False, tol=0.001,
                           cache_size=200, class_weight=None, verbose=False, max_iter=-1,
                           decision_function_shape='ovr', break_ties=False, random_state=None
                           ).fit(X_train, Y_train)
        sv_class_pred = sv_class.predict(X_test)
        print('kernel: {}'.format(kern))
        print('accuracy: {}%\n'.format(np.round(accuracy_score(sv_class_pred, Y_test)*100, 2)))

kernel: linear
accuracy: 90.44%

kernel: poly
degree: 2
accuracy: 97.27%

kernel: poly
degree: 3
accuracy: 95.9%

kernel: poly
degree: 4
accuracy: 94.88%

kernel: poly
degree: 5
accuracy: 95.73%

kernel: poly
degree: 6
accuracy: 52.9%

kernel: rbf
accuracy: 98.12%

kernel: sigmoid
accuracy: 78.5%



#### Regularization Parameter

In [44]:
# define the SVC estimator model using various kernels, with prediction
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for kern in kernels:
    for reg in np.arange(1,11):
        sv_class = svm.SVC(C=reg, kernel=kern, degree=3, gamma='scale',
                           coef0=0.0, shrinking=True, probability=False, tol=0.001,
                           cache_size=200, class_weight=None, verbose=False, max_iter=-1,
                           decision_function_shape='ovr', break_ties=False, random_state=None
                           ).fit(X_train, Y_train)
        sv_class_pred = sv_class.predict(X_test)
        acc = accuracy_score(sv_class_pred, Y_test)

        print('kernel: {}\nParameter: {}'.format(kern, reg))
        print('{}%\n'.format(np.round(acc*100, 2)))

kernel: linear
Parameter: 1
90.44%

kernel: linear
Parameter: 2
90.44%

kernel: linear
Parameter: 3
90.44%

kernel: linear
Parameter: 4
90.44%

kernel: linear
Parameter: 5
90.44%

kernel: linear
Parameter: 6
90.44%

kernel: linear
Parameter: 7
90.44%

kernel: linear
Parameter: 8
90.44%

kernel: linear
Parameter: 9
90.44%

kernel: linear
Parameter: 10
90.44%

kernel: poly
Parameter: 1
95.9%

kernel: poly
Parameter: 2
97.44%

kernel: poly
Parameter: 3
97.27%

kernel: poly
Parameter: 4
97.1%

kernel: poly
Parameter: 5
97.1%

kernel: poly
Parameter: 6
96.93%

kernel: poly
Parameter: 7
96.93%

kernel: poly
Parameter: 8
96.93%

kernel: poly
Parameter: 9
96.93%

kernel: poly
Parameter: 10
97.1%

kernel: rbf
Parameter: 1
98.12%

kernel: rbf
Parameter: 2
98.81%

kernel: rbf
Parameter: 3
98.81%

kernel: rbf
Parameter: 4
98.81%

kernel: rbf
Parameter: 5
98.81%

kernel: rbf
Parameter: 6
98.81%

kernel: rbf
Parameter: 7
98.81%

kernel: rbf
Parameter: 8
98.81%

kernel: rbf
Parameter: 9
98.81%

kerne

#### GridSearchCV

In [46]:
# establish the SVC
sv_class = svm.SVC()
# define the GridSearch parameters; based upon previous hyperparameter tuning
grid_values = {'kernel': ('rbf', 'linear'),
               'C': [0.001, 0.009, 0.01, 0.09, 1., 2., 3., 4., 5.],
               'gamma': [1., 2., 3., 4., 5., 6., 7., 8., 9.]
              }
# return the estimator, then fit it to the data
grid_clf_test_acc = GridSearchCV(sv_class, param_grid=grid_values, scoring='accuracy')
grid_clf_test_acc.fit(X_train, Y_train)
grid_clf_acc_test_pred = grid_clf_test_acc.predict(X_test)

grid_clf_val_acc = GridSearchCV(sv_class, param_grid=grid_values, scoring='accuracy')
grid_clf_val_acc.fit(X_train, Y_train)
grid_clf_acc_val_pred = grid_clf_val_acc.predict(X_val)

#### GridSearch - Best Parameters

In [47]:
print('TEST DATA')
print('=========')
# View the accuracy score
print('Best score:', grid_clf_test_acc.best_score_) 
# View the best parameters for the model found using grid search
print('Best Kernel:', grid_clf_test_acc.best_estimator_.kernel)
print('Best C:', grid_clf_test_acc.best_estimator_.C) 
print('Best Gamma:', grid_clf_test_acc.best_estimator_.gamma)


print('\nVALIDATION DATA')
print('===============')
# View the accuracy score
print('Best score:', grid_clf_val_acc.best_score_) 
# View the best parameters for the model found using grid search
print('Best Kernel:', grid_clf_val_acc.best_estimator_.kernel)
print('Best C:', grid_clf_val_acc.best_estimator_.C) 
print('Best Gamma:', grid_clf_val_acc.best_estimator_.gamma)

TEST DATA
Best score: 0.9247863247863247
Best Kernel: linear
Best C: 1.0
Best Gamma: 1.0

VALIDATION DATA
Best score: 0.9247863247863247
Best Kernel: linear
Best C: 1.0
Best Gamma: 1.0


In [53]:
# establish the 'best' SVC derived from GridSearchCV
cost = 1.0
kern = 'linear'
gam = 1.0
sv_class = svm.SVC(C=cost, kernel=kern, gamma=gam).fit(X_train, Y_train)

sv_class_test_pred = sv_class.predict(X_test)
sv_class_val_pred = sv_class.predict(X_val)

# return the 'precision', 'recall', and 'f-score' of the preceding 'best' model
print('Best Model (TEST):        {}\n{}'.format(kern, classification_report(sv_class_test_pred, Y_test)))
print('\nBest Model (VALIDATION): {}\n{}'.format(kern, classification_report(sv_class_val_pred, Y_val)))

Best Model (TEST):        linear
              precision    recall  f1-score   support

           0       0.81      1.00      0.89       235
           1       1.00      0.84      0.91       351

    accuracy                           0.90       586
   macro avg       0.90      0.92      0.90       586
weighted avg       0.92      0.90      0.91       586


Best Model (VALIDATION): linear
              precision    recall  f1-score   support

           0       0.84      1.00      0.92       259
           1       1.00      0.85      0.92       326

    accuracy                           0.92       585
   macro avg       0.92      0.93      0.92       585
weighted avg       0.93      0.92      0.92       585



#### Confusion Matrix

In [51]:
# return the confusion matrix for the RANDOM FOREST model
print('GRADIENT DESCENT BOOST Confusion Matrix (TEST DATA):')
print(conf_matrix(Y_test, sv_class_test_pred, 'FAIL', 'PASS'))

print('\nGRADIENT DESCENT BOOST Confusion Matrix (VALIDATION DATA):')
print(conf_matrix(Y_val, sv_class_val_pred, 'FAIL', 'PASS'))

GRADIENT DESCENT BOOST Confusion Matrix (TEST DATA):
           Predicted PASS  Predicted FAIL
True PASS             235              56
True FAIL               0             295

GRADIENT DESCENT BOOST Confusion Matrix (VALIDATION DATA):
           Predicted PASS  Predicted FAIL
True PASS             259              48
True FAIL               0             278


### Support Vector Regression (SVR)

Resultant from the target feature is binary, an SVR model would need to use a different, continuous, feature.

Therefore, the SVR approach is not established here.

# Results

This Milestone-02 assignment is intended to exemplify the various machine learning model types employed for typical categorical constructs of a feature rich dataset, one that is based upon predicting product PASS / FAIL outcomes established via a manufacturing process of diaper products.

The initial data wrangling and analysis was executed in the Milestone-01 assignmnet and in part is replicated at the onset of this notebook. The various modifications made to the incoming dataset are: 1) mergin the initially segregated attribute and target attribute DataFrames, 2) removing the mean-zero attributes from the DataFrame, 3) scaling of all features, and 4) for improved prediction outcomes, the DataFrame is over-sampled so as to balance the target attribute.

For the primary work of this assignment the following model types are established: 1) both entropy and gini type decision trees, 2) the random forest and gradient descent boost ensemble decision trees, and 3) a support vector classifier model. Each of these model is assessed standard metrics for deeper understanding of performance and comparison.

For our decision trees, the following outcomes were shown:  
    
   - ENTROPY DECISION TREE
         + Test Data Accuracy          87.88%
         + Validation Data Accuracy    85.64%
      
         + AUC Test Score-PASS         0.88
         + AUC Validation Score-PASS   0.86
    
   - GINI DECISION TREE
         + Test Data Accuracy          86.86%
         + Validation Data Accuracy    87.69%
      
         + AUC Test Score-PASS         0.87
         + AUC Validation Score-PASS   0.88

The results shown here clearly indicate that either the ENTROPY or the GINI decision tree model are comparable for the underlying dataset. 

For the ensemble decision tree models, the same results were accomplsihed. Mainly we now have:
   - RANDOM FOREST DECISION TREE
         + Test Data Accuracy          93.17%
         + Validation Data Accuracy    92.48%
      
         + AUC Test Score-PASS         0.98
         + AUC Validation Score-PASS   0.98
   - GRADIENT DESCENT BOOST DECISION TREE
         + Test Data Accuracy          96.93%
         + Validation Data Accuracy    96.75%
      
         + AUC Test Score-PASS         0.87
         + AUC Validation Score-PASS   0.88

From the results of the ensemble decision trees, it appears that the gradient descent boost model has a slight improvement in accuracy over the random forest decision tree, however, the AUC scores are slightly lower. For further understanding the confusion matrices can be further obseeved (shown in the preceding code).

The last model employed, a support vector classifier (svc), was initially hand-tuned via a comparison of different kernels and regularization parameters. The best results achieved were as follows:
   - TEST DATA ONLY
         + kernel: linear
         + accuracy: 90.44%
    
         + kernel: poly
         + degree: 2
         + accuracy: 97.27%
    
         + kernel: rbf
         + accuracy: 98.12%
    
         + kernel: sigmoid
         + accuracy: 78.5%

   - Comparing these results against the basic SVC model of our dataset gives:
         + TEST DATA
         + accuracy: 98.12%
    
         + VALIDATION DATA
         + accuracy: 98.12%

Further hand-tuning of the SVC model and its regularization parameter returned the following as best results:
   - TEST DATA ONLY
         + kernel:   linear
         + cost:     1
         + accuracy: 90.44%
         
         + kernel:    poly
         + cost:      2
         + accuracy:  97.44%
         
         + kernel:    rbf
         + cost:      2
         + accuracy:  98.81%
         
         + kernel:    sigmoid
         + cost:      3
         + accuracy:  79.52%

With these hand-tuned results, the GridSearchCV algorithm is applied, where only kernel `linear` and `rbf` were evaluated:
   - GridSearch CV: TEST DATA
         + Best score:   0.9247863247863247
         + Best Kernel:  linear
         + Best C:       1.0
         + Best Gamma:   1.0
   - GridSearch CV: VALIDATION DATA
         + Best score:   0.9247863247863247
         + Best Kernel:  linear
         + Best C:       1.0
         + Best Gamma:   1.0

As seen here, the GridSearch model performs equally well on both the TEST and VALIDATION dataset, given the GridSearchCV optimized parameters. As before, the confusion matrix can further provide insight to how the model perfomred across the datasets.