# Part II: Model Development

In this part, we develop three unique pipelines for predicting backorder. We use the smart sample from Part I to fit and evaluate these pipelines. 

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import preprocessing

import joblib


## Reload the smart sample here

In [2]:

# Reload your smart sampling from local file 
# ----------------------------------

df = pd.read_csv('sample-data-v1.csv')

# Getting rid of the first column which was saved as the row in the csv
df = df.iloc[:,1:]

In [3]:
df.head()

Unnamed: 0,national_inv,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.7,0.78,0,0,0,1,0,0
1,80.0,644.0,1091.0,1569.0,210.0,616.0,921.0,1338.0,0,0.0,1.0,0.99,0,0,0,1,0,0
2,98.0,0.0,0.0,0.0,1.0,3.0,7.0,12.0,0,0.0,0.79,0.78,0,0,0,1,0,0
3,20.0,0.0,0.0,0.0,1.0,1.0,4.0,12.0,0,0.0,0.47,0.39,0,0,0,1,0,0
4,202.0,224.0,504.0,770.0,88.0,272.0,585.0,842.0,0,0.0,0.33,0.32,0,0,0,1,0,0


In [4]:
df.tail()

Unnamed: 0,national_inv,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
22581,0.0,3454.0,4388.0,4388.0,1.0,1.0,1.0,1.0,0,0.0,0.83,0.86,1,0,0,1,0,1
22582,5.0,3.0,3.0,9.0,1.0,7.0,10.0,13.0,0,0.0,0.34,0.53,0,0,1,1,0,1
22583,-1.0,73.0,114.0,172.0,10.0,55.0,109.0,171.0,0,7.0,0.37,0.54,0,0,0,1,0,1
22584,6.0,61.0,61.0,85.0,9.0,24.0,75.0,136.0,0,0.0,0.44,0.64,0,0,1,1,0,1
22585,0.0,4.0,5.0,7.0,1.0,4.0,4.0,5.0,0,0.0,0.78,0.78,0,0,0,1,0,1


## Normalize/standardize the data if required

In [5]:
# Standardization of the dataset

scaler = preprocessing.StandardScaler().fit(df)

df_s = scaler.transform(df)

# Normalization of the dataset

df_sn = preprocessing.normalize(df_s, axis = 0, norm='l2')

df_sn = pd.DataFrame(df_sn)

In [6]:
df_sn.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,-0.000457,-0.000571,-0.000555,-0.000541,-0.000431,-0.000454,-0.00046,-0.000359,-0.000332,-0.000497,0.001655,0.001615,-0.00329,-0.000147,-0.002663,0.001353,-9.9e-05,-0.006654
1,-0.00028,0.001869,0.001764,0.001755,0.001922,0.001986,0.001513,0.000993,-0.000332,-0.000497,0.00174,0.001676,-0.00329,-0.000147,-0.002663,0.001353,-9.9e-05,-0.006654
2,-0.00024,-0.000571,-0.000555,-0.000541,-0.00042,-0.000443,-0.000445,-0.000347,-0.000332,-0.000497,0.00168,0.001615,-0.00329,-0.000147,-0.002663,0.001353,-9.9e-05,-0.006654
3,-0.000414,-0.000571,-0.000555,-0.000541,-0.00042,-0.00045,-0.000451,-0.000347,-0.000332,-0.000497,0.00159,0.001501,-0.00329,-0.000147,-0.002663,0.001353,-9.9e-05,-0.006654
4,-8e-06,0.000278,0.000516,0.000586,0.000555,0.000623,0.000793,0.000492,-0.000332,-0.000497,0.00155,0.00148,-0.00329,-0.000147,-0.002663,0.001353,-9.9e-05,-0.006654


## Split the data into Train/Test

In [7]:
# Not using the standardized data here but may come back to it.

X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1], df.iloc[:,-1:],
                                                    test_size = 0.2)

## Developing Pipeline

In this section, we design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a model

We are free to use any of the models that we learned in the past or use new models. 

* It is difficult to fit an anomaly detection method in the sklearn pipeline without writing custom codes. For simplicity, we avoid fitting an anomaly detection method within a pipeline. So we can create the workflow in two steps. 
    * Step I: fit an outlier with the training set
    * Step II: define a pipeline using a feature selection and a classification method. Then cross-validate this pipeline using the training data without outliers. 
        * Note: if your smart sample is somewhat imbalanced, you might want to change the scoring method in GridSearchCV (see the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).

* Once we fit the pipeline, we identify the best model and give an unbiased evaluation using the test set that we created in Part II. For unbiased evaluation we report confusion matrix, precision, recall, f1-score, accuracy, and other measures if you like. 

(Optional) Those who are interested in writing custom codes for adding an outlier detection method into the sklearn pipeline, please follow this discussion [thread](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline). 


**Note:** <span style='background:yellow'>We will be using Grid Search to find the optimal parameters of the pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

In [8]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor

from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif, f_regression

from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import chi2

from sklearn.svm import SVC


### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation
  
Add cells as needed. 

In [9]:
# Envelop function code

def elliptic_envelope_session(X, y):
    # Fit envelope
    envelope = EllipticEnvelope(support_fraction=1, contamination=0.2).fit(X)

    # Create an boolean indexing array to pick up outliers
    outliers = envelope.predict(X)==-1

    # Re-slice X,y into a cleaned dataset with outliers excluded
    X_clean = X[~outliers]
    y_clean = y[~outliers]
    return X_clean, y_clean

In [10]:
# Add anomaly detection code  (Question #E201)
# ----------------------------------

# Elliptic Envelope for pipeline 1

X_train_env, y_train_env = elliptic_envelope_session(X_train, y_train)


In [11]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E202)
# ----------------------------------

# Using PCA. Set components to 5/10 to start as we only have 16 predictor variables

param_grid = {'PCA__n_components': [5, 10],
              'SVC__C': [1e3, 5e3],        
              'SVC__kernel': ['rbf']}

# Define the pipeline (P102)
pipe = Pipeline([
    ('scale', StandardScaler()),
    ('PCA', PCA()),
    ('SVC', SVC(kernel='rbf'))
])

model_grid_1 = GridSearchCV(pipe, param_grid=param_grid, cv=10, n_jobs=5)
model_grid_1.fit(X_train_env, y_train_env)

  return f(*args, **kwargs)


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('PCA', PCA()), ('SVC', SVC())]),
             n_jobs=5,
             param_grid={'PCA__n_components': [5, 10],
                         'SVC__C': [1000.0, 5000.0], 'SVC__kernel': ['rbf']})

In [12]:
print(model_grid_1.best_estimator_)

Pipeline(steps=[('scale', StandardScaler()), ('PCA', PCA(n_components=10)),
                ('SVC', SVC(C=5000.0))])


In [13]:
y_pred = model_grid_1.predict(X_test)

In [14]:
# Given an unbiased evaluation  (Question #E203)
# ----------------------------------

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.88      0.82      2263
           1       0.86      0.73      0.79      2255

    accuracy                           0.81      4518
   macro avg       0.81      0.80      0.80      4518
weighted avg       0.81      0.81      0.80      4518



#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [15]:
# K-means function code

def kmeans_session(X, y):
    # run k-means clustering
    km_clusters = KMeans(n_clusters=3, algorithm="full").fit_predict(X, y)
    
    # create cluster distribution, this time they are in tuples so we can sort easily
    dist_clusters = ((np.sum(km_clusters==z), z) for z in np.unique(km_clusters))
    
    # sort clusters descendingly by number of data entries in cluster
    dist_clusters = sorted(dist_clusters, reverse = True)
    
    # find out the cluster with max number of data entries
    max_cluster = dist_clusters[0][1]

    # select data in max_cluster as inliers
    inliers = km_clusters == max_cluster
    
    return X[inliers], y[inliers]

In [16]:
# Add anomaly detection code  (Question #E205)
# ----------------------------------

X_train_km, y_train_km = elliptic_envelope_session(X_train, y_train)

In [17]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E206)
# ----------------------------------

# Use FactorAnalysis and GaussianNB

param_grid = {'FactorAnalysis__n_components': [5, 10]}

# Define the pipeline (P102)
pipe = Pipeline([
    ('scale', StandardScaler()),
    ('FactorAnalysis', FactorAnalysis()),
    ('GaussianNB', GaussianNB())
])

model_grid_2 = GridSearchCV(pipe, param_grid=param_grid, cv=10, n_jobs=5)
model_grid_2.fit(X_train_km, y_train_km)

  return f(*args, **kwargs)


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('FactorAnalysis', FactorAnalysis()),
                                       ('GaussianNB', GaussianNB())]),
             n_jobs=5, param_grid={'FactorAnalysis__n_components': [5, 10]})

In [18]:
print(model_grid_2.best_estimator_)

Pipeline(steps=[('scale', StandardScaler()),
                ('FactorAnalysis', FactorAnalysis(n_components=10)),
                ('GaussianNB', GaussianNB())])


In [19]:
y_pred2 = model_grid_2.predict(X_test)

In [20]:
# Given an unbiased evaluation  (Question #E207)
# ----------------------------------

print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.70      0.61      0.65      2263
           1       0.65      0.74      0.69      2255

    accuracy                           0.67      4518
   macro avg       0.68      0.67      0.67      4518
weighted avg       0.68      0.67      0.67      4518



#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [21]:
# Local Outlier function

def local_outlier_factor_session(X, y):
    lof_labels = LocalOutlierFactor(n_neighbors=10).fit_predict(X, y)
    inliers = lof_labels == 1 # select inliers
    return X[inliers], y[inliers]

In [22]:
# Add anomaly detection code  (Question #E209)
# ----------------------------------

X_train_lo, y_train_lo = local_outlier_factor_session(X_train, y_train)

In [23]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E210)
# ----------------------------------

param_grid = {'LR__max_iter': [500, 1000]}

# Define the pipeline (P102)
pipe = Pipeline([
    ('scale', StandardScaler()),
    ('SKB', SelectKBest(f_regression, k=5)),
    ('LR', LogisticRegression())
])

model_grid_3 = GridSearchCV(pipe, param_grid=param_grid, cv=10, n_jobs=5)
model_grid_3.fit(X_train_lo, y_train_lo)

  return f(*args, **kwargs)
  corr /= X_norms
  cond2 = cond0 & (x <= _a)
  return f(*args, **kwargs)


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scale', StandardScaler()),
                                       ('SKB',
                                        SelectKBest(k=5,
                                                    score_func=<function f_regression at 0x7f326baed9d8>)),
                                       ('LR', LogisticRegression())]),
             n_jobs=5, param_grid={'LR__max_iter': [500, 1000]})

In [24]:
print(model_grid_3.best_estimator_)

Pipeline(steps=[('scale', StandardScaler()),
                ('SKB',
                 SelectKBest(k=5,
                             score_func=<function f_regression at 0x7f326baed9d8>)),
                ('LR', LogisticRegression(max_iter=500))])


In [25]:
y_pred3 = model_grid_3.predict(X_test)

In [26]:
# Given an unbiased evaluation  (Question #E211)
# ----------------------------------

print(classification_report(y_test, y_pred3))

              precision    recall  f1-score   support

           0       0.72      0.46      0.56      2263
           1       0.60      0.82      0.69      2255

    accuracy                           0.64      4518
   macro avg       0.66      0.64      0.63      4518
weighted avg       0.66      0.64      0.63      4518



#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## Compare these three pipelines and discuss your findings

## <span style="background: yellow;">Commit your code!</span> 

### Pickle the required pipeline/models for Part III.

In [27]:

joblib.dump(model_grid_1, 'model_one.pkl')


['model_one.pkl']

In [28]:
# Saving the best estimator in its own file

joblib.dump(model_grid_1.best_estimator_, 'model_one_best.pkl')

['model_one_best.pkl']

You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`


# Save your notebook!
## Then `File > Close and Halt`