# Part II: Model Development

In this part, we develop three unique pipelines for predicting backorder. We use the smart sample from Part I to fit and evaluate these pipelines. 

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Reload the smart sample here

In [2]:

# Reload your smart sampling from local file 
# ----------------------------------
import joblib
X_sampled, y_sampled = joblib.load('sample-data-v1.pkl')


In [3]:
X_sampled.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22586 entries, 0 to 22585
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   national_inv       22586 non-null  float64
 1   lead_time          22586 non-null  float64
 2   in_transit_qty     22586 non-null  float64
 3   forecast_3_month   22586 non-null  float64
 4   forecast_6_month   22586 non-null  float64
 5   forecast_9_month   22586 non-null  float64
 6   sales_1_month      22586 non-null  float64
 7   sales_3_month      22586 non-null  float64
 8   sales_6_month      22586 non-null  float64
 9   sales_9_month      22586 non-null  float64
 10  min_bank           22586 non-null  float64
 11  potential_issue    22586 non-null  int64  
 12  pieces_past_due    22586 non-null  float64
 13  perf_6_month_avg   22586 non-null  float64
 14  perf_12_month_avg  22586 non-null  float64
 15  local_bo_qty       22586 non-null  float64
 16  deck_risk          225

## Normalize/standardize the data if required

In [4]:
from sklearn.preprocessing import MinMaxScaler
#num_columns = ['national_inv','lead_time','in_transit_qty','forecast_3_month','forecast_6_month',
#              'forecast_9_month','sales_1_month','sales_3_month','sales_6_month','sales_9_month','min_bank',
#              'pieces_past_due','perf_6_month_avg','perf_12_month_avg','local_bo_qty']
#features_scaled = X_sampled[num_columns]
#features_scaled.info()

scaler = MinMaxScaler().fit(X_sampled)
X_sampled = pd.DataFrame(scaler.transform(X_sampled), index=X_sampled.index, columns=X_sampled.columns)

X_sampled.head()

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,...,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop
0,0.002184,0.230769,0.0,9.6e-05,7.3e-05,6.8e-05,3.8e-05,4.9e-05,5.687299e-05,5.741829e-05,...,1.0,0.0,0.9973,0.9979,0.0,1.0,1.0,1.0,0.0,1.0
1,0.002184,0.038462,0.0,4e-06,2e-06,1e-06,0.0,0.0,0.0,0.0,...,1.0,0.0,0.99,0.99,0.0,0.0,1.0,0.0,0.0,1.0
2,0.002184,0.038462,0.0,1.8e-05,9e-06,6e-06,0.0,0.0,0.0,0.0,...,1.0,0.0,0.9978,0.9975,0.0,0.0,1.0,1.0,0.0,1.0
3,0.002184,0.153846,0.0,1.6e-05,9e-06,7e-06,3e-06,6e-06,3.378593e-06,4.943297e-06,...,1.0,0.0,0.9971,0.9975,0.0,1.0,1.0,1.0,0.0,1.0
4,0.002187,0.153846,0.0,0.0,0.0,0.0,0.0,0.0,5.630989e-07,3.802536e-07,...,1.0,0.0,0.9984,0.9977,0.0,1.0,1.0,1.0,0.0,1.0


## Split the data into Train/Test

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_sampled,y_sampled,test_size=0.2, random_state=100)

## Developing Pipeline

In this section, we design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a model

We are free to use any of the models that we learned in the past or use new models. 

* It is difficult to fit an anomaly detection method in the sklearn pipeline without writing custom codes. For simplicity, we avoid fitting an anomaly detection method within a pipeline. So we can create the workflow in two steps. 
    * Step I: fit an outlier with the training set
    * Step II: define a pipeline using a feature selection and a classification method. Then cross-validate this pipeline using the training data without outliers. 
        * Note: if your smart sample is somewhat imbalanced, you might want to change the scoring method in GridSearchCV (see the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).

* Once we fit the pipeline, we identify the best model and give an unbiased evaluation using the test set that we created in Part II. For unbiased evaluation we report confusion matrix, precision, recall, f1-score, accuracy, and other measures if you like. 

(Optional) Those who are interested in writing custom codes for adding an outlier detection method into the sklearn pipeline, please follow this discussion [thread](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline). 


**Note:** <span style='background:yellow'>We will be using Grid Search to find the optimal parameters of the pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

In [6]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from scipy.stats import uniform

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression


from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation
  
Add cells as needed. 

In [7]:
# Add anomaly detection code  (Question #E201)
# ----------------------------------

envelope = EllipticEnvelope(support_fraction=1, contamination=0.2).fit(X_train)
outliers = envelope.predict(X_train)==-1
X_train_clean = X_train[~outliers]
y_train_clean = y_train[~outliers]

In [23]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E202)
# ----------------------------------
param_grid = {'PCA__n_components': [5,10,15,20],
             'SVC__C': [1,5,10,15,20,25],
             }



pipe = Pipeline([
    ('PCA', PCA()),
    ('SVC', SVC(kernel='rbf'))
])

grid_model = GridSearchCV(pipe, param_grid = param_grid, n_jobs = 5, cv=5)


In [24]:
grid_model.fit(X_train_clean,y_train_clean)
print(grid_model.best_estimator_)

Pipeline(steps=[('PCA', PCA(n_components=10)), ('SVC', SVC(C=15))])


In [25]:
# Given an unbiased evaluation  (Question #E203)
# ----------------------------------
predicted_y = grid_model.predict(X_test)
print(classification_report(y_test, predicted_y))

              precision    recall  f1-score   support

           0       0.56      0.57      0.57      2275
           1       0.56      0.55      0.55      2243

    accuracy                           0.56      4518
   macro avg       0.56      0.56      0.56      4518
weighted avg       0.56      0.56      0.56      4518



In [26]:
pd.DataFrame(confusion_matrix(y_test,predicted_y))

Unnamed: 0,0,1
0,1302,973
1,1013,1230


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [12]:
# Add anomaly detection code  (Question #E205)
# ----------------------------------
iso = IsolationForest(contamination=0.05).fit(X_train,y_train)
iso_outliers = iso.predict(X_train)==-1
X_train_clean2 = X_train[~iso_outliers]
y_train_clean2 = y_train[~iso_outliers]


In [68]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E206)
# ----------------------------------

param_grid2 = {'sKb__k': [5,10,15,20],
              'DT__max_depth': [2,3,4,5,6,7,8,9,10,15,20]}
pipe2 = Pipeline([
    ('sKb',SelectKBest(score_func=chi2)),
    ('DT',DecisionTreeClassifier(criterion='gini'))
])

grid_model2 = GridSearchCV(pipe2, param_grid = param_grid2, n_jobs = 5, cv=5)

In [69]:
grid_model2.fit(X_train_clean2,y_train_clean2)
print(grid_model2.best_estimator_)

Pipeline(steps=[('sKb',
                 SelectKBest(k=20,
                             score_func=<function chi2 at 0x7efe70257840>)),
                ('DT', DecisionTreeClassifier(max_depth=9))])


In [70]:
# Given an unbiased evaluation  (Question #E207)
# ----------------------------------
predicted_y2 = grid_model2.predict(X_test)
print(classification_report(y_test, predicted_y2))

              precision    recall  f1-score   support

           0       0.85      0.88      0.87      2275
           1       0.87      0.85      0.86      2243

    accuracy                           0.86      4518
   macro avg       0.86      0.86      0.86      4518
weighted avg       0.86      0.86      0.86      4518



In [71]:
pd.DataFrame(confusion_matrix(y_test,predicted_y2))

Unnamed: 0,0,1
0,1999,276
1,346,1897


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [17]:
# Add anomaly detection code  (Question #E209)
# ----------------------------------
lof = LocalOutlierFactor(novelty=False).fit(X_train,y_train)
lof_outliers = lof.fit_predict(X_train)==-1
X_train_clean3 = X_train[~lof_outliers]
y_train_clean3 = y_train[~lof_outliers]

In [76]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E210)
# ----------------------------------
param_grid3 = {'RF__n_estimators': [1,2,3,4,5,6,7,8,9,10],
               'RF__max_depth': [10,20,30,40,50,60,70,80,100]
              }
pipe3 = Pipeline([
    ('VT',VarianceThreshold()),
    ('RF',RandomForestClassifier(criterion='entropy'))
])

grid_model3 = GridSearchCV(pipe3, param_grid = param_grid3, n_jobs = 5, cv=5)

In [77]:
grid_model3.fit(X_train_clean3,y_train_clean3)
print(grid_model3.best_estimator_)

Pipeline(steps=[('VT', VarianceThreshold()),
                ('RF',
                 RandomForestClassifier(criterion='entropy', max_depth=70,
                                        n_estimators=9))])


In [78]:
# Given an unbiased evaluation  (Question #E211)
# ----------------------------------
predicted_y3 = grid_model3.predict(X_test)
print(classification_report(y_test, predicted_y3))

              precision    recall  f1-score   support

           0       0.88      0.91      0.90      2275
           1       0.91      0.88      0.89      2243

    accuracy                           0.89      4518
   macro avg       0.89      0.89      0.89      4518
weighted avg       0.89      0.89      0.89      4518



In [79]:
pd.DataFrame(confusion_matrix(y_test,predicted_y3))

Unnamed: 0,0,1
0,2075,200
1,280,1963


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## Compare these three pipelines and discuss your findings

## <span style="background: yellow;">Commit your code!</span> 

### Pickle the required pipeline/models for Part III.

In [80]:
import joblib
joblib.dump([X_sampled,y_sampled,pipe3,grid_model3], 'model-v1.pkl')




['model-v1.pkl']

You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`


# Save your notebook!
## Then `File > Close and Halt`