# Part II: Model Development

In this part, we develop three unique pipelines for predicting backorder. We use the smart sample from Part I to fit and evaluate these pipelines. 

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt
import joblib
import os, sys
import itertools
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

## Reload the smart sample here

In [103]:

# Reload your smart sampling from local file 
# ----------------------------------
pipe_samples = joblib.load('pipe_samples')
pipe_samples = pipe_samples.dropna()

train_set_downsampled= joblib.load('train-set-downsized')
train_set_downsampled = train_set_downsampled.dropna()

## Normalize/standardize the data if required; otherwise ignore. You can perform this step inside the pipeline (if required). 

## Split the data into Train/Test

In [18]:
X = pipe_samples.iloc[:,:-1]
y = pipe_samples.went_on_backorder

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)


Training shapes (X, y):  (2295, 21) (2295,)
Testing shapes (X, y):  (574, 21) (574,)


In [105]:
X2 = train_set_downsampled.iloc[:,:-1]
y2 = train_set_downsampled.went_on_backorder

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.20)
print("Training shapes (X, y): ", X_train2.shape, y_train2.shape)
print("Testing shapes (X, y): ", X_test2.shape, y_test2.shape)

Training shapes (X, y):  (17276, 21) (17276,)
Testing shapes (X, y):  (4320, 21) (4320,)


## Developing Pipeline

In this section, we design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a classification model


We are free to use any of the models that we learned in the past or we can use new models. Here is a pool of methods: 

### Pool of Anomaly Detection Methods (Discussed in M4)
1. IsolationForest
2. EllipticEnvelope
3. LocalOutlierFactor
4. OneClassSVM
5. SGDOneClassSVM

### Pool of Feature Selection Methods (Discussed in M3)

1. VarianceThreshold
1. SelectKBest with any scoring method (e.g, chi, f_classif, mutual_info_classif)
1. SelectKPercentile
3. SelectFpr, SelectFdr, or  SelectFwe
1. GenericUnivariateSelect
2. PCA
3. Factor Analysis
4. Variance Threshold
5. RFE
7. SelectFromModel


### Classification Methods (Discussed in M1-M2
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Naive Bayes
5. Linear SVC
6. SVC with kernels
7. KNeighborsClassifier
8. GradientBoostingClassifier
9. XGBClassifier
10. LGBM Classifier



It is difficult to fit an anomaly detection method in the sklearn pipeline without writing custom codes. For simplicity, we avoid fitting an anomaly detection method within a pipeline. So we can create the workflow in two steps. 
* Step I: fit an outlier with the training set
* Step II: define a pipeline using a feature selection and a classification method. Then cross-validate this pipeline using the training data without outliers. 
* Note: if your smart sample is somewhat imbalanced, you might want to change the scoring method in GridSearchCV (see the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).


Once we fit the pipeline with gridsearch, we identify the best model and give an unbiased evaluation using the test set that we created in Part II. For unbiased evaluation we report confusion matrix, precision, recall, f1-score, accuracy, and other measures if you like. 

**Optional: Those who are interested in writing custom codes for adding an outlier detection method into the sklearn pipeline, please follow this discussion [thread](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline).**


**Note:** <span style='background:yellow'>We will be using Grid Search to find the optimal parameters of the pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

In [54]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectPercentile, f_classif


### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation
  
Add cells as needed. 

In [106]:
# Add anomaly detection code  (Question #E201)
# ----------------------------------
iso_forest = IsolationForest(contamination=0.1).fit(X_train2, y_train2)
iso_outliers = iso_forest.predict(X_train2)==-1
print(f"Num of outliers = {np.sum(iso_outliers)}")
X_trainiso = X_train2[~iso_outliers]
y_trainiso = y_train2[~iso_outliers]


Num of outliers = 1727


In [107]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E202)
# ----------------------------------
pca_components = 20
logistic = LogisticRegression(max_iter = 10000, tol = 0.1)
pipe = Pipeline(steps=[("PCA", PCA(n_components = pca_components)), ("LogisticRegression", logistic)])


param_grid = {'PCA__n_components': [3,5,10,15,20],
              'LogisticRegression__C': [.1,1,10,100,1000]}    
          




clf_pipe = Pipeline([('PCA', PCA()), ('LogisticRegression',LogisticRegression(max_iter = 100000))])



from sklearn.model_selection import GridSearchCV
model_grid = GridSearchCV(clf_pipe, param_grid = param_grid, cv = 10, n_jobs = 5)
model_grid.fit(X_trainiso, y_trainiso)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('PCA', PCA()),
                                       ('LogisticRegression',
                                        LogisticRegression(max_iter=100000))]),
             n_jobs=5,
             param_grid={'LogisticRegression__C': [0.1, 1, 10, 100, 1000],
                         'PCA__n_components': [3, 5, 10, 15, 20]})

In [108]:
# Given an unbiased evaluation  (Question #E203)
# ----------------------------------
from sklearn.metrics import classification_report, confusion_matrix
predicted_y = model_grid.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test, predicted_y)))
print("Accuracy:", np.round(accuracy_score(y_test, predicted_y), 2))
print("Precision:", np.round(precision_score(y_test, predicted_y, average='weighted'), 2))
print("Recall:", np.round(precision_score(y_test, predicted_y, average='weighted'), 2))
print("F1-Score:", np.round(f1_score(y_test, predicted_y, average='weighted'), 2))




     0    1
0  219   73
1   28  254
Accuracy: 0.82
Precision: 0.83
Recall: 0.83
F1-Score: 0.82


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

In [130]:
# Detail Hyperparameters and Results below  (Question #E204)
# ---------------------------------------------
df = pd.DataFrame(model_grid.cv_results_)


best = df[df['rank_test_score'] == 1]
print(best.params)

print('optimal parameters = C = 100, n_components = 20, performance = mean_test_score = .814394')
best[['params','mean_test_score']]

19    {'LogisticRegression__C': 100, 'PCA__n_compone...
Name: params, dtype: object
optimal parameters = C = 100, n_components = 20, performance = mean_test_score = .814394


Unnamed: 0,params,mean_test_score
19,"{'LogisticRegression__C': 100, 'PCA__n_compone...",0.814394


## <span style="background: yellow;">Commit your code!</span> 

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [111]:
# Add anomaly detection code  (Question #E205)
# ----------------------------------
from sklearn.svm import OneClassSVM
svm = OneClassSVM(kernel='rbf', nu = .3).fit(X_train2, y_train2)
svm_outliers = svm.predict(X_train2)==-1
print(f"Num of outliers = {np.sum(svm_outliers)}")
X_svm = X_train2[~svm_outliers]
y_svm = y_train2[~svm_outliers]


Num of outliers = 5198


<function sklearn.base.BaseEstimator.get_params(self, deep=True)>

In [112]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E206)
# ----------------------------------
from sklearn.feature_selection import GenericUnivariateSelect
decision = DecisionTreeClassifier()
pipe = Pipeline(steps=[("GenericUnivariateSelect", GenericUnivariateSelect(f_classif)), ("DecisionTreeClassifier", decision)])


param_grid2 = {"GenericUnivariateSelect__mode":['precentile','k_best','fpr','fdr','fwe'],
              'DecisionTreeClassifier__max_depth': [5,7,8,12,15],
              'DecisionTreeClassifier__criterion': ['entropy', 'gini']}    
          




clf_pipe2 = Pipeline([('GenericUnivariateSelect', GenericUnivariateSelect(f_classif, mode = 'k_best')), ('DecisionTreeClassifier',DecisionTreeClassifier(criterion = 'gini'))])



from sklearn.model_selection import GridSearchCV
model_grid2 = GridSearchCV(clf_pipe2, param_grid = param_grid2, cv = 10, n_jobs = 5)
model_grid2.fit(X_svm, y_svm)

        nan 0.85668102 0.85626711 0.85593557        nan        nan
 0.85577131 0.85618522 0.85452877        nan        nan 0.86123585
 0.86181587 0.86065611        nan        nan 0.85833781 0.86007609
 0.85734471        nan        nan 0.85403023 0.85411301 0.85444427
        nan        nan 0.85858403 0.85808721 0.85841868        nan
        nan 0.85800497 0.85825332 0.85783921        nan        nan
 0.85875035 0.85808851 0.85833583        nan        nan 0.8564324
 0.85601849 0.85850276]


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('GenericUnivariateSelect',
                                        GenericUnivariateSelect(mode='k_best')),
                                       ('DecisionTreeClassifier',
                                        DecisionTreeClassifier())]),
             n_jobs=5,
             param_grid={'DecisionTreeClassifier__criterion': ['entropy',
                                                               'gini'],
                         'DecisionTreeClassifier__max_depth': [5, 7, 8, 12, 15],
                         'GenericUnivariateSelect__mode': ['precentile',
                                                           'k_best', 'fpr',
                                                           'fdr', 'fwe']})

In [113]:
# Given an unbiased evaluation  (Question #E207)
# ----------------------------------
y_pred = model_grid2.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test, y_pred)))
print("Accuracy:", np.round(accuracy_score(y_test, y_pred), 2))
print("Precision:", np.round(precision_score(y_test, y_pred, average='weighted'), 2))
print("Recall:", np.round(precision_score(y_test, y_pred, average='weighted'), 2))
print("F1-Score:", np.round(f1_score(y_test, y_pred, average='weighted'), 2))


     0    1
0  239   53
1   17  265
Accuracy: 0.88
Precision: 0.88
Recall: 0.88
F1-Score: 0.88


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

In [128]:
# Detail Hyperparameters and Results below  (Question #E208)
# ---------------------------------------------


df2 = pd.DataFrame(model_grid2.cv_results_)


best2 = df2[df2['rank_test_score'] == 1]
print(best2.params)


best2

print('optimal params = entropy, max_depth = 12, mode = fdr, performance mean_test_score = .861816')
best2[['params','mean_test_score']]


18    {'DecisionTreeClassifier__criterion': 'entropy...
Name: params, dtype: object
optimal params = entropy, max_depth = 12, mode = fdr, performance mean_test_score = .861816


Unnamed: 0,params,mean_test_score
18,{'DecisionTreeClassifier__criterion': 'entropy...,0.861816


## <span style="background: yellow;">Commit your code!</span> 

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [115]:
# Add anomaly detection code  (Question #E209)
# ----------------------------------
lof_labels = LocalOutlierFactor(n_neighbors=10).fit_predict(X_train2, y_train2)
inliers = lof_labels == 1
X_clean = X_train2[inliers]
y_clean = y_train2[inliers]

In [116]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E210)
# ----------------------------------
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
pipe = Pipeline(steps=[("FactorAnalysis", FactorAnalysis(random_state = 5)), ("RandomForestClassifier", rf)])


param_grid3 = {'FactorAnalysis__n_components': [5,10,15,20],
              'RandomForestClassifier__n_estimators': [5,10,15,20]}    
          




clf_pipe3 = Pipeline([('FactorAnalysis', FactorAnalysis(random_state = 0)), ('RandomForestClassifier',RandomForestClassifier())])



from sklearn.model_selection import GridSearchCV
model_grid3 = GridSearchCV(clf_pipe3, param_grid = param_grid3, cv = 10, n_jobs = 5)
model_grid3.fit(X_clean, y_clean)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('FactorAnalysis', FactorAnalysis()),
                                       ('RandomForestClassifier',
                                        RandomForestClassifier())]),
             n_jobs=5,
             param_grid={'FactorAnalysis__n_components': [5, 10, 15, 20],
                         'RandomForestClassifier__n_estimators': [5, 10, 15,
                                                                  20]})

In [117]:
# Given an unbiased evaluation  (Question #E211)
# ----------------------------------
y_pred3 = model_grid3.predict(X_test)
print(pd.DataFrame(confusion_matrix(y_test, y_pred3)))
print("Accuracy:", np.round(accuracy_score(y_test, y_pred3), 2))
print("Precision:", np.round(precision_score(y_test, y_pred3, average='weighted'), 2))
print("Recall:", np.round(precision_score(y_test, y_pred3, average='weighted'), 2))
print("F1-Score:", np.round(f1_score(y_test, y_pred3, average='weighted'), 2))

     0    1
0  278   14
1   10  272
Accuracy: 0.96
Precision: 0.96
Recall: 0.96
F1-Score: 0.96


#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

In [123]:
# Detail Hyperparameters and Results below  (Question #E212)
# ---------------------------------------------

df3 = pd.DataFrame(model_grid3.cv_results_)


best3 = df3[df3['rank_test_score'] == 1]
print(best3.params)


print('optimal parameters = param_FactorAnalysis__n_components = 15 param_RandomForestClassifier__n_estimators = 20, mean test score = .889135')
best3

best3[['params','mean_test_score']]



11    {'FactorAnalysis__n_components': 15, 'RandomFo...
Name: params, dtype: object
optimal parameters = param_FactorAnalysis__n_components = 15 param_RandomForestClassifier__n_estimators = 20, mean test score = .889135


Unnamed: 0,params,mean_test_score
11,"{'FactorAnalysis__n_components': 15, 'RandomFo...",0.889135


## Compare these three pipelines and discuss your findings

## <span style="background: yellow;">Commit your code!</span> 

### Pickle the required pipeline/models for Part III.

In [134]:

joblib.dump(model_grid3.best_estimator_,'bestpiperevised')

['bestpiperevised']

You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`


# Save your notebook!
## Then `File > Close and Halt`