# Part II: Model Development

In this part, we develop three unique pipelines for predicting backorder. We use the smart sample from Part I to fit and evaluate these pipelines. 

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Reload the smart sample here

In [3]:

# Reload your smart sampling from local file 
# ----------------------------------

import joblib
# Load the sampled data from the file
dataset_new= joblib.load('dataset_new.pkl')

In [4]:
dataset_new.head

<bound method NDFrame.head of          national_inv  lead_time  in_transit_qty  forecast_3_month  \
197               0.0        2.0             0.0              54.0   
606               0.0        2.0             0.0               2.0   
846               1.0       12.0             0.0              18.0   
882              -2.0        8.0             0.0              17.0   
898               0.0        8.0             0.0              30.0   
...               ...        ...             ...               ...   
11897           105.0        8.0             0.0               0.0   
1107399           2.0        8.0             0.0               0.0   
652399            2.0        8.0             0.0               0.0   
516580           21.0       16.0             0.0               0.0   
1178875          72.0       16.0             0.0               0.0   

         forecast_6_month  forecast_9_month  sales_1_month  sales_3_month  \
197                  54.0              54.0         

## Normalize/standardize the data if required; otherwise ignore. You can perform this step inside the pipeline (if required). 

## Split the data into Train/Test

In [5]:
X = dataset_new.iloc [:, :-1]
y = dataset_new.went_on_backorder

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## Developing Pipeline

In this section, we design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a classification model


We are free to use any of the models that we learned in the past or we can use new models. Here is a pool of methods: 

### Pool of Anomaly Detection Methods (Discussed in M4)
1. IsolationForest
2. EllipticEnvelope
3. LocalOutlierFactor
4. OneClassSVM
5. SGDOneClassSVM

### Pool of Feature Selection Methods (Discussed in M3)

1. VarianceThreshold
1. SelectKBest with any scoring method (e.g, chi, f_classif, mutual_info_classif)
1. SelectKPercentile
3. SelectFpr, SelectFdr, or  SelectFwe
1. GenericUnivariateSelect
2. PCA
3. Factor Analysis
4. Variance Threshold
5. RFE
7. SelectFromModel


### Classification Methods (Discussed in M1-M2
1. Decision Tree
2. Random Forest
3. Logistic Regression
4. Naive Bayes
5. Linear SVC
6. SVC with kernels
7. KNeighborsClassifier
8. GradientBoostingClassifier
9. XGBClassifier
10. LGBM Classifier



It is difficult to fit an anomaly detection method in the sklearn pipeline without writing custom codes. For simplicity, we avoid fitting an anomaly detection method within a pipeline. So we can create the workflow in two steps. 
* Step I: fit an outlier with the training set
* Step II: define a pipeline using a feature selection and a classification method. Then cross-validate this pipeline using the training data without outliers. 
* Note: if your smart sample is somewhat imbalanced, you might want to change the scoring method in GridSearchCV (see the [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)).


Once we fit the pipeline with gridsearch, we identify the best model and give an unbiased evaluation using the test set that we created in Part II. For unbiased evaluation we report confusion matrix, precision, recall, f1-score, accuracy, and other measures if you like. 

**Optional: Those who are interested in writing custom codes for adding an outlier detection method into the sklearn pipeline, please follow this discussion [thread](https://stackoverflow.com/questions/52346725/can-i-add-outlier-detection-and-removal-to-scikit-learn-pipeline).**


**Note:** <span style='background:yellow'>We will be using Grid Search to find the optimal parameters of the pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

In [7]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest
from xgboost import XGBClassifier
from sklearn.svm import SVC


### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation
  
Add cells as needed. 

##  -Anomaly detection: Isolation forest

## -Dimensionality reduction: PCA


## -Model training/validation : Logistic regression

In [7]:
# Add anomaly detection code  (Question #E201)
# ----------------------------------
#Using IsolationForest
from sklearn.ensemble import IsolationForest

# Constructing  IsolationForest
iso_forest = IsolationForest(contamination=0.05).fit(X_train, y_train)

# Get labels from classifier 
iso_out_pred= iso_forest.predict(X_train) == -1
print(f"Num of outliers = {np.sum(iso_out_pred)}")
iso_X_pred = X_train[~iso_out_pred]
iso_y_pred = y_train[~iso_out_pred]

Num of outliers = 847


In [8]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E202)
# ----------------------------------
pipeline1 = Pipeline([
    ('scaler', MinMaxScaler()),
    ('PCA', PCA(n_components=20)),
    ('LogisticRegression', LogisticRegression(max_iter=200))
])


In [9]:
# training and defining parameter grid for PCA and logistic regression
    
param_grid = {
    'PCA__n_components': [20, 50, 100, 150, 200],
    'LogisticRegression__solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'],
    'LogisticRegression__penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'LogisticRegression__C':  [0.001, 0.01, 0.1, 1, 10, 100, 1000]

}


In [10]:
# perform grid search over the parameter grid
model_grid_search = GridSearchCV(pipeline1, param_grid=param_grid, cv=5, n_jobs=2)
model_grid_search.fit(X_train, y_train)

        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan 0.50002952        nan        nan        nan
        nan 0.53362013        nan        nan        nan        nan
 0.53367916        nan        nan        nan        nan 0.53362013
        nan        nan        nan        nan 0.53362013        nan
        nan        nan        nan 0.53362013        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan        nan        nan        nan
        nan        nan        nan 0.59796902        nan        nan
        nan        nan        nan        nan        nan        nan
        nan 0.71344254        nan        nan        nan        nan
 0.57583147        nan        nan        nan        nan 0.5733

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', MinMaxScaler()),
                                       ('PCA', PCA(n_components=20)),
                                       ('LogisticRegression',
                                        LogisticRegression(max_iter=200))]),
             n_jobs=2,
             param_grid={'LogisticRegression__C': [0.001, 0.01, 0.1, 1, 10, 100,
                                                   1000],
                         'LogisticRegression__penalty': ['l1', 'l2',
                                                         'elasticnet', 'none'],
                         'LogisticRegression__solver': ['lbfgs', 'liblinear',
                                                        'newton-cg', 'sag',
                                                        'saga'],
                         'PCA__n_components': [20, 50, 100, 150, 200]})

In [11]:
# print the best parameters and score from the grid search
print('Best Parameters: ', model_grid_search.best_params_)
print('Best Score: ', model_grid_search.best_score_)

Best Parameters:  {'LogisticRegression__C': 0.001, 'LogisticRegression__penalty': 'none', 'LogisticRegression__solver': 'newton-cg', 'PCA__n_components': 20}
Best Score:  0.7134425362060437


### making predictions on training data

In [12]:
# using best pipeline to make predictions on the training data 
best_pipeline = model_grid_search.best_estimator_
y_pred = best_pipeline.predict(X_train)


In [13]:
accuracy = accuracy_score(y_train, y_pred)
print('Pipeline Accuracy: ', accuracy)

Pipeline Accuracy:  0.7129700690713737


In [14]:
# Given an unbiased evaluation 
#confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
pd.DataFrame(confusion_matrix(y_train, y_pred))

Unnamed: 0,0,1
0,4890,3577
1,1285,7187


In [15]:
#classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.58      0.67      8467
           1       0.67      0.85      0.75      8472

    accuracy                           0.71     16939
   macro avg       0.73      0.71      0.71     16939
weighted avg       0.73      0.71      0.71     16939



### making predictions on testing data 

In [16]:
y_pred_t=best_pipeline.predict(X_test)


In [17]:
accuracy = accuracy_score(y_test, y_pred_t)
print('Pipeline Accuracy: ', accuracy)

Pipeline Accuracy:  0.7154241190012396


In [18]:
from sklearn.metrics import classification_report, confusion_matrix
pd.DataFrame(confusion_matrix(y_test, y_pred_t))

Unnamed: 0,0,1
0,1678,1148
1,459,2362


In [19]:
#classification report
print(classification_report(y_test, y_pred_t))

              precision    recall  f1-score   support

           0       0.79      0.59      0.68      2826
           1       0.67      0.84      0.75      2821

    accuracy                           0.72      5647
   macro avg       0.73      0.72      0.71      5647
weighted avg       0.73      0.72      0.71      5647



#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

##  -Anomaly detection: Elliptic envelope

## -Dimensionality reduction: FA

## -Model training/validation : SVC

In [20]:
# Add anomaly detection code  (Question #E205)
# ----------------------------------
from sklearn.covariance import EllipticEnvelope
model_ee = EllipticEnvelope(contamination=0.05) # adjust the contamination level as needed
model_ee.fit(X)


EllipticEnvelope(contamination=0.05)

In [21]:
# Predict the anomalies
y_pred_out = model_ee.predict(X)
anomalies = X[y_pred_out == -1]

In [22]:
print('Number of anomalies:', len(anomalies))
print('Anomalies:', anomalies)

Number of anomalies: 1130
Anomalies:          national_inv  lead_time  in_transit_qty  forecast_3_month  \
2256            356.0        8.0           279.0            7419.0   
5808              0.0        8.0             0.0               0.0   
6602             17.0        8.0             0.0              75.0   
8579             31.0        8.0             1.0            1265.0   
8755            -13.0        8.0            53.0             599.0   
...               ...        ...             ...               ...   
1113189         837.0        2.0          1401.0            1754.0   
1319984          34.0        8.0             0.0               0.0   
1024431        1533.0        2.0             0.0               0.0   
1097205       12668.0        4.0          6010.0           15000.0   
51001           309.0       52.0             0.0               0.0   

         forecast_6_month  forecast_9_month  sales_1_month  sales_3_month  \
2256               7419.0            9405.0  

In [23]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E206)
# ----------------------------------
from sklearn.feature_selection import SelectKBest
from sklearn.svm import SVC
from scipy.stats import uniform
from sklearn.decomposition import FactorAnalysis

In [24]:
# Define the pipeline
pipeline2 = Pipeline([
    ('scale', MinMaxScaler()),
    ('FA', FactorAnalysis(n_components=20)),
    ('SVC', SVC(kernel='rbf'))

])

In [25]:
param_grid = {
     'FA__n_components': [5, 10, 15],
    'SVC__C': [1e3, 5e3],        
    'SVC__kernel': ['rbf']
}

In [26]:
# Perform grid search
model_grid_search2 = GridSearchCV(pipeline2, param_grid=param_grid, cv=5,n_jobs=2)
model_grid_search2.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scale', MinMaxScaler()),
                                       ('FA', FactorAnalysis(n_components=20)),
                                       ('SVC', SVC())]),
             n_jobs=2,
             param_grid={'FA__n_components': [5, 10, 15],
                         'SVC__C': [1000.0, 5000.0], 'SVC__kernel': ['rbf']})

In [27]:
# print the best parameters and score from the grid search
print('Best Parameters: ', model_grid_search2.best_params_)
print('Best Score: ', model_grid_search2.best_score_)

Best Parameters:  {'FA__n_components': 15, 'SVC__C': 5000.0, 'SVC__kernel': 'rbf'}
Best Score:  0.7433140778216872


### making predictions on training data

In [28]:
# using best pipeline to make predictions on the test data
best_pipeline = model_grid_search2.best_estimator_
y_pred2 = best_pipeline.predict(X_train)

In [29]:
accuracy = accuracy_score(y_train, y_pred2)
print('Pipeline Accuracy: ', accuracy)

Pipeline Accuracy:  0.7507527008678199


In [30]:
# Given an unbiased evaluation  (Question #E207)
# ----------------------------------
#confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
pd.DataFrame(confusion_matrix(y_train, y_pred2))

Unnamed: 0,0,1
0,6580,1887
1,2335,6137


In [31]:
#classification report
print(classification_report(y_train, y_pred2))

              precision    recall  f1-score   support

           0       0.74      0.78      0.76      8467
           1       0.76      0.72      0.74      8472

    accuracy                           0.75     16939
   macro avg       0.75      0.75      0.75     16939
weighted avg       0.75      0.75      0.75     16939



### making predictions on testing data

In [32]:
y_pred2_t = best_pipeline.predict(X_test)

In [33]:
accuracy = accuracy_score(y_test, y_pred2_t)
print('Pipeline Accuracy: ', accuracy)

Pipeline Accuracy:  0.737559766247565


In [34]:
#confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
pd.DataFrame(confusion_matrix(y_test, y_pred2_t))

Unnamed: 0,0,1
0,2164,662
1,820,2001


In [35]:
#classification report
print(classification_report(y_test, y_pred2_t))

              precision    recall  f1-score   support

           0       0.73      0.77      0.74      2826
           1       0.75      0.71      0.73      2821

    accuracy                           0.74      5647
   macro avg       0.74      0.74      0.74      5647
weighted avg       0.74      0.74      0.74      5647



#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## <span style="background: yellow;">Commit your code!</span> 

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

##  -Anomaly detection: LocalOutlierFactor

## -Dimensionality reduction: RFE


## -Model training/validation : Random forest Algorithm

In [8]:
# Add anomaly detection code  (Question #E209)
# ----------------------------------
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)




In [9]:
# Fit the model and make predictions
y_pred_out = lof.fit_predict(X)

In [10]:
# Get the number of outliers detected
n_outliers = np.sum(y_pred_out == -1)


In [11]:
# Print the number of outliers detected
print("Number of outliers detected: ", n_outliers)

Number of outliers detected:  2256


In [12]:
# Add codes for feature selection and classification pipeline with grid search  (Question #E210)
# ----------------------------------
#pipelining with RFE as feature selector and random forest for classification
pipeline3 = Pipeline([
    ('scale', MinMaxScaler()),
    ('rfe', RFE(estimator=RandomForestClassifier())),
    ('rf', RandomForestClassifier())

])
    

In [13]:
#Define the hyperparameters to search over
param_grid = {
    'rfe__n_features_to_select': [5, 10, 15],
    'rf__n_estimators': [100, 200, 500],
    'rf__max_depth': [5, 10, 20]
}

In [14]:
# training by Performining grid search
model_grid_search3 = GridSearchCV(pipeline3, param_grid=param_grid, cv=10,n_jobs=5)
model_grid_search3.fit(X_train, y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('scale', MinMaxScaler()),
                                       ('rfe',
                                        RFE(estimator=RandomForestClassifier())),
                                       ('rf', RandomForestClassifier())]),
             n_jobs=5,
             param_grid={'rf__max_depth': [5, 10, 20],
                         'rf__n_estimators': [100, 200, 500],
                         'rfe__n_features_to_select': [5, 10, 15]})

In [15]:
# print the best parameters and score from the grid search
print('Best Parameters: ', model_grid_search3.best_params_)
print('Best Score: ', model_grid_search3.best_score_)

Best Parameters:  {'rf__max_depth': 20, 'rf__n_estimators': 200, 'rfe__n_features_to_select': 15}
Best Score:  0.9007617657539797


### making predictions on training data

In [16]:
# using best pipeline to make predictions on the training data
best_pipeline3 = model_grid_search3.best_estimator_
y_pred3 = best_pipeline.predict(X_train)

In [17]:
#accuracy
accuracy = accuracy_score(y_train, y_pred3)
print('Pipeline Accuracy: ', accuracy)

Pipeline Accuracy:  0.9757955015054017


In [18]:
# Given an unbiased evaluation  (Question #E211)
# ----------------------------------
#confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
pd.DataFrame(confusion_matrix(y_train, y_pred3))

Unnamed: 0,0,1
0,8143,324
1,86,8386


In [19]:
#classification report
print(classification_report(y_train, y_pred3))

              precision    recall  f1-score   support

           0       0.99      0.96      0.98      8467
           1       0.96      0.99      0.98      8472

    accuracy                           0.98     16939
   macro avg       0.98      0.98      0.98     16939
weighted avg       0.98      0.98      0.98     16939



### making predictions on testing data

In [20]:
y_pred3_t = best_pipeline.predict(X_test)

In [21]:
#accuracy
accuracy = accuracy_score(y_test, y_pred3_t)
print('Pipeline Accuracy: ', accuracy)

Pipeline Accuracy:  0.9015406410483443


In [22]:
#confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
pd.DataFrame(confusion_matrix(y_test, y_pred3_t))

Unnamed: 0,0,1
0,2462,364
1,192,2629


In [23]:
#classification report
print(classification_report(y_test, y_pred3_t))

              precision    recall  f1-score   support

           0       0.93      0.87      0.90      2826
           1       0.88      0.93      0.90      2821

    accuracy                           0.90      5647
   macro avg       0.90      0.90      0.90      5647
weighted avg       0.90      0.90      0.90      5647



#### <center>Record the optimal hyperparameters and performance resulting from this pipeline.</center>

## Compare these three pipelines and discuss your findings

## <span style="background: yellow;">Commit your code!</span> 

### Pickle the required pipeline/models for Part III.

In [22]:

import joblib
joblib.dump(model_grid_search, 'model1.pkl')



['model1.pkl']

In [36]:
joblib.dump(model_grid_search2, 'pipeline2.pkl')

['pipeline2.pkl']

In [24]:
import joblib
joblib.dump(model_grid_search3,'bestmodel.pkl')      

['bestmodel.pkl']

You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`


# Save your notebook!
## Then `File > Close and Halt`