# Final Project


For your final project, you will build a classifer for
the **Backorder Prediction** dataset by following our
operationalized machine learning pipeline.

![AppliedML_Workflow IMAGE MISSING](../images/AppliedML_Workflow.png)


--- 

## Data

Details of the dataset are located here:

Dataset (originally posted on Kaggle): https://www.kaggle.com/tiredgeek/predict-bo-trial

The files are accessible in the JupyterHub environment:
 * `/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv`
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

The data is used to predict likelihood of product to go on Back Order.
 
<span style='background:yellow'>**NOTE:** The training data file is 117MB. **Do NOT try to version control any data files** (training, test, or created), you will blow-through the _push limit_.</span>  
You can easily lock up a notebook with bad coding practices.  
Please save you project early, and often, and use `git commits` to checkpoint your process.

## Exploration, Training, and Validation

You will examine the _training_ dataset and perform 
 * **data preparation and exploratory data analysis**, 
 * **anomaly detection / removal**,
 * **dimensionality reduction** and then
 * **train and validate 3 different models**.

Of the 3 different models, you are free to pick any estimator from Scikit-Learn 
or models we have so far covered using TensorFlow.

### Validation Assessment

Your first, intermediate, result will be an **assessment** of the models' performance.
This assessement should be grounded within a 10-fold cross-validation methodology.

This should include the confusion matrix and F-score for each classifier.


---

## Testing

Once you have chosen your final model, you will need to re-train it using all the training data.


--- 
##  Overview / Roadmap

**General steps**:
* Training and Validation
  * Dataset carpentry & Exploratory Data Analysis
    * Develop functions to perform the necessary steps, you will have to carpentry the Training and the Testing data.
  * Generate a **smart sample** of the the data
  * Create 3 alternative pipelines, each does:
      * Anomaly detection
      * Dimensionality reduction
      * Model training/validation
* Testing
  * Train chosen model full training data
  * Evaluate model against testing
  * Write a summary of your processing and an analysis of the model performance




In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Load dataset

**Description**
~~~
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag
went_on_backorder - Product actually went on backorder. 
~~~

**Note**: This is a real-world dataset without any preprocessing.  
There will also be warnings due to fact that the 1st column is mixing integer and string values.  
**NOTE:** The last column, `went_on_backorder`, is what we are trying to predict.


In [2]:
# Dataset location
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)


# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()



  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
count,1687860.0,1586967.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0
mean,496.1118,7.872267,44.05202,178.1193,344.9867,506.3644,55.92607,175.0259,341.7288,525.2697,52.7723,2.043724,-6.872059,-6.437947,0.6264507
std,29615.23,7.056024,1342.742,5026.553,9795.152,14378.92,1928.196,5192.378,9613.167,14838.61,1254.983,236.0165,26.55636,25.84333,33.72224
min,-27256.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,-99.0,0.0
25%,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.63,0.66,0.0
50%,15.0,8.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,4.0,0.0,0.0,0.82,0.81,0.0
75%,80.0,9.0,0.0,4.0,12.0,20.0,4.0,15.0,31.0,47.0,3.0,0.0,0.97,0.95,0.0
max,12334400.0,52.0,489408.0,1427612.0,2461360.0,3777304.0,741774.0,1105478.0,2146625.0,3205172.0,313319.0,146496.0,1.0,1.0,12530.0


## Processing

In this section, the goal is to figure out:

* which columns we can use directly,  
* which columns are usable after some processing,  
* and which columns are not processable or obviously irrelevant (like product id) that we will discard.

Then process and prepare this dataset for creating a predictive model.

### Take samples and examine the dataset

In [3]:
dataset.iloc[:3,:6]

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month
0,1793815,247.0,4.0,0.0,81.0,81.0
1,1697019,5.0,8.0,0.0,0.0,0.0
2,1239403,140.0,15.0,0.0,0.0,100.0


In [4]:
dataset.iloc[:3,6:12]

Unnamed: 0,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank
0,81.0,32.0,148.0,396.0,709.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0
2,100.0,11.0,36.0,85.0,151.0,16.0


In [5]:
dataset.iloc[:3,12:18]

Unnamed: 0,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk
0,No,0.0,0.73,0.78,0.0,No
1,No,0.0,0.82,0.75,0.0,No
2,No,0.0,0.5,0.44,0.0,No


In [6]:
dataset.iloc[:3,18:24]

Unnamed: 0,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,No,No,Yes,No,No
1,No,No,Yes,No,No
2,No,No,Yes,No,No


### Drop columns that are obviously irrelevant or not processable

In [7]:
# Add code below this comment  (Question #E8001)
# ----------------------------------

#dataset.del['sku']
dataset=dataset.drop('sku', axis=1)
#dataset=dataset.drop('AppointmentID', axis=1)


### Find unique values of string columns

Now try to make sure that these Yes/No columns really only contains Yes or No.  
If that's true, proceed to convert them into binaries (0s and 1s).

**Tip**: use [unique()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) function of pandas Series.

Example

~~~python
print('went_on_backorder', dataset['went_on_backorder'].unique())
~~~

In [8]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))

print(yes_no_columns)

# Add code below this comment  (Question #E8002)
# ----------------------------------
def PrintUniqueValues(yes_no_column,dataset):
    for i in yes_no_column:
        print("'"+i+": '", dataset[i].unique())
PrintUniqueValues(yes_no_columns,dataset)


['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
'potential_issue: ' ['No' 'Yes' nan]
'deck_risk: ' ['No' 'Yes' nan]
'oe_constraint: ' ['No' 'Yes' nan]
'ppap_risk: ' ['No' 'Yes' nan]
'stop_auto_buy: ' ['Yes' 'No' nan]
'rev_stop: ' ['No' 'Yes' nan]
'went_on_backorder: ' ['No' 'Yes' nan]


You may see **nan** also as possible values representing missing values in the dataset.

We fill them using most popular values, the [Mode](https://en.wikipedia.org/wiki/Mode_%28statistics%29) in Stats.

In [9]:
for column_name in yes_no_columns:
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)
    


Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No


### Convert yes/no columns into binary (0s and 1s)

In [10]:
# Add code below this comment  (Question #E8003)
# ----------------------------------
def CovertToBinary(yes_no_column,data):
    for i in yes_no_column:
        data[i] = data[i].apply(['Yes', 'No'].index)

CovertToBinary(yes_no_columns,dataset)
PrintUniqueValues(yes_no_columns,dataset)
dataset.info()

'potential_issue: ' [1 0]
'deck_risk: ' [1 0]
'oe_constraint: ' [1 0]
'ppap_risk: ' [1 0]
'stop_auto_buy: ' [0 1]
'rev_stop: ' [1 0]
'went_on_backorder: ' [1 0]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 22 columns):
national_inv         1687860 non-null float64
lead_time            1586967 non-null float64
in_transit_qty       1687860 non-null float64
forecast_3_month     1687860 non-null float64
forecast_6_month     1687860 non-null float64
forecast_9_month     1687860 non-null float64
sales_1_month        1687860 non-null float64
sales_3_month        1687860 non-null float64
sales_6_month        1687860 non-null float64
sales_9_month        1687860 non-null float64
min_bank             1687860 non-null float64
potential_issue      1687861 non-null int64
pieces_past_due      1687860 non-null float64
perf_6_month_avg     1687860 non-null float64
perf_12_month_avg    1687860 non-null float64
local_bo_qty         1687860 non-null 

Now all columns should be either int64 or float64.

### Data cleaning functions

In [11]:
##Cleaning the dataset for anymore NAN or long float values, or negative values.
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)


def delete_negative_values(X_data):
    for i in X_data.columns:
        if any(X_data[i]<0):
            X_data = X_data[X_data[i] > 0]
            X_data = X_data.dropna()
            print(any(X_data[i]<0))
    return X_data

def replace_negative_values_with_0(X_data):
    for i in X_data.columns:
        if any(X_data[i]<0):
            X_data[X_data[i]<0]=0 #a[a < 0] = 0
            print(any(X_data[i]<0))
    return X_data
#dataset=delete_negative_values(dataset)

In [12]:
#First Cleaning the dataset dealing with NAN or long float values
clean_dataset(dataset)

dataset.shape

(1586967, 22)

### Perform additional steps to smartly sample the data into a more manageable size for cross-fold validation

**Note:** After sampling the data, you may want to write the data to a file for reloading later.
Remove the old `dataset` variable

In [11]:
# Add code below this comment   (Question #E8004) 
# ----------------------------------

#Resampling after cleaning the data on NAN or long float values.
dataset_resampled = pd.concat([
    dataset[dataset['went_on_backorder'] == 1].sample(frac = 1).reset_index(drop=True),
    dataset[dataset['went_on_backorder'] == 0]
])

dataset_resampled = dataset_resampled.sample(frac = 1).reset_index(drop=True)
dataset = dataset.reset_index()

In [72]:
num_went_on_backorder = np.sum(dataset[dataset['went_on_backorder']==1]) # find out total number of no-show cases
print('went_on_backorder ratio:', num_went_on_backorder, '/', len(dataset), '=', num_went_on_backorder / len(dataset))

went_on_backorder ratio: index                1.330102e+12
national_inv         7.766043e+08
lead_time            1.242368e+07
in_transit_qty       7.212486e+07
forecast_3_month     2.977890e+08
forecast_6_month     5.770620e+08
forecast_9_month     8.474303e+08
sales_1_month        9.001331e+07
sales_3_month        2.824282e+08
sales_6_month        5.575383e+08
sales_9_month        8.613777e+08
min_bank             8.417925e+07
potential_issue      1.575147e+06
pieces_past_due      3.408234e+06
perf_6_month_avg    -1.607801e+06
perf_12_month_avg   -8.785444e+05
local_bo_qty         9.543310e+05
deck_risk            1.255292e+06
oe_constraint        1.575761e+06
ppap_risk            1.389139e+06
stop_auto_buy        3.877000e+04
rev_stop             1.575588e+06
went_on_backorder    1.575998e+06
dtype: float64 / 1586967 = index                838140.622366
national_inv            489.363844
lead_time                 7.828566
in_transit_qty           45.448244
forecast_3_month        18

In [73]:
#went_on_backorder
upsample_rate = (len(dataset) - num_went_on_backorder) / num_went_on_backorder
print('upsample_rate:', upsample_rate)

upsample_rate: index                -0.999999
national_inv         -0.997957
lead_time            -0.872263
in_transit_qty       -0.977997
forecast_3_month     -0.994671
forecast_6_month     -0.997250
forecast_9_month     -0.998127
sales_1_month        -0.982370
sales_3_month        -0.994381
sales_6_month        -0.997154
sales_9_month        -0.998158
min_bank             -0.981148
potential_issue       0.007504
pieces_past_due      -0.534373
perf_6_month_avg     -1.987042
perf_12_month_avg    -2.806360
local_bo_qty          0.662910
deck_risk             0.264221
oe_constraint         0.007111
ppap_risk             0.142411
stop_auto_buy        39.932860
rev_stop              0.007222
went_on_backorder     0.006960
dtype: float64


In [74]:
dataset_resampled = pd.concat([
    dataset[dataset['went_on_backorder'] == 1].sample(frac = 0.01).reset_index(drop=True),
    dataset[dataset['went_on_backorder'] == 0]
])
print(dataset.count())
dataset_resampled.count()

index                1586967
national_inv         1586967
lead_time            1586967
in_transit_qty       1586967
forecast_3_month     1586967
forecast_6_month     1586967
forecast_9_month     1586967
sales_1_month        1586967
sales_3_month        1586967
sales_6_month        1586967
sales_9_month        1586967
min_bank             1586967
potential_issue      1586967
pieces_past_due      1586967
perf_6_month_avg     1586967
perf_12_month_avg    1586967
local_bo_qty         1586967
deck_risk            1586967
oe_constraint        1586967
ppap_risk            1586967
stop_auto_buy        1586967
rev_stop             1586967
went_on_backorder    1586967
dtype: int64


index                26729
national_inv         26729
lead_time            26729
in_transit_qty       26729
forecast_3_month     26729
forecast_6_month     26729
forecast_9_month     26729
sales_1_month        26729
sales_3_month        26729
sales_6_month        26729
sales_9_month        26729
min_bank             26729
potential_issue      26729
pieces_past_due      26729
perf_6_month_avg     26729
perf_12_month_avg    26729
local_bo_qty         26729
deck_risk            26729
oe_constraint        26729
ppap_risk            26729
stop_auto_buy        26729
rev_stop             26729
went_on_backorder    26729
dtype: int64

In [75]:
dataset_resampled

# Load and shuffle
dataset_resampled = dataset_resampled.sample(frac = 0.3).reset_index(drop=True)
#dataset_resampled.describe()
#
dataset_resampled.head()
dataset_resampled.count()

index                8019
national_inv         8019
lead_time            8019
in_transit_qty       8019
forecast_3_month     8019
forecast_6_month     8019
forecast_9_month     8019
sales_1_month        8019
sales_3_month        8019
sales_6_month        8019
sales_9_month        8019
min_bank             8019
potential_issue      8019
pieces_past_due      8019
perf_6_month_avg     8019
perf_12_month_avg    8019
local_bo_qty         8019
deck_risk            8019
oe_constraint        8019
ppap_risk            8019
stop_auto_buy        8019
rev_stop             8019
went_on_backorder    8019
dtype: int64

In [76]:
print('went_on_backorder ratio:', np.sum(dataset_resampled['went_on_backorder'] == 1) / len(dataset_resampled))
print('went_on_backorder ratio:', np.sum(dataset['went_on_backorder'] == 1) / len(dataset))
#went_on_backorder ratio: 0.5936845255911404 


went_on_backorder ratio: 0.5874797356278838
went_on_backorder ratio: 0.993088073034915


# Smarty resampled data 

Samrtly re-sampled data had a went_on_backorder ratio: 0.5936845255911404, which looked quite balanced subset of the original dataset. Original dataset had a went_on_backorder ratio: 0.993088073034915.

In [None]:
# Write your smart sampling to local file  (Question #E8004 ... cont. ) 
# ----------------------------------
dataset_resampled.to_csv('sampled2.csv',index=False)


You should have made a couple commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Data Sampled`

### <center><span style='color:green'>This becomes the new Starting Point after initial data work</span></center>

In [16]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.svm import OneClassSVM
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


# Reload your smart sampling from local file  (Question #E8004 ... cont.) 
# ----------------------------------

RESAMPLED = 'sampled2.csv'
assert os.path.exists(RESAMPLED)

# Load and shuffle
dataset2 = pd.read_csv(RESAMPLED).sample(frac = 1).reset_index(drop=True)
dataset2.describe()

dataset2.shape

(13364, 23)

In [11]:
#Splitting the subset into X and y

X=dataset2.iloc[:,:-1]
y=dataset2.went_on_backorder

X.shape

(13364, 22)

In [12]:
#Splitting the data into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
print("Training shapes (X, y): ", X_train.shape, y_train.shape)
print("Testing shapes (X, y): ", X_test.shape, y_test.shape)

Training shapes (X, y):  (12027, 22) (12027,)
Testing shapes (X, y):  (1337, 22) (1337,)


In [13]:
X_test2=replace_negative_values_with_0(X_test)

False
False


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


## Pipeline

In this section, design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a model

**Note:** <span style='background:yellow'>Ensure you are using Grid Search to find optimal parameters of your pipelines.</span>

You can add more notebook cells or import any Python modules as needed.

### <font color='Red'>Your 1st pipeline </font>
  * Anomaly detection - <font color='Red'>SVM method used</font>
  * Dimensionality reduction - <font color='Red'>PCA() and selectKBest()</font>
  * Model training/validation - <font color='Red'>Ridge()</font>

In [16]:
# Add code below this comment  (Question #E8005)
# ----------------------------------

svm = OneClassSVM(kernel='rbf',gamma='auto').fit(X_train, y_train)
svm_outliers = svm.predict(X_train)==-1

# Pull inliers
X_svm = X_train[~svm_outliers] 
y_svm = y_train[~svm_outliers]


##### SVM was really hard on removing the ouliers. The subset is slashed down to almost half the number of rows.

In [17]:
print("X_train.shape: ",X_train.shape, "X_svm.shape:", X_svm.shape)

X_train.shape:  (12027, 22) X_svm.shape: (4329, 22)


# Removing negative values in data
The followng step is to replace the -ve values with zero because without this step the pipeline was failing saying "can't handle the negative values." I tried first deleting the negative values, but then a lot of data was going away. Hence I just chose replacing negative values with 0 instead of dropping them.

Also, I purposefully did it after ouliers removal step because I though ouliers would have been affected if I would have done this step before the anomaly detection. 



In [18]:
X_svm2=replace_negative_values_with_0(X_svm)


False
False


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [21]:
# 1ST Pipeline with PCA and Ridge
pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('classifier', Ridge())
])

from sklearn.feature_selection import SelectKBest, chi2, f_regression,mutual_info_regression

N_FEATURES_OPTIONS = [2,4,5,6,10,12,18]
param_grid = [{
    'reduce_dim__n_components': N_FEATURES_OPTIONS
},
    {
        'reduce_dim': [SelectKBest(f_regression)],
        'reduce_dim__k': N_FEATURES_OPTIONS
    },
]
reducer_labels = ['PCA','KBest(chi2)'] #

clf2 = GridSearchCV(pipe,cv=10, n_jobs=2, param_grid=param_grid)

clf2.fit(X_svm2, y_svm)





GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))]),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid=[{'reduce_dim__n_components': [2, 4, 5, 6, 10, 12, 18]}, {'reduce_dim': [SelectKBest(k=6, score_func=<function f_regression at 0x7f1b5d780c80>)], 'reduce_dim__k': [2, 4, 5, 6, 10, 12, 18]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [26]:
y_pred=clf2.predict(X_test2).round()
print("Confusion Matrix:===> ","\n", confusion_matrix(y_test, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test, y_pred))
print("Best Accuracy Score:===> ", clf2.best_score_)

Confusion Matrix:===>  
 [[123 417]
 [ 69 728]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.64      0.23      0.34       540
           1       0.64      0.91      0.75       797

   micro avg       0.64      0.64      0.64      1337
   macro avg       0.64      0.57      0.54      1337
weighted avg       0.64      0.64      0.58      1337

Best Accuracy Score:===>  0.04214284499357722


In [27]:
clf2.best_estimator_

Pipeline(memory=None,
     steps=[('reduce_dim', SelectKBest(k=6, score_func=<function f_regression at 0x7f1b5d780c80>)), ('classifier', Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [28]:
clf2.cv_results_



{'mean_fit_time': array([0.13946404, 0.19324427, 0.12543812, 0.14788818, 0.19117432,
        0.28480551, 0.09613636, 0.03657842, 0.04688015, 0.02402456,
        0.02375894, 0.04015844, 0.03928351, 0.02968147]),
 'std_fit_time': array([0.0759725 , 0.11147323, 0.05799941, 0.06367585, 0.05210349,
        0.06669843, 0.00433948, 0.03725929, 0.04220813, 0.02974793,
        0.02942295, 0.03796989, 0.03945764, 0.03434674]),
 'mean_score_time': array([0.00153742, 0.00202346, 0.00208526, 0.00172095, 0.00995138,
        0.00216825, 0.00187285, 0.00690937, 0.00152936, 0.00152485,
        0.00148675, 0.00978835, 0.00133595, 0.00127347]),
 'std_score_time': array([6.74347572e-04, 9.66293265e-04, 9.24548792e-04, 6.04504407e-04,
        2.32406592e-02, 9.87240087e-04, 6.80614867e-04, 1.65097738e-02,
        3.56327047e-04, 2.91750099e-04, 2.63384112e-04, 2.37231139e-02,
        1.81391299e-04, 5.39484982e-05]),
 'param_reduce_dim__n_components': masked_array(data=[2, 4, 5, 6, 10, 12, 18, --, --, --, 

#### <center>Record the optimal hyperparameters and performance resulting from this pipeline grid search.</center>

### <font color='Red'>Your 2nd pipeline </font>
  * Anomaly detection - <font color='Red'>Isolation Method method used</font>
  * Dimensionality reduction - <font color='Red'>PCA() and selectKBest()</font>
  * Model training/validation - <font color='Red'>RandomForestClassifier(n_estimators=100, max_depth=10)</font>

In [29]:
# Add code below this comment  (Question #E8007)
# ----------------------------------

# Pipeline 2 => Anomaly detection => Construct IsolationForest 
iso_forest = IsolationForest(n_estimators=250,
                             bootstrap=True).fit(X_train, y_train)

iso_outliers = iso_forest.predict(X_train)==-1

X_iso = X_train[~iso_outliers]
y_iso = y_train[~iso_outliers]

X_iso2=replace_negative_values_with_0(X_iso)



False
False


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


##### Looking at the shape of original training dataset and training dataset without outliers, looks like isolation Forest was not as brutual as svm in removing the ouliers.

In [31]:
print("X_train.shape: ",X_train.shape, "X_iso.shape:", X_iso.shape)

X_train.shape:  (12027, 22) X_iso.shape: (10824, 22)


##### Lets pickle out the anomaly detection classifier.

In [32]:
joblib.dump(iso_forest, 'iso_forest.pkl')

['iso_forest.pkl']

In [33]:
#Pipeline with PCA and Random forest classifier


pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('classifier', RandomForestClassifier(n_estimators=100, max_depth=10,random_state=1))
])



N_FEATURES_OPTIONS = [2,5,8,12,18,20]
param_grid = [{
    'reduce_dim__n_components': N_FEATURES_OPTIONS
},
    {
        'reduce_dim': [SelectKBest(f_regression)],
        'reduce_dim__k': N_FEATURES_OPTIONS
    },
]
reducer_labels = ['PCA','KBest(chi2)'] #

clf3 = GridSearchCV(pipe,cv=10, n_jobs=2, param_grid=param_grid)

clf3.fit(X_iso2, y_iso)

#n_estimators=500, max_depth=5,random_state=0) 85% x_svm2
#n_estimators=100, max_depth=10,random_state=0) 86% x_svm2
#n_estimators=100, max_depth=10,random_state=1) 86.6 x_svm2

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=Non...mators=100, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid=[{'reduce_dim__n_components': [2, 5, 8, 12, 18, 20]}, {'reduce_dim': [SelectKBest(k=18, score_func=<function f_regression at 0x7f1b5d780c80>)], 'reduce_dim__k': [2, 5, 8, 12, 18, 20]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [34]:
y_pred3=clf3.predict(X_test2).round()
print("Confusion Matrix:===> ","\n", confusion_matrix(y_test, y_pred3),"\n")
print("Classification Report:===> ","\n", classification_report(y_test, y_pred3))
print("Best Accuracy Score:===> ", clf3.best_score_)

Confusion Matrix:===>  
 [[470  70]
 [ 92 705]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.84      0.87      0.85       540
           1       0.91      0.88      0.90       797

   micro avg       0.88      0.88      0.88      1337
   macro avg       0.87      0.88      0.87      1337
weighted avg       0.88      0.88      0.88      1337

Best Accuracy Score:===>  0.8732446415373245


In [35]:
clf3.best_estimator_

Pipeline(memory=None,
     steps=[('reduce_dim', SelectKBest(k=18, score_func=<function f_regression at 0x7f1b5d780c80>)), ('classifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_im...mators=100, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False))])

In [36]:
clf3.cv_results_



{'mean_fit_time': array([1.45658276, 2.01593521, 2.36447291, 3.03860893, 2.91751575,
        3.11890912, 0.31489751, 0.51972735, 0.52162497, 0.73051088,
        0.99146113, 0.93290198]),
 'std_fit_time': array([0.15329191, 0.16042788, 0.29098683, 0.30247709, 0.17292014,
        0.09337753, 0.04683821, 0.05869908, 0.0326775 , 0.03888426,
        0.05246475, 0.04226542]),
 'mean_score_time': array([0.04450729, 0.03650155, 0.04070563, 0.05709233, 0.07172623,
        0.04302967, 0.016027  , 0.02290478, 0.02109427, 0.02491107,
        0.02405984, 0.02345941]),
 'std_score_time': array([0.04590495, 0.02243414, 0.02933746, 0.03452093, 0.05291311,
        0.01331274, 0.00478787, 0.00454521, 0.00066332, 0.0054267 ,
        0.00487581, 0.00337116]),
 'param_reduce_dim__n_components': masked_array(data=[2, 5, 8, 12, 18, 20, --, --, --, --, --, --],
              mask=[False, False, False, False, False, False,  True,  True,
                     True,  True,  True,  True],
        fill_value='?',
 

#### <center>Record the optimal hyperparameters and performance resulting from this pipeline grid search.</center>

### <font color='Red'>Your 3rd pipeline </font>
  * Anomaly detection - <font color='Red'>EllipticEnvelope(support_fraction=1, contamination=0.2) method used</font>
  * Dimensionality reduction - <font color='Red'>PCA() and selectKBest()</font>
  * Model training/validation - <font color='Red'>LogisticRegression(solver='lbfgs')</font>

In [37]:
# Add code below this comment  (Question #E8009)
# ----------------------------------
from sklearn.covariance import EllipticEnvelope

envelope = EllipticEnvelope(support_fraction=1, contamination=0.2).fit(X_train)

# Create an boolean indexing array to pick up outliers
outliers = envelope.predict(X_train)==-1

# Re-slice X,y into a cleaned dataset with outliers excluded
X_env = X_train[~outliers]
y_env = y_train[~outliers]

X_env2=replace_negative_values_with_0(X_env)

False
False


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [38]:
print("X_train.shape: ",X_train.shape, "X_env.shape:", X_env.shape)

X_train.shape:  (12027, 22) X_env.shape: (9621, 22)


In [39]:
#Pipeline with PCA and Random forest classifier


pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('classifier', LogisticRegression(solver='lbfgs'))
])

from sklearn.feature_selection import SelectKBest, chi2

N_FEATURES_OPTIONS = [5,7,8,10,12,18]
C_OPTIONS = [1, 1e-1, 3e3]
param_grid = [{
    'reduce_dim': [PCA(iterated_power=7)],
    'reduce_dim__n_components': N_FEATURES_OPTIONS,
    'classifier__C': C_OPTIONS
},
    {
        # A second set of tests cases for hyperparameters
        'reduce_dim': [SelectKBest(f_regression)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classifier__C': C_OPTIONS
    },
   
]
reducer_labels = ['PCA','KBest(chi2)'] #

clf4 = GridSearchCV(pipe,cv=10, n_jobs=2, param_grid=param_grid)

clf4.fit(X_env2, y_env)




GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=2,
       param_grid=[{'reduce_dim': [PCA(copy=True, iterated_power=7, n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)], 'reduce_dim__n_components': [5, 7, 8, 10, 12, 18], 'classifier__C': [1, 0.1, 3000.0]}, {'reduce_dim': [SelectKBest(k=12, score_func=<function f_regression at 0x7f1b5d780c80>)], 'reduce_dim__k': [5, 7, 8, 10, 12, 18], 'classifier__C': [1, 0.1, 3000.0]}],
       pre_dispatch='2*

In [44]:
y_pred4=clf4.predict(X_test2).round()
print("Confusion Matrix:===> ","\n", confusion_matrix(y_test, y_pred4),"\n")
print("Classification Report:===> ","\n", classification_report(y_test, y_pred4))
print("Best Accuracy Score:===> ", clf4.best_score_)

Confusion Matrix:===>  
 [[435 105]
 [117 680]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.79      0.81      0.80       540
           1       0.87      0.85      0.86       797

   micro avg       0.83      0.83      0.83      1337
   macro avg       0.83      0.83      0.83      1337
weighted avg       0.83      0.83      0.83      1337

Best Accuracy Score:===>  0.825797734123272


In [45]:
clf4.best_estimator_

Pipeline(memory=None,
     steps=[('reduce_dim', SelectKBest(k=12, score_func=<function f_regression at 0x7f1b5d780c80>)), ('classifier', LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [46]:
clf4.cv_results_



{'mean_fit_time': array([1.6894145 , 1.75725484, 1.93365562, 2.40704446, 2.01250665,
        1.50336368, 1.65684788, 1.66517582, 2.08171687, 1.9510772 ,
        2.80622602, 1.58432474, 1.3049598 , 1.50047667, 1.73769503,
        2.01901641, 1.83836453, 1.10613053, 0.53518987, 0.72288265,
        0.92608991, 1.50344837, 1.48821666, 0.66174414, 0.50366209,
        0.70473917, 0.95827558, 1.62211485, 1.50287652, 0.46263673,
        0.34512308, 0.71867697, 0.8888417 , 1.26221528, 1.39413147,
        0.59744401]),
 'std_fit_time': array([0.21972775, 0.55980279, 0.390492  , 0.52453754, 0.29647712,
        0.43140356, 0.47857213, 0.75381477, 0.3306847 , 0.58721538,
        0.88456633, 0.64319114, 0.37981194, 0.45435315, 0.38329408,
        0.35346463, 0.31477759, 0.26907455, 0.09704853, 0.24125366,
        0.29550304, 0.19227707, 0.15652622, 0.28815497, 0.13232586,
        0.20844806, 0.28831979, 0.17117804, 0.31251406, 0.16470846,
        0.07248768, 0.18574868, 0.19003756, 0.30576889, 0.281

#### <center>Record the optimal hyperparameters and performance resulting from this pipeline grid search.</center>

## Document the cross-validation analysis for the three models

### You may want to pickle some models that do some things.

In [None]:
# Just a suggestion :)
# ----------------------------

Logistic Regression and Random Forest Classifiers performed well over Ridge, so I would like to pick them 
on the final training and predicting.





You should have made a few commits so far of this project.  
**Definitely make a commit of the notebook now!**  
Comment should be: `Final Project, Checkpoint - Pipelines done`

### <center><span style='color:green'>This becomes the new Starting Point after pipeline grid search work</span></center>

In [47]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd



# Retrain a model using the full training data set

## Train
Use the full training data set to train the model.

In [13]:
# Add code below this comment  (Question #E8012)
# ----------------------------------

X_whole=dataset.iloc[:,:-1]
y_whole=dataset.went_on_backorder


In [14]:
X_whole2=replace_negative_values_with_0(X_whole)

False
False


### Splitting the complete training data into ratio of 90% training and 10% validation set

In [17]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_whole2, y_whole, test_size=0.1)


### Logistic Regression on training data

In [18]:
classifier=LogisticRegression(solver='lbfgs', C=0.1, n_jobs=2)
classifier.fit(X_train1,y_train1)
y_pred=classifier.predict(X_test1)

print("Confusion Matrix:===> ","\n", confusion_matrix(y_test1, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test1, y_pred))
print("Best Accuracy Score:===> ", accuracy_score(y_test1, y_pred))



Confusion Matrix:===>  
 [[     2   1100]
 [     2 157593]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.50      0.00      0.00      1102
           1       0.99      1.00      1.00    157595

   micro avg       0.99      0.99      0.99    158697
   macro avg       0.75      0.50      0.50    158697
weighted avg       0.99      0.99      0.99    158697

Best Accuracy Score:===>  0.9930559493878272


Since the model borrowed from 3rd pipeline failed terribly with imbalanced full training data, yielding 0 went_backorders, as we can see from the confusion matrix, even though accuracy was pretty good.

#####  3rd pipeline was trained and tested on smartly re-sampled balanced data. But now we are dealing with the total training data which is highly unbalanced; and to deal with unbalanced data like this I added one more parameter in the classifier called "class_weight", which improved the results.

In [55]:
class_weight=dict({1:5, 0:180}) #1:3 - 98%
classifier=LogisticRegression(solver='lbfgs', C=0.1, n_jobs=2,class_weight=class_weight)
classifier.fit(X_train1,y_train1)

LogisticRegression(C=0.1, class_weight={1: 5, 0: 180}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=2, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

The modification with the class_weight helped predicting went_backorders, which was our sucess metrics.

### Logistic Regression performace test on the  Validation data (10 % of training data)

In [56]:
y_pred=classifier.predict(X_test1)

print("Confusion Matrix:===> ","\n", confusion_matrix(y_test1, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test1, y_pred))
print("Best Accuracy Score:===> ", accuracy_score(y_test1, y_pred))

Confusion Matrix:===>  
 [[    37   1074]
 [   194 157392]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.16      0.03      0.06      1111
           1       0.99      1.00      1.00    157586

   micro avg       0.99      0.99      0.99    158697
   macro avg       0.58      0.52      0.53    158697
weighted avg       0.99      0.99      0.99    158697

Best Accuracy Score:===>  0.9920099308745597


### Random Forest Classifier on training data

##### Similarly, RandomForestClassifier performed badly without class_weight and I tried and tested it with both (without and with class_weight)

In [57]:
classifier2= RandomForestClassifier(n_estimators=100, max_depth=10,random_state=0, n_jobs=2)
classifier2.fit(X_train1,y_train1)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=2,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [58]:
y_pred=classifier2.predict(X_test1)

#y_pred4=clf4.predict(X_test2).round()
print("Confusion Matrix:===> ","\n", confusion_matrix(y_test1, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test1, y_pred))
print("Best Accuracy Score:===> ", accuracy_score(y_test1, y_pred))

Confusion Matrix:===>  
 [[     2   1109]
 [     0 157586]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       1.00      0.00      0.00      1111
           1       0.99      1.00      1.00    157586

   micro avg       0.99      0.99      0.99    158697
   macro avg       1.00      0.50      0.50    158697
weighted avg       0.99      0.99      0.99    158697

Best Accuracy Score:===>  0.9930118401734123


In [35]:
class_weight=dict({1:1, 0:20}) #1:3 - 98%
classifier2= RandomForestClassifier(n_estimators=100, max_depth=10,random_state=0, n_jobs=2, class_weight=class_weight)
classifier2.fit(X_train1,y_train1)

RandomForestClassifier(bootstrap=True, class_weight={1: 1, 0: 20},
            criterion='gini', max_depth=10, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=2, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

### Random Forest Classifier performace test on the  Validation data (10 % of training data)

In [36]:
y_pred=classifier2.predict(X_test1)

#y_pred4=clf4.predict(X_test2).round()
print("Confusion Matrix:===> ","\n", confusion_matrix(y_test1, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test1, y_pred))
print("Best Accuracy Score:===> ", accuracy_score(y_test1, y_pred))

Confusion Matrix:===>  
 [[   500    602]
 [  2274 155321]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.18      0.45      0.26      1102
           1       1.00      0.99      0.99    157595

   micro avg       0.98      0.98      0.98    158697
   macro avg       0.59      0.72      0.62    158697
weighted avg       0.99      0.98      0.99    158697

Best Accuracy Score:===>  0.9818774141918247


### Save the trained model with the pickle library.
Saved two models with class_weight parameter, which proved to be a game changer in case of the unbalanced data.

In [37]:
# Add code below this comment  (Question #E8013)
# ----------------------------------


#joblib.dump(classifier, 'LogisticRegression.pkl')

joblib.dump(classifier2, 'RandomForest2.pkl')




['RandomForest2.pkl']

### Reload the trained model from the pickle file
### Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

In [38]:
# Add code below this comment  (Question #E8014)
# ----------------------------------
loaded_model = joblib.load('LogisticRegression.pkl')

loaded_model2 = joblib.load('RandomForest2.pkl')



## Test
Test your new model using the testing data set.
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

### Import and cleaning of the test data

In [28]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Add code below this comment  (Question #E8015)
# ----------------------------------

## Data cleaning

TEST='/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(TEST)
test=pd.read_csv(TEST).sample(frac = 1).reset_index(drop=True)
test=test.drop('sku', axis=1)
yes_no_columns_test = list(filter(lambda i: test[i].dtype!=np.float64, test.columns))
for column_name in yes_no_columns_test:
    mode = test[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    test[column_name].fillna(mode, inplace=True)

CovertToBinary(yes_no_columns_test,test)
clean_dataset(test)



  interactivity=interactivity, compiler=compiler, result=result)


Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No


Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,10.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.33,0.24,0.0,0.0,1.0,1.0,0.0,1.0,1.0
1,41.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.57,0.42,0.0,0.0,1.0,0.0,0.0,1.0,1.0
2,9.0,4.0,1.0,10.0,19.0,25.0,1.0,6.0,20.0,24.0,...,0.0,0.73,0.78,0.0,1.0,1.0,1.0,0.0,1.0,1.0
3,205.0,8.0,123.0,277.0,597.0,917.0,92.0,311.0,637.0,908.0,...,0.0,-99.00,-99.00,0.0,1.0,1.0,0.0,0.0,1.0,1.0
4,47.0,2.0,0.0,0.0,0.0,0.0,1.0,2.0,5.0,9.0,...,0.0,0.80,0.75,0.0,1.0,1.0,1.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242071,4.0,8.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.35,0.36,0.0,1.0,1.0,1.0,0.0,1.0,1.0
242072,10.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.48,0.48,0.0,0.0,1.0,1.0,0.0,1.0,1.0
242073,9.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.99,0.99,0.0,1.0,1.0,0.0,0.0,1.0,1.0
242074,76.0,9.0,0.0,64.0,96.0,96.0,0.0,0.0,0.0,0.0,...,32.0,0.90,0.88,0.0,1.0,1.0,1.0,0.0,1.0,1.0


### Splitting the test data into X data and y target

In [29]:
#Splitting the data
X_test_full=test.iloc[:,:-1]
y_test_full=test.went_on_backorder
X_test_full2=replace_negative_values_with_0(X_test_full)


False
False


### Predictions made with pickled LOGISTIC REGRESSION model.

In [65]:
##Logistic Regression
y_pred=loaded_model.predict(X_test_full2)

print("Confusion Matrix:===> ","\n", confusion_matrix(y_test_full, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test_full, y_pred))
print("Best Accuracy Score:===> ", accuracy_score(y_test_full, y_pred))


Confusion Matrix:===>  
 [[    53   2551]
 [   302 224445]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.15      0.02      0.04      2604
           1       0.99      1.00      0.99    224747

   micro avg       0.99      0.99      0.99    227351
   macro avg       0.57      0.51      0.51    227351
weighted avg       0.98      0.99      0.98    227351

Best Accuracy Score:===>  0.9874511218336405


### Predictions made with pickled Random Forest model.

In [39]:
#Random Forest
y_pred=loaded_model2.predict(X_test_full2)

print("Confusion Matrix:===> ","\n", confusion_matrix(y_test_full, y_pred),"\n")
print("Classification Report:===> ","\n", classification_report(y_test_full, y_pred))
print("Best Accuracy Score:===> ", accuracy_score(y_test_full, y_pred))

Confusion Matrix:===>  
 [[   932   1672]
 [  2881 221866]] 

Classification Report:===>  
               precision    recall  f1-score   support

           0       0.24      0.36      0.29      2604
           1       0.99      0.99      0.99    224747

   micro avg       0.98      0.98      0.98    227351
   macro avg       0.62      0.67      0.64    227351
weighted avg       0.98      0.98      0.98    227351

Best Accuracy Score:===>  0.9799736970587329


## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## Then `File > Close and Halt`