***
***


<br><h1>Final Model  |       Apprentice Chef, Inc. - Business Case</h1>
<h3>MSc in Business Analytics           |       Machine Learning - Python</h3>
<br>Jorge Hernández Jiménez - Marketing Analyst<br>
Hult International Business School<br>

<a href="https://github.com/jhj95">GitHub</a> <br>
<a href="https://www.linkedin.com/in/jorge-hernandez-jimenez/">LinkedIn</a><br><br><br>

***
***

This is the final regression model of a full analysis on the Apprentice Chef, Inc. business case.
It is an analysis on a business case created by my Professor <a href="https://www.linkedin.com/in/kusterer/">Chase B. Kusterer</a> in the Machine Learning course from the MSc in Business Analytics at Hult International Business School in San Francisco. The case consists on a digital food enterprise, Apprentice Chef, Inc. that: “offer a wide selection of daily-prepared gourmet meals delivered directly to your door.”.

The model is build based on supervised learning. After, test 7 different model this is the best performance with a Train Score of 0.7487 and a Test Score of  0.7171. It uses a Logistic Regression (OLS) from the machine learning library Scikit-Learn.

If you want to see the full analysis just click the next link: <a href="https://github.com/jhj95/Supervised-Learning-Analysis-Apprentice-Chef-Inc./blob/master/Supervised%20Learning%20Analysis%20%7C%20Apprentice%20Chef%2C%20Inc.%20-%20Business%20Case.ipynb">Supervised Learning Analysis Apprentice Chef, Inc. - Business Case</a>



<h2>Index</h2>

* <h4> Import Packages </h4>

    
* <h4> Load Data </h4>

    
* <h4> Feature Engineering </h4>
    
    
* <h4> Train & Test Split </h4> 


* <h4> Final Model (instantiate, fit, and predict) </h4>


* <h4> Final Model Score (score) </h4><br><br>
    
    
*
Documentation: <a href="https://scikit-learn.org/stable/">Scikit-Learn</a> ( <a href="https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares">OLS</a> )

<br>

In [1]:
# importing datetime to check the length time of the code
from datetime import datetime

#starting to count the time
startTime = datetime.now()

In [2]:
################################################################################
# Import Packages
################################################################################

import numpy                    as   np      # data science essentials
import pandas                   as   pd      # data science essentials
import matplotlib.pyplot        as   plt     # essential graphical output
import seaborn                  as   sns     # enhanced graphical output
import statsmodels.formula.api  as   smf     # regression modeling
import sklearn.linear_model                  # (scikit-learn)linear models (LinearRegression, Ridge, Lasso, ARD)

from sklearn.model_selection    import train_test_split        # train and test split
from sklearn.neighbors          import KNeighborsRegressor     # KNN for Regression
from sklearn.preprocessing      import StandardScaler          # standard scaler


In [3]:
################################################################################
# Load Data
################################################################################

# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


# specifying file name
file = 'Apprentice_Chef_Dataset.xlsx'

# reading the file into Python and naming it apprentice
apprentice = pd.read_excel(file)


In [4]:
################################################################################
# Feature Engineering
################################################################################

# REVENUE is going to be the dependent variable of our model. Because what we want to 
#predict is: "the revenue that each customer is going to pay in his/her first year using this platform."

# However, because REVENUE has really high values, we are going to use instead 
# the logarithm of REVENUE with the variable: LN_REVENUE. This does not affect 
# at all our predictive model and enhance its results, THAT'S AWESOME, RIGHT?
    
# creating a variable with the log of Revenue
LN_REVENUE = np.log(apprentice['REVENUE'])

apprentice['LN_REVENUE'] = pd.Series(LN_REVENUE)



############################
###  Outlier Thresholds  ###
############################

# creating outlier thresholds
UNIQUE_MEALS_PURCH_HI             =     6
MASTER_CLASSES_ATTENDED_HI        =     2
TOTAL_PHOTOS_VIEWED_HI            =     300


# creating a list with all the thresholds to ease the funcitionality of the for loop:
outliers_1 = [['UNIQUE_MEALS_PURCH_HI', UNIQUE_MEALS_PURCH_HI],
              ['MASTER_CLASSES_ATTENDED_HI', MASTER_CLASSES_ATTENDED_HI],
              ['TOTAL_PHOTOS_VIEWED_HI', TOTAL_PHOTOS_VIEWED_HI]]


# for loop to develop features (columns) for outliers:
for x,y in outliers_1:
    
    # setting a title for the column
    out = 'OUT_' + x
        
    # creating a column with value = 0
    apprentice[out.replace('_HI', '')] = 0
        
    # stating the condition to be 'HI'
    condition_hi  = apprentice.loc[0:,out.replace('_HI', '')][apprentice[x.replace('_HI', '')] > y]
        
    # replacing the value 0 for 1 for the case where the user fulfill the condition
    apprentice[out.replace('_HI', '')].replace(to_replace = condition_hi,
                                               value      = 1,
                                               inplace    = True)
    
    

############################
#  Trend-based Thresholds  #
############################

# creating trend-based thresholds
AVG_PREP_VID_TIME_CHANGE_HI               =     300          # data scatters above this point
TOTAL_PHOTOS_VIEWED_CHANGE_HI             =     500          # data scatters above this point
AVG_CLICKS_PER_VISIT_CHANGE_HI            =     11           # trend changes above this point
CONTACTS_W_CUSTOMER_SERVICE_CHANGE_HI     =     10           # trend changes above this point
MASTER_CLASSES_ATTENDED_CHANGE_AT         =     3            # different at 3
TOTAL_PHOTOS_VIEWED_CHANGE_AT             =     0            # zero inflated


# creating a list with all the thresholds to ease the funcitionality of the for loop:
outliers_2 = [['AVG_PREP_VID_TIME_CHANGE_HI', AVG_PREP_VID_TIME_CHANGE_HI],
              ['TOTAL_PHOTOS_VIEWED_CHANGE_HI', TOTAL_PHOTOS_VIEWED_CHANGE_HI],
              ['AVG_CLICKS_PER_VISIT_CHANGE_HI', AVG_CLICKS_PER_VISIT_CHANGE_HI],
              ['CONTACTS_W_CUSTOMER_SERVICE_CHANGE_HI', CONTACTS_W_CUSTOMER_SERVICE_CHANGE_HI],
              ['MASTER_CLASSES_ATTENDED_CHANGE_AT', MASTER_CLASSES_ATTENDED_CHANGE_AT],
              ['TOTAL_PHOTOS_VIEWED_CHANGE_AT', TOTAL_PHOTOS_VIEWED_CHANGE_AT]]


# for loop to develope features (columns) for outliers
for x,y in outliers_2:
    
    # setting a title for the column
    change = 'CHANGE_' + x
    
    ########################################
    ## change above threshold             ##
    ########################################
    if 'HI' in x:
        
        # creating a column with value = 0
        apprentice[change.replace('_CHANGE_HI', '')] = 0
        
        # stating the condition to be 'HI'
        condition  = apprentice.loc[0:, change.replace('_CHANGE_HI', '')]\
                     [apprentice[x.replace('_CHANGE_HI', '')] > y]
        
        # replacing the value 0 for 1 for the case where the user fulfill the condition
        apprentice[change.replace('_CHANGE_HI', '')].replace(to_replace = condition,
                                            value      = 1,
                                            inplace    = True)
    
    ########################################
    ## change at threshold                ##
    ########################################
    elif 'AT' in x:
        
        # creating a column with value = 0
        apprentice[change.replace('_CHANGE_AT', '')] = 0
        
        # stating the condition to be 'AT'
        condition = apprentice.loc[0:, change.replace('_CHANGE_AT', '')]\
                    [apprentice[x.replace('_CHANGE_AT', '')] == y]
        
        # replacing the value 0 for 1 for the case where the user fulfill the condition
        apprentice[change.replace('_CHANGE_AT', '')].replace(to_replace = condition,
                                                      value      = 1,
                                                      inplace    = True)

        
        

In [5]:
################################################################################
# Train/Test Split
################################################################################

# Independent Variables in model 
x_variables1 = ['CROSS_SELL_SUCCESS', 'TOTAL_MEALS_ORDERED', 'UNIQUE_MEALS_PURCH', 
               'CONTACTS_W_CUSTOMER_SERVICE', 'CANCELLATIONS_AFTER_NOON', 'PACKAGE_LOCKER',
               'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 'MASTER_CLASSES_ATTENDED', 
               'MEDIAN_MEAL_RATING', 'OUT_UNIQUE_MEALS_PURCH', 'OUT_MASTER_CLASSES_ATTENDED',
               'OUT_TOTAL_PHOTOS_VIEWED', 'CHANGE_AVG_PREP_VID_TIME', 'CHANGE_TOTAL_PHOTOS_VIEWED',
               'CHANGE_AVG_CLICKS_PER_VISIT', 'CHANGE_CONTACTS_W_CUSTOMER_SERVICE',
               'CHANGE_MASTER_CLASSES_ATTENDED']



print(f"""Model :" {x_variables1}""")
    
# applying modelin scikit-learn

# preparing x-variables
apprentice_data   = apprentice.loc[: , x_variables1]  # Here you just need to change the different 
                                                          # x_variables lists to check which 
                                                          # is the best model, I left 
                                                          # model 1 because it's the winner.


# preparing response variable
apprentice_target = apprentice.loc[: , 'LN_REVENUE']


# running train/test split again
X_train, X_test, y_train, y_test = train_test_split(
                                                    apprentice_data,
                                                    apprentice_target,
                                                    test_size    =  0.25, # test data is going to be 25%
                                                    random_state =  222)  # random state set in the
                                                                              # business case


Model :" ['CROSS_SELL_SUCCESS', 'TOTAL_MEALS_ORDERED', 'UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'CANCELLATIONS_AFTER_NOON', 'PACKAGE_LOCKER', 'AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 'MASTER_CLASSES_ATTENDED', 'MEDIAN_MEAL_RATING', 'OUT_UNIQUE_MEALS_PURCH', 'OUT_MASTER_CLASSES_ATTENDED', 'OUT_TOTAL_PHOTOS_VIEWED', 'CHANGE_AVG_PREP_VID_TIME', 'CHANGE_TOTAL_PHOTOS_VIEWED', 'CHANGE_AVG_CLICKS_PER_VISIT', 'CHANGE_CONTACTS_W_CUSTOMER_SERVICE', 'CHANGE_MASTER_CLASSES_ATTENDED']


In [6]:
################################################################################
# Final Model (instantiate, fit, and predict)
################################################################################

# INSTANTIATING a model object
lr = sklearn.linear_model.LinearRegression()


# FITTING to the training data
lr_fit = lr.fit(X_train, y_train)


# PREDICTING on new data
lr_pred = lr_fit.predict(X_test)


# saving scoring data for future use
lr_train_score = lr.score(X_train, y_train).round(4)
lr_test_score  = lr.score(X_test, y_test).round(4)


In [7]:
################################################################################
# Final Model Score (score)
################################################################################

print(f"""
Model      Train Score      Test Score
-----      -----------      ----------
OLS        {lr_train_score}           {lr_test_score}
""")
    
    
# creating a dictionary for model results
model_performance = {'Model'    : ['OLS'],
                     'Training' : [lr_train_score],
                     'Testing'  : [lr_test_score]}
    
    
# converting model_performance into a DataFrame
model_performance = pd.DataFrame(model_performance)


Model      Train Score      Test Score
-----      -----------      ----------
OLS        0.7487           0.7171



In [8]:
# checking the length time of the code
print(datetime.now() - startTime)

0:00:08.127798
