# **Analysis Report to Management**
by: Kwadwo Kyei

## Introduction


This analysis report on Apprentice Chef is made up of two different projects. The first task was to take on an analytics project to better understand how much revenue Apprentice expects from each customer within the first year of using their services and the second task was to analyze Apprentice Chef’s new program named Halfway There and predict which customers would subscribe to the service of receiving half a bottle of wine from a local California vineyard every Wednesday.


## Project 1

From the first case analysis I was able to derive meaningful information on what impacts revenue from customers. Working with a wide variety of explanatory variables, I was able to single out complaints as an important variable In impacting how much revenue to expect from each customer. This variable was significantly significant with a p-value below the alpha level of 0.05. The complaints variable has a negative coefficient with the dependent variable and that means that for every increase in complaints, the company can expect to see revenue from each customer to decrease by -0.4411. 

## Project 2

For the second case on cross-sell promotion for “Halfway There”, I used feature importance to select my most valuable variables to be used in my modles. I found total meals ordered and average time spent on website to be important in determining whether a customer would subscribe to the “Halfway There” wine program. These features were found to have over 0.10 feature importance.

## Insights for Project 1

My highest model's R-Square value was the ols model with an R-squared of 0.752

In [13]:
# Importing the library
import pandas as pd #data science main library  

import statsmodels.formula.api as smf # regression model

# Calling the Doc/file
file = './Apprentice_Chef_Case_Info/Apprentice_Chef_Dataset.xlsx' #using a Excel file


# Bringing-in the Doc/file to the Python
Chef = pd.read_excel(io = file) #using pd.read_excel since it is not a CSV

In [14]:
Chef['log_REVENUE']=np.log10(Chef['REVENUE']).round(2)

In [15]:
##Creating new Variables
#Revenue Per Order
Chef.loc[: , "frequency_of_unique_orders"] = (Chef.loc[:,"TOTAL_MEALS_ORDERED"]/
                                                   Chef.loc[:,"UNIQUE_MEALS_PURCH"]).round(2)

## Percentage of bad orders ie the inconvenience caused to the customer
Chef.loc[:,"inconvenience"] = ((Chef.loc[:,"EARLY_DELIVERIES"]+Chef.loc[:,"LATE_DELIVERIES"])/
                                    Chef.loc[:,"TOTAL_MEALS_ORDERED"]).round(2)


## Ratio of customer care contact with number of orders
Chef.loc[ : , "complaints"] = (Chef.loc[:,"CONTACTS_W_CUSTOMER_SERVICE"]/
                                              Chef.loc[:,"TOTAL_MEALS_ORDERED"]).round(2)
## Ratio of early deliveries per total delivery
Chef.loc[ : , "percentage_early_deliveries_over_total"] = (Chef.loc[:,"EARLY_DELIVERIES"]/
                                              Chef.loc[:,"TOTAL_MEALS_ORDERED"]).round(2)

In [16]:
## Sample Solution ##

# Step 1: INSTANTIATE a model object
lm_best = smf.ols(formula =  """log_REVENUE~complaints +
MEDIAN_MEAL_RATING+
AVG_PREP_VID_TIME+
LARGEST_ORDER_SIZE+
inconvenience +
TOTAL_MEALS_ORDERED +
CONTACTS_W_CUSTOMER_SERVICE+
frequency_of_unique_orders +
percentage_early_deliveries_over_total +
CROSS_SELL_SUCCESS +
MASTER_CLASSES_ATTENDED +
TOTAL_PHOTOS_VIEWED +
LARGEST_ORDER_SIZE  """,
                                data = Chef)


# Step 2: FIT the data into the model object
results = lm_best.fit()


# Step 3: analyze the SUMMARY output
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:            log_REVENUE   R-squared:                       0.752
Model:                            OLS   Adj. R-squared:                  0.750
Method:                 Least Squares   F-statistic:                     487.6
Date:                Mon, 15 Feb 2021   Prob (F-statistic):               0.00
Time:                        23:36:19   Log-Likelihood:                 1615.5
No. Observations:                1946   AIC:                            -3205.
Df Residuals:                    1933   BIC:                            -3132.
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

## Insights for Project 2

My highest cross-sell model's AUC score was with the tuned tree model, which had an AUC score of 0.7320

In [17]:
# importing libraries
import numpy as np
import random            as rand                     # random number gen
import pandas            as pd                       # data science essentials
import matplotlib.pyplot as plt                      # data visualization
import seaborn           as sns                      # enhanced data viz
from sklearn.model_selection import train_test_split # train-test split
from sklearn.linear_model import LogisticRegression  # logistic regression
import statsmodels.formula.api as smf                # logistic regression
from sklearn.metrics import confusion_matrix         # confusion matrix
from sklearn.metrics import roc_auc_score            # auc score
from sklearn.neighbors import KNeighborsClassifier   # KNN for classification
from sklearn.neighbors import KNeighborsRegressor    # KNN for regression
from sklearn.preprocessing import StandardScaler     # standard scaler
# libraries for classification trees
from sklearn.tree import DecisionTreeClassifier      # classification trees
from sklearn.tree import export_graphviz             # exports graphics
from six import StringIO                             # saves objects in memory
from IPython.display import Image                    # displays on frontend
import pydotplus                                     # interprets dot objects
from sklearn.model_selection import RandomizedSearchCV     # hyperparameter tuning
from sklearn.metrics import make_scorer              # customizable scorer
from sklearn.ensemble import RandomForestClassifier     # random forest
from sklearn.ensemble import GradientBoostingClassifier # gbm


# setting pandas print options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 100)


# specifying file name
file = './Apprentice_Chef_Case_Info/Apprentice_Chef_Dataset.xlsx' #using a Excel file

# reading the file into Python
ap_chef = pd.read_excel(io = file)


In [18]:
# log transforming REVENUE and saving it to the dataset.
ap_chef['log_REVENUE'] = np.log10(ap_chef['REVENUE'])

In [19]:
# making a copy of housing
ap_chef_data = ap_chef.copy()


#The following code is meant to drop the potential Y-variable (CROSS_SELL_SUCCESS) 
# as well as categorical variables not used in models like:
# NAME, EMAIL, FIRST_NAME AND FAMILY_NAME. This is so that these variables
# don't accidentally end up on the X-side of the model to be developed.
ap_chef_data = ap_chef_data.drop([ 'CROSS_SELL_SUCCESS',
                                     'NAME',
                                     'EMAIL',
                                     'FIRST_NAME',
                                     'FAMILY_NAME'], axis = 1)


# preparing response variables
ap_chef_target = ap_chef.loc[ : , 'CROSS_SELL_SUCCESS']


# train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
            ap_chef_data,
            ap_chef_target,
            test_size    = 0.25,
            random_state = 219,
            stratify     = ap_chef_target)


# merging training data for statsmodels
ap_chef_train = pd.concat([X_train, y_train], axis = 1)

In [21]:
# declaring a hyperparameter space
criterion_space = ['gini', 'entropy']
splitter_space  = ['best', 'random']
depth_space     = pd.np.arange(1, 25, 1)
leaf_space      = pd.np.arange(1, 100, 1)


# creating a hyperparameter grid
param_grid = {'criterion'        : criterion_space,
              'splitter'         : splitter_space,
              'max_depth'        : depth_space,
              'min_samples_leaf' : leaf_space}


# INSTANTIATING the model object without hyperparameters
tuned_tree = DecisionTreeClassifier(random_state = 219)


# RandomizedSearchCV object
tuned_tree_cv = RandomizedSearchCV(estimator             = tuned_tree,
                                   param_distributions   = param_grid,
                                   cv                    = 6,
                                   n_iter                = 1000,
                                   random_state          = 219,
                                   scoring = make_scorer(roc_auc_score,
                                             needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)
tuned_tree_cv.fit(ap_chef_data, ap_chef_target)


# PREDICT step is not needed


# printing the optimal parameters and best score
print("Tuned Parameters  :", tuned_tree_cv.best_params_)
print("Tuned Training AUC:", tuned_tree_cv.best_score_.round(4))

  depth_space     = pd.np.arange(1, 25, 1)
  leaf_space      = pd.np.arange(1, 100, 1)


Tuned Parameters  : {'splitter': 'random', 'min_samples_leaf': 11, 'max_depth': 19, 'criterion': 'gini'}
Tuned Training AUC: 0.5604


In [23]:
# building a model based on hyperparameter tuning results

# INSTANTIATING a logistic regression model with tuned values
tree_tuned = tuned_tree_cv.best_estimator_


# FIT step is not needed


# PREDICTING based on the testing set
tree_tuned_pred = tree_tuned.predict(X_test)


# SCORING the results
print('Training ACCURACY:', tree_tuned.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', tree_tuned.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = tree_tuned_pred).round(4))


# saving scoring data for future use
tree_tuned_train_score = tree_tuned.score(X_train, y_train).round(4) # accuracy
tree_tuned_test_score  = tree_tuned.score(X_test, y_test).round(4)   # accuracy


# saving the AUC score
tree_tuned_auc         = roc_auc_score(y_true  = y_test,
                                     y_score = tree_tuned_pred).round(4) # auc

Training ACCURACY: 0.7279
Testing  ACCURACY: 0.7536
AUC Score        : 0.6764


## Conclusion

In conclusion in order to increase the revenue of each customer within their first year we would need to decrease the amount of complaints from customers and to reach more customers with our new wine promotion we would have to target customers who make a lot of meal orders and spend a lot of time on the website.