# A2: Classification Modeling Case Study
## Apprentice Chef - Machine Learning

Student: Markus Proesch
Cohort: 4
Hult International Business School

Deliverables: 
- Analysis of Apprentice Chef's customers user data.
- Build insight about the Halfway There promotion, offering customers over 21 1/2 bottle of CA wine.
- Build a model to predict cross sell success.

This assignment encompasses feature engineering, variable selection, and model development.


Import the following packages to run the script and get the right output.

Exploring the dataset, classes and features

In [1]:
import pandas as pd                                      # data science essentials
import matplotlib.pyplot as plt                          # data visualization
import seaborn as sns                                    # enhanced data visualization
import statsmodels.formula.api as smf                    # explanatory model 
from sklearn.model_selection import train_test_split     # divides dataset into a train and test set
from sklearn.linear_model import LogisticRegression      # Logistic Regression
from sklearn.metrics import confusion_matrix             # confusion matrix
from sklearn.metrics import roc_auc_score                # Calculating the ROC and AUC
from sklearn.neighbors import KNeighborsClassifier       # KNN for classification
from sklearn.preprocessing import StandardScaler         # Standardizing values
from sklearn.tree import DecisionTreeClassifier          # classification trees
from sklearn.ensemble import GradientBoostingClassifier  # Gradient Booster for classification
from sklearn.ensemble import RandomForestClassifier      # Random Forest for classification   
from sklearn.model_selection import GridSearchCV         # hyperparameter tuning
from sklearn.metrics import make_scorer                  # customizable scorer

In [2]:
# Load the Excel file 
original_df  = pd.read_excel('Apprentice Chef Dataset.xlsx')

# Explore the data and struckture + columns
#original_df.info()
#original_df.describe()

# Exploring the original variables in the dataset
#original_df.columns

# Feature engineering:
Developing different features that will be used later in the analysis. 

- AVG_PRICE_PER_MEAL is the average price a customer paid per meal they ordered. Including the Weekly Plan  subscription.

- Missing values in FAMILY_NAME were flagged, but after analyzing the dataset further. Approximately 1/4 of the NAMES were equal to FAMILY_NAME so missing values in FAMILY_NAME were filled with same value as in NAME.

- The emails are divided into subgroups based on their domain. The different domain groups are:

Professional: 
@mmm.com, @amex.com, @apple.com, @boeing.com, @caterpillar.com, @chevron.com, @cisco.com, @cocacola.com
@disney.com, @dupont.com, @exxon.com, @ge.org, @goldmansacs.com, @homedepot.com, @ibm.com, @intel.com
@jnj.com, @jpmorgan.com, @mcdonalds.com, @merck.com, @microsoft.com, @nike.com, @pfizer.com, @pg.com, @travelers.com
@unitedtech.com, @unitedhealth.com, @verizon.com, @visa.com, @walmart.com

Personal: 
@gmail.com, @yahoo.com, @protonmail.com

Junk: 
@me.com, @aol.com, @hotmail.com, @live.com, @msn.com, @passport.com

In [3]:
# Add Avg. Price per Meal variable
original_df['AVG_PRICE_PER_MEAL'] = original_df['REVENUE']/original_df['TOTAL_MEALS_ORDERED']


# Flagging all observations/customers with no input in Family Name
for col in original_df:

        if original_df[col].isnull().astype(int).sum() > 0:
            original_df['mv_'+col] = original_df[col].isnull().astype(int)

# Filled NA in FAMILY_NAME with same name as in FIRST_NAME since that seems to be the way to do it
original_df['FAMILY_NAME'] = original_df['FAMILY_NAME'].fillna(original_df['FIRST_NAME'])

# Drop the flagged missing values in FAMILY_NAME as they are replaced with FIRST_NAME
original_df = original_df.drop(columns = 'mv_FAMILY_NAME')            
            

# Dummie variables from the email domain.
# Dataset has to be a DataFrame for .iterrows() to work
original_df_email       = pd.DataFrame(original_df['EMAIL'])

placeholder_lst  = []

for index, col in original_df_email.iterrows():
    split_email  = original_df_email.loc[index, 'EMAIL'].split(sep = '@')
    
    placeholder_lst.append(split_email)
    
email_df         = pd.DataFrame(placeholder_lst)
email_df.columns = ['name', 'domain']

# Domain groups
personal_domain     = ['@gmail.com', '@yahoo.com','@protonmail.com']
professional_domain = ['@mmm.com', '@amex.com','@apple.com',
                      '@boeing.com','@caterpillar.com',
                      '@chevron.com','@cisco.com','@cocacola.com',
                      '@disney.com','@dupont.com','@exxon.com',
                      '@ge.org','@goldmansacs.com','@homedepot.com',
                      '@ibm.com','@intel.com','@jnj.com',
                      '@jpmorgan.com','@mcdonalds.com','@merck.com',
                      '@microsoft.com','@nike.com','@pfizer.com',
                      '@pg.com','@travelers.com','@unitedtech.com',
                      '@unitedhealth.com','@verizon.com','@visa.com',
                      '@walmart.com']
junk_domain         = ['@me.com', '@aol.com', '@hotmail.com', '@live.com',
                       '@msn.com','@passport.com']

# For loop categorising the different email domains
placeholder_lst = []

for domain in email_df['domain']:
    
    if '@' + domain in personal_domain:
        placeholder_lst.append('PERSONAL_DOMAIN')
    elif '@' + domain in professional_domain:
        placeholder_lst.append('PROFESSIONAL_DOMAIN')
    else:
        placeholder_lst.append('JUNK_DOMAIN')
        
# make the columns into a series to append it to original dataset        
email_df['DOMAIN_GROUP'] = pd.Series(placeholder_lst)

# Add the domain categories column to the original dataset 
original_df['DOMAIN'] = email_df['DOMAIN_GROUP']

# Get dummies from the domain variable and drop the original column
one_hot_DOMAIN = pd.get_dummies(original_df['DOMAIN'])

# Remove the old and add the 3 new columns
original_df           = original_df.drop('DOMAIN', axis = 1)
original_df           = original_df.join([one_hot_DOMAIN])



- NUMBER_NAMES is a count of how many names a customer has in his/her NAME

- SAME_NAME is a binary feature showing 1 if the customer has the same FIRST and LAST name, and 0 if they are different.

- ATTENDED_MASTER_CLASS is a feature developed from MASTER_CLASS_ATTENDED. Instead of counting the number of classes a customer attended. It is a binary feature where 1 = Attended 1 or more classes, and 0 = Never attended a class

- NOBLE is a binary feature where 1 = A person part of the Noble class, and 0 = Not Noble
        Who is in the noble class: 
        _"zo", "mo" and "Mo" is from a foreign language, meaning "of"
        _"of" is often a part of "Daughter of .." or "Son of .."
        _"Lord" and "knight" are also a part of the noble class
        

In [4]:
# Adding variable, counting the number of names in NAME column

def text_split_feature(col, df, sep=' ', new_col_name=None):
    """
Splits values in a string Series (as part of a DataFrame) and sums the number
of resulting items. Automatically appends summed column to original DataFrame.

PARAMETERS
----------
col          : column to split
df           : DataFrame where column is located
sep          : string sequence to split by, default ' '
new_col_name : name of new column after summing split, default
               'number_of_names'
"""
    
    original_df[new_col_name] = 0
    
    
    for index, val in original_df.iterrows():
        original_df.loc[index, new_col_name] = len(original_df.loc[index, col].split(sep = ' '))
        
text_split_feature(col = 'NAME', df = original_df, new_col_name = 'NUMBER_NAMES' )



In [5]:
# Adding variable where FIRST NAME is the same as FAMILY NAME
placeholder_lst = []

for row,col in original_df.iterrows():
    if original_df.loc[row,'FIRST_NAME'] == original_df.loc[row,'FAMILY_NAME']:
        placeholder_lst.append(1)
    else:
        placeholder_lst.append(0)

# Adding the new variable to the original dataset
original_df['SAME_NAME'] = pd.Series(placeholder_lst)

In [6]:
# Making attending a master class into a binary variable
placeholder_lst = []

for row,col in original_df.iterrows():
    if original_df.loc[row,'MASTER_CLASSES_ATTENDED'] >= 1:
        placeholder_lst.append(1)
    else:
        placeholder_lst.append(0)

# Adding the new variable to the original dataset
original_df['ATTENDED_MASTER_CLASS'] = pd.Series(placeholder_lst)
original_df = original_df.drop(columns = 'MASTER_CLASSES_ATTENDED')

In [7]:
# Flagging NOBLE people in the customer list
placeholder_lst = []

for row,pattern in original_df.iterrows():
    if ' of ' in original_df.loc[row,'NAME'] or \
    'lord' in original_df.loc[row,'NAME'] or \
    'Lord' in original_df.loc[row,'NAME'] or \
    ' mo ' in original_df.loc[row,'NAME'] or \
    ' zo ' in original_df.loc[row,'NAME'] or \
    ' Mo ' in original_df.loc[row,'NAME'] or \
    'Knight' in original_df.loc[row, 'NAME'] or \
    'knight'in original_df.loc[row, 'NAME']:
        placeholder_lst.append(1)
    else:
        placeholder_lst.append(0)

original_df['NOBLE'] = pd.Series(placeholder_lst)



# Data exploration: Outliers and thresholds

Two for loops used to present all variables with a histogram (frequency) and scatter (relationship) plot

In [8]:
#original_df_no_char = original_df.drop(columns = ['NAME', 'EMAIL', 'FIRST_NAME', 'FAMILY_NAME'])

#for col in original_df_no_char:
    
#    fig, ax = plt.subplots(figsize = (10, 8))
    
#    plt.hist(original_df_no_char[col], bins = 100)
#    xlabel = print(f'{col}')
#    plt.show()


In [9]:
#original_df_num = original_df.drop(columns = ['NAME', 'EMAIL', 'FIRST_NAME', 'FAMILY_NAME'])


#for col in original_df_num:
    
#    fig, ax = plt.subplots(figsize = (8, 6))
    
#    plt.scatter(x = original_df_num[col], y = 'CROSS_SELL_SUCCESS',
#                data = original_df_num, alpha = 0.6)
#    xlabel = print(f'{col}')
#    plt.show()



In [10]:
# Outliers thresholds determined based on the histograms and scatterplots
revenue_hi                    = 6500
total_meals_hi                = 320
unique_meals_hi               = 12
contact_w_customer_service_hi = 12
avg_time_per_site_hi          = 400
cancel_before_noon_hi         = 7
late_deliveries_hi            = 15
avg_prep_video_hi             = 350
total_photoes_hi              = 900
avg_meal_price                = 120
follow_rec_pct_hi             = 30
follow_rec_pct_lo             = 1

# FOLLOWED_RECOMMENDATIONS_PCT
original_df['out_FOLLOWED_RECOMMENDATIONS_PCT']  = 0
condition_hi = original_df.loc[0:,'out_FOLLOWED_RECOMMENDATIONS_PCT'][original_df['FOLLOWED_RECOMMENDATIONS_PCT'] 
                                                                      > follow_rec_pct_hi]
condition_lo = original_df.loc[0:,'out_FOLLOWED_RECOMMENDATIONS_PCT'][original_df['FOLLOWED_RECOMMENDATIONS_PCT'] 
                                                                      < follow_rec_pct_lo]


original_df['out_FOLLOWED_RECOMMENDATIONS_PCT'].replace(to_replace = condition_hi,
                                                        value      = 1,
                                                        inplace    = True)
original_df['out_FOLLOWED_RECOMMENDATIONS_PCT'].replace(to_replace = condition_lo,
                                                        value      = 1,
                                                        inplace    = True)

# REVENUE
original_df['out_REVENUE']  = 0
condition_hi = original_df.loc[0:,'out_REVENUE'][original_df['REVENUE'] 
                                                          > revenue_hi]

original_df['out_REVENUE'].replace(to_replace = condition_hi,
                                   value      = 1,
                                   inplace    = True)
# TOTAL_MEALS_ORDERED
original_df['out_TOTAL_MEALS_ORDERED']  = 0
condition_hi = original_df.loc[0:,'out_TOTAL_MEALS_ORDERED'][original_df['TOTAL_MEALS_ORDERED'] 
                                                             > total_meals_hi]

original_df['out_TOTAL_MEALS_ORDERED'].replace(to_replace = condition_hi,
                                               value      = 1,
                                               inplace    = True)

# UNIQUE_MEALS_PURCH
original_df['out_UNIQUE_MEALS_PURCH']  = 0
condition_hi = original_df.loc[0:,'out_UNIQUE_MEALS_PURCH'][original_df['UNIQUE_MEALS_PURCH'] 
                                                            > unique_meals_hi]

original_df['out_UNIQUE_MEALS_PURCH'].replace(to_replace = condition_hi,
                                              value      = 1,
                                              inplace    = True)

# CONTACTS_W_CUSTOMER_SERVICE
original_df['out_CONTACTS_W_CUSTOMER_SERVICE']  = 0
condition_hi = original_df.loc[0:,'out_CONTACTS_W_CUSTOMER_SERVICE'][original_df['CONTACTS_W_CUSTOMER_SERVICE'] 
                                                                     > contact_w_customer_service_hi]

original_df['out_CONTACTS_W_CUSTOMER_SERVICE'].replace(to_replace = condition_hi,
                                                       value      = 1,
                                                       inplace    = True)

# AVG_TIME_PER_SITE_VISIT
original_df['out_AVG_TIME_PER_SITE_VISIT']  = 0
condition_hi = original_df.loc[0:,'out_AVG_TIME_PER_SITE_VISIT'][original_df['AVG_TIME_PER_SITE_VISIT'] 
                                                                 > avg_time_per_site_hi]

original_df['out_AVG_TIME_PER_SITE_VISIT'].replace(to_replace = condition_hi,
                                                   value      = 1,
                                                   inplace    = True)

# CANCELLATIONS_BEFORE_NOON
original_df['out_CANCELLATIONS_BEFORE_NOON']  = 0
condition_hi = original_df.loc[0:,'out_CANCELLATIONS_BEFORE_NOON'][original_df['CANCELLATIONS_BEFORE_NOON'] 
                                                                   > cancel_before_noon_hi]

original_df['out_CANCELLATIONS_BEFORE_NOON'].replace(to_replace = condition_hi,
                                                     value      = 1,
                                                     inplace    = True)

# LATE_DELIVERIES
original_df['out_LATE_DELIVERIES']  = 0
condition_hi = original_df.loc[0:,'out_LATE_DELIVERIES'][original_df['LATE_DELIVERIES'] 
                                                         > late_deliveries_hi]

original_df['out_LATE_DELIVERIES'].replace(to_replace = condition_hi,
                                           value      = 1,
                                           inplace    = True)

# AVG_PREP_VID_TIME
original_df['out_AVG_PREP_VID_TIME']  = 0
condition_hi = original_df.loc[0:,'out_AVG_PREP_VID_TIME'][original_df['AVG_PREP_VID_TIME'] 
                                                          > avg_prep_video_hi]

original_df['out_AVG_PREP_VID_TIME'].replace(to_replace = condition_hi,
                                             value      = 1,
                                             inplace    = True)

# TOTAL_PHOTOS_VIEWED
original_df['out_TOTAL_PHOTOS_VIEWED']  = 0
condition_hi = original_df.loc[0:,'out_TOTAL_PHOTOS_VIEWED'][original_df['TOTAL_PHOTOS_VIEWED'] 
                                                          > total_photoes_hi]

original_df['out_TOTAL_PHOTOS_VIEWED'].replace(to_replace = condition_hi,
                                               value      = 1,
                                               inplace    = True)

# AVG_PRICE_PER_MEAL
original_df['out_AVG_PRICE_PER_MEAL']  = 0
condition_hi = original_df.loc[0:,'out_AVG_PRICE_PER_MEAL'][original_df['AVG_PRICE_PER_MEAL'] 
                                                          > avg_meal_price]

original_df['out_AVG_PRICE_PER_MEAL'].replace(to_replace = condition_hi,
                                              value      = 1,
                                              inplace    = True)

In [11]:
# Correlation chart with variables correlation with REVENUE
# Correlation insight were key to find the 2 important insight
#original_df_corr = original_df.corr().round(2)

# Heatmap gave a good overview over correlation within the dataset
#fig, ax = plt.subplots(figsize  = (20,20))

#sns.heatmap(original_df_corr, cmap = 'coolwarm',
#            square = True, annot = True,
#            linecolor = 'black', linewidths = 0.5)

# Looking at correlation with CROSS_SELL_SUCCESS
#print(original_df_corr['CROSS_SELL_SUCCESS'].sort_values(ascending=False))

# Finding correlation for FOLLOWED_RECOMMENDATIONS_PCT triggering the insight
#print(original_df_corr['FOLLOWED_RECOMMENDATIONS_PCT'].sort_values(ascending=False))
    
    

In [12]:
# Splitting the dataset into a train and test set for the statsmodel

original_df_data   = original_df.drop(columns = 'CROSS_SELL_SUCCESS')

original_df_target = original_df.loc[:,'CROSS_SELL_SUCCESS']

X_train, X_test, y_train, y_test = train_test_split(original_df_data,
                                                   original_df_target,
                                                   test_size = 0.25,
                                                   random_state = 222,
                                                   stratify = original_df_target)

# merging training data for statsmodels since it doesn't work the same way as sci-kit learn
chef_train = pd.concat([X_train, y_train], axis = 1)

# Variable selection

The p-value threshold at 0.15

In [13]:
# instantiating a logistic regression model object
#logistic_initial = smf.logit(formula   = """CROSS_SELL_SUCCESS ~ 
#                             FOLLOWED_RECOMMENDATIONS_PCT""",
#                             data = chef_train)


# FITTING the model object
#results_logistic = logistic_initial.fit()


# checking the results SUMMARY
#results_logistic.summary()

In [14]:
# For loop to print the numeric variable in the right format for statsmodel
#for col in original_df:
#    print(f"{col} +")

### VARIABLES REMOVED:
'TOTAL_MEALS_ORDERED','UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 'PRODUCT_CATEGORIES_VIEWED', 'AVG_TIME_PER_SITE_VISIT', 'TASTES_AND_PREFERENCES', 'MOBILE_LOGINS', 'PC_LOGINS', 'PACKAGE_LOCKER', 'REFRIGERATED_LOCKER','AVG_PREP_VID_TIME', 'LARGEST_ORDER_SIZE', 'MEDIAN_MEAL_RATING', 'TOTAL_PHOTOS_VIEWED', 
'AVG_PRICE_PER_MEAL', 'WEEKLY_PLAN','out_REVENUE', 'out_TOTAL_MEALS_ORDERED', 'out_UNIQUE_MEALS_PURCH', 'out_CONTACTS_W_CUSTOMER_SERVICE', 'out_AVG_TIME_PER_SITE_VISIT', 'out_CANCELLATIONS_BEFORE_NOON',
'out_LATE_DELIVERIES', 'out_AVG_PREP_VID_TIME', 'out_TOTAL_PHOTOS_VIEWED', 'out_AVG_PRICE_PER_MEAL']

In [15]:
logistic_fitted_w_out = smf.logit(formula   = """CROSS_SELL_SUCCESS ~ 
REVENUE +
MOBILE_NUMBER +
CANCELLATIONS_BEFORE_NOON +
CANCELLATIONS_AFTER_NOON +
EARLY_DELIVERIES +
LATE_DELIVERIES +
FOLLOWED_RECOMMENDATIONS_PCT +
AVG_CLICKS_PER_VISIT +
JUNK_DOMAIN +
PROFESSIONAL_DOMAIN +
NUMBER_NAMES +
SAME_NAME +
ATTENDED_MASTER_CLASS +
NOBLE +
out_FOLLOWED_RECOMMENDATIONS_PCT +
out_LATE_DELIVERIES 
 """, data = chef_train)


# FITTING the model object
results_logistic = logistic_fitted_w_out.fit()


# checking the results SUMMARY
results_logistic.summary()

Optimization terminated successfully.
         Current function value: 0.353202
         Iterations 8


0,1,2,3
Dep. Variable:,CROSS_SELL_SUCCESS,No. Observations:,1459.0
Model:,Logit,Df Residuals:,1442.0
Method:,MLE,Df Model:,16.0
Date:,"Wed, 05 Feb 2020",Pseudo R-squ.:,0.4375
Time:,22:58:29,Log-Likelihood:,-515.32
converged:,True,LL-Null:,-916.19
Covariance Type:,nonrobust,LLR p-value:,2.6910000000000003e-160

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.1058,0.824,-1.342,0.180,-2.720,0.509
REVENUE,-0.0002,9.28e-05,-1.771,0.077,-0.000,1.75e-05
MOBILE_NUMBER,0.8683,0.232,3.744,0.000,0.414,1.323
CANCELLATIONS_BEFORE_NOON,0.2472,0.053,4.632,0.000,0.143,0.352
CANCELLATIONS_AFTER_NOON,-0.2809,0.175,-1.602,0.109,-0.624,0.063
EARLY_DELIVERIES,0.0628,0.035,1.795,0.073,-0.006,0.131
LATE_DELIVERIES,0.0571,0.029,1.952,0.051,-0.000,0.114
FOLLOWED_RECOMMENDATIONS_PCT,0.0451,0.006,7.640,0.000,0.034,0.057
AVG_CLICKS_PER_VISIT,-0.1109,0.040,-2.773,0.006,-0.189,-0.033


In [16]:
# Variable dictinary for Signingicant variables and the full dataset

variable_dict = {
    'logit_sig' : ['REVENUE', 'MOBILE_NUMBER','CANCELLATIONS_BEFORE_NOON',
                   'CANCELLATIONS_AFTER_NOON','EARLY_DELIVERIES', 'LATE_DELIVERIES',
                   'FOLLOWED_RECOMMENDATIONS_PCT', 'AVG_CLICKS_PER_VISIT',
                    'JUNK_DOMAIN', 'PROFESSIONAL_DOMAIN', 'NUMBER_NAMES',
                   'SAME_NAME','ATTENDED_MASTER_CLASS', 'out_FOLLOWED_RECOMMENDATIONS_PCT',
                   'NOBLE'],
    
    
   'logit_full' : ['REVENUE', 'TOTAL_MEALS_ORDERED',
                   'UNIQUE_MEALS_PURCH', 'CONTACTS_W_CUSTOMER_SERVICE', 
                   'PRODUCT_CATEGORIES_VIEWED', 'AVG_TIME_PER_SITE_VISIT', 
                   'MOBILE_NUMBER', 'CANCELLATIONS_BEFORE_NOON',
                   'CANCELLATIONS_AFTER_NOON', 'TASTES_AND_PREFERENCES',
                   'MOBILE_LOGINS', 'PC_LOGINS', 'EARLY_DELIVERIES', 
                   'LATE_DELIVERIES', 'PACKAGE_LOCKER', 'REFRIGERATED_LOCKER',
                   'FOLLOWED_RECOMMENDATIONS_PCT', 'AVG_PREP_VID_TIME', 
                   'LARGEST_ORDER_SIZE', 'MEDIAN_MEAL_RATING', 
                   'AVG_CLICKS_PER_VISIT', 'TOTAL_PHOTOS_VIEWED', 
                   'AVG_PRICE_PER_MEAL', 'JUNK_DOMAIN', 'PROFESSIONAL_DOMAIN', 'NUMBER_NAMES', 'SAME_NAME',
                   'NOBLE', 'WEEKLY_PLAN',
                   'out_FOLLOWED_RECOMMENDATIONS_PCT', 'out_REVENUE',
                   'out_TOTAL_MEALS_ORDERED', 'out_UNIQUE_MEALS_PURCH',
                   'out_CONTACTS_W_CUSTOMER_SERVICE', 
                   'out_AVG_TIME_PER_SITE_VISIT', 'out_CANCELLATIONS_BEFORE_NOON',
                   'out_LATE_DELIVERIES', 'out_AVG_PREP_VID_TIME', 
                   'out_TOTAL_PHOTOS_VIEWED', 'out_AVG_PRICE_PER_MEAL']
}


# Model development
Divided the significant variables into a train and test set

Stratify = FOLLOWED_RECOMMENDATIONS_PCT
Because it is the most important variables and will help the model if the distribution of variables within is equal

In [17]:
# Divide into a data and a target dataset
original_df_data   =  original_df.loc[ : , variable_dict['logit_sig']]
original_df_target =  original_df.loc[ : , 'CROSS_SELL_SUCCESS']


# this is the exact code we were using before
X_train, X_test, y_train, y_test = train_test_split(
            original_df_data,
            original_df_target,
            random_state = 222,
            test_size    = 0.25,
            stratify     = original_df['FOLLOWED_RECOMMENDATIONS_PCT'])



In [18]:
# Divide a standardized data set into train and test variable to run models side by side

original_df_data   =  original_df.loc[ : , variable_dict['logit_sig']]
original_df_target =  original_df.loc[ : , 'CROSS_SELL_SUCCESS']

# INSTANTIATING StandardScaler()
scaler = StandardScaler()

# FITTING the independent variable data
scaler.fit(original_df_data)


# TRANSFORMING the independent variable data
X_scaled     = scaler.transform(original_df_data)


# converting to a DataFrame
X_scaled_df  = pd.DataFrame(X_scaled) 


# Train test split with all the scaled data
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(
                    X_scaled_df,
                    original_df_target,
                    random_state = 222,
                    test_size = 0.25,
                    stratify = original_df_target)

In [19]:
# Build a list to keep track of model values
model_performance = [['Model', 'Training Accuracy',
                      'Testing Accuracy', 'AUC Value']]

In [20]:
# INSTANTIATING a logistic regression model
logreg = LogisticRegression(solver = 'liblinear',
                                 C = 1,
                      random_state = 222)


# FITTING the training data
logreg_fit = logreg.fit(X_train, y_train)


# PREDICTING based on the testing set
logreg_pred = logreg_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', logreg_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', logreg_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = logreg_pred))

# Adding model results to table
model_performance.append(['Logistic Regression w significant var',
                          logreg_fit.score(X_train, y_train).round(4),
                          logreg_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                        y_score = logreg_pred).round(4)])
      

Training ACCURACY: 0.8177
Testing  ACCURACY: 0.807
AUC Score        : 0.7918665636688893


In [21]:
# INSTANTIATING a logistic regression model
logreg = LogisticRegression(solver = 'liblinear',
                            C = 1,
                            random_state = 222)


# FITTING the training data
logreg_fit = logreg.fit(X_train_scaled, y_train_scaled)


# PREDICTING based on the testing set
logreg_pred = logreg_fit.predict(X_test_scaled)


# SCORING the results
print('Training ACCURACY:', logreg_fit.score(X_train_scaled, y_train_scaled).round(4))
print('Testing  ACCURACY:', logreg_fit.score(X_test_scaled, y_test_scaled).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test_scaled,
                                          y_score = logreg_pred))

# Adding model results to table
model_performance.append(['Logistic Regression scaled significant var',
                          logreg_fit.score(X_train_scaled, y_train_scaled).round(4),
                          logreg_fit.score(X_test_scaled, y_test_scaled).round(4),
                          roc_auc_score(y_true  = y_test_scaled,
                                        y_score = logreg_pred).round(4)])
      

Training ACCURACY: 0.817
Testing  ACCURACY: 0.7967
AUC Score        : 0.7572526919203655


In [22]:
# INSTANTIATING a classification tree object
tree_pruned      = DecisionTreeClassifier()


# FITTING the training data
tree_pruned_fit  = tree_pruned.fit(X_train, y_train)


# PREDICTING test data set
tree_pred = tree_pruned_fit.predict(X_test)


# SCORING the model
print('Training ACCURACY:', tree_pruned_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', tree_pruned_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = tree_pred).round(4))

# Adding model results to table
model_performance.append(['Decision Tree w significant var',
                          tree_pruned_fit.score(X_train, y_train).round(4),
                          tree_pruned_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = tree_pred).round(4)])


Training ACCURACY: 1.0
Testing  ACCURACY: 0.7577
AUC Score        : 0.708


In [23]:
full_tree = DecisionTreeClassifier()


# FITTING the training data
full_tree_fit = full_tree.fit(X_train_scaled, y_train_scaled)


# PREDICTING on new data
full_tree_pred = full_tree_fit.predict(X_test_scaled)


# SCORING the model
print('Training ACCURACY:', full_tree_fit.score(X_train_scaled, y_train_scaled).round(4))
print('Testing  ACCURACY:', full_tree_fit.score(X_test_scaled, y_test_scaled).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test_scaled,
                                          y_score = full_tree_pred).round(4))

# Adding model results to table
model_performance.append(['Decision tree scaled w significant var',
                          full_tree_fit.score(X_train_scaled, y_train_scaled).round(4),
                          full_tree_fit.score(X_test_scaled, y_test_scaled).round(4),
                          roc_auc_score(y_true  = y_test_scaled,
                                          y_score = full_tree_pred).round(4)])


Training ACCURACY: 1.0
Testing  ACCURACY: 0.7372
AUC Score        : 0.7016


In [24]:
# INSTANTIATING a classification 
g_boost = GradientBoostingClassifier(loss = 'deviance',
                                     criterion = 'mae',
                                     learning_rate =  0.1,
                                     n_estimators = 95,
                                     max_features = 3,
                                     random_state  = 222)

# FITTING the training data
g_boost_fit = g_boost.fit(X_train, y_train)

# PREDICTING on test data
g_boost_pred = g_boost_fit.predict(X_test)

# SCORING the model
print('Training ACCURACY:', g_boost_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', g_boost_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = g_boost_pred).round(4))

# Adding model results to table
model_performance.append(['GradientBoosting w significant var',
                          g_boost_fit.score(X_train, y_train).round(4),
                          g_boost_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = g_boost_pred).round(4)])

Training ACCURACY: 0.83
Testing  ACCURACY: 0.8111
AUC Score        : 0.8029


In [25]:
# declaring a hyperparameter space
#loss          = ['deviance', 'exponential']
#learning_rate = pd.np.arange(0.1, 0.5, 0.1)
#n_estimators  = range(95,105,5)
#criterion     = ['friedman_mse', 'mae', 'mse']


# creating a hyperparameter grid

#param_grid = {'loss'          : loss,
#              'learning_rate' : learning_rate,
#              'n_estimators'  : n_estimators,
#              'criterion'     : criterion}


# INSTANTIATING the model object without hyperparameters

#boost_tuned = GradientBoostingClassifier( max_features = 3,
#                                         random_state  = 222,
#                                               verbose = 1)


# GridSearchCV object

#boost_tuned_grid = GridSearchCV(estimator = boost_tuned,
#                               param_grid = param_grid,
#                                       cv = 4,
#                                  scoring = make_scorer(roc_auc_score,
#                                                        needs_threshold = False))


# FITTING to the FULL DATASET (due to cross-validation)

#boost_tuned_grid.fit(X_train, y_train)


# printing the optimal parameters and best score

#print("Tuned Parameters  :", boost_tuned_grid.best_params_)
#print("Tuned CV AUC      :", boost_tuned_grid.best_score_.round(4))

In [26]:
# INSTANTIATING a classification 
g_boost = GradientBoostingClassifier(loss = 'deviance',
                                     criterion = 'mae',
                                     learning_rate =  0.1,
                                     n_estimators = 95,
                                     max_features = 3,
                                     random_state  = 222)

# FITTING the training data
g_boost_fit = g_boost.fit(X_train_scaled, y_train_scaled)

# PREDICTING on test data
g_boost_pred = g_boost_fit.predict(X_test_scaled)

# SCORING the model
print('Training ACCURACY:', g_boost_fit.score(X_train_scaled, y_train_scaled).round(4))
print('Testing  ACCURACY:', g_boost_fit.score(X_test_scaled, y_test_scaled).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test_scaled,
                                          y_score = g_boost_pred).round(4))

# Adding model results to table
model_performance.append(['GradientBoosting scaled w significant var',
                          g_boost_fit.score(X_train_scaled, y_train_scaled).round(4),
                          g_boost_fit.score(X_test_scaled, y_test_scaled).round(4),
                          roc_auc_score(y_true  = y_test_scaled,
                                          y_score = g_boost_pred).round(4)])

Training ACCURACY: 0.8417
Testing  ACCURACY: 0.8049
AUC Score        : 0.7582


In [27]:
# INSTANTIATING a classification tree object
rndfor = RandomForestClassifier(criterion = 'gini',
                                bootstrap = True, 
                                max_depth = 4, 
                                n_estimators = 200,
                                min_samples_leaf = 25, 
                                random_state = 222)

# FITTING the training data
rndfor_fit = rndfor.fit(X_train, y_train)

# PREDICTING on test data
rndfor_pred = rndfor_fit.predict(X_test)

# SCORING the model
print('Training ACCURACY:', rndfor_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', rndfor_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = rndfor_pred).round(4))

model_performance.append(['Random Forrest w significant var',
                          rndfor_fit.score(X_train, y_train).round(4),
                          rndfor_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = rndfor_pred).round(4)])

Training ACCURACY: 0.8184
Testing  ACCURACY: 0.8131
AUC Score        : 0.7676


In [28]:
# INSTANTIATING a classification tree object
rndfor = RandomForestClassifier(criterion = 'gini',
                                bootstrap = True, 
                                max_depth = 4, 
                                n_estimators = 200,
                                min_samples_leaf = 25, 
                                random_state = 222)

# FITTING the training data
rndfor_fit = rndfor.fit(X_train_scaled, y_train_scaled)

# PREDICTING on test data
rndfor_pred = rndfor_fit.predict(X_test_scaled)

# SCORING the model
print('Training ACCURACY:', rndfor_fit.score(X_train_scaled, y_train_scaled).round(4))
print('Testing  ACCURACY:', rndfor_fit.score(X_test_scaled, y_test_scaled).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test_scaled,
                                          y_score = rndfor_pred).round(4))

model_performance.append(['Random Forrest scaled w significant var',
                          rndfor_fit.score(X_train_scaled, y_train_scaled).round(4),
                          rndfor_fit.score(X_test_scaled, y_test_scaled).round(4),
                          roc_auc_score(y_true  = y_test_scaled,
                                          y_score = rndfor_pred).round(4)])

Training ACCURACY: 0.8211
Testing  ACCURACY: 0.7947
AUC Score        : 0.7337


In [29]:
#User defined function to find the optimal number of neighbors
def optimal_neighbors(X_data,
                      y_data,
                      pct_test=0.25,
                      seed=222,
                      response_type='class',
                      max_neighbors=50):
    """
Exhaustively compute training and testing results for KNN across
[1, max_neighbors]. Outputs the maximum test score and (by default) a
visualization of the results.
PARAMETERS
----------
X_data        : explanatory variable data
y_data        : response variable
pct_test      : test size for training and validation from (0,1), default 0.25
seed          : random seed to be used in algorithm, default 802
response_type : type of neighbors algorithm to use, default 'reg'
    Use 'reg' for regression (KNeighborsRegressor)
    Use 'class' for classification (KNeighborsClassifier)
max_neighbors : maximum number of neighbors in exhaustive search, default 20
"""    
    
    
    
    # train-test split
    X_train, X_test, y_train, y_test = train_test_split(X_data,
                                                        y_data,
                                                        test_size = pct_test,
                                                        random_state = seed)


    # creating lists for training set accuracy and test set accuracy
    training_accuracy = []
    test_accuracy = []
    
    
    # setting neighbor range
    neighbors_settings = range(1, max_neighbors + 1)


    for n_neighbors in neighbors_settings:
        # building the model based on response variable type
        if response_type == 'reg':
            clf = KNeighborsRegressor(n_neighbors = n_neighbors)
            clf.fit(X_train, y_train)
            
        elif response_type == 'class':
            clf = KNeighborsClassifier(n_neighbors = n_neighbors)
            clf.fit(X_train, y_train)            
            
        else:
            print("Error: response_type must be 'reg' or 'class'")
        
        
        # recording the training set accuracy
        training_accuracy.append(clf.score(X_train, y_train))
    
        # recording the generalization accuracy
        test_accuracy.append(clf.score(X_test, y_test))
    
    # returning optimal number of neighbors
    print(f"The optimal number of neighbors is: {test_accuracy.index(max(test_accuracy))+1}")
    return test_accuracy.index(max(test_accuracy))+1



In [30]:
# INSTANTIATING a KNN classification model with optimal neighbors
knn_opt = KNeighborsClassifier(n_neighbors = optimal_neighbors(X_train, 
                                                               y_train))


# FITTING the training data
knn_fit = knn_opt.fit(X_train_scaled, y_train_scaled)


# PREDICTING based on the testing set
knn_pred = knn_fit.predict(X_test_scaled)


# SCORING the results
print('Training ACCURACY:', knn_fit.score(X_train_scaled, y_train_scaled).round(4))
print('Testing  ACCURACY:', knn_fit.score(X_test_scaled, y_test_scaled).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test_scaled,
                                          y_score = knn_pred).round(4))

model_performance.append(['KNeighbor Scaled w significant var',
                          knn_fit.score(X_train_scaled, y_train_scaled).round(4),
                          knn_fit.score(X_test_scaled, y_test_scaled).round(4),
                          roc_auc_score(y_true  = y_test_scaled,
                                          y_score = knn_pred).round(4)])


The optimal number of neighbors is: 9
Training ACCURACY: 0.8451
Testing  ACCURACY: 0.7823
AUC Score        : 0.7535


In [31]:
# INSTANTIATING a KNN classification model with optimal neighbors
knn_opt = KNeighborsClassifier(n_neighbors = optimal_neighbors(X_train, 
                                                               y_train))


# FITTING the training data
knn_fit = knn_opt.fit(X_train, y_train)


# PREDICTING based on the testing set
knn_pred = knn_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', knn_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', knn_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = knn_pred).round(4))

model_performance.append(['KNeighbor w significant var',
                          knn_fit.score(X_train, y_train).round(4),
                          knn_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = knn_pred).round(4)])


The optimal number of neighbors is: 9
Training ACCURACY: 0.7608
Testing  ACCURACY: 0.7166
AUC Score        : 0.6707


In [32]:
# Adding in all variables to see the ensemble models can do a better job with all
original_df_data   =  original_df[variable_dict['logit_full']]
original_df_target =  original_df.loc[:,'CROSS_SELL_SUCCESS']


# this is the exact code we were using before
X_train, X_test, y_train, y_test = train_test_split(
            original_df_data,
            original_df_target,
            random_state = 222,
            test_size    = 0.25,
            stratify     = original_df['FOLLOWED_RECOMMENDATIONS_PCT'])


In [33]:
rndfor = RandomForestClassifier(criterion = 'gini',
                                bootstrap = True, 
                                max_depth = 4, 
                                n_estimators = 200,
                                min_samples_leaf = 25, 
                                random_state = 222)

# FITTING the training data
rndfor_fit = rndfor.fit(X_train, y_train)

# PREDICTING on test data
rndfor_pred = rndfor_fit.predict(X_test)

# SCORING the model
print('Training ACCURACY:', rndfor_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', rndfor_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = rndfor_pred).round(4))

model_performance.append(['Random Forrest w all var',
                          rndfor_fit.score(X_train, y_train).round(4),
                          rndfor_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = rndfor_pred).round(4)])

Training ACCURACY: 0.8218
Testing  ACCURACY: 0.8234
AUC Score        : 0.7708


In [34]:
# INSTANTIATING a KNN classification model with optimal neighbors
knn_opt = KNeighborsClassifier(n_neighbors = optimal_neighbors(X_train, 
                                                               y_train))


# FITTING the training data
knn_fit = knn_opt.fit(X_train, y_train)


# PREDICTING based on the testing set
knn_pred = knn_fit.predict(X_test)


# SCORING the results
print('Training ACCURACY:', knn_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', knn_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = knn_pred).round(4))

model_performance.append(['KNeighbor w all var',
                          knn_fit.score(X_train, y_train).round(4),
                          knn_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = knn_pred).round(4)])


The optimal number of neighbors is: 34
Training ACCURACY: 0.6799
Testing  ACCURACY: 0.7125
AUC Score        : 0.5411


In [35]:
# INSTANTIATING a classification 
g_boost = GradientBoostingClassifier()

# FITTING the training data
g_boost_fit = g_boost.fit(X_train, y_train)

# PREDICTING on test data
g_boost_pred = g_boost_fit.predict(X_test)

# SCORING the model
print('Training ACCURACY:', g_boost_fit.score(X_train, y_train).round(4))
print('Testing  ACCURACY:', g_boost_fit.score(X_test, y_test).round(4))
print('AUC Score        :', roc_auc_score(y_true  = y_test,
                                          y_score = g_boost_pred).round(4))


model_performance.append(['GradientBoosting w all var',
                          g_boost_fit.score(X_train, y_train).round(4),
                          g_boost_fit.score(X_test, y_test).round(4),
                          roc_auc_score(y_true  = y_test,
                                          y_score = g_boost_pred).round(4)])

Training ACCURACY: 0.902
Testing  ACCURACY: 0.8049
AUC Score        : 0.7843


In [37]:
# The complete list of models and parameters used. I ran and hyperparam tuned other models, but they were not better
print(pd.DataFrame(model_performance))

                                             0                  1  \
0                                        Model  Training Accuracy   
1        Logistic Regression w significant var             0.8177   
2   Logistic Regression scaled significant var              0.817   
3              Decision Tree w significant var                  1   
4       Decision tree scaled w significant var                  1   
5           GradientBoosting w significant var               0.83   
6    GradientBoosting scaled w significant var             0.8417   
7             Random Forrest w significant var             0.8184   
8      Random Forrest scaled w significant var             0.8211   
9           KNeighbor Scaled w significant var             0.8451   
10                 KNeighbor w significant var             0.7608   
11                    Random Forrest w all var             0.8218   
12                         KNeighbor w all var             0.6799   
13                  GradientBoosti