# Summary

Using logistic regression, the final accuracy score was 77%. This model was used to investigate how BigRetail can increase signups.

The results showed that the people most likely to sign up for rewards are women, high value customers, and big spenders.

# Import Libraries and Data

In [2]:
import pandas as pd

# Prepping for modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

# Metrics
from sklearn import metrics
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve
from sklearn.metrics import precision_score, recall_score, precision_recall_curve,f1_score, fbeta_score
from sklearn.metrics import accuracy_score

In [3]:
df_train = pd.read_csv('train_df_cleaned.csv')
df_test = pd.read_csv('test_df_cleaned.csv')

In [4]:
df_train.drop(columns=['Unnamed: 0'], inplace = True)
df_test.drop(columns=['Unnamed: 0'], inplace = True)

# Modeling - Logistic Regression

In [5]:
df_train.head()

Unnamed: 0,Rewards_Signup,Age,Sex,Addtl_HH_size,LastPurchaseAmt,CustomerTier,New,Reactivated
0,1,29.0,0,0,10.5,2,1,0
1,1,19.0,0,0,7.8792,3,0,0
2,0,25.0,1,0,7.05,3,1,0
3,1,44.0,0,1,57.9792,1,0,1
4,1,32.0,1,0,7.925,3,1,0


In [6]:
# Define target and features
X = df_train[['Age', 'Sex', 'Addtl_HH_size', 'LastPurchaseAmt', 'CustomerTier', 'New', 'Reactivated']]
y = df_train['Rewards_Signup']

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [7]:
# Scale since there are multiple features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # Scale test features

In [8]:
# Create the model
logistic_model = LogisticRegression(penalty='l1', solver='liblinear', C=.1) # Create model
logistic_model.fit(X_train, y_train) # Fit model on training data

LogisticRegression(C=0.1, penalty='l1', solver='liblinear')

In [9]:
#Print various scores/metrics
y_predict = (logistic_model.predict_proba(X_test)[:, 1] >= .5)

print("Precision: {:6.4f},   Recall: {:6.4f}".format(precision_score(y_test, y_predict), 
                                                     recall_score(y_test, y_predict)))
print("F1 Score: ", f1_score(y_test, y_predict))
print("ROC AUC score : ", roc_auc_score(y_test, logistic_model.predict_proba(X_test)[:,1]))
    
print("Accuracy score: ", accuracy_score(y_test, y_predict))
    
print('Training accuracy:', logistic_model.score(X_train, y_train))
print('Test accuracy:', logistic_model.score(X_test, y_test))

Precision: 0.7097,   Recall: 0.6875
F1 Score:  0.6984126984126984
ROC AUC score :  0.8397253787878788
Accuracy score:  0.8061224489795918
Training accuracy: 0.8174807197943444
Test accuracy: 0.8061224489795918


In [10]:
# List coefficients
list_coef = list(zip(X.columns, logistic_model.coef_[0]))
list_coef

[('Age', -0.015690393894629926),
 ('Sex', -1.1781879191762892),
 ('Addtl_HH_size', -0.018471616534562117),
 ('LastPurchaseAmt', 0.0909396505793005),
 ('CustomerTier', -0.565217644202949),
 ('New', 0.0),
 ('Reactivated', 0.0)]

In [11]:
from math import e

In [12]:
# List features and their coefficients in a sorted and interpretable format.

# Each coefficient represents log odds. Put each coefficient as the exponent of e so that
# it represents the odds. Subtract one and multiple by 100 so that coefficients can be 
# interpreted as increasing the odds of the pump being functional by ___ percent.
new_tuple = () 
for itup in list_coef:
    new_tuple += (round((e**itup[1] - 1) * 100, 2),)
    
interpretable = list(zip(X.columns, new_tuple)) # List of features and their odds

In [13]:
interpretable

[('Age', -1.56),
 ('Sex', -69.22),
 ('Addtl_HH_size', -1.83),
 ('LastPurchaseAmt', 9.52),
 ('CustomerTier', -43.18),
 ('New', 0.0),
 ('Reactivated', 0.0)]

What this means:
* The three biggest predictors of whether or not a person will sign up for the RewardsProgram is their Sex, their CustomerTier score, and their Last Purchase.
* Women are more likely than men to sign up for rewards. Being a woman increases the odds that they'll sign up by 69.22%.
* The CustomerTier score is doing a pretty good job. Valuable customers are more likely to sign up than non-valuable customers. Being a valuable customer increases the odds that they'll sign up by 43.17%.
* Lastly, people who spent more money on their most recent purchase are more likely to sign up than people who spend less. Each additional dollar that a person spends increases the likelihood that they'll sign up by 9.51%.

In [14]:
# Sanity check
print('Men w rewards: ', len(df_train[(df_train.Sex == 1) & (df_train.Rewards_Signup == 1)]))
print('Men without rewards: ', len(df_train[(df_train.Sex == 1) & (df_train.Rewards_Signup == 0)]))
print(len(df_train[(df_train.Sex == 1) & (df_train.Rewards_Signup == 1)]) / len(df_train[df_train.Sex == 1]))

print('Women w rewards: ', len(df_train[(df_train.Sex == 0) & (df_train.Rewards_Signup == 1)]))
print('Women without rewards: ', len(df_train[(df_train.Sex == 0) & (df_train.Rewards_Signup == 0)]))
print(len(df_train[(df_train.Sex == 0) & (df_train.Rewards_Signup == 1)]) / len(df_train[df_train.Sex == 1]))

Men w rewards:  57
Men without rewards:  267
0.17592592592592593
Women w rewards:  128
Women without rewards:  35
0.3950617283950617


# Final Test

In [16]:
X_test = df_test[['Age', 'Sex', 'Addtl_HH_size', 'LastPurchaseAmt', 'CustomerTier', 'New', 'Reactivated']]
y_test = df_test['Rewards_Signup']

y_predict = (logistic_model.predict_proba(X_test)[:, 1] >= .5)

print("F1 Score: ", f1_score(y_test, y_predict))
print("ROC AUC score : ", roc_auc_score(y_test, logistic_model.predict_proba(X_test)[:,1]))
print("*** Accuracy score: ", accuracy_score(y_test, y_predict))
print("Precision: {:6.4f},   Recall: {:6.4f}".format(precision_score(y_test, y_predict), 
                        recall_score(y_test, y_predict)))

F1 Score:  0.6749999999999999
ROC AUC score :  0.8074675324675324
*** Accuracy score:  0.7719298245614035
Precision: 0.7500,   Recall: 0.6136
